词袋模型：概念及python实现-天翼云

词袋模型：概念及python实现

2024-04-18 09:15:34 阅读次数：37

词袋模型

- 1. 基本概念
- 2. 代码实现

1. 基本概念

在对文本进行分类时，需要首先对文本进行向量会表示，常用到词袋模型。

词袋模型（Bow，Bag of Words）不考虑文本中词与词之间的上下文关系，仅仅只考虑所有词的权重（与词在文本中出现的频率有关），类似于将所有词语装进一个袋子里，每个词都是独立的，不含语义信息。

生成文本的词袋模型分为三步：

分词（tokenizing）
统计词频（counting）
特征标准化（normalizing）

词集模型（SoW，Set of Words）与词带模型类似，唯一的不同是仅考虑词是否在文本中出现，而不考虑词频。多数时候一般使用词袋模型。

比如语料库中有4个文本：

I come to China to travel
This is a car polupar in China
I love tea and Apple
The work is to write some papers in science

上述语料生成的词典共有21个单词：

‘a’,
‘and’,
‘apple’,
‘car’,
‘china’,
‘come’,
‘i’,
‘in’,
‘is’,
‘love’,
‘papers’,
‘polupar’,
‘science’,
‘some’,
‘tea’,
‘the’,
‘this’,
‘to’,
‘travel’,
‘work’,
‘write’

每个单词的One-Hot Representation如下：

‘a’： $\;\;\;\;[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]$
‘and’： $\;\;[0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]$
…
‘write’： $[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]$

上述文本的词袋模型表示如下：

$[0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0]$
$[1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]$
$[0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]$
$[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1]$

词频归一化结果如下：

$[0, 0, 0, 0, 1 / 6, 1 / 6, 1 / 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 / 3, 1 / 6, 0, 0]$
$[1 / 7, 0, 0, 1 / 7, 1 / 7, 0, 0, 1 / 7, 1 / 7, 0, 0, 1 / 7, 0, 0, 0, 0, 1 / 7, 0, 0, 0, 0]$
$[0, 1 / 5, 1 / 5, 0, 0, 0, 1 / 5, 0, 0, 1 / 5, 0, 0, 0, 0, 1 / 5, 0, 0, 0, 0, 0, 0]$
$[0, 0, 0, 0, 0, 0, 0, 1 / 9, 1, 0, 1 / 9, 0, 1 / 9, 1 / 9, 0, 1 / 9, 0, 1 / 9, 0, 1 / 9, 1 / 9]$

在大规模的文本处理中，由于特征的维度对应分词词汇表的大小，维度将会非常高，常使用Hash Trick的方法进行降维。

此外，词袋模型中的值也可以采用单词的TF-IDF值。

2. 代码实现

主要通过sklearn.feature_extraction.text中的CountVectorizer类实现。

CountVectorizer是常见的特征数值计算类（支持传入停止词），对于每个文本通过fit_transform方法计算每个单词在该文本中出现的频率，形成词频矩阵。
通过get_feature_names可查看所有文本关键字，通过toarray可查看到文本的词袋模型结果。

输入：列表，列表元素为字符串
输出：词频矩阵，矩阵元素 $a [i] [j]$ 表示 $j$ 词在第 $i$ 个文本下的词频

scikit-learn的HashingVectorizer类实现了基于signed hash trick的算法。

代码如下：

from sklearn.feature_extraction.text import CountVectorizer  
corpus=["I come to China to travel", 
    "This is a car polupar in China",          
    "I love tea and Apple ",   
    "The work is to write some papers in science"] 
vectorizer=CountVectorizer()
print("词频统计：")
#输出4个文本的词频统计：左边的括号中的两个数字分别为(文本序号，词序号)，右边数字为频次
print(vectorizer.fit_transform(corpus))
print("\n词袋模型：")
print(vectorizer.fit_transform(corpus).toarray())

输出如下：
词袋模型：概念及python实现

from sklearn.feature_extraction.text import HashingVectorizer 
vectorizerH=HashingVectorizer(n_features = 6,norm = None) #将19维词汇表哈希降维到6维
print("词频统计：")
print(vectorizerH.fit_transform(corpus))
print("\n词袋模型：")
print(vectorizerH.fit_transform(corpus).toarray())

活动

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

词袋模型：概念及python实现

词袋模型：概念及python实现

1. 基本概念

2. 代码实现

相关文章

【python基础】学习路线

【后端】【语言】【python】python常见操作

【python】python 打印时间 python打印程序运行时间

【python C结构体】Python Ctypes结构体指针处理(函数参数，函数返回)

Blender下使用python设置骨骼旋转

Python 打包——过去、现在与未来

Python 中 -m 的典型用法、原理解析与发展演变

python编程入门（适合初学者）

Python学习前简介

python实战三：使用循环while模拟用户登录

作者介绍

最新文章

提高mnist模型精确度

LGBM分类模型预测

美多商城之用户注册(用户模型类)

模型类序列化器ModelSerializer

GluonTS保存、加载模型

热门文章

GluonTS保存、加载模型

提高mnist模型精确度

LGBM分类模型预测

模型类序列化器ModelSerializer

美多商城之用户注册(用户模型类)

热门标签

相关产品

弹性云主机

天翼云电脑（公众版）

对象存储

云硬盘

随机文章

提高mnist模型精确度

GluonTS保存、加载模型

LGBM分类模型预测

模型类序列化器ModelSerializer

美多商城之用户注册(用户模型类)