1.安装依赖的包
```
"# 读取docx\n",
"!pip install python-docx\n",
"!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple python-docx\n",
"# 中英文分词\n",
"!pip install jieba\n",
"!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple jieba\n",
"# 输出到excel\n",
"!pip install pandas"
"!pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pandas"
```
2.读取docx文件到一个大字符串
```python
import docx
from docx import Document
document = docx.Document("Python.docx")
content = " ".join([para.text for para in document.paragraphs])
```
3. 中文分词
```
import jieba
seg_list = jieba.cut(content,cut_all=False)
print(type(seg_list))
# 过滤标点符号,无意义的单个字
seg_list = [
word
for word in seg_list
if len(word) >1
]
print(seg_list[:30])
```
4.统计词频
```
from collections import Counter
counter = Counter(seg_list)
for key,count in list(counter.items())[:10]:
print(key,count)
```
5. 构造pandas并且排序
```
import pandas as pd
df = pd.DataFrame(list(counter.items()), columns = ['word','count'])
df.sort_values(by="count",ascending=False,inplace=True)
df.head()
```
将list转化为dict文章来源:https://www.toymoban.com/news/detail-646866.html
```
a=['hello','world','1','2']
b= dict(zip(a[0::2],a[1::2]))
b
```
文章来源地址https://www.toymoban.com/news/detail-646866.html
到了这里,关于Python读取Word统计词频输出到Excel的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!