【爬虫实战】python文本分析库——Gensim

1年前作者：认真写程序的强哥分类：Toy博客阅读(8)违法举报

这篇具有很好参考价值的文章主要介绍了【爬虫实战】python文本分析库——Gensim。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

01、引言

Gensim是一个用于自然语言处理和文本分析的 Python 库，提供了许多强大的功能，包括文档的相似度计算、关键词提取和文档的主题分析，要开始使用Gensim，您需要安装它，再进行文本分析和NLP任务，安装Gensim可以使用pip：

pip install gensim

02、主题分析以及文本相似性分析

Gensim是一个强大的Python库，用于执行主题建模和文本相似性分析等自然语言处理任务。使用Gensim进行主题建模（使用Latent Dirichlet Allocation，LDA）和文本相似性分析（使用 similarities 模块中的 MatrixSimilarity 或 SparseMatrixSimilarity 来计算文档相似度），代码如下：

from gensim import corpora, models, similarities

# 创建一个简单的文本数据集作为示例
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]
# 预处理文本数据：
# 切分文档为单词
text = [document.split() for document in documents]

# 创建一个词典，将每个单词映射到一个唯一的整数ID
dictionary = corpora.Dictionary(text)

# 使用词典将文本转化为文档-词袋（document-term）表示
corpus = [dictionary.doc2bow(doc) for doc in text]

#训练LDA模型并执行主题建模：
# 训练LDA模型
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

# 输出主题及其词汇
for topic in lda_model.print_topics():
    print(topic)

#文本相似性分析：
from gensim import similarities

# 创建一个索引
index = similarities.MatrixSimilarity(lda_model[corpus])

# 定义一个查询文本
query = "This is a new document."

# 预处理查询文本
query_bow = dictionary.doc2bow(query.split())

# 获取查询文本与所有文档的相似性得分
sims = index[lda_model[query_bow]]

# 按相似性得分降序排列文档
sims = sorted(enumerate(sims), key=lambda item: -item[1])

# 输出相似文档及其得分
for document_id, similarity in sims:
    print(f"Document {document_id}: Similarity = {similarity}")

结果如下：

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

另一种方法，在gensim下用 Wasserstein 距离方法计算文档相似度，代码如下：

from gensim import corpora
from scipy.stats import wasserstein_distance
import numpy as np

# 示例文档
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# 预处理文本和创建词典
text = [document.split() for document in documents]
dictionary = corpora.Dictionary(text)

# 创建文档的词袋表示
corpus = [dictionary.doc2bow(doc) for doc in text]

# 创建文档的概率分布
document_distributions = [np.array([0] * len(dictionary)) for _ in range(len(corpus))]

for i, doc_bow in enumerate(corpus):
    for word_id, count in doc_bow:
        document_distributions[i][word_id] = count / len(doc_bow)

# 计算Wasserstein距离
# 这里示例计算第一个文档和其他文档之间的Wasserstein距离
for i in range(1, len(document_distributions)):
    wasserstein_dist = wasserstein_distance(document_distributions[0], document_distributions[i])
    print(f"Wasserstein Distance between Document 0 and Document {i}: {wasserstein_dist}")

结果如下：

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

03、关键词提取

Gensim 允许你使用 TF-IDF 权重和其他算法来提取文档中的关键词。你可以使用 models.TfidfModel 来计算 TF-IDF 权重，然后使用 model.get_document_topics 来获取文档的主题分布，代码如下：

from gensim import corpora, models
from gensim.parsing.preprocessing import preprocess_string, strip_punctuation

# 示例文档
documents = [
    "This is the first document. It contains important information.",
    "This document is the second document. It also has important content.",
    "And this is the third one. It may contain some relevant details.",
    "Is this the first document? Yes, it is."
]

# 预处理文本
def preprocess(text):
    # 使用Gensim的文本预处理工具进行处理，包括去除标点符号
    custom_filters = [strip_punctuation]
    processed_text = preprocess_string(text, custom_filters)
    return processed_text

# 预处理文档并创建词袋表示
text = [preprocess(document) for document in documents]
dictionary = corpora.Dictionary(text)
corpus = [dictionary.doc2bow(doc) for doc in text]

# 计算TF-IDF模型
tfidf_model = models.TfidfModel(corpus)
lda_model=models.LdaModel(corpus)

# 获取TF-IDF加权
for i, doc in enumerate(corpus):
    tfidf_weights = tfidf_model[doc]
    print(f"TF-IDF Weights for Document {i}: {tfidf_weights}")

# 获取文档的主题分布
for i, doc in enumerate(corpus):
    document_topics = lda_model.get_document_topics(doc)
    print(f"Topic Distribution for Document {i}: {document_topics}")

最终结果如下：

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

04、Word2Vec 嵌入（词嵌入 Word Embeddings）

gensim支持训练和使用 Word2Vec 模型，以将单词映射到低维向量空间。Word2Vec是一种词嵌入技术，它可以捕捉单词之间的语义关系，使得词汇可以在向量空间中表示。这对于词义相似度计算、单词聚类和其他自然语言处理任务非常有用，代码如下：

from gensim.models import Word2Vec

# 示例文本数据
sentences = [
    ["this", "is", "a", "sample", "sentence"],
    ["word2vec", "is", "used", "to", "create", "word", "embeddings"],
    ["it", "maps", "words", "to", "low-dimensional", "vectors"],
    ["these", "vectors", "capture", "semantic", "meaning", "of", "words"],
]

# 训练Word2Vec模型
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

# 保存模型
model.save("word2vec.model")

# 加载模型
model = Word2Vec.load("word2vec.model")

# 获取单词的向量表示
word_vector = model.wv["word2vec"]
print("Vector representation for 'word2vec':", word_vector)

# 查找与单词最相似的单词
similar_words = model.wv.most_similar("word2vec", topn=3)
print("Most similar words to 'word2vec':", similar_words)

最终结果如下：

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

05、FastText 嵌入（子词嵌入 Subword Embeddings）

Gensim支持 FastText 模型，这是一个基于子词的嵌入模型，可以捕获单词的内部结构和形态，FastText在许多自然语言处理任务中表现出色，尤其在处理具有丰富形态变化的语言时非常有用，代码如下：

from gensim.models.fasttext import FastText

# 示例文本数据
sentences = [
    ["this", "is", "a", "sample", "sentence"],
    ["fasttext", "is", "used", "to", "capture", "word", "subword", "embeddings"],
    ["it", "can", "handle", "morphological", "variations", "in", "words"],
    ["fasttext", "embeddings", "are", "useful", "for", "NLP", "tasks"],
]

# 训练FastText模型
model = FastText(sentences, vector_size=100, window=5, min_count=1, sg=0)

# 保存模型
model.save("fasttext.model")

# 加载模型
model = FastText.load("fasttext.model")

# 获取单词的向量表示
word_vector = model.wv["fasttext"]
print("Vector representation for 'fasttext':", word_vector)

# 查找与单词最相似的单词
similar_words = model.wv.most_similar("fasttext", topn=3)
print("Most similar words to 'fasttext':", similar_words)

最终结果如下：

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

06、文档向量化

使用Gensim将文档表示为词袋模型和TF-IDF向量，从而将文档转化为数值表示形式，以便用于文本分类、文本检索和文本聚类等任务代码如下：

from gensim import corpora

# 示例文档
documents = [
    "This is the first document. It contains important information.",
    "This document is the second document. It also has important content.",
    "And this is the third one. It may contain some relevant details.",
    "Is this the first document? Yes, it is.",
]

# 预处理文本并创建词袋表示
text = [document.split() for document in documents]
dictionary = corpora.Dictionary(text)
corpus = [dictionary.doc2bow(doc) for doc in text]

# 文档向量表示
document_vectors = [dict(doc) for doc in corpus]

# 输出文档向量
for i, doc_vector in enumerate(document_vectors):
    print(f"Document {i} Vector: {doc_vector}")

from gensim import corpora, models

# 示例文档
documents = [
    "This is the first document. It contains important information.",
    "This document is the second document. It also has important content.",
    "And this is the third one. It may contain some relevant details.",
    "Is this the first document? Yes, it is.",
]

# 预处理文本并创建词袋表示
text = [document.split() for document in documents]
dictionary = corpora.Dictionary(text)
corpus = [dictionary.doc2bow(doc) for doc in text]

# 计算TF-IDF模型
tfidf_model = models.TfidfModel(corpus)
corpus_tfidf = tfidf_model[corpus]

# 文档TF-IDF向量表示
tfidf_vectors = [dict(doc) for doc in corpus_tfidf]

# 输出TF-IDF文档向量
for i, tfidf_vector in enumerate(tfidf_vectors):
    print(f"TF-IDF Vector for Document {i}: {tfidf_vector}")

最终结果如下：

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

以上就是本文对Gensim库文本分析的方法介绍，希望能够帮助大家处理解决文本分析问题，感兴趣的小伙伴可以亲自去试试！

感兴趣的小伙伴，赠送全套Python学习资料，包含面试题、简历资料等具体看下方。

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

一、Python所有方向的学习路线

Python所有方向的技术点做的整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照下面的知识点去找对应的学习资源，保证自己学得较为全面。

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

二、Python必备开发工具

工具都帮大家整理好了，安装就可直接上手！ gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

三、最新Python学习笔记

当我学到一定基础，有自己的理解能力的时候，会去阅读一些前辈整理的书籍或者手写的笔记资料，这些笔记详细记载了他们对一些技术点的理解，这些理解是比较独到，可以学到不一样的思路。

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

四、Python视频合集

观看全面零基础学习视频，看视频学习是最快捷也是最有效果的方式，跟着视频中老师的思路，从基础到深入，还是很容易入门的。

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

五、实战案例

纸上得来终觉浅，要学会跟着视频一起敲，要动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

六、面试宝典

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言

简历模板 gensim库,爬虫,python,Python爬虫,Python学习,Python文本分析,Gensim,开发语言
若有侵权，请联系删除文章来源地址https://www.toymoban.com/news/detail-834856.html

到了这里，关于【爬虫实战】python文本分析库——Gensim的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：如若内容造成侵权/违法违规/事实不符，请点击违法举报进行投诉反馈，一经查实，立即删除！

分享到：

领支付宝红包赞助服务器费用

〖Python网络爬虫实战㉔〗- Ajax数据爬取之Ajax 分析案例
订阅：新手可以订阅我的其他专栏。免费阶段订阅量1000+ python项目实战 Python编程基础教程系列（零基础小白搬砖逆袭) 说明：本专栏持续更新中，目前专栏免费订阅，在转为付费专栏前订阅本专栏的，可以免费订阅付费专栏，
2024年02月07日
浏览(25)
爬虫学习记录之Python 爬虫实战：爬取研招网招生信息详情
【简介】本篇博客为爱冲锋，爬取北京全部高校的全部招生信息，最后持久化存储为表格形式，可以用作筛选高校。此处导入本次爬虫所需要的全部依赖包分别是以下内容，本篇博客将爬取研招网北京所有高校的招生信息，主要爬取内容为学校，考试方式，所在学院，专业
2024年01月24日
浏览(11)
基于python舆情分析可视化系统+情感分析+爬虫+机器学习（源码）✅
大数据毕业设计：Python招聘数据采集分析可视化系统✅ 毕业设计：2023-2024年计算机专业毕业设计选题汇总（建议收藏）毕业设计：2023-2024年最新最全计算机专业毕设选题推荐汇总 🍅 感兴趣的可以先收藏起来，点赞、关注不迷路，大家在毕设选题，项目以及论文编写等相关
2024年01月20日
浏览(13)
python微博舆情分析系统可视化情感分析爬虫机器学习（源码+讲解）✅
🍅 大家好，今天给大家分享一个Python项目，感兴趣的可以先收藏起来，点赞、关注不迷路! 🍅 大家在毕设选题，项目以及论文编写等相关问题都可以给我留言咨询，希望帮助同学们顺利毕业。设计1000套（建议收藏）毕业设计：2023-2024年最新最全计算机专业毕业设计选题
2024年03月25日
浏览(11)
python数据分析之利用多种机器学习方法实现文本分类、情感预测
大家好，我是带我去滑雪！文本分类是一种机器学习和自然语言处理（NLP）任务，旨在将给定的文本数据分配到预定义的类别或标签中。其目标是为文本数据提供自动分类和标注，使得可以根据其内容或主题进行组织、排序和分析。文本分类在各种应用场景
2024年02月11日
浏览(16)
大数据舆情评论数据分析：基于Python微博舆情数据爬虫可视化分析系统(NLP情感分析+爬虫+机器学习)
基于Python的微博舆情数据爬虫可视化分析系统，结合了NLP情感分析、爬虫技术和机器学习算法。该系统的主要目标是从微博平台上抓取实时数据，对这些数据进行情感分析，并通过可视化方式呈现分析结果，以帮助用户更好地了解舆情动向和情感倾向。系统首先利用爬虫技术
2024年04月15日
浏览(14)
大数据毕业设计Python+Django旅游景点评论数据采集分析可视化系统 NLP情感分析 LDA主题分析 bayes分类旅游爬虫旅游景点评论爬虫机器学习深度学习人工智能计算机毕业设计
毕业论文（设计）开题报告学生姓名学号所在学院信息工程学院专业指导教师姓名指导教师职称工程师助教指导教师单位论文（设计）题目基于朴素贝叶斯算法旅游景点线上评价情感分析开题报告内容选题依据及研究内容（国内、外研究现状，初步
2024年04月17日
浏览(37)
Python 网络爬虫数据的存储（一）：TXT 文本文件存储：
提取到数据后，接下来就是存储数据了，数据的存储形式多种多样，其中最简单的一种就是将数据直接保存为文本文件，例如：txt, json， csv 等，还可以将数据保存到数据库中，如关系型数据库 MySQL，非关系型数据库 MongoDB， Redis等，除了这两种，也可以直接把数据存
2024年02月03日
浏览(34)
Python实战，爬虫实战，用Python抢票
Python是一门高级编程语言，其在大数据、人工智能、科学计算等众多领域都有广泛应用。而在互联网时代，Python更是成为网络爬虫、数据挖掘的主要选择之一。那么，如何将Python应用于实战中，实现抢票等功能呢？接下来，将介绍Python实战爬虫抢票的全流程。爬虫先来谈一
2024年02月06日
浏览(10)
AI机器学习实战 | 使用 Python 和 scikit-learn 库进行情感分析
专栏集锦，大佬们可以收藏以备不时之需 Spring Cloud实战专栏：https://blog.csdn.net/superdangbo/category_9270827.html Python 实战专栏：https://blog.csdn.net/superdangbo/category_9271194.html Logback 详解专栏：https://blog.csdn.net/superdangbo/category_9271502.html tensorflow专栏：https://blog.csdn.net/superdangbo/category_869
2024年02月05日
浏览(11)