TF-IDF 在 Python 中的应用教程

TF-IDF（Term Frequency-Inverse Document Frequency）是一种常用的文本挖掘技术，用于评估一个词对于一个文本集中一个文档的重要程度。以下是一个简单的 Python 教程，展示了如何使用 TF-IDF 来分析文本。

安装必要的库

首先，确保你已经安装了以下 Python 库：

pip install scikit-learn

导入库

from sklearn.feature_extraction.text import TfidfVectorizer

准备数据

以下是一些示例文本，我们将使用这些文本来演示 TF-IDF 的应用。

documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

创建 TF-IDF 向量化器

vectorizer = TfidfVectorizer()

计算文档的 TF-IDF

tfidf_matrix = vectorizer.fit_transform(documents)

获取特征名称

feature_names = vectorizer.get_feature_names_out()

获取每个文档的 TF-IDF

feature_array = np.array(feature_names)
doc_array = np.array(documents)

for doc_index, doc in enumerate(doc_array):
    print(f"Document {doc_index}:")
    print(" ".join([feature_array[i] for i in tfidf_matrix[doc_index].nonzero()[1]]))

图片示例

下面是一张关于文本挖掘的图片，可以帮助你更好地理解 TF-IDF 的概念。

扩展阅读

想要了解更多关于 TF-IDF 的知识，可以阅读以下文章：