机器学习——使用朴素贝叶斯分类器实现垃圾邮件检测（python代码+数据集）

这篇具有很好参考价值的文章主要介绍了机器学习——使用朴素贝叶斯分类器实现垃圾邮件检测（python代码+数据集）。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

系列文章目录

机器学习——scikit-learn库学习、应用
机器学习——最小二乘法拟合曲线、正则化
机器学习——使用朴素贝叶斯分类器实现垃圾邮件检测（python代码+数据集）

1、概念阐述

贝叶斯公式： $\mid B)=\frac{P(A) P(B \mid A)}{P(B)}$
贝叶斯概念可以参考这个视频，我觉得还不错。
对于垃圾邮件分类预测，简单来说就是判断一封邮件是垃圾邮件的概率和是正常邮件的概率，哪一个概率大就判定为是哪一种类型的邮件。以 $h^{+}$ 为垃圾邮件， $h^{-}$ 为正常邮件，D代表需要验证的邮件，d代表邮件里面的的单词如下：
$D = d_{1} ,d_{2} ,d_{3} ,d_{4} ,d_{5} ...d_{n}$
则一封测试邮件为垃圾邮件和正常邮件的概率公式如下： $P\left(h^{+} \mid D\right)=\frac{P\left(h^{+}\right) * P\left(D \mid h^{+}\right)}{P(D)}$
$P\left(h^{-} \mid D\right)=\frac{P\left(h^{-}\right) * P\left(D \mid h^{-}\right)}{P(D)}$
$P\left(h^{-} \mid D\right)$ ：测试邮件为正常邮件的概率
$P\left(h^{+} \mid D\right)$ ：测试邮件为垃圾邮件的概率
$P\left(h^{-}\right)$ ：一封邮件为正常邮件的概率（先验数据集）
$P\left(h^{+}\right)$ ：一封邮件为垃圾邮件的概率（先验数据集）
下面三个公式描述的可能不太对
$P\left(D \mid h^{+}\right)$ ：垃圾邮件是测试邮件的概率
$P\left(D \mid h^{-}\right)$ ：正常邮件是测试邮件的概率
$P (D)$ ：测试邮件的概率

因为独立事件间的概率计算更加简单，为了简便计算，朴素贝叶斯假定所有输入事件之间是相互独立的。也就是邮件里的每个单词相互独立与出现的先后顺序无关。由朴素贝叶斯定理公式可以分解为如下：
$P\left(h^{+} \mid D\right)=\frac{P\left(h^{+}\right) * P\left(d_{1 } \mid h^{+}\right)* P\left(d_{2 } \mid h^{+}\right)*...* P\left(d_{n } \mid h^{+}\right)}{P(d_{1})*P(d_{2})*...*P(d_{n})}$
$P\left(h^{-} \mid D\right)=\frac{P\left(h^{-}\right) * P\left(d_{1 } \mid h^{-}\right)* P\left(d_{2 } \mid h^{-}\right)*...* P\left(d_{n } \mid h^{-}\right)}{P(d_{1})*P(d_{2})*...*P(d_{n})}$
$P\left(d_{n } \mid h^{+}\right)$ ：代表了测试邮件中的一个单词在垃圾邮件邮件中刚出现的概率（先验数据集）， $P\left(d_{n } \mid h^{-}\right)$ 同理。
$P\left(h^{+} \mid D\right)$ 的整个式子就是求：（ $P\left(h^{-}\right)$ 乘测试邮件中的每个单词在垃圾邮件中出现的概率）除上（每个单词在垃圾邮件和正常邮件中出现的概率）， $P\left(h^{-} \mid D\right)$ 同理。

但在实际计算中为了防止概率太小导致最后结果四舍五入约等于0，会把上面的概率取log再相加。
因为 $P\left(h^{+} \mid D\right)$ 、 $P\left(h^{-} \mid D\right)$ 的分母一样，我们的目的是比较两者的概率大小，为了方便计算把分母省略了。
以上参考了B站唐宇迪的讲解。

这段代码实现了一个朴素贝叶斯分类器来对电子邮件进行分类，其中包括读取数据、文本预处理、训练分类器和预测新邮件的过程。具体来说，代码中使用get_data 函数从指定目录 DATA_DIR 中读取数据，并将 Spam 和 Ham 的电子邮件内容分别存储在 data 和 target 列表中。在数据预处理过程中，使用 preprocess 函数对文本进行处理，包括转为小写、去除标点符号和停用词等操作。接下来实现了 NaiveBayesClassifier 类的 fit 和 predict 方法，其中 fit方法用于训练朴素贝叶斯分类器，predict 方法用于对新邮件进行分类。在 fit 方法中，首先定义了四个 defaultdict类型的变量，包括每个类别的文档数 class_total，每个类别中所有单词出现次数之和 word_total，每个类别中每个单词出现次数word_given_class 和词汇表 vocabulary。然后遍历训练集 X 和标签 y，使用 preprocess 函数对文本进行处理，将处理后的文本和标签一起用于更新 class_total、word_total、word_given_class 和vocabulary 等变量。在 predict方法中，首先使用训练集中每个类别的文档数计算了每个类别的先验概率的对数，这样可以避免概率值太小而导致下溢。然后遍历待分类邮件 X，使用preprocess 函数对文本进行处理，计算该邮件属于每个类别的概率的对数，最终选择概率最大值对应的类别作为预测结果。

2、代码

代码中的一些函数使用说明参考：http://t.csdn.cn/VGQjB

读取数据

DATA_DIR = 'enron'  # 数据集地址
target_names = ['ham', 'spam']  # 正常、 垃圾 
stopwords = set(open('stopwords.txt', 'r').read().splitlines())   # .splitlines() 按换行符分割

def get_data(DATA_DIR):
    subfolders = ['enron%d' % i for i in range(1,7)]  # 返回enron1-enron7列表
    data = []
    target = []
    for subfolder in subfolders: # 对enron1-enron7依次遍历
        # spam
        spam_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'spam')) # 获文件夹下的所有文件 返回一个列表
        for spam_file in spam_files: # 依次打开返回的文件
            with open(os.path.join(DATA_DIR, subfolder, 'spam', spam_file), encoding="latin-1") as f:
                data.append(f.read())
                target.append(1)  # 垃圾邮件标签 为1
        # ham
        ham_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'ham'))
        for ham_file in ham_files:
            with open(os.path.join(DATA_DIR, subfolder, 'ham', ham_file), encoding="latin-1") as f:
                data.append(f.read())
                target.append(0)  # 正常邮件标签 为0
    return data, target

数据转换

def preprocess(text):
    """
    对文本进行处理，包括去除标点、转为小写、去除停用词等操作。
    """
    text = text.lower()  # 转为小写
    text = re.sub(f'[{string.punctuation}]', ' ', text)  # 去除标点符号
    text = [word for word in text.split() if word not in stopwords]  # 去除停用词
    return text

朴素贝叶斯函数

训练和预测

class NaiveBayesClassifier():
    def __init__(self):
        self.vocabulary = set()  # 词汇表
        self.class_total = defaultdict(int)  # 每个类别的文档数
        self.word_total = defaultdict(int)  # 每个类别中所有单词出现次数之和
        self.word_given_class = defaultdict(lambda: defaultdict(int))  # 每个类别中每个单词出现次数

    def fit(self, X, y):
        """
        训练分类器，其中 X 为训练集数据，y 为训练集标签。
        """
        for text, label in zip(X, y):
            words = preprocess(text)  # 对文本进行处理，包括去除标点、转为小写、去除停用词等操作。
            self.class_total[label] += 1   # 该类别文档数加1
            for word in words:             # 遍历每一个单词
                self.vocabulary.add(word)   # 加入词汇表
                self.word_given_class[label][word] += 1  # 该类别中该单词出现次数加1
                self.word_total[label] += 1  # 该类别所有单词出现次数之和加1

    def predict(self, X):
        """
        对新邮件进行分类，其中 X 为待分类邮件。
        """
        log_priors = {}  # 存储每个类别的先验概率的对数（用于避免下溢）
        for c in self.class_total.keys():
            #  log（每个类别的文档数/文档数）  也就是训练集正常邮件和垃圾邮件占总邮件的比例
            log_priors[c] = math.log(self.class_total[c] / sum(self.class_total.values())) 

        predictions = []
        for text in X:  # 遍历测试集的每一个邮件
            words = preprocess(text) # 对文本进行处理，包括去除标点、转为小写、去除停用词等操作。

            log_probs = {}
            for c in self.class_total.keys(): #遍历两个类别  正常和垃圾邮件类别
                log_probs[c] = log_priors[c]
                for word in words:  # 遍历每个单词
                    if word in self.vocabulary:  # 如果测试邮件的这个单词在对应类别词汇表中
                        # 计算条件概率的对数
                        # 这里需要说明一下： +1是为了防止有些单词在测试样本中出现而垃圾邮件或正常邮件中没有导致概率为0 的情况
                        # 取log是为了防止概率太小 导致最后结果约等于0
                        log_probs[c] += math.log((self.word_given_class[c][word] + 1) / (self.word_total[c] + len(self.vocabulary)))  
            predictions.append(max(log_probs, key=log_probs.get))  # 取概率最大值对应的类别为预测结果

        return predictions

测试准确率

# 读取数据
X, y = get_data(DATA_DIR)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 划分测试集和训练集

# 训练分类器
clf = NaiveBayesClassifier() # 实例化分类器
clf.fit(X_train, y_train)    # 送入数据训练
predictions = clf.predict(X_test)

#模型准确率
accuracy = np.sum(np.array(predictions) == np.array(y_test)) / len(y_test) # 计算准确率
print(f'Accuracy: {accuracy:.2f}')

预测新邮件

# 预测新邮件，并输出分类结果和准确率
new_email = 'Subject: et & s photo contest - announcing the winners\nCongratulations to the following winners of the 2001 ET & S photo contest. Over 200 entries were submitted! The winning photos will be displayed in the 2001 ET & S public education calendar.'
prediction = clf.predict([new_email])[0]
predictions = clf.predict(X_test)
accuracy = np.sum(np.array(predictions) == np.array(y_test)) / len(y_test)

print(f'Prediction: {target_names[prediction]}')
print(f'Accuracy: {accuracy:.2f}')

总代码

代码几乎每一行都加了注释，这里就不解释了。

import os
import re
import string
import math
import numpy as np
from collections import defaultdict
from sklearn.model_selection import train_test_split


DATA_DIR = 'enron'  # 数据集地址
target_names = ['ham', 'spam']  # 正常、 垃圾 
stopwords = set(open('stopwords.txt', 'r').read().splitlines())   # .splitlines() 按换行符分割


def get_data(DATA_DIR):
    subfolders = ['enron%d' % i for i in range(1,7)]  # 返回enron1-enron7列表
    data = []
    target = []
    for subfolder in subfolders: # 对enron1-enron7依次遍历
        # spam
        spam_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'spam')) # 获文件夹下的所有文件 返回一个列表
        for spam_file in spam_files: # 依次打开返回的文件
            with open(os.path.join(DATA_DIR, subfolder, 'spam', spam_file), encoding="latin-1") as f:
                data.append(f.read())
                target.append(1)  # 垃圾邮件标签 为1
        # ham
        ham_files = os.listdir(os.path.join(DATA_DIR, subfolder, 'ham'))
        for ham_file in ham_files:
            with open(os.path.join(DATA_DIR, subfolder, 'ham', ham_file), encoding="latin-1") as f:
                data.append(f.read())
                target.append(0)  # 正常邮件标签 为0
    return data, target


def preprocess(text):
    """
    对文本进行处理，包括去除标点、转为小写、去除停用词等操作。
    """
    text = text.lower()  # 转为小写
    text = re.sub(f'[{string.punctuation}]', ' ', text)  # 去除标点符号
    text = [word for word in text.split() if word not in stopwords]  # 去除停用词
    return text


class NaiveBayesClassifier():
    def __init__(self):
        self.vocabulary = set()  # 词汇表
        self.class_total = defaultdict(int)  # 每个类别的文档数
        self.word_total = defaultdict(int)  # 每个类别中所有单词出现次数之和
        self.word_given_class = defaultdict(lambda: defaultdict(int))  # 每个类别中每个单词出现次数

    def fit(self, X, y):
        """
        训练分类器，其中 X 为训练集数据，y 为训练集标签。
        """
        for text, label in zip(X, y):
            words = preprocess(text)  # 对文本进行处理，包括去除标点、转为小写、去除停用词等操作。
            self.class_total[label] += 1   # 该类别文档数加1
            for word in words:             # 遍历每一个单词
                self.vocabulary.add(word)   # 加入词汇表
                self.word_given_class[label][word] += 1  # 该类别中该单词出现次数加1
                self.word_total[label] += 1  # 该类别所有单词出现次数之和加1

    def predict(self, X):
        """
        对新邮件进行分类，其中 X 为待分类邮件。
        """
        log_priors = {}  # 存储每个类别的先验概率的对数（用于避免下溢）
        for c in self.class_total.keys():
            #  log（每个类别的文档数/文档数）  也就是训练集正常邮件和垃圾邮件占总邮件的比例
            log_priors[c] = math.log(self.class_total[c] / sum(self.class_total.values())) 

        predictions = []
        for text in X:  # 遍历测试集的每一个邮件
            words = preprocess(text) # 对文本进行处理，包括去除标点、转为小写、去除停用词等操作。

            log_probs = {}
            for c in self.class_total.keys(): #遍历两个类别
                log_probs[c] = log_priors[c]
                for word in words:  # 遍历每个单词
                    if word in self.vocabulary:  # 如果测试邮件的这个单词在对应类别词汇表中
                        # 计算条件概率的对数
                        log_probs[c] += math.log((self.word_given_class[c][word] + 1) / (self.word_total[c] + len(self.vocabulary)))  
            predictions.append(max(log_probs, key=log_probs.get))  # 取概率最大值对应的类别为预测结果

        return predictions

# 预测新邮件，并输出分类结果和准确率
new_email = 'Subject: et & s photo contest - announcing the winners\nCongratulations to the following winners of the 2001 ET & S photo contest. Over 200 entries were submitted! The winning photos will be displayed in the 2001 ET & S public education calendar.'
prediction = clf.predict([new_email])[0]
predictions = clf.predict(X_test)
accuracy = np.sum(np.array(predictions) == np.array(y_test)) / len(y_test)

print(f'Prediction: {target_names[prediction]}')
print(f'Accuracy: {accuracy:.2f}')