机器学习模型优劣评价指标：混淆矩阵，P-R曲线与平均精确度（附代码实现）

这篇具有很好参考价值的文章主要介绍了机器学习模型优劣评价指标：混淆矩阵，P-R曲线与平均精确度（附代码实现）。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

文章参考：Mean Average Precision (mAP) Explained | Paperspace Blog

一. Confusion Metrics混淆矩阵

二. Precision-Recall Curve, Average precision P-R曲线，平均精确度

三. 举例与代码实现

（1）从预测分数到类别标签(From Prediction Score to Class Label)

（2）精确度-召回度曲线(Precision-Recall Curve)

（3）平均精度AP(Average Precision)

先考虑最简单的二分类问题：

一. Confusion Metrics混淆矩阵

（图源见水印，Predicted Class预测的分类，Actual Class实际的分类）

precision_recall_curve,机器学习,深度学习,python

其实Confusion Metrics本体只是一个2 x 2的表格，这里比较重要的是Type I Error（第一类错误）和Type II Error（第二类错误）的理解。对应的有Accuracy（精度）、Precision（精准度）和Recall/Sensitivity（召回率）。

(1)Precision: 所有Positive的预测中(也就是预测为1)，预测正确的比例。

$precision_recall_curve,机器学习,深度学习,python$

(2)Recall/Sensitivity: 现实中为1的case中，被Positive预测(也就是预测为1，被正确预测了) 的比例是多少。就是说，现在是中本来是1，但是模型却预测成了0，此时产生了上图的Type II Error。

*值得一说的是，在用模型检测疾病时，Recall/Sensitivity是一个重要参考指标，因为我们希望得到的结果是宁愿预测错，也不愿意放过有疾病可能性的病人。

$precision_recall_curve,机器学习,深度学习,python$

(3)Accuracy: 不同于上2个的计算方式不是列求和中的比例就是行求和中的比例，这个指标可以看成是一个俯瞰全局的比例，预测为1并且现实是1以及预测为0并且现实是0的比例，分母所有整体case的数量。

$precision_recall_curve,机器学习,深度学习,python$

(4)F1-score: 一个同时考量(1)Precision与(2)Recall的指标。

$precision_recall_curve,机器学习,深度学习,python$

下图医生说的话，可以理解为模型的positive/negative的预测。

precision_recall_curve,机器学习,深度学习,python

二. Precision-Recall Curve, Average precision P-R曲线，平均精确度

Precision从预测结果角度出发，描述了二分类器预测出来的正例结果中有多少是真实正例，即该二分类器预测的正例有多少是准确的；Recall从真实结果角度出发，描述了测试集中的真实正例有多少被二分类器挑选了出来，即真实的正例有多少被该二分类器召回。

Precision和Recall通常是一对矛盾的性能度量指标。一般来说，Precision越高时，Recall往往越低。原因是，如果我们希望提高Precision，即二分类器预测的正例尽可能是真实正例，那么就要提高二分类器预测正例的门槛，例如，之前预测正例只要是概率大于等于0.5的样例我们就标注为正例，那么现在要提高到概率大于等于0.7我们才标注为正例，这样才能保证二分类器挑选出来的正例更有可能是真实正例；而这个目标恰恰与提高Recall相反，如果我们希望提高Recall，即二分类器尽可能地将真实正例挑选出来，那么势必要降低二分类器预测正例的门槛，例如之前预测正例只要概率大于等于0.5的样例我们就标注为真实正例，那么现在要降低到大于等于0.3我们就将其标注为正例，这样才能保证二分类器挑选出尽可能多的真实正例。

那么有没有一种指标表征了二分类器在Precision和Recall两方面的综合性能呢？答案是肯定的。如前面所说，按照二分类器预测正例的门槛的不同，我们可以得到多组Precision和Recall。在很多情况下，我们可以对二分类器的预测结果进行排序，排在前面的是二分类器认为的最可能是正例的样本，排在最后的是二分类器认为的最不可能是正例的样本。按此顺序逐个逐步降低二分类器预测正例的门槛，则每次可以计算得到当前的Precision和Recall。以Recall作为横轴，Precision作为纵轴可以得到Precision-Recall曲线图，简称为P-R图：

precision_recall_curve,机器学习,深度学习,python

P-R图可直观地显示出二分类器的Precision和Recall，在进行比较时，若一个二分类器的P-R曲线被另一个二分类器的P-R曲线完全包住，则可断言后者的性能优于前者，例如上图二分类器A的性能优于C；如果两个二分类器的P-R图发生了交叉，则难以断言二者性能孰优孰劣，例如上图中的二分类器A与B。

然而，在很多情况下，人们还是希望将二分类器A和B性能比个高低，这时一个比较合理的指标是P-R曲线下面的面积的大小，它在一定程度上表征了二分类器在Precision和Recall这两方面的综合性能。这就是AP（Average Precision）平均精准度，简单来说就是对P-R曲线上的Precision值求均值。对于P-R曲线来说，我们使用积分来进行计算：

实际中则有多分类问题，而不仅仅局限于二分类问题。通常来说AP是在单个类别下的，mAP（mean Average Precision）是AP值在所有类别下的均值。

并且，实际操作过程中我们并不直接对该P-R曲线进行计算，而是对P-R曲线进行平滑处理。即对P-R曲线上的每个点，Precision的值取该点右侧最大的Precision的值：

precision_recall_curve,机器学习,深度学习,python

用公式来描述就是。用该公式进行平滑后再计算AP的值。例如，Interplolated AP（Pascal Voc 2008 的AP计算方式）。在平滑处理的PR曲线上，取横轴0-1的10等分点（包括断点共11个点）的Precision的值，计算其平均值为最终AP的值。

precision_recall_curve,机器学习,深度学习,python

当然也可以直接积分进行处理。

三. 举例与代码实现

考虑一个二分类问题，整个过程有以下几步：

（1）从预测分数到类别标签(From Prediction Score to Class Label)

假设有两个类别，Positive 和 Negative，这里是 10 个样本的实际标签，记作y_true：

y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive"]

这些样本被输入模型时，它会返回以下预测分数pred_scores：

pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3]

基于这些分数，我们对样本进行分类（即为每个样本分配一个类标签）。首先人为设定一个预测分数的阈值，当分数等于或高于阈值时，样本被归为一类(通常为正类, 1)。否则，它被归类为其他类别(通常为负类，0)。这里我们设定阈值为0.5，得到模型预测的标签y_pred：

import numpy

pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3]
y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive"]

threshold = 0.5
y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores]
print(y_pred)

输出：

['positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'negative', 'negative', 'negative', 'negative']

现在 y_true 和 y_pred 变量中都提供了真实标签和预测标签。基于这些标签，由前面的定义可以计算出混淆矩阵、精确率和召回率。

r = numpy.flip(sklearn.metrics.confusion_matrix(y_true, y_pred))
print(r)

precision = sklearn.metrics.precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
print(precision)

recall = sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
print(recall)

结果：

# Confusion Matrix (From Left to Right & Top to Bottom: True Positive, False Negative, False Positive, True Negative)
[[4 2]
 [1 3]]

# Precision = 4/(4+1)
0.8

# Recall = 4/(4+2)
0.6666666666666666

（2）精确度-召回度曲线(Precision-Recall Curve)

由于准确率和召回率的重要性，一条准确率-召回率曲线可以显示不同阈值的准确率和召回率值之间的权衡。该曲线有助于选择最佳阈值以最大化两个指标。

创建精确召回曲线需要一些输入：

1. 真实标签。 2. 样本的预测分数。 3. 将预测分数转换为类别标签的一些阈值。

本代码段创建 y_true 列表来保存真实标签，pred_scores 列表用于预测分数，最后是用于不同阈值的 thresholds 列表（这里从0.2到0.7，步长为0.05）。

import numpy

y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive", "positive", "positive", "positive", "negative", "negative", "negative"]

pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3, 0.7, 0.5, 0.8, 0.2, 0.3, 0.35]

thresholds = numpy.arange(start=0.2, stop=0.7, step=0.05)

因为thresholds有 10 个阈值，所以将创建 10 个精度和召回值。下一个名为 precision_recall_curve() 的函数接收真实标签、预测分数和阈值。它返回两个代表精度和召回值的等长列表：

import sklearn.metrics

def precision_recall_curve(y_true, pred_scores, thresholds):
    precisions = []
    recalls = []
    
    for threshold in thresholds:
        y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores]

        precision = sklearn.metrics.precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
        recall = sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
        
        precisions.append(precision)
        recalls.append(recall)

    return precisions, recalls

调用前面的数据得到：

precisions, recalls = precision_recall_curve(y_true=y_true, pred_scores=pred_scores,thresholds=thresholds)

输出：

# Precision
[0.5625,
 0.5714285714285714,
 0.5714285714285714,
 0.6363636363636364,
 0.7,
 0.875,
 0.875,
 1.0,
 1.0,
 1.0]
# Recall
[1.0,
 0.8888888888888888,
 0.8888888888888888,
 0.7777777777777778,
 0.7777777777777778,
 0.7777777777777778,
 0.7777777777777778,
 0.6666666666666666,
 0.5555555555555556,
 0.4444444444444444]

给定两个长度相等的列表，可以在二维图中绘制它们的值，如下所示：

import matplotlib.pyplot
matplotlib.pyplot.plot(recalls, precisions, linewidth=4, color="red")
matplotlib.pyplot.xlabel("Recall", fontsize=12, fontweight='bold')
matplotlib.pyplot.ylabel("Precision", fontsize=12, fontweight='bold')
matplotlib.pyplot.title("Precision-Recall Curve", fontsize=15, fontweight="bold")
matplotlib.pyplot.show()

precision_recall_curve,机器学习,深度学习,python

可以看出，随着召回率的增加，精度会降低。原因是当正样本数量增加（高召回率）时，正确分类每个样本的准确率降低（低精度）。这是预料之中的，因为当有很多样本时，模型更有可能失败。

准确率-召回率曲线可以很容易地确定准确率和召回率都高的点。根据上图，最好的点是(recall, precision)=(0.778, 0.875)。更好的方法是使用F1-score的指标（公式见前文）：

f1 = 2 * ((numpy.array(precisions) * numpy.array(recalls)) / (numpy.array(precisions) + numpy.array(recalls)))

根据 F1 列表中的值，最高分是 0.82352941。它是列表中的第 6 个元素（即索引 5）。召回率和精度列表中的第 6 个元素分别为 0.778 和 0.875。对应的阈值为0.45：

# F1-score
[0.72, 
 0.69565217, 
 0.69565217, 
 0.7,
 0.73684211,
 0.82352941, 
 0.82352941, 
 0.8, 
 0.71428571, 
 0.61538462]

下图以蓝色显示了与召回率和准确率之间的最佳平衡相对应的点的位置。总之，平衡精度和召回率的最佳阈值是 0.45，此时精度为 0.875，召回率为 0.778。

matplotlib.pyplot.plot(recalls, precisions, linewidth=4, color="red", zorder=0)
matplotlib.pyplot.scatter(recalls[5], precisions[5], zorder=1, linewidth=6)

matplotlib.pyplot.xlabel("Recall", fontsize=12, fontweight='bold')
matplotlib.pyplot.ylabel("Precision", fontsize=12, fontweight='bold')
matplotlib.pyplot.title("Precision-Recall Curve", fontsize=15, fontweight="bold")
matplotlib.pyplot.show()

precision_recall_curve,机器学习,深度学习,python

（3）平均精度AP(Average Precision)

AP是根据下面的公式计算的：使用一个循环，通过遍历所有的精度precision/召回recall，计算出当前召回和下一次召回之间的差异，然后乘以当前精度。换句话说，Average-Precision是每个阈值的精确度(precision)的加权求和，其中的权重是召回率(recall)的差。这其实就是微积分里面“分割，求和”的过程：

$precision_recall_curve,机器学习,深度学习,python$

其中：Recalls（n）=0，Precisions（n）=1，n是选取的阈值数量。也就是说要将召回列表recalls和精确列表precisions分别附加上0和1。

AP = numpy.sum((recalls[:-1] - recalls[1:]) * precisions[:-1])

下面是计算AP的完整代码：文章来源地址https://www.toymoban.com/news/detail-755269.html

import numpy
import sklearn.metrics

def precision_recall_curve(y_true, pred_scores, thresholds):
    precisions = []
    recalls = []
    
    for threshold in thresholds:
        y_pred = ["positive" if score >= threshold else "negative" for score in pred_scores]

        precision = sklearn.metrics.precision_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
        recall = sklearn.metrics.recall_score(y_true=y_true, y_pred=y_pred, pos_label="positive")
        
        precisions.append(precision)
        recalls.append(recall**

    return precisions, recalls

y_true = ["positive", "negative", "negative", "positive", "positive", "positive", "negative", "positive", "negative", "positive", "positive", "positive", "positive", "negative", "negative", "negative"]
pred_scores = [0.7, 0.3, 0.5, 0.6, 0.55, 0.9, 0.4, 0.2, 0.4, 0.3, 0.7, 0.5, 0.8, 0.2, 0.3, 0.35]
thresholds=numpy.arange(start=0.2, stop=0.7, step=0.05)

precisions, recalls = precision_recall_curve(y_true=y_true, 
                                             pred_scores=pred_scores, 
                                             thresholds=thresholds)

precisions.append(1)
recalls.append(0)

precisions = numpy.array(precisions)
recalls = numpy.array(recalls)

AP = numpy.sum((recalls[:-1] - recalls[1:]) * precisions[:-1])
print(AP)

到了这里，关于机器学习模型优劣评价指标：混淆矩阵，P-R曲线与平均精确度（附代码实现）的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！