【人工智能概论】 XGBoost应用——特征筛选

这篇具有很好参考价值的文章主要介绍了【人工智能概论】 XGBoost应用——特征筛选。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

【人工智能概论】 XGBoost应用——特征筛选

换一个评价指标，特征排序结果就会不一样，甚至同样的数据同样的方法多次执行得到的结果也不是完全一样，特征筛选这件事见仁见智，要理性看待，但确实可以提供一种交叉验证的角度。

一. 梯度提升算法是如何计算特征重要性的？

使用梯度提升算法的好处是在提升树被创建后，可以相对直接地得到每个特征的重要性得分。
一般来说，一个特征越多的被用来在模型中构建决策树，它的重要性就相对越高。
在单个决策树中通过每个特征分裂点改进性能度量的量来计算特征的重要性。由节点负责加权和记录次数，也就是说一个特征对分裂点改进性能度量越大（或越靠近根节点），权值越大；被越多提升树所选择，特征越重要。性能度量可以是选择分裂节点的Gini纯度，也可以是其他度量函数。
最终将一个特征在所有提升树中的结果进行加权求和后然后平均，得到重要性得分。

二. 动手绘制特征的重要性

2.1 特征关键度分数 feature_importances_

通过xgboost模块提供的方法构建一个XGBoost模型，在训练的过程中模型会自动给各特征的重要性打分。
这些特征重要性分数可以通过模型成员变量 feature_importances_ 获得。
可以将它们打印出来：

print(model.feature_importances_)

也可以将它们绘制于条形图上：

# from matplotlib import pyplot as plt
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.show()

2.2 应用举例

训练一个XGBoost模型，并展示特征的重要性。

from numpy import loadtxt
from xgboost import XGBClassifier
from matplotlib import pyplot as plt

# load data
dataset = loadtxt('machine-1-1.csv', delimiter=",")

# split data into X and y
X = dataset[:,1:39]
y = dataset[:,39]

# fit model no training data
model = XGBClassifier()
model.fit(X, y)

# feature importance
print(model.feature_importances_)

# plot
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.show()

得分情况与直方图显示

2.3 特征关键度排序可视化显示 plot_importance

前面用条形图显示的方法很不错，但是并没有按照重要程度进行排序，所幸xgboost提供内置的绘图函数可以实现这个功能。
xgboost库提供了plot_importance（）函数，其可以按重要性顺序绘制要素。

# plot feature importance
# from matplotlib import pyplot as plt
plot_importance(model)
plt.show()

2.4 应用举例

还是上面的数据

# plot feature importance using built-in function
from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot as plt
# load data
dataset = loadtxt('machine-1-1.csv', delimiter=",")
# split data into X and y
X = dataset[:,1:39]
y = dataset[:,39]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model,importance_type='gain')
plt.show()

得分情况

2.5 解决plot_importance和feature_importance获得的特征排序不同

在使用xgboost提供的plot_importance和feature_importance方法获得的特征排序时，可能会出现获得的排名不一样的问题。
为什么会出现？

因为plot_importance默认的importance_type是’weight’，而feature_importance_默认的importance_type是’gain’。

怎么办？

换成一样的就行了。

xgboost里面的feature importance是怎么计算的呢？

importance type一共有三种类型：weight, gain, cover

weight 是特征在提升树里出现的次数，即所有树中，某个特征作为分裂节点的次数。

gain 是在所有树中，某个特征在分裂后带来的平均信息增益。

cover 是与特征相关的记录(observation)的相对数量。例如，如果有100条记录(observation)，4个特征(feature) 和3棵树(tree)，并且假设特征1分别用于确定树1，树2和树3中10、5和2个记录的叶节点；则cover指标会将该特征的coverage计算为10 + 5 + 2 = 17个记录。这将针对所有4个特征进行计算，其cover将以所有特征的cover指标的17%表示。

换一个评价指标，结果就会不一样，这其实告诉我们一个什么道理，特征筛选这件事见仁见智，要理性看待，但确实可以提供一种交叉验证的角度。

三. 基于评分的特征选择

3.1 基本原理

特征重要性评分可用于scikit-learn中的特征选择。
这是通过使用SelectFromModel类完成的，该类采用一个模型，并且可以将数据集转换为具有选定要素的子集。
该类可以采用预先训练好的模型，如在整个训练数据集上进行训练的模型。
然后，它可以使用阈值来确定要选择的特征。即当在SelectFromModel实例上调用transform()方法时，该阈值被用于在训练集和测试集上一致性选择相同特征。

3.2 实际举例

首先，在训练集上训练xgboost模型，并在测试集上检测效果；
然后，将模型封装在一个SelectFromModel实例中，通过该实例与特征重要性评分来选择特征；
最后，用所选择的特征子集训练模型，并在相同的特征方案下在测试集上评估效果。
核心代码：

# 用阈值选择特征
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# 训练模型
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# 评估模型
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)

可以通过测试多个阈值，获取多组特征子集，进行性能与成本之间的衡量。
完整代码：

# use feature importance for feature selection
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel

# load data
dataset = loadtxt('machine-1-1.csv', delimiter=",")

# split data into X and y
X = dataset[:,1:39]
Y = dataset[:,39]

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)

# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:

    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    
    # train model
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

结果展示：
从直觉来说随着阈值的变大，特征数量会减少，模型的准确率也应随之下降。
这是有道理的，因此就需要在模型复杂度（特征数量）和准确率做一个权衡。
但是有些情况（就像上边），特征数量的减少反而会让准确率升高，或许因为这些被剔除特征是噪声。文章来源地址https://www.toymoban.com/news/detail-818205.html

四. XGBoost做回归任务

总体与分类的差不多，只是细节需要调整。
accuracy_score -> XGBRegressor；XGBClassifier -> mean_squared_error

from numpy import loadtxt
from numpy import sort
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectFromModel
import pandas as pd

# load data
dataset = pd.read_csv('diabetes.csv', header=0)
dataset = dataset.iloc[:,1:].values

# split data into X and y
X = dataset[:,:-2]
Y = dataset[:,-2]

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

# fit model on all training data
model = XGBRegressor()
model.fit(X_train, y_train)

# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = mean_squared_error(y_test, predictions)
print("Accuracy: %.2f" % accuracy)

# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
    
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    
    # train model
    selection_model = XGBRegressor()
    selection_model.fit(select_X_train, y_train)
    
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = mean_squared_error(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f" % (thresh, select_X_train.shape[1], accuracy))

也可以继续看特征关键程度

from matplotlib import pyplot as plt
from xgboost import plot_importance

# feature importance
print(model.feature_importances_)

# plot
plt.bar(range(len(model.feature_importances_)), model.feature_importances_)
plt.show()

# plot feature importance
plot_importance(model,importance_type='gain')
plt.show()

五. 其它内容

5.1 参数的问题

XGBoost有多种超参数，它们对模型的性能有很大的影响，调参是门艺术。

5.2 网格调参法

xgboost既可以用来做二分类、多分类，也可以用来做回归，除了数据特征以外，对模型调参也是影响模型性能的关键环节，一般是按一定的步骤、网格搜索最优参数，如下两篇文章一个是用来分类，一个是用来预测数值的案例，并且详细给出了调参的步骤和代码：
分类器XGBClassifier的调参
回归器XGBRegressor的调参