机器学习——线性回归-Toy模板网

这篇具有很好参考价值的文章主要介绍了机器学习——线性回归。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

机器学习——线性回归

基于Python实现线性回归、预测和建模评估。

1 模型设定

以Boston数据集为例，其中MEDV是标签，其余均为特征变量

CRIM per capita crime rate by town

ZN proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS proportion of non-retail business acres per town

CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX nitric oxides concentration (parts per 10 million)

RM average number of rooms per dwelling

AGE proportion of owner-occupied units built prior to 1940

DIS weighted distances to five Boston employment centres

RAD index of accessibility to radial highways

TAX full-value property-tax rate per $10,000

PTRATIO pupil-teacher ratio by town

B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT % lower status of the population

MEDV Median value of owner-occupied homes in $1000’s

为实现对MEDV的预测，构建如下线性回归模型
$a_0+a_1CRIM +a_2ZN +\dots+ a_{13}LSTAT+u$
其中 $u$ 为扰动项。与计量经济学相区别，这里无需对 $u$ 的特征做出假定。

2 训练模型

import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv('boston.csv')
# 标签
y = data['MEDV']
# 特征变量
x = data.iloc[0:506, 0:13]
# 训练集、测试集分割
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
# 基于训练集线性回归
model = LinearRegression()
model.fit(X_train, y_train)
# 回归系数
print(f'回归系数\n{model.coef_}\n')

# [-1.21310401e-01  4.44664254e-02  1.13416945e-02  2.51124642e+00
#  -1.62312529e+01  3.85906801e+00 -9.98516565e-03 -1.50026956e+00
#   2.42143466e-01 -1.10716124e-02 -1.01775264e+00  6.81446545e-03
#  -4.86738066e-01]

3 模型预测

# 基于测试集进行预测
pred = model.predict(X_test)

4 交叉验证

通常选择10折或5折交叉验证，评估模型的预测能力。当均方误差MSE越小表明预测效果越强。对全样本使用交叉验证：

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import RepeatedKFold
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
scores_mse = -cross_val_score(model, x, y, cv=kfold, scoring='neg_mean_squared_error')
print('每次交叉差验证的回归损失：', scores_mse)
print('十折交叉验证MSE期望:', scores_mse.mean())
#每次鞅差验证的回归损失： [20.54427466 24.47650033  9.49619045 48.63290854 12.11906454 #18.14673907 17.53359386 38.67822303 34.22829546 13.73556966]
#折交叉验证MSE期望: 23.759135960073124

为保守起见，重复进行十折交叉验证，重复次数为M,共得到10M个MSE。

rkfold = RepeatedKFold(n_splits=10, n_repeats=10, random_state=1234567)
scores_mse = -cross_val_score(model, x, y, cv=rkfold, scoring='neg_mean_squared_error')
print('重复10次的10折交叉验证均值：\n',scores_mse.mean())
# 重复10次的10折交叉验证均值：\n 23.719695852306927

# 均方误差损失分布直方图
sns.histplot(pd.DataFrame(scores_mse), color='green',kde=True)
plt.xlabel('MSE')
plt.title('10-fold CV Repeated 10 Times')
plt.grid()

机器学习——线性回归

如果样本量较小，可使用留一法 LeaveOneOut

loo = LeaveOneOut()
scores_mse = -cross_val_score(model, x, y, cv=loo, scoring='neg_mean_squared_error')
print('留一法MSE期望:\n',scores_mse.mean() )
#留一法MSE期望:23.725745519476153