机器学习入门实例-加州房价预测-4（继续调参+评估）-Toy模板网

这篇具有很好参考价值的文章主要介绍了机器学习入门实例-加州房价预测-4（继续调参+评估）。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

Randomized Search

如果需要尝试、调整的超参数只有有限几个，比如之前的例子，那只用grid search就够了；但如果超参数的搜索空间非常大，应该用RandomizedSearchCV。它有两个优点：

支持更大的参数范围
它可以更快找到最优的超参数组合。因为不是遍历所有组合，而是在指定的参数范围内随机采样，然后评估性能。
可以根据现有资源情况给参数的上下限，因此更灵活。
缺点是可能错过最优，只得到一个可以接受的“最优”。如果时间允许，还是可以用GridSearch的。

    from sklearn.model_selection import RandomizedSearchCV
    from scipy.stats import randint

    forest_reg = RandomForestRegressor()
    
    # randint(low=1,high=101).rvs(5) 输出：array([64, 98, 35,  2, 72]) 不要用size控制个数了
    param_grid = {
    	# 'n_estimators': list(range(1, 200)),
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }
    
    grid_search = RandomizedSearchCV(forest_reg, param_grid, cv=5,
                                     n_iter=20,
                                     scoring="neg_mean_squared_error",
                                     return_train_score=True)
    grid_search.fit(housing_prepared, housing_labels)
    print(grid_search.best_params_)
    print(grid_search.best_estimator_)
    print(np.sqrt(-grid_search.best_score_))

{'max_features': 6, 'n_estimators': 199}
RandomForestRegressor(max_features=6, n_estimators=199)
49012.16057617387

其中n_iter表示尝试的参数组合总数。如果n_iter太小，可能错过最优的超参数组合；如果n_iter太大，会增加搜索时间，消耗更多计算资源。

评估模型

查看每一列在预测时的重要性

	param_grid = [
        {'n_estimators': [3, 10, 30, 50], 'max_features': [2, 4, 6, 8, None]},
        {'bootstrap': [False], 'n_estimators': [3, 10, 30], 'max_features': [2, 3, 4, 8]}
    ]
    grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                               scoring="neg_mean_squared_error",
                               return_train_score=True)

    grid_search.fit(housing_prepared, housing_labels)
    print(grid_search.best_params_)
    print(grid_search.best_estimator_)
    print(np.sqrt(-grid_search.best_score_))
    
    # 获取列标签
    housing_num = housing.drop("ocean_proximity", axis=1)
    num_attribs = list(housing_num)
    extra_attribs = ["rooms_per_household", "pop_per_household", "bedrooms_per_room"]
    # 获取每一列在准确预测时的相对重要性数值
    feature_importances = grid_search.best_estimator_.feature_importances_
    # 这里我修改了函数，多返回了full_pipeline
    # 从pipeline中获取某个transformer中输入的列
    cat_encoder = full_pipeline.named_transformers_['cat']
    cat_one_hot_attribs = list(cat_encoder.categories_[0])
    # 最终列名 = 纯数值列的列名 + 新增的三列列名 + one-hot时产生的列名
    attributes = num_attribs + extra_attribs + cat_one_hot_attribs
    print(sorted(zip(feature_importances, attributes), reverse=True))

{'bootstrap': False, 'max_features': 8, 'n_estimators': 30}
RandomForestRegressor(bootstrap=False, max_features=8, n_estimators=30)
49442.37738967349
[(0.3250563395483288, 'median_income'), 
(0.1633435907899842, 'INLAND'), 
(0.11059555286375254, 'pop_per_household'), 
(0.08114145071753134, 'longitude'), 
(0.0728049997803568, 'latitude'), 
(0.07264703358828413, 'bedrooms_per_room'), 
(0.06346893798818128, 'rooms_per_household'), 
(0.04130518938735756, 'housing_median_age'), 
(0.014117547726336705, 'total_rooms'), 
(0.01405138434431168, 'population'), 
(0.013966918312688084, 'total_bedrooms'), 
(0.013656643753704638, 'households'), 
(0.009607652315968867, '<1H OCEAN'), 
(0.002484053857680537, 'NEAR OCEAN'), 
(0.001674961006904646, 'NEAR BAY'), 
(7.774401862815335e-05, 'ISLAND')]

知道了重要性后，可以舍弃掉一些不太重要的列，或者调整不太重要的列，使之更为重要。

在测试集上评估

	from sklearn.metrics import mean_squared_error
    # 直接用
    final_model = grid_search.best_estimator_
    # 处理测试集数据
    X_test = test_set.drop("median_house_value", axis=1)
    y_test = test_set["median_house_value"].copy()
	# 使用总pipeline处理数据
    X_test_prepared,f = transform_data(X_test)
    # 使用模型预测
    final_predictions = final_model.predict(X_test_prepared)
	# 计算rmse
    final_mse = mean_squared_error(y_test, final_predictions)
    final_rmse = np.sqrt(final_mse)
    print(final_rmse)
    
    # 计算95%置信区间
    from scipy import stats
    confidence = 0.95
    squared_errors = (final_predictions - y_test) ** 2
    interval = np.sqrt(stats.t.interval(confidence, len(squared_errors)-1,
                                        loc=squared_errors.mean(),
                                        scale=stats.sem(squared_errors)))
    print(interval)