基于随机森林的otto商品分类-Toy模板网

这篇具有很好参考价值的文章主要介绍了基于随机森林的otto商品分类。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

数据集介绍

Otto Group数据集来源于《Otto Group Product Classification Challenge》。Otto集团是世界上最大的电子商务公司之一，在20多个国家拥有子公司。我们每天在全球销售数百万种产品，在我们的产品线中添加了数千种产品。

我们公司对我们产品性能的一致性分析至关重要。然而，由于我们的全球基础设施不同，许多相同的产品被分类不同。因此，我们的产品分析的质量在很大程度上取决于对类似产品进行准确分类的能力。分类越好，我们对产品范围的了解就越多。

在这次竞争中，我们为超过200000种产品提供了一个具有93项功能的数据集。目的是建立一个预测模型，能够区分我们的主要产品类别。获奖模型将采用开源模式。

奥托集团产品分类数据集：

Target：共9个商品类别
Features：93个特征：整数型特征

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV
%matplotlib inline

读取数据

查看当前工作路径

os.path.abspath('.')

读取数据

data = pd.read_csv("./otto-group-product-classification-challenge/train.csv")
data.head()

	id	feat_1	feat_4	feat_5	feat_6	feat_7	feat_8	...	feat_85	feat_86	feat_87	feat_90	target
0	1	1	0	0	0	0	0	...	1	0	0	0	Class_1
1	2	0	0	0	0	0	1	...	0	0	0	0	Class_1
2	3	0	0	0	0	0	1	...	0	0	0	0	Class_1
3	4	1	1	6	1	5	0	...	0	1	2	0	Class_1
4	5	0	0	0	0	0	0	...	1	0	0	1	Class_1

5 rows × 95 columns

# 数据维度
data.shape

(61878, 95)

数据特征分析

# 描述性统计
data.describe()

	id	feat_1	feat_2	feat_3	feat_4	feat_5	feat_6	feat_7	feat_8	feat_9	...	feat_84	feat_85	feat_86	feat_87	feat_88	feat_89	feat_90	feat_91	feat_92	feat_93
count	61878.000000	61878.00000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	...	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000	61878.000000
mean	30939.500000	0.38668	0.263066	0.901467	0.779081	0.071043	0.025696	0.193704	0.662433	1.011296	...	0.070752	0.532306	1.128576	0.393549	0.874915	0.457772	0.812421	0.264941	0.380119	0.126135
std	17862.784315	1.52533	1.252073	2.934818	2.788005	0.438902	0.215333	1.030102	2.255770	3.474822	...	1.151460	1.900438	2.681554	1.575455	2.115466	1.527385	4.597804	2.045646	0.982385	1.201720
min	1.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	15470.250000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	30939.500000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	46408.750000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	...	0.000000	0.000000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	61878.000000	61.00000	51.000000	64.000000	70.000000	19.000000	10.000000	38.000000	76.000000	43.000000	...	76.000000	55.000000	65.000000	67.000000	30.000000	61.000000	130.000000	52.000000	19.000000	87.000000

8 rows × 94 columns

# 查看数据分布
sns.countplot(x=data.target)

<AxesSubplot:xlabel='target', ylabel='count'>

基于随机森林的otto商品分类,机器学习,随机森林,分类,算法

可以看出，数据类别不均衡

数据处理

# 特征值
x = data.drop(["id","target"], axis=1)
# 目标值
y = data["target"]

x.head()

	feat_1	feat_4	feat_5	feat_6	feat_7	feat_8	feat_10	...	feat_84	feat_85	feat_86	feat_87	feat_90
0	1	0	0	0	0	0	0	...	0	1	0	0	0
1	0	0	0	0	0	1	0	...	0	0	0	0	0
2	0	0	0	0	0	1	0	...	0	0	0	0	0
3	1	1	6	1	5	0	1	...	22	0	1	2	0
4	0	0	0	0	0	0	0	...	0	1	0	0	1

5 rows × 93 columns

y.value_counts().sort_index()文章来源地址https://www.toymoban.com/news/detail-689724.html

# 由于数据集较大，同时样本类别分布不均衡，故通过欠采样缩小数据集规模
# from imblearn.under_sampling import RandomUnderSampler

把标签值转换为数字

y = LabelEncoder().fit_transform(y)
y

array([0, 0, 0, ..., 8, 8, 8])

分割数据

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train.shape, y_train.shape, y_test.shape, x_test.shape

((49502, 93), (49502,), (12376,), (12376, 93))

模型训练

from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(oob_score=True)
rf_model.fit(x_train, y_train)

RandomForestClassifier(oob_score=True)

y_pred = rf_model.predict(x_test)

模型评估

# 模型在训练集上的准确率 
rf_model.score(x_train, y_train)

0.9999797987960083

# 模型在测试集上的准确率 
rf_model.score(x_test, y_test)

0.8089043309631545

# 包外估计
rf_model.oob_score_

0.7993818431578522

encoder = OneHotEncoder(sparse=False)
y_test = encoder.fit_transform(y_test.reshape(-1,1))
y_pred = encoder.fit_transform(y_pred.reshape(-1,1))
y_test,

(array([[0., 0., 1., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 1., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 1., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]]),)

 y_pred

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

# logloss评估
log_loss(y_test, y_pred, eps=1e-15, normalize=True)

6.600210582899472

# 以概率形式输出
y_pred_proba = rf_model.predict_proba(x_test)
y_pred_proba

array([[0.  , 0.2 , 0.77, ..., 0.  , 0.02, 0.  ],
       [0.02, 0.48, 0.16, ..., 0.06, 0.  , 0.  ],
       [0.03, 0.02, 0.03, ..., 0.3 , 0.32, 0.02],
       ...,
       [0.12, 0.01, 0.05, ..., 0.08, 0.11, 0.53],
       [0.01, 0.56, 0.32, ..., 0.01, 0.02, 0.  ],
       [0.18, 0.09, 0.01, ..., 0.1 , 0.2 , 0.14]])

rf_model.oob_score_

0.7993818431578522

log_loss(y_test, y_pred_proba, eps=1e-15, normalize=True)

0.6232249914857839

到了这里，关于基于随机森林的otto商品分类的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

基于随机森林的otto商品分类

数据集介绍

读取数据

查看当前工作路径

读取数据

数据特征分析

数据处理

把标签值转换为数字

分割数据

模型训练

模型评估

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

支付宝扫一扫领取红包，优惠每天领

二维码1

二维码2

	feat_1	feat_4	feat_5	feat_6	feat_7	feat_8	feat_10	...	feat_84	feat_85	feat_86	feat_87	feat_90
0	1	0	0	0	0	0	0	...	0	1	0	0	0
1	0	0	0	0	0	1	0	...	0	0	0	0	0
2	0	0	0	0	0	1	0	...	0	0	0	0	0
3	1	1	6	1	5	0	1	...	22	0	1	2	0
4	0	0	0	0	0	0	0	...	0	1	0	0	1

	feat_1	feat_4	feat_5	feat_6	feat_7	feat_8	feat_10	...	feat_84	feat_85	feat_86	feat_87	feat_90
0	1	0	0	0	0	0	0	...	0	1	0	0	0
1	0	0	0	0	0	1	0	...	0	0	0	0	0
2	0	0	0	0	0	1	0	...	0	0	0	0	0
3	1	1	6	1	5	0	1	...	22	0	1	2	0
4	0	0	0	0	0	0	0	...	0	1	0	0	1

	feat_1	feat_4	feat_5	feat_6	feat_7	feat_8	feat_10	...	feat_84	feat_85	feat_86	feat_87	feat_90
0	1	0	0	0	0	0	0	...	0	1	0	0	0
1	0	0	0	0	0	1	0	...	0	0	0	0	0
2	0	0	0	0	0	1	0	...	0	0	0	0	0
3	1	1	6	1	5	0	1	...	22	0	1	2	0
4	0	0	0	0	0	0	0	...	0	1	0	0	1