1 介绍
乳腺癌数据属于二分类问题,包含569条样本,31个特征,1个标签维度。
如果有需要,可以联系:https://docs.qq.com/doc/DWEtRempVZ1NSZHdQ
2 导入常用的工具箱
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
np.random.seed(123)
3 导入数据集
data = pd.read_csv("data.csv")
4 数据探索
4.1 打印数据信息
print(data.shape)
print(data.head())
print(data.describe())
print(data.info)
4.2 可视化显示
4.2.1 显示相关系数,并可视化
co = data.corr()
plt.subplots(figsize=(8, 8))
sns.heatmap(co.corr().round(2),annot=True)
plt.show()
4.2.2 显示每个类别的数量
sns.countplot(data['diagnosis'])
plt.show()
5 数据预处理
5.1 类别标签编码LabelEncoder
data["diagnosis"] = LabelEncoder().fit_transform(data["diagnosis"])
print(data["diagnosis"].head(5))
5.2 剔除id列drop
data.drop(["id"],axis=1, inplace=True)
print(data.columns)
5.3 查看是否有空值isnull
print(data.isnull().sum())
没有空值,不用对空值进行处理
5.4 划分训练集和测试集 train_test_split
from sklearn.utils import shuffle
data = shuffle(data,random_state=123) #打乱样本
x = data.drop(["diagnosis"], axis=1)
y = data["diagnosis"]
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=123)
数据集整体数量:569
训练集集整体数量:398
测试集整体数量:171文章来源:https://www.toymoban.com/news/detail-480013.html
5.5 数据归一化MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
6 使用多个模型训练和预测
model_list = [KNeighborsClassifier(),SVC(),DecisionTreeClassifier(),RandomForestClassifier()]
for model in model_list:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
svm_acc = round(accuracy_score(y_test, y_pred), 2)
print("{}模型精度:{}".format(model, svm_acc))
KNeighborsClassifier()模型精度:0.96
SVC()模型精度:0.98
DecisionTreeClassifier()模型精度:0.94
RandomForestClassifier()模型精度:0.95
7 超参数调优GridSearchCV
- SCORERS查找评分指标
- verbose=3才能显示出每次的迭代过程
- scoring评分标准不一样,得到的结果就会不一样
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import SCORERS
param_grid = {"C": [0.01, 0.1, 1, 10, 100],
"gamma": [0.0001, 0.001, 0.01, 0.1, 1, 10, 20]}
grid_search = GridSearchCV(SVC(), param_grid, cv=2, verbose=3, scoring="accuracy")
grid_search.fit(X_train, y_train)
print(grid_search.best_score_)
print(grid_search.best_params_)
print(grid_search.score(X_test, y_test))
最后的结果和在测试集上的得分文章来源地址https://www.toymoban.com/news/detail-480013.html
{'C': 1, 'gamma': 1}
0.9766081871345029
到了这里,关于分类4:机器学习处理乳腺癌数据集代码的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!