机器学习样本数据划分的典型Python方法

这篇具有很好参考价值的文章主要介绍了机器学习样本数据划分的典型Python方法。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

机器学习样本数据划分的典型Python方法

Date Author Version Note
2023.08.16 Dog Tao V1.0 完成文档撰写。

样本数据的分类

In machine learning and deep learning, the data used to develop a model can be divided into three distinct sets: training data, validation data, and test data. Understanding the differences among them and their distinct roles is crucial for effective model development and evaluation.

Training Data

  • Purpose: The training data is used to train the model. It’s the dataset the algorithm will learn from.
  • Usage: The model parameters are adjusted or “learned” using this data. For example, in a neural network, weights are adjusted using backpropagation on this data.
  • Fraction: Typically, a significant majority of the dataset is allocated to training (e.g., 60%-80%).
  • Issues: Overfitting can be a concern if the model becomes too specialized to the training data, leading it to perform poorly on unseen data.

Validation Data

  • Purpose: The validation data is used to tune the model’s hyperparameters and make decisions about the model’s structure (e.g., choosing the number of hidden units in a neural network or the depth of a decision tree).
  • Usage: After training on the training set, the model is evaluated on the validation set, and adjustments to the model (like changing hyperparameters) are made based on this evaluation. The process might be iterative.
  • Fraction: Often smaller than the training set, typically 10%-20% of the dataset.
  • Issues: Overfitting to the validation set can happen if you make too many adjustments based on the validation performance. This phenomenon is sometimes called “validation set overfitting” or “leakage.”

Test Data

  • Purpose: The test data is used to evaluate the model’s final performance after training and validation. It provides an unbiased estimate of model performance in real-world scenarios.
  • Usage: Only for evaluation. The model does not “see” this data during training or hyperparameter tuning. Once the model is finalized, it is tested on this dataset to gauge its predictive performance.
  • Fraction: Typically, 10%-20% of the dataset.
  • Issues: To preserve the unbiased nature of the test set, it should never be used to make decisions about the model. If it’s used in this way, it loses its purpose, and one might need a new test set.

Note: The exact percentages mentioned can vary based on the domain, dataset size, and specific methodologies. In practice, strategies like k-fold cross-validation might be used, where the dataset is split into k subsets, and the model is trained and validated multiple times, each time using a different subset as the validation set and the remaining data as the training set.

In summary, the distinction among training, validation, and test data sets is crucial for robust model development, avoiding overfitting, and ensuring that the model will generalize well to new, unseen data.

机器学习样本数据划分的典型Python方法,机器学习,python,深度学习,数据清洗,数据划分

numpy.ndarray类型数据

直接划分

To split numpy.ndarray data into a training set and validation set, you can use the train_test_split function provided by the sklearn.model_selection module.

Here’s a brief explanation followed by an example:

  • Function Name: train_test_split()

  • Parameters:

    1. arrays: Sequence of indexables with the same length. Can be any data type.
    2. test_size: If float, should be between 0.0 and 1.0, representing the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
    3. train_size: Complement to test_size. If not provided, the value is set to the complement of the test size.
    4. random_state: Seed for reproducibility.
    5. shuffle: Whether to shuffle before splitting. Default is True.
    6. stratify: If not None, the data is split in a stratified fashion using this as the class labels.
  • Returns: Split arrays.

Example:

Let’s split an example dataset into a training set (80%) and a validation set (20%):

import numpy as np
from sklearn.model_selection import train_test_split

# Sample data
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.randint(0, 2, 100)  # 100 labels, binary classification

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", len(X_train))
print("Validation set size:", len(X_val))
  • If you want the split to be reproducible (i.e., get the same split each time you run the code), set the random_state to any integer value.
  • If you’re working with imbalanced datasets and want to ensure that the class distribution is the same in both the training and validation sets, you can use the stratify parameter. Setting stratify=y will ensure that the splits have the same class distribution as the original dataset.

交叉验证

基于KFold

For performing ( n )-fold cross-validation on numpy.ndarray data, you can use the KFold class from the sklearn.model_selection module.

Here’s how you can use ( n )-fold cross-validation:

  • Class Name: KFold

  • Parameters of KFold:

    1. n_splits: Number of folds.
    2. shuffle: Whether to shuffle the data before splitting into batches.
    3. random_state: Seed used by the random number generator for reproducibility.

Example:

Let’s say you want 5-fold cross-validation:

import numpy as np
from sklearn.model_selection import KFold

# Sample data
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.randint(0, 2, 100)  # 100 labels, binary classification

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, val_index in kf.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
    print("Training set size:", len(X_train))
    print("Validation set size:", len(X_val))
    print("---")
  • Each iteration in the loop gives you a different split of training and validation data.
  • The training and validation indices are generated based on the size of X.
  • If you want the split to be reproducible (i.e., get the same split each time you run the code), set the random_state parameter.
  • In case you want stratified k-fold cross-validation (where the folds are made by preserving the percentage of samples for each class), use StratifiedKFold instead of KFold. This can be particularly useful for imbalanced datasets.
基于RepeatedKFold

RepeatedKFold repeats K-Fold cross-validator. For each repetition, it splits the dataset into k-folds and then the k-fold cross-validation is performed. This results in having multiple scores for multiple runs, which might give a more comprehensive evaluation of the model’s performance.

Parameters:

  • n_splits: Number of folds.
  • n_repeats: Number of times cross-validator needs to be repeated.
  • random_state: Random seed for reproducibility.

Example:

import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])

rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=42)

for train_index, test_index in rkf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
基于cross_val_score

cross_val_score evaluates a score by cross-validation. It’s a quick utility that wraps both the steps of splitting the dataset and evaluating the estimator’s performance.

Parameters:

  • estimator: The object to use to fit the data.
  • X: The data to fit.
  • y: The target variable for supervised learning problems.
  • cv: Cross-validation strategy.
  • scoring: A string (see model evaluation documentation) or a scorer callable object/function.

Example:

Here’s an example using RepeatedKFold with cross_val_score for a simple regression model:

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, RepeatedKFold

# Generate a sample dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)

# Define the model
model = LinearRegression()

# Define the evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# Evaluate the model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# Summary of performance
print('Mean MAE: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

In the above example:

  • cross_val_score is used to evaluate the performance of a LinearRegression model using the mean absolute error (MAE) metric.
  • We employ a 10-fold cross-validation strategy that is repeated 3 times, as specified by RepeatedKFold.
  • The scores from all these repetitions and folds are aggregated into the scores array.

Note:

  • In the scoring parameter, the ‘neg_mean_absolute_error’ is used because in sklearn, the convention is to maximize the score, so loss functions are represented with negative values (the closer to 0, the better).

torch.tensor类型数据

直接划分

基于TensorDataset

To split a tensor into training and validation sets, you can use the random_split method from torch.utils.data. This is particularly handy when you’re dealing with Dataset objects, but it can also be applied directly to tensors with a bit of wrapping.

Here’s how you can do it:

  1. Wrap your tensor in a TensorDataset:
    Before using random_split, you might need to wrap your tensors in a TensorDataset so they can be treated as a dataset.

  2. Use random_split to divide the dataset:
    The random_split function requires two arguments: the dataset you’re splitting and a list of lengths for each resulting subset.

Here’s an example using random_split:

import torch
from torch.utils.data import TensorDataset, random_split

# Sample tensor data
X = torch.randn(1000, 10)  # 1000 samples, 10 features each
Y = torch.randint(0, 2, (1000,))  # 1000 labels

# Wrap tensors in a dataset
dataset = TensorDataset(X, Y)

# Split into 80% training (800 samples) and 20% validation (200 samples)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print(len(train_dataset))  # 800
print(len(val_dataset))    # 200

Once you’ve split your data into training and validation sets, you can easily load them in batches using DataLoader if needed.

  • The random_split method does not actually make a deep copy of the dataset. Instead, it returns Subset objects that internally have indices to access the original dataset. This makes the splitting operation efficient in terms of memory.

  • Each time you call random_split, the split will be different because the method shuffles the indices. If you want reproducibility, you should set the random seed using torch.manual_seed() before calling random_split.

The resulting subsets from random_split can be directly passed to DataLoader to create training and validation loaders:

from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

This allows you to efficiently iterate over the batches of data during training and validation.

If you have a TensorDataset and you want to retrieve all the data pairs from it, you can simply iterate over the dataset. Each iteration will give you a tuple where each element of the tuple corresponds to a tensor in the TensorDataset.

Here’s an example:

import torch
from torch.utils.data import TensorDataset

# Sample tensor data
X = torch.randn(100, 10)  # 100 samples, 10 features each
Y = torch.randint(0, 2, (100,))  # 100 labels

# Wrap tensors in a dataset
dataset = TensorDataset(X, Y)

# Get all data pairs
data_pairs = [data for data in dataset]

# If you want to get them separately
X_data, Y_data = zip(*data_pairs)

# Convert back to tensors if needed
X_data = torch.stack(X_data)
Y_data = torch.stack(Y_data)

print(X_data.shape)  # torch.Size([100, 10])
print(Y_data.shape)  # torch.Size([100])

In the code above:

  • We first create a TensorDataset from sample data.
  • Then, we use list comprehension to retrieve all data pairs from the dataset.
  • Finally, we separate the features and labels using the zip function, and then convert them back to tensors.

The zip(*data_pairs) expression is a neat Python trick that involves unpacking and transposing pairs (or tuples) of data.

To break it down:

  1. zip function: This is a built-in Python function that allows you to iterate over multiple lists (or other iterable objects) in parallel. For example, if you have two lists a = [1,2,3] and b = [4,5,6], calling zip(a,b) will yield pairs (1,4), (2,5), and (3,6).

  2. The * unpacking operator: When used in a function call, it unpacks a list (or tuple) into individual elements. For instance, if you have func(*[1,2,3]), it’s the same as calling func(1,2,3).

When you use them together as in zip(*data_pairs), you’re doing the following:

  • Unpacking the data_pairs: This treats the list of tuples in data_pairs as separate arguments to zip.
  • Transposing with zip: Since each element of data_pairs is a tuple of (X, Y), using zip effectively transposes the data, separating all the X’s from the Y’s.

Here’s a simple example to illustrate:

data_pairs = [(1, 'a'), (2, 'b'), (3, 'c')]
x_data, y_data = zip(*data_pairs)
print(x_data)  # Outputs: (1, 2, 3)
print(y_data)  # Outputs: ('a', 'b', 'c')

In the context of our previous discussion, this operation allowed us to efficiently separate the feature tensors from the label tensors in the TensorDataset.

基于切片方法

To split a PyTorch tensor into training and validation sets, you can use simple slicing. Here’s a straightforward way to do this:

  1. Decide on a split ratio (e.g., 80% training and 20% validation).
  2. Shuffle the tensor (optional, but often a good idea).
  3. Split the tensor based on the desired ratio.

Here’s an example using a 80-20 split:

import torch

# Sample data
X = torch.randn(1000, 10)  # 1000 samples, 10 features each
Y = torch.randint(0, 2, (1000,))

# Shuffle data
indices = torch.randperm(X.size(0))
X = X[indices]
Y = Y[indices]

# Split ratios
train_size = int(0.8 * X.size(0))
val_size = X.size(0) - train_size

# Split data
X_train = X[:train_size]
Y_train = Y[:train_size]
X_val = X[train_size:]
Y_val = Y[train_size:]

print(X_train.size())
print(Y_train.size())
print(X_val.size())
print(Y_val.size())

In this example:

  • We first shuffled the data by generating a permutation of indices with torch.randperm().
  • We then split the data based on the desired ratio (in this case, 80-20).
  • The resulting tensors (X_train, Y_train, X_val, Y_val) represent the training and validation sets respectively.

This method works well when you have independent and identically distributed data. If you need to perform stratified sampling (e.g., you want to ensure the training and validation sets have similar class distributions), consider using utilities from libraries like scikit-learn to generate the splits, and then index into the PyTorch tensor using those splits.

The torch.randperm(n) function generates a random permutation of integers from 0 to n-1. This is particularly useful for shuffling data. Let’s break down the function torch.randperm(X.size(0)):

  1. X.size(0):

    • This retrieves the size of the first dimension of tensor X.
    • If X is a 2D tensor with shape [samples, features], then X.size(0) will return the number of samples.
  2. torch.randperm(...):

    • This generates a tensor of random permutations of integers from 0 to n-1, where n is the input argument.
    • The result is effectively a shuffled sequence of integers in the range [0, n-1].

In the context of splitting data into training and validation sets, the random permutation ensures that the data is shuffled randomly before the split, so that the training and validation sets are likely to be representative of the overall dataset.

交叉验证

To perform n-fold cross-validation on PyTorch tensor data, you can use the KFold class from sklearn.model_selection. Here’s a step-by-step guide:

  1. Convert the PyTorch tensor to numpy arrays using the .numpy() method.
  2. Use KFold from sklearn.model_selection to generate training and validation indices.
  3. Use these indices to split your PyTorch tensor data into training and validation sets.
  4. Train and validate your model using these splits.

Let’s see a practical example:

import torch
from sklearn.model_selection import KFold

# Sample tensor data
X = torch.randn(100, 10)  # 100 samples, 10 features each
Y = torch.randint(0, 2, (100,))  # 100 labels

# Convert tensor to numpy
X_np = X.numpy()
Y_np = Y.numpy()

# Number of splits
n_splits = 5
kf = KFold(n_splits=n_splits)

for train_index, val_index in kf.split(X_np):
    # Convert indices to tensor
    train_index = torch.tensor(train_index)
    val_index = torch.tensor(val_index)

    X_train, X_val = X[train_index], X[val_index]
    Y_train, Y_val = Y[train_index], Y[val_index]
    
    # Now, you can train and validate your model using X_train, X_val, Y_train, Y_val

Note:

  • The KFold class provides indices which we then use to slice our tensor and obtain the respective training and validation sets.
  • In the example above, we’re performing a 5-fold cross-validation on the data. Each iteration provides a new training-validation split.

If you want to shuffle the data before splitting, you can set the shuffle parameter of KFold to True.文章来源地址https://www.toymoban.com/news/detail-660531.html

到了这里,关于机器学习样本数据划分的典型Python方法的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 【机器学习】sklearn数据集的使用,数据集的获取和划分

    「作者主页」: 士别三日wyx 「作者简介」: CSDN top100、阿里云博客专家、华为云享专家、网络安全领域优质创作者 「推荐专栏」: 对网络安全感兴趣的小伙伴可以关注专栏《网络安全入门到精通》 json_decode() 可以对JSON字符串 「解码」 ,并转换为PHP变量。 语法 参数 $json

    2024年02月10日
    浏览(26)
  • ❤️ ❤️ ❤️ 爆:使用ChatGPT+Streamlit快速构建机器学习数据集划分应用程序!!!

    ChatGPT 对于 Python 程序员有用吗?特别是我们这些使用 Python 进行数据处理、数据清洗和构建机器学习模型的人?我们试试就知道了。 来自 OpenAI 的 ChatGPT 是什么?假设你已经知道了。网上铺天盖地的宣传呐喊,我想不再需要介绍了。加入您碰巧不知道 ChatGPT 是什么,赶快去查

    2023年04月22日
    浏览(40)
  • Python机器学习实验 Python 数据分析

    掌握常见数据预处理方法,熟练运用数据分析方法,并掌握 Python 中的 Numpy、 Pandas 模块提供的数据分析方法。 1.   P a nd a s   基本数据处理 使用 Pandas 模块,完成以下操作。 (1)创建一个由 0 到 50 之间的整数组成的 10 行 5 列的 dataframe。如下: (2)汇总每一列的

    2024年04月09日
    浏览(31)
  • 决策树的原理、方法以及python实现——机器学习笔记

    * * * * * *  The Machine Learning Noting Series  * * * * * * 决 策树(Decision Tree)是机器学习的核心算法之一,在较小训练样本或有限计算资源下仍有较好表现,它包括分类树和回归树,是目前应用最广泛的分类预测和回归预测方法。 0 引言 1 决策树的概念     分类树     回归树 2  

    2024年02月04日
    浏览(40)
  • Python数据挖掘与机器学习

    近年来,Python编程语言受到越来越多科研人员的喜爱,在多个编程语言排行榜中持续夺冠。同时,伴随着深度学习的快速发展,人工智能技术在各个领域中的应用越来越广泛。机器学习是人工智能的基础,因此,掌握常用机器学习算法的工作原理,并能够熟练运用Python建立实

    2024年02月11日
    浏览(32)
  • 【Python | 机器学习】Python中进行特征重要性分析的9个常用方法(含源代码)

    特征重要性分析用于了解每个特征(变量或输入)对于做出预测的有用性或价值。目标是确定对模型输出影响最大的最重要的特征,它是机器学习中经常使用的一种方法。 特征重要性分析在数据科学和机器学习中扮演着重要的角色,具有以下重要性: 理解数据:特征重要性分析

    2024年02月03日
    浏览(44)
  • 机器学习案例:运营商客户流失的数据分析 #数据去重#数据分组整合#缺失值处理#相关性分析#样本平衡#决策树、随机森林、逻辑回归

    前提: 随着业务快速发展、电信市场的竞争愈演愈烈。如何最大程度地挽留在网用户、吸取新客户,是电信企业最 关注的问题之一。 客户流失 会给企业带来一系列损失,故在发展用户每月增加的同时,如何挽留和争取更多 的用户,也是一项非常重要的工作。 能否利用大数

    2024年02月08日
    浏览(37)
  • 基于R语言、MATLAB、Python机器学习方法与案例分析

    目录   基于R语言机器学习方法与案例分析 基于MATLAB机器学习、深度学习在图像处理中的实践技术应用 全套Python机器学习核心技术与案例分析实践应用   基于R语言机器学习方法与案例分析 机器学习已经成为继理论、实验和数值计算之后的科研“第四范式”,是发现新规律,

    2024年02月07日
    浏览(27)
  • GEE/PIE遥感大数据处理与典型案例丨数据整合Reduce、云端数据可视化、数据导入导出及资产管理、机器学习算法等

    目录 ​专题一:初识GEE和PIE遥感云平台 专题二:GEE和PIE影像大数据处理基础 专题三:数据整合Reduce 专题四:云端数据可视化 专题五:数据导入导出及资产管理 专题六:机器学习算法 专题七:专题练习与回顾 更多应用 随着航空、航天、近地空间等多个遥感平台的不断发展

    2024年02月11日
    浏览(54)
  • 【Python】数据挖掘与机器学习(一)

    大家好 我是寸铁👊 总结了一篇【Python】数据挖掘与机器学习(一)sparkles: 喜欢的小伙伴可以点点关注 💝 问题描述 请从一份数据中预测鲍鱼的年龄,数据集在abalone.cvs中,数据集一共有4177 个样本,每个样本有9个特征。其中rings为鲍鱼环数,鲍鱼每一年长一环,类似树轮,是

    2024年04月12日
    浏览(29)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包