在用深度学习做分类的时候,常常需要进行交叉验证,目前pytorch没有通用的一套代码来实现这个功能。可以借助 sklearn中的 StratifiedKFold,KFold来实现,其中StratifiedKFold可以根据类别的样本量,进行数据划分。以5折为例,它可以实现每个类别的样本都是4:1划分。
代码简单的示例如下:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for i, (train_idx, val_idx) in enumerate(skf.split(imgs, labels)):
trainset, valset = np.array(imgs)[[train_idx]],np.array(imgs)[[val_idx]]
traintag, valtag = np.array(labels)[[train_idx]],np.array(labels)[[val_idx]]
以上示例是将所有imgs列表与对应的labels列表进行split,得到train_idx代表训练集的下标,val_idx代表验证集的下标。后续代码只需要将split完成的trainset与valset输入dataset即可。
接下来用我自己数据集的实例来完整地实现整个过程,即从读取数据,到开始训练。如果你的数据集存储方式和我不同,改一下数据读取代码即可。关键是如何获取到imgs和对应的labels。
我的数据存储方式是这样的(类别为文件夹名,属于该类别的图像在该文件夹下):
"""A generic data loader where the images are arranged in this way: :: root/dog/xxx.png root/dog/xxy.png root/dog/xxz.png root/cat/123.png root/cat/nsdf3.png root/cat/asd932_.png
以下代码是获取imgs与labels的过程:
import os
import numpy as np
IMG_EXTENSIONS = ('.jpg', '.jpeg', '.png')
def is_image_file(filename):
return filename.lower().endswith(IMG_EXTENSIONS)
def find_classes(dir):
classes = [d.name for d in os.scandir(dir) if d.is_dir()]
classes.sort()
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
return classes, class_to_idx
if __name__ == "__main__":
dir = 'your root path'
classes, class_to_idx = find_classes(dir)
imgs = []
labels = []
for target_class in sorted(class_to_idx.keys()):
class_index = class_to_idx[target_class]
target_dir = os.path.join(dir, target_class)
if not os.path.isdir(target_dir):
continue
for root, _, fnames in sorted(os.walk(target_dir, followlinks=True)):
for fname in sorted(fnames):
path = os.path.join(root, fname)
if is_image_file(path):
imgs.append(path)
labels.append(class_index)
上述代码只需要把dir改为自己的root路径即可。接下来对所有数据进行5折split。其中我自己写了MyDataset类,可以直接照搬用。
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5) #5折
for i, (train_idx, val_idx) in enumerate(skf.split(imgs, labels)):
trainset, valset = np.array(imgs)[[train_idx]],np.array(imgs)[[val_idx]]
traintag, valtag = np.array(labels)[[train_idx]],np.array(labels)[[val_idx]]
train_dataset = MyDataset(trainset, traintag, data_transforms['train'] )
val_dataset = MyDataset(valset, valtag, data_transforms['val'])
from PIL import Image
import torch
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, imgs, labels, transform=None,target_transform=None):
self.imgs = imgs
self.labels = labels
self.transform = transform
self.target_transform = target_transform
def __len__(self):
return len(self.imgs)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
path = self.imgs[idx]
target = self.labels[idx]
with open(path, 'rb') as f:
img = Image.open(f)
img = img.convert('RGB')
if self.transform:
img = self.transform(img)
if self.target_transform is not None:
target = self.target_transform(target)
return img, target
有了数据集之后,就可以创建dataloader了,后面就是正常的训练代码:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5) #5折
for i, (train_idx, val_idx) in enumerate(skf.split(imgs, labels)):
trainset, valset = np.array(imgs)[[train_idx]],np.array(imgs)[[val_idx]]
traintag, valtag = np.array(labels)[[train_idx]],np.array(labels)[[val_idx]]
train_dataset = MyDataset(trainset, traintag, data_transforms['train'] )
val_dataset = MyDataset(valset, valtag, data_transforms['val'])
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size,
shuffle=True, num_workers=args.workers)
test_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=args.batch_size,
shuffle=True, num_workers=args.workers)
# define model
model = resnet18().cuda()
# define criterion
criterion = torch.nn.CrossEntropyLoss()
# Observe that all parameters are being optimized.
optimizer = optim.SGD(model.parameters(),
lr=args.lr,
momentum=args.momentum,
weight_decay=args.weight_decay)
for epoch in range(args.epoch):
train_acc, train_loss = train(train_dataloader, model, criterion, args)
test_acc, tect_acc_top5, test_loss = validate(test_dataloader, model, criterion, args)
为了保证每次跑的时候分的数据都是一致的,注意shuffle=False(默认)
StratifiedKFold(n_splits=5,shuffle=False)文章来源:https://www.toymoban.com/news/detail-447854.html
以上就是实现的基本代码,之所以在代码层面实现k折而不是在数据层面做,比如预先把数据等分为5份。是因为这个代码可以支持数据样本的随意增减,不需要人为地再去分数据,十分方便。 文章来源地址https://www.toymoban.com/news/detail-447854.html
到了这里,关于手把手教你用pytorch实现k折交叉验证,解决类别不平衡的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!