训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0

7月前作者：计算机视觉-Archer 分类：Toy博客阅读(36) 违法举报

这篇具有很好参考价值的文章主要介绍了训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

运行Dit时，torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /home/pansiyuan/jupyter/qianyu/data

遇到报错

1 完整报错

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

2 报错关键位置

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 83746) of binary: /opt/conda/bin/python Traceback (most recent call last):

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

解决办法：

此时是多卡计算看不到报错信息

采用单卡
torchrun --nnodes=1 --nproc_per_node=1 train.py --model DiT-XL/2 --data-path /home/pansiyuan/jupyter/qianyu/data

单卡之后报错结果是数据集找不到

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

没有找到文件args,data_path是不是出问题了

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

值得注意的是这里的DiT给的路径data_path等等，在train.py文件arg里面都是用的不是下划线

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

注意这里的指令也需要下划线
torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /home/pansiyuan/jupyter/qianyu/data/train

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

args.data_path

修改后，再次尝试调小batch就行

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能

如果使用7张卡设置batch size为256就就如下报错，因为无法整除

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0,深度学习,人工智能文章来源地址https://www.toymoban.com/news/detail-756012.html

到了这里，关于训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：如若内容造成侵权/违法违规/事实不符，请点击违法举报进行投诉反馈，一经查实，立即删除！

分享到：

领支付宝红包赞助服务器费用

解决ValueError: Error initializing torch.distributed using env:// rendezvous:: environment variable 报错

在命令行运行程序时候可成功跑通，但在程序调试过程中出现如下错误：源代码：修改后： import torch.distributed as dist import os os.environ[\\\'MASTER_ADDR\\\'] = \\\'localhost\\\' os.environ[\\\'MASTER_PORT\\\'] = \\\'5678\\\' dist.init_process_group(backend=\\\'nccl\\\', init_method=\\\'env://\\\', rank = 0, world_size = 1)

2024年02月11日
浏览(245)
【深度学习】多卡训练__单机多GPU方法详解（torch.nn.DataParallel、torch.distributed）

多GPU训练能够加快模型的训练速度，而且在单卡上不能训练的模型可以使用多个小卡达到训练的目的。多GPU训练可以分为单机多卡和多机多卡这两种，后面一种也就是分布式训练——训练方式比较麻烦，而且要关注的性能问题也有很多，据网上的资料有人建议能单机训练最好

2024年02月02日
浏览(24)
python报错：ERROR: No matching distribution found for

使用pip安装包时提示报错如下： ERROR: Could not find a version that satisfies the requirement package (from versions: none) ERROR: No matching distribution found for package 大多数是网络问题，替换使用国内的镜像来源加速即可。打开cmd 输入按回车执行命令后，安装成功。例如：

2024年02月15日
浏览(36)
jenkins构建时，报错ERROR: No matching distribution found for pywin32==305

最近用jenkin构建了一个任务，控制台输出，出现如下报错信息： ERROR: Could not find a version that satisfies the requirement pywin32==305 (from versions: none) ERROR: No matching distribution found for pywin32==305 Build step \\\'Execute shell\\\' marked build as failure Finished: FAILURE 原因： requirement是需要导入的依赖包文件

2023年04月27日
浏览(31)
解决报错ERROR: No matching distribution found for torchvision==0.11.2+cu111

目录一、猜测二、验证三、解决方案四、检验该报错是在按官网方法用指令：安装pytorch时出现的，以下是分析：这个错误提示表明在指令提供的下载网址上没有找到符合要求的torchvision软件包版本，需要安装符合要求的版本。问题可能出在指定的版本号（0.11.2+cu111），这

2024年02月11日
浏览(27)
记一次pip下载包报错ERROR: No matching distribution found for xxx时的解决方案

前言当我们使用python自带的pip安装一些包时，可能会报以下错误：出现这种情况有三种可能：第一种可能： pip的版本过低，需要升级一下，可以执行以下命令进行尝试第二种可能：考虑可能是网速的原因，这时可以采用国内的镜像源来加速第三种可能：检查下是否开启

2024年02月11日
浏览(47)
torch.hub.load报错urllib.error.HTTPError: HTTP Error 403: rate limit exceeded

在运行DINOv2的示例代码时，需要载入预训练的模型，比如： torch.hub.load报错“urllib.error.HTTPError: HTTP Error 403: rate limit exceeded”，具体报错信息如下： Traceback (most recent call last): File \\\"/data1/domainnet/dinov2/demo.py\\\", line 15, in module backbone_model = torch.hub.load(repo_or_dir=\\\"facebookresearch/

2024年02月04日
浏览(35)
elastic-agent安装报错“Fleet Server - Error - x509: certificate signed by unknown authority

elasticssearch版本8.4.3 根据官网的提示https://www.elastic.co/guide/en/fleet/8.4/fleet-troubleshooting.html#agent-enrollment-certs 出现这种问题需要增加参数 --insecure To fix this problem, pass the --insecure flag along with the enroll or install command. ./elastic-agent install --fleet-server-es=https://192.168.0.180:9200

2024年02月11日
浏览(33)
pytorch 进行分布式调试debug torch.distributed.launch 三种方式

一. pytorch 分布式调试debug torch.distributed.launch 三种方式 1. 方式1：ipdb调试（建议）参考之前的博客：python调试器 ipdb 注意：pytorch 分布式调试只能使用侵入式调试，也即是在你需要打断点的地方（或者在主程序的第一行）添加下面的代码：当进入pdb调试后，跟原先使用pdb调试

2024年02月07日
浏览(24)
No matching distribution found for torch==1.10.1+cu111

30系显卡暂时不支持CUDA11以下版本，CUDA不支持当前显卡的算力。解决方法1：https://blog.csdn.net/weixin_43760844/article/details/115706289 解决方法2：conda下载cudatoolkit （貌似没有解决问题，嘿嘿，可能只能卸载cuda了）首先搜索安装包的版本然后安装固定版本的cudatoolkit，我的cuda最高

2024年02月07日
浏览(30)