1.问题描述
我想跑一个模型的训练源代码时,就出现了这个问题,之前上网一顿查,发现并没有解决的办法。所说的也跟这个对不上。这个问题的本身是有关于pytorch分布使训练的问题。
实际情况如下。
root@node02:~/data/zjx/others/DDPtry# python -m torch.distributed.launch --nproc_per_node 3 tryDDP_1.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
File "tryDDP_1.py", line 92, in <module>
Traceback (most recent call last):
File "tryDDP_1.py", line 92, in <module>
Traceback (most recent call last):
File "tryDDP_1.py", line 92, in <module>
b = c
NameError: name 'c' is not defined
b = c
b = c
NameError: name 'c' is not defined
NameError: name 'c' is not defined
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'tryDDP_1.py', '--local_rank=2']' returned non-zero exit status 1.
2.问题的解决
出现这个问题时,解决问题的关键不在于这个问题本身,而是在于这个问题前面所报出的问题。
正因为原代码中的某处或者某几处错误,从而导致分布使训练不能进行,所以都会报出这个错误。从上面的实际举例可以看出,在这个错误之前,还有个错误,如下图画框所示
当然,这个错误是我故意设计的,就是为了举例说明出现这个问题的来源,因为我在代码中加了一处错误,如下图划线处所示。
正式因为代码中出现的这处错误,导致分布式训练不能顺利进行,所以才会返回如
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'tryDDP_1.py', '--local_rank=2']' returned non-zero exit status 1.
这样的错误。所以,解决这个问题的关键是把这个错误之前的所有报错都解决掉,之后就可以顺利进行分布式训练了。
如下所示,(下面将b=c 这个错误去掉)
然后运行,如下图所示,可以正常运行了文章来源:https://www.toymoban.com/news/detail-616033.html
文章来源地址https://www.toymoban.com/news/detail-616033.html
到了这里,关于关于subprocess.CalledProcessError: Commandxxx returned non-zero exit status 1. 的问题--pytorch分布式训练问题的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!