Error details: RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
This error occurs when using torch.nn.parallel.DistributedDataParallel
to train a model parallelly. I launched program A with python -m torch.distributed.launch --nproc_per_node=2 trainA.py
and worked fine. Then when A is running, I tried to launch program B with python -m torch.distributed.launch --nproc_per_node=2 trainB.py
yet ended up with the error above.
It turns out that the issue arises from the network address. As the error reports, the address 29500
is being used. Hence, modifying the address should work. So I used the command python -m torch.distributed.launch --nproc_per_node=2 --master_port='29501' trainB.py
.
Problem solved!!!文章来源地址https://www.toymoban.com/news/detail-690380.html
文章来源:https://www.toymoban.com/news/detail-690380.html
到了这里,关于RuntimeError: The server socket has failed to listen on any local network address. The server socket的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!