解决PyTorch DDP: Finding the cause of “Expected to mark a variable ready only once“

这篇具有很好参考价值的文章主要介绍了解决PyTorch DDP: Finding the cause of “Expected to mark a variable ready only once“。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

早上做消融实验的时候需要复现俩月前的实验结果，但是莫名其妙同样的代码和环境却跑不通了，会在loss.backward()的时候报如下错误：
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the ``forward`` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. 2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple ``checkpoint`` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

在网上找了一圈发现一共也没几个人问过这个报错，其中stackoverflow上有人解决了这问题，说是把find_unused_parameters设置为false就莫名其妙好了，但是我这么设置之后在固定D训练G的时候又报错：之前写代码时碰到了这样一个错误：
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
解决PyTorch DDP: Finding the cause of “Expected to mark a variable ready only once“,pytorch,人工智能,python
但解决“Expected to have finished reduction in the prior iteration before starting a new one”又涉及到把find_unused_parameters设置为True，这看起来直接和上面的解决方法矛盾了…文章来源地址https://www.toymoban.com/news/detail-734328.html

最后实在是无可奈何，觉得可能是因为单卡采用分布式训练出了啥问题，于是干脆把ddp给关了再训练，报错才得以解决，太玄学了。。。

到了这里，关于解决PyTorch DDP: Finding the cause of “Expected to mark a variable ready only once“的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！