环境:
在容器中跑pytorch模型的训练
问题表现:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/root/anaconda3/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/root/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 358, in reduce_storage
fd, size = storage._share_fd_cpu_()
RuntimeError: unable to write to file </torch_27822_2114653298_869>: No space left on device (28)
问题原因:
torch.utils.data.dataloader使用了共享内存。容器根目录没有空间,导致共享内存写到文件的时候报错。
解决方案,dataloader不使用共享内存:
from torch.utils.data import dataloader
from torch.multiprocessing import reductions
from multiprocessing.reduction import ForkingPickler
default_collate_func = dataloader.default_collate
def default_collate_override(batch):
dataloader._use_shared_memory = False
return default_collate_func(batch)
setattr(dataloader, 'default_collate', default_collate_override)
for t in torch._storage_classes:
if sys.version_info[0] == 2:
if t in ForkingPickler.dispatch:
del ForkingPickler.dispatch[t]
else:
if t in ForkingPickler._extra_reducers:
del ForkingPickler._extra_reducers[t]
参考:文章来源:https://www.toymoban.com/news/detail-822780.html
https://blog.csdn.net/u012796629/article/details/105936386文章来源地址https://www.toymoban.com/news/detail-822780.html
到了这里,关于pytorch无法把共享内存写入文件的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!