0%

RuntimeError: CUDA error: an illegal memory access was encountered

在 GPU 上训练时报了下面的错,始终无法得到解决。看起来像是由数据量大小引起的,这是因为之前我用小数据训练没问题,但是改用大数据之后就报错了。不过,根据查到的资料显示,别人在其他情况下也遇到过这问题。

1
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

后来在网上看到了别人的讨论,有人运行了以下代码:

1
CUDA_LAUNCH_BLOCKING=1 python train.py

我试了一下,然而过了几个 epoch 之后又报错了。不过加了 CUDA_LAUNCH_BLOCKING=1 之后,错误信息更加详细了,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Traceback (most recent calls WITHOUT Sacred internals):
File "train.py", line 98, in run
model(data_loader)
File "/home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/zcy/python_workspace/DSTex/model.py", line 117, in forward
self.train_epoch(pbar_train, cur_epoch, train_batch_num, train_statistics_every)
File "/home/zcy/python_workspace/DSTex/model.py", line 162, in train_epoch
train_perf = self.train_batch(train_data, cur_epoch)
File "/home/zcy/python_workspace/DSTex/model.py", line 215, in train_batch
loss.backward()
File "/home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (copy_device_to_device at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/cuda/Copy.cu:61)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f5fc5c77b5e in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::native::copy_device_to_device(at::TensorIterator&, bool) + 0x861 (0x7f5fc82b12b1 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x240f91c (0x7f5fc82b391c in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x9146ac (0x7f5fed76d6ac in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x911d73 (0x7f5fed76ad73 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x44 (0x7f5fed76c834 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::native::embedding_dense_backward_cuda(at::Tensor const&, at::Tensor const&, long, long, bool) + 0x4bd (0x7f5fc83fdbdd in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xde41dc (0x7f5fc6c881dc in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0xe2404c (0x7f5fedc7d04c in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x28037f1 (0x7f5fef65c7f1 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xe2404c (0x7f5fedc7d04c in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::native::embedding_backward(at::Tensor const&, at::Tensor const&, long, long, bool, bool) + 0x124 (0x7f5fed7ca1a4 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0xeaefe0 (0x7f5fedd07fe0 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x29acffa (0x7f5fef805ffa in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0xee78d9 (0x7f5fedd408d9 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::generated::EmbeddingBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x1cd (0x7f5fef45ef9d in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x2ae8215 (0x7f5fef941215 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f5fef93e513 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f5fef93f2f2 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f5fef937969 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f5ff2c7e558 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #21: <unknown function> + 0xc819d (0x7f5ff56e119d in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #22: <unknown function> + 0x76db (0x7f6017a0d6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #23: clone + 0x3f (0x7f601773688f in /lib/x86_64-linux-gnu/libc.so.6)

看起来有点像是张量在拷贝的时候出的错,回想起之前有人说“改一下代码就可以解决”。所以打算试一下他提供的代码:

1
torch.cuda.set_device(<device_num>)

简单来说,就是在你调用 tensor.cuda() 或者 model.to(device) 之后再调用上面的代码即可。更新:这方法还是不行,一会之后有报错了。

后来设置 torch.backends.cudnn.benchmark=False,也是失败了。

如果还是不行,可以试一下 issue 中各种的玄学方法。。。

这貌似是一个随机的错误,目前还是无解。。。

更新:好像发现是为什么了。

(这篇文章好像有很多人看,我又更新了一下)

无数天之后再次更新:其实就是在调用交叉熵函数的时候,真实标签的值大于预测出概率分布的维度。例如,概率分布的维度是 200,而对应的标签为 205。我的场景是使用 pointer network 预测语句中的索引位置,由于数据处理失误,导致在真实标签中多加了 1。

注:这个 error 由很多问题引起,我无法保证上述解决办法对你有效。