DXGI_ERROR_DEVICE_REMOVED Error #95

douglastehling · 2021-03-31T16:54:50Z

hello, i have a problem. I don't know if anyone has had this problem. I have a Vega8, the drivers are all installed correctly but it is giving the error DXGI_ERROR_DEVICE_REMOVED when I try to run the following script.

import tensorflow.compat.v1 as tf
tf.enable_eager_execution (tf.ConfigProto (log_device_placement = True))
print (tf.add ([1.0, 2.0], [3.0, 4.0]))

I've already followed the instructions on the link https://aka.ms/tfdmltimeout but it doesn't work.

2021-03-31 11: 29: 36.810513: I tensorflow / stream_executor / platform / default / dso_loader.cc: 98] Successfully opened dynamic library C: \ Users \ d.belgd \ Miniconda3 \ envs \ directml2 \ lib \ site-packages \ tensorflow_core \ python / directml.bdb07c797e1e1af1b4a42d21c67ce5494d73991459.dll
2021-03-31 11: 29: 36.917148: I tensorflow / core / common_runtime / dml / dml_device_cache.cc: 126] DirectML device enumeration: found 1 compatible adapters.
[PhysicalDevice (name = '/ physical_device: DML: 0', device_type = 'DML')]
2021-03-31 11: 29: 36.920996: I tensorflow / core / platform / cpu_feature_guard.cc: 142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-03-31 11: 29: 36.925428: I tensorflow / core / common_runtime / dml / dml_device_cache.cc: 109] DirectML: creating device on adapter 0 (AMD Radeon (TM) Vega 8 Graphics)
2021-03-31 11: 29: 37.129830: And tensorflow / core / common_runtime / dml / dml_heap_allocator.cc: 53] The DirectML device has encountered an unrecoverable error (DXGI_ERROR_DEVICE_REMOVED). This is most often caused by a timeout occurring on t the GPU. Please visit https://aka.ms/tfdmltimeout for more information and troubleshooting steps.
2021-03-31 11: 29: 37.136448: F tensorflow / core / common_runtime / dml / dml_heap_allocator.cc: 53] HRESULT failed with 0x887a0005: hr

I think this is the problem when I try to run

python detect_video.py --video data/grca-trainmix_1280x720.mp4 --trace --max_frames 10 --headless

WARNING:tensorflow:From detect_video.py:39: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

W0331 13:51:28.546197  3820 module_wrapper.py:139] From detect_video.py:39: The name tf.keras.backend.get_session is deprecated. Please use tf.compat.v1.keras.backend.get_session instead.

2021-03-31 13:51:28.806023: I tensorflow/stream_executor/platform/default/dso_loader.cc:98] Successfully opened dynamic library C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python/directml.bdb07c797e1af1b4a42d21c67ce5494d73991459.dll
2021-03-31 13:51:28.933164: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:126] DirectML device enumeration: found 1 compatible adapters.
2021-03-31 13:51:28.936741: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2021-03-31 13:51:28.940855: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:109] DirectML: creating device on adapter 0 (AMD Radeon(TM) Vega 8 Graphics)
WARNING:tensorflow:From detect_video.py:46: The name tf.RunOptions is deprecated. Please use tf.compat.v1.RunOptions instead.

W0331 13:51:29.155223  3820 module_wrapper.py:139] From detect_video.py:46: The name tf.RunOptions is deprecated. Please use tf.compat.v1.RunOptions instead.

WARNING:tensorflow:From C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0331 13:51:29.190702  3820 deprecation.py:506] From C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Traceback (most recent call last):
  File "detect_video.py", line 148, in <module>
    app.run(main)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\absl\app.py", line 303, in run
    _run_main(main, args)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "detect_video.py", line 65, in main
    yolo.load_weights(FLAGS.weights)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 182, in load_weights
    return super(Model, self).load_weights(filepath, by_name)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 1339, in load_weights
    pywrap_tensorflow.NewCheckpointReader(filepath)
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 877, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "C:\Users\d.belgd\Miniconda3\envs\directml2\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 889, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./checkpoints/yolov3.tf: Not found: FindFirstFile failed for: ./checkpoints : The system cannot find the path specified.
; No such process

The text was updated successfully, but these errors were encountered:

jstoecker · 2021-03-31T20:28:28Z

Looks like Radeon Vega 8 is an integrated GPU, and from the logs you shared it looks like it's having trouble allocating memory.

How much system memory (RAM) do you have? If you can provide a dxdiag.txt it would be helpful in understanding the capabilities of your system.

One thing you can try is lowering the default DML heap allocator's allocation size from 4GB to something smaller. For example, you can add these lines to the top of your first script (or set the environment variable elsewhere before the python process launches):

import os
os.environ["TF_DIRECTML_MAX_ALLOC_SIZE"] = "536870912" # 512MB

You can also enabling verbose logging, which will print even more details that might help here. Example:

import os
os.environ["TF_CPP_MIN_VLOG_LEVEL"] = "3"

douglastehling · 2021-03-31T21:33:35Z

Looks like Radeon Vega 8 is an integrated GPU, and from the logs you shared it looks like it's having trouble allocating memory.

How much system memory (RAM) do you have? If you can provide a dxdiag.txt it would be helpful in understanding the capabilities of your system.

One thing you can try is lowering the default DML heap allocator's allocation size from 4GB to something smaller. For example, you can add these lines to the top of your first script (or set the environment variable elsewhere before the python process launches):

@jstoecker, Thank you for your help. here is the DxDiag.txt file. My PC has 6GB of ram and 2GB of GPU. I tested this parameter here to allocate memory and removed the error. Thanks a lot for the help
DxDiag.txt

jstoecker · 2021-03-31T22:04:39Z

Good to hear, and thanks for the dxdiag! I'll open a bug internally to see if we can improve this experience so it's not necessary to set an environment variable.

adtsai · 2021-03-31T23:28:46Z

One more thing to add - if you're still seeing the error with the yolov3 sample, don't forget to run setup.py first before trying detect_video.py, because it looks like it's having trouble finding the checkpoint file. :)

douglastehling · 2021-04-01T17:00:37Z

@jstoecker and @adtsai really with the memory allocation it worked, now one thing I saw, was that detect-video.py is using shared memory and not dedicated memory. Do you know that directml supports access to dedicated memory? I ask this because the detection of the objects is very slow

jstoecker · 2021-04-01T19:03:20Z

In short: yes, DirectML supports access to dedicated memory!

DirectML itself doesn't allocate memory for GPU resources: that's up to the application/framework using it, such as TensorFlow-DirectML (TFDML) in this case. TFDML has a number of allocators for different purposes, but the bulk of the memory (to store the tensors used in GPU calculations) will be backed by subregions of a so-called default heap. Default heaps reflect different memory pools based on the GPU architecture (UMA or NUMA/discrete).

Your Radeon Vega 8 is an integrated GPU, so the 2GB of dedicated memory you see isn't physical VRAM but rather reserved system memory. In other words, your system actually has 8GB of RAM, but the integrated GPU is claiming 2GB of it for exclusive access. This blog explains some of the differences between dedicated and shared memory, how they are reported in task manager, and some differences between discrete and integrated GPUs in this respect.

Integrated GPUs are, unfortunately, not going to be particularly fast in machine learning. It's worth pointing out that we haven't really optimized TFDML for integrated GPUs (e.g. we could avoid some memory copies since default-heap resources will always live in the "L0" memory pool); however, it's unlikely that you'll see huge performance gains over the CPU without using a more powerful discrete GPU.

This was referenced Apr 2, 2021

Force memory growth for UMA adapters microsoft/tensorflow-directml#208

Merged

"Device Removed Error" When Radeon Software in "Compute" mode microsoft/tensorflow-directml#209

Closed

Ranazzi mentioned this issue May 29, 2023

DXGI_ERROR_DEVICE_REMOVED issue #462

Open

dmenig mentioned this issue Sep 21, 2023

Add support for 3d operators #507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DXGI_ERROR_DEVICE_REMOVED Error #95

DXGI_ERROR_DEVICE_REMOVED Error #95

douglastehling commented Mar 31, 2021

jstoecker commented Mar 31, 2021

douglastehling commented Mar 31, 2021

jstoecker commented Mar 31, 2021

adtsai commented Mar 31, 2021

douglastehling commented Apr 1, 2021 •

edited

Loading

jstoecker commented Apr 1, 2021

DXGI_ERROR_DEVICE_REMOVED Error #95

DXGI_ERROR_DEVICE_REMOVED Error #95

Comments

douglastehling commented Mar 31, 2021

jstoecker commented Mar 31, 2021

douglastehling commented Mar 31, 2021

jstoecker commented Mar 31, 2021

adtsai commented Mar 31, 2021

douglastehling commented Apr 1, 2021 • edited Loading

jstoecker commented Apr 1, 2021

douglastehling commented Apr 1, 2021 •

edited

Loading