Replies: 6 comments
-
Thank you for using Amazon SageMaker. Would you be able to provide the following to help us in troubleshooting your issue?
Look forward to hearing back from you. |
Beta Was this translation helpful? Give feedback.
-
Thank you. Here are answers to your questions: SageMaker Python SDK version: 2.13.0 This appears to be a problem with the TensorFlow Estimator. GPU utilization goes to zero. I am supplying you a notebook that demonstrates this by training my model with the estimator and without it. The below link is my codebase that recreates this issue. Please open the notebook: DeepTradingAI.ipynb using an AWS instance of ml.p3.2xlarge This notebook has two parts:
Note: You do not need to worry about the data as they are fetched from a database connection. Here is the download link: Thanks, Nektarios |
Beta Was this translation helpful? Give feedback.
-
Even if I use my own custom Docker container, I get:
|
Beta Was this translation helpful? Give feedback.
-
I managed to get the above loaded, but still get: 2020-10-07 16:17:42.690258: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. This is my Dockerfile:
|
Beta Was this translation helpful? Give feedback.
-
Do the boxes have CUDA installed properly?? This used to all work. All of a sudden things broke |
Beta Was this translation helpful? Give feedback.
-
Did you solve this issue? I am facing similar problems. Running the same script locally takes about 10s/epoch while on sagemaker instance it takes about 2mins/epoch. I don’t understand what’s the problem. |
Beta Was this translation helpful? Give feedback.
-
When I used to train my Deep Learning model on a p3.2xlarge instance it used to take 2 seconds per epoch. Now, all of a sudden, it takes around 39 seconds per epoch! Before total training time was 15 minutes and now it can go up to 2 hours!
Please advise why this is happening.
See image attachment.
Thank you!
Nektarios
Describe the bug
Training very slow on p3.2xlarge. Maybe GPU not being used.
To reproduce
A clear, step-by-step set of instructions to reproduce the bug.
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots or logs

If applicable, add screenshots or logs to help explain your problem.
System information
A description of your system. Please provide:
Additional context
Add any other context about the problem here.
Beta Was this translation helpful? Give feedback.
All reactions