Training time on a p3.2xlarge instance suddenly increased 10 fold! Very slow. #2394

nectario · 2020-09-23T21:05:10Z

nectario
Sep 23, 2020

When I used to train my Deep Learning model on a p3.2xlarge instance it used to take 2 seconds per epoch. Now, all of a sudden, it takes around 39 seconds per epoch! Before total training time was 15 minutes and now it can go up to 2 hours!

Please advise why this is happening.

See image attachment.

Thank you!

Nektarios

Describe the bug
Training very slow on p3.2xlarge. Maybe GPU not being used.

To reproduce
A clear, step-by-step set of instructions to reproduce the bug.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.

System information
A description of your system. Please provide:

SageMaker Python SDK version:
Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow
Framework version: 2.3
Python version: 3.7
CPU or GPU: GPU
Custom Docker image (Y/N):

Additional context
Add any other context about the problem here.

metrizable · 2020-10-06T15:12:14Z

metrizable
Oct 6, 2020

@nectario

Thank you for using Amazon SageMaker. Would you be able to provide the following to help us in troubleshooting your issue?

SageMaker Python SDK version
Custom Docker Image (Y/N)
Clear, step-by-step instructions to reproduce the bug
Code snippets that may be relevant for troubleshooting

Look forward to hearing back from you.

0 replies

nectario · 2020-10-06T15:38:06Z

nectario
Oct 6, 2020
Author

Thank you. Here are answers to your questions:

SageMaker Python SDK version: 2.13.0
Custom Docker Image: Y

This appears to be a problem with the TensorFlow Estimator. GPU utilization goes to zero. I am supplying you a notebook that demonstrates this by training my model with the estimator and without it.

The below link is my codebase that recreates this issue.

Please open the notebook: DeepTradingAI.ipynb using an AWS instance of ml.p3.2xlarge

This notebook has two parts:

Training as gpu_local using a TensorFlow Estimator which you will see the problem of every epoch taking about 40 seconds.
Training the script directly with no TensorFlow estimator and you will see each epoch takes about 1 or 2 seconds.

Note: You do not need to worry about the data as they are fetched from a database connection.

Here is the download link:
https://supportsagemaker.s3.us-east-2.amazonaws.com/sagemaker_gpu_training_issue_notebook.zip

Thanks,

Nektarios

0 replies

nectario · 2020-10-07T13:01:27Z

nectario
Oct 7, 2020
Author

Even if I use my own custom Docker container, I get:

2020-10-07 12:57:40.486419: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-10-07 12:57:40.486514: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2020-10-07 12:57:40.486609: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2020-10-07 12:57:40.486685: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2020-10-07 12:57:40.486779: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2020-10-07 12:57:40.486868: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2020-10-07 12:57:40.486945: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory

0 replies

nectario · 2020-10-07T16:24:56Z

nectario
Oct 7, 2020
Author

I managed to get the above loaded, but still get:

2020-10-07 16:17:42.690258: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

This is my Dockerfile:

FROM tensorflow/tensorflow:2.3.1-gpu

RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y apt-utils

RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y git

RUN export LD_LIBRARY_PATH=/usr/local/nvidia/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

RUN pip --no-cache-dir install --upgrade pip
RUN pip install sagemaker-containers
RUN pip --no-cache-dir install sagemaker-training

RUN git clone https://dc9e2704c0415c0652e770ed7fd475d2b092bcb2:[email protected]/nectario/deeptradingmodels.git /opt/ml/code/

RUN pip --no-cache-dir install -r /opt/ml/code/requirements.txt

WORKDIR /opt/ml/code/

RUN cd $WORKDIR

ENV SAGEMAKER_PROGRAM /opt/ml/code/Daily.py

0 replies

nectario · 2020-10-07T16:31:05Z

nectario
Oct 7, 2020
Author

Do the boxes have CUDA installed properly?? This used to all work. All of a sudden things broke

0 replies

Thijsvandepoll · 2021-06-04T23:04:45Z

Thijsvandepoll
Jun 4, 2021

Did you solve this issue? I am facing similar problems. Running the same script locally takes about 10s/epoch while on sagemaker instance it takes about 2mins/epoch. I don’t understand what’s the problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training time on a p3.2xlarge instance suddenly increased 10 fold! Very slow. #2394

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training time on a p3.2xlarge instance suddenly increased 10 fold! Very slow. #2394

nectario Sep 23, 2020

Replies: 6 comments

metrizable Oct 6, 2020

nectario Oct 6, 2020 Author

nectario Oct 7, 2020 Author

nectario Oct 7, 2020 Author

nectario Oct 7, 2020 Author

Thijsvandepoll Jun 4, 2021

nectario
Sep 23, 2020

metrizable
Oct 6, 2020

nectario
Oct 6, 2020
Author

nectario
Oct 7, 2020
Author

nectario
Oct 7, 2020
Author

nectario
Oct 7, 2020
Author

Thijsvandepoll
Jun 4, 2021