Closed
Description
🐛 Describe the bug
Hi,
I tried bert and resnet examples in the tutorial https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/
but it ran slower with the "torch.compile" with v100 under unbuntu env i have (i.e., Linux GCRHYP3C148 4.15.0-193-generic #204-Ubuntu SMP)
isn't it supposed to be faster?
thanks
Error logs
No response
Minified repro
"""
resnet
"""
import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
opt_model = torch.compile(model, backend="inductor")
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
model(torch.randn(1,3,64,64))
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
opt_model(torch.randn(1,3,64,64))
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")
this runs like the following and the compiled model run 74x slower as shown below
~/project/sandbox$ python hello_torchdynamo4.py
Using cache found in /home/styoun/.cache/torch/hub/pytorch_vision_v0.10.0
estimated_ms=223.81260681152344
estimated_ms=16573.572265625
it's similar for the following bert example in the tutorial. it's 14.7x slower with the extra line "model = torch.compile(model)"
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
model = torch.compile(model) # This is the only line of code that we changed
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
output = model(**encoded_input)
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")