compiled model run in v100 GPU is slower

### 🐛 Describe the bug

Hi,
I tried bert and resnet examples in the tutorial https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/
but it ran slower with the "torch.compile" with v100 under unbuntu env i have (i.e., Linux GCRHYP3C148 4.15.0-193-generic #204-Ubuntu SMP)
isn't it supposed to be faster?
thanks


### Error logs

_No response_

### Minified repro

"""
resnet
"""
~~~
import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
opt_model = torch.compile(model, backend="inductor")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
model(torch.randn(1,3,64,64))
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
opt_model(torch.randn(1,3,64,64))
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")
~~~


this runs like the following and the compiled model run 74x slower as shown below
````
~/project/sandbox$ python hello_torchdynamo4.py
Using cache found in /home/styoun/.cache/torch/hub/pytorch_vision_v0.10.0
estimated_ms=223.81260681152344
estimated_ms=16573.572265625
````

it's similar for the following bert example in the tutorial. it's 14.7x slower with the extra line "model = torch.compile(model)"

~~~
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
model = torch.compile(model) # This is the only line of code that we changed
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
output = model(**encoded_input)
end_event.record()
torch.cuda.synchronize()
estimate_ms = start_event.elapsed_time(end_event)
print(f"estimated_ms={estimate_ms}")

~~~



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compiled model run in v100 GPU is slower #2008

🐛 Describe the bug

Error logs

Minified repro

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

compiled model run in v100 GPU is slower #2008

Description

🐛 Describe the bug

Error logs

Minified repro

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions