-
Notifications
You must be signed in to change notification settings - Fork 29.2k
enable finegrained_fp8 and granite_speech cases on XPU #38036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c2b7d1a
ab89c7c
234b759
fbb1c67
612e51d
7dded96
56639cf
fe4ce7b
279faa9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,11 +18,13 @@ | |
|
||
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, FineGrainedFP8Config, OPTForCausalLM | ||
from transformers.testing_utils import ( | ||
backend_empty_cache, | ||
require_accelerate, | ||
require_read_token, | ||
require_torch_gpu, | ||
require_torch_multi_gpu, | ||
require_torch_accelerator, | ||
require_torch_multi_accelerator, | ||
slow, | ||
torch_device, | ||
) | ||
from transformers.utils import is_accelerate_available, is_torch_available | ||
|
||
|
@@ -34,7 +36,7 @@ | |
from accelerate import init_empty_weights | ||
|
||
|
||
@require_torch_gpu | ||
@require_torch_accelerator | ||
class FineGrainedFP8ConfigTest(unittest.TestCase): | ||
def test_to_dict(self): | ||
""" | ||
|
@@ -60,13 +62,13 @@ def test_from_dict(self): | |
@slow | ||
@require_accelerate | ||
@require_read_token | ||
@require_torch_gpu | ||
@require_torch_accelerator | ||
class FP8QuantizerTest(unittest.TestCase): | ||
model_name = "meta-llama/Llama-3.2-1B" | ||
input_text = "Once upon a time" | ||
max_new_tokens = 10 | ||
EXPECTED_OUTPUT = "Once upon a time, there was a man who was very rich." | ||
device_map = "cuda" | ||
device_map = torch_device | ||
offload_device_map = { | ||
"model.embed_tokens": 0, | ||
"model.layers.0": 0, | ||
|
@@ -103,7 +105,7 @@ def setUpClass(cls): | |
|
||
def tearDown(self): | ||
gc.collect() | ||
torch.cuda.empty_cache() | ||
backend_empty_cache(torch_device) | ||
gc.collect() | ||
|
||
def test_quantized_model_conversion(self): | ||
|
@@ -151,7 +153,8 @@ def test_quantized_model(self): | |
input_ids = self.tokenizer(self.input_text, return_tensors="pt").to(self.device_map) | ||
|
||
output = self.quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens, do_sample=False) | ||
self.assertEqual(self.tokenizer.decode(output[0], skip_special_tokens=True), self.EXPECTED_OUTPUT) | ||
output_tokens = self.tokenizer.decode(output[0], skip_special_tokens=True) | ||
self.assertEqual(output_tokens, self.EXPECTED_OUTPUT) | ||
|
||
def test_save_pretrained(self): | ||
""" | ||
|
@@ -188,11 +191,12 @@ def test_block_size(self): | |
) | ||
self.assertEqual(quantized_model.config.quantization_config.weight_block_size, (32, 32)) | ||
|
||
@require_torch_multi_gpu | ||
def test_quantized_model_multi_gpu(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For I don't know whether there are any other scenarios I didn't considered, but for this case, seems the correct ground truth should be 0. @ydshieh , pls let me know your insights, thx
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this depends on the vram of your GPUs/XPUs ; it will only use both if one is not enough, otherwise maybe it would make sense to use another device map strategy here lik "balanced" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will let @SunMarc or @MekkCyber to share their thoughts for this. On our CI, these tests are not collected, I believe it is due to the @yao-matrix You are able to run this test ...? I am surprised. I will take a look at this issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. embed_tokens is indeed tied to lm_head but the layers can be dispatched to other gpus. setting "auto" in device_map will default to "balanced" strategy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I removed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @IlyasMoutawwakil @SunMarc yes, i tried There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will investigate a bit more. @MekkCyber tested locally and it works but when running with pytest it fails There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @SunMarc @MekkCyber You will need to remove related issue: #38093 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you think there is no more change required for this From my side, I am just waiting a nit change regarding variable name. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's good to go from my side, we need to figure out why it fails with |
||
@require_torch_multi_accelerator | ||
def test_quantized_model_multi_accelerator(self): | ||
""" | ||
Simple test that checks if the quantized model is working properly with multiple GPUs | ||
set CUDA_VISIBLE_DEVICES=0,1 if you have more than 2 GPUs | ||
Simple test that checks if the quantized model is working properly with multiple accelerators | ||
set CUDA_VISIBLE_DEVICES=0,1 if you have more than 2 GPUs; or set ZE_AFFINITY_MASK=0,1 if you | ||
have more than 2 XPUs. | ||
""" | ||
input_ids = self.tokenizer(self.input_text, return_tensors="pt").to(self.device_map) | ||
quantization_config = FineGrainedFP8Config() | ||
|
@@ -204,8 +208,8 @@ def test_quantized_model_multi_gpu(self): | |
output = quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens, do_sample=False) | ||
self.assertEqual(self.tokenizer.decode(output[0], skip_special_tokens=True), self.EXPECTED_OUTPUT) | ||
|
||
@require_torch_multi_gpu | ||
def test_save_pretrained_multi_gpu(self): | ||
@require_torch_multi_accelerator | ||
def test_save_pretrained_multi_accelerators(self): | ||
""" | ||
Simple test that checks if the quantized model is working properly after being saved and loaded | ||
""" | ||
|
@@ -245,9 +249,9 @@ def test_save_pretrained_offload(self): | |
self.assertEqual(self.tokenizer.decode(output[0], skip_special_tokens=True), self.EXPECTED_OUTPUT) | ||
|
||
|
||
@require_torch_gpu | ||
@require_torch_accelerator | ||
class FP8LinearTest(unittest.TestCase): | ||
device = "cuda" | ||
device = torch_device | ||
|
||
@unittest.skipIf( | ||
torch.cuda.is_available() and torch.cuda.get_device_capability()[0] < 9, | ||
|
Uh oh!
There was an error while loading. Please reload this page.