Skip to content

Conversation

mseeger
Copy link
Contributor

@mseeger mseeger commented Aug 28, 2025

The computation of lora_ind only works for models with n_head == n_query_groups and n_embd == n_head * head_size. This is not the case for the Qwen3-4B model, for example.

@mseeger mseeger requested review from lantiga, t-vi and Borda as code owners August 28, 2025 20:04
@mseeger
Copy link
Contributor Author

mseeger commented Aug 29, 2025

This PR only changes LoRA code. The failing test does not use LoRA. I suspect maybe the atol and rtol for this test may be a little too tight? Or do you see a different explanation?

@Borda
Copy link
Member

Borda commented Sep 1, 2025

I suspect maybe the atol and rtol for this test may be a little too tight? Or do you see a different explanation?

I would be fine to relax the tol. a bit...
cc: @t-vi

@mseeger
Copy link
Contributor Author

mseeger commented Sep 1, 2025

The relative difference is large. But when I run this on my Mac, the test passes. Have not tried on GPU.

Are we sure the test passes on main?

@mseeger
Copy link
Contributor Author

mseeger commented Sep 2, 2025

I ran the test in question on a GPU instance. It passes before and after this PR:

(valkeyrie) ubuntu@ip-172-31-26-205:~/git/litgpt$ pytest tests/test_model.py -k test_against_original_gemma_2
======================================================================= test session starts =======================================================================
platform linux -- Python 3.12.3, pytest-8.4.1, pluggy-1.6.0
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/ubuntu/git/litgpt
configfile: pyproject.toml
plugins: dependency-0.6.0, anyio-4.10.0, rerunfailures-16.0, benchmark-5.1.0, timeout-2.4.0
collected 583 items / 579 deselected / 4 selected

tests/test_model.py ..xx                                                                                                                                    [100%]

========================================================== 2 passed, 579 deselected, 2 xfailed in 9.98s ===========================================================

@mseeger
Copy link
Contributor Author

mseeger commented Sep 2, 2025

It seems to exclude the two GPU tests. I have no idea why.

I don't know how to diagnose this further.

@mseeger
Copy link
Contributor Author

mseeger commented Sep 2, 2025

OK, I commented out the pytest.mark.xfail(raises=AssertionError, strict=False), and now the test in question (test_against_original_gemma_2 in test_models.py) fails with quite large errors, BOTH in main and in my branch.

This means this test should either be fixed or commented out. The latter happens on my instance, due to this pytest.mark.xfail(raises=AssertionError, strict=False), but somehow the CI system seems to run the test?

How to proceed here? @t-vi

@mseeger
Copy link
Contributor Author

mseeger commented Sep 2, 2025

Maybe it is also the case the CI system runs the tests on CPU and this fails. But the CPU tests work for me, both on my Mac laptop and an EC2 instance. The comments in the test don't sound re-assuring. I'd recommend to comment out this test altogether until we are sure the code can be made to do what the HF side is doing as well?

@Borda
Copy link
Member

Borda commented Sep 4, 2025

cc: @t-vi ^^ 🐿️

Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mseeger @Borda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants