- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
CUDA: avoid mul + bias fusion when buffers are split #16935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unsurprisingly since this is just disabling fusion in this case, this fixes the issue.
          
 Yes unlikely fusion would help in this case anyway  | 
    
| 
           we should probably have some multigpu unit tests to catch this sort of thing.  | 
    
| 
           @am17an should we merge this?  | 
    
| 
           I'm wondering whether we should just disable fusion outright if we detect any buffer is split or   | 
    
| 
           At least for   | 
    
| 
           If there are issues with any  llama.cpp/ggml/src/ggml-cuda/mmvq.cu Lines 656 to 665 in 070ff4d 
  | 
    
| 
           I think we already check this with   | 
    
| 
           
  | 
    
| 
           To be clear there have been no crashes reported with   | 
    
          
 What I mean is that the padding is being cleared for  More generally,   | 
    
          
 Works fine here (present bug aside). I think the perception that it dosent work is that rocr has had multiple bugs relating to handing various p2p scenarios.  | 
    
| 
           I am merging this as it solves quite a bunch of   | 
    
Fix #16799. When fusing just a mul-mat + bias, we don't check if the buffer is split. We check this when fusing gate + up. Tested on 3x 4090 with gpt-oss-120b