Reduce time it takes to import SGLang #12510
                
     Draft
            
            
          
      
        
          +188
        
        
          −91
        
        
          
        
      
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Motivation
We notice that the time to import things in SGLang takes a lot of time (#10492). I have been looking into what is taking up a lot of time and if there are simple ways to help reduce this import time. From the original issue, we want to reduce:
which is what I have been focusing my efforts on. However, I think there are things we can do to reduce time for other imports. This is more of a V1 to get community feedback from experts.
Modifications
There are some heavy imports. For example, the quantization methods import at the module level is heavy. Moving some imports to the function level (only time it is used), we can reduce module import time. However, I can see how this can easily be an antipattern. In fact, it can hurt performance if we have a function that is used a lot that we have an import in. I tried to only do this in functions that we only expect to run once or a small number of times. However, I can understand the argument against this kind of code. I also don't think all the changes to
hf_transformer_utils.pyhelp so I will be taken a deeper look, since the changes are a bit invasive.Accuracy Tests
These changes should not affect model outputs.
Benchmarking and Profiling
Running
for i in {1..100}; do (time python -B -c "from sglang.srt.managers.scheduler import Scheduler") 2>&1 | grep "^real"; done | python calc_avg.py(calc_avg.py)With these changes:
Compared to top-of-main:
so we have ~1.5 second improvement. Not the best, so I am going to keep working on it. I mostly targeted improving the timing in the creation of
ModelConfigobject. The difference so far is largely from removing the quantization import:Without these changes import_sglang_tom.log

With these changes import_sglang_new.log

Machine:
Checklist