Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] The sglang cannot reach the preset concurrency level. #1477

Open
5 tasks done
rangehow opened this issue Sep 20, 2024 · 0 comments
Open
5 tasks done

[Bug] The sglang cannot reach the preset concurrency level. #1477

rangehow opened this issue Sep 20, 2024 · 0 comments

Comments

@rangehow
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

In order to achieve maximum throughput, we had to increase the number of request connections being sent. However, while the sglang server could initially reach the expected concurrency level during the early stages of establishing the connection with the program, it would later drop to single digits.

Reproduction

To better reproduce this scenario, I have abstracted the logic into three code files: server, client, and main. The server simulates the sglang responding to requests, the client simulates a single user making requests to OpenAI, and the main is used to create users in bulk.

I replaced the sglang response with a simple fake response server, which always returns after a 1s delay to simulate the generation latency of sglang.

server.py

from fastapi import FastAPI, Request
from pydantic import BaseModel
import asyncio
import logging
from datetime import datetime
import threading
import time
import csv

app = FastAPI()

# 追踪当前活动的请求数
active_requests = 0

# 文件路径
output_file = 'active_requests_log.csv'

class CompletionRequest(BaseModel):
    model: str
    messages: list
    temperature: float

@app.middleware("http")
async def track_requests(request: Request, call_next):
    global active_requests
    active_requests += 1  # 请求进入时增加计数
    logging.info(f"Active requests: {active_requests}")

    response = await call_next(request)

    active_requests -= 1  # 请求完成后减少计数
    logging.info(f"Active requests: {active_requests}")

    return response

@app.post("/v1/chat/completions")
async def completions(request: CompletionRequest):
    await asyncio.sleep(1)  # 模拟大模型生成的延时
    return {
        "choices": [
            {"message": {"content": f"Response to {request.messages[-1]['content']}"}}
        ]
    }

def record_active_requests():
    """ 每秒记录一次活跃请求数量到文件 """
    global active_requests
    with open(output_file, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["timestamp", "active_requests"])  # 写表头
        
        while True:
            # 每秒记录一次
            current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            writer.writerow([current_time, active_requests])
            file.flush()  # 确保每秒写入数据到文件
            time.sleep(1)

# 启动一个线程来记录活跃请求数
threading.Thread(target=record_active_requests, daemon=True).start()

if __name__ == "__main__":
    import uvicorn
    logging.basicConfig(level=logging.INFO)
    uvicorn.run(app, host="127.0.0.1", port=8203)

client.py

import asyncio
from functools import wraps
import httpx
import logging
from openai import AsyncOpenAI


# 限制并发请求的装饰器
def limit_async_func_call(max_size: int):
    sem = asyncio.Semaphore(max_size)

    def final_decro(func):
        @wraps(func)
        async def wait_func(*args, **kwargs):
            async with sem:
                try:
                    return await func(*args, **kwargs)
                except Exception as e:
                    logging.error(f"Exception in {func.__name__}: {e}")
  
        return wait_func
    return final_decro



custom_http_client = httpx.AsyncClient(
    limits=httpx.Limits(max_connections=4096, max_keepalive_connections=1024),
    timeout=httpx.Timeout(timeout=None)
)

openai_async_client = AsyncOpenAI(
    api_key="EMPTY", base_url="http://localhost:8203/v1",  # 模拟本地 server
    http_client=custom_http_client
)


# 假设这个是你要进行并发测试的函数
@limit_async_func_call(max_size=1024)  # 限制并发为1024
async def custom_model_if_cache(prompt, system_prompt=None, history_messages=[], **kwargs):
    

    

    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.extend(history_messages)
    messages.append({"role": "user", "content": prompt})

    # 假设这里是要调用的外部 API
    response = await openai_async_client.chat.completions.create(
        model="gpt-3.5-turbo", messages=messages, temperature=0, **kwargs
    )

    return "hi"

main.py

import asyncio
import logging
from client import custom_model_if_cache
# 模拟 10 万个请求
TOTAL_REQUESTS = 100000

async def simulate_requests():
    tasks = []
    for i in range(TOTAL_REQUESTS):
        prompt = f"Test prompt {i}"  # 每次请求的不同参数
        task = custom_model_if_cache(prompt=prompt)  # 调用受限的异步函数
        tasks.append(task)

    # 并发执行所有请求
    results = await asyncio.gather(*tasks, return_exceptions=True)

    # 打印前10个结果以验证
    for result in results[:10]:
        print(result)

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    asyncio.run(simulate_requests())

I preset the concurrency level to 1024 and used the program to track the server's response acceptance under the aforementioned requests. Based on the following statistics, we can see that on the mock server, the concurrency level while completing 100,000 tasks almost reaches the preset value of 1024.

image

However, when I used the same logic to request the sglang-hosted OpenAI server, I could only achieve a concurrency level of around a dozen requests, which was close to serial processing. This confused me. After ruling out issues in other parts of the system through simple logic checks, I believe the problem might lie with sglang.
image

Environment

Python: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 3090
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.6
CUDA_HOME: /data/ruanjh/cuda/cuda12.2
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 550.54.14
PyTorch: 2.4.0+cu121
sglang: 0.3.1
flashinfer: 0.1.6+cu121torch2.3
triton: 3.0.0
transformers: 4.43.3
requests: 2.32.3
tqdm: 4.66.4
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.111.0
hf_transfer: 0.1.6
huggingface_hub: 0.24.3
interegular: 0.3.3
packaging: 23.2
PIL: 10.3.0
psutil: 6.0.0
pydantic: 2.7.4
uvicorn: 0.30.1
uvloop: 0.19.0
zmq: 26.2.0
vllm: 0.6.0
multipart: 0.0.9
openai: 1.44.1
anthropic: Module Not Found
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     PXB     PXB     SYS     SYS     SYS     SYS     0-23,48-71      0               N/A
GPU1    PIX      X      PXB     PXB     SYS     SYS     SYS     SYS     0-23,48-71      0               N/A
GPU2    PXB     PXB      X      PXB     SYS     SYS     SYS     SYS     0-23,48-71      0               N/A
GPU3    PXB     PXB     PXB      X      SYS     SYS     SYS     SYS     0-23,48-71      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     PXB     PXB     24-47,72-95     1               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      PXB     PXB     24-47,72-95     1               N/A
GPU6    SYS     SYS     SYS     SYS     PXB     PXB      X      PXB     24-47,72-95     1               N/A
GPU7    SYS     SYS     SYS     SYS     PXB     PXB     PXB      X      24-47,72-95     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 100002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant