Skip to content

feat: out-of-box free model providers + dynamic model discovery#2

Open
KylinMountain wants to merge 13 commits intomainfrom
claude/review-fallback-logic-iRPJj
Open

feat: out-of-box free model providers + dynamic model discovery#2
KylinMountain wants to merge 13 commits intomainfrom
claude/review-fallback-logic-iRPJj

Conversation

@KylinMountain
Copy link
Copy Markdown
Owner

@KylinMountain KylinMountain commented Mar 26, 2026

Summary

  • 预配置 8 个 LLM 平台(智谱、硅基流动、OpenRouter、魔搭、美团LongCat、火山引擎、阿里云、DeepSeek),用户只需 cp .env.example .env 填入 API Key 即可开箱即用
  • 新增模型自动发现服务:定时调用各平台 /v1/models 接口,自动获取最新免费模型列表并注册路由
  • 新增 5 条预配置模型路由,支持多平台自动 fallback

支持平台

平台 免费额度 环境变量
智谱 GLM 6 款免费模型 (glm-4-flash 等) ZHIPU_API_KEY
硅基流动 DeepSeek-V3/R1, Qwen2.5 等免费 SILICONFLOW_API_KEY
魔搭 ModelScope Qwen3-235B, DeepSeek-R1 等,2000 次/天 MODELSCOPE_API_KEY
OpenRouter ~30 款 :free 模型 OPENROUTER_API_KEY
美团 LongCat 公测全免费,Lite 5000 万 tokens/天 LONGCAT_API_KEY
火山引擎/豆包 新用户每模型 50 万 tokens VOLCENGINE_API_KEY
阿里云百炼 qwen-turbo 有免费额度 ALIYUN_API_KEY
DeepSeek deepseek-chat DEEPSEEK_API_KEY

模型自动发现 (Model Discovery)

新增 discovery 包,支持定时从各平台 API 自动获取免费模型列表:

discovery:
  enabled: true
  interval: 24h
  providers: [openrouter, siliconflow, modelscope]
  • OpenRouter: 通过 pricing.prompt == "0" 筛选免费模型
  • SiliconFlow: 排除 Pro/ 前缀(付费加速版)
  • ModelScope: 全部模型免费(2000 次/天)
  • Generic: 通用 fetcher,支持任意 OpenAI 兼容平台

发现的模型自动注册为 provider/model 格式路由,出现在 /v1/models 列表中。静态配置的路由永远不会被覆盖。Router 使用 RWMutex 保证并发安全。

预配置路由

路由 说明 Fallback 平台
chat-free 免费聊天 智谱→硅基流动→LongCat→魔搭→OpenRouter
reasoning-free 免费推理/思考链 硅基流动→魔搭→LongCat→智谱
chat-free-large 大参数免费模型 魔搭→硅基流动→OpenRouter
chat-lite 轻量高速 LongCat→智谱→硅基流动
chat 付费高质量 DeepSeek→阿里云→智谱

Test plan

  • 验证各平台 base_url 可达
  • 用至少 2 个平台的 Key 测试 chat-free 路由的 fallback
  • 验证 docker-compose up 能正确传递环境变量
  • 测试直通模式 provider/model 格式访问
  • 验证 discovery 启动后能成功获取 OpenRouter 免费模型列表
  • 验证 discovery 不会覆盖静态配置的路由
  • 验证 /v1/models 接口返回发现的模型

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6

claude added 4 commits March 26, 2026 00:18
Pre-configure popular LLM platforms (Zhipu GLM, SiliconFlow, OpenRouter,
Aliyun DashScope, DeepSeek) with free model routes so users can start
by just setting API keys in .env. Includes multi-platform fallback routes
like chat-free and reasoning-free.

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
- Add ModelScope (魔搭) with free Qwen3-235B, DeepSeek-R1 etc (2000 calls/day)
- Add Meituan LongCat with free Flash-Chat/Thinking/Lite models
- Update Zhipu with full free model list (glm-4.7-flash, glm-z1-flash etc)
- Update OpenRouter with latest free models (gemini-2.5-flash, llama-3.3-70b)
- Add new routes: chat-free-large (large free models), chat-lite (fast/light)
- Fix OpenRouter RPM to match actual free tier limit (20 RPM)
- Fix LongCat base_url to correct path (api.longcat.chat/openai/v1)

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Add 火山引擎/豆包 (ark.cn-beijing.volces.com) as the 8th provider.
New users get 500K tokens per model for free. API key format: ark-xxxxx.

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Add a discovery package that periodically fetches free models from
provider /v1/models APIs and dynamically updates the router:

- OpenRouter: filters by pricing.prompt == "0"
- SiliconFlow: filters out Pro/ prefix (paid accelerated)
- ModelScope: all models are free (2000 calls/day)
- Generic: fallback fetcher for any OpenAI-compatible provider

Router now uses RWMutex for thread-safe dynamic updates. Static routes
from config are never overwritten by discovered models.

Configurable via:
  discovery:
    enabled: true
    interval: 24h
    providers: [openrouter, siliconflow, modelscope]

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
@KylinMountain KylinMountain changed the title feat: add out-of-box config with 8 free model providers feat: out-of-box free model providers + dynamic model discovery Mar 28, 2026
claude added 9 commits March 28, 2026 23:05
OpenRouter pricing fields can be "0", "free", or empty string for
free models. Previously only checked for "0".

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Add daily request limit support to MultiLimiter via a new RPD field.
Uses a token bucket with 86400s refill window. This enables accurate
rate limiting for providers like ModelScope (2000 calls/day).

Config example:
  rate_limit:
    rpm: 60
    rpd: 2000
    tpm: 100000

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Discovery now builds an aggregated 'free' route from all discovered
free models. Each provider contributes its best model (heuristic:
prefer large/popular models like Qwen3, DeepSeek-V3, Llama-3.3-70b).

Users can now simply call model:"free" to get automatic cross-platform
fallback across all configured free providers. The alias name is
configurable via discovery.free_alias (default "free").

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
New 'auto' mode routes requests to different model tiers based on
complexity analysis:

- lite: simple greetings, short Q&A → fast small models
- standard: normal conversation, summaries → standard models
- large: code generation, long writing, function calling → large models
- reasoning: math, logic, debugging, step-by-step → reasoning models

Classification heuristics:
- Tool/function calls → large
- Reasoning keywords (prove, debug, step by step) → reasoning
- Code/complex keywords (implement, algorithm, analyze) → large
- Content length > 4000 chars or > 10 turns → large
- Content length > 500 chars or > 4 turns → standard
- Otherwise → lite

Usage: model:"auto" in the request, maps to configured route names.
Each tier falls back to adjacent tiers if not configured.

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Fix bug where tryTarget error was shadowed by outer 'err' variable,
causing 'failed: <nil>' in logs and losing the actual error info.
Affected all three handlers (chat, embedding, rerank) and stream mode.

Also fix stale model ID: qwen3-coder-480b → qwen3-coder (verified
via OpenRouter API discovery).

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Auto route now uses a two-stage classification:
1. Rules (zero latency) - handles obvious cases: tools→large,
   keywords→reasoning/large, short messages→lite, long content→large
2. LLM classifier (optional) - for ambiguous cases, calls a small
   model (e.g. glm-4-flash) with a classification prompt, 5s timeout,
   max_tokens=10, temperature=0

Config:
  auto_route:
    classifier: "zhipu/glm-4-flash"  # provider/model format

If classifier is not configured, falls back to rule-based standard
tier for ambiguous cases. The LLM prompt asks the model to reply
with just one word: lite/standard/large/reasoning.

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Discovery now parses model size and capability from model IDs:
- Size: regex extracts "70b", "405b", MoE active params "120b-a12b"→12B
- Capability: keywords like "thinking", "r1", "qwq" → reasoning

Auto-generates tier routes (auto:lite, auto:standard, auto:large,
auto:reasoning) with parameter-count-based weights. AutoRouter reads
these directly - no manual tier→route mapping needed.

Classification: <10B→lite, 10-72B→standard, >72B→large,
reasoning keywords→reasoning, unknown size→standard.

The "free" alias now picks the best model from large/standard/lite
tiers. Manual tier overrides still supported as fallback.

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
- Require 3 consecutive rate-limit failures before marking backend
  unhealthy (was: 1 failure = instant unhealthy)
- Skip already-tried target in fallback retry loop to avoid
  double-counting failures and wasting retry attempts
- Reset fail count on successful request

This fixes the issue where one model's 429 would take down all
models sharing the same provider API key.

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Major changes to rate limit handling:
- 429 now marks the specific model as hot (30s cooldown), NOT the
  entire backend/API key. Other models on the same provider work fine.
- handleWithRetry skips hot models when selecting targets
- Auto route degrades from large→standard→lite when all models in
  a tier are hot, ensuring requests always find an available model

New cooldown package tracks per-model (provider+model) overload state.
This properly handles providers like OpenRouter where different models
have independent rate limits sharing one API key.

Flow: model:auto → classify(large) → auto:large all hot →
      degrade → auto:standard has available models → route there

https://claude.ai/code/session_01Hjjpqc8fw2Rxz7mGN4WdU6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants