Skip to content

Promote Dev #122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 68 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
dc174c2
- implemented kvcache-aware-scorer
vMaroon Apr 30, 2025
3476f59
undo gofumpt
vMaroon Apr 30, 2025
e6ca553
added scorer initialization debug msg
vMaroon Apr 30, 2025
388a7db
- added debug logging
vMaroon Apr 30, 2025
01f019d
updated KVCacheAwareScorer comments
vMaroon Apr 30, 2025
bc2fee3
reused envutils (review comment)
vMaroon Apr 30, 2025
d08e32f
Merge pull request #34 from vMaroon/kvcache-aware
vMaroon Apr 30, 2025
e7d8837
testing new lint config to diff: false
clubanderson Apr 30, 2025
fe5168f
fix lint
clubanderson May 1, 2025
b96946e
fix lint
clubanderson May 1, 2025
476cbab
fix lint
clubanderson May 1, 2025
847daf8
fix lint
clubanderson May 1, 2025
288589c
fix lint
clubanderson May 1, 2025
5c364e2
fix lint
clubanderson May 1, 2025
d835ccb
fix lint
clubanderson May 1, 2025
4818a31
fix lint
clubanderson May 1, 2025
ed2db83
fix lint
clubanderson May 1, 2025
869f7cd
fix lint
clubanderson May 1, 2025
91a0f85
fix lint
clubanderson May 1, 2025
424d5b4
fix lint
clubanderson May 1, 2025
27a62bb
fix lint
clubanderson May 1, 2025
d88c441
fix lint
clubanderson May 1, 2025
c56372c
fix lint
clubanderson May 1, 2025
1e5466a
fix lint
clubanderson May 1, 2025
6f36dfd
fix lint
clubanderson May 1, 2025
3a95881
fix lint
clubanderson May 1, 2025
ce547eb
fix lint
clubanderson May 1, 2025
39dd257
fix lint
clubanderson May 1, 2025
5e6f5b6
fix lint
clubanderson May 1, 2025
d62d457
fix
clubanderson May 1, 2025
508fd29
fix
clubanderson May 1, 2025
06f6d26
fix
clubanderson May 1, 2025
19617ad
fix
clubanderson May 1, 2025
255beb5
fix
clubanderson May 1, 2025
f9e6530
add comments to working golang file
clubanderson May 1, 2025
e2f398a
Provide a way to enable the PDFilter
lionelvillard May 1, 2025
01c043e
update readme
lionelvillard May 1, 2025
58f3213
Merge pull request #104 from neuralmagic/enable-pd
vMaroon May 1, 2025
a1d7254
add log lines
lionelvillard May 1, 2025
c2d68de
Update pod labels to match ModelService
jgchn May 2, 2025
2f6a763
Merge pull request #106 from jgchn/label
vMaroon May 2, 2025
867b18c
address review comments
lionelvillard May 2, 2025
ca12b30
Merge pull request #105 from neuralmagic/debug-pd
vMaroon May 2, 2025
e0eee4c
fix build:
vMaroon May 2, 2025
466e773
Fixed scorer tests
shmuelk May 4, 2025
49b6afa
Added PostResponse to scheduler config
shmuelk May 4, 2025
3e8284c
Use an init() function instead of modifying the scheduler code to inj…
shmuelk May 4, 2025
32e43b1
Added code to scheduler to enable running the PostResponse plugins
shmuelk May 4, 2025
4655be4
Invoke the PostResponse handlers and send any added headers to the user
shmuelk May 4, 2025
6fffe9e
Added a simple unit test for the PostResponse plugin invocation
shmuelk May 4, 2025
220915f
Merge pull request #113 from shmuelk/post-response
shmuelk May 4, 2025
403fae6
[build]: Updating vllm deployment to the latest image and scorers (#112)
kfirtoledo May 5, 2025
9f01f6c
Add P/D scheduler (#115)
mayabar May 5, 2025
b7689d0
Add decode filter to the default filters list in case pd is enabled (…
mayabar May 5, 2025
e45e31c
cherry-picked prefix_score
oglok Apr 22, 2025
073069a
Add prefix store functionality
oglok Apr 22, 2025
9e30e07
Prefix Aware Scorer
oglok Apr 23, 2025
a481c85
Add unit tests for prefix store
oglok Apr 23, 2025
53c550d
Add unit tests for prefix aware scorer
oglok Apr 23, 2025
d7f20fe
implemented PrefixAwareScorer based on Ricardo's work
vMaroon May 4, 2025
b852c92
Remove KVcache scorer changes for traceability
oglok May 5, 2025
a0e02c0
addressed review comments
vMaroon May 5, 2025
09f7448
Merge pull request #48 from oglok/prefix_scorer
vMaroon May 5, 2025
f52caa2
Session affinity scorer (#117)
dmitripikus May 5, 2025
b98733f
[docs]: Add prefix flags to README
kfirtoledo May 6, 2025
575678e
Merge pull request #123 from kfirtoledo/dev
vMaroon May 6, 2025
1bd3e92
switch prefix_scorer updates to post-schedule temporarily
vMaroon May 6, 2025
26e423e
Merge pull request #126 from vMaroon/dev
vMaroon May 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,9 @@ go.work.sum

# generated docs
site

# tokenizer lib
lib

# local configuration files
.envrc
13 changes: 5 additions & 8 deletions .golangci.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
run:
timeout: 5m
allow-parallel-runners: true

# Settings related to issues
issues:
# Report issues on new code only (since we're brining in from upstream)
new: true
# Which dirs to exclude: issues from them won't be reported
exclude-dirs:
- bin

linters:
disable-all: true
enable:
Expand All @@ -18,7 +19,7 @@ linters:
- fatcontext
- ginkgolinter
- gocritic
- govet
# - govet # do not enable - this causes some metalinter issue
- loggercheck
- misspell
- perfsprint
Expand All @@ -27,17 +28,13 @@ linters:
- makezero
- errcheck
- goconst
- gofmt
- goimports
- gosimple
- ineffassign
- nakedret
- prealloc
- typecheck
- unparam
- unused

linters-settings:
revive:
rules:
- name: comment-spacings
- name: comment-spacings
15 changes: 13 additions & 2 deletions .tekton/buildah-build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,15 @@ spec:
USERNAME=$(jq -r '.auths["quay.io"].username' /root/.docker/config.json)
PASSWORD=$(jq -r '.auths["quay.io"].password' /root/.docker/config.json)

echo "🔐 Extracting Git credentials from workspace..."
GIT_USER=$(cat /workspace/git-auth/username)
GIT_TOKEN=$(cat /workspace/git-auth/token)

if [ -z "$GIT_USER" ] || [ -z "$GIT_TOKEN" ]; then
echo "❌ Error: Missing git-auth credentials"
exit 1
fi

if [ "$USERNAME" = "null" ] || [ "$PASSWORD" = "null" ]; then
echo "❌ Error: Missing registry credentials"
exit 1
Expand All @@ -56,8 +65,10 @@ spec:
export DOCKER_CONFIG=/root/.docker
export BUILDER=buildah
export IMG=$(params.image_tag_base):$(params.dev-version)

export GIT_NM_USER=$GIT_USER
export NM_TOKEN=$GIT_TOKEN

echo "🚀 Calling make buildah-build with IMG=$IMG..."
make buildah-build IMG=$IMG
make buildah-build IMG=$IMG

echo "$IMG" > /tekton/results/image-url
19 changes: 19 additions & 0 deletions .tekton/go-build-task.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,24 @@ spec:
script: |
#!/bin/bash
cd $(workspaces.source.path)

echo "🔐 Extracting Git credentials from workspace..."
GIT_USER=$(cat /workspace/git-auth/username)
GIT_TOKEN=$(cat /workspace/git-auth/token)

if [ -z "$GIT_USER" ] || [ -z "$GIT_TOKEN" ]; then
echo "❌ Error: Missing git-auth credentials"
exit 1
fi

echo "🔐 Configuring Git..."
git config --global user.email "[email protected]"
git config --global user.name "ci-tag-bot"
git config --global url."https://${GIT_USER}:${GIT_TOKEN}@github.com".insteadOf "https://github.com"
git config --global --add safe.directory "$(pwd)"

# required for go build with tokenizer lib linking
dnf install -y gcc-c++ libstdc++ libstdc++-devel && dnf clean all

go env -w GOFLAGS=-buildvcs=false
make build
1 change: 1 addition & 0 deletions .tekton/go-lint-task.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ spec:
steps:
- name: run-lint
image: us.icr.io/ibm-hc4ai-operator/golangci-lint:v1.64.8
# image: us.icr.io/ibm-hc4ai-operator/golangci-lint:v2.0.3
imagePullPolicy: IfNotPresent
script: |
#!/bin/bash
Expand Down
5 changes: 5 additions & 0 deletions .tekton/pipelinerun.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,9 @@ spec:
workspaces:
- name: source
workspace: source
- name: git-auth
workspace: git-auth


- name: extract-version-and-registry
params:
Expand Down Expand Up @@ -328,6 +331,8 @@ spec:
workspace: registry-secret
- name: container-storage
workspace: container-storage
- name: git-auth
workspace: git-auth

- name: vulnerability-scan
when:
Expand Down
23 changes: 21 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,47 @@ FROM quay.io/projectquay/golang:1.24 AS builder
ARG TARGETOS
ARG TARGETARCH

# ENV GOPROXY=https://goproxy.io,direct
# Install build tools
RUN dnf install -y gcc-c++ libstdc++ libstdc++-devel && dnf clean all

WORKDIR /workspace

## NeuralMagic internal repos pull config
ARG GIT_NM_USER
ARG NM_TOKEN
### use git token
RUN echo -e "machine github.com\n\tlogin ${GIT_NM_USER}\n\tpassword ${NM_TOKEN}" >> ~/.netrc
ENV GOPRIVATE=github.com/neuralmagic
ENV GIT_TERMINAL_PROMPT=1

# Copy the Go Modules manifests
COPY go.mod go.mod
COPY go.sum go.sum
# cache deps before building and copying source so that we don't need to re-download as much
# and so that source changes don't invalidate our downloaded layer
RUN go mod download
RUN rm -rf ~/.netrc # remove git token

# Copy the go source
COPY cmd ./cmd
COPY pkg ./pkg
COPY internal ./internal
COPY api ./api

# HuggingFace tokenizer bindings
RUN mkdir -p lib
RUN curl -L https://github.com/daulet/tokenizers/releases/download/v1.20.2/libtokenizers.${TARGETOS}-${TARGETARCH}.tar.gz | tar -xz -C lib
RUN ranlib lib/*.a

# Build
# the GOARCH has not a default value to allow the binary be built according to the host where the command
# was called. For example, if we call make image-build in a local env which has the Apple Silicon M1 SO
# the docker BUILDPLATFORM arg will be linux/arm64 when for Apple x86 it will be linux/amd64. Therefore,
# by leaving it empty we can ensure that the container and binary shipped on it will have the same platform.
RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} go build -o bin/epp cmd/epp/main.go cmd/epp/health.go
ENV CGO_ENABLED=1
ENV GOOS=${TARGETOS:-linux}
ENV GOARCH=${TARGETARCH}
RUN go build -o bin/epp -ldflags="-extldflags '-L$(pwd)/lib'" cmd/epp/main.go cmd/epp/health.go

# Use distroless as minimal base image to package the manager binary
# Refer to https://github.com/GoogleContainerTools/distroless for more details
Expand Down
34 changes: 29 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -439,11 +439,20 @@ lint: check-golangci-lint ## Run lint
golangci-lint run

##@ Build
LDFLAGS ?= -extldflags '-L$(shell pwd)/lib'
CGO_ENABLED=1 # Enable CGO

.PHONY: download-tokenizer
download-tokenizer: ## Download the HuggingFace tokenizer bindings.
@echo "Downloading HuggingFace tokenizer bindings..."
mkdir -p lib
curl -L https://github.com/daulet/tokenizers/releases/download/v1.20.2/libtokenizers.$(TARGETOS)-$(TARGETARCH).tar.gz | tar -xz -C lib
ranlib lib/*.a

.PHONY: build
build: check-go ##
build: check-go download-tokenizer ##
@printf "\033[33;1m==== Building ====\033[0m\n"
go build -o bin/epp cmd/epp/main.go cmd/epp/health.go
go build -ldflags="$(LDFLAGS)" -o bin/epp cmd/epp/main.go cmd/epp/health.go

##@ Container Build/Push

Expand All @@ -456,7 +465,12 @@ buildah-build: check-builder load-version-json ## Build and push image (multi-ar
for arch in amd64; do \
ARCH_TAG=$$FINAL_TAG-$$arch; \
echo "📦 Building for architecture: $$arch"; \
buildah build --arch=$$arch --os=linux --layers -t $(IMG)-$$arch . || exit 1; \
buildah build \
--arch=$$arch \
--build-arg GIT_NM_USER=$(GIT_NM_USER) \
--build-arg NM_TOKEN=$(NM_TOKEN) \
--os=linux \
--layers -t $(IMG)-$$arch . || exit 1; \
echo "🚀 Pushing image: $(IMG)-$$arch"; \
buildah push $(IMG)-$$arch docker://$(IMG)-$$arch || exit 1; \
done; \
Expand All @@ -474,7 +488,11 @@ buildah-build: check-builder load-version-json ## Build and push image (multi-ar
sed -e '1 s/\(^FROM\)/FROM --platform=$${BUILDPLATFORM}/' Dockerfile > Dockerfile.cross; \
- docker buildx create --use --name image-builder || true; \
docker buildx use image-builder; \
docker buildx build --push --platform=$(PLATFORMS) --tag $(IMG) -f Dockerfile.cross . || exit 1; \
docker buildx build --push \
--platform=$(PLATFORMS) \
--build-arg GIT_NM_USER=$(GIT_NM_USER)\
--build-arg NM_TOKEN=$(NM_TOKEN) \
--tag $(IMG) -f Dockerfile.cross . || exit 1; \
docker buildx rm image-builder || true; \
rm Dockerfile.cross; \
elif [ "$(BUILDER)" = "podman" ]; then \
Expand All @@ -489,7 +507,13 @@ buildah-build: check-builder load-version-json ## Build and push image (multi-ar
.PHONY: image-build
image-build: check-container-tool load-version-json ## Build container image using $(CONTAINER_TOOL)
@printf "\033[33;1m==== Building container image $(IMG) ====\033[0m\n"
$(CONTAINER_TOOL) build --build-arg TARGETOS=$(TARGETOS) --build-arg TARGETARCH=$(TARGETARCH) -t $(IMG) .
$(CONTAINER_TOOL) build --platform=$(TARGETOS)/$(TARGETARCH) \
--build-arg TARGETOS=$(TARGETOS) \
--build-arg TARGETARCH=$(TARGETARCH) \
--build-arg GIT_NM_USER=$(GIT_NM_USER)\
--build-arg NM_TOKEN=$(NM_TOKEN) \
--progress=plain \
-t $(IMG) .

.PHONY: image-push
image-push: check-container-tool load-version-json ## Push container image $(IMG) to registry
Expand Down
72 changes: 69 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,73 @@

This project offers tools for AI Inference, enabling developers to build [Inference Gateways].

---
## Temporary Fork Configuration

To enable the KVCacheAwareScorer, the following environment variables must be configured:
```
export ENABLE_KVCACHE_AWARE_SCORER=true
export KVCACHE_AWARE_SCORER_WEIGHT=1.0
export KVCACHE_INDEXER_REDIS_ADDR=<redis-service>
export HF_TOKEN=<HuggingFace Token that has access to the vLLM models>
```

To enable the PrefixAwareScorer, the following environment variables must be configured:
```
export ENABLE_PREFIX_AWARE_SCORER=true
export PREFIX_AWARE_SCORER_WEIGHT=1.0
```

To enable the LoadAwareScorer, the following environment variables must be configured:
```
export ENABLE_LOAD_AWARE_SCORER=true
export LOAD_AWARE_SCORER_WEIGHT=1.0
```

To enable the SessionAwareScorer, the following environment variables must be configured:
```
export ENABLE_SESSION_AWARE_SCORER=true
export SESSION_AWARE_SCORER_WEIGHT=1.0
```

To enable Prefill/Decode (PD) processing, the following environment variable must be configured:
```
export PD_ENABLED=true
```

To define the prompt length threshold (requests with a prompt longer than the value defined here will be processed using the prefill-decode process), the following environment variable must be configured:
```
export PD_PROMPT_LEN_THRESHOLD=10
```

Prefill configuration:

To enable and configure the kv cache scorer for prefill, the following environment variables must be configured:
```
export PREFILL_ENABLE_KVCACHE_AWARE_SCORER=true
export PREFILL_KVCACHE_AWARE_SCORER_WEIGHT=1.0
```

To enable and configure the load aware scorer for prefill, the following environment variables must be configured:
```
export PREFILL_ENABLE_LOAD_AWARE_SCORER=true
export PREFILL_LOAD_AWARE_SCORER_WEIGHT=1.0
```

Decode configuration:

To enable and configure the kv cache scorer for decode, the following environment variables must be configured:
```
export DECODE_ENABLE_KVCACHE_AWARE_SCORER=true
export DECODE_KVCACHE_AWARE_SCORER_WEIGHT=1.0
```

To enable and configure the load aware scorer for decode, the following environment variables must be configured:
```
export DECODE_ENABLE_LOAD_AWARE_SCORER=true
export DECODE_LOAD_AWARE_SCORER_WEIGHT=1.0
```
---
[Inference Gateways]:#concepts-and-definitions

## Concepts and Definitions
Expand Down Expand Up @@ -79,8 +146,8 @@ See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for deta
## Roadmap

As Inference Gateway builds towards a GA release. We will continue to expand our capabilities, namely:
1. Prefix-cache aware load balancing with interfaces for remote caches
1. Recommended LoRA adapter pipeline for automated rollout
1. Prefix-cache aware load balancing with interfaces for remote caches
1. Recommended LoRA adapter pipeline for automated rollout
1. Fairness and priority between workloads within the same criticality band
1. HPA support for autoscaling on aggregate metrics derived from the load balancer
1. Support for large multi-modal inputs and outputs
Expand All @@ -104,4 +171,3 @@ Contributions are readily welcomed, follow the [dev guide](./docs/dev.md) to sta
### Code of conduct

Participation in the Kubernetes community is governed by the [Kubernetes Code of Conduct](code-of-conduct.md).

9 changes: 6 additions & 3 deletions deploy/components/vllm-p2p/vllm-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,12 @@ spec:
- "-c"
args:
- |
export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}}:80 && \
export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}} && \
vllm serve ${MODEL_NAME} \
--host 0.0.0.0 \
--port 8000 \
--enable-chunked-prefill false \
--max-model-len ${MAX_MODEL_LEN} \
--kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}'
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'
ports:
- name: http
containerPort: 8000
Expand Down Expand Up @@ -78,6 +77,10 @@ spec:
secretKeyRef:
name: ${HF_SECRET_NAME}
key: ${HF_SECRET_KEY}
- name: VLLM_ENABLE_V1_MULTIPROCESSING
value: "1"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: LMCACHE_LOOKUP_URL
value: ${REDIS_HOST}:${REDIS_PORT}
- name: LMCACHE_ENABLE_DEBUG
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,12 @@ spec:
valueFrom:
secretKeyRef:
name: hf-token
key: ${HF_SECRET_KEY}
key: ${HF_SECRET_KEY}
- name: ENABLE_KVCACHE_AWARE_SCORER
value: "true"
- name: KVCACHE_AWARE_SCORER_WEIGHT
value: "2.0"
- name: ENABLE_LOAD_AWARE_SCORER
value: "true"
- name: LOAD_AWARE_SCORER_WEIGHT
value: "1.0"
Loading