Skip to content

inzh-studio/llm-service-java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Java 17+ llama.cpp b2930 java-llama.cpp v3.0.1-inzh

Java LLM service

The main goal of Java LLM service is to propose a packaged solution for execute LLM Service. Release assets contains ready on use multiple inference implementation type.

This project is based from:

Quick Start

Download last release, with choosed inference implementation.

  • Exemple, use this archive for cpu + gpu inference with vulkan sdk:

    inzh-llm-http-service-0.1-vulkan-win-x64.zip
    
  • Start service with launch script:

    start.bat

    Start script require JAVA_HOME environnement variable, if missing, use set in command console. For exemple:

    set JAVA_HOME=C:\Program Files\Java\jdk-17

Advanced start

  • Start with pre open model

    start.bat  --modelPath ./models  --open Meta-Llama-3-8B-Instruct.Q8_0
  • Avoid VRAM capacity overrun errors

    If your configuration has a low amount of VRAM, you can choose the number of layers that will be processed by the GPU, to avoid having "Out Of Memory" errors

    start.bat  --modelPath ./models  --open Meta-Llama-3-8B-Instruct.Q8_0 --nGpuLayers 8

Model list

Bellow, tested GGUF models with ressource consumtion (RAM and VRAM).

Model RAM & VRAM
Meta-Llama-3-8B-Instruct.Q8_0 ~ 9.6 Go
mistral-7b-instruct-v0.2.Q5_K_M ~ 5.3 Go
vigogne-2-7b-instruct.Q8_0 ~ 11.6 Go

REST Service usage

All request type (except completion) support bulk request. For bulk usage, just use array of request object.

  • Completion request for chat:
curl  http://localhost:18000/v1/chat/completions \
 --json '
{
   "context":"This is a conversation between ${user.name} and ${bot.name}, a friendly chatbot.${bot.name} is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.",
   "model":"Meta-Llama-3-8B-Instruct.Q8_0",
   "messages":[
      {
         "role":"user",
         "content":"List me the numbers between 1 and 6, from the largest value to the smallest."
      }
   ]
}'
{
   "duration":1654,
   "text":"6, 5, 4, 3, 2, 1. That's the list of numbers between 1 and 6, in descending order. Is there anything else I can help you with? "
}
  • Completion request for chat (EventStream):
curl -N  http://localhost:18000/v1/chat/completions \
 --json '
{
   "context":"This is a conversation between ${user.name} and ${bot.name}, a friendly chatbot.${bot.name} is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision.",
   "model":"Meta-Llama-3-8B-Instruct.Q8_0",
   "configuration": {
    "stream":true
   },
   "messages":[
      {
         "role":"user",
         "content":"List me the numbers between 1 and 6, from the largest value to the smallest."
      }
   ]
}'
data: {"duration":390,"text":"6"}

data: {"duration":27,"text":","}

.......

data: {"duration":29,"text":"1"}

data: {"duration":29,"text":"."}

data: {"duration":29,"text":" Would"}

data: {"duration":29,"text":" you"}

data: {"duration":29,"text":" like"}

data: {"duration":29,"text":" to"}

data: {"duration":30,"text":" know"}

data: {"duration":29,"text":" any"}

data: {"duration":30,"text":" other"}

data: {"duration":29,"text":" information"}

data: {"duration":29,"text":"?"}
  • Embedding request
curl http://localhost:18000/v1/embeddings \
 --json '
{
   "model":"Meta-Llama-3-8B-Instruct.Q8_0",
   "input":"When i was a child i was a Jedi."
}'
{
  "duration": 53,
  "embedding": [
    -0.028669784,
    -0.004733609,
    0.011325254,
    
	.......
	
    0.0013322092,
    0.014150904,
    0.010830845,
    -0.00008983313
  ]
}
  • Tokenize request
curl http://localhost:18000/v1/tokenizes \
 --json '
{
   "model":"Meta-Llama-3-8B-Instruct.Q8_0",
   "input":"When i was a child i was a Jedi."
}'
{
  "duration": 1,
  "tokens": [
    4599,
    602,
    574,
    264,
    1716,
    602,
    574,
    264,
    41495,
    13
  ]
}
  • Resolve request value from token
curl http://localhost:18000/v1/resolves \
 --json '
{
   "model":"Meta-Llama-3-8B-Instruct.Q8_0",
   "tokens":[
      4599,
      602,
      574,
      264,
      1716,
      602,
      574,
      264,
      41495,
      13
   ]
}'
{
  "duration": 1,
  "text": "When i was a child i was a Jedi."
}