Serving
Contents
Serving¶
We use a vLLM backend which currently supports most of the popular LLM model architectures, including Llama 2 (base & chat finetune) and Vicuna. Here’s a full list of vLLM supported models
Deployment¶
Model deployments are referenced by their HuggingFace modelhub name. Finetuned models trained through LLM-ATC are referenced
by using the --name llm-atc
.
# serve an llm-atc finetuned model, requires source `llm-atc/` prefix and grabs model checkpoint from object store
$ llm-atc serve --name llm-atc --source s3://my-bucket/my_vicuna/ --accelerator A100:1 -c servecluster --cloud gcp --region asia-southeast1 --envs "HF_TOKEN=<HuggingFace_token>"
# serve a HuggingFace model, e.g. `lmsys/vicuna-13b-v1.3`
$ llm-atc serve --name lmsys/vicuna-13b-v1.3 --accelerator A100:1 -c servecluster --cloud gcp --region asia-southeast1 --envs "HF_TOKEN=<HuggingFace_token>"
# Llama 7b can be served on a V100
$ llm-atc serve --name meta-llama/Llama-2-7b-chat-hf --accelerator V100:1 -c servecluster --cloud aws --region us-east-2 --envs "HF_TOKEN=<HuggingFace_token>"
# Llama 70b requires more VRAM, request A100-80GB:2 at least
$ llm-atc serve --name meta-llama/Llama-2-70b-chat-hf --accelerator A100-80GB:2 -c servecluster --cloud aws --region us-east-2 --envs "HF_TOKEN=<HuggingFace_token>"
Querying the Endpoint¶
This creates an OpenAI API compatible endpoint on the provisioned instance on port=8000
, which can receive HTTP requests
from your laptop.
# get the ip address of the OpenAI API endpoint
$ ip=$(sky status --ip servecluster)
# test which models are available
$ curl http://$ip:8000/v1/models
# chat completion
$ curl http://$ip:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}'
# shutdown when done
$ sky stop servecluster
The endpoint supports a subset of the OpenAI API schema:
chat completions (not including function calling)
text completion
embeddings