💰 Save 50% on OpenAI API costs with a one-line change.
OpenAI offers a 50% discount through their batch API if you can tolerate a higher latency and are willing to make significant changes to your code.
With llm-proxy, you can start using OpenAI’s batch API by configuring the OpenAI client to use the proxy as the base URL. No other code changes required.
Your application sends a request and waits for a response just like before, except with increased latency.
- The proxy receives individual API requests from the application.
- It groups them into batches based on configurable criteria (time window, batch size, etc.).
- Batched requests are sent to OpenAI's batch API endpoint.
- The proxy waits until the batch finishes.
sequenceDiagram
participant App
participant llm-proxy
participant OpenAI
App->>llm-proxy: req1
App->>llm-proxy: req...
App->>llm-proxy: req1000
llm-proxy->>OpenAI: batch(req1, req..., req1000)
App->>llm-proxy: req1001
App->>llm-proxy: req1002
llm-proxy->>OpenAI: batch(req1001, req1002)
OpenAI-->>llm-proxy: async resp1, resp..., resp1000
llm-proxy->>App: resp1
llm-proxy->>App: resp...
llm-proxy->>App: resp1000
OpenAI-->>llm-proxy: async resp1002, resp1001
llm-proxy->>App: resp1002
llm-proxy->>App: resp1001
You can run it via Docker or directly in a Go environment.
Docker (a lightweight 26MiB image):
docker build -t llm-proxy .
docker run -d --restart unless-stopped -p 3030:3030 llm-proxy
Directly using Go:
go run .
See the examples folder for more details and usage examples.
import os
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:3030/v1") # only change needed
completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-4o-mini",
)
print(completion.choices[0].message.content)
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: 'http://127.0.0.1:3030/v1' // only change needed
});
async function main() {
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: 'Say this is a test' }],
});
console.log(completion.choices[0]?.message?.content);
}
main();
curl http://127.0.0.1:3030/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
package main
import (
"context"
"fmt"
openai "github.com/sashabaranov/go-openai"
"os"
)
func main() {
config := openai.DefaultConfig(os.Getenv("OPENAI_API_KEY"))
config.BaseURL = "http://127.0.0.1:3030/v1" // only change needed
client := openai.NewClientWithConfig(config)
resp, err := client.CreateChatCompletion(
context.Background(),
openai.ChatCompletionRequest{
Model: openai.GPT4oMini,
Messages: []openai.ChatCompletionMessage{
{
Role: openai.ChatMessageRoleUser,
Content: "Hello!",
},
},
},
)
if err != nil {
fmt.Printf("ChatCompletion error: %v\n", err)
return
}
fmt.Println(resp.Choices[0].Message.Content)
}
Both APIs that support batch (/v1/chat/completions
and /v1/embeddings
) are supported.
Any other endpoint will be relayed to OpenAI as-is.
Simple real-time statistics are accessible through the http://127.0.0.1:3030/stats
endpoint. This provides insights into request counts, batch efficiency, and latency metrics.
Monitor the /stats
endpoint to ensure the proxy is performing as expected in your environment.
Sample output:
{
"requests": {
"total": 2998,
"successful": 2997,
"failed": 0,
"synthesized_error_responses": 999,
"avg_time_ms": 153959.67467467466,
"p50_time_ms": 203733,
"p95_time_ms": 250896,
"p99_time_ms": 251073
},
"batches": {
"total": 3,
"successful": 3,
"failed": 0,
"avg_time_ms": 152846.33333333334,
"p50_time_ms": 104655.5,
"p95_time_ms": 226016,
"p99_time_ms": 226016
}
}
- Not suitable for applications requiring real-time responses (e.g. chatbot)
- Streaming APIs are not supported, as they don't support batch mode.
OpenAI's commitment for this API is 24-hour turnaround time. In practice, I've observed good latencies during nights and weekends (a few seconds). However, latency can increase significantly during weekdays, sometimes reaching up to an hour.
- Configurable grace period. If a batch doesn't complete within the allotted time, the batch will be canceled, partial results will be returned, and the remaining requests will be sent via the synchronous API.
- Add support also for Gemini, as it also supports batch.
- Enable usage tracking (counting tokens, estimating cost).