A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
Jan 10, 2025 - Python
A high-throughput and memory-efficient inference and serving engine for LLMs
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
In this repository, I will share some useful notes and references about deploying deep learning-based models in production.
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
Standardized Serverless ML Inference Platform on Kubernetes
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑💻 Video Tutorials.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
🏕️ Reproducible development environment
AICI: Prompts as (Wasm) Programs
Olares: An Open-Source Sovereign Cloud OS for Local AI
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
Hopsworks - Data-Intensive AI platform with a Feature Store
The simplest way to serve AI/ML models in production
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
Model Deployment at Scale on Kubernetes 🦄️
A scalable inference server for models optimized with OpenVINO™
A throughput-oriented high-performance serving framework for LLMs
An open source DevOps tool for packaging and versioning AI/ML models, datasets, code, and configuration into an OCI artifact.
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Add a description, image, and links to the model-serving topic page so that developers can more easily learn about it.
To associate your repository with the model-serving topic, visit your repo's landing page and select "manage topics."