PyTorch Inference Framework
Production-ready inference with TensorRT acceleration. Handles batching, model versioning, and GPU memory management out of the box — built under Kolosal AI.
Overview
What is it?
Running PyTorch models in production is messy — GPU memory leaks, no batching, manual versioning, and no API layer. This framework wraps all of that into a clean, deployable service.
Built as part of the Kolosal AI open-source toolchain, it compiles models with TensorRT for maximum GPU throughput, exposes them via a FastAPI endpoint, and ships in Docker so deployment is a single command.
Features
What it does
TensorRT Acceleration
Compiles PyTorch models to TensorRT engines for maximum GPU throughput with minimal latency.
Dynamic Batching
Automatically groups incoming requests into optimal batch sizes to saturate GPU compute.
Model Versioning
Register and serve multiple model versions simultaneously, with instant rollback support.
GPU Memory Management
Pools and recycles GPU memory across requests to prevent OOM errors under load.
FastAPI Endpoints
Auto-generated REST API with async support, health checks, and OpenAPI docs.
Docker-ready
Ships as a Docker image — deploy to any GPU-enabled environment with a single command.
Stack
Built with
See the code
Full source, docs, and usage examples on GitHub.