Back to Portfolio
Open Source Infrastructure

PyTorch Inference Framework

Production-ready inference with TensorRT acceleration. Handles batching, model versioning, and GPU memory management out of the box — built under Kolosal AI.

What is it?

Running PyTorch models in production is messy — GPU memory leaks, no batching, manual versioning, and no API layer. This framework wraps all of that into a clean, deployable service.

Built as part of the Kolosal AI open-source toolchain, it compiles models with TensorRT for maximum GPU throughput, exposes them via a FastAPI endpoint, and ships in Docker so deployment is a single command.


What it does

TensorRT Acceleration

Compiles PyTorch models to TensorRT engines for maximum GPU throughput with minimal latency.

📦

Dynamic Batching

Automatically groups incoming requests into optimal batch sizes to saturate GPU compute.

🔖

Model Versioning

Register and serve multiple model versions simultaneously, with instant rollback support.

🧠

GPU Memory Management

Pools and recycles GPU memory across requests to prevent OOM errors under load.

🌐

FastAPI Endpoints

Auto-generated REST API with async support, health checks, and OpenAPI docs.

🐳

Docker-ready

Ships as a Docker image — deploy to any GPU-enabled environment with a single command.


Built with

PyTorch TensorRT FastAPI Docker CUDA Python

See the code

Full source, docs, and usage examples on GitHub.