Back to Portfolio
Open Source Infrastructure

PyTorch Inference Framework

Production-ready inference with TensorRT acceleration. Handles batching, model versioning, and GPU memory management out of the box — built under Kolosal AI.

What is it?

Running PyTorch models in production is messy — GPU memory leaks, no batching, manual versioning, and no API layer. This framework wraps all of that into a clean, deployable service.

Built as part of the Kolosal AI open-source toolchain, it compiles models with TensorRT for maximum GPU throughput, exposes them via a FastAPI endpoint, and ships in Docker so deployment is a single command.


What it does

TensorRT Acceleration

Compiles PyTorch models to TensorRT engines for maximum GPU throughput with minimal latency.

Dynamic Batching

Automatically groups incoming requests into optimal batch sizes to saturate GPU compute.

Model Versioning

Register and serve multiple model versions simultaneously, with instant rollback support.

GPU Memory Management

Pools and recycles GPU memory across requests to prevent OOM errors under load.

FastAPI Endpoints

Auto-generated REST API with async support, health checks, and OpenAPI docs.

Docker-ready

Ships as a Docker image — deploy to any GPU-enabled environment with a single command.


Built with

PyTorch TensorRT FastAPI Docker CUDA Python

See the code

Full source, docs, and usage examples on GitHub.