AI Inference Gateway

Proof of-concept

Unified API gateway routing inference requests to optimal providers — HuggingFace, OpenAI, or self-hosted vLLM.

View Live Demo (Unavailable)

Overview

The AI Inference Gateway provides a single OpenAI-compatible API endpoint that intelligently routes requests to the best available inference provider based on cost, latency, model capability, and rate-limit headroom. It supports chat completions, embeddings, text classification, and structured extraction — with automatic failover, request caching, and per-user token management.

Technologies Used

Next.jsTypeScriptHuggingFace APIOpenAI APIRedisZod

Key Features

OpenAI-compatible /v1/chat/completions endpoint
Multi-provider routing with automatic failover
Request caching via Redis for repeated queries
Per-user API token issuance and rate limiting
Streaming response support (SSE)
Cost tracking and usage analytics dashboard

Challenges & Solutions

Challenge:

Normalising response formats across different inference providers

Solution:

Built adapter layers that transform provider-specific responses into a unified OpenAI-compatible schema

Challenge:

Managing cold-start latency on free-tier HuggingFace models

Solution:

Implemented a warm-up scheduler and fallback routing to pre-warmed alternative models

Challenge:

Preventing abuse while keeping the API accessible

Solution:

Designed a tiered token system with configurable rate limits, usage quotas, and automatic throttling

Outcome

Serving internal demos and research workloads; public beta planned for Q2 2026

Overview

Technologies Used

Key Features

Challenges & Solutions

Challenge:

Solution:

Challenge:

Solution:

Challenge:

Solution:

Outcome

Loading

Overview

Technologies Used

Key Features

Challenges & Solutions

Challenge:

Solution:

Challenge:

Solution:

Challenge:

Solution:

Outcome