Skip to content
Back to Projects

AI Inference Gateway

Proof of-concept

Unified API gateway routing inference requests to optimal providers — HuggingFace, OpenAI, or self-hosted vLLM.

View Live Demo (Unavailable)
View on GitHub
AI Inference Gateway

Overview

The AI Inference Gateway provides a single OpenAI-compatible API endpoint that intelligently routes requests to the best available inference provider based on cost, latency, model capability, and rate-limit headroom. It supports chat completions, embeddings, text classification, and structured extraction — with automatic failover, request caching, and per-user token management.

Technologies Used

Next.jsTypeScriptHuggingFace APIOpenAI APIRedisZod

Key Features

  • OpenAI-compatible /v1/chat/completions endpoint
  • Multi-provider routing with automatic failover
  • Request caching via Redis for repeated queries
  • Per-user API token issuance and rate limiting
  • Streaming response support (SSE)
  • Cost tracking and usage analytics dashboard

Challenges & Solutions

Challenge:

Normalising response formats across different inference providers

Solution:

Built adapter layers that transform provider-specific responses into a unified OpenAI-compatible schema

Challenge:

Managing cold-start latency on free-tier HuggingFace models

Solution:

Implemented a warm-up scheduler and fallback routing to pre-warmed alternative models

Challenge:

Preventing abuse while keeping the API accessible

Solution:

Designed a tiered token system with configurable rate limits, usage quotas, and automatic throttling

Outcome

Serving internal demos and research workloads; public beta planned for Q2 2026