Skip to content

// projects

GenAI API Gateway

One hardened front door for every LLM call — cost caps, rate limits, and prompt-injection defense.

A service that sits between your apps and model providers and enforces the controls no single app should reimplement: authentication, per-tenant rate and cost limits, caching, retries with fallback, and input/output safety checks including prompt-injection defense. Apps call one stable API instead of talking to providers directly. It is the deployment-and-safety project that shows you can put an LLM in front of untrusted users.

AdvancedPythonFastAPIRedisDockerOpenAI
01The problem

The moment more than one app calls an LLM, the same problems get copy-pasted everywhere: no shared budget, no consistent rate limiting, secrets scattered across services, and no defense against users who try to override the system prompt or exfiltrate data. Exposing a raw model endpoint to untrusted input is a security and cost liability. The problem is centralizing those cross-cutting concerns into one gateway apps can trust.

02Architecture
  1. 01

    API & auth

    A FastAPI gateway exposes a stable, provider-agnostic endpoint; every request carries a scoped API key mapped to a tenant with its own limits.

  2. 02

    Rate & cost limiting

    Redis-backed token-bucket rate limits and a rolling per-tenant spend counter reject or queue requests before they reach a provider, so no tenant can exhaust the shared budget.

  3. 03

    Prompt-injection defense

    Untrusted input is separated from system instructions and screened against known injection patterns; system prompts are never concatenated with raw user text.

  4. 04

    Routing & fallback

    A router selects the model and provider, retries transient failures with backoff, and fails over to a secondary model when the primary errors or times out.

  5. 05

    Caching

    Redis caches responses for identical requests to cut cost and latency on repeat traffic, scoped per tenant so nothing leaks across keys.

  6. 06

    Output filtering

    Responses pass a safety and PII check before they are returned, and structured outputs are schema-validated so a malformed generation cannot reach the caller.

  7. 07

    Observability

    Every request is traced with tenant, model, tokens, cost, latency, and cache status, exported for dashboards and alerts.

03Key trade-offs

A gateway service over a shared client library

A network hop and an extra service to run buys you central control — one place to rotate keys, change limits, and add defenses — versus a library every app must upgrade in lockstep. Past one consumer, the gateway wins.

Token buckets in Redis over in-process counters

Centralizing limits in Redis makes them correct across many gateway replicas; you take a Redis dependency in exchange for limits that actually hold under horizontal scaling.

Instruction/data separation over a 'do not obey injected commands' prompt

Structurally separating system instructions from user content stops most prompt injection far more reliably than politely asking the model not to comply — defense by construction beats defense by wording.

Fallback to a secondary model over hard failure

Degrading to a backup model keeps the product up during a provider incident, accepting a possible quality dip instead of an outage — a trade you make explicit and monitor.

04How you know it works
  • Load tests that assert rate and cost limits actually cap throughput and spend per tenant under burst traffic.
  • A prompt-injection test corpus (override, exfiltration, and role-swap attempts) with an asserted block-or-neutralize rate.
  • Fault-injection tests that kill the primary provider and verify fallback, retries, and timeouts behave as designed.
  • Cache-correctness tests confirming cached responses stay scoped per tenant and never leak across keys.
05Deployment
  • A stateless Dockerized gateway that scales horizontally behind a load balancer; Redis holds shared limit and cache state.
  • Provider keys live only in the gateway secret store, so app services never hold model credentials.
  • CI runs the injection corpus and limit tests before promotion; a drop in block rate fails the build.
  • Per-tenant cost, latency, error, and cache-hit metrics are exported with alerts on spend and error-rate spikes.
06Interview talking points
  1. 01Why cross-cutting LLM concerns belong in a gateway, and what breaks when each app rolls its own.
  2. 02How you defend against prompt injection by construction, separating instructions from untrusted data.
  3. 03How Redis token buckets keep rate and cost limits correct across horizontally-scaled replicas.
  4. 04Your fallback and retry strategy, and how you keep a provider outage from becoming a product outage.

Video walkthrough

Watch it built, end to end

A full video walkthrough — architecture, trade-offs, evals, and deployment — ships with the AI Engineer Interview & Portfolio Kit at launch (August 2026). There is no fake demo here: join the waitlist and you will get it the day it lands.

Production AI Notes

One practical AI engineering email each week

One concept, one architecture, one project idea, and one interview question — written for developers who want to build and ship real AI systems.

No spam. Unsubscribe anytime.