CASE STUDY

Distributed Recommendation System

Event-driven pipeline with LLM-powered re-ranking at scale.

Cached hot pathLLM re-rankingEvent-driven freshness

FastAPIApache KafkaPostgreSQLRedisGroqllama-3.3-70b-versatileDocker

What it is

A production-grade backend that processes behavioural signals at scale and serves real-time personalised recommendations: top-N retrieval from SQL over aggregated per-user scores, then LLM re-ranking with per-item explanations. Built to demonstrate event-driven architecture, Redis cache-aside, and a batched LLM refinement step analogous to the re-ranking stage in production RAG-style pipelines (not full vector/document retrieval).

What I Built

Event ingestion API in FastAPI feeding a Kafka consumer that aggregates click, view, and purchase signals into weighted per-user item scores stored in PostgreSQL
Recommendations API serving top-N personalised results per user, queried from PostgreSQL based on aggregated signal weights
Redis cache-aside layer for low-latency reads on frequently requested user recommendations (sub-100ms recommendation latency on cache hit)
LLM re-ranking via Groq (llama-3.3-70b-versatile) that re-orders score-based SQL candidates and generates per-item natural language explanations (“Recommended because…”)
Full observability stack: Prometheus metrics on all three microservices — request rates, latency histograms, cache hit ratios, and Kafka consumer lag — plus Grafana dashboards with 7 panels
Designed for cache invalidation consistency, event-driven data freshness, and score-based retrieval (SQL over aggregated weights) with a batched LLM refinement step analogous to production RAG re-ranking — not vector or document retrieval

The Interesting Problem

The LLM re-ranking layer sits after retrieval, not during it. The challenge was making Llama 3 re-order candidates semantically without blowing up latency. The solution was batching all top-N candidates into a single structured prompt and parsing the re-ranked order from the response, keeping the LLM call to one per request regardless of N.

Key Results

Automated tests (unit + integration) — 100% pass rate

Microservices with independent scaling

<100ms

Recommendation latency on cache hit

1000+

Concurrent users handled via async pipeline