CASE STUDY
Distributed Recommendation System
Event-driven pipeline with LLM-powered re-ranking at scale.
What it is
A production-grade backend that processes behavioural signals at scale and serves real-time personalised recommendations: top-N retrieval from SQL over aggregated per-user scores, then LLM re-ranking with per-item explanations. Built to demonstrate event-driven architecture, Redis cache-aside, and a batched LLM refinement step analogous to the re-ranking stage in production RAG-style pipelines (not full vector/document retrieval).
What I Built
- Event ingestion API in FastAPI feeding a Kafka consumer that aggregates click, view, and purchase signals into weighted per-user item scores stored in PostgreSQL
- Recommendations API serving top-N personalised results per user, queried from PostgreSQL based on aggregated signal weights
- Redis cache-aside layer for low-latency reads on frequently requested user recommendations (sub-100ms recommendation latency on cache hit)
- LLM re-ranking via Groq (llama-3.3-70b-versatile) that re-orders score-based SQL candidates and generates per-item natural language explanations (“Recommended because…”)
- Full observability stack: Prometheus metrics on all three microservices — request rates, latency histograms, cache hit ratios, and Kafka consumer lag — plus Grafana dashboards with 7 panels
- Designed for cache invalidation consistency, event-driven data freshness, and score-based retrieval (SQL over aggregated weights) with a batched LLM refinement step analogous to production RAG re-ranking — not vector or document retrieval
The Interesting Problem
The LLM re-ranking layer sits after retrieval, not during it. The challenge was making Llama 3 re-order candidates semantically without blowing up latency. The solution was batching all top-N candidates into a single structured prompt and parsing the re-ranked order from the response, keeping the LLM call to one per request regardless of N.
Key Results
21
Automated tests (unit + integration) — 100% pass rate
3
Microservices with independent scaling
<100ms
Recommendation latency on cache hit
1000+
Concurrent users handled via async pipeline