← Back to projects

CASE STUDY

Distributed Recommendation System

Event-driven pipeline with LLM-powered re-ranking at scale.

Cached hot pathLLM re-rankingEvent-driven freshness
FastAPIApache KafkaPostgreSQLRedisGroqllama-3.3-70b-versatileDocker

What it is

A production-grade backend that processes behavioural signals at scale and serves real-time personalised recommendations: top-N retrieval from SQL over aggregated per-user scores, then LLM re-ranking with per-item explanations. Built to demonstrate event-driven architecture, Redis cache-aside, and a batched LLM refinement step analogous to the re-ranking stage in production RAG-style pipelines (not full vector/document retrieval).

What I Built

  • Event ingestion API in FastAPI feeding a Kafka consumer that aggregates click, view, and purchase signals into weighted per-user item scores stored in PostgreSQL
  • Recommendations API serving top-N personalised results per user, queried from PostgreSQL based on aggregated signal weights
  • Redis cache-aside layer for low-latency reads on frequently requested user recommendations (sub-100ms recommendation latency on cache hit)
  • LLM re-ranking via Groq (llama-3.3-70b-versatile) that re-orders score-based SQL candidates and generates per-item natural language explanations (“Recommended because…”)
  • Full observability stack: Prometheus metrics on all three microservices — request rates, latency histograms, cache hit ratios, and Kafka consumer lag — plus Grafana dashboards with 7 panels
  • Designed for cache invalidation consistency, event-driven data freshness, and score-based retrieval (SQL over aggregated weights) with a batched LLM refinement step analogous to production RAG re-ranking — not vector or document retrieval

The Interesting Problem

The LLM re-ranking layer sits after retrieval, not during it. The challenge was making Llama 3 re-order candidates semantically without blowing up latency. The solution was batching all top-N candidates into a single structured prompt and parsing the re-ranked order from the response, keeping the LLM call to one per request regardless of N.

Key Results

21

Automated tests (unit + integration) — 100% pass rate

3

Microservices with independent scaling

<100ms

Recommendation latency on cache hit

1000+

Concurrent users handled via async pipeline