Top 'K' Frequent Items (Heavy Hitters)¶

1. Requirements¶

Functional¶

Track the most frequent items in a real-time stream (e.g., top trending hashtags on Twitter, most played songs on Spotify).
Provide the top 'K' items for a specific time window (e.g., last 1 minute, 1 hour, 1 day).

Non-Functional¶

Scalability: Must handle millions of events per second.
Low Latency: Retrieval of Top 'K' should be near-instant.
Accuracy vs. Efficiency: Approximate counts are often acceptable in exchange for massive space savings (e.g., using Count-Min Sketch).

2. Capacity Estimation¶

Traffic: 10 million events per second.
Storage: If tracking 1 million unique items, a simple hash map would take ~100MB. However, for billion-scale items, we need probabilistic data structures.
Bandwidth: 10M events/sec * 100 bytes/event $\approx$ 1GB/s.

3. APIs¶

GET /v1/top-k?k=10&window=1h

Response: {"items": [{"id": "item1", "count": 5000}, ...], "window": "1h"}

4. DB Design¶

In-Memory: Redis (Sorted Sets) for real-time storage.
Persistent: Cassandra or DynamoDB for historical data.
Schema: (item_id, count, window_timestamp).

5. HLD with Mermaid¶

graph TD A[Client Events] --> B[API Gateway] B --> C[Kafka/Message Queue] C --> D[Stream Processor - Flink/Spark] D --> E[(Redis - Top K Cache)] D --> F[(Database - Historical)] G[Query Service] --> E

6. Detailed Design¶

Count-Min Sketch¶

To handle massive cardinality, we use a Count-Min Sketch (probabilistic data structure).

Uses a 2D array of counters and multiple hash functions.
Provides an upper bound on frequency with fixed memory.

Heavy Keepers / Space Saving Algorithm¶

Used to track the actual IDs of the top items, as Count-Min Sketch only tracks counts.

Data Aggregation¶

Log Aggregator: Collects logs from app servers.
Stream Processing: Groups items and updates counts in sliding windows.
Merging Results: Distributed nodes track local top-K; a central aggregator merges them to find the global top-K.

7. Bottlenecks¶

Hot Keys: A single very popular item can overwhelm a partition. Solution: Use sub-partitioning or local aggregation before sending to the global counter.
Accuracy: For small 'K', error rates must be strictly monitored.
Memory: Keeping many time windows in memory can be expensive.