Case Study: Facebook Newsfeed¶

1. Requirements clarifications (Functional & Non-Functional)¶

Functional¶

Users see a feed of posts from friends and followed pages.
Users can post updates (text, image, video).
Real-time or near-real-time delivery of updates.
Ranking/Sorting (most recent vs. top stories).

Non-Functional¶

Low Latency: Feed generation should take < 200ms.
Scalability: Support billions of users and millions of posts per day.
Availability: Feed should be accessible even if some services are down.

2. System interface definition (APIs)¶

getFeed(user_id, count, cursor)
postStatus(user_id, text, media_ids)

3. Back-of-the-envelope estimation (Capacity Estimation)¶

Users: 2B total, 500M DAU.
Feed Content: Each user follows ~500 entities.
Traffic: 500M users * 20 feed refreshes/day $\rightarrow$ 10B refreshes/day.
Write Traffic: 100M posts/day.

4. Defining data model (Database Schema/Model)¶

User DB: Relational (MySQL with Sharding) for profiles/friendships.
Post DB: NoSQL (Cassandra/HBase) for high write throughput.
Feed Store: In-memory (Redis) for pre-generated feeds.
- Key: user_id, Value: List of [post_id, timestamp]

5. High-level design (with Mermaid)¶

graph TD A[Client] --> B[API Gateway] B --> C[Feed Service] B --> D[Post Service] D --> E[Fan-out Service] E --> F[(Redis Feed Cache)] C --> F D --> G[(Post DB)] H[Ranking Service] --> F

6. Detailed design (Deep dive into components)¶

Feed Generation Strategies¶

Pull (Fan-out on Load): Keep posts in a DB. When a user requests their feed, query all friends' posts and merge. Issue: Very slow for users with many friends.
Push (Fan-out on Write): When a post is made, immediately push it to the pre-generated feeds of all followers in Redis. Issue: "Celebrity Problem" (millions of followers).
Hybrid: Use Push for normal users and Pull for celebrities.

Ranking Algorithm¶

Feeds aren't just chronological. Scores are calculated based on:

Affinity: How often the user interacts with the author.
Weight: Type of post (video > photo > text).
Time Decay: Newer posts are more relevant.

Feed Cache Sharding¶

Shard the Redis cluster by user_id to ensure that a single user's feed is always on one node.

7. Identifying and resolving bottlenecks (Scaling/Bottlenecks)¶

Celebrity Fan-out: Handling users with 10M+ followers requires dedicated workers and asynchronous processing.
Cache Eviction: Keeping feeds for 2B users in RAM is expensive. Use LRU eviction for inactive users.
Media Storage: Images and videos require a CDN and optimized blob storage.

Likely Follow-Up Questions¶

How do you ensure feed staleness is acceptable to users?

Feed staleness (delay before new post appears) affects user experience:

Push staleness: With push fan-out, post appears within 100-500ms (network + Redis update latency).
Pull staleness: With pull fan-out, staleness depends on cache TTL (0-5 minutes).
User tolerance: Users tolerate 1-2 second delays; >5 seconds feels broken.
Eventual consistency: For normal posts, show within 100ms. For sensitive content (policy violations), delay review (24hr OK).
Monitoring: Track p50, p99 staleness; alert if >1s consistently.

Strategy: Use push for fast posts; async indexing for search/archival (can be slower).

How do you handle real-time notifications alongside feed updates?

Notifications inform users of likes, comments, and new posts from close friends:

Real-time delivery: Notifications sent via WebSocket or long-polling; appear immediately.
Persistence: Store notifications in database for 30 days; queryable via API if user was offline.
Batching: Don't notify user for every like; batch into "5 people liked your post" after 5 minutes.
Filtering: Don't notify for posts you've already seen in feed; use deduplication.
Importance: Distinguish urgent (messages from friends) vs casual (stranger liked post).

Architecture: Separate notification service listening to post/like/comment events; push via Kafka to notification delivery service.

What happens when ranking service is slow or down?

Ranking is computationally expensive (ML models); failures degrade gracefully:

Fallback: If ranking times out, serve posts in reverse chronological order.
Cache: Pre-compute rankings for top users hourly; serve cached rankings if ranking service down.
Approximation: Use simpler ranking (likes count only) instead of full ML model when under load.
Circuit breaker: If ranking fails >5% of requests, disable ranking; use chronological order.
Monitoring: Track ranking latency; alert if >500ms (target <100ms).

Result: Feed always available, just less personalized during ranking issues.