Lesson 1: The Mindset
The Shift
When you write code, you are building a single room. You care about the furniture (variables), the flow (logic), and the usability (UI).
System Design is city planning.
You stop caring about the furniture in every room. Instead, you care about:
- Traffic flow: Can the roads handle rush hour? (Throughput)
- Utilities: Is there enough water and electricity? (Capacity)
- Disaster recovery: What happens if the power plant explodes? (Reliability)
- Expansion: Can we add a new suburb next year? (Scalability)
Real-World Case Studies
Case Study 1: The Netflix Chaos Monkey (Success)
In 2008-2011, Netflix faced a critical problem. They moved from DVD-by-mail to streaming, but their single datacenter in Virginia was a single point of failure. One morning, the datacenter experienced a major outage—millions of customers couldn't watch anything.
The System Design Decision: Instead of buying bigger, better datacenters (Vertical Scaling), Netflix chose to:
- Move everything to cloud infrastructure (AWS)
- Adopt a microservices architecture
- Build Chaos Monkey—a tool that randomly kills services to test resilience
The Result:
- Netflix went from 99.9% uptime to 99.99%+ uptime
- They handle millions of concurrent streams globally
- When AWS has an outage in one region, Netflix users don't even notice
Case Study 2: Healthcare.gov Launch (Failure)
In 2013, the US government launched Healthcare.gov with these requirements:
- Handle millions of users trying to sign up simultaneously
- Integrate with dozens of legacy systems (IRS, insurance companies, etc.)
- 100% data accuracy (no room for errors in health coverage)
The System Design Mistakes:
- No load testing before launch (assumed it would "just work")
- Tightly coupled architecture with no caching layer
- Single database bottleneck (no sharding)
- No graceful degradation (the entire site crashed instead of showing partial results)
The Consequences:
- Site crashed immediately—only 6 people could sign up on day one
- Cost $1.7 billion and took 2 years to fix
- Public relations disaster and loss of trust
The Fix:
- Added a caching layer (Redis) to handle read-heavy traffic
- Implemented horizontal scaling with auto-scaling groups
- Added circuit breakers to prevent cascading failures
- Built queue-based architecture for background processing
Case Study 3: Instagram's 2010 Growth Spike
When Instagram launched in 2010, they had:
- 2 servers (one for the app, one for the database)
- 50,000 users on launch day
- Within a week: 1 million users
- Within a month: 10 million users
The Challenge: Their architecture couldn't handle the exponential growth. The database was overwhelmed, and image uploads were timing out.
The System Design Solution:
- Database Sharding: Split user data across multiple database servers by user ID
- Content Delivery Network (CDN): Host images on edge servers globally
- Read Replicas: Created multiple read-only copies of the database
- Async Processing: Moved image processing to background queues
The Numbers:
- Before fix: 95% uptime, 30-second image uploads
- After fix: 99.9% uptime, <1-second image uploads
- They scaled from 2 servers to 500+ servers in 6 months
- Eventually acquired by Facebook for $1 billion
Important
The golden rule: In system design, there are no right answers, only trade-offs.
Functional vs Non-Functional
Every system has two sets of requirements. In an interview (and real life), 90% of your initial grade comes from clarifying these before you draw a single box.
1. Functional Requirements (The "What")
These are the features. If the system doesn't do this, it's useless.
Basic Examples:
- User can post a tweet.
- User can follow others.
- User sees a news feed.
Real-World Examples by Industry:
| Industry | Functional Requirements | Real System Example |
|---|---|---|
| E-commerce | Browse products, add to cart, checkout, track orders | Amazon, Shopify |
| Social Media | Post content, follow users, like/comment, real-time notifications | Twitter, Instagram |
| Streaming | Video playback, quality adjustment, search, watchlist | Netflix, YouTube |
| Banking | Transfer money, view balance, pay bills, transaction history | Chase, Revolut |
| Healthcare | Book appointments, view records, message doctors, prescription management | Teladoc, Epic |
Advanced Examples from Production Systems:
Uber's Real-Time Requirements:
- Driver tracking updates every 4 seconds
- Passenger requests must be matched to drivers within <5 seconds
- Surge pricing calculated dynamically based on real-time supply/demand
- Payment processing within <2 seconds after ride completion
Spotify's Music Streaming Requirements:
- <200ms latency for track start (no buffering delay)
- Support offline playback with 10,000+ songs cached
- Real-time collaborative playlists with <500ms sync
- Personalized recommendations with 50M+ tracks in catalog
Airbnb's Booking Requirements:
- Support concurrent bookings for same property (prevent double-booking)
- 24-hour hold on bookings before payment
- Real-time availability sync across 190+ countries
- Instant book feature (no host approval required)
2. Non-Functional Requirements (The "How")
These are the constraints. If the system doesn't meet these, it will crash/fail/be too slow.
Basic Examples:
- Scalability: Must handle 100M daily active users.
- Latency: Feed must load in < 200ms.
- Consistency: A tweet must appear on followers' feeds within 5 seconds.
Real-World Production Requirements:
| System | Availability | Latency | Throughput | Data Size |
|---|---|---|---|---|
| Google Search | 99.99% | <0.5 seconds | 63,000 queries/second | 100+ petabytes |
| Netflix Streaming | 99.99% | <2 seconds (start) | 100M+ concurrent streams | 1+ petabytes/day |
| 99.9% | <100ms (message delivery) | 65B+ messages/day | 4+ petabytes/year | |
| Twitter (X) | 99.9% | <200ms (timeline) | 500M+ tweets/day | 500+ petabytes |
| AWS S3 | 99.999999999% (11 nines) | <100ms (GET) | 20M+ requests/second | 100+ exabytes |
Industry-Specific Requirements:
Finance (Banking/Trading):
- Strong Consistency: Account balances must be 100% accurate (no eventual consistency)
- Auditability: Every transaction must be logged and traceable
- Compliance: GDPR, PCI-DSS, SOX compliance required
- Low Latency: Trading decisions in microseconds for high-frequency trading
Healthcare:
- HIPAA Compliance: All data encrypted at rest and in transit
- High Availability: Patient data must be accessible 24/7
- Privacy: Strict access controls and audit logs
- Disaster Recovery: RPO (Recovery Point Objective) < 1 hour, RTO (Recovery Time Objective) < 4 hours
Gaming:
- Real-Time: <50ms latency for multiplayer gaming
- High Throughput: Handle millions of concurrent players
- Scalable: Auto-scale for game launches and events
- Anti-Cheat: Prevent cheating and hacking
IoT (Internet of Things):
- High Ingest Rate: Handle millions of devices sending data simultaneously
- Edge Computing: Process data locally to reduce bandwidth
- Low Power: Devices operate on battery for years
- Intermittent Connectivity: Work with unstable network connections
graph TD
A[Requirements] --> B[Functional]
A --> C[Non-Functional]
B --> B1[Features]
B --> B2[APIs]
B --> B3[User Flows]
C --> C1[Scalability]
C --> C2[Reliability]
C --> C3[Latency]
C --> C4[Cost]
style A fill:#f9f,stroke:#333
style B fill:#bbf,stroke:#333
style C fill:#bfb,stroke:#333
The "It Depends" Game
Junior engineers search for the "best" database. Senior engineers ask "what are we optimizing for?"
| You Optimize For | You Might Sacrifice | Example |
|---|---|---|
| Consistency | Availability | Banking (Balances must be correct, even if system goes down briefly) |
| Availability | Consistency | Social Media (Better to show old likes than an error page) |
| Write Speed | Read Speed | Logging (Write fast, read rarely) |
| Development Speed | Performance | Startups (Ship Python/Ruby MVP fast, rewrite later) |
Practical Trade-Off Scenarios
Scenario 1: Building a Real-Time Chat App
Context: You're building a chat app like Slack or Discord. Users expect messages to appear instantly.
The Trade-Off Decisions:
| Decision | Option A | Option B | What You Choose & Why |
|---|---|---|---|
| Message Storage | Relational DB (PostgreSQL) | NoSQL (Cassandra) | NoSQL - High write throughput, eventual consistency acceptable |
| Real-time Updates | Polling (client asks server every 5s) | WebSockets (server pushes updates) | WebSockets - Lower latency, less server load |
| Message History | Keep forever | 90-day retention | 90-day retention - Reduce storage costs, most users don't need old messages |
| Online Status | Check on every message | Heartbeat every 30s | Heartbeat - Scale better, less database load |
Performance Impact:
- Polling approach: 100K users × 1 request/5s = 20,000 requests/second just for checking messages
- WebSocket approach: 100-200 requests/second (heartbeat only)
Scenario 2: Building an E-Commerce Platform
Context: You're building Amazon-scale e-commerce. Need to handle Black Friday traffic spikes.
The Architecture Trade-Offs:
import { * } from 'sruja.ai/stdlib'
ECommerce = system "E-Commerce Platform" {
description "High-volume retail platform with trade-off decisions documented"
// TRADE-OFF 1: Read-heavy vs Write-heavy
ProductDB = database "Product Catalog Database" {
technology "PostgreSQL with Read Replicas"
description "Optimized for READ operations (99% of traffic is reads)"
tradeoff {
decision "Use read replicas for product browsing"
sacrifice "Write latency (updates take longer to propagate)"
reason "Users browse products 100x more than they add products"
metric "Read:Write ratio = 100:1"
}
}
// TRADE-OFF 2: Strong vs Eventual Consistency
CartService = container "Shopping Cart Service" {
technology "Redis"
description "In-memory cache for cart state"
tradeoff {
decision "Use Redis (in-memory) for cart storage"
sacrifice "Durability (cart data lost if Redis crashes)"
reason "Cart data is temporary and can be recreated from product catalog"
mitigation "Periodic snapshots to persistent storage"
}
}
// TRADE-OFF 3: Cost vs Performance
SearchEngine = container "Product Search" {
technology "Elasticsearch"
description "Full-text search with caching layer"
tradeoff {
decision "Use expensive Elasticsearch cluster"
sacrifice "Infrastructure cost ($5K/month)"
reason "Search performance directly impacts conversion rates (1% latency = 1% revenue loss)"
metric "Search latency <200ms required for optimal UX"
}
}
// TRADE-OFF 4: Availability vs Consistency
OrderService = container "Order Processing" {
technology "Kafka + Microservices"
description "Async order processing pipeline"
tradeoff {
decision "Use async messaging (eventual consistency)"
sacrifice "Real-time inventory accuracy"
reason "Better availability and scalability during peak traffic"
mitigation "Compensating transactions to handle over-selling"
}
}
}
Scenario 3: CAP Theorem in Practice
Real-World Example: Netflix vs. PayPal
Netflix (Choose Availability):
- If a user can't watch a video, they might cancel subscription
- Trade-off: Occasionally show stale content recommendations
- Architecture: AP system (Available, Partition-tolerant, Eventually consistent)
- Data: Video recommendations, watch history, user preferences
PayPal (Choose Consistency):
- If a transaction is processed incorrectly, lawsuits happen
- Trade-off: Brief service interruptions during network partitions
- Architecture: CP system (Consistent, Partition-tolerant, Limited availability)
- Data: Account balances, transaction records, payment processing
The Decision Matrix:
Ask yourself:
1. What happens if the data is wrong?
→ If lawsuits/financial loss → Prioritize Consistency (PayPal model)
→ If just bad UX → Prioritize Availability (Netflix model)
2. What's the tolerance for downtime?
→ Zero tolerance → Prioritize Availability (Instagram for celebrity photos)
→ Some tolerance OK → Prioritize Consistency (Banking)
3. Can you design around the trade-off?
→ Yes: Use hybrid approach (read-optimized cache + write-optimized DB)
→ No: Pick one and accept the consequences
Sruja Integration
In Sruja, we treat requirements as code. This keeps your constraints right next to your architecture.
Why Kinds and Types Matter
In Sruja, you declare kinds to establish the vocabulary of your architecture. This isn't just syntax—it provides real benefits:
- Early Validation: If you typo an element type (e.g.,
syteminstead ofsystem), Sruja catches it immediately. - Better Tooling: IDEs can provide autocomplete and validation based on your declared kinds.
- Self-Documentation: Anyone reading your model knows exactly which element types are available.
- Custom Vocabulary: You can define your own kinds (e.g.,
microservice = kind "Microservice") to match your domain. - Flat and Clean: With Sruja's flat syntax, these declarations live at the top of your file—no
specificationwrapper block required.
Example: Requirements-Driven Architecture
import { * } from 'sruja.ai/stdlib'
// 1. Defining the "What" (Functional)
requirement R1 functional "Users can post short text messages (tweets)"
// 2. Defining the "How" (Non-Functional)
requirement R2 performance "500ms p95 latency for reading timeline"
requirement R3 scale "Store 5 years of tweets (approx 1PB)"
requirement R4 availability "99.9% uptime SLA"
// 3. The Architecture follows the requirements
Twitter = system "The Platform" {
description "Satisfies R1, R2, R3, R4"
TimelineAPI = container "Timeline API" {
technology "Rust"
description "Satisfies R2 - optimized for low latency"
slo {
latency {
p95 "500ms"
window "7 days"
}
availability {
target "99.9%"
window "30 days"
}
}
}
TweetDB = database "Tweet Storage" {
technology "Cassandra"
description "Satisfies R3 - distributed storage for 1PB scale"
}
TimelineAPI -> TweetDB "Reads/Writes"
}
// 4. Document the decision
ADR001 = adr "Use Cassandra for tweet storage" {
status "Accepted"
context "Need to store 1PB of tweets with high write throughput"
decision "Use Cassandra for distributed, scalable storage"
consequences "Excellent scalability, eventual consistency trade-off"
}
view index {
title "Twitter Platform Overview"
include *
}
// Performance-focused view
view performance {
title "Performance View"
include Twitter.TimelineAPI Twitter.TweetDB
}
Knowledge Check
Q: My boss says "We need to handle infinite users". How do you respond?
Bad Answer: "Okay, I'll use Kubernetes and sharding."
Senior Answer: "Infinite is expensive. Do we expect 1k users or 100M users? The design for 1k costs $50/mo. The design for 100M costs $50k/mo. Let's define a realistic target for the next 12 months."
Q: Why not just use the fastest database for everything?
Because "fastest" depends on the workload. A database fast at reading (Cassandra) might be complex to manage. A database fast at relationships (Neo4j) might scale poorly for heavy writes. Trade-offs.
Quiz: Test Your Knowledge
Ready to apply what you've learned? Take the interactive quiz for this lesson!
1. In system design, what do we call requirements that describe the features and functionality of a system (what it should do)?
Click to see answer
Answer: Functional
Alternative answers:
- functional requirements
Explanation: Functional requirements define the features and capabilities of the system. Examples: "User can post a tweet," "User can browse products."
2. In system design, what do we call requirements that describe how the system should perform (constraints like speed, scalability, reliability)?
Click to see answer
Answer: Non-functional
Alternative answers:
- non-functional
- non functional
- NFR
- NFRs
Explanation: Non-functional requirements define the quality attributes and constraints of the system. Examples: "Must handle 100M users," "Response time <200ms."
3. A banking system must ensure that account balances are always accurate and transactions cannot be lost. Which trade-off would you prioritize?
- a) Prioritize availability over consistency (it's better to show wrong data than no data)
- b) Prioritize development speed over performance (ship a MVP first)
- c) Prioritize write speed over read speed (logging-focused optimization)
- d) Prioritize consistency over availability (brief downtime is acceptable, but data must be correct)
4. You're building a real-time chat application like Discord. Users expect messages to appear instantly across all devices. What's the best architecture approach?
- a) Use relational database with strong consistency (PostgreSQL) for all message storage
- b) Use HTTP polling where clients check for new messages every 5 seconds
- c) Use eventual consistency with 24-hour delay synchronization
- d) Use WebSockets for real-time push with eventual consistency for message storage
5. Netflix experienced a major outage in 2008 when their single datacenter failed. What was their system design solution?
- a) Bought a bigger, more expensive datacenter with better hardware (Vertical scaling)
- b) Hired more operations engineers to manually failover systems
- c) Built a single, massive monolithic application on dedicated servers
- d) Moved to cloud infrastructure with microservices and built Chaos Monkey to test resilience
6. Healthcare.gov's initial launch in 2013 was a disaster. Which of these was NOT one of their system design mistakes?
- a) No load testing before launch
- b) Tightly coupled architecture with no caching layer
- c) Single database bottleneck with no sharding
- d) Using cloud infrastructure instead of dedicated on-premise servers
7. You're building a product search engine for an e-commerce site handling 10M products. The search feature generates 99% of traffic. What optimization should you prioritize?
- a) Optimize for write speed (since products are added frequently)
- b) Use a single-node relational database for simplicity
- c) Disable caching to ensure always-fresh search results
- d) Use read replicas and a specialized search engine like Elasticsearch
8. Instagram launched with 2 servers and grew to 10M users in one month. What was their key architectural change to handle this growth?
- a) Rewrote the entire application in a different programming language
- b) Bought the biggest available server (vertical scaling)
- c) Removed all features to reduce complexity
- d) Implemented database sharding, CDN for images, and async processing for image handling
9. Which of the following statements best describes the relationship between latency and throughput?
- a) Low latency always means high throughput
- b) High latency always means high throughput
- c) Latency and throughput are the same thing
- d) A system can have low latency but low throughput, or high latency but high throughput
10. Your boss says "We need to handle infinite users." What's the most appropriate response?
- a) Great! I'll immediately implement Kubernetes and distributed sharding
- b) Impossible! Let's cap users at 1,000 and reject anyone else
- c) Let's build the system assuming unlimited resources regardless of cost
- d) Infinite is expensive. Let's define realistic targets for the next 12 months (e.g., 100K users) and design for that
11. What is the term for the system design principle that means every decision involves sacrificing one quality to gain another (e.g., choosing consistency means sacrificing availability)?
Click to see answer
Answer: Trade-off
Alternative answers:
- trade-off
- tradeoff
Explanation: Trade-offs are fundamental to system design. There are no perfect solutions—every architecture choice involves benefits and costs. "It depends" is the correct answer because it depends on which trade-offs you choose.
This quiz covers:
- Functional vs Non-functional requirements
- Real-world case studies (Netflix, Healthcare.gov, Instagram)
- Trade-off decisions in system design
- Practical scenarios and decision-making
Next Steps
Now that we have the mindset, let's learn the language. 👉 Lesson 2: The Vocabulary of Scale