Lesson 2: Databases

SQL vs. NoSQL: The Fundamental Choice

Choosing between SQL and NoSQL is one of the most important architecture decisions. Here's how to think about it.

SQL (Relational Databases)

What it does: Stores structured data in tables with predefined schemas using the relational model.

Characteristics:

✅ Structured data - Tables, rows, columns with defined types
✅ ACID compliance - Strong guarantees for Atomicity, Consistency, Isolation, Durability
✅ Powerful querying - Complex joins, aggregations, transactions
✅ Data integrity - Foreign keys, constraints, referential integrity
❌ Vertical scaling - Limited to single server (mostly)
❌ Schema changes - ALTER TABLE can be slow/complex

Examples: PostgreSQL, MySQL, Oracle, SQL Server

Best for:

Financial transactions (banking, payments)
Complex queries with joins and aggregations
Strong consistency requirements
Data relationships (e.g., user orders and order items)

NoSQL (Non-Relational Databases)

What it does: Flexible data models without rigid schemas, designed for horizontal scaling.

Characteristics:

✅ Flexible schemas - Add fields on the fly
✅ Horizontal scaling - Distribute data across multiple servers
✅ High performance - Optimized for specific access patterns
✅ Multiple models - Document, Key-Value, Graph, Column-family
❌ Weaker consistency - Often eventual consistency
❌ Limited querying - Complex joins harder or impossible

Examples:

Document: MongoDB, CouchDB
Key-Value: Redis, DynamoDB
Column-family: Cassandra, HBase
Graph: Neo4j, JanusGraph

Best for:

Rapidly evolving data structures
Massive scale (PBs of data, millions of users)
High throughput write workloads
Hierarchical or graph data

Decision Matrix: SQL vs NoSQL

Factor	Choose SQL When...	Choose NoSQL When...
Data structure	Fixed, well-defined schema	Flexible, evolving schema
Consistency	Strong consistency required	Eventual consistency OK
Scale	Moderate (up to TBs)	Massive (TBs to PBs)
Complexity	Complex queries, joins	Simple access patterns
Transactions	ACID transactions needed	Eventual consistency OK
Speed	Complex queries take time	Fast for specific patterns

Scaling Databases

Replication: Copying Data

Replication copies data across multiple servers for redundancy and performance.

Master-Slave (Primary-Replica)

How it works: All writes go to the master. Reads can go to slaves (read replicas).

Write → Master DB (Primary)
         ↓ (Replication)
         → Slave 1 (Read Replica)
         → Slave 2 (Read Replica)
         → Slave 3 (Read Replica)

Read → Can go to any slave

Pros:

✅ Read scalability - Distribute read load across replicas
✅ Data redundancy - Multiple copies for backup/failover
✅ Improved read latency - Read from geographically closer replica
✅ Simplifies backups - Backup from replicas, not production

Cons:

❌ Single write bottleneck - All writes go through master
❌ Replication lag - Slaves might be slightly behind master
❌ Complex failover - Promoting a replica to master needs care
❌ Write scaling - Cannot scale writes (still single master)

Real-world use: Read-heavy systems (e.g., product catalog, user profiles, content delivery)

Performance impact:

Read throughput: 10x improvement (1 master → 10 replicas)
Write throughput: No improvement (still limited by master)
Replication lag: Typically 10-100ms for local replicas

Master-Master (Active-Active)

How it works: Writes can go to any node. Changes are propagated between masters.

Write → Master 1 → Master 2 → Master 3
Write → Master 2 → Master 1 → Master 3
Write → Master 3 → Master 1 → Master 2

Read → Can go to any master

Pros:

✅ Write scalability - Distribute writes across masters
✅ No single point of failure - Any master can accept writes
✅ Geographic distribution - Masters in different regions for local writes

Cons:

❌ Conflict resolution - Two masters writing same row simultaneously
❌ Complex setup - More complex to configure and maintain
❌ Eventual consistency - Harder to maintain strong consistency

Real-world use: Global applications needing local writes in each region (e.g., social media feeds)

Performance impact:

Write throughput: 3-5x improvement (1 master → 3-5 masters)
Conflict rate: Depends on write patterns (0.1-5% of writes may conflict)
Consistency window: 100-500ms for cross-region replication

Sharding: Partitioning Data

Sharding (partitioning) splits data across multiple database servers based on a shard key.

Horizontal Sharding

How it works: Split rows of a table across servers based on a shard key.

Users Table (100M rows)
├── Shard 1: Users A-G (25M users) → DB Server 1
├── Shard 2: Users H-N (25M users) → DB Server 2
├── Shard 3: Users O-T (25M users) → DB Server 3
└── Shard 4: Users U-Z (25M users) → DB Server 4

Pros:

✅ Write scalability - Distribute writes across shards
✅ Massive scale - Handle petabytes of data
✅ Parallel queries - Query multiple shards in parallel
✅ Independent scaling - Scale hot shards independently

Cons:

❌ Complex joins - Cross-shard joins are hard/expensive
❌ Rebalancing - Moving data between shards is complex
❌ Query complexity - Application needs to route queries to right shard
❌ Hot spots - Popular shard keys can create imbalances

Real-world use: User data, time-series data, any massive dataset

Performance impact:

Write throughput: 10x improvement (1 server → 10 shards)
Query performance: Depends on shard key (local shard = fast, cross-shard = slow)
Rebalancing time: Days to weeks for large datasets

Consistent Hashing

How it works: Use a hash ring to determine shard, minimizing data movement when adding/removing shards.

Hash Ring: [0, 1, 2, ..., 1023]
├── Shard 1: 0-255
├── Shard 2: 256-511
├── Shard 3: 512-767
└── Shard 4: 768-1023

User "alice" → hash("alice") % 1024 = 342 → Shard 2
User "bob"   → hash("bob") % 1024 = 789 → Shard 4

Pros:

✅ Minimal rebalancing - Adding a shard moves ~1/n of data
✅ Even distribution - Hash function spreads data evenly
✅ Deterministic - Same key always routes to same shard

Cons:

❌ Range queries hard - Finding users A-G requires querying all shards
❌ Hash collisions - Poor hash function can create imbalances
❌ No hot spot handling - Popular keys still create imbalances

Real-world use: Key-value stores, caching, session storage

Real-World Case Studies

Case Study 1: Netflix's Migration to Cassandra

The Challenge: Netflix grew from DVD-by-mail to streaming, moving from a single database to needing to handle billions of play events per day.

Before (Single PostgreSQL):

Single master database
100M subscribers
Limited by single server capacity
SPOF (Single Point of Failure)

The Migration Journey:

Phase 1: Master-Slave Replication

PostgreSQL Master (writes) → 5 Read Replicas
                               ↓ (replication lag: 50-100ms)

Result: 5x read scalability, writes still bottlenecked
Problem: Still can't handle streaming scale

Phase 2: Sharded PostgreSQL

Sharded by subscriber ID (100 shards)
Each shard: 1M subscribers
Result: Write scalability improved, but complex cross-shard queries
Problem: Manual rebalancing, hot shard issues

Phase 3: Migration to Cassandra (Final Solution)

Cassandra Cluster (100 nodes)
├── Datacenter US-East (50 nodes)
├── Datacenter US-West (50 nodes)

Data Model:
├── Play Events Table (sharded by user_id)
├── Watch History Table (sharded by user_id)
└── Catalog Table (replicated 3x, no sharding)

Key Decisions:

Cassandra choice - NoSQL for horizontal scalability
Multi-datacenter - Geographic redundancy and local reads
Consistency level - QUORUM for reads/writes (balanced)
Replication factor - 3 (2 datacenters + 1 local replica)

The Results:

Scale: 100M → 260M subscribers
Play events: 1B+ events/day (vs 10M with PostgreSQL)
Availability: 99.99% (was 99.9% with PostgreSQL)
Latency: p99 latency <200ms (vs 500ms with sharded PostgreSQL)
Data volume: 10+ PB of data (would be impossible with SQL)

💡 Key Insight: Netflix chose Cassandra for its write-heavy workload (play events). They kept some data in PostgreSQL for complex queries (billing, analytics) because different workloads benefit from different database technologies.

Case Study 2: Airbnb's Hybrid Database Strategy

The Challenge: Airbnb needed to handle rapid growth (1M → 100M+ users) while maintaining complex relationships between listings, bookings, and reviews.

The Architecture:

Airbnb Platform
├── PostgreSQL (Core Business)
│   ├── Users Table
│   ├── Listings Table
│   ├── Bookings Table
│   └── Reviews Table
│       └── Replicated (Master + 3 Read Replicas)
│
└── MongoDB (Flexible Data)
    ├── Search Index
    ├── Message Queue
    └── Activity Feed

Why PostgreSQL for Core Business?

Data relationships matter
- Listing → Bookings → Reviews (complex joins)
- User → Listings → Reviews (many-to-many relationships)
- Foreign keys ensure referential integrity
ACID transactions
- Booking = Create booking record + Update availability + Send notification
- All-or-nothing: booking succeeds or fails completely
- No partial bookings or data inconsistencies
Complex queries
- "Find listings with availability in date range" (complex WHERE clauses)
- "Get user's booking history with reviews" (joins + aggregations)
- "Calculate host revenue" (complex analytics)

Why MongoDB for Flexible Data?

Evolving schema
- Search index fields change frequently (add filters, sorting)
- Message queue format evolves (add metadata, change structure)
- No ALTER TABLE downtime
Horizontal scaling
- Search index: 100M+ listings, complex queries
- Activity feed: Millions of updates/second
- Sharding easier than PostgreSQL
Performance patterns
- Search: Read-heavy, simple queries
- Activity feed: Write-heavy, append-only
- Both benefit from NoSQL optimizations

Performance Results:

PostgreSQL: 50K writes/sec, 500K reads/sec (complex queries)
MongoDB: 200K writes/sec, 2M reads/sec (simple queries)
Overall latency: p95 <100ms for core flows

💡 Key Insight: Airbnb uses the right tool for the right job. PostgreSQL for complex relational data, MongoDB for flexible, high-throughput patterns. This hybrid approach gives them both data integrity and scalability.

Case Study 3: Uber's Sharding Evolution

The Challenge: Uber grew from a single PostgreSQL database to processing millions of trips per second globally.

Stage 1: Single PostgreSQL

Single database in one datacenter
1M trips/day
Bottleneck: Single server capacity
Risk: Datacenter outage = complete downtime

Stage 2: Sharded PostgreSQL

Sharded by city_id (US trips in US DB, Europe trips in Europe DB)
100M trips/day
Benefits: Geographic isolation, write scalability
Problem: Complex cross-city queries (traveler rides in multiple cities)

Stage 3: Schemaless Sharding (The Pivot)

Moved to MySQL with schemaless design (JSONB columns)
Sharded by rider_id (each rider's data on one shard)
1B+ trips/day
Benefits: Schema flexibility, better shard distribution
Problem: Still hard to evolve, complex migrations

Stage 4: Multi-Database Strategy (Current)

Uber Platform
├── MySQL (Transactional Data)
│   ├── Trips Table (sharded by city)
│   ├── Users Table (sharded by user_id)
│   └── Payments Table (sharded by ride_id)
│
├── Cassandra (Time-Series Data)
│   ├── Event Stream (10M+ events/sec)
│   └── Telemetry (vehicle location updates)
│
├── Redis (Real-Time Data)
│   ├── Driver Location (updates every 2 sec)
│   └── Surge Pricing (calculations every 5 sec)
│
└── ElasticSearch (Search)
    ├── Location Search (geospatial queries)
    └── Destination Search (autocomplete)

Key Decisions:

MySQL for transactions - Trip data needs ACID guarantees
Cassandra for time-series - Event stream is write-heavy (append-only)
Redis for real-time - Driver locations need <100ms latency
ElasticSearch for search - Geospatial queries require specialized search

Performance Results:

Trips processed: 1B+ trips/day (from 1M)
Event stream: 10M+ events/sec (Cassandra handles write load)
Real-time data: 10M+ driver locations/sec (Redis provides <50ms latency)
Search: <100ms p95 for location search (ElasticSearch optimization)

💡 Key Insight: Uber doesn't try to fit everything in one database. Different workloads (transactions, time-series, real-time, search) benefit from different database technologies. The complexity is higher, but the performance and scalability are unmatched.

Production Metrics

Database Performance Comparison

Database	Write Throughput	Read Latency (p95)	Scale
PostgreSQL	50K writes/sec	10-50ms	TBs
MySQL	100K writes/sec	10-40ms	TBs
MongoDB	500K writes/sec	5-20ms	PBs
Cassandra	1M+ writes/sec	20-100ms	PBs
Redis	10M+ writes/sec	<5ms	GBs-TBs

Replication Performance

Strategy	Write Latency Impact	Read Scalability	Consistency
Master-Slave	+5-10ms	10x	Strong
Master-Master	+50-200ms	3-5x	Eventual
Multi-Master	+100-500ms	5-10x	Eventual

Sharding Performance

Strategy	Write Throughput Improvement	Query Performance	Rebalancing
Horizontal (range)	5-10x	Fast (local), Slow (cross-shard)	Days-Weeks
Consistent Hashing	8-12x	Medium (range queries slow)	Minimal
Geographic	2-5x	Very fast (local reads)	Complex

Trade-Off Scenarios

Scenario 1: E-Commerce Platform

Context: Building Amazon-scale e-commerce platform. Need to handle Black Friday traffic spikes (10x normal). Data: product catalog, user accounts, orders, reviews.

The Trade-Off Decisions:

Decision	Option A	Option B	What You Choose & Why
Product Catalog	PostgreSQL (complex queries)	MongoDB (flexible schema)	PostgreSQL - Complex filters, joins, ACID for inventory
User Sessions	PostgreSQL (sessions table)	Redis (key-value)	Redis - Fast access, TTL for expiration
Orders	PostgreSQL (ACID transactions)	MongoDB (eventual consistency)	PostgreSQL - Financial data needs strong consistency
Reviews	PostgreSQL (relational)	MongoDB (document)	MongoDB - Flexible schema, high write throughput
Replication	Master-Slave (read-heavy)	Master-Master (write-heavy)	Master-Slave - 90% reads, strong consistency for orders
Sharding	Not needed (scale vertically)	Yes (future-proof)	Yes - Black Friday traffic requires horizontal scale

Result:

Pros: Right database for each workload, strong consistency for transactions, scalability for Black Friday
Cons: Complex architecture (4 databases), cross-database joins impossible
Performance: 1M+ products, 100K+ concurrent users, p95 <100ms

Context: Building a real-time social media feed like Twitter. Users post updates, followers see them in timeline. Requirements: millions of posts/sec, low latency, global availability.

The Trade-Off Decisions:

Decision	Option A	Option B	What You Choose & Why
Posts Storage	PostgreSQL (relational)	Cassandra (time-series)	Cassandra - Write-heavy (append-only), time-series access pattern
Timeline Query	PostgreSQL (complex joins)	Denormalized (pre-computed timelines)	Denormalized - Pre-compute user timelines, no complex joins
Likes/Follows	PostgreSQL (relational)	Redis (fast counters)	Redis - High frequency reads/writes, simple counters
User Profiles	PostgreSQL (ACID needed)	MongoDB (flexible)	PostgreSQL - Profile needs ACID (security, privacy)
Replication	Master-Slave (eventual OK)	Multi-Master (global writes)	Multi-Master - Global users need local writes
Consistency	Strong	Eventual	Eventual - OK if likes show with slight delay, but posts appear quickly

Result:

Pros: Massive write scalability, global availability, low latency for feed
Cons: Eventual consistency (users might see stale data), complex architecture
Performance: 10M+ posts/sec, p99 feed latency <200ms, 99.99% availability

Scenario 3: Analytics Platform

Context: Building an analytics platform processing billions of events/day. Need to store raw events, compute aggregates in real-time, and serve ad-hoc queries.

The Trade-Off Decisions:

Decision	Option A	Option B	What You Choose & Why
Raw Events	PostgreSQL (structured)	Cassandra (time-series)	Cassandra - Write-heavy (append-only), automatic TTL, time-series partitioning
Aggregates	PostgreSQL (materialized views)	Redis (pre-computed)	Redis - Fast access, TTL for expiration, update on new events
Ad-hoc Queries	PostgreSQL (complex SQL)	ElasticSearch (search/analytics)	ElasticSearch - Fast aggregations, full-text search, scalable
User Metadata	PostgreSQL (ACID needed)	MongoDB (flexible)	PostgreSQL - User accounts need ACID (security, permissions)
Data Retention	Manual cleanup	TTL-based automatic	TTL-based - Automate cleanup (billions of events/day)
Consistency	Strong	Eventual	Eventual - OK if aggregates are slightly stale, but raw events must be durable

Result:

Pros: Massive scale (PBs of data), automatic retention, fast aggregations
Cons: Eventual consistency (aggregates might be behind), complex ETL pipelines
Performance: 10B+ events/day, p95 query latency <500ms, 99.9% data durability

Sruja Perspective: Modeling Databases

In Sruja, we document database choices with clear trade-offs and performance characteristics.

Why Model Databases?

Modeling databases in your architecture provides:

Technology choice clarity - Document why SQL vs NoSQL
Scaling strategy - Show replication/sharding approach
Performance visibility - Track throughput, latency, scale
Failure analysis - Understand SPOFs and failover scenarios

Example: E-Commerce Database Architecture

// partial
import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce Platform" {
    description "Multi-database architecture for different workloads"
    
    // POSTGRESQL: Core transactional data
    PrimaryDB = database "Primary PostgreSQL" {
        technology "PostgreSQL 15"
        description "Stores users, products, orders - requires ACID"
        tags ["sql", "transactional"]
        
        tradeoff {
            decision "Use PostgreSQL for core business data"
            sacrifice "Write scalability (single master for writes)"
            reason "Strong consistency required for orders/payments, complex queries needed for product catalog"
            mitigation "Read replicas for read scalability, plan for sharding"
        }
        
        capacity {
            storage "10 TB"
            throughput_writes "50K/sec"
            throughput_reads "500K/sec"
            replication "Master + 3 Read Replicas"
        }
    }
    
    // MONGODB: Flexible, high-throughput data
    ReviewsDB = database "MongoDB Cluster" {
        technology "MongoDB 7"
        description "Stores reviews, activity feed - flexible schema, high writes"
        tags ["nosql", "document", "sharded"]
        
        tradeoff {
            decision "Use MongoDB for reviews and activity feed"
            sacrifice "Strong consistency (eventual consistency OK)"
            reason "Flexible schema for evolving data, high write throughput, easy sharding"
            mitigation "Write-ahead log for durability, read-your-writes for consistency"
        }
        
        capacity {
            storage "50 TB"
            throughput_writes "500K/sec"
            throughput_reads "2M/sec"
            replication "Replica Set 3 + Sharded by user_id"
        }
    }
    
    // REDIS: Real-time session data
    SessionStore = database "Redis Cluster" {
        technology "Redis 7"
        description "Stores user sessions, shopping carts - fast access, TTL"
        tags ["nosql", "key-value", "real-time"]
        
        tradeoff {
            decision "Use Redis for session data"
            sacrifice "Durability (in-memory, can lose data on crash)"
            reason "Sub-millisecond latency required for session checks, automatic TTL for cleanup"
            mitigation "AOF persistence, replicas for backup"
        }
        
        capacity {
            storage "1 TB"
            throughput_writes "10M/sec"
            throughput_reads "20M/sec"
            replication "Cluster mode (6 nodes)"
        }
    }
    
    // TRAFFIC FLOW
    UserService = container "User Service" {
        technology "Go"
    }
    
    ProductService = container "Product Service" {
        technology "Python"
    }
    
    OrderService = container "Order Service" {
        technology "Java"
    }
    
    UserService -> PrimaryDB "Reads user data"
    UserService -> SessionStore "Checks session (fast)"
    ProductService -> PrimaryDB "Queries product catalog"
    OrderService -> PrimaryDB "Creates order (transaction)"
    OrderService -> ReviewsDB "Posts review (high throughput)"
}

view index {
    title "E-Commerce Database Architecture"
    include *
}

view sql {
    title "SQL Databases"
    include ECommerce.PrimaryDB
}

view nosql {
    title "NoSQL Databases"
    include ECommerce.ReviewsDB ECommerce.SessionStore
}

Key Trade-Offs Documented

1. Database Technology Choice:

PostgreSQL for transactions - Strong consistency, complex queries, but write scalability limited
MongoDB for high throughput - Flexible schema, horizontal scaling, but eventual consistency
Redis for real-time - Sub-millisecond latency, but in-memory (durability concerns)

2. Scaling Strategy:

Read replicas for PostgreSQL (read-heavy workload: 10:1 read:write ratio)
Sharding for MongoDB (write-heavy reviews: user_id shard key)
Redis cluster for session data (high throughput, automatic failover)

3. Consistency vs Performance:

Sacrifice strong consistency for reviews (OK if reviews appear with slight delay)
Maintain strong consistency for orders (financial data cannot be wrong)
Use read-your-writes pattern for better user experience

Knowledge Check

Q: I'm building a social media feed like Twitter. Millions of posts/second. What database should I use for posts?

Cassandra (or similar time-series NoSQL)

Social media feeds are write-heavy (append-only time-series) and require massive horizontal scale. SQL databases cannot handle millions of writes/second effectively. Cassandra is designed exactly for this workload: write-optimized, automatically partitions by time, scales horizontally.

Q: Why use master-slave replication instead of master-master?

Master-slave is simpler and provides strong consistency.

Master-slave writes go to a single master, reads can go to slaves. This is simple to implement, provides strong consistency (slaves always see master's changes), and works great for read-heavy workloads. Master-master scales writes better but adds complexity: conflict resolution, eventual consistency, and harder to reason about.

Q: My app has flexible data (schema changes frequently). Should I use SQL or NoSQL?

NoSQL - specifically document databases like MongoDB.

SQL databases have rigid schemas. ALTER TABLE can be slow and complex. NoSQL document databases allow you to add fields on the fly without migrations. Perfect for rapidly evolving data structures (e.g., user profiles with new features, product catalogs with new attributes).

Q: What's the main benefit of consistent hashing over range sharding?

Minimal data movement when adding/removing shards.

With range sharding, adding a shard might require moving 50% of data. With consistent hashing, adding a shard only moves ~1/n of data (where n is number of shards). This makes scaling operations much faster and less disruptive.

Quiz: Test Your Knowledge

Q1: Which database type is best for financial transactions requiring ACID guarantees?

MongoDB (Document)
Redis (Key-Value)
PostgreSQL (Relational)
Cassandra (Column-family)

Answer

PostgreSQL (Relational) provides ACID compliance (Atomicity, Consistency, Isolation, Durability) which is critical for financial transactions. MongoDB and Redis sacrifice strong consistency for performance. Cassandra provides tunable consistency but is optimized for write-heavy workloads, not complex transactions.

Q2: Which replication strategy is best for a read-heavy system where reads outnumber writes 10:1?

Master-Master
Master-Slave
No replication needed

Answer

Master-Slave is ideal for read-heavy workloads. All writes go to the master (single write path), but reads can be distributed across multiple read replicas. This gives you 10x read scalability with the simplicity of a single master. Master-Master gives write scalability but adds complexity (conflict resolution, eventual consistency).

Q3: You're building a time-series analytics platform ingesting billions of events/day. Which database is best?

PostgreSQL
MongoDB
Cassandra
Redis

Answer

Cassandra is designed specifically for time-series workloads: write-heavy, append-only data that grows indefinitely. It automatically partitions data by time, scales horizontally, and provides TTL for automatic data retention. PostgreSQL would struggle with write throughput and storage. MongoDB works but Cassandra's time-series optimizations are better. Redis is in-memory and too expensive for billions of events.

Q4: Which of these is NOT a characteristic of NoSQL databases?

Flexible schemas
Horizontal scalability
Strong ACID compliance
Eventual consistency is often acceptable

Answer

Strong ACID compliance is NOT a characteristic of most NoSQL databases. NoSQL databases typically prioritize scalability and flexibility over strong ACID guarantees. While some NoSQL databases (like MongoDB) provide document-level ACID guarantees, they don't provide the same level of multi-document transaction support as SQL databases.

Q5: Uber uses multiple databases (MySQL, Cassandra, Redis, ElasticSearch). Why?

To reduce complexity
To choose the right database for each workload
Because they couldn't decide on one
To minimize costs

Answer

To choose the right database for each workload - this is a deliberate architectural decision. Different workloads benefit from different databases: MySQL for transactions (trips), Cassandra for time-series (event stream), Redis for real-time (driver locations), ElasticSearch for search. This polyglot persistence approach gives better performance and scalability than trying to force everything into one database.

Q6: What's the main drawback of sharding?

Slower writes (distributed across servers)
Complex joins across shards
More storage space needed
Slower reads (distributed across servers)

Answer

Complex joins across shards is the main drawback. When data is split across multiple database servers, joining data that resides on different shards becomes complex and expensive. Cross-shard joins require querying multiple shards and merging results in the application, which is slow and complex. This is why you carefully choose shard keys to minimize cross-shard queries.

Q7: Airbnb uses PostgreSQL for core business data and MongoDB for flexible data. Why this hybrid approach?

PostgreSQL is faster
MongoDB is cheaper
Different workloads benefit from different technologies
They couldn't scale PostgreSQL

Answer

Different workloads benefit from different technologies - this is a smart architectural decision. PostgreSQL excels at complex relational data with ACID transactions (users, listings, bookings). MongoDB excels at flexible, evolving schemas with high write throughput (search index, activity feed). Using the right database for each workload gives better performance and scalability than forcing everything into one technology.

Q8: Which database feature allows Redis to automatically delete session data after 30 minutes?

ACID transactions
TTL (Time To Live)
Sharding
Master-Slave replication

Answer

TTL (Time To Live) allows Redis to automatically delete data after a specified time. This is perfect for session data, shopping carts, and other temporary data. You don't need manual cleanup jobs - Redis handles it automatically. Set a TTL of 30 minutes when storing a session, and Redis deletes it automatically.

Next Steps

Now that we understand databases and how to choose between SQL and NoSQL, let's learn about caching to improve performance. 👉 Lesson 3: Caching (Strategies, Eviction Policies, Redis Case Study)

Keyboard shortcuts

Sruja – Context engineering for the AI era.