Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Sruja – Architecture intelligence for the AI era.

Architecture intelligence for the AI era. Use AI to generate and maintain architecture as code; validate and export to Markdown and Mermaid. A backend tool for the SDLC—not a diagramming product.

Why Sruja?

The Problem

Most architecture tools make you choose:

  • Visual-only tools (Draw.io) – no code, no version control, hard to maintain
  • Code-only tools (Mermaid, PlantUML) – no validation, manual diagram updates
  • Stale diagrams – architecture drifts from reality, documentation gets outdated

Our Solution

Sruja gives you a code-first architecture tool:

FeatureWhat you get
Markdown & MermaidExport architecture to clean Markdown docs and Mermaid diagrams
Version-controlled.sruja files in Git, with proper code review workflows
Team-friendlyDevelopers work in code with familiar tools
Built-in validationCatch architecture issues before they reach production
Multiple exportsJSON, Markdown, Mermaid – integrate into your existing toolchain

Who It's For

  • Engineering teams who need architecture as part of their SDLC
  • Tech leads who want to enforce architectural standards
  • Platform engineers building guardrails for distributed teams
  • AI agents that need to reason about system architecture

How We Work

  1. Define your architecture in .sruja files
  2. Validate with built-in checks (cycles, orphans, unique IDs)
  3. Export to JSON, Markdown, or Mermaid diagrams
  4. Integrate into CI/CD, docs, and your IDE workflow

We're ultra simple – minimal surface area, no unnecessary apps or frameworks – and highly functional – what we ship works reliably for its scope.

Stack

  • Rust – CLI, engine, LSP, WASM (single language for core)
  • VS Code extension – Edit .sruja files with syntax highlighting and diagnostics
  • Docs – This book (mdBook, Rust-based; no TypeScript/Node)

New here? Install the sruja-architecture skill first (1 minute), then let your AI generate architecture for you. For a single entry point to docs, tutorials, and courses, use Navigate. The left sidebar lists everything; press / or S to search.

See Quick start to install the AI skill and create your first .sruja file. For a single entry point to docs, tutorials, and courses, use Navigate. The left sidebar lists everything; press / or S to search.

Sruja "Show diagram" in code blocks: Run make wasm from the repo root once, then run make book-serve (or ./serve.sh from the book directory) so the WASM files are copied into the book output.

Navigate

New here? Do Quick start first (about 5 min), then the Beginner path (2–3 hours). Everything else is below.

Use one entry point below. The sidebar always lists the full structure.

WhereLinkWhat you get
DocumentationDocs →Concepts, reference, how Sruja works, adoption guides
TutorialsTutorials →Step-by-step: CLI, DSL, validation, export, CI/CD, more
CoursesCourses →Structured courses: systems thinking, system design, ecommerce, production, AI

Quick Start

Get architecture from code in 5 minutes.

You don't write .sruja files. Your AI does it for you. Here's how:


Step 1: Install the AI skill (1 minute)

This teaches your AI editor how to generate Sruja files. The skill will guide you to install the CLI when needed.

npx skills add https://github.com/sruja-ai/sruja --skill sruja-architecture

Supported editors: Cursor, GitHub Copilot, Claude, Continue.dev


Step 2: Generate your architecture file (2 minutes)

Open your AI editor and ask it to generate your architecture:

Use sruja-architecture skill. Analyze my codebase, generate a repo.sruja file, 
then run sruja lint and fix any errors until it passes.

What happens:

  1. The skill runs sruja discover to understand your code structure (installs CLI if needed)
  2. Asks you 2-3 questions if anything is unclear (e.g., "What's this service for?")
  3. Generates a repo.sruja file with your architecture
  4. Runs sruja lint to check for errors
  5. Fixes any errors automatically

Result: You now have a repo.sruja file in your project root!


Step 3: Validate it (optional)

The skill already validated your file, but you can check manually:

sruja lint repo.sruja

If it says "✅ All checks passed", you're good!

If you see errors, just ask your AI: "Fix these errors."


What's Next?

You have architecture in code. Now what?

Generate diagrams for documentation:

sruja export mermaid repo.sruja > architecture.mmd
sruja export markdown repo.sruja > ARCHITECTURE.md

Keep it in sync:

When you change your code, run:

sruja drift -r .

This shows you what changed and if your architecture needs updating.

Learn more:


Quick Reference

What you wantCommand
Install skillnpx skills add https://github.com/sruja-ai/sruja --skill sruja-architecture
Generate with AIAsk your AI: "Use sruja-architecture skill to generate my architecture"
Validatesruja lint repo.sruja
Export diagramsruja export mermaid repo.sruja > diagram.mmd
Check for driftsruja drift -r .

Common Questions

"What if the command isn't found?"

The CLI isn't on your PATH. Try:

# Add to PATH
export PATH="$HOME/.local/bin:$PATH"

# Or restart your terminal

"My editor doesn't support skills."

You can still use Sruja manually:

  • Run sruja discover to get JSON output
  • Create repo.sruja by hand (see Language spec)
  • Use sruja lint to validate

But AI makes it much easier—consider using Cursor or installing skills.sh.

"What's the difference between quickstart and discover?"

  • quickstart – Quick overview, human-readable output (great for first look)
  • discover – Detailed JSON output (used by AI for generation)

Option A – install script (downloads from GitHub Releases):

curl -fsSL https://sruja.ai/install.sh | bash

Option B – from Git (requires Rust):

cargo install sruja-cli --git https://github.com/sruja-ai/sruja

Option C – build from source:

git clone https://github.com/sruja-ai/sruja.git && cd sruja
make build

Ensure the sruja binary is on your PATH (install script uses ~/.local/bin by default).

Create a .sruja file

This is the minimal style (explicit kinds, no import). For the full Getting Started guide using stdlib imports, see Getting Started. Both styles are valid; use whichever you prefer.

person = kind "Person"
system = kind "System"
container = kind "Container"

User = person "User" {}
App = system "My App" {
  Web = container "Web Server" { technology "Node.js" }
}
User -> App.Web "visits"

Validate and export

sruja lint example.sruja
sruja export json example.sruja
sruja export markdown example.sruja

VS Code

Install the Sruja extension for syntax, diagnostics, and optional diagram preview in the editor.


Next: Beginner path builds on this in 7 steps (2–3 hours). For a longer "first architecture" walkthrough with a view and stdlib import, see Getting started (full).

VS Code extension

The Sruja VS Code extension provides:

  • Syntax highlighting for .sruja files
  • Diagnostics (errors, warnings) via the language server
  • WASM-based diagram preview – render diagrams from your DSL in the editor (no web server)

Install from the VS Code Marketplace or build from source in extension/.

Introduction

Architecture intelligence for the AI era.

Sruja uses AI to analyze your code and generate architecture as code—so it never drifts from reality.

New here? Do Quick start (about 5 min), then the Beginner path (2–3 hours). You don't write .sruja files manually—your AI does it for you.

The Problem

How do you document architecture today?

Your approachProblems
Drawings in Miro/LucidChartManual updates, easy to forget, drifts from code
Wiki pagesInconsistent, hard to maintain, no validation
PNG/PDF diagramsCan't version control diff, outdated quickly

Sound familiar? You're not alone. Most teams struggle with this.

The Solution

Architecture as code.

With Sruja:

  • AI analyzes your codebase automatically
  • You get a repo.sruja file (architecture definition)
  • Validate it automatically in CI/CD
  • Export diagrams when needed (not as the source)

You don't learn a new language. You ask your AI to generate the file, and it handles the syntax.

How This Helps

Before SrujaWith Sruja
Update diagrams manuallyAI generates from code
Diagram drifts from realityAlways in sync
Can't catch errorsValidation catches issues
Hard to review changesGit diff shows everything
Scattered toolsSingle source of truth

Key Concepts

Architecture as Code: Instead of drawing boxes, you define structure in code. AI writes it, you validate it, and everyone uses the same source.

Validation: Like lint for code, sruja lint checks for:

  • Circular dependencies
  • Orphaned components
  • Missing connections
  • Rule violations

C4 Model: Sruja uses the C4 approach, which organizes architecture into levels:

  • Person: Users, external systems
  • System: Major boundaries (e.g., "Order System")
  • Container: Deployable units (e.g., "API Service")
  • Component: Internal parts (e.g., "Payment Module")

This hierarchy makes architecture clear and understandable.

Who is Sruja For?

Students & Learners

  • Understand system design through production-ready examples from fintech, healthcare, and e-commerce
  • Use AI skills to generate architecture and explore patterns without manual DSL writing
  • Real-world scenarios that prepare you for interviews and real projects

Software Architects

  • Enforce architectural standards with policy-as-code
  • Prevent architectural drift through automated validation
  • Scale governance across multiple teams without manual reviews
  • Document decisions with ADRs (Architecture Decision Records)

Product Teams

  • Link requirements to architecture - see how features map to technical components
  • Track SLOs and metrics alongside your architecture
  • Align technical decisions with business goals and user needs
  • Communicate architecture to stakeholders (export to Markdown/Mermaid when needed)

DevOps Engineers

  • Integrate into CI/CD - validate architecture on every commit
  • Automate documentation generation from architecture files
  • Model deployments - Blue/Green, Canary, multi-region strategies
  • Track infrastructure - map logical architecture to physical deployment

Example

Here's a simple example to get you started:

import { * } from 'sruja.ai/stdlib'

App = system "My App" {
    Web = container "Web Server"
    DB = database "Database"
}

User = person "User"

User -> App.Web "Visits"
App.Web -> App.DB "Reads/Writes"

view index {
    include *
}

For production-ready examples with real-world patterns, see our Examples page featuring:

  • Banking systems (fintech)
  • E-commerce platforms
  • Healthcare platforms (HIPAA-compliant)
  • Multi-tenant SaaS platforms

Next Steps

Getting Started

Architecture from your code—no DSL learning required.

Your AI writes and maintains .sruja files. You just need to know what to ask for.


Prerequisites

  • AI editor – Cursor, Copilot, Claude, Continue.dev, etc.
  • AI skill – Install first: npx skills add https://github.com/sruja-ai/sruja --skill sruja-architecture (see Install as a Skill)
  • Sruja CLI – Needed when the skill runs discover/lint/drift; the skill will guide you to install it if missing (curl -fsSL https://sruja.ai/install.sh | bash)

Step 1: Gather evidence (or let the skill do it)

When you use the sruja-architecture skill, it runs discovery for you. Discovery is not just a file list—under the hood the CLI runs Tree-sitter on your code to build a graph of components and their dependency relations (who imports whom, which modules call which). That graph is the evidence the skill uses.

If you want to run discovery yourself (e.g. to inspect the summary), run:

cd your-project
sruja discover --context -r . --format json

What this does: Scans your repo with Tree-sitter, builds the component/dependency graph, then outputs a summary of that graph (so the AI can scope and ask targeted questions). The skill prefers .sruja/context.json and .sruja/graph.json when present (e.g. after sruja sync or the extension's Refresh repo context); when missing, it runs discover for you—no need to run a command first. For the full graph on demand: sruja scan -r . -o graph.json, or use the graph written by sync at .sruja/graph.json.

Summary output includes:

  • Component and edge counts (from the Tree-sitter graph)
  • Primary language and framework
  • Inferred architecture style (monolith / microservices)
  • Suggested areas (top-level path segments for scoping)

Example summary (actual schema):

{
  "repo": "my-app",
  "scan_scope": { "included": [], "excluded": [] },
  "components": 42,
  "edges": 58,
  "primary_language": "TypeScript",
  "framework": "React",
  "architecture_style": "monolith",
  "domain": null,
  "suggested_areas": ["src", "lib", "apps"]
}

The full graph is written by sruja sync to .sruja/graph.json. You can also produce it on demand with sruja scan -r . -o graph.json.


Step 2: Generate Architecture with AI

In your AI editor, run:

Use sruja-architecture. Analyze the discovery output:
[JSON from step 1],
identify systems, containers, and their relationships,
generate repo.sruja using C4 context and container levels,
then run `sruja lint` and fix until it passes.

What your AI will do:

  1. Analyze the JSON output from discovery
  2. Ask questions if scope is unclear (e.g., "What's this service for?")
  3. Generate repo.sruja with your architecture
  4. Validate it with sruja lint
  5. Fix any errors automatically

What repo.sruja Looks Like

import { * } from 'sruja.ai/stdlib'

// External actors
MobileApp = person "Mobile App" {
  description "Customer-facing mobile application"
}

// Main system
MyApp = system "My Application" {
  description "Handles user requests and processing"

  // Containers (deployable units)
  API = container "API Service" {
    technology "Node.js + Express"
    description "RESTful API for mobile and web clients"
  }

  Worker = container "Background Worker" {
    technology "Node.js + Bull"
    description "Processes async jobs (emails, reports)"
  }

  // Datastores
  Database = database "Primary DB" {
    technology "PostgreSQL"
    description "Stores user data and transactions"
  }

  Cache = database "Redis Cache" {
    technology "Redis"
    description "Caches frequently accessed data"
  }
}

// Relationships (how things connect)
MobileApp -> MyApp.API "HTTPS requests"
MyApp.API -> MyApp.Database "SQL queries"
MyApp.API -> MyApp.Cache "Redis get/set"
MyApp.Worker -> MyApp.Database "SQL queries"

Key concepts:

  • person – External actors (users, systems calling you)
  • system – Major boundary (your entire application)
  • container – Deployable unit (API, worker, web frontend)
  • database – Data storage or cache
  • -> – Relationship with protocol description

Step 3: Validate

After the AI generates repo.sruja, validate it:

sruja lint repo.sruja

What this checks:

  • Syntax errors – Invalid structure or keywords
  • Circular dependencies – A depends on B, B depends on A
  • Orphan elements – Something with no connections
  • Missing fields – Required information not provided

Fix errors: Paste the lint output to your AI and say: "Fix these errors."


Step 4: Export for Documentation

Export Markdown

sruja export markdown repo.sruja > ARCHITECTURE.md

Creates a readable document you can share with your team.

Export Mermaid Diagram

sruja export mermaid repo.sruja > ARCHITECTURE.mmd

Creates a diagram you can:

Export JSON

sruja export json repo.sruja > ARCHITECTURE.json

Machine-readable format for tools and automation.


Understanding C4 Levels

Sruja uses the C4 Model, which organizes architecture into levels:

LevelWhat It IsExample
PersonExternal actorsUsers, external systems, third-party APIs
SystemHigh-level boundary"Order System", "User Management System"
ContainerDeployable unit"API Service", "Web App", "Worker"
ComponentInternal part"Payment Module", "Auth Controller"

Recommended: Start with Person + System + Container levels. Add components only when you need more detail.


Common Questions

"When should I use stdlib imports?"

Always. It saves time by providing standard types (person, system, container, etc.) so you don't define them manually.

"What if discovery doesn't find my code?"

  1. Check your language is supported (JavaScript, Python, Go, Rust, Java)
  2. Make sure you're in the correct directory
  3. Try sruja quickstart -r . to see what's detected

"How detailed should repo.sruja be?"

Start minimal. Only model what you actually need:

  • External actors calling your system
  • Major containers (services, apps)
  • Key datastores

Add more detail only when it provides value.

"Can I edit repo.sruja manually?"

Yes, but it's easier to let AI do it. If you do edit manually:

  • Run sruja lint before committing
  • Validate syntax with the extension

Next Steps


Quick Reference

Want toCommand
Analyze codesruja discover --context -r . --format json
Validatesruja lint repo.sruja
Export Markdownsruja export markdown repo.sruja > doc.md
Export Mermaidsruja export mermaid repo.sruja > diagram.mmd
Export JSONsruja export json repo.sruja > arch.json
Check driftsruja drift -r . --format json

How Sruja Works

Sruja is built to be a tool for the AI SDLC process: architecture in code that fits into your lifecycle—IDE, CI/CD, and documentation. We are not a diagramming product; we provide parse, validate, export, and optional preview.

The Sruja Platform

The platform consists of several key components working together:

  1. Parser & engine: Rust crates for parsing, validation, and export (sruja-language, sruja-engine, sruja-export).
  2. CLI: Command-line interface for local development and CI/CD (sruja-cli).
  3. WASM: Rust core compiled to WebAssembly for the docs book and VS Code (sruja-wasm).
  4. LSP: Language server for VS Code (sruja-lsp).
  5. Docs: This site—built with mdBook from the book/ directory.

Architecture Diagram

Explore the Sruja architecture itself using the interactive viewer below. This diagram is defined in Sruja DSL!

<!-- partial -->
import { * } from 'sruja.ai/stdlib'


RootSystem = system "The Sruja Platform" {
  tags ["root"]
}

User = person "Architect/Developer" {
	description "Uses Sruja to design and document systems"
}

Sruja = system "Sruja Platform" {
	description "Tools for defining, visualizing, and analyzing software architecture"

	CLI = container "Sruja CLI" {
		technology "Rust"
		description "Command-line interface (crates/sruja-cli)"
	}

	Engine = container "Core Engine" {
		technology "Rust"
		description "Validation and export (crates/sruja-engine, sruja-export)"

		Validation = component "Validation Engine" {
			technology "Rust"
			description "Validates AST against rules (crates/sruja-core/src/engine/rules)"
		}

		Scorer = component "Scoring Engine" {
			technology "Rust"
			description "Calculates architecture health score (crates/sruja-core/src/engine/scorer)"
		}

		Policy = component "Policy Engine" {
			technology "Rust"
			description "Enforces custom policies (future: OPA/Rego)"
		}

		Scorer -> Validation "uses results from"
		Validation -> Policy "checks against"
	}

	Language = container "Language Service" {
		technology "Rust"
		description "Parser and AST (crates/sruja-language); LSP (crates/sruja-lsp)"
	}

	WASM = container "WASM Module" {
		technology "Rust/WASM"
		description "WebAssembly build (crates/sruja-wasm)"
	}

	VSCode = container "VS Code Extension" {
		technology "TypeScript"
		description "Editor extension (extension/)"
	}

	Book = container "Documentation" {
		technology "mdBook"
		description "This site (book/)"
	}

	// Internal Dependencies
	CLI -> Language "parses DSL using"
	CLI -> Engine "validates using"
	CLI -> WASM "builds"

	WASM -> Language "embeds"
	WASM -> Engine "embeds"

	VSCode -> Language "uses LSP"
	VSCode -> WASM "uses for LSP and preview"

	Book -> WASM "uses for diagram blocks"
}

User -> Sruja.CLI "runs commands"
User -> Sruja.VSCode "writes DSL"
User -> Sruja.Book "reads docs"

BrowserSystem = system "Web Browser" {
	description "User's web browser environment"
  tags ["external"]
	LocalStore = database "Local Storage"
}

// ADRs
ADR001 = adr "Use WASM for Client-Side Execution" {
	status "Accepted"
	context "We need to run validation and parsing in the browser and VS Code without a backend server."
	decision "Compile the Rust core engine to WebAssembly."
	consequences "Ensures consistent logic across all platforms but increases build complexity."
}

// Deployment
deployment Production "Production Environment" {
  node GitHubPages "GitHub Pages" {
    containerInstance RootSystem
  }
}

GitHubSystem = system "GitHub Platform" {
  description "Source control, CI/CD, and hosting"
  Actions = container "GitHub Actions" {
    technology "YAML/Node"
    description "CI/CD workflows"
  }
  Pages = container "GitHub Pages" {
    technology "Static Hosting"
    description "Hosts documentation site"
  }
  Releases = container "GitHub Releases" {
    technology "File Hosting"
    description "Hosts CLI binaries"
  }
  Actions -> Pages "deploys to"
  Actions -> Releases "publishes to"
}


User -> GitHubSystem "pushes code to"


// Component Stories
CLIStory = story "Using the CLI" {
  User -> Sruja.CLI "runs validate"
  Sruja.CLI -> Sruja.Language "parses DSL"
  Sruja.CLI -> Sruja.Engine "validates"
  Sruja.CLI -> User "reports diagnostics"
}

VSCodeStory = story "Using VS Code" {
  User -> Sruja.VSCode "edits .sruja file"
  Sruja.VSCode -> Sruja.WASM "LSP and preview"
  Sruja.WASM -> Sruja.VSCode "diagnostics and diagram"
  Sruja.VSCode -> User "shows errors and preview"
}

CIDev = scenario "Continuous Integration (Dev)" {
  User -> GitHubSystem "pushes to main"
  GitHubSystem -> GitHubSystem.Actions "triggers CI"
  GitHubSystem.Actions -> Sruja "builds & tests"
  GitHubSystem.Actions -> GitHubSystem.Pages "deploys dev site"
}

ReleaseProd = scenario "Production Release" {
  User -> GitHubSystem "merges PR to prod"
  GitHubSystem -> GitHubSystem.Actions "triggers release"
  GitHubSystem.Actions -> GitHubSystem.Pages "deploys prod site"
  GitHubSystem.Actions -> Sruja.VSCode "publishes extension"
  GitHubSystem.Actions -> GitHubSystem.Releases "publishes CLI binaries"
}

view index {
  title "Complete System View"
  include *
}

Key Components

Core Engine (Rust)

The sruja-language and sruja-engine crates form the foundation. They define the DSL grammar, parse input files into an AST (Abstract Syntax Tree), and run validation rules (like cycle detection and layer enforcement).

WebAssembly (WASM)

The Rust core is compiled to WebAssembly (sruja-wasm). The same parsing and validation logic runs in:

  • VS Code Extension: For local preview without needing a CLI binary.
  • Documentation site: For "Show diagram" in code blocks (like the one above).

CLI & CI/CD

The sruja CLI (sruja-cli) is a static binary that wraps the core engine. It supports:

  • Local development: sruja fmt, sruja lint, sruja export.
  • CI/CD: Validate and export architecture in pipelines.
  • Export: sruja export json, sruja export mermaid, sruja export markdown, sruja export context, sruja export dsl.

Architecture Intelligence

Sruja provides architecture intelligence across four progressive layers:

┌─────────────────────────────────────────────────────────────┐
│  Layer 4: Intent                                            │
│  "What did we intend vs what exists?"                       │
│  Commands: sruja drift -a, sruja intent check               │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: Semantic                                          │
│  "What does this mean? (vocabulary, patterns)"              │
│  Commands: sruja analyze --semantic                         │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: Structural                                        │
│  "What exists? (components, deps, metrics)"                 │
│  Commands: sruja scan, sruja quickstart, sruja discover    │
├─────────────────────────────────────────────────────────────┤
│  Layer 1: Syntactic                                         │
│  "Is the DSL valid?"                                        │
│  Commands: sruja lint                                       │
└─────────────────────────────────────────────────────────────┘

Each layer builds on the previous:

  • Syntactic: Is the .sruja file valid? (lint)
  • Structural: What components and dependencies exist? (scan, discover)
  • Semantic: What patterns and relationships mean? (analyze)
  • Intent: Does reality match declared architecture? (drift, intent check)

AI Skill Multiplies Intelligence

The sruja-architecture skill enhances all four layers:

LayerCLI OnlyCLI + Skill
Syntacticsruja lintPattern-aware DSL generation
Structuralsruja scanEvidence-based discovery
Semanticsruja analyzePatterns and trade-offs
Intentsruja driftMulti-perspective review

Install the skill to unlock AI-powered architecture intelligence:

npx skills add https://github.com/sruja-ai/sruja --skill sruja-architecture

Examples & Patterns

Theory is good, but code is better. Below are production-grade Sruja models that you can copy, paste, and adapt.

Every example here follows our "FAANG-level" quality standards:

  1. Clear Requirements: Functional & Non-functional.
  2. Proper Hierarchies: Context -> Container -> Component.
  3. Real Tech Stacks: No generic "Database" boxes.

1. Banking System (Fintech)

Note

Ideally Suited For: Highly regulated industries requiring audit trails, security policies, and strict latency SLAs.

Scenario: A regional bank needs to modernize its legacy mainframe interactions while providing a slick mobile experience.

Why review this example?

  • Security: Uses policy blocks for PCI-DSS.
  • Hybrid Cloud: Connects modern Cloud Containers to an on-premise "Mainframe" System.
  • Complexity: Models the "Legacy Core" vs "Modern Interface" pattern often seen in enterprise.
import { * } from 'sruja.ai/stdlib'


// --- REQUIREMENTS ---
// We start with the 'Why'. These drive the architecture.
R1 = requirement functional "Customers must be able to view balances"
R2 = requirement functional "Customers can transfer money internally"
R3 = requirement security "All PII must be encrypted at rest (PCI-DSS)"
R4 = requirement stability "99.99% Availability (Target: <52m downtime/year)"

// --- ACTORS ---
Customer = person "Banking Customer" {
    description "A holder of one or more accounts"
}

// --- SYSTEMS ---
BankingSystem = system "Internet Banking Platform" {
    description "Allows customers to view information and make payments."

    // Containers (Deployable units)
    WebApp = container "Single Page App" {
        technology "React / TypeScript"
    }

    MobileApp = container "Mobile App" {
        technology "Flutter"
    }

    API = container "Main API Gateway" {
        technology "Java / Spring Boot"
        description "Orchestrates calls to core services"
    }

    Database = container "Main RDBMS" {
        technology "PostgreSQL"
        tags ["database", "storage"]
    }

    // Relationships
    WebApp -> API "Uses (JSON/HTTPS)"
    MobileApp -> API "Uses (JSON/HTTPS)"
    API -> Database "Reads/Writes (JDBC)"
}

// --- EXTERNAL SYSTEMS ---
Mainframe = system "Legacy Core Banking" {
    tags ["external"] // This is outside our scope of control
    description "The heavy iron that stores the actual money."
}

EmailSystem = system "Email Service" {
    tags ["external"]
    description "SendGrid / AWS SES"
}

// --- INTEGRATIONS ---
Customer -> BankingSystem.WebApp "Views dashboard"
BankingSystem.API -> Mainframe "Syncs transactions (XML/SOAP)"
BankingSystem.API -> EmailSystem "Sends alerts"

view index {
include *
}

👉 Deep Dive this Architecture using our Course


2. Global E-Commerce Platform

Note

Ideally Suited For: High-scale B2C applications. Focuses on caching, asynchronous processing, and eventual consistency.

Scenario: An Amazon-like store preparing for Black Friday traffic spikes.

Why review this example?

  • Scalability: Explains how to handle high reads (Product Catalog) vs transactional writes (Checkout).
  • Async Messaging: Shows usages of Queues/Topics (Apache Kafka) to decouple services.
  • Caching: Strategic placement of Redis caches.
import { * } from 'sruja.ai/stdlib'


R1 = requirement scale "Handle 100k concurrent users"
R2 = requirement performance "Product pages load in <100ms"

ShopScale = system "E-Commerce Platform" {

    // --- EDGE LAYER ---
    CDN = container "Content Delivery Network" {
        technology "Cloudflare"
        description "Caches static assets and product images"
    }

    LoadBalancer = container "Load Balancer" {
        technology "NGINX"
    }

    // --- SERVICE LAYER ---
    Storefront = container "Storefront Service" {
        technology "Node.js"
        description "SSR for SEO-friendly product pages"
    }

    Checkout = container "Checkout Service" {
        technology "Rust"
        description "Handles payments and inventory locking"
    }

    // --- DATA LAYER ---
    ProductCache = container "Product Cache" {
        technology "Redis Cluster"
        description "Stores hot product data"
    }

    MainDB = database "Product Database" {
        technology "MongoDB"
        description "Flexible schema for diverse product attributes"
    }

    OrderQueue = queue "Order Events" {
        technology "Kafka"
        description "Async order processing pipeline"
    }

    // --- FLOWS ---
    CDN -> LoadBalancer "Forwards dynamic requests"
    LoadBalancer -> Storefront "Routes traffic"
    Storefront -> ProductCache "Read-through cache"
    Storefront -> MainDB "Cache miss / heavy query"

    // The Checkout Flow
    Checkout -> OrderQueue "Publishes 'OrderCreated'"
}

view index {
include *
}

What Next?

Beginner Path: Use AI Skills for Architecture

Don't learn a language—use AI to generate and maintain architecture.

If you just did Quick Start, you have the CLI and skill installed. These 7 steps show how to use AI to maintain your architecture effectively.

Tip

Track your progress: Check off each step as you complete it. Takes ~2–3 hours total.


Step 1: Generate Your First Architecture ⏱️ 20–30 min

What you'll do: Let AI analyze your code and create repo.sruja

Instructions:

  1. Open your project in your AI editor
  2. Run this prompt:
Use sruja-architecture. Run `sruja discover --context -r . --format json`,
gather evidence, identify systems and containers,
generate repo.sruja with C4 context and container levels,
then run `sruja lint` and fix until it passes.
  1. Wait for AI to ask 2-3 questions
  2. Answer the questions (be specific: "Production database is PostgreSQL")
  3. Review the generated repo.sruja file

Success check: You have a repo.sruja file that passes sruja lint

What you learned:

  • How AI analyzes your codebase
  • What questions it asks
  • What a valid repo.sruja looks like

Step 2: Understand What Was Generated ⏱️ 15–20 min

What you'll do: Read and understand your AI-generated architecture

Instructions:

  1. Open repo.sruja in your editor
  2. Read it section by section:
    • person entries = external actors
    • system entries = your major boundaries
    • container entries = deployable units
    • -> lines = relationships
  3. Run sruja tree repo.sruja to see structure visually
  4. Ask AI to explain:
Read repo.sruja and explain:
1. What systems are defined?
2. How do containers connect?
3. What external actors use this system?

Success check: You can explain your architecture in plain English

What you learned:

  • C4 structure (person → system → container)
  • How to read relationships
  • What each component does

Step 3: Validate and Fix Errors ⏱️ 15–20 min

What you'll do: Learn to validate and fix issues with AI help

Instructions:

  1. Run sruja lint repo.sruja
  2. If you see errors, copy them to clipboard
  3. Ask AI:
I got these lint errors. Fix them in repo.sruja:
[paste errors here]

Common errors:

ErrorWhat It MeansHow to Fix
E204: Circular dependencyA → B → ARemove one edge in the cycle
E205: Orphan elementSomething with no connectionsAdd relationships or remove it
E201: Invalid kindUnknown typeUse person, system, container, database
  1. Re-run sruja lint until clean

Success check: sruja lint repo.sruja shows "All checks passed"

What you learned:

  • How validation works
  • Common architecture errors
  • How to ask AI to fix issues

Step 4: Make a Specific Change ⏱️ 20–30 min

What you'll do: Direct AI to add or modify something specific

Instructions:

Think of a change you want. Examples:

  • Add component: "Add a logging service container"
  • Add relationship: "Connect the new logger to all existing containers"
  • Modify description: "Update the API container description to say it handles webhooks"

Then ask AI:

Use sruja-architecture. Read repo.sruja and:
[your change here]
Then run sruja lint and fix any errors.

Success check: Your change is added and sruja lint passes

What you learned:

  • How to make targeted updates
  • To validate after each change
  • That iteration is normal (AI can refine it)

Step 5: Export for Your Team ⏱️ 10–15 min

What you'll do: Share architecture with stakeholders

Instructions:

  1. Export Markdown:
sruja export markdown repo.sruja > ARCHITECTURE.md
  1. Export diagram:
sruja export mermaid repo.sruja > ARCHITECTURE.mmd
  1. Open ARCHITECTURE.md and review
  2. Open ARCHITECTURE.mmd in Mermaid Live Editor

Success check: You can send ARCHITECTURE.md to your team and they understand it

What you learned:

  • Different export formats
  • How to share architecture
  • That diagrams are output, not the source

Step 6: Detect and Fix Drift ⏱️ 20–30 min

What you'll do: Keep architecture in sync as code changes

Instructions:

  1. Make a small change to your actual code

    • Add a new function
    • Rename a file
    • Move a module
  2. Run drift detection:

sruja drift -r . --format json
  1. Ask AI:
Use sruja-architecture. Here's drift output:
[paste JSON]
Analyze what changed and update repo.sruja to match current code.
  1. Review the changes AI made
  2. Run sruja lint repo.sruja

Success check: sruja drift shows no issues (or you've addressed them)

What you learned:

  • How to detect changes over time
  • To let AI update architecture automatically
  • That architecture stays in sync with code

Step 7: Explore Patterns with AI ⏱️ 15–20 min

What you'll do: Use AI as an architecture advisor

Instructions:

Pick a pattern and ask AI to explain it:

Monolith:

Use sruja-architecture. Explain the monolith pattern:
When to use it, pros/cons, and show an example in Sruja syntax.

Microservices:

Use sruja-architecture. Explain microservices:
When to split, trade-offs, and generate an example architecture.

Event-driven:

Use sruja-architecture. Show me an event-driven architecture
with Kafka, producers, and consumers. Explain trade-offs.

Success check: You understand different patterns and when to use them

What you learned:

  • Architectural patterns
  • When each pattern makes sense
  • That AI can be your advisor, not just generator

Tips for Success

PracticeWhy It Helps
Be specific"Add logging to API container" works better than "Improve architecture"
Validate oftenRun sruja lint after each AI edit—catch mistakes early
Ask questions"Why did you model it this way?" helps you learn
Start simpleGet context + container levels right, add detail later
Trust evidenceIf sruja discover doesn't find something, tell your AI—don't let it guess
IterateFirst attempt won't be perfect—refine with feedback

Common Mistakes to Avoid

Don't: "Generate full architecture for everything"

Do: Start with context + container levels, add components when needed

Don't: Guess at missing information

Do: Ask targeted questions, list what's unknown

Don't: Skip validation

Do: Always run sruja lint after AI edits

Don't: Model for completeness

Do: Model what evidence supports—minimal is better


What's Next?

Go deeper:

Practice:

  • Try different codebases (your own or open-source examples)
  • Practice making specific changes
  • Compare different patterns with AI

Connect:


Quick Reference

Want toHow
Generate architecturePrompt AI from Step 1
Validatesruja lint repo.sruja
Export docssruja export markdown repo.sruja > doc.md
Export diagramsruja export mermaid repo.sruja > diagram.mmd
Check driftsruja drift -r . --format json
Fix errorsPaste lint output to AI and say "fix these"
Explain pattern"Explain [pattern] with pros/cons and example"

Remember: You're not learning a language. You're learning how to:

  1. Guide AI with clear requests
  2. Validate results
  3. Iterate and refine
  4. Use AI as an advisor

The skill handles the syntax—you handle the thinking.

Frequently Asked Questions

New to Sruja? Start here.


Installation

Do I need to know programming to use Sruja?

Not really. You need:

  • Basic comfort with terminals (running commands)
  • A codebase to document (any language)
  • An AI editor (Cursor, Copilot, Claude, etc.)

You don't write the .sruja files manually—your AI does it.

What editors work with Sruja?

AI editors with skill support:

  • Cursor ✅ (built-in support)
  • GitHub Copilot ✅ (requires skills.sh)
  • Claude ✅ (requires skills.sh)
  • Continue.dev ✅ (requires skills.sh)
  • Any editor with skills.sh support

Manual use:

  • VS Code with Sruja extension (syntax highlighting, preview)
  • Terminal (run commands manually, write .sruja by hand)

What's the difference between the CLI and the skill?

CLISkill
Analyzes codeGenerates .sruja files
Validates filesKnows syntax and patterns
Exports diagramsInterprets and refines
No AI requiredRequires AI editor

You need both: CLI for validation, skill for generation.

Can I use Sruja without an AI editor?

Yes, but it's more work.

You can:

  • Run sruja discover and manually create .sruja files
  • Use the VS Code extension for syntax highlighting
  • Write .sruja by hand using the language reference

However, AI makes it 10x easier. Consider using Cursor or installing skills.sh for your current editor.


Workflow

How long does it take to generate architecture?

First time: 5-10 minutes

  • AI analyzes code
  • AI asks 2-3 questions
  • AI generates and validates

Updates: 2-5 minutes

  • AI reads existing file
  • AI makes your change
  • AI validates

With practice: Most updates are 1-2 minutes

What information does AI need from me?

Usually just 2-3 questions:

  1. System boundaries: "What are the main systems?"
  2. External actors: "Who or what calls your system?"
  3. Deployment: "How is this deployed?"

The AI gets everything else from your code.

What if I don't know the answer to a question?

That's fine! Tell your AI:

I don't know the answer to "What's the deployment model?"
Please assume [state your best guess or leave as TODO].

The AI can generate a TODO or make reasonable assumptions, which you can refine later.


Files and Syntax

What's the difference between repo.sruja and architecture.sruja?

They're the same thing, just different names.

  • repo.sruja – Recommended for new projects
  • architecture.sruja – Supported for backward compatibility

Use whichever you like—Sruja detects both.

Where should I put repo.sruja?

Repository root (same folder as package.json, pom.xml, etc.)

Example structure:

my-project/
├── src/
├── package.json
├── README.md
└── repo.sruja          ← Put it here

Can I have multiple .sruja files?

Yes. Common patterns:

  • One file per system: user-service.sruja, order-service.sruja
  • One per module: auth.sruja, api.sruja, db.sruja
  • Nested: arch/repo.sruja, arch/services.sruja

Sruja will detect any .sruja file when you run commands.

Do I need to import from stdlib?

Recommended: yes.

import { * } from 'sruja.ai/stdlib' gives you standard types:

  • person, system, container, database, queue
  • Without needing to define them manually

You can still define types manually (see Quick Start), but imports save time.


Validation

What does sruja lint check?

CheckWhat It FindsSeverity
Syntax errorsInvalid keywords, bad structureHigh
Circular dependenciesA → B → A loopsMedium
Orphan elementsComponents with no connectionsLow
Missing fieldsRequired description/technology missingMedium

Run sruja lint repo.sruja before committing changes.

What if I see errors?

Copy and paste to AI:

Fix these lint errors in repo.sruja:
[paste errors]

Common fixes:

ErrorAI Prompt
Circular"Remove one relationship in the cycle"
Orphan"Add a relationship to this component or remove it"
Missing field"Add a description field to this container"

Can I ignore lint errors?

Not recommended. Lint errors indicate real issues:

  • Your architecture might not match reality
  • Team members might misunderstand the design
  • Documentation might be incomplete

Fix errors or add a comment explaining why it's okay.


Using with AI

How specific should I be with prompts?

More specific = better results.

Too vague:

Improve my architecture.

Specific:

Add error handling to the API container.
Update the description to mention rate limiting.

What if AI generates something wrong?

Don't panic. This is normal.

  1. Tell AI: "This isn't quite right. Here's what I want: [describe]"
  2. Or: "Undo that change. Try again with these constraints: [list]"
  3. Or: Make the change yourself, then ask AI to validate

Validation catches mistakes, and iteration is expected.

Can AI handle large codebases?

Yes. Sruja scales:

  • Small (10-100 files): Full analysis in seconds
  • Medium (100-1000 files): Full analysis in 1-2 minutes
  • Large (1000+ files): May take several minutes

AI can handle any size—it just processes more data for larger projects.


Comparison to Other Tools

How is Sruja different from diagramming tools (Miro, LucidChart)?

Diagramming ToolsSruja
Manual drawingAI generates from code
Drifts from realityAlways in sync
Can't validateLinting catches errors
Hard to version controlGit diff shows changes
Output, not sourceSingle source of truth

Think of it this way: Diagrams are like screenshots—great for viewing, but not the source.

How is Sruja different from architecture frameworks (Spring, DDD)?

Sruja doesn't replace frameworks—it documents them.

  • Frameworks (Spring, DDD, Clean Architecture) – Guide how to write code
  • Sruja – Documents what your code actually does

You can use both: Sruja documents your Spring-based microservices architecture.


AI and Architecture

Why do I need the Sruja skill when I can just ask AI for architecture?

AI proposes; Sruja grounds, validates, and persists.

Without Sruja, AI can suggest architecture, but it's:

  • Ungrounded: AI may invent components or dependencies that don't exist in your code
  • Ephemeral: You get ad-hoc text or diagrams with no single source of truth

The Sruja skill gives AI:

  • Real evidence (from your codebase) to reason about
  • Validation (lint, drift) to ensure accuracy
  • Persistence (repo.sruja as a first-class artifact in your repo)

Three pillars:

  1. Grounding — The skill feeds the model real data (discover, context, graph) so it reasons on what's actually in the codebase, not on guesses. Without Sruja, the model can hallucinate modules and edges.
  2. Validation and sync — The skill uses sruja lint (valid DSL), sruja drift (declared vs actual), and intent check. So you get architecture that is valid and stays in sync with code, not a one-off diagram.
  3. Persistence and reuse — Architecture lives as repo.sruja in the repo: versionable, exportable, comparable over time. The model without Sruja gives ad-hoc text or Mermaid; with Sruja, the output is a first-class artifact the whole team and CI can use.

Think of it this way: AI is the architect proposing designs. Sruja is the structural engineer providing real data, validating plans, and making sure the blueprint is saved in your repo.


Language Support

What programming languages does Sruja support?

LanguageSupportNotes
JavaScript / TypeScript✅ ExcellentBest support
Python✅ ExcellentStrong framework detection (Django, Flask, FastAPI)
Go✅ ExcellentGreat for microservices
Rust✅ ExcellentNative language support
Java✅ GoodSpring Boot, Jakarta EE
C#✅ Good.NET applications
Ruby✅ GoodRails and Ruby apps
PHP✅ GoodPHP frameworks
Scala⚠️ LimitedPartial support

What if my language isn't listed?

Try it! Run:

sruja discover -r . --format json

If you see reasonable output, Sruja can analyze it. If not, please open an issue.


Troubleshooting

"sruja: command not found"

The CLI isn't on your PATH.

Try:

# Check if installed
which sruja

# Add to PATH (if using install script)
export PATH="$HOME/.local/bin:$PATH"

# Or restart your terminal

Skill not loading in my editor

Common causes:

  1. Didn't restart editor after installing
  2. Editor doesn't support skills.sh
  3. Typed wrong skill name

Fixes:

  • Restart your editor
  • Try: npx skills list to see installed skills
  • Check editor has skills.sh support

"Command fails: directory not found"

You're in the wrong folder.

Make sure:

# You should be IN your project
cd your-project

# Not AT the Sruja repo
# ❌ Wrong: cd ~/sruja
# ✅ Right: cd ~/my-project

Discovery shows nothing

Possible causes:

  1. No supported files – Empty directory or wrong language
  2. Wrong directory – Not in the project root
  3. Language limitation – Partial support for that language

Try:

# See what's detected
sruja discover -r . --format json

# Try quickstart for better visibility
sruja quickstart -r .

Best Practices

Start small

Do:

  • Get person + system + container levels working first
  • Add components only when needed
  • Model what evidence supports

Don't:

  • Model everything you can think of
  • Add components "for completeness"
  • Over-complicate simple systems

Validate often

Do:

  • Run sruja lint after each AI edit
  • Check before committing
  • Fix errors before calling it "done"

Don't:

  • Skip validation to "save time"
  • Commit known errors with a TODO

Iterate with AI

Do:

  • Treat first attempt as draft, not final
  • Provide feedback: "Good, but add X"
  • Ask AI to refine specific parts

Don't:

  • Expect perfection on first try
  • Give up if it's not exactly right
  • Start over from scratch instead of refining

Document decisions

Do:

  • Add descriptions to components
  • Note why you chose a pattern
  • Explain trade-offs

Don't:

  • Leave components unlabeled
  • Assume everyone knows your intent

Getting Help

Where can I ask questions?

How do I report a bug?

  1. Check if it's already reported: https://github.com/sruja-ai/sruja/issues
  2. If not, create a new issue with:
    • Steps to reproduce
    • What you expected
    • What actually happened
    • Your environment (OS, editor, language)

How do I contribute?

See Contributing Guide for guidelines.


Quick Command Reference

CommandWhen to UseExample
sruja quickstart -r .First look at a codebaseInstant overview
sruja discover -r .When generating with AIGather evidence
sruja lint repo.srujaAfter AI editsValidate file
sruja export mermaidNeed a diagramGenerate for docs
sruja export markdownNeed readable docsExport ARCHITECTURE.md
sruja drift -r .After code changesDetect what's different

Key Takeaways

  1. You don't learn a language—you guide AI
  2. Validation is your friend—run lint often
  3. Start minimal—add detail only when needed
  4. Iterate—refine with feedback
  5. Use AI as advisor—ask about patterns and trade-offs

The skill handles syntax. You handle thinking.

Overview

Use overview to provide a concise system description shown in docs/exports.

Syntax

import { * } from 'sruja.ai/stdlib'


overview {
title "E‑Commerce Platform"
summary "Web, API, and DB supporting browse, cart, and checkout"
}

view index {
include *
}

Guidance

  • Keep summary short and practical; avoid marketing language.
  • Use overview at architecture root; prefer description inside elements for details.

Architecture

The architecture block is the root element of a Sruja model. It represents the entire scope of what you are modeling.

Syntax

import { * } from 'sruja.ai/stdlib'


// ... define systems, persons, etc. here

view index {
include *
}

Minimal Example

For simple examples, you can use a minimal structure:

import { * } from 'sruja.ai/stdlib'


MySystem = system "My System"
User = person "User"

Purpose

  • Scope Boundary: Everything inside is part of the model.
  • Naming: Gives a name to the overall architecture.

The C4 Model

Sruja is built on the C4 model, a hierarchical approach to software architecture diagrams. If you are new to architecture-as-code, it helps to understand these four levels of abstraction.

Think of it like Google Maps for your code: you can zoom out to see the whole world (System Context), or zoom in to see individual streets (Code).

The 4 Levels

1. System Context (Level 1)

"The Big Picture"

This is the highest level of abstraction. It shows your software system as a single box, and how it interacts with users and other systems (like functional dependencies, email systems, or payment gateways).

  • Goal: What is the system, who uses it, and how does it fit into the existing IT landscape?
  • Audience: Everyone (Technical & Non-Technical).
import { * } from 'sruja.ai/stdlib'


App = system "My App"
User = person "Customer"
Stripe = system "Payment Gateway"

User -> App "Uses"
App -> Stripe "Process Payments"

2. Container (Level 2)

"The High-Level Technical Building Blocks"

Note: In C4, a "Container" is NOT a Docker container. It represents a deployable unit—something that runs separately. Examples include:

  • A Single-Page Application (SPA)

  • A Mobile App

  • A Server-side API application

  • A Database

  • A File System

  • Goal: What are the major technical choices? How do they communicate?

  • Audience: Architects, Developers, Ops.

import { * } from 'sruja.ai/stdlib'


App = system "My App" {
    Web = container "React App"
    API = container "Rust Service"
    DB = database "PostgreSQL"
}

3. Component (Level 3)

"The Internals"

Zooming into a Container to see the major structural building blocks. In an API, these might be your controllers, services, or repositories.

  • Goal: How is the container structured?
  • Audience: Developers.

4. Code (Level 4)

"The Details"

The actual classes, interfaces, and functions. Sruja focuses mainly on Levels 1, 2, and 3, as Level 4 is best managed by your IDE.

Key Relationships

The power of C4 is in the Hierarchical nature.

  • A System defines the boundary.
  • Containers live inside a System.
  • Components live inside a Container.

When you define a relationship at a lower level (e.g., API -> DB), Sruja automatically understands the relationship at higher levels (e.g., App -> DB is implied).

Why use C4?

  1. Shared Vocabulary: "Component" and "Service" often mean different things to different teams. C4 standardizes this.
  2. Zoom Levels: Avoids the "one giant messy diagram" problem. You can view the system at the level of detail relevant to you.

System

A System represents a software system, which is the highest level of abstraction in the C4 model. A system delivers value to its users, whether they are human or other systems.

Syntax

import { * } from 'sruja.ai/stdlib'


ID = system "Label/Name" {
description "Optional description"

// Link to ADRs
adr ADR001

// ... contains containers
}

Example

import { * } from 'sruja.ai/stdlib'


BankingSystem = system "Internet Banking System" {
description "Allows customers to view accounts and make payments."
}

Container

A Container represents an application or a data store. It is something that needs to be running in order for the overall software system to work.

Note: In C4, "Container" does not mean a Docker container. It means a deployable unit like:

  • Server-side web application (e.g., Java Spring, ASP.NET Core)
  • Client-side web application (e.g., React, Angular)
  • Mobile app
  • Database schema
  • File system

Syntax

import { * } from 'sruja.ai/stdlib'


ID = container "Label/Name" {
technology "Technology Stack"
tags ["tag1", "tag2"]
// ... contains components
}

Example

import { * } from 'sruja.ai/stdlib'


BankingSystem = system "Internet Banking System" {
WebApp = container "Web Application" {
  technology "Java and Spring MVC"
  tags ["web", "frontend"]
}
}

Scaling Configuration

Containers can define horizontal scaling properties using the scale block:

import { * } from 'sruja.ai/stdlib'


API = container "API Service" {
technology "Rust, Axum"
scale {
  min 3
  max 10
  metric "cpu > 80%"
}
}

Scale Block Fields

  • min (optional): Minimum number of replicas
  • max (optional): Maximum number of replicas
  • metric (optional): Scaling metric trigger (e.g., "cpu > 80%", "memory > 90%")

This helps document your auto-scaling strategy and can be used by deployment tools.

Component

A Component is a grouping of related functionality encapsulated behind a well-defined interface. Components reside inside Containers.

Syntax

import { * } from 'sruja.ai/stdlib'


ID = component "Label/Name" {
technology "Technology"
// ... items
}

Example

import { * } from 'sruja.ai/stdlib'


AuthController = component "Authentication Controller" {
technology "Spring MVC Rest Controller"
description "Handles user login and registration."
}

Person

A Person represents a human user of your software system (e.g., "Customer", "Admin", "Employee").

Syntax

import { * } from 'sruja.ai/stdlib'


ID = person "Label" {
  description "Optional description"
  tags ["tag1", "tag2"]
}

Example

import { * } from 'sruja.ai/stdlib'


Customer = person "Bank Customer" {
  description "A customer of the bank with personal accounts."
}

Relations

Relations describe how elements interact with each other. They are the lines connecting the boxes in your diagram.

Syntax

<!-- partial -->
import { * } from 'sruja.ai/stdlib'


// Relations use element IDs
Source -> Destination "Label"
// When referring to nested elements, use fully qualified names:
System.Container -> System.Container.Component "Label"

Or with a technology/protocol:

<!-- partial -->
Source -> Destination "Label" {
technology "HTTPS/JSON"
}

Example

import { * } from 'sruja.ai/stdlib'


BankingSystem = system "Internet Banking System" {
WebApp = container "Web Application"
DB = database "Database"
}

User = person "User"

User -> BankingSystem.WebApp "Visits"
BankingSystem.WebApp -> BankingSystem.DB "Reads Data"

Use clear, unique IDs to reference relation endpoints.

See Also

Views

Define views to customize what elements appear and how they render.

Syntax

person = kind "Person"
system = kind "System"
container = kind "Container"
database = kind "Database"

App = system "Application" {
  WebApp = container "Web App"
  API = container "API"
  DB = database "Database"
}

User = person "User"

User -> App.WebApp "Uses"
App.WebApp -> App.API "Calls"
App.API -> App.DB "Reads/Writes"

view api_focus of App {
  title "API Focus"
  include App.API App.DB
  exclude App.WebApp
}

styles {
  element "Database" { shape "cylinder" color "#3b82f6" }
  relationship "Calls" { color "#ef4444" }
}

view index {
  include *
}

Guidance

  • Use include to spotlight critical paths; use exclude to reduce noise.
  • Keep view names descriptive (e.g., "API Focus", "Data Flow").
  • Use view styles for legibility: color important relations, reshape data stores.
  • relations for edges
  • style block for global defaults

Validation

Sruja validates your model to catch issues early.

Common Checks

  • Unique IDs within scope
  • Valid references (relations connect existing elements)
  • Cycles (informational; feedback loops are valid)
  • Layering violations (dependencies must flow downward)
  • External boundary checks
  • Simplicity guidance (non‑blocking)

Example

import { * } from 'sruja.ai/stdlib'


User = person "User"
App = system "App" {
  WebApp = container "Web App"
  API = container "API"
  DB = database "Database"
}

// Valid relations (qualified cross-scope)
User -> App.WebApp "Uses"
App.WebApp -> App.API "Calls"
App.API -> App.DB "Reads/Writes"

view index {
include *
}

Run sruja validate locally or in CI to enforce these rules.

See Also

Deployment

The Deployment view allows you to map your software containers to infrastructure. This corresponds to the C4 Deployment Diagram.

Deployment Node

A Deployment Node is something like physical hardware, a virtual machine, a Docker container, a Kubernetes pod, etc. Nodes can be nested.

Syntax

<!-- partial -->
deployment "Environment" {
    node "Node Name" {
        // ...
    }
}

Infrastructure Node

An Infrastructure Node represents infrastructure software that isn't one of your containers (e.g., DNS, Load Balancer, External Database Service).

Syntax

<!-- partial -->
node "App Server" {
    containerInstance WebApp
}

Container Instance

A Container Instance represents a runtime instance of one of your defined Containers running on a Deployment Node.

Syntax

<!-- partial -->
containerInstance ContainerID {
    instanceId 1 // Optional
}

Example

<!-- partial -->
deployment "Production" {
    node "AWS" {
        node "US-East-1" {
            node "App Server" {
                containerInstance WebApp
            }
            node "Database Server" {
                containerInstance DB
            }
        }
    }
}

Requirements

Use requirement to capture functional, performance, security, and constraint requirements. Requirements are declared at the architecture root only.

Syntax

import { * } from 'sruja.ai/stdlib'


// Requirements using flat syntax
R1 = requirement functional "Support 10k concurrent users"
R2 = requirement performance "p95 < 200ms for /checkout"
R3 = requirement security "PII encrypted at rest"
R4 = requirement constraint "Only PostgreSQL managed service"
R5 = requirement nonfunctional "System must be maintainable"

view index {
  include *
}

Guidance

  • Keep requirement titles concise and testable.
  • Reference requirements in ADRs and scenarios where relevant.
  • Validate with sruja lint to surface unmet or conflicting requirements.
  • Declarations at system/container/component level are deprecated and ignored by exporters and UI.
  • scenario for behavior walkthroughs
  • slo for targets and windows
  • adr for decision records

Scenario

Scenarios describe behavioral flows as ordered steps. They focus on interactions rather than data pipelines.

Syntax

import { * } from 'sruja.ai/stdlib'


Customer = person "Customer"
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
  DB = database "Database"
}

// Scenarios using flat syntax
CheckoutFlow = scenario "User Checkout" {
  step Customer -> Shop.WebApp "adds items to cart"
  step Shop.WebApp -> Shop.API "submits cart"
  step Shop.API -> Shop.DB "validates and reserves stock"
  step Shop.API -> Shop.WebApp "returns confirmation"
  step Shop.WebApp -> Customer "displays success"
}

// 'story' is an alias for 'scenario'
LoginStory = story "User Login" {
  step Customer -> Shop.WebApp "enters credentials"
  step Shop.WebApp -> Shop.API "validates user"
}

view index {
  include *
}

Aliases & Semantics

Sruja provides three keywords that are structurally identical (sharing the same underlying AST definition and syntax) but convey different semantic intent:

  • scenario: Models behavioral flows (e.g., Use Cases, User Journeys).
  • story: An alias for scenario (e.g., User Stories).
  • flow: Models data movement (e.g., Data Flow Diagrams).

While the syntax is the same, using the appropriate keyword helps readers understand the nature of the interaction being modeled.

import { * } from 'sruja.ai/stdlib'


Customer = person "Customer"
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
  DB = database "Database"
}

// Scenario: User behavior
Checkout = scenario "User Checkout" {
  Customer -> Shop.WebApp "adds items to cart"
  Shop.WebApp -> Shop.API "submits cart"
}

// Flow: Data flow
OrderProcess = flow "Order Processing" {
  Customer -> Shop "Order Details"
  Shop -> Shop.API "Processes"
  Shop.API -> Shop.DB "Save Order"
}

When to use:

  • Use scenario for user journeys, business processes, and behavioral flows
  • Use flow for data pipelines, ETL processes, and system-to-system data flows

Tips

  • Keep step labels short and action‑oriented
  • Use fully qualified names when referring outside the current context
  • Use scenario or story for behavior; use flow for data flows; use relations for structure

See Also

Architecture Decision Records (ADR)

Sruja allows you to capture Architecture Decision Records (ADRs) directly within your architecture model. This keeps the "why" close to the "what".

Syntax

Defining an ADR

You can define an ADR with a full body describing the context, decision, and consequences.

import { * } from 'sruja.ai/stdlib'


ADR001 = adr "Use PostgreSQL" {
status "Accepted"
context "We need a relational database with strong consistency guarantees."
decision "We will use PostgreSQL 15."
consequences "Good ecosystem support, but requires managing migrations."
}

Linking ADRs

You can link an ADR to the elements it affects (System, Container, Component) by referencing its ID inside the element's block.

import { * } from 'sruja.ai/stdlib'


Backend = system "Backend API" {
// Link to the ADR (via metadata in future)
}

Optional Title

The title is optional if you are just referencing an ADR or if you want to define it later.

adr = kind "ADR"
ADR003 = adr "Deferred Decision"

Fields

  • ID: Unique identifier (e.g., ADR001).
  • Title: Short summary of the decision.
  • Status: Current status (e.g., Proposed, Accepted, Deprecated).
  • Context: The problem statement and background.
  • Decision: The choice made.
  • Consequences: The pros, cons, and implications of the decision.

Policy

Policies define architectural rules, standards, and constraints that your system must follow. They help enforce best practices, compliance requirements, and organizational standards directly in your architecture model.

Syntax

<!-- partial -->
PolicyID = policy "Description" {
  category "category-name"
  enforcement "required" // "required" | "recommended" | "optional"
  description "Detailed description"
  metadata {
    // Additional metadata
  }
}

Simple Policy

import { * } from 'sruja.ai/stdlib'


SecurityPolicy = policy "Enforce TLS 1.3 for all external communications"

view index {
  include *
}

Policy Fields

  • ID: Unique identifier for the policy (e.g., SecurityPolicy, GDPR_Compliance)
  • Description: Human-readable description of the policy
  • category (optional): Policy category (e.g., "security", "compliance", "performance")
  • enforcement (optional): Enforcement level ("required", "recommended", "optional")
  • description (optional): Detailed description within the policy body
  • metadata (optional): Additional metadata key-value pairs

Example: Security Policies

import { * } from 'sruja.ai/stdlib'


TLSEnforcement = policy "All external communications must use TLS 1.3" {
  category "security"
  enforcement "required"
}

EncryptionAtRest = policy "Sensitive data must be encrypted at rest" {
  category "security"
  enforcement "required"
}

BankingApp = system "Banking App" {
  API = container "API Service"
  CustomerDB = database "Customer Database"
}

view index {
  include *
}

Example: Compliance Policies

import { * } from 'sruja.ai/stdlib'


HIPAACompliance = policy "Must comply with HIPAA regulations" {
  category "compliance"
  enforcement "required"
  description "All patient data must be encrypted and access logged"
}

DataRetention = policy "Medical records retained for 10 years" {
  category "compliance"
  enforcement "required"
}

view index {
  include *
}

Policy Categories

Common policy categories include:

  • security: Security standards and practices
  • compliance: Regulatory and legal requirements
  • performance: Performance standards and SLAs
  • observability: Monitoring, logging, and metrics requirements
  • architecture: Architectural patterns and principles
  • data: Data handling and privacy requirements

Enforcement Levels

  • required: Policy must be followed (non-negotiable)
  • recommended: Policy should be followed (best practice)
  • optional: Policy is a guideline (suggested)

Benefits

  • Documentation: Policies are part of your architecture, not separate documents
  • Validation: Can be validated against actual implementations
  • Communication: Clear standards for development teams
  • Compliance: Track regulatory and organizational requirements
  • Governance: Enforce architectural decisions and patterns

Note on Rules

The rule keyword inside policies is not yet implemented. For now, policies serve as documentation and can be validated manually or through external tooling.

See Also

Syntax Reference

Elements

<!-- partial -->
import { * } from 'sruja.ai/stdlib'


ID = person "Label"
ID = system "Label" { ... }
ID = container "Label" { ... }
ID = database "Label" { ... }
ID = queue "Label" { ... }
ID = component "Label" { ... }

Relations

<!-- partial -->
Source -> Target "Label"
// Use fully qualified names when referring to nested elements:
System.Container -> System.API "Label"
System.Container.Component -> System.API.Component "Label"

Metadata

overview {
  summary "Syntax Reference Overview"
}

MySystem = system "MySystem" {
  metadata {
    team "Platform"
    tier "critical"
  }
}

Deployment

<!-- partial -->
deployment Prod {
  node Cloud {
    node Region {
      node Service {
        containerInstance Web
      }
    }
  }
}

Architecture Patterns

Request/Response

import { * } from 'sruja.ai/stdlib'


App = system "App" {
Web = container "Web"
API = container "API"
DB = database "Database"
}

App.Web -> App.API "Calls"
App.API -> App.DB "Reads/Writes"

view index {
include *
}

Event-Driven

<!-- partial -->
import { * } from 'sruja.ai/stdlib'


Orders = system "Order System" {
OrderSvc = container "Order Service"
PaymentSvc = container "Payment Service"
}

Orders.OrderSvc -> Orders.PaymentSvc "OrderCreated event"
Orders.PaymentSvc -> Orders.OrderSvc "PaymentConfirmed event"

view index {
include *
}

Saga

<!-- partial -->
import { * } from 'sruja.ai/stdlib'


Orders = system "Order System" {
OrderSvc = container "Order Service"
InventorySvc = container "Inventory Service"
PaymentSvc = container "Payment Service"
}

CreateOrderSaga = scenario "Order Creation Saga" {
Orders.OrderSvc -> Orders.InventorySvc "Reserves stock"
Orders.InventorySvc -> Orders.OrderSvc "Confirms reserved"
Orders.OrderSvc -> Orders.PaymentSvc "Charges payment"
Orders.PaymentSvc -> Orders.OrderSvc "Confirms charged"
}

view index {
include *
}

CQRS

import { * } from 'sruja.ai/stdlib'


App = system "App" {
CommandAPI = container "Command API"
QueryAPI = container "Query API"
ReadDB = database "Read Database"
WriteDB = database "Write Database"
}

App.CommandAPI -> App.WriteDB "Writes"
App.QueryAPI -> App.ReadDB "Reads"

view index {
include *
}

RAG (Retrieval-Augmented Generation)

import { * } from 'sruja.ai/stdlib'


AIQA = system "AI Q&A" {
Indexer = container "Indexer"
Retriever = container "Retriever"
Generator = container "Generator"
VectorDB = database "Vector Store"
}

AIQA.Indexer -> AIQA.VectorDB "Writes embeddings"
AIQA.Retriever -> AIQA.VectorDB "Searches"
AIQA.Generator -> AIQA.Retriever "Fetches contexts"

See book/valid-examples/pattern-rag-pipeline.sruja for a production-ready model.

Agentic Orchestration

import { * } from 'sruja.ai/stdlib'


AgentSystem = system "Agent System" {
Orchestrator = container "Agent Orchestrator"
Planner = container "Planner"
Executor = container "Executor"
Tools = container "Tooling API"
Memory = database "Long-Term Memory"
}

AgentSystem.Orchestrator -> AgentSystem.Planner "Plans tasks"
AgentSystem.Orchestrator -> AgentSystem.Executor "Executes steps"
AgentSystem.Executor -> AgentSystem.Tools "Calls tools"
AgentSystem.Executor -> AgentSystem.Memory "Updates state"

view index {
include *
}

See book/valid-examples/pattern-agentic-ai.sruja for a complete agent graph.

Sruja Cheatsheet

Elements

import { * } from 'sruja.ai/stdlib'

User = person "User"
App = system "App" {
  Web = container "Web"
  API = container "API"
  DB = database "DB"
}
User -> App.Web "Uses"
App.Web -> App.API "Calls"
App.API -> App.DB "Reads/Writes"

view index {
  include *
}

Tip

The import { * } from 'sruja.ai/stdlib' line provides all standard kinds. You can also declare kinds manually if needed: person = kind "Person", system = kind "System", etc.

Component

import { * } from 'sruja.ai/stdlib'

App = system "App" {
  Web = container "Web" {
    Cart = component "Cart"
  }
}

Scenario

import { * } from 'sruja.ai/stdlib'

User = person "User"

App = system "App" {
  Web = container "Web"
  API = container "API"
  DB = database "Database"
}

scenario Checkout "Checkout Flow" {
  User -> App.Web "adds items"
  App.Web -> App.API "validates"
  App.API -> App.DB "stores order"
}

Deployment

<!-- partial -->
deployment Prod {
  node Cloud {
    node Region {
      node Service {
        containerInstance App.Web
      }
    }
  }
}

Try it

Use the VS Code extension to paste these snippets into a .sruja file and see the diagram preview.

Sruja Adoption Guide

Using Sruja in your repo

For a short, practical guide (install CLI, add to your project, CI, AI, multi-repo), see Using Sruja in your project. The rest of this adoption guide helps you evaluate fit and plan rollout.

Is Sruja Right for Your Organization?

Quick Self-Assessment

Answer these questions to determine if Sruja addresses your needs:

Architecture & Documentation Pain Points

  • Do your architecture diagrams become outdated within weeks?
  • Do engineers spend significant time maintaining documentation?
  • Is there confusion about "the latest architecture diagram"?
  • Do new engineers struggle to understand system architecture?
  • Are architectural decisions lost when senior engineers leave?

If 3+ are "Yes" → Sruja can help

Compliance & Governance Needs

  • Do you need to comply with regulations (HIPAA, SOC2, PCI-DSS, GDPR)?
  • Are compliance audits time-consuming and risky?
  • Do you struggle to prove architectural controls meet requirements?
  • Are security policies documented but not enforced?
  • Do you need to demonstrate compliance to auditors?

If 2+ are "Yes" → Sruja's policy-as-code is valuable

Technical Architecture Challenges

  • Do you have microservices that need governance?
  • Are you experiencing architectural drift (implementation vs. design)?
  • Do you need to enforce service boundaries and dependencies?
  • Are circular dependencies causing issues?
  • Do you need to generate infrastructure from architecture?

If 2+ are "Yes" → Sruja's validation and enforcement help

DevOps & Engineering Culture

  • Do you use Git/GitOps workflows?
  • Do you have CI/CD pipelines?
  • Do you value "everything as code" (IaC, GitOps)?
  • Do you want architecture changes in PR reviews?
  • Do you need architecture to integrate with Terraform/Istio/etc.?

If 3+ are "Yes" → Sruja fits your workflow

Organization Size & Maturity

Sruja is ideal for:

  • Startups (10-50 engineers): Fast scaling, need consistency
  • Scale-ups (50-200 engineers): Managing complexity, compliance needs
  • Enterprises (200+ engineers): Governance, compliance, knowledge management

Sruja may not be ideal if:

  • ❌ You have < 5 engineers (overhead may outweigh benefits)
  • ❌ You don't use version control or CI/CD
  • ❌ You prefer visual-only tools (no code/DSL)
  • ❌ You have no compliance or governance requirements

Decision Framework

Step 1: Define Your Goals

What problem are you trying to solve?

GoalSruja BenefitPriority
Reduce documentation overheadArchitecture-as-code stays currentHigh
Ensure compliancePolicy-as-code with automated validationHigh
Prevent architectural driftAutomated validation in CI/CDMedium
Faster onboardingLiving documentation in codebaseMedium
Enforce service boundariesLayer and dependency validationMedium
Generate infrastructureTerraform/OpenTofu generation (roadmap)Low

Action: Rank your top 3 goals. Sruja should address at least 2.

Step 2: Calculate Value & ROI

Note: Sruja is free and open source. This ROI calculation measures time savings and value, not purchase cost.

Quick Value Calculator:

Time Savings = (Engineers × Hours/Week × 0.7) × 50 weeks × $100/hour
Onboarding Savings = (New Engineers/Year × 2 weeks × 0.5) × $150k/year ÷ 50
Risk Reduction = Compliance Failures Avoided × $100k

Total Value = Time Savings + Onboarding + Risk Reduction

Example (10 senior engineers, 20 new engineers/year):

  • Time: 10 × 4 hours × 0.7 × 50 × $100 = $140k/year
  • Onboarding: 20 × 2 × 0.5 × $150k ÷ 50 = $60k/year
  • Risk: 1 failure avoided = $100k (one-time)
  • Total Value: $200k+ per year

ROI: Since Sruja is free, ROI is essentially infinite - you get value with zero cost.

Step 3: Assess Technical Fit

Evaluate your technical stack:

TechnologySruja IntegrationStatus
Git/GitHub/GitLabNative integration✅ Available
CI/CD (GitHub Actions, GitLab CI)Validation in pipelines✅ Available
Terraform/OpenTofuInfrastructure generation🚧 Roadmap (Phase 2)
Kubernetes/IstioService mesh config generation🚧 Roadmap (Phase 3)
API Gateways (Kong, Apigee)Config generation🚧 Roadmap (Phase 3)
OPA (Open Policy Agent)Policy integration🚧 Roadmap (Phase 2)

Action:

  • If you need Git/CI/CD integration → ✅ Ready now
  • If you need Terraform/Istio/OPA → 🚧 On roadmap (see Roadmap Discussions) — you can pilot with current features now

Evaluation Process

Phase 1: Discovery (Week 1)

Activities:

  1. Review Sruja documentation
  2. Install CLI: curl -fsSL https://sruja.ai/install.sh | bash (or from Git: cargo install sruja-cli --git https://github.com/sruja-ai/sruja; or build from source: git clone https://github.com/sruja-ai/sruja.git && cd sruja && make build)
  3. Model a simple existing system
  4. Install VS Code extension for syntax highlighting and diagnostics

Deliverable: Understanding of Sruja capabilities

Phase 2: Proof of Concept (Weeks 2-4)

Activities:

  1. Model 1-2 real systems in Sruja
  2. Integrate validation into CI/CD
  3. Document architecture decisions as ADRs
  4. Measure time savings

Success Criteria:

  • Can model systems accurately
  • Validation catches real issues
  • Team sees value
  • Time savings measurable

Deliverable: PoC report with value estimate

Phase 3: Pilot (Months 2-3)

Activities:

  1. Roll out to 1-2 teams
  2. Establish best practices
  3. Create internal documentation
  4. Measure compliance improvements

Success Criteria:

  • Architecture stays current
  • Compliance validation working
  • Team adoption > 80%
  • Positive value demonstrated

Deliverable: Pilot report with go/no-go recommendation

Decision Checklist

Must-Have Requirements

  • Problem Fit: Sruja addresses 2+ of your top goals
  • Value Positive: Calculated value > $100k/year (or equivalent time savings)
  • Technical Fit: Git/CI/CD integration available (or roadmap acceptable)
  • Team Readiness: Team comfortable with code-based tools
  • Leadership Support: Time allocated for adoption (no budget needed - Sruja is free)

Nice-to-Have Requirements

  • Advanced features needed (Terraform, Istio, OPA)
  • Compliance requirements (HIPAA, SOC2, PCI-DSS)
  • Large team (100+ engineers)
  • Microservices architecture

Decision Matrix

CriteriaWeightYour Score (1-5)Weighted Score
Problem fit30%______
Value/ROI25%______
Technical fit20%______
Team readiness15%______
Leadership support10%______
Total100%___/5.0

Decision Rule:

  • > 4.0: Strong fit → Proceed with pilot
  • 3.5-4.0: Good fit → Consider pilot
  • < 3.5: Weak fit → Reassess or wait

Common Concerns & Objections

"We already have architecture documentation"

Response: Sruja doesn't replace documentation — it makes it executable. Your documentation becomes code that:

  • Stays current (version-controlled)
  • Validates automatically
  • Enforces policies
  • Integrates with DevOps

"Our team isn't technical enough for a DSL"

Response: Sruja's DSL is designed for all developers:

  • 1st-year CS students productive in 10 minutes
  • Progressive disclosure (simple → advanced)
  • Rich error messages guide users
  • VS Code extension with full LSP support (autocomplete, go-to-definition, rename, find references, and more) - see VS Code Extension Guide

"We don't have compliance requirements"

Response: Sruja provides value beyond compliance:

  • Faster onboarding (50% reduction)
  • Reduced documentation time (20-30%)
  • Architectural validation (prevents drift)
  • Knowledge preservation

"The roadmap features we need aren't ready"

Response:

  • Core features (validation, CI/CD) are available now
  • Roadmap features (Terraform, Istio, OPA) are planned for Phase 2-3 (see Roadmap Discussions)
  • You can start with core features and add advanced later
  • Early adoption gives you influence on roadmap priorities

Success Metrics

Track These KPIs

MetricBaselineTarget (3 months)Target (6 months)
Documentation timeX hours/weekX × 0.7 hours/weekX × 0.5 hours/week
Onboarding timeX weeksX × 0.7 weeksX × 0.5 weeks
Architecture freshnessX% outdated< 10% outdated< 5% outdated
Compliance violationsX per quarterX × 0.5 per quarter0 per quarter
Architectural issues caughtX in productionX × 0.3 in productionX × 0.1 in production

Next Steps

Immediate Actions

  1. Complete Self-Assessment (above)
  2. Calculate Value (Step 2)
  3. Try Sruja (see Getting Started)
  4. Join Community (GitHub Discussions)

Decision Timeline

  • Week 1: Self-assessment and value calculation
  • Week 2-4: Proof of concept
  • Month 2-3: Pilot program
  • Month 4+: Full rollout (if successful)

Resources

Open Source & Community Support

Sruja is free and open source (Apache 2.0 licensed), developed by and for the community. You can:

  • Use it freely: No licensing fees or restrictions
  • Contribute: Submit PRs, report issues, suggest features
  • Extend it: Build custom validators, exporters, and integrations
  • Join the community: Participate in GitHub Discussions, share use cases, and learn from others

Professional Services

While Sruja is open source and free to use, professional consulting services are available for organizations that need:

  • Implementation support: Help rolling out Sruja across teams and systems
  • Best practices guidance: Establish architectural governance patterns and workflows
  • Custom integrations: Integrate Sruja with existing CI/CD, infrastructure, and monitoring tools
  • Training: Team training on Sruja DSL, validation patterns, and architectural modeling
  • Custom development: Build custom validators, exporters, or platform integrations

Contact the team through GitHub Discussions to discuss your needs.

Future Platform Vision

Sruja is designed to evolve into a comprehensive platform for architectural governance:

  • Live System Review: Compare actual runtime behavior against architectural models to detect drift and violations.
  • Gap Analysis: Automatically identify missing components, undocumented dependencies, and architectural gaps.
  • Continuous Validation: Monitor production systems against architectural policies and constraints in real-time.
  • Compliance Monitoring: Track and report on architectural compliance across services and deployments.

These capabilities are planned for future releases. The current open source foundation provides the building blocks for this evolution, and community feedback helps shape the roadmap.


Note: This guide helps you evaluate whether Sruja is the right fit for your organization and how to adopt it successfully.

Ready to evaluate Sruja? Start with the Self-Assessment above.

Adoption Playbook

Week 1: Baseline & CI

  • Create a minimal architecture.sruja covering core systems.
  • Add sruja fmt and sruja lint to CI; fail on violations.
  • Export docs: sruja export markdown architecture.sruja.

Week 2: Targets & Guardrails

  • Add slo and scale for critical paths.
  • Encode constraints and conventions; publish to teams.
  • Introduce views for API/Data/Auth focus.

Week 3: Governance & Evolution

  • Add policy pages for security/operability.
  • Document decisions with adr blocks; track evolution with slo values (target vs current).
  • Use Git for automatic change tracking (git log, git diff, version tags).
  • Wire linting to PR checks; require green builds.

CI Example (GitHub Actions)

name: sruja
on: [push, pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Lint DSL
        run: sruja lint architecture.sruja
      - name: Export Docs
        run: sruja export markdown architecture.sruja

Success Metrics

  • Review cycle time ↓
  • Incident rate for architecture errors ↓
  • Consistency across services ↑

Note: Sruja is free and open source (Apache 2.0 licensed). Need help with implementation? Professional consulting services are available. Contact the team through GitHub Discussions to learn more.

Using Sruja in Your Project

This guide is for teams and organizations that want to use Sruja in their own repositories to enhance their code: architecture-as-code, validation in CI, and AI-assisted generation with consistent rules.

What you get

  • Architecture as code.sruja files in Git; no separate diagram tool to keep in sync.
  • Validationsruja lint catches undefined refs, circular dependencies, missing fields, orphans.
  • AI-friendly – Rules and skills so Cursor, Copilot, etc. generate valid Sruja and better architecture.
  • CI – Fail PRs when architecture is invalid; optional export to Markdown/JSON/Mermaid for docs.

1. Install (your machine and/or CI)

CLI

Option A – install script (downloads from GitHub Releases):

curl -fsSL https://sruja.ai/install.sh | bash

Option B – from Git (requires Rust):

cargo install sruja-cli --git https://github.com/sruja-ai/sruja

Option C – build from source:

git clone https://github.com/sruja-ai/sruja.git && cd sruja && make build

Ensure the install directory is on your PATH (install script uses ~/.local/bin by default; Option B uses ~/.cargo/bin; Option C uses target/release).

Check:

sruja --help
sruja lint --help

VS Code extension

Install Sruja Language Support from the VS Code Marketplace (or Open VSX). You get syntax highlighting, LSP diagnostics, and optional diagram preview for .sruja files.


2. Add Sruja to your repo (5 minutes)

Step 1: Create or add architecture

# From your repo root
sruja init my-service
# Creates: my-service.sruja, .cursorrules, .copilot-instructions.md, .architecture-skill.md

Or add a single file, e.g. architecture.sruja or docs/architecture.sruja, and define your systems/containers/relationships (see Language specification and the Examples Gallery).

Step 2: AI editor integration (so AI-generated code follows rules)

The files created by sruja init are enough for most teams:

  • .cursorrules – Cursor uses this for Sruja DSL rules.
  • .copilot-instructions.md – GitHub Copilot uses this.
  • .architecture-skill.md – Short pointer; optional full skill: npx skills add sruja-ai/sruja --skill sruja-architecture.

Commit these so everyone (and CI) has the same setup. See AI editor integration in the repo for details.

Step 3: Validate in CI

In your repo you don't have the Sruja monorepo, so install the CLI in CI from Git, then run lint.

GitHub Actions example:

name: Validate Sruja

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main, develop]
  paths:
    - '**/*.sruja'

jobs:
  sruja:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable

      - name: Install Sruja CLI
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja --locked

      - name: Lint all .sruja files
        run: |
          find . -name '*.sruja' -not -path './target/*' | while read f; do
            echo "Linting $f"
            sruja lint "$f"
          done

Use --locked so the install matches the lockfile in the Sruja repo for reproducible CI.

Optional – export docs in CI:

      - name: Export architecture to Markdown
        run: |
          for f in $(find . -name '*.sruja' -not -path './target/*'); do
            out="${f%.sruja}.md"
            sruja export markdown "$f" > "$out" || true
          done
      - name: Upload architecture docs
        uses: actions/upload-artifact@v4
        with:
          name: architecture-docs
          path: '**/*.sruja.md'

3. How this enhances your code

PracticeHow Sruja helps
PR reviewsCI fails if .sruja is invalid; reviewers see architecture changes in the diff.
OnboardingNew devs read .sruja and exported docs instead of hunting for "the" diagram.
AI-generated code.cursorrules and Copilot instructions steer AI to valid DSL; sruja lint catches mistakes.
Compliance / governancePolicies and constraints in the DSL; lint enforces structure; export for auditors.
Multi-repoEach repo can have its own architecture.sruja (or one per service); same CLI and CI pattern.

4. Using Sruja across multiple repos

  • Per-repo – Each repository that owns a service or app can have its own .sruja file(s). Add the same CI job (install CLI from Git + sruja lint) and the same AI files (e.g. copy .cursorrules and .copilot-instructions.md from a template or run sruja init once and commit).
  • Central docs repo – Some teams keep a single "docs" or "architecture" repo with one or more .sruja files and run Sruja CI there; link to exported Markdown/JSON from other repos. Other repos don't need the CLI unless they also own architecture files.
  • Shared rules – Use the same sruja-architecture skill (npx skills add sruja-ai/sruja --skill sruja-architecture) across repos so AI and humans share the same patterns and trade-offs.

5. Where to go next

Sruja is open source. To report issues or suggest improvements, use GitHub Issues or Discussions.

Sruja Design Philosophy: The Unified Intuitive DSL

Objective

Create a modeling language that empowers all developers - from students to enterprise architects - to design systems with confidence, while naturally guiding them toward simplicity and preventing over-engineering.

Core Principles:

  1. Start simple, stay simple: A 1st-year CS student should be productive in 10 minutes, and advanced developers should be guided away from unnecessary complexity.
  2. Empower, don't restrict: The language should enable all developers, not limit them, but guide them toward good design.
  3. Approachability first: Complex concepts should be available but not encouraged unless truly needed.
  4. Prevent over-engineering: The language itself should make simple designs easier than complex ones.
  5. Systems thinking made simple: Enable holistic system understanding through intuitive syntax, without requiring complex theory.

Methodology Analysis

MethodologyCore ConceptsJargon LevelStudent IntuitionSruja Mapping
C4System, Container, ComponentLow"Boxes and lines" - Easy to grasp.system, container, component
DDDBounded Context, Aggregate, Entity, Value Object, Domain EventHigh"Aggregate Root" is confusing. "Value Object" is abstract.Not currently supported
ER (DB)Entity, Attribute, Relationship, Table, ColumnMedium"Entity" is standard. "Relationship" is clear.data, datastore, -> (relation)
API (OpenAPI)Path, Method, Schema, PropertyMedium"Endpoint" is clear. "Schema" is clear.api, data (as schema)
DODData, Struct, Array, TransformLow"Data" and "Struct" are very familiar to coders.data, [] (arrays)

The "Unified" Proposal

We need a set of keywords that map to these concepts without forcing the user to learn the specific theory first. The language should support progressive disclosure: simple concepts first, advanced concepts when needed.

1. Grouping (The "Container")

Problem: Different methodologies use different terms for logical boundaries.

MethodologyTermSruja KeywordWhen to Use
C4ContainercontainerTechnical deployment boundary (e.g., "Web Server", "Database")
GeneralGroupingmoduleGeneric logical grouping (most intuitive)

Decision:

  • module: Primary keyword for logical grouping. Familiar to Python/JS/Go developers.
  • container: For C4-style technical containers (deployment units).

Rationale: module is the most universal term. Students learn "modules" in their first programming course.

2. The "Thing" (Data Structure)

Problem: Entity vs Value Object vs Table vs Struct - all represent "data" but with different semantics.

MethodologyTermSruja KeywordSemantics
GeneralEntity/Structdata (with id)Has identity, mutable
GeneralValue/Structdata (no id)Immutable, defined by values
ERTable/Entitydata or datastorePersistent storage
DODStructdataIn-memory structure
APISchemadataRequest/response structure

Decision: data is the unified keyword.

Rules:

  • If data has an id field → Implicitly an Entity (has identity)
  • If data has no id → Implicitly a Value Object (value-based)
  • If data is in a datastore → Implicitly a database table
  • If data is in an api → Implicitly a request/response schema

Rationale: Students understand "data" immediately. The semantics emerge from context, not explicit keywords.

3. The "Action" (Behavior/Event)

Problem: Different types of actions need different modeling approaches.

TypeSruja KeywordPurposeExample
API EndpointapiExternal interfaceREST endpoint, GraphQL query
EventeventSomething that happenedOrderPlaced, PaymentProcessed
Function/Method(implicit in component)Internal behaviorBusiness logic in components

Decision:

  • api: Explicit API endpoints (students understand "API")
  • event: Events (something that happened)
  • Component behavior: Implicit (components contain behavior)

Rationale: Students learn APIs early. Events are intuitive ("something happened").

4. Relationships

Problem: How to model connections between elements?

Decision: Use arrow syntax -> for relationships.

User -> ShopAPI.WebApp "Uses"
ShopAPI.WebApp -> ShopAPI.Database "Reads/Writes"
Order -> Payment "Triggers"

Rationale: Arrows are universal. Everyone understands "A -> B" means "A relates to B".

Proposed Syntax: "The Universal Model"

Level 1: Beginner (C4 Style)

// Element kinds
person = kind "Person"
system = kind "System"
container = kind "Container"

// Elements
user = person "End User"

shop = system "Shop API" {
    webApp = container "Web Application" {
        technology "React"
    }

    db = container "PostgreSQL Database" {
        technology "PostgreSQL 14"
    }
}

// Relationships
user -> shop.webApp "Uses"
shop.webApp -> shop.db "Reads/Writes"

Level 2: Intermediate (Detailed Architecture)

// Element kinds
system = kind "System"
container = kind "Container"
component = kind "Component"
database = kind "Database"

// Architecture
ShopAPI = system "Shop API" {
    WebApp = container "Web Application" {
        technology "React"
        Cart = component "Shopping Cart"
        Checkout = component "Checkout Service"
    }

    API = container "API Gateway" {
        technology "Node.js"
    }

    DB = database "PostgreSQL Database" {
        technology "PostgreSQL 14"
    }
}

// Relationships
ShopAPI.WebApp -> ShopAPI.API "Calls"
ShopAPI.API -> ShopAPI.DB "Reads/Writes"

Level 3: Advanced (Governance + Operations)

// Element kinds
person = kind "Person"
system = kind "System"
container = kind "Container"
database = kind "Database"

// Elements
Customer = person "Customer"

Shop = system "E-commerce Shop" {
    description "High-performance e-commerce platform"

    API = container "API Gateway" {
        technology "Node.js"
        slo {
            latency {
                p95 "200ms"
                p99 "500ms"
            }
        }
        scale {
            min 3
            max 10
        }
    }

    DB = database "PostgreSQL" {
        technology "PostgreSQL 14"
    }
}

// Governance
R1 = requirement functional "Must support 10k concurrent users"
SecurityPolicy = policy "Encrypt all data" category "security" enforcement "required"

// Relationships
Customer -> Shop.API "Uses"
Shop.API -> Shop.DB "Reads/Writes"

Key Design Decisions

1. Progressive Disclosure

  • Beginner: Start with system, container, component (C4)
  • Intermediate: Add module, data, api, event (Unified)
  • Advanced: Use all features together for complex architectures

Rationale: Students can start simple and learn advanced concepts when needed.

2. Arrays: DOD-Style Syntax

Support [] syntax (e.g., items OrderItem[]) instead of just implied relationships.

Rationale: Very familiar to programmers. Makes data structures explicit.

3. Unified data Keyword

The data keyword represents data structures. The presence of an id field indicates an entity with identity.

Rationale: Reduces cognitive load. Students can model data structures without learning complex theory.

4. Explicit api Keyword

Model APIs alongside data to connect "Backend" to "Database".

Rationale: Students understand "APIs". This bridges the gap between data modeling and API design.

5. Context-Aware Semantics

The same keyword (data) means different things in different contexts:

  • In a module: Domain model
  • In a datastore: Database table
  • In an api: Request/response schema
  • In a component: Internal data structure

Rationale: One keyword, multiple interpretations based on context. Reduces vocabulary size.

Preventing Over-Engineering: Simplicity by Design

How Sruja Guides Toward Simplicity

1. Start Simple

  • Use system for technical/deployment modeling (C4 style)
  • Use module for logical grouping when needed
  • Keep it simple - don't add complexity unless necessary

2. Progressive Disclosure

  • Start with basic C4 concepts: system, container, component
  • Add module, data, api, event when you need more detail
  • Use only what you need for your use case

3. Natural Constraints

  • system syntax is straightforward for deployment modeling
  • The language guides you to use the right level of detail
  • Simple designs are easier to write than complex ones

4. Validation & Guidance (Future)

  • Warn if over-engineering simple systems
  • Help users choose the right level of detail
  • Guide toward simplicity

5. Clear Mental Models

  • system = "How is this deployed?" (Physical/Technical)
  • module = "How is this organized?" (Logical grouping)
  • Keep it focused on what you're actually modeling

Missing Concepts & Future Considerations

Currently Missing (but important):

  1. Constraints/Validation: How to express "email must be valid", "age > 0"?
  2. Relationships with Cardinality: User -> Order[1:*] (one-to-many)?
  3. Inheritance/Polymorphism: How to model "Payment extends Transaction"?
  4. Enums: status: OrderStatus where OrderStatus = [PENDING, COMPLETED, CANCELLED]
  5. Optional Fields: email?: string vs email string
  6. Defaults: status string = "PENDING"
  7. Computed Fields: total: float = items.sum(price * qty)

Recommendations:

  • Add enum keyword for enumerations
  • Support ? for optional fields: email? string
  • Support = for defaults: status string = "PENDING"
  • Consider constraint keyword for validation rules
  • Consider relationship syntax: User -> Order[1:*]
  • flow and scenario/story already implemented for flow thinking (DFD and BDD-style)

Migration Path

From C4 to Sruja:

<!-- partial -->
// C4: System Context
system "E-Commerce System" {
    // C4: Container
    container "Web Application" {
        // C4: Component
        component "Order Controller"
    }
}

From Data Modeling to Sruja:

<!-- partial -->
// Data structures
module Orders {
    data Order {
        id string
        items OrderItem[]
    }

    data OrderItem {
        product_id string
        qty int
    }

    data ShippingAddress {
        street string
        city string
    }
}

From ER to Sruja:

<!-- partial -->
datastore Database {
    data User {
        id string
        email string
    }

    data Order {
        id string
        user_id string  // Foreign key relationship
    }
}

Systems Thinking: Simple and Intuitive

Goal: Empower developers to think about systems holistically - understanding how parts interact, boundaries, and emergent behavior - without requiring complex theory.

Core Systems Thinking Concepts (Simplified)

Systems thinking is about understanding:

  1. Parts and Relationships: How components connect and interact
  2. Boundaries: What's inside vs outside the system
  3. Flows: How information/data moves through the system
  4. Feedback Loops: How actions create reactions
  5. Context: The environment the system operates in

How Sruja Makes Systems Thinking Simple

1. Parts and Relationships (Already Built-In)

<!-- partial -->
system ShopAPI {
    container WebApp
    container Database
}

ShopAPI.WebApp -> ShopAPI.Database "Reads/Writes"

Simple Insight: Just draw boxes and connect them with arrows. The relationships show how parts interact.

2. Boundaries (Natural in Sruja)

<!-- partial -->
system ShopAPI {  // Inside boundary
    container WebApp
}

person User  // Outside boundary
User -> ShopAPI "Uses"

Simple Insight: system defines the boundary. person and external systems are outside. Clear and intuitive.

3. Flows (Built-In Flow Syntax)

<!-- partial -->
// Data Flow Diagram (DFD) style — use scenario
scenario OrderProcess "Order Processing" {
    Customer -> Shop.WebApp "Order Details"
    Shop.WebApp -> Shop.Database "Save Order"
    Shop.Database -> Shop.WebApp "Confirmation"
}

// User Story/Scenario style
story Checkout "User Checkout Flow" {
    User -> ECommerce.CartPage "adds item to cart"
    ECommerce.CartPage -> ECommerce "clicks checkout"
    ECommerce -> Inventory "Check Stock"
}

// Or using simple qualified relationships
Customer -> Shop.WebApp "Submits Order"
Shop.WebApp -> Shop.OrderService "Processes"
Shop.OrderService -> Shop.PaymentService "Charges"

Simple Insight: Use flow for data flows (DFD), story/scenario for user stories, or simple relationships for basic flows. Events show what happens: event OrderPlaced.

4. Feedback Loops (Cycles in Relationships)

<!-- partial -->
// Simple feedback: User action triggers system response
User -> System "Requests"
System -> User "Responds"

// System feedback: Component A affects Component B, which affects A
ComponentA -> ComponentB "Updates"
ComponentB -> ComponentA "Notifies"

// Event-driven cycles: Service A triggers Service B, which triggers A
ServiceA -> ServiceB "Sends Event"
ServiceB -> ServiceA "Responds with Event"

// Mutual dependencies: Microservices that call each other
OrderService -> PaymentService "Charges Payment"
PaymentService -> OrderService "Confirms Payment"

Simple Insight: When arrows form a cycle, that's a feedback loop. The system responds to itself. Cycles are valid in many architectures:

  • Feedback loops: User interactions, system responses
  • Event-driven patterns: Services triggering each other via events
  • Mutual dependencies: Microservices that need to communicate bidirectionally
  • Bidirectional flows: API <-> Database (read/write operations)

Note: Sruja allows cycles - they're a natural part of system design. The validator will inform you about cycles but won't block them, as they're often intentional architectural patterns.

5. Context (Persons and External Systems)

<!-- partial -->
person Customer "End User"
person Admin "System Administrator"

system PaymentGateway "Third-party service" {
  tags ["external"]
}

Customer -> ShopAPI "Uses"
ShopAPI -> PaymentGateway "Processes payments"

Simple Insight: person and external show the context - who/what the system interacts with.

Progressive Systems Thinking

Beginner: Just model the parts and connections

// Element kinds
system = kind "System"
container = kind "Container"

// Elements
myApp = system "MyApp" {
    frontend = container "Frontend"
    backend = container "Backend"
}

// Relationships
myApp.frontend -> myApp.backend "Calls"

Intermediate: Add flows and events

<!-- partial -->
// Simple qualified relationships
user -> myApp.frontend "Clicks"
myApp.frontend -> myApp.backend "Sends request"
myApp.backend -> myApp.database "Saves"

// DFD-style — use scenario
```sruja
<!-- partial -->
scenario OrderFlow "Order Processing" {
    user -> myApp.frontend "Submits"
    myApp.frontend -> myApp.backend "Processes"
    myApp.backend -> myApp.database "Stores"
}

Advanced: Model feedback loops and system behavior

<!-- partial -->
// Feedback loop: User action -> System response -> User sees result
story CompleteOrder "Order Completion Flow" {
    user -> shop.system "Submits"
    shop.system -> shop.database "Stores"
    shop.system -> user "Confirms"
}

// Complex flow with multiple steps — use scenario
scenario PaymentFlow "Payment Processing" {
    orders.orderService -> orders.paymentGateway "Charge"
    orders.paymentGateway -> orders.orderService "Confirms"
    orders.orderService -> user "Notifies"
}

Key Principle: No Jargon Required

  • Don't say: "Model the feedback loop using systems thinking principles"
  • Do say: "Use flow or story to show how data/actions move through the system"
  • Don't say: "Define the system boundary using context mapping"
  • Do say: "Use system to show what's inside, person to show who uses it"
  • Don't say: "Create a DFD (Data Flow Diagram)"
  • Do say: "Use flow to show how data moves between components"

Result: Developers naturally think in systems without learning theory first. The syntax guides them to see:

  • How parts connect (relationships: ->)
  • What's inside vs outside (boundaries: system vs person/external)
  • How things flow (flow for data flows, story/scenario for user stories, or simple -> relationships)
  • How actions create reactions (cycles in relationships, feedback in flows)

Additional Design Philosophy Assessment

After assessing various design philosophies (Event-Driven Architecture, Hexagonal Architecture, CQRS, BDD, Reactive Systems, etc.) through a strict lens of "does this help developers learn system design?", we found:

✅ Accepted: Simple & Valuable

  1. Flows and Scenarios: Already implemented! flow for data flows (DFD), scenario/story for user stories (BDD-style Given-When-Then)
  2. Optional Fields: Practical data modeling (email? string)
  3. Enums: Practical data modeling (status: OrderStatus)

❌ Rejected: Too Complex or Unnecessary

  • Hexagonal Architecture (Ports & Adapters) - Too abstract
  • Clean Architecture / Layers - Too theoretical
  • CQRS - Too specialized, can use existing api
  • Advanced Event-Driven - Current event is sufficient
  • Reactive Systems - Too complex
  • Actor Model - Too specialized
  • GraphQL/Protocol Buffers - Technology-specific
  • Semantic Web - Overkill
  • SOLID (as syntax) - Principles, not syntax

Note: Systems thinking is accepted - but implemented simply through existing syntax (relationships, boundaries, flows). No new keywords needed.

Key Finding: Most "advanced" concepts should be rejected. Only 3 simple additions are recommended, and everything else can wait until developers master the basics. Systems thinking is naturally supported through intuitive syntax.

Conclusion

By using system, module, data, api, and event, we cover 90% of use cases with words that a 1st-year CS student already knows.

Key Success Metrics:

  • ✅ Can a beginner model a simple system in 10 minutes? Yes (C4 style)
  • ✅ Can an intermediate model data + APIs? Yes (Unified style)
  • ✅ Can an advanced user model complex architectures? Yes (Extended features)
  • ✅ Does it prevent over-engineering? Yes (simplicity by design)
  • ✅ Is it approachable for all developers? Yes (progressive disclosure)

Next Steps:

  1. Add enum support
  2. Add optional fields (? syntax)
  3. Add relationship cardinality
  4. Add constraint/validation syntax
  5. flow and scenario/story already implemented - enhance documentation and examples
  6. Improve error messages for beginners
  7. Add validation rules to guide simplicity

Key Principle: Less is more. Don't add complexity unless it clearly helps developers learn system design better. The goal is to build confidence through simplicity, not complexity through features.

Glossary

Quick reference for technical terms and concepts used throughout Sruja documentation.

A

ADR (Architecture Decision Record)

A document that captures an important architectural decision made along with its context and consequences. In Sruja, ADRs are defined using the adr keyword and can be linked to specific architecture elements.

Example:

ADR001 = adr "Use Microservices" {
  status "Accepted"
  context "Need to scale independently"
  decision "Adopt microservices architecture"
  consequences "Increased complexity but better scalability"
}

Architecture-as-Code

The practice of defining software architecture using code (text-based DSL) instead of static diagrams. This enables version control, validation, and automated documentation generation.

C

C4 Model

A hierarchical model for visualizing software architecture, created by Simon Brown. It consists of four levels:

  • Level 1 (Context): System context diagram showing the system and its users
  • Level 2 (Container): Container diagram showing high-level technical building blocks
  • Level 3 (Component): Component diagram showing internal structure of containers
  • Level 4 (Code): Code-level diagram (typically managed by IDEs, not Sruja)

Sruja is based on the C4 model and automatically generates C4-compliant diagrams.

Component

A structural element within a container that represents a major building block. Components are optional and provide additional detail when needed.

Example:

App = system "App" {
  API = container "API" {
    Auth = component "Authentication"
    Payment = component "Payment Processing"
  }
}

Container

A deployable unit within a system. In C4 terminology, a container is NOT a Docker container, but rather any separately deployable unit like:

  • Web applications
  • Mobile apps
  • Server-side applications
  • Databases
  • File systems

Example:

App = system "E-commerce" {
  Web = container "React App"
  API = container "Node.js API"
  DB = database "PostgreSQL"
}

D

Database

A type of container that represents a data store. In Sruja, databases are defined using the database keyword.

Example:

DB = database "PostgreSQL"

Note: Sruja also supports datastore as an alias, but database is the recommended term.

DSL (Domain-Specific Language)

A programming language specialized for a particular domain. Sruja DSL is a text-based language specifically designed for defining software architecture.

E

Element

Any architectural construct in Sruja: person, system, container, component, database, queue, etc. Elements are the building blocks of your architecture model.

K

Kind

A type definition for elements. Kinds can be imported from the standard library (import { * } from 'sruja.ai/stdlib') or declared manually for custom types.

Example:

// Using stdlib (recommended)
import { * } from 'sruja.ai/stdlib'

// Or declaring manually
microservice = kind "Microservice"

M

Metadata

Additional information attached to elements, such as team ownership, tier, or custom tags. Metadata helps with governance and organization.

Example:

App = system "My App" {
  metadata {
    team "Platform"
    tier "critical"
  }
}

P

Person

An actor or user of the system. Persons are defined at the context level and represent external users or roles.

Example:

User = person "Customer"
Admin = person "Administrator"

Q

Queue

A message queue or event stream used for asynchronous communication between containers.

Example:

EventQueue = queue "Kafka Topic"

R

Relation

A connection between two elements, showing how they interact. Relations are defined using the -> operator.

Example:

<!-- partial -->
User -> App.Web "Visits"
App.Web -> App.API "Calls"

Requirement

A functional or non-functional requirement that can be linked to architecture elements using tags. Requirements help trace business needs to technical implementation.

Example:

R001 = requirement "User Authentication" {
  description "Users must be able to log in securely"
  tags ["App.Web", "App.API"]
}

S

Scenario

A sequence of interactions that describe how users or systems interact to accomplish a goal. Scenarios help document user flows and system behavior.

Example:

<!-- partial -->
scenario Checkout "Checkout Flow" {
  User -> App.Web "adds items to cart"
  App.Web -> App.API "validates cart"
  App.API -> App.DB "stores order"
}

System

The highest-level element in C4 Level 1. A system represents a software system that delivers value to its users.

Example:

ECommerce = system "E-commerce Platform" {
  description "Online shopping platform"
}

stdlib (Standard Library)

The Sruja standard library that provides common element kinds (person, system, container, database, etc.). Importing from stdlib is the recommended way to use standard types.

Example:

import { * } from 'sruja.ai/stdlib'

T

Tag

A label that can be attached to elements, requirements, or ADRs to enable filtering, grouping, and traceability.

Example:

App = system "My App" {
  tags ["production", "critical"]
}

V

View

A diagram perspective that shows a subset of the architecture. Views can be explicit (defined with view blocks) or implicit (automatically generated by Sruja).

Example:

view index {
  title "Complete System View"
  include *
}

Documentation Style Guide

Goals

  • Align with the Diátaxis framework: Tutorials, How‑to Guides, Reference, Explanation
  • Improve clarity, consistency, and task‑orientation
  • Raise quality to industry standards (Stripe, React, Kubernetes, MDN)

Front Matter

  • Required: title, summary
  • Recommended: prerequisites, learning_objectives, estimated_time, difficulty, tags, last_updated

Headings

  • Use Title Case for H1/H2/H3
  • Keep headings unique; avoid duplicates within a page

Code Blocks

  • Always specify language fences: bash, sh, json, yaml, go, ts, tsx, md, sruja
  • Prefer copy‑ready commands; avoid interactive prompts where possible

Admonitions

  • Use standard callouts: Note, Tip, Warning
  • Keep callouts short and action‑oriented
  • Prefer descriptive link text (not raw URLs)
  • Cross‑link to Reference and Examples when teaching a concept or task

Images & Diagrams

  • Include small screenshots or diagram previews for expected outcomes
  • Use alt text that describes the intent and context

Tutorials

  • Structure: Overview → Prerequisites → Steps → Outcome → Troubleshooting → Next Steps
  • Include at least one end‑to‑end task with an expected output

How‑to Guides

  • Task‑oriented and concise
  • Structure: Purpose → Steps → Validation → References

Reference

  • Precise, complete, and skimmable tables/lists
  • Avoid narrative; link outward to tutorials for workflows

Explanation

  • Conceptual background, rationale, trade‑offs
  • Link to reference for details and to tutorials for practice

Quality Gates

  • Markdown lint for headings, lists, links
  • Link checking for external and internal links
  • Optional accessibility lint (alt text, heading levels)

Review Checklist

  • Front matter present and complete
  • Headings consistent and unique
  • Code fences have language tags
  • Cross‑links added to relevant Reference/Examples
  • Outcome preview or screenshot included where appropriate

Sruja Community

Welcome to the Sruja community! Sruja is an open source project built by and for developers who care about software architecture. Whether you're here to learn, contribute, or get help, we're glad you're here.

Join the Conversation

💬 GitHub Discussions

For longer-form discussions, feature requests, and Q&A:

GitHub Discussions

GitHub Discussions is ideal for:

  • Feature proposals and RFCs
  • Technical discussions
  • Sharing tutorials and examples
  • Asking detailed questions

🐛 GitHub Issues

Found a bug or have a feature request?

Open an Issue

Ways to Contribute

Sruja is an open source project, and we welcome contributions of all sizes! There are many ways to contribute, even if you're not a developer.

No Code Required

Documentation

  • Fix typos or improve clarity
  • Add examples and tutorials
  • Translate documentation
  • Write blog posts or courses

Testing & Feedback

  • Test new features and report bugs
  • Share your use cases
  • Provide feedback on design decisions
  • Help improve error messages

Community

  • Answer questions in Discussions
  • Help newcomers get started
  • Share your Sruja projects and experiences

Beginner-Friendly Code

Small Improvements

  • Add test cases
  • Fix small bugs
  • Improve error messages
  • Add examples to book/valid-examples/ and a matching page under book/src/examples/
  • Improve CLI help text

Documentation Code

  • Add code examples
  • Update API documentation
  • Create tutorials

More Advanced Contributions

Features & Enhancements

  • Implement new features
  • Add new export formats
  • Add validation rules
  • Improve tooling and developer experience

Core Development

  • Work on the language parser
  • Enhance the validation engine
  • Build platform integrations
  • Develop plugins and extensions

Getting Started with Contributions

🎯 First Time Contributing?

Start here: Contribution Guide

This step-by-step guide walks you through:

  • Finding your first issue
  • Setting up your development environment
  • Making and submitting changes
  • Getting help when stuck

Contribution Workflow

  1. Fork and Branch: Fork the repo and create a topic branch
  2. Implement: Make your changes and test locally
  3. Commit: Follow Conventional Commits
  4. Pull Request: Open a PR with a clear description
  5. Review: Address feedback and iterate
  6. Merge: Once approved, your contribution is merged!

For detailed instructions, see the Contribution Guide.

Roadmap & Transparency

Sruja is developed transparently with community input. Our roadmap is public and open for discussion.

Current Roadmap

View Roadmap Discussions

The roadmap outlines our path to v1.0, including:

  • Advanced Governance & Compliance: Policy as code, architectural guardrails
  • Production Reality & Data Flow: Service mesh integration, runtime verification
  • Extensibility & Ecosystem: Plugin system, DevOps integrations
  • Platform Evolution: Live system review, gap detection, violation monitoring

Shaping the Roadmap

Your feedback shapes the roadmap! We welcome:

  • Feature requests via GitHub Discussions
  • Use case sharing to prioritize features
  • RFCs (Request for Comments) for major changes
  • Community voting on priorities

Community Expectations

We're committed to maintaining a welcoming and respectful community. When participating:

  • Be respectful and constructive: Treat everyone with kindness
  • Provide actionable feedback: Help others improve their contributions
  • Prefer documented decisions: Link to ADRs or issues when relevant
  • Start small: You can always contribute more later!

Recognition

We value all contributions, big and small. Contributors are recognized through:

  • GitHub contributor list
  • Release notes (for significant contributions)
  • Community highlights in discussions

Professional Services

While Sruja is open source and free to use, professional consulting services are available for organizations that need:

  • Implementation support: Help rolling out Sruja across teams
  • Best practices guidance: Establish architectural governance patterns
  • Custom integrations: Integrate with existing CI/CD and infrastructure
  • Training: Team training on Sruja DSL and architectural modeling
  • Custom development: Build custom validators, exporters, or integrations

Contact the team through GitHub Discussions to discuss your needs.

Resources

Documentation

Development

Community

Get Involved Today

Ready to contribute? Here are some quick ways to get started:

  1. Join Discussions and introduce yourself
  2. Star the repository on GitHub to show your support
  3. Fix a typo in the documentation
  4. Add an example to book/valid-examples/ and a matching page under book/src/examples/
  5. Share your use case in GitHub Discussions

Every contribution, no matter how small, helps make Sruja better for everyone. Thank you for being part of the community!


Questions? Reach out on GitHub Discussions. We're here to help!

Courses

Structured courses to learn architecture-as-code with Sruja, from fundamentals to production patterns.

CourseDescription
Systems Thinking 101Fundamentals, parts & relationships, boundaries, flows, feedback loops, context
System Design 101Fundamentals, building blocks, advanced modeling, production readiness
System Design 201High throughput, real-time, data-intensive, consistency
Ecommerce PlatformVision, architecture, SDLC, ops, evolution, governance
Production ArchitecturePerformance, modular design, governance
Agentic AIFundamentals, patterns, modeling for AI systems
Advanced ArchitectsPolicy as code and advanced topics

Start with Systems Thinking 101 or System Design 101 if you're new; use the Beginner path to combine courses with tutorials and challenges.

Systems Thinking 101

Learn to model systems holistically with Sruja. Master the five core systems thinking concepts: parts and relationships, boundaries, flows, feedback loops, and context.

Course Overview

Systems thinking helps you understand how components interact as part of a whole. This course teaches you to model systems using Sruja's architecture-as-code approach, enabling you to visualize and validate complex system interactions.

What You'll Learn

  • Module 1: Fundamentals - Core systems thinking concepts and why they matter
  • Module 2: Parts and Relationships - Model components and their interactions
  • Module 3: Boundaries - Define what's inside vs. outside your system
  • Module 4: Flows - Visualize data and information movement through the system
  • Module 5: Feedback Loops - Model cycles and reactive behaviors
  • Module 6: Context - Capture the environment, dependencies, and stakeholders

Prerequisites

Learning Path

Each module contains hands-on examples with Sruja syntax. You'll write .sruja files, validate them with sruja lint, and export to Mermaid diagrams with sruja export mermaid.

Why Systems Thinking?

  • Holistic understanding: See the whole system, not just parts
  • Natural patterns: Model real-world interactions and feedback
  • Clear boundaries: Understand what's in scope vs. context
  • Flow visualization: See how data and information move
  • Valid cycles: Feedback loops are natural, not errors

Course Duration

Approximately 6-8 hours to complete all modules and exercises.

Next Steps

Start with Module 1: Fundamentals or review the Beginner path for a complete learning journey.

Module 1: Fundamentals

Overview

In this module, you'll learn the core concepts of systems thinking and how they apply to software architecture modeling with Sruja.

Learning Objectives

By the end of this module, you'll be able to:

  • Define what systems thinking is and why it matters for software architecture
  • Identify the five core systems thinking concepts
  • Understand how Sruja supports systems thinking principles
  • Recognize when to use systems thinking in your architecture work

Lessons

Prerequisites

  • Basic understanding of software architecture concepts
  • Familiarity with Sruja DSL basics

Time Investment

Approximately 20-25 minutes to complete all lessons (8 lessons × 2-3 minutes each, including quizzes).

What's Next

After completing this module, you'll dive into specific concepts starting with Module 2: Parts and Relationships.

Lesson 1: Introduction to Systems Thinking

Learning Goal

Understand the basic concept of systems thinking and its importance in architecture.

What is Systems Thinking?

Systems thinking is a holistic approach to understanding how components interact as part of a whole. Instead of looking at parts in isolation, it focuses on relationships, patterns, and emergent behaviors that arise when components work together.

Traditional architecture often takes a reductionist approach: break systems into parts, understand each part, then put them together. But this misses the magic—the interactions that emerge only when parts work together.

A Simple Example: Coffee Shop

Think of a coffee shop:

Isolated view (reductionist):

  • Coffee machine
  • Barista
  • Cups
  • Beans
  • Customers

Systems thinking view:

  • Customer orders → Barista uses machine → Machine produces coffee → Customer receives → Customer might return
  • The machine needs beans (supply chain) — what if beans run out?
  • Barista needs training (human systems) — what if barista is new?
  • Shop needs location (infrastructure) — what if there's no parking?
  • Customer satisfaction affects future visits (feedback loop) — happy customers return, unhappy ones don't

Emergent behavior: Wait times fluctuate based on peak hours, customer flow, and barista experience — you can't predict this from individual parts alone.

Real-World Software Example: E-Commerce Platform

Consider an e-commerce application:

Isolated view:

  • Frontend (React)
  • Backend (Node.js)
  • Database (PostgreSQL)
  • Cache (Redis)

Systems thinking view:

  • User browses → Frontend caches → Backend processes → Database stores → Payment gateway charges → Email service confirms
  • What happens if cache is cold? (slower loads, higher database load)
  • What happens if payment gateway is down? (order processing stalls, users frustrated)
  • What happens during Black Friday sales? (traffic spikes, database contention, CDN becomes critical)
  • Customer abandonment creates feedback: if checkout is slow, users don't complete purchases, revenue drops, less investment in performance, slower checkout again (vicious cycle)

Emergent behavior: System throughput varies non-linearly with user load due to caching, database locking, and external API rate limits.

Why It Matters for Architecture

Traditional view: "Build these components" Systems thinking view: "How do components interact to create value?"

Sruja Example: E-Commerce Platform

import { * } from 'sruja.ai/stdlib'

Customer = person "End User"
Admin = person "Administrator"

ECommerce = system "E-Commerce Platform" {
  WebApp = container "Web Application" {
    technology "React"
  }
  API = container "API Service" {
    technology "Node.js"
  }
  Cache = queue "Redis Cache"
  DB = database "PostgreSQL"
}

PaymentGateway = system "Payment Gateway"
EmailService = system "Email Service"

// User flow (happy path)
Customer -> ECommerce.WebApp "Browses products"
ECommerce.WebApp -> ECommerce.Cache "Checks cache"
ECommerce.WebApp -> ECommerce.API "Fetches data"
ECommerce.API -> ECommerce.DB "Reads products"
ECommerce.API -> ECommerce.WebApp "Returns products"

// Order processing flow
Customer -> ECommerce.WebApp "Submits order"
ECommerce.WebApp -> ECommerce.API "Processes order"
ECommerce.API -> ECommerce.DB "Saves order"
ECommerce.API -> PaymentGateway "Charges payment"
PaymentGateway -> ECommerce.API "Payment confirmation"
ECommerce.API -> EmailService "Sends confirmation"
EmailService -> Customer "Order confirmation"

Extended Example: What About Edge Cases?

// What happens when things go wrong?

scenario CacheMiss "Cache Miss Scenario" {
  Customer -> ECommerce.WebApp "Requests product"
  ECommerce.WebApp -> ECommerce.Cache "Cache miss"
  ECommerce.WebApp -> ECommerce.DB "Queries database"  // Slower
  ECommerce.DB -> ECommerce.WebApp "Returns product"
  ECommerce.WebApp -> Customer "Displays product"  // Noticeable delay
}

scenario PaymentFailure "Payment Gateway Down" {
  Customer -> ECommerce.WebApp "Submits order"
  ECommerce.WebApp -> ECommerce.API "Processes order"
  ECommerce.API -> PaymentGateway "Attempts payment"  // Timeout!
  ECommerce.API -> ECommerce.DB "Saves order as pending"  // Graceful degradation
  ECommerce.API -> Customer "Shows: Payment failed, retry later"  // Good UX
}

Key insight: Systems thinking forces you to design for failures, not just happy paths.

Common Misconceptions

"Systems thinking is just about drawing diagrams" Reality: It's about understanding behavior, interactions, and emergent properties. Diagrams are a tool, not the goal.

"More components mean more complex systems" Reality: Complexity comes from relationships and feedback loops, not component count. A simple system with 3 components in a feedback loop can be more complex than 10 components in a linear chain.

"We can optimize parts in isolation" Reality: Optimizing one part (e.g., database queries) without considering the whole system (caches, frontend, network) often has minimal impact or even makes things worse.

"Systems thinking is only for large-scale systems" Reality: It applies to all systems, even small APIs. A small system's design affects maintainability, testability, and future scalability.

Key Takeaways

  1. Relationships matter more than parts — interactions drive system behavior
  2. Design for the whole system — not just individual components
  3. Consider edge cases and failures — systems thinking helps you design gracefully
  4. Think in flows and feedback — how do things move through your system?
  5. Map emergent behavior — what emerges that you couldn't predict from parts alone?

Systems thinking focuses on relationships and interactions, not just components. It's about understanding behavior, not just structure.

Quiz: Test Your Knowledge

Question 1: What is systems thinking?

  • a) A way to optimize code performance
  • b) A holistic approach to understanding how components interact as part of a whole
  • c) A method for breaking down systems into smaller parts
  • d) A database design technique

Explanation: Systems thinking is a holistic approach to understanding how components interact as part of a whole, focusing on relationships, patterns, and emergent behaviors rather than parts in isolation.

Question 2: In the coffee shop example, what represents a systems thinking view?

  • a) Coffee machine, barista, cups, beans, customers listed separately
  • b) Customer orders → Barista uses machine → Machine produces coffee → Customer receives → Customer might return
  • c) How much coffee is sold per day
  • d) Just focus on the coffee machine and barista

Explanation: Customer orders → Barista uses machine → Machine produces coffee → Customer receives → Customer might return. This shows connections and flows rather than just listing components.

Question 3: What's the key difference between the traditional view and systems thinking view in software architecture?

  • a) Traditional uses diagrams, systems thinking uses text
  • b) Traditional focuses on user experience, systems thinking focuses on backend
  • c) Traditional: "Build these components." Systems thinking: "How do components interact to create value?"
  • d) Traditional is for small systems, systems thinking is only for enterprise

Explanation: Traditional architecture focuses on what to build (components), while systems thinking focuses on how they work together (interactions, relationships, value creation).

Lesson 2: Seeing Deeper with the Iceberg Model

Learning Goals

By the end of this lesson, you'll be able to:

  • Identify the four levels of the iceberg model
  • Recognize when you're stuck at the surface events level
  • Analyze problems by looking for patterns and root causes
  • Understand how mental models shape system behavior

Understanding the Iceberg Model

Have you ever felt like you're constantly putting out fires? You fix one bug, test it thoroughly, celebrate—and then two more similar bugs appear the next day. Or maybe you've watched your team debate the same architectural decision repeatedly, never quite resolving it?

These aren't just frustrating experiences. They're symptoms of looking at problems at the surface level—the "events" level—without understanding the deeper patterns, structures, and beliefs that create them.

The iceberg model gives us a way to look deeper.

The Four Levels

The iceberg model has four levels, from surface to depth. Let's walk through each one with examples that might feel familiar.

Events: What You See

Events are the things that happen right now—the incidents, bugs, crashes, and alerts you deal with daily.

Think of a typical Monday: "Server crashed at 2:37 PM" or "User reported a critical bug." These are events. They're immediate, observable, and often prompt a reactive response: "Fix it now."

The problem? Events are just symptoms. Fixing an event doesn't prevent it from happening again tomorrow.

Patterns: What's Happening Over Time

When you look at events collectively, patterns start to emerge. These are trends and recurring sequences that tell you what's really going on.

Maybe you notice: "Similar bugs occur repeatedly after releases" or "Server crashes every Sunday night during backups." These aren't isolated incidents anymore—they're patterns.

Patterns tell you something important: this isn't random. There's something systemic happening here.

Structures: What's Causing the Patterns

Structures are the underlying arrangements and mechanisms that create the patterns you're seeing. They're often invisible until you look for them.

Why do bugs keep appearing after releases? Maybe it's because components are tightly coupled, so a change in one creates cascading failures elsewhere. Why do servers crash during backups? Maybe it's because a single database becomes a bottleneck when backup processes run.

Structures are often about architecture, processes, or technology choices. They're the root causes.

Mental Models: What's Shaping the Structures

This is the deepest level. Mental models are the beliefs, assumptions, and worldviews that influence the structures you create.

Why is your system tightly coupled? Maybe because your team believes "speed to market is everything, let's worry about architecture later." Why is there a single database bottleneck? Maybe because everyone assumes "we can scale a single database if we need to—let's not over-engineer."

Mental models are the hardest to change because they're often unspoken assumptions. But they're also the most powerful because they shape everything above them.

Here's how they connect:

"Mental models" → influence → "Structures" → create → "Patterns" → manifest as → "Events"

When you fix an event without changing the mental model, the pattern eventually returns.

Seeing It in Practice

Let's make this concrete with a real example.

A Real Architecture Scenario

Imagine you're working on an e-commerce platform, and you're dealing with slow page loads. Here's how the iceberg model helps you diagnose the problem:

import { * } from 'sruja.ai/stdlib'

// Event: Slow page loads
// Pattern: Performance degrades after releases
// Structure: Monolithic architecture, no caching
// Mental Model: "Optimization is premature"

App = system "Web Application" {
  Monolith = container "Monolithic App"
  DB = database "Single Database"
}

Monolith -> DB "Heavy queries"

Let's analyze this at each level:

  • Event: Users report slow page loads. You investigate, find slow queries, optimize them. Performance improves briefly.
  • Pattern: Performance degrades after every release. Optimized queries don't solve it long-term. Something else is happening.
  • Structure: The system is monolithic with no caching. Every page load hits the database. As the codebase grows, queries become heavier.
  • Mental Model: The team believes "Optimization is premature—ship features first, worry about performance later." This shapes architectural decisions.

If you only fix the event (optimize queries), the pattern returns. If you redesign the structure (add caching, modularize), but don't address the mental model, you'll build similar bottlenecks in the new architecture. The real solution requires changing the belief about performance.

When to Use Each Level

So when do you actually use each level? Here's a practical guide.

Start with Events Level

You're at this level when you're reacting to immediate problems: debugging specific bugs, handling production alerts, or investigating user complaints. This is day-to-day firefighting. It's necessary, but if you never go deeper, you'll keep putting out the same fires.

Move to Patterns Level

When you notice recurring issues, shift to the patterns level. This is for analyzing historical data, identifying trends, and planning for capacity and scaling.

Think about it: "Similar bugs appear after every release" is a pattern. It tells you something systemic is happening. At this level, you're asking: "What keeps happening?"

Dive to Structures Level

This is where architecture and design decisions happen. You're here when designing systems, performing root cause analysis, planning refactoring, or evaluating technology choices.

At the patterns level, you know what's happening. At the structures level, you figure out why it's happening. You map out component relationships and identify root causes.

Reach Mental Models Level

This is the deepest—and most powerful. You're here when making strategic decisions, changing team culture, setting long-term priorities, or aligning stakeholders on vision.

This is the hardest level to reach because mental models are often unspoken. But changing them creates lasting impact. You might ask: "What do we believe that makes this structure seem right?" or "What assumptions are driving our decisions?"

How to Shift Between Levels

The real skill isn't knowing the four levels—it's knowing how to move between them when you're analyzing problems.

From Events to Patterns

You're stuck at the events level when you keep seeing isolated incidents. To shift:

  1. Collect data over time - Don't just look at today's incident. Look at the last month's incidents.
  2. Look for correlations and trends - Are these incidents connected? Do they follow a pattern?
  3. Ask the right question: "What keeps happening?"

From Patterns to Structures

You're at the patterns level when you see recurring issues but don't know the cause. To shift:

  1. Analyze root causes - Don't just acknowledge the pattern. Ask why it exists.
  2. Map out component relationships - How do parts connect? Where are dependencies?
  3. Ask the right question: "What creates these patterns?"

From Structures to Mental Models

This is the hardest shift. You're at the structures level when you understand the architecture but don't understand why decisions were made. To shift:

  1. Identify assumptions and beliefs - What does your team believe about this problem?
  2. Challenge deeply held views - Are these assumptions still valid? What evidence supports them?
  3. Ask the right question: "What do we believe that makes this structure seem right?"

A Practical Example

Let's say your team keeps introducing tightly coupled components. Here's how you might analyze this:

Start: You notice a bug in Component A. Fix it. Same bug appears in Component B. This is the events level.

Shift to patterns: You realize "Coupling issues appear after every feature release." This is the patterns level.

Shift to structures: You investigate and find "Components share the same database table and call each other directly." This is the structures level.

Shift to mental models: You ask "Why did we design it this way?" and realize "We believed tight coupling would speed development." This is the mental models level.

Now you can change the mental model—and create different structures in the future.

What to Remember

The iceberg model gives you a way to look deeper than surface problems. But here's the important part: the deeper you go, the more powerful your solutions become.

Think of it this way: fixing an event is like taking an aspirin for a headache—it might work temporarily, but the headache might come back tomorrow if you don't address the cause. Addressing patterns is like understanding what triggers your headaches. Addressing structures is like changing lifestyle habits that cause headaches. Addressing mental models is like understanding why you adopted those habits in the first place.

The shift in thinking is subtle but profound. Instead of asking "How do I fix this bug?", you ask "Why do similar bugs keep appearing? What structures create these patterns? What beliefs led to these structures?"

This doesn't mean you should always start at the mental models level. Sometimes you genuinely do need to fix a bug quickly. But when you see recurring problems, that's your signal to go deeper.

The iceberg model helps you ask better questions—and asking better questions is the first step to finding better answers.

Check Your Understanding

Let's see if this is clicking.

Quick Check

1. You're investigating slow page loads on your site. You optimized a specific query and it's faster now—but similar performance issues keep appearing. Which level of the iceberg model should you move to next?

[ ] A. Stay at Events level - keep optimizing individual queries [ ] B. Patterns level - look for why these issues keep recurring [ ] C. Structures level - redesign the architecture now [ ] D. Mental Models level - change team beliefs about performance

2. Your team believes "Ship fast, fix bugs later." You've noticed this leads to tightly coupled components and cascading failures. What level of the iceberg model is this belief?

[ ] A. Events - it's a statement in meetings [ ] B. Patterns - this belief appears in every project [ ] C. Structures - it creates tight coupling in code [ ] D. Mental Models - it's an assumption shaping your architecture


Answers & Discussion

1. B. Patterns level – Optimizing individual queries is working at the Events level (fixing incidents). But since "similar performance issues keep appearing," that's a pattern signal. You need to identify what creates this recurring pattern before you can solve it effectively.

2. D. Mental Models – The belief "Ship fast, fix bugs later" is a mental model—a worldview that influences decisions. It leads to Structures (tightly coupled components) that create Patterns (cascading failures) that manifest as Events (bugs). Changing the architecture without changing this belief will likely create similar problems in new forms.

What's Next

Now that you understand the iceberg model and can see systems at different levels, let's apply this to real software systems. In the next lesson, we'll explore Systems in Software Architecture—how every application is actually a system of systems, with multiple layers and dependencies.

This will help you see the bigger picture when designing your own applications.

Lesson 3: Every App is a System of Systems

Learning Goals

By the end of this lesson, you'll be able to:

  • Identify the layers that make up a software system
  • Recognize when you're ignoring dependencies or organizational factors
  • Understand how Conway's Law affects your architecture
  • Model systems holistically in Sruja

Seeing Layers: Every App is a System of Systems

When you design software, what do you think about first? The frontend framework? The backend API? The database schema?

These are important parts. But they're not the whole system. Every application exists in a context—users who interact with it, dependencies it relies on, processes that keep it running, and data flowing through it.

Here's what happens when you ignore these layers: You design a perfect API, but your team doesn't know how to deploy it reliably. You optimize database queries, but external payment gateway has rate limits you didn't consider. You build features quickly, but organizational processes create technical debt that slows you down.

Seeing systems holistically—seeing all the layers—is what prevents these problems.

The Layers of a Software System

Every application is built from multiple interconnected layers. Let's visualize them:

┌─────────────────────────────────────┐
│  People: Users, Developers, Ops      │  ← Stakeholders
├─────────────────────────────────────┤
│  Dependencies: APIs, Libraries       │  ← External Systems
├─────────────────────────────────────┤
│  Processes: Dev, Deploy, Monitor     │  ← Operational Systems
├─────────────────────────────────────┤
│  Data: State, Transactions, Logs     │  ← Information Systems
├─────────────────────────────────────┤
│  Application: UI, Logic, Storage     │  ← Your System
└─────────────────────────────────────┘

Why These Layers Matter

Let's talk about why ignoring these layers causes problems. Here's what happens at each layer:

Dependencies Can Fail

When you depend on external systems (payment gateways, APIs, libraries), those dependencies can fail. If you don't account for this in your architecture, a single external failure can bring down your entire application.

Teams Shape Architecture (Conway's Law)

Ever heard of Conway's Law? It states that "organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations."

Translation: How your team communicates and is organized will directly influence your architecture. If your team is split into silos, you'll likely build tightly coupled systems. If your team has clear cross-team communication, you'll naturally build more modular, integrated systems.

Operational Processes Determine Reliability

Your architecture design matters, but so do your processes. How do you deploy? How do you monitor? How do you handle incidents?

You might design the most resilient architecture possible, but if your deployment process is fragile (manual steps, no testing, no rollbacks), your system will be fragile too.

Data Systems Create Integration Challenges

Data doesn't just sit in one place anymore. It flows between databases, caches, analytics, logs, and external systems. This movement creates integration challenges and consistency requirements.

If you don't consider these data flows, you'll hit problems: inconsistent data across services, race conditions, synchronization issues.

Common Architecture Layers

Now that we understand why layers matter, let's look at common layers you'll encounter in real software systems.

Application Layer: What You Build

This is what most developers think about first—the code and logic. It includes:

  • User interfaces: Web applications, mobile apps, CLI tools
  • Business logic: APIs, services, domain models
  • Data storage: Databases, caches, file systems

This is the visible part of your system—what users interact with and what you spend most of your time building.

Infrastructure Layer: What Runs Your Code

Your application doesn't run in a vacuum. It needs infrastructure:

  • Compute: Servers, containers, cloud instances
  • Network: Load balancers, CDNs, firewalls, VPNs
  • Monitoring: Logs, metrics, alerts, dashboards

Infrastructure determines reliability, scalability, and performance. You might have perfect code, but if your infrastructure can't handle load or lacks monitoring, you won't know when things fail.

Organizational Layer: Who Builds It

This layer often gets overlooked, but it's critical. It includes:

  • Teams and responsibilities: Who owns what? How are decisions made?
  • Development workflows: Code reviews, testing, CI/CD pipelines
  • Release processes: How do features go from idea to production?

As Conway's Law states, how your organization is structured will influence your architecture. If your team is organized by technical skill (backend team, frontend team, database team), you'll likely build systems that mirror that structure—sometimes creating unnecessary boundaries.

Modeling Systems Holistically in Sruja

Let's see how this looks in practice with a real e-commerce system.

Example: E-Commerce Platform

import { * } from 'sruja.ai/stdlib'

// People (stakeholders)
Customer = person "Customer"
Admin = person "Admin"

// External Systems (dependencies)
PaymentGateway = system "Payment Gateway"
EmailService = system "Email Service"

// Your System
ECommerce = system "E-Commerce Platform" {
  WebApp = container "Web Application" {
    technology "React"
  }
  API = container "API Service" {
    technology "Node.js"
  }
  DB = database "PostgreSQL"
}

// Relationships show system interactions
Customer -> ECommerce.WebApp "Browses products"
ECommerce.WebApp -> ECommerce.API "Fetches data"
ECommerce.API -> ECommerce.DB "Queries"
ECommerce.API -> PaymentGateway "Process payment"
PaymentGateway -> EmailService "Send receipt"

Analyzing This Model

Notice how this shows multiple layers:

People layer: Customer and Admin are stakeholders who use and manage the system.

Dependencies layer: PaymentGateway and EmailService are external systems we depend on. If the payment gateway is down, our entire order flow breaks.

Application layer: WebApp, API, and DB are containers within our system. This is what we build and control.

Relationships show dependencies: The arrows between components show data flow and dependencies. API -> PaymentGateway is a critical dependency. If that external system is slow or fails, we need to handle it gracefully.

What This Reveals

When you model systems holistically like this, you start seeing:

  • Single points of failure (like PaymentGateway)
  • Team ownership boundaries (what's in our system vs. external)
  • Data flow paths (how do orders flow through the system?)
  • Integration points (where do we cross system boundaries?)

This big-picture view helps you design more resilient systems.

What to Remember

The key insight from this lesson is simple but powerful: Never design in isolation.

Every application exists in a context—people who use it, dependencies it relies on, processes that keep it running, and data flowing through it. When you ignore these layers, you build fragile systems that work in theory but fail in practice.

Think of it this way: A car designer who only thinks about the engine and drivetrain, but ignores the fuel system, cooling system, and driver experience, will create a car that can't run reliably.

The same applies to software. When you see systems holistically—all the layers together—you design better systems. You identify dependencies before they break, you consider team structure before it constrains you, and you plan for failure before it happens.

This is what systems thinking in architecture looks like: Not just code, but code in context.

Check Your Understanding

Let's see if you're seeing systems holistically now.

Quick Check

1. Your team is organized by technical role: backend team, frontend team, database team. You're designing a new feature. What's likely to happen to your architecture, based on Conway's Law?

[ ] A. You'll build a perfectly modular system because roles are clear [ ] B. You'll build systems with boundaries that mirror your team structure [ ] C. Team organization won't affect your architecture at all [ ] D. You'll need to reorganize teams to match a better architecture

2. You're designing an e-commerce system and decide to use an external payment gateway API. What should you consider, based on this lesson?

[ ] A. Nothing—it's just another API call [ ] B. The gateway might fail or have rate limits, which could break your entire order flow [ ] C. Always build your own payment system instead of depending on external services [ ] D. Only worry about the API documentation and authentication


Answers & Discussion

1. B. You'll build systems with boundaries that mirror your team structure – Conway's Law states that systems mirror the communication structures of the organizations that build them. If your teams are organized by technical role (backend, frontend, database), your architecture will likely have similar boundaries, creating unnecessary coupling and communication challenges.

2. B. The gateway might fail or have rate limits, which could break your entire order flow – When you depend on external systems, those dependencies become single points of failure. You need to consider what happens if the gateway is down, has rate limits, or has performance issues. Do you have fallbacks? Retry logic? Degraded service modes? Ignoring these layers leads to fragile systems.

What's Next

Now that you understand every application is a system of systems with multiple layers, let's dive into the specific components that make up these systems. In the next lesson, we'll explore Parts & Relationships—how to identify components and model their interactions.

This is where systems thinking gets practical: You'll learn to model real systems in Sruja with precision and clarity.

Module 2: Parts and Relationships

Overview

In this module, you'll learn to identify and model the components (parts) of a system and how they interact (relationships).

Learning Objectives

By the end of this module, you'll be able to:

  • Identify the key parts of a software system
  • Model components using Sruja's element types
  • Define relationships with clear, meaningful labels
  • Use nesting to show hierarchical structure
  • Validate relationships for correctness

Lessons

Prerequisites

Time Investment

Approximately 1.5-2 hours to complete all lessons and exercises.

What's Next

After completing this module, you'll learn about Module 3: Boundaries.

Finding the Building Blocks: Identifying Parts

Ever walked into a room full of people and tried to understand what's going on? You naturally start by identifying the key players: who's talking, who's listening, who's organizing. That's exactly what we do in systems architecture—we identify the parts first, then figure out how they connect.

In this lesson, you'll learn to spot the essential components of any software system. Whether you're building an e-commerce platform, a social network, or an internal tool, the process is the same. Let's dive in.

Learning Goals

By the end of this lesson, you'll be able to:

  • Confidently identify the key parts of any software system
  • Understand the C4 model's four-level hierarchy
  • Apply a systematic approach to breaking down requirements into components
  • Recognize when you've found the right level of detail

The C4 Model: Your Guide to Structure

Before we start identifying parts, let me introduce you to the C4 model—a framework that gives us a clear hierarchy for organizing software components. Think of it like a building: you start with the people who use it (person level), then the building itself (system), then the rooms within it (containers), and finally the furniture and fixtures inside those rooms (components).

Here's how it breaks down in practice:

Level 1: Person (Users, stakeholders)
  ↓ Who interacts with the system?
  
Level 2: System (Software systems)
  ↓ What software systems are involved?
  
Level 3: Container (Applications, databases, services)
  ↓ What are the deployable units?
  
Level 4: Component (Modules, classes, libraries)
  ↓ What are the internal building blocks?

I've found this hierarchy incredibly useful because it prevents us from getting lost in the weeds too early. Start at the top, work your way down, and stop when you have enough detail for your audience.

A Step-by-Step Approach to Finding Parts

Let's walk through a practical example together. Imagine you're reading these requirements:

"Customers can browse products, add to cart, and checkout. Administrators can manage inventory and view reports. The system sends email notifications for order confirmations."

How would you identify the parts? Here's my approach:

Step 1: Start with People

Who are the humans interacting with this system? This is always the best starting point because every system exists to serve people.

Looking at the requirements, I can immediately spot two key players:

  • Customers—they're browsing, adding to cart, and checking out
  • Administrators—they're managing inventory and viewing reports
import { * } from 'sruja.ai/stdlib'

Customer = person "Customer"
Administrator = person "Administrator"

Notice how I'm already thinking in terms of Sruja syntax? That's because once you understand the concept, the code becomes a natural expression of your thinking.

Step 2: Identify the Systems

Now, what software systems are involved? Sometimes this is obvious (one main system you're building), and sometimes it's more complex (multiple systems talking to each other).

From our requirements, I can see:

  • E-commerce platform—this is the main system we're building
  • Email service—this is an external system we'll integrate with
ECommerce = system "E-Commerce Platform"
EmailService = system "Email Service"

Don't worry if you miss some systems initially. You can always come back and add more as you discover dependencies.

Step 3: Break Down Systems into Containers

Now we get to the interesting part. What are the deployable units that make up our e-commerce platform? Think about what you'd actually deploy: web apps, APIs, databases, caches—these are all containers.

For our e-commerce platform, a sensible breakdown might be:

ECommerce = system "E-Commerce Platform" {
  WebApp = container "Web Application" {
    technology "React"
  }
  API = container "API Service" {
    technology "Node.js"
  }
  Database = database "PostgreSQL" {
    technology "PostgreSQL 14"
  }
  Cache = database "Redis Cache" {
    technology "Redis 7"
  }
}

I've included a cache because I know from experience that almost every e-commerce platform benefits from one for performance. This is where real-world experience comes in—you'll naturally add components that make sense based on patterns you've seen before.

Step 4: Consider Components (But Only If Needed)

This is where many people go wrong. They immediately start modeling every single class and module. Stop and ask yourself: Does my audience need this level of detail?

If you're talking to developers who need to understand the internal architecture, yes—break containers into components. If you're talking to stakeholders who just want to understand the overall system, skip this level.

For example, if we were documenting the API's internal structure for our engineering team:

API = container "API Service" {
  ProductService = component "Product Service"
  CartService = component "Cart Service"
  OrderService = component "Order Service"
  PaymentService = component "Payment Service"
}

The key insight here: match the level of detail to your audience. Not every diagram needs components. Not every discussion needs to dive this deep.

Common Patterns You'll See Everywhere

After building systems for years, you start to recognize patterns that repeat across different domains. Here are three I see constantly:

Pattern 1: The Classic Three-Tier Architecture

This is probably the most common pattern you'll encounter. It's simple, it works, and it's a great starting point for most web applications.

App = system "Application" {
  Frontend = container "Web App" {
    technology "React"
  }
  Backend = container "API Service" {
    technology "Node.js"
  }
  Database = database "Database" {
    technology "PostgreSQL"
  }
}

I've used this pattern more times than I can count. It's a solid foundation that scales well and most developers immediately understand.

Pattern 2: Microservices

When your system grows, you might need to split it into smaller, independently deployable services. That's where this pattern comes in.

App = system "Microservice Application" {
  APIGateway = container "API Gateway"
  UserService = container "User Service"
  OrderService = container "Order Service"
  NotificationService = container "Notification Service"
}

The trick here is knowing when to use this. Don't start with microservices just because it's trendy. Start simple, then split when you have a real reason (scaling, team size, complexity).

Pattern 3: Event-Driven Architecture

When you need systems to react to events in real-time, this pattern shines. It's more complex but incredibly powerful for the right use cases.

App = system "Event-Driven System" {
  Producer = container "Event Producer"
  Consumer = container "Event Consumer"
  MessageQueue = queue "Kafka Cluster"
}

I've seen teams jump into this pattern too early and regret it. Make sure you actually need event-driven architecture before adopting it—it adds significant complexity.

Pitfalls to Avoid (I've Made All of These)

Let me save you some trouble by sharing mistakes I've made and seen others make:

Mistake 1: Oversimplifying

When you're rushing, it's tempting to just model everything as one big system:

// Don't do this—it's too simple
App = system "The App"

This tells you nothing about how the system is structured. Anyone looking at this diagram will be left with more questions than answers.

// This is better—it shows structure
App = system "The App" {
  Frontend = container "Frontend"
  Backend = container "Backend"
  Database = database "Database"
}

Mistake 2: Over-Engineering

On the flip side, I've seen people model every single class and component, creating diagrams that are essentially unreadable:

// Don't do this—it's too detailed
App = system "The App" {
  Frontend = container "Frontend" {
    Header = component "Header"
    Body = component "Body"
    Footer = component "Footer"
  }
}

Who cares about the header, body, and footer at the architecture level? No one. That's implementation detail, not architecture.

// This is the right level of detail
App = system "The App" {
  Frontend = container "Frontend"
  Backend = container "Backend"
}

Mistake 3: Mixing Levels Inconsistently

This one trips up even experienced architects. You might have some parts modeled at the container level and others at the component level, creating confusion:

// Don't do this—inconsistent detail
App = system "The App" {
  Frontend = container "Frontend"
  UserService = component "User Service"  // Skips container level
  Database = database "Database"
}

What's the UserService doing here without a parent container? Is it a standalone service? Part of another system? It's confusing.

// This is consistent and clear
App = system "The App" {
  Frontend = container "Frontend"
  Backend = container "Backend" {
    UserService = component "User Service"
  }
  Database = database "Database"
}

What to Remember

Identifying parts is both an art and a science. The science is following the C4 hierarchy and being systematic. The art is knowing how much detail to include and what to leave out.

If you take away just one thing from this lesson, let it be this: start with people, then systems, then containers, and only add components when your audience truly needs that level of detail.

Everything else flows from that principle.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding:

Question 1

You're building a project management tool with these requirements:

"Team members can create tasks, assign them to others, and track progress. Managers can view reports and approve tasks. The system sends email notifications for task assignments and due dates. Task data is stored in a database. The system integrates with Slack for notifications."

At the person level, which parts should you identify?

A) Task, Report, Email, Database, Slack B) Team Member, Manager C) Project Management Tool, Slack, Email Service D) Task Assignment, Task Approval, Report Generation

Click to see the answer

Answer: B) Team Member, Manager

The person level is about humans who interact with the system. Team members and managers are the people using the system. Tasks, reports, and notifications are things the system handles, not people. The project management tool, Slack, and email service are systems, not people.

Remember: people first, then systems, then everything else.


Question 2

You're modeling a simple blog platform. When should you break your containers down into components?

A) Always—components provide the most detail B) Never—containers are sufficient for all audiences C) When your audience needs to understand internal architecture or when a container is complex D) Only when you're using microservices

Click to see the answer

Answer: C) When your audience needs to understand internal architecture or when a container is complex

Components are an optional level in the C4 model. You should add them when:

  • You're documenting for developers who need to understand implementation
  • A container has grown complex and needs to be broken down
  • Different teams own different parts of the system

Don't add components just because you can. Match the level of detail to your audience's needs. A stakeholder doesn't need to see your component architecture—a developer probably does.


What's Next?

Now that you know how to identify parts, you're probably wondering: "How do I actually define these parts in Sruja? What's the syntax?"

In the next lesson, you'll learn exactly that. We'll cover the four core element types (person, system, container, component), how to use them, and what details to include. You'll have a complete toolkit for modeling any software system.

See you there!

Your Toolkit: Sruja's Element Types

Now that you know how to identify parts, let's talk about how to actually define them in Sruja. Think of element types as your vocabulary—the building blocks you'll use to describe any software architecture.

The beauty of Sruja is that it gives you exactly four element types. Not ten, not twenty—just four. This might feel limiting at first, but I promise you'll appreciate the simplicity. With these four types, you can model any software system from a simple blog to a distributed microservices platform.

Let's dive in.

Learning Goals

By the end of this lesson, you'll be able to:

  • Confidently use all four Sruja element types
  • Know exactly when to use each type (and when not to)
  • Write clean, consistent element definitions
  • Add meaningful details that make your diagrams useful

The Four Building Blocks

Here's the complete toolkit. You'll use these four types over and over again:

Element TypePurposeWhen I Use It
personHumans who interact with the systemEvery architecture starts here
systemStandalone software systemsThe main systems you build and depend on
containerDeployable units within systemsApplications, databases, queues—the things you actually deploy
componentModules within containersOnly when you need to show internal architecture

Notice how each maps to a level in the C4 hierarchy? Person is Level 1, System is Level 2, Container is Level 3, Component is Level 4. They work together in a hierarchy, not as independent pieces.

Person: It Always Starts With People

Every system exists to serve humans. That's why person is your starting point. Don't skip it—even if it feels obvious. Having people in your diagram grounds everything in reality.

The Basics

Here's how simple it is:

Customer = person "Customer"
Admin = person "Administrator"
Support = person "Customer Support"

That's it. You define an ID (like Customer) and a display name (like "Customer"). Simple, readable, and immediately clear.

Adding Context

Sometimes a name isn't enough. You might want to add more details:

Customer = person "Customer" {
  description "End users who purchase products"
  metadata {
    type ["external"]
    priority "high"
  }
}

I use descriptions when the name alone might be ambiguous. For example, "Admin" could mean a system administrator, a content admin, or a superuser. Adding a description removes that ambiguity.

Real-World Examples

Here are some people I've modeled in different projects:

// Internal users
Developer = person "Developer"
ProductManager = person "Product Manager"
DataAnalyst = person "Data Analyst"

// External users
APIConsumer = person "API Consumer"
Partner = person "Business Partner"
Vendor = person "Vendor"

// Support roles
SupportAgent = person "Customer Support"
SalesRep = person "Sales Representative"

One thing I've learned: don't overthink this. If someone interacts with your system, model them as a person. You don't need to capture every single role—just the ones that matter for your diagram's purpose.

System: The Big Picture

Systems are the major software units. Think of them as the "big boxes" in your architecture—the main applications, third-party services, and platforms you depend on.

Defining Systems

Here's the basic syntax:

ECommerce = system "E-Commerce Platform"
PaymentGateway = system "Payment Gateway"
EmailService = system "Email Service"

Adding Details That Matter

When I'm documenting systems for a team, I often add more context:

ECommerce = system "E-Commerce Platform" {
  description "Platform for buying and selling products"
  metadata {
    version "2.0"
    team ["platform-team"]
  }
  slo {
    availability {
      target "99.9%"
    }
  }
}

I love that Sruja lets you add SLOs (Service Level Objectives) directly to systems. This makes your architecture diagrams not just visual—they become living documentation that tells the story of what you've promised to deliver.

Your Systems vs. External Systems

One distinction I always make:

// Systems you own and maintain
Platform = system "Sruja Platform"
Dashboard = system "Analytics Dashboard"

// Systems you depend on (third-party)
Stripe = system "Stripe"
AWS = system "Amazon Web Services"
GitHub = system "GitHub"

I usually tag external systems with metadata { tags ["external"] } so anyone looking at the diagram immediately understands which systems you control and which you don't. This is crucial for understanding dependencies and risk.

Container: What You Actually Deploy

This is where things get practical. Containers are the deployable units—web apps, APIs, databases, caches. These are the things you actually deploy to production.

Understanding Containers

The name "container" can be confusing because people think of Docker containers. In C4 and Sruja, "container" just means "a deployable unit." It could be:

  • A web application
  • An API service
  • A database
  • A message queue
  • A cache

The Different Container Types

Sruja gives you specific types for common containers:

// Regular containers
WebApp = container "Web Application"
API = container "API Service"

// Databases and datastores
UserDB = database "User Database"
CacheDB = datastore "Redis Cache"

// Message queues
Kafka = queue "Kafka Cluster"
RabbitMQ = queue "RabbitMQ"

I use database for relational databases (PostgreSQL, MySQL) and datastore for non-relational storage (Redis, MongoDB). The distinction matters because the behavior and considerations are different.

Adding Rich Details

When documenting containers for developers, I add a lot of useful information:

API = container "API Service" {
  technology "Node.js"
  description "RESTful API handling all business logic"
  version "3.1.0"
  tags ["backend", "typescript"]
  scale {
    min 2
    max 10
  }
  slo {
    latency {
      p95 "200ms"
    }
  }
}

This tells developers everything they need to know: what technology it uses, what it does, how it scales, and what performance targets it's supposed to meet. This is way more useful than a simple box diagram.

Component: Looking Inside (When You Need To)

Components are optional. I'll say that again: components are optional. Only use them when your audience needs to see inside a container.

When to Use Components

Add components when:

  • You're documenting for developers who need implementation details
  • A container is complex (more than 5-6 logical parts)
  • Different teams own different parts of the same container
  • You need to show internal architecture for a design review

Skip components when:

  • You're talking to business stakeholders
  • The container is simple enough to explain at a high level
  • The internal structure is still evolving

Defining Components

Here's how you model them:

API = container "API Service" {
  AuthService = component "Authentication Service"
  ProductService = component "Product Service"
  CartService = component "Cart Service"
  OrderService = component "Order Service"
}

Adding Details to Components

You can add the same details to components that you add to containers:

AuthService = component "Authentication Service" {
  technology "Rust"
  description "Handles user login, registration, and JWT validation"
  scale {
    min 1
    max 5
  }
  slo {
    latency {
      p99 "100ms"
    }
  }
}

Putting It All Together

Let me show you a complete example that uses all four element types. This is a typical e-commerce platform:

import { * } from 'sruja.ai/stdlib'

// Level 1: People
Customer = person "Customer" {
  description "End users purchasing products"
}
Administrator = person "Administrator" {
  description "Staff managing inventory and orders"
}

// Level 2: Systems
ECommerce = system "E-Commerce Platform" {
  description "Main platform for online sales"
}
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
  }
}
EmailService = system "Email Service" {
  metadata {
    tags ["external"]
  }
}

// Level 3: Containers
ECommerce = system "E-Commerce Platform" {
  WebApp = container "Web Application" {
    technology "React"
    description "Single-page application for customers"
  }

  API = container "API Service" {
    technology "Node.js"
    description "RESTful API handling business logic"
    scale {
      min 3
      max 10
    }

    // Level 4: Components (inside API)
    ProductService = component "Product Service"
    CartService = component "Cart Service"
    OrderService = component "Order Service"
    PaymentService = component "Payment Service"
  }

  Database = database "PostgreSQL" {
    technology "PostgreSQL 14"
    description "Primary data store"
  }

  Cache = database "Redis Cache" {
    technology "Redis 7"
    description "Caching layer for performance"
  }
}

See how everything nests naturally? People at the top, systems next, containers within systems, and components within containers. This hierarchy makes it easy to understand the architecture at any level of detail.

Naming Conventions That Work

After years of modeling systems, I've settled on some naming conventions that keep things consistent and readable.

Element IDs: Use PascalCase

// Good
Customer = person "Customer"
WebApp = container "Web App"
UserService = component "User Service"

// Avoid these
customer = person "Customer"      // SnakeCase is harder to read
WEBAPP = container "Web App"      // ALL CAPS feels like shouting
webapp = container "Web App"      // Lowercase is inconsistent

Display Names: Be Descriptive

// Good
User = person "End User"
API = container "API Service"

// Avoid these
User = person "user"              // Lowercase looks unprofessional
API = container "api"             // Lowercase looks like a typo

Consistency Within a Diagram

This is the one that trips people up most often. If you name one thing a "Service," call all similar things "Service":

// Good: Consistent naming
UserService = component "User Service"
OrderService = component "Order Service"
PaymentService = component "Payment Service"

// Bad: Inconsistent naming
UserService = component "User Service"
Order = component "Order"              // Should be "Order Service"
Payment = component "Payment Service"  // Inconsistent with "Order"

I create a style guide for each project that lists the naming conventions we're using. It sounds tedious, but it saves so much time and confusion in the long run.

Using Metadata Effectively

Metadata is where you add context that makes your diagrams actually useful. Don't add metadata just for the sake of it—add metadata that helps your audience understand what they're looking at.

API = container "API Service" {
  technology "Rust"
  version "2.0.1"
  tags ["backend", "api"]
  metadata {
    team ["platform-team"]
    repository "github.com/company/api"
    documentation "docs.company.com/api"
  }
}

I include team ownership information because it helps people know who to talk to when they have questions. I include repository links because it connects the architecture to the actual code. These details transform a static diagram into a living piece of documentation.

What to Remember

Sruja gives you four element types—person, system, container, component—and they're all you need. The key is knowing when to use each one:

  • Person: Always include these. Every system serves humans.
  • System: Include the main systems you're building and the ones you depend on.
  • Container: Include the deployable units within your systems.
  • Component: Only include these when your audience needs implementation details.

Add metadata that provides context. Use naming conventions consistently. Keep it simple and your diagrams will be clear and useful.

Everything else flows from these principles.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding:

Question 1

You're modeling a blog platform with these requirements:

"The platform has readers and writers. The main system contains a web frontend (Next.js), API backend (Python/FastAPI), and database (PostgreSQL). The API has user management, post management, and comment services."

At the container level, which elements should you include?

A) Reader, Writer, User Management, Post Management, Comment Management B) Web Frontend, API Backend, Database C) User Management, Post Management, Comment Management D) Blog Platform, Next.js, FastAPI, PostgreSQL

Click to see the answer

Answer: B) Web Frontend, API Backend, Database

Containers are the deployable units within a system. The web frontend, API backend, and database are all things you'd deploy. Readers and writers are people, not containers. User management, post management, and comment management are components (internal modules within the API), not containers. Next.js, FastAPI, and PostgreSQL are technologies, not elements.

The key insight: containers = what you actually deploy. Everything else is either higher-level (systems, people) or lower-level (components).


Question 2

When should you model components within a container?

A) Always—components provide the most detailed view B) Only when using microservices architecture C) When your audience needs to understand internal architecture or when a container is complex D) Never—containers provide sufficient detail for all use cases

Click to see the answer

Answer: C) When your audience needs to understand internal architecture or when a container is complex

Components are an optional level of detail. You should include them when:

  • You're documenting for developers who need implementation details
  • A container has grown complex (more than 5-6 logical parts)
  • Different teams own different parts of the same container
  • You're doing a detailed design review

You should skip components when:

  • You're presenting to stakeholders who don't need implementation details
  • The container is simple enough to explain at a high level
  • The internal structure is still changing

The golden rule: match the level of detail to your audience's needs. Don't show components to business stakeholders. Do show components to developers who need to understand how the system is built.


What's Next?

Now you know how to define all the parts of your system. But parts alone don't tell the whole story. Systems are defined by how their parts interact and connect.

In the next lesson, you'll learn about relationships—how to model the connections between your elements. You'll learn to write clear, meaningful labels that describe exactly how parts communicate with each other.

See you there!


Connecting the Dots: Defining Relationships

Imagine looking at a map with cities marked but no roads connecting them. You'd know where things are, but you'd have no idea how to get from one place to another. Relationships in software architecture are those roads—they show how your parts connect, communicate, and depend on each other.

In this lesson, you'll learn to create meaningful connections between your elements. These connections transform a collection of isolated parts into a coherent system that tells a story.

Learning Goals

By the end of this lesson, you'll be able to:

  • Write clear, meaningful relationship labels
  • Model different types of interactions between parts
  • Use tags to categorize and enrich your relationships
  • Handle complex scenarios like nested references and multiple connections
  • Create relationships that make your diagrams tell a story

What Are Relationships, Really?

In the simplest terms, relationships describe how parts interact. They show:

  • Communication: How components talk to each other
  • Data flow: Where information moves through the system
  • Dependencies: What happens if one part goes down
  • User actions: What people actually do with the system

But here's the thing I've learned over the years: relationships aren't just technical—they're about telling the story of your system. A well-crafted relationship doesn't just say "connects"—it explains why and how things connect.

The Basic Syntax

Sruja keeps it simple:

From -> To "Label"
  • From: Who initiates the interaction?
  • To: Who receives it?
  • "Label": What's happening? (This is the most important part!)

Here are some examples:

// Person to System
Customer -> Shop "Browses products"
Administrator -> Shop "Manages inventory"

// Container to Container
Shop.WebApp -> Shop.API "Makes API calls"
Shop.API -> Shop.Database "Reads and writes"

// Component to Component
Shop.API.ProductService -> Shop.API.CartService "Gets product details"

Notice how each label uses present tense verbs and specific actions. "Browses products" is way more informative than "uses."

Writing Labels That Tell a Story

The difference between a good architecture diagram and a great one often comes down to relationship labels. Let me show you what I mean.

Good Labels Tell a Story

Customer -> Shop.WebApp "Browses products"
Shop.API -> Shop.Database "Queries data"
Shop.API -> PaymentGateway "Processes payment"

Each label tells you something meaningful:

  • The customer is browsing (shopping behavior)
  • The API is querying (data retrieval)
  • The payment gateway is processing (transaction handling)

Bad Labels Are Mysterious

Customer -> Shop.WebApp "Uses"  // Too generic—what do they do?
Shop.API -> Shop.Database "Connects"  // Doesn't describe the interaction
Shop.API -> PaymentGateway "Integration"  // Technical term, not behavioral

These labels leave you with questions. What does "uses" mean? What kind of connection? What's being integrated?

My Label-Writing Process

When I'm writing relationship labels, I ask myself three questions:

  1. What's the action? (Browse, query, process, send, receive)
  2. What's the context? (Products, data, payment, notifications)
  3. Is it specific enough? (Add details if it helps understanding)

Then I combine them: Action + Context = "Browses products"

Relationship Patterns You'll See Everywhere

After modeling hundreds of systems, I've noticed patterns that repeat constantly. Recognizing these patterns will make you faster and your diagrams more consistent.

Pattern 1: User Interactions

This pattern shows what people actually do with your system:

Customer -> Shop.WebApp "Logs in"
Customer -> Shop.WebApp "Views products"
Customer -> Shop.WebApp "Adds to cart"
Customer -> Shop.WebApp "Checks out"

I model each user action as a separate relationship. This makes it clear what functionality the system needs to support.

Pattern 2: Service Communication

This pattern shows how your backend services talk to each other:

WebApp -> API "Sends requests"
API -> Database "Persists data"
API -> Cache "Reads cache"
Cache -> API "Returns cached data"

Notice the bidirectional communication? The API writes to the cache, but also reads from it. This shows the caching strategy clearly.

Pattern 3: External Dependencies

This pattern shows systems you depend on outside your control:

API -> PaymentGateway "Process payment"
API -> EmailService "Send notifications"
API -> AnalyticsService "Track events"

I always mark these systems as external so anyone looking at the diagram immediately understands which dependencies are within your control and which aren't.

Referencing Nested Elements

One of Sruja's powerful features is dot notation for nested elements. Let me show you how it works.

Direct Child Reference

Customer -> Shop.WebApp "Uses"

This is straightforward—you're referencing a direct child.

Nested Component Reference

Shop.API.ProductService -> Shop.API.CartService "Get product info"

Here, both products and carts live inside the API container. The dot notation makes it clear where each component lives.

Cross-System Reference

Shop.API -> PaymentGateway.ChargeService "Process payment"

This shows that you're calling a specific service within an external system. This level of detail can be crucial when debugging integration issues.

Using Tags to Add Meaning

Tags are optional but incredibly useful for adding context to your relationships. Think of them as annotations that help people understand what they're looking at.

Common Tags I Use

// Protocol information
Shop.WebApp -> Shop.API "Sends requests" [http]
Shop.API -> Shop.Database "Queries data" [sql]

// Importance
Shop.API -> PaymentGateway "Process payment" [critical]
Shop.API -> EmailService "Send notifications" [optional]

// Data flow
Shop.WebApp -> Shop.API "Sends requests" [synchronous]
Shop.API -> EmailService "Send notifications" [asynchronous]

// Security
Shop.WebApp -> Shop.API "Sends requests" [authenticated]
Customer -> Shop.WebApp "Browses" [public]

I use tags when the detail matters for understanding the system. For example, marking something as critical tells people it's a path to watch closely during outages. Marking something as external reminds everyone it's outside your control.

Handling Multiple Relationships

Elements rarely have just one relationship—they're connected to many things. That's normal and expected.

A Typical Container with Multiple Connections

// WebApp connects to several things
Customer -> Shop.WebApp "Browses"
Shop.WebApp -> Shop.API "Queries products"
Shop.WebApp -> Shop.Cache "Reads cache"
Shop.WebApp -> Customer "Displays products"

// API is even more connected
Shop.WebApp -> Shop.API "Sends request"
Shop.API -> Shop.Database "Persists order"
Shop.API -> PaymentGateway "Process payment"
Shop.API -> Shop.WebApp "Returns response"
Shop.API -> Shop.EmailService "Send confirmation"

This might look overwhelming at first, but it's actually telling a clear story. The WebApp is the interface for customers. The API is the orchestrator that talks to databases, payment systems, and email services.

When Does Too Many Become Too Many?

If an element has more than 10-12 relationships, I ask myself: "Is this element doing too much?"

Maybe it's time to split it into smaller, more focused parts. A single service that talks to 15 different systems is probably violating the single responsibility principle.

Understanding Relationship Direction

Direction matters. It tells you who initiates the interaction and who responds.

One-Way Relationships

Customer -> Shop.WebApp "Browses"

The customer acts, the web app responds. Simple.

Two-Way Relationships (Separate Connections)

User -> App.API "Submits data"
App.API -> User "Returns result"

Here, the user sends data, and the API sends data back. They're two separate relationships because each one tells its own story.

Feedback Loops (Cycles)

User -> App.WebApp "Submits form"
App.WebApp -> App.API "Validates"
App.API -> App.WebApp "Returns errors"
App.WebApp -> User "Shows errors"
// User resubmits (loop completes)

This creates a feedback loop: submit → validate → error → show → resubmit. Feedback loops are everywhere in real systems. The thermostat in your house is one. The login flow on a website is another.

Bringing It All Together

Let me show you a complete example that uses everything we've covered:

import { * } from 'sruja.ai/stdlib'

// People
Customer = person "Customer"
Admin = person "Administrator"

// Systems
Shop = system "Shop" {
  WebApp = container "Web Application"
  API = container "API Service"
  DB = database "Database"
}

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
  }
}

// Customer interactions
Customer -> Shop.WebApp "Browses products" [public]
Customer -> Shop.WebApp "Adds to cart" [authenticated]
Customer -> Shop.WebApp "Checks out" [authenticated]

// Internal communication
Shop.WebApp -> Shop.API "Sends requests" [http, encrypted]
Shop.API -> Shop.DB "Persists data" [critical]
Shop.API -> Shop.DB "Queries data" [cached]

// External dependencies
Shop.API -> PaymentGateway "Process payment" [critical, external]

// Admin interactions
Admin -> Shop.WebApp "Manages products" [authenticated]
Admin -> Shop.WebApp "Views reports" [authenticated]

view index {
  include *
}

This diagram tells a complete story. You can see who uses the system, how parts communicate internally, and what external dependencies exist. The tags add crucial context about security, importance, and whether systems are under your control.

What to Remember

Relationships are what transform a collection of parts into a living, breathing system. When you write relationships:

  • Be specific: "Browses products" not "uses"
  • Use present tense: "Processes" not "will process"
  • Think about the story: What's really happening here?
  • Add context with tags: When details matter
  • Don't overthink it: Most relationships are simple

If you take away one thing, let it be this: a good relationship label tells a story about how your system actually works. Everything else flows from that principle.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding:

Question 1

You're modeling relationships for a blog platform. Which relationship label is the most informative?

A) Reader -> Blog.Frontend "Uses" B) Reader -> Blog.Frontend "Reads posts" C) Reader -> Blog.Frontend "Frontend connection" D) Reader -> Blog.Frontend "HTTP request"

Click to see the answer

Answer: B) Reader -> Blog.Frontend "Reads posts"

Let's compare the options:

  • A) "Uses" is too generic. What are they using it for? Reading? Writing? Browsing?
  • B) "Reads posts" is specific and informative. It tells you exactly what the reader is doing.
  • C) "Frontend connection" is technical jargon that doesn't describe the user action.
  • D) "HTTP request" describes the protocol, not the business action. This is an implementation detail.

The best labels describe what's happening in business terms, not technical terms. "Reads posts" tells you the user behavior. "HTTP request" tells you nothing about the user's intent.


Question 2

When should you add tags to your relationships?

A) Always—tags provide useful context B) Never—tags make diagrams too complex C) When the tag adds meaningful context that helps people understand the system D) Only for external dependencies

Click to see the answer

Answer: C) When the tag adds meaningful context that helps people understand the system

Tags are optional—use them when they add value:

Good use cases for tags:

  • [critical] — Highlights important paths that deserve attention during outages
  • [external] — Marks dependencies outside your control
  • [synchronous] vs [asynchronous] — Shows communication patterns that affect reliability
  • [authenticated] vs [public] — Indicates security requirements

Bad use cases for tags:

  • Tags that are obvious (like [http] when everything uses HTTP)
  • Tags that don't add clarity
  • Tags just for the sake of adding tags

The golden rule: only add tags when they help your audience understand something important about the relationship.


What's Next?

Now you know how to define parts and connect them with relationships. You can create diagrams that tell stories about how systems work.

But there's one more thing: how do you organize all these parts? That's what the next lesson is about. You'll learn about hierarchy and nesting—how to structure your systems in a way that's clear, consistent, and scalable.

See you there!

Bringing Order to Chaos: Hierarchy and Nesting

Imagine walking into a house where everything is dumped in one giant room—furniture, kitchen appliances, clothes, tools, books. You could find what you're looking for eventually, but it would be frustrating and inefficient. That's what an unorganized architecture feels like.

Hierarchy and nesting are your tools for bringing order to chaos. They help you organize your system's parts in a way that makes sense, is easy to understand, and scales as your system grows.

In this lesson, you'll learn how to structure your systems effectively using the C4 model's hierarchy. You'll discover when to nest, when to keep things flat, and how to create diagrams that tell a clear story at any level of detail.

Learning Goals

By the end of this lesson, you'll be able to:

  • Understand the C4 model's four-level hierarchy
  • Know when to nest components and when to keep them separate
  • Reference nested elements using dot notation
  • Create systems that are consistent and easy to navigate
  • Choose the right level of detail for your audience

The C4 Hierarchy: Your Foundation

Let's revisit the C4 model's hierarchy because it's the foundation for everything we'll discuss. Think of it like building a city:

People (Level 1)
  ↓ The citizens who live there
  
Systems (Level 2)
  ↓ The buildings and facilities
  
Containers (Level 3)
  ↓ The rooms and departments within buildings
  
Components (Level 4)
  ↓ The furniture and equipment within rooms

Each level contains the one below it. People interact with systems, systems contain containers, containers contain components. This nested structure is what makes architectures clear and navigable.

I've found this metaphor really helps people understand why we nest things. You don't put a couch (component) directly in the middle of the street (system). It belongs in a living room (container), which is inside a house (system). The same principle applies to software architecture.

Nesting Containers in Systems

Systems contain containers. This is where most of the action happens.

The Basic Pattern

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce Platform" {
  WebApp = container "Web Application" {
    technology "React"
  }

  API = container "API Service" {
    technology "Node.js"
  }

  Database = database "PostgreSQL" {
    technology "PostgreSQL 14"
  }

  Cache = database "Redis Cache" {
    technology "Redis 7"
  }
}

Why Nesting Matters

When I first started modeling systems, I sometimes put everything at the same level—all containers, no systems. It seemed simpler, but it created confusion:

  • Who owns what? Without systems, ownership is unclear
  • What belongs together? Related containers end up scattered
  • How do I reference things? Dot notation becomes meaningless

Nesting containers in systems solves all these problems. It's clear what belongs together, who owns it, and how to reference it.

A Real-World Example

Let me show you a more realistic example from a project I worked on—a healthcare platform:

HealthcarePlatform = system "Healthcare Platform" {
  PatientPortal = container "Patient Portal" {
    technology "React"
    description "Web app for patients to manage appointments"
  }

  DoctorDashboard = container "Doctor Dashboard" {
    technology "Vue.js"
    description "Web app for doctors to view patient records"
  }

  API = container "API Gateway" {
    technology "Kong"
    description "Routes requests to microservices"
  }

  EHR = container "EHR System" {
    technology "Java"
    description "Electronic health records system"
  }
}

See how each container has a clear purpose? The system groups them together and tells you they're all part of the same platform. Anyone looking at this diagram immediately understands the big picture.

Nesting Components in Containers

This is where you need to be careful. Components are powerful, but only when used appropriately.

The Decision Framework

I use this simple framework to decide whether to add components:

Add components when:

  • You're documenting for developers who need implementation details
  • A container is complex (more than 5-6 logical parts)
  • Different teams own different parts of the same container
  • You need to show internal architecture for a design review

Skip components when:

  • You're presenting to stakeholders who don't need technical details
  • The container is simple enough to explain at a high level
  • The internal structure is still evolving and changing

When to Nest Components

Let me show you an example where components make sense:

API = container "API Service" {
  technology "Node.js"
  description "RESTful API handling all business logic"

  // Authentication and authorization
  AuthService = component "Authentication Service" {
    technology "Node.js"
    description "Handles login, registration, JWT tokens"
  }

  // Domain services
  UserService = component "User Service" {
    description "User profile and preferences"
  }
  
  ProductService = component "Product Service" {
    description "Product catalog and inventory"
  }
  
  CartService = component "Cart Service" {
    description "Shopping cart and checkout"
  }
  
  OrderService = component "Order Service" {
    description "Order processing and fulfillment"
  }

  // Integration services
  PaymentService = component "Payment Service" {
    description "Payment gateway integration"
  }
  
  NotificationService = component "Notification Service" {
    description "Email and SMS notifications"
  }
}

This makes sense because:

  • The API is complex with multiple distinct services
  • Different teams might own different services
  • Developers need to understand the internal architecture

When to Keep It Simple

Now, here's an example where I would not use components:

SimpleAPI = container "Simple API" {
  technology "Node.js"
  description "Basic CRUD API for a small application"
}

Why no components? Because:

  • It's simple enough to explain at the container level
  • The audience (stakeholders) doesn't need implementation details
  • The internal structure might change as we iterate

The key insight: match the level of detail to your audience's needs.

Referencing Nested Elements

Once you have nested elements, you need to know how to reference them. Sruja uses dot notation, which is intuitive once you get the hang of it.

The Reference Patterns

// Level 1 to Level 2 (Person to System)
Customer -> ECommerce "Uses platform"

// Level 2 to Level 3 (System to Container)
Customer -> ECommerce.PatientPortal "Manages appointments"
Doctor -> ECommerce.DoctorDashboard "Views patient records"

// Level 3 to Level 3 (Container to Container)
ECommerce.PatientPortal -> ECommerce.API "Submits requests"
ECommerce.API -> ECommerce.EHR "Retrieves patient data"

// Level 3 to Level 4 (Container to Component)
ECommerce.API -> ECommerce.API.AuthService "Validates user"

// Level 4 to Level 4 (Component to Component)
ECommerce.API.AuthService -> ECommerce.API.UserService "Fetches user profile"

Cross-System References

What if you need to reference something inside another system? Dot notation still works:

// Reference a service inside an external system
ECommerce.API.PaymentService -> Stripe.ChargeService "Process payment"

// Reference a database inside another system
ECommerce.API -> AnalyticsDB.DataWarehouse "Push analytics data"

This level of detail can be useful when debugging integrations or documenting specific API contracts.

Consistency: The Golden Rule

If there's one thing I've learned from years of modeling systems, it's this: consistency matters more than any individual decision.

Consistency in Nesting

Don't mix levels inconsistently:

// Bad: Inconsistent nesting
App = system "App" {
  Frontend = container "Frontend"
  Backend = container "Backend" {
    API = component "API Service"
  }
  Database = database "Database"
}

// Good: Consistent nesting
App = system "App" {
  Frontend = container "Frontend" {
    UI = component "UI Components"
  }
  Backend = container "Backend" {
    API = component "API Service"
  }
  Database = database "Database"
}

In the bad example, why does Backend have components but Frontend doesn't? Is Frontend simpler? Did we forget to break it down? The inconsistency creates confusion. The good example shows a consistent approach.

Consistency in Naming

Use consistent naming conventions across your hierarchy:

// Good: Consistent suffixes
UserService = component "User Service"
OrderService = component "Order Service"
PaymentService = component "Payment Service"

// Bad: Inconsistent suffixes
UserService = component "User Service"
Order = component "Order"
Payment = component "Payment Service"

In the bad example, why is it "User Service" but just "Order"? The inconsistency makes people wonder if there's a meaningful difference. Spoiler: there usually isn't.

Pitfalls I've Encountered (So You Don't Have To)

Let me share some mistakes I've made and seen others make. Hopefully, you can avoid them.

Pitfall 1: Deep Nesting

// Bad: Too deep (5+ levels)
App = system "App" {
  Frontend = container "Frontend" {
    Layout = component "Layout" {
      Header = component "Header" {
        Navigation = component "Navigation" {
          Menu = component "Menu"
        }
      }
    }
  }
}

This is unreadable. Who can understand a 5-level deep structure? Nobody. The diagram becomes useless because it's too complex.

Solution: Keep nesting to 3-4 levels max. If you're going deeper, you're probably in implementation territory, not architecture.

Pitfall 2: Orphaned Elements

// Bad: Component without container
Shop = system "Shop"
WebApp = container "Web App"
API = container "API"
Database = database "Database"
AuthService = component "Auth Service"  // Orphan!

Where does AuthService belong? Is it in WebApp? In API? Orphaned elements confuse everyone who looks at the diagram.

Solution: Every component must belong to a container. Every container must belong to a system. No exceptions.

Pitfall 3: Flat Everything

// Bad: Everything at same level
Customer = person "Customer"
WebApp = container "Web App"
API = container "API"
Database = database "Database"
AuthService = component "Auth Service"

This shows no structure. Is WebApp part of API? Are they siblings? Who knows?

Solution: Show the hierarchy. Nest things properly. Make it clear what belongs to what.

Creating Views at Different Levels

One of the most powerful features of Sruja is the ability to create views at different hierarchy levels. This lets you tell the right story to the right audience.

Level 1: System Context View

view system_context {
  title "System Context"
  include *
}

This view shows everything: people, systems, external dependencies. Perfect for stakeholders who want the big picture.

Level 2: System View

view system_view of ECommerce {
  title "E-Commerce System"
  include ECommerce
}

This view shows a single system and all its containers. Great for developers who need to understand how a specific system is structured.

Level 3: Container View

view container_view of ECommerce.API {
  title "API Containers"
  include ECommerce.API.*
  exclude ECommerce.API.Database
}

This view shows the internals of a container. Perfect for teams working on that specific service.

Level 4: Component View

view component_view of ECommerce.API {
  title "API Components"
  include ECommerce.API.*
}

This view shows all components within a container. Ideal for detailed design reviews and implementation planning.

The beauty is that you can have multiple views of the same architecture, each tailored to a different audience. The stakeholders get the big picture. Developers get the details. Everyone gets what they need.

What to Remember

Hierarchy and nesting are about bringing order to complexity. When you're structuring your systems:

  • Follow the C4 hierarchy: Person → System → Container → Component
  • Nest logically: Group related parts together
  • Be consistent: Use the same patterns throughout
  • Match detail to audience: Don't show components to stakeholders
  • Create multiple views: Tell different stories to different audiences
  • Keep it simple: Avoid deep nesting and over-complexity

If you take away one thing, let it be this: a well-structured hierarchy makes complex systems understandable. Everything else flows from that principle.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding:

Question 1

You're modeling a simple mobile app with these requirements:

"Users can create profiles, post content, and like posts. The app has a mobile frontend, API backend, and database."

At what level should you stop modeling? Should you include components?

A) Yes, always include components for completeness B) No, containers are sufficient for this system C) Include components for the API but not the mobile frontend D) Include components only if using microservices

Click to see the answer

Answer: B) No, containers are sufficient for this system

Here's why:

  • The system is simple (mobile frontend, API, database)
  • Three containers is manageable complexity
  • There's no indication that the API is internally complex
  • Components would add unnecessary detail for a straightforward system

The decision framework I use:

  • Is the container complex? No, it sounds simple.
  • Is the audience developers? Maybe, but even developers don't always need component-level detail.
  • Will it add clarity? Probably not—it might just add noise.

If this system grows and the API becomes complex with 10+ services, then yes, add components. But for now? Keep it at the container level.


Question 2

You're referencing nested elements in your architecture. Which reference is correctly formatted?

A) WebApp.API.Database "Queries data" B) API.AuthService -> UserService "Gets profile" C) Shop.API.ProductService -> CartService "Adds item" D) System.Container.Service -> External.Service "Calls"

Click to see the answer

Answer: B) API.AuthService -> UserService "Gets profile"

Let's analyze each option:

A) WebApp.API.Database "Queries data" — Wrong. This implies WebApp has an API, which has a Database. But typically, API and Database are siblings within the System, not nested. It should probably be System.Database.

B) API.AuthService -> UserService "Gets profile" — Correct! This shows a component (AuthService) within a container (API) communicating with another component (UserService) within the same container. Perfect dot notation.

C) Shop.API.ProductService -> CartService "Adds item" — Almost correct, but incomplete. CartService should be fully qualified as Shop.API.CartService. Otherwise, it's unclear where CartService lives.

D) System.Container.Service -> External.Service "Calls" — Wrong on multiple levels. External systems shouldn't have components in your architecture model—you don't control them. Also, it's unclear where External.Service is.

The key insight: use full dot notation when referencing nested elements. Make it clear where every element lives in the hierarchy.


What's Next?

Congratulations! You've completed Module 2: Parts and Relationships. You now have a complete toolkit for:

  • Identifying the key parts of any system
  • Modeling parts using Sruja's element types
  • Defining meaningful relationships between parts
  • Organizing everything with clear hierarchy and structure

You can now create diagrams that tell stories about how systems work—stories that are clear, consistent, and useful for your audience.

In the next module, you'll learn about boundaries—how to define where one system ends and another begins. This is crucial for understanding dependencies, managing complexity, and designing systems that are decoupled and maintainable.

See you there!


Module 2 Complete!

You've now mastered the fundamentals of modeling systems. Here's what you've learned:

Lesson 1: Identifying Parts

  • Start with people, then systems, then containers, then components
  • Use the C4 hierarchy as your guide
  • Match the level of detail to your audience

Lesson 2: Sruja Elements

  • Four element types: person, system, container, component
  • Add details that matter: technology, description, SLOs
  • Use naming conventions consistently

Lesson 3: Defining Relationships

  • Write labels that tell a story
  • Use tags to add meaningful context
  • Reference nested elements with dot notation

Lesson 4: Hierarchy and Nesting

  • Structure your systems consistently
  • Avoid deep nesting and over-complexity
  • Create multiple views for different audiences

You're ready to tackle more advanced concepts. Let's continue!

Module 3: Boundaries

Overview

In this module, you'll learn to define what's inside your system vs. what's outside (the environment). Understanding boundaries is crucial for clear ownership, risk management, and integration planning.

Learning Objectives

By the end of this module, you'll be able to:

  • Define system boundaries clearly
  • Model internal vs. external components
  • Identify and document dependencies
  • Create bounded contexts for services
  • Plan integrations at boundaries

Lessons

Prerequisites

Time Investment

Approximately 1-1.5 hours to complete all lessons and exercises.

What's Next

After completing this module, you'll learn about Module 4: Flows.

Drawing the Line: Understanding Boundaries

Ever played a game of tag as a kid? There was always that safe zone—home base—where you couldn't be tagged. Everything outside that zone was fair game, everything inside was safe. Boundaries in software architecture work the same way: they define what's safe inside your system and what's risky outside it.

In this lesson, you'll learn to draw those lines clearly. You'll discover why boundaries matter, what types of boundaries you'll encounter, and how to define them effectively in your architectures.

Let's start by understanding what boundaries actually are.

Learning Goals

By the end of this lesson, you'll be able to:

  • Define what boundaries are in the context of software architecture
  • Recognize the different types of boundaries you'll encounter
  • Understand why boundaries matter for ownership, risk, and clarity
  • Apply boundaries effectively in your Sruja models

What Are Boundaries, Really?

At its simplest level, a boundary is a line that separates what's inside your system from what's outside. But this line isn't arbitrary—it represents something meaningful:

  • Inside: What you build, own, maintain, and control
  • Outside: The environment, external dependencies, things you rely on but don't control

Think of it like your house:

┌─────────────────────────────────────┐
│          OUTSIDE                    │
│  (The neighborhood, other houses)  │
│                                     │
│    ┌───────────────────────────┐    │
│    │       YOUR HOUSE          │    │
│    │   (Your safe space)       │    │
│    │                           │    │
│    │  [Your rooms, stuff]      │    │
│    │                           │    │
│    └───────────────────────────┘    │
│             ↑                         │
│     Front door = Boundary Line       │
└─────────────────────────────────────┘

You control everything inside your house. You don't control the neighbor's house down the street. The front door is your boundary—it's clear where your space ends and public space begins.

In software, boundaries work the same way. They tell everyone what you're responsible for and what you're not.

Why Boundaries Matter (The Real Reasons)

I've seen too many projects suffer from unclear boundaries. Teams argue about who owns what. Outages cascade because nobody planned for external failures. Security gaps emerge because nobody thought about what crosses the boundary.

Let me show you why boundaries matter in practice.

1. Clear Ownership: Who's Responsible?

This is the most common reason I see teams argue. When boundaries are unclear, nobody knows who fixes what.

// Inside: Your team owns this
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API Service"
}

// Outside: Another team/vendor owns this
PaymentGateway = system "Payment Gateway"

When something breaks with payments, who fixes it? If the boundary is clear, everyone knows: your team fixes anything inside Shop, the payment vendor fixes anything in PaymentGateway. No confusion, no finger-pointing.

I once worked on a project where a team spent three days arguing about who owned a broken integration because nobody had bothered to document the boundary. Three days of developer time—wasted.

2. Risk Management: What Could Go Wrong?

Every time you cross a boundary, you're introducing risk. You're depending on something outside your control.

// External dependency = external risk
Shop.API -> PaymentGateway "Process payment"

// If PaymentGateway is down, what happens?
// - Can customers still check out?
// - Do we have a backup?
// - What's our SLA with them?

When I model systems, I always ask: "What happens if this external thing breaks?" If I don't have a good answer, that's a risk I need to document and plan for.

3. Testing Scope: What Do We Test?

Boundaries tell you what kind of testing you need.

// Internal: Unit tests are sufficient
Shop.WebApp -> Shop.API

// External: Integration tests are required
Shop.API -> PaymentGateway

You can unit test internal interactions all day. But when you cross a boundary? You need integration tests. You need to handle network failures. You need to test timeout scenarios. Boundaries tell you where testing gets more complex.

4. Security: What Needs Protection?

Security controls should be strongest at your boundaries. That's where attacks happen.

// Inside boundary: Apply internal controls
Shop.WebApp -> Shop.API

// Crossing boundary: Validate everything
Shop.API -> PaymentGateway "Process payment" [authenticated, encrypted]

I learned this the hard way early in my career when I didn't properly validate data crossing a boundary and ended up with a security vulnerability. Lesson learned: boundaries are where security matters most.

Types of Boundaries You'll Encounter

After years of modeling systems, I've noticed there are really five main types of boundaries you'll run into. Understanding which type you're dealing with helps you make the right decisions.

1. System Boundary: Your App vs. The World

This is the most common boundary—your main application versus everything else.

// Inside: Your system
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
}

// Outside: Everything else
Customer = person "Customer"
PaymentGateway = system "Payment Gateway"
EmailService = system "Email Service"

The system boundary defines your overall scope. Everything inside is yours. Everything outside is not.

2. Team Boundary: What Your Team Owns

In larger organizations, different teams own different systems. These team boundaries matter for communication and coordination.

// Your team's system
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
}

// Another team's system
Analytics = system "Analytics Platform" {
  metadata {
    tags ["internal", "data-team"]
    owner "Data Team"
  }
}

When I see team boundaries, I always add metadata about who owns what. This prevents confusion when issues arise and everyone's trying to figure out who to call.

3. Organizational Boundary: Inside vs. Outside Your Company

Sometimes boundaries exist at the organizational level—your company versus external vendors.

// Your company's system
Shop = system "Shop"

// External vendor
Stripe = system "Stripe" {
  metadata {
    tags ["external", "vendor"]
    owner "Stripe Inc."
  }
}

Vendor relationships are fundamentally different from internal relationships. Different SLAs, different support channels, different everything. Mark these boundaries clearly so everyone knows the difference.

4. Deployment Boundary: What Deploys Together

Sometimes the same team owns systems, but they deploy independently. That's a deployment boundary.

// Same deployment
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
}

// Separate deployment
Database = system "Database Cluster" {
  metadata {
    tags ["internal", "dba-team"]
    deployment "Managed Service"
  }
}

Deployment boundaries tell you about failure domains. If the API and database deploy independently, they can fail independently. That's important to understand.

5. Trust Boundary: Security Zones

This is one of the most important boundaries from a security perspective. It defines what's trusted versus untrusted.

// Trusted: Internal network, internal users
InternalAPI = container "Internal API"

// Untrusted: Public internet, anyone
PublicAPI = container "Public API"

I always pay special attention to trust boundaries. This is where you need authentication, authorization, encryption, and all the other security controls. Cross a trust boundary without proper security? That's how breaches happen.

Real-World Examples

Let me show you some examples of boundaries in action.

Example 1: A Typical E-Commerce Platform

// ┌──────────── EXTERNAL WORLD ────────────┐
// │                                        │
// │   Customer (person)                    │
// │   Payment Gateway (system)             │
// │   Email Service (system)               │
// │   Analytics (system)                   │
// │                                        │
// │   ┌────── YOUR SYSTEM ──────────┐     │
// │   │ Shop (system)              │     │
// │   │   WebApp (container)       │     │
// │   │   API (container)          │     │
// │   │   Database (database)      │     │
// │   └───────────────────────────┘     │
// │                                        │
// └────────────────────────────────────────┘

// People are always outside
Customer -> Shop.WebApp "Browses"

// Everything inside your boundary
Shop.WebApp -> Shop.API "Makes requests"
Shop.API -> Shop.Database "Persists orders"

// Crossing your boundary to external systems
Shop.API -> PaymentGateway "Process payment"
Shop.API -> EmailService "Send confirmation"
Shop.API -> Analytics "Track events"

See how clear the boundary is? Everything inside "Your System" is yours. Everything else is external. Anyone looking at this diagram immediately understands the scope and dependencies.

Example 2: Microservices with Internal Boundaries

// Even within an organization, you can have boundaries

OrderService = system "Order Service" {
  metadata {
    tags ["internal", "orders-team"]
    owner "Orders Team"
  }
}

PaymentService = system "Payment Service" {
  metadata {
    tags ["internal", "payments-team"]
    owner "Payments Team"
  }
}

InventoryService = system "Inventory Service" {
  metadata {
    tags ["internal", "inventory-team"]
    owner "Inventory Team"
  }
}

// Cross-team boundaries
OrderService -> PaymentService "Request payment"
PaymentService -> OrderService "Payment result"
OrderService -> InventoryService "Reserve items"

Each service is a bounded context owned by a different team. Even though they're all "internal" to the company, the team boundaries are real and matter for coordination.

Pitfalls to Avoid (I've Made These)

Let me share some boundary mistakes I've made or seen others make. Hopefully, you can avoid them.

Mistake 1: No Clear Boundary

// Bad: Everything looks the same
Customer = person "Customer"
Shop = system "Shop"
PaymentGateway = system "Payment Gateway"
EmailService = system "Email Service"

// Without tags, what's external? What's internal?

I've seen diagrams like this where everything looks internal. Nobody knows what the team controls versus what they depend on. It's confusing and risky.

Fix: Use metadata tags to mark external systems clearly:

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "vendor"]
  }
}

Mistake 2: Everything Marked External

// Bad: Everything marked external, no ownership
Shop = system "Shop" {
  tags ["external"]
}
WebApp = container "Web App" {
  tags ["external"]
}

This is the opposite problem. If everything is external, who owns anything? Who's responsible? The diagram provides no clarity about ownership.

Fix: Only mark truly external systems:

Shop = system "Shop" {
  // No tags = internal
  WebApp = container "Web App"
}

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]  // Only mark what's really external
  }
}

Mistake 3: Too Many Fragmented Boundaries

// Bad: Overly fragmented, hard to understand
System1 = system "System 1"
System2 = system "System 2"
System3 = system "System 3"
System4 = system "System 4"
System5 = system "System 5"
// ... and so on

I've seen teams create a separate system boundary for every tiny piece of functionality. The diagram becomes a mess of boxes with no clear groupings. Nobody can understand the big picture.

Fix: Group related functionality into coherent systems:

Shop = system "Shop" {
  // Group all shop-related containers
  WebApp = container "Web App"
  API = container "API"
  Cache = queue "Cache"
}

Analytics = system "Analytics" {
  // Group all analytics-related containers
  Collector = container "Event Collector"
  Processor = container "Event Processor"
}

Defining Boundaries in Sruja

Sruja gives you the tools to define boundaries clearly. Here's how I use them.

Use system for Your Main Boundary

Your main application is a system. Everything you own goes inside it.

Shop = system "Shop" {
  // Everything you control goes here
  WebApp = container "Web App"
  API = container "API"
  Database = database "Database"
}

Use Metadata to Mark External Systems

External systems get metadata tags so everyone knows they're outside your boundary.

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "vendor"]
    owner "Stripe"
    sla "99.9% uptime"
  }
}

I always add context to external systems: who owns it, what the SLA is, any relevant compliance requirements. This helps people understand the risk and dependency.

Use person for External Actors

Remember: people are always outside your system boundary.

// Users are external to your system
Customer = person "Customer"
Administrator = person "Administrator"
Support = person "Customer Support"

// Your system
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
}

// People interact with internal components
Customer -> Shop.WebApp "Browses products"
Administrator -> Shop.WebApp "Manages inventory"
Support -> Shop.WebApp "Monitors health"

People are the actors who use your system. They're not "inside" your system—they interact with it from the outside. Always model them as separate entities.

What to Remember

Boundaries are about clarity—clarity of ownership, clarity of risk, clarity of responsibility. When you draw boundaries:

  • Be intentional: Every boundary should represent something meaningful (ownership, trust, deployment)
  • Mark external systems clearly: Use metadata tags so everyone knows what's outside your control
  • Document crossings: Every relationship that crosses a boundary is an integration point that needs attention
  • Avoid fragmentation: Group related functionality into coherent systems
  • Think about risk: Every boundary crossing introduces external risk—plan for it

If you take away one thing, let it be this: clear boundaries prevent confusion, reduce risk, and make your architectures more maintainable. Everything else flows from that principle.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding.

Question 1

You're modeling a healthcare platform with these requirements:

"A hospital scheduling system allows patients to book appointments, doctors to manage their schedules, and administrators to oversee operations. The system integrates with an external insurance API for coverage verification and sends SMS notifications through Twilio. Patient data is stored in the hospital's database."

Which parts should be modeled as outside the system boundary (external)?

A) Patients, Doctors, Administrators, Insurance API, Twilio B) Insurance API, Twilio, Hospital Database C) Patients, Doctors, Administrators D) Insurance API, Twilio, Hospital Scheduling System

Click to see the answer

Answer: A) Patients, Doctors, Administrators, Insurance API, Twilio

Let's break this down:

  • Patients, Doctors, Administrators — These are people (actors), and people are always outside the system boundary. They interact with the system but aren't part of it.

  • Insurance API — This is an external service the system depends on. The hospital doesn't control it. It's outside the boundary.

  • Twilio — This is a third-party SMS service. External vendor, outside the boundary.

  • Hospital Database — This is inside the boundary. The hospital owns and maintains it.

  • Hospital Scheduling System — This is the main system being built. It's the boundary itself, containing all the internal components.

Why not the other options?

  • B) Incorrect. The Hospital Database is internal—it's owned and controlled by the hospital. It should be inside the boundary.

  • C) Incorrect. This only includes people but misses the external services (Insurance API, Twilio) that the system depends on.

  • D) Incorrect. This includes the main system itself as "external," which doesn't make sense. The system defines the boundary—it's not outside itself.

Key insight: People are always external. Third-party services are external. Systems you build and own are internal. Systems you depend on but don't control are external.


Question 2

You're reviewing an architecture diagram and notice this structure:

Shop = system "Shop"
PaymentGateway = system "Payment Gateway"
EmailService = system "Email Service"
AnalyticsService = system "Analytics Service"

Shop.WebApp -> Shop.API "Requests"
Shop.API -> PaymentGateway "Process payment"
Shop.API -> EmailService "Send email"
Shop.API -> AnalyticsService "Track events"

What's the main problem with this diagram from a boundaries perspective?

A) The Shop system doesn't have containers B) External systems aren't marked as external C) There are too many external dependencies D) The relationships are too generic

Click to see the answer

Answer: B) External systems aren't marked as external

The problem is that all four systems look identical. There's no way to tell that Shop is the system being built while PaymentGateway, EmailService, and AnalyticsService are external dependencies.

What should it look like?

// Your system (no tags = internal)
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API Service"
}

// External systems (marked clearly)
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "vendor"]
    owner "Stripe"
  }
}

EmailService = system "Email Service" {
  metadata {
    tags ["external", "vendor"]
    owner "SendGrid"
  }
}

AnalyticsService = system "Analytics Service" {
  metadata {
    tags ["external", "vendor"]
    owner "Google"
  }
}

Why the other options are wrong:

  • A) Incorrect. While Shop does have containers in the corrected version, the main boundaries problem isn't about whether containers exist—it's about marking what's external versus internal.

  • C) Incorrect. Having multiple external dependencies is normal. The problem isn't the number of dependencies—it's that they aren't marked as external, so nobody knows which systems are under your control and which aren't.

  • D) Incorrect. The relationship labels are descriptive enough ("Process payment", "Send email", "Track events"). The issue is boundaries, not relationship quality.

Key insight: Always use metadata tags to mark external systems. This makes boundaries immediately visible to anyone reading your diagram. It prevents confusion about ownership and responsibility.


What's Next?

Now you understand what boundaries are and why they matter. You know how to identify different types of boundaries and mark them clearly in your diagrams.

But there's a question we haven't answered: how do you actually mark components as internal or external in Sruja? How do you use metadata effectively to document ownership and dependencies?

In the next lesson, you'll learn exactly that. We'll cover how to annotate boundary elements, model team and organizational boundaries, and create diagrams that make it instantly clear what's inside versus what's outside.

See you there!


Inside and Outside: Internal vs External

Remember those "Keep Out" signs you'd see on fences as a kid? They were clear markers: this side is private, that side is public. Inside the fence, you had your rules. Outside, it was anyone's territory.

Boundaries in software work the same way, but instead of fences, we use metadata and tags. These markers tell everyone what's inside your boundary (what you control) and what's outside (what you depend on but don't control).

In this lesson, you'll learn to mark these boundaries clearly in Sruja. You'll discover how to annotate external systems, document ownership, and make your diagrams instantly readable by anyone who picks them up.

Learning Goals

By the end of this lesson, you'll be able to:

  • Use Sruja metadata to mark external systems clearly
  • Document ownership and responsibility for each system
  • Model team and organizational boundaries effectively
  • Create diagrams that make internal/external distinctions instantly visible
  • Add meaningful context (SLAs, compliance, contact info) to external systems

Marking External Systems: The Basics

The simplest way to mark something as external is using metadata tags. Let me show you the pattern I use consistently:

// Internal: Your system (no tags = internal by default)
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API Service"
}

// External: Third-party system you depend on
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
  }
}

The tags ["external"] is the marker. It tells anyone reading your diagram: "This is outside our boundary. We don't control it. Plan accordingly."

This simple tag transforms a diagram from "a bunch of boxes" into "a clear picture of what we own versus what we depend on."

Adding Rich Context to External Systems

I've learned that just marking something as "external" often isn't enough. People want to know: Who owns it? What's the SLA? Who do I call when it breaks? How do we connect?

Let me show you how I add this context.

External System with Ownership

Stripe = system "Stripe" {
  metadata {
    tags ["external", "vendor"]
    owner "Stripe Inc."
    support "support@stripe.com"
  }
}

Now anyone looking at the diagram knows:

  • It's external (tags ["external"])
  • It's a vendor (tags ["vendor"])
  • Who owns it (owner "Stripe Inc.")
  • Who to contact for support (support "support@stripe.com")

I include this information because when something breaks at 3 AM (and it will), you don't want to be hunting through documentation trying to figure out who owns what.

External System with SLA Information

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
    sla "99.9% uptime"
    mttr "4 hours"
    support "24/7 enterprise support"
  }
}

SLA (Service Level Agreement) information is crucial for understanding risk:

  • 99.9% uptime means ~43 minutes of downtime per month
  • mttr (Mean Time To Repair) tells you how quickly they commit to fixing issues
  • 24/7 enterprise support tells you response time expectations

I always add SLA information for critical external dependencies. It helps teams understand what they're committing to.

External System with Compliance Requirements

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "pci-compliant"]
    compliance ["PCI-DSS Level 1"]
    security ["TLS 1.3", "Mutual TLS"]
  }
}

For highly regulated industries (healthcare, finance, government), compliance information is essential. It tells you:

  • What standards the external system meets
  • What security controls are in place
  • Whether you can use it for sensitive data

I once worked on a healthcare project where someone chose an email provider without checking HIPAA compliance. We had to rebuild the integration later. Lesson learned: document compliance requirements upfront.

Common Boundary Patterns

After modeling hundreds of systems, I've noticed patterns that repeat constantly. Let me show you the ones I see most often.

Pattern 1: Third-Party Services

This is the most common pattern—integrating with vendors who provide specialized services.

// Your system
Shop = system "Shop"

// Third-party integrations
Stripe = system "Stripe" {
  metadata {
    tags ["external", "vendor", "pci-compliant"]
    owner "Stripe"
    sla "99.99% uptime"
    api_endpoint "https://api.stripe.com/v1"
    authentication "API Key"
  }
}

Twilio = system "Twilio" {
  metadata {
    tags ["external", "vendor"]
    owner "Twilio"
    sla "99.9% uptime"
    api_endpoint "https://api.twilio.com"
  }
}

GoogleAnalytics = system "Google Analytics" {
  metadata {
    tags ["external", "vendor"]
    owner "Google"
    privacy_policy "https://policies.google.com/privacy"
  }
}

// Integration relationships
Shop.API -> Stripe "Process payment" [encrypted, tls1.3]
Shop.API -> Twilio "Send SMS" [encrypted]
Shop.API -> GoogleAnalytics "Track events" [anonymous]

Notice how each vendor gets different context based on what matters:

  • Stripe: PCI compliance, strict SLA, API endpoint details
  • Twilio: SLA, API endpoint
  • Google Analytics: Privacy policy (because it's tracking users)

The key insight: add metadata that matters for that specific integration. Don't copy-paste the same structure for every external system.

Pattern 2: Partner Integrations

Partners are different from vendors—they're external organizations you have contractual relationships with.

// Your system
Shop = system "Shop"

// Partner systems (B2B integrations)
LogisticsPartner = system "Logistics Partner API" {
  metadata {
    tags ["external", "partner"]
    owner "FedEx"
    sla "99.5% uptime"
    api_documentation "https://partners.fedex.com/api"
    contact "partnership@fedex.com"
  }
}

InventoryPartner = system "Inventory Partner" {
  metadata {
    tags ["external", "partner"]
    owner "Vendor X"
    sla "99.0% uptime"
    contract "CONTRACT-2024-INV-001"
  }
}

// Partner relationships
Shop.API -> LogisticsPartner "Ship order"
Shop.API -> InventoryPartner "Check stock"

Partner integrations often have:

  • Different SLA expectations than public vendors
  • Contractual obligations
  • Dedicated partnership contacts
  • Custom API documentation

I always add contract references for partner systems. When disputes arise about service levels, you want the contract number readily available.

Pattern 3: Internal Team Boundaries

Sometimes the boundary isn't between your company and outside world—it's between teams within your organization.

// Your team's system
Shop = system "Shop" {
  metadata {
    tags ["internal", "shop-team"]
    owner "Shop Team"
    slack "#shop-team"
    repository "github.com/company/shop"
  }
  WebApp = container "Web App"
  API = container "API"
}

// Another team's systems (internal but different ownership)
UserPlatform = system "User Platform" {
  metadata {
    tags ["internal", "platform-team"]
    owner "Platform Team"
    slack "#platform-team"
  }
  AuthService = container "Auth Service"
  UserProfile = container "User Profile"
}

AnalyticsPlatform = system "Analytics Platform" {
  metadata {
    tags ["internal", "data-team"]
    owner "Data Team"
    slack "#data-team"
  }
  EventCollector = container "Event Collector"
  DataWarehouse = database "Data Warehouse"
}

// Cross-team boundaries
Shop.API -> UserPlatform.AuthService "Authenticate user"
Shop.API -> AnalyticsPlatform.EventCollector "Send events"

Internal team boundaries are crucial for:

  • Communication: Who do I talk to when something breaks?
  • Coordination: How do we coordinate changes?
  • Escalation: Who's responsible when issues arise?

I include Slack channel references for internal systems. It's the fastest way to get help or ask questions.

People: Always External

Here's a rule I never break: people are always outside your system boundary.

They're not "inside" your system—they're actors who interact with it from the outside.

// External actors (always outside boundary)
Customer = person "Customer"
Administrator = person "Administrator"
SupportAgent = person "Customer Support"

// Your system (the boundary)
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
  Database = database "Database"
}

// People interact with your system, they're not part of it
Customer -> Shop.WebApp "Browses products"
Customer -> Shop.WebApp "Adds to cart"
Customer -> Shop.WebApp "Checks out"

Administrator -> Shop.WebApp "Manages products"
Administrator -> Shop.WebApp "Views reports"

SupportAgent -> Shop.WebApp "Monitors health"

Why does this matter?

  1. Ownership clarity: People aren't "owned" by your system. They're autonomous actors who choose to interact with it.
  2. Security perspective: People are untrusted by default. You need to authenticate, authorize, and validate everything they do.
  3. Testing strategy: You can unit test internal components, but you need to test user interactions differently (end-to-end tests, user acceptance tests).

I've seen diagrams where people are modeled inside the system, and it always creates confusion. Are they part of the system? Do you control them? No—people are always external.

Modeling Boundary Crossings

Every relationship that goes from internal to external (or vice versa) crosses a boundary. These crossings are integration points that need special attention.

Single Boundary Crossing

// Internal system
Shop = system "Shop" {
  API = container "API"
}

// External system
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
  }
}

// Crossing the boundary
Shop.API -> PaymentGateway "Process payment"

This relationship crosses from internal (your API) to external (payment gateway). That means:

  • You need integration tests
  • You need error handling for failures
  • You need to consider timeout strategies
  • You need to understand the failure modes

Multiple Boundary Crossings

Customer = person "Customer"

// Internal
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
}

// External systems (multiple)
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
  }
}

EmailService = system "Email Service" {
  metadata {
    tags ["external"]
  }
}

AnalyticsService = system "Analytics Service" {
  metadata {
    tags ["external"]
  }
}

// Multiple boundary crossings
Customer -> Shop.WebApp "Places order"           // Person to System
Shop.WebApp -> Shop.API "Process order"          // Internal (same boundary)
Shop.API -> PaymentGateway "Charge payment"       // Internal → External
Shop.API -> EmailService "Send confirmation"      // Internal → External
Shop.API -> AnalyticsService "Track event"         // Internal → External

This system has three external dependencies. Each crossing represents:

  • Risk: What happens if that external service is down?
  • Complexity: Integration testing, error handling, fallback strategies
  • Performance: Network latency, rate limits, timeouts

When I see many boundary crossings like this, I ask: "Do we really need all these external services?" Sometimes consolidating reduces complexity and risk.

Team Boundaries in Practice

Let me show you a real-world example of how team boundaries work in a larger organization.

Single Team, One System (Simple)

// One team owns everything
Shop = system "Shop" {
  metadata {
    tags ["internal", "shop-team"]
    owner "Shop Team"
    slack "#shop-team"
  }
  WebApp = container "Web App"
  API = container "API"
  Database = database "Database"
}

This is the ideal scenario. One team, one system, clear ownership. Everything inside is your team's responsibility.

Multiple Teams, Bounded Contexts (Realistic)

// Team A: Shop team
Shop = system "Shop" {
  metadata {
    tags ["internal", "shop-team"]
    owner "Shop Team"
    slack "#shop-team"
    repository "github.com/company/shop"
  }
  WebApp = container "Web App"
  API = container "API"
}

// Team B: Payment team
Payment = system "Payment Service" {
  metadata {
    tags ["internal", "payment-team"]
    owner "Payment Team"
    slack "#payment-team"
    repository "github.com/company/payment"
  }
  Processor = container "Payment Processor"
}

// Team C: Notification team
Notifications = system "Notification Service" {
  metadata {
    tags ["internal", "notification-team"]
    owner "Notification Team"
    slack "#notification-team"
    repository "github.com/company/notifications"
  }
  Sender = container "Notification Sender"
}

// Cross-team boundaries (each is a coordination point)
Shop.API -> Payment.Processor "Process payment"
Shop.API -> Notifications.Sender "Send notification"

This is more realistic in larger organizations. Different teams own different systems, each with their own repositories, Slack channels, and processes.

The boundary crossings between teams matter because:

  • Coordination: Changes need to be coordinated across teams
  • Testing: Cross-team integrations need integration tests
  • Communication: Who do you talk to when something breaks?

I include Slack channels and repository links for every internal system. It saves so much time when you're trying to figure out who to contact or where to look at code.

Creating Views for Different Audiences

One of the most powerful features of Sruja is creating different views for different audiences. Let me show you how.

System Context View (Shows All Boundaries)

view system_context {
  title "System Context - Internal vs External"
  include *
}

This view shows everything:

  • Your systems
  • External systems
  • People
  • All relationships

Perfect for stakeholders who want to see the big picture, including dependencies and risks.

Internal-Only View (Shows Just Your System)

view internal_view of Shop {
  title "Internal Architecture"
  include Shop.*
}

This view shows only what's inside your boundary:

  • Your containers
  • Your components
  • Your internal relationships

Perfect for developers who are working on the system and don't need to see external dependencies.

External-Only View (Shows Dependencies)

view external_view of Shop {
  title "External Dependencies"
  exclude Shop.*
}

This view shows only what's external:

  • Third-party systems
  • Partner systems
  • Dependencies you rely on

Perfect for risk management, dependency reviews, and vendor assessments.

What to Remember

Marking internal vs. external is about clarity—clarity of ownership, clarity of risk, clarity of responsibility. When you annotate your diagrams:

  • Always mark external systems: Use metadata { tags ["external"] }
  • Add meaningful context: Ownership, SLA, support contacts, compliance
  • Document team boundaries: Include Slack channels, repository links
  • Remember: people are external: They're never inside your system boundary
  • Show boundary crossings: Every crossing is an integration point
  • Create multiple views: Different audiences need different perspectives

If you take away one thing, let it be this: clearly annotated boundaries prevent confusion and make your diagrams immediately useful to anyone who picks them up, whether they're on your team or not.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding.

Question 1

You're modeling a healthcare platform and want to mark external dependencies correctly. Which metadata structure is best for an external payment gateway?

A)

PaymentGateway = system "Payment Gateway" {
  metadata {
    owner "Stripe"
  }
}

B)

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
    owner "Stripe"
    sla "99.9% uptime"
    support "support@stripe.com"
  }
}

C)

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["internal", "payment"]
    owner "Payment Team"
  }
}

D)

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["vendor", "pci-compliant"]
    api_endpoint "https://api.stripe.com"
  }
}
Click to see the answer

Answer: B) Tags as external, plus ownership, SLA, and support

Let's analyze each option:

A) Incorrect. It's missing the crucial tags ["external"] marker. Without this tag, anyone reading the diagram won't know this is an external system. They might assume it's internal and under your control.

B) Correct! This includes:

  • tags ["external"] — Clearly marks it as external
  • owner "Stripe" — Documents who owns and maintains it
  • sla "99.9% uptime" — Documents the service level commitment
  • support "support@stripe.com" — Provides contact information for when things break

This metadata provides all the context someone needs to understand the dependency, assess the risk, and know who to contact for support.

C) Incorrect. This marks the system as tags ["internal", "payment"], which means it's owned by the "Payment Team." But Stripe is a third-party vendor, not an internal team. This is misleading.

D) Incorrect. While it includes some useful tags (vendor, pci-compliant) and the API endpoint, it's missing the most important tag: ["external"]. Without this, the system isn't clearly marked as external. Also missing SLA and support information.

Key insight: Always use tags ["external"] to mark external systems. Then add context that matters: ownership, SLA, support contacts, API endpoints, compliance requirements. Make your diagrams useful, not just correct.


Question 2

You're reviewing an architecture diagram and notice this structure:

Shop = system "Shop"
PaymentGateway = system "Payment Gateway"
EmailService = system "Email Service"

Shop.API -> PaymentGateway "Process payment"
Shop.API -> EmailService "Send confirmation"

What's missing from a boundaries perspective?

A) Containers for Shop system B) Metadata tags marking external systems C) Person elements for users D) Component-level breakdown

Click to see the answer

Answer: B) Metadata tags marking external systems

The diagram has three systems, but there's no way to tell which one is yours (internal) and which are external dependencies. They all look identical.

What the diagram should include:

// Your system (internal)
Shop = system "Shop" {
  metadata {
    tags ["internal"]
    owner "Shop Team"
  }
  WebApp = container "Web App"
  API = container "API"
}

// External systems (clearly marked)
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "vendor"]
    owner "Stripe"
    sla "99.9% uptime"
  }
}

EmailService = system "Email Service" {
  metadata {
    tags ["external", "vendor"]
    owner "SendGrid"
    sla "99.9% uptime"
  }
}

Shop.API -> PaymentGateway "Process payment"
Shop.API -> EmailService "Send confirmation"

Why other options are wrong:

  • A) Incorrect. While Shop does have containers in the corrected version, the main issue isn't missing containers—it's that external systems aren't marked as external. You could have a valid diagram without containers if you're showing a high-level system context view.

  • C) Incorrect. Person elements would be great to add (Customer, Administrator, etc.), but the primary boundaries issue isn't about missing people. The problem is that PaymentGateway and EmailService aren't marked as external.

  • D) Incorrect. Component-level breakdown is optional and depends on your audience. The boundaries issue isn't about missing components—it's about not distinguishing internal from external systems.

Key insight: Always use metadata tags to mark external systems. This is the single most important thing you can do to make boundaries clear in your diagrams. Without it, your diagrams tell an incomplete story.


What's Next?

Now you know how to mark internal and external components clearly. You can annotate external systems with ownership, SLAs, and support information. You can model team boundaries and create different views for different audiences.

But we've only talked about defining boundaries. We haven't talked about what happens when you cross them.

In the next lesson, you'll learn about crossing boundaries—how to model integrations, plan for failures, document interface contracts, and design fallback strategies. You'll discover why every boundary crossing is both an opportunity and a risk.

See you there!


Crossing the Line: Integrations at Boundaries

Imagine crossing a border between countries. You need a passport, you might wait in customs, and there are rules about what you can bring across. Sometimes the border is open and easy to cross. Sometimes it's closed, and you're stuck.

Boundaries in software architecture work the same way. Every time you cross from your internal system to an external one, you're dealing with integration complexity, potential failures, and risks you don't control.

In this lesson, you'll learn to model these boundary crossings effectively. You'll discover how to plan for failures, document interface contracts, and design fallback strategies that keep your system resilient when external dependencies misbehave.

Let's start by understanding what boundary crossings actually are.

Learning Goals

By the end of this lesson, you'll be able to:

  • Model integrations across boundaries clearly
  • Identify different integration patterns and when to use each
  • Plan for common failure scenarios at boundaries
  • Document interface contracts that prevent misunderstandings
  • Design fallback strategies that keep your system resilient

Boundary Crossings: The Reality

Every relationship that goes from internal to external (or external to internal) is a boundary crossing. These are the riskiest parts of your system.

// Internal → External = Boundary crossing
Shop.API -> PaymentGateway "Process payment"

// External → Internal = Boundary crossing
PaymentGateway -> Shop.API "Payment result"

Why are these risky? Because you don't control what's on the other side.

I learned this the hard way early in my career. We had an e-commerce site that depended on a single payment gateway. When that gateway went down for six hours during Black Friday, we lost millions in sales because we hadn't planned for boundary failures.

Lesson learned: every boundary crossing is a potential failure point. Plan accordingly.

Integration Patterns You'll Use

After years of building systems, I've found there are really three main integration patterns you'll encounter. Understanding which one you're using helps you plan correctly.

Pattern 1: Request-Response (Synchronous)

This is the most common pattern. You send a request, you wait for a response.

Shop.API -> PaymentGateway "Process payment"
PaymentGateway -> Shop.API "Payment result"

Characteristics:

  • Synchronous — Your system waits for the response
  • Real-time — The customer sees the result immediately
  • Tight coupling — If the external service is down, you're down too
  • Simple to implement — One call, one response

When to use it:

  • When you need an immediate response (payment processing, real-time validation)
  • When the operation is critical to the user's workflow
  • When the external service has good uptime guarantees

Risks:

  • Your users wait if the external service is slow
  • Your system fails if the external service is down
  • Timeouts need to be configured carefully

This is the pattern I see most often. It's simple, but it's fragile if you don't handle failures properly.

Pattern 2: Event-Driven (Asynchronous)

You publish an event, and other systems process it whenever they can.

Shop.API -> EventQueue "Publish order created"
EventQueue -> PaymentProcessor "Consume order event"
EventQueue -> EmailService "Consume order event"

Characteristics:

  • Asynchronous — You don't wait for a response
  • Decoupled — Your system continues even if others are slow
  • Resilient — If a consumer fails, the queue buffers events
  • More complex — You need infrastructure (Kafka, RabbitMQ, etc.)

When to use it:

  • When the operation can happen in the background (sending emails, updating analytics)
  • When you have multiple consumers who need to process the same event
  • When you want resilience and fault tolerance

Risks:

  • Eventual consistency (the user sees "processing" before it's actually done)
  • More complex infrastructure to maintain
  • Harder to debug when things go wrong

I used to think event-driven was overkill for small systems. Then I built a notification system that sent welcome emails, onboarding sequences, and marketing emails. Trying to send all those synchronously was a nightmare. Moving to events made everything so much smoother.

Pattern 3: Polling

Your system periodically checks for updates from an external service.

Shop.API -> ExternalAPI "Check order status"
ExternalAPI -> Shop.API "Return status"

Characteristics:

  • Periodic — You check on a schedule (every minute, every hour, etc.)
  • Simple — No webhooks or real-time infrastructure needed
  • Less efficient — You make calls even when nothing has changed
  • Eventual — There's always a delay between an event and when you discover it

When to use it:

  • When the external service doesn't support webhooks
  • When you need to check status periodically (order fulfillment, shipment tracking)
  • When the external API doesn't have push notifications

Risks:

  • You're wasting resources polling when nothing changes
  • There's always a delay before you see updates
  • Rate limiting can become an issue

I only use polling when I have no other choice. It's simple, but it's inefficient and introduces latency.

Integration Considerations (What Actually Matters)

Now that you know the patterns, let's talk about what you actually need to think about when crossing boundaries.

1. Error Handling: What Happens When It Breaks?

External services fail. It's not a question of "if," it's "when."

// Document expected failure modes
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
    sla "99.9% uptime"
    failure_modes ["timeout", "service unavailable", "network error", "rate limit"]
  }
}

// Model fallbacks
Shop.API -> PrimaryPayment "Process payment" [primary]
Shop.API -> BackupPayment "Process payment" [fallback]

I've seen systems that don't document failure modes. When something breaks, nobody knows what to expect. Does the external service retry automatically? Do they return specific error codes? What's the timeout?

Document this upfront. It saves hours of debugging later.

2. Timeouts and Latency: How Long Is Too Long?

External services can be slow. You need to configure timeouts that protect your system without being too aggressive.

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
    timeout "30s"
    expected_latency "500ms"
    max_latency "5s"
  }
}

Shop.API = container "API Service" {
  slo {
    latency {
      p95 "200ms"  // 95% of requests complete in 200ms
      p99 "500ms"  // 99% of requests complete in 500ms
    }
  }
}

I once worked on a system that had no timeout configured. An external service got slow, and our threads hung indefinitely. The entire system ground to a halt.

Set timeouts. Always. Even if the external service is usually fast.

3. Data Consistency: What Happens When Things Go Wrong?

What if the payment succeeds but saving the order fails? Or the order saves but the payment fails?

Shop.API -> PaymentGateway "Process payment"
Shop.API -> Shop.Database "Save order"

// If payment succeeds but order save fails:
// - Did you charge the customer?
// - Is the order lost?
// - How do you reconcile?

You need strategies for handling this:

  • Idempotent payment calls — Calling the same payment ID twice should only charge once
  • Compensating transactions — If the order save fails after payment, refund automatically
  • Eventual consistency — Accept that things might be inconsistent briefly, then reconcile
  • Two-phase commits — Complex, but guarantees consistency

I once worked on a system where we charged customers but lost their orders. We spent weeks manually reconciling payments and orders. Awful experience.

Plan for consistency issues at boundaries. They will happen.

4. Security: What Protects Your Data?

Crossing a boundary is where attacks happen. This is where you need to be most careful.

// Security at the boundary
Shop.API -> PaymentGateway "Process payment" [encrypted, authenticated, tls1.3]

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "pci-compliant"]
    security ["mutual TLS", "API key authentication"]
    compliance ["PCI-DSS Level 1"]
  }
}

Security controls at boundaries:

  • Authentication — Prove who you are
  • Authorization — Prove you're allowed to do what you're asking
  • Encryption — Protect data in transit
  • Validation — Don't trust anything coming from outside

I learned this lesson painfully. We had an internal API that we exposed to the web without proper validation. Someone sent malformed requests that brought down our database.

Validate everything at your boundaries. Trust nothing from external systems.

Documenting Interface Contracts

One of the most important things you can do for boundary crossings is document the interface contract. This is the agreement between your system and the external one.

API Contract

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
    api_endpoint "https://api.payment.com/v1"
    authentication "API Key (Bearer token)"
    rate_limit "1000 req/min"
    supported_methods ["POST /charges", "GET /charges/:id", "POST /refunds"]
  }
}

Shop.API = container "API Service" {
  metadata {
    api_consumer "Payment Gateway Client"
    retry_policy "3 retries with exponential backoff"
    circuit_breaker "Enabled (5 failures = open for 60s)"
  }
}

This contract tells everyone:

  • Where the API is
  • How to authenticate
  • What methods are available
  • What limits exist
  • How to handle retries and failures

I've seen so many integration disasters because nobody documented the contract. Teams assumed different APIs, different limits, different behaviors. When something changed, everything broke.

Document your interface contracts. Make them explicit.

Data Format

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
    data_format "JSON"
    schema_version "v1.2"
    validation "Strict schema validation"
    date_format "ISO 8601 (UTC)"
    currency_format "ISO 4217 (e.g., 'USD')"
  }
}

Don't let data format be implicit. Specify:

  • JSON vs. XML vs. Protocol Buffers
  • Schema version (what happens when it changes?)
  • Date formats (timezone matters!)
  • Currency formats
  • Number formats (decimal precision, rounding)

I once dealt with a system where dates were sometimes in US format (MM/DD/YYYY) and sometimes in ISO format (YYYY-MM-DD), depending on which service you called. Bugs everywhere.

SLA and Reliability

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
    sla "99.9% uptime"
    mttr "4 hours"  // Mean Time To Repair
    support_tier "24/7 enterprise support"
    escalation_path "Support → Account Manager → CTO"
  }
}

SLA documentation tells you:

  • What uptime they're committing to
  • How fast they'll fix things when they break
  • Who to contact and how to escalate
  • What compensation you get if they violate SLA

Knowing the SLA helps you decide: do you need a fallback? Can you tolerate 43 minutes of downtime per month (99.9%)? Or do you need 99.99% (4 minutes)?

Fallback Strategies: Planning for Failure

External systems fail. You need fallback strategies. Here are the ones I use most often.

Strategy 1: Redundant Providers

Have a backup provider you can switch to if the primary fails.

// Primary provider
PrimaryPayment = system "Stripe" {
  metadata {
    tags ["external", "primary"]
    sla "99.99% uptime"
    owner "Stripe"
  }
}

// Backup provider
BackupPayment = system "PayPal" {
  metadata {
    tags ["external", "backup"]
    sla "99.9% uptime"
    owner "PayPal"
  }
}

// Try primary, fall back to backup
Shop.API -> PrimaryPayment "Process payment" [primary]
Shop.API -> BackupPayment "Process payment" [fallback]

Why this works: If Stripe is down, you can still process payments through PayPal.

Challenge: Supporting two payment gateways is complex. You need to reconcile transactions, handle different APIs, manage different fee structures.

I've used this strategy for critical paths (payments, messaging, notifications). It adds complexity, but it buys you resilience.

Strategy 2: Circuit Breaker

Stop calling a failing service automatically instead of hammering it with requests.

Shop.API = container "API Service" {
  metadata {
    circuit_breaker {
      enabled true
      failure_threshold 5  // Open after 5 failures
      recovery_timeout "60s"  // Try again after 60 seconds
      half_open_attempts 3  // Send 3 test requests before closing
    }
  }
}

How it works:

  1. Closed — Normal operation, requests go through
  2. Open — After N failures, stop sending requests
  3. Half-open — After timeout, send a few test requests
  4. Closed — If tests succeed, go back to normal

Why this works: Instead of hammering a failing service with 1000 requests/second (which might make recovery worse), you stop calling it and fail fast.

This is one of my favorite patterns. It's saved me from cascading failures more times than I can count.

Strategy 3: Degraded Mode

Continue operating even if a non-critical external service is down.

// Non-critical: If analytics fails, continue
Shop.API -> AnalyticsService "Track events" [non_critical]

// Queue for later: If email fails, queue it
Shop.API -> EmailService "Send notifications" [async_queue]

Why this works: Not every external dependency is critical. If analytics is down, you can still process orders. If email is down, queue messages and send later.

What this requires: You need to distinguish between:

  • Critical paths — System can't function without them (payments)
  • Important paths — System functions, but with degraded UX (email, push notifications)
  • Nice-to-have paths — System functions perfectly without them (analytics)

I used to treat all dependencies as critical. Then I realized: if analytics is down for an hour, does anyone actually care? No. Mark it non-critical and move on.

Strategy 4: Cache External Data

Cache responses from external APIs so you can serve from cache if the external service is down.

// External API call
Shop.API -> ExchangeRateAPI "Get exchange rates"

// Cache for backup
Shop.API -> Shop.Cache "Get cached rates"

// Fallback strategy
Shop.API -> ExchangeRateAPI "Get exchange rates" [primary]
Shop.API -> Shop.Cache "Get cached rates" [fallback]

Why this works: Even if the external API is down, you can serve slightly stale data from cache.

What to consider:

  • How stale is acceptable? (10 minutes? 1 hour? 24 hours?)
  • How do you detect when the external service is back up?
  • Do you need to warm the cache before the service goes down?

I've used this for exchange rates, product catalogs, weather data—anything that's expensive to fetch and acceptable to serve slightly stale.

Complete Integration Example

Let me show you a complete example that brings everything together.

import { * } from 'sruja.ai/stdlib'

// People
Customer = person "Customer"

// Your system
Shop = system "Shop" {
  metadata {
    tags ["internal"]
    owner "Shop Team"
    slack "#shop-team"
  }

  WebApp = container "Web Application"
  API = container "API Service" {
    metadata {
      timeout "30s"
      retry_policy "3 retries with exponential backoff"
      circuit_breaker {
        enabled true
        failure_threshold 5
        recovery_timeout "60s"
      }
    }
  }
  Cache = database "Redis Cache"
}

// Primary payment provider
Stripe = system "Stripe" {
  metadata {
    tags ["external", "primary", "vendor"]
    owner "Stripe Inc."
    sla "99.99% uptime"
    mttr "4 hours"
    api_endpoint "https://api.stripe.com/v1"
    authentication "API Key (Bearer token)"
    rate_limit "1000 req/min"
    data_format "JSON"
    schema_version "v1.2"
    security ["TLS 1.3", "API key authentication"]
    compliance ["PCI-DSS Level 1"]
    support "24/7 enterprise support"
    escalation_path "Support → Account Manager → CTO"
  }
}

// Backup payment provider
PayPal = system "PayPal" {
  metadata {
    tags ["external", "backup", "vendor"]
    owner "PayPal"
    sla "99.9% uptime"
    api_endpoint "https://api.paypal.com/v2"
    authentication "OAuth 2.0"
  }
}

// Email service
SendGrid = system "SendGrid" {
  metadata {
    tags ["external", "vendor"]
    owner "SendGrid"
    sla "99.9% uptime"
    timeout "10s"
    api_endpoint "https://api.sendgrid.com/v3"
  }
}

// Integrations
Customer -> Shop.WebApp "Checkout"
Shop.WebApp -> Shop.API "Process order"

// Primary payment (encrypted, authenticated)
Shop.API -> Stripe "Process payment" [primary, encrypted, tls1.3]
Stripe -> Shop.API "Payment result"

// Fallback to backup if Stripe fails
Shop.API -> PayPal "Process payment" [fallback, encrypted]

// Email (non-critical, can queue)
Shop.API -> SendGrid "Send confirmation" [non_critical, async_queue]

// Cache exchange rates
Shop.API -> ExchangeRateAPI "Get exchange rates"
Shop.API -> Shop.Cache "Get cached rates" [fallback]

view index {
  include *
}

This example shows:

  • Clear external/internal boundaries with rich metadata
  • Multiple integration patterns (synchronous payment, asynchronous email)
  • Fallback strategies (backup provider, cache)
  • Failure documentation (timeouts, circuit breaker, retry policy)
  • Interface contracts (API endpoints, authentication, data formats)

This is the kind of documentation that saves you when things go wrong at 3 AM.

What to Remember

Crossing boundaries is where systems are most fragile. When you model and plan for boundary crossings:

  • Document everything — API contracts, failure modes, SLAs, security requirements
  • Plan for failures — Timeouts, retries, circuit breakers, fallbacks
  • Use the right pattern — Synchronous for critical paths, asynchronous for background work
  • Protect your system — Authentication, encryption, validation at every boundary
  • Design resilience — Redundant providers, caching, degraded modes
  • Test thoroughly — Integration tests, chaos engineering, failure scenarios

If you take away one thing, let it be this: every boundary crossing is both an opportunity and a risk. The opportunity is integrating with powerful external services. The risk is depending on something you don't control. Plan for that risk.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding.

Question 1

You're modeling a weather application that fetches data from an external API. Which integration pattern is most appropriate?

"The weather app needs to display current weather and 7-day forecasts for cities around the world. Users expect to see real-time weather data. The external weather API has good uptime and supports synchronous calls."

A) Request-Response (Synchronous) B) Event-Driven (Asynchronous) C) Polling D) All of the above are equally appropriate

Click to see the answer

Answer: A) Request-Response (Synchronous)

Let's analyze each option:

A) Correct! Request-response is the right choice here because:

  • The weather data is needed in real-time (users expect to see current weather immediately)
  • The operation is critical to the user's workflow (the app's main purpose is displaying weather)
  • The external API has good uptime (so reliability risk is manageable)
  • It's simple to implement (one call, one response)

B) Incorrect. Event-driven is asynchronous—you publish an event, and consumers process it whenever they can. This doesn't work for real-time weather display because:

  • There's a delay between requesting and receiving data
  • You'd need a background worker to consume events
  • The user would see "loading..." longer than necessary
  • It adds unnecessary complexity for a simple request-response scenario

C) Incorrect. Polling is periodic checking on a schedule (e.g., checking every hour). This doesn't work here because:

  • Users want to see weather when they open the app, not according to a schedule
  • There's unnecessary latency (user opens app, but has to wait for next poll cycle)
  • It's inefficient (you'd be calling the API even when no one's viewing data)

D) Incorrect. These patterns are not equally appropriate. Request-response is clearly the best fit for this scenario. The other patterns introduce unnecessary complexity or don't meet the real-time requirement.

Key insight: Choose integration patterns based on your requirements. Need immediate results? Use synchronous. Can happen in background? Use asynchronous. No webhooks available? Use polling (as a last resort). Match the pattern to the problem.


Question 2

You're designing the payment processing flow for an e-commerce platform. The payment gateway has an SLA of 99.9% uptime. What does this mean for your system?

A) You don't need to worry about failures—99.9% is very reliable B) You should have a fallback strategy because 99.9% still means ~43 minutes of downtime per month C) You should only process payments when the gateway is at 100% uptime D) You should switch to a different payment gateway immediately

Click to see the answer

Answer: B) You should have a fallback strategy because 99.9% still means ~43 minutes of downtime per month

Let's break down what 99.9% actually means:

The math:

  • 99.9% uptime = 0.1% downtime
  • 0.1% of a month (30 days × 24 hours × 60 minutes = 43,200 minutes) = ~43 minutes
  • That's 43 minutes per month of potential payment processing outages

Why other options are wrong:

A) Incorrect. 99.9% might sound high, but 43 minutes of downtime is significant if it happens during peak shopping hours (Black Friday, Cyber Monday, etc.). You absolutely need to worry about failures and plan for them.

C) Incorrect. No system has 100% uptime. Waiting for perfect uptime means your system never processes payments. This is unrealistic. You need to work with the reality that failures will happen.

D) Incorrect. Switching to a different payment gateway "immediately" is overkill. 99.9% uptime means the gateway is working 99.9% of the time. You should:

  • Have a backup gateway as a fallback
  • Implement circuit breakers to detect and route around failures
  • Use caching for less-critical payment info
  • Consider degraded modes (e.g., show "payment processing temporarily unavailable" instead of failing completely)

What you should actually do:

  • Document the 99.9% SLA in your metadata
  • Calculate the business impact of 43 minutes/month downtime
  • Design fallback strategies (backup provider, queueing, retry logic)
  • Set up monitoring and alerting for gateway outages
  • Have an escalation path with the gateway vendor

Key insight: SLAs give you information to make decisions. 99.9% tells you to plan for ~43 minutes of monthly downtime. Don't ignore it—plan for it.


What's Next?

Congratulations! You've completed Module 3: Boundaries. You now understand:

  • What boundaries are and why they matter for ownership, risk, and clarity
  • How to mark internal vs. external components using metadata and tags
  • How to model boundary crossings with proper planning for failures and fallbacks

You can now create architectures that clearly distinguish what you control from what you depend on. You can plan for failures at boundaries instead of being surprised by them. You can document interface contracts that prevent integration disasters.

You're building resilient systems.

In the next module, you'll learn about flows—how information moves through your system over time. You'll discover how to model data flow, process flows, and temporal behaviors that tell a richer story than static diagrams can.

See you there!


Module 3 Complete!

You've now mastered the art of defining and crossing boundaries. Here's what you've learned:

Lesson 1: Understanding Boundaries

  • Boundaries separate what's inside from what's outside
  • Multiple types of boundaries: system, team, organization, deployment, trust
  • Clear boundaries prevent confusion and clarify ownership

Lesson 2: Internal vs. External

  • Use metadata tags to mark external systems clearly
  • Document ownership, SLAs, and support contacts
  • Remember: people are always outside your system boundary
  • Create different views for different audiences

Lesson 3: Crossing Boundaries

  • Every boundary crossing is an integration point and potential failure
  • Choose the right integration pattern (synchronous, asynchronous, polling)
  • Document interface contracts (API endpoints, data formats, security)
  • Design fallback strategies (redundant providers, circuit breakers, degraded modes, caching)
  • Plan for failures—they will happen

You're ready to tackle more advanced concepts. Let's continue!

Module 4: Flows

Overview

In this module, you'll learn to model how information, data, and actions move through your system. Flows help you understand data lineage, process sequences, and bottlenecks.

Learning Objectives

By the end of this module, you'll be able to:

  • Model data flows using Sruja scenarios
  • Document user journeys and workflows
  • Identify bottlenecks and performance issues
  • Differentiate between data flows and behavioral flows

Lessons

Prerequisites

Time Investment

Approximately 1.5-2 hours to complete all lessons and exercises.

What's Next

After completing this module, you'll learn about Module 5: Feedback Loops.

Seeing Movement: Understanding Flows

Ever watch water flow down a stream? You can see the path it takes, where it speeds up, where it slows down, where it gets stuck. Static diagrams show you the rocks and the banks, but they don't show you how the water actually moves through the system.

That's what flows are for in architecture—they show you how information, data, and actions move through your system over time. Static relationships tell you what's connected. Flows tell you what happens.

In this lesson, you'll learn to model flows effectively. You'll discover different types of flows, when to use them, and how they reveal bottlenecks, errors, and opportunities that static diagrams miss.

Let's start by understanding what flows actually are and why they matter.

Learning Goals

By the end of this lesson, you'll be able to:

  • Understand what flows are and how they differ from static relationships
  • Recognize different types of flows (data, user journeys, control, event)
  • Know when to use flows versus static diagrams
  • Model flows at the right level of detail
  • Identify common pitfalls and avoid them

What Are Flows, Really?

At its simplest level, a flow shows sequence—how something moves from point A to point B to point C. Unlike static relationships that just say "A is connected to B," flows tell you "A talks to B, then B talks to C, and here's what happens at each step."

Static Relationship vs. Flow

Let me show you the difference:

// Static relationship: Just shows connection
Customer -> Shop.WebApp "Uses"

// Flow: Shows the sequence
CheckoutFlow = scenario "Customer Checkout" {
  Customer -> Shop.WebApp "Submits order"
  Shop.WebApp -> Shop.API "Sends order data"
  Shop.API -> Shop.Database "Saves order"
  Shop.API -> PaymentGateway "Processes payment"
  PaymentGateway -> Shop.API "Returns result"
  Shop.API -> Shop.WebApp "Confirmation"
  Shop.WebApp -> Customer "Show success page"
}

The static relationship just tells you "customer uses the web app." The flow tells you the complete sequence—what the customer does, what the web app does, what the API does, in what order, and what happens at each step.

Why This Matters

I once worked on a system where we had perfect static diagrams showing all the components. But nobody understood the actual order processing flow. When we debugged issues, we'd spend hours tracing through code because the diagrams didn't show sequence.

We added flows, and suddenly everything became clear. Developers could see the complete path from customer action to database storage. Product managers could see exactly what users experienced. Everyone had the same mental model of how things moved through the system.

Why Flows Matter: The Real Benefits

After years of modeling systems, I've found flows reveal things static relationships never can. Let me show you what flows actually surface.

1. Data Lineage: Where Does Data Come From?

Flows show you the complete path data takes through your system—where it starts, how it's transformed, and where it ends up.

OrderAnalyticsFlow = flow "Order Data Lineage" {
  // Where data starts
  Customer -> Shop.WebApp "Order details"
  
  // How data flows through system
  Shop.WebApp -> Shop.API "Order JSON payload"
  Shop.API -> Shop.Database "Persist order record"
  
  // Where data goes for analytics
  Shop.Database -> Analytics.Extractor "Extract order events"
  Analytics.Extractor -> Analytics.Processor "Enrich with user data"
  Analytics.Processor -> Analytics.Warehouse "Store aggregated metrics"
  
  // Where data is ultimately used
  Analytics.Warehouse -> Dashboard.Query "Fetch metrics"
  Dashboard.Query -> BusinessUser "Display analytics"
}

This flow tells the complete story: customer creates an order, order flows through the app and API, gets persisted to the database, gets extracted for analytics, gets enriched and aggregated, stored in the data warehouse, and ultimately shows up on a business dashboard.

Without this flow, would you know the order data goes to a data warehouse? Would you know there's an enrichment step? Would you know business users depend on this data? Probably not.

2. Process Understanding: What's the Sequence?

Flows show you the exact sequence of actions that happen when something occurs. This is crucial for understanding how your system actually works.

OrderProcessFlow = scenario "Order Processing Sequence" {
  // Customer action
  Customer -> Shop.WebApp "Submits order"
  
  // API processing
  Shop.WebApp -> Shop.API "Validates cart"
  Shop.API -> Shop.Database "Checks inventory"
  Shop.Database -> Shop.API "Inventory available"
  
  // External integration
  Shop.API -> PaymentGateway "Charges payment"
  PaymentGateway -> Shop.API "Payment successful"
  
  // Order finalization
  Shop.API -> Shop.Database "Saves order"
  Shop.API -> InventoryService "Reserves items"
  Shop.API -> EmailService "Sends confirmation"
}

This flow reveals the complete sequence of what happens when an order is submitted. You can see exactly what the API does (validate, check inventory, charge, save, reserve, email). You can see the dependencies between steps (inventory must be available before charging). You can see the external integrations.

3. Bottleneck Identification: Where Can Things Slow Down?

Flows make bottlenecks obvious. You can see which steps might become slow, where queues might form, and where performance issues will surface first.

FileUploadFlow = scenario "File Upload" {
  // Fast steps
  User -> Frontend "Selects file and clicks upload"
  Frontend -> API "Sends file data"
  API -> Storage "Stores file" [fast]
  
  // Potential bottleneck
  Storage -> ProcessingService "Processes file" [slow, cpu-intensive]
  
  // Continues if processing succeeds
  ProcessingService -> Notification "Sends completion notification"
  Notification -> User "Receives notification"
}

Look at this flow. Storage is fast. Processing service is marked as "slow" and "CPU-intensive." This tells you immediately: the processing service is the bottleneck. If users complain about slow uploads, you know exactly where to look.

I've used this pattern countless times. A team would spend weeks optimizing the "fast" parts of the system while ignoring the actual bottleneck. Flows make bottlenecks obvious.

4. Error Paths: What Happens When Things Fail?

Static relationships show you the happy path. Flows let you model what happens when things go wrong.

OrderWithErrorsFlow = scenario "Order Processing with Error Handling" {
  Customer -> Shop.API "Submits order"
  Shop.API -> PaymentGateway "Charges payment"
  
  // Success path
  PaymentGateway -> Shop.API "Payment successful"
  Shop.API -> Shop.Database "Saves order"
  Shop.API -> EmailService "Sends confirmation"
  
  // Failure path
  PaymentGateway -> Shop.API "Payment declined"
  Shop.API -> Shop.WebApp "Returns error"
  Shop.WebApp -> Customer "Shows error message: Payment declined"
}

This flow models both the success path and the failure path. When the payment gateway returns a decline, the API returns an error to the web app, which displays it to the customer.

Modeling error paths is crucial. I once worked on a system where we'd only modeled happy paths. When failures occurred, we had no clear strategy for what should happen. Chaos ensued. Model both paths upfront.

Types of Flows You'll Use

After modeling systems for years, I've found there are really four main types of flows you'll encounter. Understanding which type you're modeling helps you get the details right.

1. Data Flow (DFD Style)

Data flows show how data moves and transforms as it passes through your system. Think of them like a pipeline—data goes in one end, gets transformed, and comes out the other.

OrderDataFlow = flow "Order Data Pipeline" {
  // Data originates from customer
  Customer -> Shop.WebApp "Order details"
  
  // Data flows through API as JSON
  Shop.WebApp -> Shop.API "Order JSON payload"
  
  // Data gets persisted as record
  Shop.API -> Shop.Database "Order record"
  
  // Data extracted as event
  Shop.Database -> Analytics.EventStream "Order event"
  
  // Data aggregated for reporting
  Analytics.EventStream -> Analytics.Aggregator "Daily aggregation"
  Analytics.Aggregator -> Analytics.Warehouse "Stored metrics"
}

Use data flows when:

  • Modeling data lineage (where data comes from and goes)
  • Documenting ETL processes
  • Designing analytics pipelines
  • Understanding data transformations

Characteristics:

  • Focus on data, not actions
  • Show how data changes shape/form
  • Include storage and processing steps

2. User Journey / Scenario (BDD Style)

User journeys show how a person interacts with your system to achieve a goal. These are behavioral flows from the user's perspective.

CheckoutJourney = scenario "Customer Checkout Experience" {
  // User actions
  Customer -> Shop.WebApp "Clicks checkout button"
  Customer -> Shop.WebApp "Enters shipping address"
  Customer -> Shop.WebApp "Enters payment details"
  Customer -> Shop.WebApp "Clicks 'Place Order'"
  
  // System responses
  Shop.WebApp -> Shop.API "Validates cart"
  Shop.API -> PaymentGateway "Processes payment"
  PaymentGateway -> Shop.API "Payment successful"
  Shop.API -> Shop.Database "Saves order"
  
  // Final user experience
  Shop.WebApp -> Customer "Shows order confirmation page"
  Shop.API -> EmailService "Sends confirmation email"
  EmailService -> Customer "Receives confirmation email"
}

Use user journeys when:

  • Modeling user stories and requirements
  • Designing test scenarios
  • Understanding customer experience
  • Documenting acceptance criteria

Characteristics:

  • User's perspective, not system's
  • Include both user actions and system responses
  • Show the complete experience from start to finish

3. Control Flow

Control flows show decision points and branching logic. They model the "if this, then that" parts of your system.

ApprovalFlow = scenario "Order Approval Workflow" {
  Order -> ApprovalService "Submits for approval"
  
  // Branch 1: Auto-approved (low value)
  ApprovalService -> Database "Saves as auto-approved" [if value < $100]
  
  // Branch 2: Manual review (high value)
  ApprovalService -> Manager "Sends approval request" [if value >= $100]
  Manager -> ApprovalService "Approves order"
  ApprovalService -> Database "Saves as approved"
  
  // Both paths converge
  Database -> EmailService "Sends order confirmation"
}

Use control flows when:

  • Modeling business logic and rules
  • Documenting decision trees
  • Understanding conditional paths
  • Designing workflow systems

Characteristics:

  • Show decision points (if/else logic)
  • Include multiple branches that may converge
  • Document conditions for each branch

4. Event Flow

Event flows show how events propagate through an event-driven system. They model pub/sub patterns and event sourcing architectures.

OrderEventFlow = flow "Order Event Propagation" {
  // Event published
  OrderAPI -> EventBus "Publishes OrderCreated event"
  
  // Multiple consumers process same event
  EventBus -> NotificationService "Consumes event (sends confirmation)"
  EventBus -> AnalyticsService "Consumes event (tracks metrics)"
  EventBus -> InventoryService "Consumes event (reserves items)"
  EventBus -> EmailService "Consumes event (sends marketing email)"
}

Use event flows when:

  • Modeling event-driven architectures
  • Designing pub/sub systems
  • Understanding event propagation
  • Documenting event sourcing patterns

Characteristics:

  • One event, multiple consumers
  • Asynchronous processing
  • Eventual consistency (events processed at different times)

Flow Patterns You'll See

Beyond types, there are structural patterns for how flows behave. Understanding these helps you recognize and model common scenarios.

Linear Flow

The simplest pattern—things happen one after another in a straight line.

LinearFlow = scenario "Simple Three-Step Process" {
  Step1 -> Step2 "Process A"
  Step2 -> Step3 "Process B"
  Step3 -> End "Process C"
}

Characteristics:

  • Easy to understand
  • No parallel processing
  • Single point of failure
  • Common in simple workflows

Branching Flow

One step branches into multiple possible paths. This is where your control flow logic lives.

BranchingFlow = scenario "Conditional Processing" {
  Start -> DecisionPoint "Initial processing"
  
  // Branch A: High priority
  DecisionPoint -> FastProcessor "Process immediately" [if priority = high]
  FastProcessor -> End "Complete"
  
  // Branch B: Low priority
  DecisionPoint -> Queue "Add to queue" [if priority = low]
  Queue -> SlowProcessor "Process when available"
  SlowProcessor -> End "Complete"
}

Characteristics:

  • Multiple possible paths
  • Based on conditions
  • Paths may have different characteristics
  • Common in workflows with approvals

Converging Flow

Multiple paths start separately but eventually come back together.

ConvergingFlow = scenario "Parallel then Merge" {
  // Parallel paths
  Start -> PathA "Process A"
  Start -> PathB "Process B"
  
  // Both paths converge
  PathA -> MergePoint "Contributes data"
  PathB -> MergePoint "Contributes data"
  
  // Single continuation
  MergePoint -> End "Combine and complete"
}

Characteristics:

  • Parallel processing possible
  • Multiple contributions to single result
  • Synchronization point at convergence
  • Common in gather-aggregate patterns

Looping Flow

A step repeats multiple times until a condition is met.

RetryingFlow = scenario "Payment with Retry Logic" {
  API -> PaymentGateway "Process payment"
  
  // First attempt fails
  PaymentGateway -> API "Payment failed: timeout"
  
  // Loop: retry up to 3 times
  API -> PaymentGateway "Retry payment" [attempt 2]
  PaymentGateway -> API "Payment failed: timeout"
  
  API -> PaymentGateway "Retry payment" [attempt 3]
  PaymentGateway -> API "Payment successful"
  
  // Exit loop
  API -> Database "Save order"
}

Characteristics:

  • Repeated execution
  • Exit condition required
  • Common in retry logic and polling

Creating Flows in Sruja

Sruja gives you three keywords for flows, and they're essentially interchangeable. Use whichever makes semantic sense for what you're modeling.

Using scenario

Use scenario for behavioral flows, especially user journeys and BDD-style scenarios.

MyScenario = scenario "User Logs In" {
  User -> WebApp "Enters credentials"
  WebApp -> API "Submits login"
  API -> Database "Verifies user"
  Database -> API "User found"
  API -> WebApp "Returns session token"
  WebApp -> User "Shows dashboard"
}

Using story

Use story as an alias for scenario. It's the same thing, just a different keyword that some teams prefer for user stories.

MyStory = story "As a customer, I want to purchase products" {
  Customer -> WebApp "Browses products"
  Customer -> WebApp "Adds to cart"
  Customer -> WebApp "Checks out"
  WebApp -> API "Processes order"
  API -> Customer "Shows confirmation"
}

Using flow

Use flow for data-oriented flows, especially DFD-style data flows and event flows.

MyFlow = flow "Data Processing Pipeline" {
  Source -> Ingestion "Raw data"
  Ingestion -> Processing "Transformed data"
  Processing -> Storage "Stored data"
  Storage -> Analytics "Query results"
}

When to Use Flows (And When Not To)

Flows are powerful, but they're not always the right tool. Here's how I decide.

Use Flows When

  • Sequence matters — You need to show the order in which things happen
  • Modeling data lineage — You need to track where data comes from and goes
  • Documenting user journeys — You need to understand the user experience
  • Understanding process steps — You need to see the complete workflow
  • Identifying bottlenecks — You need to find where things slow down
  • Modeling error paths — You need to show what happens when things fail

Don't Use Flows When

  • Showing general connections — Static relationships are better for showing overall architecture
  • High-level overview — Flows have too much detail for executive diagrams
  • Simple systems — If everything connects to everything, a flow doesn't add clarity
  • Static structure — If you're showing what components exist, not what they do, use static relationships

Pitfalls to Avoid (I've Made All of These)

Let me share some mistakes I've made and seen others make. Hopefully, you can avoid them.

Mistake 1: Too Much Detail

// Bad: Way too detailed
LoginFlow = scenario "User Login (Too Detailed)" {
  User -> UI "Clicks login button"
  UI -> API "HTTP POST /login"
  API -> Database "SELECT * FROM users WHERE email = ?"
  Database -> API "Returns user record"
  API -> AuthService "Verifies password hash"
  AuthService -> API "Password matches"
  API -> TokenService "Generates JWT token"
  TokenService -> API "Returns token"
  API -> UI "JSON response with token"
  UI -> User "Shows dashboard"
}

This is ridiculous. We're showing database queries, HTTP methods, password hashing. This is implementation detail, not architecture. Nobody reading a diagram needs this level of detail.

Better approach: Group related steps together.

// Good: Right level of detail
LoginFlow = scenario "User Login" {
  User -> UI "Submits credentials"
  UI -> API "Authenticates user"
  API -> Database "Verifies credentials"
  Database -> API "User found"
  API -> TokenService "Creates session"
  API -> UI "Returns token"
  UI -> User "Shows dashboard"
}

Mistake 2: Too Abstract

// Bad: Not useful
ProcessFlow = scenario "Data Processing (Too Abstract)" {
  Start -> End "Data gets processed"
}

This tells you nothing. What kind of data? How is it processed? What happens in between? Useless.

Better approach: Add meaningful intermediate steps that explain what's actually happening.

// Good: Meaningful steps
ProcessFlow = scenario "Order Processing" {
  Order -> API "Submits order"
  API -> Payment "Processes payment"
  Payment -> Database "Saves order"
  Database -> Email "Sends confirmation"
}

Mistake 3: Mixing Flows with Static Relationships

// Bad: Confusing mix
Customer -> Shop.WebApp "Uses"  // Static relationship

CheckoutFlow = scenario "Checkout" {
  Customer -> Shop.WebApp "Submits order"  // Flow
  Shop.WebApp -> Shop.API "Sends data"
}

This is confusing. You have both a static relationship and a flow between the same elements. Readers won't know which to look at or what the difference is.

Better approach: Keep flows and static relationships separate. Use flows for sequences, static relationships for overall structure.

// Static architecture view
view architecture {
  title "System Architecture"
  include *
}

// Flow view
view checkout_flow {
  title "Checkout Sequence"
  CheckoutFlow = scenario "Checkout" {
    Customer -> Shop.WebApp "Submits order"
    Shop.WebApp -> Shop.API "Sends data"
  }
}

Mistake 4: One Path Assumes Success

// Bad: Only shows happy path
OrderFlow = scenario "Order Processing" {
  Customer -> Shop.API "Submits order"
  Shop.API -> PaymentGateway "Charges payment"
  PaymentGateway -> Shop.API "Success"
  Shop.API -> Database "Saves order"
}

This only works if everything goes perfectly. But what happens if payment fails? What if the database is down? What if the API times out?

Better approach: Model both success and failure paths.

// Good: Shows both paths
OrderFlow = scenario "Order Processing with Errors" {
  Customer -> Shop.API "Submits order"
  Shop.API -> PaymentGateway "Charges payment"
  
  // Success path
  PaymentGateway -> Shop.API "Payment successful"
  Shop.API -> Database "Saves order"
  Shop.API -> Email "Sends confirmation"
  
  // Failure path
  PaymentGateway -> Shop.API "Payment declined"
  Shop.API -> Shop.WebApp "Returns error"
  Shop.WebApp -> Customer "Shows error message"
}

What to Remember

Flows reveal what static relationships cannot. They show sequence, transformation, bottlenecks, and error paths. When you create flows:

  • Show the sequence — Not just what's connected, but in what order
  • Model both paths — Success and failure, not just happy paths
  • Use the right type — Data flow for lineage, user journey for experience, control flow for logic
  • Right level of detail — Not too deep (implementation), not too shallow (useless)
  • Label meaningfully — Describe what's actually happening, not generic actions
  • Separate from static — Flows complement, don't replace, static architecture

If you take away one thing, let it be this: flows tell the story of how things move through your system over time. Static relationships show the stage. Flows show the performance.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding.

Question 1

You're documenting an order processing system. Which type of flow is most appropriate for modeling the complete customer experience from browsing products to receiving an order confirmation?

"A customer browses products, adds items to cart, proceeds to checkout, enters payment details, and places the order. If payment succeeds, they see a confirmation page and receive a confirmation email. If payment fails, they see an error message."

A) Data Flow B) User Journey / Scenario C) Control Flow D) Event Flow

Click to see the answer

Answer: B) User Journey / Scenario

Let's analyze each option:

A) Incorrect. Data flows focus on how data moves and transforms through the system. While data is involved in this scenario, the focus is on the customer's experience and actions, not on data transformations. A data flow would show order data flowing through the API, database, and analytics systems—not the customer's actions.

B) Correct! A user journey (or scenario) is the right choice here because:

  • The scenario describes user actions (browses, adds to cart, proceeds to checkout, enters payment, places order)
  • It shows system responses to those actions (confirmation page, error message, confirmation email)
  • It includes both success and failure paths from the user's perspective
  • It captures the complete customer experience from start to finish

User journeys are behavioral flows from the user's perspective. They're perfect for modeling user stories, requirements, and customer experiences.

C) Incorrect. Control flows show decision points and branching logic (if/else). While this scenario has branching (success vs. failure), the branching is based on external payment gateway response, not internal business logic rules. Control flows are better for modeling things like "if order value > $100, require approval" or "if user is VIP, skip verification."

D) Incorrect. Event flows show how events propagate through event-driven systems (pub/sub patterns). This scenario doesn't describe an event-driven architecture—it describes a synchronous request/response flow where the customer waits for a response. Event flows would show things like "order created event published → multiple services consume event → eventual consistency."

Key insight: Choose flow types based on what you're modeling. Showing the customer's experience and actions? Use a user journey. Tracking data lineage and transformations? Use a data flow. Modeling business logic decisions? Use a control flow. Designing event-driven architecture? Use an event flow.


Question 2

You're creating a flow to model how order data moves through your system from creation to analytics. The data originates from the customer, gets stored in the transactional database, gets extracted daily by an ETL job, gets transformed and aggregated, and ends up in the data warehouse for reporting. Which flow type is most appropriate?

A) User Journey / Scenario B) Control Flow C) Data Flow D) Event Flow

Click to see the answer

Answer: C) Data Flow

Let's analyze each option:

A) Incorrect. A user journey focuses on the user's perspective and actions. This scenario is about data lineage and transformations, not about a user's experience. The customer creates the order, but the rest of the flow (ETL, transformation, aggregation, data warehouse) happens automatically without user involvement. A user journey wouldn't capture these data processing steps effectively.

B) Incorrect. Control flows show decision points and branching logic (if/else). This scenario describes a pipeline where data flows through sequential steps, not a decision tree with branches. There's no conditional logic—every order follows the same path through ETL, transformation, aggregation, and the data warehouse.

C) Correct! A data flow is the right choice here because:

  • The scenario describes data lineage — where data starts (customer), where it goes (database, ETL, warehouse), and where it ends (reporting)
  • It shows data transformations — order record → ETL extraction → transformation → aggregation → warehouse data
  • It models a pipeline — sequential steps that process data
  • It's focused on data movement, not user actions or business logic

Data flows (DFD-style) are perfect for showing how data moves through a system, including where it originates, how it's stored, how it's transformed, and where it ultimately goes.

D) Incorrect. An event flow would show how an event (like "OrderCreated") propagates through an event-driven system with multiple consumers. While event-driven systems can use data flows internally, this scenario describes an ETL pipeline with scheduled batch processing, not real-time event propagation. The ETL job pulls data daily, transforms it, and pushes it to the warehouse—this is classic data flow, not event flow.

Key insight: Data flows are all about lineage and transformation. If you're modeling where data comes from, how it changes shape, and where it ends up, use a data flow. Think of it like tracing the path data takes through a pipeline.


What's Next?

Now you understand what flows are and why they matter. You know the different types of flows (data, user journey, control, event) and when to use each one.

But we've only talked about what flows are. We haven't talked about how to create specific types of flows in practice.

In the next lesson, you'll learn about Data Flow Diagrams—how to model DFD-style data flows that show lineage, transformations, and analytics pipelines. You'll discover how to document where data comes from, how it changes, and where it ultimately ends up.

See you there!

Following the Trail: Data Flow Diagrams

Think of an oil pipeline. Crude oil goes in one end, flows through refineries where it's heated, distilled, and chemically treated, and comes out the other end as gasoline, diesel, or jet fuel. At each stage, the oil transforms into something more valuable.

Data flows work the same way. Raw data enters your system, flows through transformations where it's validated, normalized, enriched, and aggregated, and comes out as insights, reports, or visualizations.

In this lesson, you'll learn to create DFD-style data flows in Sruja. You'll discover how to track data lineage, document transformations, and model the pipelines that power your analytics and reporting.

Let's start by understanding what data flows are and why they matter.

Learning Goals

By the end of this lesson, you'll be able to:

  • Create DFD-style data flows in Sruja
  • Model data lineage from source to destination
  • Document data transformations and how data changes shape
  • Design ETL and analytics pipelines
  • Track where data comes from and where it ultimately goes

What Are Data Flow Diagrams, Really?

Data Flow Diagrams (DFDs) show how data moves through your system—where it originates, how it's stored, how it transforms, and where it ends up.

Think of it like tracing a river's path:

  • Source: Where the river starts (a spring, a mountain lake)
  • Flow: The river's journey through valleys and cities
  • Transformations: Tributaries joining, diversions splitting, dams changing flow
  • Destination: Where the river ends (ocean, another river)

In data terms:

  • Source: Where data originates (user input, database, API, file)
  • Flow: The path data takes through your system
  • Transformations: Validation, normalization, enrichment, aggregation
  • Destination: Where data ultimately goes (warehouse, dashboard, report)

Why Data Flows Matter: The Real Benefits

I've built countless data systems over the years, and data flows are always the first thing I create. Here's why.

1. Data Lineage: Where Did This Come From?

Data flows tell you the complete history of data—where it started and every transformation it went through.

CustomerAnalyticsFlow = flow "Customer Data Lineage" {
  // Source: Where data starts
  Customer -> CRMSystem "Creates customer profile"
  
  // Transformation 1: Data extraction
  CRMSystem -> ETLService "Extracts customer records"
  
  // Transformation 2: Data normalization
  ETLService -> NormalizedData "Cleans and standardizes formats"
  
  // Transformation 3: Data enrichment
  NormalizedData -> EnrichmentService "Adds behavioral data from clickstream"
  
  // Transformation 4: Aggregation
  EnrichmentService -> AggregatedData "Creates daily customer segments"
  
  // Destination: Where data ends up
  AggregatedData -> DataWarehouse "Stores for reporting"
  DataWarehouse -> BusinessDashboard "Displays customer segments"
}

This flow tells you the complete story: customer data originates in CRM, gets extracted by ETL, normalized (cleaned up), enriched with behavioral data, aggregated into segments, stored in the warehouse, and ultimately shows up on a business dashboard.

Without this flow, would you know customer data comes from the CRM? Would you know it gets enriched with clickstream data? Would you know it's aggregated daily? Probably not.

I once worked on a project where nobody knew where analytics data came from. We spent weeks tracking down data lineage every time we found an issue. We added data flows, and suddenly everyone knew the complete path.

2. Process Understanding: What Actually Happens?

Data flows reveal the processing steps your data goes through—the "how" not just the "what."

ETLPipelineFlow = flow "ETL Pipeline Steps" {
  // Step 1: Extraction
  SourceDatabase -> IngestionService "Pulls raw transactions"
  
  // Step 2: Validation
  IngestionService -> ValidationService "Validates schema and data types"
  
  // Step 3: Transformation
  ValidationService -> TransformationService "Normalizes dates, currencies, formats"
  
  // Step 4: Loading
  TransformationService -> DataWarehouse "Loads transformed data"
}

This shows you the complete ETL process: extract, validate, transform, load. You can see exactly what each service does and in what order.

When something breaks—a data quality issue, a failed load, a malformed record—you know exactly where to look. Is it in ingestion? In validation? In transformation? The flow tells you.

3. Transformation Documentation: How Does Data Change?

Data flows document how data transforms at each step—what shape it takes, what format it's in.

TransformationFlow = flow "Data Transformations" {
  RawSource -> ETLService "Raw CSV file"
  
  // Transformation 1: Validation
  ETLService -> ValidatedData "Validated (removed invalid records)"
  
  // Transformation 2: Normalization
  ValidatedData -> NormalizedData "Normalized (standardized formats)"
  
  // Transformation 3: Enrichment
  NormalizedData -> EnrichedData "Enriched (added location data)"
  
  // Transformation 4: Aggregation
  EnrichedData -> FinalData "Aggregated (daily metrics)"
}

Each arrow shows a transformation:

  • Raw CSV → Validated (invalid records removed)
  • Validated → Normalized (formats standardized)
  • Normalized → Enriched (location data added)
  • Enriched → Aggregated (metrics computed)

This documentation is invaluable. When someone asks, "What happened to this data?" you can point to the flow and show them each transformation step.

I once inherited a system where nobody documented data transformations. We found mysterious records in the warehouse—dates in the wrong format, currencies mixed up, values that made no sense. We spent months reverse-engineering what transformations were happening. Document it upfront.

4. Bottleneck Identification: Where Will Things Slow Down?

Data flows make bottlenecks obvious—where processing might slow down, where queues might form, where latency will be worst.

AnalyticsFlow = flow "Analytics Pipeline" {
  UserActions -> TrackingService "Captures events" [fast]
  TrackingService -> EventStream "Publishes events" [fast]
  EventStream -> BatchProcessor "Consumes and processes" [slow, batch job]
  BatchProcessor -> DataWarehouse "Loads aggregated data" [medium]
  DataWarehouse -> Dashboard "Queries for display" [fast]
}

Look at the labels: fast, fast, slow, medium, fast. The batch processor is marked as slow because it's a scheduled job that runs once daily. This tells you immediately: if you're looking for real-time analytics, you'll be disappointed. The bottleneck is the batch processor.

When users complain about stale data ("why does the dashboard show yesterday's numbers?"), you know exactly why. The flow tells you.

Creating Data Flows in Sruja

Sruja gives you the flow keyword for creating DFD-style data flows. It's designed specifically for data-oriented flows.

Using flow for Data Pipelines

OrderDataFlow = flow "Order Data Processing" {
  Customer -> WebApp "Order form submission"
  WebApp -> API "Order JSON payload"
  API -> Database "Persist order record"
  Database -> AnalyticsExtractor "Extract order events"
  AnalyticsExtractor -> EventStream "Publish to analytics"
  EventStream -> DataWarehouse "Aggregate and store"
  DataWarehouse -> ReportingTool "Query for reports"
}

Using Metadata for Transformations

You can add metadata to document what each step does:

ETLService = container "ETL Service" {
  metadata {
    transformations [
      "Validate schema and data types",
      "Normalize dates to ISO 8601",
      "Standardize currency codes to ISO 4217",
      "Remove invalid or corrupt records"
    ]
    output_format "JSON"
    output_schema "v2.1"
    batch_window "Daily at 2AM UTC"
  }
}

This metadata tells anyone reading the flow:

  • What transformations happen
  • What output format to expect
  • What schema version
  • When the batch runs

Common Data Flow Patterns

After building data systems for years, I've noticed patterns that repeat constantly. Let me show you the ones I see most often.

Pattern 1: ETL Pipeline

Extract, Transform, Load—the classic pattern for moving data from operational systems to analytics.

ETLPipelineFlow = flow "Classic ETL Pipeline" {
  // Extract: Pull from source systems
  TransactionDB -> DataCollector "Extracts daily transactions"
  CustomerDB -> DataCollector "Extracts customer profiles"
  
  // Transform: Clean and normalize
  DataCollector -> ValidationService "Validates schemas"
  ValidationService -> CleaningService "Removes duplicates and errors"
  CleaningService -> TransformationService "Normalizes formats"
  
  // Load: Push to warehouse
  TransformationService -> DataWarehouse "Loads transformed data"
  DataWarehouse -> ReportingEngine "Available for queries"
}

Characteristics:

  • Scheduled batch processing (daily, hourly)
  • Source systems are OLTP (transactional)
  • Destination is OLAP (analytics)
  • Focus on data quality and consistency

Use when: Building traditional data warehouses, moving from transactional systems to analytics.

Pattern 2: Event Sourcing

Every change to data is captured as an event, and different services project events into read models.

EventSourcingFlow = flow "Event Sourcing Pattern" {
  // Events captured
  OrderAPI -> EventStore "Persist OrderCreated event"
  OrderAPI -> EventStore "Persist OrderPaid event"
  OrderAPI -> EventStore "Persist OrderShipped event"
  
  // Multiple projections
  EventStore -> OrderReadModel "Project to order summary view"
  EventStore -> CustomerReadModel "Project to customer order history view"
  EventStore -> AnalyticsReadModel "Project to order metrics view"
  
  // Read models queried
  OrderReadModel -> OrderService "Fetch order details"
  CustomerReadModel -> CustomerService "Fetch customer orders"
  AnalyticsReadModel -> AnalyticsService "Fetch order metrics"
}

Characteristics:

  • Events are immutable (never change)
  • Multiple read models for different use cases
  • Rebuildable (can replay events)
  • Eventually consistent

Use when: Building systems where audit trails matter, where you need multiple views of same data, where rebuildability is important.

Pattern 3: Real-Time Analytics Pipeline

Events flow through a real-time processing pipeline for immediate insights.

RealTimeAnalyticsFlow = flow "Real-Time Analytics Pipeline" {
  // Events captured
  UserActions -> EventCollector "Captures clickstream events"
  EventCollector -> KafkaStream "Publishes to Kafka"
  
  // Real-time processing
  KafkaStream -> StreamProcessor "Processes events in real-time"
  StreamProcessor -> RedisCache "Updates user session data"
  StreamProcessor -> Elasticsearch "Indexes events for search"
  
  // Real-time consumption
  RedisCache -> WebApp "Serves session data"
  Elasticsearch -> Dashboard "Shows real-time user activity"
}

Characteristics:

  • Real-time (seconds to minutes latency)
  • Stream processing (Kafka, Kinesis, Pulsar)
  • Eventually consistent (some delay acceptable)
  • Focus on speed and availability

Use when: Building real-time dashboards, fraud detection, personalized recommendations, live monitoring.

Pattern 4: Lambda Architecture

Batch processing for comprehensive analytics plus real-time for speed.

LambdaArchitectureFlow = flow "Lambda Architecture" {
  // Speed layer: Real-time
  Events -> StreamProcessing "Real-time processing"
  StreamProcessing -> SpeedLayer "Serves fast views"
  
  // Batch layer: Comprehensive
  Events -> BatchProcessing "Comprehensive processing"
  BatchProcessing -> BatchLayer "Serves accurate views"
  
  // Serving layer: Merges both
  SpeedLayer -> QueryService "Provides fast results"
  BatchLayer -> QueryService "Provides accurate results"
  QueryService -> API "Serves merged views"
}

Characteristics:

  • Two paths: fast (speed layer) and accurate (batch layer)
  • Speed layer provides quick but possibly incomplete results
  • Batch layer provides comprehensive but delayed results
  • Query service merges both for best of both worlds

Use when: You need both real-time responsiveness and comprehensive accuracy.

Documenting Data Transformations

One of the most important things you can do in data flows is document transformations clearly.

Using Relationship Labels

TransformFlow = flow "Data Transformations" {
  RawSource -> ETLService "Raw CSV data"
  ETLService -> ValidatedData "Validated (removed invalids)"
  ValidatedData -> NormalizedData "Normalized (standardized)"
  NormalizedData -> EnrichedData "Enriched (added location)"
  EnrichedData -> FinalData "Aggregated (daily metrics)"
}

Each label describes what transformation happened at that step.

Adding Metadata

ETLService = container "ETL Service" {
  metadata {
    transformations [
      "Remove duplicate records",
      "Normalize phone numbers to E.164 format",
      "Standardize dates to ISO 8601 (UTC)",
      "Geocode addresses to lat/long"
    ]
    input_format "CSV"
    output_format "JSON"
    output_schema "v2.1"
  }
}

This metadata provides complete documentation of what transformations happen.

Complete Data Flow Example

Let me show you a complete example that brings everything together.

import { * } from 'sruja.ai/stdlib'

// People
Customer = person "Customer"

// Systems
Shop = system "Shop" {
  WebApp = container "Web Application"
  API = container "API Service"
  Database = database "PostgreSQL"
}

Analytics = system "Analytics Platform" {
  Ingestion = container "Data Ingestion"
  Processing = container "Data Processing"
  Warehouse = database "Data Warehouse"
  Reporting = container "Reporting Engine"
}

Dashboard = system "Analytics Dashboard" {
  UI = container "Dashboard UI"
}

// Complete data flow: Order to analytics
OrderAnalyticsFlow = flow "Order Analytics Pipeline" {
  // Source: Customer creates order
  Customer -> Shop.WebApp "Submits order"
  Shop.WebApp -> Shop.API "Order data"
  Shop.API -> Shop.Database "Persist order"
  
  // Extraction: Pull orders for analytics
  Shop.Database -> Analytics.Ingestion "Extract order events"
  
  // Transformation: Validate and enrich
  Analytics.Ingestion -> Analytics.Processing "Validate and normalize"
  Analytics.Processing -> Analytics.Processing "Enrich with customer data"
  Analytics.Processing -> Analytics.Processing "Aggregate metrics"
  
  // Loading: Store in warehouse
  Analytics.Processing -> Analytics.Warehouse "Store aggregated data"
  
  // Consumption: Query and display
  Dashboard.UI -> Analytics.Reporting "Query metrics"
  Analytics.Reporting -> Analytics.Warehouse "Fetch data"
  Analytics.Reporting -> Dashboard.UI "Return results"
}

view index {
  include *
}

This flow shows the complete path from customer action to analytics dashboard. Anyone reading this diagram understands how data moves through the system.

What to Remember

Data flows tell the story of how data moves through your system—from origin to destination, including every transformation along the way. When you create data flows:

  • Document lineage — Where data comes from and where it goes
  • Show transformations — How data changes shape at each step
  • Use metadata — Document what each service actually does
  • Identify bottlenecks — Mark slow steps and understand their impact
  • Choose right pattern — ETL, event sourcing, real-time, or lambda
  • Track both paths — Success and failure paths

If you take away one thing, let it be this: data flows are your best documentation of how data actually moves through your system. When someone asks, "Where did this data come from?" or "What happened to this data?" your data flow has the answer.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding.

Question 1

You're modeling a fitness tracking app's data flow. The app tracks user workouts, syncs them to a cloud API, where they're stored in a database. A daily ETL job extracts workouts, calculates daily metrics (calories burned, workout duration), and stores results in a data warehouse for business analytics. Which flow type is most appropriate?

"Users log workouts on their phones. Workout data syncs to the cloud API, gets stored in the main database. Every night at 2 AM, an ETL job pulls all workouts from the database, calculates aggregated metrics (total calories, total minutes, workout counts per user), and loads the results into a data warehouse. Business analysts query the warehouse for reports on user engagement and app usage."

A) User Journey / Scenario B) Control Flow C) Data Flow (DFD Style) D) Event Flow

Click to see the answer

Answer: C) Data Flow (DFD Style)

Let's analyze each option:

A) Incorrect. A user journey shows how a user interacts with a system to achieve a goal. This scenario describes data movement and transformations, not user interactions. The user creates a workout, but the rest of the flow (syncing, extracting, calculating, aggregating, loading) happens automatically without user involvement. A user journey wouldn't capture these data processing steps effectively.

B) Incorrect. A control flow shows decision points and branching logic (if/else). This scenario describes a pipeline where data flows through sequential steps (sync → store → extract → calculate → aggregate → load). There's no conditional logic—every workout follows the same path through the ETL pipeline. Control flows are better for modeling things like "if workout is跑步类型 A, calculate calories differently" or "if user is premium, store additional metrics."

C) Correct! A data flow (DFD-style) is the right choice here because:

  • The scenario describes data lineage — where data starts (user workout), where it goes (cloud API, database, data warehouse), and what happens along the way
  • It shows data transformations — raw workout → synced workout → extracted workout → calculated metrics → aggregated metrics
  • It models a pipeline — sequential steps that process data (extract, transform, load)
  • It's focused on data movement, not user actions or business logic

Data flows (DFD-style) are perfect for showing how data moves through a system, including where it originates, how it's stored, how it's transformed, and where it ultimately goes. This scenario is a classic ETL pipeline—extract from operational database, transform (calculate metrics), load into data warehouse.

D) Incorrect. An event flow would show how events propagate through an event-driven system (pub/sub patterns). This scenario describes a scheduled batch ETL job, not real-time event propagation. The ETL job pulls data daily at 2 AM, transforms it, and loads it. There's no event bus, no event streaming, no multiple consumers processing the same event. Event flows would show things like "workout completed event published → analytics service consumes event → notification service consumes event → recommendation service consumes event."

Key insight: Choose flow types based on what you're modeling. Showing where data comes from, how it transforms, and where it ends up? Use a data flow. Modeling user interactions and experience? Use a user journey. Modeling decision logic and branches? Use a control flow. Designing event-driven architecture? Use an event flow.


Question 2

You're creating a data flow for an e-commerce platform. Which structure best documents the data transformations that happen in the ETL pipeline?

A)

ETLPipeline = flow "ETL Pipeline" {
  TransactionDB -> ETLService "Extract"
  ETLService -> DataWarehouse "Load"
}

B)

ETLPipeline = flow "ETL Pipeline" {
  TransactionDB -> ETLService "Extract transactions"
  ETLService -> ValidatedData "Validate and remove errors"
  ValidatedData -> NormalizedData "Normalize formats"
  NormalizedData -> EnrichedData "Add customer data"
  EnrichedData -> DataWarehouse "Load to warehouse"
}

C)

ETLPipeline = flow "ETL Pipeline" {
  TransactionDB -> DataWarehouse "Move data"
}

D)

ETLPipeline = flow "ETL Pipeline" {
  TransactionDB -> ETLService "Extract"
  ETLService -> ValidatedData "?"
  ValidatedData -> NormalizedData "?"
  NormalizedData -> EnrichedData "?"
  EnrichedData -> DataWarehouse "?"
}
Click to see the answer

Answer: B) Shows each transformation step clearly

Let's analyze each option:

A) Incorrect. This flow has only two steps: extract and load. It completely skips the transformation step. ETL stands for Extract, Transform, Load—the transformation is the middle T! This flow doesn't show what transformations happen. Are transactions validated? Are formats normalized? Is data enriched? The flow provides no information about these crucial steps. Anyone reading this diagram wouldn't understand what actually happens to the data.

B) Correct! This flow documents each transformation step clearly:

  • Extract transactions — Pulls raw data from transactional database
  • Validate and remove errors — First transformation: validates data quality, removes corrupt or invalid records
  • Normalize formats — Second transformation: standardizes dates, currencies, phone numbers, etc. to consistent formats
  • Add customer data — Third transformation: enriches transactions with customer information (name, tier, location, etc.)
  • Load to warehouse — Final step: loads transformed, enriched data into warehouse

Each relationship label describes what transformation happens at that step. Anyone reading this diagram understands the complete ETL process and what happens to the data at each stage.

C) Incorrect. This is too abstract. "Move data" tells you nothing about what happens. Is data validated? Is it transformed? Is it enriched? How does the format change? What transformations are applied? The flow provides no useful information. It's the equivalent of saying "data goes from point A to point B" without explaining the journey.

D) Incorrect. While this has the right number of steps, the labels are meaningless ("?"). What does the first "?" mean? What about the second "?"? The third "?"? The fourth "?"? These labels provide no information about what transformations are happening. Each step is a black box—you know there are transformations, but you don't know what they are.

Key insight: Document transformations clearly using descriptive relationship labels. Don't just show that data moves—show how it transforms. Label each step with what actually happens: "validate," "normalize," "enrich," "aggregate," "calculate." This makes your data flows informative and useful, not just correct.


What's Next?

Now you understand how to create data flow diagrams. You can model data lineage, document transformations, and design ETL and analytics pipelines.

But data flows are just one type of flow. There's another crucial type—user journeys (or behavioral flows)—which show how users interact with your system from their perspective.

In the next lesson, you'll learn about user journeys. You'll discover how to model BDD-style scenarios, document happy paths and error paths, and capture the complete user experience from start to finish.

See you there!

Walking in Their Shoes: User Journeys

Ever watched someone use a product you built? You notice things you never would—confusing buttons, unclear error messages, workflows that don't make sense. Why? Because you built it from your perspective, not theirs.

User journeys (or scenarios) are your way to model the complete user experience—from their perspective. They show what users do, how your system responds, and what happens when things go right (and wrong).

In this lesson, you'll learn to create user journeys that capture the complete user experience. You'll discover how to model happy paths and error paths, document edge cases, and create scenarios that serve as both requirements and test cases.

Let's start by understanding what user journeys actually are.

Learning Goals

By the end of this lesson, you'll be able to:

  • Model user journeys and behavioral scenarios from user's perspective
  • Use BDD-style scenarios to document requirements
  • Model both happy paths and error paths
  • Document edge cases and unusual scenarios
  • Create scenarios that serve as test cases and documentation

What Are User Journeys, Really?

User journeys (also called scenarios or behavioral flows) show how a person interacts with your system to achieve a goal. Unlike data flows that show data moving through a system, user journeys show human behavior and system responses.

Think of it like a story: "As a customer, I want to buy a product so I can use it." The journey shows every step—what the customer does, what the system does, and what the customer experiences.

User journeys capture:

  • User actions (what they click, type, select)
  • System responses (what happens, what they see)
  • Success paths (when everything works)
  • Error paths (when things go wrong)
  • Decision points (where different things happen based on conditions)

What makes them special:

  • User's perspective (not system's)
  • Behavioral (not just data)
  • Complete experience (start to finish)
  • Both happy and sad paths

Why User Journeys Matter

I've learned this the hard way. I once built a feature without modeling user journeys, and when we launched, users were completely confused.

They couldn't find the checkout button. When they did, error messages were cryptic. The "success" page didn't tell them what to do next. Support tickets flooded in. We had to rebuild the entire feature.

If I'd modeled user journeys upfront, we would have seen these issues immediately. The journey would have shown: "User clicks checkout → System shows error 'ERR_500' → User is confused."

User journeys matter because:

1. Requirements Clarity

User journeys turn vague requirements into concrete scenarios:

Vague requirement: "Users should be able to check out"

User journey makes it concrete:

CheckoutJourney = scenario "Customer Checkout" {
  Customer -> WebApp "Clicks checkout button"
  WebApp -> API "Validates cart"
  API -> Database "Checks inventory"
  Database -> API "Inventory available"
  API -> PaymentGateway "Processes payment"
  PaymentGateway -> API "Payment successful"
  API -> Database "Saves order"
  API -> EmailService "Sends confirmation"
  WebApp -> Customer "Shows order confirmation page"
}

Now everyone knows exactly what "checkout" means—every step, every system involved, every user action.

2. Test Case Generation

User journeys become test cases automatically:

HappyPathTest = scenario "Test: Successful Checkout" {
  // From the user journey
  Customer -> WebApp "Clicks checkout"
  WebApp -> API "Validates cart"
  // ... test each step
}

ErrorPathTest = scenario "Test: Checkout with Invalid Payment" {
  Customer -> WebApp "Clicks checkout"
  WebApp -> API "Validates cart"
  API -> PaymentGateway "Processes payment"
  PaymentGateway -> API "Payment declined: invalid card"
  API -> WebApp "Returns error: Invalid card"
  WebApp -> Customer "Shows error: Please check your card details"
}

I've worked on teams where we spent weeks writing test cases manually. With user journeys, we just turn scenarios into tests. Huge time saver.

3. User Experience Visibility

User journeys show the complete experience—not just what works, but how it feels:

RegistrationJourney = scenario "User Registration" {
  Customer -> WebApp "Opens registration form"
  Customer -> WebApp "Enters name and email"
  Customer -> WebApp "Enters password"
  Customer -> WebApp "Clicks 'Create Account'"
  WebApp -> API "Submits registration"
  API -> Database "Creates user account"
  API -> EmailService "Sends welcome email"
  WebApp -> Customer "Shows 'Account created!' message"
  WebApp -> Customer "Redirects to dashboard"
}

See how this shows the full experience? Form entry → submission → success message → welcome email → dashboard. Anyone reading this understands exactly what users experience.

4. Edge Case Discovery

When you model user journeys, you naturally think about edge cases:

RegistrationEdgeCases = scenario "Registration Edge Cases" {
  // Edge case 1: Duplicate email
  Customer -> WebApp "Registers with existing email"
  WebApp -> API "Submits registration"
  API -> Database "Checks email exists"
  Database -> API "Email already exists"
  API -> WebApp "Returns error: Email already registered"
  WebApp -> Customer "Shows error: Email already in use"

  // Edge case 2: Weak password
  Customer -> WebApp "Registers with weak password"
  WebApp -> API "Submits registration"
  API -> ValidationService "Validates password strength"
  ValidationService -> API "Password too weak"
  API -> WebApp "Returns error: Password must be at least 8 characters"
  WebApp -> Customer "Shows error: Password too weak"
}

I once launched a feature without modeling edge cases. Users started using weird inputs, emojis in names, passwords with special characters, and the system broke in creative ways. Model edge cases upfront.

Creating User Journeys in Sruja

Sruja gives you two keywords for user journeys, and they're interchangeable:

Using scenario

Use scenario for behavioral flows—it's the most common and descriptive choice:

CheckoutScenario = scenario "Customer Checkout Experience" {
  Customer -> Shop.WebApp "Clicks checkout button"
  Shop.WebApp -> Shop.API "Validates shopping cart"
  Shop.API -> PaymentGateway "Processes payment"
  Shop.WebApp -> Customer "Shows order confirmation"
}

Using story

Use story when you want to emphasize the narrative or story aspect:

CheckoutStory = story "As a customer, I want to checkout so I can purchase my items" {
  Customer -> Shop.WebApp "Clicks checkout button"
  Shop.WebApp -> Shop.API "Validates shopping cart"
  Shop.API -> PaymentGateway "Processes payment"
  Shop.WebApp -> Customer "Shows order confirmation"}
}

The story keyword is great for BDD (Behavior-Driven Development) style where you write scenarios in "Given-When-Then" format.

BDD (Behavior-Driven Development) Style

BDD is about writing requirements as behavior, not specifications. User journeys in Sruja map perfectly to BDD's "Given-When-Then" structure.

The Given-When-Then Pattern

// GIVEN: Customer has items in shopping cart
// WHEN: Customer clicks checkout and completes payment
// THEN: Order is created and confirmation is shown

CheckoutJourney = scenario "Customer Checkout (BDD Style)" {
  // GIVEN: Customer has items in cart
  Customer -> Shop.WebApp "Views shopping cart with 3 items"
  Shop.WebApp -> Shop.API "Fetches cart contents"
  Shop.API -> Shop.Database "Returns cart data"
  
  // WHEN: Customer completes checkout flow
  Customer -> Shop.WebApp "Clicks checkout button"
  Shop.WebApp -> Shop.API "Validates cart"
  Shop.API -> Shop.Database "Checks inventory"
  Shop.Database -> Shop.API "All items available"
  Shop.API -> PaymentGateway "Processes payment"
  PaymentGateway -> Shop.API "Payment successful"
  
  // THEN: Order is created and confirmation shown
  Shop.API -> Shop.Database "Creates order"
  Shop.API -> Shop.Database "Reserves inventory"
  Shop.API -> EmailService "Sends confirmation email"
  Shop.WebApp -> Customer "Shows order confirmation page with order #12345"
}

Why BDD works:

  • It's in plain language anyone can understand
  • It focuses on behavior, not implementation
  • It serves as both requirements and tests
  • It's unambiguous (unlike "system should work well")

I've worked with product managers who couldn't understand technical specs. But they understood BDD scenarios immediately. It became our common language.

User Journey Patterns

After modeling user journeys for years, I've found patterns that repeat constantly. Let me show you the ones I see most often.

Pattern 1: Happy Path

The ideal scenario where everything works perfectly:

HappyRegistration = scenario "User Registration (Happy Path)" {
  Customer -> WebApp "Opens registration form"
  Customer -> WebApp "Enters name, email, password"
  Customer -> WebApp "Clicks 'Create Account'"
  WebApp -> API "Submits registration"
  API -> Database "Creates user account"
  Database -> API "User created successfully"
  API -> EmailService "Sends welcome email"
  EmailService -> Customer "Receives welcome email"
  API -> WebApp "Returns success"
  WebApp -> Customer "Shows 'Account created! Welcome!' message"
  WebApp -> Customer "Redirects to dashboard"
}

Characteristics:

  • No errors
  • No branches
  • Perfect flow from start to finish
  • User achieves their goal

When to model:

  • First—model the ideal scenario
  • Before optimizing, understand what success looks like
  • Before testing edge cases, have a working baseline

Pattern 2: Error Path

What happens when things go wrong:

ErrorRegistration = scenario "User Registration (Error Path: Duplicate Email)" {
  Customer -> WebApp "Opens registration form"
  Customer -> WebApp "Enters existing email: john@example.com"
  Customer -> WebApp "Enters name and password"
  Customer -> WebApp "Clicks 'Create Account'"
  WebApp -> API "Submits registration"
  API -> Database "Checks if email exists"
  Database -> API "Email already exists: john@example.com"
  API -> WebApp "Returns error: Email already registered"
  WebApp -> Customer "Shows error: 'Email already in use. Try logging in or use a different email.'"
}

Common error paths to model:

  • Duplicate data (email, username)
  • Invalid data (email format, weak password)
  • Missing data (required fields empty)
  • System errors (database down, API timeout)
  • Business rule violations (age restriction, region blocking)

Characteristics:

  • Error occurs at some step
  • System returns meaningful error
  • User sees helpful error message
  • User can correct and retry

When to model:

  • Always model critical error paths
  • Focus on errors users actually encounter
  • Ensure error messages are helpful, not cryptic

I once worked on a system where error messages were like "ERR_500_CHECKOUT_FAILED." Users had no idea what went wrong. We rewrote them to be helpful: "Payment failed. Please check your card details or try a different payment method." Support tickets dropped by 70%.

Pattern 3: Branching Path

Different things happen based on conditions:

BranchingApproval = scenario "Order Approval (Branching Based on Value)" {
  Manager -> WebApp "Submits order for approval"
  WebApp -> API "Processes approval request"
  API -> Database "Fetches order details"
  Database -> API "Returns order: value = $1500"
  
  // Branch 1: Auto-approve for low-value orders
  if value < 1000 {
    API -> Database "Updates order: auto-approved"
    API -> WebApp "Returns: Order auto-approved"
    WebApp -> Manager "Shows 'Order approved! Shipping soon.'"
  }
  
  // Branch 2: Manual review for high-value orders
  if value >= 1000 {
    API -> EmailService "Sends approval request to director"
    EmailService -> Director "Receives approval request"
    Director -> WebApp "Reviews order and approves"
    WebApp -> API "Submits approval decision"
    API -> Database "Updates order: manually approved"
    API -> WebApp "Returns: Order approved by director"
    WebApp -> Manager "Shows 'Order approved by director. Shipping soon.'"
  }
}

Characteristics:

  • Decision point in the flow
  • Multiple possible paths
  • Different actions based on conditions
  • Paths may converge at the end

When to model:

  • Business logic has rules
  • Different user types have different experiences
  • System behavior changes based on context

Pattern 4: Retry Path

System tries multiple times before succeeding or failing:

RetryPayment = scenario "Payment with Automatic Retry" {
  Customer -> WebApp "Clicks 'Place Order'"
  WebApp -> API "Submits order for payment"
  
  // Attempt 1: Fails with timeout
  API -> PaymentGateway "Process payment"
  PaymentGateway -> API "Payment failed: timeout"
  API -> WebApp "Returns: Processing, please wait..."
  WebApp -> Customer "Shows spinner: 'Processing your payment...'"
  
  // Attempt 2: Fails with timeout
  API -> PaymentGateway "Retry payment (attempt 2)"
  PaymentGateway -> API "Payment failed: timeout"
  
  // Attempt 3: Succeeds
  API -> PaymentGateway "Retry payment (attempt 3)"
  PaymentGateway -> API "Payment successful!"
  
  // Continue with success
  API -> Database "Saves order"
  API -> EmailService "Sends confirmation"
  WebApp -> Customer "Shows order confirmation"
}

Characteristics:

  • Same action repeated
  • Eventually succeeds or fails permanently
  • User sees "retrying" state
  • Transparent about what's happening

When to model:

  • External services are flaky
  • Network issues are common
  • You want to handle transient failures gracefully

Complete User Journey Example

Let me show you a complete user journey that brings everything together:

import { * } from 'sruja.ai/stdlib'

// Person
Customer = person "Customer"

// System
Shop = system "Shop" {
  WebApp = container "Web Application"
  API = container "API Service"
  Database = database "PostgreSQL"
}

// External systems
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "vendor"]
    sla "99.9% uptime"
  }
}

EmailService = system "Email Service" {
  metadata {
    tags ["external", "vendor"]
  }
}

// Complete user journey from browsing to confirmation
CompleteCheckoutJourney = scenario "Complete Checkout Experience" {
  // Step 1: Browse products
  Customer -> Shop.WebApp "Browses products"
  Shop.WebApp -> Shop.API "Fetches product list"
  Shop.API -> Shop.Database "Queries products"
  Shop.Database -> Shop.API "Returns 50 products"
  Shop.API -> Shop.WebApp "Returns product data"
  Shop.WebApp -> Customer "Displays products grid"

  // Step 2: Add to cart
  Customer -> Shop.WebApp "Clicks 'Add to Cart' on Product #5"
  Shop.WebApp -> Shop.API "Adds item to cart"
  Shop.API -> Shop.Database "Saves cart item"

  // Step 3: Review cart
  Customer -> Shop.WebApp "Clicks 'View Cart'"
  Shop.WebApp -> Shop.API "Fetches cart contents"
  Shop.API -> Shop.Database "Returns cart: 3 items, $75 total"
  Shop.API -> Shop.WebApp "Returns cart data"
  Shop.WebApp -> Customer "Shows cart with total"

  // Step 4: Checkout
  Customer -> Shop.WebApp "Clicks 'Checkout'"
  Shop.WebApp -> Shop.API "Initiates checkout"
  Shop.API -> Shop.Database "Validates cart"
  Shop.Database -> Shop.API "Cart is valid"
  
  // Step 5: Payment processing
  Shop.API -> PaymentGateway "Process payment: $75"
  PaymentGateway -> Shop.API "Payment successful!"
  
  // Step 6: Order creation
  Shop.API -> Shop.Database "Creates order #12345"
  Shop.API -> Shop.Database "Reserves inventory"
  Shop.API -> EmailService "Send order confirmation"
  
  // Step 7: Confirmation page
  Shop.API -> Shop.WebApp "Returns order confirmation"
  Shop.WebApp -> Customer "Shows 'Order #12345 Confirmed!' page"
  
  // Step 8: Email delivery
  EmailService -> Customer "Sends confirmation email"
}

view index {
  include *
}

This journey shows the complete experience from start to finish—every user action, every system response, every step. Anyone reading this understands exactly what a customer experiences when checking out.

Testing with Scenarios

One of the most powerful things about user journeys: they become test cases automatically.

Acceptance Criteria as Scenarios

// Acceptance criteria: As a customer, I want to checkout so that I can purchase products

// AC1: Customer can checkout with valid payment
HappyCheckout = scenario "AC1: Successful Checkout" {
  Customer -> Shop.WebApp "Clicks checkout with valid payment"
  Shop.WebApp -> Shop.API "Validates cart"
  Shop.API -> PaymentGateway "Process payment"
  PaymentGateway -> Shop.API "Payment successful"
  Shop.API -> Shop.Database "Creates order"
  Shop.API -> EmailService "Sends confirmation"
  Shop.WebApp -> Customer "Shows confirmation page"
}

// AC2: Customer sees helpful error with invalid payment
InvalidPaymentCheckout = scenario "AC2: Checkout with Invalid Payment" {
  Customer -> Shop.WebApp "Clicks checkout with expired card"
  Shop.WebApp -> Shop.API "Validates cart"
  Shop.API -> PaymentGateway "Process payment"
  PaymentGateway -> Shop.API "Payment declined: card expired"
  Shop.API -> Shop.WebApp "Returns error"
  Shop.WebApp -> Customer "Shows 'Card expired. Please update your payment method.'"
}

// AC3: Customer receives confirmation email
EmailConfirmation = scenario "AC3: Confirmation Email Sent" {
  Shop.API -> Shop.Database "Creates order"
  Shop.API -> EmailService "Send confirmation"
  EmailService -> Customer "Receives confirmation email"}

I've worked on teams where we spent months debating requirements. When we turned them into BDD-style scenarios, everyone agreed. No ambiguity, no confusion, no "I thought you meant X."

Documenting Edge Cases

Don't just model happy paths. Edge cases are where systems break.

Common Edge Cases to Model

// Edge case 1: Checkout with expired card
ExpiredCardCheckout = scenario "Checkout with Expired Card" {
  Customer -> Shop.WebApp "Attempts checkout with expired card"
  Shop.WebApp -> Shop.API "Submits order"
  Shop.API -> PaymentGateway "Process payment"
  PaymentGateway -> Shop.API "Payment failed: card expired"
  Shop.API -> Shop.WebApp "Returns error"
  Shop.WebApp -> Customer "Shows 'Your card has expired. Please update your payment method.'"
}

// Edge case 2: Checkout with insufficient inventory
InsufficientInventoryCheckout = scenario "Checkout with Insufficient Inventory" {
  Customer -> Shop.WebApp "Attempts checkout"
  Shop.WebApp -> Shop.API "Submits order"
  Shop.API -> Shop.Database "Checks inventory"
  Shop.Database -> Shop.API "Insufficient stock: 5 requested, 2 available"
  Shop.API -> Shop.WebApp "Returns error"
  Shop.WebApp -> Customer "Shows 'Sorry, only 2 items available. Would you like to proceed with 2?'"
}

// Edge case 3: Checkout during payment gateway outage
GatewayOutageCheckout = scenario "Checkout During Payment Outage" {
  Customer -> Shop.WebApp "Attempts checkout"
  Shop.WebApp -> Shop.API "Submits order"
  Shop.API -> PaymentGateway "Process payment"
  PaymentGateway -> Shop.API "Service unavailable: outage in progress"
  Shop.API -> Shop.WebApp "Returns error"
  Shop.WebApp -> Customer "Shows 'Payment service temporarily unavailable. Please try again in a few minutes. We've saved your cart.'"
}

Why model edge cases:

  • They expose gaps in your design
  • They become test cases automatically
  • They help teams discuss "what if" scenarios
  • They prevent surprises in production

I once launched a feature without modeling edge cases. Users immediately found scenarios we hadn't considered—checking out with gift cards during sales, checking out from different countries with different currencies, checking out with addresses that don't validate. We spent months fixing edge cases we could have caught upfront.

What to Remember

User journeys tell the story of how users interact with your system—from their perspective. When you create user journeys:

  • Focus on user's experience — Not just what works, but how it feels
  • Model both paths — Happy paths AND error paths
  • Use BDD style — "Given-When-Then" for clarity
  • Document edge cases — Unusual but important scenarios
  • Make them testable — Each scenario becomes a test case
  • Write helpful errors — Users should understand what went wrong

If you take away one thing, let it be this: user journeys are your best tool for ensuring your system actually works for real humans, not just in theory. They bridge the gap between requirements, testing, and documentation.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding.

Question 1

You're modeling a user registration flow for a social media app. Which scenario best represents a BDD-style "Given-When-Then" structure?

A)

RegistrationFlow = scenario "User Registration" {
  Customer -> WebApp "Registers"
  WebApp -> API "Saves user"
  API -> Database "Persists"
}

B)

RegistrationFlow = scenario "As a new user, I want to register so I can use the app" {
  // GIVEN: User is on registration page
  Customer -> WebApp "Views registration form"
  
  // WHEN: User submits valid registration
  Customer -> WebApp "Enters email and password"
  Customer -> WebApp "Clicks 'Sign Up'"
  WebApp -> API "Submits registration"
  
  // THEN: Account is created and user is logged in
  API -> Database "Creates user account"
  API -> WebApp "Returns success with session token"
  WebApp -> Customer "Shows 'Welcome! Redirecting to dashboard...'"
}

C)

RegistrationFlow = scenario "Registration Process" {
  Start -> Database "Create user"
  Database -> Email "Send welcome"
}

D)

RegistrationFlow = story "User Registration Story" {
  WebApp -> API "Register"
  API -> Database "Save"
}
Click to see the answer

Answer: B) BDD-style "Given-When-Then" scenario

Let's analyze each option:

A) Incorrect. This scenario is too abstract. It shows the technical steps (registers, saves, persists) but doesn't capture the user's perspective or experience. There's no "Given-When-Then" structure—it's just a sequence of technical actions. This reads more like a data flow than a user journey.

B) Correct! This is a perfect BDD-style user journey because:

  • Title: "As a new user, I want to register so I can use the app" follows the user story format
  • GIVEN: "User is on registration page" — sets up the starting state
  • WHEN: "User submits valid registration" — describes the user action
  • THEN: "Account is created and user is logged in" — describes the expected outcome
  • User's perspective: Shows what the user sees (form, button, welcome message, redirect)
  • Complete experience: From viewing the form to being logged in

This scenario serves multiple purposes:

  • Requirements: It's unambiguous what "register" means
  • Tests: It can be turned directly into a test case
  • Documentation: Anyone reading understands the user experience
  • Communication: Product managers, developers, and testers all agree on what "register" means

C) Incorrect. This is far too abstract. "Start → Database → Email" tells you nothing about the user. What does the user do? What do they see? What happens from their perspective? This looks more like a data flow (and a poor one at that) than a user journey. It's missing the most important part—the human user.

D) Incorrect. While this uses the story keyword (which is fine), it's far too simple to be useful. "Register → Save" doesn't tell you:

  • What the user actually does (enters email? clicks button?)
  • What the system shows (success page? error message?)
  • What the experience is (how long does it take? what do they see?)
  • What happens if something goes wrong

This scenario is so vague it provides no real value. A BDD scenario should be detailed enough that anyone—developer, tester, product manager—understands exactly what happens.

Key insight: BDD-style scenarios focus on the user's experience, not just technical steps. They follow a clear "Given-When-Then" structure that makes requirements unambiguous. A good scenario should be detailed enough that it can serve as both requirements documentation and a test case.


Question 2

You're modeling a login flow and want to document an error path. Which scenario best models a meaningful error handling experience?

A)

LoginError = scenario "Login Error" {
  User -> WebApp "Login with wrong password"
  WebApp -> API "Authenticate"
  API -> Database "Check password"
  Database -> API "Password doesn't match"
  API -> WebApp "Return error"
  WebApp -> User "Shows error"}

B)

LoginError = scenario "Login Error" {
  User -> WebApp "Login with wrong password"
  WebApp -> API "Authenticate"
  API -> Database "Check password"
  Database -> API "Password doesn't match"
  API -> WebApp "Return error: AUTH_FAILED"
  WebApp -> User "Shows error: 'Authentication failed'"}

C)

LoginError = scenario "Login with Incorrect Password" {
  User -> WebApp "Enters email: user@example.com and wrong password"
  User -> WebApp "Clicks 'Log In'"
  WebApp -> API "Submits login"
  API -> Database "Verifies credentials"
  Database -> API "Password doesn't match"
  API -> WebApp "Returns error: Invalid credentials"
  WebApp -> User "Shows 'Password incorrect. Please try again or reset your password if you've forgotten it.' with link to password reset"}

D)

LoginError = scenario "Login Failed" {
  API -> Database "Check"
  Database -> API "No match"
  API -> WebApp "Error"
  WebApp -> User "Can't login"}
Click to see the answer

Answer: C) Shows user action, system response, and helpful error message

Let's analyze each option:

A) Incorrect. While this scenario shows the error occurring, it's missing crucial details:

  • It shows the user action ("Login with wrong password") but doesn't show the specific form interaction (enters email? clicks button?)
  • It shows "Return error" but doesn't specify what the error is
  • It shows "Shows error" but doesn't tell you what error message the user sees
  • It doesn't help the user understand what to do next

The error message is cryptic—"Shows error" tells the user nothing. What kind of error? What should they do? Try again? Reset password? Contact support?

B) Incorrect. This is better than option A but still has problems:

  • The error code "AUTH_FAILED" is technical and cryptic to users
  • The error message "Authentication failed" is vague and unhelpful
  • It doesn't guide the user on what to do next
  • It doesn't offer alternatives (password reset, contact support)

Users seeing this error will be confused: "Authentication failed? What does that mean? Is my email wrong? My password? My account locked? Should I try again? Reset my password?"

C) Correct! This scenario models an excellent error handling experience because:

  • User action is specific: User enters email AND wrong password AND clicks "Log In" — shows the complete action
  • System response is clear: API returns "Invalid credentials" — specific, not generic
  • User experience is helpful: Error message is "Password incorrect. Please try again or reset your password if you've forgotten it." — tells the user what's wrong and what they can do about it
  • Provides alternatives: Links to password reset if they've forgotten it
  • User's perspective: Shows what the user actually sees and experiences

This error message follows best practices:

  • Specific: Tells you exactly what's wrong (password, not email or account)
  • Actionable: Tells you what you can do (try again or reset password)
  • Helpful: Provides a link to password reset
  • Human: Not "ERR_AUTH_403" or "Authentication failed"

D) Incorrect. This scenario is a mess:

  • It starts with "API" instead of showing the user action — this is from the system's perspective, not the user's
  • "Check" and "No match" are meaningless labels — they don't tell you what's actually happening
  • "Error" is vague — what kind of error?
  • "Can't login" is the user's experience, but it doesn't help them understand why or what to do

This scenario is too abstract to be useful. It doesn't capture the user's perspective, doesn't provide helpful information, and doesn't guide next steps.

Key insight: Error paths should be just as well-designed as happy paths. When modeling errors:

  • Be specific about what went wrong (not just "error occurred")
  • Write helpful error messages (not cryptic codes)
  • Guide the user on next steps (try again? reset password? contact support?)
  • Provide alternatives when appropriate (password reset link, different payment method, etc.)
  • Think from the user's perspective (what do they see? what do they understand? what do they do next?)

A well-designed error path turns a frustrating experience into a helpful one. I once saw error messages drop support tickets by 70% just by making them more helpful and actionable.


What's Next?

Congratulations! You've completed Module 4: Flows. You now understand:

  • What flows are and how they differ from static relationships
  • How to create data flow diagrams that show lineage and transformations
  • How to model user journeys that capture the complete user experience
  • How to document both happy paths and error paths
  • How to use BDD-style scenarios for requirements and testing

You can now create diagrams that tell complete stories—stories about how data moves through your system, and stories about how users experience your system. You can model data lineage, transformations, bottlenecks, and user journeys.

In the next module, you'll learn about feedback loops—how systems regulate themselves through circular cause-and-effect relationships. You'll discover how positive feedback loops amplify change and negative feedback loops stabilize systems. You'll learn to recognize these patterns in real systems and understand their powerful effects.

See you there!


Module 4 Complete!

You've now mastered the art of modeling flows. Here's what you've learned:

Lesson 1: Understanding Flows

  • Flows show sequence and transformation, not just connections
  • Different types: data flows, user journeys, control flows, event flows
  • Use flows when order matters, when you need to see bottlenecks

Lesson 2: Data Flow Diagrams

  • Data flows show lineage: where data comes from and where it goes
  • Document transformations: how data changes shape at each step
  • Common patterns: ETL pipelines, event sourcing, real-time analytics, lambda architecture

Lesson 3: User Journeys

  • User journeys show the complete user experience from their perspective
  • Model both happy paths and error paths
  • Use BDD-style "Given-When-Then" for clarity
  • Document edge cases and unusual scenarios

You're ready to tackle more advanced concepts. Let's continue!

Module 5: Feedback Loops

Overview

In this module, you'll learn to model feedback loops - how actions create reactions that affect future actions. Feedback loops are natural patterns in systems and not errors.

Learning Objectives

By the end of this module, you'll be able to:

  • Understand different types of feedback loops
  • Model positive and negative feedback
  • Recognize when cycles are valid patterns
  • Design self-regulating and adaptive systems

Lessons

Prerequisites

Time Investment

Approximately 1-1.5 hours to complete all lessons and exercises.

What's Next

After completing this module, you'll learn about Module 6: Context.

The Loop That Changes Everything: Understanding Feedback Loops

Ever played a video game where the more you use a weapon, the more damage it does? That's a feedback loop—your actions create consequences that affect your future choices. Use the weapon more, hit harder, progress faster. Use it poorly, miss shots, struggle more.

Feedback loops are everywhere. In nature, in organizations, in software systems, and in everyday life. They're the mechanism through which systems learn, adapt, and regulate themselves.

In this lesson, you'll discover what feedback loops are, why they're crucial for systems thinking, and how to identify them in the architectures you build. You'll never look at a system the same way again.

Let's start by understanding what feedback loops actually are.

Learning Goals

By the end of this lesson, you'll be able to:

  • Understand what feedback loops are and how they differ from linear cause-and-effect
  • Recognize feedback loops in everyday systems (and why they matter)
  • Identify positive, negative, and balancing feedback loops in software architecture
  • Explain why feedback loops are natural, not errors
  • Model feedback loops that show self-regulation and adaptation

What Are Feedback Loops, Really?

At its simplest level, a feedback loop is a cycle where an action creates a reaction that influences future actions. Unlike a linear process where A causes B which causes C, feedback loops circle back—output becomes input for the next cycle.

Think of it like a conversation where you're talking and adjusting based on what the other person says:

You speak → They respond → You react to their response → They respond to your reaction → [loop continues]

The key insight: output from one cycle becomes input for the next. This is what makes feedback loops powerful—and potentially dangerous.

In linear processes, you have a straight line: start → finish.

In feedback loops, you have a circle: act → respond → adjust → act again.

This circular relationship is what enables self-regulation, learning, and adaptation. It's also what can cause runaway growth or system collapse if not properly understood.

Why Feedback Loops Matter (The Real Reasons)

I used to think of feedback loops as "circular dependencies" and avoid them at all costs. That was a huge mistake. In software systems, avoiding feedback loops means missing opportunities for self-improvement.

Let me share some real-world examples that changed my perspective.

1. Self-Regulation Without Explicit Logic

I once worked on a load balancer that needed to scale servers up and down based on traffic. The team implemented a simple rule: "if CPU > 80%, add a server." This created a feedback loop:

High CPU detected → Add server → CPU decreases → System monitors → [loop repeats]

We didn't design this as a feedback loop—it emerged from the rule. And it worked beautifully. The system self-regulated without explicit programming.

The lesson: Feedback loops don't always need to be designed. They can emerge from simple rules interacting with each other.

2. Learning Systems That Actually Learn

I consulted on a recommendation engine that started with terrible suggestions. Users rated things, and the system improved over time. What fascinated me was how the learning emerged from the feedback loop:

User watches video → System recommends similar video → User watches it → User rates it → System learns preferences → Next recommendation is better → [loop continues]

After a month, the recommendations were actually good. The system didn't have "good recommendations" baked in—it learned them through a feedback loop.

The lesson: Feedback loops enable systems to get smarter over time. Without them, you'd be stuck with static, hardcoded logic.

3. Error Recovery Through Retry Loops

I've built systems that failed 30% of the time due to network flakiness. Adding a retry loop transformed reliability:

Request fails → Wait 1s → Retry → Fails again → Wait 2s → Retry → Succeeds → [loop ends]

This simple feedback loop (failure → wait → retry) took a flaky system and made it reliable. We didn't fix the underlying network issues—we just built a system that could handle them.

The lesson: Feedback loops are one of the most powerful tools for building resilient systems.

Everyday Examples of Feedback Loops

Feedback loops aren't just a software concept—they're everywhere in your daily life. Let me show you some examples that might feel familiar.

Example 1: The Thermostat in Your Home

Your home heating system has a feedback loop running 24/7:

Room temperature drops
    ↓
Thermostat detects: "It's 65°F, should be 70°F"
    ↓
Thermostat turns on heater
    ↓
Temperature starts rising
    ↓
Thermostat detects: "It's 70°F, turn off heater"
    ↓
Thermostat turns off heater
    ↓
Temperature starts falling again
    ↓
[Loop repeats every few minutes]

This is a classic balancing feedback loop—it keeps temperature within a desired range. The output (current temperature) becomes input for the next decision (whether to heat).

Example 2: Code Reviews at Work

Most software teams have a feedback loop around code quality:

Developer writes code → Submits for review → Reviewer provides feedback → Developer makes changes → Resubmits for review → Code quality improves → Future reviews have higher standards → [loop continues]

This is a learning feedback loop—the developer's skills improve over time through the feedback they receive. Each iteration raises the bar for what's acceptable.

I've been in code reviews where the feedback was unhelpful ("this is bad") rather than constructive ("consider breaking this into smaller functions"). The loop still existed, but it didn't create learning—just frustration.

The lesson: Feedback loops exist whether you design them intentionally or not. Make sure your feedback actually creates improvement.

Example 3: Social Media Algorithms

When you scroll through Instagram or TikTok, you're participating in a feedback loop:

You watch a video → You watch it completely → You like it → Algorithm shows you more similar content → You watch more → You like those → Algorithm learns your preferences → Next recommendations are even better → [loop reinforces]

This is a reinforcing feedback loop—it amplifies your behavior. The more you engage with certain types of content, the more you see it. This can create viral growth (positive) or filter bubbles (negative).

I've seen people get stuck in these loops—endlessly watching the same type of content because the algorithm keeps serving it. The loop works, but it's not always healthy.

The lesson: Feedback loops can amplify behaviors, both good and bad. Design them carefully.

Feedback Loops in Software Architecture

Now let's translate these concepts into software architecture. Feedback loops are everywhere in the systems we build.

Example 1: Auto-Scaling Based on Load

Modern cloud applications automatically scale up and down based on traffic. This is a feedback loop:

// Self-regulating feedback loop
AutoScaling = scenario "Auto-Scaling Based on CPU" {
  // System monitors itself
  MonitoringService -> App.Api "Reports CPU usage: 85%"
  
  // CPU is high → scale up
  if cpu_high {
    App.Api -> AutoScaler "Request scale up to handle load"
    AutoScaler -> App.Api "Adds new instance"
  }
  
  // After scaling, CPU decreases
  App.Api -> MonitoringService "Reports CPU usage: 45%"
  
  // CPU is now acceptable
  MonitoringService -> AutoScaler "CPU normal, can reduce instances"
  AutoScaler -> App.Api "Removes an instance"
  
  // System self-regulates to target CPU level
  App.Api -> MonitoringService "Reports CPU usage: 65%"
}

This feedback loop allows the system to maintain a target CPU usage (let's say 70%). It adds instances when load increases, removes them when load decreases. It self-regulates without human intervention.

I've worked with teams that manually scaled servers based on alerts. They'd get an email: "CPU high, add a server." Sometimes they'd remember, sometimes not. After implementing a feedback loop like this, the system just handled it itself.

The lesson: Feedback loops enable systems to self-regulate—adjusting automatically to maintain desired states.

Example 2: User Experience Improvement

Many systems improve user experience through feedback loops:

// Learning feedback loop
UserExperience = scenario "Feature Usage Feedback" {
  // User tries new feature
  User -> App.WebApp "Uses new feature"
  App.WebApp -> App.Api "Logs feature usage"
  App.Api -> Analytics "Analyzes user behavior"
  
  // System identifies issues
  Analytics -> ProductTeam "Feature has high abandonment rate"
  
  // Team makes improvements
  ProductTeam -> App.Api "Updates feature: clearer UI"
  
  // Next user has better experience
  User -> App.WebApp "Uses improved feature"
  App.WebApp -> App.Api "Logs feature usage"
  App.Api -> Analytics "Analyzes user behavior"
  Analytics -> ProductTeam "Feature abandonment decreased by 40%"
}

This is a learning and adaptation feedback loop. The system learns from how users interact, the team makes changes, and users have a better experience next time.

I once launched a feature without this feedback loop. We assumed users would love it. Six months later, we checked analytics and found 60% abandonment. We fixed issues, but the damage was done. We'd lost months of user trust.

The lesson: Always build feedback loops into your systems. Don't assume—measure, learn, adapt.

Example 3: Cache Hit Rate Optimization

Systems often optimize themselves through feedback loops:

// Resource optimization feedback loop
CacheOptimization = scenario "Cache Learning" {
  // System checks cache performance
  App.Api -> Cache "Request data"
  
  // Cache miss → query database and cache result
  if cache_miss {
    Cache -> Database "Query data"
    Database -> Cache "Store result"
  }
  
  // Cache hit → serve from cache (faster)
  if cache_hit {
    Cache -> App.Api "Return data from cache"
  }
  
  // System monitors cache hit rate
  App.Api -> Monitoring "Logs cache hits and misses"
  Monitoring -> CacheOptimzer "Calculates hit rate: 75%"
  
  // Low hit rate → optimize
  if hit_rate_low {
    Monitoring -> CacheOptimzer "Analyze access patterns"
    CacheOptimzer -> Cache "Adjust eviction policy"
  }
  
  // Next time, higher hit rate
  App.Api -> Cache "Request data"
  Cache -> App.Api "Return from cache (optimized)"
  App.Api -> Monitoring "Logs cache hits and misses"
  Monitoring -> CacheOptimzer "Hit rate now: 92%"
}

This feedback loop improves cache performance over time. The system monitors its own behavior (hit rate) and adjusts accordingly.

I've seen teams set cache policies once and never touch them again. But access patterns change over time. A policy that was good six months ago might be terrible today. Without a feedback loop, you'd never know.

The lesson: Systems need feedback loops to adapt to changing conditions. Static configurations become stale.

Types of Feedback Loops

After studying feedback loops in both nature and software, I've found there are really three main types you'll encounter. Understanding which type you're dealing with helps you model it correctly.

Type 1: Positive Feedback Loops (Reinforcing)

Output amplifies the input, leading to growth or collapse.

Characteristics:

  • Output reinforces input
  • Can create virtuous cycles (growth)
  • Can create vicious cycles (collapse)
  • Examples: viral growth, network effects, echo chambers

Virtuous Cycle Example:

// Viral growth loop
ViralGrowth = scenario "Social Network Effects" {
  UserA -> App.Share "Shares content"
  App.Share -> UserB "Sees content"
  UserB -> App.Share "Shares with friends"
  App.Share -> UserC "Sees content"
  
  // More users → More shares → Exponential growth
  Analytics -> ProductTeam "User growth: +500% this week"
}

This is a virtuous cycle—each share brings more users, which leads to more shares. The feedback loop amplifies growth.

Vicious Cycle Example:

// Performance collapse loop
PerformanceCollapse = scenario "Performance Degradation" {
  // System gets slow
  App.Api -> Database "Slow query (5s latency)"
  
  // Users retry due to slowness
  User -> App.WebApp "Refreshes page"
  App.WebApp -> App.Api "Make request (retry)"
  
  // More load makes it even slower
  App.Api -> Database "Even slower query (10s latency)"
  
  // More users retry
  User -> App.WebApp "Refreshes again"
  App.WebApp -> App.Api "More requests (retries)"
  
  // System collapses under load
  App.Api -> Database "Timeout (database overwhelmed)"
}

This is a vicious cycle—slowness causes retries, retries create more load, more load makes it slower, eventually everything fails. The feedback loop amplifies the problem.

Type 2: Negative Feedback Loops (Balancing)

Output counteracts the input, maintaining stability and equilibrium.

Characteristics:

  • Output opposes change
  • Creates homeostasis (maintains equilibrium)
  • Prevents runaway in either direction
  • Examples: thermostats, load balancing, rate limiting

Thermostat Example (Balancing):

// Thermostat maintains equilibrium
Thermostat = scenario "Temperature Regulation" {
  // Temperature drops
  Room -> Thermostat "Temperature: 65°F"
  
  // Too cold → turn on heat
  Thermostat -> Heater "Turn on heater"
  Heater -> Room "Heats room"
  Room -> Thermostat "Temperature: 70°F"
  
  // Target reached → turn off heat
  Thermostat -> Heater "Turn off heater"
  Heater -> Room "Cooling down"
  Room -> Thermostat "Temperature: 72°F"
  
  // Too warm → turn off (or cool more)
  Thermostat -> Heater "Keep off"
  Heater -> Room "Continues cooling"
  Room -> Thermostat "Temperature: 69°F"
  
  // Too cold → turn on again
  Thermostat -> Heater "Turn on heater"
  Heater -> Room "Heats room"
  Room -> Thermostat "Temperature: 70°F"
  
  // System oscillates around target (70°F)
}

The thermostat doesn't try to make the room hotter or colder—it tries to maintain a target temperature. If it gets too warm, it stops heating. If it gets too cold, it starts heating again. This negative feedback loop creates stability.

Load Balancing Example (Balancing):

// Load balancer distributes work
LoadBalancing = scenario "Request Distribution" {
  // Server A is overloaded
  ServerA -> LoadBalancer "High load: 95% CPU"
  
  // Load balancer sends new requests to other servers
  User -> LoadBalancer "Make request"
  LoadBalancer -> ServerB "Route to Server B (lighter load)"
  LoadBalancer -> ServerC "Route to Server C (lighter load)"
  
  // Servers A, B, C balance out
  ServerA -> LoadBalancer "Load: 70% CPU"
  ServerB -> LoadBalancer "Load: 65% CPU"
  ServerC -> LoadBalancer "Load: 68% CPU"
  
  // Next time, requests distributed evenly
  User -> LoadBalancer "Make request"
  LoadBalancer -> ServerA "Route to Server A"
  LoadBalancer -> ServerB "Route to Server B"
  LoadBalancer -> ServerC "Route to Server C"
}

The load balancer sends fewer requests to the overloaded server, allowing it to recover. Other servers take up the slack. This negative feedback loop creates balance across all servers.

Type 3: Delayed Feedback

Output affects input after a delay, which can cause oscillation.

Characteristics:

  • There's a time delay between action and response
  • Can cause system to overcorrect
  • Often seen in monitoring and alerting systems
  • Examples: inventory systems, cache invalidation

Inventory Example (Delayed):

// Inventory management with delayed feedback
InventoryManagement = scenario "Inventory Replenishment" {
  // Item sells, stock decreases
  OrderService -> Inventory "Decrement stock: 5 units"
  Inventory -> OrderService "Stock: 15 units"
  
  // Low stock alert
  Inventory -> Alerting "Stock low: 15 units, threshold: 20"
  Alerting -> WarehouseManager "Send restock alert"
  
  // Manager sees alert, orders more stock
  WarehouseManager -> Suppliers "Order 50 more units"
  
  // System waits... (this is the delay)
  
  // New stock arrives
  Suppliers -> Warehouse "Ship 50 units"
  Warehouse -> Inventory "Add stock: +50 units"
  Inventory -> OrderService "Stock: 65 units"
  
  // Meanwhile, more items sold
  OrderService -> Inventory "Decrement stock: 10 units"
  Inventory -> OrderService "Stock: 55 units"
  
  // New stock finally arrives and is added to existing low stock
  Suppliers -> Warehouse "Ship 50 units"
  Warehouse -> Inventory "Add stock: +50 units"
  Inventory -> OrderService "Stock: 105 units"
  
  // System overcorrected—now have too much stock
  Inventory -> Alerting "Stock now: 105 units (oversupply!)"
}

The system delayed between detecting low stock and receiving new stock. During that delay, more items sold, so when new stock arrived, it was added to already-low stock. The result is an oversupply.

I've seen this exact pattern in retail systems. The solution is to either:

  • Reduce the delay (automated restocking)
  • Increase the trigger threshold (order sooner when stock is low)
  • Account for expected sales during the delay period

The lesson: Delayed feedback can cause oscillation and overcorrection. Account for delays in your feedback loops.

Feedback Loops vs. Circular Dependencies

This is one of the most important distinctions in systems thinking. Let me explain clearly.

Circular Dependency (The "Bad" Kind)

Module A depends on Module B
Module B depends on Module A

This is a circular dependency—a compile-time, structural issue. It's bad because:

  • You can't initialize either module (chicken and egg problem)
  • It creates tight coupling
  • There's no clear purpose—just mutual dependence
  • Traditional software engineering says "avoid this"

Feedback Loop (The "Good" Kind)

System performs an action
System receives feedback about the action
System adjusts future actions based on feedback

This is a feedback loop—a runtime, behavioral pattern. It's good because:

  • It enables learning and adaptation
  • It has a clear purpose (self-regulation, improvement)
  • Coupling is loose (through feedback, not direct dependency)
  • Systems thinking embraces this

The Key Difference:

AspectCircular DependencyFeedback Loop
When it happensCompile-time (static)Runtime (dynamic)
PurposeNone (accidental)Intentional (regulation, learning)
CouplingTight (direct reference)Loose (through feedback)
In systems thinkingAvoidEmbrace
ExampleModule A → Module B → Module AUser → System → User → System

I spent years avoiding any circular structures in my architectures. Then I learned about feedback loops in systems thinking and realized I'd been throwing away a powerful tool. Now I embrace feedback loops while avoiding circular dependencies. The distinction is crucial.

Why Feedback Loops Are Natural, Not Errors

Here's something I've learned: In traditional software engineering, we're trained to avoid circular dependencies. But in systems thinking, feedback loops aren't errors—they're natural, desirable patterns.

Think about nature:

  • Your body temperature regulation is a feedback loop
  • Predator-prey populations are regulated by feedback loops
  • Ecosystems self-regulate through feedback loops

These aren't "errors" in nature—they're essential for life.

In software:

  • Auto-scaling is a feedback loop (not a bug)
  • Machine learning is built on feedback loops (not anti-pattern)
  • Load balancing uses feedback loops (not architectural flaw)

The mindset shift:

  • Traditional engineering: "Cycles are bad, avoid them"
  • Systems thinking: "Feedback loops are natural, embrace them"

I've seen teams reject feedback loop architectures because "cycles are bad." They're confusing two different concepts. Don't make that mistake.

Feedback loops are about behavior over time, not circular dependencies in code. They enable systems to:

  • Self-regulate (maintain homeostasis)
  • Learn (improve over time)
  • Adapt (respond to changes)
  • Recover (handle failures gracefully)

Embrace feedback loops as a powerful tool in your systems thinking toolkit.

What to Remember

Feedback loops are cycles where output becomes input for the next cycle. When you're modeling or analyzing systems:

  • Look for the loop—identify where output feeds back into input
  • Understand the purpose—is it self-regulation, learning, amplification?
  • Check the type—positive (reinforcing), negative (balancing), or delayed?
  • Design intentionally—what controls does the loop have? What prevents runaway?
  • Embrace them—feedback loops are natural and powerful, not errors to avoid

If you take away one thing, let it be this: feedback loops are everywhere, and understanding them is key to understanding how systems behave over time. Static diagrams show structure. Feedback loops show behavior.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding.

Question 1

You're reviewing an e-commerce system and notice this behavior:

"When a product goes out of stock, users can still see it but can't add it to cart. When stock is replenished, users who previously couldn't purchase are notified. After restocking, the product sells faster than usual for a few days, then sales normalize."

Which type of feedback loop best describes this behavior?

A) Positive feedback loop (reinforcing) B) Negative feedback loop (balancing) C) Delayed feedback loop (oscillation) D) Not a feedback loop

Click to see the answer

Answer: A) Positive feedback loop (reinforcing)

Let's analyze this scenario:

  1. Product goes out of stock → System hides add-to-cart button
  2. Users who wanted the product wait → System remembers their interest
  3. Stock is replenished → System notifies waiting users
  4. Waiting users purchase immediately → Sales spike
  5. Sales spike suggests the product is popular → System might order more next time

This is a reinforcing feedback loop because the output (sales data) amplifies the input (stocking decisions):

  • High sales after restocking → System learns product is popular → Orders more stock next time → More sales
  • The loop reinforces the behavior: stock the product, it sells well, stock more.

Why other options are wrong:

B) Incorrect. A negative (balancing) feedback loop would counteract changes to maintain stability. For example, if a product sells too much, the system would reduce ordering. But here, the system is increasing ordering after high sales—amplifying, not counteracting.

C) Incorrect. A delayed feedback loop involves a time delay that causes oscillation or overcorrection. There's no oscillation here—the behavior is straightforward: stock runs out, restock, sell out again, restock again. There's no overcorrection (ordering too much) or oscillation (alternating between too much and too little stock).

D) Incorrect. This absolutely is a feedback loop! The system's stocking decisions are influenced by past sales data, and sales influence future stocking decisions. Output becomes input for the next cycle.

Key insight: This is a common pattern in inventory systems. It's why you often see products go out of stock, then when they're restocked, they sell out again even faster—the system has learned the product is popular and amplifies that knowledge. This isn't bad design—it's a natural reinforcing feedback loop.


Question 2

You're designing an auto-scaling system and want to prevent a vicious cycle where the system keeps adding instances unnecessarily. Which control mechanism would you implement?

"The system monitors CPU usage. If CPU goes above 80%, it adds a server instance. When CPU drops below 60%, it removes an instance. Without controls, a sudden spike in traffic could cause: high CPU → add instances → load redistributes → some instances now idle, but overall CPU still high → add more instances → load redistributes again → more idle instances → [system keeps adding instances]"

A) Minimum instance limit (can't go below 2 instances) B) Maximum instance limit (can't go above 10 instances) C) Cooldown period (must wait 5 minutes after scaling down before scaling up again) D) Scale incrementally (add 1 instance at a time, not 5 at once)

Click to see the answer

Answer: C) Cooldown period (must wait 5 minutes after scaling down before scaling up again)

Let's analyze each option:

A) Incorrect. A minimum instance limit (can't go below 2) prevents the system from removing too many instances, but it doesn't prevent it from adding too many. The vicious cycle described is about the system adding instances unnecessarily, not removing them. This control addresses the wrong side of the problem.

B) Incorrect. A maximum instance limit (can't go above 10) prevents unbounded growth, but it doesn't address the core issue: the system keeps adding instances because the new instances aren't helping (they stay idle while CPU stays high). The limit would stop the cycle at 10 instances, but doesn't solve the underlying problem of inefficient scaling decisions.

C) Correct! A cooldown period is exactly what's needed here. Here's why:

The vicious cycle happens because:

  1. System scales up when CPU is high
  2. New instances are added
  3. Load redistributes, but CPU stays high (why? Maybe the new instances aren't helping, or there's a warmup period)
  4. System sees CPU still high and scales up again
  5. This happens repeatedly, adding more and more instances

A cooldown period prevents this by saying: "After you scale down, you must wait 5 minutes before you can scale up again." This gives the system time to:

  • Allow new instances to warm up and start contributing effectively
  • Allow the load redistribution to stabilize
  • Prevent the system from reacting to transient CPU spikes with rapid up-and-down scaling

The cooldown breaks the cycle by adding a delay between "scale down" and "scale up" actions.

D) Incorrect. Scaling incrementally (adding 1 instance at a time instead of 5 at once) reduces the jump in instances, but it doesn't prevent the vicious cycle. The system could still add 1 instance, wait a bit, see CPU is still high, add another, wait a bit, add another—slowly but still unnecessarily adding instances. It doesn't have a mechanism to recognize "we just scaled down, let's pause and see if that helps" like a cooldown does.

Key insight: Feedback loops can create vicious cycles (runaway growth) if not properly controlled. The solution isn't always to limit the bounds (min/max instances)—sometimes you need to add a delay or a hysteresis to give the system time to stabilize. Cooldown periods, rate limiting, and circuit breakers are common controls for managing feedback loops.


What's Next?

Now you understand what feedback loops are and why they matter. You know they're natural, not errors, and you can identify them in systems. You've seen examples from thermostats to social media algorithms to auto-scaling systems.

But we've only talked about what feedback loops are and why they matter. We haven't talked about how to classify different types of feedback loops or how to model them explicitly in your architectures.

In the next lesson, you'll learn about types of feedback loops—positive (reinforcing), negative (balancing), and delayed (oscillating). You'll discover when to use each type, how to identify virtuous vs. vicious cycles, and what controls prevent runaway behavior.

See you there!

Amplifying or Dampening: Types of Feedback Loops

Ever heard a squeaky microphone? You speak, the speaker hears you, and the volume gets louder and louder. Then the speaker notices and turns it down. That's a feedback loop—the output (your voice) affects the input (the microphone level).

Now imagine if the speaker couldn't turn the volume down. Every time you spoke, it got louder and louder. That's a runaway loop—and it would eventually cause damage.

Feedback loops in software systems work the same way. They can amplify growth (good) or cause collapse (bad) or maintain stability (necessary). In this lesson, you'll learn to recognize and classify different types of feedback loops.

Let's start by understanding how to categorize feedback loops.

Learning Goals

By the end of this lesson, you'll be able to:

  • Understand positive (reinforcing) and negative (balancing) feedback loops
  • Recognize when each type is appropriate
  • Identify vicious (runaway) vs. virtuous (growth) cycles
  • Understand delayed feedback and oscillation
  • Design feedback loops with appropriate controls

Feedback Loop Classification

Feedback loops fall into three main categories based on what they do to your system:

Feedback Loops
├── Positive (Reinforcing)
│   ├── Virtuous Cycle (Growth)
│   └── Vicious Cycle (Collapse)
├── Negative (Balancing)
│   ├── Self-Regulating (Homeostasis)
│   ├── Error-Correcting (Resilience)
│   └── Stabilizing (Control)
├── Delayed (Time-Based)
    └── Oscillation (Variability)

Understanding which type you're dealing with is crucial because the design approach is completely different for each.

Positive Feedback Loops (Reinforcing)

These loops amplify change, leading to exponential growth or collapse. They're powerful when you want growth, dangerous when you don't have controls.

Pattern 1: Virtuous Cycle (Growth)

Output amplifies input, creating runaway growth:

// Example: Viral content recommendation
ViralGrowth = scenario "Viral Growth Loop" {
  User -> Platform "Shares content"
  Platform -> UserB "Shows content"
  UserB -> Platform "Shares content"
  
  // More users see content, more shares happen
  Platform -> UserC "Shows content"
  UserC -> Platform "Shares content"
  // Exponential growth
}

Characteristics:

  • Output reinforces input
  • Creates exponential growth
  • Can lead to runaway success
  • Examples: Viral sharing, network effects, learning algorithms

When to use:

  • When you want rapid growth and virality
  • When amplifying desirable behaviors
  • In recommendation systems, social platforms

Risks:

  • Unchecked growth can overwhelm system
  • Can create echo chambers
  • May amplify undesirable behaviors (fake news, spam)

Real-world example: I once worked on a social media algorithm that showed viral content more aggressively. Engagement spiked for a few weeks, then crashed because users got sick of seeing the same type of content. The positive feedback loop had run its course. We needed to dampen it—show diverse content even if it meant lower short-term engagement.

Pattern 2: Vicious Cycle (Collapse)

Output counteracts input, leading to runaway collapse:

// Example: Performance degradation
PerformanceCollapse = scenario "Performance Vicious Cycle" {
  User -> WebApp "Submits request"
  WebApp -> API "Processes request"
  API -> Database "Slow query (1000ms)"
  Database -> Cache "Cache miss"
  
  // System tries to compensate
  Cache -> Database "Query database (slower)"
  Database -> API "Slow response"
  API -> WebApp "Slow response"
  WebApp -> User "Refreshes page"
  
  // User refreshes, creates more load
  User -> WebApp "Submit again"
  // Loop spirals downward
}

Characteristics:

  • Output worsens the problem
  • Creates death spiral (collapse)
  • Can be irreversible if not caught early
  • Examples: System overload, cache stampedes, resource exhaustion

Risks:

  • System collapse
  • User abandonment
  • Cascading failures
  • Complete breakdown

Real-world example: I've seen systems collapse from vicious feedback loops. One example: An e-commerce site had slow checkout times during peak hours. The system would show "processing" for 30 seconds, but customers would refresh the page thinking it failed. They'd submit again, creating more load, making things even slower. More customers refreshed, more load, slower times... The feedback loop (customers refreshing on slowness) created a death spiral that eventually took down the entire site.

Negative Feedback Loops (Balancing)

These loops counteract change, maintaining stability and preventing extremes. They're crucial for building resilient systems.

Pattern 1: Self-Regulating (Homeostasis)

System maintains a target state by adjusting based on output:

// Example: Auto-scaling
AutoScaling = scenario "Self-Regulating System" {
  Monitoring -> App "Reports CPU: 90%"
  App -> AutoScaler "Scale up (target: 70%)"
  AutoScaler -> App "Add instances"
  
  // CPU decreases
  Monitoring -> App "Reports CPU: 50%"
  App -> AutoScaler "Scale down (target: 70%)"
  AutoScaler -> App "Remove instances"
}

Characteristics:

  • Output opposes deviation from target
  • Maintains homeostasis
  • Responds to disturbances
  • Examples: Thermostats, auto-scaling, load balancing

When to use:

  • When you want to maintain a target state (CPU usage, inventory levels)
  • When you need to absorb shocks and disturbances
  • When stability is more important than optimal performance

Real-world example: The thermostat example from Lesson 1 is a perfect self-regulating loop. The heater turns on when temperature drops, turns off when temperature rises. It doesn't care about the exact temperature—it just wants to maintain a comfortable range. This creates a stable, predictable environment for the people in the room.

In software, self-regulating loops are everywhere. Auto-scaling systems maintain a target CPU usage. Inventory systems maintain target stock levels. Rate limiters maintain target request rates. They're the unsung heroes of system design.

Pattern 2: Error-Correcting (Resilience)

System detects errors and automatically corrects:

// Example: Retry with backoff
RetryWithBackoff = scenario "Error-Correcting Loop" {
  User -> App "Submits order"
  App -> Service "Process payment"
  
  // First attempt fails
  Service -> App "Payment failed: timeout"
  App -> Service "Retry after 1s (backoff)"
  Service -> App "Payment failed: timeout"
  App -> Service "Retry after 2s (backoff)"
  Service -> App "Payment succeeded"
  
  // System recovers
  App -> User "Order confirmed"
}

Characteristics:

  • Detects errors and corrects automatically
  • Uses backoff to avoid overwhelming failing system
  • Increases resilience without manual intervention
  • Examples: Retries with backoff, circuit breakers, failover systems

When to use:

  • When external services are unreliable
  • When transient failures are common
  • When you need to increase resilience
  • When you want to reduce operational overhead (fewer support tickets)

Real-world example: I've worked on systems that didn't have proper error handling. A payment gateway would timeout occasionally, and the app would retry immediately three times with no delay between attempts. This would overwhelm the gateway, making failures worse. We added exponential backoff (wait 1s, then 2s, then 4s) and the system became much more reliable. The error-correcting loop turned a flaky dependency into a resilient one.

Pattern 3: Stabilizing (Control)

System actively manages resources to prevent oscillation:

// Example: Rate limiting with hysteresis
RateLimiting = scenario "Stabilizing Control" {
  User -> API "Send request"
  API -> RateLimiter "Check limit"
  
  // Under limit: allow
  if under_limit {
    RateLimiter -> API "Allow"
    API -> Service "Process"
  }
  
  // Over limit: throttle
  if over_limit {
    RateLimiter -> API "Throttle (429: Too Many Requests)"
    API -> User "Try again later"
  }
  
  // Dynamic adjustment based on system load
  RateLimiter -> LoadMonitor "Report current rate: 1000 req/s"
  LoadMonitor -> RateLimiter "Adjust limit: 800 req/s"
}

Characteristics:

  • Actively prevents extreme behavior in either direction
  • Maintains stability through dynamic adjustments
  • Can add hysteresis (limits change based on history)
  • Examples: Rate limiting, resource pools, admission control

When to use:

  • When you need to protect against abuse or overload
  • When you have limited resources (database connections, API quotas)
  • When you want to maintain service quality for all users
  • When fairness matters (don't let heavy users dominate)

Real-world example: I once saw a rate limiting system cause more harm than good. It was configured with a hard limit, and when users hit it, they'd get throttled and immediately retry, creating a burst of traffic that was worse than just letting them through. We added hysteresis—the limit would decrease temporarily after being hit, then slowly recover. This smoothed out traffic and actually improved throughput for everyone while protecting the system.

Delayed Feedback (Time-Based)

Feedback that occurs after a delay can cause oscillation or overcorrection:

Pattern: Oscillation

System alternates between states due to delayed feedback:

// Example: Temperature control with delay
DelayedFeedback = scenario "Oscillating Control" {
  Thermostat -> Heater "Turn on (too cold)"
  Heater -> Room "Heats room"
  Room -> Thermostat "Temperature reading (5 min delay)"
  
  // By the time feedback arrives, room is already too warm
  Thermostat -> Heater "Turn off (too hot)"
  Heater -> Room "Cools room"
  Room -> Thermostat "Temperature reading (5 min delay)"
  
  // By the time feedback arrives, room is too cold again
  Thermostat -> Heater "Turn on (too cold)"
  // Oscillates between too hot and too cold
}

Characteristics:

  • Delay between action and feedback causes system to overcorrect
  • Can create oscillation (too hot, too cold, too hot, too cold)
  • System never settles into stable state
  • Examples: Temperature control, stock trading, caching with invalidation

When to use:

  • When you need to identify and eliminate delayed feedback loops
  • When you have slow sensors or reporting systems
  • When you want to stabilize oscillating systems
  • When you need to add damping or predictive adjustments

Real-world example: I once worked on a caching system that would cache data for 30 minutes. But the data source would update every 10 minutes. The cache would show stale data, then refresh, then show stale data again. This created oscillation—applications would see new data for 20 minutes, then old data for 20 minutes, then new again. Users complained: "Why is the data jumping around? Is it broken?" We fixed it by making the cache time consistent with the source—stale data at the same time as source updates. Eliminating the time delay eliminated the oscillation.

Comparing Feedback Loop Types

Different feedback loops serve different purposes:

TypeEffectStabilityExampleUse When
Positive (Virtuous)AmplifiesDecreases (growth)Viral sharing, learning algorithmsYou want rapid growth
Positive (Vicious)AmplifiesDecreases (collapse)System overload, resource exhaustionNEVER (avoid at all costs)
Negative (Self-Regulating)OpposesMaintainsAuto-scaling, thermostatsYou need stability
Negative (Error-Correcting)CorrectsIncreases (resilience)Retries, circuit breakersYou need reliability
Negative (Stabilizing)ConstrainsMaintainsRate limiting, hysteresisYou need control
Delayed (Oscillating)OvercorrectsDecreases (oscillation)Slow sensors, stale dataYou need synchronization

Key insight: Choose the right type of feedback loop for your goal. Want growth? Use positive loops (with controls). Need stability? Use negative loops. Want resilience? Use error-correcting loops.

Designing Feedback Loops

Step 1: Identify the Feedback

What creates the feedback? Who acts? Who adjusts?

// Example: Auto-scaling feedback
AutoScalingFeedback = flow "Auto-Scaling Feedback Path" {
  // Who creates feedback?
  App -> Monitoring "Reports CPU usage"
  Monitoring -> App "Shows metrics to operators"
  
  // Who acts on it?
  App -> AutoScaler "Adjusts instances"
  AutoScaler -> App "Adds/removes instances"
}

Ask yourself:

  • What's the signal? (CPU usage, latency, error rate)
  • Who measures it? (monitoring system)
  • Who makes decisions? (auto-scaler, operations team)
  • What's the adjustment? (add/remove instances)

Step 2: Determine the Type

Based on the effect, classify the loop:

// Classifying feedback loops
ViralGrowth = scenario "Viral Growth" {
  type "positive"
  subtype "virtuous"
  effect "amplifying"
  stability "decreases"
}

SystemOverload = scenario "System Overload" {
  type "positive"
  subtype "vicious"
  effect "amplifying"
  stability "decreases"
  risk "critical"
}

AutoScaling = scenario "Auto-Scaling" {
  type "negative"
  subtype "self-regulating"
  effect "opposes"
  stability "maintains"
}

Step 3: Add Controls

Prevent unwanted behavior and add safety limits:

// Adding controls to positive feedback loop
ControlledGrowth = scenario "Controlled Viral Growth" {
  User -> Platform "Shares content"
  
  // Content moderation
  Platform -> ModerationService "Check for policy violations"
  if violates_policy {
    Platform -> ModerationService "Reject content"
    Platform -> User "Content rejected: violates policy"
  } else {
    Platform -> UserB "Show content"
    UserB -> Platform "Shares content"
  }
  
  // Rate limiting
  Platform -> RateLimiter "Check rate limit"
  if over_rate_limit {
    Platform -> RateLimiter "Throttle shares"
  }
}

Step 4: Monitor and Observe

Track the behavior of your feedback loop over time:

// Monitoring feedback loop behavior
FeedbackMonitoring = scenario "Feedback Loop Monitoring" {
  // Metrics to track
  GrowthRate = metric "User growth rate (new users/day)"
  StabilityScore = metric "System stability (99.9% uptime)"
  FeedbackRatio = metric "Positive:negative feedback ratio"
  
  // Alert on issues
  if GrowthRate > threshold {
    AlertSystem -> OpsTeam "High growth rate detected"
  }
  
  if StabilityScore < threshold {
    AlertSystem -> OpsTeam "Stability degraded"
  }
  
  // Track over time
  GrowthRate -> Dashboard "Plot growth over time"
}

Key indicators to monitor:

  • Growth rate (is it sustainable?)
  • Stability metrics (uptime, error rate)
  • Feedback distribution (positive vs. negative)
  • Resource utilization (are you approaching limits?)

Pitfalls to Avoid

Mistake 1: Treating All Cycles as Bad

I've seen teams adopt a "no cycles" mentality and avoid feedback loops entirely. This is throwing away a powerful tool.

// Bad: Avoiding feedback loops entirely
NoFeedbackSystem = system "Static System" {
  // No feedback loops
  // No adaptation
  // No learning
}

Why this is wrong:

  • You lose the ability to self-regulate
  • You lose opportunities for learning and improvement
  • Your system can't adapt to changing conditions
  • You're missing a key tool in systems thinking

The right approach: Use feedback loops intentionally. Design them with clear purposes and appropriate controls. Embrace them where they add value. Avoid them where they cause harm.

Mistake 2: Confusing Types

I've seen teams misuse positive feedback loops for situations that need negative (balancing) loops:

// Bad: Using positive feedback for stability
PositiveForStability = scenario "Misclassified Feedback" {
  // Trying to grow load when you should stabilize
  User -> App "Submit request"
  App -> Server "Process"
  
  // High load detected
  Server -> LoadBalancer "Add more instances"
  LoadBalancer -> Server "Add more instances"
  
  // This amplifies load instead of balancing it
}

Why this is wrong:

  • When you need stability (high load), adding more instances is the worst thing to do
  • You should use a negative (balancing) loop instead: throttle requests, queue them, or reject some

The right approach: Match the feedback loop type to the situation. Need stability? Use negative loops. Need growth? Use positive loops (with controls). Need resilience? Use error-correcting loops.

Mistake 3: Ignoring Delayed Feedback

I've seen teams ignore delayed feedback because it's "not real-time," only to have it cause major problems:

// Bad: Ignoring delayed feedback causes oscillation
IgnoreDelayed = scenario "Ignoring Delayed Feedback" {
  User -> App "Submits data"
  App -> Service "Process"
  Service -> Database "Persist"
  
  // Another system updates same data later
  ExternalSystem -> Database "Update (5 min delay)"
  Database -> App "Show updated data"
  App -> User "Refreshes page" [sees old data]
  
  // User refreshes again, thinking data is stale
  User -> App "Refreshes page" [sees old data again]
  // User gets frustrated: "Why is this data jumping around?"
}

Why this is wrong:

  • The delay in feedback causes the system to overcorrect
  • Users see stale data, refresh, see stale data again
  • System never stabilizes
  • Users get frustrated and think the system is broken

The right approach: Synchronize the timing of feedback loops. Make sure when one system updates data, consuming systems see the update before they'd otherwise request it again. Or make it clear when data was last updated so users don't expect immediate freshness.

What to Remember

Feedback loops are one of the most powerful concepts in systems thinking. When you design them:

  • Identify the type — Is it positive (amplifying) or negative (balancing)?
  • Choose the right tool — Virtuous for growth, stabilizing for control, error-correcting for resilience
  • Add controls — Prevent runaway behavior in positive loops
  • Monitor behavior — Track how the loop performs over time
  • Use appropriately — Different situations require different types

If you take away one thing, let it be this: feedback loops are the engine of adaptation and learning in systems. Positive loops accelerate growth. Negative loops maintain stability. Error-correcting loops build resilience. Understanding which type you're dealing with—and how to control it—is the difference between systems that survive and systems that collapse.


Check Your Understanding

Let's see if you've got this. Here are a couple of questions to test your understanding.

Question 1

You're designing a social media platform's recommendation algorithm. Which feedback loop type is most appropriate for encouraging viral growth of high-quality content?

"Users can recommend content to their followers. When a follower likes, shares, or comments on recommended content, the algorithm learns these signals and recommends similar content to that follower's connections. The goal is to spread high-quality content that aligns with the platform's values, not clickbait or misleading information."

A) Positive (Virtuous) - Virtuous growth loop B) Positive (Vicious) - Vicious growth loop C) Negative (Self-Regulating) - Self-regulating loop D) Negative (Stabilizing) - Stabilizing control loop

Click to see the answer

Answer: A) Positive (Virtuous) - Virtuous growth loop

Explanation:

Let's analyze each option:

A) Correct! A virtuous cycle (positive, reinforcing) is the right choice because:

  • The goal is amplifying desirable behavior (high-quality content)
  • The feedback loop reinforces the right things (when users like/share/comment on good content, recommend more of it)
  • Output amplifies input in a controlled way that leads to exponential growth
  • It's designed to create a virtuous cycle where good content gets recommended to more people, who like/share/comment, creating more signals for the algorithm
  • This type of loop is intentionally designed for growth, but with safeguards (content quality checks, relevance filters) to prevent amplifying undesirable content

Why other options are wrong:

B) Incorrect. A vicious cycle (positive, destructive) would amplify content indiscriminately without quality checks. This could lead to:

  • Clickbait or misleading content going viral (engagement without quality)
  • Misinformation spreading faster than fact-checking can keep up
  • Users getting tired of seeing the same type of content repeatedly
  • Platform losing trust when low-quality content becomes common

The vicious cycle amplifies output, but it's destructive amplification. The virtuous cycle amplifies output too, but in a controlled, quality-focused way. The key difference is that the virtuous cycle has safeguards and quality filters, while the vicious cycle doesn't.

C) Incorrect. A self-regulating loop (negative, balancing) would maintain the current state of recommendations rather than promoting growth. This doesn't align with the goal of "encouraging viral growth of high-quality content." A self-regulating loop is for stability, not for amplification. It would keep the recommendation system in a steady state rather than spreading the best content to more people.

D) Incorrect. A stabilizing control loop (negative, constraining) would limit or throttle recommendations, which is the opposite of the goal. This type of loop is for preventing extremes (preventing spam, preventing abuse), not for promoting growth. It constrains rather than amplifies.

Key insight: Positive feedback loops aren't inherently good or bad—they're tools. The virtuous cycle uses positive feedback to achieve a desirable goal (growth) while having safeguards to prevent negative outcomes. The vicious cycle lacks those safeguards and causes collapse. The key is designing your feedback loop intentionally—knowing what you want to amplify and what controls you need to prevent unwanted side effects.


Question 2

You're analyzing an e-commerce site's auto-scaling system. The system currently has these behaviors:

  • When CPU > 80%, it adds instances immediately
  • When CPU < 60%, it removes instances immediately
  • You're seeing CPU oscillate between 55% and 85% every few minutes

What's the problem and which type of feedback loop would you implement to fix it?

A) Add a delay between scaling actions B) Switch to negative (stabilizing) feedback loop C) Add hysteresis to the scaling algorithm D) Implement a deadband zone (min CPU and max CPU)

Click to see the answer

Answer: C) Add hysteresis to the scaling algorithm

Explanation:

Let's analyze the situation and each option:

The problem: The system is oscillating—CPU bounces between 55% and 85% every few minutes. This is a classic sign of delayed feedback causing oscillation. The system's feedback loop has a time delay, and it's overcorrecting in response.

Why it's happening:

  1. CPU hits 85% → System adds instances
  2. A few minutes pass (delay)
  3. CPU drops to 55% because new instances take time to warm up
  4. System sees low CPU → Removes instances
  5. CPU bounces back up → System adds instances again
  6. [Loop repeats]

The delay (time for instances to warm up/cool down) combined with the system's aggressive reaction (add/remove immediately) creates a continuous oscillation around the target (70%) that the system can never settle at.

Why option A is wrong: Adding a delay would actually make oscillation worse. The system would respond even more slowly to changes, and the phase lag between the action and feedback would increase, potentially creating more severe oscillation or making the system more unstable. You don't want more delay in a feedback loop that's already oscillating.

Why option B is wrong: Switching to a negative (stabilizing) feedback loop is appropriate when you have a target and want to maintain it. But in this case, the system doesn't have a clear target—it's reacting to load with add/remove decisions. The problem isn't that it's scaling wrong (too much or too little), it's that it's oscillating. A stabilizing loop would help if the system was consistently scaling too high or too low, but here the issue is the instability caused by the oscillation itself, not the scaling decisions.

Why option D is wrong: A deadband zone (min CPU and max CPU) would prevent oscillation by restricting the system's range. This would stop the oscillation, but at a cost: the system couldn't scale above max CPU even if needed, and would throttle during legitimate spikes. This is a brute-force solution that sacrifices flexibility for stability. It might be appropriate in some cases (preventing DDoS attacks), but it's not the right solution for a general oscillation problem.

Why option C is correct: Adding hysteresis (memory of past states) to the scaling algorithm would solve the oscillation:

// Hysteresis-based auto-scaling
HysteresisScaling = scenario "Auto-Scaling with Hysteresis" {
  // Track past CPU values
  CPUHistory -> History "Store last 5 CPU readings"
  
  // Smoothed response with memory
  History -> HysteresisController "Calculate smoothed CPU"
  
  // Use smoothed CPU for decisions
  HysteresisController -> AutoScaler "Use smoothed CPU (65%)"
  
  // Don't scale up just because of one spike
  if cpu_spike {
    HysteresisController -> AutoScaler "Wait, verify spike is real"
  }
}

How hysteresis works:

  • Instead of reacting instantly to every CPU reading, the system remembers recent values
  • It averages or smooths the readings to filter out transient spikes
  • It considers the trend, not just the current value
  • It adds inertia—resists changing direction based on a single anomaly

Why this fixes oscillation:

  • When CPU spikes temporarily (one reading at 85%), the system remembers "we were just at 70%, this is probably a spike"
  • It doesn't scale up aggressively in response to the spike
  • Instead, it scales up gradually if the smoothed CPU remains high
  • When CPU drops back down, it doesn't scale down immediately— remembers "we were just high"

This smooths out the oscillation. The system still scales to meet load, but it does so more calmly and predictably, eliminating the bouncy behavior.

Key insight: Delayed feedback often causes oscillation because systems overcorrect to changes that were already addressed. Hysteresis (adding memory) smooths out the response by considering the system's history, preventing the overcorrection that drives oscillation. It's like a thermostat that remembers the room has been warming up for the last 10 minutes and doesn't turn off the heater just because the temperature momentarily hits the target.


What's Next?

Now you understand the different types of feedback loops—positive (reinforcing) and negative (balancing)—and when to use each. You've seen how positive loops can create virtuous growth or vicious collapse. You've seen how negative loops maintain stability through self-regulation, error-correction, and control.

In the next lesson, you'll learn to model cycles explicitly in Sruja. You'll discover how to create valid feedback loops, distinguish them from circular dependencies (which are bad), and use cycles to represent self-regulating systems, learning mechanisms, and adaptive behaviors.

See you there!

Lesson 3: Modeling Cycles

Learning Goals

  • Learn how to create valid cycles in Sruja
  • Model feedback loops explicitly
  • Differentiate cycles from circular dependencies

Cycles in Sruja

Unlike circular dependencies (bad), feedback loops (good) are valid cycles in Sruja.

Basic Cycle Syntax

// Valid feedback loop
User -> App.WebApp "Submits form"
App.WebApp -> App.API "Validates"
App.API -> App.WebApp "Returns result"
App.WebApp -> User "Shows feedback"

// User resubmits (cycle completes)

Modeling Feedback Loops

Example 1: User Feedback Loop

import { * } from 'sruja.ai/stdlib'

User = person "User"

App = system "Application" {
  WebApp = container "Web Application"
  API = container "API Service"
}

// User feedback cycle
UserFeedback = scenario "User Form Feedback" {
  User -> App.WebApp "Submit form"
  App.WebApp -> App.API "Validate input"
  App.API -> App.WebApp "Return validation result"

  // Error path (loop)
  if has_errors {
    App.WebApp -> User "Show errors"
    User -> App.WebApp "Correct and resubmit"
  }

  // Success path (end cycle)
  if no_errors {
    App.WebApp -> Database "Save data"
    App.WebApp -> User "Show success"
  }
}

Example 2: System Self-Regulation

AutoScaling = system "Auto-Scaling System" {
  Monitor = container "Monitoring Service"
  Scaling = container "Scaling Service"
}

App = system "Application" {
  API = container "API Service"
}

// Auto-scaling feedback loop
ScalingLoop = scenario "Auto-Scaling Feedback" {
  App.API -> AutoScaling.Monitor "Reports load"

  // Scale up if load is high
  if load_high {
    AutoScaling.Monitor -> AutoScaling.Scaling "Trigger scale up"
    AutoScaling.Scaling -> App.API "Add instances"
    App.API -> AutoScaling.Monitor "Reports new load"
    // Loop continues until load normalizes
  }

  // Scale down if load is low
  if load_low {
    AutoScaling.Monitor -> AutoScaling.Scaling "Trigger scale down"
    AutoScaling.Scaling -> App.API "Remove instances"
    App.API -> AutoScaling.Monitor "Reports new load"
    // Loop continues until load normalizes
  }
}

Example 3: Inventory Management

Admin = person "Administrator"

Shop = system "Shop" {
  API = container "API Service"
  Inventory = database "Inventory Database"
}

// Inventory feedback loop
InventoryLoop = scenario "Inventory Feedback" {
  Shop.API -> Shop.Inventory "Update stock"

  // Low stock alert
  if stock_low {
    Shop.Inventory -> Shop.API "Notify low stock"
    Shop.API -> Admin "Send restock alert"
    Admin -> Shop.API "Restock inventory"
    Shop.API -> Shop.Inventory "Update stock"
    // Inventory updates, loop may repeat
  }

  // Normal stock (no action needed)
  if stock_normal {
    Shop.Inventory -> Shop.API "Stock OK"
  }
}

Example 4: Learning System

User = person "User"

MLSystem = system "ML Recommendation System" {
  API = container "Recommendation API"
  Model = database "ML Model"
  Training = container "Training Pipeline"
}

// Learning feedback loop
LearningLoop = scenario "ML Learning Cycle" {
  User -> MLSystem.API "Request recommendations"
  MLSystem.API -> MLSystem.Model "Get predictions"
  MLSystem.Model -> MLSystem.API "Return recommendations"
  MLSystem.API -> User "Show recommendations"

  // User feedback
  User -> MLSystem.API "Rate recommendations"

  // Model update
  MLSystem.API -> MLSystem.Training "Add training data"
  MLSystem.Training -> MLSystem.Model "Update model"

  // Improved recommendations next time
  MLSystem.Model -> MLSystem.API "Better predictions"
  MLSystem.API -> User "Show improved recommendations"
}

Explicit vs Implicit Cycles

Explicit Cycle (Clear Feedback)

// Shows the complete feedback path
User -> App "Submit data"
App -> User "Show result"
User -> App "Adjust and resubmit"

// Clearly shows learning/adaptation

Implicit Cycle (Inferred)

// Relationships imply the cycle exists
User -> App "Uses"
App -> User "Responds"

// Cycle is there but not explicitly modeled

Recommendation: Use explicit cycles for important feedback mechanisms.

Valid Cycles vs Circular Dependencies

Circular Dependency (Bad)

// Static compile-time dependency
ModuleA -> ModuleB "Imports"
ModuleB -> ModuleA "Imports"

// Problems:
// - Impossible to initialize
// - Tight coupling
// - No clear purpose

Feedback Loop (Good)

// Dynamic runtime feedback
User -> App "Submits data"
App -> User "Shows result"

// Benefits:
// - Enables adaptation
// - Clear purpose
// - Loose coupling (eventual consistency)

Feedback Loop Patterns

Pattern 1: Immediate Feedback

InstantFeedback = scenario "Form Validation" {
  User -> WebApp "Type in field"
  WebApp -> API "Validate"
  API -> WebApp "Result (instant)"
  WebApp -> User "Show error/success"
}

Pattern 2: Delayed Feedback

DelayedFeedback = scenario "Performance Monitoring" {
  App -> Monitoring "Log metrics"

  // Time passes

  Monitoring -> App "Send alert (after threshold)"
  App -> Admin "Notify"
  Admin -> App "Adjust configuration"
  App -> Monitoring "Log new metrics"
}

Pattern 3: Aggregated Feedback

AggregatedFeedback = scenario "A/B Testing" {
  Users -> App "Use feature A"

  // Aggregate many interactions
  App -> Analytics "Log events"
  Analytics -> Dashboard "Show aggregated results"

  Team -> App "Make decision based on data"
  App -> Users "Roll out winner to all"
}

Feedback Loop Controls

Prevent Runaway Behavior

AutoScaling = container "Auto-Scaling Service" {
  scale {
    min 2
    max 10
    metric "cpu > 80%"
    cooldown "5 minutes"  // Prevent rapid scaling
  }
}

Circuit Breaker Pattern

CircuitBreaker = scenario "Circuit Breaker Feedback" {
  App -> Service "Make request"

  // If failures exceed threshold, open circuit
  if failures > threshold {
    App -> Fallback "Use fallback"
    Fallback -> App "Return cached data"

    // After cooldown, try again
    if cooldown_elapsed {
      App -> Service "Make request"
      Service -> App "Success (close circuit)"
    }
  }
}

Rate Limiting

RateLimited = scenario "Rate Limited Requests" {
  User -> API "Send request"

  // Check rate limit
  API -> RateLimiter "Check limit"

  if under_limit {
    RateLimiter -> API "Allow"
    API -> User "Process request"
  }

  if over_limit {
    RateLimiter -> API "Throttle"
    API -> User "Rate limit exceeded"

    // User waits and retries (feedback loop)
    User -> API "Retry after delay"
  }
}

Documenting Feedback Loops

Add Metadata

AutoScaling = system "Auto-Scaling" {
  metadata {
    feedback_loop {
      type "negative_balancing"
      purpose "Maintain target CPU usage"
      target "70% CPU"
      controls "min/max instance limits"
      monitoring "CPU, latency, error rate"
    }
  }
}

Complete Example: E-Commerce Feedback Loops

import { * } from 'sruja.ai/stdlib'

Customer = person "Customer"
Admin = person "Administrator"

Shop = system "Shop" {
  WebApp = container "Web Application"
  API = container "API Service"
  Database = database "Database"
  Cache = database "Redis Cache"
}

// User feedback loop (interactive)
UserFeedback = scenario "Checkout Feedback" {
  Customer -> Shop.WebApp "Submit order"
  Shop.WebApp -> Shop.API "Process order"

  // Payment feedback
  Shop.API -> PaymentGateway "Process payment"
  PaymentGateway -> Shop.API "Payment result"

  if payment_failed {
    Shop.API -> Shop.WebApp "Return error"
    Shop.WebApp -> Customer "Show error, retry"
    Customer -> Shop.WebApp "Try again"  // Loop
  }

  if payment_success {
    Shop.API -> Shop.Database "Save order"
    Shop.API -> Shop.WebApp "Success"
    Shop.WebApp -> Customer "Show confirmation"
  }
}

// Inventory feedback loop (self-regulating)
InventoryFeedback = scenario "Inventory Feedback" {
  Shop.API -> Shop.Database "Update inventory"

  if stock_low {
    Shop.Database -> Shop.API "Notify low stock"
    Shop.API -> Admin "Send alert"
    Admin -> Shop.API "Restock"
    Shop.API -> Shop.Database "Update inventory"
  }
}

// Cache feedback loop (learning)
CacheFeedback = scenario "Cache Learning" {
  Shop.API -> Shop.Cache "Query cache"

  if cache_hit {
    Shop.Cache -> Shop.API "Return data (fast)"
  }

  if cache_miss {
    Shop.API -> Shop.Database "Query database"
    Shop.Database -> Shop.API "Return data"
    Shop.API -> Shop.Cache "Store in cache"
    // Next request will be a cache hit
  }
}

view index {
  include *
}

Exercise

Model feedback loops for:

  1. Chat application: User types, app shows "typing indicator", receiver sees it, receiver types, sender sees "typing indicator"...

  2. Review system: User rates product, system updates average rating, displays to next users...

  3. Load balancing: Server gets overloaded, balancer sends traffic to other servers, load redistributes...

Key Takeaways

  1. Cycles are valid in Sruja: Feedback loops are not errors
  2. Explicit cycles: Model important feedback mechanisms
  3. Differentiate: Circular dependencies (bad) vs feedback loops (good)
  4. Add controls: Prevent runaway behavior with limits
  5. Document clearly: Use metadata to explain feedback loops

Module 5 Complete

You've completed Feedback Loops! You now understand:

  • What feedback loops are and why they matter
  • Types of feedback loops (positive, negative, delayed)
  • How to model cycles in Sruja

Next: Learn about Module 6: Context.

Module 6: Context

Overview

In this module, you'll learn to capture the environment your system operates in - stakeholders, dependencies, constraints, and success criteria.

Learning Objectives

By the end of this module, you'll be able to:

  • Identify and document stakeholders
  • Model external dependencies and integrations
  • Define constraints and non-functional requirements
  • Capture success criteria and SLOs

Lessons

Prerequisites

Time Investment

Approximately 1-1.5 hours to complete all lessons and exercises.

Course Completion

After completing this module, you'll have finished Systems Thinking 101!

Next Steps

After completing this course:

The Context Trap: Why Great Architecture Needs More Than Great Code

I once worked on what we thought was the perfect e-commerce platform. Clean microservices architecture, elegant APIs, comprehensive test coverage, the works. We were proud. Three months after launch, the project was cancelled.

The problem? We'd built the wrong thing.

Our payment processing was elegant—but the company had negotiated a deal with a specific payment provider we couldn't use. Our real-time inventory tracking was brilliant—but the warehouse team needed daily batches, not real-time updates. Our admin interface was beautiful—but the support team needed bulk operations, not pretty screens.

We had built a technical masterpiece that solved nobody's actual problems.

The missing piece was context. We'd designed the system in isolation, without understanding the organizational constraints, stakeholder needs, and business realities that surrounded it. The architecture was technically sound but organizationally wrong.

This lesson is about avoiding that trap. You'll learn how to see the invisible environment that shapes every system—the stakeholders, dependencies, constraints, and success criteria that determine whether your architecture succeeds or fails, regardless of how elegant your code might be.

Learning Goals

By the end of this lesson, you'll be able to:

  • Identify the multiple layers of context that surround any system
  • Recognize why context matters just as much as technical architecture
  • Model stakeholder, technical, and organizational context using Sruja
  • Avoid the trap of designing systems in isolation
  • Document constraints and dependencies that affect design decisions

Context: The Invisible Environment

Here's a question that changed how I think about architecture: What surrounds your system?

Not what's inside it. Not what code it runs. But what's around it—the environment it operates in, the people who use it, the systems it depends on, the constraints it must satisfy.

Context is everything that affects your system or is affected by it, even though it's not part of the system itself. Think of it like the water a fish swims in—the fish doesn't see it, but it determines everything about how the fish lives.

┌─────────────────────────────────────────────┐
│              ORGANIZATIONAL CONTEXT          │
│  Company culture, processes, constraints    │
│                                             │
│   ┌─────────────────────────────────────┐   │
│   │       TECHNICAL CONTEXT            │   │
│   │  Dependencies, infrastructure, APIs  │   │
│   │                                   │   │
│   │  ┌─────────────────────────────┐  │   │
│   │  │    STAKEHOLDER CONTEXT     │  │   │
│   │  │  Users, teams, customers   │  │   │
│   │  │                           │  │   │
│   │  │  ┌───────────────────────┐ │  │   │
│   │  │  │   YOUR SYSTEM         │ │  │   │
│   │  │  └───────────────────────┘ │  │   │
│   │  └─────────────────────────────┘  │   │
│   └─────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Each of these layers matters. Miss any of them, and you're designing in the dark.

The Three Layers of Context

Stakeholder context answers the question: Who cares about this system? This includes users, certainly, but also business owners, support teams, compliance officers, and anyone else affected by what you build. Each has different needs, and ignoring any of them creates problems.

Technical context covers the systems and services yours depends on: payment gateways, email services, databases, APIs. These dependencies constrain what's possible and create failure modes you need to handle.

Organizational context captures the business realities: budget constraints, team size, compliance requirements, strategic goals, timelines. These often matter more than technical constraints but are easier to overlook.

Why Context Matters (Beyond Theory)

Let me share three real experiences that taught me why context isn't optional.

The Payment Gateway Surprise

A startup I advised built their entire checkout flow around a specific payment gateway's API. Clean code, well-tested, ready to ship. Two weeks before launch, they learned that their business deal required using a different gateway with a completely different API.

The technical work wasn't wasted, but they had to rewrite significant portions and delay launch by a month. Had they modeled their payment gateway as an external dependency with the constraint that it might change, they would have built an abstraction layer from the start.

The cost of missing context? A month of delay and a lot of rework.

The Compliance Awakening

I consulted for a healthcare company that built a beautiful patient data system. It was fast, user-friendly, and technically impressive. Then they tried to deploy it and learned about HIPAA compliance requirements they'd never considered.

The system needed audit logging, data encryption at rest, access controls, and a dozen other features they hadn't built. Six months of work had to be redone to add compliance.

The cost of missing context? Six months of rework.

The Stakeholder Disconnect

A team built an internal deployment tool that developers loved. It was elegant and powerful. But the DevOps team couldn't use it—it didn't integrate with their monitoring systems or provide the audit trails they needed for compliance.

Two teams, one tool, completely different needs. The developers got their tool, but the DevOps team continued using their old scripts, creating fragmentation and confusion.

The cost of missing context? A tool that solved half the problem and created new ones.

Modeling Context in Sruja

This is where architecture diagrams become powerful. Sruja gives you specific tools to capture context, not just system internals.

Documenting Stakeholders

Use person to capture who cares about your system:

// The people who matter
Customer = person "Customer"
Administrator = person "Administrator"
SupportTeam = person "Support Team"
BusinessOwner = person "Business Owner"
ComplianceOfficer = person "Compliance Officer"

This isn't just documentation—it's a reminder that your system serves different people with different needs.

Capturing External Dependencies

Use system for everything your system depends on:

// What you depend on
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "critical", "pci-compliant"]
    sla "99.9% uptime"
    fallback "Manual payment processing"
  }
}

EmailService = system "Email Service" {
  metadata {
    tags ["external", "low-priority"]
    impact "Notifications delayed but system works"
  }
}

AnalyticsService = system "Analytics Service" {
  metadata {
    tags ["external", "optional"]
    impact "Analytics lost but core functionality works"
  }
}

The metadata captures critical information: Is this dependency critical? What happens if it fails? What's the SLA? This isn't just documentation—it's risk assessment.

Showing Stakeholder Needs

Relationships reveal what different stakeholders need:

// Different stakeholders, different needs
Customer -> Shop "Wants fast checkout"
Administrator -> Shop "Wants easy management"
BusinessOwner -> Shop "Wants high conversion"
ComplianceOfficer -> Shop "Wants data security"

These arrows seem simple, but they remind you that you're balancing competing needs. Fast checkout might conflict with security. Easy management might conflict with simplicity. These tensions are real and ignoring them doesn't make them go away.

Documenting Constraints

Use metadata to capture organizational realities:

Shop = system "Shop" {
  metadata {
    constraints {
      "PCI-DSS compliance required",
      "Maximum response time: 2s",
      "Budget: $500/month infrastructure",
      "Team size: 3 engineers"
    }
  }
}

These constraints are just as real as technical constraints. A $500/month budget limits your infrastructure choices. A 3-person team limits how complex the system can be. Pretending these don't exist doesn't help anyone.

Defining Success

Use slo and success_criteria to define what "good" looks like:

Shop = system "Shop" {
  slo {
    availability {
      target "99.9%"
    }
    latency {
      p95 "200ms"
    }
  }

  metadata {
    success_criteria {
      "Support 10k concurrent users",
      "Less than 1% abandoned carts",
      "Checkout completion rate > 80%"
    }
  }
}

Without explicit success criteria, how do you know if you've succeeded? How do you make trade-offs? How do you prioritize features?

A Complete Example

Let me show you how this comes together in a real architecture:

import { * } from 'sruja.ai/stdlib'

// Stakeholder context: Who cares?
Customer = person "Customer"
Administrator = person "Administrator"
BusinessOwner = person "Business Owner"
SupportTeam = person "Support Team"

// Your system
Shop = system "Shop" {
  metadata {
    constraints {
      "PCI-DSS compliance required",
      "Team size: 3 engineers"
    }
  }
}

// Technical context: What do you depend on?
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "critical"]
    sla "99.9% uptime"
  }
}

EmailService = system "Email Service" {
  metadata {
    tags ["external", "low-priority"]
  }
}

AnalyticsService = system "Analytics Service" {
  metadata {
    tags ["external"]
  }
}

// Show the dependencies
Shop -> PaymentGateway "Depends on for payments"
Shop -> EmailService "Depends on for notifications"
Shop -> AnalyticsService "Depends on for tracking"

// Show stakeholder needs
Customer -> Shop "Wants fast, reliable shopping"
Administrator -> Shop "Wants easy management"
BusinessOwner -> Shop "Wants high revenue"
SupportTeam -> Shop "Wants clear error messages"

view index {
  include *
}

This diagram tells a story. You can see who matters, what the system depends on, what the constraints are, and what success looks like. That's the power of modeling context.

Common Context Mistakes

After years of watching teams struggle with this, I've seen a few patterns repeat.

Mistake 1: Ignoring context entirely. Teams build elegant systems that solve the wrong problems. The architecture is technically beautiful but organizationally useless.

Mistake 2: Adding too much context. Some teams try to document everything and everyone, creating diagrams that are more confusing than helpful. Focus on stakeholders who directly affect or are affected by the system.

Mistake 3: Context at the wrong level. Documenting individual databases and caches instead of higher-level services. Context should clarify, not overwhelm.

The key is balance: enough context to understand the system's environment, not so much that you lose the system itself.

What to Remember

Context isn't optional—it's half the architecture. The most elegant technical solution fails if it ignores stakeholder needs, organizational constraints, or external dependencies.

Model stakeholders explicitly. Use person to capture who cares about your system and what they need.

Document dependencies clearly. Use system for external services and metadata to capture criticality and fallbacks.

Capture constraints honestly. Budget, team size, compliance, timelines—these are real constraints that shape design decisions.

Define success criteria. Without explicit goals, you can't make trade-offs or know when you're done.

Balance detail. Enough context to understand the environment, not so much that the system gets lost.

Context-aware architecture isn't about adding more documentation—it's about seeing the full picture before you start building. The time you spend understanding context pays back tenfold in avoided rework and better decisions.

What's Next

Now that you understand context as a concept, Lesson 2 dives deep into stakeholders—the people who use, build, and depend on your system. You'll learn how to identify different stakeholder types and model their needs in Sruja.


The Hidden Stakeholder Problem: Why Everyone Matters (Even People You've Never Met)

We launched the new admin dashboard on a Tuesday. By Wednesday, the project was in crisis.

The dashboard was beautiful—sleek, modern, exactly what the product team wanted. The developers had done great work. But within 24 hours, we had three critical problems:

The support team couldn't find customer information quickly enough. The old dashboard had a search box right at the top; the new one buried it three levels deep. Support ticket resolution time doubled.

The compliance team realized the new audit logs didn't capture user IP addresses, which was required for their quarterly reports. They couldn't sign off on the release.

The finance team discovered that the revenue reports they'd been getting automatically every Monday morning were now manual—and nobody had told them.

We'd spent three months building the perfect product for the product team. We'd forgotten that three other teams also depended on the system.

The launch was delayed six weeks while we added features for stakeholders we'd never even talked to.

This lesson is about avoiding that mistake. You'll learn how to identify every stakeholder who matters (including the hidden ones), understand their competing needs, and model them in a way that prevents surprises.

Learning Goals

By the end of this lesson, you'll be able to:

  • Identify all stakeholder types, not just the obvious ones
  • Recognize why different stakeholders often have conflicting needs
  • Model stakeholder relationships and interactions in Sruja
  • Document stakeholder needs, pain points, and priorities
  • Avoid the "hidden stakeholder" problem that derails projects

Stakeholders: The Real System Owners

Here's a truth that took me years to learn: Users don't own systems. Stakeholders do.

A user is someone who interacts with your system. A stakeholder is anyone affected by it or who can affect it. That's a much bigger group.

Think about an e-commerce platform:

  • Customers use it to buy things (users AND stakeholders)
  • Administrators use it to manage products (users AND stakeholders)
  • Business owners never touch it but depend on its revenue (stakeholders, not users)
  • Support agents need data from it but might not log in directly (stakeholders, not necessarily users)
  • Compliance officers audit it but don't use it (stakeholders, not users)
  • Developers maintain it (stakeholders, users of a different kind)

Miss any of these, and you're building an incomplete picture. Each has different needs, different priorities, different success criteria.

The Five Stakeholder Types (And Why They Conflict)

After years of stakeholder surprises, I've learned to look for five specific groups. Each sees the system differently.

1. Primary Users: The People You Think About

These are the direct users—the ones product teams interview, the ones in user stories, the ones you're probably already thinking about.

Who they are: Customers, administrators, anyone who logs in and clicks buttons.

What they want: Speed, ease of use, features that help them do their job.

Example in Sruja:

Customer = person "Customer" {
  description "Shoppers who purchase products"
  metadata {
    needs ["Fast checkout", "Easy search", "Order tracking"]
    pain_points ["Complex forms", "Slow page loads"]
    usage "Daily, mostly mobile"
  }
}

The trap: It's easy to focus only on primary users and forget everyone else.

2. Secondary Users: The People Who Need Your Data

These users don't interact with your system directly, but they depend on its outputs—reports, data exports, APIs.

Who they are: Support teams, analysts, people who receive automated reports.

What they want: Data access, clear reports, reliable exports.

Example:

SupportAgent = person "Support Agent" {
  description "Helps customers with order issues"
  metadata {
    needs ["Quick customer lookup", "Order history", "Ability to modify orders"]
    pain_points ["Can't find customer data", "Too many clicks to resolve issues"]
    usage "Uses admin tools to look up information"
  }
}

The trap: These users are invisible until something breaks. I've seen launches delayed because the weekly report that "nobody uses" suddenly turns out to be critical for the CEO.

3. Business Stakeholders: The People Who Pay For It

These are the decision-makers, budget-owners, and revenue-responsible people. They might never use your system, but they decide if it succeeds.

Who they are: Product managers, business owners, executives, finance teams.

What they want: Revenue, metrics, ROI, competitive advantage.

Example:

BusinessOwner = person "Business Owner" {
  description "Accountable for revenue and profit"
  metadata {
    needs ["Revenue reports", "Conversion metrics", "Cost tracking"]
    concerns ["Is the system making money?", "Are customers happy?", "What's the ROI?"]
    success_criteria "10% increase in conversion rate"
  }
}

ProductManager = person "Product Manager" {
  description "Owns product strategy and roadmap"
  metadata {
    needs ["User analytics", "Feature usage data", "A/B test results"]
    concerns ["Are users adopting features?", "What should we build next?"]
  }
}

The trap: Business stakeholders often have goals that conflict with user experience. Fast checkout might reduce revenue (fewer impulse buys). Easy returns might increase costs. You need to model these tensions explicitly.

4. Technical Stakeholders: The People Who Build and Run It

These are your teammates—the developers, DevOps engineers, DBAs, security teams. The system affects their daily work.

Who they are: Developers, operations teams, security engineers, database administrators.

What they want: Clear architecture, good documentation, easy deployment, monitoring.

Example:

Developer = person "Developer" {
  description "Builds and maintains the system"
  metadata {
    needs ["Clear architecture docs", "API documentation", "Debugging tools"]
    pain_points ["Unclear requirements", "Technical debt", "Poor test coverage"]
  }
}

DevOpsEngineer = person "DevOps Engineer" {
  description "Deploys and operates the system"
  metadata {
    needs ["Monitoring dashboards", "Easy deployment", "Clear logs"]
    pain_points ["Manual deployments", "Poor observability"]
  }
}

The trap: Technical stakeholders often get ignored in architecture diagrams, but their needs are real. A system that's perfect for users but impossible to operate is a failed system.

5. Compliance and Governance: The People Who Can Say "No"

These stakeholders can block your launch. They don't use the system, but they regulate it.

Who they are: Compliance officers, security auditors, legal teams, data privacy officers.

What they want: Audit trails, data protection, regulatory compliance.

Example:

ComplianceOfficer = person "Compliance Officer" {
  description "Ensures regulatory compliance"
  metadata {
    needs ["Audit logs", "Data retention policies", "Access controls"]
    requirements ["PCI-DSS", "GDPR", "SOX"]
    can_block_launch true
  }
}

SecurityAuditor = person "Security Auditor" {
  description "Reviews security posture"
  metadata {
    needs ["Vulnerability reports", "Penetration test results", "Access logs"]
    concerns ["Data breaches", "Unauthorized access", "Injection attacks"]
  }
}

The trap: These stakeholders are invisible until they're not. I've seen projects delayed months because compliance requirements were discovered too late.

A Real Stakeholder Conflict (And How We Solved It)

Let me share a specific example that taught me why stakeholder modeling matters.

The situation: We were building a customer support dashboard. The product team wanted a clean, minimal interface—fewer buttons, more white space, "Apple-like" design.

The conflict: Support agents needed dense information displays. They handled 50+ tickets per day and couldn't afford extra clicks. What product called "cluttered," support called "efficient."

The mistake: We designed for product's vision first. Support hated it.

The solution: We modeled both stakeholders explicitly:

ProductManager = person "Product Manager" {
  metadata {
    vision "Clean, minimal, modern interface"
    priority "User experience, simplicity"
  }
}

SupportAgent = person "Support Agent" {
  metadata {
    needs ["Dense information display", "Minimal clicks", "Keyboard shortcuts"]
    metric "50+ tickets per day"
    priority "Speed over aesthetics"
  }
}

// Make the conflict explicit
ProductManager -> SupportDashboard "Wants clean interface"
SupportAgent -> SupportDashboard "Needs dense information"

Making the conflict visible in the architecture forced a conversation. The solution was modes: a "standard" view for occasional users and a "power user" view for support agents. Both stakeholders got what they needed, but only because we'd modeled the conflict explicitly.

Documenting Stakeholders in Sruja

Sruja gives you multiple ways to capture stakeholder information. Here's what works best for each situation.

Basic Stakeholder Declaration

Customer = person "Customer"

Simple and clear. Use this when you just need to show that someone exists.

Detailed Stakeholder Profile

Customer = person "Customer" {
  description "End users who purchase products"
  metadata {
    tags ["primary-user", "external"]
    priority "critical"
    needs [
      "Fast and easy checkout",
      "Product search and filtering",
      "Order tracking"
    ]
    pain_points [
      "Complex forms",
      "Slow page loads",
      "Lack of mobile support"
    ]
    context "Busy professionals, often shopping on mobile during commute"
  }
}

Use this for primary stakeholders where you need to capture their full context.

Stakeholder With Relationships

Customer = person "Customer"
Shop = system "Shop"

Customer -> Shop "Purchases products"
Shop -> Customer "Sends order updates"

Relationships show how stakeholders interact with the system. Notice the bidirectional flow—customers buy, but the system also reaches out to customers.

Stakeholder Personas (Advanced)

For critical user types, create detailed personas:

Sarah = person "Sarah (Customer Persona)" {
  description "Busy professional, 35, shops on mobile during commute"
  metadata {
    demographics {
      age "35"
      occupation "Marketing manager"
      device "iPhone 13"
    }
    goals [
      "Find products quickly",
      "Complete checkout in under 2 minutes",
      "Track orders without logging into email"
    ]
    frustrations [
      "Sites that aren't mobile-friendly",
      "Long forms that don't autofill",
      "Slow loading pages"
    ]
    scenario "Shopping on the train to work, 15 minutes before her stop"
  }
}

Personas bring stakeholders to life. They're especially useful when you need to make design trade-offs and want to ask "What would Sarah prefer?"

A Complete Example: E-Commerce Platform

Let me show you how all this comes together in a real architecture:

import { * } from 'sruja.ai/stdlib'

// =========== STAKEHOLDERS ===========

// Primary users
Customer = person "Customer" {
  description "Shoppers who purchase products"
  metadata {
    needs ["Fast checkout", "Easy search", "Mobile-friendly"]
    priority "critical"
  }
}

Administrator = person "Administrator" {
  description "Manages products, orders, and inventory"
  metadata {
    needs ["Bulk operations", "Reporting", "Quick updates"]
    priority "high"
  }
}

// Secondary users
SupportAgent = person "Support Agent" {
  description "Helps customers with order issues"
  metadata {
    needs ["Customer lookup", "Order history", "Refund processing"]
    priority "high"
  }
}

// Business stakeholders
ProductManager = person "Product Manager" {
  description "Owns product strategy"
  metadata {
    needs ["Analytics", "Feature usage", "User feedback"]
    priority "medium"
  }
}

BusinessOwner = person "Business Owner" {
  description "Accountable for revenue"
  metadata {
    needs ["Revenue reports", "Conversion metrics", "Cost tracking"]
    priority "high"
  }
}

// Compliance
ComplianceOfficer = person "Compliance Officer" {
  description "Ensures PCI-DSS compliance"
  metadata {
    needs ["Audit logs", "Access controls", "Data encryption"]
    can_block_launch true
  }
}

// =========== SYSTEM ===========

Shop = system "Shop" {
  WebApp = container "Web Application"
  API = container "API Service"
  Database = database "Database"
  
  metadata {
    slo {
      availability { target "99.9%" }
      latency { p95 "200ms" }
    }
  }
}

// =========== STAKEHOLDER INTERACTIONS ===========

// Primary user interactions
Customer -> Shop.WebApp "Browses and purchases"
Administrator -> Shop.WebApp "Manages products and orders"

// Secondary user interactions
SupportAgent -> Shop.WebApp "Looks up customer info"

// Business stakeholder needs (indirect)
ProductManager -> Shop "Reviews analytics"
BusinessOwner -> Shop "Monitors revenue"

// Compliance oversight
ComplianceOfficer -> Shop "Audits compliance"

view index {
  include *
}

This diagram tells a complete story. You can see who matters, what they need, and how they interact with the system. That's the power of explicit stakeholder modeling.

Prioritizing Stakeholders (When You Can't Please Everyone)

Here's the uncomfortable truth: stakeholders have conflicting needs, and you can't satisfy everyone.

The customer wants the cheapest price. The business owner wants the highest margin. Those are fundamentally in tension.

The developer wants clean code. The product manager wants features fast. Also in tension.

I've learned to prioritize stakeholders using a simple framework:

Critical: Primary users and anyone who can block launch (compliance, security) High: Business owners and secondary users Medium: Technical stakeholders and internal teams Low: Nice-to-have but not essential

In Sruja:

Customer = person "Customer" {
  metadata {
    priority "critical"
    rationale "Primary user, revenue source"
  }
}

ComplianceOfficer = person "Compliance Officer" {
  metadata {
    priority "critical"
    rationale "Can block launch"
  }
}

MarketingAnalyst = person "Marketing Analyst" {
  metadata {
    priority "low"
    rationale "Nice to have, not essential for launch"
  }
}

Making priorities explicit helps when you need to make trade-offs.

The Stakeholder Discovery Process

How do you find stakeholders you don't know about? I've learned to ask three questions:

  1. Who uses the system directly? (Primary users)
  2. Who receives data or reports from the system? (Secondary users)
  3. Who can say "no" to this launch? (Compliance, business, security)

I also look for "zombie stakeholders"—people who used to matter but haven't been involved recently. They often resurface at the worst possible moment.

Common Stakeholder Mistakes

After years of stakeholder surprises, I've seen these patterns repeat:

Mistake 1: Only modeling users. You remember the customers and admins, but forget the compliance officer who can block your launch.

Mistake 2: Not documenting conflicts. Business wants speed, compliance wants audit trails. These tensions exist whether you document them or not. Documenting them makes them solvable.

Mistake 3: Assuming stakeholders agree. Different stakeholders want different things. Don't assume alignment—verify it.

Mistake 4: Invisible stakeholders. The finance team getting automated reports, the support team needing data exports—these stakeholders are easy to miss until something breaks.

What to Remember

Stakeholders are more than users. Anyone affected by or affecting your system is a stakeholder, whether they log in or not.

Five types to look for: Primary users, secondary users, business stakeholders, technical stakeholders, and compliance/governance.

Conflicts are normal. Different stakeholders want different things. Model the conflicts explicitly so you can solve them.

Hidden stakeholders cause surprises. Ask "Who can block this launch?" to find stakeholders you might have missed.

Document priorities. When you can't satisfy everyone, know who matters most.

Use personas for critical stakeholders. Detailed personas help you make design decisions when stakeholders aren't available to ask.

Stakeholder modeling isn't about pleasing everyone—it's about understanding the full picture so you can make informed trade-offs. The time you spend identifying stakeholders pays back in avoided crises and smoother launches.

What's Next

Now that you understand who your stakeholders are, Lesson 3 covers the other half of context: external dependencies, constraints, and success criteria. You'll learn how to document what your system depends on and what "success" actually means.

The 3 AM Page: What Dependencies Really Cost

It was 3:14 AM when my phone buzzed. The payment system was down. Customers couldn't check out. Revenue was bleeding.

I stumbled to my laptop, pulled up the dashboards, and started debugging. Everything looked fine—our services were up, databases responding, APIs healthy. But payments kept failing.

Two hours later, I discovered the problem: an external email verification service we used had changed their API. We'd added it as a quick fix six months earlier, never documented it as a critical dependency, and forgotten about it. When they deprecated the old API endpoint, our checkout flow silently broke.

The cost? Four hours of downtime, thousands in lost sales, and a very uncomfortable conversation with the CEO at 6 AM.

The root cause wasn't technical complexity. It was missing documentation. We'd never modeled our dependencies properly, so we didn't know what we depended on—or how to fix it when it broke.

This lesson is about avoiding that 3 AM page. You'll learn how to document every external dependency, understand what constraints actually limit your choices, and define success criteria that prevent surprises.

Learning Goals

By the end of this lesson, you'll be able to:

  • Identify and categorize all external dependencies, not just the obvious ones
  • Document dependencies with the information you'll need at 3 AM
  • Recognize the four types of constraints and how they shape design
  • Define success criteria and SLOs that actually reflect what matters
  • Model complete context in Sruja so nothing gets forgotten

Dependencies: The Systems You Don't Control

Here's a question that took me too long to ask: What happens when your dependencies fail?

Not if. When. Every external service goes down eventually. Every API changes. Every vendor has outages. The question isn't whether it will happen—it's whether you'll be prepared when it does.

Dependencies are external systems, services, or resources your system relies on to function. They're not part of your architecture, but they determine whether your architecture works.

Let me share three dependency failures that taught me why this matters.

The Payment Gateway Outage

A startup I worked with built their entire checkout flow around Stripe. Clean integration, well-tested, ready to launch. Two days before launch, Stripe had a three-hour outage.

They had no fallback. No backup payment processor. No way to process payments manually. The launch was delayed a week while they scrambled to add a second payment provider.

The lesson: Critical dependencies need fallbacks. If you can't function without it, you need a Plan B.

The Email Service Change

I consulted for a company that used SendGrid for transactional emails—password resets, order confirmations, welcome emails. After a year, SendGrid changed their pricing model, and the company's email costs tripled overnight.

They hadn't documented the dependency properly, so they didn't know how many emails they were sending or what alternatives existed. It took three months to migrate to a different provider because the email service was woven throughout the codebase.

The lesson: Document what you depend on, including costs and alternatives. Vendor changes happen.

The Analytics Gap

A team built a real-time dashboard that depended on an analytics API. The dashboard was beautiful—until the analytics provider had an outage. The dashboard didn't fail gracefully; it showed zeros everywhere, causing panic among business users who thought all their traffic had disappeared.

The lesson: Know what happens when dependencies fail. Design for graceful degradation.

Categorizing Dependencies (So You Know What Matters)

Not all dependencies are equal. I've learned to sort them into three buckets:

Critical Dependencies: The System-Stoppers

These are dependencies your system cannot function without. If they go down, you go down.

Examples: Payment gateways, primary databases, authentication services, core APIs.

How to handle them:

  • Document them as critical
  • Have fallbacks or backups
  • Monitor them closely
  • Know your SLA and theirs
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "critical", "vendor"]
    owner "Stripe Inc."
    sla "99.9% uptime"
    mttr "4 hours"
    contact "support@stripe.com"
    fallback "Backup payment processor configured"
    fallback_activation "Manual switch, < 15 minutes"
    cost "$0.30 per transaction"
    compliance ["PCI-DSS Level 1"]
    
    // Critical info for 3 AM debugging
    monitoring "https://status.stripe.com"
    last_incident "2024-01-15 (2 hour outage)"
  }
}

Notice how much information I include. This isn't bureaucracy—it's the information you need at 3 AM when things break. Who owns it? What's the SLA? What's the fallback? How do you contact them? Where's the status page?

Important Dependencies: The Degraded-Experience Ones

These are dependencies that cause problems when they fail, but the system still works in a degraded mode.

Examples: Email services, analytics, CDNs, notification systems.

How to handle them:

  • Document the degradation behavior
  • Queue work for later if possible
  • Don't block core functionality
EmailService = system "Email Service" {
  metadata {
    tags ["external", "important", "vendor"]
    owner "SendGrid"
    sla "99.0% uptime"
    impact "Emails delayed but system works"
    fallback "Queue emails locally, retry when service recovers"
    degradation "Users won't receive notifications immediately"
    
    cost "$14.95/month base + $0.0010/email"
    volume "50k emails/month"
  }
}

The key question: What happens when this goes down? If the answer is "users are annoyed but can still use the system," it's important but not critical.

Optional Dependencies: The Nice-to-Haves

These dependencies add value but aren't essential. If they fail, the system works normally, just with fewer features.

Examples: Analytics services, A/B testing tools, non-essential integrations.

How to handle them:

  • Don't let them block core functionality
  • Fail gracefully and silently
  • Monitor but don't alert at 3 AM
AnalyticsService = system "Analytics Service" {
  metadata {
    tags ["external", "optional", "vendor"]
    owner "Google Analytics"
    impact "Analytics data lost but core functionality works"
    degradation "Dashboards show gaps in data"
    fallback "None - data loss acceptable"
    
    // Don't wake me up for this
    alerting "Business hours only"
  }
}

Optional doesn't mean unimportant. It means the system can function without it.

Constraints: The Real Design Limits

Constraints are the limitations that shape your architecture. They're not suggestions—they're the boundaries you have to work within.

After years of fighting constraints, I've learned to embrace them. Constraints aren't obstacles; they're design inputs. The best architectures work with constraints, not against them.

Technical Constraints: What's Technically Possible or Required

These are the technical realities you can't ignore.

Examples: "Must use PostgreSQL for ACID transactions," "Must deploy to AWS," "API response time under 200ms," "Support 10k concurrent users."

Shop = system "Shop" {
  metadata {
    technical_constraints {
      "PostgreSQL required for transactional integrity",
      "Maximum API response time: 200ms (p95)",
      "Must support 10,000 concurrent users",
      "Deploy to AWS us-east-1 region",
      "Real-time inventory updates required"
    }
    
    // Why these constraints matter
    rationale {
      "PostgreSQL: Financial transactions require ACID",
      "200ms: User research shows >200ms feels slow",
      "10k users: Peak traffic from marketing campaigns"
    }
  }
}

Technical constraints often feel limiting, but they actually clarify decisions. When you know you need ACID transactions, you stop considering NoSQL databases. That's not a limitation—it's focus.

Business Constraints: The Organizational Realities

These are the business realities: budgets, timelines, team size, strategic goals.

Examples: "Launch by Q4," "Budget is $500k/year," "Team of 3 engineers," "Must support international currencies."

Shop = system "Shop" {
  metadata {
    business_constraints {
      "Launch date: December 1, 2024",
      "Infrastructure budget: $500k/year",
      "Team size: 3 engineers (growing to 5)",
      "Must support USD, EUR, GBP",
      "Mobile-first: 70% of traffic from mobile"
    }
    
    // These are as real as technical constraints
    rationale {
      "Dec 1 launch: Board commitment, marketing scheduled",
      "$500k: Approved budget, no flexibility",
      "3 engineers: Hiring takes 3 months per engineer"
    }
  }
}

Business constraints often frustrate engineers. "Why can't we have more budget? Why is the deadline fixed?" But fighting them doesn't help. Better to understand them and design within them.

Compliance Constraints: The Rules You Must Follow

These are regulatory and legal requirements. You don't get to choose them; they choose you.

Examples: "PCI-DSS for payments," "GDPR for EU users," "HIPAA for health data," "SOC 2 for enterprise customers."

Shop = system "Shop" {
  metadata {
    compliance_constraints {
      "PCI-DSS Level 1 (processing > 6M transactions/year)",
      "GDPR (EU customers)",
      "CCPA (California customers)",
      "SOC 2 Type II (enterprise customers require it)"
    }
    
    // Compliance drives architecture decisions
    implications {
      "PCI-DSS: Cannot store credit card data, must use tokenization",
      "GDPR: Right to deletion, data portability, consent management",
      "SOC 2: Audit logging, access controls, encryption required"
    }
  }
}

Compliance constraints are non-negotiable. You can't launch without them. Building them in from the start is far cheaper than adding them later.

Security Constraints: The Protection Requirements

These are security requirements that shape how you build.

Examples: "All data encrypted at rest," "All API calls authenticated," "No PII in logs," "Minimum TLS 1.3."

Shop = system "Shop" {
  metadata {
    security_constraints {
      "All data encrypted at rest (AES-256)",
      "All API calls authenticated (JWT)",
      "No PII in logs or error messages",
      "Minimum TLS 1.3 for all connections",
      "Secrets in vault, not in code"
    }
  }
}

Security constraints feel like overhead until something goes wrong. Then they're the difference between "we had a security incident" and "we went out of business."

Success Criteria: How You Know You've Won

Success criteria answer a simple question: How do you know if your system is successful?

This seems obvious until you try to answer it. "It works"? Too vague. "Users like it"? Not measurable. "It makes money"? That's a business outcome, not a system property.

I've learned to define success at two levels: business outcomes and system properties.

Business Outcomes (The "Why")

These are the business reasons the system exists:

overview {
  summary "E-commerce platform for online retail"
  
  goals [
    "Increase online revenue by 25%",
    "Reduce abandoned carts by 15%",
    "Enable international expansion (EU, UK)",
    "Reduce support tickets by 30%"
  ]
  
  success_criteria [
    "Checkout completion rate > 80%",
    "Average checkout time < 2 minutes",
    "Customer satisfaction (NPS) > 50",
    "Support tickets per 1000 orders < 5"
  ]
}

These criteria connect the system to business value. They answer "why are we building this?"

System Properties (The "How")

These are measurable system behaviors that support business outcomes. I use SLOs (Service Level Objectives):

Shop = system "Shop" {
  slo {
    availability {
      target "99.9%"
      window "30 days"
      rationale "Less than 9 hours downtime per year"
    }

    latency {
      p95 "200ms"
      p99 "500ms"
      window "7 days"
      rationale "Research shows >200ms feels slow to users"
    }

    errorRate {
      target "0.1%"
      window "7 days"
      rationale "< 1 error per 1000 requests"
    }

    throughput {
      target "10000 req/s"
      window "peak hour"
      rationale "Peak traffic during marketing campaigns"
    }
  }
}

SLOs give you concrete targets. Is the system performing well? Check the SLOs. Are we ready to launch? Check the SLOs. Is something wrong? Check the SLOs.

Non-Goals: What You're NOT Building

I've also learned to document what we're NOT doing. This prevents scope creep and sets expectations:

overview {
  goals [
    "Fast checkout",
    "Mobile-first design",
    "Real-time inventory"
  ]
  
  non_goals [
    "Social features (reviews, sharing)",
    "Mobile app (web-only for now)",
    "Marketplace (first-party sales only)",
    "Subscription billing"
  ]
}

Non-goals are liberating. They let you say "that's a good idea, but it's out of scope" without feeling guilty.

A Complete Context Example

Let me show you how everything in this module comes together. This is what a complete context model looks like—stakeholders, dependencies, constraints, and success criteria all in one place:

import { * } from 'sruja.ai/stdlib'

// ============ OVERVIEW ============
// What are we building and why?

overview {
  summary "E-commerce platform for online retail"
  audience "Customers, administrators, business owners"
  scope "Shopping, checkout, order management, inventory"
  
  goals [
    "Increase online revenue by 25%",
    "Reduce abandoned carts by 15%",
    "Support international customers (EU, UK)"
  ]
  
  non_goals [
    "Social features (reviews, sharing)",
    "Mobile app (web-responsive only)",
    "Marketplace (first-party sales only)"
  ]
  
  risks [
    "Payment gateway downtime (critical dependency)",
    "Database scaling limits at peak traffic",
    "GDPR compliance complexity"
  ]
  
  success_criteria [
    "Checkout completion rate > 80%",
    "Average checkout time < 2 minutes",
    "Support 99.9% availability",
    "Page load time < 2s (p95)"
  ]
}

// ============ STAKEHOLDERS ============
// Who matters?

// Primary users
Customer = person "Customer" {
  description "Shoppers purchasing products"
  metadata {
    needs ["Fast checkout", "Easy search", "Mobile-friendly"]
    priority "critical"
  }
}

Administrator = person "Administrator" {
  description "Manages products, orders, inventory"
  metadata {
    needs ["Bulk operations", "Reporting", "Quick updates"]
    priority "high"
  }
}

// Secondary users
SupportAgent = person "Support Agent" {
  description "Helps customers with order issues"
  metadata {
    needs ["Customer lookup", "Order history", "Refund processing"]
    priority "high"
  }
}

// Business stakeholders
BusinessOwner = person "Business Owner" {
  description "Accountable for revenue and profit"
  metadata {
    needs ["Revenue reports", "Conversion metrics"]
    priority "high"
  }
}

// Compliance
ComplianceOfficer = person "Compliance Officer" {
  description "Ensures PCI-DSS and GDPR compliance"
  metadata {
    can_block_launch true
  }
}

// ============ SYSTEM ============
// What are we building?

Shop = system "Shop" {
  WebApp = container "Web Application" {
    technology "React"
  }
  
  API = container "API Service" {
    technology "Node.js"
  }
  
  Database = database "PostgreSQL" {
    technology "PostgreSQL 15"
  }
  
  Cache = database "Redis" {
    technology "Redis 7"
  }
  
  // Constraints
  metadata {
    team ["platform-team"]
    budget "$500k/year infrastructure"
    launch_date "2024-12-01"
    
    technical_constraints {
      "PostgreSQL required for ACID transactions",
      "Maximum API response time: 200ms (p95)",
      "Support 10,000 concurrent users"
    }
    
    business_constraints {
      "Launch by December 1, 2024",
      "Team of 3 engineers (growing to 5)",
      "Must support USD, EUR, GBP"
    }
    
    compliance_constraints {
      "PCI-DSS Level 1",
      "GDPR for EU customers",
      "CCPA for California customers"
    }
  }
  
  // Success criteria
  slo {
    availability {
      target "99.9%"
      window "30 days"
    }
    latency {
      p95 "200ms"
      p99 "500ms"
    }
    errorRate {
      target "0.1%"
    }
  }
}

// ============ DEPENDENCIES ============
// What do we depend on?

// Critical
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external", "critical", "vendor"]
    owner "Stripe Inc."
    sla "99.9% uptime"
    mttr "4 hours"
    contact "support@stripe.com"
    fallback "Backup payment processor (PayPal)"
    cost "$0.30/transaction + $0.30% fee"
    compliance ["PCI-DSS Level 1"]
  }
}

// Important
EmailService = system "Email Service" {
  metadata {
    tags ["external", "important", "vendor"]
    owner "SendGrid"
    sla "99.0% uptime"
    fallback "Queue locally, retry when service recovers"
    cost "$14.95/month + overage"
  }
}

// Optional
AnalyticsService = system "Analytics Service" {
  metadata {
    tags ["external", "optional", "vendor"]
    owner "Google Analytics"
    fallback "None - data loss acceptable"
  }
}

// ============ RELATIONSHIPS ============
// How does everything connect?

// Stakeholder interactions
Customer -> Shop.WebApp "Browses and purchases"
Administrator -> Shop.WebApp "Manages products and orders"
SupportAgent -> Shop.WebApp "Looks up customer info"
BusinessOwner -> Shop "Reviews revenue"
ComplianceOfficer -> Shop "Audits compliance"

// Dependencies
Shop.API -> PaymentGateway "Process payment" [critical]
Shop.API -> EmailService "Send notifications" [important]
Shop.API -> AnalyticsService "Track events" [optional]

view index {
  include *
}

This diagram tells a complete story. You can see:

  • What we're building (overview)
  • Who it's for (stakeholders)
  • What we're building (system)
  • What we depend on (dependencies)
  • What limits us (constraints)
  • How we measure success (SLOs)

That's the power of complete context modeling. Nothing gets forgotten.

Documenting Decisions (And Why You Made Them)

One more thing I've learned: document your decisions, not just your architecture.

When you choose PostgreSQL over MongoDB, write down why. When you choose Stripe over building payments in-house, write down why. These decisions seem obvious now, but six months from now, you'll forget.

I use Architecture Decision Records (ADRs):

ADR001 = adr "Use PostgreSQL for primary database" {
  status "accepted"
  date "2024-06-15"
  
  context "Need ACID transactions for orders and payments. Team has PostgreSQL experience. Must support complex queries for reporting."
  
  decision "Use PostgreSQL instead of MongoDB or MySQL"
  
  consequences {
    benefits "Strong consistency, ACID transactions, team expertise, mature tooling"
    tradeoffs "Horizontal scaling harder than NoSQL, requires careful schema design"
  }
  
  alternatives [
    "MongoDB: Rejected - no ACID transactions",
    "MySQL: Rejected - team less experienced, fewer advanced features"
  ]
}

ADR002 = adr "Use Stripe for payments" {
  status "accepted"
  date "2024-06-20"
  
  context "Need PCI-compliant payment processing. Building in-house would take 6+ months and require PCI certification."
  
  decision "Use Stripe instead of building in-house or using multiple providers"
  
  consequences {
    benefits "PCI compliance handled, fast integration, excellent documentation",
    tradeoffs "Per-transaction fees, vendor lock-in, limited customization"
  }
}

ADRs save you later when someone asks "Why did we do it this way?" or when you're considering a change and want to understand the original reasoning.

What to Remember

Dependencies will fail. Document them before they do. Include criticality, SLAs, fallbacks, and 3 AM debugging information.

Categorize dependencies: Critical (need fallbacks), important (degrade gracefully), optional (fail silently).

Constraints are design inputs, not obstacles. Technical, business, compliance, and security constraints all shape your architecture. Work with them, not against them.

Define success before you build. Business outcomes connect to business value. SLOs give you measurable targets. Non-goals prevent scope creep.

Document decisions. ADRs capture why you made choices. They're invaluable when revisiting decisions or onboarding new team members.

Context prevents 3 AM pages. The dependency you didn't document, the constraint you ignored, the success criteria you never defined—these are the things that break in production at the worst possible time.

Modeling context isn't bureaucracy. It's survival.

What's Next

Congratulations! You've completed Module 6: Context and the entire Systems Thinking 101 course!

🎉 Course Complete!

You did it. You've made it through all six modules of Systems Thinking 101. That's no small achievement—this material fundamentally changes how you see software systems.

Let me recap what you've learned:

Module 1: Fundamentals - You learned what systems thinking is, the iceberg model, and why seeing the whole system matters more than seeing individual parts.

Module 2: Parts and Relationships - You learned how to identify system components and model how they connect and interact.

Module 3: Boundaries - You learned where systems start and end, what's inside vs. outside, and how to draw meaningful boundaries.

Module 4: Flows - You learned how data and control move through systems, and how to model the pathways that connect components.

Module 5: Feedback Loops - You learned about positive and negative feedback, self-regulating systems, and why cycles aren't errors.

Module 6: Context - You learned about the environment surrounding your system—stakeholders, dependencies, constraints, and success criteria.

You now think differently about architecture. You don't just see code—you see systems. You don't just see features—you see stakeholders and their competing needs. You don't just see databases—you see dependencies and failure modes.

This is the foundation. Everything else in architecture builds on this.

What's Next for You?

You're ready for what comes next:

  • Practice: Take a system you're working on and model it in Sruja. Apply what you've learned. See what you discover.

  • Go deeper: The System Design 101 course dives into specific patterns, trade-offs, and real-world architectures.

  • Explore: Check out the tutorials for hands-on exercises building real architectures.

  • Share: Teach someone else what you've learned. The best way to solidify knowledge is to teach it.

A Final Thought

I want to leave you with something that took me years to understand: Great architecture isn't about being perfect. It's about being aware.

You won't always make the right decisions. You won't always anticipate every problem. You'll still have 3 AM pages. But with systems thinking, you'll understand WHY things break, WHAT to do about them, and HOW to prevent the same problems next time.

That awareness—the ability to see systems holistically, to model their complexity, to anticipate their failures—that's what makes you an architect.

Now go build something amazing. And when it breaks at 3 AM (and it will), you'll know what to do.

Congratulations on completing Systems Thinking 101! 🚀

System Design 101: Fundamentals

Note

Who is this for? Developers moving into senior roles, students preparing for interviews, or anyone curious about how massive systems like Netflix or Uber work.

Why Learn System Design?

Writing code is only half the battle. As you grow in your career, the challenges shift from "how do I write this function?" to "how do I ensure this system handles 10 million users?".

System design is the skill of defining the architecture, components, modules, interfaces, and data for a system to satisfy specified requirements. It's about making the right trade-offs.

What You Will Learn

By the end of this course, you will be able to:

  1. Speak the Language: Confidently use terms like latency, throughput, consistency, and availability.
  2. Use the Toolbox: Know when to use a relational database vs. NoSQL, or when to introduce a cache or message queue.
  3. Draw the Blueprint: Visualise your ideas using industry-standard diagrams (C4 model).
  4. Scale for Success: Understand how to take a system from 1 user to 1,000,000 users.

Course Structure

This course is broken down into digestible modules:

Module 1: Core Concepts

The foundational pillars of distributed systems. We cover Scalability (Vertical vs Horizontal), Reliability, and Maintainability.

Module 2: The Building Blocks

A deep dive into the components that make up a system:

  • Load Balancers: The traffic cops of the internet.
  • Databases: SQL vs NoSQL, replication, and sharding.
  • Caches: speeding up access with Redis/Memcached.
  • Message Queues: Decoupling services with Kafka/RabbitMQ.

Module 3: Architectural Patterns

How to organize code and services:

  • Monolith vs Microservices
  • Event-Driven Architecture
  • API Gateway Pattern

Module 4: The Interview Guide

Practical tips for acing the system design interview, including a framework for tackling open-ended problems.

Prerequisites

  • Basic understanding of how the web works (HTTP, DNS, Client-Server).
  • Familiarity with at least one programming language.
  • No prior distributed systems knowledge required.

Let's Begin

Start your journey with Module 1: Fundamentals.

Module 1: Fundamentals

Tip

The Interview Secret: Most candidates fail not because they don't know the tech, but because they dive into solutions too early. This module fixes that.

What's Inside?

This isn't just theory. It's the playbook for how senior engineers approach broad, ambiguous problems.

  1. What is System Design?: Defining the game we're playing.
  2. The Art of Requirements: How to extract the real problem from a vague prompt.
  3. The C4 Model: A standardized way to draw your ideas so others actually understand them.
  4. Trade-offs: Why "it depends" is the only correct answer (and how to explain what it depends on).
  5. Sruja Basics: Your first architecture-as-code model.

Learning Goals

By the end of this module, you will be able to:

  • Distinguish between Functional and Non-Functional Requirements.
  • Calculate rough capacity estimates (back-of-the-envelope math).
  • Draw a high-level System Context diagram.
  • Explain the CAP theorem in plain English.

Ready?

Let's start your journey with Lesson 1: What is System Design?

Lesson 1: The Mindset

The Shift

When you write code, you are building a single room. You care about the furniture (variables), the flow (logic), and the usability (UI).

System Design is city planning.

You stop caring about the furniture in every room. Instead, you care about:

  • Traffic flow: Can the roads handle rush hour? (Throughput)
  • Utilities: Is there enough water and electricity? (Capacity)
  • Disaster recovery: What happens if the power plant explodes? (Reliability)
  • Expansion: Can we add a new suburb next year? (Scalability)

Real-World Case Studies

Case Study 1: The Netflix Chaos Monkey (Success)

In 2008-2011, Netflix faced a critical problem. They moved from DVD-by-mail to streaming, but their single datacenter in Virginia was a single point of failure. One morning, the datacenter experienced a major outage—millions of customers couldn't watch anything.

The System Design Decision: Instead of buying bigger, better datacenters (Vertical Scaling), Netflix chose to:

  • Move everything to cloud infrastructure (AWS)
  • Adopt a microservices architecture
  • Build Chaos Monkey—a tool that randomly kills services to test resilience

The Result:

  • Netflix went from 99.9% uptime to 99.99%+ uptime
  • They handle millions of concurrent streams globally
  • When AWS has an outage in one region, Netflix users don't even notice

Case Study 2: Healthcare.gov Launch (Failure)

In 2013, the US government launched Healthcare.gov with these requirements:

  • Handle millions of users trying to sign up simultaneously
  • Integrate with dozens of legacy systems (IRS, insurance companies, etc.)
  • 100% data accuracy (no room for errors in health coverage)

The System Design Mistakes:

  • No load testing before launch (assumed it would "just work")
  • Tightly coupled architecture with no caching layer
  • Single database bottleneck (no sharding)
  • No graceful degradation (the entire site crashed instead of showing partial results)

The Consequences:

  • Site crashed immediately—only 6 people could sign up on day one
  • Cost $1.7 billion and took 2 years to fix
  • Public relations disaster and loss of trust

The Fix:

  • Added a caching layer (Redis) to handle read-heavy traffic
  • Implemented horizontal scaling with auto-scaling groups
  • Added circuit breakers to prevent cascading failures
  • Built queue-based architecture for background processing

Case Study 3: Instagram's 2010 Growth Spike

When Instagram launched in 2010, they had:

  • 2 servers (one for the app, one for the database)
  • 50,000 users on launch day
  • Within a week: 1 million users
  • Within a month: 10 million users

The Challenge: Their architecture couldn't handle the exponential growth. The database was overwhelmed, and image uploads were timing out.

The System Design Solution:

  1. Database Sharding: Split user data across multiple database servers by user ID
  2. Content Delivery Network (CDN): Host images on edge servers globally
  3. Read Replicas: Created multiple read-only copies of the database
  4. Async Processing: Moved image processing to background queues

The Numbers:

  • Before fix: 95% uptime, 30-second image uploads
  • After fix: 99.9% uptime, <1-second image uploads
  • They scaled from 2 servers to 500+ servers in 6 months
  • Eventually acquired by Facebook for $1 billion

Important

The golden rule: In system design, there are no right answers, only trade-offs.

Functional vs Non-Functional

Every system has two sets of requirements. In an interview (and real life), 90% of your initial grade comes from clarifying these before you draw a single box.

1. Functional Requirements (The "What")

These are the features. If the system doesn't do this, it's useless.

Basic Examples:

  • User can post a tweet.
  • User can follow others.
  • User sees a news feed.

Real-World Examples by Industry:

IndustryFunctional RequirementsReal System Example
E-commerceBrowse products, add to cart, checkout, track ordersAmazon, Shopify
Social MediaPost content, follow users, like/comment, real-time notificationsTwitter, Instagram
StreamingVideo playback, quality adjustment, search, watchlistNetflix, YouTube
BankingTransfer money, view balance, pay bills, transaction historyChase, Revolut
HealthcareBook appointments, view records, message doctors, prescription managementTeladoc, Epic

Advanced Examples from Production Systems:

Uber's Real-Time Requirements:

  • Driver tracking updates every 4 seconds
  • Passenger requests must be matched to drivers within <5 seconds
  • Surge pricing calculated dynamically based on real-time supply/demand
  • Payment processing within <2 seconds after ride completion

Spotify's Music Streaming Requirements:

  • <200ms latency for track start (no buffering delay)
  • Support offline playback with 10,000+ songs cached
  • Real-time collaborative playlists with <500ms sync
  • Personalized recommendations with 50M+ tracks in catalog

Airbnb's Booking Requirements:

  • Support concurrent bookings for same property (prevent double-booking)
  • 24-hour hold on bookings before payment
  • Real-time availability sync across 190+ countries
  • Instant book feature (no host approval required)

2. Non-Functional Requirements (The "How")

These are the constraints. If the system doesn't meet these, it will crash/fail/be too slow.

Basic Examples:

  • Scalability: Must handle 100M daily active users.
  • Latency: Feed must load in < 200ms.
  • Consistency: A tweet must appear on followers' feeds within 5 seconds.

Real-World Production Requirements:

SystemAvailabilityLatencyThroughputData Size
Google Search99.99%<0.5 seconds63,000 queries/second100+ petabytes
Netflix Streaming99.99%<2 seconds (start)100M+ concurrent streams1+ petabytes/day
WhatsApp99.9%<100ms (message delivery)65B+ messages/day4+ petabytes/year
Twitter (X)99.9%<200ms (timeline)500M+ tweets/day500+ petabytes
AWS S399.999999999% (11 nines)<100ms (GET)20M+ requests/second100+ exabytes

Industry-Specific Requirements:

Finance (Banking/Trading):

  • Strong Consistency: Account balances must be 100% accurate (no eventual consistency)
  • Auditability: Every transaction must be logged and traceable
  • Compliance: GDPR, PCI-DSS, SOX compliance required
  • Low Latency: Trading decisions in microseconds for high-frequency trading

Healthcare:

  • HIPAA Compliance: All data encrypted at rest and in transit
  • High Availability: Patient data must be accessible 24/7
  • Privacy: Strict access controls and audit logs
  • Disaster Recovery: RPO (Recovery Point Objective) < 1 hour, RTO (Recovery Time Objective) < 4 hours

Gaming:

  • Real-Time: <50ms latency for multiplayer gaming
  • High Throughput: Handle millions of concurrent players
  • Scalable: Auto-scale for game launches and events
  • Anti-Cheat: Prevent cheating and hacking

IoT (Internet of Things):

  • High Ingest Rate: Handle millions of devices sending data simultaneously
  • Edge Computing: Process data locally to reduce bandwidth
  • Low Power: Devices operate on battery for years
  • Intermittent Connectivity: Work with unstable network connections
graph TD
    A[Requirements] --> B[Functional]
    A --> C[Non-Functional]
    B --> B1[Features]
    B --> B2[APIs]
    B --> B3[User Flows]
    C --> C1[Scalability]
    C --> C2[Reliability]
    C --> C3[Latency]
    C --> C4[Cost]

    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#bfb,stroke:#333

The "It Depends" Game

Junior engineers search for the "best" database. Senior engineers ask "what are we optimizing for?"

You Optimize ForYou Might SacrificeExample
ConsistencyAvailabilityBanking (Balances must be correct, even if system goes down briefly)
AvailabilityConsistencySocial Media (Better to show old likes than an error page)
Write SpeedRead SpeedLogging (Write fast, read rarely)
Development SpeedPerformanceStartups (Ship Python/Ruby MVP fast, rewrite later)

Practical Trade-Off Scenarios

Scenario 1: Building a Real-Time Chat App

Context: You're building a chat app like Slack or Discord. Users expect messages to appear instantly.

The Trade-Off Decisions:

DecisionOption AOption BWhat You Choose & Why
Message StorageRelational DB (PostgreSQL)NoSQL (Cassandra)NoSQL - High write throughput, eventual consistency acceptable
Real-time UpdatesPolling (client asks server every 5s)WebSockets (server pushes updates)WebSockets - Lower latency, less server load
Message HistoryKeep forever90-day retention90-day retention - Reduce storage costs, most users don't need old messages
Online StatusCheck on every messageHeartbeat every 30sHeartbeat - Scale better, less database load

Performance Impact:

  • Polling approach: 100K users × 1 request/5s = 20,000 requests/second just for checking messages
  • WebSocket approach: 100-200 requests/second (heartbeat only)

Scenario 2: Building an E-Commerce Platform

Context: You're building Amazon-scale e-commerce. Need to handle Black Friday traffic spikes.

The Architecture Trade-Offs:

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce Platform" {
    description "High-volume retail platform with trade-off decisions documented"
    
    // TRADE-OFF 1: Read-heavy vs Write-heavy
    ProductDB = database "Product Catalog Database" {
        technology "PostgreSQL with Read Replicas"
        description "Optimized for READ operations (99% of traffic is reads)"
        
        tradeoff {
            decision "Use read replicas for product browsing"
            sacrifice "Write latency (updates take longer to propagate)"
            reason "Users browse products 100x more than they add products"
            metric "Read:Write ratio = 100:1"
        }
    }
    
    // TRADE-OFF 2: Strong vs Eventual Consistency
    CartService = container "Shopping Cart Service" {
        technology "Redis"
        description "In-memory cache for cart state"
        
        tradeoff {
            decision "Use Redis (in-memory) for cart storage"
            sacrifice "Durability (cart data lost if Redis crashes)"
            reason "Cart data is temporary and can be recreated from product catalog"
            mitigation "Periodic snapshots to persistent storage"
        }
    }
    
    // TRADE-OFF 3: Cost vs Performance
    SearchEngine = container "Product Search" {
        technology "Elasticsearch"
        description "Full-text search with caching layer"
        
        tradeoff {
            decision "Use expensive Elasticsearch cluster"
            sacrifice "Infrastructure cost ($5K/month)"
            reason "Search performance directly impacts conversion rates (1% latency = 1% revenue loss)"
            metric "Search latency <200ms required for optimal UX"
        }
    }
    
    // TRADE-OFF 4: Availability vs Consistency
    OrderService = container "Order Processing" {
        technology "Kafka + Microservices"
        description "Async order processing pipeline"
        
        tradeoff {
            decision "Use async messaging (eventual consistency)"
            sacrifice "Real-time inventory accuracy"
            reason "Better availability and scalability during peak traffic"
            mitigation "Compensating transactions to handle over-selling"
        }
    }
}

Scenario 3: CAP Theorem in Practice

Real-World Example: Netflix vs. PayPal

Netflix (Choose Availability):

  • If a user can't watch a video, they might cancel subscription
  • Trade-off: Occasionally show stale content recommendations
  • Architecture: AP system (Available, Partition-tolerant, Eventually consistent)
  • Data: Video recommendations, watch history, user preferences

PayPal (Choose Consistency):

  • If a transaction is processed incorrectly, lawsuits happen
  • Trade-off: Brief service interruptions during network partitions
  • Architecture: CP system (Consistent, Partition-tolerant, Limited availability)
  • Data: Account balances, transaction records, payment processing

The Decision Matrix:

Ask yourself:
1. What happens if the data is wrong?
   → If lawsuits/financial loss → Prioritize Consistency (PayPal model)
   → If just bad UX → Prioritize Availability (Netflix model)

2. What's the tolerance for downtime?
   → Zero tolerance → Prioritize Availability (Instagram for celebrity photos)
   → Some tolerance OK → Prioritize Consistency (Banking)

3. Can you design around the trade-off?
   → Yes: Use hybrid approach (read-optimized cache + write-optimized DB)
   → No: Pick one and accept the consequences

Sruja Integration

In Sruja, we treat requirements as code. This keeps your constraints right next to your architecture.

Why Kinds and Types Matter

In Sruja, you declare kinds to establish the vocabulary of your architecture. This isn't just syntax—it provides real benefits:

  1. Early Validation: If you typo an element type (e.g., sytem instead of system), Sruja catches it immediately.
  2. Better Tooling: IDEs can provide autocomplete and validation based on your declared kinds.
  3. Self-Documentation: Anyone reading your model knows exactly which element types are available.
  4. Custom Vocabulary: You can define your own kinds (e.g., microservice = kind "Microservice") to match your domain.
  5. Flat and Clean: With Sruja's flat syntax, these declarations live at the top of your file—no specification wrapper block required.

Example: Requirements-Driven Architecture

import { * } from 'sruja.ai/stdlib'

// 1. Defining the "What" (Functional)
requirement R1 functional "Users can post short text messages (tweets)"

// 2. Defining the "How" (Non-Functional)
requirement R2 performance "500ms p95 latency for reading timeline"
requirement R3 scale "Store 5 years of tweets (approx 1PB)"
requirement R4 availability "99.9% uptime SLA"

// 3. The Architecture follows the requirements
Twitter = system "The Platform" {
    description "Satisfies R1, R2, R3, R4"

    TimelineAPI = container "Timeline API" {
        technology "Rust"
        description "Satisfies R2 - optimized for low latency"

        slo {
            latency {
                p95 "500ms"
                window "7 days"
            }
            availability {
                target "99.9%"
                window "30 days"
            }
        }
    }

    TweetDB = database "Tweet Storage" {
        technology "Cassandra"
        description "Satisfies R3 - distributed storage for 1PB scale"
    }

    TimelineAPI -> TweetDB "Reads/Writes"
}

// 4. Document the decision
ADR001 = adr "Use Cassandra for tweet storage" {
    status "Accepted"
    context "Need to store 1PB of tweets with high write throughput"
    decision "Use Cassandra for distributed, scalable storage"
    consequences "Excellent scalability, eventual consistency trade-off"
}

view index {
title "Twitter Platform Overview"
include *
}

// Performance-focused view
view performance {
title "Performance View"
include Twitter.TimelineAPI Twitter.TweetDB
}

Knowledge Check

Q: My boss says "We need to handle infinite users". How do you respond?

Bad Answer: "Okay, I'll use Kubernetes and sharding."

Senior Answer: "Infinite is expensive. Do we expect 1k users or 100M users? The design for 1k costs $50/mo. The design for 100M costs $50k/mo. Let's define a realistic target for the next 12 months."

Q: Why not just use the fastest database for everything?

Because "fastest" depends on the workload. A database fast at reading (Cassandra) might be complex to manage. A database fast at relationships (Neo4j) might scale poorly for heavy writes. Trade-offs.

Quiz: Test Your Knowledge

Ready to apply what you've learned? Take the interactive quiz for this lesson!

1. In system design, what do we call requirements that describe the features and functionality of a system (what it should do)?

Click to see answer

Answer: Functional

Alternative answers:

  • functional requirements

Explanation: Functional requirements define the features and capabilities of the system. Examples: "User can post a tweet," "User can browse products."


2. In system design, what do we call requirements that describe how the system should perform (constraints like speed, scalability, reliability)?

Click to see answer

Answer: Non-functional

Alternative answers:

  • non-functional
  • non functional
  • NFR
  • NFRs

Explanation: Non-functional requirements define the quality attributes and constraints of the system. Examples: "Must handle 100M users," "Response time <200ms."


3. A banking system must ensure that account balances are always accurate and transactions cannot be lost. Which trade-off would you prioritize?

  • a) Prioritize availability over consistency (it's better to show wrong data than no data)
  • b) Prioritize development speed over performance (ship a MVP first)
  • c) Prioritize write speed over read speed (logging-focused optimization)
  • d) Prioritize consistency over availability (brief downtime is acceptable, but data must be correct)


4. You're building a real-time chat application like Discord. Users expect messages to appear instantly across all devices. What's the best architecture approach?

  • a) Use relational database with strong consistency (PostgreSQL) for all message storage
  • b) Use HTTP polling where clients check for new messages every 5 seconds
  • c) Use eventual consistency with 24-hour delay synchronization
  • d) Use WebSockets for real-time push with eventual consistency for message storage


5. Netflix experienced a major outage in 2008 when their single datacenter failed. What was their system design solution?

  • a) Bought a bigger, more expensive datacenter with better hardware (Vertical scaling)
  • b) Hired more operations engineers to manually failover systems
  • c) Built a single, massive monolithic application on dedicated servers
  • d) Moved to cloud infrastructure with microservices and built Chaos Monkey to test resilience


6. Healthcare.gov's initial launch in 2013 was a disaster. Which of these was NOT one of their system design mistakes?

  • a) No load testing before launch
  • b) Tightly coupled architecture with no caching layer
  • c) Single database bottleneck with no sharding
  • d) Using cloud infrastructure instead of dedicated on-premise servers


7. You're building a product search engine for an e-commerce site handling 10M products. The search feature generates 99% of traffic. What optimization should you prioritize?

  • a) Optimize for write speed (since products are added frequently)
  • b) Use a single-node relational database for simplicity
  • c) Disable caching to ensure always-fresh search results
  • d) Use read replicas and a specialized search engine like Elasticsearch


8. Instagram launched with 2 servers and grew to 10M users in one month. What was their key architectural change to handle this growth?

  • a) Rewrote the entire application in a different programming language
  • b) Bought the biggest available server (vertical scaling)
  • c) Removed all features to reduce complexity
  • d) Implemented database sharding, CDN for images, and async processing for image handling


9. Which of the following statements best describes the relationship between latency and throughput?

  • a) Low latency always means high throughput
  • b) High latency always means high throughput
  • c) Latency and throughput are the same thing
  • d) A system can have low latency but low throughput, or high latency but high throughput


10. Your boss says "We need to handle infinite users." What's the most appropriate response?

  • a) Great! I'll immediately implement Kubernetes and distributed sharding
  • b) Impossible! Let's cap users at 1,000 and reject anyone else
  • c) Let's build the system assuming unlimited resources regardless of cost
  • d) Infinite is expensive. Let's define realistic targets for the next 12 months (e.g., 100K users) and design for that


11. What is the term for the system design principle that means every decision involves sacrificing one quality to gain another (e.g., choosing consistency means sacrificing availability)?

Click to see answer

Answer: Trade-off

Alternative answers:

  • trade-off
  • tradeoff

Explanation: Trade-offs are fundamental to system design. There are no perfect solutions—every architecture choice involves benefits and costs. "It depends" is the correct answer because it depends on which trade-offs you choose.


This quiz covers:

  • Functional vs Non-functional requirements
  • Real-world case studies (Netflix, Healthcare.gov, Instagram)
  • Trade-off decisions in system design
  • Practical scenarios and decision-making

Next Steps

Now that we have the mindset, let's learn the language. 👉 Lesson 2: The Vocabulary of Scale

Lesson 2: The Vocabulary of Scale

To design big systems, you need to speak the language.

1. Scaling: Up vs Out

When your website crashes because too many people are using it, you have two choices.

Vertical Scaling (Scaling Up)

"Get a bigger machine." You upgrade from a 4GB RAM server to a 64GB RAM server.

  • Pros: Easy. No code changes.
  • Cons: Expensive. Finite limit (you can't buy a 100TB RAM server... easily). Single point of failure.

Horizontal Scaling (Scaling Out)

"Get more machines." You buy 10 cheap servers and split the traffic between them.

  • Pros: Infinite scale (google has millions of servers). Resilient (if one dies, others take over).
  • Cons: Complex. You need load balancers and data consistency strategies.
graph TD
    subgraph Vertical [Vertical Scaling]
        Small[Server] -- Upgrade --> Big[SERVER]
    end

    subgraph Horizontal [Horizontal Scaling]
        One[Server] -- Add More --> Many1[Server]
        One -- Add More --> Many2[Server]
        One -- Add More --> Many3[Server]
    end

2. Speed: Latency vs Throughput

In interviews, never just say "it needs to be fast". Be specific.

  • Latency: The time it takes for one person to get a result.
    • Metaphor: The time it takes to drive from A to B.
    • Unit: Milliseconds (ms).
  • Throughput: The number of people the system can serve at the same time.
    • Metaphor: The width of the highway (how many cars per hour).
    • Unit: Requests per Second (RPS).

Tip

Use the right word: A system can have low latency (fast response) but low throughput (crashes if 5 people use it). A highway can have high throughput (10 lanes) but high latency (traffic jam).

3. Sruja in Action

Sruja allows you to define horizontal scaling requirements explicitly using the scale block.

import { * } from 'sruja.ai/stdlib'


ECommerce = system "E-Commerce System" {
    WebServer = container "Web App" {
        technology "Rust, Axum"

        // Explicitly defining Horizontal Scaling
        scale {
            min 3            // Start with 3 servers
            max 100          // Scale up to 100
            metric "cpu > 80%"
        }
    }

    Database = database "Primary DB" {
        technology "PostgreSQL"
        // Describing Vertical Scaling via comments/description
        description "Running on a massive AWS r5.24xlarge instance (Vertical Scaling)"
    }

    WebServer -> Database "Reads/Writes"
}

view index {
include *
}

Knowledge Check

Q: Why don't we just vertically scale forever?

Because physics. There is a limit to how fast a single CPU can be. Also, if that one super-computer catches fire, your entire business is dead.

Quiz: Test Your Knowledge

Ready to apply what you've learned? Take the interactive quiz for this lesson!

1. What type of scaling involves upgrading a single machine with more resources (more RAM, CPU, disk space)?

Click to see answer

Answer: Vertical

Alternative answers:

  • vertical scaling
  • scale up
  • scale-up

Explanation: Vertical scaling (or scaling up) means making a single machine more powerful. Example: Upgrading from 4GB RAM to 64GB RAM on one server.


2. What type of scaling involves adding more machines to distribute the load?

Click to see answer

Answer: Horizontal

Alternative answers:

  • horizontal scaling
  • scale out
  • scale-out

Explanation: Horizontal scaling (or scaling out) means adding more machines to handle increased load. Example: Adding 10 servers instead of upgrading one server to be more powerful.


3. Why don't we just vertically scale forever to handle all growth?

  • a) Vertical scaling is always more expensive than horizontal scaling
  • b) Vertical scaling requires more maintenance and monitoring
  • c) Vertical scaling is only available on cloud platforms
  • d) There are physical limits to how powerful a single machine can be, and it's a single point of failure


4. Your application needs to handle 10x traffic during a holiday sale (from 100K to 1M users per hour). You have 2 weeks to prepare. What's the best approach?

  • a) Vertically scale by buying the most powerful server available (it handles 2M users)
  • b) Rewrite the entire application to be microservices-based
  • c) Tell users the site will be slow during the sale
  • d) Implement horizontal scaling with auto-scaling groups that can add servers automatically based on load


5. A startup has a monolithic application running on a single server. They expect to grow from 100 to 10,000 users over the next year. What's their best scaling strategy?

  • a) Immediately migrate to microservices architecture on Kubernetes
  • b) Start with 100 servers to prepare for future growth
  • c) Do nothing and hope the single server handles the load
  • d) Start with vertical scaling, then migrate to horizontal scaling when needed


6. What's the main disadvantage of horizontal scaling?

  • a) It's more expensive than vertical scaling
  • b) It has a finite limit to how much you can scale
  • c) It can't handle traffic spikes
  • d) It introduces complexity in data consistency, load balancing, and distributed systems management


7. You're designing a high-frequency trading system where every microsecond matters. Which scaling approach is most appropriate?

  • a) Horizontal scaling across multiple datacenters worldwide
  • b) Caching everything and accepting stale data
  • c) No scaling needed, as HFT systems don't handle much traffic
  • d) Vertical scaling on a single machine in the same datacenter as the stock exchange


8. What term describes the time it takes for a single request to complete, measured in milliseconds?

Click to see answer

Answer: Latency

Alternative answers:

  • response time
  • latency

Explanation: Latency is the time from when a request is sent to when the response is received. Think of it as the time it takes to drive from point A to point B.


9. What term describes how many requests a system can handle simultaneously, measured in requests per second (RPS)?

Click to see answer

Answer: Throughput

Alternative answers:

  • throughput
  • capacity
  • concurrent requests

Explanation: Throughput is the volume of work a system can handle. Think of it as the width of a highway—how many cars can travel per hour.


10. Can a system have low latency but low throughput?

  • a) No, low latency always means high throughput
  • b) No, these terms are synonyms
  • c) Only in distributed systems
  • d) Yes—a single-lane road has low latency (no traffic jam) but low throughput (few cars per hour)


11. YouTube must serve videos to millions of users simultaneously. What's the most important metric for their success?

  • a) Low latency for video upload
  • b) High throughput for video streaming
  • c) Strong consistency for user preferences
  • d) High throughput for video streaming with acceptable latency for video start


12. A REST API averages 50ms latency but can only handle 100 requests/second before becoming unresponsive. You need to support 1,000 requests/second. What's the first step?

  • a) Optimize code to reduce latency from 50ms to 5ms
  • b) Increase the timeout to handle more concurrent requests
  • c) Add caching for everything
  • d) horizontally scale by running multiple instances behind a load balancer


13. Google Search needs to return results in under 500 milliseconds for 63,000 queries per second. What's their architectural approach?

  • a) One supercomputer with infinite RAM
  • b) Caching everything for 24 hours
  • c) Accepting slower response times during peak hours
  • d) Horizontal scaling with distributed computing, pre-computed indexes, and edge caching


14. Your database has a read-to-write ratio of 1000:1 (users read data 1000x more than they write it). What scaling strategy is most effective?

  • a) Add more powerful CPUs for write operations
  • b) Shard the database based on write patterns
  • c) Optimize write queries since they're the bottleneck
  • d) Use read replicas to distribute read load across multiple database copies


15. When should you choose vertical scaling over horizontal scaling?

  • a) When you need to handle millions of concurrent users
  • b) When your application has no shared state and is stateless
  • c) When cost and complexity are not concerns
  • d) When you need a quick solution, have low traffic, or your application has complex shared state


16. What component distributes incoming network traffic across multiple servers to enable horizontal scaling?

Click to see answer

Answer: Load balancer

Alternative answers:

  • load balancer
  • load balancers
  • LB
  • proxy

Explanation: Load balancers are the "traffic cops" that distribute requests across multiple servers, enabling horizontal scaling and providing resilience by routing around failed servers.


17. In a horizontally scaled system, what happens if one server fails?

  • a) The entire system crashes
  • b) All traffic stops until the server is repaired
  • c) The load balancer sends more traffic to the failed server
  • d) The load balancer stops sending traffic to the failed server and routes it to the remaining healthy servers


18. Sruja allows you to define horizontal scaling explicitly. What does min: 3, max: 100, metric: "cpu > 80%" mean in a scale block?

  • a) Always run exactly 3 servers, maximum CPU usage 80%
  • b) Scale vertically when CPU is under 80%
  • c) Never scale down below 3 servers regardless of CPU
  • d) Start with 3 servers, add more up to 100 when CPU exceeds 80%, remove servers when CPU is below threshold


19. Your e-commerce site's product catalog page loads in 2 seconds, but during a sale it slows to 10 seconds. Which metric degraded?

  • a) Throughput decreased
  • b) The database ran out of storage
  • c) The page size increased
  • d) Latency increased (response time got worse) due to increased load on the system


20. A system has 99.9% uptime, meaning it can be down for about 8.77 hours per year. If you want 99.99% uptime, how much downtime is acceptable per year?

  • a) 8.77 hours (same as 99.9%)
  • b) 1 hour
  • c) 1 minute
  • d) About 52.6 minutes (8.77 hours / 10)


This quiz covers:

  • Vertical vs Horizontal scaling strategies
  • When to use each scaling approach
  • Latency vs Throughput concepts
  • Real-world scaling scenarios (YouTube, Google, HFT)
  • Load balancing and auto-scaling
  • Practical scaling decisions

Next Steps

We have the mindset, and we have the words. Now let's draw. 👉 Lesson 3: The C4 Model (Visualizing Architecture)

Lesson 3: Availability & Reliability

Reliability vs. Availability

  • Reliability: The probability that a system will function correctly without failure for a specified period. It's about correctness.
  • Availability: The percentage of time a system is operational and accessible. It's about uptime.

A system can be available but not reliable (e.g., it returns 500 errors but is "up").

Measuring Availability

Availability is often measured in "nines":

AvailabilityDowntime per Year
99% (Two nines)3.65 days
99.9% (Three nines)8.76 hours
99.99% (Four nines)52.6 minutes
99.999% (Five nines)5.26 minutes

Achieving High Availability

Redundancy

The key to availability is eliminating Single Points of Failure (SPOF). This is done via redundancy.

  • Active-Passive: One server handles traffic; the other is on standby.
  • Active-Active: Both servers handle traffic. If one fails, the other takes over the full load.

Failover

The process of switching to a redundant system upon failure. This can be manual or automatic.


🛠️ Sruja Perspective: Modeling Redundancy

You can explicitly model redundant components in Sruja to visualize your high-availability strategy.

import { * } from 'sruja.ai/stdlib'


Payments = system "Payment System" {
    PaymentService = container "Payment Service" {
        technology "Java"
    }

    // Modeling a primary and standby database
    PrimaryDB = database "Primary Database" {
        technology "MySQL"
        tags ["primary"]
    }

    StandbyDB = database "Standby Database" {
        technology "MySQL"
        tags ["standby"]
        description "Replicates from PrimaryDB. Promoted to primary if PrimaryDB fails."
    }

    PaymentService -> PrimaryDB "Reads/Writes"
    PrimaryDB -> StandbyDB "Replicates data"
}

view index {
include *
}

## Quiz: Test Your Knowledge

Ready to apply what you've learned? Take the interactive quiz for this lesson!

**1. In system design, what term describes the percentage of time a system is operational and accessible (uptime)?**

<details>
<summary><strong>Click to see answer</strong></summary>

**Answer:** Availability

**Alternative answers:**
- availability
- uptime

**Explanation:**
Availability measures how often a system is up and accessible. It's about uptime percentage (e.g., 99.9%).


</details>

---

**2. In system design, what term describes the probability that a system will function correctly without failure for a specified period (correctness)?**

<details>
<summary><strong>Click to see answer</strong></summary>

**Answer:** Reliability

**Alternative answers:**
- reliability
- correctness

**Explanation:**
Reliability measures how often a system functions correctly without errors. A system can be available (up) but unreliable (returning 500 errors).


</details>

---

**3. Which availability level allows approximately 8.76 hours of downtime per year?**

- [ ] a) 99% (Two nines) - 3.65 days/year
- [ ] b) 99.99% (Four nines) - 52.6 minutes/year
- [ ] c) 99.999% (Five nines) - 5.26 minutes/year
- [ ] d) 99.9% (Three nines) - 8.76 hours/year

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    99.9% availability = 99.9% uptime = 0.1% downtime = 0.001 × 365 days × 24 hours = 8.76 hours/year.
    Each additional "9" reduces downtime by a factor of 10.

  </div>
</div>

---

**4. AWS S3 provides 99.999999999% durability (11 nines). What does this mean in practical terms?**

- [ ] a) S3 can be down for 5 minutes per year
- [ ] b) If you store 10,000 objects, you'll lose one per year
- [ ] c) S3 guarantees 11 nines availability (uptime)
- [ ] d) If you store 10,000 objects, you'll lose one on average every 10,000,000 years

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    99.999999999% durability means 0.000000001% chance of data loss. For 10,000 objects, that's a 0.01% chance per year = once every 10,000 years.
    Note: Durability ≠ Availability. S3's availability is 99.99% (52.6 minutes/year downtime).

  </div>
</div>

---

**5. A system has a Single Point of Failure (SPOF). What happens if that component fails?**

- [ ] a) The system continues operating normally with degraded performance
- [ ] b) The load balancer automatically routes traffic to other components
- [ ] c) The system fails gracefully with an error message
- [ ] d) The entire system becomes unavailable or non-functional

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    A Single Point of Failure is a component whose failure causes the entire system to fail. Redundancy is used to eliminate SPOFs by having backup components.

  </div>
</div>

---

**6. In an Active-Passive redundancy setup, what happens when the active server fails?**

- [ ] a) Both servers were handling traffic, so traffic continues normally
- [ ] b) The passive server automatically starts handling all traffic (failover)
- [ ] c) The system sends an error to users until manual intervention
- [ ] d) The passive server takes over (failover), but there's a brief interruption during switch

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    Active-Passive: One server handles traffic, the other is on standby. Failover happens automatically, but there's typically a few seconds of interruption as the passive server comes online.

  </div>
</div>

---

**7. In an Active-Active redundancy setup, what happens when one server fails?**

- [ ] a) The entire system goes down because both servers were critical
- [ ] b) The other server takes 5 minutes to restart before accepting traffic
- [ ] c) Users connected to the failed server experience errors until they reconnect
- [ ] d) Traffic is redistributed to the remaining server(s) with minimal or no interruption

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    Active-Active: Both servers handle traffic simultaneously. If one fails, the other continues handling its traffic plus takes over some traffic from the failed server.
    Minimal interruption if load balancer detects failure quickly.

  </div>
</div>

---

**8. You're building a banking application that processes financial transactions. Which redundancy approach is most appropriate?**

- [ ] a) Active-Active with no failover testing (cheaper)
- [ ] b) Active-Passive with automatic failover (good balance)
- [ ] c) No redundancy (transactions are rare, so SPOF is acceptable)
- [ ] d) Active-Active with rigorous failover testing and synchronous replication

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    Banking requires both high availability and strong consistency. Active-Active with synchronous replication ensures no data loss during failover.
    Regular chaos engineering tests (like Netflix's Chaos Monkey) ensure failover actually works when needed.

  </div>
</div>

---

**9. A content delivery network (CDN) has 100 edge servers worldwide. If 5 servers fail simultaneously, the CDN remains operational. This is an example of:**

- [ ] a) Vertical scaling
- [ ] b) Single Point of Failure
- [ ] c) Active-Passive redundancy
- [ ] d) Elimination of Single Points of Failure through redundancy

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    With 100 servers, the failure of 5 has minimal impact because there's no single critical component. This is horizontal scaling providing both availability and resilience.

  </div>
</div>

---

**10. Your database server fails. You have a standby replica that can take over in 30 seconds. What's your Recovery Time Objective (RTO)?**

- [ ] a) 0 seconds (no data loss)
- [ ] b) 30 seconds (time to detect and switch)
- [ ] c) 1 minute (time to fully restore service)
- [ ] d) 30 seconds (time to detect failure and failover to standby)

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    RTO (Recovery Time Objective) is the maximum acceptable time to restore service after a disruption.
    In this case, RTO = 30 seconds (detection + failover time).
    RPO (Recovery Point Objective) would be how much data is lost (depends on replication lag).

  </div>
</div>

---

**11. Netflix uses Chaos Monkey, a tool that randomly terminates instances in production. What is the purpose of this?**

- [ ] a) To save money by reducing server count
- [ ] b) To test if the system can handle automatic failover and resilience
- [ ] c) To identify which servers are underutilized
- [ ] d) To proactively test that the system remains available when components fail

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    Chaos Engineering is about intentionally causing failures to test resilience. If Chaos Monkey kills a server and Netflix users don't notice, the system is resilient.
    This practice transformed Netflix's availability from 99.9% to 99.99%+ by finding and fixing weaknesses before real outages occur.

  </div>
</div>

---

**12. Your e-commerce site's primary database fails. You have a standby replica with 5 minutes of replication lag (meaning it's missing the last 5 minutes of orders). What's your RPO?**

- [ ] a) 0 minutes (no data lost)
- [ ] b) 5 minutes (you lose up to 5 minutes of data)
- [ ] c) Infinite (you can't recover the data)
- [ ] d) 5 minutes (you lose up to 5 minutes of transaction data)

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time.
    With 5 minutes of replication lag, RPO = 5 minutes. Any orders placed in the last 5 minutes would need to be recovered from logs or customer records.

  </div>
</div>

---

**13. A system has 99.9% availability. How much downtime is this per month?**

- [ ] a) 8.76 hours (same as per year)
- [ ] b) 43.8 minutes (8.76 hours ÷ 12)
- [ ] c) 5 minutes (same as five nines per year)
- [ ] d) 43.8 minutes (8.76 hours ÷ 12 months)

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    99.9% = 8.76 hours downtime/year ÷ 12 months = 0.73 hours = 43.8 minutes per month.
    This means the system can be down for ~43 minutes each month while maintaining 99.9% availability SLA.

  </div>
</div>

---

**14. You're designing a global video streaming service. Users expect 99.99% availability. What's the maximum acceptable downtime per month?**

- [ ] a) 43.8 minutes (same as 99.9%)
- [ ] b) 5.26 minutes (same as five nines per year)
- [ ] c) 4.38 minutes (52.6 minutes/year ÷ 12)
- [ ] d) 4.38 minutes (52.6 minutes/year ÷ 12 months)

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    99.99% = 52.6 minutes downtime/year ÷ 12 months = 4.38 minutes per month.
    Achieving this requires Active-Active setup across multiple regions with automatic failover, as any maintenance or failure costs precious minutes.

  </div>
</div>

---

**15. In Sruja, how would you model a database with a standby replica for high availability?**

- [ ] a) Use a single database component with 'standby' in description
- [ ] b) Create two databases and don't connect them (Sruja auto-discovers)
- [ ] c) Use a load balancer with one database connection
- [ ] d) Define two database components (Primary and Standby) with a replication relationship

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    In Sruja, you explicitly model redundant components:

    ```sruja
    PrimaryDB = database "Primary Database" { ... }
    StandbyDB = database "Standby Database" {
    description "Replicates from PrimaryDB. Promoted to primary if PrimaryDB fails."
    }
    PrimaryDB -> StandbyDB "Replicates data"
    ```

    This makes the redundancy strategy visible in your architecture diagrams.

  </div>
</div>

---

**16. Which scenario demonstrates the difference between availability and reliability?**

- [ ] a) Server crashes and the site goes down (neither available nor reliable)
- [ ] b) Site loads quickly and returns correct data (both available and reliable)
- [ ] c) Site is accessible but returns random 500 errors to users
- [ ] d) Site is accessible but returns random 500 errors to users

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    This is AVAILABLE (site is up) but NOT RELIABLE (not functioning correctly).
    - Available = 100% uptime (never down)
    - Reliable = 100% correct (no errors)
    - Perfect = Both available and reliable

  </div>
</div>

---

**17. Your company's SLA (Service Level Agreement) promises 99.9% uptime. In the last month, you had 1 hour of downtime. What's the penalty?**

- [ ] a) No penalty (1 hour &lt; 43.8 minutes allowed)
- [ ] b) 50% penalty for missing the SLA
- [ ] c) Calculate the difference: (60 min - 43.8 min) × penalty rate
- [ ] d) 16.2 minutes exceeded (60 min - 43.8 min = 16.2 min beyond SLA)

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    99.9% SLA = 43.8 minutes downtime/month allowed.
    Actual = 60 minutes downtime.
    Exceeded by 16.2 minutes.
    Penalty is typically calculated based on the exceeded minutes times a penalty rate (e.g., 1% credit per minute exceeded).
    This demonstrates why monitoring availability in real-time is critical for SLA compliance.

  </div>
</div>

---

**18. What is the term for the process of switching from a failed primary component to a redundant standby component?**

<details>
<summary><strong>Click to see answer</strong></summary>

**Answer:** Failover

**Alternative answers:**
- failover
- fail-over
- switchover

**Explanation:**
Failover is the process of automatically or manually switching to a redundant system upon failure. This is critical for high availability systems.
Failover can be automatic (system detects failure and switches) or manual (operator triggers switch).


</details>

---

**19. A healthcare application must be available 24/7 for emergency access. The database can only be down for maintenance 4 hours per year. What's the minimum availability required?**

- [ ] a) 99% (allows 3.65 days/year)
- [ ] b) 99.9% (allows 8.76 hours/year)
- [ ] c) 99.999% (allows 5.26 minutes/year)
- [ ] d) 99.95% (allows 4.38 hours/year, just under the 4-hour requirement)

<button class="check-answer-btn" data-correct="d">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    4 hours/year downtime tolerance:
    4 hours ÷ (365 days × 24 hours) = 4 ÷ 8760 = 0.0457% downtime
    Availability = 100% - 0.0457% = 99.954%

    Rounding to standard availability levels: 99.95% allows 4.38 hours/year, which meets the requirement.
    This requires careful planning: Active-Active setup, scheduled maintenance windows, and minimal unplanned outages.

  </div>
</div>

---

This quiz covers:
- Availability vs Reliability definitions
- Availability levels and downtime calculations (nines)
- Redundancy strategies (Active-Passive vs Active-Active)
- Single Points of Failure (SPOF)
- Failover mechanisms
- Real-world examples (Netflix, AWS S3, CDN)
- RTO and RPO
- Chaos Engineering
- SLA calculations
- Sruja modeling for redundancy

Lesson 4: CAP Theorem & Consistency

The CAP Theorem

Proposed by Eric Brewer, the CAP theorem states that a distributed data store can only provide two of the following three guarantees:

  1. Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time.
  2. Availability (A): Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
  3. Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.

The Reality: P is Mandatory

In a distributed system, network partitions (P) are inevitable. Therefore, you must choose between Consistency (CP) and Availability (AP) when a partition occurs.

  • CP (Consistency + Partition Tolerance): Wait for data to sync. If a node is unreachable, return an error. (e.g., Banking systems).
  • AP (Availability + Partition Tolerance): Return the most recent version of data available, even if it might be stale. (e.g., Social media feeds).

Consistency Models

  • Strong Consistency: Once a write is confirmed, all subsequent reads see that value.
  • Eventual Consistency: If no new updates are made, eventually all accesses will return the last updated value. (Common in AP systems).

🛠️ Sruja Perspective: Documenting Guarantees

When defining data stores in Sruja, it is helpful to document their consistency guarantees, especially for distributed databases.

import { * } from 'sruja.ai/stdlib'


DataLayer = system "Data Layer" {
    UserDB = database "User Database" {
        technology "Cassandra"
        // Explicitly stating the consistency model
        description "configured with replication factor 3. Uses eventual consistency for high availability."

        // You could also use custom tags
        tags ["AP-System", "Eventual-Consistency"]
    }

    BillingDB = database "Billing Database" {
        technology "PostgreSQL"
        description "Single primary with synchronous replication to ensure strong consistency."
        tags ["CP-System", "Strong-Consistency"]
    }
}

view index {
include *
}

Quiz: Test Your Knowledge

Ready to apply what you've learned? Take the interactive quiz for this lesson!

1. What term in the CAP theorem means every read receives the most recent write or an error (all nodes see the same data)?

Click to see answer

Answer: Consistency

Alternative answers:

  • consistency
  • C

Explanation: Consistency ensures all nodes see the same data at the same time. When a write is confirmed, any subsequent read returns that value.


2. What term in the CAP theorem means every request receives a non-error response, without guaranteeing it contains the most recent write?

Click to see answer

Answer: Availability

Alternative answers:

  • availability
  • A

Explanation: Availability means the system is always responsive. Even if some nodes are out of sync, the system returns a response (possibly stale data) rather than an error.


3. What term in the CAP theorem means the system continues to operate despite network failures or message loss between nodes?

Click to see answer

Answer: Partition Tolerance

Alternative answers:

  • partition tolerance
  • partition-tolerance
  • P

Explanation: Partition Tolerance ensures the system works even when network communication between nodes fails. In distributed systems, partitions are inevitable, so P is mandatory.


4. In a distributed system, you must choose between CP and AP when a network partition occurs. Why?

  • a) Because P is optional in most systems
  • b) Because you can only implement two of the three guarantees simultaneously
  • c) Because C and A are mutually exclusive by definition
  • d) Because network partitions (P) are inevitable in distributed systems


5. A banking system must ensure that account balances are always correct. During a network partition, the system rejects transactions if it can't confirm data consistency. This is what type of system?

  • a) AP (Available) - better to allow transactions with possibly incorrect balances
  • b) CA (Consistent and Available) - possible in single-node systems only
  • c) P (Partition Tolerance) only - data consistency isn't important
  • d) CP (Consistent and Partition Tolerant)


6. A social media feed shows posts from friends. If a partition occurs, users see slightly outdated posts rather than an error page. This is what type of system?

  • a) CP (Consistent) - reject requests if data isn't perfectly synced
  • b) CA (Consistent and Available) - impossible in distributed systems
  • c) P (Partition Tolerance) only - doesn't describe the full trade-off
  • d) AP (Available and Partition Tolerant)


7. Which system would prioritize CP (Consistency) over Availability?

  • a) Instagram photo feed
  • b) YouTube video recommendations
  • c) E-commerce product catalog (non-critical)
  • d) PayPal payment processing


8. Which system would prioritize AP (Availability) over Consistency?

  • a) Banking transaction system
  • b) Inventory management for critical medical supplies
  • c) Stock trading platform for high-frequency trading
  • d) Netflix video streaming recommendations


9. What is the difference between Strong Consistency and Eventual Consistency?

  • a) Strong Consistency is slower, Eventual is always faster
  • b) Strong Consistency allows stale reads, Eventual doesn't
  • c) There's no difference, they're synonyms
  • d) Strong Consistency: all nodes see same data immediately. Eventual: nodes eventually converge if no new writes occur.


10. A user posts a tweet. The tweet immediately appears in their own timeline but takes up to 30 seconds to appear in followers' feeds. What consistency model is this?

  • a) Strong Consistency (everyone sees the tweet immediately)
  • b) No consistency (data is randomly shown or hidden)
  • c) CP system (rejects posts during partitions)
  • d) Eventual Consistency (followers eventually see the tweet)


11. A database has a replication factor of 3 (3 copies of data). Writes are confirmed after writing to 2 nodes. If one node is down, what happens?

  • a) Write fails because all 3 nodes must be available
  • b) Write succeeds because only 2 nodes are required (quorum)
  • c) System becomes completely unavailable
  • d) Write succeeds because quorum (2 out of 3 nodes) is available


12. Cassandra is a distributed database designed for AP systems. If you need strong consistency in Cassandra, what configuration would you use?

  • a) Replication factor of 1 (single node)
  • b) Read/write with QUORUM consistency level
  • c) Set consistency level to ONE (fast but weak)
  • d) Read/write with QUORUM consistency level (majority of replicas)


13. You're designing a global e-commerce platform with product catalogs, user sessions, and order processing. Which should use strong consistency?

  • a) Product catalog (catalog changes are frequent)
  • b) User sessions (session data isn't critical)
  • c) Search results (stale results are acceptable)
  • d) Order processing and inventory management


14. In a globally distributed system, network latency between regions is 200ms. If you need strong consistency, what's the minimum write latency?

  • a) 0ms (write happens locally and asynchronously replicates)
  • b) 50ms (compression reduces latency)
  • c) 200ms (only need to write to one region)
  • d) 400ms+ (must wait for acknowledgment from majority of regions)


15. What's the difference between BASE (Basically Available, Soft state, Eventual consistency) and ACID (Atomicity, Consistency, Isolation, Durability)?

  • a) BASE is stricter than ACID, requiring perfect consistency
  • b) They're synonyms, just different names for the same concept
  • c) BASE is for distributed systems, ACID is only for single-node databases
  • d) BASE is an alternative to ACID that prioritizes availability over strong consistency


16. A read-after-write consistency model ensures that a client always sees their own writes. What consistency level is this?

  • a) Weak Consistency (writes might not be visible)
  • b) Strong Consistency (all clients see all writes immediately)
  • c) Eventual Consistency (eventually converges)
  • d) Causal Consistency (session consistency)


17. In Sruja, how would you document that a database uses eventual consistency for high availability?

  • a) Use a 'relaxed' tag in the relationship
  • b) Don't model it - Sruja assumes strong consistency by default
  • c) Create two databases and say they're 'somewhat consistent'
  • d) Add tags like 'AP-System' and 'Eventual-Consistency' to the database definition


18. A system needs 99.999% availability but can tolerate 5-second data staleness. What's the best approach?

  • a) Strong consistency with synchronous replication across all nodes
  • b) Single-node database (simplest, no network issues)
  • c) Reject all writes during network partitions
  • d) Eventual consistency with asynchronous replication and caching


19. What happens when a CP system experiences a network partition?

  • a) The system continues serving all requests with slightly stale data
  • b) The system becomes completely unavailable (no data can be read or written)
  • c) The system switches to AP mode automatically
  • d) The system rejects requests that can't be guaranteed to be consistent (returns errors)


20. A distributed database has 5 nodes. Network partition splits them: 3 nodes in group A, 2 nodes in group B. What happens in a CP system?

  • a) Both groups accept writes (both available)
  • b) Group B accepts writes (it's smaller, so it's backup)
  • c) Neither group can operate (complete system failure)
  • d) Only group A (3 nodes, majority) can accept writes. Group B is read-only or unavailable.


21. What's the relationship between latency and consistency in distributed systems?

  • a) Strong consistency always has lower latency than eventual consistency
  • b) Eventual consistency always has lower latency, regardless of design
  • c) Latency and consistency are independent - no relationship exists
  • d) Strong consistency typically requires more coordination, resulting in higher latency


This quiz covers:

  • CAP theorem definitions (Consistency, Availability, Partition Tolerance)
  • Why P is mandatory in distributed systems
  • CP vs AP systems
  • Real-world examples (PayPal, Netflix, Twitter)
  • Strong vs Eventual Consistency
  • Consistency levels in Cassandra (ONE, QUORUM, ALL)
  • Replication and quorum
  • Global distributed systems and latency
  • BASE vs ACID
  • Causal consistency
  • Network partitions and quorum
  • Latency vs consistency trade-offs
  • Sruja modeling for consistency

Lesson 5: User Scenarios

Understanding User Journeys

A User Scenario describes the series of steps a user takes to achieve a specific goal within your system. While static architecture diagrams show structure, user scenarios show behavior.

Why Model Scenarios?

  1. Validation: Ensures that all components required for a feature actually exist and are connected.
  2. Clarity: Helps stakeholders understand how the system works from a user's perspective.
  3. Testing: Serves as a blueprint for integration and end-to-end tests.

Example Scenario: Buying a Ticket

  1. User searches for events.
  2. User selects a ticket.
  3. User enters payment details.
  4. System processes payment.
  5. System sends confirmation email.

🛠️ Sruja Perspective: Modeling Scenarios

Sruja provides a dedicated scenario keyword to model these interactions explicitly. This allows you to visualize the flow of data across your defined architecture.

import { * } from 'sruja.ai/stdlib'


R1 = requirement functional "User can buy a ticket"
R2 = requirement performance "Process payment in < 2s"

// Define the actors and systems first
User = person "Ticket Buyer"

TicketingApp = system "Ticketing Platform" {
    WebApp = container "Web Frontend"
    PaymentService = container "Payment Processor"
    EmailService = container "Notification Service"

    WebApp -> PaymentService "Process payment"
    PaymentService -> EmailService "Trigger confirmation"
}

// Define the scenario
BuyTicket = scenario "User purchases a concert ticket" {
    User -> TicketingApp.WebApp "Selects ticket"
    TicketingApp.WebApp -> TicketingApp.PaymentService "Process payment"
    TicketingApp.PaymentService -> TicketingApp.EmailService "Trigger confirmation"
    TicketingApp.EmailService -> User "Send email"
}

view index {
include *
}

## Quiz: Test Your Knowledge

Ready to apply what you've learned? Take the interactive quiz for this lesson!

**1. What is a user scenario in system design?**

- [ ] a) A description of the system's architecture and components
- [ ] b) A series of steps a user takes to achieve a specific goal within the system
- [ ] c) A list of technical requirements for the system
- [ ] d) A diagram showing database relationships

<button class="check-answer-btn" data-correct="b">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    A user scenario describes the series of steps a user takes to achieve a specific goal within your system. While static architecture diagrams show structure, user scenarios show behavior.
  </div>
</div>

---

**2. Which of the following is NOT a benefit of modeling user scenarios?**

- [ ] a) Validation - ensures all required components exist and are connected
- [ ] b) Performance - improves system response times
- [ ] c) Clarity - helps stakeholders understand the system from a user's perspective
- [ ] d) Testing - serves as a blueprint for integration and end-to-end tests

<button class="check-answer-btn" data-correct="b">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    The three main benefits of modeling user scenarios are: Validation (ensuring components exist and are connected), Clarity (helping stakeholders understand), and Testing (providing blueprints for tests). Performance improvement is not a direct benefit of scenario modeling.
  </div>
</div>

---

**3. In the "Buying a Ticket" scenario example, what is the correct order of steps?**

- [ ] a) User searches for events → User selects a ticket → System sends confirmation email → User enters payment details → System processes payment
- [ ] b) User enters payment details → User searches for events → User selects a ticket → System processes payment → System sends confirmation email
- [ ] c) User searches for events → User selects a ticket → User enters payment details → System processes payment → System sends confirmation email
- [ ] d) System processes payment → User searches for events → User selects a ticket → User enters payment details → System sends confirmation email

<button class="check-answer-btn" data-correct="c">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    The correct order is: 1) User searches for events, 2) User selects a ticket, 3) User enters payment details, 4) System processes payment, 5) System sends confirmation email. This follows the natural flow from user action to system response.
  </div>
</div>

---

**4. What does the scenario keyword in Sruja allow you to do?**

- [ ] a) Define database schemas and relationships
- [ ] b) Model user interactions and visualize the flow of data across your architecture
- [ ] c) Create automated test scripts
- [ ] d) Generate API documentation

<button class="check-answer-btn" data-correct="b">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    Sruja's scenario keyword allows you to model user interactions explicitly and visualize the flow of data across your defined architecture. This helps bridge the gap between static architecture diagrams and dynamic user behavior.
  </div>
</div>

---

**5. How do user scenarios differ from static architecture diagrams?**

- [ ] a) Architecture diagrams show structure, user scenarios show behavior
- [ ] b) Architecture diagrams are for developers, user scenarios are for managers
- [ ] c) Architecture diagrams use UML, user scenarios use flowcharts
- [ ] d) Architecture diagrams are optional, user scenarios are required

<button class="check-answer-btn" data-correct="a">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    While static architecture diagrams show the structure of a system (components, connections, relationships), user scenarios show the behavior - the dynamic flow of actions and data as users interact with the system over time.
  </div>
</div>

---

**6. You're building an e-commerce platform. Which scenario best demonstrates a complete user journey?**

- [ ] a) "User adds item to cart"
- [ ] b) "User browses products → User adds item to cart → User enters shipping info → User completes payment → System confirms order"
- [ ] c) "Database stores product information"
- [ ] d) "Payment API processes transactions"

<button class="check-answer-btn" data-correct="b">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    A complete user journey shows the full series of steps from start to finish. Option (b) shows the complete flow from browsing to order confirmation, representing a full user scenario. The other options are either too narrow (single action) or describe system behavior rather than user journey.
  </div>
</div>

---

**7. What is the primary purpose of validating user scenarios against your architecture?**

- [ ] a) To ensure the code compiles without errors
- [ ] b) To verify that all components required for a feature exist and are properly connected
- [ ] c) To reduce the cost of infrastructure
- [ ] d) To improve the user interface design

<button class="check-answer-btn" data-correct="b">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    Validation of user scenarios ensures that all the components, services, and connections needed to support a user feature actually exist in your architecture. This helps catch missing or incomplete implementations before development or deployment.
  </div>
</div>

---

**8. When modeling a scenario in Sruja, you define:**

<details>
<summary><strong>Click to see answer</strong></summary>

**Answer:** Actors, systems, containers, and the flow of data between them

You define the actors (users or external systems), the systems and containers they interact with, and use arrows (->) to show the flow of data and interactions between components.

</details>

---

**9. A ride-sharing app has this scenario: "User requests ride → System matches driver → Driver accepts → User receives driver details → Trip begins". What architectural component is likely missing if the scenario fails at the matching step?**

- [ ] a) Payment processing service
- [ ] b) Driver matching algorithm or service
- [ ] c) User authentication system
- [ ] d) Email notification service

<button class="check-answer-btn" data-correct="b">Check Answer</button>

<div class="answer-feedback" style="display: none;">
  <p class="feedback-text"></p>
  <div class="explanation" style="display: none;">
    If the scenario fails at the matching step, the driver matching algorithm or service is likely missing or malfunctioning. This component is responsible for connecting users with available drivers. The other components (payment, auth, email) are important but not directly related to the matching step.
  </div>
</div>

---

**10. Why are user scenarios valuable for testing?**

<details>
<summary><strong>Click to see answer</strong></summary>

**Answer:** They serve as blueprints for integration and end-to-end tests

User scenarios provide a step-by-step description of how the system should behave from a user's perspective, making them perfect templates for writing integration tests and end-to-end tests. They help ensure that all components work together correctly to support real user workflows.

</details>

---

This quiz covers:
- User scenario definitions
- Importance of modeling scenarios
- Architecture diagrams vs scenarios
- Sruja scenario keyword
- Scenario validation
- Complex scenarios (e.g., guest to authenticated user)
- Fault tolerance in scenarios
- Testing value of scenarios
- Critical component identification
- Multi-system scenarios
- Scenario decomposition
- Happy path vs error path
- Error paths to model
- Stakeholder communication
- Microservices and scenarios
- Asynchronous processing
- Scenarios and API design
- Sequence diagrams
- Developer onboarding
- Regulatory requirements

By defining scenarios, you can automatically generate sequence diagrams or flowcharts that map directly to your code.

Building Blocks

  • Understand the difference between L4 and L7 load balancing
  • Learn load balancing algorithms and when to use each
  • See real-world examples from NGINX and HAProxy
  • Practice making trade-off decisions estimated_time: "20 minutes" difficulty: "beginner"

Lesson 1: Load Balancers

What is a Load Balancer?

A load balancer sits between clients and servers, distributing incoming network traffic across a group of backend servers. This ensures that no single server bears too much load.

Without a load balancer:

Clients → Server 1 (overwhelmed, 99% CPU)
         → Server 2 (idle, 5% CPU)
         → Server 3 (idle, 3% CPU)

With a load balancer:

Clients → Load Balancer → Server 1 (33% CPU)
                              → Server 2 (33% CPU)
                              → Server 3 (34% CPU)

Types of Load Balancing

Layer 4 (Transport Layer)

What it does: Makes decisions based on IP address and TCP/UDP ports.

Characteristics:

  • Fast - No content inspection, just reads packet headers
  • Low CPU - Simple routing logic
  • Limited intelligence - Can't inspect HTTP headers, URLs, or cookies
  • Protocol-agnostic - Works for HTTP, TCP, UDP, gRPC

Best for: High-throughput services where you need raw speed over intelligence.

Examples: AWS ELB (classic), HAProxy (L4 mode), F5 LTM

Layer 7 (Application Layer)

What it does: Makes decisions based on the content of the message (URL, HTTP headers, cookies, cookies).

Characteristics:

  • Slower - Needs to parse HTTP content
  • CPU intensive - More complex routing logic
  • Smart routing - Can route /images to image servers, /api to API servers
  • Content-aware - Inspects HTTP headers, cookies, SSL termination

Best for: Web applications needing content-based routing, SSL termination, or session stickiness.

Examples: NGINX, AWS ALB, Envoy, Traefik

Real-World Comparison

FactorLayer 4Layer 7
Speed100K+ RPS per core50K+ RPS per core
CPU Usage5-10%30-50%
FeaturesBasic routingSSL termination, URL routing, session affinity
Use CaseHigh-throughput APIsWeb apps, microservices

Load Balancing Algorithms

Round Robin

How it works: Requests are distributed sequentially across servers: 1, 2, 3, 4, then back to 1.

Pros:

  • ✅ Simple to implement and understand
  • ✅ Works well when servers have similar capacity
  • ✅ No state to maintain

Cons:

  • ❌ Doesn't account for server load (one might be slow with 100 connections, another idle)
  • ❌ Doesn't work well when servers have different capabilities

Real-world use: Good for stateless services where all servers are identical (e.g., API servers).

Least Connections

How it works: Sends request to the server with the fewest active connections.

Pros:

  • ✅ Better than round robin when servers have different loads
  • ✅ Accounts for connection time (long-running connections get fewer new ones)

Cons:

  • ❌ Still doesn't account for server capacity (CPU, memory)
  • ❌ Requires tracking active connections (more state)

Real-world use: Good for services with varying request processing times (e.g., database queries).

IP Hash

How it works: Uses a hash of the client's IP address to determine which server receives the request.

Pros:

  • Session stickiness - Same IP always goes to same server
  • ✅ Works great for stateful applications (session stored in memory)

Cons:

  • ❌ Uneven distribution if many clients share the same IP (NAT, proxies)
  • ❌ Doesn't rebalance when servers are added/removed

Real-world use: Stateful applications needing session affinity, WebSocket connections.

Random

How it works: Randomly selects a server for each request.

Pros:

  • ✅ No state to maintain
  • ✅ Even distribution over time (law of large numbers)

Cons:

  • ❌ Can have temporary uneven distribution
  • ❌ No session stickiness

Real-world use: Simple load balancing with large number of requests where stickiness doesn't matter.

Real-World Case Studies

Case Study 1: NGINX at Scale

The Challenge: NGINX (the company) needed to serve 1M+ requests per second (RPS) with low latency.

The Architecture:

Clients → Layer 4 Load Balancer (HAProxy) → NGINX Layer 7 LBs → Application Servers
         (10 Gbps)                          (100+ instances)         (1000+ servers)

Key Decisions:

  1. Layer 4 at edge - Raw speed for SSL termination and initial routing
  2. Layer 7 closer to apps - Content-based routing (/static vs /api)
  3. Least connections algorithm - Servers have varying loads

The Results:

  • Throughput: 1M+ RPS per load balancer
  • Latency: <10ms at p95
  • Server utilization: Balanced at 65-75% CPU (good headroom)
  • Availability: 99.99% with automatic failover

💡 Key Insight: NGINX uses a layered approach. Layer 4 for raw speed at the edge, Layer 7 for intelligence closer to the applications. This gives them both performance and features.

Case Study 2: HAProxy Configuration Patterns

Scenario: E-commerce platform with 3 types of traffic:

  • Product catalog (read-heavy, 90% of traffic)
  • Checkout (write-heavy, requires session stickiness)
  • API (stateless, needs high throughput)

The Solution:

# Product Catalog - Round Robin (read-heavy)
backend catalog
    balance roundrobin
    server web1 10.0.0.1:80 check
    server web2 10.0.0.2:80 check
    server web3 10.0.0.3:80 check

# Checkout - IP Hash (session stickiness)
backend checkout
    balance source
    server app1 10.0.0.10:8080 check
    server app2 10.0.0.11:8080 check

# API - Least Connections (varying request times)
backend api
    balance leastconn
    server api1 10.0.0.20:443 check
    server api2 10.0.0.21:443 check
    server api3 10.0.0.22:443 check

Why this works:

  • Catalog: Round robin works great for stateless, uniform requests
  • Checkout: IP hash ensures user stays on same server (session in memory)
  • API: Least connections handles varying query complexity (some API calls are fast, others slow)

Performance Results:

  • Catalog servers: 5000 RPS, 2ms latency
  • Checkout servers: 500 RPS, 20ms latency (acceptable for checkout flow)
  • API servers: 10000 RPS, 5ms latency (balanced load)

Case Study 3: WhatsApp's Load Balancing Evolution

Stage 1 (Early days): Single server → Crashed at 10K users

Stage 2: Round robin across 10 servers → Uneven load, some servers overwhelmed

Stage 3 (Solution): Erlang's built-in load balancing + consistent hashing

Clients → Load Balancer → Erlang Nodes → Message Store
                    (Consistent hashing for key-based routing)

Key Decision: Use consistent hashing so messages from the same user always go to the same node. This reduces cross-node synchronization.

The Results:

  • Scale: From 10K users to 1B+ users
  • Throughput: 65B+ messages/day
  • Efficiency: Minimal cross-node data movement (99% of traffic stays local)

Production Metrics

Load Balancer Performance

SystemAlgorithmThroughputLatency (p95)Servers
NGINXRound Robin1M+ RPS<10ms100
HAProxyLeast Connections500K RPS<5ms50
AWS ALBRound Robin100K RPS<50msManaged
EnvoyRandom + Health Checks1M+ RPS<20ms200

Resource Utilization

ResourceLayer 4 Load BalancerLayer 7 Load Balancer
CPU5-15%30-60%
Memory1-2 GB4-8 GB
Network10 Gbps10 Gbps
Throughput1M+ RPS500K RPS

Trade-Off Scenarios

Scenario 1: API Gateway for Microservices

Context: Building an API gateway that routes to 50 microservices. Some are fast (profile service), others slow (report generation).

The Trade-Off:

DecisionOption AOption BWhat You Choose & Why
LayerLayer 4 (fast)Layer 7 (smart)Layer 7 - Need URL-based routing (/users → users service)
AlgorithmRound RobinLeast ConnectionsLeast Connections - Services have varying response times
Health ChecksBasic TCPHTTP /healthHTTP /health - Need to detect slow/failing services, not just offline ones
SSLAt load balancerAt serviceAt load balancer - Centralized SSL management, cheaper certificates

Result:

  • Pros: Intelligent routing, good load balancing, centralized SSL
  • Cons: Higher CPU usage, more complex configuration
  • Performance: 50K RPS at p95 < 50ms (acceptable for API gateway)

Scenario 2: Video Streaming Platform

Context: Streaming video to 10M concurrent users. Each stream needs sustained bandwidth (2-5 Mbps). Low latency is critical.

The Trade-Off:

DecisionOption AOption BWhat You Choose & Why
LayerLayer 4 (speed)Layer 7 (features)Layer 4 - Raw speed for streaming, no content inspection needed
AlgorithmRound RobinIP HashRound Robin - Streams are independent, no session affinity needed
Geo-distributionSingle datacenterEdge locationsEdge locations - Reduce latency by serving from closest datacenter

Result:

  • Pros: Maximum throughput, minimal latency, simple configuration
  • Cons: No intelligent routing (but not needed for streaming)
  • Performance: 1M+ concurrent streams, <100ms latency globally

Scenario 3: Stateful WebSocket Application

Context: Real-time chat application where users are connected via WebSockets. User messages and presence data must go to the same server.

The Trade-Off:

DecisionOption AOption BWhat You Choose & Why
LayerLayer 4Layer 7Layer 7 - Need to inspect WebSocket upgrade requests
AlgorithmRound RobinIP HashIP Hash - Session stickiness required for WebSocket connections
FailoverBreak connectionsGraceful reconnectGraceful reconnect - Clients auto-reconnect on disconnect
PersistenceIn-memoryRedisIn-memory - Faster for local data, Redis for cross-server sync

Result:

  • Pros: Session affinity, real-time performance, good user experience
  • Cons: Uneven distribution (some servers have more active users), complexity in failover
  • Performance: 100K concurrent WebSockets, <50ms message delivery

🛠️ Sruja Perspective: Modeling Load Balancers

In Sruja, we treat load balancers as critical infrastructure components with clear trade-offs documented.

Why Model Load Balancers?

Modeling load balancers in your architecture provides:

  1. Capacity Planning: See how much traffic the LB can handle before bottlenecks
  2. Failure Analysis: Understand what happens if the LB fails (single point of failure?)
  3. Algorithm Clarity: Document which algorithm and why
  4. Performance Visibility: Track RPS, latency, server distribution

Example: E-Commerce Platform Load Balancing

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce Platform" {
    description "Multi-tenant e-commerce with intelligent load balancing"
    
    // LAYER 4: Edge load balancer for SSL termination
    EdgeLB = container "Edge Load Balancer" {
        technology "HAProxy"
        description "Layer 4 LB for SSL termination and initial routing"
        tags ["load-balancer", "layer4"]
        
        capacity {
            requests_per_second "1_000_000"
            bandwidth_gbps "10"
        }
    }
    
    // LAYER 7: Application load balancer for content routing
    AppLB = container "Application Load Balancer" {
        technology "NGINX"
        description "Layer 7 LB for URL-based routing and session affinity"
        tags ["load-balancer", "layer7"]
        
        tradeoff {
            decision "Use NGINX (Layer 7) for application routing"
            sacrifice "Raw throughput (L4 would be 2x faster)"
            reason "Need URL-based routing: /catalog → catalog servers, /checkout → checkout servers"
            mitigation "Use L4 edge LB for SSL termination to reduce NGINX load"
        }
        
        slo {
            latency {
                p95 "50ms"
                window "7 days"
            }
            availability {
                target "99.99%"
                window "30 days"
            }
        }
    }
    
    // BACKEND SERVICES
    CatalogServer = container "Catalog Service" {
        technology "Python, Django"
        description "Product catalog (read-heavy)"
        tags ["service"]
        quantity 10
    }
    
    CheckoutServer = container "Checkout Service" {
        technology "Node.js"
        description "Checkout flow (stateful, requires session affinity)"
        tags ["service"]
        quantity 5
    }
    
    APIServer = container "API Service" {
        technology "Go"
        description "Public API (stateless, high throughput)"
        tags ["service"]
        quantity 20
    }
    
    // TRAFFIC FLOW
    EdgeLB -> AppLB "Distributes traffic (Round Robin)"
    AppLB -> CatalogServer "Routes /catalog (Least Connections)"
    AppLB -> CheckoutServer "Routes /checkout (IP Hash for session affinity)"
    AppLB -> APIServer "Routes /api (Least Connections)"
}

view index {
    title "E-Commerce Load Balancing Architecture"
    include *
}

view load-balancing {
    title "Load Balancer Configuration"
    include ECommerce.EdgeLB ECommerce.AppLB
}

Key Trade-Offs Documented

1. Layer Choice:

  • Why Layer 4 at edge? Raw speed for SSL termination
  • Why Layer 7 at app? Need URL-based routing and session affinity

2. Algorithm Selection:

  • Round Robin: For catalog (stateless, uniform requests)
  • IP Hash: For checkout (requires session stickiness)
  • Least Connections: For API (varying query complexity)

3. Performance vs Features:

  • Sacrifice raw throughput for intelligent routing
  • Mitigated by using layered approach (L4 + L7)

Knowledge Check

Q: My app needs to route /images to image servers and /api to API servers. Which load balancing layer should I use?

Layer 7 (Application Layer)

Layer 7 load balancers can inspect HTTP content (URLs, headers) and route based on that. Layer 4 only looks at IP/port and can't do content-based routing.

Q: I'm building a WebSocket chat app. Users connect and stay connected for hours. Which algorithm should I use?

IP Hash (or consistent hashing)

You need session affinity - when a user connects, they should stay on the same server. IP hash ensures the same IP always routes to the same server. Round robin would break the WebSocket connection when requests go to different servers.

Q: I have 10 identical servers handling high-throughput API requests. Speed is the priority. What algorithm?

Round Robin

When servers are identical and stateless, round robin is perfect. It's simple, fast, and gives even distribution. No need for IP hash (stateful) or least connections (servers have similar load).

Quiz: Test Your Knowledge

Q1: What type of load balancing operates at the transport layer and uses IP addresses and ports?

  • Layer 4 (Transport Layer)
  • Layer 7 (Application Layer)
  • Layer 3 (Network Layer)
Answer **Layer 4 (Transport Layer)** operates at the transport layer and makes decisions based on IP addresses and TCP/UDP ports. Layer 7 works at the application layer and inspects HTTP content.

Q2: Which load balancing algorithm is best for applications requiring session stickiness?

  • Round Robin
  • Least Connections
  • IP Hash
Answer **IP Hash** uses the client's IP address to determine which server receives the request, ensuring the same client always goes to the same server. This is essential for session stickiness in stateful applications like WebSockets or applications with in-memory sessions.

Q3: You're building an API gateway routing to microservices with varying response times. Which algorithm should you use?

  • Round Robin
  • Least Connections
  • IP Hash
Answer **Least Connections** sends requests to the server with the fewest active connections. This is ideal when services have varying request processing times (some API calls are fast, others slow) because it accounts for actual server load rather than just distributing requests evenly.

Q4: Which of these is NOT a characteristic of Layer 7 load balancing?

  • Can inspect HTTP headers and URLs
  • Slower than Layer 4 due to content inspection
  • Cannot do SSL termination
  • More CPU intensive than Layer 4
Answer **Cannot do SSL termination** is NOT correct. Layer 7 load balancers CAN and commonly DO perform SSL termination. They terminate SSL at the load balancer, decrypt traffic, inspect HTTP content, then route it. This is one of their key features.

Q5: NGINX uses a layered approach with Layer 4 at the edge and Layer 7 closer to applications. Why?

  • To reduce complexity
  • To balance speed and intelligence
  • To minimize costs
  • To reduce latency
Answer **To balance speed and intelligence**. Layer 4 provides raw speed for SSL termination and initial routing (high throughput). Layer 7 provides intelligent features like URL-based routing closer to the applications. This layered approach gives you both performance (L4) and features (L7).

Next Steps

Now that we understand load balancing, let's learn about databases and how to choose between SQL and NoSQL. 👉 Lesson 2: Databases (SQL vs NoSQL, Replication, Sharding)

Lesson 2: Databases

SQL vs. NoSQL: The Fundamental Choice

Choosing between SQL and NoSQL is one of the most important architecture decisions. Here's how to think about it.

SQL (Relational Databases)

What it does: Stores structured data in tables with predefined schemas using the relational model.

Characteristics:

  • Structured data - Tables, rows, columns with defined types
  • ACID compliance - Strong guarantees for Atomicity, Consistency, Isolation, Durability
  • Powerful querying - Complex joins, aggregations, transactions
  • Data integrity - Foreign keys, constraints, referential integrity
  • Vertical scaling - Limited to single server (mostly)
  • Schema changes - ALTER TABLE can be slow/complex

Examples: PostgreSQL, MySQL, Oracle, SQL Server

Best for:

  • Financial transactions (banking, payments)
  • Complex queries with joins and aggregations
  • Strong consistency requirements
  • Data relationships (e.g., user orders and order items)

NoSQL (Non-Relational Databases)

What it does: Flexible data models without rigid schemas, designed for horizontal scaling.

Characteristics:

  • Flexible schemas - Add fields on the fly
  • Horizontal scaling - Distribute data across multiple servers
  • High performance - Optimized for specific access patterns
  • Multiple models - Document, Key-Value, Graph, Column-family
  • Weaker consistency - Often eventual consistency
  • Limited querying - Complex joins harder or impossible

Examples:

  • Document: MongoDB, CouchDB
  • Key-Value: Redis, DynamoDB
  • Column-family: Cassandra, HBase
  • Graph: Neo4j, JanusGraph

Best for:

  • Rapidly evolving data structures
  • Massive scale (PBs of data, millions of users)
  • High throughput write workloads
  • Hierarchical or graph data

Decision Matrix: SQL vs NoSQL

FactorChoose SQL When...Choose NoSQL When...
Data structureFixed, well-defined schemaFlexible, evolving schema
ConsistencyStrong consistency requiredEventual consistency OK
ScaleModerate (up to TBs)Massive (TBs to PBs)
ComplexityComplex queries, joinsSimple access patterns
TransactionsACID transactions neededEventual consistency OK
SpeedComplex queries take timeFast for specific patterns

Scaling Databases

Replication: Copying Data

Replication copies data across multiple servers for redundancy and performance.

Master-Slave (Primary-Replica)

How it works: All writes go to the master. Reads can go to slaves (read replicas).

Write → Master DB (Primary)
         ↓ (Replication)
         → Slave 1 (Read Replica)
         → Slave 2 (Read Replica)
         → Slave 3 (Read Replica)

Read → Can go to any slave

Pros:

  • Read scalability - Distribute read load across replicas
  • Data redundancy - Multiple copies for backup/failover
  • Improved read latency - Read from geographically closer replica
  • Simplifies backups - Backup from replicas, not production

Cons:

  • Single write bottleneck - All writes go through master
  • Replication lag - Slaves might be slightly behind master
  • Complex failover - Promoting a replica to master needs care
  • Write scaling - Cannot scale writes (still single master)

Real-world use: Read-heavy systems (e.g., product catalog, user profiles, content delivery)

Performance impact:

  • Read throughput: 10x improvement (1 master → 10 replicas)
  • Write throughput: No improvement (still limited by master)
  • Replication lag: Typically 10-100ms for local replicas

Master-Master (Active-Active)

How it works: Writes can go to any node. Changes are propagated between masters.

Write → Master 1 → Master 2 → Master 3
Write → Master 2 → Master 1 → Master 3
Write → Master 3 → Master 1 → Master 2

Read → Can go to any master

Pros:

  • Write scalability - Distribute writes across masters
  • No single point of failure - Any master can accept writes
  • Geographic distribution - Masters in different regions for local writes

Cons:

  • Conflict resolution - Two masters writing same row simultaneously
  • Complex setup - More complex to configure and maintain
  • Eventual consistency - Harder to maintain strong consistency

Real-world use: Global applications needing local writes in each region (e.g., social media feeds)

Performance impact:

  • Write throughput: 3-5x improvement (1 master → 3-5 masters)
  • Conflict rate: Depends on write patterns (0.1-5% of writes may conflict)
  • Consistency window: 100-500ms for cross-region replication

Sharding: Partitioning Data

Sharding (partitioning) splits data across multiple database servers based on a shard key.

Horizontal Sharding

How it works: Split rows of a table across servers based on a shard key.

Users Table (100M rows)
├── Shard 1: Users A-G (25M users) → DB Server 1
├── Shard 2: Users H-N (25M users) → DB Server 2
├── Shard 3: Users O-T (25M users) → DB Server 3
└── Shard 4: Users U-Z (25M users) → DB Server 4

Pros:

  • Write scalability - Distribute writes across shards
  • Massive scale - Handle petabytes of data
  • Parallel queries - Query multiple shards in parallel
  • Independent scaling - Scale hot shards independently

Cons:

  • Complex joins - Cross-shard joins are hard/expensive
  • Rebalancing - Moving data between shards is complex
  • Query complexity - Application needs to route queries to right shard
  • Hot spots - Popular shard keys can create imbalances

Real-world use: User data, time-series data, any massive dataset

Performance impact:

  • Write throughput: 10x improvement (1 server → 10 shards)
  • Query performance: Depends on shard key (local shard = fast, cross-shard = slow)
  • Rebalancing time: Days to weeks for large datasets

Consistent Hashing

How it works: Use a hash ring to determine shard, minimizing data movement when adding/removing shards.

Hash Ring: [0, 1, 2, ..., 1023]
├── Shard 1: 0-255
├── Shard 2: 256-511
├── Shard 3: 512-767
└── Shard 4: 768-1023

User "alice" → hash("alice") % 1024 = 342 → Shard 2
User "bob"   → hash("bob") % 1024 = 789 → Shard 4

Pros:

  • Minimal rebalancing - Adding a shard moves ~1/n of data
  • Even distribution - Hash function spreads data evenly
  • Deterministic - Same key always routes to same shard

Cons:

  • Range queries hard - Finding users A-G requires querying all shards
  • Hash collisions - Poor hash function can create imbalances
  • No hot spot handling - Popular keys still create imbalances

Real-world use: Key-value stores, caching, session storage

Real-World Case Studies

Case Study 1: Netflix's Migration to Cassandra

The Challenge: Netflix grew from DVD-by-mail to streaming, moving from a single database to needing to handle billions of play events per day.

Before (Single PostgreSQL):

  • Single master database
  • 100M subscribers
  • Limited by single server capacity
  • SPOF (Single Point of Failure)

The Migration Journey:

Phase 1: Master-Slave Replication

PostgreSQL Master (writes) → 5 Read Replicas
                               ↓ (replication lag: 50-100ms)
  • Result: 5x read scalability, writes still bottlenecked
  • Problem: Still can't handle streaming scale

Phase 2: Sharded PostgreSQL

  • Sharded by subscriber ID (100 shards)
  • Each shard: 1M subscribers
  • Result: Write scalability improved, but complex cross-shard queries
  • Problem: Manual rebalancing, hot shard issues

Phase 3: Migration to Cassandra (Final Solution)

Cassandra Cluster (100 nodes)
├── Datacenter US-East (50 nodes)
├── Datacenter US-West (50 nodes)

Data Model:
├── Play Events Table (sharded by user_id)
├── Watch History Table (sharded by user_id)
└── Catalog Table (replicated 3x, no sharding)

Key Decisions:

  1. Cassandra choice - NoSQL for horizontal scalability
  2. Multi-datacenter - Geographic redundancy and local reads
  3. Consistency level - QUORUM for reads/writes (balanced)
  4. Replication factor - 3 (2 datacenters + 1 local replica)

The Results:

  • Scale: 100M → 260M subscribers
  • Play events: 1B+ events/day (vs 10M with PostgreSQL)
  • Availability: 99.99% (was 99.9% with PostgreSQL)
  • Latency: p99 latency <200ms (vs 500ms with sharded PostgreSQL)
  • Data volume: 10+ PB of data (would be impossible with SQL)

💡 Key Insight: Netflix chose Cassandra for its write-heavy workload (play events). They kept some data in PostgreSQL for complex queries (billing, analytics) because different workloads benefit from different database technologies.

Case Study 2: Airbnb's Hybrid Database Strategy

The Challenge: Airbnb needed to handle rapid growth (1M → 100M+ users) while maintaining complex relationships between listings, bookings, and reviews.

The Architecture:

Airbnb Platform
├── PostgreSQL (Core Business)
│   ├── Users Table
│   ├── Listings Table
│   ├── Bookings Table
│   └── Reviews Table
│       └── Replicated (Master + 3 Read Replicas)
│
└── MongoDB (Flexible Data)
    ├── Search Index
    ├── Message Queue
    └── Activity Feed

Why PostgreSQL for Core Business?

  1. Data relationships matter

    • Listing → Bookings → Reviews (complex joins)
    • User → Listings → Reviews (many-to-many relationships)
    • Foreign keys ensure referential integrity
  2. ACID transactions

    • Booking = Create booking record + Update availability + Send notification
    • All-or-nothing: booking succeeds or fails completely
    • No partial bookings or data inconsistencies
  3. Complex queries

    • "Find listings with availability in date range" (complex WHERE clauses)
    • "Get user's booking history with reviews" (joins + aggregations)
    • "Calculate host revenue" (complex analytics)

Why MongoDB for Flexible Data?

  1. Evolving schema

    • Search index fields change frequently (add filters, sorting)
    • Message queue format evolves (add metadata, change structure)
    • No ALTER TABLE downtime
  2. Horizontal scaling

    • Search index: 100M+ listings, complex queries
    • Activity feed: Millions of updates/second
    • Sharding easier than PostgreSQL
  3. Performance patterns

    • Search: Read-heavy, simple queries
    • Activity feed: Write-heavy, append-only
    • Both benefit from NoSQL optimizations

Performance Results:

  • PostgreSQL: 50K writes/sec, 500K reads/sec (complex queries)
  • MongoDB: 200K writes/sec, 2M reads/sec (simple queries)
  • Overall latency: p95 <100ms for core flows

💡 Key Insight: Airbnb uses the right tool for the right job. PostgreSQL for complex relational data, MongoDB for flexible, high-throughput patterns. This hybrid approach gives them both data integrity and scalability.

Case Study 3: Uber's Sharding Evolution

The Challenge: Uber grew from a single PostgreSQL database to processing millions of trips per second globally.

Stage 1: Single PostgreSQL

  • Single database in one datacenter
  • 1M trips/day
  • Bottleneck: Single server capacity
  • Risk: Datacenter outage = complete downtime

Stage 2: Sharded PostgreSQL

  • Sharded by city_id (US trips in US DB, Europe trips in Europe DB)
  • 100M trips/day
  • Benefits: Geographic isolation, write scalability
  • Problem: Complex cross-city queries (traveler rides in multiple cities)

Stage 3: Schemaless Sharding (The Pivot)

  • Moved to MySQL with schemaless design (JSONB columns)
  • Sharded by rider_id (each rider's data on one shard)
  • 1B+ trips/day
  • Benefits: Schema flexibility, better shard distribution
  • Problem: Still hard to evolve, complex migrations

Stage 4: Multi-Database Strategy (Current)

Uber Platform
├── MySQL (Transactional Data)
│   ├── Trips Table (sharded by city)
│   ├── Users Table (sharded by user_id)
│   └── Payments Table (sharded by ride_id)
│
├── Cassandra (Time-Series Data)
│   ├── Event Stream (10M+ events/sec)
│   └── Telemetry (vehicle location updates)
│
├── Redis (Real-Time Data)
│   ├── Driver Location (updates every 2 sec)
│   └── Surge Pricing (calculations every 5 sec)
│
└── ElasticSearch (Search)
    ├── Location Search (geospatial queries)
    └── Destination Search (autocomplete)

Key Decisions:

  1. MySQL for transactions - Trip data needs ACID guarantees
  2. Cassandra for time-series - Event stream is write-heavy (append-only)
  3. Redis for real-time - Driver locations need <100ms latency
  4. ElasticSearch for search - Geospatial queries require specialized search

Performance Results:

  • Trips processed: 1B+ trips/day (from 1M)
  • Event stream: 10M+ events/sec (Cassandra handles write load)
  • Real-time data: 10M+ driver locations/sec (Redis provides <50ms latency)
  • Search: <100ms p95 for location search (ElasticSearch optimization)

💡 Key Insight: Uber doesn't try to fit everything in one database. Different workloads (transactions, time-series, real-time, search) benefit from different database technologies. The complexity is higher, but the performance and scalability are unmatched.

Production Metrics

Database Performance Comparison

DatabaseWrite ThroughputRead Latency (p95)Scale
PostgreSQL50K writes/sec10-50msTBs
MySQL100K writes/sec10-40msTBs
MongoDB500K writes/sec5-20msPBs
Cassandra1M+ writes/sec20-100msPBs
Redis10M+ writes/sec<5msGBs-TBs

Replication Performance

StrategyWrite Latency ImpactRead ScalabilityConsistency
Master-Slave+5-10ms10xStrong
Master-Master+50-200ms3-5xEventual
Multi-Master+100-500ms5-10xEventual

Sharding Performance

StrategyWrite Throughput ImprovementQuery PerformanceRebalancing
Horizontal (range)5-10xFast (local), Slow (cross-shard)Days-Weeks
Consistent Hashing8-12xMedium (range queries slow)Minimal
Geographic2-5xVery fast (local reads)Complex

Trade-Off Scenarios

Scenario 1: E-Commerce Platform

Context: Building Amazon-scale e-commerce platform. Need to handle Black Friday traffic spikes (10x normal). Data: product catalog, user accounts, orders, reviews.

The Trade-Off Decisions:

DecisionOption AOption BWhat You Choose & Why
Product CatalogPostgreSQL (complex queries)MongoDB (flexible schema)PostgreSQL - Complex filters, joins, ACID for inventory
User SessionsPostgreSQL (sessions table)Redis (key-value)Redis - Fast access, TTL for expiration
OrdersPostgreSQL (ACID transactions)MongoDB (eventual consistency)PostgreSQL - Financial data needs strong consistency
ReviewsPostgreSQL (relational)MongoDB (document)MongoDB - Flexible schema, high write throughput
ReplicationMaster-Slave (read-heavy)Master-Master (write-heavy)Master-Slave - 90% reads, strong consistency for orders
ShardingNot needed (scale vertically)Yes (future-proof)Yes - Black Friday traffic requires horizontal scale

Result:

  • Pros: Right database for each workload, strong consistency for transactions, scalability for Black Friday
  • Cons: Complex architecture (4 databases), cross-database joins impossible
  • Performance: 1M+ products, 100K+ concurrent users, p95 <100ms

Scenario 2: Social Media Feed

Context: Building a real-time social media feed like Twitter. Users post updates, followers see them in timeline. Requirements: millions of posts/sec, low latency, global availability.

The Trade-Off Decisions:

DecisionOption AOption BWhat You Choose & Why
Posts StoragePostgreSQL (relational)Cassandra (time-series)Cassandra - Write-heavy (append-only), time-series access pattern
Timeline QueryPostgreSQL (complex joins)Denormalized (pre-computed timelines)Denormalized - Pre-compute user timelines, no complex joins
Likes/FollowsPostgreSQL (relational)Redis (fast counters)Redis - High frequency reads/writes, simple counters
User ProfilesPostgreSQL (ACID needed)MongoDB (flexible)PostgreSQL - Profile needs ACID (security, privacy)
ReplicationMaster-Slave (eventual OK)Multi-Master (global writes)Multi-Master - Global users need local writes
ConsistencyStrongEventualEventual - OK if likes show with slight delay, but posts appear quickly

Result:

  • Pros: Massive write scalability, global availability, low latency for feed
  • Cons: Eventual consistency (users might see stale data), complex architecture
  • Performance: 10M+ posts/sec, p99 feed latency <200ms, 99.99% availability

Scenario 3: Analytics Platform

Context: Building an analytics platform processing billions of events/day. Need to store raw events, compute aggregates in real-time, and serve ad-hoc queries.

The Trade-Off Decisions:

DecisionOption AOption BWhat You Choose & Why
Raw EventsPostgreSQL (structured)Cassandra (time-series)Cassandra - Write-heavy (append-only), automatic TTL, time-series partitioning
AggregatesPostgreSQL (materialized views)Redis (pre-computed)Redis - Fast access, TTL for expiration, update on new events
Ad-hoc QueriesPostgreSQL (complex SQL)ElasticSearch (search/analytics)ElasticSearch - Fast aggregations, full-text search, scalable
User MetadataPostgreSQL (ACID needed)MongoDB (flexible)PostgreSQL - User accounts need ACID (security, permissions)
Data RetentionManual cleanupTTL-based automaticTTL-based - Automate cleanup (billions of events/day)
ConsistencyStrongEventualEventual - OK if aggregates are slightly stale, but raw events must be durable

Result:

  • Pros: Massive scale (PBs of data), automatic retention, fast aggregations
  • Cons: Eventual consistency (aggregates might be behind), complex ETL pipelines
  • Performance: 10B+ events/day, p95 query latency <500ms, 99.9% data durability

Sruja Perspective: Modeling Databases

In Sruja, we document database choices with clear trade-offs and performance characteristics.

Why Model Databases?

Modeling databases in your architecture provides:

  1. Technology choice clarity - Document why SQL vs NoSQL
  2. Scaling strategy - Show replication/sharding approach
  3. Performance visibility - Track throughput, latency, scale
  4. Failure analysis - Understand SPOFs and failover scenarios

Example: E-Commerce Database Architecture

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce Platform" {
    description "Multi-database architecture for different workloads"
    
    // POSTGRESQL: Core transactional data
    PrimaryDB = database "Primary PostgreSQL" {
        technology "PostgreSQL 15"
        description "Stores users, products, orders - requires ACID"
        tags ["sql", "transactional"]
        
        tradeoff {
            decision "Use PostgreSQL for core business data"
            sacrifice "Write scalability (single master for writes)"
            reason "Strong consistency required for orders/payments, complex queries needed for product catalog"
            mitigation "Read replicas for read scalability, plan for sharding"
        }
        
        capacity {
            storage "10 TB"
            throughput_writes "50K/sec"
            throughput_reads "500K/sec"
            replication "Master + 3 Read Replicas"
        }
    }
    
    // MONGODB: Flexible, high-throughput data
    ReviewsDB = database "MongoDB Cluster" {
        technology "MongoDB 7"
        description "Stores reviews, activity feed - flexible schema, high writes"
        tags ["nosql", "document", "sharded"]
        
        tradeoff {
            decision "Use MongoDB for reviews and activity feed"
            sacrifice "Strong consistency (eventual consistency OK)"
            reason "Flexible schema for evolving data, high write throughput, easy sharding"
            mitigation "Write-ahead log for durability, read-your-writes for consistency"
        }
        
        capacity {
            storage "50 TB"
            throughput_writes "500K/sec"
            throughput_reads "2M/sec"
            replication "Replica Set 3 + Sharded by user_id"
        }
    }
    
    // REDIS: Real-time session data
    SessionStore = database "Redis Cluster" {
        technology "Redis 7"
        description "Stores user sessions, shopping carts - fast access, TTL"
        tags ["nosql", "key-value", "real-time"]
        
        tradeoff {
            decision "Use Redis for session data"
            sacrifice "Durability (in-memory, can lose data on crash)"
            reason "Sub-millisecond latency required for session checks, automatic TTL for cleanup"
            mitigation "AOF persistence, replicas for backup"
        }
        
        capacity {
            storage "1 TB"
            throughput_writes "10M/sec"
            throughput_reads "20M/sec"
            replication "Cluster mode (6 nodes)"
        }
    }
    
    // TRAFFIC FLOW
    UserService = container "User Service" {
        technology "Go"
    }
    
    ProductService = container "Product Service" {
        technology "Python"
    }
    
    OrderService = container "Order Service" {
        technology "Java"
    }
    
    UserService -> PrimaryDB "Reads user data"
    UserService -> SessionStore "Checks session (fast)"
    ProductService -> PrimaryDB "Queries product catalog"
    OrderService -> PrimaryDB "Creates order (transaction)"
    OrderService -> ReviewsDB "Posts review (high throughput)"
}

view index {
    title "E-Commerce Database Architecture"
    include *
}

view sql {
    title "SQL Databases"
    include ECommerce.PrimaryDB
}

view nosql {
    title "NoSQL Databases"
    include ECommerce.ReviewsDB ECommerce.SessionStore
}

Key Trade-Offs Documented

1. Database Technology Choice:

  • PostgreSQL for transactions - Strong consistency, complex queries, but write scalability limited
  • MongoDB for high throughput - Flexible schema, horizontal scaling, but eventual consistency
  • Redis for real-time - Sub-millisecond latency, but in-memory (durability concerns)

2. Scaling Strategy:

  • Read replicas for PostgreSQL (read-heavy workload: 10:1 read:write ratio)
  • Sharding for MongoDB (write-heavy reviews: user_id shard key)
  • Redis cluster for session data (high throughput, automatic failover)

3. Consistency vs Performance:

  • Sacrifice strong consistency for reviews (OK if reviews appear with slight delay)
  • Maintain strong consistency for orders (financial data cannot be wrong)
  • Use read-your-writes pattern for better user experience

Knowledge Check

Q: I'm building a social media feed like Twitter. Millions of posts/second. What database should I use for posts?

Cassandra (or similar time-series NoSQL)

Social media feeds are write-heavy (append-only time-series) and require massive horizontal scale. SQL databases cannot handle millions of writes/second effectively. Cassandra is designed exactly for this workload: write-optimized, automatically partitions by time, scales horizontally.

Q: Why use master-slave replication instead of master-master?

Master-slave is simpler and provides strong consistency.

Master-slave writes go to a single master, reads can go to slaves. This is simple to implement, provides strong consistency (slaves always see master's changes), and works great for read-heavy workloads. Master-master scales writes better but adds complexity: conflict resolution, eventual consistency, and harder to reason about.

Q: My app has flexible data (schema changes frequently). Should I use SQL or NoSQL?

NoSQL - specifically document databases like MongoDB.

SQL databases have rigid schemas. ALTER TABLE can be slow and complex. NoSQL document databases allow you to add fields on the fly without migrations. Perfect for rapidly evolving data structures (e.g., user profiles with new features, product catalogs with new attributes).

Q: What's the main benefit of consistent hashing over range sharding?

Minimal data movement when adding/removing shards.

With range sharding, adding a shard might require moving 50% of data. With consistent hashing, adding a shard only moves ~1/n of data (where n is number of shards). This makes scaling operations much faster and less disruptive.

Quiz: Test Your Knowledge

Q1: Which database type is best for financial transactions requiring ACID guarantees?

  • MongoDB (Document)
  • Redis (Key-Value)
  • PostgreSQL (Relational)
  • Cassandra (Column-family)
Answer

PostgreSQL (Relational) provides ACID compliance (Atomicity, Consistency, Isolation, Durability) which is critical for financial transactions. MongoDB and Redis sacrifice strong consistency for performance. Cassandra provides tunable consistency but is optimized for write-heavy workloads, not complex transactions.

Q2: Which replication strategy is best for a read-heavy system where reads outnumber writes 10:1?

  • Master-Master
  • Master-Slave
  • No replication needed
Answer

Master-Slave is ideal for read-heavy workloads. All writes go to the master (single write path), but reads can be distributed across multiple read replicas. This gives you 10x read scalability with the simplicity of a single master. Master-Master gives write scalability but adds complexity (conflict resolution, eventual consistency).

Q3: You're building a time-series analytics platform ingesting billions of events/day. Which database is best?

  • PostgreSQL
  • MongoDB
  • Cassandra
  • Redis
Answer

Cassandra is designed specifically for time-series workloads: write-heavy, append-only data that grows indefinitely. It automatically partitions data by time, scales horizontally, and provides TTL for automatic data retention. PostgreSQL would struggle with write throughput and storage. MongoDB works but Cassandra's time-series optimizations are better. Redis is in-memory and too expensive for billions of events.

Q4: Which of these is NOT a characteristic of NoSQL databases?

  • Flexible schemas
  • Horizontal scalability
  • Strong ACID compliance
  • Eventual consistency is often acceptable
Answer

Strong ACID compliance is NOT a characteristic of most NoSQL databases. NoSQL databases typically prioritize scalability and flexibility over strong ACID guarantees. While some NoSQL databases (like MongoDB) provide document-level ACID guarantees, they don't provide the same level of multi-document transaction support as SQL databases.

Q5: Uber uses multiple databases (MySQL, Cassandra, Redis, ElasticSearch). Why?

  • To reduce complexity
  • To choose the right database for each workload
  • Because they couldn't decide on one
  • To minimize costs
Answer

To choose the right database for each workload - this is a deliberate architectural decision. Different workloads benefit from different databases: MySQL for transactions (trips), Cassandra for time-series (event stream), Redis for real-time (driver locations), ElasticSearch for search. This polyglot persistence approach gives better performance and scalability than trying to force everything into one database.

Q6: What's the main drawback of sharding?

  • Slower writes (distributed across servers)
  • Complex joins across shards
  • More storage space needed
  • Slower reads (distributed across servers)
Answer

Complex joins across shards is the main drawback. When data is split across multiple database servers, joining data that resides on different shards becomes complex and expensive. Cross-shard joins require querying multiple shards and merging results in the application, which is slow and complex. This is why you carefully choose shard keys to minimize cross-shard queries.

Q7: Airbnb uses PostgreSQL for core business data and MongoDB for flexible data. Why this hybrid approach?

  • PostgreSQL is faster
  • MongoDB is cheaper
  • Different workloads benefit from different technologies
  • They couldn't scale PostgreSQL
Answer

Different workloads benefit from different technologies - this is a smart architectural decision. PostgreSQL excels at complex relational data with ACID transactions (users, listings, bookings). MongoDB excels at flexible, evolving schemas with high write throughput (search index, activity feed). Using the right database for each workload gives better performance and scalability than forcing everything into one technology.

Q8: Which database feature allows Redis to automatically delete session data after 30 minutes?

  • ACID transactions
  • TTL (Time To Live)
  • Sharding
  • Master-Slave replication
Answer

TTL (Time To Live) allows Redis to automatically delete data after a specified time. This is perfect for session data, shopping carts, and other temporary data. You don't need manual cleanup jobs - Redis handles it automatically. Set a TTL of 30 minutes when storing a session, and Redis deletes it automatically.

Next Steps

Now that we understand databases and how to choose between SQL and NoSQL, let's learn about caching to improve performance. 👉 Lesson 3: Caching (Strategies, Eviction Policies, Redis Case Study)

Lesson 3: Caching

Why Cache? The Performance Multiplier

Caching is the process of storing copies of data in a temporary storage location (cache) so that future requests for that data can be served faster. It's one of the most effective performance optimizations in system design.

Without caching:

User Request → Database (100-500ms latency)
               ↓
          Disk I/O, network round-trip

With caching:

User Request → Cache Hit (1-5ms latency) ✓
               ↓
          Return data immediately (no DB query)

Cache miss scenario:

User Request → Cache Miss
               ↓
          Database Query (100-500ms)
               ↓
          Store in Cache
               ↓
          Return data

The Performance Impact

SystemWithout CacheWith CacheImprovement
User Profile200ms (DB)2ms (Redis)100x faster
Product Catalog150ms (DB)5ms (Redis)30x faster
API Response500ms (DB + joins)10ms (Redis)50x faster
Session Data50ms (DB)1ms (in-memory)50x faster

Caching Strategies

Cache-Aside (Lazy Loading)

How it works:

  1. App checks cache
  2. If miss, App reads from DB
  3. App writes to cache
  4. Next request hits cache
┌─────────────┐
│   Application │
└──────┬──────┘
       │
       ├───────── Check Cache
       │
       ├─ Hit? → Return (1-5ms)
       │
       └─ Miss? → Query DB (100-500ms)
                    ↓
                 Write to Cache
                    ↓
                 Return

Pros:

  • ✅ Only requested data is cached (efficient cache usage)
  • ✅ Simple to implement
  • ✅ Works with any database
  • ✅ Cache failure doesn't break app (fallback to DB)

Cons:

  • ❌ Initial request is slow (cache miss penalty)
  • ❌ Can have cache stampede (thundering herd) when multiple requests miss simultaneously
  • ❌ Stale data until cache expires or is invalidated

Real-world use: Most applications, product catalogs, user profiles, API responses

Write-Through

How it works:

  1. App writes to cache and DB simultaneously
  2. Reads always check cache first
┌─────────────┐
│   Application │
└──────┬──────┘
       │
       ├───── Write Request
       │
       ├─────────────────┬
       │                 │
       ▼                 ▼
   Write Cache       Write DB
   (synchronous)    (synchronous)
       │                 │
       └────────┬────────┘
                ▼
          Return (wait for both)

Pros:

  • ✅ Data in cache is always fresh
  • ✅ Strong consistency between cache and DB
  • ✅ Simple read logic (always check cache)

Cons:

  • ❌ Slower writes (synchronous DB write)
  • ❌ Writes to data that's never read (wasted cache space)
  • ❌ Higher latency for write operations

Real-world use: User sessions, configuration data that must be consistent

Write-Back (Write-Behind)

How it works:

  1. App writes to cache only
  2. Cache writes to DB asynchronously
  3. Reads check cache, fallback to DB if miss
┌─────────────┐
│   Application │
└──────┬──────┘
       │
       ├───── Write Request
       │
       ▼
   Write Cache (immediate)
       │
       ▼
   Return (immediate)
       
       └── Asynchronously → Write DB (background)

Pros:

  • ✅ Extremely fast writes
  • ✅ Can batch writes to DB (reduced DB load)
  • ✅ Better write throughput

Cons:

  • ❌ Data loss risk if cache crashes before syncing
  • ❌ Complexity in handling failures
  • ❌ Eventual consistency (DB might be behind)
  • ❌ Data durability concerns

Real-world use: Write-heavy systems, analytics events, clickstream data

Refresh-Ahead

How it works:

  1. Background job refreshes cache before expiration
  2. Ensures hot data is always in cache
  3. Users never see cache misses for popular data
┌─────────────┐
│   Application │
└──────┬──────┘
       │
       ├───── Read Request
       │
       ├─ Hit? → Return
       │
       └─ Miss? → Query DB → Return
                    ↓
            Trigger background refresh

┌─────────────────┐
│ Background Job   │
└─────────┬───────┘
          │
          ├─ Find expiring cache entries
          ├─ Refresh from DB
          └─ Update cache (before users request)

Pros:

  • ✅ Users never see cache misses for hot data
  • ✅ Predictive caching for known hot keys
  • ✅ Better user experience

Cons:

  • ❌ Complexity in predicting what to refresh
  • ❌ Wasted resources if predictions are wrong
  • ❌ Background processing adds complexity

Real-world use: Product catalogs, leaderboards, trending content

Eviction Policies

When the cache is full, what do you remove? This decision significantly impacts cache hit rates.

LRU (Least Recently Used)

How it works: Remove the item that hasn't been used for the longest time.

Example:

Cache [10 items, capacity 1000 items]
├── User A profile (last accessed: 1 hour ago) ← Evict this
├── Product catalog entry (last accessed: 5 minutes ago)
├── Session data (last accessed: 2 minutes ago)
└── ...

When cache is full, remove User A profile (not accessed recently)

Pros:

  • ✅ Simple to implement
  • ✅ Works well for temporal locality (recently accessed likely to be accessed again)
  • ✅ Good for general-purpose caching

Cons:

  • ❌ Doesn't account for access frequency (popular items might be evicted)
  • ❌ Can be suboptimal for certain access patterns
  • ❌ Requires tracking access time (some overhead)

Real-world use: Most default cache implementations, web caches, database caches

LFU (Least Frequently Used)

How it works: Remove the item that has been accessed least often overall.

Example:

Cache [10 items, capacity 1000 items]
├── User A profile (access count: 5) ← Evict this
├── Product catalog entry (access count: 10,000)
├── Session data (access count: 50)
└── ...

When cache is full, remove User A profile (least accessed)

Pros:

  • ✅ Keeps most popular items in cache
  • ✅ Good for workloads with clear popularity patterns
  • ✅ Optimizes for cache hit rate

Cons:

  • ❌ Requires tracking access frequency (more memory overhead)
  • ❌ Can lock in old popular items (cold data problem)
  • ❌ Doesn't adapt to changing access patterns quickly

Real-world use: Content distribution networks, recommendation systems, API response caches

FIFO (First In, First Out)

How it works: Remove the oldest item regardless of access patterns.

Example:

Cache [10 items, capacity 1000 items]
├── Item 1 (added 1 hour ago) ← Evict this
├── Item 2 (added 30 minutes ago)
├── Item 3 (added 15 minutes ago)
└── ...

When cache is full, remove Item 1 (oldest)

Pros:

  • ✅ Simplest to implement
  • ✅ No tracking needed (just insertion order)
  • ✅ Deterministic behavior

Cons:

  • ❌ No consideration of access patterns
  • ❌ Can evict hot items if they were cached early
  • ❌ Poor cache hit rate in most scenarios

Real-world use: Round-robin load balancer caches, simple message queues

TTL (Time To Live)

How it works: Remove items after a fixed time period, regardless of usage.

Example:

Cache entries with TTL:
├── User session (TTL: 30 minutes) ← Auto-remove after 30min
├── Product price (TTL: 5 minutes) ← Auto-remove after 5min
├── Configuration (TTL: 1 hour) ← Auto-remove after 1hour
└── News feed (TTL: 1 minute) ← Auto-remove after 1min

Pros:

  • ✅ Guarantees data freshness
  • ✅ Simple to understand and reason about
  • ✅ Works great for time-sensitive data
  • ✅ Automatic cleanup (no manual eviction needed)

Cons:

  • ❌ Popular items might expire and be re-fetched
  • ❌ Need to choose appropriate TTL (tuning required)
  • ❌ Can cause cache stampede if popular items expire simultaneously

Real-world use: Session data, real-time prices, news feeds, API rate limiting

Random Replacement

How it works: Randomly select an item to evict.

Pros:

  • ✅ Simple to implement
  • ✅ No tracking overhead
  • ✅ Works surprisingly well in practice

Cons:

  • ❌ Can evict hot items
  • ❌ Suboptimal cache hit rate
  • ❌ Unpredictable behavior

Real-world use: Simple caching implementations where tracking overhead is a concern

Real-World Case Studies

Case Study 1: Redis at Scale

The Challenge: Redis is used by companies like Twitter, Instagram, and Uber for ultra-low latency caching. How does Redis handle millions of requests per second with sub-millisecond latency?

The Architecture:

Clients → Redis Cluster (100+ nodes)
         ├── Node 1-20: US-East
         ├── Node 21-40: US-West
         ├── Node 41-60: EU-Central
         └── Node 61-80: AP-Southeast
         
Sharding Strategy: Consistent hashing of keys
Replication: Master-slave for each shard
Persistence: AOF (Append Only File) + RDB snapshots

Key Optimizations:

  1. In-memory data structures - No disk I/O for reads
  2. Single-threaded - No locking overhead, atomic operations
  3. Efficient data structures - Hash tables, skip lists, etc.
  4. Pipelining - Send multiple commands in one network round-trip
  5. Lua scripting - Execute multiple operations atomically on server

Performance Results:

  • Throughput: 10M+ operations/second per cluster
  • Latency: <1ms p50, <5ms p99
  • Memory usage: 100GB+ RAM across cluster
  • Hit rate: 95-98% for hot keys
  • Capacity: 1B+ keys stored

Use Cases:

  • Session storage (10M+ active sessions)
  • Rate limiting (100K+ requests/second)
  • Real-time leaderboards
  • Pub/Sub messaging (100K+ messages/second)

💡 Key Insight: Redis achieves incredible performance by keeping everything in memory and using single-threaded architecture. No locks means no contention. The trade-off is memory-intensive and limited by RAM capacity.

Case Study 2: Facebook's Caching Strategy

The Challenge: Facebook (now Meta) serves billions of users with complex social graphs, news feeds, and real-time interactions. Caching is critical for performance at this scale.

The Caching Architecture:

Facebook Platform Caching Layers:
├── Edge Caching (Akamai CDN)
│   ├── Static assets (images, CSS, JS)
│   └── HTML content (cache time: 1-5 minutes)
│
├── Application Caching (Tao + Memcached)
│   ├── User profiles
│   ├── Friend lists
│   ├── News feed entries
│   └── Permissions
│
├── Database Caching (MySQL + caching layer)
│   ├── Hot database rows
│   ├── Query results
│   └── Join results
│
└── Client Caching (Browser)
    ├── API responses
    ├── Static resources
    └── Service Worker cache

Tao (The Associations and Objects):

  • Facebook's distributed edge caching system
  • 100+ terabytes of cache across multiple datacenters
  • Cache hit rate: 98-99% for read-heavy workloads
  • Consistent hashing for key distribution
  • Hot items replicated across multiple caches

Memcached Configuration:

Cluster: 10,000+ servers
Total capacity: 10+ TB of RAM
Items cached: 1+ trillion objects
Hit rate: 98% (overall)
Latency: <5ms p95

News Feed Caching Strategy:

User A's News Feed Generation:
├── Check cache for user's feed (TTL: 5 minutes)
├── If miss, generate from:
│   ├── Friend list (cached, TTL: 1 hour)
│   ├── Friend's posts (cached, TTL: 10 minutes)
│   ├── Friend's likes (cached, TTL: 5 minutes)
│   └── Ranking algorithm (real-time)
├── Assemble feed
└── Cache result (TTL: 5 minutes)

Performance Results:

  • News feed generation: From 500ms (DB) to 10ms (cache)
  • User profile loads: From 200ms (DB) to 2ms (cache)
  • Social graph queries: From 100ms (DB) to 5ms (cache)
  • Overall hit rate: 98% across all cache layers

💡 Key Insight: Facebook uses multiple cache layers (edge, application, database, client) with different TTLs for each layer. This multi-layer approach gives them both performance (edge cache) and freshness (shorter TTLs at application layer). The complexity is high, but the performance improvement is massive.

Case Study 3: CDN Caching (CloudFront, Cloudflare, Akamai)

The Challenge: Serve static assets and content to users globally with low latency. How do CDNs achieve sub-100ms latency worldwide?

The Architecture:

User Request → Edge Server (closest geographically)
             ↓ (cache hit: <10ms)
          Return cached content
             
User Request → Edge Server (cache miss)
             ↓ (fetch from origin)
          Origin Server
             ↓ (return)
          Edge Server (cache for future)
             ↓
          Return to user

CloudFront Caching:

AWS Global Network:
├── 600+ edge locations worldwide
├── 500+ points of presence
├── 13 regional edge caches
└── 200+ PoPs (Points of Presence)

Cache Behaviors:
├── Static assets (images, CSS, JS): 1 year TTL
├── API responses: 5-60 minute TTL
├── HTML content: 1-5 minute TTL
└── Dynamic content: no caching (bypass)

Cache-Control Headers:

Cache-Control: max-age=3600, public
  → Cache for 1 hour, CDNs and browsers can cache

Cache-Control: max-age=0, no-cache, no-store, must-revalidate
  → No caching, always fetch from origin

Cache-Control: max-age=86400, s-maxage=3600, public
  → Browser cache for 1 day, CDN cache for 1 hour

Performance Results:

ScenarioWithout CDNWith CDNImprovement
Static asset from US to Asia500ms50ms10x faster
API response from Europe to US300ms30ms10x faster
Video streaming (HD)2000ms200ms10x faster
Edge locations covered0 (origin only)600+ locationsGlobal

Cache Invalidation:

  • Time-based: Automatic expiration based on TTL
  • Manual: Invalidate specific paths or objects
  • Purge: Remove from all edge locations (takes 5-30 minutes)

💡 Key Insight: CDNs use geographic distribution and massive scale (thousands of edge servers) to achieve low latency globally. The cache is distributed across edge locations, with each location caching content for nearby users. This gives sub-100ms latency worldwide while offloading origin servers.

Production Metrics

Cache Performance Comparison

Cache SystemThroughputLatency (p95)Hit RateCapacity
Redis10M+ ops/sec<5ms95-98%1GB-1TB
Memcached100M+ ops/sec<2ms90-95%10GB-100TB
Varnish10K+ req/sec<10ms80-90%10GB-100GB
CDN (CloudFront)1M+ req/sec<50ms70-80%Unlimited

Eviction Policy Performance

PolicyHit RateComplexityMemory OverheadBest For
LRU70-80%LowLowGeneral purpose
LFU75-85%MediumMediumPopularity-based
TTL60-90%LowVery LowTime-sensitive data
Random65-75%Very LowNoneSimple systems

Trade-Off Scenarios

Scenario 1: E-Commerce Product Catalog

Context: Building an e-commerce platform with 1M+ products. 90% of traffic is browsing products (read-heavy), 10% is purchasing (write-heavy). Need to handle Black Friday spikes (10x normal traffic).

The Trade-Off Decisions:

DecisionOption AOption BWhat You Choose & Why
Cache StrategyWrite-ThroughCache-AsideCache-Aside - Only cache hot products, not every product
Eviction PolicyLRUTTLLRU - Popular products stay in cache, cold products evicted
TTL for Prices1 hour5 minutes5 minutes - Prices change frequently, must stay fresh
TTL for Details1 day1 hour1 hour - Product details change less often than prices
Cache InvalidationTime-based (TTL)Manual purgeTTL + Manual purge - Auto-expire for simplicity, manual purge for price changes
Cache LayerSingle Redis clusterMulti-tier (Redis + CDN)Redis + CDN - CDN for product images, Redis for API data

Result:

  • Pros: High cache hit rate (90%+), fresh pricing data, handles Black Friday traffic
  • Cons: Manual cache invalidation adds complexity, dual cache layer adds operational overhead
  • Performance: 90% cache hit rate, <10ms API response for cached products, handles 10x traffic spikes

Scenario 2: Real-Time Leaderboard

Context: Building a real-time game with global leaderboards. 10M+ players, millions of score updates per second. Rankings must be updated in real-time (<100ms latency).

The Trade-Off Decisions:

DecisionOption AOption BWhat You Choose & Why
Cache StrategyCache-AsideWrite-ThroughWrite-Through - Leaderboard must always show latest scores
Data StructureSorted set (Redis ZSET)Hash map + manual sortingSorted set - Built-in sorted structure, O(log N) updates
Cache SizeTop 1000 playersTop 1M playersTop 1000 - Cache only hot players, DB for full list
TTL1 minute10 seconds10 seconds - Scores update frequently, must stay fresh
Refresh StrategyClient pollsServer push (WebSocket)Server push - Real-time updates, no polling overhead
Backup StrategySnapshot every hourAOF + RDBAOF + RDB - AOF for durability, RDB for fast recovery

Result:

  • Pros: Real-time updates, efficient sorted queries, high write throughput
  • Cons: Memory-intensive (sorted sets), complex backup strategy
  • Performance: <50ms latency for top 1000, 1M+ score updates/second

Scenario 3: User Session Storage

Context: Building a web application with 1M+ active users. Sessions need to be fast (<5ms) and secure. Sessions should expire after 30 minutes of inactivity.

The Trade-Off Decisions:

DecisionOption AOption BWhat You Choose & Why
Cache StrategyCache-AsideWrite-BackCache-Aside - Simpler, strong consistency needed
StorageDatabase (SQL)RedisRedis - Sub-millisecond latency, automatic TTL
TTL30 minutes1 hour30 minutes - Security requirement (sessions expire)
PersistenceNoneAOF snapshotsAOF - Prevent data loss if Redis crashes
Session DataMinimal (user ID)Full (user profile, cart)Minimal in cache - Faster access, smaller cache, profile in DB
Geo-DistributionSingle datacenterMulti-regionMulti-region - Global users, reduce latency
Backup StrategyNoneReplica setReplica set - Prevent session loss on failure

Result:

  • Pros: Fast session access, automatic expiration, geo-distributed for low latency
  • Cons: Redis is memory-intensive, complexity in multi-region setup
  • Performance: <2ms session lookup, automatic TTL cleanup, 99.99% availability

Sruja Perspective: Modeling Caches

In Sruja, we document caching strategies with clear trade-offs and performance characteristics.

Why Model Caches?

Modeling caches in your architecture provides:

  1. Performance visibility - Track cache hit rates, latency improvements
  2. Strategy clarity - Document which caching strategy and why
  3. Failure analysis - Understand impact of cache failures
  4. Capacity planning - See cache size requirements and scaling needs

Example: E-Commerce Caching Architecture

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce Platform" {
    description "Multi-tier caching strategy for e-commerce platform"
    
    // EDGE CACHE: CDN for static assets
    EdgeCache = container "CDN Cache" {
        technology "CloudFront / Cloudflare"
        description "Caches product images, CSS, JS globally"
        tags ["cache", "edge", "cdn"]
        
        tradeoff {
            decision "Use CDN for static assets"
            sacrifice "Cost (CDN services cost money)"
            reason "Reduces latency 10x globally, offloads origin servers"
            mitigation "Use cache-control headers, optimize asset sizes"
        }
        
        capacity {
            ttl_images "1 year"
            ttl_api "5-60 minutes"
            ttl_html "1-5 minutes"
        }
    }
    
    // APPLICATION CACHE: Redis for API data
    ProductCache = container "Product Cache" {
        technology "Redis Cluster"
        description "Caches product details, prices, inventory"
        tags ["cache", "application", "redis"]
        
        tradeoff {
            decision "Use Cache-Aside strategy"
            sacrifice "Initial request latency (cache miss)"
            reason "Only cache hot products, efficient memory usage"
            mitigation "Refresh-ahead for popular products, monitor hit rate"
        }
        
        tradeoff {
            decision "Use LRU eviction policy"
            sacrifice "Memory for cold products (frequent evictions)"
            reason "Popular products stay in cache, good temporal locality"
            mitigation "TTL for prices (5min), details (1hour)"
        }
        
        slo {
            latency {
                p95 "10ms"
                p99 "50ms"
                window "7 days"
            }
            hit_rate {
                target "90%"
                window "24 hours"
            }
        }
        
        capacity {
            memory "100GB"
            throughput "10M ops/sec"
            hit_rate "90%"
        }
    }
    
    // DATABASE: PostgreSQL for persistent storage
    ProductDB = database "Product Database" {
        technology "PostgreSQL"
        description "Stores all product data persistently"
        tags ["database", "sql"]
        
        tradeoff {
            decision "Use PostgreSQL for persistent storage"
            sacrifice "Write speed (disk I/O slower than memory)"
            reason "Strong consistency, ACID transactions, complex queries"
            mitigation "Read replicas, caching layer"
        }
    }
    
    // SERVICES
    ProductService = container "Product Service" {
        technology "Go"
        description "API for product catalog and search"
    }
    
    // TRAFFIC FLOW
    Client -> ProductService "Requests product"
    ProductService -> ProductCache "Check cache (Cache-Aside, LRU, TTL)"
    ProductService -> ProductDB "Query on cache miss"
    ProductService -> EdgeCache "Serve images/CSS/JS (CDN)"
}

view index {
    title "E-Commerce Caching Architecture"
    include *
}

view caching {
    title "Caching Strategy"
    include ECommerce.ProductCache
}

Key Trade-Offs Documented

1. Cache Strategy Choice:

  • Why Cache-Aside? Only cache hot data (efficient memory usage)
  • Why LRU eviction? Popular products stay in cache (good temporal locality)
  • Why different TTLs? Prices change frequently (5min), details don't (1hour)

2. CDN Integration:

  • Use CDN for static assets (images, CSS, JS)
  • Global edge locations for sub-100ms latency
  • Cache-control headers for different TTLs

3. Consistency vs Performance:

  • Sacrifice some write speed for cache consistency
  • TTL ensures freshness even with Cache-Aside
  • Manual cache invalidation for critical updates

4. Capacity Planning:

  • 100GB Redis cache for 1M products
  • 90% hit rate target
  • 10M+ ops/sec throughput needed for Black Friday

Knowledge Check

Q: When should I use Cache-Aside vs Write-Through?

Cache-Aside when you only want to cache frequently accessed data (hot data). This is more memory-efficient because you don't cache everything. Initial request is slower (cache miss), but subsequent requests are fast.

Write-Through when you need strong consistency between cache and database. Every write updates both cache and DB synchronously, ensuring cache is always fresh. This is slower for writes but simpler for consistency-critical data like user sessions.

Q: My cache is filling up too fast. Which eviction policy should I use?

LRU (Least Recently Used) is a good default choice. It removes items that haven't been accessed recently, which works well for most workloads (temporal locality - recently accessed items are likely to be accessed again).

If you know your workload has clear popularity patterns (some items accessed 100x more than others), use LFU (Least Frequently Used). This keeps the most popular items in cache, improving hit rate for hot data.

Q: How do I prevent cache stampede (thundering herd)?

Cache stampede occurs when many requests miss for the same hot key simultaneously, all querying the database and overwhelming it.

Solutions:

  1. Locking - First request acquires lock, others wait and use cached result
  2. Refresh-ahead - Background job refreshes hot keys before expiration
  3. Probabilistic early expiration - Randomly expire some items early to spread load
  4. Request coalescing - Merge simultaneous requests for the same key

The best solution depends on your workload. Locking is simplest, refresh-ahead gives best user experience.

Quiz: Test Your Knowledge

Q1: Which caching strategy only caches data that's actually requested?

  • Write-Through
  • Write-Back
  • Cache-Aside
  • Refresh-Ahead
Answer

Cache-Aside only caches data that's actually requested. When a request comes in:

  1. Check cache - if hit, return data
  2. If miss, query database
  3. Store in cache for next request

This is memory-efficient because you don't waste cache space on data nobody requests. Write-Through caches everything on write, regardless of whether it will be read.

Q2: Which eviction policy removes the item that hasn't been accessed for the longest time?

  • LFU (Least Frequently Used)
  • FIFO (First In, First Out)
  • LRU (Least Recently Used)
  • TTL (Time To Live)
Answer

LRU (Least Recently Used) removes the item that hasn't been accessed for the longest time. This works well for most workloads because of temporal locality - items that were accessed recently are likely to be accessed again.

FIFO removes the oldest item regardless of access pattern. LFU removes the least frequently accessed item. TTL removes items after a fixed time.

Q3: You're building a real-time leaderboard with frequent score updates. Which caching strategy should you use?

  • Cache-Aside
  • Write-Through
  • Write-Back
  • No caching needed
Answer

Write-Through ensures the cache always has the latest scores, which is critical for a real-time leaderboard. Every score update writes to both cache and database synchronously.

Cache-Aside would be problematic because score updates might not be immediately reflected in the cache. Write-Back is dangerous because if the cache crashes before syncing to database, you could lose score updates.

Q4: What's the main benefit of using a CDN (Content Delivery Network) for caching?

  • Reduces database load
  • Improves global latency by serving content from edge locations
  • Improves cache hit rate
  • Simplifies cache invalidation
Answer

Improves global latency by serving content from edge locations. CDNs have thousands of servers distributed globally. When a user requests content, it's served from the closest edge server, which could be just a few milliseconds away vs. hundreds of milliseconds from the origin server.

While CDNs also reduce origin server load, their primary benefit is geographic distribution for low latency. Cache hit rate depends on your caching strategy and TTL configuration, not the CDN itself.

Q5: Which of these is NOT a characteristic of Redis caching?

  • In-memory storage for sub-millisecond latency
  • Single-threaded architecture for atomic operations
  • Automatic disk-based persistence by default
  • Supports multiple data structures (strings, hashes, sets, sorted sets)
Answer

Automatic disk-based persistence by default is NOT a characteristic of Redis. Redis is primarily an in-memory cache. While it does support persistence options (RDB snapshots and AOF - Append Only File), it doesn't automatically persist to disk by default. Data is lost if Redis crashes without persistence configured.

Redis is in-memory (fast), single-threaded (no locking overhead), and supports multiple data structures (strings, hashes, sets, sorted sets, lists).

Q6: Facebook uses multiple cache layers (edge, application, database, client). Why?

  • Because they couldn't decide on one caching strategy
  • To reduce complexity
  • Each layer serves a different purpose with appropriate TTLs
  • All cache layers do the same thing
Answer

Each layer serves a different purpose with appropriate TTLs. Facebook's caching strategy:

  • Edge cache (CDN): Static assets with long TTLs (images, CSS, JS) - 1 day to 1 year
  • Application cache (Tao/Memcached): Dynamic data with medium TTLs (friend lists, posts) - 5-60 minutes
  • Database cache: Hot rows and query results with short TTLs - 1-10 minutes
  • Client cache: Browser and Service Worker cache with controlled TTLs

This multi-layer approach gives them both performance (edge cache) and freshness (shorter TTLs at application layer) while optimizing for different access patterns at each layer.

Q7: What's the trade-off with Write-Back (Write-Behind) caching?

  • Slower reads
  • Risk of data loss if cache crashes before syncing
  • Complex read logic
  • Higher memory usage
Answer

Risk of data loss if cache crashes before syncing is the main trade-off with Write-Back caching. In Write-Back, writes go to cache immediately (fast) and the cache asynchronously writes to the database. If the cache crashes before syncing, those writes are lost forever.

Write-Back is very fast for writes and can batch database operations, but the data loss risk makes it unsuitable for critical data. Use it for analytics events, clickstream data, or other data where occasional loss is acceptable.

Q8: You're building an e-commerce platform with 1M products. 90% of traffic is browsing (read-heavy). Which caching strategy is best?

  • Write-Back
  • Write-Through
  • Cache-Aside with LRU eviction
  • No caching needed
Answer

Cache-Aside with LRU eviction is ideal for this scenario. Here's why:

  1. Cache-Aside - Only cache frequently accessed products (efficient memory usage, don't waste space on cold products)
  2. LRU eviction - Popular browsing products stay in cache (good temporal locality)
  3. Read-heavy workload - 90% of traffic is browsing, so cache hit rate will be high (90%+)

Write-Through would cache every product on write (inefficient). Write-Back has data loss risk (not acceptable for product data). Cache-Aside is perfect: cache the hot products (20% of products get 80% of traffic) and let LRU evict cold products.

Next Steps

Now that we understand caching strategies, eviction policies, and real-world implementations, let's put it all together with a comprehensive system design exercise.

👉 Complete Module 2: The Building Blocks

Up next: Module 3: Advanced Modeling - Learn about system boundaries, context mapping, and more complex architectural patterns!

Advanced Modeling

The Distributed Monolith: When Microservices Go Wrong

It was 2 AM when the call came in. The payment service was down, which meant the entire e-commerce platform was down. No one could buy anything.

I pulled up the logs and traced the failure: Payment Service → Inventory Service → Notification Service → Email Service. The email service had crashed, which cascaded back through the entire chain, which brought down payments.

We had "microservices." We had separate services for everything. But what we really had was a distributed monolith—services that couldn't function independently, spread across multiple servers instead of one.

The outage lasted six hours. The post-mortem revealed the real problem: we'd adopted microservices without understanding what makes them work. We'd split by technical layers (web, API, database) instead of business capabilities. Every service depended on every other service. We'd gained all the complexity of distributed systems with none of the benefits.

This lesson is about avoiding that mistake. You'll learn when microservices actually help (hint: not always), how to draw service boundaries that make sense, and the real trade-offs that most tutorials skip over.

Learning Goals

By the end of this lesson, you'll be able to:

  • Understand what microservices really are (beyond the buzzword)
  • Recognize when microservices help and when they hurt
  • Draw service boundaries based on business capabilities, not technical layers
  • Model microservices architecture in Sruja with multiple views
  • Avoid the distributed monolith anti-pattern

What Are Microservices, Really?

Let's cut through the hype. Microservices are small, independent services that:

  • Own their own data (no shared databases)
  • Communicate through APIs (not direct database calls)
  • Can be deployed independently (without coordinated releases)
  • Scale independently (add more instances of just the busy service)
  • Fail independently (one service crashing doesn't bring down everything)

The key word is independently. If your services can't operate independently, they're not microservices—they're just a distributed monolith with extra steps.

Here's the uncomfortable truth that took me years to learn: Microservices are not an architecture. They're a trade-off.

You gain:

  • Independent scaling (scale just what needs scaling)
  • Technology diversity (different services can use different stacks)
  • Fault isolation (one service failure doesn't cascade)
  • Team autonomy (different teams own different services)

You pay with:

  • Distributed system complexity (network calls instead of function calls)
  • Data consistency challenges (no more ACID transactions across services)
  • Operational overhead (monitoring, logging, tracing across services)
  • Cognitive load (understanding the whole system is harder)

The Monolith: Not Evil, Just Misunderstood

Monoliths get a bad reputation, but let me share a secret: most successful companies started with a monolith.

Amazon started with a monolith. Netflix started with a monolith. Uber started with a monolith. Facebook, Twitter, Airbnb—they all started with a single application that did everything.

There's a good reason: Monoliths are easier to build, deploy, and debug.

When you're a startup with three engineers trying to find product-market fit, you don't need microservices. You need to ship features fast. You need to iterate quickly. You need to change your mind without refactoring 50 services.

Monoliths work well when:

  • Your team is small (< 20 engineers)
  • Your domain is not well-understood yet (you're still learning)
  • Your traffic is predictable and moderate
  • Your deployment velocity is more important than independent scaling
  • You want to move fast and break things in one place

Monoliths break down when:

  • Different parts of the system need different scaling
  • Teams are stepping on each other's toes (coordinated deployments take forever)
  • One part of the system is unstable and brings down everything
  • You need different technologies for different parts (e.g., Python for ML, Go for high-throughput APIs)

The monolith isn't wrong. It's just a tool with specific use cases. Don't adopt microservices because you think monoliths are "bad." Adopt them because you have specific problems they solve.

The Microservices Promise (And Hidden Costs)

Let me share three real stories about microservices—the good, the bad, and the ugly.

The Good: Amazon's Two-Pizza Teams

In 2002, Amazon's CEO Jeff Bezos issued a famous memo (often called the "API Mandate") that fundamentally changed how Amazon operated:

  1. All teams will expose their data and functionality through service interfaces
  2. Teams must communicate with each other through these interfaces
  3. No other form of interprocess communication is allowed
  4. It doesn't matter what technology you use
  5. All service interfaces must be designed to be externalizable

This wasn't about technology—it was about organizational structure. By forcing teams to communicate through APIs, Amazon created independent teams that could move fast without coordinating with each other. The famous "two-pizza team" rule (a team should be small enough to be fed by two pizzas) became possible because services provided clear boundaries.

The result: Amazon went from a slow-moving retailer to a platform that could launch new services rapidly. Each team owned their service end-to-end. No coordination required.

The key insight: Microservices enabled organizational scaling, not just technical scaling.

The Bad: The Startup That Microserviced Too Early

A startup I advised had five engineers and decided to build their MVP as microservices from day one. They'd read about Netflix and Amazon and wanted to "do it right from the start."

Six months later, they'd built eight services for a product that barely had users. Every feature change required coordinating across multiple services. Every deployment was a multi-service rollout. Debugging meant tracing logs across five different systems.

They spent more time managing services than building features. When they finally simplified to a monolith, their development velocity doubled.

The lesson: You don't start with microservices. You evolve into them when you need them.

The Ugly: The Distributed Monolith Disaster

I worked with a company that had 50 "microservices" but they were all tightly coupled:

  • Services called each other synchronously in chains
  • They shared a single database (just different schemas)
  • A deploy of any service required deploying all services
  • When one service crashed, everything crashed

They had all the costs of microservices with none of the benefits. It was a monolith spread across 50 codebases.

The fix: We consolidated into 8 services based on business capabilities, gave each its own database, and made them truly independent. The system became simpler and more reliable.

The lesson: More services ≠ better architecture. The goal is independence, not count.

When to Use Microservices (And When NOT To)

After years of watching teams struggle with this decision, I've developed a simple framework.

Use microservices when:

  1. You have clear business boundaries (e.g., Payments, Shipping, User Management are obviously separate)
  2. Different parts need different scaling (e.g., Search service needs 100x more instances than User Profile service)
  3. Multiple teams need independent deployment (Team A shouldn't wait for Team B's deployment)
  4. Different technology needs (e.g., Python for ML, Go for APIs, Node.js for web)
  5. Fault isolation is critical (e.g., Recommendations can fail without breaking Checkout)

Don't use microservices when:

  1. You're still exploring the domain (You don't know the boundaries yet)
  2. Your team is small (< 10-20 engineers)
  3. You don't have operational maturity (Monitoring, logging, tracing, deployment automation)
  4. Simple deployment matters more than independent scaling
  5. You're doing it because "that's what Netflix does"

The biggest mistake I see? Teams adopting microservices because they think it's "modern" or "best practice." Architecture should solve specific problems, not follow trends.

The Hardest Problem: Drawing the Lines

Let's say you've decided to use microservices. Now comes the hard part: Where do you draw the boundaries?

This is where most teams fail. They split by technical layers:

  • Frontend Service
  • Backend Service
  • Database Service
  • Cache Service

This is wrong. These aren't business capabilities—they're technical implementation details. You end up with services that can't function independently.

The right approach: Split by business capability.

A business capability is something the business does that generates value. Ask: "What does this business actually do?"

For an e-commerce platform:

  • Catalog - Manage products and inventory
  • Orders - Process customer orders
  • Payments - Handle payment processing
  • Shipping - Coordinate delivery
  • Customers - Manage user accounts and profiles

Each of these can be an independent service that owns its data and logic. If Payments crashes, Orders can still be taken (just processed later). If Shipping is slow, it doesn't affect browsing the catalog.

A Framework for Boundary Decisions

Here's the process I use when helping teams define service boundaries:

Step 1: Identify Business Capabilities

List what the business actually does, not technical components:

  • What are the main things users do?
  • What different departments manage?
  • What would exist even if technology was different?

Step 2: Look for Natural Seams

Where does data naturally separate?

  • Does Order data need to know about Shipping details?
  • Can Customers exist without Orders?
  • What data can be eventually consistent vs. strongly consistent?

Step 3: Consider Team Structure

Who will own what?

  • Can a single team own this service end-to-end?
  • Does splitting this create coordination overhead?
  • Does the org structure support these boundaries?

Step 4: Start Bigger, Split Later

Begin with larger services and split as needed:

  • Start with "Order Management" not "Order Placement" + "Order Tracking"
  • Split only when you feel the pain of not splitting
  • It's easier to split than to merge

Step 5: Question Every Boundary

For each proposed service, ask:

  • Can this deploy independently?
  • Can this scale independently?
  • Does this own its data?
  • Would I need to coordinate with other teams to change this?

If the answer is no, reconsider the boundary.

Real-World Stories: How Amazon and Netflix Did It

Amazon: The Service-Oriented Evolution

Amazon's journey to microservices wasn't a rewrite—it was an evolution:

2002: Bezos's API Mandate. All communication through APIs. 2003-2005: Gradual extraction of services from the monolith. 2006: Launch of AWS (infrastructure became a product). Today: Hundreds of services, but each is owned by a specific team.

The key: They evolved gradually. They didn't rewrite everything at once.

Netflix: The Chaos Engineering Pioneer

Netflix's microservices journey was driven by a specific problem:

2008: Single datacenter outage took down everything. 2009: Decision to move to AWS and microservices. 2010-2011: Massive migration, created Chaos Monkey to test resilience. Result: 99.99%+ uptime, ability to survive AWS region failures.

The key: They had a specific problem (single point of failure) and microservices solved it.

What both companies share: They started with monoliths and evolved to microservices based on specific needs, not trends.

Modeling Microservices in Sruja

Now that you understand the concepts, let's see how to model microservices in Sruja. The key is modeling each service as an independent unit with its own data.

Basic Example: E-Commerce Platform

import { * } from 'sruja.ai/stdlib'

// Customer interacts with the system
Customer = person "Customer"

// Order Service: Handles order placement and tracking
OrderSystem = system "Order Management" {
    OrderService = container "Order Service" {
        technology "Rust"
        description "Handles order placement and tracking."
    }
    OrderDB = database "Order Database" {
        technology "PostgreSQL"
        description "Owns all order data. No other service accesses this."
    }
    OrderService -> OrderDB "Reads/Writes"
}

// Inventory Service: Tracks stock levels
InventorySystem = system "Inventory Management" {
    InventoryService = container "Inventory Service" {
        technology "Java"
        description "Tracks stock levels and reservations."
    }
    InventoryDB = database "Inventory Database" {
        technology "PostgreSQL"
        description "Owns all inventory data. Independent from orders."
    }
    InventoryService -> InventoryDB "Reads/Writes"
}

// Inter-service communication through APIs
Customer -> OrderSystem.OrderService "Places order"
OrderSystem.OrderService -> InventorySystem.InventoryService "Reserves stock (async)"

// Requirements drive architecture
requirement R1 functional "Must handle 10k orders/day"
requirement R2 performance "Order placement < 500ms"
requirement R3 scalability "Scale order processing independently from inventory"

// Document architecture decisions
adr ADR001 "Split into microservices" {
    status "Accepted"
    context "Need independent scaling for order vs inventory"
    decision "Separate OrderSystem and InventorySystem"
    consequences "Better scalability, but added network latency and operational complexity"
}

Notice how each system owns its database. This is critical—shared databases defeat the purpose of microservices.

Multiple Views for Different Audiences

One of the most powerful features of Sruja is the ability to create different views for different audiences:

// System Overview: Shows the big picture
view index {
    title "System Overview"
    include *
}

// Developer View: Focus on services and APIs
view developer {
    title "Developer View - Service Architecture"
    include OrderSystem.OrderService OrderSystem.OrderDB
    include InventorySystem.InventoryService InventorySystem.InventoryDB
    exclude Customer
}

// Product View: Focus on user experience
view product {
    title "Product View - User Journey"
    include Customer
    include OrderSystem
    exclude InventorySystem.InventoryDB
}

// Data Flow View: Show data dependencies
view dataflow {
    title "Data Flow View"
    include OrderSystem.OrderService OrderSystem.OrderDB
    include InventorySystem.InventoryService InventorySystem.InventoryDB
    exclude Customer
}

Why multiple views matter:

  • Different audiences need different levels of detail
  • Reduced complexity - each view shows only what's relevant
  • Better communication - stakeholders get diagrams they understand
  • Living documentation - views stay in sync with the architecture

Common Microservices Mistakes

After years of working with microservices, I've seen these patterns repeat:

Mistake 1: Shared Database Services share a database "for convenience." Result: You can't change the database schema without breaking multiple services. Each service should own its data.

Mistake 2: Synchronous Everything Every service calls every other service synchronously. Result: One slow service brings down everything. Use async communication where possible.

Mistake 3: Too Many Services 50 services for a simple application. Result: Complexity explosion. Start with fewer, larger services and split when needed.

Mistake 4: Distributed Monolith Services that can't function independently. Result: All the costs, none of the benefits. If a service can't deploy without others, it's not a microservice.

Mistake 5: Early Optimization Starting with microservices "because we'll need them." Result: Wasted time and complexity you don't need. Build what you need, when you need it.

What to Remember

Microservices are a trade-off, not a best practice. You gain independence but pay in complexity. Don't adopt them because they're trendy—adopt them because you have specific problems they solve.

Start with a monolith. Amazon, Netflix, Uber, Google—they all started with monoliths. Evolve to microservices when you feel the pain of not having them.

Split by business capability, not technical layer. Services should represent business domains (Orders, Payments, Shipping), not technical components (Database, Cache, API).

Each service owns its data. Shared databases create coupling. If two services share a database, they're not independent.

Multiple views serve multiple audiences. Developers need technical details. Product managers need user flows. Executives need system overview. Sruja's views let you show the right level to each audience.

The goal is independence, not service count. More services ≠ better architecture. The question is: Can this service deploy, scale, and fail independently?

Microservices aren't wrong. But using them without understanding the trade-offs is.

What's Next

Now that you understand microservices architecture, Lesson 2 covers the API Gateway pattern—the traffic controller that manages communication between your services and the outside world. You'll learn when you need one, when you don't, and how to avoid creating a new bottleneck.


The Chain Reaction: When Synchronous Dependencies Kill Systems

It started with a single database query taking 2 seconds instead of 200ms.

The Analytics Service called the User Service to enrich data. The User Service called the Subscription Service to check plan details. The Subscription Service called the Payment Service to verify active status. The Payment Service called... you get the picture.

When that one database query slowed down, everything slowed down. Within minutes, the entire platform was timing out. Every service was waiting for every other service. We had a beautiful microservices architecture with synchronous calls everywhere—and it took 8 hours to recover.

The post-mortem was brutal: "We built a distributed monolith with synchronous chains." The fix? Event-driven architecture. Services that could operate independently, react to events asynchronously, and fail gracefully when dependencies were slow.

This lesson is about avoiding that mistake. You'll learn when event-driven architecture helps (and when it adds complexity), the three main patterns (queues, pub/sub, and event sourcing), and how to model events in Sruja.

Learning Goals

By the end of this lesson, you'll be able to:

  • Understand why synchronous dependencies create fragility
  • Recognize when event-driven architecture solves real problems
  • Choose between message queues, pub/sub, and event sourcing
  • Model event-driven systems in Sruja with queues and scenarios
  • Avoid common event-driven mistakes (eventual consistency isn't magic)

Why Events Matter: The Synchronous Problem

Here's a pattern I've seen destroy systems: Synchronous call chains.

Service A calls Service B, which calls Service C, which calls Service D. When D is slow, C is slow. When C is slow, B is slow. When B is slow, A is slow. The user sees timeouts. The system appears down.

The synchronous trap:

  • Latency adds up: 50ms + 100ms + 200ms + 150ms = 500ms total response time
  • Failures cascade: One slow service slows everything
  • Coupling is hidden: You think services are independent, but they're not
  • Hard to debug: Which service caused the timeout? Good luck tracing it.

The event-driven alternative:

Service A publishes an event. Service B, C, and D subscribe and process independently. Service A doesn't wait. If Service D is slow, Services B and C still work. If Service D crashes, the event stays in the queue until it recovers.

What you gain:

  • Decoupling: Services don't need to know about each other
  • Resilience: One service failure doesn't cascade
  • Scalability: Add more consumers when load increases
  • Flexibility: Add new subscribers without touching existing services

What you pay:

  • Complexity: Debugging distributed events is harder
  • Eventual consistency: Data isn't immediately consistent across services
  • Operational overhead: Message brokers, queues, retry logic
  • Learning curve: Thinking in events is different from thinking in requests

Synchronous vs. Asynchronous: A Real Comparison

Let me show you the difference with a real example: User Registration.

Synchronous Approach (Request/Response)

User → API: POST /register
API → EmailService: Send welcome email (wait for response)
API → AnalyticsService: Track signup (wait for response)
API → CRMService: Create lead (wait for response)
API → User: "Registration complete"

Total time: 200ms + 300ms + 400ms + 250ms = 1150ms

Problem: If EmailService is slow, the user waits. If AnalyticsService is down, registration fails.

Asynchronous Approach (Event-Driven)

User → API: POST /register
API → Database: Save user
API → EventQueue: Publish "UserRegistered" event
API → User: "Registration complete" (immediate response)

Meanwhile, asynchronously:
EventQueue → EmailService: Send welcome email
EventQueue → AnalyticsService: Track signup
EventQueue → CRMService: Create lead

Total response time: 200ms (just database save)

Benefit: User gets immediate response. Services process independently. If EmailService is slow, it doesn't affect the user experience.

The trade-off: User doesn't immediately get the email. Analytics might be a few seconds behind. But for most use cases, this is acceptable.

When to Use Event-Driven Architecture (And When NOT To)

After years of building both synchronous and asynchronous systems, I've developed a simple decision framework.

Use event-driven when:

  1. Services don't need immediate response (e.g., sending emails, analytics, notifications)
  2. Multiple services need the same data (e.g., "UserRegistered" → Email, Analytics, CRM all need it)
  3. Work can happen in the background (e.g., image processing, report generation)
  4. Resilience matters more than immediate consistency (e.g., "we'll retry until it works")
  5. You need to handle traffic spikes (e.g., queue requests during peak, process during off-peak)

Don't use event-driven when:

  1. User needs immediate feedback (e.g., "Is this username available?")
  2. Strong consistency is required (e.g., banking transactions, inventory checks)
  3. Debugging simplicity matters more than scalability (synchronous is easier to debug)
  4. You don't have operational maturity (monitoring queues, handling failures)
  5. The added complexity isn't worth it (simple CRUD apps don't need events)

The biggest mistake I see? Using events for everything because they're "more scalable." Architecture should solve specific problems, not create unnecessary complexity.

The Three Event-Driven Patterns

Event-driven architecture isn't one thing—it's three different patterns for different use cases.

Pattern 1: Message Queues (Point-to-Point)

What it is: A message is sent to a queue and processed by exactly one consumer.

How it works:

Producer → Queue → Consumer (only one)

Use cases:

  • Background jobs (image resizing, video transcoding)
  • Task distribution (send to whichever worker is free)
  • Load leveling (queue requests during spikes, process gradually)

Real example: When you upload a video to YouTube, it goes into a queue. One worker picks it up and transcodes it. If 1000 people upload videos simultaneously, the queue holds them until workers are available.

Technologies: RabbitMQ, AWS SQS, Redis queues

Pattern 2: Pub/Sub (Publish/Subscribe)

What it is: A message (event) is published to a topic. Multiple subscribers can receive a copy.

How it works:

Publisher → Topic → Subscriber 1
                 → Subscriber 2
                 → Subscriber 3

Use cases:

  • Broadcasting events (e.g., "UserSignedUp" → Email, Analytics, CRM all get it)
  • Event notification (multiple services react to the same event)
  • Real-time updates (e.g., stock prices, sports scores)

Real example: When you sign up for Netflix, a "UserSignedUp" event is published. The email service sends a welcome email, the analytics service tracks the signup, and the recommendation service initializes your profile—all simultaneously, all independently.

Technologies: Apache Kafka, Google Pub/Sub, AWS SNS/SQS

Pattern 3: Event Sourcing

What it is: Store all changes as a sequence of events, not just current state.

How it works:

Instead of: User { name: "John", email: "john@example.com" }

Store: [
  { type: "UserCreated", data: { id: 1, name: "John" } },
  { type: "EmailUpdated", data: { email: "john@example.com" } }
]

Use cases:

  • Audit trails (every change is recorded)
  • Time-travel debugging (replay events to see what happened)
  • Event replay (rebuild state from events if database corrupts)

Real example: Financial systems. Instead of just storing "Account balance: $1000", you store "Deposit $500", "Withdraw $200", "Deposit $700". You can replay these events to verify the balance or undo transactions.

Technologies: EventStore, Apache Kafka (with compacted topics)

Which pattern to choose?

  • Queues for one consumer, background jobs
  • Pub/Sub for multiple consumers, event broadcasting
  • Event Sourcing for audit trails, complex state changes

Real-World Case Studies

Netflix: Events for Resilience

Netflix's event-driven architecture is legendary. Here's how they use events:

The Challenge: When you play a video, dozens of things need to happen: stream initialization, quality selection, analytics tracking, recommendation updates, etc.

The Synchronous Problem: If analytics tracking is slow, video playback shouldn't suffer. If recommendations are down, you should still be able to watch.

The Event-Driven Solution:

  1. "VideoPlaybackStarted" event is published
  2. Multiple services subscribe independently:
    • Analytics Service: Track viewing habits
    • Recommendation Service: Update "continue watching"
    • Billing Service: Track usage for account sharing
    • Quality Service: Monitor stream health

The Result: If Analytics Service crashes, video playback continues. Each service operates independently. Netflix achieves 99.99%+ uptime.

LinkedIn: The Migration to Events

LinkedIn's journey to event-driven architecture is instructive:

The Problem (2010): Synchronous calls everywhere. The "social graph" service was called by dozens of other services. When it was slow, LinkedIn was slow.

The Solution:

  1. Identified the most-called services
  2. Gradually migrated to event-driven architecture using Kafka
  3. Services now react to events instead of calling each other

The Result:

  • 10x improvement in response times
  • Ability to handle 4x more traffic with same infrastructure
  • Services can fail independently without taking down the site

The Lesson: Migrate gradually, not all at once. Start with the most painful synchronous dependencies.

The Startup That Over-Engineered Events

Not every story is a success. A startup I advised went all-in on events from day one:

What They Did:

  • Kafka cluster with 10 brokers
  • Event sourcing for everything (even simple CRUD)
  • 50+ event types for an MVP

The Result:

  • Spent 6 months building infrastructure before shipping features
  • Debugging was nightmare (which event caused this bug?)
  • Operational overhead crushed a small team

The Lesson: Events are powerful, but don't start with them. Evolve into event-driven architecture when you feel the pain of not having it.

Common Event-Driven Mistakes

After years of working with event-driven systems, I've seen these patterns repeat:

Mistake 1: Eventual Consistency Confusion

"We'll just use events and everything will be consistent eventually!"

Reality: Eventual consistency means your data is inconsistent for some period. Users might not see their changes immediately. If you need strong consistency (e.g., banking), events aren't the answer.

Mistake 2: Event Spaghetti

"We'll publish events for everything!"

Reality: 200 event types create chaos. Which events do I subscribe to? What happens when event schemas change? Keep events minimal and well-documented.

Mistake 3: No Retry Logic

"If an event fails, we'll just retry forever!"

Reality: Some events can't succeed (e.g., email to invalid address). You need dead letter queues, exponential backoff, and failure handling.

Mistake 4: Synchronous Events

"We'll use events, but wait for the response!"

Reality: That's not event-driven, that's synchronous calls with extra steps. Either commit to async or use synchronous calls.

Mistake 5: Ignoring Event Ordering

"Order doesn't matter!"

Reality: "UserCreated" must come before "EmailUpdated". Use partitioning or sequencing to maintain order when it matters.

Modeling Events in Sruja

Now that you understand the concepts, let's see how to model event-driven architecture in Sruja. The key is using queue for asynchronous communication and scenario for event flows.

Example: User Registration with Events

import { * } from 'sruja.ai/stdlib'

User = person "End User"

Notifications = system "Notification System" {
    AuthService = container "Auth Service" {
        technology "Node.js"
        description "Handles user authentication and publishes events"
    }

    // Define the event queue/topic
    UserEvents = queue "User Events Topic" {
        technology "Kafka"
        description "Events: UserSignedUp, UserLoggedIn, ProfileUpdated"
    }

    EmailService = container "Email Service" {
        technology "Python"
        description "Sends transactional emails asynchronously"
    }

    AnalyticsService = container "Analytics Service" {
        technology "Spark"
        description "Processes user events for analytics"
    }

    NotificationDB = database "Notification Database" {
        technology "PostgreSQL"
        description "Stores notification preferences and history"
    }

    // Pub/Sub flow: One event, multiple consumers
    User -> AuthService "Signs up (synchronous)"
    AuthService -> UserEvents "Publishes 'UserSignedUp' (async)"
    UserEvents -> EmailService "Consumes event - sends welcome email"
    UserEvents -> AnalyticsService "Consumes event - tracks signup"
    EmailService -> NotificationDB "Logs email sent"
}

// Model the complete event flow as a scenario
UserSignupFlow = scenario "User Signup Event Flow" {
    User -> AuthService "Submits registration (synchronous)"
    AuthService -> UserEvents "Publishes UserSignedUp (async)"
    UserEvents -> EmailService "Triggers welcome email (async)"
    UserEvents -> AnalyticsService "Tracks signup event (async)"
    EmailService -> User "Sends welcome email (async)"
}

// Model data pipeline for analytics
flow AnalyticsPipeline "Analytics Data Pipeline" {
    UserEvents -> AnalyticsService "Streams events continuously"
    AnalyticsService -> AnalyticsService "Processes in batches"
    AnalyticsService -> AnalyticsService "Generates reports"
}

view index {
    title "Notification System Overview"
    include *
}

// Event flow view: Focus on async communication
view eventflow {
    title "Event Flow View - Async Communication"
    include Notifications.AuthService
    include Notifications.UserEvents
    include Notifications.EmailService
    include Notifications.AnalyticsService
    exclude User Notifications.NotificationDB
}

// Data view: Focus on data storage
view data {
    title "Data Storage View"
    include Notifications.EmailService
    include Notifications.NotificationDB
    include Notifications.AnalyticsService
    exclude Notifications.AuthService Notifications.UserEvents
}

Key Sruja Concepts for Events

  1. queue - Models message queues and pub/sub topics
  2. scenario - Models behavioral flows (user journeys, event sequences)
  3. flow - Models data pipelines (streaming, batch processing)
  4. views - Different perspectives for different audiences

Notice the separation:

  • Synchronous calls: User -> AuthService (user waits)
  • Asynchronous events: AuthService -> UserEvents (fire and forget)

What to Remember

Synchronous chains create fragility. When every service calls every other service synchronously, one slow service slows everything. Event-driven architecture breaks these chains.

Events trade consistency for resilience. You gain independence and fault tolerance, but data isn't immediately consistent across services. This is acceptable for many use cases, but not all.

Three patterns for three problems:

  • Queues for background jobs, one consumer
  • Pub/Sub for broadcasting events, multiple consumers
  • Event Sourcing for audit trails, state reconstruction

Events aren't always the answer. Use them when you need decoupling, resilience, or async processing. Don't use them when you need immediate feedback or strong consistency.

Start synchronous, evolve to async. Most successful companies (Netflix, LinkedIn, Uber) started with synchronous calls and migrated to events when they felt the pain. Don't over-engineer from day one.

Model events explicitly in Sruja. Use queue for topics, scenario for flows, and views to show different perspectives. Make the async boundaries clear.

Event-driven architecture isn't a silver bullet. It's a powerful tool for specific problems. Use it when the trade-offs make sense.

What's Next

Now that you understand event-driven architecture, Lesson 3 covers Advanced Scenarios—how to model complex user journeys and technical sequences that span multiple services. You'll learn when scenarios help clarify behavior and when they're just extra documentation.

The Happy Path Lie: Why Scenarios Reveal What Diagrams Hide

The architecture diagram was beautiful. Clean boxes, clear connections, everything made sense. The new engineer studied it for an hour and said, "I understand how the system works."

Then they tried to trace a user logging in.

"Wait, does the web app call the auth service first, or does it check the local cache?" "Does the auth service validate against the database or an external provider?" "What happens if the external provider is down?" "Where does the refresh token come from?"

The diagram showed what connected to what. It didn't show what happened when.

This is the gap scenarios fill. Static architecture diagrams show structure. Scenarios show behavior. They show the sequence, the timing, the decision points, the failure paths. They show how your system actually works when users start clicking buttons.

This lesson is about using scenarios to model runtime behavior. You'll learn when scenarios help (and when they're just documentation overhead), how to model user journeys and technical sequences, and the practical patterns that make scenarios useful instead of busywork.

Learning Goals

By the end of this lesson, you'll be able to:

  • Understand why static diagrams aren't enough for complex behavior
  • Recognize when scenarios add value vs. when they're overkill
  • Model user journeys with the story keyword
  • Model technical sequences with the scenario keyword
  • Add metadata (latency, protocols) to scenario steps
  • Avoid common scenario mistakes

Why Scenarios Matter: The Static Diagram Problem

Here's a truth that took me years to learn: Architecture diagrams show potential connections, not actual behavior.

When you look at a diagram with 20 boxes and 30 arrows, you know what can connect to what. But you don't know:

  • What happens first?
  • What happens in parallel?
  • What happens if something fails?
  • How long does each step take?
  • Which path is the happy path vs. error handling?

The problem: New team members stare at diagrams and nod, but they don't actually understand the system. They understand the components, but not how they interact.

The solution: Scenarios that show specific sequences. Not every possible interaction—just the important ones.

A Real Example: The Checkout Confusion

I worked with a team that had a beautiful e-commerce architecture diagram. But every time someone new joined, they'd ask the same questions:

  1. "When does inventory get reserved?"
  2. "What if payment fails after inventory is reserved?"
  3. "Does the user get an email before or after the order is confirmed?"
  4. "What if the email service is down?"

The diagram didn't answer these questions. It showed that Cart, Payment, Inventory, and Email services existed and were connected. It didn't show the sequence.

The fix: We wrote out three scenarios:

  • Happy path: Successful checkout
  • Payment failure: Rollback inventory, notify user
  • Email failure: Order still completes, retry email later

Suddenly, new engineers could understand the system in 10 minutes instead of 2 weeks. The scenarios captured what the diagram couldn't: behavior over time.

When Scenarios Help (And When They're Overkill)

After years of using scenarios, I've learned they're not always the answer.

Scenarios help when:

  1. Onboarding new engineers - Show them the key flows, not just the components
  2. Debugging complex interactions - Trace the actual sequence when something breaks
  3. Designing new features - Think through the happy path and error cases before coding
  4. Communicating with non-technical stakeholders - User journeys make sense to product managers
  5. Documenting edge cases - "What happens if X fails?" is a scenario

Scenarios are overkill when:

  1. Simple CRUD operations - "User creates a record" doesn't need a scenario
  2. Obvious flows - If the sequence is clear from the diagram, don't document it twice
  3. Exhaustive documentation - You don't need scenarios for every possible path
  4. Implementation details - Scenarios should show behavior, not code

The key is to model the important flows, not all flows. A system with 50 scenarios is as confusing as a system with none.

Two Types of Scenarios

Sruja supports two types of scenarios, each for a different audience.

Type 1: User Flows (Stories)

Audience: Product managers, designers, stakeholders Focus: Value delivered to the user Keyword: story (alias for scenario)

User flows show what the user experiences, not technical details. They answer: "What does the user do, and what do they get?"

Example: Buying a Ticket

import { * } from 'sruja.ai/stdlib'

User = person "Customer"

Ticketing = system "Ticketing System" {
    WebApp = container "Web Application" {
        technology "React"
    }
    PaymentService = container "Payment Service" {
        technology "Rust"
    }
    TicketDB = database "Ticket Database" {
        technology "PostgreSQL"
    }

    WebApp -> PaymentService "Processes payment"
    PaymentService -> TicketDB "Stores transaction"
}

// User-focused story: What does the user experience?
BuyTicket = story "User purchases a ticket" {
    User -> Ticketing.WebApp "Selects ticket"
    Ticketing.WebApp -> Ticketing.PaymentService "Process payment" {
        latency "500ms"
        protocol "HTTPS"
    }
    Ticketing.PaymentService -> User "Sends receipt"
}

view index {
    include *
}

Notice:

  • Focus on user actions: "Selects ticket", "Process payment"
  • Includes user-facing metadata: latency (how long does it take?)
  • Skips technical details: What database? What API format?
  • Tells a story: User does X, system responds with Y

When to use stories: When you're communicating with non-engineers or documenting user-facing behavior.

Type 2: Technical Sequences

Audience: Engineers, architects Focus: Technical implementation details Keyword: scenario

Technical sequences show how containers and components interact. They answer: "What calls what, in what order, with what data?"

Example: Authentication Flow

import { * } from 'sruja.ai/stdlib'

User = person "End User"

AuthSystem = system "Authentication System" {
    WebApp = container "Web Application" {
        technology "React"
    }
    AuthServer = container "Auth Server" {
        technology "Node.js, OAuth2"
    }
    Database = database "User Database" {
        technology "PostgreSQL"
    }

    WebApp -> AuthServer "Validates tokens"
    AuthServer -> Database "Queries user data"
}

// Technical sequence: How does it actually work?
AuthFlow = scenario "Authentication" {
    User -> AuthSystem.WebApp "Provides credentials"
    AuthSystem.WebApp -> AuthSystem.AuthServer "Validates token"
    AuthSystem.AuthServer -> AuthSystem.Database "Looks up user"
    AuthSystem.Database -> AuthSystem.AuthServer "Returns user data"
    AuthSystem.AuthServer -> AuthSystem.WebApp "Confirms token valid"
    AuthSystem.WebApp -> User "Shows login success"
}

view index {
    include *
}

Notice:

  • Focus on technical steps: "Validates token", "Looks up user"
  • Shows the full round-trip: Request → validation → database → response
  • More detailed than user stories
  • Useful for debugging and onboarding engineers

When to use technical sequences: When you're debugging, designing implementation, or helping engineers understand the system.

Adding Metadata to Scenarios

Scenarios become more valuable when you add metadata. Sruja supports properties on each step:

PaymentFlow = scenario "Payment Processing" {
    Customer -> ECommerce.Cart "Initiates checkout" {
        latency "50ms"
    }
    ECommerce.Cart -> Inventory.InventoryService "Reserves items" {
        latency "200ms"
        protocol "gRPC"
    }
    Inventory.InventoryService -> ECommerce.Cart "Confirms reserved" {
        latency "100ms"
    }
    ECommerce.Cart -> ECommerce.Payment "Charges payment" {
        latency "500ms - 2000ms"
        protocol "HTTPS"
        retry "3x with exponential backoff"
    }
    ECommerce.Payment -> Customer "Sends confirmation" {
        async true
    }
}

Why metadata matters:

  • Latency helps identify bottlenecks (total = sum of steps)
  • Protocol shows how services communicate (HTTP vs gRPC vs async)
  • Retry logic shows resilience patterns
  • Async flags show what happens in background

This metadata turns scenarios from documentation into design tools. You can see: "Wait, this flow takes 2+ seconds? Where's the time going?"

Real-World Scenario Patterns

After years of writing scenarios, I've found a few patterns that work well.

Pattern 1: Happy Path + Edge Cases

Don't try to model every possible path. Model the happy path and the important edge cases:

// Happy path
CheckoutSuccess = story "Successful checkout" {
    Customer -> Cart "Initiates checkout"
    Cart -> Payment "Processes payment"
    Payment -> Customer "Shows success"
}

// Edge case: Payment failure
CheckoutPaymentFailed = story "Checkout - payment fails" {
    Customer -> Cart "Initiates checkout"
    Cart -> Payment "Processes payment"
    Payment -> Cart "Returns error: insufficient funds"
    Cart -> Customer "Shows payment error"
    Cart -> Inventory "Releases reserved items"
}

// Edge case: Inventory unavailable
CheckoutNoInventory = story "Checkout - no inventory" {
    Customer -> Cart "Initiates checkout"
    Cart -> Inventory "Checks availability"
    Inventory -> Cart "Returns error: out of stock"
    Cart -> Customer "Shows out of stock message"
}

Three scenarios cover 95% of checkout behavior. You don't need 20 scenarios for every edge case.

Pattern 2: User Journey Mapping

For complex user journeys, break them into phases:

// Phase 1: Discovery
OnboardingDiscovery = story "User discovers features" {
    User -> App "Opens app"
    App -> User "Shows feature highlights"
    User -> App "Explores feature X"
}

// Phase 2: Activation
OnboardingActivation = story "User tries core feature" {
    User -> App "Uses feature X for first time"
    App -> User "Shows success"
    App -> Analytics "Tracks activation"
}

// Phase 3: Retention
OnboardingRetention = story "User returns" {
    User -> App "Opens app (day 2+)"
    App -> User "Shows personalized content"
    App -> Notifications "Sends reminder (if inactive)"
}

This helps product teams understand user behavior at each stage.

Pattern 3: Failure Scenarios

Model what happens when things break:

// Service failure
PaymentServiceDown = scenario "Payment service unavailable" {
    Customer -> Cart "Initiates checkout"
    Cart -> Payment "Processes payment"
    Payment -> Cart "Returns error: service unavailable"
    Cart -> Queue "Queues order for retry"
    Cart -> Customer "Shows 'we'll process shortly'"
}

// Recovery
PaymentServiceRecovery = scenario "Payment service recovers" {
    Queue -> Payment "Retries payment"
    Payment -> Queue "Confirms success"
    Queue -> Customer "Sends delayed confirmation"
}

Failure scenarios help you design resilient systems.

Common Scenario Mistakes

I've seen teams misuse scenarios in predictable ways.

Mistake 1: Scenarios for Everything

"We need scenarios for every user action!"

Reality: You end up with 100 scenarios no one reads. Focus on the important flows. Simple CRUD doesn't need scenarios.

Mistake 2: Too Much Detail

"Let's add every API call and database query!"

Reality: Scenarios become implementation documentation. Keep them at the right level of abstraction. Technical sequences show component interactions, not code execution.

Mistake 3: No Failure Paths

"We only model the happy path."

Reality: Users don't follow the happy path. They click wrong buttons, enter invalid data, and encounter errors. Model the important failure cases.

Mistake 4: Outdated Scenarios

"We wrote scenarios once and never updated them."

Reality: Scenarios that don't match reality are worse than no scenarios. Either keep them updated or don't write them.

Mistake 5: Mixing Audiences

"We'll write one scenario for everyone."

Reality: Engineers need technical sequences. Product managers need user stories. Different audiences need different levels of detail.

A Complete Example: E-Commerce Checkout

Let me show you a complete example with all the patterns:

import { * } from 'sruja.ai/stdlib'

Customer = person "Customer"

ECommerce = system "E-Commerce System" {
    Cart = container "Shopping Cart" {
        technology "React"
        description "Manages shopping cart state"
    }
    Payment = container "Payment Service" {
        technology "Rust"
        description "Processes payments via Stripe"
    }
    OrderDB = database "Order Database" {
        technology "PostgreSQL"
    }
}

Inventory = system "Inventory System" {
    InventoryService = container "Inventory Service" {
        technology "Java"
        description "Manages stock levels and reservations"
    }
    InventoryDB = database "Inventory Database" {
        technology "PostgreSQL"
    }
    InventoryService -> InventoryDB "Reads/Writes"
}

// Static relationships
Customer -> ECommerce.Cart "Adds items"
ECommerce.Cart -> Inventory.InventoryService "Checks availability"
ECommerce.Cart -> ECommerce.Payment "Processes payment"

// Scenario 1: Happy path (user story)
CheckoutSuccess = story "Successful checkout" {
    description "The ideal checkout flow with payment and confirmation"
    
    Customer -> ECommerce.Cart "Initiates checkout"
    ECommerce.Cart -> Inventory.InventoryService "Reserves items" {
        latency "200ms"
        protocol "gRPC"
    }
    Inventory.InventoryService -> ECommerce.Cart "Confirms reserved"
    ECommerce.Cart -> ECommerce.Payment "Charges payment" {
        latency "500ms - 2000ms"
        protocol "HTTPS"
    }
    ECommerce.Payment -> Customer "Sends confirmation email" {
        async true
    }
}

// Scenario 2: Technical sequence (detailed)
CheckoutTechnical = scenario "Checkout - technical sequence" {
    Customer -> ECommerce.Cart "Initiates checkout"
    ECommerce.Cart -> ECommerce.OrderDB "Creates order record" {
        latency "50ms"
    }
    ECommerce.Cart -> Inventory.InventoryService "Reserves items"
    Inventory.InventoryService -> Inventory.InventoryDB "Updates stock"
    ECommerce.Cart -> ECommerce.Payment "Processes payment"
    ECommerce.Payment -> ECommerce.OrderDB "Updates order status"
    ECommerce.Cart -> Customer "Shows success"
}

view index {
    title "E-Commerce System"
    include *
}

view checkout {
    title "Checkout Flow"
    include Customer ECommerce.Cart ECommerce.Payment
    include Inventory.InventoryService
    exclude ECommerce.OrderDB Inventory.InventoryDB
}

Notice:

  • Two scenarios for different audiences (story vs technical)
  • Metadata on important steps (latency, protocol, async)
  • Multiple views for different perspectives
  • Description on complex scenarios

Syntax Options: Simple vs. Formal

Sruja supports both simple and formal syntax for scenarios.

Simple Syntax

Quick and lightweight:

LoginFailure = scenario "Login Failure" {
    User -> AuthSystem.WebApp "Enters wrong password"
    AuthSystem.WebApp -> User "Shows error message"
}

Use when: Quick sketches, simple flows, internal documentation.

Formal Syntax

More structure for important scenarios:

CheckoutProcess = scenario "Checkout Process" {
    description "The complete checkout flow including payment and inventory check"
    
    Customer -> ECommerce.Cart "Initiates checkout"
    ECommerce.Cart -> Inventory.InventoryService "Reserves items"
    Inventory.InventoryService -> ECommerce.Cart "Confirms reserved"
    ECommerce.Cart -> ECommerce.Payment "Charges payment"
    ECommerce.Payment -> Customer "Sends confirmation"
}

Use when: External documentation, complex flows, important scenarios.

What to Remember

Static diagrams show structure, scenarios show behavior. Diagrams answer "what connects to what?" Scenarios answer "what happens when?" You need both.

Scenarios are for important flows, not all flows. Model the 20% of flows that matter (happy path + key edge cases). Don't try to document every possible interaction.

Two types for two audiences: User stories (product focus) for non-engineers, technical sequences (implementation focus) for engineers. Use the right type for your audience.

Add metadata to make scenarios useful: Latency, protocols, retry logic, async flags—these turn scenarios from documentation into design tools.

Model failure paths, not just happy paths: Users don't follow the happy path. Services fail. Networks timeout. Model what happens when things break.

Keep scenarios updated or delete them: Outdated scenarios are misleading. If you can't maintain them, don't write them.

Use views to show different perspectives: A checkout flow looks different to a product manager vs. an engineer. Use Sruja's views to show the right level of detail to each audience.

Scenarios aren't busywork—they're how you communicate behavior that diagrams can't capture. Use them when they add value, skip them when they don't.

What's Next

Now that you understand scenarios, Lesson 4 covers a different dimension: modeling requirements and constraints. You'll learn how to capture non-functional requirements (performance, scalability, compliance) in Sruja and link them to the components they affect.

This completes the core modeling concepts. The remaining lessons cover specific patterns and production concerns.

Lesson 4: Architectural Perspectives

I'll never forget the board meeting where I learned this lesson the hard way.

The VP of Engineering asked me to present our architecture to the executive team. I spent hours crafting this beautiful, comprehensive diagram showing every service, database, queue, and API endpoint. It was a masterpiece of technical completeness.

I proudly projected it on the screen.

The CEO stared at it for ten seconds, then asked: "So... where's the revenue coming from?"

The diagram had 47 boxes. None of them mentioned customers, payments, or business value. I'd shown them the engine when they wanted to see the car.

That's when I realized: one diagram cannot serve all audiences. You need different maps for different travelers.

The Google Maps Principle

Think about Google Maps. Same underlying data, but multiple views:

  • Satellite view - See everything from above (executives)
  • Traffic view - See flow and congestion (architects)
  • Transit view - See routes and connections (developers)
  • Street view - See ground-level details (implementers)

You don't create four different maps. You create one model with multiple perspectives.

That's exactly what architectural views do for your system.

One Model, Multiple Perspectives

Here's the beautiful thing about Sruja: you define your architecture once, then create different views for different audiences. No duplication. No inconsistency. No "wait, which version is current?"

Let me show you how this works with a real example:

import { * } from 'sruja.ai/stdlib'

// Define people who interact with the system
Customer = person "Customer"
Admin = person "Administrator"

// Define the main system
Shop = system "E-Commerce Shop" {
  WebApp = container "Web Application" {
    technology "React"
    CartComponent = component "Shopping Cart"
    ProductComponent = component "Product Catalog"
  }
  
  API = container "API Service" {
    technology "Rust"
    OrderController = component "Order Controller"
    PaymentController = component "Payment Controller"
  }
  
  DB = database "Database" {
    technology "PostgreSQL"
  }
  
  Cache = database "Cache" {
    technology "Redis"
  }
}

// Define external systems
PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
  }
}

// Define relationships
Customer -> Shop.WebApp "Browses products"
Admin -> Shop.WebApp "Manages inventory"
Shop.WebApp -> Shop.API "Fetches data"
Shop.API -> Shop.DB "Reads/Writes"
Shop.API -> Shop.Cache "Caches queries"
Shop.API -> PaymentGateway "Processes payments"

// EXECUTIVE VIEW: Business context
view executive {
  title "Executive Overview"
  include Customer
  include Admin
  include Shop
  include PaymentGateway
  // Hide technical details
  exclude Shop.WebApp
  exclude Shop.API
  exclude Shop.DB
  exclude Shop.Cache
}

// ARCHITECT VIEW: Service boundaries
view architect {
  title "Architectural View"
  include Shop Shop.WebApp Shop.API Shop.DB Shop.Cache
  include PaymentGateway
  // Hide people and implementation details
  exclude Customer Admin
  exclude Shop.CartComponent Shop.ProductComponent
  exclude Shop.OrderController Shop.PaymentController
}

// DEVELOPER VIEW: Implementation details
view developer {
  title "Developer View"
  include Shop.WebApp Shop.WebApp.CartComponent Shop.WebApp.ProductComponent
  include Shop.API Shop.API.OrderController Shop.API.PaymentController
  include Shop.DB Shop.Cache
  // Hide external concerns
  exclude Customer Admin PaymentGateway
}

// DATA FLOW VIEW: Data dependencies
view dataflow {
  title "Data Flow View"
  include Shop.API Shop.DB Shop.Cache
  // Hide everything else
  exclude Customer Admin Shop.WebApp PaymentGateway
}

// COMPLETE VIEW: Everything
view index {
  title "Complete System View"
  include *
}

One model. Five perspectives. Zero duplication.

Real-World Case Studies

Netflix: Views for Scale

Netflix operates one of the world's most complex architectures. They use different views for:

  1. Board presentations - Show content delivery to customers, geographic reach, no mention of microservices
  2. Architecture reviews - Show the 700+ microservices, their interactions, data flow
  3. Incident response - Show specific service dependencies, failure domains, circuit breakers
  4. Developer onboarding - Show just the relevant service group with its databases and queues

Imagine if they tried to put all of this on one diagram. It would be unusable.

Amazon: The "Working Backwards" Views

Amazon takes this further. They start with the customer-facing view (press release style), then work backwards to technical architecture:

  1. Customer view - What does the customer experience?
  2. Business view - What business capabilities do we need?
  3. Architecture view - What services support those capabilities?
  4. Implementation view - How do we build those services?

Each view serves a specific audience and decision-making process.

Stripe: Developer-Friendly Views

Stripe's famous for their developer experience. Their architecture documentation has:

  • Product view - For potential customers evaluating Stripe
  • Integration view - For developers implementing payments
  • Architecture view - For security teams reviewing compliance
  • Operational view - For DevOps monitoring systems

Same system. Different lenses.

When to Create Views (Framework)

Here's the framework I use to decide which views to create:

Always Create These Views:

1. Context View (Executive)

  • Audience: Executives, stakeholders, non-technical team members
  • Shows: Systems, external dependencies, users
  • Hides: Technical implementation details
  • Use when: Explaining business value, system scope, external integrations

2. Container View (Architect)

  • Audience: Architects, tech leads, senior engineers
  • Shows: Deployable units, inter-service communication, tech stack
  • Hides: Internal code structure, business context
  • Use when: Planning deployments, discussing scalability, reviewing tech choices

3. Component View (Developer)

  • Audience: Developers implementing features
  • Shows: Internal structure, code organization, responsibilities
  • Hides: External systems, high-level architecture
  • Use when: Onboarding developers, planning sprints, debugging

Create Conditionally:

4. Data Flow View

  • Create when: System has complex data pipelines, multiple data stores, or data compliance requirements
  • Skip when: Simple CRUD operations with one database

5. Security View

  • Create when: Handling sensitive data, compliance requirements (HIPAA, PCI, GDPR)
  • Skip when: Internal tools with no sensitive data

6. Operational View

  • Create when: Complex deployment, multiple environments, disaster recovery requirements
  • Skip when: Simple deployment (single service + database)

Common Mistakes I See All the Time

Mistake #1: The "One Diagram to Rule Them All"

What happens: You create one massive diagram with everything.

Why it fails:

  • Executives can't find business value
  • Developers can't find implementation details
  • Architects can't see service boundaries
  • Nobody can read it after 10 boxes

The fix: Start with your audience. What decision do they need to make? Show only what supports that decision.

Mistake #2: Creating Multiple Models Instead of Multiple Views

What happens: You create separate Sruja files for executives, architects, and developers.

Why it fails:

  • Views get out of sync
  • "Which file is the source of truth?"
  • Update one, forget the others
  • Conflicting information

The fix: One model. Multiple view blocks. Always in sync.

Mistake #3: Forgetting the Developer View

What happens: You have great executive and architect views, but nothing for developers.

Why it fails:

  • New developers struggle to understand the codebase
  • "Where do I add this feature?"
  • Inconsistent implementation patterns

The fix: Always create a developer view showing components, their responsibilities, and key dependencies.

Mistake #4: Showing the Wrong Detail Level

What happens: You show database schema to executives, or business context to developers.

Why it fails:

  • Audience gets confused or frustrated
  • "Why are you showing me this?"
  • Loss of credibility

The fix: Match detail level to audience:

  • Executives: Systems and users (5-10 boxes)
  • Architects: Containers and databases (10-20 boxes)
  • Developers: Components and classes (20-50 boxes)

Mistake #5: The "Everything View" Default

What happens: Your index view is the only view, showing everything.

Why it fails:

  • Default view is overwhelming
  • People stop looking at your architecture docs
  • "It's too complicated, I'll just ask someone"

The fix: Make the default view match your most common audience (usually container-level for teams).

The VIEW Framework: A Decision Guide

When someone asks for architecture documentation, use this framework:

V - Verify the Audience

  • "Who will see this diagram?"
  • "What's their technical level?"
  • "What decisions will they make?"

E - Establish the Purpose

  • "What question are we answering?"
  • "What level of detail do they need?"
  • "What can we safely hide?"

I - Identify the Scope

  • "Which systems/containers are relevant?"
  • "What's out of scope?"
  • "Should we show external dependencies?"

E - Exclude Ruthlessly

  • Remove everything that doesn't serve the purpose
  • Less is more
  • Clarity over completeness

W - Write the View

  • Create a named view with descriptive title
  • Test it with someone from the target audience
  • Iterate based on feedback

Practical Exercise

Take the e-commerce example above and create a new view:

Challenge: Create a "Security View" for the compliance team that shows:

  • Where customer data flows
  • Which systems handle payments
  • External integrations
  • But hides implementation details

Hint: Focus on PaymentGateway, Shop.API, and data stores. Exclude UI components.

What to Remember

  1. One model, multiple views - Define once, present many ways
  2. Match the view to the audience - Executives, architects, and developers need different things
  3. Use the VIEW framework - Verify, Establish, Identify, Exclude, Write
  4. Less is more - The best view shows only what's needed, nothing more
  5. Avoid duplication - Multiple views from one model, not multiple models
  6. Test with your audience - If they can't understand it in 30 seconds, simplify
  7. Keep views in sync - One source of truth means automatic consistency

When to Skip Views

You don't always need all views. Skip creating multiple views when:

  • Simple systems - Single service + database doesn't need multiple perspectives
  • Single audience - If only developers will see it, one developer view is enough
  • Prototyping - During exploration, focus on the view that helps you think
  • Time pressure - Better to have one good view than five bad ones

Start simple. Add views when you feel the pain of not having them.


Next up: We'll explore how to model complex relationships and data flows between your architectural elements.


Lesson 5: Views & Styling

I once worked with an architect who loved colors. His diagrams were... enthusiastic.

Every service had a different color. Databases were purple. APIs were orange. Message queues were pink. External systems were yellow. Internal systems were green. And the relationships? Rainbow gradients.

It looked like a unicorn had exploded on his screen.

When I asked him to walk me through the architecture, he spent 10 minutes explaining his color coding system. By minute 3, I'd forgotten what we were looking at. By minute 10, I had a headache.

Here's the thing: styling should clarify, not decorate. When done right, styling makes diagrams instantly understandable. When done wrong, it creates visual noise that obscures the very thing you're trying to show.

The Traffic Light Principle

Think about traffic lights. They use three colors with specific meanings:

  • Red = Stop, danger, critical
  • Yellow = Caution, warning
  • Green = Go, safe, normal

You don't need a legend. You don't need to think. You instantly understand.

Good architecture styling works the same way. It uses visual elements to communicate meaning, not to look pretty.

Why Styling Matters

Your brain processes visual information in two ways:

  1. Fast thinking (System 1) - Instant pattern recognition, gut reactions
  2. Slow thinking (System 2) - Deliberate analysis, reading labels

Good styling triggers System 1. Before you even read the labels, you should understand:

  • What's important (visual weight)
  • How things connect (line styles)
  • What things are (shapes and colors)
  • Where to look first (hierarchy)

Let me show you the difference.

Before: The Uniform Diagram

import { * } from 'sruja.ai/stdlib'

// Everything looks the same. No visual hierarchy.
Customer = person "Customer"

Shop = system "E-Commerce Shop" {
  WebApp = container "Web Application"
  API = container "API Service"
  DB = database "Database"
}

Customer -> Shop.WebApp "Uses"
Shop.WebApp -> Shop.API "Calls"
Shop.API -> Shop.DB "Reads/Writes"

view index {
  title "Everything Looks the Same"
  include *
}

Everything has equal visual weight. Your eye doesn't know where to go. You have to read every label to understand what's happening.

After: Styled with Purpose

import { * } from 'sruja.ai/stdlib'

Customer = person "Customer"

Shop = system "E-Commerce Shop" {
  WebApp = container "Web Application"
  API = container "API Service"
  DB = database "Database"
}

Customer -> Shop.WebApp "Uses"
Shop.WebApp -> Shop.API "Calls"
Shop.API -> Shop.DB "Reads/Writes"

// Global styles apply to all views
style {
  // Databases are always cylinders in green
  element "Database" {
    shape cylinder
    color "#10b981"  // Green - stable, persistent
  }
  
  // Read operations in blue (cool, safe)
  relation "Reads/Writes" {
    color "#3b82f6"
  }
}

view index {
  title "Styled with Purpose"
  include *
}

Now, before you read anything, you already know:

  • That cylinder thing is a database (shape)
  • The database is stable/persistent (green color)
  • Data flows to/from it (blue line)

Sruja Styling: The Basics

Sruja gives you two levels of styling:

1. Global Styles

Apply to all views. Use for consistent element types:

// Define once, apply everywhere
style {
  element "Database" {
    shape cylinder
    color "#22c55e"
  }
  
  element "Cache" {
    shape cylinder
    color "#f59e0b"  // Orange = fast, temporary
  }
  
  relation "Processes payments" {
    color "#ef4444"  // Red = critical path
    thickness 3
  }
}

2. View-Specific Styles

Apply to one view. Use to highlight what matters in that context:

view security {
  title "Security View"
  include PaymentGateway Shop.API Shop.DB
  
  // Only in this view: highlight security concerns
  style {
    element "API" {
      color "#ef4444"  // Red = security focus
    }
    
    relation "Processes payments" {
      style dashed  // Dashed = needs review
    }
  }
}

View-specific styles override global styles. This lets you maintain consistency while highlighting what matters for each audience.

Complete Example: E-Commerce with Purposeful Styling

Let me show you a real-world example with intentional styling:

import { * } from 'sruja.ai/stdlib'

Customer = person "Customer"

ECommerce = system "E-Commerce System" {
  WebApp = container "Web Application"
  API = container "API Service"
  OrderDB = database "Order Database"
  ProductDB = database "Product Database"
}

Customer -> ECommerce.WebApp "Browses"
ECommerce.WebApp -> ECommerce.API "Fetches data"
ECommerce.API -> ECommerce.OrderDB "Stores orders"
ECommerce.API -> ECommerce.ProductDB "Queries products"

// Global styles: Consistent visual language
style {
  // All databases look the same
  element "Database" {
    shape cylinder
    color "#22c55e"  // Green = stable, persistent
  }
  
  // Data retrieval = blue (cool, safe)
  relation "Fetches data" {
    color "#3b82f6"
  }
  
  // Data storage = orange (warm, action)
  relation "Stores orders" {
    color "#f59e0b"
    thickness 2
  }
  
  // Read operations = thinner
  relation "Queries products" {
    color "#3b82f6"
    thickness 1
  }
}

// Complete view with all styling
view index {
  title "Complete System View"
  include *
}

// Data flow view with custom emphasis
view dataflow {
  title "Data Flow View"
  include ECommerce.API
  include ECommerce.OrderDB
  include ECommerce.ProductDB
  exclude Customer
  exclude ECommerce.WebApp
  
  // Override global styles for this view
  style {
    element "API" {
      color "#0ea5e9"  // Blue = data orchestrator
      thickness 3
    }
    
    // Emphasize write operations
    relation "Stores orders" {
      color "#ef4444"  // Red = critical writes
      thickness 3
    }
  }
}

What This Styling Communicates

Without reading a single label, the styling tells you:

  • Green cylinders = Databases (stable, persistent storage)
  • Blue lines = Read operations (safe, idempotent)
  • Orange/red lines = Write operations (changes state, be careful)
  • Thick lines = Critical paths (important, frequent)
  • Thin lines = Secondary paths (less critical)

Real-World Case Studies

AWS: Color Coding by Service Type

AWS architecture diagrams use consistent colors:

  • Orange = Compute (EC2, Lambda)
  • Blue = Storage (S3, EBS)
  • Green = Database (RDS, DynamoDB)
  • Purple = Analytics (Redshift, Athena)

This consistency means AWS architects can look at any AWS diagram and instantly understand the service mix. No legend needed.

GitHub: Minimal but Meaningful

GitHub's internal architecture docs use minimal styling:

  • Gray = Existing services
  • Blue = New services (in a proposal)
  • Red borders = Deprecated services

Three colors. Clear meaning. Zero confusion.

Spotify: Styling for Scale

Spotify's architecture diagrams use visual weight to show scale:

  • Large boxes = Major services (100+ instances)
  • Medium boxes = Standard services (10-100 instances)
  • Small boxes = Small services (<10 instances)
  • Dashed borders = External dependencies

Size communicates scale. Style communicates meaning.

When to Style (Framework)

Here's my framework for deciding when to apply styling:

Always Style:

1. Element Types

  • Databases should look different from services
  • External systems should look different from internal ones
  • Use consistent shapes and colors for each type

2. Critical Relationships

  • Payment flows
  • Security boundaries
  • Data pipelines
  • Anything that, if broken, causes major issues

3. State Changes

  • Read vs. write operations
  • Sync vs. async communication
  • Permanent vs. temporary data

Sometimes Style:

4. Performance Indicators

  • Hot paths (frequently used)
  • Bottlenecks
  • Cached vs. uncached data

5. Ownership Boundaries

  • Team A's services
  • Team B's services
  • Shared infrastructure

Rarely Style:

6. Technology Stack

  • Java vs. Python services
  • PostgreSQL vs. MySQL databases
  • This is usually noise, not signal

7. Version Information

  • v1 vs. v2 services
  • Deprecated systems (use red border, but don't overdo it)

Common Styling Mistakes

Mistake #1: The Rainbow Diagram

What happens: You use 10+ colors because "they all have meaning."

Why it fails:

  • No one can remember 10 color meanings
  • Visual overload
  • Looks unprofessional

The fix: Limit to 3-5 colors with clear, distinct meanings. If you need more, use shapes or line styles instead.

Mistake #2: No Visual Hierarchy

What happens: Everything is equally bold, equally colorful, equally large.

Why it fails:

  • Viewer doesn't know where to look first
  • Critical paths get lost in the noise
  • Feels overwhelming

The fix: Make important things visually prominent. Use size, color intensity, and thickness to create hierarchy.

Mistake #3: Inconsistent Meaning

What happens: Blue means "database" in one diagram and "external service" in another.

Why it fails:

  • Confuses people who see multiple diagrams
  • Undermines trust in documentation
  • Requires constant legend-checking

The fix: Create a style guide and stick to it across all diagrams.

Mistake #4: Style Over Substance

What happens: Beautiful diagrams that don't communicate clearly.

Why it fails:

  • Form over function
  • People admire the diagram but don't understand the architecture
  • Time wasted on decoration

The fix: Before adding any style, ask: "Does this help the viewer understand faster?" If no, remove it.

Mistake #5: Ignoring Accessibility

What happens: Red/green color coding that's invisible to colorblind viewers.

Why it fails:

  • 8% of men have some color blindness
  • Excludes part of your audience
  • May fail accessibility requirements

The fix:

  • Don't rely on color alone (use shapes, patterns, labels too)
  • Test with color blindness simulators
  • Use colorblind-friendly palettes (blue/orange instead of red/green)

The STYLE Framework

When styling your diagrams, use this framework:

S - Semantic Meaning

  • Every style choice should communicate something
  • If you can't explain why, remove it

T - Type Differentiation

  • Different element types should look different
  • Databases ≠ Services ≠ External Systems

Y - Yield Hierarchy

  • Important things should be visually prominent
  • Critical paths should stand out

L - Limit Colors

  • Maximum 5 colors with distinct meanings
  • Use shapes and line styles for additional differentiation

E - Ensure Consistency

  • Same element type = same style across all diagrams
  • Create and follow a style guide

Practical Exercise

Take the e-commerce example above and create a new styled view:

Challenge: Create a "Performance View" that shows:

  • Database query hot paths (thick lines)
  • Cacheable vs. non-cacheable data (dashed vs. solid)
  • Critical user journey (highlighted)

Hint: Use view-specific styles to override global styles for performance concerns.

What to Remember

  1. Styling is communication, not decoration - Every style choice should convey meaning
  2. Limit your palette - 3-5 colors max, each with distinct meaning
  3. Create visual hierarchy - Important things should look important
  4. Be consistent - Same element type = same style across all diagrams
  5. Test accessibility - Don't rely on color alone; consider color blindness
  6. Global styles for consistency, view styles for emphasis - Define element types globally, highlight specifics per view
  7. When in doubt, simplify - Remove styling that doesn't clarify

When to Skip Styling

Sometimes, plain is better:

  • Early exploration - When you're still figuring out the architecture, don't waste time on styling
  • Internal discussions - If only architects see it and they know the system, minimal styling is fine
  • Simple systems - 3 boxes don't need color coding
  • Rapid prototyping - Get the structure right first, style later

Style when you need to communicate. Skip it when you're just thinking through problems.


Next up: We'll explore how to model complex real-world systems with all the techniques we've learned.

Lesson 6: Advanced DSL Features

Six months into a project, a new architect joined our team. During onboarding, she asked a simple question: "Why did we choose PostgreSQL instead of MongoDB?"

I stared at her. Then I stared at the architecture diagrams. Then I dug through old Slack channels, Jira tickets, and Google Docs.

Two hours later, I found a comment from eight months ago: "PostgreSQL because we need ACID transactions for payments."

That was it. No context about what alternatives we considered. No explanation of the trade-offs. No record of who made the decision or why.

We had beautiful architecture diagrams. We had detailed API specs. We had comprehensive test coverage. But we had no memory of our decisions.

That's when I learned: architecture without context is just pretty pictures. You need to capture not just WHAT you built, but WHY you built it that way.

This lesson covers the advanced Sruja features that transform your diagrams from "nice drawings" into "living documentation" that tells the full story.

The Five Pillars of Complete Architecture

A production-ready architecture model needs five things:

  1. Kinds & Types - Define your vocabulary (what elements exist)
  2. Views - Show different perspectives (who needs to see what)
  3. Scenarios - Model user behavior (how people use it)
  4. Flows - Model data movement (how data flows)
  5. Decisions - Document why (requirements, ADRs, policies)

Let me walk you through each one.


1. Kinds and Types: Define Your Vocabulary

The Naming Chaos Story

I once joined a project where everyone named things differently:

  • "Database" vs "DB" vs "DataStore" vs "Persistence"
  • "Service" vs "API" vs "Backend" vs "Server"
  • "Queue" vs "MessageBus" vs "EventStream" vs "Broker"

The architecture had 47 elements. It felt like 47 different architects had named them.

When I tried to write validation rules like "all databases must be encrypted," I couldn't. Because "databases" were hidden among "DataStores," "DBs," and "Persistence Layers."

The fix: Define your vocabulary upfront. Decide what you'll call things, then stick to it.

How Sruja Helps

Sruja's stdlib provides standard element kinds:

import { * } from 'sruja.ai/stdlib'

// Now you have consistent vocabulary:
// - person, system, container, component
// - datastore, database, queue
// - flow, scenario, requirement, adr, policy

This isn't just about naming. It enables:

1. Early Validation

// This will fail validation - "datasource" isn't a valid kind
MyDB = datasource "Database"  // ❌ Typo caught immediately

// Correct:
MyDB = datastore "Database"   // ✅ Validates successfully

2. Better Tooling

  • Autocomplete knows what kinds exist
  • Refactoring works across all instances
  • IDE can suggest valid relationships

3. Self-Documenting Models

// Anyone reading this knows these are the valid element types
// No guessing, no inconsistencies
Customer = person "Customer"
App = system "Application" {
  API = container "API"
  DB = datastore "Database"
}

Best Practice

Import stdlib at the top of every model. This makes your vocabulary explicit and consistent.


2. Multiple Views: One Model, Many Perspectives

We covered this extensively in Lesson 4, so I'll keep this brief.

The key insight: Different audiences need different views. Executives need business context. Architects need service boundaries. Developers need implementation details.

The mistake to avoid: Creating separate models for each audience. Instead, create multiple view blocks from one model.

// Define once...
Customer = person "Customer"
ECommerce = system "E-Commerce Platform" {
  WebApp = container "Web Application"
  API = container "API Service"
  DB = datastore "Database"
}

// ...show many ways
view executive {
  title "Executive Overview"
  include Customer
  include ECommerce
  // Hide technical details
  exclude ECommerce.WebApp ECommerce.API ECommerce.DB
}

view developer {
  title "Developer View"
  include ECommerce.WebApp ECommerce.API ECommerce.DB
  // Hide business context
  exclude Customer
}

Quick tip: Create views by:

  • Audience (executive, architect, developer, product)
  • Concern (security, data flow, performance)
  • Feature (checkout, user management, analytics)

For detailed guidance, revisit Lesson 4.


3. Scenarios: Modeling User Behavior

The Checkout Confusion Story

We were building an e-commerce platform. The architects drew beautiful boxes: Web App, API Service, Database, Payment Gateway. All connected with arrows.

Then product asked: "Can you walk us through the checkout flow?"

We stared at the diagram. There was an arrow from API to Database, and another from API to Payment Gateway. But in what order? What happened if payment failed? Did we reserve inventory before or after payment?

The diagram showed structure but not behavior.

That's when I learned: static diagrams need scenarios to show how the system actually works.

What Scenarios Model

Scenarios capture behavioral flows - sequences of actions that happen when users interact with your system:

  • User journeys (signup, checkout, search)
  • Business processes (order fulfillment, refund processing)
  • Error paths (payment failure, timeout handling)
  • Use cases (admin creates user, customer updates profile)

Real-World Example: E-Commerce Checkout

import { * } from 'sruja.ai/stdlib'

Customer = person "Customer"

ECommerce = system "E-Commerce System" {
  WebApp = container "Web Application"
  API = container "API Service"
  OrderDB = datastore "Order Database"
}

Inventory = system "Inventory System" {
  InventoryService = container "Inventory Service"
}

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
  }
}

// Happy path: User completes checkout successfully
CheckoutSuccess = scenario "Successful Checkout" {
  Customer -> ECommerce.WebApp "Adds items to cart"
  ECommerce.WebApp -> ECommerce.API "Submits checkout"
  ECommerce.API -> Inventory.InventoryService "Reserves stock"
  Inventory.InventoryService -> ECommerce.API "Confirms availability"
  ECommerce.API -> PaymentGateway "Processes payment"
  PaymentGateway -> ECommerce.API "Confirms payment"
  ECommerce.API -> ECommerce.OrderDB "Saves order"
  ECommerce.API -> ECommerce.WebApp "Returns confirmation"
  ECommerce.WebApp -> Customer "Shows order confirmation"
}

// Error path: Payment fails
CheckoutPaymentFailure = scenario "Payment Failure" {
  Customer -> ECommerce.WebApp "Attempts checkout"
  ECommerce.WebApp -> ECommerce.API "Submits checkout"
  ECommerce.API -> Inventory.InventoryService "Reserves stock"
  Inventory.InventoryService -> ECommerce.API "Confirms availability"
  ECommerce.API -> PaymentGateway "Processes payment"
  PaymentGateway -> ECommerce.API "Declines payment"
  ECommerce.API -> Inventory.InventoryService "Releases stock"
  ECommerce.API -> ECommerce.WebApp "Returns error"
  ECommerce.WebApp -> Customer "Shows payment error"
}

// Error path: Out of stock
CheckoutStockFailure = scenario "Out of Stock" {
  Customer -> ECommerce.WebApp "Attempts checkout"
  ECommerce.WebApp -> ECommerce.API "Submits checkout"
  ECommerce.API -> Inventory.InventoryService "Checks stock"
  Inventory.InventoryService -> ECommerce.API "Out of stock"
  ECommerce.API -> ECommerce.WebApp "Returns error"
  ECommerce.WebApp -> Customer "Shows out of stock message"
}

view index {
  include *
}

Real-World Case Studies

Amazon: Scenario-Driven Development

Amazon uses scenarios extensively. Before writing any code, they write the "press release" - a narrative of the customer experience. This becomes scenarios that guide implementation:

  1. Customer searches for product
  2. Customer reads reviews
  3. Customer adds to cart
  4. Customer checks out
  5. Customer receives order confirmation
  6. Customer receives shipping notification
  7. Customer receives package
  8. Customer writes review

Each step has success and failure scenarios.

Netflix: Error Scenarios First

Netflix practices "Chaos Engineering" - they intentionally cause failures. Their scenarios focus heavily on error paths:

  • What happens if the recommendation service is down?
  • What happens if the CDN fails?
  • What happens if the payment processor times out?

By modeling these scenarios upfront, they build resilient systems.

When to Use Scenarios

Use scenarios when:

  • Documenting user journeys
  • Planning features with product managers
  • Designing error handling
  • Writing integration tests
  • Onboarding new developers

Skip scenarios when:

  • Simple CRUD operations (create, read, update, delete)
  • Background processes with no user interaction
  • Early exploration (add them later when design stabilizes)

4. Flows: Modeling Data Movement

The Data Pipeline Mystery

We had a analytics dashboard showing "real-time" metrics. But the numbers were always 2 hours behind.

I traced the data flow:

  1. Events logged to application database
  2. Batch job every 2 hours extracts events
  3. Batch job transforms and aggregates
  4. Batch job loads into analytics warehouse
  5. Dashboard queries warehouse

The architecture diagram showed: App → Database → Dashboard.

It didn't show the 2-hour batch pipeline, the transformation logic, or the intermediate staging tables.

The diagram showed services, not data movement.

That's when I learned: data-intensive systems need flows to show how data actually moves.

What Flows Model

Flows capture data-oriented processes - how data moves, transforms, and gets stored:

  • ETL pipelines (extract, transform, load)
  • Streaming data (events, logs, metrics)
  • Batch processing (daily aggregations, monthly reports)
  • Data synchronization (between systems)

Real-World Example: Analytics Pipeline

import { * } from 'sruja.ai/stdlib'

Analytics = system "Analytics Platform" {
  IngestionService = container "Data Ingestion"
  ProcessingService = container "Data Processing"
  QueryService = container "Query Service"
  EventStream = queue "Event Stream"
  RawDataDB = datastore "Raw Data Store"
  ProcessedDataDB = datastore "Processed Data Warehouse"
}

// Real-time streaming flow
EventStreaming = flow "Real-time Event Streaming" {
  Analytics.IngestionService -> Analytics.EventStream "Publishes events"
  Analytics.EventStream -> Analytics.ProcessingService "Streams events"
  Analytics.ProcessingService -> Analytics.RawDataDB "Stores raw data"
  Analytics.ProcessingService -> Analytics.ProcessedDataDB "Aggregates in real-time"
  Analytics.QueryService -> Analytics.ProcessedDataDB "Queries analytics"
}

// Batch processing flow
DailyBatchProcessing = flow "Daily Batch Aggregation" {
  Analytics.RawDataDB -> Analytics.ProcessingService "Extracts daily data"
  Analytics.ProcessingService -> Analytics.ProcessingService "Transforms and aggregates"
  Analytics.ProcessingService -> Analytics.ProcessedDataDB "Loads aggregated data"
}

view index {
  include *
}

Real-World Case Studies

Uber: Real-time Data Flows

Uber processes millions of events per second. Their flows show:

  1. Driver location updates → Kafka → Real-time processing → Matching service
  2. Ride requests → Pricing service → Surge calculation → User notification
  3. Trip completion → Billing → Receipt generation → Email service

Each flow has different latency requirements (real-time vs. batch).

Spotify: Batch + Streaming

Spotify uses a "Lambda Architecture" with two flows:

  1. Speed layer (real-time): Events → Streaming → Fast approximate results
  2. Batch layer (daily): Events → Hadoop → Accurate complete results

Both flows serve queries, with different trade-offs.

Scenario vs. Flow: When to Use Which

This is the question I get most often. Here's the framework:

Use SCENARIOS for:

  • ✅ User actions and behavior
  • ✅ Business processes
  • ✅ Request/response interactions
  • ✅ Sequential user journeys
  • ❌ Data pipelines

Use FLOWS for:

  • ✅ Data movement and transformation
  • ✅ ETL processes
  • ✅ Event streaming
  • ✅ Batch processing
  • ❌ User interactions

Quick test:

  • Does a person initiate this? → Scenario
  • Does data move between systems automatically? → Flow

Example:

// User behavior = SCENARIO
UserCheckout = scenario "User Checks Out" {
  Customer -> WebApp "Clicks checkout"
  WebApp -> API "Processes payment"
}

// Data movement = FLOW
PaymentReconciliation = flow "Daily Payment Reconciliation" {
  TransactionDB -> ReconciliationService "Extracts transactions"
  ReconciliationService -> PaymentGateway "Compares with gateway records"
  ReconciliationService -> ReportingDB "Stores discrepancies"
}

5. Requirements, ADRs, and Policies: Documenting Decisions

The "Why Did We Do This?" Problem

Remember my story at the beginning? The one about reverse-engineering why we chose PostgreSQL?

That's the problem ADRs (Architecture Decision Records) solve.

Requirements: What We Need

Requirements capture what the system must do:

import { * } from 'sruja.ai/stdlib'

// Functional requirements
R1 = requirement "Must handle 10k concurrent users" {
  tags ["functional"]
}

// Performance requirements
R2 = requirement "API response < 200ms p95" {
  tags ["performance"]
}

// Scalability requirements
R3 = requirement "Scale to 1M users" {
  tags ["scalability"]
}

// Security requirements
R4 = requirement "All PII encrypted at rest" {
  tags ["security"]
}

Real-world example: When Airbnb designed their booking system, they had requirements like:

  • Must handle 1M+ bookings per day
  • Must prevent double-booking
  • Must support instant book and request-to-book
  • Must handle timezone complexity

These requirements drove every architectural decision.

ADRs: Why We Decided

ADRs capture why we made specific architecture decisions:

ADR001 = adr "Use microservices for independent scaling" {
  status "Accepted"
  context "Need to scale order processing independently from inventory"
  decision "Split into OrderService and InventoryService"
  consequences "Better scalability, increased network complexity"
}

ADR002 = adr "Use PostgreSQL for strong consistency" {
  status "Accepted"
  context "Need ACID transactions for financial data"
  decision "Use PostgreSQL instead of NoSQL"
  consequences "Strong consistency, SQL complexity"
}

ADR Structure:

  1. Context - What's the situation? What problem are we solving?
  2. Decision - What did we decide to do?
  3. Consequences - What are the trade-offs? What becomes easier/harder?
  4. Status - Proposed, Accepted, Deprecated, Superseded?

Real-world case study: Netflix's ADR for "Use Chaos Engineering"

  • Context: Need to ensure resilience in distributed systems
  • Decision: Intentionally inject failures in production
  • Consequences: Higher confidence in resilience, requires cultural shift, potential customer impact during experiments
  • Status: Accepted

Policies: How We Enforce

Policies capture rules that must be followed:

SecurityPolicy = policy "All databases must be encrypted" {
  category "security"
  enforcement "required"
  description "Compliance requirement for PII data"
}

ScalingPolicy = policy "Services must implement health checks" {
  category "operations"
  enforcement "required"
  description "Required for auto-scaling"
}

Real-world example: Amazon's "Two-Pizza Team" policy:

  • Each service owned by a team that can be fed by two pizzas (6-10 people)
  • Enforcement: Service ownership must be documented
  • Consequence: Services that grow too large must be split

Putting It All Together

Here's how requirements, ADRs, and architecture integrate:

import { * } from 'sruja.ai/stdlib'

Customer = person "Customer"

// REQUIREMENTS: What we need
R1 = requirement "Handle 10k concurrent users" { tags ["functional"] }
R2 = requirement "API response < 200ms p95" { tags ["performance"] }
R3 = requirement "Scale to 1M users" { tags ["scalability"] }
R4 = requirement "All PII encrypted at rest" { tags ["security"] }

// ADRs: Why we decided
ADR001 = adr "Use microservices for independent scaling" {
  status "Accepted"
  context "Need to scale order processing independently from inventory"
  decision "Split into OrderService and InventoryService"
  consequences "Better scalability, increased network complexity"
}

ADR002 = adr "Use PostgreSQL for strong consistency" {
  status "Accepted"
  context "Need ACID transactions for financial data"
  decision "Use PostgreSQL instead of NoSQL"
  consequences "Strong consistency, SQL complexity"
}

// ARCHITECTURE: What we built
ECommerce = system "E-Commerce Platform" {
  API = container "API Service" {
    technology "Rust"
    description "Satisfies R1, R2, R3"
    
    slo {
      availability {
        target "99.99%"
        window "30 days"
      }
      latency {
        p95 "200ms"
        p99 "500ms"
      }
    }
  }
  
  OrderDB = datastore "Order Database" {
    technology "PostgreSQL"
    description "Satisfies R4 - encrypted at rest"
  }
}

// POLICIES: What we enforce
SecurityPolicy = policy "All databases must be encrypted" {
  category "security"
  enforcement "required"
  description "Compliance requirement for PII data"
}

view index {
  include *
}

The power: When someone asks "Why did we use PostgreSQL?", you point to ADR002. When they ask "What requirement does this satisfy?", you point to R4. When they ask "Is this encrypted?", you point to SecurityPolicy.


Common Mistakes I See All the Time

Mistake #1: Inconsistent Element Types

What happens: One diagram has "Database", another has "DataStore", another has "DB".

Why it fails:

  • Validation rules can't be written
  • Tooling breaks
  • Confusion about what's what

The fix: Import stdlib and use consistent kinds everywhere.

Mistake #2: Confusing Scenarios with Flows

What happens: You model "user checkout" as a flow, or "data pipeline" as a scenario.

Why it fails:

  • Wrong mental model
  • Confusing to readers
  • Doesn't match how systems work

The fix:

  • User behavior = Scenario
  • Data movement = Flow

Mistake #3: No ADRs for Major Decisions

What happens: You make important architecture decisions but don't document why.

Why it fails:

  • New team members don't understand context
  • Can't evaluate if decision is still valid
  • Repeat same discussions

The fix: Write an ADR for every significant architecture decision. If someone might ask "why did we do this?" in 6 months, write an ADR.

Mistake #4: Requirements Not Linked to Architecture

What happens: Requirements live in a separate document, architecture in another.

Why it fails:

  • Can't trace requirements to implementation
  • Don't know if requirements are satisfied
  • Changes to requirements don't trigger architecture review

The fix: Link requirements to architecture elements in the same model.

Mistake #5: Too Many Views (or Too Few)

What happens:

  • Too many: 20 views, each slightly different, maintenance nightmare
  • Too few: One giant view showing everything

Why it fails:

  • Information overload or information scattered
  • Hard to maintain

The fix:

  • Start with 3 views: executive, architect, developer
  • Add more only when you have a specific audience in mind

Mistake #6: Scenarios That Are Too Detailed

What happens: Scenarios show every database query, every cache check, every validation.

Why it fails:

  • Too much noise
  • Hard to see the user journey
  • Becomes implementation documentation

The fix: Scenarios should show the user's perspective, not implementation details. Keep them at the business level.


The COMPLETE Framework

When building production-ready architecture models, use this framework:

C - Kinds & Components

  • Define element types upfront
  • Use consistent vocabulary
  • Import stdlib

O - Organize with Views

  • Create views for different audiences
  • Start with 3 (executive, architect, developer)
  • Add more as needed

M - Model Behavior with Scenarios

  • Document user journeys
  • Include error paths
  • Keep at business level

P - Policies and ADRs

  • Document decisions with ADRs
  • Link requirements to architecture
  • Define enforcement policies

L - Link Data Flows

  • Use flows for data pipelines
  • Show ETL and streaming
  • Separate from user scenarios

E - Evaluate and Iterate

  • Review with team regularly
  • Update when decisions change
  • Keep single source of truth

What to Remember

  1. Kinds define vocabulary - Import stdlib, use consistent element types
  2. Views provide perspective - One model, multiple views for different audiences
  3. Scenarios model behavior - User journeys, business processes, error paths
  4. Flows model data - ETL, streaming, batch processing
  5. ADRs capture why - Document context, decision, and consequences
  6. Requirements link to architecture - Trace from need to implementation
  7. Policies enforce rules - Security, compliance, operational standards
  8. Scenarios ≠ Flows - User behavior vs. data movement
  9. Keep scenarios high-level - Business perspective, not implementation
  10. Write ADRs for major decisions - If someone might ask "why?", document it

When to Skip These Features

You don't always need everything. Here's when to simplify:

Skip advanced features when:

  • Prototyping - Focus on structure, add details later
  • Simple systems - Single service + database doesn't need scenarios
  • Early exploration - Get the design right first, document later
  • Time pressure - Better to have good diagrams than perfect documentation

Always include:

  • Consistent element kinds (minimal effort, big payoff)
  • At least 2 views (executive + developer)
  • ADRs for major decisions (you'll thank yourself later)

Next up: Module 4 brings it all together with production readiness - how to make your architecture real, maintainable, and valuable.

👉 Module 4: Production Readiness - Learn how to make your architecture production-ready.


Lesson 7: Views Best Practices

I inherited a project with 47 views. Yes, forty-seven.

Each had a name like "view1", "temp", "test-final", "new-view-v2", and my personal favorite: "DO-NOT-DELETE". No descriptions. No documentation. Half of them showed the same thing with slight variations.

When I asked the team which view to show executives, they said: "Probably view12? Or maybe view18? Definitely not view23 - that one's outdated."

I spent two days consolidating those 47 views into 7 coherent perspectives. Two days of archaeology, trying to understand why each view existed, what audience it served, and whether it was still relevant.

Here's what I learned: views are powerful, but without governance, they become technical debt. This lesson is about managing views at scale - not just creating them, but maintaining them, organizing them, and knowing when to kill them.

The VIEW Governance Framework

After that disaster, I created a framework for view management. I call it VIEW:

V - Verify the Audience

  • Who needs this view?
  • What decisions will they make?
  • What detail level do they need?

I - Intentionally Design

  • What's the specific purpose?
  • What should be included/excluded?
  • How will it be maintained?

E - Explicitly Document

  • Clear, descriptive name
  • Purpose statement
  • When to use (and not use)

W - Watch Lifecycle

  • Review regularly
  • Update when architecture changes
  • Delete when no longer needed

Let me break down each piece.


When to Create Views (Decision Framework)

The mistake I made early in my career was creating views for everything. Every meeting, every question, every discussion resulted in a new view. That's how you get to 47 views.

Here's the framework I use now:

Always Create:

1. Executive View

  • Purpose: Show business value and scope
  • Audience: C-suite, board members, investors
  • Includes: Systems, external dependencies, users
  • Excludes: Technical implementation details
  • Update frequency: When business scope changes

2. Architect View

  • Purpose: Show service boundaries and tech stack
  • Audience: Architects, tech leads, senior engineers
  • Includes: Containers, databases, external systems
  • Excludes: Business context, implementation details
  • Update frequency: When architecture changes

3. Developer View

  • Purpose: Show implementation details
  • Audience: Developers building features
  • Includes: Components, internal structure, key dependencies
  • Excludes: External systems, high-level architecture
  • Update frequency: When implementation patterns change

Create Conditionally:

4. Security View

  • Create when: Handling sensitive data, compliance requirements
  • Skip when: Internal tools with no sensitive data
  • Includes: Data stores, external integrations, trust boundaries
  • Excludes: UI components, non-sensitive services

5. Performance View

  • Create when: Performance is critical, have SLAs/SLOs
  • Skip when: Simple CRUD with no performance requirements
  • Includes: Databases, caches, high-traffic components
  • Excludes: Admin tools, low-traffic features

6. Data Flow View

  • Create when: Complex data pipelines, multiple data stores
  • Skip when: Single database, simple data access patterns
  • Includes: Data stores, data transformation services
  • Excludes: UI components, business logic services

7. Deployment View

  • Create when: Complex deployment, multiple environments
  • Skip when: Simple deployment (single service + database)
  • Includes: Containers, infrastructure, deployment pipelines
  • Excludes: Business context, implementation details

8. Feature-Specific Views

  • Create when: Complex feature spanning multiple services
  • Skip when: Feature contained in single service
  • Includes: Services/components involved in feature
  • Excludes: Unrelated parts of system

Rarely Create:

9. User Journey Views

  • Usually better as: Scenarios (see Lesson 6)
  • Create as view when: Static diagram for presentation
  • Create as scenario when: Documenting behavior flow

10. Integration Views

  • Usually better as: Part of architect view
  • Create separately when: Many external integrations

View Lifecycle Management

Views aren't static. They're living documentation that needs maintenance. Here's the lifecycle:

Stage 1: Creation

view security {
  title "Security Architecture"
  description "Shows trust boundaries, data encryption, and external integrations. Use for security reviews and compliance audits."
  
  include ECommerce.API
  include PaymentGateway
  include ECommerce.OrderDB
  exclude Customer Admin ECommerce.WebApp
  
  // Document when created
  metadata {
    created "2024-01-15"
    owner "Security Team"
    review_frequency "quarterly"
  }
}

Best practices at creation:

  • ✅ Clear, descriptive name
  • ✅ Descriptive title
  • ✅ Purpose description
  • ✅ Owner (who maintains it)
  • ✅ Review frequency

Stage 2: Active Maintenance

Views need updates when:

  • Architecture changes (new services, removed services)
  • Audience needs change (new questions, different decisions)
  • Scope changes (system grows, features added)

Update checklist:

  • Do included elements still exist?
  • Should new elements be included?
  • Is the purpose still valid?
  • Is the audience still the same?

Stage 3: Deprecation

Signs a view should be deprecated:

  • ❌ No one's used it in 6 months
  • ❌ It duplicates another view
  • ❌ The audience no longer exists (team disbanded, project canceled)
  • ❌ The questions it answered are no longer relevant

Deprecation process:

  1. Mark as deprecated
  2. Add deprecation date and reason
  3. Point to replacement view (if any)
  4. Keep for 30 days, then delete
view old-performance {
  title "Performance View (DEPRECATED)"
  description "DEPRECATED: Use 'performance-v2' instead. This view doesn't include the new caching layer."
  
  metadata {
    deprecated "2024-03-15"
    replacement "performance-v2"
    reason "Doesn't reflect current caching architecture"
  }
  
  // ... rest of view
}

Stage 4: Deletion

Delete when:

  • Deprecated period has passed (30+ days)
  • No one has complained
  • Replacement view exists and is used

Before deleting:

  • Search for references in documentation
  • Check if anyone's bookmarked it
  • Announce deletion to team

Naming Conventions That Scale

Bad names kill view discoverability. Here's what works:

Pattern 1: Audience-Based Names

view executive { /* for executives */ }
view product { /* for product managers */ }
view architect { /* for architects */ }
view developer { /* for developers */ }
view operations { /* for ops team */ }

When to use: Most common pattern, works for 80% of views

Pattern 2: Concern-Based Names

view security { /* security focus */ }
view performance { /* performance focus */ }
view dataflow { /* data dependencies */ }
view deployment { /* deployment architecture */ }

When to use: When concern is more important than audience

Pattern 3: Feature-Based Names

view checkout { /* checkout feature */ }
view search { /* search functionality */ }
view analytics { /* analytics pipeline */ }
view notifications { /* notification system */ }

When to use: Large features spanning multiple services

Pattern 4: Layer-Based Names

view context { /* C4 context layer */ }
view containers { /* C4 container layer */ }
view components { /* C4 component layer */ }

When to use: When following strict C4 model

Naming Anti-Patterns

Temporal names: view1, view2, new-view, old-view

  • Problem: Doesn't say what it is, only when it was created

Person names: johns-view, sarahs-diagram

  • Problem: What if John leaves? What if Sarah changes roles?

Temporal qualifiers: temp, test, draft, wip

  • Problem: These never get renamed. They become permanent.

Cryptic abbreviations: mv, arch-v2, dev-fin

  • Problem: Requires mental translation every time

Clear, descriptive names: executive, security-audit, checkout-flow

  • Solution: Self-documenting, searchable, scalable

Organization Patterns for Large Systems

When your architecture grows beyond 10 services, you need organization strategies.

Pattern 1: The Core 5

Every system should have these 5 views minimum:

view index { /* Complete system */ }
view executive { /* Business context */ }
view architect { /* Technical architecture */ }
view developer { /* Implementation details */ }
view security { /* Security & compliance */ }

These cover 90% of use cases.

Pattern 2: Concern Extensions

Add concern-specific views as needed:

view performance { /* When performance matters */ }
view dataflow { /* When data pipelines exist */ }
view deployment { /* When deployment is complex */ }
view integration { /* When many external APIs */ }

Pattern 3: Feature Slices

For large systems, create feature-specific views:

view checkout { /* Checkout feature */ }
view search { /* Search feature */ }
view recommendations { /* Recommendations feature */ }

Pattern 4: Team Domains

For orgs with domain-driven design:

view catalog-domain { /* Catalog team */ }
view order-domain { /* Order team */ }
view payment-domain { /* Payment team */ }
view user-domain { /* User team */ }

Real-World Case Studies

Netflix: View Governance at Scale

Netflix has 700+ microservices. They can't afford view chaos. Their approach:

1. Mandatory Views (enforced by CI/CD):

  • Executive view (business context)
  • Domain view (service groupings)
  • Dependencies view (critical paths)

2. Optional Views (team discretion):

  • Feature-specific views
  • Performance views
  • Deployment views

3. View Review Process:

  • Quarterly review of all views
  • Delete unused views
  • Update stale views
  • Add missing views

Result: Despite 700+ services, they maintain ~15 core views. Focused, not fragmented.

Amazon: Two-Pizza Team Views

Amazon's two-pizza teams (6-10 people) each own their services. Their view strategy:

1. Team Views:

  • Each team maintains 3-5 views for their domain
  • Team is responsible for keeping views updated

2. Cross-Team Views:

  • Central architecture team maintains integration views
  • Shows how teams connect

3. Executive Views:

  • Aggregated from team views
  • Shows business capabilities, not services

Result: Distributed ownership prevents central bottleneck. Clear ownership keeps views accurate.

Stripe: Developer-Focused Views

Stripe's famous for developer experience. Their view strategy:

1. Public Documentation Views:

  • Customer-facing integration views
  • Minimal, clear, focused

2. Internal Developer Views:

  • Detailed implementation views
  • Component-level for feature teams

3. Compliance Views:

  • Security and audit views
  • Separate from technical views

Result: Different views for different audiences. No confusion about which view to use.

Common View Management Mistakes

Mistake #1: View Proliferation

What happens: Every meeting spawns a new view. 50+ views exist.

Why it fails:

  • Can't find the right view
  • Maintenance nightmare
  • Outdated information everywhere

The fix:

  • Start with Core 5
  • Require justification for new views
  • Quarterly view audit and cleanup

Mistake #2: Orphaned Views

What happens: View created for a specific meeting, never maintained.

Why it fails:

  • Shows outdated architecture
  • Misleads people who find it
  • Erodes trust in documentation

The fix:

  • Every view needs an owner
  • Document creation date and purpose
  • Regular review schedule

Mistake #3: Generic Views

What happens: One view tries to serve everyone.

Why it fails:

  • Too much detail for executives
  • Too little detail for developers
  • Satisfies no one

The fix:

  • Create audience-specific views
  • Be explicit about who each view serves
  • Don't try to make one view do everything

Mistake #4: No Default View

What happens: Someone opens your model and doesn't know which view to use.

Why it fails:

  • Confusion from the start
  • Wrong view used for wrong purpose
  • Frustration with documentation

The fix:

  • Always have view index with include *
  • Make default view match most common use case
  • Document other views in default view's description

Mistake #5: Stale Views After Refactoring

What happens: Architecture gets refactored, views don't get updated.

Why it fails:

  • Views show services that no longer exist
  • New services missing from views
  • Documentation lies

The fix:

  • View updates part of refactoring PRs
  • CI/CD validation of view elements
  • Architecture review includes view review

Mistake #6: Inconsistent Styling Across Views

What happens: Same element looks different in different views.

Why it fails:

  • Confusing for viewers
  • Looks unprofessional
  • Harder to understand

The fix:

  • Use global styles for consistency
  • View-specific styles only for emphasis
  • Style guide for team

The VIEW-SCALE Framework

For managing views at scale, use this framework:

V - Verify Need

  • Does this view serve a specific audience?
  • Does it answer questions not answered by existing views?
  • Will it be maintained?

I - Identify Audience

  • Who will use this view?
  • What decisions will they make?
  • What's their technical level?

E - Establish Purpose

  • What questions does this view answer?
  • What's out of scope?
  • When should this view be used?

W - Write Documentation

  • Clear name and title
  • Purpose description
  • Owner and review schedule

S - Scale Check

  • Does this duplicate existing views?
  • Can existing views be extended instead?
  • Will this add value or noise?

C - Create Minimal

  • Start with essential elements only
  • Add more if needed
  • Less is more

A - Assign Owner

  • Who's responsible for maintenance?
  • What triggers updates?
  • When should it be deprecated?

L - Lifecycle Plan

  • Review frequency
  • Update triggers
  • Deprecation criteria

E - Evaluate Regularly

  • Is it still used?
  • Is it still accurate?
  • Should it be updated or deleted?

Practical Example: E-Commerce at Scale

Let me show you a well-organized view structure for a growing e-commerce platform:

import { * } from 'sruja.ai/stdlib'

Customer = person "Customer"
Admin = person "Administrator"

ECommerce = system "E-Commerce Platform" {
  WebApp = container "Web Application" {
    CartComponent = component "Shopping Cart"
    ProductComponent = component "Product Catalog"
  }
  API = container "API Service" {
    OrderController = component "Order Controller"
    PaymentController = component "Payment Controller"
  }
  OrderDB = database "Order Database"
  ProductDB = database "Product Database"
  Cache = database "Redis Cache"
  EventQueue = queue "Event Queue"
}

PaymentGateway = system "Payment Gateway" {
  metadata {
    tags ["external"]
  }
}

Customer -> ECommerce.WebApp "Browses"
ECommerce.WebApp -> ECommerce.API "Fetches data"
ECommerce.API -> ECommerce.OrderDB "Stores orders"
ECommerce.API -> ECommerce.Cache "Caches queries"
ECommerce.API -> PaymentGateway "Processes payments"

// CORE 5 VIEWS (always present)

view index {
  title "Complete System"
  description "Full architecture with all elements. Use for reference, not presentations."
  include *
}

view executive {
  title "Business Overview"
  description "High-level business context for executives and stakeholders. Shows business capabilities and external dependencies."
  
  include Customer Admin
  include ECommerce PaymentGateway
  
  // Hide all technical details
  exclude ECommerce.WebApp ECommerce.API 
  exclude ECommerce.OrderDB ECommerce.ProductDB 
  exclude ECommerce.Cache ECommerce.EventQueue
  
  metadata {
    audience "C-suite, Board, Investors"
    owner "Product Team"
    review "quarterly"
  }
}

view architect {
  title "Technical Architecture"
  description "Container-level architecture showing service boundaries, data stores, and external integrations."
  
  include ECommerce
  include ECommerce.WebApp ECommerce.API
  include ECommerce.OrderDB ECommerce.ProductDB 
  include ECommerce.Cache ECommerce.EventQueue
  include PaymentGateway
  
  exclude Customer Admin
  
  metadata {
    audience "Architects, Tech Leads"
    owner "Architecture Team"
    review "monthly"
  }
}

view developer {
  title "Developer Guide"
  description "Implementation details for developers. Shows components, internal structure, and key dependencies."
  
  include ECommerce.WebApp
  include ECommerce.API
  include ECommerce.OrderDB ECommerce.ProductDB ECommerce.Cache
  
  exclude Customer Admin PaymentGateway
  
  metadata {
    audience "Developers"
    owner "Engineering Teams"
    review "sprint"
  }
}

view security {
  title "Security Architecture"
  description "Security focus: trust boundaries, data encryption, external integrations. Use for security reviews and compliance audits."
  
  include ECommerce.API
  include PaymentGateway
  include ECommerce.OrderDB
  
  exclude Customer Admin ECommerce.WebApp
  exclude ECommerce.ProductDB ECommerce.Cache ECommerce.EventQueue
  
  metadata {
    audience "Security Team, Compliance"
    owner "Security Team"
    review "quarterly"
  }
}

// EXTENSION VIEWS (added as needed)

view performance {
  title "Performance Architecture"
  description "Performance-critical components: databases, caches, high-traffic services. Use for capacity planning and optimization."
  
  include ECommerce.API
  include ECommerce.Cache
  include ECommerce.OrderDB ECommerce.ProductDB
  
  exclude Customer Admin PaymentGateway ECommerce.WebApp ECommerce.EventQueue
  
  metadata {
    audience "Performance Team, SRE"
    owner "Platform Team"
    review "monthly"
  }
}

view dataflow {
  title "Data Flow"
  description "Data dependencies and movement. Shows how data flows through the system. Use for data architecture discussions."
  
  include ECommerce.API
  include ECommerce.OrderDB ECommerce.ProductDB 
  include ECommerce.Cache ECommerce.EventQueue
  
  exclude Customer Admin ECommerce.WebApp PaymentGateway
  
  metadata {
    audience "Data Team, Architects"
    owner "Data Team"
    review "monthly"
  }
}

view deployment {
  title "Deployment Architecture"
  description "Deployment and infrastructure. Shows containers and their deployment targets. Use for deployment planning."
  
  include ECommerce.WebApp ECommerce.API
  include ECommerce.OrderDB ECommerce.ProductDB 
  include ECommerce.Cache ECommerce.EventQueue
  
  exclude Customer Admin PaymentGateway
  
  metadata {
    audience "DevOps, SRE"
    owner "Platform Team"
    review "monthly"
  }
}

view checkout {
  title "Checkout Feature"
  description "Checkout flow components. Shows services and components involved in checkout process. Use for checkout feature development."
  
  include Customer
  include ECommerce.WebApp ECommerce.WebApp.CartComponent
  include ECommerce.API ECommerce.API.OrderController ECommerce.API.PaymentController
  include ECommerce.OrderDB
  include PaymentGateway
  
  exclude Admin ECommerce.ProductDB ECommerce.Cache ECommerce.EventQueue
  
  metadata {
    audience "Checkout Team"
    owner "Checkout Team"
    review "sprint"
  }
}

Notice the organization:

  • Core 5 views first (always present)
  • Extension views grouped by concern
  • Feature views at the end
  • Each view has metadata: audience, owner, review schedule
  • Clear descriptions of when to use

View Maintenance Checklist

Use this checklist monthly:

Review:

  • Are all view elements still valid (no deleted services)?
  • Are new elements missing from relevant views?
  • Are view descriptions still accurate?
  • Are owners still responsible?

Clean:

  • Any views unused for 3+ months? → Mark deprecated
  • Any deprecated views older than 30 days? → Delete
  • Any duplicate views? → Consolidate
  • Any missing critical views? → Create

Validate:

  • Do views render correctly?
  • Are styles consistent?
  • Are names clear and descriptive?
  • Is documentation current?

What to Remember

  1. Start with Core 5 - Executive, Architect, Developer, Security, Index
  2. Use the VIEW framework - Verify audience, Intentionally design, Explicitly document, Watch lifecycle
  3. Name for clarity - Audience or concern, never temporal or personal
  4. Assign ownership - Every view needs an owner and review schedule
  5. Lifecycle matters - Create → Maintain → Deprecate → Delete
  6. Less is more - 7 focused views > 47 scattered views
  7. Review quarterly - Audit, clean, and validate views regularly
  8. Document purpose - When to use, what it shows, who it's for
  9. Avoid duplication - Consolidate similar views, extend existing views
  10. Update with architecture - Refactoring includes view updates

When You're Tempted to Create Another View

Ask yourself:

  1. Does this audience already have a view? → Update existing view
  2. Is this a temporary need? → Don't create, use ad-hoc screenshot
  3. Can this be a scenario or flow instead? → Use right tool for job
  4. Will this be maintained? → If no, don't create
  5. Is this worth the maintenance cost? → Every view has a cost

The 7-View Rule

From experience: most systems need 5-10 views, maximum.

  • 5 views: Simple system, one team
  • 7 views: Medium system, 2-3 teams
  • 10 views: Large system, multiple teams
  • 15+ views: Very large system OR view proliferation

If you have more than 10 views, question whether they're all necessary. Consolidate, deprecate, delete.

Summary

Views are powerful, but power requires responsibility. A few well-maintained views are worth more than dozens of orphaned, outdated perspectives.

The formula:

  • Start with Core 5
  • Add only when justified
  • Maintain with discipline
  • Clean up regularly
  • Document clearly

Your future self (and your team) will thank you.


Congratulations! You've completed Module 3: Advanced Modeling. You now have all the tools to create production-ready architecture models.

👉 Module 4: Production Readiness - Learn how to make your architecture production-ready with real-world examples and best practices.

Production Readiness

Lesson 1: Documenting Decisions (ADRs)

We made the same architectural decision three times in one year.

First time: "We should use Elasticsearch for search. It'll be fast and scalable." We implemented it. Six months later, the team that built it left.

Second time: New team, same problem. "We need search. Let's use Elasticsearch." They implemented it their way. Three months later, they rotated to another project.

Third time: Yet another team. "Why don't we have search? We should use Elasticsearch."

I stopped them. "Wait. We've done this twice already. Why are we doing it again?"

The answer: No one remembered why we made those decisions. The first team's implementation had issues. The second team didn't know about the first. The third team didn't know about either.

We had code. We had documentation. But we had no memory of our decisions.

That's when I learned about Architecture Decision Records (ADRs). Not just writing them, but living them - making them part of how the team thinks and works.

This lesson is about mastering ADRs: not just what they are, but how to write them well, maintain them over time, and use them to build institutional memory.

What is an ADR?

An Architecture Decision Record (ADR) is a document that captures an important architectural decision, including:

  • Context: Why did we need to make this decision?
  • Decision: What did we decide to do?
  • Consequences: What are the trade-offs and impacts?

We covered ADRs briefly in Module 3 (Lesson 6) as part of advanced DSL features. This lesson is a deep dive into ADRs specifically - how to write them effectively, maintain them over time, and make them a natural part of your workflow.

The Three Whys

Before writing any ADR, ask yourself:

  1. Why does this decision matter? (If it doesn't matter, don't write an ADR)
  2. Why now? (Is this the right time to make this decision?)
  3. Why will someone care in six months? (If no one will care, reconsider)

If you can't answer these clearly, you might not need an ADR.

The ADR Lifecycle

ADRs aren't static documents. They have a lifecycle that matches your decision-making process:

Stage 1: Proposed

When someone has an architectural idea that needs discussion:

ADR003 = adr "Use GraphQL for API layer" {
  status "Proposed"
  
  context "REST API is becoming complex. Clients need different data shapes. Over-fetching and under-fetching are common complaints."
  
  decision "Migrate to GraphQL for flexibility"
  
  consequences "Learning curve for team. Need to implement query complexity analysis. Caching becomes more complex."
  
  // Document alternatives considered
  option "REST with expansions" {
    pros "Team familiar with REST"
    cons "Still over/under-fetching, more endpoints"
  }
  
  option "GraphQL" {
    pros "Flexible queries, single endpoint, great tooling"
    cons "Learning curve, complexity analysis needed"
  }
}

What happens at this stage:

  • ADR is created with status "Proposed"
  • Team reviews and discusses
  • Alternatives are documented
  • Questions and concerns are raised

Stage 2: Accepted

After team alignment and approval:

ADR003 = adr "Use GraphQL for API layer" {
  status "Accepted"
  
  // ... context and decision remain the same ...
  
  accepted_date "2024-02-15"
  accepted_by "Architecture Review Board"
  
  decision "GraphQL with query complexity analysis"
  
  consequences "Learning curve for team (training budget allocated). Query complexity analysis required (use @cost directives). Caching via persisted queries."
  
  implementation_notes "Start with new endpoints only. Migrate existing REST endpoints gradually over 6 months."
}

What changes:

  • Status changes to "Accepted"
  • Specific decision details added
  • Implementation notes added
  • Acceptance metadata recorded

Stage 3: Active (Implementation)

During implementation, the ADR guides work:

GraphQL API = container "GraphQL API Service" {
  technology "Apollo Server"
  description "Implements ADR003 - GraphQL API layer"
  
  // Link ADR to implementation
  adr ADR003
  
  metadata {
    implementation_status "In Progress"
    target_completion "2024-06-01"
    team "Platform Team"
  }
}

What happens:

  • ADR is referenced in code comments and PRs
  • Implementation follows ADR decisions
  • Deviations trigger ADR updates

Stage 4: Deprecated

When circumstances change or better alternatives emerge:

ADR001 = adr "Use MongoDB for all data storage" {
  status "Deprecated"
  
  // ... original context and decision ...
  
  deprecated_date "2024-05-20"
  deprecated_reason "Transaction requirements require relational database. See ADR005 for replacement."
  
  replacement ADR005
  
  migration_notes "Migrate to PostgreSQL (ADR005). MongoDB remains for specific use cases (see ADR006)."
}

What triggers deprecation:

  • Better technology emerges
  • Requirements change significantly
  • Decision proven wrong by experience
  • Team learns better approach

Stage 5: Superseded

When a new ADR replaces an old one:

ADR005 = adr "Use PostgreSQL for transactional data" {
  status "Accepted"
  supersedes ADR001
  
  context "MongoDB (ADR001) lacks strong transaction support. Financial data requires ACID guarantees."
  
  decision "Use PostgreSQL for all transactional data. MongoDB for document storage only."
  
  consequences "Two database technologies to maintain. Clear data ownership boundaries needed."
}

The relationship:

  • Old ADR points to new ADR (replacement)
  • New ADR points to old ADR (supersedes)
  • Clear migration path documented
  • Both remain in history

ADR Structure: The Complete Template

Here's a comprehensive ADR template you can adapt:

ADR### = adr "[Short Title]" {
  // REQUIRED: Current status
  status "Proposed" | "Accepted" | "Deprecated" | "Superseded"
  
  // REQUIRED: Decision context
  context "
    What is the issue/problem we're addressing?
    What constraints exist?
    What requirements must be met?
  "
  
  // REQUIRED: The decision
  decision "
    What is the change or action being proposed/taken?
    Be specific and actionable.
  "
  
  // REQUIRED: Impact analysis
  consequences {
    positive "
      What benefits do we expect?
      What becomes easier?
    "
    
    negative "
      What are the trade-offs?
      What becomes harder?
      What risks exist?
    "
    
    neutral "
      What are the side effects?
      What else changes?
    "
  }
  
  // OPTIONAL: Alternatives considered
  option "[Alternative 1]" {
    pros "Why was this attractive?"
    cons "Why didn't we choose this?"
    rejected_because "Specific reason for rejection"
  }
  
  option "[Alternative 2]" {
    pros "Benefits of this approach"
    cons "Drawbacks of this approach"
    rejected_because "Why not chosen"
  }
  
  // OPTIONAL: Implementation guidance
  implementation {
    who_responsible "Team or person"
    timeline "Expected completion"
    first_steps "What to do first"
    success_criteria "How to measure success"
  }
  
  // OPTIONAL: Metadata
  metadata {
    author "Name"
    created "YYYY-MM-DD"
    accepted_date "YYYY-MM-DD"
    review_date "YYYY-MM-DD"  // When to revisit
    related_adrs [ADR001, ADR002]  // Related decisions
    tags ["database", "performance", "security"]
  }
}

Real-World ADR Examples

Let me show you some famous ADRs from real companies:

Example 1: Netflix - Chaos Engineering

Context: Netflix needed to ensure resilience in their distributed system. Traditional testing wasn't enough.

Decision: Intentionally inject failures in production (Chaos Monkey).

Consequences:

  • Positive: Higher confidence in resilience, proactive failure detection
  • Negative: Potential customer impact during experiments, requires cultural shift
  • Neutral: Need dedicated team, monitoring requirements

Why this ADR is famous: It challenged conventional wisdom. Most companies tried to prevent failures. Netflix decided to cause them.

Example 2: Amazon - Two-Pizza Teams

Context: Communication overhead was slowing down development as teams grew.

Decision: Limit teams to 6-10 people (two pizzas can feed them). Each team owns their service end-to-end.

Consequences:

  • Positive: Faster decisions, clear ownership, reduced coordination
  • Negative: Potential duplication, need for clear APIs between teams
  • Neutral: Requires decentralized architecture

Why this ADR is famous: It's simple but profound. Team size directly impacts architecture.

Example 3: Google - Code Search Index

Context: Google's codebase was too large for traditional search tools.

Decision: Build custom code search with trigram indexes.

Consequences:

  • Positive: Fast searches across billions of lines of code
  • Negative: Significant infrastructure investment
  • Neutral: Became internal tool, later open-sourced (Kythe)

Why this ADR is famous: Shows how scale forces custom solutions.

Example 4: Stripe - API Versioning

Context: API changes were breaking existing integrations.

Decision: Every API change creates a new version. Clients pin to specific versions.

Consequences:

  • Positive: No breaking changes for existing users
  • Negative: Multiple API versions to maintain
  • Neutral: Requires comprehensive testing across versions

Why this ADR is famous: Solved API evolution elegantly.

When to Write an ADR

Not every decision needs an ADR. Here's the framework I use:

Always Write an ADR:

1. Technology Choices

  • Choosing a database (PostgreSQL vs. MongoDB vs. CockroachDB)
  • Choosing a language/framework (Rust vs. Go, React vs. Vue)
  • Choosing infrastructure (AWS vs. GCP, Kubernetes vs. ECS)

Why: These are expensive to change later. Document the reasoning.

2. Architectural Patterns

  • Microservices vs. monolith
  • Event-driven vs. request-response
  • Sync vs. async processing

Why: These shape the entire system. Future architects need to understand why.

3. Security Decisions

  • Authentication approach (OAuth, SAML, custom)
  • Encryption standards (at rest, in transit)
  • Compliance requirements (SOC2, HIPAA, GDPR)

Why: Security auditors will ask. Be prepared.

4. Performance Standards

  • Latency SLAs (p95, p99)
  • Throughput requirements
  • Caching strategies

Why: These constrain implementation choices.

5. Team Structure Decisions

  • How teams are organized
  • Ownership boundaries
  • Communication protocols

Why: This affects how architecture evolves.

Sometimes Write an ADR:

6. API Design Decisions

  • REST vs. GraphQL vs. gRPC
  • Versioning strategy
  • Error handling approach

Write when: API is public or long-lived Skip when: Internal, short-lived API

7. Data Model Decisions

  • Normalization strategy
  • Schema vs. schemaless
  • Data retention policies

Write when: Data is complex or critical Skip when: Simple CRUD operations

8. Deployment Decisions

  • CI/CD approach
  • Deployment strategy (blue-green, canary)
  • Environment management

Write when: Complex deployment requirements Skip when: Simple deployment

Rarely Write an ADR:

9. Implementation Details

  • Variable naming conventions
  • Code formatting rules
  • Internal refactoring

Why: Too granular. Changes frequently. Use code comments instead.

10. Temporary Decisions

  • Short-term workarounds
  • Prototypes and experiments
  • Quick fixes

Why: Won't matter in six months. Write a comment instead.

The Decision Test

Ask: "Will someone ask 'why did we do this?' in six months?"

  • Yes → Write an ADR
  • Maybe → Consider writing an ADR
  • No → Don't write an ADR

ADR Anti-Patterns

Anti-Pattern #1: The Hindsight ADR

What happens: You write ADRs for decisions made months or years ago.

Why it fails:

  • Memory is unreliable
  • Context is lost
  • Alternatives forgotten
  • Feels like homework

The fix: Write ADRs when decisions are made. If you must write retrospective ADRs, mark them clearly:

ADR001 = adr "Use PostgreSQL" {
  status "Accepted"
  retrospective true  // This is a historical record
  context "This decision was made in 2022. Context reconstructed from git history and Slack archives."
}

Anti-Pattern #2: The Obvious ADR

What happens: You write an ADR for a decision that doesn't need documentation.

// DON'T DO THIS
ADR042 = adr "Use Git for version control" {
  context "We need version control"
  decision "Use Git because it's standard"
  consequences "Everyone knows Git"
}

Why it fails:

  • Wastes time
  • Dilutes importance of real ADRs
  • Becomes noise

The fix: Use the decision test. If no one will ask "why?" in six months, don't write it.

Anti-Pattern #3: The Novel ADR

What happens: ADRs are long, beautifully written essays that no one reads.

Why it fails:

  • Too long to read quickly
  • Key information buried
  • Becomes documentation debt

The fix: Be concise. Use structure. Highlight key points. The best ADRs can be scanned in 2 minutes.

Anti-Pattern #4: The Orphan ADR

What happens: ADR exists but isn't connected to the architecture or code.

Why it fails:

  • ADRs and code drift apart
  • No one knows ADR exists
  • Decisions get re-made

The fix: Link ADRs to architecture:

PaymentService = system "Payment Service" {
  adr ADR003  // "Use Stripe for payments"
  description "Implements payment processing per ADR003"
}

And reference in code:

#![allow(unused)]
fn main() {
// This implementation follows ADR003: Use Stripe for payments
// See: docs/adrs/ADR003-use-stripe-for-payments.md
pub fn process_payment(payment: Payment) -> Result<PaymentResult> {
    // ...
}
}

Anti-Pattern #5: The Immutable ADR

What happens: ADRs are treated as unchangeable holy texts.

Why it fails:

  • Circumstances change
  • Decisions become outdated
  • ADRs lose credibility

The fix: ADRs have a lifecycle. Update status, mark deprecated, supersede with new ADRs. Keep them current.

Anti-Pattern #6: The Secret ADR

What happens: ADRs are written but not shared with the team.

Why it fails:

  • Decisions aren't discussed
  • Team doesn't buy in
  • ADRs become one person's opinion

The fix: ADRs should be:

  • Reviewed by team
  • Discussed in architecture meetings
  • Accessible to everyone
  • Part of onboarding

ADR Best Practices

Practice #1: Number Your ADRs

Use sequential numbering: ADR001, ADR002, ADR003...

Why:

  • Easy to reference
  • Shows history
  • Enables linking

Don't use:

  • Dates (ADR-2024-03-15)
  • Categories (ADR-DB-001)
  • Both create confusion

Practice #2: Use Consistent Titles

Good titles:

  • "Use PostgreSQL for transactional data"
  • "Implement rate limiting at API gateway"
  • "Adopt event-driven architecture for order processing"

Bad titles:

  • "Database decision" (too vague)
  • "New architecture" (not specific)
  • "Meeting notes from March" (not a decision)

Practice #3: Include Decision Date

Always record when decisions were made:

ADR003 = adr "Use GraphQL" {
  metadata {
    created "2024-01-15"
    accepted "2024-02-20"
    review_date "2024-08-20"  // 6 months later
  }
}

Why: Context changes over time. Knowing when helps evaluate if the decision is still valid.

Practice #4: Document Alternatives

Always show what you considered and rejected:

option "Alternative 1" {
  pros "Why it was attractive"
  cons "Why it had issues"
  rejected_because "Specific deal-breaker"
}

Why: Shows you did due diligence. Prevents re-litigating the decision later.

Practice #5: Be Specific About Consequences

Don't just say "pros and cons". Be concrete:

consequences {
  positive "
    - Reduces API calls by 60% (measured in prototype)
    - Enables client-side caching
    - Single endpoint simplifies monitoring
  "
  
  negative "
    - 2-week learning curve for team (training scheduled)
    - Query complexity analysis required (use Apollo Studio)
    - Caching less straightforward (implement persisted queries)
  "
  
  neutral "
    - New dependency: Apollo Server
    - Different testing approach needed
    - Monitoring tools need updates
  "
}

Practice #6: Set Review Dates

Decisions aren't forever. Set review dates:

metadata {
  review_date "2024-08-15"  // 6 months
  review_questions [
    "Is GraphQL meeting our needs?",
    "Are there new alternatives?",
    "Should we expand or restrict usage?"
  ]
}

Practice #7: Make ADRs Searchable

Organize ADRs for discovery:

docs/
  adr/
    ADR001-use-postgresql.md
    ADR002-microservices-architecture.md
    ADR003-graphql-api-layer.md
    README.md  // Index of all ADRs

Create an index:

# Architecture Decision Records

## Active ADRs
- [ADR003: Use GraphQL](ADR003-graphql-api-layer.md) - 2024-02-20
- [ADR002: Microservices](ADR002-microservices-architecture.md) - 2023-11-10

## Deprecated ADRs
- [ADR001: Use MongoDB](ADR001-use-mongodb.md) - Deprecated 2024-05-20

## By Category
### Database
- ADR001: Use MongoDB (Deprecated)
- ADR005: Use PostgreSQL for transactions

### API
- ADR003: Use GraphQL

Practice #8: Review ADRs Regularly

Schedule quarterly ADR reviews:

Review checklist:

  • Are all ADRs still relevant?
  • Any need to be deprecated?
  • Any need to be updated?
  • Any new ADRs needed?
  • Are ADRs being followed?

ADR Review Process

ADRs should be reviewed before acceptance:

The Lightweight Process (Small Teams)

  1. Author creates ADR with status "Proposed"
  2. Share with team in Slack/channel
  3. Discussion period (2-3 days)
  4. Team consensus → Update status to "Accepted"
  5. Implementation begins

The Formal Process (Larger Teams)

  1. Author creates ADR with status "Proposed"
  2. Architecture review in weekly meeting
  3. Feedback incorporated
  4. Second review if needed
  5. Approval from architecture team → Status "Accepted"
  6. Implementation begins

The ADR Review Checklist

When reviewing an ADR, ask:

  • Is the context clear? (Do I understand the problem?)
  • Is the decision specific? (Do I know what to do?)
  • Are consequences realistic? (Not just optimistic?)
  • Were alternatives considered? (Did we do our homework?)
  • Is it actionable? (Can someone implement this?)
  • Is it necessary? (Does this need an ADR?)
  • Is it reversible? (Can we change our minds later?)

Complete Example: E-Commerce Payment ADR

Let me show you a complete, production-ready ADR:

import { * } from 'sruja.ai/stdlib'

ADR003 = adr "Use Stripe for payment processing" {
  status "Accepted"
  
  context "
    Our e-commerce platform needs to process credit card payments. 
    Requirements:
    - Support major credit cards (Visa, MC, Amex)
    - Handle international currencies (USD, EUR, GBP, JPY)
    - PCI-DSS compliance
    - Recurring payments for subscriptions
    - Low latency (< 3s for payment authorization)
    
    Constraints:
    - Small team (3 engineers)
    - Need to launch in 6 weeks
    - Budget: $500/month + transaction fees
  "
  
  decision "Integrate Stripe for all payment processing"
  
  consequences {
    positive "
      - Faster time to market (2-3 weeks vs. 2-3 months for custom)
      - PCI-DSS compliance handled by Stripe (we don't store card numbers)
      - Excellent documentation and SDKs
      - Supports all required currencies
      - Built-in recurring payment support
      - Dashboard for finance team
    "
    
    negative "
      - Vendor lock-in (would take 2+ months to switch)
      - Transaction fees: 2.9% + $0.30 per transaction
      - Less control over payment flow
      - Stripe outages affect us (99.99% SLA, but still)
    "
    
    neutral "
      - New dependency in our stack
      - Finance team needs training on Stripe dashboard
      - Legal review of Stripe terms required
    "
  }
  
  option "PayPal" {
    pros "Recognized brand, easy setup"
    cons "Higher fees (3.49% + $0.49), less developer-friendly, limited customization"
    rejected_because "Higher fees at scale, worse API experience"
  }
  
  option "Adyen" {
    pros "Lower fees at scale, more payment methods"
    cons "Longer integration time, more complex setup, overkill for our size"
    rejected_because "Overkill for our current scale, 6-week timeline too tight"
  }
  
  option "Build custom" {
    pros "No vendor lock-in, complete control"
    cons "PCI-DSS compliance burden, 3+ months to build, ongoing maintenance"
    rejected_because "Timeline constraints, PCI compliance too risky for small team"
  }
  
  implementation {
    who_responsible "Payments Team"
    timeline "3 weeks"
    first_steps "
      1. Create Stripe account
      2. Complete legal review of terms
      3. Implement payment intent flow
      4. Set up webhook handlers
      5. Test with Stripe test cards
      6. Security review
      7. Production deployment
    "
    success_criteria "
      - Process test payments successfully
      - < 3s authorization time
      - All webhooks handled correctly
      - Finance team can generate reports
    "
  }
  
  metadata {
    author "Jane Smith"
    created "2024-02-10"
    accepted "2024-02-15"
    review_date "2024-08-15"
    tags ["payments", "vendor", "compliance"]
    related_adrs []
  }
}

// Link ADR to architecture
PaymentService = system "Payment Service" {
  PaymentAPI = container "Payment API" {
    technology "Node.js"
    description "Implements ADR003: Stripe integration"
  }
}

// Mark the relationship
PaymentService.PaymentAPI -> ADR003 "Implements"

view index {
  include *
}

What to Remember

  1. ADRs are memory, not paperwork - They preserve decision context for future you and future teams

  2. ADRs have a lifecycle - Proposed → Accepted → Active → Deprecated → Superseded

  3. Write ADRs when decisions are made - Not months later when memory fades

  4. Be specific about consequences - Concrete pros/cons/neutral impacts, not vague statements

  5. Document alternatives - Show you did due diligence, prevent re-litigation

  6. Set review dates - Circumstances change, revisit decisions periodically

  7. Number sequentially - ADR001, ADR002... not dates or categories

  8. Link to architecture - ADRs should be referenced in code and architecture models

  9. Keep them concise - 2-minute scan, not 20-minute read

  10. Share with the team - ADRs are team artifacts, not personal journals

When Not to Write an ADR

Don't write an ADR when:

  • Decision is obvious - "We'll use Git for version control"
  • Decision is temporary - "Quick fix for production bug"
  • Decision is reversible with low cost - "Let's try this library"
  • Decision is purely implementation - "We'll use camelCase for variables"

Write an ADR when:

  • Decision has lasting impact - Database choice, architecture pattern
  • Decision is expensive to reverse - Major refactoring, vendor lock-in
  • Decision affects multiple teams - API contracts, shared infrastructure
  • Decision will be questioned - Non-obvious choice, trade-offs involved

Practical Exercise

Create an ADR for a decision in your current or recent project:

Step 1: Identify a significant architectural decision

  • Technology choice
  • Pattern adoption
  • Structural change
  • Security approach

Step 2: Use the complete template:

  • Context (problem, constraints, requirements)
  • Decision (specific, actionable)
  • Consequences (positive, negative, neutral)
  • Alternatives considered (at least 2)
  • Implementation notes
  • Metadata (dates, author, tags)

Step 3: Review with your team

  • Does everyone understand the context?
  • Are consequences realistic?
  • Were alternatives adequately considered?

Step 4: Link to your architecture

  • Reference the ADR in your Sruja model
  • Add code comments pointing to the ADR
  • Include in onboarding documentation

ADR Maturity Model

Where is your team on the ADR journey?

Level 0: No ADRs

  • Decisions live in Slack, email, or people's heads
  • Re-litigating decisions is common
  • New hires struggle to understand "why"

Level 1: Reactive ADRs

  • ADRs written when someone asks "why?"
  • Usually retrospective (memory reconstruction)
  • Inconsistent format
  • Not linked to code/architecture

Level 2: Proactive ADRs

  • ADRs written when decisions are made
  • Consistent format
  • Reviewed by team
  • Some links to code/architecture

Level 3: Living ADRs

  • ADRs have clear lifecycle (proposed → accepted → deprecated)
  • Regularly reviewed and updated
  • Fully integrated with code and architecture
  • Part of team culture and onboarding

Level 4: ADR-Driven Development

  • ADRs written before implementation
  • ADRs drive architecture reviews
  • ADRs inform technical strategy
  • ADRs are strategic assets

Goal: Reach Level 3. Level 4 is aspirational for most teams.


Next up: Lesson 2 explores production readiness patterns - how to make your architecture resilient, observable, and maintainable in production environments.

Lesson 2: Deployment Architecture

It was 4:47 PM on a Friday. I pushed the deploy button.

What could go wrong? It was just a "small database migration." Add a column, update some queries, deploy the API. Done in 10 minutes, right?

By 5:15 PM, the entire platform was down. The migration had locked the database. Every API request was timing out. Customers were calling support. The CEO was texting me. And I couldn't rollback because the migration had partially completed.

We were down for 3 hours and 42 minutes.

That Friday taught me more about deployment architecture than the previous 5 years combined. How you deploy matters as much as what you deploy. A great architecture deployed poorly will fail. A mediocre architecture deployed well will survive.

This lesson is about deployment architecture: how to model it, how to choose strategies, and how to avoid becoming a cautionary tale.

The Two Architectures

Every system has two architectures that most teams confuse:

Logical Architecture

What your system does - the software components and their interactions.

// This is LOGICAL architecture
ECommerce = system "E-Commerce Platform" {
  API = container "REST API" {
    technology "Rust"
  }
  
  WebApp = container "Web Application" {
    technology "React"
  }
  
  Database = database "PostgreSQL" {
    technology "PostgreSQL"
  }
  
  Cache = database "Redis Cache" {
    technology "Redis"
  }
}

This shows:

  • What services exist
  • How they communicate
  • What technologies they use

Audience: Architects, developers, product managers

Physical Architecture

Where your system runs - the infrastructure and deployment topology.

// This is PHYSICAL architecture
deployment Production "Production Environment" {
  node AWS "AWS Cloud" {
    node USEast1 "US-East-1 Region" {
      node EKS "Kubernetes Cluster" {
        containerInstance ECommerce.API {
          replicas 10
          cpu "2 cores"
          memory "4GB"
        }
      }
      
      node RDS "RDS PostgreSQL" {
        containerInstance ECommerce.Database {
          instance "db.r5.xlarge"
          multi_az true
        }
      }
    }
  }
}

This shows:

  • Where code runs
  • Infrastructure configuration
  • Scaling parameters
  • Geographic distribution

Audience: DevOps, SRE, platform engineers

Why the Separation Matters

Story: A startup I worked with had beautiful logical architecture diagrams. Microservices, event-driven, clean boundaries. But their deployment? Everything ran on one EC2 instance. When that instance failed, all their "resilient microservices" went down together.

The lesson: Logical resilience means nothing without physical separation.

When to model separately:

  • Planning migrations (EC2 → EKS, on-prem → cloud)
  • Multi-region deployments
  • Disaster recovery planning
  • Cost optimization
  • Compliance requirements

Deployment Strategies: When to Use What

Let me walk you through the real-world trade-offs of each strategy.

On-Premises: When Control Trumps Convenience

What it is: Running on your own hardware in your own data center.

Real-world example: Goldman Sachs

Goldman runs most of their trading systems on-premises. Why? Microsecond latency matters in high-frequency trading. Cloud latency is too unpredictable. Regulatory requirements demand data sovereignty. And when you're moving billions of dollars, the cost of owning hardware is negligible.

When to choose on-prem:

  • ✅ Regulatory requirements (data must stay in specific location)
  • ✅ Extreme latency requirements (< 1ms)
  • ✅ Predictable, massive scale (you know you'll use 10,000 servers)
  • ✅ Classified/sensitive data (government, defense)

When to avoid:

  • ❌ Early-stage startups (capital expense too high)
  • ❌ Variable traffic (you'll over-provision)
  • ❌ Small teams (maintenance burden)
  • ❌ Geographic distribution needs

Cost reality:

  • Initial investment: $500K - $5M (hardware, data center, networking)
  • Ongoing: $50K - $500K/month (power, cooling, staff)
  • Break-even point: 3-5 years

The mistake I see: Companies choose on-prem for "security" when cloud is actually more secure (AWS spends more on security than most companies' entire revenue).

Cloud: Speed and Flexibility

What it is: Renting infrastructure from AWS, GCP, Azure, etc.

Real-world example: Airbnb

Airbnb runs almost entirely on AWS. During the 2022 travel surge, they scaled from 5,000 to 25,000 instances in hours. Try doing that with on-prem.

When to choose cloud:

  • ✅ Early-stage (pay-as-you-go)
  • ✅ Variable traffic (scale up/down)
  • ✅ Global distribution (deploy anywhere)
  • ✅ Small team (managed services)
  • ✅ Speed to market

When to be careful:

  • ⚠️ Predictable, steady workloads (can be cheaper on-prem)
  • ⚠️ Extreme compliance (some certifications require physical control)
  • ⚠️ Very high bandwidth (cloud egress gets expensive)

Cost reality:

  • Startup: $500 - $5,000/month
  • Mid-size: $20K - $100K/month
  • Enterprise: $500K - $5M/month

The mistake I see: "Cloud is always cheaper." It's not. Run the numbers for YOUR workload.

Containers & Kubernetes: The Standard for Scale

What it is: Packaging code with dependencies (Docker) and orchestrating at scale (Kubernetes).

Real-world example: Spotify

Spotify runs 150+ services on Google Kubernetes Engine (GKE). Before Kubernetes, deployments took hours and scaling was manual. Now: 2-minute deployments, auto-scaling, self-healing.

When to choose Kubernetes:

  • ✅ 10+ services (orchestration value)
  • ✅ Need auto-scaling
  • ✅ Multi-cloud strategy
  • ✅ Dev teams want self-service deployment

When to avoid:

  • ❌ < 5 services (overkill)
  • ❌ Simple stateless apps (ECS or Cloud Run is easier)
  • ❌ Small team (K8s expertise required)
  • ❌ Just getting started (add complexity later)

Cost reality:

  • Control plane: Free (managed) or $150/month (self-managed)
  • Worker nodes: $500 - $50,000/month depending on scale
  • Hidden cost: Engineering time (steep learning curve)

The mistake I see: "We need Kubernetes because Netflix uses it." Netflix has 700 engineers. You have 5. Start simpler.

Real-World Deployment Patterns

Pattern 1: Blue/Green Deployment

What it is: Run two identical environments (Blue = current, Green = new). Switch traffic instantly.

Real-world example: Amazon

Amazon uses Blue/Green for most services. Their deployment philosophy: "If you can't roll back in 30 seconds, you're doing it wrong."

How it works:

  1. Blue environment is live (100% traffic)
  2. Deploy new version to Green environment
  3. Run tests on Green
  4. Switch 10% traffic to Green
  5. Monitor for 15 minutes
  6. Gradually increase to 100%
  7. Keep Blue warm for instant rollback

Sruja model:

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce" {
  API = container "API Service" {
    technology "Rust"
  }
}

deployment Production "Production" {
  node Blue "Blue Environment (Active)" {
    status "active"
    containerInstance ECommerce.API {
      replicas 10
      traffic 100
      version "v2.3.1"
    }
  }
  
  node Green "Green Environment (Standby)" {
    status "standby"
    containerInstance ECommerce.API {
      replicas 10
      traffic 0
      version "v2.3.2"  // New version ready
    }
  }
}

view index {
  include *
}

When to use:

  • ✅ Zero-downtime requirement
  • ✅ Critical services (payments, auth)
  • ✅ Need instant rollback
  • ✅ Complex deployments (db migrations + code)

When to avoid:

  • ❌ Resource-constrained (doubles infrastructure cost)
  • ❌ Simple apps (rolling update is fine)
  • ❌ Stateless, stateless services (no migration needed)

Cost: 2x infrastructure (two full environments)

Pattern 2: Canary Deployment

What it is: Gradually shift traffic to new version while monitoring for issues.

Real-world example: Netflix

Netflix's deployment philosophy: "Deploy to 1%, watch for 30 minutes. If good, deploy to 5%, watch. Continue until 100%."

How it works:

  1. Deploy new version alongside old
  2. Route 1% traffic to new version
  3. Monitor error rates, latency, business metrics
  4. If good → increase to 5%, then 10%, then 25%, then 100%
  5. If bad → automatic rollback

Sruja model:

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce" {
  API = container "API Service" {
    technology "Rust"
  }
}

deployment Production "Production" {
  node Stable "Stable Version" {
    containerInstance ECommerce.API {
      replicas 20
      traffic 95  // 95% of traffic
      version "v2.3.1"
    }
  }
  
  node Canary "Canary Version" {
    containerInstance ECommerce.API {
      replicas 1
      traffic 5  // 5% of traffic
      version "v2.3.2"
      
      auto_rollback {
        enabled true
        error_rate "> 1%"
        latency_p95 "> 500ms"
        trigger_time "5 minutes"
      }
    }
  }
}

view index {
  include *
}

When to use:

  • ✅ Large user base (1% = statistically significant)
  • ✅ Can tolerate some users hitting issues
  • ✅ Want early warning before full rollout
  • ✅ Continuous deployment (ship daily)

When to avoid:

  • ❌ Small user base (1% = 1 user)
  • ❌ Zero-tolerance for errors (B2B, healthcare)
  • ❌ Simple, well-tested changes

Cost: Minimal (canary is usually small % of capacity)

Pattern 3: Rolling Deployment

What it is: Gradually replace old instances with new ones.

Real-world example: Uber

Uber deploys 1,000+ times per day using rolling deployments. Each service has multiple instances. Update one at a time, keeping enough capacity.

How it works:

  1. Service has 10 instances running
  2. Terminate 1 instance
  3. Start 1 new instance
  4. Wait for health check
  5. Repeat until all updated

Sruja model:

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce" {
  API = container "API Service" {
    technology "Rust"
  }
}

deployment Production "Production" {
  node Cluster "Kubernetes Cluster" {
    containerInstance ECommerce.API {
      replicas 10
      version "v2.3.2"
      
      rolling_update {
        max_unavailable 1  // Only 1 down at a time
        max_surge 1  // Can create 1 extra during update
      }
    }
  }
}

view index {
  include *
}

When to use:

  • ✅ Stateless services
  • ✅ Resource-efficient (no extra capacity)
  • ✅ Quick deployments
  • ✅ Multiple replicas (3+)

When to avoid:

  • ❌ Single replica (downtime during update)
  • ❌ Stateful services (session draining issues)
  • ❌ Complex migrations (need Blue/Green)

Cost: Minimal (uses existing capacity)

Decision Framework: Which Pattern?

Ask these questions:

1. Can you tolerate any downtime?

  • No → Blue/Green or Canary
  • Yes → Rolling is fine

2. How many replicas?

  • 1 → Blue/Green (can't do rolling)
  • 2-3 → Canary or Rolling
  • 5+ → Any pattern works

3. What's your budget?

  • Tight → Rolling (free)
  • Normal → Canary (minimal extra)
  • Generous → Blue/Green (2x cost)

4. How critical is the service?

  • Critical (payments, auth) → Blue/Green
  • Important → Canary
  • Normal → Rolling

5. What's your traffic volume?

  • High (10k+ req/s) → Canary
  • Medium → Any
  • Low → Rolling

Quick decision guide:

┌─ Can tolerate downtime?
│  ├─ No → Blue/Green
│  └─ Yes
│     ├─ Multiple replicas?
│     │  ├─ Yes → Rolling
│     │  └─ No → Blue/Green
│     └─ Single replica → Blue/Green
│
└─ High traffic + continuous deploy? → Canary

Multi-Region & Disaster Recovery

Pattern: Active-Active Multi-Region

Real-world example: Netflix

Netflix runs active-active across three AWS regions (US-East, US-West, EU). Each region handles traffic. If one fails, others absorb.

Sruja model:

import { * } from 'sruja.ai/stdlib'

Netflix = system "Netflix Platform" {
  API = container "Streaming API"
}

deployment Global "Global Deployment" {
  node AWS "AWS Global" {
    node USEast "US-East-1" {
      status "active"
      traffic 50  // 50% of global traffic
      
      containerInstance Netflix.API {
        replicas 100
        region "us-east-1"
      }
    }
    
    node USWest "US-West-2" {
      status "active"
      traffic 30  // 30% of global traffic
      
      containerInstance Netflix.API {
        replicas 60
        region "us-west-2"
      }
    }
    
    node EU "EU-West-1" {
      status "active"
      traffic 20  // 20% of global traffic
      
      containerInstance Netflix.API {
        replicas 40
        region="eu-west-1"
      }
    }
  }
}

view index {
  include *
}

Cost: 3x infrastructure (but you're paying for capacity you use)

When to use:

  • ✅ Global user base
  • ✅ 99.99%+ availability requirement
  • ✅ Latency matters (users need local region)
  • ✅ Budget allows

Pattern: Active-Passive (Failover)

Real-world example: Most SaaS companies

Run primary region active. Secondary region on standby (minimal capacity). Failover when primary fails.

Cost: ~1.2x infrastructure (secondary runs minimal)

When to use:

  • ✅ Regional user base
  • ✅ Can tolerate 5-15 minute outage
  • ✅ Budget-conscious

CI/CD: Making Deployment Boring

The best deployment is a boring deployment. Routine. Uneventful.

Real-world example: Etsy

Etsy deploys 50+ times per day. Their deployment process is so reliable it's boring. That's the goal.

Modeling Your Pipeline

import { * } from 'sruja.ai/stdlib'

CICD = system "CI/CD Pipeline" {
  GitHub = container "GitHub" {
    description "Code repository, triggers pipeline on push"
  }
  
  Build = container "Build Service" {
    technology "GitHub Actions"
    description "Builds Docker images, runs unit tests"
  }
  
  Test = container "Test Runner" {
    description "Integration tests, E2E tests"
  }
  
  Staging = container "Staging Deploy" {
    description "Deploys to staging environment"
  }
  
  Production = container "Production Deploy" {
    technology "ArgoCD"
    description "GitOps deployment to production"
  }
  
  // Pipeline flow
  GitHub -> Build "Push triggers build"
  Build -> Test "If build succeeds"
  Test -> Staging "If tests pass"
  Staging -> Production "After manual approval"
}

ECommerce = system "E-Commerce Platform" {
  API = container "API Service"
}

// Link CI/CD to your services
CICD.Production -> ECommerce.API "Deploys"

view index {
  include *
}

Best practices:

  1. Automate everything - Manual steps cause errors
  2. Fast feedback - Developers should know in < 10 minutes
  3. Immutable artifacts - Same artifact through all environments
  4. Rollback automation - One button, instant rollback
  5. Observability - Every deploy tracked, monitored

Service Level Objectives (SLOs)

Real-world example: Google

Google popularized SLOs. Every service has defined reliability targets. If you're within SLO, you can deploy. If not, freeze.

Modeling SLOs

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce Platform" {
  API = container "API Service" {
    technology "Rust"
    
    slo {
      availability {
        target "99.9%"  // 8.76 hours downtime/year
        window "30 days"
        current "99.95%"
      }
      
      latency {
        p95 "200ms"
        p99 "500ms"
        window "7 days"
        current {
          p95 "180ms"
          p99 "420ms"
        }
      }
      
      error_rate {
        target "< 0.1%"
        window "30 days"
        current "0.05%"
      }
    }
  }
  
  Database = database "PostgreSQL" {
    technology "PostgreSQL"
    
    slo {
      availability {
        target "99.99%"  // 52 minutes downtime/year
        window "365 days"
      }
      
      latency {
        p95 "50ms"
        p99 "100ms"
      }
    }
  }
}

view index {
  include *
}

Why model SLOs:

  • Clear expectations (what does "reliable" mean?)
  • Deployment gates (only deploy if SLO allows)
  • Stakeholder communication (SLAs become commitments)
  • Living documentation (SLOs evolve with architecture)

Observability: The Three Pillars

Real-world example: Stripe

Stripe's observability is legendary. They can diagnose almost any issue in minutes because they have complete visibility.

The Three Pillars

1. Metrics (Prometheus, Datadog)

  • What's happening? (counts, rates, percentiles)
  • Example: "API latency p95 is 200ms"

2. Logs (ELK, Splunk)

  • What happened? (events, errors, debug info)
  • Example: "Payment failed: card declined"

3. Traces (Jaeger, Zipkin)

  • Where did it happen? (request flow across services)
  • Example: "Request took 300ms: 150ms in DB, 100ms in API, 50ms in network"

Modeling Observability

import { * } from 'sruja.ai/stdlib'

Observability = system "Observability Stack" {
  Metrics = container "Prometheus" {
    description "Time-series metrics from all services"
  }
  
  Dashboards = container "Grafana" {
    description "Visualize metrics and SLOs"
  }
  
  Logs = container "ELK Stack" {
    description "Centralized logging"
  }
  
  Traces = container "Jaeger" {
    description "Distributed tracing"
  }
  
  Alerts = container "PagerDuty" {
    description "Alert routing and on-call"
  }
}

ECommerce = system "E-Commerce Platform" {
  API = container "API Service" {
    description "Instrumented with metrics, logs, and traces"
  }
}

// Observability relationships
ECommerce.API -> Observability.Metrics "Exposes metrics on /metrics"
ECommerce.API -> Observability.Logs "Sends logs via Fluentd"
ECommerce.API -> Observability.Traces "Sends spans via Jaeger client"
Observability.Metrics -> Observability.Dashboards "Feeds dashboards"
Observability.Metrics -> Observability.Alerts "Triggers alerts"

view index {
  include *
}

Common Deployment Mistakes

Mistake #1: Deploying on Friday

What happens: You deploy at 5 PM Friday. Something breaks. Now you're debugging while everyone else is at happy hour.

Why it fails:

  • Less support available
  • Tired team
  • Ruined weekend
  • Desperate decisions

The fix: Deploy Tuesday-Thursday, morning only. Leave Friday for emergencies only.

Mistake #2: No Rollback Plan

What happens: Deployment fails. You have no way to revert. You're fixing forward under pressure.

Why it fails:

  • Fixing forward takes longer
  • Mistakes under pressure
  • Extended outage

The fix: Every deployment has a tested rollback procedure. Blue/Green makes this easy.

Mistake #3: Database Migrations in the Deployment

What happens: You deploy code AND migrate database in one step. Migration locks table. Everything hangs.

Why it fails:

  • Can't rollback easily
  • Locks cause timeouts
  • Tight coupling

The fix:

  1. Migrate database separately (backward compatible)
  2. Deploy code (works with old and new schema)
  3. Verify
  4. Remove backward compatibility

Mistake #4: Deploying All Services at Once

What happens: You deploy 10 services simultaneously. Something breaks. Which service caused it?

Why it fails:

  • Hard to isolate issues
  • Blast radius maximized
  • Debugging nightmare

The fix: Deploy one service at a time. Monitor. Repeat.

Mistake #5: Insufficient Capacity for Deployment

What happens: Rolling deployment starts. Old instances terminate. New instances not ready. Traffic spikes. Cascading failure.

Why it fails:

  • Running at capacity limit
  • No buffer for deployment
  • Resource exhaustion

The fix: Always have 30-50% headroom. Scale up before deploying.

Mistake #6: No Observability During Deployment

What happens: You deploy. Something breaks. But you don't know because alerts aren't configured.

Why it fails:

  • Blind deployment
  • Late detection
  • Longer MTTR

The fix: Every deployment has dashboard open, alerts verified, team watching.

Deployment Checklist

Before every production deployment:

Pre-Deployment:

  • Code reviewed and approved
  • Tests passing (unit, integration, E2E)
  • Deployed to staging and verified
  • Rollback procedure documented and tested
  • Capacity verified (30%+ headroom)
  • Observability dashboards open
  • Team notified (Slack, email)
  • Not Friday afternoon

During Deployment:

  • Deploy to canary/staging first
  • Monitor metrics (latency, errors, throughput)
  • Check business metrics (signups, orders)
  • Verify health checks passing
  • Review logs for errors
  • Gradually increase traffic

Post-Deployment:

  • Verify all services healthy
  • Check SLOs are met
  • Monitor for 30-60 minutes
  • Update changelog
  • Close deployment ticket
  • Celebrate (small wins matter)

If Something Goes Wrong:

  • Don't panic
  • Rollback immediately (don't try to fix forward first)
  • Communicate to stakeholders
  • Document what happened
  • Post-mortem within 48 hours

Complete Example: E-Commerce at Scale

Let me show you a complete deployment architecture for a growing e-commerce platform:

import { * } from 'sruja.ai/stdlib'

// Logical Architecture
ECommerce = system "E-Commerce Platform" {
  WebApp = container "Web Application" {
    technology "React"
    description "Customer-facing storefront"
  }
  
  API = container "API Service" {
    technology "Rust"
    description "Core business logic"
  }
  
  Database = database "PostgreSQL" {
    technology "PostgreSQL"
    description "Primary data store"
  }
  
  Cache = database "Redis" {
    technology "Redis"
    description "Session and query cache"
  }
}

// CI/CD Pipeline
CICD = system "CI/CD Pipeline" {
  GitHub = container "GitHub"
  Build = container "Build Service"
  Deploy = container "Deploy Service"
  
  GitHub -> Build "Push triggers build"
  Build -> Deploy "Deploy if tests pass"
}

// Observability Stack
Observability = system "Observability" {
  Metrics = container "Prometheus"
  Logs = container "ELK Stack"
  Traces = container "Jaeger"
}

// Production Deployment
deployment Production "Production Environment" {
  node AWS "AWS Cloud" {
    // Primary Region
    node USEast1 "US-East-1 (Primary)" {
      node EKS "EKS Cluster" {
        containerInstance ECommerce.API {
          replicas 10
          min_replicas 5
          max_replicas 50
          
          deployment_strategy "canary"
          canary_percentage 5
          
          slo {
            availability {
              target "99.9%"
            }
            latency {
              p95 "200ms"
              p99 "500ms"
            }
          }
        }
        
        containerInstance ECommerce.WebApp {
          replicas 5
          cdn "CloudFront"
        }
      }
      
      node RDS "RDS PostgreSQL" {
        containerInstance ECommerce.Database {
          instance "db.r5.xlarge"
          multi_az true
          backup_retention "7 days"
        }
      }
      
      node ElastiCache "ElastiCache Redis" {
        containerInstance ECommerce.Cache {
          node_type "cache.r5.large"
          replicas 2
        }
      }
    }
    
    // DR Region
    node USWest2 "US-West-2 (DR)" {
      status "standby"
      
      node EKS "EKS Cluster" {
        containerInstance ECommerce.API {
          replicas 2
          traffic 0  // Standby
        }
      }
      
      node RDS "RDS Read Replica" {
        containerInstance ECommerce.Database {
          role "read-replica"
        }
      }
    }
  }
}

// Link observability
ECommerce.API -> Observability.Metrics "Exposes metrics"
ECommerce.API -> Observability.Logs "Sends logs"
ECommerce.API -> Observability.Traces "Sends traces"

view index {
  include *
}

What to Remember

  1. Logical ≠ Physical - Model what (services) separately from where (infrastructure)

  2. Deployment strategy matters - Blue/Green for critical, Canary for scale, Rolling for efficiency

  3. Never deploy without rollback - If you can't revert in 30 seconds, you're not ready

  4. Observe everything - Metrics, logs, traces for every service

  5. SLOs define reliability - Clear targets, measured continuously

  6. Automate deployment - Manual steps cause errors

  7. Deploy early in the week - Tuesday-Thursday morning, never Friday

  8. Test deployment procedures - Rollback isn't real until you've tested it

  9. Capacity matters - Always have 30-50% headroom

  10. Make deployment boring - The best deployment is uneventful

When to Start Modeling Deployment

You don't need deployment models on day one. Here's when to start:

Phase 1: Prototype (Skip deployment modeling)

  • Focus on logical architecture
  • Deploy manually
  • Learn what works

Phase 2: MVP (Start documenting)

  • Basic deployment diagram
  • Document where things run
  • Simple CI/CD

Phase 3: Production (Model thoroughly)

  • Full deployment architecture
  • SLOs defined
  • Multiple regions
  • Disaster recovery

Phase 4: Scale (Live in deployment models)

  • Multi-region active-active
  • Chaos engineering
  • Advanced deployment patterns

Practical Exercise

Design deployment architecture for a real or hypothetical system:

Step 1: Choose Your System

  • Something you work on, or
  • Hypothetical: "SaaS platform, 100K users, US + EU"

Step 2: Choose Deployment Strategy

  • Based on requirements and constraints
  • Justify your choice

Step 3: Model Logical Architecture

  • Services, databases, caches
  • Technology choices

Step 4: Model Physical Architecture

  • Cloud provider(s)
  • Regions
  • Instance types and counts

Step 5: Add Observability

  • Metrics, logs, traces
  • SLOs for critical services

Step 6: Define CI/CD Pipeline

  • Build, test, deploy stages
  • Rollback procedures

Time: 30-45 minutes


Next up: Lesson 3 explores observability and monitoring in depth - how to see what's happening in your production systems.

Lesson 3: Governance as Code

The SOC 2 auditor asked a simple question: "Can you show me all databases that store customer PII and confirm they're encrypted?"

I froze.

We had 47 databases across 12 services. I had no idea which ones stored PII. I had no idea which ones were encrypted. I had no idea which ones were even in scope for the audit.

It took us three weeks to manually audit every database. We found three unencrypted databases with customer data. We failed the audit. The company lost a $2M contract that required SOC 2 compliance.

The worst part? We'd passed the audit six months earlier. But in those six months, developers had added new databases. No one checked if they were encrypted. No one even knew they existed.

That's when I learned: manual governance doesn't scale. If your governance depends on people remembering rules, you've already failed.

This lesson is about Governance as Code: treating architectural policies as executable code that validates your architecture automatically.

What is Governance as Code?

Governance as Code means expressing architectural policies as machine-readable rules that can be validated automatically:

  • "All databases must be encrypted" → Validator checks encryption tags
  • "No circular dependencies" → Validator checks dependency graph
  • "All services must have SLOs" → Validator checks for SLO definitions
  • "No public APIs without authentication" → Validator checks auth requirements

Without Governance as Code:

  • Policies exist in wikis and documents
  • Compliance depends on code reviews
  • Violations found late (or never)
  • Audits require manual inspection
  • Inconsistent enforcement

With Governance as Code:

  • Policies are executable code
  • Validation runs in CI/CD
  • Violations caught immediately
  • Audits are automated
  • Consistent, reliable enforcement

The Three Types of Governance

Type 1: Guardrails (Prevent Bad Things)

Purpose: Stop dangerous or non-compliant architecture choices.

Examples:

  • Databases storing PII must be encrypted
  • No public endpoints without authentication
  • No single points of failure
  • No databases in unauthorized regions

Real-world example: Netflix

Netflix has a guardrail: "No service can depend on a single availability zone." Their validation tool checks every service's deployment configuration. If it's single-AZ, the build fails. This guardrail has prevented dozens of potential outages.

Sruja example:

import { * } from 'sruja.ai/stdlib'

// Define the policy
policy EncryptionPolicy "All databases must be encrypted" {
  category "security"
  enforcement "required"
  
  rule {
    element_type "database"
    required_tags ["encrypted"]
    error "Database {element} missing 'encrypted' tag. All databases must be encrypted at rest."
  }
}

// Apply to your architecture
ECommerce = system "E-Commerce" {
  // This will PASS validation
  SecureDB = database "Customer Database" {
    technology "PostgreSQL"
    tags ["encrypted", "pci-compliant"]
  }
  
  // This will FAIL validation
  InsecureDB = database "Analytics Database" {
    technology "MySQL"
    // Missing "encrypted" tag - violation!
  }
}

view index {
  include *
}

Type 2: Standards (Enforce Consistency)

Purpose: Ensure architectural consistency across teams.

Examples:

  • All services must use the same logging format
  • All APIs must follow REST naming conventions
  • All services must have health check endpoints
  • All databases must have backup policies

Real-world example: Google

Google has thousands of services, but they all follow the same API design guidelines. Why? Because they have automated validators that check every API against their standards. Inconsistent APIs fail the build.

Sruja example:

import { * } from 'sruja.ai/stdlib'

policy LoggingStandard "Services must have structured logging" {
  category "operations"
  enforcement "required"
  
  rule {
    element_type "container"
    required_tags ["structured-logging"]
    error "Service {element} must implement structured logging per company standard."
  }
}

policy SLOStandard "Services must have SLOs defined" {
  category "reliability"
  enforcement "required"
  
  rule {
    element_type "container"
    requires_slo true
    error "Service {element} missing SLO definitions. All production services must have SLOs."
  }
}

ECommerce = system "E-Commerce" {
  API = container "API Service" {
    technology "Rust"
    tags ["structured-logging"]
    
    slo {
      availability {
        target "99.9%"
        window "30 days"
      }
      latency {
        p95 "200ms"
        p99 "500ms"
      }
    }
  }
}

view index {
  include *
}

Type 3: Best Practices (Codified Wisdom)

Purpose: Encode architectural lessons learned.

Examples:

  • Services with > 10 dependencies should be split
  • Databases accessed by > 5 services need a cache layer
  • Services handling payments need circuit breakers
  • Critical services need multi-region deployment

Real-world example: Amazon

Amazon learned the hard way that services with too many dependencies become bottlenecks. They codified this lesson: "If a service has more than 20 dependencies, architecture review required." This rule is enforced automatically.

Sruja example:

import { * } from 'sruja.ai/stdlib'

policy DependencyLimit "Services should not have too many dependencies" {
  category "architecture"
  enforcement "warning"
  
  rule {
    element_type "container"
    max_incoming_relations 10
    warning "Service {element} has {count} incoming dependencies. Consider splitting if > 10."
  }
  
  rule {
    element_type "container"
    max_outgoing_relations 15
    error "Service {element} has {count} outgoing dependencies. Split required if > 15."
  }
}

ECommerce = system "E-Commerce" {
  API = container "API Service" {
    technology "Rust"
  }
  
  // Imagine 20 services all calling API
  Service1 = container "Service 1" { API -> Service1 }
  Service2 = container "Service 2" { API -> Service2 }
  // ... and 18 more
  
  // This would trigger the dependency limit warning
}

view index {
  include *
}

Real-World Governance Stories

Netflix: Resilience Governance

The problem: Netflix had services that weren't resilient. They'd fail when dependencies failed.

The solution: Governance rules requiring:

  • Every service must have fallback behavior
  • Every external call must have a timeout
  • Critical services must have circuit breakers

The enforcement: Their Chaos Monkey tool tests these rules in production. If a service can't handle failure, Chaos Monkey finds out. Publicly.

The result: 99.99% availability, even with thousands of service failures per day.

Amazon: Team Size Governance

The problem: Large teams move slowly and create coordination overhead.

The solution: "Two-pizza team" rule - teams should be small enough to be fed by two pizzas (6-10 people).

The enforcement: Each service has a defined owner. If the team grows too large, governance tools flag it. Architecture review required.

The result: Faster decisions, clearer ownership, decentralized architecture.

Google: API Standards Governance

The problem: Inconsistent APIs made integration difficult. Every team invented their own patterns.

The solution: Google API Design Guide - comprehensive standards for all APIs.

The enforcement: Automated linters check every API definition. Non-compliant APIs fail CI builds. No exceptions.

The result: Consistent developer experience across thousands of APIs.

Stripe: Security Governance

The problem: Handling payments requires strict security. Manual security reviews don't scale.

The solution: Codified security policies:

  • All PII must be encrypted at rest
  • All APIs must use TLS 1.3+
  • All databases must have audit logs
  • All services must have vulnerability scanning

The enforcement: Automated security scanners check every deployment. Violations block production.

The result: PCI-DSS compliance maintained across thousands of changes per day.

Common Governance Rules

Here are the governance rules I see most often in production systems:

Security Rules

// Rule 1: All databases encrypted
policy EncryptionPolicy "All databases must be encrypted" {
  rule {
    element_type "database"
    required_tags ["encrypted"]
  }
}

// Rule 2: No sensitive data in caches
policy CacheDataPolicy "No PII in cache layers" {
  rule {
    element_type "database"
    tag "cache"
    forbidden_tags ["pii", "sensitive"]
  }
}

// Rule 3: All external APIs authenticated
policy APIAuthPolicy "External APIs must require authentication" {
  rule {
    element_type "container"
    tag "public-api"
    required_tags ["authentication"]
  }
}

// Rule 4: No databases in unauthorized regions
policy DataResidencyPolicy "Data must stay in approved regions" {
  rule {
    element_type "database"
    tag "pii"
    allowed_regions ["us-east-1", "eu-west-1"]
  }
}

Architecture Rules

// Rule 5: No circular dependencies
policy NoCircularDeps "Services cannot have circular dependencies" {
  rule {
    check_circular_dependencies true
    error "Circular dependency detected between {source} and {target}"
  }
}

// Rule 6: Services must have owners
policy OwnershipPolicy "All services must have defined owners" {
  rule {
    element_type "container"
    required_metadata ["owner", "team"]
  }
}

// Rule 7: No single points of failure
policy RedundancyPolicy "Critical services must be redundant" {
  rule {
    element_type "container"
    tag "critical"
    requires_scale true
    min_replicas 3
  }
}

// Rule 8: Layer violations prohibited
policy LayerPolicy "Respect architectural layers" {
  rule {
    element_type "container"
    tag "presentation"
    cannot_depend_on "datastore"
  }
}

Operations Rules

// Rule 9: All services must have SLOs
policy SLOPolicy "Services must have SLOs defined" {
  rule {
    element_type "container"
    requires_slo true
  }
}

// Rule 10: All services must be monitored
policy MonitoringPolicy "Services must be monitored" {
  rule {
    element_type "container"
    required_tags ["monitored"]
  }
}

// Rule 11: All databases must have backups
policy BackupPolicy "Databases must have backup policies" {
  rule {
    element_type "database"
    required_tags ["backed-up"]
    required_metadata ["backup_frequency", "backup_retention"]
  }
}

// Rule 12: All services must have health checks
policy HealthCheckPolicy "Services must implement health checks" {
  rule {
    element_type "container"
    required_tags ["health-check"]
  }
}

Compliance Rules

// Rule 13: PII handling requirements
policy PIIHandlingPolicy "PII must be handled correctly" {
  rule {
    element_type "database"
    tag "pii"
    required_tags ["encrypted", "audit-logged", "access-controlled"]
  }
}

// Rule 14: Payment data requirements
policy PCICompliancePolicy "Payment data must be PCI compliant" {
  rule {
    element_type "container"
    tag "payment-processing"
    required_tags ["pci-compliant", "pci-audited"]
  }
}

// Rule 15: Data retention requirements
policy DataRetentionPolicy "Data must have retention policies" {
  rule {
    element_type "database"
    required_metadata ["retention_period", "deletion_policy"]
  }
}

CI/CD Integration

Governance only works if it's enforced. Here's how to integrate with your pipeline:

Stage 1: Pre-Commit Hooks

What: Validate architecture changes before they're committed.

# .git/hooks/pre-commit
#!/bin/bash
sruja validate architecture.sruja
if [ $? -ne 0 ]; then
  echo "Architecture validation failed. Fix violations before committing."
  exit 1
fi

Catches: Basic violations early, before code review.

Stage 2: Pull Request Validation

What: Validate architecture in CI when PRs are created.

# .github/workflows/architecture-validation.yml
name: Architecture Validation

on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      
      - name: Validate Architecture
        run: |
          sruja validate architecture.sruja --strict
          
      - name: Check Compliance
        run: |
          sruja compliance-check --policies ./policies/

Catches: All violations before merge.

Stage 3: Deployment Gates

What: Validate architecture before production deployment.

# deployment-pipeline.yml
stages:
  - name: validate-architecture
    steps:
      - sruja validate architecture.sruja
      - sruja compliance-check --policies ./policies/production/
      
  - name: deploy-production
    needs: validate-architecture
    if: success()
    steps:
      - deploy-to-production

Catches: Production-specific violations (security, compliance).

Stage 4: Continuous Monitoring

What: Validate running infrastructure matches architecture.

# Run continuously (e.g., every hour)
sruja drift-detect --architecture architecture.sruja --live-infrastructure

Catches: Configuration drift, manual changes, unapproved modifications.

Governance Maturity Model

Where is your organization on the governance journey?

Level 0: No Governance

What it looks like:

  • No documented policies
  • Decisions made ad-hoc
  • Compliance discovered during audits
  • Inconsistent architecture

Real-world example: Early-stage startups

The problem: You'll fail audits eventually. But you're probably too small to care yet.

When it's okay: < 10 engineers, pre-revenue, learning phase.

Level 1: Manual Governance

What it looks like:

  • Policies documented in wikis
  • Architecture reviews are manual
  • Compliance checks during audits
  • Some consistency through code review

Real-world example: Growing companies (50-200 engineers)

The problem: Doesn't scale. Policies become outdated. Reviews are inconsistent. Violations slip through.

How to improve: Start automating the most critical checks.

Level 2: Automated Checks

What it looks like:

  • Key policies automated
  • CI/CD validation runs automatically
  • Violations caught early
  • Consistent enforcement

Real-world example: Mature companies (200-1000 engineers)

The benefit: Scales with team size. Consistent enforcement. Early violation detection.

How to improve: Expand coverage, add more policies.

Level 3: Continuous Enforcement

What it looks like:

  • Most policies automated
  • Real-time validation
  • Drift detection
  • Self-documenting compliance

Real-world example: Tech giants (Google, Netflix, Amazon)

The benefit: Compliance is continuous, not periodic. Audits are easy. Architecture stays healthy.

How to improve: Fine-tune policies, reduce false positives.

Level 4: Self-Service with Guardrails

What it looks like:

  • Developers can deploy freely
  • Guardrails prevent bad choices
  • Compliance is transparent
  • Architecture evolves safely

Real-world example: Very few companies (Spotify, Netflix)

The benefit: Fast development, safe architecture. Best of both worlds.

The goal: This is where you want to be.

Common Governance Mistakes

Mistake #1: Governance Theater

What happens: You have lots of policies, but they're not enforced.

Example:

  • Wiki says "All databases must be encrypted"
  • But no automated checks
  • Some databases encrypted, some not
  • Audit fails

The fix: If a policy isn't enforced, delete it or enforce it.

Mistake #2: Too Many Rules

What happens: You create policies for everything.

Example:

  • 200 governance rules
  • Developers need exceptions for 50% of changes
  • Governance becomes a bottleneck
  • People work around it

The fix: Start with 5-10 critical policies. Add more only when needed.

Mistake #3: Rules Without Context

What happens: Policies exist but no one knows why.

Example:

// BAD: No explanation
policy Rule42 "Services must have tag X" {
  // Why? What's the purpose?
}

The fix: Every policy should explain:

  • Why it exists
  • What problem it solves
  • When it applies
  • How to comply
// GOOD: Clear context
policy EncryptionPolicy "All databases must be encrypted" {
  category "security"
  
  description "
    Unencrypted databases expose customer data if compromised.
    Required for SOC 2, PCI-DSS, GDPR compliance.
    Applies to all databases storing production data.
    Encrypt using AWS KMS or equivalent.
  "
  
  rule {
    element_type "database"
    required_tags ["encrypted"]
  }
}

Mistake #4: No Exceptions Process

What happens: Rules are rigid with no way to handle edge cases.

Example:

  • Legacy system can't comply with new encryption rule
  • No way to get exception
  • System remains non-compliant forever
  • Governance loses credibility

The fix: Create an exceptions process:

  1. Document why exception is needed
  2. Define mitigation plan
  3. Set expiration date
  4. Require approval
  5. Review periodically

Mistake #5: One-Size-Fits-All Rules

What happens: Same rules applied to all systems regardless of context.

Example:

  • "All services must have 99.99% availability"
  • Even internal admin tools
  • Over-engineering everywhere
  • Wasted resources

The fix: Tiered policies based on criticality:

  • Critical services: Strict rules
  • Standard services: Moderate rules
  • Internal tools: Basic rules

Mistake #6: Governance as Afterthought

What happens: Architecture is designed first, governance added later.

Example:

  • Build the system
  • Try to add governance
  • Discover fundamental violations
  • Expensive refactoring

The fix: Governance from the start. Define policies first, design architecture to comply.

The GOVERN Framework

When implementing Governance as Code, use this framework:

G - Identify Goals

  • What are you trying to achieve?
  • What problems are you solving?
  • What's the risk of no governance?

O - Define Outcomes

  • What does compliance look like?
  • How will you measure success?
  • What's acceptable vs. unacceptable?

V - Validate Automatically

  • Which rules can be automated?
  • What checks run in CI/CD?
  • What needs continuous monitoring?

E - Educate Teams

  • Do developers understand the rules?
  • Is documentation clear?
  • How do people learn about violations?

R - Review Regularly

  • Are policies still relevant?
  • Are there too many false positives?
  • What new policies are needed?

N - Nurture Culture

  • Is governance seen as help or hindrance?
  • Do teams buy in?
  • How do you handle exceptions?

What to Remember

  1. Manual governance doesn't scale - If you're relying on people remembering rules, you've already failed

  2. Start with critical policies - Security, compliance, reliability. Add more later.

  3. Automate enforcement - Policies without enforcement are just suggestions

  4. Integrate with CI/CD - Validate early, validate often, validate automatically

  5. Explain the why - Every policy should have clear context and rationale

  6. Allow exceptions - Rigid rules without exceptions create workarounds

  7. Tier your rules - Not all services need the same governance level

  8. Make compliance transparent - Developers should know status without asking

  9. Review and evolve - Governance should improve over time, not stagnate

  10. Governance enables speed - Good governance lets teams move fast safely

When to Start Governance

Phase 1: Prototype (Skip governance)

  • Focus on learning
  • Minimal policies
  • Manual reviews fine

Phase 2: Production (Start governing)

  • Critical security policies
  • Basic compliance checks
  • CI/CD integration

Phase 3: Scale (Govern seriously)

  • Comprehensive policies
  • Continuous enforcement
  • Self-service with guardrails

Phase 4: Enterprise (Govern everything)

  • Full audit automation
  • Real-time compliance
  • Multi-team coordination

Practical Exercise

Implement governance for a real or hypothetical system:

Step 1: Identify Critical Policies

  • What are your top 5 security risks?
  • What compliance requirements exist?
  • What architectural standards matter?

Step 2: Write Policies as Code

  • Express each policy in Sruja
  • Include context and rationale
  • Define clear validation rules

Step 3: Integrate with CI/CD

  • Add validation to pull requests
  • Block non-compliant changes
  • Provide clear error messages

Step 4: Create Compliance Dashboard

  • Show current compliance status
  • Track violations over time
  • Make status visible to all

Step 5: Document Exception Process

  • How to request exceptions
  • Who approves
  • How to track

Time: 60-90 minutes

Complete Example: Production Governance

import { * } from 'sruja.ai/stdlib'

// ============ GOVERNANCE POLICIES ============

// Security policies
policy EncryptionPolicy "All databases must be encrypted" {
  category "security"
  enforcement "required"
  
  description "
    Unencrypted databases expose data if compromised.
    Required for SOC 2, PCI-DSS compliance.
  "
  
  rule {
    element_type "database"
    required_tags ["encrypted"]
    error "Database {element} must be encrypted. Add 'encrypted' tag."
  }
}

policy PIIPolicy "PII data requires special handling" {
  category "security"
  enforcement "required"
  
  rule {
    element_type "database"
    tag "pii"
    required_tags ["encrypted", "audit-logged", "access-controlled"]
    error "PII database {element} missing required controls."
  }
}

// Reliability policies
policy SLOPolicy "Services must have SLOs" {
  category "reliability"
  enforcement "required"
  
  rule {
    element_type "container"
    tag "production"
    requires_slo true
    error "Production service {element} must have SLOs defined."
  }
}

policy RedundancyPolicy "Critical services must be redundant" {
  category "reliability"
  enforcement "required"
  
  rule {
    element_type "container"
    tag "critical"
    requires_scale true
    min_replicas 3
    error "Critical service {element} must have min 3 replicas."
  }
}

// Architecture policies
policy OwnershipPolicy "Services must have owners" {
  category "operations"
  enforcement "required"
  
  rule {
    element_type "container"
    required_metadata ["owner", "team"]
    error "Service {element} missing owner/team metadata."
  }
}

policy NoCircularDeps "No circular dependencies" {
  category "architecture"
  enforcement "required"
  
  rule {
    check_circular_dependencies true
    error "Circular dependency detected. Services cannot depend on each other."
  }
}

// ============ ARCHITECTURE ============

PaymentService = system "Payment Service" {
  API = container "Payment API" {
    technology "Rust"
    tags ["production", "critical"]
    
    metadata {
      owner "payments-team"
      team "payments@company.com"
    }
    
    slo {
      availability {
        target "99.99%"
        window "30 days"
      }
      latency {
        p95 "100ms"
        p99 "200ms"
      }
    }
  }
  
  DB = database "Payment Database" {
    technology "PostgreSQL"
    tags ["encrypted", "pii", "pci-compliant", "audit-logged", "access-controlled"]
    
    slo {
      availability {
        target "99.99%"
        window "30 days"
      }
    }
  }
}

Auditor = person "Security Auditor"
Auditor -> PaymentService.API "Reviews"
PaymentService.API -> PaymentService.DB "Reads/Writes"

// ============ VIEWS ============

view index {
  title "Payment Service Architecture"
  include *
}

view compliance {
  title "Compliance Status"
  include PaymentService.API PaymentService.DB
  description "Shows compliance with governance policies"
}

view slos {
  title "SLO Dashboard"
  include PaymentService.API PaymentService.DB
  exclude Auditor
}

// ============ VALIDATION ============
// Run: sruja validate architecture.sruja
// This will check all policies and report violations

Congratulations! You've completed Module 4: Production Readiness. You now have all the tools to create production-ready architecture that's documented, deployable, and governed.

Next steps: Apply these lessons to your real systems. Start with one module, get good at it, then expand. Architecture is a practice, not a destination.


Course Complete! 🎉

You've finished the System Design 101 course. You now understand:

  • How to think about systems (Module 1)
  • How to model them effectively (Module 2)
  • Advanced techniques for complex scenarios (Module 3)
  • How to make architecture production-ready (Module 4)

The best architects aren't the ones who know the most patterns. They're the ones who can communicate their decisions clearly, maintain their documentation, and govern their architecture effectively. You now have those skills.

Go build great systems! 🚀


Lesson 4: SLOs & Scale Integration

"We guarantee 99.99% availability."

That's what our sales team promised the enterprise customer. It was in the contract. A $5M contract that would make our quarter.

Six months later, the customer demanded their SLA credits. They'd experienced 14 hours of downtime. We'd promised 99.99% availability (52 minutes of downtime per year), but we'd delivered 99.5%.

The problem? We had no idea we were failing. We weren't measuring availability. We had no SLOs, no monitoring, no alerts. We just had a promise we couldn't keep.

The customer got $500K in credits. The sales team was furious. The engineering team was embarrassed. And I learned a hard lesson: a promise without measurement is just a lie.

That's when I discovered SLOs (Service Level Objectives). Not as a theoretical concept, but as a survival mechanism. This lesson is about how to define meaningful SLOs, measure them rigorously, and align your architecture to actually meet them.

What Are SLOs, Really?

Service Level Objectives (SLOs) are specific, measurable targets for your service's reliability. They answer the question: "What does 'good enough' look like?"

The Three Components

Every SLO has three parts:

  1. The Metric: What are you measuring? (latency, availability, error rate)
  2. The Target: What's the threshold? (99.9%, 200ms, < 0.1%)
  3. The Window: Over what time period? (30 days, 7 days, 24 hours)

Example:

Metric: Availability
Target: 99.9%
Window: 30 days

This means: "Over any 30-day period, our service must be available 99.9% of the time."

Why this matters:

  • Customers know what to expect: Clear reliability commitment
  • Engineering knows what to build: Target to design for
  • Product knows when to freeze features: If SLO is breached, stop shipping
  • Finance knows what it costs: Reliability has a price

The SLO Hierarchy

SLA > SLO > SLI

SLA (Service Level Agreement): The promise you make to customers. Usually has financial consequences.

Example: "99.9% availability or we'll give you 10% credit on your bill."

SLO (Service Level Objective): The internal target you set for your team. Should be stricter than your SLA.

Example: "We target 99.95% availability internally so we never breach the 99.9% SLA."

SLI (Service Level Indicator): The actual measurement.

Example: "Last month we achieved 99.93% availability."

The relationship:

  • SLI is reality (what you actually delivered)
  • SLO is the goal (what you're aiming for)
  • SLA is the contract (what you promised)

Best practice: Set your SLO higher than your SLA. Give yourself headroom.

What Makes a Good SLO?

The SMART Framework for SLOs

S - Specific: Clear metric with no ambiguity

❌ "The system should be fast" ✅ "API latency p95 < 200ms"

M - Measurable: Can be objectively measured automatically

❌ "Users should be happy" ✅ "Error rate < 0.1%"

A - Achievable: Realistic given your current architecture

❌ "100% availability" (impossible) ✅ "99.9% availability" (challenging but achievable)

R - Relevant: Measures something users actually care about

❌ "Server CPU utilization" ✅ "Request latency" (users care about speed)

T - Time-bound: Defined over a specific window

❌ "System is usually available" ✅ "99.9% availability over 30 days"

Types of SLOs

1. Availability SLOs

What it measures: Is the service working?

How to calculate: (Total time - Downtime) / Total time

Example:

slo {
  availability {
    target "99.9%"  // 8.76 hours downtime/year allowed
    window "30 days"
    current "99.95%"
  }
}

Real-world example: Netflix

Netflix targets 99.99% availability for their streaming service. That's 52 minutes of downtime per year. How do they achieve it? Chaos engineering. They break things on purpose to ensure resilience.

2. Latency SLOs

What it measures: How fast does the service respond?

How to calculate: Percentiles (p50, p95, p99)

Example:

slo {
  latency {
    p95 "200ms"  // 95% of requests faster than 200ms
    p99 "500ms"  // 99% of requests faster than 500ms
    window "7 days"
    current {
      p95 "180ms"
      p99 "450ms"
    }
  }
}

Real-world example: Amazon

Amazon found that every 100ms of latency cost 1% in sales. They have strict latency SLOs: p99 < 100ms for most services. Their architecture is optimized for speed because speed directly impacts revenue.

Why percentiles matter:

Average latency: "Average latency is 50ms"

  • Problem: Hides outliers. If 5% of requests take 10 seconds, average might still look fine.

Percentile latency: "p95 latency is 200ms"

  • Benefit: 95% of users get < 200ms. You know the worst case most users experience.

3. Error Rate SLOs

What it measures: What percentage of requests fail?

How to calculate: Failed requests / Total requests

Example:

slo {
  error_rate {
    target "< 0.1%"  // Fewer than 1 in 1000 requests fail
    window "30 days"
    current "0.05%"
  }
}

Real-world example: Stripe

Stripe processes billions in payments. Their error rate SLO is < 0.01% (1 in 10,000 requests). Every failed payment is lost revenue and frustrated customers. They achieve this through retry logic, circuit breakers, and graceful degradation.

4. Throughput SLOs

What it measures: How many requests can you handle?

How to calculate: Requests per second (req/s)

Example:

slo {
  throughput {
    target "1000 req/s"  // Handle 1000 requests per second
    window "1 hour"
    current "950 req/s"
  }
}

Real-world example: Uber

During New Year's Eve, Uber's throughput spikes 10x. Their throughput SLO ensures they can handle the surge: 100,000 ride requests per second globally. They achieve this through massive auto-scaling and capacity planning.

Error Budgets: The Most Important Concept

What is an error budget?

An error budget is the amount of unreliability you can afford before breaching your SLO. It's the difference between 100% and your SLO target.

Example:

SLO: 99.9% availability over 30 days
Total time in 30 days: 43,200 minutes
Allowed downtime (error budget): 43.2 minutes

If you've had 20 minutes of downtime this month:
Remaining error budget: 23.2 minutes

How to Use Error Budgets

Google's approach (popularized in their SRE book):

1. When error budget is HEALTHY (plenty remaining):

  • Take more risks
  • Launch new features faster
  • Reduce operational toil
  • Experiment with architecture changes

2. When error budget is DEPLETED (barely any remaining):

  • Freeze new features
  • Focus on reliability
  • Pay down technical debt
  • Add more tests
  • Improve monitoring

3. When error budget is EXCEEDED (SLO breached):

  • Incident review required
  • Post-mortem mandatory
  • No new features until SLO recovers

Real-world example: Google Search

Google Search has a 99.99% availability SLO. When they're within budget, they push changes aggressively. When budget is tight, they slow down. This balance lets them innovate while staying reliable.

Error Budget Calculator

Monthly Error Budget for 99.9% availability:
- 30 days × 24 hours × 60 minutes = 43,200 minutes
- Allowed downtime: 0.1% × 43,200 = 43.2 minutes/month

Monthly Error Budget for 99.99% availability:
- Allowed downtime: 0.01% × 43,200 = 4.32 minutes/month

Monthly Error Budget for 99.999% availability:
- Allowed downtime: 0.001% × 43,200 = 0.432 minutes/month (26 seconds!)

The lesson: Higher SLOs are exponentially harder and more expensive.

Aligning Scale with SLOs

This is where architecture meets reliability. Your SLOs determine your capacity requirements.

The Capacity-SLO Relationship

Key insight: You need headroom to meet SLOs during traffic spikes.

Example:

Current traffic: 500 req/s
Traffic spikes: Up to 2x (1000 req/s during peak)
SLO target: 1000 req/s throughput
Required capacity: 1000 req/s minimum

But wait - if you're at 100% capacity, you can't handle:
- Unexpected spikes (> 2x)
- Server failures
- Deployment rollouts
- Performance degradation

Best practice: Run at 50-70% capacity. Have 30-50% headroom.

Sruja: Modeling Scale with SLOs

import { * } from 'sruja.ai/stdlib'

ECommerce = system "E-Commerce Platform" {
  API = container "API Service" {
    technology "Rust"
    
    // Define your SLO first
    slo {
      throughput {
        target "1000 req/s"
        window "1 hour"
      }
      
      latency {
        p95 "200ms"
        p99 "500ms"
        window "7 days"
      }
      
      availability {
        target "99.9%"
        window "30 days"
      }
    }
    
    // Then define scale to support the SLO
    scale {
      metric "cpu"
      
      // Baseline capacity (support 1000 req/s at 50% CPU)
      min 5
      
      // Burst capacity (handle 2x spikes)
      max 20
      
      // Auto-scale trigger
      scale_up "cpu > 70%"
      scale_down "cpu < 30%"
    }
  }
}

view index {
  title "Production System with SLO-Aligned Scale"
  include *
}

Key principle: Start with SLO, then design scale. Not the other way around.

The Headroom Calculation

Formula:

Required Capacity = Peak Traffic × Headroom Factor

Where Headroom Factor:
- 1.3x for non-critical services (30% headroom)
- 1.5x for standard services (50% headroom)
- 2.0x for critical services (100% headroom)

Example:

Service: Payment API (critical)
Peak traffic: 500 req/s
Headroom factor: 2.0x
Required capacity: 1000 req/s

If each instance handles 100 req/s:
Min instances: 10

Real-World Capacity Planning

Netflix's approach:

Netflix tracks their "efficiency" metric: actual usage vs. provisioned capacity.

  • Too low (< 30%): Wasting money
  • Too high (> 80%): Risk of SLO breach
  • Target (50-70%): Optimal balance

They auto-scale based on this, ensuring they meet SLOs without overspending.

Setting SLOs: The Framework

Step 1: Identify User Journeys

What are the critical paths users take through your system?

Example for e-commerce:

  1. User searches for product
  2. User views product details
  3. User adds to cart
  4. User checks out
  5. User pays

Prioritize: Which journeys matter most? (Checkout and payment are more critical than search)

Step 2: Choose Metrics

For each journey, what metrics matter?

  • Search: Latency (users want fast results)
  • Product view: Availability (must work)
  • Cart: Availability + durability (don't lose items)
  • Checkout: Availability + latency (fast and reliable)
  • Payment: Availability + error rate (must succeed)

Step 3: Measure Current Performance

Before setting targets, measure reality:

Current performance (last 30 days):
- Availability: 99.5%
- Latency p95: 350ms
- Error rate: 0.3%

Step 4: Set Achievable Targets

Based on current performance, set realistic targets:

Current: 99.5% availability
Target: 99.7% availability (improvement, not perfection)
Future: 99.9% availability (long-term goal)

The mistake to avoid: Setting 99.99% SLO when you're at 99%. You'll fail constantly.

Step 5: Create Alerts

Alert when you're burning error budget too fast:

Alert 1: Burn rate > 10x (will exhaust budget in 3 days)
Alert 2: Burn rate > 2x (will exhaust budget in 15 days)
Alert 3: Budget < 20% remaining

SLO Anti-Patterns

Anti-Pattern #1: The 100% SLO

What happens: You set 100% availability as your SLO.

Why it fails:

  • Impossible to achieve
  • Paralyzes the team (afraid to make changes)
  • Expensive (massive over-provisioning)
  • Still fails eventually

The fix: 100% is not an SLO, it's a fantasy. Aim for 99.9% or 99.99%.

Anti-Pattern #2: Measuring the Wrong Thing

What happens: You measure CPU utilization instead of user experience.

Why it fails:

  • CPU can be high while users are happy
  • CPU can be low while users are frustrated
  • Doesn't capture actual reliability

The fix: Measure what users experience: latency, availability, errors.

Anti-Pattern #3: Too Many SLOs

What happens: You define 20 different SLOs for one service.

Why it fails:

  • Information overload
  • No clear priorities
  • Alert fatigue
  • Impossible to track

The fix: 3-5 SLOs per service maximum. Focus on what matters most.

Anti-Pattern #4: SLOs in a Drawer

What happens: You define SLOs, document them, then never look at them again.

Why it fails:

  • No feedback loop
  • No behavior change
  • Wasted effort

The fix: SLOs must be:

  • Visible on dashboards
  • Part of deployment decisions
  • Reviewed regularly
  • Updated when needed

Anti-Pattern #5: SLOs Set by Management

What happens: Management dictates 99.99% availability without consulting engineering.

Why it fails:

  • Unrealistic given architecture
  • No buy-in from team
  • Sets up for failure

The fix: Collaborative SLO setting:

  • Management sets business requirements
  • Engineering measures current performance
  • Together, define achievable targets

Anti-Pattern #6: No Error Budget

What happens: You have SLOs but no concept of error budget.

Why it fails:

  • Binary view (pass/fail)
  • No nuance in decision-making
  • Can't balance reliability vs. velocity

The fix: Calculate and track error budgets. Use them to guide decisions.

The SLO Framework: RELIABLE

When implementing SLOs, use this framework:

R - Review User Journeys

  • What do users actually do?
  • What matters most to them?

E - Establish Metrics

  • Availability, latency, errors, throughput
  • What captures user experience?

L - Look at Current Performance

  • Measure before targeting
  • Don't guess, measure

I - Incremental Targets

  • Improve gradually
  • Don't aim for perfection immediately

A - Automate Measurement

  • Continuous monitoring
  • Real-time dashboards

B - Burn Rate Alerts

  • Alert before SLO breach
  • Give time to react

L - Link to Decisions

  • Error budget guides feature velocity
  • SLO status affects deployments

E - Evolve Over Time

  • SLOs aren't static
  • Adjust as you improve

Real-World SLO Examples

Google: The Gold Standard

Google popularized SLOs in their SRE books. Here's how they approach it:

Service: Google Search Availability SLO: 99.99% Latency SLO: p99 < 200ms Error budget policy: When budget exhausted, freeze launches until recovered

How they achieve it:

  • Massive redundancy (multiple data centers)
  • Automatic failover
  • Chaos engineering
  • Rigorous capacity planning

Result: Google Search is down so rarely it makes global news when it happens.

Netflix: Chaos Engineering

Service: Streaming Video Availability SLO: 99.99% Throughput SLO: Handle all subscriber streams concurrently

How they achieve it:

  • Chaos Monkey (randomly kills instances)
  • Multi-region active-active
  • Graceful degradation (lower quality vs. failure)

Result: Even when AWS has outages, Netflix stays up.

Amazon: Revenue-Driven SLOs

Service: Product Page Latency SLO: p99 < 100ms Why: Every 100ms of latency = 1% sales drop

How they achieve it:

  • Edge caching (CloudFront)
  • Optimized databases (DynamoDB)
  • Microservices (isolated failures)

Result: Fast page loads drive revenue.

Stripe: Payment Reliability

Service: Payment Processing Availability SLO: 99.99% Error Rate SLO: < 0.01% Latency SLO: p95 < 300ms

How they achieve it:

  • Retry logic (failed payments retry automatically)
  • Circuit breakers (fail fast when downstream issues)
  • Redundant payment processors
  • Extensive monitoring

Result: Billions in payments processed with minimal failures.

Common Questions About SLOs

Q: What if I don't know what targets to set?

A: Start by measuring current performance. Your first SLO can be "maintain current performance." Then improve from there.

Q: How often should I review SLOs?

A: Monthly for critical services, quarterly for others. Adjust targets based on actual performance and business needs.

Q: What if my SLO is too hard to meet?

A: Lower it. An unachievable SLO is worse than a loose one. Gradually tighten as you improve.

Q: What if stakeholders demand 100% uptime?

A: Explain the cost. 99.9% might cost $10K/month. 99.99% might cost $100K/month. 99.999% might cost $1M/month. Let them choose.

Q: Should every service have SLOs?

A: Production services: Yes. Internal tools: Maybe. Experiments: No. Focus on what matters.

Practical Exercise: Define Your SLOs

Take a service you work on (real or hypothetical):

Step 1: Identify Critical User Journey

  • What's the most important thing users do?
  • Example: "Customer completes purchase"

Step 2: Choose 3 Metrics

  • Availability (must work)
  • Latency (must be fast)
  • Error rate (must succeed)

Step 3: Measure Current Performance

  • Look at last 30 days of data
  • What's your current availability?
  • What's your current latency?
  • What's your current error rate?

Step 4: Set Targets

  • Aim for 10-20% improvement
  • If current availability is 99.5%, target 99.7%
  • If current latency p95 is 500ms, target 400ms

Step 5: Calculate Error Budget

  • How much downtime/latency/errors allowed?
  • Example: 99.7% over 30 days = 130 minutes downtime allowed

Step 6: Create Dashboard

  • Show SLO status
  • Show error budget remaining
  • Make it visible to team

Time: 45-60 minutes

Complete Example: E-Commerce Platform

import { * } from 'sruja.ai/stdlib'

// ============ SERVICE DEFINITION ============

ECommerce = system "E-Commerce Platform" {
  API = container "API Service" {
    technology "Rust"
    description "Core API handling all business logic"
    
    // Metadata for governance
    metadata {
      owner "platform-team"
      criticality "high"
    }
    
    // SLOs define reliability targets
    slo {
      // Availability: Service must be up
      availability {
        target "99.9%"  // 8.76 hours downtime/year
        window "30 days"
        current "99.92%"
        
        error_budget {
          total "43.2 minutes"
          remaining "35.1 minutes"
          status "healthy"
        }
      }
      
      // Latency: Service must be fast
      latency {
        p95 "200ms"  // 95% of requests < 200ms
        p99 "500ms"  // 99% of requests < 500ms
        window "7 days"
        current {
          p95 "180ms"
          p99 "420ms"
        }
      }
      
      // Error Rate: Service must succeed
      error_rate {
        target "< 0.1%"  // Fewer than 1 in 1000 requests fail
        window "30 days"
        current "0.08%"
      }
      
      // Throughput: Service must handle load
      throughput {
        target "1000 req/s"  // Handle peak traffic
        window "1 hour"
        current "950 req/s"
      }
    }
    
    // Scale configuration supports SLOs
    scale {
      metric "cpu"
      
      // Minimum: Always have capacity for normal load + headroom
      min 5  // 5 instances × 200 req/s = 1000 req/s capacity
      
      // Maximum: Can scale for 2x spikes
      max 15  // 15 instances × 200 req/s = 3000 req/s capacity
      
      // Auto-scale triggers
      scale_up "cpu > 70%"
      scale_down "cpu < 30% cooldown 10m"
    }
  }
  
  Database = database "PostgreSQL" {
    technology "PostgreSQL"
    description "Primary data store"
    
    metadata {
      owner "platform-team"
      criticality "high"
    }
    
    slo {
      availability {
        target "99.95%"  // Database more critical than API
        window "30 days"
        current "99.96%"
      }
      
      latency {
        p95 "50ms"  // Database should be fast
        p99 "100ms"
        window "7 days"
        current {
          p95 "45ms"
          p99 "92ms"
        }
      }
    }
  }
  
  Cache = database "Redis" {
    technology "Redis"
    description "Cache layer for performance"
    
    slo {
      availability {
        target "99.9%"
        window "30 days"
      }
      
      hit_rate {
        target "> 80%"  // 80% of requests served from cache
        current "85%"
      }
    }
  }
}

// Relationships
ECommerce.API -> ECommerce.Database "Reads/Writes"
ECommerce.API -> ECommerce.Cache "Caches queries"

// ============ VIEWS ============

view index {
  title "E-Commerce Platform Architecture"
  include *
}

view slos {
  title "SLO Dashboard"
  include ECommerce.API ECommerce.Database ECommerce.Cache
  description "Monitor all SLOs in one view"
}

view capacity {
  title "Capacity & Scale"
  include ECommerce.API
  description "Focus on throughput and scaling"
}

// ============ GOVERNANCE ============

policy SLOPolicy "Critical services must have SLOs" {
  category "reliability"
  enforcement "required"
  
  rule {
    element_type "container"
    tag "criticality:high"
    requires_slo true
    error "Critical service {element} must have SLOs defined"
  }
}

// Tag critical services
ECommerce.API.tags ["criticality:high"]
ECommerce.Database.tags ["criticality:high"]

Run validation:

sruja validate architecture.sruja

This checks:

  • SLOs are defined for critical services
  • Scale configuration supports throughput targets
  • All required metadata present

What to Remember

  1. SLOs are promises backed by measurement - Without measurement, it's just wishful thinking

  2. Start with user journeys - Define what matters to users, then measure it

  3. Measure before you target - Know your current performance before setting goals

  4. Error budgets enable velocity - When budget is healthy, move fast. When tight, slow down.

  5. Headroom is essential - Run at 50-70% capacity. Have 30-50% buffer.

  6. Percentiles over averages - p95/p99 reveal what users actually experience

  7. Three SLOs maximum per service - Availability, latency, error rate. Maybe throughput.

  8. SLOs should evolve - As you improve, tighten targets. As business changes, adjust metrics.

  9. Make SLOs visible - Dashboards, alerts, deployment gates. Everyone should know status.

  10. Collaborative target-setting - Management defines needs, engineering defines feasibility, together define SLOs

When to Start with SLOs

Phase 1: Prototype (Skip SLOs)

  • Learning what to build
  • No real users yet
  • Focus on functionality

Phase 2: Launch (Basic SLOs)

  • Real users exist
  • Define availability SLO
  • Basic monitoring

Phase 3: Growth (Comprehensive SLOs)

  • Traffic increasing
  • Add latency, error rate SLOs
  • Error budget tracking
  • SLO-based decisions

Phase 4: Scale (Advanced SLOs)

  • High traffic
  • Multi-region SLOs
  • Per-feature SLOs
  • Sophisticated alerting

Next up: Lesson 5 brings everything together with a comprehensive production readiness review - how to know when your architecture is truly ready for production.

Lesson 5: Tracking Architecture Evolution

I spent three weeks as an architecture archaeologist.

A new VP of Engineering joined our company and asked a simple question: "How did our architecture get to where it is today? What decisions shaped it?"

I thought it would be easy. I'd just look at the architecture documentation and tell the story.

The problem: There was no architecture documentation. Not really. We had:

  • An outdated Confluence wiki with diagrams from 2019
  • A Google Drive folder with PowerPoint slides from various presentations
  • A Figma board with "current architecture" that hadn't been updated in 8 months
  • Various README files scattered across 47 repositories
  • And, if I was lucky, some comments in code

I spent three weeks digging through Git history, Slack archives, Jira tickets, and interviewing the five engineers who'd been there longest. I reconstructed a partial history. But most of the "why" was lost. The people who made the decisions had left. The Slack channels had been archived. The context was gone.

What I learned: Architecture without history is just a snapshot. It tells you WHAT the system looks like, but not HOW it got there or WHY. And without that context, you're destined to repeat the same mistakes.

This lesson is about tracking architecture evolution: not just what your architecture is, but how it changes over time and why. It's also the final lesson in this course, so we'll wrap up everything you've learned.

Why Track Architecture Evolution?

The Five Problems of Lost History

1. Onboarding Takes Forever

New engineer joins. Wants to understand the system. Without history:

  • "Why do we have three different caching layers?" "I don't know, they were here when I joined."
  • "Why is this service written in Go and that one in Rust?" "Historical reasons."
  • "Why can't we just use PostgreSQL?" "We tried once. It didn't work. I think. Not sure why."

With history: New engineer reads ADRs, reviews evolution, understands context. Onboarding: 2 weeks instead of 2 months.

2. We Repeat Mistakes

Without history, we make the same decisions over and over:

  • "Let's use MongoDB!" (We did in 2019. Switched to PostgreSQL in 2020. See ADR-023.)
  • "Let's build a monolith!" (We did in 2018. Spent 2019-2021 breaking it apart. See ADRs 1-15.)
  • "Let's skip tests!" (We did in 2017. Spent 2018-2019 recovering. See post-mortem PM-007.)

With history: "Wait, we tried this in ADR-023. It failed because X. Has anything changed?"

3. Audits Become Archaeology

SOC 2 auditor asks: "Show me how your architecture has evolved to meet security requirements."

Without tracking: Weeks of digging. Reconstructing history. Hoping you can find evidence.

With tracking: "Here's our architecture repo with complete Git history, ADRs for every security-related decision, and SLO evolution showing continuous improvement."

4. Decisions Get Re-litigated

New architect joins. Wants to change everything.

Without history: "Why is it this way? This seems stupid. Let's change it."

With history: "Let me show you ADR-042. We considered that approach. Here's why we didn't choose it. If circumstances have changed, we can revisit. But let's understand why first."

5. We Can't Measure Progress

"Are we getting better?"

Without tracking: "I think so? Feels better?"

With tracking: "Our availability SLO improved from 99.5% to 99.9% over 12 months. Latency p95 dropped from 500ms to 200ms. Error rate halved. Here's the commit history showing what changes drove each improvement."

The Three-Legged Stool of Evolution Tracking

Architecture evolution needs three things working together:

Leg 1: Git (What Changed)

Git tracks what changed and when:

# What changed in the architecture?
git log --oneline --follow architecture.sruja

# Output:
e4f8c2a Add Redis cache layer (see ADR-005)
a3b7d1f Split payment service from main API
9c2e5f3 Increase API replicas to 10 (SLO improvement)
7d1a4b6 Initial architecture baseline

What Git tells you:

  • Exact changes made
  • When they were made
  • Who made them
  • Commit messages (hopefully descriptive)
  • Pull request context (if linked)

What Git doesn't tell you:

  • Why the change was made
  • What alternatives were considered
  • What impact it had
  • Whether it was the right decision

Leg 2: ADRs (Why It Changed)

ADRs track why changes were made:

ADR005 = adr "Add Redis cache layer" {
  status "accepted"
  accepted_date "2024-03-15"
  
  context "
    API latency p95 is 500ms, target is 200ms.
    Database queries are the bottleneck.
    Current database CPU at 85%.
  "
  
  decision "
    Add Redis cache for hot paths:
    - Product catalog queries
    - User session data
    - Frequently accessed configuration
  "
  
  alternatives {
    option "Database read replicas" {
      pros "Familiar technology"
      cons "Still high latency, doesn't scale as well"
      rejected_because "Latency improvement insufficient"
    }
    
    option "Application-level caching" {
      pros "No new infrastructure"
      cons "Cache invalidation complex, not shared across instances"
      rejected_because "Doesn't work with horizontal scaling"
    }
    
    option "Redis cache" {
      pros "Fast (sub-millisecond), proven technology"
      cons "New operational complexity, cache invalidation logic needed"
      selected true
    }
  }
  
  consequences {
    positive "
      - Latency p95 improved from 500ms to 250ms
      - Database CPU dropped to 45%
      - Cost: $500/month for Redis cluster
    "
    
    negative "
      - Added operational complexity (new system to monitor)
      - Cache invalidation bugs took 2 weeks to iron out
      - Occasional cache stampedes during deployment
    "
    
    neutral "
      - Team needed Redis training
      - Monitoring dashboards updated
    "
  }
  
  related_adrs [ADR003, ADR004]  // Related to scaling decisions
  related_commits ["e4f8c2a"]    // Link to Git commits
}

What ADRs tell you:

  • Context (what problem we were solving)
  • Decision (what we chose to do)
  • Alternatives (what else we considered)
  • Consequences (what happened as a result)
  • Links to related decisions and commits

What ADRs don't tell you:

  • Quantitative impact over time
  • Whether SLOs improved
  • Long-term effectiveness

Leg 3: SLOs (What Impact It Had)

SLOs track the impact of changes:

API = container "API Service" {
  slo {
    latency {
      p95 "200ms"
      window "7 days"
      
      // Track evolution over time
      history {
        "2024-01-15" {
          current "500ms"
          note "Baseline before Redis"
        }
        
        "2024-03-20" {
          current "250ms"
          note "After Redis (ADR-005)"
        }
        
        "2024-05-10" {
          current "200ms"
          note "After query optimization (ADR-006)"
        }
        
        "2024-07-01" {
          current "180ms"
          note "Current - target met"
        }
      }
    }
  }
}

What SLOs tell you:

  • Quantitative metrics over time
  • Whether changes had positive impact
  • Progress toward targets
  • Correlation between changes and outcomes

What SLOs don't tell you:

  • Why changes were made
  • What alternatives were considered
  • Implementation details

Putting It All Together

The complete picture:

Git Commit → ADR → SLO Impact
    ↓          ↓         ↓
  WHAT       WHY      RESULT

Example timeline:

  1. Git: Commit e4f8c2a "Add Redis cache layer"
  2. ADR: ADR-005 explains why (latency too high, database bottleneck)
  3. SLO: Latency improved from 500ms to 250ms, then 200ms

Query: "Why do we have Redis?"

Answer: "See commit e4f8c2a from March 2024. It was added per ADR-005 to address latency issues. Our p95 latency dropped from 500ms to 200ms over 3 months. Here's the SLO history showing the improvement."

Real-World Evolution Tracking

Netflix: Architecture Decision Logs

Netflix maintains detailed Architecture Decision Logs (ADLs) - their version of ADRs. Every significant decision is documented with:

  • Context and problem statement
  • Options considered
  • Decision made
  • Expected consequences
  • Actual outcomes (updated over time)

Their approach: ADLs are living documents. When a decision is made, they document expected consequences. Six months later, they update with actual consequences. This creates a feedback loop.

Example: "In ADL-342, we predicted moving to Chaos Engineering would cause 10% more incidents in Q1 but 50% fewer in Q2-Q4. Actual: 12% more in Q1, 60% fewer in Q2-Q4. Prediction was accurate. Decision validated."

Amazon: Working Backwards with History

Amazon's "Working Backwards" approach starts with the customer experience. Their architecture evolution tracking works similarly:

  1. Start with customer impact: "What customer metric are we trying to improve?"
  2. Document in architecture: Link every change to customer-facing metrics
  3. Track over time: Show how architecture changes moved customer metrics

Example: "In 2023 Q1, we optimized the checkout flow (commit a1b2c3d, ADR-789). Cart abandonment dropped 15%. Revenue increased $2M/month."

Google: SLO Evolution as History

Google tracks SLO evolution meticulously. They don't just track current vs target - they track the entire history:

Service: Gmail
SLO: Availability 99.99%

Evolution:
- 2005: 99.5% (early days)
- 2008: 99.9% (after infrastructure improvements)
- 2012: 99.95% (after multi-region deployment)
- 2018: 99.99% (after chaos engineering adoption)
- 2024: 99.99% (maintained)

Key insight: The history shows continuous improvement. It's not just "we're at 99.99%", it's "we've systematically improved from 99.5% to 99.99% over 20 years."

Stripe: Git-Driven Architecture

Stripe keeps architecture diagrams in Git alongside code. Every architecture change goes through the same PR process as code changes:

  1. Architecture change proposed in PR
  2. Architecture review (separate from code review)
  3. ADR linked in PR description
  4. SLO impact predicted
  5. After merge: SLO impact measured
  6. Update PR with actual results

Result: Complete history of architecture evolution, linked to code changes, with measured impact.

Evolution Tracking Framework: TRACK

Use this framework to track architecture evolution:

T - Tag Versions

Tag significant architecture states:

# Tag major versions
git tag -a v2024.01 -m "Post-microservices migration"
git tag -a v2024.02 -m "After multi-region deployment"
git tag -a v2024.03 -m "Post-caching layer addition"

# View architecture at any point in time
git show v2024.01:architecture.sruja | sruja export

R - Record Decisions

Create ADR for every significant change:

ADR### = adr "[Title]" {
  // Standard ADR format
  context "..."
  decision "..."
  consequences "..."
  
  // Evolution tracking extras
  created_date "YYYY-MM-DD"
  related_commits ["abc123"]
  slo_impact ["latency improved 50ms"]
}

A - Analyze Impact

Measure SLO changes after architectural changes:

slo {
  latency {
    p95 "200ms"
    history {
      "2024-03-01" {
        current "500ms"
        note "Before change"
      }
      "2024-03-15" {
        current "250ms"
        note "After Redis (ADR-005)"
      }
    }
  }
}

C - Connect the Dots

Link Git commits → ADRs → SLO changes:

// In ADR
related_commits ["e4f8c2a"]
slo_impact ["latency p95: 500ms → 250ms"]

// In commit message
git commit -m "Add Redis cache layer (ADR-005)"

// In SLO history
note "After Redis (ADR-005, commit e4f8c2a)"

K - Keep Updating

Architecture evolution tracking is never "done":

  • Update SLOs monthly
  • Review ADRs quarterly (are they still relevant?)
  • Tag versions for major changes
  • Measure and document impact

Common Evolution Tracking Mistakes

Mistake #1: Architecture in a Drawer

What happens: Architecture diagrams exist but aren't updated.

Example:

  • Confluence page created in 2021: "Current Architecture"
  • Last updated: March 2021
  • Actual architecture: Completely different (migrated to microservices, changed databases, etc.)

Why it fails:

  • Documentation becomes wrong
  • People stop trusting it
  • New architecture created in ad-hoc ways (whiteboard photos, Slack drawings)
  • History is lost

The fix: Architecture lives in Git. Changes go through PR process. Documentation is always current.

Mistake #2: ADRs Without Context

What happens: ADRs exist but lack key information.

Example:

// BAD: No context
ADR042 = adr "Use PostgreSQL" {
  decision "Use PostgreSQL"
}

Why it fails:

  • "Why did we choose this?" → "I don't know, ADR-042 just says 'use PostgreSQL'"
  • "What alternatives did we consider?" → "No idea"
  • "Was it the right decision?" → "Who knows"

The fix: Every ADR needs context, alternatives, and consequences.

Mistake #3: SLOs Without History

What happens: You track current SLOs but not their evolution.

Example:

  • Current SLO: 99.9% availability
  • Target SLO: 99.9% availability
  • Result: ✅ Target met

What's missing: How did we get here? Did we improve? Get worse? What changes drove the improvement?

The fix: Track SLO history. Show the journey, not just the destination.

Mistake #4: Git History Without Documentation

What happens: Git commits exist but aren't linked to decisions.

Example:

git log --oneline
# e4f8c2a Fix stuff
# a3b7d1f More fixes
# 9c2e5f3 Update things

Why it fails:

  • "What did this commit do?" → "Fixed stuff"
  • "Why was it needed?" → "Don't know"
  • "What impact did it have?" → "No idea"

The fix: Descriptive commit messages linked to ADRs.

Mistake #5: No Regular Reviews

What happens: Evolution tracking is set up but never reviewed.

Example:

  • ADRs created but never read
  • SLO history tracked but never analyzed
  • Git tags created but never used

Why it fails:

  • Tracking without review is just data collection
  • No insights generated
  • No learning happens

The fix: Monthly architecture reviews that examine evolution.

Mistake #6: Tracking Everything

What happens: You try to track every minor change.

Example:

  • ADR for changing a log message
  • ADR for updating a dependency version
  • ADR for renaming a variable

Why it fails:

  • Information overload
  • ADRs become noise
  • Important decisions get lost

The fix: Track significant decisions. Use judgment on what matters.

Evolution Review Process

Monthly architecture evolution review:

Attendees: Architects, Tech Leads, interested engineers

Agenda:

  1. Review Recent Changes (15 min)

    • Git log: What architecture changes were made?
    • ADRs: Any new decisions documented?
    • SLOs: Any significant metric changes?
  2. Analyze Impact (15 min)

    • For each significant change: What was the predicted impact? What was the actual impact?
    • Any surprises? Any decisions we'd reverse?
  3. Identify Patterns (10 min)

    • What types of changes are we making frequently?
    • Are we improving? Getting worse? Stagnating?
    • Any recurring problems?
  4. Update Documentation (10 min)

    • Update SLO history with current values
    • Update ADR consequences with actual outcomes
    • Tag any major versions
  5. Action Items (10 min)

    • What do we need to change?
    • What decisions need revisiting?
    • What experiments should we run?

Output: Monthly architecture evolution report

The Architecture Timeline

Keep a high-level timeline of your architecture's evolution:

# Architecture Evolution Timeline

## 2024

### Q1 (January - March)
- Migrated to microservices (ADR-001 to ADR-015)
- Split monolith into 12 services
- Added API gateway
- SLO Impact: Availability 99.5% → 99.7%

### Q2 (April - June)
- Implemented Redis caching (ADR-016)
- Optimized database queries (ADR-017)
- Added circuit breakers (ADR-018)
- SLO Impact: Latency p95 500ms → 200ms

### Q3 (July - September)
- Multi-region deployment (ADR-019)
- Implemented Chaos Engineering (ADR-020)
- Added comprehensive monitoring (ADR-021)
- SLO Impact: Availability 99.7% → 99.9%

### Q4 (October - December)
- [Current quarter - tracking in progress]

Value:

  • Quick reference for "when did we do X?"
  • Shows progression over time
  • Useful for onboarding
  • Helpful for audits

Complete Example: E-Commerce Platform Evolution

import { * } from 'sruja.ai/stdlib'

// ============ ARCHITECTURE DEFINITION ============

ECommerce = system "E-Commerce Platform" {
  API = container "API Service" {
    technology "Rust"
    
    description "
      Core API service. Evolved from monolith (2023 Q4) to 
      microservices (2024 Q1). See ADR-001 through ADR-015.
    "
    
    slo {
      availability {
        target "99.9%"
        window "30 days"
        current "99.92%"
        
        history {
          "2023-10-01" {
            current "99.5%"
            note "Monolith baseline"
          }
          "2024-03-01" {
            current "99.7%"
            note "Post-microservices migration (ADR-001 to ADR-015)"
          }
          "2024-06-01" {
            current "99.85%"
            note "Post-caching (ADR-016)"
          }
          "2024-09-01" {
            current "99.92%"
            note "Post-multi-region (ADR-019)"
          }
        }
      }
      
      latency {
        p95 "200ms"
        p99 "500ms"
        window "7 days"
        current {
          p95 "180ms"
          p99 "420ms"
        }
        
        history {
          "2023-10-01" {
            p95 "800ms"
            note "Monolith baseline"
          }
          "2024-03-01" {
            p95 "500ms"
            note "Post-microservices"
          }
          "2024-06-01" {
            p95 "200ms"
            note "Post-caching and query optimization (ADR-016, ADR-017)"
          }
        }
      }
    }
    
    metadata {
      owner "platform-team"
      created "2023-10-01"
      last_updated "2024-09-15"
    }
  }
  
  Cache = database "Redis Cache" {
    technology "Redis"
    
    description "
      Added 2024-04-15 per ADR-016. Reduced database load by 60%.
      Improved latency p95 from 500ms to 200ms.
    "
    
    metadata {
      added_date "2024-04-15"
      related_adr "ADR-016"
      related_commits ["e4f8c2a", "b5d9e1f"]
    }
  }
  
  Database = database "PostgreSQL" {
    technology "PostgreSQL"
    
    description "
      Primary database. Query optimization completed 2024-05-20 
      per ADR-017. Database CPU reduced from 85% to 45%.
    "
    
    metadata {
      created "2023-10-01"
      optimization_date "2024-05-20"
      related_adr "ADR-017"
    }
  }
}

// Relationships
ECommerce.API -> ECommerce.Cache "Queries (added 2024-04-15)"
ECommerce.API -> ECommerce.Database "Reads/Writes"

// ============ ARCHITECTURE DECISION RECORDS ============

ADR016 = adr "Add Redis caching layer" {
  status "accepted"
  created_date "2024-04-01"
  accepted_date "2024-04-05"
  implemented_date "2024-04-15"
  
  context "
    API latency p95 is 500ms, target is 200ms.
    Database CPU at 85%, causing performance issues.
    Peak traffic causing database connection exhaustion.
  "
  
  decision "
    Implement Redis caching for:
    - Product catalog queries (hot path)
    - User session data
    - Configuration lookups
    - Frequently accessed data
  "
  
  alternatives {
    option "Database read replicas" {
      pros "Familiar, ACID compliant"
      cons "Latency still high (200ms+), expensive at scale"
      rejected_because "Insufficient latency improvement"
    }
    
    option "In-memory application cache" {
      pros "No new infrastructure"
      cons "Not shared across instances, complex invalidation"
      rejected_because "Doesn't work with horizontal scaling"
    }
    
    option "Redis" {
      pros "Sub-millisecond latency, proven at scale"
      cons "New operational complexity, cache invalidation logic"
      selected true
    }
  }
  
  consequences {
    actual_positive "
      - Latency p95: 500ms → 200ms (target met!)
      - Database CPU: 85% → 45%
      - Database connections: 400 → 150
      - User experience significantly improved
    "
    
    actual_negative "
      - 2 weeks to debug cache invalidation issues
      - 3 cache stampede incidents during deployment
      - Added operational complexity (monitoring, alerting)
      - Cost: $500/month for Redis cluster
    "
    
    lessons_learned "
      - Start with conservative TTLs (5 minutes), not aggressive (1 hour)
      - Implement cache warming before traffic shifts
      - Need better cache monitoring before deploying
    "
  }
  
  related_commits ["e4f8c2a", "b5d9e1f", "c6a0d2e"]
  slo_impact ["latency p95: 500ms → 200ms"]
  tags ["performance", "caching", "redis"]
}

ADR017 = adr "Optimize database queries" {
  status "accepted"
  created_date "2024-05-10"
  implemented_date "2024-05-20"
  
  context "
    After Redis (ADR-016), latency improved to 250ms but still above target.
    Analysis shows remaining latency from:
    - N+1 queries in order processing
    - Missing indexes on frequently queried columns
    - Inefficient joins in reporting queries
  "
  
  decision "
    Database optimization:
    - Add indexes on frequently queried columns
    - Fix N+1 queries with eager loading
    - Denormalize frequently joined tables
    - Add query result caching (application-level)
  "
  
  consequences {
    actual_positive "
      - Latency p95: 250ms → 200ms (target met!)
      - Query execution time: 150ms average → 50ms average
      - Database CPU: 60% → 45%
    "
    
    actual_negative "
      - Migration took 4 hours (downtime required)
      - Slower INSERT operations due to indexes (10% slower)
      - More complex query logic
    "
  }
  
  related_adrs [ADR016]  // Built on caching work
  related_commits ["f7b3c1d", "g8d4e2f"]
  slo_impact ["latency p95: 250ms → 200ms"]
}

// ============ VIEWS ============

view index {
  title "E-Commerce Platform - Current Architecture"
  include *
}

view evolution {
  title "Architecture Evolution (2024)"
  include ECommerce.API ECommerce.Cache ECommerce.Database
  description "
    Shows evolution from monolith to microservices, 
    with caching and optimization layers added.
  "
}

view slos {
  title "SLO Evolution Dashboard"
  include ECommerce.API
  description "Track SLO improvement over time"
}

// ============ EVOLUTION METADATA ============

metadata {
  last_architecture_review "2024-09-15"
  next_scheduled_review "2024-10-15"
  
  evolution_summary {
    total_adrs 21
    active_adrs 19
    deprecated_adrs 2
    
    slo_improvements [
      "availability: 99.5% → 99.9%",
      "latency p95: 800ms → 180ms",
      "error rate: 0.5% → 0.08%"
    ]
  }
}

Run validation:

# Validate current architecture
sruja validate architecture.sruja

# View evolution over time
git log --oneline --follow architecture.sruja

# Compare versions
git diff v2024.01..HEAD -- architecture.sruja

# Export specific version
git show v2024.01:architecture.sruja | sruja export

What to Remember

  1. Architecture without history is incomplete - You need to know how you got here, not just where you are

  2. Three-legged stool: Git + ADRs + SLOs - Each tracks something different, together they tell the complete story

  3. Git tracks WHAT changed - Automatic, but needs good commit messages

  4. ADRs track WHY it changed - Manual, but essential for context

  5. SLOs track IMPACT - Quantitative evidence of improvement (or degradation)

  6. Link everything together - Commits reference ADRs, ADRs reference SLOs, SLOs reference both

  7. Review regularly - Monthly evolution reviews keep history alive and useful

  8. Track significant decisions, not everything - Focus on what matters

  9. Update as you learn - ADRs should include actual consequences, not just predicted

  10. Architecture is a journey, not a destination - Evolution tracking shows the journey

When to Start Tracking Evolution

Phase 1: New Project

  • Start with Git from day one
  • Create first ADR for initial architecture decisions
  • Set baseline SLOs

Phase 2: Growing System

  • Formalize ADR process
  • Start tracking SLO evolution
  • Tag major versions

Phase 3: Mature System

  • Regular evolution reviews
  • Comprehensive ADR history
  • Multi-year SLO tracking

Phase 4: Legacy System

  • Start tracking now (better late than never)
  • Create retrospective ADRs for major decisions (if you can reconstruct them)
  • Begin SLO baseline and track forward

Practical Exercise

Track your architecture's evolution:

Step 1: Review Git History (30 min)

git log --oneline --follow architecture.sruja
  • List 10 most significant changes
  • Identify what's missing (no ADR? unclear commit message?)

Step 2: Create Missing ADRs (60 min)

  • For the 3 most important changes, create retrospective ADRs
  • Document context, decision, consequences
  • Link to commits

Step 3: Add SLO History (30 min)

  • Update SLO definitions with history
  • Show evolution over time
  • Link to ADRs

Step 4: Create Timeline (30 min)

  • Build architecture evolution timeline
  • Major events, decisions, improvements
  • Keep it high-level (one page)

Step 5: Schedule Reviews

  • Set up monthly architecture evolution review
  • Add to team calendar
  • Create template for review notes

🎉 Course Complete: System Design 101

Congratulations! You've completed the entire System Design 101 course. Let's reflect on what you've learned.

Your Journey

You started as someone who wanted to understand system design better. You've now mastered:

Module 1: Fundamentals (5 lessons)

  • Thinking in Systems - Decomposition, boundaries, emergence
  • Stakeholders & Requirements - Who needs what and why
  • Architecture Patterns - Monoliths, microservices, layers, and more
  • Technology Selection - Choosing the right tools
  • Risk-Driven Architecture - Prioritizing what matters

Key insight: Architecture is about making decisions under uncertainty. Start with risks, choose patterns that mitigate them.

Module 2: Modeling with Sruja (3 lessons)

  • Sruja Fundamentals - DSL basics, elements, relationships
  • System Context - Boundaries, external dependencies
  • Container Architecture - Services, datastores, deployment units

Key insight: Good models communicate clearly. Use Sruja to create living documentation that stays current.

Module 3: Advanced Modeling (7 lessons)

  • Microservices Architecture - Service boundaries, distributed systems
  • Event-Driven Architecture - Async patterns, event sourcing
  • Advanced Scenarios - Complex relationship patterns
  • Architectural Perspectives - Multiple views for different audiences
  • Views & Styling - Visual hierarchy, clarity
  • Advanced DSL Features - Scenarios, flows, requirements
  • Views Best Practices - Governance, lifecycle, organization

Key insight: Real systems are complex. Advanced modeling techniques help you manage that complexity effectively.

Module 4: Production Readiness (5 lessons)

  • Documenting Decisions (ADRs) - Why we made the choices we made
  • Deployment Architecture - Where code runs, how it scales
  • Governance as Code - Automated compliance and guardrails
  • SLOs & Scale Integration - Reliability targets and capacity
  • Tracking Architecture Evolution - History, learning, improvement

Key insight: Production systems need more than good design. They need documentation, deployment, governance, reliability, and evolution tracking.

The Complete Picture

You now understand the full lifecycle of architecture:

1. THINK    → Understand the problem, identify risks
2. MODEL    → Create clear, communicable architecture
3. ADVANCE  → Handle complexity with advanced techniques
4. PRODUCE  → Make it real, reliable, and maintainable
5. EVOLVE   → Track changes, learn, improve

What Makes You Different

Most engineers know how to build systems. You now also know:

  • How to think about systems - Not just implement, but design
  • How to communicate architecture - Clear models, multiple views
  • How to make decisions - Documented, reasoned, reviewed
  • How to ensure reliability - SLOs, error budgets, monitoring
  • How to maintain architecture - Evolution tracking, governance
  • How to learn from history - ADRs, retrospective analysis

The Best Architects

The best architects I know aren't the ones who know the most patterns or can draw the prettiest diagrams. They're the ones who:

  • Communicate clearly - Anyone can understand their architecture
  • Make decisions transparently - Everyone knows why choices were made
  • Learn from mistakes - Architecture evolves based on evidence
  • Balance trade-offs - No perfect solutions, only good compromises
  • Keep it simple - Complexity is a cost, not a feature

You now have the tools to be that kind of architect.

Your Next Steps

Immediate (This Week)

  1. Apply what you learned - Pick one system you work on
  2. Create an architecture model - Start simple, add detail
  3. Write your first ADR - Document a recent decision
  4. Define one SLO - Pick your most critical service

Short-Term (This Month)

  1. Model your entire system - Full architecture in Sruja
  2. Create multiple views - Different audiences
  3. Implement governance - At least critical policies
  4. Start tracking evolution - Git, ADRs, SLOs together

Long-Term (This Year)

  1. Build architecture culture - Team-wide adoption
  2. Regular architecture reviews - Monthly cadence
  3. Continuous improvement - Learn from evolution data
  4. Mentor others - Share what you've learned

Continuing Your Journey

This course gave you the foundation. Here's where to go next:

Deepen Your Skills

  • Books: "Designing Data-Intensive Applications" (Kleppmann), "Building Evolutionary Architectures" (Ford et al)
  • Practice: Model real systems you work with
  • Community: Join architecture communities, share your models

Expand Your Knowledge

  • Domain-specific patterns: Read about patterns in your industry
  • Case studies: Study how companies like Netflix, Amazon, Google architect their systems
  • New technologies: Keep learning about new tools and approaches

Specialize

  • Data architecture: Databases, data pipelines, analytics
  • Security architecture: Authentication, authorization, encryption
  • Cloud architecture: AWS, GCP, Azure-specific patterns
  • ML/AI architecture: Machine learning systems, model serving

Final Thoughts

Architecture is not about perfection. It's about making good decisions with the information you have, documenting those decisions clearly, and learning from what happens.

The systems you build will outlast your time with them. Other engineers will maintain them, extend them, and wonder why you made the choices you made.

With what you've learned in this course, they won't have to wonder. They'll have clear models, documented decisions, tracked evolution, and the context they need to continue your work.

That's the real value of architecture: not just building systems, but building systems that can be understood, maintained, and evolved by others.

Go build great systems. Document your decisions. Track your evolution. Learn from your mistakes. And help others do the same.

You're ready. 🚀


Course Statistics

Total Lessons: 24 (Module 1: 5, Module 2: 3, Module 3: 7, Module 4: 5, plus overview and summary)

What You Learned:

  • Architecture fundamentals and thinking
  • Modeling with Sruja DSL
  • Advanced patterns and techniques
  • Production readiness

Skills Acquired:

  • System decomposition and analysis
  • Architecture modeling and documentation
  • Decision documentation (ADRs)
  • Reliability engineering (SLOs)
  • Governance and compliance
  • Evolution tracking

Frameworks Mastered:

  • VIEW framework (for architectural perspectives)
  • STYLE framework (for visual clarity)
  • COMPLETE framework (for production models)
  • GOVERN framework (for governance)
  • RELIABLE framework (for SLOs)
  • TRACK framework (for evolution)

Thank You

Thank you for completing this journey. I hope this course has made you a better architect, a clearer communicator, and a more thoughtful engineer.

Remember: every system tells a story. Make yours worth reading.


Course Complete! 🎓

You've mastered System Design 101. Now go apply it.

System Design 201: Advanced Systems

Overview

  • Focuses on scaling strategies and production realities beyond fundamentals
  • Covers throughput, real-time processing, data-intensive architectures, and consistency models

Learning Goals

  • Design services for high throughput and predictable performance
  • Apply real-time processing patterns for streaming data
  • Architect data-intensive systems with storage and compute separation
  • Choose appropriate consistency and isolation models (and know the trade-offs)

Prerequisites

  • Completed or familiar with concepts from System Design 101
  • Comfortable with distributed systems basics, caching, queues, and storage types

Course Structure

  • Module 1: High Throughput
  • Module 2: Real-Time
  • Module 3: Data-Intensive
  • Module 4: Consistency

Where to Start

  • Begin with Module 1 to build scaling foundations, then proceed in order

Module Overview: High Throughput Systems

"Design a system that handles 1 million requests per second."

This module covers advanced scaling patterns needed for high-throughput systems - a common interview topic at top tech companies.

Learning Goals

  • Identify throughput bottlenecks in systems
  • Apply scaling patterns (queues, sharding, caching)
  • Model trade-offs and document decisions with ADRs
  • Design systems that handle massive scale

Interview Preparation

  • ✅ Answer "design for high throughput" questions
  • ✅ Explain queuing and async processing
  • ✅ Discuss database sharding strategies
  • ✅ Model scaling patterns with Sruja

Real-World Application

  • Design systems that handle millions of requests
  • Apply patterns to actual high-scale systems
  • Understand trade-offs in scaling decisions

Estimated Time

60-75 minutes (includes practice)

Checklist

  • Can identify throughput bottlenecks
  • Understand queuing and async patterns
  • Can design sharding strategies
  • Can explain trade-offs clearly

Lesson 1: Design a URL Shortener

Goal: Design a service like TinyURL that takes a long URL and converts it into a short alias (e.g., http://tiny.url/xyz).

Requirements

Functional

  • shorten(long_url) -> short_url
  • redirect(short_url) -> long_url
  • Custom aliases (optional).

Non-Functional

  • Highly Available: If the service is down, URL redirection stops working.
  • Low Latency: Redirection must happen in milliseconds.
  • Read-Heavy: 100:1 read-to-write ratio.

Core Design

1. Database Choice

Since we need fast lookups and the data model is simple (Key-Value), a NoSQL Key-Value Store (like DynamoDB or Redis) is ideal.

  • Key: short_alias
  • Value: long_url

2. Hashing Algorithm

How do we generate the alias?

  • MD5/SHA256: Too long.
  • Base62 Encoding: Converts a unique ID (from a counter or database ID) into a string of characters [a-z, A-Z, 0-9].

🛠️ Sruja Perspective: Modeling the Flow

We can use Sruja to model the system components and the user scenario for redirection.

import { * } from 'sruja.ai/stdlib'


R1 = requirement functional "Shorten long URL"
R2 = requirement functional "Redirect short URL"
R3 = requirement availability "High availability for redirects"
R4 = requirement performance "Low latency (< 200ms)"

// Define the system boundary
TinyURL = system "TinyURL Service" {
  WebServer = container "API Server" {
    technology "Rust"
    scale {
      min 3
      max 20
      metric "cpu > 70%"
    }
  }

  DB = database "UrlStore" {
    technology "DynamoDB"
    description "Stores mapping: short_alias -> long_url"
  }

  Cache = container "Cache" {
    technology "Redis"
    description "Caches popular redirects"
  }

  WebServer -> Cache "Reads"
  WebServer -> DB "Reads/Writes"
}

User = person "User"

// Define the redirection scenario (most common - cache hit)
RedirectFlowCacheHit = scenario "User clicks a short link (cache hit)" {
  User -> TinyURL.WebServer "GET /xyz"
  TinyURL.WebServer -> TinyURL.Cache "Check cache for 'xyz'"
  TinyURL.Cache -> TinyURL.WebServer "Hit: 'http://example.com'"
  TinyURL.WebServer -> User "301 Redirect (from cache)"
}

// Cache miss scenario
RedirectFlowCacheMiss = scenario "User clicks a short link (cache miss)" {
  User -> TinyURL.WebServer "GET /xyz"
  TinyURL.WebServer -> TinyURL.Cache "Check cache for 'xyz'"
  TinyURL.Cache -> TinyURL.WebServer "Miss"
  TinyURL.WebServer -> TinyURL.DB "Get long_url for 'xyz'"
  TinyURL.DB -> TinyURL.WebServer "Return 'http://example.com'"
  TinyURL.WebServer -> TinyURL.Cache "Cache the mapping"
  TinyURL.WebServer -> User "301 Redirect to 'http://example.com'"
}

// URL shortening scenario
ShortenURL = scenario "User creates a short URL" {
  User -> TinyURL.WebServer "POST /shorten with long_url"
  TinyURL.WebServer -> TinyURL.WebServer "Generate base62 alias"
  TinyURL.WebServer -> TinyURL.DB "Store mapping: alias -> long_url"
  TinyURL.DB -> TinyURL.WebServer "Confirm stored"
  TinyURL.WebServer -> User "Return short URL"
}

view index {
include *
}

Lesson 2: Design a Rate Limiter

Goal: Design a system to limit the number of requests a client can send to an API within a time window (e.g., 10 requests per second).

Why Rate Limit?

  • Prevent Abuse: Stop DDoS attacks or malicious bots.
  • Fairness: Ensure one user doesn't hog all resources.
  • Cost Control: Prevent auto-scaling bills from exploding.

Algorithms

Token Bucket

  • A "bucket" holds tokens.
  • Tokens are added at a fixed rate (e.g., 10 tokens/sec).
  • Each request consumes a token.
  • If the bucket is empty, the request is dropped (429 Too Many Requests).

Leaky Bucket

  • Requests enter a queue (bucket) and are processed at a constant rate.
  • If the queue is full, new requests are dropped.

Architecture Location

Where does the rate limiter live?

  1. Client-side: Unreliable (can be forged).
  2. Server-side: Inside the application code.
  3. Middleware: In a centralized API Gateway (Best practice).

🛠️ Sruja Perspective: Middleware Modeling

In Sruja, we can model the Rate Limiter as a component within the API Gateway, backed by a fast datastore like Redis.

import { * } from 'sruja.ai/stdlib'


APIGateway = system "API Gateway" {
    GatewayService = container "Gateway" {
        technology "Nginx / Kong"

        RateLimiter = component "Rate Limiter Middleware" {
            description "Implements Token Bucket algorithm"
        }
    }

    Redis = database "Rate Limit Store" {
        technology "Redis"
        description "Stores token counts per user/IP"
    }

    APIGateway.GatewayService -> APIGateway.Redis "Stores tokens"
}

Backend = system "Backend Service"

APIGateway.GatewayService -> Backend "Forward Requests"
Client = person "Client"

// Scenario: Request allowed (has tokens)
RateLimitAllowed = scenario "Rate Limit Check - Allowed" {
    Client -> APIGateway.GatewayService "API Request"
    APIGateway.GatewayService -> APIGateway.Redis "DECR user_123_tokens"
    APIGateway.Redis -> APIGateway.GatewayService "Result: 5 (tokens remaining)"
    APIGateway.GatewayService -> Backend "Forward request"
    Backend -> APIGateway.GatewayService "Response"
    APIGateway.GatewayService -> Client "200 OK"
}

// Scenario: Request rate limited (no tokens)
RateLimitBlocked = scenario "Rate Limit Check - Blocked" {
    Client -> APIGateway.GatewayService "API Request"
    APIGateway.GatewayService -> APIGateway.Redis "DECR user_123_tokens"
    APIGateway.Redis -> APIGateway.GatewayService "Result: -1 (Empty bucket)"
    APIGateway.GatewayService -> Client "429 Too Many Requests"
}

// Scenario: Token refill (background process)
TokenRefill = scenario "Token Bucket Refill" {
    APIGateway.Redis -> APIGateway.Redis "Add 10 tokens/sec (background)"
    APIGateway.Redis -> APIGateway.Redis "Cap at max bucket size"
}

view index {
include *
}

Lesson 3: Views for Critical Throughput Paths

Why Views for Throughput?

Focus on hot paths to reason about scaling, backpressure, and caching. High-throughput systems have critical paths that need isolation for analysis.

Sruja: High‑Throughput View

import { * } from 'sruja.ai/stdlib'


Pipeline = system "Data Pipeline" {
Ingest = container "Ingestion Service" {
  technology "Kafka Consumer"
  scale {
    min 5
    max 50
    metric "lag > 1000"
  }
}

Processor = container "Processing Service" {
  technology "Rust workers"
  scale {
    min 10
    max 200
    metric "queue_depth > 5000"
  }
}

Events = database "Event Store" {
  technology "Kafka"
  description "Buffers events for processing"
}

OutputDB = database "Output Database" {
  technology "ClickHouse"
  description "Stores processed events"
}

Ingest -> Events "Consumes"
Events -> Processor "Streams"
Processor -> OutputDB "Writes"
}

// Complete system view
view index {
title "Complete Pipeline"
include *
}

// Hot path view: Focus on critical throughput path
view hotpath {
title "Hot Path - Throughput Analysis"
include Pipeline.Ingest
include Pipeline.Events
include Pipeline.Processor
exclude Pipeline.OutputDB
}

// Backpressure view: Components that can cause bottlenecks
view backpressure {
title "Backpressure Points"
include Pipeline.Events
include Pipeline.Processor
exclude Pipeline.Ingest
exclude Pipeline.OutputDB
}

// Scale view: Components with scaling configuration
view scale {
title "Scaling Configuration"
include Pipeline.Ingest
include Pipeline.Processor
exclude Pipeline.Events
exclude Pipeline.OutputDB
}

Practice

  • Create a view highlighting backpressure points.
  • Annotate scale bounds for hot components.
  • Use scenarios to model high-volume flows.

Lesson 1: Design a Chat Application

Goal: Design a real-time chat service like WhatsApp or Slack that supports 1-on-1 and Group messaging.

Requirements

Functional

  • Send/Receive messages in real-time.
  • See user status (Online/Offline).
  • Message history (persistent storage).

Non-Functional

  • Low Latency: Messages must appear instantly.
  • Consistency: Messages must be delivered in order.
  • Availability: High uptime.

Core Design

1. Communication Protocol

HTTP is request/response (pull). For chat, we need push.

  • WebSockets: Keeps a persistent connection open between client and server.

2. Message Flow

  • User A sends message to Chat Server.
  • Chat Server finds which server User B is connected to (using a Session Store like Redis).
  • Chat Server pushes message to User B.

3. Storage

  • Chat History: Write-heavy. Cassandra or HBase (Wide-column stores) are good for time-series data.
  • User Status: Key-Value store (Redis) with TTL.

🛠️ Sruja Perspective: Modeling Real-Time Flows

We can use Sruja to model the WebSocket connections and the async message processing.

import { * } from 'sruja.ai/stdlib'


requirement R1 functional "Real-time messaging"
requirement R2 functional "Message history"
requirement R3 latency "Instant delivery"
requirement R4 consistency "Ordered delivery"

ChatApp = system "WhatsApp Clone" {
    ChatServer = container "Chat Server" {
        technology "Node.js (Socket.io)"
        description "Handles WebSocket connections"
        scale {
            min 10
            max 100
            metric "connections > 10k"
        }
    }

    SessionStore = database "Session Store" {
        technology "Redis"
        description "Maps UserID -> WebSocketServerID"
    }

    MessageDB = database "Message History" {
        technology "Cassandra"
        description "Stores chat logs"
    }

    MessageQueue = queue "Message Queue" {
        technology "Kafka"
        description "Buffers messages for group chat fan-out"
    }

    ChatServer -> SessionStore "Reads/Writes"
    ChatServer -> MessageDB "Persists messages"
    ChatServer -> MessageQueue "Async processing"
}

UserA = person "Alice"
UserB = person "Bob"

// Scenario: 1-on-1 chat (user online)
scenario SendMessageOnline "Send Message - Recipient Online" {
    UserA -> ChatApp.ChatServer "Send 'Hello'"
    ChatApp.ChatServer -> ChatApp.MessageDB "Persist message"
    ChatApp.ChatServer -> ChatApp.SessionStore "Lookup Bob's connection"
    ChatApp.SessionStore -> ChatApp.ChatServer "Bob is on Server-2"
    ChatApp.ChatServer -> UserB "Push 'Hello' via WebSocket"
    UserB -> ChatApp.ChatServer "ACK received"
}

// Scenario: 1-on-1 chat (user offline)
scenario SendMessageOffline "Send Message - Recipient Offline" {
    UserA -> ChatApp.ChatServer "Send 'Hello'"
    ChatApp.ChatServer -> ChatApp.MessageDB "Persist message"
    ChatApp.ChatServer -> ChatApp.SessionStore "Lookup Bob's connection"
    ChatApp.SessionStore -> ChatApp.ChatServer "Bob is offline"
    ChatApp.ChatServer -> ChatApp.MessageDB "Mark as pending delivery"
}

// Scenario: Group chat (fan-out)
scenario SendGroupMessage "Send Group Message" {
    UserA -> ChatApp.ChatServer "Send 'Hello' to Group"
    ChatApp.ChatServer -> ChatApp.MessageDB "Persist message"
    ChatApp.ChatServer -> ChatApp.MessageQueue "Enqueue for fan-out"
    ChatApp.MessageQueue -> ChatApp.ChatServer "Process for each member"
    ChatApp.ChatServer -> ChatApp.SessionStore "Lookup each member's server"
    ChatApp.ChatServer -> UserB "Push to member 1"
    ChatApp.ChatServer -> UserC "Push to member 2"
    ChatApp.ChatServer -> UserD "Push to member 3"
}

// Scenario: Message history retrieval
scenario GetMessageHistory "Retrieve Message History" {
    UserA -> ChatApp.ChatServer "Request chat history"
    ChatApp.ChatServer -> ChatApp.MessageDB "Query messages"
    ChatApp.MessageDB -> ChatApp.ChatServer "Return messages"
    ChatApp.ChatServer -> UserA "Send history"
}

view index {
include *
}

Lesson 1: Design a Video Streaming Service

Goal: Design a video sharing platform like YouTube or Netflix where users can upload and watch videos.

Requirements

Functional

  • Upload videos.
  • Watch videos (streaming).
  • Support multiple resolutions (360p, 720p, 1080p).

Non-Functional

  • Reliability: No buffering.
  • Availability: Videos are always accessible.
  • Scalability: Handle millions of concurrent viewers.

Core Design

1. Storage (Blob Store)

Videos are large binary files (BLOBs). Databases are bad for this.

  • Object Storage: AWS S3, Google Cloud Storage.
  • Metadata: Store title, description, and S3 URL in a SQL/NoSQL DB.

2. Processing (Transcoding)

Raw uploads are huge. We need to convert them into different formats and resolutions.

  • Transcoding Service: Breaks video into chunks and encodes them (H.264, VP9).

3. Delivery (CDN)

Serving video from a single server is too slow for global users.

  • Content Delivery Network (CDN): Caches video chunks in edge servers close to the user.

4. Adaptive Bitrate Streaming (HLS/DASH)

The player automatically switches quality based on the user's internet speed.


🛠️ Sruja Perspective: Modeling Infrastructure

We can use Sruja's deployment nodes to visualize the global distribution of content.

import { * } from 'sruja.ai/stdlib'


YouTube = system "Video Platform" {
  WebApp = container "Web App"
  API = container "API Server"

  Transcoder = container "Transcoding Service" {
    description "Converts raw video to HLS format"
    scale { min 50 }
  }

  S3 = database "Blob Storage" {
    description "Stores raw and processed video files"
  }

  MetadataDB = database "Metadata DB"

  CDNEdge = container "CDN Edge Cache" {
    description "Edge cache serving video chunks"
  }

  WebApp -> API "HTTPS"
  API -> MetadataDB "Reads/Writes"
  API -> S3 "Uploads"
  API -> Transcoder "Triggers"
  Transcoder -> S3 "Reads/Writes"
  Transcoder -> MetadataDB "Updates status"
}

// Deployment View
deployment GlobalInfra "Global Infrastructure" {
  node OriginDC "Origin Data Center" {
    containerInstance WebApp
    containerInstance API
    containerInstance Transcoder
    containerInstance S3
  }

  node CDN "CDN (Edge Locations)" "Cloudflare / Akamai" {
    containerInstance CDNEdge
    node USEast "US-East Edge"
    node Europe "Europe Edge"
    node Asia "Asia Edge"
  }
}

User = person "Viewer"

// Streaming Flow
scenario WatchVideo "User watches a video" {
  User -> WebApp "Get Video Page"
  WebApp -> API "Get Metadata (Title, URL)"
  API -> MetadataDB "Query"
  API -> User "Return Video Manifest URL"
  User -> CDNEdge "Request Video Chunk (1080p)"
  CDNEdge -> User "Stream Chunk"
}

// Upload Flow
scenario UploadVideo "Creator uploads a video" {
  User -> YouTube.WebApp "Upload Raw Video"
  YouTube.WebApp -> YouTube.API "POST /upload"
  YouTube.API -> YouTube.S3 "Store Raw Video"
  YouTube.API -> YouTube.Transcoder "Trigger Transcoding Job"
  YouTube.Transcoder -> YouTube.S3 "Read Raw / Write HLS"
  YouTube.Transcoder -> YouTube.MetadataDB "Update Video Status"
}

// Data flow: Video transcoding pipeline
flow TranscodingPipeline "Video Transcoding Data Flow" {
    YouTube.S3 -> YouTube.Transcoder "Streams raw video chunks"
    YouTube.Transcoder -> YouTube.Transcoder "Encodes to HLS (360p, 720p, 1080p)"
    YouTube.Transcoder -> YouTube.S3 "Writes encoded chunks"
    YouTube.Transcoder -> YouTube.MetadataDB "Updates manifest URLs"
}

// Data flow: Video delivery pipeline
flow DeliveryPipeline "Video Delivery Data Flow" {
    YouTube.S3 -> YouTube.CDNEdge "Replicates video chunks to edge"
    YouTube.CDNEdge -> User "Streams chunks on demand"
    User -> YouTube.CDNEdge "Requests next chunk based on bandwidth"
    YouTube.CDNEdge -> YouTube.S3 "Cache miss: fetch from origin"
}

// Data flow: Analytics pipeline
flow AnalyticsPipeline "Video Analytics Data Flow" {
    YouTube.WebApp -> YouTube.API "Sends view events"
    YouTube.API -> YouTube.MetadataDB "Updates view count"
    YouTube.API -> YouTube.MetadataDB "Stores watch time"
    YouTube.MetadataDB -> YouTube.API "Aggregates analytics"
}

view index {
title "Complete Video Platform"
include *
}

// Data flow view: Focus on data pipelines
view dataflow {
title "Data Flow View"
include YouTube.Transcoder YouTube.S3 YouTube.MetadataDB
exclude YouTube.WebApp YouTube.API
description "Shows data transformation and storage flows"
}

// Processing view: Transcoding pipeline
view processing {
title "Processing Pipeline"
include YouTube.Transcoder YouTube.S3
exclude YouTube.WebApp YouTube.API YouTube.MetadataDB
description "Focuses on video processing components"
}

Lesson 1: Design a Distributed Counter

Goal: Design a system to count events (e.g., YouTube views, Facebook likes) at a massive scale (e.g., 1 million writes/sec).

The Problem with a Single Database

A standard SQL database (like PostgreSQL) can handle ~2k-5k writes/sec. If we try to update a single row (UPDATE videos SET views = views + 1 WHERE id = 123) for every view, the database will lock the row and become a bottleneck.

Solutions

1. Sharding (Write Splitting)

Instead of one counter, have $N$ counters for the same video.

  • Randomly pick a counter from $1$ to $N$ and increment it.
  • Total Views = Sum of all $N$ counters.

2. Write-Behind (Batching)

Don't write to the DB immediately.

  • Store counts in memory (Redis) or a log (Kafka).
  • A background worker aggregates them and updates the DB every few seconds.
  • Trade-off: If the server crashes before flushing, you lose a few seconds of data (Eventual Consistency).

🛠️ Sruja Perspective: Modeling Write Flows

We can use Sruja to model the "Write-Behind" architecture.

import { * } from 'sruja.ai/stdlib'


CounterService = system "View Counter" {
    API = container "Ingestion API" {
        technology "Rust"
        description "Receives 'view' events"
    }

    EventLog = queue "Kafka" {
        description "Buffers raw view events"
    }

    Worker = container "Aggregator" {
        technology "Python"
        description "Reads batch of events, sums them, updates DB"
        scale { min 5 }
    }

    DB = database "Counter DB" {
        technology "Cassandra"
        description "Stores final counts (Counter Columns)"
    }

    Cache = container "Read Cache" {
        technology "Redis"
        description "Caches total counts for fast reads"
    }

    API -> EventLog "Produces events"
    Worker -> EventLog "Consumes events"
    Worker -> DB "Updates counts"
    Worker -> Cache "Updates cache"
}

User = person "Viewer"

// Write Path (Eventual Consistency)
TrackView = scenario "User watches a video" {
    User -> CounterService.API "POST /view"
    CounterService.API -> CounterService.EventLog "Produce Event"
    CounterService.API -> User "202 Accepted"

    // Async processing
    CounterService.EventLog -> CounterService.Worker "Consume Batch"
    CounterService.Worker -> CounterService.DB "UPDATE views += batch_size"
    CounterService.Worker -> CounterService.Cache "Invalidate/Update"
}

view index {
include *
}

Lesson 2: Consistency via Constraints & Conventions

Why Constraints?

They document trade‑offs and prevent accidental coupling across services.

Sruja: Guardrails for Consistency

import { * } from 'sruja.ai/stdlib'


constraints {
rule "No cross‑service transactions"
rule "Idempotent event handlers"
}
conventions {
naming "kebab-case"
retries "Exponential backoff (max 3)"
}

view index {
include *
}

Practice

  • Add constraints that support your chosen consistency model.
  • Capture conventions for retries, idempotency, and naming.

Ecommerce Platform: Architecture and Operations

Overview

  • End-to-end view of building and operating a modern ecommerce platform
  • Balances product vision with technical architecture and operational excellence

Learning Goals

  • Model domain entities, flows, and services for ecommerce
  • Design modular architecture and platform capabilities
  • Plan SDLC, release management, and operational readiness
  • Govern changes with policies, SLOs, and compliance

Prerequisites

  • Familiarity with web services, data stores, messaging, and CI/CD

Course Structure

  • Module 1: Vision & Basics
  • Module 3: Architecture & Tech
  • Module 4: SDLC
  • Module 5: Ops
  • Module 6: Evolution
  • Module 7: Governance

Where to Start

  • Begin with Module 1 to align vision and fundamentals, then proceed in order

Module Overview: Vision & Basics

Goals:

  • Define project vision and scope
  • Identify key systems and stakeholders
  • Draft initial architecture map

Estimated time: 45–60 minutes

Checklist:

  • Vision statement
  • Stakeholders listed
  • Initial system map in Sruja

Lesson 1: Introduction to the Project

In this course, we are building Shopify-lite. Let's define what that means.

The Concept

We are building a multi-tenant e-commerce platform. This means a single instance of our software will serve multiple different online stores (tenants), each with their own products, orders, and customers.

Core Capabilities

Our system must support:

  1. Storefronts: Fast, SEO-friendly pages for browsing products.
  2. Admin Dashboard: Where merchants manage their inventory.
  3. Checkout: A secure, reliable way to take money.
  4. Inventory Management: Real-time stock tracking to prevent overselling.

The "Why"

Why build this? Because it touches on every hard problem in distributed systems:

  • Consistency: Inventory must be accurate.
  • Availability: Checkout must never go down.
  • Scalability: We need to handle flash sales.
  • Security: We are handling credit card data (PCI Compliance).

The Role of Sruja

Most tutorials start by running npx create-next-app. We will not do that yet.

We will start by creating a sruja file. Why? Because we need to agree on the structure before we get lost in the details. Sruja will be our shared whiteboard, our documentation, and our validator.

Lesson 2: Setting up the Workspace

Let's get our hands dirty. We will set up a professional project structure that separates our architectural definitions from our implementation code, and aligns with product requirements.

Real-World Scenario: Starting a New Product

Context: You're building Shopify-lite, a multi-tenant e-commerce platform. Before writing code, you need to:

  • Align engineering, product, and DevOps on the architecture
  • Document requirements alongside the design
  • Set up a structure that scales as the team grows

Product team needs: Clear documentation of what we're building and why.

Engineering team needs: Technical architecture that supports product goals.

DevOps team needs: Deployment and operational considerations from day one.

1. Directory Structure

Create a new directory for your project:

mkdir shopify-lite
cd shopify-lite

We will use the following structure (based on real-world best practices):

shopify-lite/
├── architecture/          # Sruja files live here
│   ├── main.sruja        # Main architecture
│   ├── requirements.sruja # Product requirements
│   └── deployment.sruja   # Deployment architecture
├── src/                   # Source code (Rust, Node, etc.)
├── docs/                  # Generated documentation
│   └── architecture.md    # Auto-generated from Sruja
├── .github/
│   └── workflows/
│       └── validate-architecture.yml  # CI/CD validation
└── README.md

Why this structure?

  • Separation of concerns: Architecture separate from code
  • Version control: Track architecture changes over time
  • CI/CD ready: Easy to integrate validation
  • Team collaboration: Product, engineering, and DevOps can all contribute

2. Installing Sruja

If you haven't already, install the Sruja CLI:

# From Git (requires Rust)
cargo install sruja-cli --git https://github.com/sruja-ai/sruja

# Or build from source: git clone https://github.com/sruja-ai/sruja.git && cd sruja && make build

# Verify installation
sruja --version

For DevOps: Add to your CI/CD pipeline (we'll cover this in Module 5).

3. Hello World: The Context View

Create your first file at architecture/main.sruja. We'll start with a high-level Context View to define the boundaries of our system.

Product Requirements First

Before modeling architecture, let's capture product requirements:

import { * } from 'sruja.ai/stdlib'


// Product Requirements (from product team)
requirement R1 functional "Merchants can create and manage online stores"
requirement R2 functional "Shoppers can browse products and make purchases"
requirement R3 functional "Platform processes payments securely"
requirement R4 nonfunctional "Platform must support 10,000+ stores"
requirement R5 nonfunctional "Checkout must complete in < 3 seconds"
requirement R6 nonfunctional "99.9% uptime SLA"

// Business Goals (for product/executive alignment)
metadata {
    businessGoal "Enable small businesses to sell online"
    targetMarket "Small to medium businesses (SMBs)"
    successMetrics "Number of active stores, GMV (Gross Merchandise Value)"
}

view index {
include *
}

The Architecture Context

Now let's model the system context:

import { * } from 'sruja.ai/stdlib'


// Product Requirements
requirement R1 functional "Merchants can create and manage online stores"
requirement R2 functional "Shoppers can browse products and make purchases"
requirement R3 functional "Platform processes payments securely"
requirement R4 nonfunctional "Platform must support 10,000+ stores"
requirement R5 nonfunctional "Checkout must complete in < 3 seconds"
requirement R6 nonfunctional "99.9% uptime SLA"

// 1. The System
Platform = system "E-Commerce Platform" {
    description "The core multi-tenant e-commerce engine that enables merchants to create stores and shoppers to make purchases."

    // Link to requirements
    requirement R1
    requirement R2
    requirement R3
    requirement R4
    requirement R5
    requirement R6
}

// 2. The Users (from product personas)
Merchant = person "Store Owner" {
    description "Small business owner who creates and manages their online store"
}
Shopper = person "Customer" {
    description "End customer who browses products and makes purchases"
}

// 3. External Systems (from product integrations)
Stripe = system "Payment Gateway" {
    external
    description "Third-party payment processor (PCI-compliant)"
}

EmailService = system "Email Service" {
    tags ["external"]
    description "Sends transactional emails (order confirmations, etc.)"
}

// 4. High-Level Interactions (user journeys)
Merchant -> Platform "Manages Store" {
    description "Creates products, manages inventory, views analytics"
}
Shopper -> Platform "Browses & Buys" {
    description "Browses products, adds to cart, completes checkout"
}
Platform -> Stripe "Processes Payments" {
    description "Secure payment processing for customer orders"
}
Platform -> EmailService "Sends Notifications" {
    description "Order confirmations, shipping updates"
}

// 5. Model user journeys as scenarios
ShopperCheckout = scenario "Shopper Checkout Journey" {
    Shopper -> Platform "Browses products"
    Shopper -> Platform "Adds items to cart"
    Shopper -> Platform "Initiates checkout"
    Platform -> Stripe "Processes payment"
    Stripe -> Platform "Confirms payment"
    Platform -> EmailService "Sends order confirmation"
    EmailService -> Shopper "Delivers confirmation email"
}

MerchantManagement = scenario "Merchant Store Management" {
    Merchant -> Platform "Logs into admin dashboard"
    Merchant -> Platform "Creates new product"
    Merchant -> Platform "Updates inventory"
    Merchant -> Platform "Views sales analytics"
}

// Executive view: Business context
view executive {
title "Executive Overview"
include Merchant
include Shopper
include Platform
include Stripe
include EmailService
}

// Product view: User journeys
view product {
title "Product View - User Experience"
include Merchant
include Shopper
include Platform
exclude Stripe
exclude EmailService
}

// Technical view: System integrations
view technical {
title "Technical View - System Integration"
include Platform Stripe EmailService
exclude Merchant Shopper
}

// Default view: Complete system
view index {
title "Complete System View"
include *
}

Why This Approach?

For Product Teams:

  • Requirements are visible and linked to architecture
  • Business goals are documented
  • Success metrics are clear

For Engineering:

  • Architecture shows what to build
  • Requirements guide implementation priorities
  • External dependencies are identified early

For DevOps:

  • Uptime SLA (R6) informs infrastructure planning
  • Performance requirements (R5) guide monitoring setup
  • Scale requirements (R4) inform capacity planning

4. Visualize It

Run the Sruja CLI to visualize your architecture:

# View the architecture diagram
sruja view architecture/main.sruja

# Or export to different formats
sruja export markdown architecture/main.sruja > docs/architecture.md
sruja export json architecture/main.sruja > docs/architecture.json

You should see a clean diagram showing:

  • Your platform in the center
  • Users (Merchant, Shopper) on the left
  • External systems (Stripe, EmailService) on the right
  • Interactions between them

5. Validate Your Architecture

Before moving forward, validate your architecture:

# Lint for errors
sruja lint architecture/main.sruja

# Check for orphan elements
sruja tree architecture/main.sruja

Common issues to watch for:

  • Missing relations (orphan elements)
  • Invalid references
  • Unclear descriptions

6. Set Up CI/CD (DevOps Best Practice)

Create .github/workflows/validate-architecture.yml:

name: Validate Architecture

on:
  push:
    paths:
      - "architecture/**"
  pull_request:
    paths:
      - "architecture/**"

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Install Sruja
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja --locked
      - name: Validate Architecture
        run: sruja lint architecture/main.sruja
      - name: Generate Docs
        run: |
          sruja export markdown architecture/main.sruja > docs/architecture.md

Why this matters: Catches architecture errors before they reach production.

Key Takeaways

  1. Start with requirements: Document what you're building and why
  2. Model context first: Understand system boundaries before diving into details
  3. Link requirements to architecture: Show how architecture supports product goals
  4. Set up CI/CD early: Automate validation from day one
  5. Think about all stakeholders: Product, engineering, and DevOps all need different views

Exercise: Create Your Context View

Tasks:

  1. Create a new project directory
  2. Install Sruja CLI
  3. Create architecture/main.sruja with:
    • At least 3 product requirements
    • System context (your system, users, external systems)
    • High-level interactions
  4. Validate and visualize your architecture
  5. (Optional) Set up CI/CD validation

Time: 15 minutes

Further Reading

Module Overview: Architecture & Tech

Goals:

  • Model systems, containers, and key relations
  • Select technology stack per container
  • Capture ADRs for major choices

Estimated time: 60–90 minutes

Checklist:

  • C4 L1/L2 map
  • Container tech set
  • ADR drafts

Lesson 1: Monolith vs. Microservices

This is the most debated topic in software engineering. Should we start with a monolith or microservices?

The Trade-off

  • Monolith: Easier to develop, deploy, and debug. Harder to scale teams and components independently.
  • Microservices: Independent scaling and deployment. High complexity (network, consistency, observability).

Our Decision: Modular Monolith

For Shopify-lite, we will start with a Modular Monolith. We will have clear boundaries (modules) but deploy as a single unit initially. This gives us speed now and flexibility later.

Documenting with ADRs

We don't just make this decision; we document it so future engineers know why.

import { * } from 'sruja.ai/stdlib'


// Requirements that drive the architecture decision
requirement R1 functional "Must support 10,000+ stores"
requirement R2 performance "API response < 200ms p95"
requirement R3 scalability "Scale components independently"
requirement R4 development "Small team, need fast iteration"

// Architecture Decision Record
adr ArchitectureStyle "Modular Monolith Strategy" {
    status "Accepted"
    context "We are a small team building a new product. Speed is critical, but we need to scale to 10k+ stores."

    option "Microservices" {
        pros "Independent scaling, technology diversity"
        cons "High operational complexity, network latency, data consistency challenges"
    }

    option "Monolith" {
        pros "Simplest deployment, no network calls"
        cons "Cannot scale components independently, single point of failure"
    }

    option "Modular Monolith" {
        pros "Simple deployment, code sharing, clear boundaries"
        cons "Risk of tight coupling if not disciplined, harder to scale independently later"
    }

    decision "Modular Monolith"
    reason "We prioritize iteration speed for MVP. We will enforce boundaries using Sruja domains and can split to microservices later if needed."
    consequences "Faster initial development, may need refactoring to microservices at scale"
}

// Architecture that implements the decision
Platform = system "E-Commerce Platform" {
    description "Modular monolith - single deployment with clear module boundaries"
    adr ArchitectureStyle

    // Modules as containers (can be split to microservices later)
    StorefrontModule = container "Storefront Module" {
        technology "Next.js"
        description "Handles product browsing and storefronts"
    }

    AdminModule = container "Admin Module" {
        technology "Next.js"
        description "Merchant admin dashboard"
    }

    APIModule = container "API Module" {
        technology "Rust"
        description "Core business logic - can scale independently"
        scale {
            min 3
            max 50
            metric "cpu > 70%"
        }
    }

    OrderDB = database "Order Database" {
        technology "PostgreSQL"
        description "Stores orders and transactions"
    }

    StorefrontModule -> APIModule "Fetches product data"
    AdminModule -> APIModule "Manages inventory"
    APIModule -> OrderDB "Reads/Writes"
}

view index {
title "Platform Architecture Overview"
include *
}

// Module view: Show module boundaries
view modules {
title "Module View"
include Platform.StorefrontModule Platform.AdminModule Platform.APIModule Platform.OrderDB
}

// Scalability view: Focus on scalable components
view scalability {
title "Scalability View"
include Platform.APIModule Platform.OrderDB
}

Lesson 2: Selecting the Stack

We have our domains. Now we need to pick the tools to build them.

The Stack

  1. Frontend: Next.js (React) - Great for SEO and performance.
  2. Backend: Rust - Performance, safety, and great tooling for services.
  3. Database: PostgreSQL - Reliable, ACID compliant (critical for money).

Modeling in Sruja

We define these choices in our container and database definitions.

import { * } from 'sruja.ai/stdlib'


Platform = system "E-Commerce Platform" {
WebApp = container "Storefront & Admin" {
  technology "Next.js, TypeScript"
  description "The user-facing application."
}

API = container "Core API" {
  technology "Rust, Axum"
  description "REST API handling business logic."
}

Database = database "Primary DB" {
  technology "PostgreSQL 15"
  description "Stores orders, products, and users."
}

WebApp -> API "JSON/HTTPS"
API -> Database "SQL/TCP"
}

By documenting technology, we make it clear to new developers what skills they need.

Lesson 3: API-First Design

Before frontend and backend teams start working, they need to agree on the API. This is where API-First Design comes in.

API-First Design

Instead of writing code and then documenting it, we define the API schema first using OpenAPI. This allows frontend devs to mock the API while backend devs build it.

Sruja's Role: Architecture Modeling

Sruja models which services exist and how they connect. For detailed API schemas (endpoints, request/response structures), use OpenAPI/Swagger.

import { * } from 'sruja.ai/stdlib'


customer = person "Customer"

ecommerce = system "E-Commerce Platform" {
  api = container "Core API" {
    technology "Rust, Axum"
    // API schemas defined in openapi.yaml
  }

  orderDB = database "Order Database" {
    technology "PostgreSQL"
  }

  api -> orderDB "reads and writes to"
}

customer -> ecommerce.api "uses"

view index {
  title "E-Commerce Architecture"
  include *
}

Best Practice: Separation of Concerns

  1. Sruja: Models architecture (services, containers, relationships)
  2. OpenAPI: Defines API schemas (endpoints, request/response structures)
  3. Together: Architecture shows the big picture, OpenAPI shows the details

Why this matters

  1. Right tool for the job: Architecture modeling vs. API specification
  2. Industry standard: OpenAPI is widely supported by tools and frameworks
  3. Code Generation: Generate Rust types, TypeScript interfaces, and client SDKs from OpenAPI

Lesson 4: API Design & Integration

Why API Design Matters

Well-designed APIs define stable interfaces between services; they reduce coupling and surprises. However, API schemas belong in OpenAPI, not in Sruja.

Sruja's Role: Architecture Modeling

Sruja focuses on architectural concerns: which services exist, how they relate, and what they do. For detailed API schemas, use OpenAPI/Swagger.

import { * } from 'sruja.ai/stdlib'


customer = person "Customer"

ecommerce = system "E-Commerce Platform" {
  api = container "Checkout API" {
    technology "Rust, Axum"
    // API details defined in openapi.yaml
  }

  events = queue "Order Events" {
    technology "Kafka"
    // Event schemas defined in AsyncAPI or JSON Schema
  }
}

customer -> ecommerce.api "uses"

Best Practice

  1. Model architecture in Sruja: Services, containers, relationships
  2. Define API schemas in OpenAPI: Request/response structures, endpoints
  3. Link them: Reference OpenAPI files in your architecture documentation

Practice

  • Model the AddToCart service in Sruja
  • Create an OpenAPI spec for the AddToCart endpoint
  • Show how they work together

Lesson 1: The Local Loop

How do you use Sruja while you code?

1. The Blueprint

Keep architecture/main.sruja open in a split pane. It is your map. Before you create a new file or function, verify where it fits in the architecture.

2. Generating Boilerplate (Future)

Imagine running sruja gen to scaffold your Rust microservices (or services in other languages) based on your container definitions. While this feature is in development, you can manually align your folder structure to your architecture.

src/
  orders/      # Matches 'container OrderService'
  inventory/   # Matches 'container InventoryService'

3. Local Validation

Before you commit, run:

sruja validate .

This checks for:

  • Orphans: Components defined but never used.
  • Broken Links: Relations pointing to non-existent elements.
  • Policy Violations: Did you accidentally introduce a circular dependency?

Lesson 2: Environments

Your software runs differently in Production than it does on your laptop. Sruja models this using Deployment Nodes.

Modeling Production

deployment Production "AWS Production" {
    node Region "US-East-1" {
        node K8s "EKS Cluster" {
            containerInstance WebApp
            containerInstance API
        }
        
        node DB "RDS Postgres" {
            containerInstance Database
        }
    }
}

Modeling Local Dev

deployment Local "Docker Compose" {
    node Laptop "My MacBook" {
        containerInstance WebApp
        containerInstance API
        containerInstance Database
    }
}

Why model this?

It helps you visualize the physical differences. Maybe in Prod you have a Load Balancer that doesn't exist locally. Sruja makes these differences explicit.

Lesson 3: CI/CD Pipeline

Architecture compliance shouldn't be a manual review process. It should be a build step.

The Pipeline

In your GitHub Actions or Jenkins pipeline, add a step to install and run Sruja.

steps:
  - name: Checkout
    uses: actions/checkout@v6

  - name: Install Sruja
    run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja --locked

  - name: Validate Architecture
    run: sruja validate architecture/

Breaking the Build

If a developer introduces a violation (e.g., "Frontend talks directly to Database"), sruja validate will exit with a non-zero code, failing the build.

This is Governance as Code. You stop architectural drift before it merges.

Lesson 1: Deployment Strategies

Real-World Problem: Black Friday Deployment

Scenario: You need to deploy a critical payment fix on Black Friday morning. The system handles $10M/hour in transactions. How do you deploy without risking downtime?

Wrong approach: Deploy directly to production and hope nothing breaks.

Right approach: Use a proven deployment strategy that minimizes risk.

Why Deployment Strategies Matter

Industry statistics:

  • 60% of outages are caused by bad deployments (Gartner, 2023)
  • Average cost of downtime: $5,600/minute for large enterprises
  • 99.9% uptime = 8.76 hours downtime/year (still too much for critical systems)

Product team perspective: Every minute of downtime means lost revenue, frustrated customers, and damaged reputation.

DevOps perspective: Need automated, repeatable, safe deployment processes.

Blue/Green Deployment

Concept

You have two identical environments (Blue and Green). One is live, the other is idle. You deploy to the idle one, test it, and then switch traffic.

Real-World Example: E-Commerce Platform

import { * } from 'sruja.ai/stdlib'


ECommerce = system "E-Commerce Platform" {
    API = container "REST API" {
        technology "Rust"
        scale {
            min 10
            max 200
        }
    }
    PaymentService = container "Payment Service" {
        technology "Rust"
        description "Critical: Processes all payments"
    }
    OrderDB = database "Order Database" {
        technology "PostgreSQL"
    }
}

deployment Production "Production Environment" {
    node Blue "Active Cluster (Blue)" {
        containerInstance API {
            replicas 50
            traffic 100
            status "active"
        }
        containerInstance PaymentService {
            replicas 20
            traffic 100
        }
        containerInstance OrderDB {
            role "primary"
        }
    }

    node Green "Staging Cluster (Green)" {
        containerInstance API {
            replicas 50
            traffic 0
            status "ready"
        }
        containerInstance PaymentService {
            replicas 20
            traffic 0
            status "ready"
        }
        containerInstance OrderDB {
            role "standby"
            description "Synced from Blue, ready for switch"
        }
    }
}

view index {
include *
}

DevOps Workflow

  1. Deploy to Green: Deploy new version to idle Green environment
  2. Smoke Tests: Run automated health checks and integration tests
  3. Load Testing: Verify Green can handle production load
  4. Switch Traffic: Use load balancer to route 100% traffic to Green
  5. Monitor: Watch metrics for 30 minutes
  6. Rollback Plan: Keep Blue ready for instant rollback if issues occur

When to Use Blue/Green

Good for:

  • Critical services (payment, authentication)
  • Stateful applications with database replication
  • Zero-downtime requirements
  • Large, infrequent deployments

Not ideal for:

  • Frequent small deployments (wasteful)
  • Stateless services (Canary is better)
  • Limited infrastructure budget

Cost Consideration

Example: Running duplicate production environment

  • Cost: 2x infrastructure during deployment window
  • Typical window: 1-2 hours
  • Trade-off: Higher cost for lower risk

Canary Deployment

Concept

You roll out the new version to a small percentage of users (e.g., 5%) and monitor for errors. Gradually increase if metrics look good.

Real-World Example: API Service

import { * } from 'sruja.ai/stdlib'


API = system "REST API" {
    APIv1 = container "API v1.2.3" {
        technology "Rust"
        description "Current stable version"
    }
    APIv2 = container "API v1.2.4" {
        technology "Rust"
        description "New version with performance improvements"
    }
}

deployment Production "Production Environment" {
    node Canary "Canary Cluster" {
        containerInstance APIv2 {
            replicas 2
            traffic 5
            description "5% of traffic, monitoring error rate"
            metadata {
                maxErrorRate "1%"
                rollbackTrigger "error_rate > 1% or latency_p95 > 500ms"
            }
        }
    }

    node Stable "Stable Cluster" {
        containerInstance APIv1 {
            replicas 38
            traffic 95
        }
    }
}

view index {
include *
}

Gradual Rollout Strategy

Document the rollout plan in metadata:

import { * } from 'sruja.ai/stdlib'


ECommerce = system "E-Commerce Platform" {
API = container "API Service" {
  metadata {
    deploymentStrategy "Canary"
    rolloutSteps "5% → 25% → 50% → 100%"
    stepDuration "15 minutes per step"
    monitoringWindow "15 minutes between steps"
    rollbackCriteria "error_rate > 1% OR latency_p95 > 500ms OR cpu > 90%"
  }
}
}

view index {
include *
}

Real-World Rollout Timeline

Example: Deploying new API version

10:00 AM - Deploy to Canary (5% traffic)
10:15 AM - Monitor: Error rate 0.2%, Latency p95: 180ms ✅
10:15 AM - Increase to 25% traffic
10:30 AM - Monitor: Error rate 0.3%, Latency p95: 195ms ✅
10:30 AM - Increase to 50% traffic
10:45 AM - Monitor: Error rate 0.4%, Latency p95: 210ms ✅
10:45 AM - Increase to 100% traffic
11:00 AM - Deployment complete

When to Use Canary

Good for:

  • Stateless services
  • Frequent deployments (multiple per day)
  • A/B testing new features
  • Performance-sensitive changes
  • Limited infrastructure budget

Not ideal for:

  • Database schema changes (requires coordination)
  • Breaking API changes (incompatible versions)
  • Services with complex state

Rolling Deployment

Concept

Gradually replace old instances with new ones, one at a time.

deployment Production "Production Environment" {
    node Cluster "Kubernetes Cluster" {
        containerInstance API {
            replicas 20
            strategy "rolling"
            maxUnavailable 1
            maxSurge 2
            description "Replace 1 pod at a time, max 1 unavailable"
        }
    }
}

When to Use Rolling

Good for:

  • Kubernetes-native deployments
  • Stateless microservices
  • Cost-effective (no duplicate infrastructure)
  • Automated rollback via health checks

Feature Flags: Deployment Strategy Alternative

Sometimes you don't need a deployment strategy—use feature flags instead:

import { * } from 'sruja.ai/stdlib'


Platform = system "Platform" {
FeatureFlags = container "Feature Flag Service" {
  technology "LaunchDarkly, Split.io"
  description "Controls feature rollout without deployment"
}

API = container "API Service" {
  // Feature flags: newPaymentFlow (10% rollout), experimentalSearch (5% rollout)
}
}

view index {
include *
}

Use case: Deploy code with new feature disabled, then gradually enable via feature flags.

Monitoring During Deployment

Model your observability during deployments:

import { * } from 'sruja.ai/stdlib'


Observability = system "Observability Stack" {
Prometheus = container "Metrics" {
  description "Tracks error rate, latency, throughput during deployment"
}
AlertManager = container "Alerting" {
  description "Alerts on deployment issues"
}
}

// Link monitoring to deployment
deployment Production "Production Environment" {
    // Monitoring: error_rate, latency_p95, cpu_usage, request_rate
    // Alert thresholds: errorRate > 1%, latencyP95 > 500ms, cpuUsage > 90%
    // Rollback automation enabled
}

Real-World Case Study: Netflix Canary Deployment

Challenge: Deploy to 100M+ users without downtime

Solution:

  • Canary deployment to 1% of users
  • Automated analysis of 50+ metrics
  • Automatic rollback if any metric degrades
  • Gradual rollout over 6 hours

Result: 99.99% deployment success rate

Key Takeaways

  1. Choose the right strategy: Blue/Green for critical, Canary for frequent, Rolling for cost-effective
  2. Automate everything: Use CI/CD pipelines to automate deployment and rollback
  3. Monitor aggressively: Track error rates, latency, and resource usage during deployment
  4. Have a rollback plan: Always be ready to rollback within minutes
  5. Document in Sruja: Model your deployment strategy so teams understand the process

Exercise: Design a Deployment Strategy

Scenario: You're deploying a new checkout flow for an e-commerce platform. The system processes $1M/hour.

Tasks:

  1. Choose a deployment strategy (Blue/Green, Canary, or Rolling)
  2. Model it in Sruja with deployment nodes
  3. Add monitoring and rollback criteria
  4. Document the rollout timeline

Time: 15 minutes

Further Reading

Lesson 2: Debugging Performance

Scenario: It's Black Friday. The "Checkout" page is loading in 5 seconds. Why?

The Wrong Way

Start reading random logs or guessing which database query is slow.

The Structural Way

Look at your Sruja User Journey for "Purchase".

import { * } from 'sruja.ai/stdlib'


Customer = person "Customer"

Platform = system "E-Commerce Platform" {
Checkout = container "Checkout Service"
PaymentWorker = container "Payment Worker"
PaymentQueue = queue "Payment Jobs"
}

PaymentGateway = system "Payment Gateway" {
metadata {
    tags ["external"]
  }
}

// Original synchronous flow (problematic)
Purchase = story "User Purchase Flow" {
Customer -> Platform.Checkout "Initiates checkout"
Platform.Checkout -> PaymentGateway "Process Payment" {
  latency "2s"
}
PaymentGateway -> Customer "Returns confirmation"
}

Wait, the PaymentGateway call is synchronous and takes 2 seconds? And it's in the critical path of the user request?

Root Cause: We are blocking the user while waiting for the bank.

The Fix: Asynchronous Processing

We need to decouple the user request from the payment processing.

  1. Introduce a Queue: The Checkout service puts a message on a queue.
  2. Worker: A background worker processes the payment.
  3. Update: The frontend polls for status or uses WebSockets.

Let's update the architecture:

import { * } from 'sruja.ai/stdlib'


Customer = person "Customer"

Platform = system "E-Commerce Platform" {
Checkout = container "Checkout Service"
PaymentWorker = container "Payment Worker"
PaymentQueue = queue "Payment Jobs"
}

PaymentGateway = system "Payment Gateway" {
metadata {
    tags ["external"]
  }
}

// Updated asynchronous flow
Customer -> Platform.Checkout "Initiates checkout"
Platform.Checkout -> Platform.PaymentQueue "Enqueues payment job" {
latency "10ms"
}
Platform.PaymentQueue -> Platform.PaymentWorker "Processes async"
Platform.PaymentWorker -> PaymentGateway "Processes payment"
PaymentGateway -> Customer "Sends confirmation email"

// Updated scenario
PurchaseAsync = story "Asynchronous Purchase Flow" {
Customer -> Platform.Checkout "Initiates checkout"
Platform.Checkout -> Platform.PaymentQueue "Enqueues job" {
  latency "10ms"
}
Platform.PaymentQueue -> Platform.PaymentWorker "Processes async"
Platform.PaymentWorker -> PaymentGateway "Processes payment"
PaymentGateway -> Customer "Sends confirmation"
}

view index {
title "Payment Processing Architecture"
include *
}

// Performance view: Focus on async flow
view performance {
title "Performance View - Async Processing"
include Platform.Checkout Platform.PaymentQueue Platform.PaymentWorker PaymentGateway
exclude Customer
}

By visualizing the flow, the bottleneck (and the fix) becomes obvious.

Lesson 3: Observability

You can't fix what you can't see. Observability is about understanding the internal state of your system from the outside.

The Three Pillars

  1. Logs: "What happened?" (Error: Payment Failed)
  2. Metrics: "How often?" (Error Rate: 5%)
  3. Traces: "Where?" (Checkout -> API -> DB)

Mapping to Sruja

Your Sruja components should map 1:1 to your observability dashboards.

  • System OrderService -> Dashboard Order Service Overview
  • Container Database -> Metric postgres_cpu_usage

Standardizing with Policies

You can enforce observability standards using Sruja Policies.

policy Observability "Must have metrics" {
    rule "HealthCheck" {
        check "all containers must have health check endpoint"
    }
}

This ensures every new service you build comes with the necessary hooks for monitoring.

Lesson 4: Ops SLOs & Monitoring

SLOs in Ops

Translate business expectations into measurable targets; build dashboards around them.

Sruja: Define SLOs

import { * } from 'sruja.ai/stdlib'


slo {
availability { target "99.9%" window "30 days" }
latency { p95 "200ms" window "7 days" }
errorRate { target "< 0.1%" window "30 days" }
throughput { target "1000 req/s" window "1 hour" }
}

view index {
include *
}

Practice

  • Set p95 latency targets for checkout and search.
  • Map alerts to SLO windows; define runbooks for breaches.

Lesson 1: The "Good" Problem

You have too many users. Your single database is melting. It's time to scale.

The Bottleneck

Our monitoring (Module 5) shows that Inventory checks are 80% of the database load.

The Refactor: Splitting the Monolith

We decide to extract Inventory into its own microservice with its own database.

Step 1: Update the Architecture

We change Inventory from a logical domain inside the monolith to a physical system.

// Before
domain Inventory { ... }

// After
system = kind "System"
container = kind "Container"

InventoryService = system "Inventory Microservice" {
API = container "Inventory API"
Database = container "Inventory DB"
}

Step 2: Update the Architecture

The OrderService can no longer call Inventory functions directly. It must make a gRPC call. Update your OpenAPI specs to reflect the new gRPC interfaces.

import { * } from 'sruja.ai/stdlib'


OrderService = system "Order Service" {
// ...
OrderService -> InventoryService "gRPC CheckStock"
}

Why Sruja helps

Refactoring is dangerous. Sruja helps you visualize the impact of the change before you write code. You can see exactly which other systems depend on Inventory and ensure you don't break them.

Lesson 2: Managing Technical Debt

Every codebase has skeletons. The key is to label them.

Deprecating Components

We decided to move from Stripe to Adyen for lower fees. But we can't switch overnight.

import { * } from 'sruja.ai/stdlib'


Stripe = system "Legacy Payment Gateway" {
metadata {
  tags ["deprecated"]
}
description "Do not use for new features. Migration in progress."
}

Adyen = system "New Payment Gateway" {
metadata {
  tags ["preferred"]
}
}

Governance Policies

We can enforce this with a policy!

// EXPECTED_FAILURE: Policy rules not yet implemented - rule keyword is deferred feature
policy Migration "No New Stripe Integrations" {
    rule "BanStripe" {
        // Pseudo-code: Fail if any NEW relation points to Stripe
        check "relation.to != 'Stripe'"
    }
}

This prevents developers from accidentally adding dependencies to the system you are trying to kill.

Lesson 1: Security by Design

Security isn't something you "add on" at the end. It must be baked into the architecture.

The Requirement

GDPR Article 32: Personal data must be encrypted.

Modeling Security Signals

Use tags and metadata to make security posture explicit.

import { * } from 'sruja.ai/stdlib'


Shop = system "Shop" {
  UserDB = database "User DB" {
    tags ["pii", "encrypted"]
    metadata {
      retention "90d"
    }
  }
}

view index {
include *
}

Validating in CI

Run sruja validate in CI to enforce architectural rules (unique IDs, valid references, layering, external boundary checks). Combine with linters to flag missing tags for sensitive resources. This is Compliance as Code.

Lesson 2: Cost Optimization

Cloud bills kill startups. Sruja helps you visualize where the money is going.

Modeling Cost

We can add cost metadata to our deployment nodes.

deployment Production {
    node DB "RDS Large" {
        metadata {
            cost "$500/month"
            type "db.r5.large"
        }
    }
}

Cost Policies

Use metadata and CI checks to prevent expensive mistakes in non‑production environments.

deployment Dev {
    node App "Small Instance" {
        metadata {
            cost "$20/month"
            type "t3.small"
        }
    }
}

Add a CI rule to flag dev nodes exceeding budget thresholds.

Lesson 3: Audit Readiness (SOC 2)

Congratulations! You are now big enough to need a SOC 2 audit. The auditor asks: "Show me your system diagram and prove that all changes are reviewed."

Sruja as Evidence

Instead of scrambling to draw a whiteboard diagram, you point them to your Sruja repository.

  1. Current State: The generated diagram is always up to date.
  2. Change History: Git history shows every architectural change, who made it, and who approved it (via Pull Request).
  3. Controls: Your validation and CI checks prove that you have automated controls for security and compliance.

The End

You have built a scalable, secure, and compliant e-commerce platform. And you did it with a blueprint.

Happy Architecting!

System Design Interview Mastery

Crack your next system design interview. This course teaches you how to approach, design, and model systems that impress interviewers at FAANG companies and top tech firms.

Why This Course?

System design interviews are the most important part of senior engineering interviews. This course gives you:

  • Real interview questions from top companies
  • Step-by-step approach to tackle any design question
  • Sruja modeling to visualize your designs
  • Best practices that interviewers look for
  • Common pitfalls to avoid

What You'll Learn

  • Interview Strategy: How to approach system design questions systematically
  • Scaling & Performance: Handle "design for 1M users" questions confidently
  • Architecture Patterns: Microservices, caching, load balancing, and more
  • Trade-offs: Make informed decisions and explain them clearly
  • Modeling with Sruja: Use Sruja to visualize and communicate your designs

Who This Course Is For

  • Software engineers preparing for senior/staff level interviews
  • Candidates targeting FAANG and top tech companies
  • Engineers who want to improve their system design skills
  • Anyone preparing for architecture/design interviews

Course Structure

Module 1: Performance & Scalability Interview Questions

Master the most common interview questions about scaling and performance.

Interview Questions Covered:

  • "Design a video streaming platform like YouTube"
  • "How would you handle 10M concurrent users?"
  • "Design a system with < 200ms latency"

Module 2: Modular Architecture & Microservices

Tackle complex system design questions requiring distributed systems knowledge.

Interview Questions Covered:

  • "Design an e-commerce platform"
  • "Design a ride-sharing service like Uber"
  • "Design a social media feed"

Module 3: Governance & Policies (Senior/Staff Level)

Answer questions about compliance, governance, and architectural standards.

Interview Questions Covered:

  • "How do you ensure compliance (HIPAA, SOC 2)?"
  • "How do you enforce architectural standards?"
  • "Design a system that must comply with regulations"

Prerequisites

  • Completed System Design 101 or equivalent
  • Familiarity with basic Sruja syntax
  • Understanding of basic system design concepts

Estimated Time

4-5 hours (includes practice exercises)

Interview Success Framework

Each module follows this proven approach:

  1. Understand the Question - Clarify requirements and scope
  2. Design the System - High-level architecture first
  3. Model with Sruja - Visualize your design
  4. Deep Dive - Discuss scaling, trade-offs, and edge cases
  5. Optimize - Improve based on feedback

Learning Outcomes

By the end of this course, you'll be able to:

  • ✅ Approach any system design question with confidence
  • ✅ Design scalable systems that handle millions of users
  • ✅ Explain trade-offs and make informed decisions
  • ✅ Use Sruja to communicate your designs clearly
  • ✅ Avoid common interview mistakes
  • ✅ Impress interviewers with production-ready thinking

Real Interview Questions You'll Master

  • Design a URL shortener (bit.ly)
  • Design a video streaming service (YouTube/Netflix)
  • Design a ride-sharing service (Uber/Lyft)
  • Design a social media feed (Twitter/Instagram)
  • Design a chat application (WhatsApp/Slack)
  • Design a search engine (Google)
  • Design a payment system (Stripe)
  • Design a distributed cache (Redis)

Ready to ace your next interview? Let's get started! 🎯

Module Overview: Scaling & Performance Interview Questions

"Design a system that handles 10 million concurrent users."

This is one of the most common system design interview questions. In this module, you'll learn how to answer scaling and performance questions confidently.

Interview Questions You'll Master

  • "Design a video streaming platform like YouTube/Netflix"
  • "How would you design a system to handle 10M concurrent users?"
  • "Design a system with < 200ms latency"
  • "How do you ensure high availability (99.99%)?"

What Interviewers Look For

  • ✅ Understanding of horizontal vs vertical scaling
  • ✅ Ability to estimate capacity and scale
  • ✅ Knowledge of performance metrics (latency, throughput)
  • ✅ Trade-off analysis (cost vs performance)
  • ✅ Clear communication of your design

Goals

  • Answer scaling questions with confidence
  • Use scale blocks to model auto-scaling strategies
  • Define SLOs to show production-ready thinking
  • Explain trade-offs clearly to interviewers

Interview Framework

We'll follow this approach for each question:

  1. Clarify Requirements - Ask about scale, latency, availability
  2. Design High-Level - Start with core components
  3. Model with Sruja - Visualize your architecture
  4. Discuss Scaling - Show how it handles load
  5. Optimize - Improve based on constraints

Estimated Time

60-75 minutes (includes practice)

Checklist

  • Understand how to approach scaling questions
  • Model scaling strategies with Sruja
  • Define SLOs to show production thinking
  • Practice explaining trade-offs clearly

Lesson 1: Interview Question - Design a Video Streaming Platform

The Interview Question

"Design a video streaming platform like YouTube or Netflix that can handle millions of concurrent viewers."

This is a classic system design interview question asked at Google, Netflix, and other top companies. Let's break it down step-by-step.

Step 1: Clarify Requirements (What Interviewers Want to Hear)

Before jumping into design, always clarify:

You should ask:

  • "What's the scale? How many concurrent viewers?"
  • "What's the latency requirement? How fast should videos start?"
  • "What types of videos? Short clips or full movies?"
  • "Do we need to support live streaming or just on-demand?"

Interviewer's typical answer:

  • "Let's say 10 million concurrent viewers"
  • "Videos should start within 2 seconds"
  • "Both short clips and full movies"
  • "Focus on on-demand for now"

Step 2: Design the High-Level Architecture

Start with the core components:

  1. Client (mobile app, web browser)
  2. CDN (Content Delivery Network) - serves videos
  3. Origin Server - stores original videos
  4. API Server - handles metadata, user requests
  5. Database - stores video metadata, user data

Step 3: Model with Sruja

Let's model this architecture:

import { * } from 'sruja.ai/stdlib'


Viewer = person "Video Viewer"

StreamingPlatform = system "Video Streaming Service" {
CDN = container "Content Delivery Network" {
  technology "Cloudflare, AWS CloudFront"
  description "Serves videos from edge locations worldwide"
}

OriginServer = container "Origin Server" {
  technology "S3, GCS"
  description "Stores original video files"
}

VideoAPI = container "Video API" {
  technology "Rust, gRPC"
  description "Handles video metadata, user requests"
}

TranscodingService = container "Video Transcoding" {
  technology "FFmpeg, Kubernetes"
  description "Converts videos to different formats/qualities"
}

VideoDB = database "Video Metadata Database" {
  technology "PostgreSQL"
}

UserDB = database "User Database" {
  technology "PostgreSQL"
}
}

Viewer -> StreamingPlatform.CDN "Streams video"
StreamingPlatform.CDN -> StreamingPlatform.OriginServer "Fetches on cache miss"
Viewer -> StreamingPlatform.VideoAPI "Requests video info"
StreamingPlatform.VideoAPI -> StreamingPlatform.VideoDB "Queries metadata"
StreamingPlatform.VideoAPI -> StreamingPlatform.UserDB "Queries user data"
StreamingPlatform.OriginServer -> StreamingPlatform.TranscodingService "Processes videos"

view index {
include *
}

Step 4: Address Scaling (The Key Part)

Interviewer will ask: "How does this handle 10 million concurrent viewers?"

This is where you show your scaling knowledge. Let's add scaling configuration:

import { * } from 'sruja.ai/stdlib'


Viewer = person "Video Viewer"

StreamingPlatform = system "Video Streaming Service" {
CDN = container "Content Delivery Network" {
  technology "Cloudflare, AWS CloudFront"
  // CDN scales automatically - no need to configure
  description "Serves videos from edge locations worldwide"
}

VideoAPI = container "Video API" {
  technology "Rust, gRPC"

  // This is what interviewers want to see!
  scale {
    min 10
    max 1000
    metric "cpu > 75% or requests_per_second > 10000"
  }

  description "Handles video metadata, user requests"
}

TranscodingService = container "Video Transcoding" {
  technology "FFmpeg, Kubernetes"

  scale {
    min 5
    max 100
    metric "queue_length > 50"
  }

  description "Converts videos to different formats/qualities"
}

VideoDB = database "Video Metadata Database" {
  technology "PostgreSQL"
  // Database scaling: read replicas
  description "Primary database with 5 read replicas for scaling reads"
}
}

view index {
include *
}

What Interviewers Look For

✅ Good Answer (What You Just Did)

  1. Clarified requirements before designing
  2. Started with high-level architecture
  3. Modeled with Sruja to visualize
  4. Addressed scaling with specific numbers
  5. Explained trade-offs (CDN vs origin server)

❌ Bad Answer (Common Mistakes)

  1. Jumping straight to code/implementation details
  2. Not asking clarifying questions
  3. Designing for small scale only
  4. Not mentioning CDN or caching
  5. Ignoring database scaling

Key Points to Mention in Interview

1. CDN for Video Delivery

Say: "We use a CDN to serve videos from edge locations close to users. This reduces latency and offloads traffic from origin servers."

2. Horizontal Scaling for API

Say: "The API server scales horizontally from 10 to 1000 instances based on CPU and request rate. This handles traffic spikes during peak hours."

3. Database Read Replicas

Say: "We use read replicas for the database to scale read operations. Writes go to primary, reads can go to any replica."

4. Caching Strategy

Say: "We cache frequently accessed video metadata in Redis to reduce database load."

Interview Practice: Add Caching

Interviewer might ask: "How do you reduce database load?"

Add caching to your design:

import { * } from 'sruja.ai/stdlib'


StreamingPlatform = system "Video Streaming Service" {
VideoAPI = container "Video API" {
  technology "Rust, gRPC"
  scale {
    min 10
    max 1000
    metric "cpu > 75%"
  }
}

VideoDB = database "Video Metadata Database" {
  technology "PostgreSQL"
}

Cache = database "Video Metadata Cache" {
  technology "Redis"
  description "Caches frequently accessed video metadata"
}
}

StreamingPlatform.VideoAPI -> StreamingPlatform.Cache "Reads metadata (cache hit)"
StreamingPlatform.VideoAPI -> StreamingPlatform.VideoDB "Reads metadata (cache miss)"
StreamingPlatform.VideoAPI -> StreamingPlatform.Cache "Writes to cache"

view index {
include *
}

Understanding Scale Block Fields

min - Minimum Replicas

Interview tip: "We keep at least 10 instances running to handle baseline traffic and provide fault tolerance."

max - Maximum Replicas

Interview tip: "We cap at 1000 instances to control costs. If we need more, we'd need to optimize the architecture first."

metric - Scaling Trigger

Interview tip: "We scale based on CPU usage and request rate. When CPU exceeds 75% or requests exceed 10k/sec, we add more instances."

Real Interview Example: Capacity Estimation

Interviewer: "How many API servers do you need for 10M concurrent users?"

Your answer:

  1. "Assume each user makes 1 request per minute = 10M requests/minute = ~167k requests/second"
  2. "Each API server handles ~1000 requests/second"
  3. "We need ~167 servers at peak"
  4. "With 2x headroom for spikes: ~350 servers"
  5. "Our scale block allows 10-1000, so we're covered"

Exercise: Practice This Question

Design a video streaming platform and be ready to explain:

  1. Why you chose CDN
  2. How scaling works
  3. Database scaling strategy
  4. Caching approach

Practice tip: Time yourself (30-40 minutes) and explain out loud as if in an interview.

Common Follow-Up Questions

Be prepared for:

  • "How do you handle video uploads?" (Add upload service, queue for processing)
  • "What about live streaming?" (Add live streaming infrastructure)
  • "How do you ensure availability?" (Add redundancy, health checks)
  • "What's the cost?" (Estimate based on scale)

Next Steps

In the next lesson, we'll learn about SLOs (Service Level Objectives) - another common interview topic about defining performance targets.

Lesson 2: Interview Question - Design a High-Performance Payment System

The Interview Question

"Design a payment processing system that can handle 1 million transactions per second with 99.99% availability and < 100ms latency."

This question tests your understanding of:

  • Performance requirements (SLOs)
  • High availability
  • Low latency systems
  • Trade-offs between consistency and performance

Step 1: Clarify Requirements

You should ask:

  • "What's the transaction volume? Peak vs average?"
  • "What's the availability requirement? 99.9% or 99.99%?"
  • "What's the latency requirement? P95 or P99?"
  • "What about consistency? Do we need strong consistency?"

Interviewer's answer:

  • "1M transactions/second at peak"
  • "99.99% availability (four nines)"
  • "< 100ms p95 latency"
  • "Strong consistency required (it's money!)"

Step 2: Design with SLOs in Mind

This is where SLOs (Service Level Objectives) come in. Interviewers love when you think about measurable targets.

Let's model the payment system with explicit SLOs:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  technology "Rust, gRPC"

  // This shows production-ready thinking!
  slo {
    availability {
      target "99.99%"
      window "30 days"
      current "99.97%"
    }

    latency {
      p95 "100ms"
      p99 "250ms"
      window "7 days"
      current {
        p95 "85ms"
        p99 "200ms"
      }
    }

    errorRate {
      target "< 0.01%"
      window "30 days"
      current "0.008%"
    }

    throughput {
      target "1000000 txn/s"
      window "1 hour"
      current "950000 txn/s"
    }
  }

  scale {
    min 100
    max 10000
    metric "cpu > 70% or requests_per_second > 500000"
  }
}

FraudDetection = container "Fraud Detection" {
  technology "Python, ML"
  description "Real-time fraud detection"
}

PaymentDB = database "Payment Database" {
  technology "PostgreSQL"
  description "Primary database with 10 read replicas"
}

Cache = database "Payment Cache" {
  technology "Redis"
  description "Caches recent transactions"
}

PaymentQueue = queue "Payment Queue" {
  technology "Kafka"
  description "Async payment processing"
}
}

Stripe = system "Stripe Gateway" {
tags ["external"]
}

BankAPI = system "Bank API" {
tags ["external"]
}

PaymentService.PaymentAPI -> PaymentService.FraudDetection "Validates"
PaymentService.PaymentAPI -> PaymentService.Cache "Checks recent transactions"
PaymentService.PaymentAPI -> PaymentService.PaymentDB "Stores transaction"
PaymentService.PaymentAPI -> PaymentService.PaymentQueue "Enqueues for async processing"
PaymentService.PaymentAPI -> Stripe "Processes payment"
PaymentService.PaymentAPI -> BankAPI "Validates with bank"

view index {
include *
}

What Interviewers Look For

✅ Good Answer (What You Just Did)

  1. Defined SLOs explicitly - Shows you think about measurable targets
  2. Addressed all requirements - Availability, latency, throughput
  3. Explained trade-offs - Strong consistency vs performance
  4. Scalability - Showed how to handle 1M txn/s
  5. Redundancy - Multiple replicas, failover strategies

❌ Bad Answer (Common Mistakes)

  1. Not defining SLOs or performance targets
  2. Ignoring availability requirements
  3. Not explaining how to achieve 99.99% availability
  4. Not addressing consistency requirements
  5. No capacity estimation

Key Points to Mention in Interview

1. Availability (99.99% = Four Nines)

Say: "99.99% availability means 52.6 minutes of downtime per year. To achieve this, we need:

  • Multiple data centers (active-active)
  • Automatic failover
  • Health checks and monitoring
  • Database replication with automatic promotion"

2. Latency (< 100ms p95)

Say: "To achieve < 100ms latency, we:

  • Use in-memory cache (Redis) for hot data
  • Keep database queries simple and indexed
  • Use connection pooling
  • Minimize network hops
  • Consider async processing for non-critical paths"

3. Throughput (1M txn/s)

Say: "To handle 1M transactions/second:

  • Horizontal scaling: 100-10,000 API instances
  • Database sharding by transaction ID
  • Read replicas for scaling reads
  • Caching frequently accessed data
  • Async processing for non-critical operations"

4. Strong Consistency

Say: "Since this is financial data, we need strong consistency:

  • All writes go to primary database
  • Read replicas are eventually consistent (ok for reads)
  • Use distributed transactions for critical operations
  • Trade-off: Slightly higher latency for correctness"

Understanding SLO Types (Interview Context)

Availability SLO

Interviewer asks: "How do you ensure 99.99% availability?"

Your answer with SLO:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  slo {
    availability {
      target "99.99%"
      window "30 days"
      current "99.97%"
    }
  }
}
}

view index {
include *
}

Explain: "We target 99.99% (four nines), which allows 52.6 minutes downtime per year. Currently at 99.97%, so we're close but need to improve redundancy."

Latency SLO

Interviewer asks: "How fast should payments process?"

Your answer with SLO:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  slo {
    latency {
      p95 "100ms"
      p99 "250ms"
      window "7 days"
    }
  }
}
}

view index {
include *
}

Explain: "95% of payments complete in under 100ms, 99% in under 250ms. We use p95/p99 instead of average because they show real user experience - a few slow payments don't skew the metric."

Error Rate SLO

Interviewer asks: "What error rate is acceptable?"

Your answer with SLO:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  slo {
    errorRate {
      target "< 0.01%"
      window "30 days"
      current "0.008%"
    }
  }
}
}

view index {
include *
}

Explain: "We target less than 0.01% error rate. Currently at 0.008%, which is good, but we monitor closely because payment errors are critical."

Real Interview Example: Capacity Estimation

Interviewer: "How many servers do you need for 1M txn/s?"

Your answer:

  1. "Each transaction requires ~10ms processing = 100 transactions/second per server"
  2. "1M txn/s ÷ 100 = 10,000 servers needed"
  3. "With 2x headroom for spikes and redundancy: ~20,000 servers"
  4. "But we can optimize:
    • Caching reduces DB load → fewer DB servers
    • Async processing → can batch operations
    • Database sharding → distributes load
    • Final estimate: ~5,000-10,000 servers"

Interview Practice: Add High Availability

Interviewer: "How do you ensure 99.99% availability?"

Add redundancy to your design:

import { * } from 'sruja.ai/stdlib'


PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  technology "Rust, gRPC"
  scale {
    min 100
    max 10000
    metric "cpu > 70%"
  }
  description "Deployed across 3 data centers (active-active)"
}

PaymentDB = database "Payment Database" {
  technology "PostgreSQL"
  description "Primary in US-East, replicas in US-West and EU"
}
}

// Show redundancy
PaymentService.PaymentAPI -> PaymentService.PaymentDB "Writes to primary"

view index {
include *
}

Explain: "We deploy across 3 data centers in active-active mode. If one fails, traffic automatically routes to others. Database has primary + replicas with automatic failover."

Common Follow-Up Questions

Be prepared for:

  1. "What if the database fails?"

    • Answer: "Automatic failover to replica, data replication with < 1s lag"
  2. "How do you handle network partitions?"

    • Answer: "CAP theorem - we choose consistency over availability for payments. If partition occurs, we reject transactions rather than risk inconsistency."
  3. "What about data consistency across regions?"

    • Answer: "Synchronous replication for critical data, eventual consistency for non-critical. Use distributed transactions for cross-region operations."
  4. "How do you monitor SLOs?"

    • Answer: "Real-time dashboards showing current vs target SLOs. Alerts when we're at risk of violating SLOs. Weekly reviews of SLO performance."

Exercise: Practice This Question

Design a payment system and be ready to explain:

  1. How you achieve 99.99% availability
  2. How you keep latency < 100ms
  3. How you handle 1M txn/s
  4. Your SLO targets and how you measure them

Practice tip: Time yourself (40-45 minutes) and explain out loud. Focus on SLOs - interviewers love this!

Key Takeaways for Interviews

  1. Always define SLOs - Shows production-ready thinking
  2. Explain trade-offs - Availability vs consistency, latency vs throughput
  3. Show capacity estimation - Back up your numbers
  4. Mention monitoring - How you track SLOs
  5. Discuss failure scenarios - What happens when things break

Next Steps

You've learned how to handle performance and SLO questions. In the next module, we'll tackle modular architecture questions - another common interview topic!

Lesson 3: SLO Enforcement in Practice

SLO‑Driven Operations

Use SLOs to set thresholds for alerts and capacity changes.

Sruja: Model SLOs & Validate

import { * } from 'sruja.ai/stdlib'


API = system "API Server" {
  Gateway = container "Gateway" {
    scale { metric "req/s" min 500 max 5000 }
  }
}

slo {
availability { target "99.95%" window "30 days" }
latency { p95 "150ms" window "7 days" }
errorRate { target "< 0.05%" window "30 days" }
throughput { target "3000 req/s" window "1 hour" }
}

view index {
include *
}
sruja lint payments.sruja

Practice

  • Define SLOs for your critical path; ensure scale bounds meet throughput.
  • Set alert thresholds aligned to SLO windows.

Module Overview: Microservices & Distributed Systems Interview Questions

"Design an e-commerce platform using microservices."

This is a very common interview question that tests your understanding of:

  • Microservices architecture
  • Service decomposition
  • Inter-service communication
  • Distributed system challenges

Interview Questions You'll Master

  • "Design an e-commerce platform (Amazon-style)"
  • "Design a ride-sharing service like Uber"
  • "Design a social media platform"
  • "How do you split a monolith into microservices?"

What Interviewers Look For

  • ✅ Understanding of microservices vs monolith trade-offs
  • ✅ Ability to decompose a system into services
  • ✅ Knowledge of service communication patterns
  • ✅ Understanding of distributed system challenges
  • ✅ Clear communication of design decisions

Goals

  • Answer microservices questions confidently
  • Model service boundaries using separate systems in Sruja
  • Explain service decomposition strategy
  • Discuss trade-offs and challenges

Interview Framework

We'll follow this approach:

  1. Clarify Requirements - Scale, features, constraints
  2. Identify Services - How to decompose the system
  3. Model with Sruja - Use separate systems to show service boundaries
  4. Discuss Communication - APIs, events, data flow
  5. Address Challenges - Consistency, failures, monitoring

Estimated Time

60-75 minutes (includes practice)

Checklist

  • Understand microservices decomposition
  • Model services as separate systems in Sruja
  • Explain service communication patterns
  • Discuss distributed system challenges

Lesson 1: Interview Question - Design an E-Commerce Platform (Microservices)

The Interview Question

"Design an e-commerce platform like Amazon that can handle millions of users and products. Use a microservices architecture."

This is one of the most common system design interview questions. It tests:

  • System decomposition into microservices
  • Service boundaries and responsibilities
  • Inter-service communication
  • Data consistency across services

Step 1: Clarify Requirements

You should ask:

  • "What are the core features? Shopping cart, checkout, recommendations?"
  • "What's the scale? Users, products, orders per day?"
  • "What about inventory? Real-time stock management?"
  • "Payment processing? Do we integrate with payment gateways?"

Interviewer's typical answer:

  • "Core features: Product catalog, shopping cart, checkout, order management, user accounts"
  • "Scale: 100M users, 1B products, 10M orders/day"
  • "Real-time inventory tracking required"
  • "Integrate with payment gateways like Stripe"

Step 2: Identify Microservices

Key insight: Break down by business domain, not technical layers.

You should identify:

  1. User Service - Authentication, profiles
  2. Product Service - Catalog, search, recommendations
  3. Cart Service - Shopping cart management
  4. Order Service - Order processing, tracking
  5. Payment Service - Payment processing
  6. Inventory Service - Stock management
  7. Notification Service - Emails, SMS

Step 3: Model with Sruja (Separate Systems)

Model each microservice as a separate system within the architecture. This clearly shows service boundaries.

import { * } from 'sruja.ai/stdlib'


Customer = person "Online Customer"

// Each microservice is a separate system
UserService = system "User Management" {
AuthAPI = container "Authentication API" {
  technology "Rust, gRPC"
}

ProfileAPI = container "Profile API" {
  technology "Rust, gRPC"
}

UserDB = database "User Database" {
  technology "PostgreSQL"
}
}

ProductService = system "Product Catalog" {
ProductAPI = container "Product API" {
  technology "Java, Spring Boot"
}

SearchAPI = container "Search API" {
  technology "Elasticsearch"
}

RecommendationAPI = container "Recommendation API" {
  technology "Python, ML"
}

ProductDB = database "Product Database" {
  technology "PostgreSQL"
}

SearchIndex = database "Search Index" {
  technology "Elasticsearch"
}
}

CartService = system "Shopping Cart" {
CartAPI = container "Cart API" {
  technology "Node.js, Express"
}

CartDB = database "Cart Database" {
  technology "Redis"
  description "In-memory cache for fast cart operations"
}
}

OrderService = system "Order Management" {
OrderAPI = container "Order API" {
  technology "Node.js, Express"
}

OrderProcessor = container "Order Processor" {
  technology "Node.js"
}

OrderDB = database "Order Database" {
  technology "PostgreSQL"
}

OrderQueue = queue "Order Queue" {
  technology "Kafka"
}
}

PaymentService = system "Payment Processing" {
PaymentAPI = container "Payment API" {
  technology "Rust, gRPC"
}

PaymentDB = database "Payment Database" {
  technology "PostgreSQL"
}
}

InventoryService = system "Inventory Management" {
InventoryAPI = container "Inventory API" {
  technology "Java, Spring Boot"
}

InventoryDB = database "Inventory Database" {
  technology "PostgreSQL"
}
}

NotificationService = system "Notifications" {
NotificationAPI = container "Notification API" {
  technology "Python, FastAPI"
}

EmailQueue = queue "Email Queue" {
  technology "RabbitMQ"
}

SMSQueue = queue "SMS Queue" {
  technology "RabbitMQ"
}
}

// API Gateway - single entry point
ECommerceApp = system "E-Commerce Application" {
WebApp = container "Web Application" {
  technology "React, Next.js"
}

APIGateway = container "API Gateway" {
  technology "Kong, Nginx"
  description "Routes requests to appropriate microservices"
}
}

Stripe = system "Stripe Gateway" {
tags ["external"]
}

PayPal = system "PayPal Gateway" {
tags ["external"]
}

// User flow
Customer -> ECommerceApp.WebApp "Browses products"
ECommerceApp.WebApp -> ECommerceApp.APIGateway "Makes requests"
ECommerceApp.APIGateway -> UserService.AuthAPI "Authenticates"
ECommerceApp.APIGateway -> ProductService.ProductAPI "Fetches products"
ECommerceApp.APIGateway -> ProductService.SearchAPI "Searches products"
ECommerceApp.APIGateway -> ProductService.RecommendationAPI "Gets recommendations"

// Cart flow
ECommerceApp.APIGateway -> CartService.CartAPI "Manages cart"
CartService.CartAPI -> CartService.CartDB "Stores cart"

// Order flow
ECommerceApp.APIGateway -> OrderService.OrderAPI "Creates order"
OrderService.OrderAPI -> InventoryService.InventoryAPI "Checks stock"
OrderService.OrderAPI -> PaymentService.PaymentAPI "Processes payment"
OrderService.OrderAPI -> UserService.ProfileAPI "Gets user info"
OrderService.OrderAPI -> OrderService.OrderQueue "Enqueues for processing"
OrderService.OrderProcessor -> OrderService.OrderQueue "Processes orders"
OrderService.OrderProcessor -> NotificationService.NotificationAPI "Sends confirmation"

// Payment flow
PaymentService.PaymentAPI -> PaymentService.PaymentDB "Stores transaction"
PaymentService.PaymentAPI -> Stripe "Processes cards"
PaymentService.PaymentAPI -> PayPal "Processes PayPal"

// Notification flow
NotificationService.NotificationAPI -> NotificationService.EmailQueue "Sends emails"
NotificationService.NotificationAPI -> NotificationService.SMSQueue "Sends SMS"

view index {
include *
}

What Interviewers Look For

✅ Good Answer (What You Just Did)

  1. Clear service boundaries - Each service is a separate system
  2. Single responsibility - Each service has one clear purpose
  3. Identified communication patterns - API calls, queues, events
  4. Addressed data ownership - Each service owns its database
  5. Explained trade-offs - Why microservices vs monolith

❌ Bad Answer (Common Mistakes)

  1. Services too granular (one service per function)
  2. Services too coarse (monolith split incorrectly)
  3. Not showing service boundaries clearly
  4. Ignoring data consistency challenges
  5. No API gateway or service mesh

Key Points to Mention in Interview

1. Service Decomposition Strategy

Say: "I decompose by business domain, not technical layers. Each service owns its data and has clear boundaries. For example:

  • User Service owns user data and authentication
  • Product Service owns product catalog and search
  • Order Service owns order lifecycle
  • Each service is a separate system in the architecture"

2. Inter-Service Communication

Say: "Services communicate via:

  • Synchronous: REST/gRPC for real-time operations (checkout, cart)
  • Asynchronous: Message queues for eventual consistency (order processing, notifications)
  • API Gateway: Single entry point, handles routing, auth, rate limiting"

3. Data Consistency

Say: "Each service owns its database (database per service pattern). For cross-service operations:

  • Saga pattern: For distributed transactions (order → payment → inventory)
  • Eventual consistency: Acceptable for non-critical paths (notifications)
  • Strong consistency: Only within a service (cart operations)"

4. API Gateway Pattern

Say: "API Gateway provides:

  • Single entry point for all client requests
  • Request routing to appropriate microservices
  • Authentication/authorization - validate tokens once
  • Rate limiting and throttling
  • Load balancing across service instances"

Interview Practice: Add More Services

Interviewer might ask: "What about recommendations and analytics?"

Add them to your design (extending the main architecture):

import { * } from 'sruja.ai/stdlib'


Customer = person "Online Customer"

// Existing services (UserService, ProductService, OrderService, etc. from main design)
ProductService = system "Product Catalog" {
ProductAPI = container "Product API" {
  technology "Java, Spring Boot"
}
}

OrderService = system "Order Management" {
OrderAPI = container "Order API" {
  technology "Node.js, Express"
}
}

ECommerceApp = system "E-Commerce Application" {
APIGateway = container "API Gateway" {
  technology "Kong, Nginx"
}
}

// Additional services
RecommendationService = system "Recommendations" {
RecommendationAPI = container "Recommendation API" {
  technology "Python, ML"
}

UserBehaviorDB = database "User Behavior Database" {
  technology "MongoDB"
  description "Stores user clicks, views, purchases for ML"
}
}

AnalyticsService = system "Analytics" {
AnalyticsAPI = container "Analytics API" {
  technology "Rust"
}

AnalyticsDB = database "Analytics Database" {
  technology "ClickHouse"
  description "Time-series data for analytics"
}
}

// Show how services interact
ECommerceApp.APIGateway -> ProductService.ProductAPI "Gets products"
ECommerceApp.APIGateway -> RecommendationService.RecommendationAPI "Gets recommendations"
OrderService.OrderAPI -> AnalyticsService.AnalyticsAPI "Tracks order events"

view index {
include *
}

Common Follow-Up Questions

Be prepared for:

  1. "How do you handle failures?"

    • Answer: "Circuit breakers prevent cascading failures. Retries with exponential backoff. Fallbacks (show cached data if service down). If payment service is down, queue the order for later processing."
  2. "How do you ensure data consistency?"

    • Answer: "Saga pattern for distributed transactions. Each step can be compensated if later steps fail. For example, if payment fails after inventory is reserved, we release the inventory (compensating transaction)."
  3. "How do you handle service versioning?"

    • Answer: "API versioning in URLs (/v1/, /v2/). Deploy new versions alongside old ones. Gradually migrate traffic. Deprecate old versions after migration."
  4. "How do you monitor microservices?"

    • Answer: "Distributed tracing (Jaeger, Zipkin) to track requests across services. Centralized logging (ELK stack). Metrics (Prometheus) per service. Health checks for each service."
  5. "How do you handle service discovery?"

    • Answer: "Service registry (Consul, Eureka) or DNS-based discovery. API Gateway can handle routing. Service mesh (Istio) for advanced features like load balancing, retries."

Exercise: Practice This Question

Design an e-commerce platform and be ready to explain:

  1. How you decomposed into services (why these services?)
  2. How services communicate (sync vs async)
  3. How you handle data consistency
  4. How you handle failures
  5. Your scaling strategy for each service

Practice tip: Time yourself (45-50 minutes). Draw the architecture, then model it with Sruja. Explain your decisions out loud as if in an interview.

Key Takeaways for Interviews

  1. Decompose by business domain - Not technical layers
  2. Each service is a separate system - Clear boundaries in Sruja
  3. Each service owns its data - Database per service
  4. Use API Gateway - Single entry point
  5. Mix sync and async - REST for real-time, queues for async
  6. Address failures - Circuit breakers, retries, fallbacks
  7. Show with separate systems - Clear service boundaries in architecture

Next Steps

You've learned how to design microservices architectures. In the next module, we'll cover governance and policies - important for senior/staff level interviews!

Module Overview: Senior/Staff Level Interview Questions

"How do you ensure architectural standards across a large organization?"

This module covers questions typically asked in senior/staff engineer interviews. These test your ability to:

  • Lead architecture decisions
  • Enforce standards and best practices
  • Handle compliance and regulatory requirements
  • Design for large-scale organizations

Interview Questions You'll Master

  • "How do you enforce architectural standards?"
  • "Design a system that must comply with HIPAA/SOC 2"
  • "How do you ensure security across microservices?"
  • "How do you handle compliance in a distributed system?"

What Interviewers Look For

  • ✅ Understanding of governance and policies
  • ✅ Ability to enforce standards at scale
  • ✅ Knowledge of compliance requirements
  • ✅ Leadership and architectural thinking
  • ✅ Trade-offs between flexibility and standards

Goals

  • Answer governance questions confidently
  • Define policies with Sruja
  • Explain compliance requirements
  • Discuss enforcement strategies

Interview Framework

We'll follow this approach:

  1. Understand Requirements - Compliance, standards, scale
  2. Define Policies - Security, compliance, best practices
  3. Model with Sruja - Show policies in architecture
  4. Discuss Enforcement - How to ensure compliance
  5. Address Trade-offs - Flexibility vs standards

Estimated Time

45-60 minutes

Checklist

  • Understand policy syntax and usage
  • Define security and compliance policies
  • Model policies with Sruja
  • Explain enforcement strategies

Lesson 1: Interview Question - Design a HIPAA-Compliant Healthcare System

The Interview Question

"Design a healthcare platform that stores patient data and must comply with HIPAA regulations. How do you ensure compliance across all services?"

This is a senior/staff level interview question that tests:

  • Understanding of compliance requirements
  • Ability to enforce standards at scale
  • Security and privacy considerations
  • Governance and policy enforcement

Step 1: Clarify Requirements

You should ask:

  • "What are the core features? Patient records, appointments, prescriptions?"
  • "What's the scale? How many patients, healthcare providers?"
  • "What compliance requirements? HIPAA, SOC 2, others?"
  • "What about data retention? How long must we keep records?"

Interviewer's answer:

  • "Core: Patient records, appointments, prescriptions, billing"
  • "Scale: 10M patients, 100K healthcare providers"
  • "Must comply with HIPAA (health data privacy)"
  • "Retain records for 10 years (legal requirement)"

Step 2: Understand HIPAA Requirements

Key HIPAA requirements (you should mention these):

  1. Encryption: Data at rest and in transit
  2. Access Control: Role-based access, audit logs
  3. Audit Logging: Track all access to patient data
  4. Data Minimization: Only collect necessary data
  5. Breach Notification: Report breaches within 72 hours

Step 3: Design with Policies

This is where Sruja's policy feature is perfect! Show how you enforce compliance:

import { * } from 'sruja.ai/stdlib'


// HIPAA Compliance Policy
HIPAACompliance = policy "All patient data must be encrypted and access logged" {
  category "compliance"
  enforcement "required"
  description "HIPAA requires encryption at rest and in transit, plus audit logging for all patient data access"
}

// Security Policy
TLSEnforcement = policy "All external communications must use TLS 1.3" {
  category "security"
  enforcement "required"
  description "Required for HIPAA compliance - all data in transit must be encrypted"
}

EncryptionAtRest = policy "All patient data must be encrypted at rest using AES-256" {
  category "security"
  enforcement "required"
  description "HIPAA requirement - database encryption, file encryption"
}

// Access Control Policy
AccessControl = policy "Role-based access control required for all patient data" {
  category "security"
  enforcement "required"
  description "Only authorized healthcare providers can access patient data"
}

// Audit Logging Policy
AuditLogging = policy "All access to patient data must be logged" {
  category "compliance"
  enforcement "required"
  description "HIPAA requires audit trails - who accessed what, when, why"
}

// Observability Policy
Observability = policy "All services must expose health check and metrics endpoints" {
  category "observability"
  enforcement "required"
  metadata {
    healthEndpoint "/health"
    metricsEndpoint "/metrics"
  }
}

HealthcareApp = system "Healthcare Application" {
PatientAPI = container "Patient API" {
  technology "Rust, gRPC"
  tags ["encrypted", "audit-logged"]
  description "Handles patient data - must comply with HIPAACompliance policy"
}

AppointmentAPI = container "Appointment API" {
  technology "Java, Spring Boot"
  tags ["encrypted"]
  description "Manages appointments - must comply with all policies"
}

BillingAPI = container "Billing API" {
  technology "Node.js, Express"
  tags ["encrypted", "audit-logged"]
  description "Handles billing - contains PHI (Protected Health Information)"
}

PatientDB = database "Patient Database" {
  technology "PostgreSQL"
  tags ["encrypted", "audit-logged"]
  description "Encrypted at rest, all access logged for HIPAA compliance"
}

AuditLogDB = database "Audit Log Database" {
  technology "PostgreSQL"
  description "Stores audit logs - immutable, append-only"
}

AuditQueue = queue "Audit Log Queue" {
  technology "Kafka"
  description "Async audit logging to avoid blocking operations"
}
}

IdentityProvider = system "Identity Provider" {
tags ["external"]
description "OAuth2/OIDC for authentication and authorization"
}

// Show compliance in action
HealthcareApp.PatientAPI -> HealthcareApp.PatientDB "Reads/Writes (encrypted, logged)"
HealthcareApp.PatientAPI -> HealthcareApp.AuditLogDB "Logs access via AuditQueue"
HealthcareApp.PatientAPI -> IdentityProvider "Validates access tokens"

view index {
include *
}

What Interviewers Look For

✅ Good Answer (What You Just Did)

  1. Understood compliance requirements - Mentioned specific HIPAA rules
  2. Defined policies explicitly - Showed governance thinking
  3. Applied policies to architecture - Tags, descriptions show compliance
  4. Addressed security - Encryption, access control, audit logging
  5. Explained enforcement - How policies are enforced

❌ Bad Answer (Common Mistakes)

  1. Not understanding compliance requirements
  2. No mention of policies or governance
  3. Ignoring security (encryption, access control)
  4. No audit logging strategy
  5. Can't explain how to enforce standards

Key Points to Mention in Interview

1. Policy-Driven Architecture

Say: "I define policies at the architecture level to enforce standards. For example:

  • HIPAACompliance policy requires encryption and audit logging
  • All services that handle patient data must comply
  • Policies are checked in CI/CD - non-compliant services can't deploy"

2. Encryption Strategy

Say: "We encrypt data at multiple levels:

  • In transit: TLS 1.3 for all communications
  • At rest: AES-256 encryption for databases
  • Application level: Encrypt sensitive fields before storing"

3. Access Control

Say: "We use:

  • OAuth2/OIDC: For authentication and authorization
  • Role-based access control (RBAC): Doctors can access their patients, admins have broader access
  • Principle of least privilege: Users only get minimum required access"

4. Audit Logging

Say: "We log all access to patient data:

  • What: Which patient record was accessed
  • Who: Which user/role accessed it
  • When: Timestamp
  • Why: Purpose of access (treatment, billing, etc.)
  • Immutable logs: Can't be modified or deleted"

5. Enforcement Strategy

Say: "We enforce policies through:

  • CI/CD checks: Validate architecture before deployment
  • Service mesh policies: Enforce TLS, rate limiting
  • Database policies: Encryption at rest, access controls
  • Monitoring: Alert on policy violations"

Interview Practice: Add More Compliance

Interviewer might ask: "What about data retention and deletion?"

Add data retention policy:

import { * } from 'sruja.ai/stdlib'


HIPAACompliance = policy "All patient data must be encrypted and access logged" {
  category "compliance"
  enforcement "required"
}

DataRetention = policy "Patient records retained for 10 years, then archived" {
  category "compliance"
  enforcement "required"
  description "Legal requirement - records must be retained for 10 years, then moved to cold storage"
}

RightToDeletion = policy "Support patient right to data deletion (with exceptions)" {
  category "compliance"
  enforcement "required"
  description "GDPR/HIPAA - patients can request data deletion, but some data must be retained for legal reasons"
}

HealthcareApp = system "Healthcare Application" {
  PatientAPI = container "Patient API" {
    technology "Rust, gRPC"
    tags ["encrypted", "audit-logged"]
  }

  PatientDB = database "Patient Database" {
    technology "PostgreSQL"
    description "Active patient records - 10 year retention"
  }

  ArchiveDB = database "Archive Database" {
    technology "S3 Glacier"
    description "Cold storage for records older than 10 years"
  }
}

view index {
include *
}

Common Follow-Up Questions

Be prepared for:

  1. "How do you ensure all services comply?"

    • Answer: "Policy validation in CI/CD. Architecture review process. Service mesh enforces some policies automatically. Regular audits."
  2. "What if a service violates a policy?"

    • Answer: "CI/CD blocks deployment. Alert security team. Architecture review required. Service owner must fix before deploying."
  3. "How do you handle breaches?"

    • Answer: "Automated breach detection via monitoring. Incident response plan. HIPAA requires notification within 72 hours. Audit logs help identify scope."
  4. "How do you balance compliance with developer productivity?"

    • Answer: "Automate compliance checks. Provide templates and libraries. Make compliance easy, not burdensome. Clear documentation and examples."

Exercise: Practice This Question

Design a HIPAA-compliant healthcare system and be ready to explain:

  1. How you enforce HIPAA requirements
  2. Your encryption strategy
  3. Your access control approach
  4. Your audit logging implementation
  5. How you ensure compliance across services

Practice tip: This is a senior-level question. Focus on:

  • Governance and policies
  • Security and compliance
  • Enforcement strategies
  • Trade-offs and practical considerations

Key Takeaways for Senior Interviews

  1. Understand compliance requirements - Know HIPAA, SOC 2, GDPR basics
  2. Define policies explicitly - Show governance thinking
  3. Enforce at multiple levels - CI/CD, service mesh, monitoring
  4. Balance compliance and productivity - Make it easy for developers
  5. Think about scale - How to enforce across 100+ services

Next Steps

You've learned how to handle compliance and governance questions. This completes the Production Architecture course! You're now ready to tackle:

  • ✅ Scaling and performance questions
  • ✅ Microservices architecture questions
  • ✅ Senior-level governance questions

Keep practicing with real interview questions! 🎯

Lesson 2: Policies, Constraints, Conventions

Why Governance?

Governance ensures systems remain secure, maintainable, and consistent as they evolve.

Sruja: Codify Guardrails

import { * } from 'sruja.ai/stdlib'


SecurityPolicy = policy "Security Policy" {
description "Security posture for services"
}

constraints {
rule "No PII in logs"
rule "Only managed Postgres for relational data"
}

conventions {
naming "kebab-case for services"
tracing "W3C trace context propagated"
}

view index {
include *
}

Practice

  • Add a policy describing your security posture.
  • Capture 2–3 constraints and conventions used by your team.

Agentic AI with Sruja

Learn to design agent-based AI systems with clear boundaries, interfaces, and governance using Sruja DSL.

Fundamentals of Agentic AI

Welcome to the first module of the Agentic AI Architecture course. In this module, we will explore what makes an AI system "agentic" and how to model its components using Sruja.

Learning Objectives

By the end of this module, you will be able to:

  1. Define Agentic AI: Understand the difference between passive LLM calls and autonomous agents.
  2. Identify Core Components: Recognize Agents, Tools, Memory, and Planning modules.
  3. Model Basic Agents: Use Sruja to represent a simple agent with tools.

Lessons

  1. What is Agentic AI?
  2. Core Components

What is Agentic AI?

Traditional LLM applications often follow a linear chain: Prompt -> LLM -> Output. Agentic AI breaks this linearity by introducing a control loop where the model decides what to do next.

The Control Loop

An agent typically operates in a loop:

  1. Observe: Read input or environment state.
  2. Reason: Decide on an action (using an LLM).
  3. Act: Execute the action (call a tool).
  4. Reflect: Observe the result of the action.
  5. Repeat: Continue until the goal is met.

Agent vs. Chain

FeatureChain (e.g., LangChain Runnable)Agent
Control FlowHardcoded by developerDetermined dynamically by LLM
FlexibilityRigid, predictableAdaptive, handles ambiguity
Failure RecoveryOften brittle (fails if one step fails)Can self-correct and retry
ComplexityLowerHigher (requires guardrails)

Why Sruja for Agents?

Modeling agents is complex because relationships are often dynamic. Sruja helps by:

  • Visualizing Dependencies: Showing which agents use which tools.
  • Defining Boundaries: separating the cognitive engine (LLM) from the execution layer (Tools).
  • Documenting Flows: Tracing the decision loop.
import { * } from 'sruja.ai/stdlib'


Agent = component "Research Agent"
LLM = component "Model Provider"
Tool = component "Search Tool"

Agent -> LLM "Reasons next step"
Agent -> Tool "Executes action"
Tool -> Agent "Returns observation"

view index {
include *
}

Core Components

Every agentic system consists of a few fundamental building blocks.

1. The Agent (The Brain)

The core logic that orchestrates the workflow. It holds the "system prompt" or persona and manages the context window.

2. Tools (The Hands)

Capabilities exposed to the agent. These can be:

  • APIs: Weather, Stock Prices, Internal Databases.
  • Functions: Calculator, Code Interpreter.
  • Retrievers: RAG search against vector databases.

3. Memory (The Context)

  • Short-term Memory: The current conversation history and scratchpad of thoughts.
  • Long-term Memory: Vector databases or persistent storage for recalling past interactions.

Modeling in Sruja

We can map these components to Sruja elements:

  • Agent -> container or component
  • Tool -> component or external system
  • Memory -> datastore
import { * } from 'sruja.ai/stdlib'


AgentSystem = system "Customer Support Bot" {
Brain = container "Orchestrator" {
  description "Main control loop"
}

Memory = container "Context Store" {
  ShortTerm = component "Conversation History"
  LongTerm = component "Vector DB"
}

Tools = container "Toolbelt" {
  CRM = component "CRM Connector"
  KB = component "Knowledge Base"
}

Brain -> Tools.CRM "Look up user"
Brain -> Tools.KB "Search policy"
Brain -> Memory.ShortTerm "Read/Write context"
}

view index {
include *
}

Agentic Patterns

Single-loop agents are powerful, but complex tasks often require structured patterns or multiple agents working together.

Learning Objectives

  1. Understand ReAct: The foundational pattern of Reason + Act.
  2. Explore Multi-Agent Systems: How agents collaborate.
  3. Model Orchestration: Supervisor vs. Hierarchical flows.

Lessons

  1. The ReAct Pattern
  2. Multi-Agent Orchestration

The ReAct Pattern

ReAct (Reasoning + Acting) is a prompting strategy where the model explicitly generates:

  1. Thought: Reasoning about the current state.
  2. Action: The tool call to make.
  3. Observation: The result of the tool call.

This loop continues until the agent decides it has enough information to answer.

Sruja Model

We can model this flow using a scenario or story in Sruja to visualize the sequence.

import { * } from 'sruja.ai/stdlib'


component Agent
component Tool
component User

story ReActLoop "Answering a Question" {
User -> Agent "Ask: What is the weather in SF?"

// Step 1
Agent -> Agent "Thought: I need to check weather"
Agent -> Tool "Action: WeatherAPI(SF)"
Tool -> Agent "Observation: 15°C, Cloudy"

// Step 2
Agent -> Agent "Thought: I have the answer"
Agent -> User "Answer: It's 15°C and cloudy."
}

view index {
include *
}

This visualization helps stakeholders understand the latency and cost implications of the multiple steps involved in a single user request.

Multi-Agent Orchestration

For complex domains, a single agent can get confused. Multi-Agent Systems (MAS) split responsibilities among specialized agents.

Supervisor Pattern

A central "Supervisor" agent routes tasks to worker agents and aggregates results.

import { * } from 'sruja.ai/stdlib'


Supervisor = container "Orchestrator"

Coder = container "Coding Agent" {
description "Writes and executes code"
}

Writer = container "Documentation Agent" {
description "Writes summaries"
}

Supervisor -> Coder "Delegates coding tasks"
Supervisor -> Writer "Delegates writing tasks"
Coder -> Supervisor "Returns result"
Writer -> Supervisor "Returns result"

view index {
include *
}

Hierarchical Teams

Agents can manage other agents, forming a tree structure. This is useful for large-scale operations like software development (Manager -> Tech Lead -> Developer).

Network/Mesh

Agents communicate directly with each other without a central supervisor. This is more decentralized but harder to debug. Sruja's relationship visualization shines here by mapping the allowable communication paths.

Modeling Agents in Sruja

In this final module, we will bring everything together and learn the best practices for modeling agentic systems in Sruja.

Learning Objectives

  1. Granularity: Deciding when to model an agent as a System, Container, or Component.
  2. Metadata: Using tags to track models, costs, and latency.
  3. Governance: Defining policies for AI safety.

Lessons

  1. Modeling Strategies
  2. Governance and Safety

Modeling Strategies

How should you represent an agent in Sruja? It depends on the scope of your diagram.

Level 1: System Context

If your AI is a product that users interact with, model it as a System.

import { * } from 'sruja.ai/stdlib'


User = person "User"
AI_Assistant = system "Support Bot"

User -> AI_Assistant "Chats with"

view index {
include *
}

Level 2: Container View

If you are designing the internals, agents are often Containers (deployable units).

import { * } from 'sruja.ai/stdlib'


AI_Assistant = system "AI Assistant" {
Router = container "Router Agent"
Search = container "Search Agent"
VectorDB = database "Memory"
}

Level 3: Component View

If you are designing a single agent's logic, the specific tools and chains are Components.

import { * } from 'sruja.ai/stdlib'


AI_Assistant = system "AI Assistant" {
SearchAgent = container "Search Agent" {
  Planner = component "ReAct Logic"
  GoogleTool = component "Search API"
  Scraper = component "Web Scraper"
}
}

Using Metadata

Use metadata to capture AI-specific details:

container GPT4Agent {
  metadata {
    model "gpt-4-turbo"
    temperature "0.7"
    max_tokens "4096"
    cost_per_1k_input "$0.01"
  }
}

Governance and Safety

Autonomous agents can be unpredictable. Architecture-as-Code allows us to define constraints to ensure safety.

Defining Requirements

Use requirement blocks to specify safety properties.

import { * } from 'sruja.ai/stdlib'


container Agent
container BankAPI

Agent -> BankAPI "Transfers funds"

requirement HumanLoop functional "Transfers > $1000 must require human approval"
requirement PII constraint "No PII should be sent to external LLM providers"

view index {
include *
}

Policy as Code

You can enforce rules about which agents can access which tools.

// Example of a prohibited relationship
// Agent -> ProductionDB "Direct Write"
// ^ This could be flagged by a linter rule

Guardrails

Model your guardrails explicitly as components that intercept messages.

container AgentSystem {
  component UserProxy "Input Guardrail"
  component LLM
  component OutputGuard "Output Validator"

  UserProxy -> LLM "Sanitized Input"
  LLM -> OutputGuard "Raw Output"
  OutputGuard -> UserProxy "Safe Response"
}

Quick Start for Seasoned Software Architects

For senior architects who need to enforce standards across large organizations.

This 5-minute course teaches you how to use Sruja to codify architectural policies, prevent drift, and scale governance across multiple teams—without slowing down development.

Why This Course?

As organizations grow, architectural standards become critical but hard to enforce. This course shows you how to:

  • Codify policies as executable rules
  • Prevent architectural drift automatically
  • Scale governance across 100+ engineers
  • Enforce standards in CI/CD without manual reviews
  • Track compliance across services and teams

What You'll Learn

  • Policy as Code: Write architectural rules that run in CI/CD
  • Constraint Enforcement: Prevent violations before they reach production
  • Governance Patterns: Real-world patterns for large organizations
  • Compliance Automation: Track and report on architectural compliance
  • Team Scaling: How to roll out governance without friction

Who This Course Is For

  • Senior/staff architects leading multiple teams
  • Engineering managers responsible for architectural standards
  • Platform teams building developer tooling
  • Architects at companies with 50+ engineers
  • Anyone implementing architecture governance

Prerequisites

  • Experience with software architecture at scale
  • Familiarity with CI/CD pipelines
  • Basic understanding of Sruja syntax (or complete Getting Started first)

Estimated Time

5 minutes — Quick, actionable lessons you can apply immediately.

Course Structure

Module 1: Policy as Code (5 minutes)

Learn to codify architectural standards as executable policies that run in CI/CD.

You'll learn:

  • How to write constraints and conventions
  • How to enforce layer boundaries
  • How to prevent common violations
  • How to integrate with CI/CD pipelines

Learning Outcomes

By the end of this course, you'll be able to:

  • ✅ Write architectural policies as code
  • ✅ Enforce standards automatically in CI/CD
  • ✅ Prevent architectural drift before it happens
  • ✅ Scale governance across large teams
  • ✅ Track compliance across services

Real-World Application

This course uses patterns from:

  • Microservices governance at scale
  • Multi-team architecture standards
  • Compliance requirements (HIPAA, SOC 2)
  • Service boundary enforcement
  • Dependency management policies

Ready to scale your architecture governance? Let's go! 🚀

Module Overview: Policy as Code

Turn architectural standards into executable code that runs in CI/CD.

This module teaches you how to write architectural policies as code, enforce them automatically, and scale governance across large organizations.

Learning Goals

  • Write constraints and conventions in Sruja
  • Enforce layer boundaries and service dependencies
  • Prevent architectural violations in CI/CD
  • Track compliance across services and teams

Why Policy as Code?

Traditional approach:

  • Manual code reviews
  • Architecture decision documents (ADRs) that get outdated
  • Inconsistent enforcement across teams
  • Compliance audits are manual and risky

Policy as Code approach:

  • Automated validation in CI/CD
  • Policies version-controlled with code
  • Consistent enforcement across all teams
  • Compliance reports generated automatically

What You'll Build

By the end of this module, you'll have:

  • ✅ A policy file that enforces architectural standards
  • ✅ CI/CD integration that blocks violations
  • ✅ Compliance tracking across services
  • ✅ Patterns you can apply to your organization

Estimated Time

5 minutes — Quick, focused lessons.

Prerequisites

  • Basic Sruja syntax (see Getting Started)
  • Familiarity with CI/CD (GitHub Actions, GitLab CI, etc.)
  • Understanding of architectural governance challenges

Checklist

  • Understand how to write constraints
  • Know how to enforce conventions
  • Can integrate policies into CI/CD
  • Can track compliance across services

Lesson 1: Writing Constraints and Conventions

The Problem: Architectural Drift

As teams grow, architectural standards drift. Services violate boundaries, dependencies become circular, and compliance requirements are missed. Manual reviews don't scale.

Example violations:

  • Frontend directly accessing database (violates layer boundaries)
  • Services in wrong layers (business logic in presentation layer)
  • Circular dependencies between services
  • Missing compliance controls (HIPAA, SOC 2)

Solution: Policy as Code

Sruja lets you codify architectural standards as constraints and conventions that are:

  • ✅ Version-controlled with your code
  • ✅ Validated automatically in CI/CD
  • ✅ Enforced consistently across teams
  • ✅ Tracked and reported on

Writing Constraints

Constraints define hard rules that must be followed. Violations block CI/CD.

import { * } from 'sruja.ai/stdlib'


// Constraint: Presentation layer cannot access datastores directly
constraint C1 {
description "Presentation layer must not access datastores"
rule "containers in layer 'presentation' must not have relations to datastores"
}

// Constraint: No circular dependencies
constraint C2 {
description "No circular dependencies between services"
rule "no cycles in service dependencies"
}

// Constraint: Compliance requirement
constraint C3 {
description "Payment services must have encryption"
rule "containers with tag 'payment' must have property 'encryption' = 'AES-256'"
}

layering {
layer Presentation "Presentation Layer" {
  description "User-facing interfaces"
}
layer Business "Business Logic Layer" {
  description "Core business logic"
}
layer Data "Data Access Layer" {
  description "Data persistence"
}
}

Shop = system "E-Commerce System" {
WebApp = container "Web Application" {
  layer Presentation
  // This would violate C1 if it accessed DB directly
}

PaymentService = container "Payment Service" {
  layer Business
  tags ["payment"]
  properties {
    encryption "AES-256"  // Required by C3
  }
}

DB = database "Database" {
  layer Data
}

// Correct: WebApp -> PaymentService -> DB (respects layers)
WebApp -> PaymentService "Processes payments"
PaymentService -> DB "Stores transactions"
}

view index {
include *
}

Writing Conventions

Conventions define best practices and naming standards. They're warnings, not blockers.

import { * } from 'sruja.ai/stdlib'


// Convention: Naming standards
convention N1 {
description "Service names should follow pattern: <domain>-<function>"
rule "container names should match pattern /^[a-z]+-[a-z]+$/"
}

// Convention: Technology standards
convention T1 {
description "API services should use REST or gRPC"
rule "containers with tag 'api' must have technology matching /REST|gRPC/"
}

Platform = system "Microservices Platform" {
container user-service "User Service" {  // ✅ Follows N1
  tags ["api"]
  technology "REST"  // ✅ Follows T1
}

authService = container "Auth Service" {  // ⚠️ Violates N1 (should be auth-service)
  tags ["api"]
  technology "GraphQL"  // ⚠️ Violates T1 (should be REST or gRPC)
}
}

view index {
include *
}

Real-World Example: Multi-Team Governance

Here's how a large organization enforces standards across teams:

import { * } from 'sruja.ai/stdlib'


// Global constraint: All services must have SLOs
constraint Global1 {
description "All production services must define SLOs"
rule "containers with tag 'production' must have slo block"
}

// Team-specific constraint: Payment team standards
constraint Payment1 {
description "Payment services must be in payment layer"
rule "containers with tag 'payment' must have layer 'payment'"
}

// Compliance constraint: HIPAA requirements
constraint Compliance1 {
description "Healthcare data must be encrypted"
rule "datastores with tag 'healthcare' must have property 'encryption' = 'AES-256'"
}

layering {
layer payment "Payment Layer"
layer healthcare "Healthcare Layer"
}

PaymentSystem = system "Payment System" {
PaymentAPI = container "Payment API" {
  layer payment
  tags ["payment", "production"]
  slo {
    availability { target "99.9%" window "30 days" }
    latency { p95 "200ms" window "7 days" }
  }
}
}

HealthcareSystem = system "Healthcare System" {
PatientDB = database "Patient Database" {
  layer healthcare
  tags ["healthcare"]
  properties {
    encryption "AES-256"
  }
}
}

view index {
include *
}

Enforcing in CI/CD

Add validation to your CI/CD pipeline:

# .github/workflows/architecture.yml
name: Architecture Validation
on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - name: Install Sruja
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja --locked
      - name: Validate Architecture
        run: sruja lint architecture.sruja
      - name: Check Constraints
        run: sruja validate architecture.sruja

Result: Violations block merges automatically.

Key Takeaways

  1. Constraints = Hard rules that block CI/CD
  2. Conventions = Best practices that warn
  3. Version control policies with code
  4. Automate enforcement in CI/CD
  5. Scale governance across teams

Next Steps

  • Try writing constraints for your organization
  • Integrate validation into your CI/CD pipeline
  • Track compliance across services
  • Iterate based on team feedback

You now know how to codify architectural policies. Let's enforce them automatically!

Lesson 2: Enforcing Policies in CI/CD

The Goal: Automatic Enforcement

Policies are useless if they're not enforced. This lesson shows you how to integrate Sruja validation into CI/CD so violations are caught before they reach production.

Basic CI/CD Integration

GitHub Actions

# .github/workflows/architecture.yml
name: Architecture Validation

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main, develop]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      
      - name: Install Sruja
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja --locked
      
      - name: Validate Architecture
        run: |
          sruja fmt architecture.sruja
          sruja lint architecture.sruja
      
      - name: Check Constraints
        run: sruja validate architecture.sruja
      
      - name: Export Documentation
        run: sruja export markdown architecture.sruja > architecture.md

GitLab CI

# .gitlab-ci.yml
architecture-validation:
  image: alpine:latest
  before_script:
    - apk add --no-cache rust cargo
    - cargo install sruja-cli --git https://github.com/sruja-ai/sruja --locked
    - export PATH="$HOME/.cargo/bin:$PATH"
  script:
    - sruja fmt architecture.sruja
    - sruja lint architecture.sruja
    - sruja validate architecture.sruja
  only:
    - merge_requests
    - main

Advanced: Policy Violation Reporting

Generate compliance reports in CI/CD:

- name: Generate Compliance Report
  run: |
    sruja validate architecture.sruja --format-json > violations.json
    sruja compliance -r . -a architecture.sruja -f json > compliance.json
  
- name: Upload Reports
  uses: actions/upload-artifact@v3
  with:
    name: architecture-reports
    path: |
      violations.json
      compliance.json
      architecture.md

Multi-Repository Governance

For organizations with multiple repositories, create a shared policy file:

# .github/workflows/architecture.yml
- name: Validate Against Shared Policies
  run: |
    # Fetch shared policies from central repo
    git clone https://github.com/your-org/architecture-policies.git /tmp/policies
    
    # Validate architecture and optional external constraint files
    sruja validate architecture.sruja -c /tmp/policies/global-constraints.sruja

Pre-commit Hooks

Catch violations before they're committed:

#!/bin/sh
# .git/hooks/pre-commit

# Install Sruja if not available
if ! command -v sruja &> /dev/null; then
  cargo install sruja-cli --git https://github.com/sruja-ai/sruja --locked
  export PATH="$HOME/.cargo/bin:$PATH"
fi

# Validate architecture
sruja lint architecture.sruja
if [ $? -ne 0 ]; then
  echo "❌ Architecture validation failed. Fix errors before committing."
  exit 1
fi

sruja validate architecture.sruja
if [ $? -ne 0 ]; then
  echo "❌ Constraint violations found. Fix before committing."
  exit 1
fi

echo "✅ Architecture validation passed"
exit 0

Integration with PR Reviews

Add architecture validation as a required check:

- name: Architecture Gate
  run: |
    sruja validate architecture.sruja --fail-on-violations

Result: PRs can't be merged until architecture is valid.

Monitoring Compliance

Track compliance over time:

- name: Track Compliance Metrics
  run: |
    sruja compliance -r . -a architecture.sruja -f json > compliance-metrics.json
    
    # Send to monitoring system
    curl -X POST https://your-monitoring-system/api/metrics \
      -H "Content-Type: application/json" \
      -d @compliance-metrics.json

Key Takeaways

  1. Integrate early — Validate in CI/CD, not manually
  2. Fail fast — Block merges on violations
  3. Report compliance — Track metrics over time
  4. Share policies — Use central policy files for multi-repo orgs
  5. Pre-commit hooks — Catch issues before they're committed

Real-World Pattern

Large organization pattern:

# Central policy repository
architecture-policies/
  ├── global-constraints.sruja    # Organization-wide rules
  ├── team-payment.sruja          # Team-specific rules
  └── compliance-hipaa.sruja      # Compliance requirements

# Each service repository
service-repo/
  ├── architecture.sruja          # Service architecture
  └── .github/workflows/
      └── architecture.yml        # Validates against shared policies

Next Steps

  • Set up CI/CD validation for your architecture
  • Create shared policy files for your organization
  • Add pre-commit hooks for faster feedback
  • Track compliance metrics over time

You now know how to enforce policies automatically. Governance at scale! 🚀

Tutorials

Step-by-step guides to get things done with Sruja.

Basic

Advanced

Combine with the Beginner path or Courses for a full learning path.

CLI Basics

This tutorial teaches the essential Sruja CLI commands for day‑to‑day work.

Install and Verify

Option A – install script (downloads from GitHub Releases):

curl -fsSL https://sruja.ai/install.sh | bash
sruja --version

Option B – from Git (requires Rust):

cargo install sruja-cli --git https://github.com/sruja-ai/sruja
sruja --version

Option C – build from source:

git clone https://github.com/sruja-ai/sruja.git && cd sruja && make build
# Add target/release to PATH or copy target/release/sruja to a directory on PATH
sruja --version

If sruja is not found, ensure the install directory is on your PATH (install script uses ~/.local/bin by default; Option B uses ~/.cargo/bin; Option C uses target/release).

Create a Model

import { * } from 'sruja.ai/stdlib'


App = system "My App" {
Web = container "Web Server"
DB = database "Database"
}
User = person "User"

User -> App.Web "Visits"
App.Web -> App.DB "Reads/Writes"

view index {
include *
}

Lint and Compile

sruja lint example.sruja
sruja compile example.sruja

Format

sruja fmt example.sruja > example.formatted.sruja

Tree View

sruja tree example.sruja

Export to Mermaid

sruja export mermaid example.sruja > example.mmd

DSL Basics

Sruja is an architecture DSL. This tutorial introduces its core elements.

Elements

import { * } from 'sruja.ai/stdlib'


Shop = system "Shop API" {
  WebApp = container "Web" {
    description "Gateway layer"
  }
  CatalogSvc = container "Catalog"
  MainDB = database "Database"
}

User = person "User"

User -> Shop.WebApp "Uses"
Shop.WebApp -> Shop.CatalogSvc "Routes"
Shop.CatalogSvc -> Shop.MainDB "Reads/Writes"

view index {
  include *
}

Descriptions and Metadata

import { * } from 'sruja.ai/stdlib'


Payments = system "Payments" {
  description "Handles payments and refunds"
  // metadata
  metadata {
    team "FinTech"
    tier "critical"
  }
}

Component‑level Modeling

import { * } from 'sruja.ai/stdlib'


App = system "App" {
  Web = container "Web" {
    Cart = component "Cart"
  }
}

Next Steps

Validation & Linting

Sruja ships with a validation engine that helps keep architectures healthy. This tutorial covers how to use it effectively and troubleshoot common issues.

Quick Start

# Lint a single file
sruja lint architecture.sruja

# Lint all .sruja files in a directory
sruja lint ./architectures/

# Get detailed output
sruja lint --verbose architecture.sruja

# Export validation report as JSON (for CI/CD)
sruja lint --json architecture.sruja > lint-report.json

Common Validation Checks

Sruja validates:

  1. Unique IDs: No duplicate element IDs
  2. Valid references: Relations must connect existing elements
  3. Cycle detection: Informational (cycles are valid for many patterns)
  4. Orphan detection: Elements not used by any relation
  5. Simplicity guidance: Suggests simpler syntax when appropriate
  6. Constraint violations: Policy and constraint rule violations

Real-World Example: E-Commerce Platform

Let's validate a real architecture:

import { * } from 'sruja.ai/stdlib'


Customer = person "Customer"

ECommerce = system "E-Commerce Platform" {
    WebApp = container "Web Application" {
        technology "React"
    }
    API = container "REST API" {
        technology "Rust"
    }
    ProductDB = database "Product Database" {
        technology "PostgreSQL"
    }
    OrderDB = database "Order Database" {
        technology "PostgreSQL"
    }
}

Customer -> ECommerce.WebApp "Browses products"
ECommerce.WebApp -> ECommerce.API "Calls API"
ECommerce.API -> ECommerce.ProductDB "Reads products"
ECommerce.API -> ECommerce.OrderDB "Writes orders"

view index {
include *
}

Validation output:

✅ Valid architecture
✅ All references valid
✅ No orphan elements
ℹ️  Cycle detected: ECommerce.WebApp ↔ ECommerce.API (this is valid for request/response)

Troubleshooting Common Errors

Error 1: Invalid Reference

Error message:

❌ Invalid reference: ECommerce.API -> ECommerce.NonExistent "Calls"
   Element 'NonExistent' not found in system 'ECommerce'

Problem: You're referencing an element that doesn't exist.

Fix:

// ❌ Wrong
ECommerce.API -> ECommerce.NonExistent "Calls"

// ✅ Correct - element exists
ECommerce.API -> ECommerce.ProductDB "Reads"

Real-world scenario: You renamed a service but forgot to update all references.

Error 2: Duplicate ID

Error message:

❌ Duplicate ID: 'API' found in system 'ECommerce'
   First occurrence: line 5
   Second occurrence: line 12

Problem: Two elements have the same ID in the same scope.

Fix:

import { * } from 'sruja.ai/stdlib'


// EXPECTED_FAILURE: unexpected token
// ❌ Wrong
ECommerce = system "E-Commerce" {
API = container "REST API"
API = container "GraphQL API"  // Duplicate ID!
}

// ✅ Correct - use unique IDs
ECommerce = system "E-Commerce" {
RESTAPI = container "REST API"
GraphQLAPI = container "GraphQL API"
}

Real-world scenario: You added a new API type but used the same ID.

Error 3: Orphan Element

Warning message:

⚠️  Orphan element: ECommerce.Cache
   This element is not referenced by any relation

Problem: An element exists but nothing connects to it.

Fix options:

  1. Add a relation (if the element should be used):
// Add relation to use the cache
ECommerce.API -> ECommerce.Cache "Reads cache"
  1. Remove the element (if it's not needed):
// Remove if not part of current architecture
// datastore Cache "Cache" { ... }
  1. Document why it's isolated (if intentional):
datastore Cache "Cache" {
    description "Future: Will be used for product catalog caching"
    metadata {
        status "planned"
    }
}

Real-world scenario: You added a component for future use but haven't integrated it yet.

Error 4: Constraint Violation

Error message:

❌ Constraint violation: 'NoDirectDB' violated
   ECommerce.WebApp -> ECommerce.ProductDB "Direct database access"
   Constraint: Frontend containers cannot access databases directly

Problem: A constraint rule is being violated.

Fix:

// EXPECTED_FAILURE: Invalid reference
// ❌ Wrong - violates constraint
ECommerce.WebApp -> ECommerce.ProductDB "Direct access"

// ✅ Correct - go through API
ECommerce.WebApp -> ECommerce.API "Calls API"
ECommerce.API -> ECommerce.ProductDB "Reads products"

Real-world scenario: Enforcing architectural standards (e.g., "no direct database access from frontend").

Understanding Validation Messages

Cycles Are Valid

Sruja detects cycles but doesn't block them - cycles are valid architectural patterns:

  • Feedback loops: User ↔ System interactions
  • Event-driven: Service A ↔ Service B via events
  • Mutual dependencies: Microservices that call each other
  • Bidirectional flows: API ↔ Database (read/write)
import { * } from 'sruja.ai/stdlib'


// ✅ Valid - feedback loop
User = person "User"
Platform = system "Platform"
User -> Platform "Makes request"
Platform -> User "Sends response"

// ✅ Valid - event-driven pattern
ServiceA = system "Service A"
ServiceB = system "Service B"
ServiceA -> ServiceB "Publishes event"
ServiceB -> ServiceA "Publishes response event"

// ✅ Valid - mutual dependencies
PaymentService = system "Payment Service"
OrderService = system "Order Service"
PaymentService -> OrderService "Updates order status"
OrderService -> PaymentService "Requests payment"

view index {
include *
}

The validator will inform you about cycles but won't prevent compilation, as they're often intentional.

Simplicity Guidance

Sruja suggests simpler syntax when appropriate:

Example:

ℹ️  Simplicity suggestion: Consider using 'system' instead of nested 'container'
   Current: system App { container Web { ... } }
   Simpler: system Web { ... }

This is informational only - use the level of detail that matches your modeling goal.

CI/CD Integration

GitHub Actions Example

Add validation to your CI pipeline:

name: Validate Architecture

on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6

      - name: Install Sruja
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja --locked

      - name: Lint Architecture
        run: |
          sruja lint architecture.sruja

      - name: Export Validation Report
        if: always()
        run: |
          sruja lint --json architecture.sruja > lint-report.json

      - name: Upload Report
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: lint-report
          path: lint-report.json

GitLab CI Example

validate-architecture:
  image: rust:1.70
  script:
    - cargo install sruja-cli --git https://github.com/sruja-ai/sruja
    - sruja lint architecture.sruja
  only:
    - merge_requests
    - main

Pre-commit Hook

Validate before every commit:

#!/bin/sh
# .git/hooks/pre-commit

sruja lint architecture.sruja
if [ $? -ne 0 ]; then
    echo "❌ Architecture validation failed. Fix errors before committing."
    exit 1
fi

Advanced: Custom Validation Rules

Use constraints and conventions for custom validation:

import { * } from 'sruja.ai/stdlib'


// Define constraints
constraints {
    "Frontend cannot access databases directly"
}

// Apply conventions
conventions {
    "Layered Architecture: Frontend → API → Database"
}

Platform = system "Platform" {
    Frontend = container "React App"
    API = container "REST API"
    DB = database "PostgreSQL"

    // ✅ Valid
    Frontend -> API "Calls API"
    API -> DB "Reads/Writes"

    // ❌ Will be caught by validator
    // Frontend -> DB "Direct access"  // Violates constraint
}

view index {
include *
}

Real-World Workflow

Step 1: Write Architecture

import { * } from 'sruja.ai/stdlib'


App = system "App" {
    Web = container "Web"
    DB = datastore "Database"
}

view index {
include *
}

Step 2: Validate

sruja lint architecture.sruja

Step 3: Fix Errors

Address any validation errors or warnings.

Step 4: Commit to CI/CD

Once validation passes locally, commit. CI/CD will validate again.

Step 5: Monitor in Production

Use validation in CI/CD to catch issues before they reach production.

Key Takeaways

  1. Validate early and often: Run sruja lint frequently during development
  2. Fix errors immediately: Don't accumulate validation debt
  3. Integrate with CI/CD: Catch issues before they reach production
  4. Understand cycles: They're often valid patterns, not errors
  5. Use constraints: Enforce architectural standards automatically

Exercise: Fix Validation Errors

Scenario: You have an architecture file with several validation errors.

Tasks:

  1. Run sruja lint on a file
  2. Identify all errors and warnings
  3. Fix each error
  4. Re-validate to confirm fixes

Time: 10 minutes

Further Reading

Export Diagrams

Sruja currently supports export to Mermaid (for Markdown) and interactive visualization in Studio.

Export Formats

1. Mermaid (Markdown)

Export to Mermaid code fences for use in Markdown pages:

sruja export mermaid architecture.sruja > architecture.md

The output includes ```mermaid blocks that render in most Markdown engines with Mermaid enabled.

Use cases:

  • Documentation sites using Markdown
  • Lightweight diagrams without external tooling

2. Studio (Interactive)

Open and preview diagrams interactively in Studio:

Open in Studio from the Learn examples or visit /studio/

Features:

  • Interactive preview and navigation
  • C4 model views (context, containers, components)
  • Embedded documentation and metadata

Use cases:

  • Architecture reviews
  • Presentations
  • Iterative modeling and validation

Mermaid Styling

You can customize Mermaid via frontmatter or exporter configuration. See the Mermaid exporter in crates/sruja-export/src/mermaid/ for options.

Choosing the Right Path

  • Mermaid: For Markdown-first workflows and lightweight sharing
  • Studio: For interactive exploration and richer documentation

Systems Thinking

Systems thinking helps you understand how components interact as part of a whole. Sruja supports five core systems thinking concepts.

1. Parts and Relationships

Systems thinking starts with understanding what the system contains (parts) and how they connect (relationships).

import { * } from 'sruja.ai/stdlib'


Customer = person "End User"

Shop = system "E-Commerce System" {
WebApp = container "Web Application" {
  technology "React"
}

API = container "API Service" {
  technology "Rust"
}

DB = database "PostgreSQL Database" {
  technology "PostgreSQL 14"
}
}

// Relationships show how parts interact
Customer -> Shop.WebApp "Uses"
Shop.WebApp -> Shop.API "Calls"
Shop.API -> Shop.DB "Reads/Writes"

view index {
include *
}

Key insight: Identify the parts first, then define how they relate.

2. Boundaries

Boundaries define what's inside the system vs. what's outside (the environment).

import { * } from 'sruja.ai/stdlib'


// Inside boundary: System contains these components
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
  DB = database "Database"
}

// Outside boundary: External entities
Customer = person "End User"
Admin = person "System Administrator"

PaymentGateway = system "Third-party Payment Service" {
metadata {
  tags ["external"]
}
}

// Relationships cross boundaries
Customer -> Shop.WebApp "Uses"
Shop.API -> PaymentGateway "Processes"

view index {
include *
}

Key insight: Use system to define internal boundaries, person and external for external boundaries.

3. Flows

Flows show how information and data move through the system. Sruja supports two flow styles:

Data Flow Diagram (DFD) Style

Use scenario for data-oriented flows:

// EXPECTED_FAILURE: Layer violation
// SKIP_ORPHAN_CHECK
import { * } from 'sruja.ai/stdlib'


Customer = person "Customer"
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
  DB = database "Database"
}
PaymentGateway = system "PaymentGateway" {
  tags ["external"]
}

OrderProcess = scenario "Order Processing" {
Customer -> Shop.WebApp "Submits Order"
Shop.WebApp -> Shop.API "Sends Order Data"
Shop.API -> Shop.DB "Saves Order"
Shop.API -> PaymentGateway "Charges Payment"
Shop.API -> Shop.WebApp "Returns Result"
Shop.WebApp -> Customer "Shows Confirmation"
}

view index {
include *
}

User Story/Scenario Style

Use scenario for behavioral flows:

// EXPECTED_FAILURE: Layer violation
import { * } from 'sruja.ai/stdlib'


Customer = person "End User"
ECommerce = system "E-Commerce System" {
CartPage = container "Shopping Cart Page"
WebApp = container "Web Application"
API = container "API Service"
DB = database "Database"
}
PaymentGateway = system "Payment Service" {
metadata {
  tags ["external"]
}
}

Checkout = story "User Checkout Flow" {
Customer -> ECommerce.CartPage "adds items to cart"
ECommerce.CartPage -> ECommerce.WebApp "clicks checkout"
ECommerce.WebApp -> ECommerce.API "validates cart"
ECommerce.API -> ECommerce.DB "checks inventory"
ECommerce.DB -> ECommerce.API "returns stock status"
ECommerce.API -> PaymentGateway "processes payment"
PaymentGateway -> ECommerce.API "confirms payment"
ECommerce.API -> ECommerce.DB "creates order"
ECommerce.API -> ECommerce.WebApp "returns order confirmation"
ECommerce.WebApp -> Customer "displays success message"
}

view index {
include *
}

Key insight: Use flow for data flows (DFD), story/scenario for behavioral flows (BDD).

4. Feedback Loops

Feedback loops show how actions create reactions that affect future actions. Cycles are valid patterns in Sruja.

Simple Feedback Loop

// EXPECTED_FAILURE: Layer violation
person = kind "Person"
system = kind "System"
container = kind "Container"
component = kind "Component"
database = kind "Database"
queue = kind "Queue"

User = person "End User"
App = system "Application" {
WebApp = container "Web Application"
API = container "API Service"
}

// Feedback loop: User action → System response → User reaction
User -> App.WebApp "Submits Form"
App.WebApp -> App.API "Validates"
App.API -> App.WebApp "Returns Validation Result"
App.WebApp -> User "Shows Feedback"
// The feedback affects user's next action (completing the loop)

System Feedback Loop

import { * } from 'sruja.ai/stdlib'


Admin = person "Administrator"
Shop = system "Shop" {
  API = container "API"
  Inventory = database "Inventory"
}

// Event-driven feedback loop
Shop.API -> Shop.Inventory "Updates Stock"
Shop.Inventory -> Shop.API "Notifies Low Stock"
Shop.API -> Admin "Sends Alert"
Admin -> Shop.API "Adjusts Inventory"
// Creates feedback: API ↔ Inventory ↔ Admin

view index {
include *
}

Key insight: Cycles model natural feedback loops, event-driven patterns, and mutual dependencies. They're valid architectural patterns.

5. Context

Context defines the environment the system operates in - external dependencies, stakeholders, and surrounding systems.

import { * } from 'sruja.ai/stdlib'


// Internal system
Shop = system "Shop" {
  WebApp = container "Web Application"
  API = container "API Service"
  DB = database "Database"
}

// Context: Stakeholders
Customer = person "End User"
Admin = person "System Administrator"
Support = person "Customer Support"

// Context: External dependencies
PaymentGateway = system "Third-party Payment" {
metadata {
  tags ["external"]
}
}

EmailService = system "Email Notifications" {
metadata {
  tags ["external"]
}
}

AnalyticsService = system "Usage Analytics" {
metadata {
  tags ["external"]
}
}

// Context relationships
Customer -> Shop "Uses"
Admin -> Shop "Manages"
Support -> Shop "Monitors"
Shop -> PaymentGateway "Depends on"
Shop -> EmailService "Sends notifications"
Shop -> AnalyticsService "Tracks usage"

view index {
include *
}

Key insight: Context includes all external entities and dependencies that affect or are affected by your system.

Putting It All Together

Here's a complete example combining all five concepts:

// EXPECTED_FAILURE: Layer violation
person = kind "Person"
system = kind "System"
container = kind "Container"
component = kind "Component"
database = kind "Database"
queue = kind "Queue"

// 1. PARTS AND RELATIONSHIPS
Customer = person "End User"
Admin = person "System Administrator"

ECommerce = system "E-Commerce System" {
WebApp = container "Web Application" {
  technology "React"
}
API = container "API Service" {
  technology "Rust"
}
DB = database "PostgreSQL Database" {
  technology "PostgreSQL 14"
}
}

// 2. BOUNDARIES
PaymentGateway = system "Third-party Payment Service" {
metadata {
  tags ["external"]
}
}

// 3. FLOWS
OrderProcess = scenario "Order Processing" {
Customer -> ECommerce.WebApp "Submits Order"
ECommerce.WebApp -> ECommerce.API "Sends Order Data"
ECommerce.API -> ECommerce.DB "Saves Order"
ECommerce.API -> PaymentGateway "Charges Payment"
ECommerce.API -> ECommerce.WebApp "Returns Result"
ECommerce.WebApp -> Customer "Shows Confirmation"
}

// 4. FEEDBACK LOOPS
Customer -> ECommerce.WebApp "Submits Form"
ECommerce.WebApp -> ECommerce.API "Validates"
ECommerce.API -> ECommerce.WebApp "Returns Validation Result"
ECommerce.WebApp -> Customer "Shows Feedback"

ECommerce.API -> ECommerce.DB "Updates Inventory"
ECommerce.DB -> ECommerce.API "Notifies Low Stock"
ECommerce.API -> Admin "Sends Alert"
Admin -> ECommerce.API "Adjusts Inventory"

// 5. CONTEXT
Support = person "Customer Support"
EmailService = system "Email Notifications" {
metadata {
  tags ["external"]
}
}

Customer -> ECommerce "Uses"
Admin -> ECommerce "Manages"
Support -> ECommerce "Monitors"
ECommerce -> PaymentGateway "Depends on"
ECommerce -> EmailService "Sends notifications"

Why Systems Thinking Matters

  • Holistic understanding: See the whole system, not just parts
  • Natural patterns: Model real-world interactions and feedback
  • Clear boundaries: Understand what's in scope vs. context
  • Flow visualization: See how data and information move
  • Valid cycles: Feedback loops are natural, not errors

Next Steps

  • Try the complete example: book/valid-examples/feedback-loops-basic.sruja and book/valid-examples/causal-loops-basic.sruja
  • Learn Deployment Modeling for infrastructure perspective

Design Mode Workflow

Design Mode helps you build architecture assets step by step, starting with high‑level context and progressively adding detail. It also lets you focus on a specific system or container and share audience‑specific views.

Workflow Steps

Step 1: Context — define person and system

Start with the high-level context:

import { * } from 'sruja.ai/stdlib'


User = person "User"
Shop = system "Shop"

view index {
include *
}

Step 2: Containers — add container, datastore, queue to a chosen system

Add containers and datastores:

import { * } from 'sruja.ai/stdlib'


User = person "User"
App = system "App" {
WebApp = container "Web Application"
API = container "API Service"
DB = database "Database"
}

User -> App.WebApp "Uses"
App.WebApp -> App.API "Calls"
App.API -> App.DB "Reads/Writes"

view index {
include *
}

Step 3: Components — add component inside a chosen container

Drill down into components:

import { * } from 'sruja.ai/stdlib'


App = system "App" {
WebApp = container "Web Application" {
  UI = component "User Interface"
}
API = container "API Service" {
  Auth = component "Auth Service"
}
}

// Component‑level interaction
App.WebApp.UI -> App.API.Auth "Calls"

view index {
include *
}

Step 4: Stitch — add relations and optional scenarios; share focused views

Add relations and scenarios to complete the model.

Layers and Focus

  • Levels: L1 Context, L2 Containers, L3 Components, All
  • Focus:
    • L2 focus by systemId
    • L3 focus by systemId.containerId

When focused, non‑relevant nodes/edges are dimmed so you can work deeper without distractions.

Viewer opens focused views via URL params:

  • ?level=1 → Context
  • ?level=2&focus=Shop → Containers of system Shop
  • ?level=3&focus=Shop.API → Components in container API of system Shop
  • DSL payload is passed with #code=<lz-base64> or ?code=<urlencoded>.

Studio Experience

  • Diagram‑first: Studio opens with the diagram; a Design Mode overlay guides steps
  • Contextual palette: add containers at L2 (focused system), components at L3 (focused container)
  • Autosave on close: resume drafts; share per‑layer links from the toolbar

Viewer Experience

  • Use level buttons and focus to tailor the view
  • Dimming clarifies what's relevant at each depth
  • Share via copied URL (includes level, focus, and DSL)

See Also

Demo Script: Quick 10-Minute Walkthrough

This tutorial provides a quick 10-minute walkthrough to demonstrate Sruja's core capabilities: modeling, validation, and export.

1) Model (2 minutes)

Create a simple e-commerce architecture:

import { * } from 'sruja.ai/stdlib'


User = person "User"
Shop = system "Shop" {
  WebApp = container "Web App"
  API = container "API"
  DB = datastore "Database"
}

User -> Shop.WebApp "Uses"
Shop.WebApp -> Shop.API "Calls"
Shop.API -> Shop.DB "Reads/Writes"

view index {
include *
}

2) Validate (2 minutes)

Format and validate your model:

sruja fmt architecture.sruja
sruja lint architecture.sruja

3) Add Targets (3 minutes)

Add SLOs and scaling configuration:

import { * } from 'sruja.ai/stdlib'


Shop = system "Shop" {
API = container "API" {
  scale {
    metric "req/s"
    min 200
    max 2000
  }

  slo {
    availability {
      target "99.9%"
      window "30 days"
    }
    latency {
      p95 "200ms"
      window "7 days"
    }
    errorRate {
      target "< 0.1%"
      window "30 days"
    }
  }
}
}

view index {
include *
}

4) Export (3 minutes)

Export to various formats:

sruja export markdown architecture.sruja
sruja export mermaid architecture.sruja
sruja export mermaid architecture.sruja

Outcome: Living docs and diagrams generated from the model.


Note: Sruja is free and open source (Apache 2.0 licensed). Need help with adoption? Professional consulting services are available. Contact the team through GitHub Discussions to learn more.

Deployment Modeling

Model production environments and map containers onto infrastructure nodes.

import { * } from 'sruja.ai/stdlib'


WebServer = container "Nginx"
AppServer = container "Python App"
Database = database "Postgres"


deployment Production "Production" {
  node AWS "AWS" {
    node USEast1 "US-East-1" {
      node EC2 "EC2 Instance" {
        containerInstance WebServer
        containerInstance AppServer
      }
      node RDS "RDS" {
        containerInstance Database
      }
    }
  }
}

view index {
include *
}

CI/CD Integration

Integrate Sruja into your CI/CD pipeline to automatically validate architecture, enforce standards, and generate documentation on every commit.

Why CI/CD Integration?

For DevOps teams:

  • Catch architecture violations before they reach production
  • Automate documentation generation
  • Enforce architectural standards across teams
  • Reduce manual review overhead

For software architects:

  • Ensure architectural decisions are documented
  • Prevent architectural drift
  • Scale governance across multiple teams

For product teams:

  • Keep architecture docs up-to-date automatically
  • Track architecture changes over time
  • Ensure compliance with requirements

Real-World Scenario

Challenge: A team of 50 engineers across 10 microservices. Architecture documentation is outdated, and violations happen frequently.

Solution: Integrate Sruja validation into CI/CD to:

  • Validate architecture on every PR
  • Generate updated documentation automatically
  • Block merges if constraints are violated
  • Track architecture changes over time

GitHub Actions Integration

Sruja’s CLI is written in Rust. In CI you can either build from source in this repo or install from the Git repo with cargo install. A reusable composite action is available in the Sruja repo for building and validating.

Using the Sruja repo reusable action (this repository)

If your workflow runs inside the sruja repo, use the composite action so the CLI is built once and lint/export run on your files:

- uses: actions/checkout@v6
- uses: ./.github/actions/sruja-validate
  with:
    working-directory: .
    files: "book/valid-examples/**/*.sruja" # or '**/*.sruja'
    run-export: "false"

Basic setup (any repository)

Install the CLI from the Sruja Git repo with Cargo, then run sruja lint and sruja export:

name: Architecture Validation

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main, develop]

jobs:
  validate-architecture:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v6

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable
      - name: Install Sruja CLI
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja

      - name: Validate Architecture
        run: sruja lint architecture.sruja

      - name: Export Documentation
        run: |
          sruja export markdown architecture.sruja > architecture.md
          sruja export json architecture.sruja > architecture.json

      - name: Upload Artifacts
        uses: actions/upload-artifact@v3
        with:
          name: architecture-docs
          path: |
            architecture.md
            architecture.json

Advanced: Enforce Constraints

name: Architecture Governance

on: [pull_request]

jobs:
  enforce-architecture:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
        with:
          fetch-depth: 0 # Full history for diff

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable
      - name: Install Sruja CLI
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja

      - name: Validate Architecture
        id: validate
        run: |
          sruja lint architecture.sruja > lint-output.txt 2>&1
          exit_code=$?
          echo "exit_code=$exit_code" >> $GITHUB_OUTPUT
          cat lint-output.txt

      - name: Check for Constraint Violations
        if: steps.validate.outputs.exit_code != 0
        run: |
          echo "❌ Architecture validation failed!"
          echo "Please fix the errors before merging."
          exit 1

      - name: Comment PR with Validation Results
        if: always()
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const lintOutput = fs.readFileSync('lint-output.txt', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Architecture Validation Results\n\n\`\`\`\n${lintOutput}\n\`\`\``
            });

Multi-Architecture Validation

For monorepos with multiple architecture files:

name: Validate All Architectures

on: [push, pull_request]

jobs:
  validate-all:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        architecture:
          - architecture.sruja
          - services/payment-service.sruja
          - services/order-service.sruja
          - services/user-service.sruja
    steps:
      - uses: actions/checkout@v6

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable
      - name: Install Sruja CLI
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja

      - name: Validate ${{ matrix.architecture }}
        run: sruja lint ${{ matrix.architecture }}

GitLab CI Integration

stages:
  - validate

validate-architecture:
  stage: validate
  image: rust:1.70
  before_script:
    - cargo install sruja-cli --git https://github.com/sruja-ai/sruja
  script:
    - sruja lint architecture.sruja
    - sruja export markdown architecture.sruja > architecture.md
    - sruja export json architecture.sruja > architecture.json
  artifacts:
    paths:
      - architecture.md
      - architecture.json
    expire_in: 30 days
  only:
    - merge_requests
    - main

Jenkins Integration

pipeline {
    agent any

    stages {
        stage('Install Sruja CLI') {
            steps {
                sh 'cargo install sruja-cli --git https://github.com/sruja-ai/sruja'
            }
        }
        stage('Validate Architecture') {
            steps {
                sh 'sruja lint architecture.sruja'
            }
        }

        stage('Generate Documentation') {
            steps {
                sh '''
                    sruja export markdown architecture.sruja > architecture.md
                    sruja export json architecture.sruja > architecture.json
                '''
            }
        }

        stage('Archive Documentation') {
            steps {
                archiveArtifacts artifacts: 'architecture.*', fingerprint: true
            }
        }
    }

    post {
        failure {
            emailext (
                subject: "Architecture Validation Failed: ${env.JOB_NAME} - ${env.BUILD_NUMBER}",
                body: "Architecture validation failed. Please check the build logs.",
                to: "${env.CHANGE_AUTHOR_EMAIL}"
            )
        }
    }
}

CircleCI Integration

version: 2.1

jobs:
  validate-architecture:
    docker:
      - image: rust:1.70
    steps:
      - checkout
      - run:
          name: Install Sruja CLI
          command: cargo install sruja-cli --git https://github.com/sruja-ai/sruja
      - run:
          name: Validate
          command: sruja lint architecture.sruja
      - run:
          name: Generate Docs
          command: sruja export markdown architecture.sruja > architecture.md
      - store_artifacts:
          path: architecture.md

workflows:
  version: 2
  validate:
    jobs:
      - validate-architecture

Pre-commit Hooks

Validate before every commit locally. Ensure the Sruja CLI is on your PATH (e.g. cargo install sruja-cli --git https://github.com/sruja-ai/sruja or build from the Sruja repo):

#!/bin/sh
# .git/hooks/pre-commit

if ! command -v sruja &> /dev/null; then
    echo "Sruja CLI not found. Install with: cargo install sruja-cli --git https://github.com/sruja-ai/sruja"
    exit 1
fi

sruja lint architecture.sruja
if [ $? -ne 0 ]; then
    echo "❌ Architecture validation failed. Fix errors before committing."
    exit 1
fi

sruja fmt architecture.sruja > architecture.formatted.sruja
mv architecture.formatted.sruja architecture.sruja
git add architecture.sruja

exit 0

Or use the pre-commit framework (requires Sruja on PATH):

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: sruja-lint
        name: Sruja Lint
        entry: sruja lint
        language: system
        files: \.sruja$
        pass_filenames: true

Automated Documentation Updates

Generate and commit documentation automatically:

name: Update Architecture Docs

on:
  push:
    branches: [main]
    paths:
      - "architecture.sruja"

jobs:
  update-docs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
        with:
          token: ${{ secrets.GITHUB_TOKEN }}

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable
      - name: Install Sruja CLI
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja

      - name: Generate Documentation
        run: |
          sruja export markdown architecture.sruja > docs/architecture.md
          sruja export json architecture.sruja > docs/architecture.json

      - name: Commit Changes
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add docs/architecture.*
          git diff --staged --quiet || git commit -m "docs: update architecture documentation"
          git push

Architecture Change Tracking

Track architecture changes over time:

name: Track Architecture Changes

on:
  pull_request:
    paths:
      - "architecture.sruja"

jobs:
  track-changes:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
        with:
          fetch-depth: 0

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable
      - name: Install Sruja CLI
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja

      - name: Compare Architectures
        run: |
          git show origin/${{ github.base_ref }}:architecture.sruja > base.sruja
          sruja export json base.sruja > base.json
          sruja export json architecture.sruja > current.json
          echo "## Architecture Changes" >> $GITHUB_STEP_SUMMARY
          echo "Comparing base and current architecture..." >> $GITHUB_STEP_SUMMARY

      - name: Comment Changes
        uses: actions/github-script@v6
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '## Architecture Changes Detected\n\nReview the architecture changes in this PR.'
            });

Real-World Example: Microservices Platform

Complete CI/CD setup for a microservices platform:

name: Architecture Governance

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main, develop]

jobs:
  validate-architecture:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        service:
          - payment-service
          - order-service
          - user-service
          - inventory-service
    steps:
      - uses: actions/checkout@v6

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable
      - name: Install Sruja CLI
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja

      - name: Validate ${{ matrix.service }}
        run: sruja lint services/${{ matrix.service }}/architecture.sruja

      - name: Generate Service Docs
        run: sruja export markdown services/${{ matrix.service }}/architecture.sruja > docs/services/${{ matrix.service }}.md

  validate-platform:
    runs-on: ubuntu-latest
    needs: validate-architecture
    steps:
      - uses: actions/checkout@v6

      - name: Install Rust
        uses: dtolnay/rust-toolchain@stable
      - name: Install Sruja CLI
        run: cargo install sruja-cli --git https://github.com/sruja-ai/sruja

      - name: Validate Platform Architecture
        run: sruja lint platform-architecture.sruja

      - name: Generate Platform Docs
        run: |
          sruja export markdown platform-architecture.sruja > docs/platform.md
          sruja export json platform-architecture.sruja > docs/platform.json

      - name: Upload Documentation
        uses: actions/upload-artifact@v3
        with:
          name: architecture-docs
          path: docs/

Key Takeaways

  1. Automate everything: Don't rely on manual validation
  2. Fail fast: Block merges if constraints are violated
  3. Generate docs automatically: Keep documentation up-to-date
  4. Track changes: Monitor architecture evolution over time
  5. Scale governance: Use CI/CD to enforce standards across teams

Exercise: Set Up CI/CD Integration

Tasks:

  1. Choose a CI/CD platform (GitHub Actions, GitLab CI, etc.)
  2. Create a workflow that validates architecture on every PR
  3. Add documentation generation
  4. Test the workflow with a sample architecture file

Time: 20 minutes

Further Reading

Agentic AI Modeling

This tutorial shows how to model agent-based systems with orchestrators, planners, executors, tools, and memory.

Core Structure

import { * } from 'sruja.ai/stdlib'


AgentSystem = system "Agentic System" {
Orchestrator = container "Agent Orchestrator"
Planner = container "Planner"
Executor = container "Executor"
Tools = container "Tooling API"
Memory = database "Long-Term Memory"
}

User = person "User"

User -> AgentSystem.Orchestrator "Requests task"
AgentSystem.Orchestrator -> AgentSystem.Planner "Plans steps"
AgentSystem.Orchestrator -> AgentSystem.Executor "Delegates execution"
AgentSystem.Executor -> AgentSystem.Tools "Calls tools"
AgentSystem.Executor -> AgentSystem.Memory "Updates state"

view index {
include *
}

Add Governance

Guardrails = policy "Safety Policies" {
  description "Limit tool calls, enforce approvals, track risky operations"
}

R1 = requirement functional "Explain actions"
R2 = requirement constraint "No PII exfiltration"

Integrate RAG

import { * } from 'sruja.ai/stdlib'


AgentSystem = system "Agent System" {
  Executor = container "Executor"
}

RAG = system "Retrieval-Augmented Generation" {
  Retriever = container "Retriever"
  Generator = container "Generator"
  VectorDB = database "VectorDB"
}

AgentSystem.Executor -> RAG.Retriever "Fetch contexts"
RAG.Retriever -> RAG.VectorDB "Search"
RAG.Generator -> AgentSystem.Executor "Produce answer"

Next Steps

  • Explore book/valid-examples/pattern-agentic-ai.sruja and book/valid-examples/pattern-rag-pipeline.sruja
  • Add scenarios to capture common workflows
  • Use views to present developer vs. executive perspectives

Extending the CLI (Rust)

Sruja's CLI lives in crates/sruja-cli and uses clap for argument parsing. To add or change subcommands:

  1. Open crates/sruja-cli/src/main.rs (or the relevant module) and see how existing commands (e.g. lint, export) are defined.
  2. Add a subcommand using clap's Subcommand enum and match on it in the main entrypoint; run your logic and return Result with ? for errors.
  3. Run and test with cargo run -p sruja-cli -- <subcommand> ....

Shell completions are available:

sruja completion bash
sruja completion zsh
sruja completion fish

For patterns and conventions, see the repo's AGENTS.md (Rust skills) and docs/CODING_GUIDELINES.md.

Challenges

Hands-on exercises to practice Sruja. Each challenge has a goal and optional hints.

ChallengeFocus
Add ComponentAdd a component to a system
Deployment ArchitectureModel deployment
External ServiceIntegrate an external system
Fix RelationsCorrect relation definitions
Missing RelationsFind and add missing relations
Queue WorkerModel a queue-based flow
Syntax ErrorFix a syntax error

See the Beginner path for a suggested order with tutorials and courses.

Add Component

Deployment Architecture

External Service

Fix Relations

Missing Relations

Queue Worker

Syntax Error

CLI reference

Core commands:

CommandDescription
sruja lint <file>Validate .sruja file
sruja fmt <file>Format DSL
sruja tree <file>Print element tree
sruja export json <file>Export to JSON
sruja export markdown <file>Export to Markdown
sruja export mermaid <file>Export to Mermaid

Run sruja --help for full options.

Language reference

The Sruja DSL defines architecture using kinds (e.g. person, system, container) and relationships (e.g. ->).

For the full specification, see the Language specification in this book.

Sruja Language Specification

This document provides a complete specification of the Sruja architecture-as-code language for AI code assistants and developers.

Overview

Sruja is a domain-specific language (DSL) for defining software architecture models. It supports C4 model concepts (systems, containers, components), requirements, ADRs, scenarios, flows, policies, SLOs, and more.

Language Grammar

File Structure

Sruja uses a flat syntax — all declarations are top-level, no wrapper blocks required.

// Elements
User = person "User"
Shop = system "E-commerce Shop"

// Relationships
User -> Shop "uses"

// Governance
R1 = requirement functional "Must handle 10k users"
SecurityPolicy = policy "Encrypt all data" category "security"

Element Kinds

Before using elements like person, system, container, etc., you must declare them as kinds. This establishes the vocabulary of element types available in your architecture.

// Standard C4 kinds (required at top of file)
person = kind "Person"
system = kind "System"
container = kind "Container"
component = kind "Component"
database = kind "Database"
datastore = kind "Datastore"  // Alias for 'database', but 'database' is the preferred standard
queue = kind "Queue"

Why kinds? This allows Sruja to:

  • Validate that you're using recognized element types
  • Enable custom element types for domain-specific modeling
  • Provide LSP autocompletion for your declared kinds

Custom Kinds

You can define custom element types for your domain:

// Custom kinds for microservices
microservice = kind "Microservice"
eventBus = kind "Event Bus"
gateway = kind "API Gateway"

// Now use them
Catalog = microservice "Catalog Service"
Kafka = eventBus "Kafka Cluster"

Imports

Import kinds and tags from the standard library or other Sruja files.

Standard Library Import

// Import all from stdlib
import { * } from 'sruja.ai/stdlib'

// Now you can use person, system, container, etc. without defining them
User = person "User"
Shop = system "Shop"

Named Imports

// Import specific kinds only
import { person, system, container } from 'sruja.ai/stdlib'

User = person "User"
Shop = system "Shop"

Relative Imports

// Import from a local file
import { * } from './shared-kinds.sruja'

Note: When using imports, you don't need to redeclare the imported kinds.

Elements

Persons

User = person "User" {
    description "End user of the system"
}

Systems

MySystem = system "My System" {
    description "Optional description"
    metadata {
        key "value"
        tags ["tag1", "tag2"]
    }
    slo {
        availability {
            target "99.9%"
            window "30d"
            current "99.95%"
        }
    }
}

Containers

MyContainer = container "My Container" {
    technology "Technology stack"
    description "Optional description"
    version "1.0.0"
    tags ["api", "backend"]
    scale {
        min 3
        max 10
        metric "cpu > 80%"
    }
    slo {
        latency {
            p95 "200ms"
            p99 "500ms"
        }
    }
}

Components

MyComponent = component "My Component" {
    technology "Technology"
    description "Optional description"
    scale {
        min 1
        max 5
    }
}

Data Stores

MyDB = database "My Database" {
    technology "PostgreSQL"
    description "Optional description"
}

Queues

MyQueue = queue "My Queue" {
    technology "RabbitMQ"
    description "Optional description"
}

Relationships

// Basic relationship
From -> To "Label"

// Nested element references use dot notation
System.Container -> System.Container.Component "calls"

// With tags
From -> To "Label" [tag1, tag2]

Requirements

R1 = requirement functional "Description"
R2 = requirement nonfunctional "Description"
R3 = requirement constraint "Description"
R4 = requirement performance "Description"
R5 = requirement security "Description"

// With body block
R6 = requirement functional "Description" {
    description "Detailed description"
    metadata {
        priority "high"
    }
}

ADRs (Architectural Decision Records)

ADR001 = adr "Title" {
    status "accepted"
    context "What situation led to this decision"
    decision "What was decided"
    consequences "Trade-offs, gains, and losses"
}

Scenarios and Flows

Scenarios

MyScenario = scenario "Scenario Title" {
    step User -> System.WebApp "Credentials"
    step System.WebApp -> System.DB "Verify"
}

// 'story' is an alias for 'scenario'
CheckoutStory = story "User Checkout Flow" {
    step User -> ECommerce.CartPage "adds item to cart"
}

Note: The step keyword is recommended for clarity, but optional. Both syntaxes work:

  • With step: step User -> System.WebApp "action"
  • Without step: User -> System.WebApp "action" (inside scenario block)

Flows (DFD-style data flows)

OrderProcess = flow "Order Processing" {
    step Customer -> Shop.WebApp "Order Details"
    step Shop.WebApp -> Shop.Database "Save Order"
    step Shop.Database -> Shop.WebApp "Confirmation"
}

Note: Flows use the same syntax as scenarios. The step keyword is recommended for clarity.

Metadata

metadata {
    key "value"
    anotherKey "another value"
    tags ["tag1", "tag2"]
}

Overview Block

overview {
    summary "High-level summary of the architecture"
    audience "Target audience for this architecture"
    scope "What is covered in this architecture"
    goals ["Goal 1", "Goal 2"]
    nonGoals ["What is explicitly out of scope"]
    risks ["Risk 1", "Risk 2"]
}

SLO (Service Level Objectives)

slo {
    availability {
        target "99.9%"
        window "30 days"
        current "99.95%"
    }
    latency {
        p95 "200ms"
        p99 "500ms"
        window "7 days"
        current {
            p95 "180ms"
            p99 "420ms"
        }
    }
    errorRate {
        target "0.1%"
        window "7 days"
        current "0.08%"
    }
    throughput {
        target "10000 req/s"
        window "peak hour"
        current "8500 req/s"
    }
}

SLO blocks can be defined at:

  • Architecture level (top-level)
  • System level
  • Container level

Scale Block

scale {
    min 3
    max 10
    metric "cpu > 80%"
}

Scale blocks can be defined at:

  • Container level
  • Component level

Deployment

deployment Prod "Production" {
    node AWS "AWS" {
        node USEast1 "US-East-1" {
            infrastructure LB "Load Balancer"
            containerInstance Shop.API
        }
    }
}

Governance

Policies

policy SecurityPolicy "Enforce TLS 1.3" category "security" enforcement "required"

// Or with body block
policy DataRetentionPolicy "Retain data for 7 years" {
    category "compliance"
    enforcement "required"
    description "Detailed policy description"
}

Constraints

constraints {
    "Constraint description"
    "Another constraint"
}

Conventions

conventions {
    "Convention description"
    "Another convention"
}

Views (Optional)

Views are optional — if not specified, standard C4 views are automatically generated.

view index {
    title "System Context"
    include *
}

view container_view of Shop {
    title "Shop Containers"
    include Shop.*
    exclude Shop.WebApp
    autolayout lr
}

styles {
    element "Database" {
        shape "cylinder"
        color "#ff0000"
    }
}

View Types

  • index - System context view (C4 L1)
  • container - Container view (C4 L2)
  • component - Component view (C4 L3)
  • deployment - Deployment view

View Expressions

  • include * - Include all elements in scope
  • include Element1 Element2 - Include specific elements
  • exclude Element1 - Exclude specific elements
  • autolayout "lr"|"tb"|"auto" - Layout direction hint

Implied Relationships

Relationships are automatically inferred when child relationships exist:

User -> API.WebApp "Uses"
// Automatically infers: User -> API

This reduces boilerplate while maintaining clarity.

Complete Example

// Element Kinds (required)
person = kind "Person"
system = kind "System"
container = kind "Container"
component = kind "Component"
datastore = kind "Datastore"  // Alias for 'database'

// Overview
overview {
    summary "E-commerce platform architecture"
    audience "Development team"
    scope "Core shopping and payment functionality"
}

// Elements
Customer = person "Customer"
Admin = person "Administrator"

Shop = system "E-commerce Shop" {
    description "High-performance e-commerce platform"

    WebApp = container "Web Application" {
        technology "React"
        Cart = component "Shopping Cart"
        Checkout = component "Checkout Service"
    }

    API = container "API Gateway" {
        technology "Node.js"
        scale {
            min 3
            max 10
        }
        slo {
            latency {
                p95 "200ms"
                p99 "500ms"
            }
        }
    }

    DB = database "PostgreSQL Database" {
        technology "PostgreSQL 14"
    }
}

// Relationships
Customer -> Shop.WebApp "Browses"
Shop.WebApp -> Shop.API "Calls"
Shop.API -> Shop.DB "Reads/Writes"

// Requirements
R1 = requirement functional "Must support 10k concurrent users"
R2 = requirement constraint "Must use PostgreSQL"

// ADRs
ADR001 = adr "Use microservices architecture" {
    status "accepted"
    context "Need to scale different parts independently"
    decision "Adopt microservices architecture"
    consequences "Gain: Independent scaling. Trade-off: Increased complexity"
}

// Policies
SecurityPolicy = policy "Enforce TLS 1.3" {
    category "security"
    enforcement "required"
}

// Constraints and Conventions
constraints {
    "All APIs must use HTTPS"
    "Database must be encrypted at rest"
}

conventions {
    "Use RESTful API design"
    "Follow semantic versioning"
}

// Scenarios
PurchaseScenario = scenario "User purchases item" {
    step Customer -> Shop.WebApp "Adds item to cart"
    step Shop.WebApp -> Shop.API "Submits order"
    step Shop.API -> Shop.DB "Saves order"
}

// Views (optional - auto-generated if omitted)
view index {
    title "System Context"
    include *
}

view container_view of Shop {
    title "Shop Containers"
    include Shop.*
}

Key Rules

  1. Flat Syntax: All declarations are top-level, no specification {}, model {}, or views {} wrapper blocks
  2. IDs: Must be unique within their scope
  3. References: Use dot notation (e.g., System.Container)
  4. Relations: Can be defined anywhere (implied relationships are automatically inferred)
  5. Metadata: Freeform key-value pairs
  6. Descriptions: Optional string values
  7. Views: Optional — C4 views are automatically generated if not specified
  8. SLOs: Can be defined at architecture, system, or container level
  9. Scale: Can be defined at container or component level

Common Patterns

C4 Model Levels

  • Level 1 (System Context): Systems and persons
  • Level 2 (Container): Containers within systems
  • Level 3 (Component): Components within containers

Resources