Lesson 3: Availability & Reliability#
Reliability vs. Availability#
- Reliability: The probability that a system will function correctly without failure for a specified period. It’s about correctness.
- Availability: The percentage of time a system is operational and accessible. It’s about uptime.
A system can be available but not reliable (e.g., it returns 500 errors but is “up”).
Measuring Availability#
Availability is often measured in “nines”:
| Availability | Downtime per Year |
|---|---|
| 99% (Two nines) | 3.65 days |
| 99.9% (Three nines) | 8.76 hours |
| 99.99% (Four nines) | 52.6 minutes |
| 99.999% (Five nines) | 5.26 minutes |
Achieving High Availability#
Redundancy#
The key to availability is eliminating Single Points of Failure (SPOF). This is done via redundancy.
- Active-Passive: One server handles traffic; the other is on standby.
- Active-Active: Both servers handle traffic. If one fails, the other takes over the full load.
Failover#
The process of switching to a redundant system upon failure. This can be manual or automatic.
🛠️ Sruja Perspective: Modeling Redundancy#
You can explicitly model redundant components in Sruja to visualize your high-availability strategy.
```sruja
system Payments "Payment System" {
container PaymentService "Payment Service" {
technology "Java"
}
// Modeling a primary and standby database
container PrimaryDB "Primary Database" {
technology "MySQL"
tags ["primary"]
}
container StandbyDB "Standby Database" {
technology "MySQL"
tags ["standby"]
description "Replicates from PrimaryDB. Promoted to primary if PrimaryDB fails."
}
PaymentService -> PrimaryDB "Reads/Writes"
PrimaryDB -> StandbyDB "Replicates data"
}