SRE
Overview
Requirements, analysis, and design
- Qualitative requirements define systems from the user's point of view
- Who
- Who are the users, developers, or stakeholders?
- What
- What does the system do?
- What are the main features?
- Why
- Why is the system needed?
- When
- When do the users need and/or want the solution?
- When can the developers be done?
- How
- How will the system work?
- How Many users will there be?
- How much data will there be?
Key performance indicators (KPIs)
- In business, common KPIs include
- Return on investment (ROI)
- Earnings before interest and taxes (EBIT)
- Employee turnover
- Customer churn
- In software, common KPIs include
- Page views
- User registrations
- Clickthroughs
- Checkouts
- KPI is not the same thing as a goal or objective
- Goal: Increase turnover for an online store
- KPI: The percentage of conversions on the website
Service Level Indicator (SLI)
- SLIs are carefully selected monitoring metrics that measure one aspect of a service's reliability
- Ideally, SLIs should have a close linear relationship with your users' experience of that reliability, and we recommend expressing them as the ratio of two numbers: the number of good events divided by the count of all valid events
- Must be time-bound and measurable
- 3-5 SLIs per user journey
Service level objective (SLO)
- combines a service level indicator with target reliability
- If you express your SLIs as is commonly recommended, your SLOs will generally be somewhere just short of 100%, for example, 99.9%, or "three nines."
- Must be achievable and relevant
- Tips
- The goal isn't to make SLOs as high as possible
- The goal is to make them as low as you can get away with while still making users happy (That's why it's important to understand your users)
- The higher you set the SLO, the higher the cost of computer resources (redundancy) and operations effort (people time)
- Applications should not significantly outperform their SOLs, because users come to expect the level of reliability you usually give them
S.M.A.R.T
- Specific
A question such as “Is the site fast enough for you?” is not specific; it's subjective. A statement such as “The 95th percentile of results are returned in under 100 milliseconds” is specific. - Measurable
A lot of monitoring is numbers, grouped over time, with math applied. An SLI must be a number or a delta; something we can measure and place in a mathematical equation. - Achievable
- Relevant
- Time-bound
Do you want a service to be 99% available? That’s fine. Is that per year? Per month? Per day? Does the calculation look at specific windows of set time, from Sunday to Sunday for example, or is it a rolling period of the last seven days? It can't be measured accurately if we don't know the answers to those questions.
Service Level Agreements (SLA)
- Commitments are made to your customers that your systems and applications will have only a certain amount of “downtime.”
- An SLA describes the minimum levels of service that you promise to provide to your customers and what happens when you break that promise
- If your service has paying customers, an SLA may include some way of compensating them with refunds or credits when that service has an outage that is longer than this agreement allows
- To give you the opportunity to detect problems and take remedial action before your reputation is damaged, your alerting thresholds are often substantially higher than the minimum levels of service documented in your SLA
- Not all services have an SLA, but all services should have an SLO
- Your SLO thresholds should be stricter than your SLA
Error Budget
Alert
Four Golden Signals
- Latency
- Page load latency
- Number of requests waiting for a thread
- Query duration
- Service response time
- Transaction duration
- Time to the first response
- Time to complete return
- Traffic
- # HTTP requests per second
- # Requests for static vs. dynamic content
- Network I/O
- # Concurrent sessions
- # Transactions per second
- # of retrievals per second
- # of active requests
- # of write ops
- # of read ops
- # of active connections
- Saturation
- % memory utilization
- % thread pool utilization
- % cache utilization
- % disk utilization
- % CPU utilization
- Disk quota
- Memory quota
- # of available connections
- and # of users on the system
- Errors
- Wrong answers or incorrect content
- # 400/500 HTTP codes
- # failed requests
- # exceptions
- # stack traces
- servers that fail liveness checks
- And # dropped connections
Handling Excess Loads
- Load Shedding
- API limits
- Streaming Data
- Reduce Quality of Service
- Instead of talking to a recommendations API, return a hardcoded set of products
Avoiding Cascading Failures
- Plan to avoid thrashing
- Circuit Breaker
- Reduce Quality of Service
Postmortem
- Impact
- Root Causes and Trigger
- Detection
- Resolution
- Quantifiable metrics
- Lessons Learned
- Timeline
- Action items
Incident Management
Note
Blameless Culture
Fix problems, not people