Site Reliability Engineering
SLO, SLO
devops observability sre sli slo performance
Site Reliability Engineering
Google Cloud - SRE fundamentals: SLIs, SLAs and SLOs
SRE begins with the idea that a prerequisite to success is availability.
What Is a Site Reliability Engineer?
- A site reliability engineer is responsible for the monitoring, automation, and reliability of IT operations. They use software development tools to automate IT operations tasks like change management, incident response, and production system management. They’re also responsible for monitoring the health of software deployments and relaying logs and data back to the developers.
Service Level Indicators
SLI 는 시스템의 가용성을 파악하기 위한 핵심 지표 를 의미한다. 지표로는 error rate (http 요청에 대한 http status 5xx 의 비율) 와 latency 등이 있다.
Service Level Objectives
SLO 는 시스템에서 기대되는 가용성을 설정한 목표 이고, 이것을 위반하면 incidents 로 간주하고 opsgenie 등으로 incidents 알람이 온다.