DevOps
Development va Operations birlashuvi - tezroq deliver, barqaror ishlash, uzluksiz yaxshilash.
DevOps nima?
DevOps - bu Development (dasturlash) va Operations (operatsiyalar) jamoalarini birlashtiradigan madaniyat, amaliyotlar va vositalar to'plami. Bu shunchaki toollar emas, bu fikrlash tarzi.
DevOps ning asosiy maqsadlari:
- Tezroq delivery - kod yozilganidan production'ga tez yetkazish
- Yuqori sifat - avtomatlashtirilgan testlar va tekshiruvlar
- Barqarorlik - tizim ishonchli va oldindan bashorat qilinadigan
- Hamkorlik - Dev va Ops birgalikda mas'ul
Culture
Hamkorlik, mas'uliyat, doimiy o'rganish
Automation
Qo'l mehnati kamaytirish, takrorlanadigan jarayonlar
Measurement
Metrikalar, monitoring, data-driven qarorlar
Ilgari: "Mening kompyuterimda ishlayapti" - developer. "Mening serverimda ishlamayapti" - ops. Hozir DevOps bilan: "Biz birga mas'ulmiz - koddan productiongacha".
Nega kerak?
Traditional yondashuvda Dev va Ops alohida jamoalar edi. Bu ko'p muammolarga olib kelardi:
Tezkor bozorga chiqish
Amazon har 11.7 soniyada deploy qiladi. Netflix kuniga 1000+ marta. DevOps siz buni qila olmaysiz.
Feedback loop
Production'dan tez feedback = tez yaxshilash. Muammo 5 daqiqada aniqlanadi, 5 kundan keyin emas.
Xavfsizlik (DevSecOps)
Security "shift left" - xavfsizlik boshidan o'ylanadi, oxirida emas.
Cost efficiency
Avtomatlashtirish = kam qo'l mehnati = kam xato = kam xarajat.
Elite DevOps jamoalar: Deploy frequency - kuniga bir necha marta, Lead time - 1 soatdan kam, Change fail rate - 0-15%, MTTR - 1 soatdan kam. Sizning jamoangiz qayerda?
Asosiy tushunchalar
Infrastructure as Code (IaC)
Serverlar, network, database - barchasi kod sifatida yoziladi va version control'da saqlanadi:
- Terraform - cloud-agnostic, deklarativ
- Pulumi - real dasturlash tillari (TypeScript, Python)
- CloudFormation - AWS native
- Ansible - configuration management
Monitoring va Observability
Tizimni ichidan ko'rish uchun uchta ustun:
- Metrics - raqamlar: CPU, memory, request count, error rate (Prometheus, Datadog)
- Logs - eventlar: nima bo'ldi, qachon, qayerda (ELK, Loki)
- Traces - so'rov yo'li: request qaysi service'lardan o'tdi (Jaeger, Zipkin)
Site Reliability Engineering (SRE)
Google tomonidan yaratilgan yondashuv - "software engineering lens to operations":
- SLI (Service Level Indicator) - o'lchov (latency, availability)
- SLO (Service Level Objective) - maqsad (99.9% uptime)
- SLA (Service Level Agreement) - shartnoma (buzilsa - kompensatsiya)
- Error Budget - qancha downtime "ruxsat etilgan"
Incident Management
Muammo bo'lganda nima qilish kerak:
- Detection - alert orqali erta aniqlash
- Response - on-call engineer tez javob beradi
- Mitigation - user ta'sirini kamaytirish (rollback, traffic shift)
- Resolution - asosiy sababni tuzatish
- Postmortem - blameless tahlil, o'rganish, oldini olish
Secrets Management
Maxfiy ma'lumotlarni xavfsiz saqlash va tarqatish:
- Kodda HECH QACHON plaintext secret bo'lmasin
- HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
- Rotation - secret'larni muntazam yangilash
- Audit - kim, qachon, qaysi secret'ga kirdi
Amaliy jarayon (step-by-step)
Infrastructure as Code joriy qilish
Terraform yoki Pulumi bilan infra'ni kodga o'tkazing. Version control, PR review, automated apply.
Monitoring stack o'rnatish
Prometheus + Grafana metrics uchun, Loki logs uchun. Yoki managed: Datadog, New Relic.
Alerting sozlash
Critical metrikalar uchun alert. PagerDuty, Opsgenie bilan on-call rotation.
SLI/SLO aniqlash
Har bir service uchun: availability, latency p99, error rate. Dashboard yarating.
Incident response jarayoni
Runbook'lar, escalation path, postmortem template. Practice: game days, chaos engineering.
Secrets management
Vault yoki cloud-native secrets. Environment variables orqali inject. Rotation policy.
Documentation va runbooks
Har bir alert uchun runbook: nima qilish kerak, escalation, useful commands.
Continuous improvement
Haftalik ops review, postmortem action items, toil reduction.
Eng ko'p uchraydigan xatolar
"Kubernetes o'rnatdik = DevOps qildik" - xato. DevOps - bu madaniyat. Toollar yordamchi. Avval jarayonlarni tuzating.
Juda ko'p alert = hech kim e'tibor bermaydi. Faqat actionable alertlar. Noisy alertlarni tune qiling yoki o'chiring.
Xato qilgan odamni ayblash = odamlar xatolarni yashiradi. Blameless postmortems - system failures, not people failures.
Alert keldi, lekin nima qilish kerak? On-call engineer panic. Har bir alert uchun runbook yozing.
SRE prinsiplariga amal qiling: error budget, toil measurement, automation over manual work. "Hope is not a strategy".
Best practices
- Everything as Code - infra, config, policy, monitoring - hammasi Git'da
- Immutable infrastructure - serverlarni yangilash emas, almashtrish
- Cattle, not pets - serverlar bir xil, almashtiriladigan, special emas
- Shift left - security, testing, quality - boshidan o'ylang
- Blameless postmortems - xatolardan o'rganing, odamlarni ayblamang
- On-call rotation - bir kishi doim mas'ul, rotatsiya bilan
- Error budgets - SLO buzilsa, feature development to'xtaydi, reliability'ga fokus
- Toil automation - takrorlanadigan qo'l ishini avtomatlang
- Chaos engineering - production'da failure simulyatsiya qiling (Netflix Chaos Monkey)
- Documentation - runbooks, architecture diagrams, onboarding guides
Asboblar va texnologiyalar
Terraform
IaC standart. AWS, GCP, Azure - hammasi bir tilda. State management, modules.
Prometheus + Grafana
Metrics collection va visualization. PromQL, alerting, dashboards.
HashiCorp Vault
Secrets management, dynamic credentials, encryption as a service.
Mini misol
Terraform bilan AWS EC2 va monitoring:
# Terraform - Infrastructure as Code terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } backend "s3" { bucket = "my-terraform-state" key = "prod/terraform.tfstate" region = "us-east-1" } } provider "aws" { region = var.aws_region } # VPC module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "5.0.0" name = "${var.project}-vpc" cidr = "10.0.0.0/16" azs = ["us-east-1a", "us-east-1b"] private_subnets = ["10.0.1.0/24", "10.0.2.0/24"] public_subnets = ["10.0.101.0/24", "10.0.102.0/24"] enable_nat_gateway = true single_nat_gateway = true tags = var.common_tags } # ECS Cluster resource "aws_ecs_cluster" "main" { name = "${var.project}-cluster" setting { name = "containerInsights" value = "enabled" } tags = var.common_tags } # CloudWatch Alarms resource "aws_cloudwatch_metric_alarm" "high_cpu" { alarm_name = "${var.project}-high-cpu" comparison_operator = "GreaterThanThreshold" evaluation_periods = "2" metric_name = "CPUUtilization" namespace = "AWS/ECS" period = "300" statistic = "Average" threshold = "80" alarm_actions = [aws_sns_topic.alerts.arn] dimensions = { ClusterName = aws_ecs_cluster.main.name } }
Prometheus alert rules:
groups: - name: application rules: # High error rate - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}" runbook_url: "https://wiki.example.com/runbooks/high-error-rate" # High latency - alert: HighLatency expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) > 1 for: 10m labels: severity: warning annotations: summary: "High latency (p99 > 1s)" description: "P99 latency is {{ $value | humanizeDuration }}" # Pod not ready - alert: PodNotReady expr: | kube_pod_status_ready{condition="true"} == 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} not ready"
Xavfsizlik va ishonchlilik
- Secrets in Vault - hech qachon environment variables'da plaintext. HashiCorp Vault yoki cloud-native.
- Least privilege - har bir service faqat kerakli permission. IAM roles, RBAC.
- Network segmentation - public, private, database subnets. Security groups, NACLs.
- Audit logging - barcha admin amallar loglanadi. CloudTrail, audit logs.
- Immutable infrastructure - serverlar patch qilinmaydi, almashtiriladi. Golden AMI/images.
- Compliance as Code - OPA, Sentinel bilan policy enforcement. Automated compliance checks.
Ko'p so'raladigan savollar (FAQ)
DevOps - madaniyat va amaliyotlar (what). SRE - Google'ning specific implementation (how). SRE = "class that implements DevOps interface". SRE ko'proq metrikalar, SLO, error budgets'ga fokus qiladi.
Terraform - infrastructure provisioning (server yaratish, VPC, database). Ansible - configuration management (software o'rnatish, config yangilash). Ko'pincha birga ishlatiladi: Terraform infra yaratadi, Ansible configure qiladi.
Haftalik yoki 2 haftalik rotatsiya. Primary + secondary on-call. Kompensatsiya (qo'shimcha to'lov yoki off time). Handoff meeting. PagerDuty, Opsgenie bilan schedule va escalation. Runbooks har alert uchun.
SLO 99.9% = oyda 43 daqiqa downtime "ruxsat etilgan". Bu error budget. Budget tugasa - feature development to'xtaydi, faqat reliability. Budget bo'lsa - risk olish mumkin (yangi feature, experiment).
1) Timeline - nima bo'ldi, qachon. 2) Impact - qancha user ta'sirlandi. 3) Root cause - asosiy sabab (system, not person). 4) What went well - yaxshi qilingan ishlar. 5) Action items - takrorlanmasligi uchun nima qilamiz. 72 soat ichida yoziladi.
Toil - qo'lda bajariladigan, takrorlanadigan, avtomatlashtirilishi mumkin bo'lgan ish. SRE vaqtining 50%dan ko'pi toil bo'lmasligi kerak. Avtomatlashtiring: script, cron, operator pattern. Toil'ni tracking qiling va kamaytirish uchun vaqt ajrating.
Production'da kontrollangan failure inject qilish - Netflix Chaos Monkey random server o'chiradi. Maqsad: tizim resilient ekanini isbotlash. Boshlang'ich: staging'da, keyin production'da (low traffic vaqtida). Gremlin, LitmusChaos toollar.
DevOps evolution - internal developer platform (IDP) yaratish. Developer'lar self-service orqali infra oladi, deploy qiladi. Platform team "golden paths" yaratadi. Backstage, Port kabi toollar. "You build it, you run it" enable qilish.