DevOps | From Code to Cloud

DevOps nima?

DevOps - bu Development (dasturlash) va Operations (operatsiyalar) jamoalarini birlashtiradigan madaniyat, amaliyotlar va vositalar to'plami. Bu shunchaki toollar emas, bu fikrlash tarzi.

DevOps ning asosiy maqsadlari:

Tezroq delivery - kod yozilganidan production'ga tez yetkazish
Yuqori sifat - avtomatlashtirilgan testlar va tekshiruvlar
Barqarorlik - tizim ishonchli va oldindan bashorat qilinadigan
Hamkorlik - Dev va Ops birgalikda mas'ul

Culture

Hamkorlik, mas'uliyat, doimiy o'rganish

Automation

Qo'l mehnati kamaytirish, takrorlanadigan jarayonlar

Measurement

Metrikalar, monitoring, data-driven qarorlar

Oddiy tushuntirish

Ilgari: "Mening kompyuterimda ishlayapti" - developer. "Mening serverimda ishlamayapti" - ops. Hozir DevOps bilan: "Biz birga mas'ulmiz - koddan productiongacha".

Nega kerak?

Traditional yondashuvda Dev va Ops alohida jamoalar edi. Bu ko'p muammolarga olib kelardi:

Tezkor bozorga chiqish

Amazon har 11.7 soniyada deploy qiladi. Netflix kuniga 1000+ marta. DevOps siz buni qila olmaysiz.

Feedback loop

Production'dan tez feedback = tez yaxshilash. Muammo 5 daqiqada aniqlanadi, 5 kundan keyin emas.

Xavfsizlik (DevSecOps)

Security "shift left" - xavfsizlik boshidan o'ylanadi, oxirida emas.

Cost efficiency

Avtomatlashtirish = kam qo'l mehnati = kam xato = kam xarajat.

DORA Metrics

Elite DevOps jamoalar: Deploy frequency - kuniga bir necha marta, Lead time - 1 soatdan kam, Change fail rate - 0-15%, MTTR - 1 soatdan kam. Sizning jamoangiz qayerda?

Asosiy tushunchalar

Infrastructure as Code (IaC)

Serverlar, network, database - barchasi kod sifatida yoziladi va version control'da saqlanadi:

Terraform - cloud-agnostic, deklarativ
Pulumi - real dasturlash tillari (TypeScript, Python)
CloudFormation - AWS native
Ansible - configuration management

Monitoring va Observability

Tizimni ichidan ko'rish uchun uchta ustun:

Metrics - raqamlar: CPU, memory, request count, error rate (Prometheus, Datadog)
Logs - eventlar: nima bo'ldi, qachon, qayerda (ELK, Loki)
Traces - so'rov yo'li: request qaysi service'lardan o'tdi (Jaeger, Zipkin)

Site Reliability Engineering (SRE)

Google tomonidan yaratilgan yondashuv - "software engineering lens to operations":

SLI (Service Level Indicator) - o'lchov (latency, availability)
SLO (Service Level Objective) - maqsad (99.9% uptime)
SLA (Service Level Agreement) - shartnoma (buzilsa - kompensatsiya)
Error Budget - qancha downtime "ruxsat etilgan"

Incident Management

Muammo bo'lganda nima qilish kerak:

Detection - alert orqali erta aniqlash
Response - on-call engineer tez javob beradi
Mitigation - user ta'sirini kamaytirish (rollback, traffic shift)
Resolution - asosiy sababni tuzatish
Postmortem - blameless tahlil, o'rganish, oldini olish

Secrets Management

Maxfiy ma'lumotlarni xavfsiz saqlash va tarqatish:

Kodda HECH QACHON plaintext secret bo'lmasin
HashiCorp Vault, AWS Secrets Manager, Azure Key Vault
Rotation - secret'larni muntazam yangilash
Audit - kim, qachon, qaysi secret'ga kirdi

Amaliy jarayon (step-by-step)

1

Infrastructure as Code joriy qilish

Terraform yoki Pulumi bilan infra'ni kodga o'tkazing. Version control, PR review, automated apply.

2

Monitoring stack o'rnatish

Prometheus + Grafana metrics uchun, Loki logs uchun. Yoki managed: Datadog, New Relic.

3

Alerting sozlash

Critical metrikalar uchun alert. PagerDuty, Opsgenie bilan on-call rotation.

4

SLI/SLO aniqlash

Har bir service uchun: availability, latency p99, error rate. Dashboard yarating.

5

Incident response jarayoni

Runbook'lar, escalation path, postmortem template. Practice: game days, chaos engineering.

6

Secrets management

Vault yoki cloud-native secrets. Environment variables orqali inject. Rotation policy.

7

Documentation va runbooks

Har bir alert uchun runbook: nima qilish kerak, escalation, useful commands.

8

Continuous improvement

Haftalik ops review, postmortem action items, toil reduction.

Eng ko'p uchraydigan xatolar

1. Tool-first thinking

"Kubernetes o'rnatdik = DevOps qildik" - xato. DevOps - bu madaniyat. Toollar yordamchi. Avval jarayonlarni tuzating.

2. Alert fatigue

Juda ko'p alert = hech kim e'tibor bermaydi. Faqat actionable alertlar. Noisy alertlarni tune qiling yoki o'chiring.

3. Blame culture

Xato qilgan odamni ayblash = odamlar xatolarni yashiradi. Blameless postmortems - system failures, not people failures.

4. Runbook yo'qligi

Alert keldi, lekin nima qilish kerak? On-call engineer panic. Har bir alert uchun runbook yozing.

Yechim

SRE prinsiplariga amal qiling: error budget, toil measurement, automation over manual work. "Hope is not a strategy".

Best practices

Everything as Code - infra, config, policy, monitoring - hammasi Git'da
Immutable infrastructure - serverlarni yangilash emas, almashtrish
Cattle, not pets - serverlar bir xil, almashtiriladigan, special emas
Shift left - security, testing, quality - boshidan o'ylang
Blameless postmortems - xatolardan o'rganing, odamlarni ayblamang
On-call rotation - bir kishi doim mas'ul, rotatsiya bilan
Error budgets - SLO buzilsa, feature development to'xtaydi, reliability'ga fokus
Toil automation - takrorlanadigan qo'l ishini avtomatlang
Chaos engineering - production'da failure simulyatsiya qiling (Netflix Chaos Monkey)
Documentation - runbooks, architecture diagrams, onboarding guides

Asboblar va texnologiyalar

Terraform Ansible Docker Kubernetes Prometheus Grafana ELK Stack Jaeger HashiCorp Vault PagerDuty Datadog Istio

Terraform

IaC standart. AWS, GCP, Azure - hammasi bir tilda. State management, modules.

Prometheus + Grafana

Metrics collection va visualization. PromQL, alerting, dashboards.

HashiCorp Vault

Secrets management, dynamic credentials, encryption as a service.

Mini misol

Terraform bilan AWS EC2 va monitoring:

hcl - main.tf

# Terraform - Infrastructure as Code

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/terraform.tfstate"
    region = "us-east-1"
  }
}

provider "aws" {
  region = var.aws_region
}

# VPC
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"
  
  name = "${var.project}-vpc"
  cidr = "10.0.0.0/16"
  
  azs             = ["us-east-1a", "us-east-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]
  
  enable_nat_gateway = true
  single_nat_gateway = true
  
  tags = var.common_tags
}

# ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "${var.project}-cluster"
  
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
  
  tags = var.common_tags
}

# CloudWatch Alarms
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "${var.project}-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              = "300"
  statistic           = "Average"
  threshold           = "80"
  
  alarm_actions = [aws_sns_topic.alerts.arn]
  
  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
  }
}

Prometheus alert rules:

yaml - alert-rules.yml

groups:
  - name: application
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) 
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
      
      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency (p99 > 1s)"
          description: "P99 latency is {{ $value | humanizeDuration }}"
      
      # Pod not ready
      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{condition="true"} == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} not ready"

Xavfsizlik va ishonchlilik

Secrets in Vault - hech qachon environment variables'da plaintext. HashiCorp Vault yoki cloud-native.
Least privilege - har bir service faqat kerakli permission. IAM roles, RBAC.
Network segmentation - public, private, database subnets. Security groups, NACLs.
Audit logging - barcha admin amallar loglanadi. CloudTrail, audit logs.
Immutable infrastructure - serverlar patch qilinmaydi, almashtiriladi. Golden AMI/images.
Compliance as Code - OPA, Sentinel bilan policy enforcement. Automated compliance checks.

Ko'p so'raladigan savollar (FAQ)

DevOps - madaniyat va amaliyotlar (what). SRE - Google'ning specific implementation (how). SRE = "class that implements DevOps interface". SRE ko'proq metrikalar, SLO, error budgets'ga fokus qiladi.

Terraform - infrastructure provisioning (server yaratish, VPC, database). Ansible - configuration management (software o'rnatish, config yangilash). Ko'pincha birga ishlatiladi: Terraform infra yaratadi, Ansible configure qiladi.

Haftalik yoki 2 haftalik rotatsiya. Primary + secondary on-call. Kompensatsiya (qo'shimcha to'lov yoki off time). Handoff meeting. PagerDuty, Opsgenie bilan schedule va escalation. Runbooks har alert uchun.

SLO 99.9% = oyda 43 daqiqa downtime "ruxsat etilgan". Bu error budget. Budget tugasa - feature development to'xtaydi, faqat reliability. Budget bo'lsa - risk olish mumkin (yangi feature, experiment).

1) Timeline - nima bo'ldi, qachon. 2) Impact - qancha user ta'sirlandi. 3) Root cause - asosiy sabab (system, not person). 4) What went well - yaxshi qilingan ishlar. 5) Action items - takrorlanmasligi uchun nima qilamiz. 72 soat ichida yoziladi.

Toil - qo'lda bajariladigan, takrorlanadigan, avtomatlashtirilishi mumkin bo'lgan ish. SRE vaqtining 50%dan ko'pi toil bo'lmasligi kerak. Avtomatlashtiring: script, cron, operator pattern. Toil'ni tracking qiling va kamaytirish uchun vaqt ajrating.

Production'da kontrollangan failure inject qilish - Netflix Chaos Monkey random server o'chiradi. Maqsad: tizim resilient ekanini isbotlash. Boshlang'ich: staging'da, keyin production'da (low traffic vaqtida). Gremlin, LitmusChaos toollar.

DevOps evolution - internal developer platform (IDP) yaratish. Developer'lar self-service orqali infra oladi, deploy qiladi. Platform team "golden paths" yaratadi. Backstage, Port kabi toollar. "You build it, you run it" enable qilish.

Glossary (Atamalar lug'ati)

IaC Infrastructure as Code - infra'ni kod sifatida yozish va version control qilish.

SRE Site Reliability Engineering - Google'ning DevOps implementatsiyasi, reliability fokus.

SLI/SLO/SLA Service Level Indicator/Objective/Agreement - o'lchov, maqsad, shartnoma.

Error Budget SLO dan kelib chiqadigan "ruxsat etilgan" downtime miqdori.

Toil Qo'lda bajariladigan, takrorlanadigan, avtomatlashtirilishi kerak bo'lgan ish.

Postmortem Incident'dan keyin blameless tahlil va o'rganish jarayoni.

Runbook Alert yoki muammo uchun step-by-step ko'rsatmalar hujjati.

On-call Production muammolariga 24/7 javob berish uchun mas'ul bo'lish davri.

Observability Tizimni ichidan ko'rish qobiliyati - metrics, logs, traces.

Chaos Engineering Production'da kontrollangan failure inject qilib resilience tekshirish.

Golden Path Platform team tomonidan tavsiya etilgan, optimallashtirilgan development yo'li.

MTTR Mean Time To Recovery - incident'dan tiklash uchun o'rtacha vaqt.