Pràctica 3: Case Study Infraestructura Completa

En aquesta secció, veurem un case study complet on muntem una infraestructura des de zero per una startup fictícia anomenada "CloudShop", una plataforma de comerç electrònic.

Context del Projecte

CloudShop és una startup que està llançant la seva plataforma de comerç electrònic. Esperan tenir creixement ràpid i necessiten una infraestructura que pugui escalar. L'equip de desenvolupament és petit (3 desenvolupadors i 1 DevOps engineer) i necessiten automatitzar tant com sigui possible.

Requisits

L'arquitectura necessita: - Múltiples entorns: desenvolupament, staging, i producció - Servidors web redundants per alta disponibilitat - Base de dades amb replicació - Sistema de caching - Monitorització i alerting - Backups automàtics - Desplegaments zero-downtime - Capacitat d'escalar ràpidament afegint més servidors

Decisió: Utilitzar Ansible

L'equip decideix utilitzar Ansible per les següents raons: - L'equip és petit i Ansible és més fàcil d'aprendre - No volen la complexitat de mantenir un Puppet Server - Volen integrar l'automatització directament amb el seu pipeline CI/CD - El nombre de servidors serà relativament petit inicialment (menys de 50)

Estructura del Projecte

L'equip estructura el seu projecte Ansible així:

cloudshop-infrastructure/
├── ansible.cfg
├── inventories/
│   ├── development/
│   │   ├── hosts.ini
│   │   └── group_vars/
│   │       ├── all.yml
│   │       └── webservers.yml
│   ├── staging/
│   │   ├── hosts.ini
│   │   └── group_vars/
│   ├── production/
│   │   ├── hosts.ini
│   │   └── group_vars/
├── playbooks/
│   ├── site.yml
│   ├── webservers.yml
│   ├── databases.yml
│   ├── deploy-app.yml
│   └── backup.yml
├── roles/
│   ├── common/
│   ├── nginx/
│   ├── nodejs/
│   ├── postgresql/
│   ├── redis/
│   ├── monitoring/
│   └── backup/
├── group_vars/
│   └── all/
│       ├── vault.yml (encriptat)
│       └── common.yml
└── README.md

Implementació Pas a Pas

Pas 1: Configuració Inicial

L'equip crea el fitxer ansible.cfg amb configuració bàsica:

[defaults]
inventory = inventories/production/hosts.ini
host_key_checking = False
retry_files_enabled = False
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600
roles_path = roles
callbacks_enabled = timer, profile_tasks

[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Pas 2: Definir l'Inventari

Per a producció (inventories/production/hosts.ini):

[loadbalancers]
lb01.cloudshop.com ansible_host=203.0.113.10

[webservers]
web01.cloudshop.com ansible_host=203.0.113.20
web02.cloudshop.com ansible_host=203.0.113.21

[databases]
db01.cloudshop.com ansible_host=203.0.113.30
db02.cloudshop.com ansible_host=203.0.113.31 postgresql_role=replica

[cache]
redis01.cloudshop.com ansible_host=203.0.113.40

[monitoring]
monitor01.cloudshop.com ansible_host=203.0.113.50

[production:children]
loadbalancers
webservers
databases
cache
monitoring

[production:vars]
ansible_user=ubuntu
ansible_become=yes
environment=production

Pas 3: Crear el Role "Common"

Aquest role configura aspectes comuns a tots els servidors. El fitxer roles/common/tasks/main.yml:

---
- name: Configurar hostname
  hostname:
    name: "{{ inventory_hostname }}"

- name: Actualitzar tots els paquets
  apt:
    upgrade: dist
    update_cache: yes
    cache_valid_time: 3600
  when: common_auto_upgrade | default(true)

- name: Instal·lar paquets essencials
  apt:
    name:
      - vim
      - curl
      - wget
      - git
      - htop
      - unzip
      - ntp
      - fail2ban
      - ufw
    state: present

- name: Configurar timezone
  timezone:
    name: "{{ common_timezone | default('Europe/Madrid') }}"

- name: Configurar NTP
  service:
    name: ntp
    state: started
    enabled: yes

- name: Configurar limits del sistema
  pam_limits:
    domain: "*"
    limit_type: "{{ item.type }}"
    limit_item: "{{ item.item }}"
    value: "{{ item.value }}"
  loop:
    - { type: 'soft', item: 'nofile', value: '65536' }
    - { type: 'hard', item: 'nofile', value: '65536' }
    - { type: 'soft', item: 'nproc', value: '32768' }
    - { type: 'hard', item: 'nproc', value: '32768' }

- name: Configurar firewall per SSH
  ufw:
    rule: allow
    port: '22'
    proto: tcp

- name: Activar firewall
  ufw:
    state: enabled
    policy: deny

- name: Configurar fail2ban
  template:
    src: jail.local.j2
    dest: /etc/fail2ban/jail.local
    owner: root
    group: root
    mode: '0644'
  notify: Restart fail2ban

- name: Assegurar que fail2ban està executant-se
  service:
    name: fail2ban
    state: started
    enabled: yes

- name: Crear usuaris administradors
  user:
    name: "{{ item.name }}"
    groups: sudo
    append: yes
    shell: /bin/bash
  loop: "{{ admin_users }}"

- name: Afegir claus SSH públiques
  authorized_key:
    user: "{{ item.name }}"
    key: "{{ item.ssh_key }}"
    state: present
  loop: "{{ admin_users }}"

- name: Deshabilitar autenticació amb password
  lineinfile:
    path: /etc/ssh/sshd_config
    regexp: '^#?PasswordAuthentication'
    line: 'PasswordAuthentication no'
    state: present
  notify: Restart sshd

Pas 4: Implementar Desplegament Zero-Downtime

L'equip crea un playbook especial per desplegar l'aplicació sense temps d'inactivitat (playbooks/deploy-app.yml):

---
- name: Desplegar aplicació amb zero-downtime
  hosts: webservers
  become: yes
  serial: 1  # Desplegar un servidor cada vegada

  vars:
    app_name: "cloudshop"
    app_repo: "https://github.com/cloudshop/shop-app.git"
    app_version: "{{ deploy_version | default('main') }}"
    health_check_url: "http://localhost:3000/health"

  pre_tasks:
    - name: Verificar que el servidor està healthy abans de desplegar
      uri:
        url: "{{ health_check_url }}"
        status_code: 200
      register: pre_health
      failed_when: false

    - name: Avortar si el servidor ja està unhealthy
      fail:
        msg: "El servidor {{ inventory_hostname }} ja estava unhealthy abans del desplegament"
      when: pre_health.status != 200

    - name: Treure servidor del load balancer
      uri:
        url: "http://{{ hostvars[groups['loadbalancers'][0]]['ansible_host'] }}/admin/disable/{{ inventory_hostname }}"
        method: POST
      delegate_to: localhost
      become: no

    - name: Esperar que les connexions existents acabin
      wait_for:
        timeout: 30

  tasks:
    - name: Crear directori de backup
      file:
        path: "/opt/{{ app_name }}/backups"
        state: directory
        owner: "{{ app_name }}"
        group: "{{ app_name }}"

    - name: Fer backup de la versió actual
      command: >
        cp -r /opt/{{ app_name }}/app /opt/{{ app_name }}/backups/app-{{ ansible_date_time.iso8601_basic_short }}
      args:
        creates: "/opt/{{ app_name }}/backups/app-{{ ansible_date_time.iso8601_basic_short }}"
      when: pre_health.status == 200

    - name: Clonar nova versió de l'aplicació
      git:
        repo: "{{ app_repo }}"
        dest: "/opt/{{ app_name }}/app"
        version: "{{ app_version }}"
        force: yes
      become_user: "{{ app_name }}"
      register: git_result

    - name: Instal·lar dependències si han canviat
      npm:
        path: "/opt/{{ app_name }}/app"
        production: yes
      become_user: "{{ app_name }}"
      when: git_result.changed

    - name: Executar migracions de base de dades
      command: npm run migrate
      args:
        chdir: "/opt/{{ app_name }}/app"
      become_user: "{{ app_name }}"
      when: git_result.changed
      run_once: yes  # Les migracions només cal executar-les una vegada

    - name: Reiniciar aplicació
      service:
        name: "{{ app_name }}"
        state: restarted
      when: git_result.changed

    - name: Esperar que l'aplicació estigui disponible
      uri:
        url: "{{ health_check_url }}"
        status_code: 200
      register: result
      until: result.status == 200
      retries: 30
      delay: 2

    - name: Executar smoke tests
      uri:
        url: "http://localhost:3000/{{ item }}"
        status_code: 200
      loop:
        - "health"
        - "api/products"
        - "api/categories"

  post_tasks:
    - name: Tornar a afegir servidor al load balancer
      uri:
        url: "http://{{ hostvars[groups['loadbalancers'][0]]['ansible_host'] }}/admin/enable/{{ inventory_hostname }}"
        method: POST
      delegate_to: localhost
      become: no

    - name: Esperar que el load balancer reconegui el servidor
      wait_for:
        timeout: 10

    - name: Verificar que el servidor està rebent tràfic
      uri:
        url: "http://{{ hostvars[groups['loadbalancers'][0]]['ansible_host'] }}/health"
        status_code: 200
      delegate_to: localhost
      become: no

  rescue:
    - name: Rollback en cas d'error
      block:
        - name: Restaurar versió anterior
          command: >
            rsync -a --delete /opt/{{ app_name }}/backups/app-*/ /opt/{{ app_name }}/app/
          args:
            removes: "/opt/{{ app_name }}/backups/"

        - name: Reiniciar aplicació amb versió anterior
          service:
            name: "{{ app_name }}"
            state: restarted

        - name: Verificar que el rollback ha funcionat
          uri:
            url: "{{ health_check_url }}"
            status_code: 200
          retries: 10
          delay: 3

        - name: Notificar l'equip del rollback
          slack:
            token: "{{ slack_token }}"
            msg: "⚠️ Desplegament fallit a {{ inventory_hostname }}. S'ha fet rollback automàticament."
            channel: "#deployments"
          delegate_to: localhost

Aquest playbook implementa un desplegament sofisticat que: 1. Treu cada servidor del load balancer abans de desplegar 2. Fa un backup de la versió actual 3. Desplega la nova versió 4. Executa tests de verificació 5. Torna a afegir el servidor al load balancer 6. Si alguna cosa falla, fa rollback automàticament

Pas 5: Integració amb CI/CD

L'equip integra Ansible amb el seu pipeline de GitHub Actions (.github/workflows/deploy.yml):

name: Deploy to Production

on:
  push:
    tags:
      - 'v*'

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v4
      with:
        repository: cloudshop/infrastructure
        token: ${{ secrets.INFRA_REPO_TOKEN }}

    - name: Setup Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install Ansible
      run: |
        pip install ansible==9.0.0
        ansible-galaxy collection install community.general

    - name: Configure SSH
      run: |
        mkdir -p ~/.ssh
        echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
        chmod 600 ~/.ssh/id_rsa
        ssh-keyscan -H lb01.cloudshop.com >> ~/.ssh/known_hosts

    - name: Decrypt Ansible Vault
      run: |
        echo "${{ secrets.ANSIBLE_VAULT_PASSWORD }}" > .vault_pass

    - name: Run Ansible Playbook
      run: |
        ansible-playbook -i inventories/production/hosts.ini \
          playbooks/deploy-app.yml \
          --vault-password-file .vault_pass \
          -e "deploy_version=${{ github.ref_name }}"

    - name: Notify Slack on Success
      if: success()
      uses: 8398a7/action-slack@v3
      with:
        status: success
        text: '✅ Desplegament exitós a producció: ${{ github.ref_name }}'
        webhook_url: ${{ secrets.SLACK_WEBHOOK }}

    - name: Notify Slack on Failure
      if: failure()
      uses: 8398a7/action-slack@v3
      with:
        status: failure
        text: '❌ Desplegament fallit a producció: ${{ github.ref_name }}'
        webhook_url: ${{ secrets.SLACK_WEBHOOK }}

Resultats del Case Study

Després d'implementar aquesta infraestructura automatitzada, CloudShop va aconseguir:

Temps de desplegament reduït: De 2 hores manualment a 15 minuts automatitzat
Zero-downtime deployments: Els clients no veuen interrupcions durant desplegaments
Consistència: Tots els entorns (dev, staging, prod) són idèntics
Recuperació ràpida: Rollbacks automàtics si alguna cosa falla
Escalabilitat: Poden afegir nous servidors executant un playbook
Documentació viva: L'infraestructura és codi, així que està auto-documentada
Auditoria: Tots els canvis estan versionats a Git

Aquest case study mostra com una petita startup pot implementar pràctiques DevOps professionals utilitzant Ansible, aconseguint infraestructura fiable i escalable sense necessitar un equip gran.