Files
claude-plugins/infraestrutura/skills/proxmox-ha/references/failover-testing.md
Emanuel Almeida 6b3a6f2698 feat: refactor 30+ skills to Anthropic progressive disclosure pattern
- All SKILL.md files now <500 lines (avg reduction 69%)
- Detailed content extracted to references/ subdirectories
- Frontmatter standardised: only name + description (Anthropic standard)
- New skills: brand-guidelines, spec-coauthor, report-templates, skill-creator
- Design skills: anti-slop guidelines, premium-proposals reference
- Removed non-standard frontmatter fields (triggers, version, author, category)

Plugins affected: infraestrutura, marketing, dev-tools, crm-ops, gestao,
core-tools, negocio, perfex-dev, wordpress, design-media

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:05:03 +00:00

3.7 KiB

Failover Testing - Proxmox HA

Procedimentos detalhados de teste de failover e monitoring.


Criar VM Teste HA

# VM 999 para teste (nao production)
qm create 999 --name ha-test --memory 512 --cores 1

# Adicionar a HA
ha-manager add vm:999 --state started

Teste 1: Shutdown Clean

# Node onde VM 999 corre:
qm shutdown 999

# HA Manager deve:
# 1. Detectar shutdown (~30s)
# 2. Tentar restart no mesmo node (max_restart vezes)
# 3. Se continua down -> relocate para outro node

# Monitorizar
watch -n 1 'ha-manager status | grep vm:999'

Teste 2: Node Crash (Simulado)

# CUIDADO: Apenas em teste, nao production

# Shutdown abrupto do node onde VM 999 corre
# (simula hardware failure)
echo b > /proc/sysrq-trigger  # Reboot forcado

# Outro node deve:
# 1. Detectar node down via quorum (~1min)
# 2. Fence node (via watchdog/IPMI)
# 3. Boot VM 999 no node surviving

# Timeline esperado:
# - 0s: Node crash
# - ~60s: Quorum detecta node missing
# - ~90s: Fencing executado
# - ~120s: VM boota em outro node

# Total downtime: ~2-3min (sem shared storage)
# Com shared storage: ~30-60s

Teste 3: Live Migration Manual

# Migration manual (com VM running)
qm migrate 999 <target-node-name> --online

# Com shared storage: <10s downtime
# Sem shared storage: copia disk = lento (GB/min)

# Para production VMs:
# - Fazer em janela manutencao se sem shared storage
# - Live migration OK se shared storage

HA Policies e Tuning

Shutdown Policy

# Default: conditional (HA Manager decide)
# Opcoes: conditional, freeze, failover, migrate

# Para VMs criticas que NAO devem migrar durante manutencao:
ha-manager set vm:200 --state freeze

# Para forcar migrate durante manutencao:
ha-manager set vm:200 --state migrate

Maintenance Mode

# Colocar node em maintenance (nao recebe novos VMs HA)
ha-manager set-node-state <node-name> maintenance

# VMs HA existentes:
# - Nao migram automaticamente
# - Mas nao recebem novas em failover

# Sair de maintenance
ha-manager set-node-state <node-name> active

Priorities (Load Balance)

# Preferencia de nodes por VM

# VM 200: Preferir Node B
ha-manager set vm:200 --group critical --restricted

# restricted: VM so corre nos nodes do grupo
# unrestricted: VM pode correr em qualquer node (fallback)

Monitoring e Alertas

HA Manager Logs

# Logs HA Manager
journalctl -u pve-ha-lrm -f  # Local Resource Manager
journalctl -u pve-ha-crm -f  # Cluster Resource Manager

# Ver decisoes de failover
grep "migrate\|relocate" /var/log/pve/tasks/index

Configurar Alertas

# Via Web UI: Datacenter -> Notifications

# Email alerts para:
# - Node down
# - Quorum lost
# - VM failover events
# - Fencing executed

# SMTP: mail.descomplicar.pt
# To: admin@descomplicar.pt

Script Monitorizacao (cron cada 5min)

#!/bin/bash
# /usr/local/bin/check-ha-health.sh

ha_status=$(ha-manager status | grep "quorum:" | awk '{print $2}')

if [ "$ha_status" != "OK" ]; then
  echo "HA Quorum NOT OK" | mail -s "ALERT: HA Issue" admin@descomplicar.pt
fi

# Cron
# */5 * * * * /usr/local/bin/check-ha-health.sh

Production Rollout

Phased approach:

# Week 1: VMs nao-criticas (teste)
ha-manager add vm:250 --group low

# Week 2: VMs medias (se Week 1 OK)
ha-manager add vm:201,202 --group medium

# Week 3: VMs criticas (se tudo OK)
ha-manager add vm:200,300 --group critical

Documentar Runbook

Criar: 06-Operacoes/Procedimentos/D7-Tecnologia/PROC-HA-Failover.md

Conteudo:

  • Detectar failover event
  • Validar VM booted correctamente
  • Investigar causa node failure
  • Restore node original
  • Migrate VM back (se necessario)