Files
claude-plugins/infraestrutura/skills/proxmox-ha/SKILL.md
Emanuel Almeida 9404af7ac9 feat: sync all plugins, skills, agents updates
New plugins: core-tools
New skills: auto-expense, ticket-triage, design, security-check,
  aiktop-tasks, daily-digest, imap-triage, index-update, mindmap,
  notebooklm, proc-creator, tasks-overview, validate-component,
  perfex-module, report, calendar-manager
New agents: design-critic, design-generator, design-lead,
  design-prompt-architect, design-researcher, compliance-auditor,
  metabase-analyst, gitea-integration-specialist
Updated: all plugin configs, knowledge datasets, existing skills

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 17:16:32 +00:00

12 KiB

name, description, author, version, quality_score, user_invocable, desk_task, allowed-tools, dependencies
name description author version quality_score user_invocable desk_task allowed-tools dependencies
proxmox-ha Configurar High Availability (HA) em cluster Proxmox - resource groups, fencing, failover automático. Use when user mentions "configure ha", "proxmox ha", "high availability", "failover", "ha manager". Descomplicar® Crescimento Digital 1.0.0 75 true 1712 Task, Read, Bash
ssh-unified
notebooklm
proxmox-cluster

Proxmox HA

Configurar High Availability (HA) em cluster Proxmox com HA Manager, fencing devices e failover automático para VMs críticas.

Quando Usar

  • Configurar HA após cluster formation (/proxmox-cluster)
  • Proteger VMs críticas com failover automático
  • Configurar fencing devices (STONITH)
  • Definir HA groups por criticidade
  • Testar failover procedures

Sintaxe

/proxmox-ha configure --critical-vms <vm-ids> [--fencing watchdog|ipmi] [--max-relocate 2]

Exemplos

# HA para VMs críticas com watchdog
/proxmox-ha configure --critical-vms 200,300 --fencing watchdog

# HA com IPMI fencing (hardware)
/proxmox-ha configure --critical-vms 200,300,301 --fencing ipmi --max-relocate 1

# Apenas testar failover (sem activar HA)
/proxmox-ha test --vm 999

Knowledge Sources

NotebookLM

mcp__notebooklm__notebook_query \
  notebook_id:"276ccdde-6b95-42a3-ad96-4e64d64c8d52" \
  query:"proxmox ha high availability fencing stonith failover"

Workflow Completo

Pre-Requisites

1. Cluster Formado

# Verificar cluster healthy
pvecm status

# Expected:
# Quorum: Active
# Nodes: 2+ online

2. Shared Storage ou Replication

Opções:

  • Shared storage (NFS, Ceph): HA ideal (failover <30s)
  • No shared storage: Requer ZFS replication ou aceita boot time failover (~2-5min)

Para Cluster Descomplicar (sem shared storage):

# Aceitar boot-time failover
# OU configurar ZFS replication:

# Node A:
zfs snapshot rpool/vm-disks@ha-sync
zfs send rpool/vm-disks@ha-sync | ssh root@<node-b-ip> zfs receive rpool/vm-disks-replica

# Automatizar com pvesr (Proxmox Storage Replication)

3. Fencing Device Configurado

CRITICAL: Sem fencing = risco split-brain

Fase 1: Fencing Configuration

1.1 Opção A: Watchdog (Software Fencing)

Mais simples, menos confiável:

# Instalar watchdog em ambos nodes
apt install watchdog

# Load kernel module
modprobe softdog

# Auto-load on boot
echo "softdog" >> /etc/modules

# Configurar HA Manager para usar watchdog
# (automático quando HA activado)

1.2 Opção B: IPMI/iLO (Hardware Fencing)

Mais confiável, requer IPMI:

# Verificar IPMI disponível
ipmitool lan print

# Configurar IPMI credentials (via BIOS ou ipmitool)

# Configurar em Proxmox (Web UI):
# Datacenter → Fencing → Add
# Type: IPMI
# IP: <node-ipmi-ip>
# Username: admin
# Password: <ipmi-pass>

# Test
fence_ipmilan -a <node-ipmi-ip> -l admin -p <pass> -o status

1.3 Opção C: Network Fencing (Menos Confiável)

Usar apenas se IPMI não disponível:

# SSH-based fencing (perigoso)
# Depende de network estar up
# Não recomendado para production

Recomendação Cluster Descomplicar:

  • Início: Watchdog (simple, funcional)
  • Produção: IPMI se hardware suporta
  • Evitar: Network fencing

Fase 2: HA Manager Configuration

2.1 Enable HA Manager

# Automático quando cluster formado
# Verificar status
ha-manager status

# Expected:
# quorum:  OK
# master:  <node-name> (elected)
# lrm:     active

2.2 Criar HA Groups (Opcional)

HA Groups por criticidade:

# Via Web UI: Datacenter → HA → Groups → Add

# Critical (priority 100)
ha-manager groupadd critical \
  --nodes "server.descomplicar.pt:100,cluster.descomplicar.pt:100"

# Medium (priority 50)
ha-manager groupadd medium \
  --nodes "server.descomplicar.pt:50,cluster.descomplicar.pt:50"

# Low (priority 10)
ha-manager groupadd low \
  --nodes "server.descomplicar.pt:10,cluster.descomplicar.pt:10"

Priority explicação:

  • Higher priority = preferência para correr nesse node
  • Usado para balancear carga
  • Em failover, ignora priority (vai para node disponível)

Fase 3: Add VMs to HA

3.1 Adicionar VMs Críticas

Via Web UI:

  • Seleccionar VM → More → Manage HA
  • Enable HA
  • Group: critical
  • Max restart: 3
  • Max relocate: 2

Via CLI:

# VM 200 (EasyPanel Docker)
ha-manager add vm:200 \
  --group critical \
  --max_restart 3 \
  --max_relocate 2 \
  --state started

# VM 300 (CWP)
ha-manager add vm:300 \
  --group critical \
  --max_restart 3 \
  --max_relocate 2 \
  --state started

Parâmetros:

  • max_restart: Tentativas restart no mesmo node antes de relocate
  • max_relocate: Máximo relocates entre nodes
  • state started: HA Manager garante VM está sempre started

3.2 Verificar HA Resources

ha-manager status

# Should show:
# vm:200: started on <node-name>
# vm:300: started on <node-name>

Fase 4: Failover Testing

4.1 Criar VM Teste HA

# VM 999 para teste (não production)
qm create 999 --name ha-test --memory 512 --cores 1

# Adicionar a HA
ha-manager add vm:999 --state started

4.2 Testar Failover Automático

Teste 1: Shutdown Clean

# Node onde VM 999 corre:
qm shutdown 999

# HA Manager deve:
# 1. Detectar shutdown (~30s)
# 2. Tentar restart no mesmo node (max_restart vezes)
# 3. Se continua down → relocate para outro node

# Monitorizar
watch -n 1 'ha-manager status | grep vm:999'

Teste 2: Node Crash (Simulado)

# CUIDADO: Apenas em teste, não production

# Shutdown abrupto do node onde VM 999 corre
# (simula hardware failure)
echo b > /proc/sysrq-trigger  # Reboot forçado

# Outro node deve:
# 1. Detectar node down via quorum (~1min)
# 2. Fence node (via watchdog/IPMI)
# 3. Boot VM 999 no node surviving

# Timeline esperado:
# - 0s: Node crash
# - ~60s: Quorum detecta node missing
# - ~90s: Fencing executado
# - ~120s: VM boota em outro node

# Total downtime: ~2-3min (sem shared storage)
# Com shared storage: ~30-60s

4.3 Testar Live Migration Manual

# Migration manual (com VM running)
qm migrate 999 <target-node-name> --online

# Com shared storage: <10s downtime
# Sem shared storage: copia disk = lento (GB/min)

# Para production VMs:
# - Fazer em janela manutenção se sem shared storage
# - Live migration OK se shared storage

Fase 5: HA Policies & Tunning

5.1 Configurar Shutdown Policy

# Default: conditional (HA Manager decide)
# Opções: conditional, freeze, failover, migrate

# Para VMs críticas que NÃO devem migrar durante manutenção:
ha-manager set vm:200 --state freeze

# Para forçar migrate durante manutenção:
ha-manager set vm:200 --state migrate

5.2 Maintenance Mode

# Colocar node em maintenance (não recebe novos VMs HA)
ha-manager set-node-state <node-name> maintenance

# VMs HA existentes:
# - Não migram automaticamente
# - Mas não recebem novas em failover

# Sair de maintenance
ha-manager set-node-state <node-name> active

5.3 Configurar Priorities (Load Balance)

# Preferência de nodes por VM

# VM 200: Preferir Node B
ha-manager set vm:200 --group critical --restricted

# restricted: VM só corre nos nodes do grupo
# unrestricted: VM pode correr em qualquer node (fallback)

Fase 6: Monitoring & Alerts

6.1 HA Manager Logs

# Logs HA Manager
journalctl -u pve-ha-lrm -f  # Local Resource Manager
journalctl -u pve-ha-crm -f  # Cluster Resource Manager

# Ver decisões de failover
grep "migrate\|relocate" /var/log/pve/tasks/index

6.2 Configurar Alertas

# Via Web UI: Datacenter → Notifications

# Email alerts para:
# - Node down
# - Quorum lost
# - VM failover events
# - Fencing executed

# SMTP: mail.descomplicar.pt
# To: admin@descomplicar.pt

6.3 Monitorização Contínua

# Script de monitoring (cron cada 5min)
#!/bin/bash
# /usr/local/bin/check-ha-health.sh

ha_status=$(ha-manager status | grep "quorum:" | awk '{print $2}')

if [ "$ha_status" != "OK" ]; then
  echo "HA Quorum NOT OK" | mail -s "ALERT: HA Issue" admin@descomplicar.pt
fi

# Cron
# */5 * * * * /usr/local/bin/check-ha-health.sh

Fase 7: Production Rollout

7.1 Migrar VMs Production para HA

Phased approach:

# Week 1: VMs não-críticas (teste)
ha-manager add vm:250 --group low

# Week 2: VMs médias (se Week 1 OK)
ha-manager add vm:201,202 --group medium

# Week 3: VMs críticas (se tudo OK)
ha-manager add vm:200,300 --group critical

7.2 Documentar Runbook

Criar: 06-Operacoes/Procedimentos/D7-Tecnologia/PROC-HA-Failover.md

Conteúdo:

  • Detectar failover event
  • Validar VM booted corretamente
  • Investigar causa node failure
  • Restore node original
  • Migrate VM back (se necessário)

Output Summary

✅ HA configurado: Cluster descomplicar

🛡️ Fencing:
   - Type: Watchdog (softdog)
   - Nodes: 2 nodes configured
   - Test: Successful ✓

📋 HA Groups:
   - Critical (priority 100): 2 VMs
   - Medium (priority 50): 0 VMs
   - Low (priority 10): 0 VMs

🖥️ HA Resources:
   - vm:200 (EasyPanel) - Critical
   - vm:300 (CWP) - Critical
   - Max restart: 3
   - Max relocate: 2

⚡ Failover Tests:
   ✓ Clean shutdown → Auto restart
   ✓ Node crash → Relocate (~2min)
   ✓ Live migration → <10s downtime

📊 Expected Metrics:
   - Detection time: ~60s
   - Fencing time: ~30s
   - Boot time: ~60-120s
   - Total failover: ~2-3min (sem shared storage)

⚠️ Limitations (sem shared storage):
   - Failover = boot time (não instant)
   - Live migration copia disk (lento)
   - Considerar shared storage futuro

🔔 Monitoring:
   - Quorum check: cada 5min
   - Alerts: Email admin@descomplicar.pt
   - Logs: journalctl -u pve-ha-*

📋 Next Steps:
   1. Monitorizar por 30 dias
   2. Adicionar mais VMs a HA gradualmente
   3. Considerar shared storage (NFS/Ceph)
   4. Documentar procedures em PROC-HA-Failover.md
   5. Treinar equipa em failover manual

⏱️ Configuration time: ~30min

Best Practices

DO

  • Testar failover em VMs teste ANTES production
  • Configurar fencing (watchdog mínimo, IPMI ideal)
  • Monitorizar quorum 24/7
  • Documentar runbooks failover
  • Alerts email para eventos críticos
  • Backup ANTES activar HA

DON'T

  • HA sem fencing (risco split-brain)
  • max_relocate muito alto (VM fica "bouncing")
  • Assumir instant failover sem shared storage
  • Testar failover em production sem plano
  • Ignorar quorum warnings

Troubleshooting

VM não failover

# Verificar HA enabled
ha-manager status | grep vm:ID

# Verificar quorum
pvecm status

# Verificar fencing functional
# (watchdog ou IPMI test)

# Logs
journalctl -u pve-ha-crm -f

Split-brain detected

# CRITICAL: Ambos nodes pensam que são master

# Shutdown 1 node completamente
systemctl poweroff

# No node restante:
pvecm expected 1  # Force quorum com 1 node

# Resolver networking
# Rejoin node shutdown

Failover loop (VM keeps restarting)

# VM falha → restart → falha → restart

# Verificar:
# 1. VM logs (qm log ID)
# 2. max_restart atingido?
# 3. Problema configuração VM?

# Pause HA temporário
ha-manager set vm:ID --state disabled

# Fix VM issue
# Re-enable HA
ha-manager set vm:ID --state started

References


Versão: 1.0.0 | Autor: Descomplicar® | Data: 2026-02-14


/ @author Descomplicar® | @copyright 2026 **/