Files
Emanuel Almeida 6b3a6f2698 feat: refactor 30+ skills to Anthropic progressive disclosure pattern
- All SKILL.md files now <500 lines (avg reduction 69%)
- Detailed content extracted to references/ subdirectories
- Frontmatter standardised: only name + description (Anthropic standard)
- New skills: brand-guidelines, spec-coauthor, report-templates, skill-creator
- Design skills: anti-slop guidelines, premium-proposals reference
- Removed non-standard frontmatter fields (triggers, version, author, category)

Plugins affected: infraestrutura, marketing, dev-tools, crm-ops, gestao,
core-tools, negocio, perfex-dev, wordpress, design-media

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:05:03 +00:00

4.1 KiB

name, description
name description
proxmox-ha Configuracao de High Availability em cluster Proxmox -- HA Manager, fencing devices (STONITH) e failover automatico para VMs criticas.

Proxmox HA

Configurar High Availability (HA) em cluster Proxmox com HA Manager, fencing devices e failover automatico para VMs criticas.

Quando Usar

  • Configurar HA apos cluster formation (/proxmox-cluster)
  • Proteger VMs criticas com failover automatico
  • Configurar fencing devices (STONITH)
  • Definir HA groups por criticidade
  • Testar failover procedures

Sintaxe

/proxmox-ha configure --critical-vms <vm-ids> [--fencing watchdog|ipmi] [--max-relocate 2]

Knowledge Sources

mcp__notebooklm__notebook_query \
  notebook_id:"276ccdde-6b95-42a3-ad96-4e64d64c8d52" \
  query:"proxmox ha high availability fencing stonith failover"

Pre-Requisitos

1. Cluster formado:

pvecm status
# Expected: Quorum: Active, Nodes: 2+ online

2. Shared Storage ou Replication:

  • Shared storage (NFS, Ceph): HA ideal (failover <30s)
  • Sem shared storage: ZFS replication ou boot time failover (~2-5min)

3. Fencing device configurado - sem fencing = risco split-brain


Workflow Completo

Fase 1: Fencing Configuration

Detalhes completos das 3 opcoes (Watchdog, IPMI, Network) em: references/fencing-configuration.md

Resumo: Watchdog para inicio, IPMI para producao, evitar network fencing.

Fase 2: HA Manager Configuration

# Verificar status
ha-manager status
# Expected: quorum: OK, master: <node-name> (elected), lrm: active

Criar HA Groups por criticidade:

# Critical (priority 100)
ha-manager groupadd critical \
  --nodes "server.descomplicar.pt:100,cluster.descomplicar.pt:100"

# Medium (priority 50)
ha-manager groupadd medium \
  --nodes "server.descomplicar.pt:50,cluster.descomplicar.pt:50"

# Low (priority 10)
ha-manager groupadd low \
  --nodes "server.descomplicar.pt:10,cluster.descomplicar.pt:10"

Fase 3: Adicionar VMs a HA

# VM 200 (EasyPanel Docker)
ha-manager add vm:200 \
  --group critical \
  --max_restart 3 \
  --max_relocate 2 \
  --state started

# VM 300 (CWP)
ha-manager add vm:300 \
  --group critical \
  --max_restart 3 \
  --max_relocate 2 \
  --state started

Parametros:

  • max_restart: Tentativas restart no mesmo node antes de relocate
  • max_relocate: Maximo relocates entre nodes
  • state started: HA Manager garante VM esta sempre started
# Verificar
ha-manager status

Fase 4: Failover Testing

Procedimentos detalhados de teste (shutdown clean, node crash simulado, live migration) e tuning de policies em: references/failover-testing.md

Fase 5: Production Rollout

Abordagem faseada (low -> medium -> critical) com monitorizacao de 30 dias.

Documentar runbook em: 06-Operacoes/Procedimentos/D7-Tecnologia/PROC-HA-Failover.md


Best Practices

Fazer:

  • Testar failover em VMs teste ANTES production
  • Configurar fencing (watchdog minimo, IPMI ideal)
  • Monitorizar quorum 24/7
  • Documentar runbooks failover
  • Backup ANTES activar HA

Nao fazer:

  • HA sem fencing (risco split-brain)
  • max_relocate muito alto (VM fica "bouncing")
  • Assumir instant failover sem shared storage
  • Testar failover em production sem plano

Troubleshooting

VM nao failover

ha-manager status | grep vm:ID
pvecm status
journalctl -u pve-ha-crm -f

Split-brain detected

# Shutdown 1 node completamente
systemctl poweroff

# No node restante:
pvecm expected 1  # Force quorum com 1 node
# Resolver networking, rejoin node shutdown

Failover loop (VM keeps restarting)

# Pause HA temporario
ha-manager set vm:ID --state disabled
# Fix VM issue
# Re-enable HA
ha-manager set vm:ID --state started

References