claude-plugins/infraestrutura/skills/proxmox-ha/SKILL.md

---
name: proxmox-ha
description: Configuracao de High Availability em cluster Proxmox -- HA Manager, fencing devices (STONITH) e failover automatico para VMs criticas.
---

# Proxmox HA

Configurar High Availability (HA) em cluster Proxmox com HA Manager, fencing devices e failover automatico para VMs criticas.

## Quando Usar

- Configurar HA apos cluster formation (/proxmox-cluster)
- Proteger VMs criticas com failover automatico
- Configurar fencing devices (STONITH)
- Definir HA groups por criticidade
- Testar failover procedures

## Sintaxe

```bash
/proxmox-ha configure --critical-vms <vm-ids> [--fencing watchdog|ipmi] [--max-relocate 2]
```

## Knowledge Sources

```bash
mcp__notebooklm__notebook_query \
  notebook_id:"276ccdde-6b95-42a3-ad96-4e64d64c8d52" \
  query:"proxmox ha high availability fencing stonith failover"
```

---

## Pre-Requisitos

**1. Cluster formado:**
```bash
pvecm status
# Expected: Quorum: Active, Nodes: 2+ online
```

**2. Shared Storage ou Replication:**
- **Shared storage** (NFS, Ceph): HA ideal (failover <30s)
- **Sem shared storage**: ZFS replication ou boot time failover (~2-5min)

**3. Fencing device configurado** - sem fencing = risco split-brain

---

## Workflow Completo

### Fase 1: Fencing Configuration

Detalhes completos das 3 opcoes (Watchdog, IPMI, Network) em: `references/fencing-configuration.md`

**Resumo:** Watchdog para inicio, IPMI para producao, evitar network fencing.

### Fase 2: HA Manager Configuration

```bash
# Verificar status
ha-manager status
# Expected: quorum: OK, master: <node-name> (elected), lrm: active
```

**Criar HA Groups por criticidade:**
```bash
# Critical (priority 100)
ha-manager groupadd critical \
  --nodes "server.descomplicar.pt:100,cluster.descomplicar.pt:100"

# Medium (priority 50)
ha-manager groupadd medium \
  --nodes "server.descomplicar.pt:50,cluster.descomplicar.pt:50"

# Low (priority 10)
ha-manager groupadd low \
  --nodes "server.descomplicar.pt:10,cluster.descomplicar.pt:10"
```

### Fase 3: Adicionar VMs a HA

```bash
# VM 200 (EasyPanel Docker)
ha-manager add vm:200 \
  --group critical \
  --max_restart 3 \
  --max_relocate 2 \
  --state started

# VM 300 (CWP)
ha-manager add vm:300 \
  --group critical \
  --max_restart 3 \
  --max_relocate 2 \
  --state started
```

**Parametros:**
- `max_restart`: Tentativas restart no mesmo node antes de relocate
- `max_relocate`: Maximo relocates entre nodes
- `state started`: HA Manager garante VM esta sempre started

```bash
# Verificar
ha-manager status
```

### Fase 4: Failover Testing

Procedimentos detalhados de teste (shutdown clean, node crash simulado, live migration) e tuning de policies em: `references/failover-testing.md`

### Fase 5: Production Rollout

Abordagem faseada (low -> medium -> critical) com monitorizacao de 30 dias.

Documentar runbook em: `06-Operacoes/Procedimentos/D7-Tecnologia/PROC-HA-Failover.md`

---

## Best Practices

**Fazer:**
- Testar failover em VMs teste ANTES production
- Configurar fencing (watchdog minimo, IPMI ideal)
- Monitorizar quorum 24/7
- Documentar runbooks failover
- Backup ANTES activar HA

**Nao fazer:**
- HA sem fencing (risco split-brain)
- max_relocate muito alto (VM fica "bouncing")
- Assumir instant failover sem shared storage
- Testar failover em production sem plano

---

## Troubleshooting

### VM nao failover
```bash
ha-manager status | grep vm:ID
pvecm status
journalctl -u pve-ha-crm -f
```

### Split-brain detected
```bash
# Shutdown 1 node completamente
systemctl poweroff

# No node restante:
pvecm expected 1  # Force quorum com 1 node
# Resolver networking, rejoin node shutdown
```

### Failover loop (VM keeps restarting)
```bash
# Pause HA temporario
ha-manager set vm:ID --state disabled
# Fix VM issue
# Re-enable HA
ha-manager set vm:ID --state started
```

---

## References

- `references/fencing-configuration.md` - Detalhes Watchdog, IPMI e Network fencing
- `references/failover-testing.md` - Testes, policies, monitoring, alertas, production rollout
- **NotebookLM:** 276ccdde-6b95-42a3-ad96-4e64d64c8d52
- **HA Manager Docs:** https://pve.proxmox.com/pve-docs/ha-manager.1.html
- **Fencing:** https://pve.proxmox.com/wiki/Fencing

---

## Healing Log

Registo de erros conhecidos e como evitá-los. Lido automaticamente antes de executar.

```jsonl
{"date":"","issue":"","fix":"","source":"user|auto"}
```

*Adicionar nova linha após cada erro corrigido.*