- All SKILL.md files now <500 lines (avg reduction 69%) - Detailed content extracted to references/ subdirectories - Frontmatter standardised: only name + description (Anthropic standard) - New skills: brand-guidelines, spec-coauthor, report-templates, skill-creator - Design skills: anti-slop guidelines, premium-proposals reference - Removed non-standard frontmatter fields (triggers, version, author, category) Plugins affected: infraestrutura, marketing, dev-tools, crm-ops, gestao, core-tools, negocio, perfex-dev, wordpress, design-media Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
175 lines
4.1 KiB
Markdown
175 lines
4.1 KiB
Markdown
---
|
|
name: proxmox-ha
|
|
description: Configuracao de High Availability em cluster Proxmox -- HA Manager, fencing devices (STONITH) e failover automatico para VMs criticas.
|
|
---
|
|
|
|
# Proxmox HA
|
|
|
|
Configurar High Availability (HA) em cluster Proxmox com HA Manager, fencing devices e failover automatico para VMs criticas.
|
|
|
|
## Quando Usar
|
|
|
|
- Configurar HA apos cluster formation (/proxmox-cluster)
|
|
- Proteger VMs criticas com failover automatico
|
|
- Configurar fencing devices (STONITH)
|
|
- Definir HA groups por criticidade
|
|
- Testar failover procedures
|
|
|
|
## Sintaxe
|
|
|
|
```bash
|
|
/proxmox-ha configure --critical-vms <vm-ids> [--fencing watchdog|ipmi] [--max-relocate 2]
|
|
```
|
|
|
|
## Knowledge Sources
|
|
|
|
```bash
|
|
mcp__notebooklm__notebook_query \
|
|
notebook_id:"276ccdde-6b95-42a3-ad96-4e64d64c8d52" \
|
|
query:"proxmox ha high availability fencing stonith failover"
|
|
```
|
|
|
|
---
|
|
|
|
## Pre-Requisitos
|
|
|
|
**1. Cluster formado:**
|
|
```bash
|
|
pvecm status
|
|
# Expected: Quorum: Active, Nodes: 2+ online
|
|
```
|
|
|
|
**2. Shared Storage ou Replication:**
|
|
- **Shared storage** (NFS, Ceph): HA ideal (failover <30s)
|
|
- **Sem shared storage**: ZFS replication ou boot time failover (~2-5min)
|
|
|
|
**3. Fencing device configurado** - sem fencing = risco split-brain
|
|
|
|
---
|
|
|
|
## Workflow Completo
|
|
|
|
### Fase 1: Fencing Configuration
|
|
|
|
Detalhes completos das 3 opcoes (Watchdog, IPMI, Network) em: `references/fencing-configuration.md`
|
|
|
|
**Resumo:** Watchdog para inicio, IPMI para producao, evitar network fencing.
|
|
|
|
### Fase 2: HA Manager Configuration
|
|
|
|
```bash
|
|
# Verificar status
|
|
ha-manager status
|
|
# Expected: quorum: OK, master: <node-name> (elected), lrm: active
|
|
```
|
|
|
|
**Criar HA Groups por criticidade:**
|
|
```bash
|
|
# Critical (priority 100)
|
|
ha-manager groupadd critical \
|
|
--nodes "server.descomplicar.pt:100,cluster.descomplicar.pt:100"
|
|
|
|
# Medium (priority 50)
|
|
ha-manager groupadd medium \
|
|
--nodes "server.descomplicar.pt:50,cluster.descomplicar.pt:50"
|
|
|
|
# Low (priority 10)
|
|
ha-manager groupadd low \
|
|
--nodes "server.descomplicar.pt:10,cluster.descomplicar.pt:10"
|
|
```
|
|
|
|
### Fase 3: Adicionar VMs a HA
|
|
|
|
```bash
|
|
# VM 200 (EasyPanel Docker)
|
|
ha-manager add vm:200 \
|
|
--group critical \
|
|
--max_restart 3 \
|
|
--max_relocate 2 \
|
|
--state started
|
|
|
|
# VM 300 (CWP)
|
|
ha-manager add vm:300 \
|
|
--group critical \
|
|
--max_restart 3 \
|
|
--max_relocate 2 \
|
|
--state started
|
|
```
|
|
|
|
**Parametros:**
|
|
- `max_restart`: Tentativas restart no mesmo node antes de relocate
|
|
- `max_relocate`: Maximo relocates entre nodes
|
|
- `state started`: HA Manager garante VM esta sempre started
|
|
|
|
```bash
|
|
# Verificar
|
|
ha-manager status
|
|
```
|
|
|
|
### Fase 4: Failover Testing
|
|
|
|
Procedimentos detalhados de teste (shutdown clean, node crash simulado, live migration) e tuning de policies em: `references/failover-testing.md`
|
|
|
|
### Fase 5: Production Rollout
|
|
|
|
Abordagem faseada (low -> medium -> critical) com monitorizacao de 30 dias.
|
|
|
|
Documentar runbook em: `06-Operacoes/Procedimentos/D7-Tecnologia/PROC-HA-Failover.md`
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
**Fazer:**
|
|
- Testar failover em VMs teste ANTES production
|
|
- Configurar fencing (watchdog minimo, IPMI ideal)
|
|
- Monitorizar quorum 24/7
|
|
- Documentar runbooks failover
|
|
- Backup ANTES activar HA
|
|
|
|
**Nao fazer:**
|
|
- HA sem fencing (risco split-brain)
|
|
- max_relocate muito alto (VM fica "bouncing")
|
|
- Assumir instant failover sem shared storage
|
|
- Testar failover em production sem plano
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### VM nao failover
|
|
```bash
|
|
ha-manager status | grep vm:ID
|
|
pvecm status
|
|
journalctl -u pve-ha-crm -f
|
|
```
|
|
|
|
### Split-brain detected
|
|
```bash
|
|
# Shutdown 1 node completamente
|
|
systemctl poweroff
|
|
|
|
# No node restante:
|
|
pvecm expected 1 # Force quorum com 1 node
|
|
# Resolver networking, rejoin node shutdown
|
|
```
|
|
|
|
### Failover loop (VM keeps restarting)
|
|
```bash
|
|
# Pause HA temporario
|
|
ha-manager set vm:ID --state disabled
|
|
# Fix VM issue
|
|
# Re-enable HA
|
|
ha-manager set vm:ID --state started
|
|
```
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- `references/fencing-configuration.md` - Detalhes Watchdog, IPMI e Network fencing
|
|
- `references/failover-testing.md` - Testes, policies, monitoring, alertas, production rollout
|
|
- **NotebookLM:** 276ccdde-6b95-42a3-ad96-4e64d64c8d52
|
|
- **HA Manager Docs:** https://pve.proxmox.com/pve-docs/ha-manager.1.html
|
|
- **Fencing:** https://pve.proxmox.com/wiki/Fencing
|