feat: refactor 30+ skills to Anthropic progressive disclosure pattern

- All SKILL.md files now <500 lines (avg reduction 69%)
- Detailed content extracted to references/ subdirectories
- Frontmatter standardised: only name + description (Anthropic standard)
- New skills: brand-guidelines, spec-coauthor, report-templates, skill-creator
- Design skills: anti-slop guidelines, premium-proposals reference
- Removed non-standard frontmatter fields (triggers, version, author, category)

Plugins affected: infraestrutura, marketing, dev-tools, crm-ops, gestao,
core-tools, negocio, perfex-dev, wordpress, design-media

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-12 15:05:03 +00:00
parent 9404af7ac9
commit 6b3a6f2698
397 changed files with 67154 additions and 17257 deletions

View File

@@ -1,26 +1,16 @@
---
name: proxmox-ha
description: Configurar High Availability (HA) em cluster Proxmox - resource groups, fencing, failover automático. Use when user mentions "configure ha", "proxmox ha", "high availability", "failover", "ha manager".
author: Descomplicar® Crescimento Digital
version: 1.0.0
quality_score: 75
user_invocable: true
desk_task: 1712
allowed-tools: Task, Read, Bash
dependencies:
- ssh-unified
- notebooklm
- proxmox-cluster
description: Configuracao de High Availability em cluster Proxmox -- HA Manager, fencing devices (STONITH) e failover automatico para VMs criticas.
---
# Proxmox HA
Configurar High Availability (HA) em cluster Proxmox com HA Manager, fencing devices e failover automático para VMs críticas.
Configurar High Availability (HA) em cluster Proxmox com HA Manager, fencing devices e failover automatico para VMs criticas.
## Quando Usar
- Configurar HA após cluster formation (/proxmox-cluster)
- Proteger VMs críticas com failover automático
- Configurar HA apos cluster formation (/proxmox-cluster)
- Proteger VMs criticas com failover automatico
- Configurar fencing devices (STONITH)
- Definir HA groups por criticidade
- Testar failover procedures
@@ -31,137 +21,50 @@ Configurar High Availability (HA) em cluster Proxmox com HA Manager, fencing dev
/proxmox-ha configure --critical-vms <vm-ids> [--fencing watchdog|ipmi] [--max-relocate 2]
```
## Exemplos
```bash
# HA para VMs críticas com watchdog
/proxmox-ha configure --critical-vms 200,300 --fencing watchdog
# HA com IPMI fencing (hardware)
/proxmox-ha configure --critical-vms 200,300,301 --fencing ipmi --max-relocate 1
# Apenas testar failover (sem activar HA)
/proxmox-ha test --vm 999
```
## Knowledge Sources
### NotebookLM
```bash
mcp__notebooklm__notebook_query \
notebook_id:"276ccdde-6b95-42a3-ad96-4e64d64c8d52" \
query:"proxmox ha high availability fencing stonith failover"
```
## Workflow Completo
---
### Pre-Requisites
## Pre-Requisitos
**1. Cluster Formado**
**1. Cluster formado:**
```bash
# Verificar cluster healthy
pvecm status
# Expected:
# Quorum: Active
# Nodes: 2+ online
# Expected: Quorum: Active, Nodes: 2+ online
```
**2. Shared Storage ou Replication**
**Opções:**
**2. Shared Storage ou Replication:**
- **Shared storage** (NFS, Ceph): HA ideal (failover <30s)
- **No shared storage**: Requer ZFS replication ou aceita boot time failover (~2-5min)
- **Sem shared storage**: ZFS replication ou boot time failover (~2-5min)
**Para Cluster Descomplicar (sem shared storage):**
```bash
# Aceitar boot-time failover
# OU configurar ZFS replication:
**3. Fencing device configurado** - sem fencing = risco split-brain
# Node A:
zfs snapshot rpool/vm-disks@ha-sync
zfs send rpool/vm-disks@ha-sync | ssh root@<node-b-ip> zfs receive rpool/vm-disks-replica
---
# Automatizar com pvesr (Proxmox Storage Replication)
```
**3. Fencing Device Configurado**
**CRITICAL:** Sem fencing = risco split-brain
## Workflow Completo
### Fase 1: Fencing Configuration
**1.1 Opção A: Watchdog (Software Fencing)**
Detalhes completos das 3 opcoes (Watchdog, IPMI, Network) em: `references/fencing-configuration.md`
**Mais simples, menos confiável:**
```bash
# Instalar watchdog em ambos nodes
apt install watchdog
# Load kernel module
modprobe softdog
# Auto-load on boot
echo "softdog" >> /etc/modules
# Configurar HA Manager para usar watchdog
# (automático quando HA activado)
```
**1.2 Opção B: IPMI/iLO (Hardware Fencing)**
**Mais confiável, requer IPMI:**
```bash
# Verificar IPMI disponível
ipmitool lan print
# Configurar IPMI credentials (via BIOS ou ipmitool)
# Configurar em Proxmox (Web UI):
# Datacenter → Fencing → Add
# Type: IPMI
# IP: <node-ipmi-ip>
# Username: admin
# Password: <ipmi-pass>
# Test
fence_ipmilan -a <node-ipmi-ip> -l admin -p <pass> -o status
```
**1.3 Opção C: Network Fencing (Menos Confiável)**
**Usar apenas se IPMI não disponível:**
```bash
# SSH-based fencing (perigoso)
# Depende de network estar up
# Não recomendado para production
```
**Recomendação Cluster Descomplicar:**
- **Início:** Watchdog (simple, funcional)
- **Produção:** IPMI se hardware suporta
- **Evitar:** Network fencing
**Resumo:** Watchdog para inicio, IPMI para producao, evitar network fencing.
### Fase 2: HA Manager Configuration
**2.1 Enable HA Manager**
```bash
# Automático quando cluster formado
# Verificar status
ha-manager status
# Expected:
# quorum: OK
# master: <node-name> (elected)
# lrm: active
# Expected: quorum: OK, master: <node-name> (elected), lrm: active
```
**2.2 Criar HA Groups (Opcional)**
**HA Groups por criticidade:**
**Criar HA Groups por criticidade:**
```bash
# Via Web UI: Datacenter → HA → Groups → Add
# Critical (priority 100)
ha-manager groupadd critical \
--nodes "server.descomplicar.pt:100,cluster.descomplicar.pt:100"
@@ -175,23 +78,8 @@ ha-manager groupadd low \
--nodes "server.descomplicar.pt:10,cluster.descomplicar.pt:10"
```
**Priority explicação:**
- Higher priority = preferência para correr nesse node
- Usado para balancear carga
- Em failover, ignora priority (vai para node disponível)
### Fase 3: Adicionar VMs a HA
### Fase 3: Add VMs to HA
**3.1 Adicionar VMs Críticas**
**Via Web UI:**
- Seleccionar VM → More → Manage HA
- Enable HA
- Group: critical
- Max restart: 3
- Max relocate: 2
**Via CLI:**
```bash
# VM 200 (EasyPanel Docker)
ha-manager add vm:200 \
@@ -208,317 +96,79 @@ ha-manager add vm:300 \
--state started
```
**Parâmetros:**
**Parametros:**
- `max_restart`: Tentativas restart no mesmo node antes de relocate
- `max_relocate`: Máximo relocates entre nodes
- `state started`: HA Manager garante VM está sempre started
- `max_relocate`: Maximo relocates entre nodes
- `state started`: HA Manager garante VM esta sempre started
**3.2 Verificar HA Resources**
```bash
# Verificar
ha-manager status
# Should show:
# vm:200: started on <node-name>
# vm:300: started on <node-name>
```
### Fase 4: Failover Testing
**4.1 Criar VM Teste HA**
```bash
# VM 999 para teste (não production)
qm create 999 --name ha-test --memory 512 --cores 1
Procedimentos detalhados de teste (shutdown clean, node crash simulado, live migration) e tuning de policies em: `references/failover-testing.md`
# Adicionar a HA
ha-manager add vm:999 --state started
```
### Fase 5: Production Rollout
**4.2 Testar Failover Automático**
Abordagem faseada (low -> medium -> critical) com monitorizacao de 30 dias.
**Teste 1: Shutdown Clean**
```bash
# Node onde VM 999 corre:
qm shutdown 999
Documentar runbook em: `06-Operacoes/Procedimentos/D7-Tecnologia/PROC-HA-Failover.md`
# HA Manager deve:
# 1. Detectar shutdown (~30s)
# 2. Tentar restart no mesmo node (max_restart vezes)
# 3. Se continua down → relocate para outro node
# Monitorizar
watch -n 1 'ha-manager status | grep vm:999'
```
**Teste 2: Node Crash (Simulado)**
```bash
# CUIDADO: Apenas em teste, não production
# Shutdown abrupto do node onde VM 999 corre
# (simula hardware failure)
echo b > /proc/sysrq-trigger # Reboot forçado
# Outro node deve:
# 1. Detectar node down via quorum (~1min)
# 2. Fence node (via watchdog/IPMI)
# 3. Boot VM 999 no node surviving
# Timeline esperado:
# - 0s: Node crash
# - ~60s: Quorum detecta node missing
# - ~90s: Fencing executado
# - ~120s: VM boota em outro node
# Total downtime: ~2-3min (sem shared storage)
# Com shared storage: ~30-60s
```
**4.3 Testar Live Migration Manual**
```bash
# Migration manual (com VM running)
qm migrate 999 <target-node-name> --online
# Com shared storage: <10s downtime
# Sem shared storage: copia disk = lento (GB/min)
# Para production VMs:
# - Fazer em janela manutenção se sem shared storage
# - Live migration OK se shared storage
```
### Fase 5: HA Policies & Tunning
**5.1 Configurar Shutdown Policy**
```bash
# Default: conditional (HA Manager decide)
# Opções: conditional, freeze, failover, migrate
# Para VMs críticas que NÃO devem migrar durante manutenção:
ha-manager set vm:200 --state freeze
# Para forçar migrate durante manutenção:
ha-manager set vm:200 --state migrate
```
**5.2 Maintenance Mode**
```bash
# Colocar node em maintenance (não recebe novos VMs HA)
ha-manager set-node-state <node-name> maintenance
# VMs HA existentes:
# - Não migram automaticamente
# - Mas não recebem novas em failover
# Sair de maintenance
ha-manager set-node-state <node-name> active
```
**5.3 Configurar Priorities (Load Balance)**
```bash
# Preferência de nodes por VM
# VM 200: Preferir Node B
ha-manager set vm:200 --group critical --restricted
# restricted: VM só corre nos nodes do grupo
# unrestricted: VM pode correr em qualquer node (fallback)
```
### Fase 6: Monitoring & Alerts
**6.1 HA Manager Logs**
```bash
# Logs HA Manager
journalctl -u pve-ha-lrm -f # Local Resource Manager
journalctl -u pve-ha-crm -f # Cluster Resource Manager
# Ver decisões de failover
grep "migrate\|relocate" /var/log/pve/tasks/index
```
**6.2 Configurar Alertas**
```bash
# Via Web UI: Datacenter → Notifications
# Email alerts para:
# - Node down
# - Quorum lost
# - VM failover events
# - Fencing executed
# SMTP: mail.descomplicar.pt
# To: admin@descomplicar.pt
```
**6.3 Monitorização Contínua**
```bash
# Script de monitoring (cron cada 5min)
#!/bin/bash
# /usr/local/bin/check-ha-health.sh
ha_status=$(ha-manager status | grep "quorum:" | awk '{print $2}')
if [ "$ha_status" != "OK" ]; then
echo "HA Quorum NOT OK" | mail -s "ALERT: HA Issue" admin@descomplicar.pt
fi
# Cron
# */5 * * * * /usr/local/bin/check-ha-health.sh
```
### Fase 7: Production Rollout
**7.1 Migrar VMs Production para HA**
**Phased approach:**
```bash
# Week 1: VMs não-críticas (teste)
ha-manager add vm:250 --group low
# Week 2: VMs médias (se Week 1 OK)
ha-manager add vm:201,202 --group medium
# Week 3: VMs críticas (se tudo OK)
ha-manager add vm:200,300 --group critical
```
**7.2 Documentar Runbook**
**Criar:** `06-Operacoes/Procedimentos/D7-Tecnologia/PROC-HA-Failover.md`
**Conteúdo:**
- Detectar failover event
- Validar VM booted corretamente
- Investigar causa node failure
- Restore node original
- Migrate VM back (se necessário)
## Output Summary
```
✅ HA configurado: Cluster descomplicar
🛡️ Fencing:
- Type: Watchdog (softdog)
- Nodes: 2 nodes configured
- Test: Successful ✓
📋 HA Groups:
- Critical (priority 100): 2 VMs
- Medium (priority 50): 0 VMs
- Low (priority 10): 0 VMs
🖥️ HA Resources:
- vm:200 (EasyPanel) - Critical
- vm:300 (CWP) - Critical
- Max restart: 3
- Max relocate: 2
⚡ Failover Tests:
✓ Clean shutdown → Auto restart
✓ Node crash → Relocate (~2min)
✓ Live migration → <10s downtime
📊 Expected Metrics:
- Detection time: ~60s
- Fencing time: ~30s
- Boot time: ~60-120s
- Total failover: ~2-3min (sem shared storage)
⚠️ Limitations (sem shared storage):
- Failover = boot time (não instant)
- Live migration copia disk (lento)
- Considerar shared storage futuro
🔔 Monitoring:
- Quorum check: cada 5min
- Alerts: Email admin@descomplicar.pt
- Logs: journalctl -u pve-ha-*
📋 Next Steps:
1. Monitorizar por 30 dias
2. Adicionar mais VMs a HA gradualmente
3. Considerar shared storage (NFS/Ceph)
4. Documentar procedures em PROC-HA-Failover.md
5. Treinar equipa em failover manual
⏱️ Configuration time: ~30min
```
---
## Best Practices
### DO
- Testar failover em VMs teste ANTES production
- Configurar fencing (watchdog mínimo, IPMI ideal)
- Monitorizar quorum 24/7
- Documentar runbooks failover
- ✅ Alerts email para eventos críticos
- ✅ Backup ANTES activar HA
**Fazer:**
- Testar failover em VMs teste ANTES production
- Configurar fencing (watchdog minimo, IPMI ideal)
- Monitorizar quorum 24/7
- Documentar runbooks failover
- Backup ANTES activar HA
### DON'T
- HA sem fencing (risco split-brain)
- max_relocate muito alto (VM fica "bouncing")
- Assumir instant failover sem shared storage
- Testar failover em production sem plano
- ❌ Ignorar quorum warnings
**Nao fazer:**
- HA sem fencing (risco split-brain)
- max_relocate muito alto (VM fica "bouncing")
- Assumir instant failover sem shared storage
- Testar failover em production sem plano
---
## Troubleshooting
### VM não failover
### VM nao failover
```bash
# Verificar HA enabled
ha-manager status | grep vm:ID
# Verificar quorum
pvecm status
# Verificar fencing functional
# (watchdog ou IPMI test)
# Logs
journalctl -u pve-ha-crm -f
```
### Split-brain detected
```bash
# CRITICAL: Ambos nodes pensam que são master
# Shutdown 1 node completamente
systemctl poweroff
# No node restante:
pvecm expected 1 # Force quorum com 1 node
# Resolver networking
# Rejoin node shutdown
# Resolver networking, rejoin node shutdown
```
### Failover loop (VM keeps restarting)
```bash
# VM falha → restart → falha → restart
# Verificar:
# 1. VM logs (qm log ID)
# 2. max_restart atingido?
# 3. Problema configuração VM?
# Pause HA temporário
# Pause HA temporario
ha-manager set vm:ID --state disabled
# Fix VM issue
# Re-enable HA
ha-manager set vm:ID --state started
```
---
## References
- `references/fencing-configuration.md` - Detalhes Watchdog, IPMI e Network fencing
- `references/failover-testing.md` - Testes, policies, monitoring, alertas, production rollout
- **NotebookLM:** 276ccdde-6b95-42a3-ad96-4e64d64c8d52
- **HA Manager Docs:** https://pve.proxmox.com/pve-docs/ha-manager.1.html
- **Fencing:** https://pve.proxmox.com/wiki/Fencing
---
**Versão:** 1.0.0 | **Autor:** Descomplicar® | **Data:** 2026-02-14
---
**/** @author Descomplicar® | @copyright 2026 **/

View File

@@ -0,0 +1,185 @@
# Failover Testing - Proxmox HA
Procedimentos detalhados de teste de failover e monitoring.
---
## Criar VM Teste HA
```bash
# VM 999 para teste (nao production)
qm create 999 --name ha-test --memory 512 --cores 1
# Adicionar a HA
ha-manager add vm:999 --state started
```
## Teste 1: Shutdown Clean
```bash
# Node onde VM 999 corre:
qm shutdown 999
# HA Manager deve:
# 1. Detectar shutdown (~30s)
# 2. Tentar restart no mesmo node (max_restart vezes)
# 3. Se continua down -> relocate para outro node
# Monitorizar
watch -n 1 'ha-manager status | grep vm:999'
```
## Teste 2: Node Crash (Simulado)
```bash
# CUIDADO: Apenas em teste, nao production
# Shutdown abrupto do node onde VM 999 corre
# (simula hardware failure)
echo b > /proc/sysrq-trigger # Reboot forcado
# Outro node deve:
# 1. Detectar node down via quorum (~1min)
# 2. Fence node (via watchdog/IPMI)
# 3. Boot VM 999 no node surviving
# Timeline esperado:
# - 0s: Node crash
# - ~60s: Quorum detecta node missing
# - ~90s: Fencing executado
# - ~120s: VM boota em outro node
# Total downtime: ~2-3min (sem shared storage)
# Com shared storage: ~30-60s
```
## Teste 3: Live Migration Manual
```bash
# Migration manual (com VM running)
qm migrate 999 <target-node-name> --online
# Com shared storage: <10s downtime
# Sem shared storage: copia disk = lento (GB/min)
# Para production VMs:
# - Fazer em janela manutencao se sem shared storage
# - Live migration OK se shared storage
```
---
## HA Policies e Tuning
### Shutdown Policy
```bash
# Default: conditional (HA Manager decide)
# Opcoes: conditional, freeze, failover, migrate
# Para VMs criticas que NAO devem migrar durante manutencao:
ha-manager set vm:200 --state freeze
# Para forcar migrate durante manutencao:
ha-manager set vm:200 --state migrate
```
### Maintenance Mode
```bash
# Colocar node em maintenance (nao recebe novos VMs HA)
ha-manager set-node-state <node-name> maintenance
# VMs HA existentes:
# - Nao migram automaticamente
# - Mas nao recebem novas em failover
# Sair de maintenance
ha-manager set-node-state <node-name> active
```
### Priorities (Load Balance)
```bash
# Preferencia de nodes por VM
# VM 200: Preferir Node B
ha-manager set vm:200 --group critical --restricted
# restricted: VM so corre nos nodes do grupo
# unrestricted: VM pode correr em qualquer node (fallback)
```
---
## Monitoring e Alertas
### HA Manager Logs
```bash
# Logs HA Manager
journalctl -u pve-ha-lrm -f # Local Resource Manager
journalctl -u pve-ha-crm -f # Cluster Resource Manager
# Ver decisoes de failover
grep "migrate\|relocate" /var/log/pve/tasks/index
```
### Configurar Alertas
```bash
# Via Web UI: Datacenter -> Notifications
# Email alerts para:
# - Node down
# - Quorum lost
# - VM failover events
# - Fencing executed
# SMTP: mail.descomplicar.pt
# To: admin@descomplicar.pt
```
### Script Monitorizacao (cron cada 5min)
```bash
#!/bin/bash
# /usr/local/bin/check-ha-health.sh
ha_status=$(ha-manager status | grep "quorum:" | awk '{print $2}')
if [ "$ha_status" != "OK" ]; then
echo "HA Quorum NOT OK" | mail -s "ALERT: HA Issue" admin@descomplicar.pt
fi
# Cron
# */5 * * * * /usr/local/bin/check-ha-health.sh
```
---
## Production Rollout
Phased approach:
```bash
# Week 1: VMs nao-criticas (teste)
ha-manager add vm:250 --group low
# Week 2: VMs medias (se Week 1 OK)
ha-manager add vm:201,202 --group medium
# Week 3: VMs criticas (se tudo OK)
ha-manager add vm:200,300 --group critical
```
### Documentar Runbook
Criar: `06-Operacoes/Procedimentos/D7-Tecnologia/PROC-HA-Failover.md`
Conteudo:
- Detectar failover event
- Validar VM booted correctamente
- Investigar causa node failure
- Restore node original
- Migrate VM back (se necessario)

View File

@@ -0,0 +1,60 @@
# Fencing Configuration - Proxmox HA
Configuracao detalhada de fencing devices (STONITH) para High Availability.
---
## Opcao A: Watchdog (Software Fencing)
Mais simples, menos confiavel:
```bash
# Instalar watchdog em ambos nodes
apt install watchdog
# Load kernel module
modprobe softdog
# Auto-load on boot
echo "softdog" >> /etc/modules
# Configurar HA Manager para usar watchdog
# (automatico quando HA activado)
```
## Opcao B: IPMI/iLO (Hardware Fencing)
Mais confiavel, requer IPMI:
```bash
# Verificar IPMI disponivel
ipmitool lan print
# Configurar IPMI credentials (via BIOS ou ipmitool)
# Configurar em Proxmox (Web UI):
# Datacenter -> Fencing -> Add
# Type: IPMI
# IP: <node-ipmi-ip>
# Username: admin
# Password: <ipmi-pass>
# Test
fence_ipmilan -a <node-ipmi-ip> -l admin -p <pass> -o status
```
## Opcao C: Network Fencing (Menos Confiavel)
Usar apenas se IPMI nao disponivel:
```bash
# SSH-based fencing (perigoso)
# Depende de network estar up
# Nao recomendado para production
```
## Recomendacao Cluster Descomplicar
- **Inicio:** Watchdog (simple, funcional)
- **Producao:** IPMI se hardware suporta
- **Evitar:** Network fencing