init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00
commit 6ac6f4be2a
925 changed files with 850330 additions and 0 deletions
@@ -0,0 +1,395 @@
+# 🕷️ Web Scraper Avançado
+
+Sistema completo de web scraping para sites complexos, fóruns e Reddit.
+
+**Author**: Descomplicar® Crescimento Digital
+**Link**: https://descomplicar.pt
+**Copyright**: 2025 Descomplicar®
+
+---
+
+## 📋 **ÍNDICE**
+
+1. [Funcionalidades](#-funcionalidades)
+2. [Requisitos](#-requisitos)
+3. [Instalação](#-instalação)
+4. [Configuração](#-configuração)
+5. [Uso Básico](#-uso-básico)
+6. [Uso Avançado](#-uso-avançado)
+7. [Estrutura de Ficheiros](#-estrutura-de-ficheiros)
+8. [Troubleshooting](#-troubleshooting)
+9. [Limitações](#-limitações)
+
+---
+
+## ✨ **FUNCIONALIDADES**
+
+### **Core**
+- ✅ Scraping com Playwright (suporta JavaScript)
+- ✅ Conversão HTML → Markdown
+- ✅ Limpeza automática de conteúdo
+- ✅ Formatação AI opcional (OpenRouter)
+
+### **Avançado**
+- ✅ Reddit API oficial (sem violar TOS)
+- ✅ Batch processing (múltiplos sites)
+- ✅ User-agent rotation
+- ✅ Proxy support
+- ✅ Rate limiting inteligente
+- ✅ Retry logic com backoff exponencial
+- ✅ Logging completo
+
+### **Tipos de Sites Suportados**
+- 🌐 Sites WordPress
+- 💬 Fóruns (vBulletin, phpBB, etc.)
+- 🛒 E-commerce (apenas recursos/blog)
+- 📰 Sites de notícias
+- 📖 Documentação técnica
+
+---
+
+## 🔧 **REQUISITOS**
+
+### **Sistema**
+- Python 3.8+
+- 2GB RAM mínimo
+- 5GB espaço livre (para output)
+
+### **APIs (opcional)**
+- OpenRouter API (para formatação AI)
+- Reddit API (para scraping Reddit)
+
+---
+
+## 📦 **INSTALAÇÃO**
+
+### **1. Clonar/Descarregar**
+```bash
+cd /media/ealmeida/Dados/Dev/Scripts/scraper/
+```
+
+### **2. Criar Virtual Environment**
+```bash
+python3 -m venv .venv
+source .venv/bin/activate  # Linux/Mac
+# ou
+.venv\Scripts\activate  # Windows
+```
+
+### **3. Instalar Dependências**
+```bash
+pip install -r requirements.txt
+```
+
+### **4. Instalar Browsers Playwright**
+```bash
+python -m playwright install chromium
+```
+
+### **5. Configurar Environment**
+```bash
+cp .env.example .env
+nano .env  # Editar com tuas credenciais
+```
+
+---
+
+## ⚙️ **CONFIGURAÇÃO**
+
+### **1. Ficheiro `.env`**
+
+```bash
+# API Keys
+OPENROUTER_API_KEY=sk-or-v1-your-key-here
+
+# Reddit API (obter em https://reddit.com/prefs/apps)
+REDDIT_CLIENT_ID=your-client-id
+REDDIT_CLIENT_SECRET=your-client-secret
+REDDIT_USER_AGENT=ScraperBot/1.0 by YourUsername
+
+# Proxy (opcional)
+PROXY_USER=username
+PROXY_PASS=password
+```
+
+### **2. Configurar Sites**
+
+Edita `sites_config.json`:
+
+```json
+{
+  "sites": [
+    {
+      "name": "Meu Site",
+      "url": "https://exemplo.com",
+      "type": "wordpress",
+      "max_depth": 2,
+      "notes": "Descrição opcional"
+    }
+  ],
+  "reddit_subreddits": ["subreddit1", "subreddit2"]
+}
+```
+
+**Tipos disponíveis**:
+- `wordpress` - Sites WordPress
+- `forum` - Fóruns (auto-limitado a depth=1)
+- `ecommerce` - E-commerce (apenas blog/recursos)
+- `website` - Sites genéricos
+
+---
+
+## 🚀 **USO BÁSICO**
+
+### **Opção 1: Batch Scraper (Recomendado)**
+
+```bash
+# Processar TODOS os sites do config
+python batch_scraper.py --all
+
+# Apenas WordPress
+python batch_scraper.py --types wordpress
+
+# Apenas fóruns
+python batch_scraper.py --types forum
+
+# Múltiplos tipos
+python batch_scraper.py --types wordpress forum
+
+# Incluir Reddit
+python batch_scraper.py --all --include-reddit
+
+# Apenas Reddit
+python batch_scraper.py --reddit-only
+```
+
+### **Opção 2: Scraper Individual**
+
+```bash
+# Editar scraper.py (linha 489)
+urls = ["https://meusite.com"]
+
+# Executar
+python scraper.py
+```
+
+### **Opção 3: Reddit Apenas**
+
+```bash
+python reddit_scraper.py
+```
+
+---
+
+## 🎯 **USO AVANÇADO**
+
+### **Pipeline Completo** (3 Fases)
+
+#### **Fase 1: Extração**
+```bash
+python batch_scraper.py --all
+```
+**Output**: `output_md/*.md` (raw)
+
+#### **Fase 2: Limpeza**
+```bash
+python clean_md.py output_md/ output_cleaned/
+```
+**Output**: `output_cleaned/*.md` (limpo)
+
+#### **Fase 3: Formatação AI** (opcional)
+```bash
+python format_content.py
+```
+**Output**: `formatted/*.md` (formatado profissionalmente)
+
+### **Config Personalizado**
+
+```bash
+# Usar config alternativo
+python batch_scraper.py --config meu_config.json --all
+```
+
+### **Filtros Avançados**
+
+Edita `sites_config.json`:
+
+```json
+{
+  "sites": [
+    {
+      "name": "Site Complexo",
+      "url": "https://exemplo.com",
+      "type": "forum",
+      "max_depth": 1,
+      "excluded_patterns": [
+        "/admin/",
+        "/private/",
+        "/login/"
+      ],
+      "notes": "Fórum com muitas páginas"
+    }
+  ]
+}
+```
+
+---
+
+## 📁 **ESTRUTURA DE FICHEIROS**
+
+```
+scraper/
+├── scraper.py              # Scraper principal (Playwright)
+├── batch_scraper.py        # Batch processor
+├── reddit_scraper.py       # Reddit API scraper
+├── clean_md.py             # Limpeza de Markdown
+├── format_content.py       # Formatação AI
+├── sites_config.json       # Configuração de sites
+├── requirements.txt        # Dependências Python
+├── .env                    # Credenciais (NÃO commitar)
+├── .env.example            # Template de credenciais
+├── .gitignore             # Exclusões Git
+├── README.md              # Esta documentação
+│
+├── output_md/             # Output fase 1 (raw)
+├── output_cleaned/        # Output fase 2 (limpo)
+├── formatted/             # Output fase 3 (formatado)
+└── logs/                  # Logs de execução
+```
+
+---
+
+## 🔍 **TROUBLESHOOTING**
+
+### **Erro: "API key not found"**
+```bash
+# Verifica .env existe
+ls -la .env
+
+# Verifica conteúdo
+cat .env
+
+# Se não existe, cria
+cp .env.example .env
+nano .env
+```
+
+### **Erro: "playwright not installed"**
+```bash
+python -m playwright install chromium
+```
+
+### **Erro: "Timeout" ao scraping**
+```python
+# Editar scraper.py linha 475
+request_timeout=120  # Aumenta para 120s
+```
+
+### **Site bloqueado (403/429)**
+```python
+# Adicionar proxy em .env
+PROXY_USER=username
+PROXY_PASS=password
+
+# Ou aumentar politeness_delay
+politeness_delay=(5, 10)  # 5-10s entre requests
+```
+
+### **Reddit: "Invalid credentials"**
+```bash
+# Criar app Reddit:
+# 1. Vai a https://reddit.com/prefs/apps
+# 2. Clica "create app"
+# 3. Tipo: "script"
+# 4. Redirect URI: http://localhost:8080
+# 5. Copia CLIENT_ID e CLIENT_SECRET para .env
+```
+
+### **Logs não aparecem**
+```bash
+# Verifica permissões
+ls -la *.log
+
+# Executa com verbose
+python batch_scraper.py --all 2>&1 | tee execution.log
+```
+
+---
+
+## ⚠️ **LIMITAÇÕES**
+
+### **Não Funciona Com**
+- ❌ Sites com Cloudflare aggressive (Challenge)
+- ❌ Sites que requerem login obrigatório
+- ❌ SPAs React/Vue sem SSR (sem HTML inicial)
+- ❌ Sites com CAPTCHA
+
+### **Limitações de Escala**
+- **Memória**: Carrega ficheiros na RAM (problema com ficheiros >100MB)
+- **Disco**: Pode gerar milhares de ficheiros
+- **API Costs**: Formatação AI pode ser cara em volumes grandes
+
+### **Rate Limits**
+- **Playwright**: ~10-20 sites/hora (sites complexos)
+- **Reddit API**: 60 requests/minuto (grátis)
+- **OpenRouter**: Depende do plano
+
+---
+
+## 📊 **PERFORMANCE ESTIMADA**
+
+| Tipo Site | Páginas/hora | Tempo médio/página |
+|-----------|--------------|-------------------|
+| WordPress simples | 100-200 | 30-60s |
+| Fórum | 50-100 | 60-90s |
+| E-commerce | 20-50 | 90-120s |
+| Reddit (API) | 1000+ | <1s |
+
+---
+
+## 🔐 **SEGURANÇA & ÉTICA**
+
+### **Boas Práticas**
+✅ Respeitar `robots.txt`
+✅ Rate limiting (2-5s entre requests)
+✅ User-agent identificável
+✅ Não sobrecarregar servidores
+✅ Usar APIs oficiais quando disponível (Reddit)
+
+### **Não Fazer**
+❌ Scraping agressivo (>1 req/s)
+❌ Ignorar rate limits
+❌ Scraping de conteúdo protegido por login
+❌ Redistribuir conteúdo sem permissão
+
+---
+
+## 📈 **ROADMAP**
+
+### **v2.0** (Próximas melhorias)
+- [ ] Suporte a mais APIs (Twitter, HackerNews)
+- [ ] Database storage (SQLite/PostgreSQL)
+- [ ] Dashboard web (Flask/FastAPI)
+- [ ] Docker support
+- [ ] Scraping agendado (cron)
+- [ ] Detecção automática de mudanças
+
+---
+
+## 📞 **SUPORTE**
+
+**Issues/Bugs**: Criar issue no repositório
+**Dúvidas**: contacto@descomplicar.pt
+**Website**: https://descomplicar.pt
+
+---
+
+## 📄 **LICENÇA**
+
+Copyright 2025 Descomplicar® Crescimento Digital
+Todos os direitos reservados
+
+---
+
+**Última atualização**: 2025-11-05
+**Versão**: 2.0