Files

T

History

ealmeida 865a9459a6 feat(scraper): adicionar scrapers Bizin.eu v1+v2 + triangulação Desk #2055

- bizin_scraper.py: undetected-chromedriver + Selenium headless
- bizin_scraper_v2.py: curl_cffi impersonação Chrome110
- .desk-project: triangulação task #2055 / projecto DES 360º

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-28 11:52:17 +01:00

.env.example

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

.gitignore

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

batch_scraper_v2_batch4.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

batch_scraper.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

bizin_scraper_v2.py

feat(scraper): adicionar scrapers Bizin.eu v1+v2 + triangulação Desk #2055

2026-04-28 11:52:17 +01:00

bizin_scraper.py

feat(scraper): adicionar scrapers Bizin.eu v1+v2 + triangulação Desk #2055

2026-04-28 11:52:17 +01:00

check_sites_availability.sh

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

clean_md.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

consolidate_knowledge_final.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

consolidate_knowledge.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

CTF_CARSTUFF_GUIDE.md

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

ctf_config_batch3.json

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

ctf_config_batch4.json

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

ctf_config.json

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

ctf_config.json.backup_depth1_20251105_030901

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

deploy_vps_batch4.sh

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

extract_knowledge_batch3_reddit.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

extract_knowledge_FINAL.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

extract_knowledge_production.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

extract_knowledge_v3_complete.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

extract_reddit_only.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

format_content.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

IMPLEMENTADO.md

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

input.md

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

monitor_batch3.sh

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

monitor_extraction_batch2.sh

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

monitor_extraction.sh

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

monitor_gemini.sh

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

monitor_local.sh

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

monitor_structure.sh

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

QUICKSTART.md

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

README.md

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

reddit_scraper.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

RELATORIO_ESTRUTURACAO_GEMINI.md

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

requirements.txt

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

scraper.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

sites_config.json

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

status_report.md

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

structure_content_ctf.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

structure_content_local.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

structure_content_test.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

test_extraction_fixed.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

test_gemini_response.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

test_improved_parser.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

test_single_file.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

validate_setup.py

init: scripts diversos (crawlers, conversores, scrapers)

2026-03-05 20:38:36 +00:00

README.md

🕷️ Web Scraper Avançado

Sistema completo de web scraping para sites complexos, fóruns e Reddit.

Author: Descomplicar® Crescimento Digital Link: https://descomplicar.pt Copyright: 2025 Descomplicar®

✨ FUNCIONALIDADES

Core

✅ Scraping com Playwright (suporta JavaScript)
✅ Conversão HTML → Markdown
✅ Limpeza automática de conteúdo
✅ Formatação AI opcional (OpenRouter)

Avançado

✅ Reddit API oficial (sem violar TOS)
✅ Batch processing (múltiplos sites)
✅ User-agent rotation
✅ Proxy support
✅ Rate limiting inteligente
✅ Retry logic com backoff exponencial
✅ Logging completo

Tipos de Sites Suportados

🌐 Sites WordPress
💬 Fóruns (vBulletin, phpBB, etc.)
🛒 E-commerce (apenas recursos/blog)
📰 Sites de notícias
📖 Documentação técnica

🔧 REQUISITOS

Sistema

Python 3.8+
2GB RAM mínimo
5GB espaço livre (para output)

APIs (opcional)

OpenRouter API (para formatação AI)
Reddit API (para scraping Reddit)

📦 INSTALAÇÃO

1. Clonar/Descarregar

cd /media/ealmeida/Dados/Dev/Scripts/scraper/

2. Criar Virtual Environment

python3 -m venv .venv
source .venv/bin/activate  # Linux/Mac
# ou
.venv\Scripts\activate  # Windows

3. Instalar Dependências

pip install -r requirements.txt

4. Instalar Browsers Playwright

python -m playwright install chromium

5. Configurar Environment

cp .env.example .env
nano .env  # Editar com tuas credenciais

⚙️ CONFIGURAÇÃO

1. Ficheiro `.env`

# API Keys
OPENROUTER_API_KEY=sk-or-v1-your-key-here

# Reddit API (obter em https://reddit.com/prefs/apps)
REDDIT_CLIENT_ID=your-client-id
REDDIT_CLIENT_SECRET=your-client-secret
REDDIT_USER_AGENT=ScraperBot/1.0 by YourUsername

# Proxy (opcional)
PROXY_USER=username
PROXY_PASS=password

2. Configurar Sites

Edita sites_config.json:

{
  "sites": [
    {
      "name": "Meu Site",
      "url": "https://exemplo.com",
      "type": "wordpress",
      "max_depth": 2,
      "notes": "Descrição opcional"
    }
  ],
  "reddit_subreddits": ["subreddit1", "subreddit2"]
}

Tipos disponíveis:

wordpress - Sites WordPress
forum - Fóruns (auto-limitado a depth=1)
ecommerce - E-commerce (apenas blog/recursos)
website - Sites genéricos

🚀 USO BÁSICO

Opção 1: Batch Scraper (Recomendado)

# Processar TODOS os sites do config
python batch_scraper.py --all

# Apenas WordPress
python batch_scraper.py --types wordpress

# Apenas fóruns
python batch_scraper.py --types forum

# Múltiplos tipos
python batch_scraper.py --types wordpress forum

# Incluir Reddit
python batch_scraper.py --all --include-reddit

# Apenas Reddit
python batch_scraper.py --reddit-only

Opção 2: Scraper Individual

# Editar scraper.py (linha 489)
urls = ["https://meusite.com"]

# Executar
python scraper.py

Opção 3: Reddit Apenas

python reddit_scraper.py

🎯 USO AVANÇADO

Pipeline Completo (3 Fases)

Fase 1: Extração

python batch_scraper.py --all

Output: output_md/*.md (raw)

Fase 2: Limpeza

python clean_md.py output_md/ output_cleaned/

Output: output_cleaned/*.md (limpo)

Fase 3: Formatação AI (opcional)

python format_content.py

Output: formatted/*.md (formatado profissionalmente)

Config Personalizado

# Usar config alternativo
python batch_scraper.py --config meu_config.json --all

Filtros Avançados

Edita sites_config.json:

{
  "sites": [
    {
      "name": "Site Complexo",
      "url": "https://exemplo.com",
      "type": "forum",
      "max_depth": 1,
      "excluded_patterns": [
        "/admin/",
        "/private/",
        "/login/"
      ],
      "notes": "Fórum com muitas páginas"
    }
  ]
}

📁 ESTRUTURA DE FICHEIROS

scraper/
├── scraper.py              # Scraper principal (Playwright)
├── batch_scraper.py        # Batch processor
├── reddit_scraper.py       # Reddit API scraper
├── clean_md.py             # Limpeza de Markdown
├── format_content.py       # Formatação AI
├── sites_config.json       # Configuração de sites
├── requirements.txt        # Dependências Python
├── .env                    # Credenciais (NÃO commitar)
├── .env.example            # Template de credenciais
├── .gitignore             # Exclusões Git
├── README.md              # Esta documentação
│
├── output_md/             # Output fase 1 (raw)
├── output_cleaned/        # Output fase 2 (limpo)
├── formatted/             # Output fase 3 (formatado)
└── logs/                  # Logs de execução

🔍 TROUBLESHOOTING

Erro: "API key not found"

# Verifica .env existe
ls -la .env

# Verifica conteúdo
cat .env

# Se não existe, cria
cp .env.example .env
nano .env

Erro: "playwright not installed"

python -m playwright install chromium

Erro: "Timeout" ao scraping

# Editar scraper.py linha 475
request_timeout=120  # Aumenta para 120s

Site bloqueado (403/429)

# Adicionar proxy em .env
PROXY_USER=username
PROXY_PASS=password

# Ou aumentar politeness_delay
politeness_delay=(5, 10)  # 5-10s entre requests

Reddit: "Invalid credentials"

# Criar app Reddit:
# 1. Vai a https://reddit.com/prefs/apps
# 2. Clica "create app"
# 3. Tipo: "script"
# 4. Redirect URI: http://localhost:8080
# 5. Copia CLIENT_ID e CLIENT_SECRET para .env

Logs não aparecem

# Verifica permissões
ls -la *.log

# Executa com verbose
python batch_scraper.py --all 2>&1 | tee execution.log

⚠️ LIMITAÇÕES

Não Funciona Com

❌ Sites com Cloudflare aggressive (Challenge)
❌ Sites que requerem login obrigatório
❌ SPAs React/Vue sem SSR (sem HTML inicial)
❌ Sites com CAPTCHA

Limitações de Escala

Memória: Carrega ficheiros na RAM (problema com ficheiros >100MB)
Disco: Pode gerar milhares de ficheiros
API Costs: Formatação AI pode ser cara em volumes grandes

Rate Limits

Playwright: ~10-20 sites/hora (sites complexos)
Reddit API: 60 requests/minuto (grátis)
OpenRouter: Depende do plano

📊 PERFORMANCE ESTIMADA

Tipo Site	Páginas/hora	Tempo médio/página
WordPress simples	100-200	30-60s
Fórum	50-100	60-90s
E-commerce	20-50	90-120s
Reddit (API)	1000+	<1s

🔐 SEGURANÇA & ÉTICA

Boas Práticas

✅ Respeitar robots.txt ✅ Rate limiting (2-5s entre requests) ✅ User-agent identificável ✅ Não sobrecarregar servidores ✅ Usar APIs oficiais quando disponível (Reddit)

Não Fazer

❌ Scraping agressivo (>1 req/s) ❌ Ignorar rate limits ❌ Scraping de conteúdo protegido por login ❌ Redistribuir conteúdo sem permissão

📈 ROADMAP

v2.0 (Próximas melhorias)

Suporte a mais APIs (Twitter, HackerNews)
Database storage (SQLite/PostgreSQL)
Dashboard web (Flask/FastAPI)
Docker support
Scraping agendado (cron)
Detecção automática de mudanças

📞 SUPORTE

Issues/Bugs: Criar issue no repositório Dúvidas: contacto@descomplicar.pt Website: https://descomplicar.pt

📄 LICENÇA

Última atualização: 2025-11-05 Versão: 2.0

README.md

🕷️ Web Scraper Avançado

📋 ÍNDICE

✨ FUNCIONALIDADES

Core

Avançado

Tipos de Sites Suportados

🔧 REQUISITOS

Sistema

APIs (opcional)

📦 INSTALAÇÃO

1. Clonar/Descarregar

2. Criar Virtual Environment

3. Instalar Dependências

4. Instalar Browsers Playwright

5. Configurar Environment

⚙️ CONFIGURAÇÃO

1. Ficheiro .env

2. Configurar Sites

🚀 USO BÁSICO

Opção 1: Batch Scraper (Recomendado)

Opção 2: Scraper Individual

Opção 3: Reddit Apenas

🎯 USO AVANÇADO

Pipeline Completo (3 Fases)

Fase 1: Extração

Fase 2: Limpeza

Fase 3: Formatação AI (opcional)

Config Personalizado

Filtros Avançados

📁 ESTRUTURA DE FICHEIROS

🔍 TROUBLESHOOTING

Erro: "API key not found"

Erro: "playwright not installed"

Erro: "Timeout" ao scraping

Site bloqueado (403/429)

Reddit: "Invalid credentials"

Logs não aparecem

⚠️ LIMITAÇÕES

Não Funciona Com

Limitações de Escala

Rate Limits

📊 PERFORMANCE ESTIMADA

🔐 SEGURANÇA & ÉTICA

Boas Práticas

Não Fazer

📈 ROADMAP

v2.0 (Próximas melhorias)

📞 SUPORTE

📄 LICENÇA

1. Ficheiro `.env`