--- name: pdf description: Processamento completo de ficheiros PDF — leitura, extraccao de texto/tabelas, merge, split, watermarks, encriptacao, OCR, criacao e preenchimento de formularios. --- # PDF Processing Guide ## Resumo Guia completo para processamento de PDFs com bibliotecas Python e ferramentas de linha de comandos. Para formularios PDF, seguir as instruccoes na seccao "Preenchimento de formularios". Para funcionalidades avancadas e bibliotecas JavaScript, consultar a seccao "Referencia avancada". ## Quick Start ```python from pypdf import PdfReader, PdfWriter # Read a PDF reader = PdfReader("/media/ealmeida/Dados/GDrive/Cloud/Descomplicar/documento.pdf") print(f"Pages: {len(reader.pages)}") # Extract text text = "" for page in reader.pages: text += page.extract_text() ``` ## Bibliotecas Python ### pypdf — operacoes basicas #### Merge PDFs ```python from pypdf import PdfWriter, PdfReader writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) with open("merged.pdf", "wb") as output: writer.write(output) ``` #### Split PDF ```python reader = PdfReader("input.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as output: writer.write(output) ``` #### Extract Metadata ```python reader = PdfReader("document.pdf") meta = reader.metadata print(f"Title: {meta.title}") print(f"Author: {meta.author}") print(f"Subject: {meta.subject}") print(f"Creator: {meta.creator}") ``` #### Rotate Pages ```python reader = PdfReader("input.pdf") writer = PdfWriter() page = reader.pages[0] page.rotate(90) # Rotate 90 degrees clockwise writer.add_page(page) with open("rotated.pdf", "wb") as output: writer.write(output) ``` ### pdfplumber — extraccao de texto e tabelas #### Extract Text with Layout ```python import pdfplumber with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text) ``` #### Extract Tables ```python with pdfplumber.open("document.pdf") as pdf: for i, page in enumerate(pdf.pages): tables = page.extract_tables() for j, table in enumerate(tables): print(f"Table {j+1} on page {i+1}:") for row in table: print(row) ``` #### Advanced Table Extraction ```python import pandas as pd with pdfplumber.open("document.pdf") as pdf: all_tables = [] for page in pdf.pages: tables = page.extract_tables() for table in tables: if table: # Check if table is not empty df = pd.DataFrame(table[1:], columns=table[0]) all_tables.append(df) # Combine all tables if all_tables: combined_df = pd.concat(all_tables, ignore_index=True) combined_df.to_excel("extracted_tables.xlsx", index=False) ``` ### reportlab — criacao de PDFs #### Basic PDF Creation ```python from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas c = canvas.Canvas("hello.pdf", pagesize=letter) width, height = letter # Add text c.drawString(100, height - 100, "Hello World!") c.drawString(100, height - 120, "This is a PDF created with reportlab") # Add a line c.line(100, height - 140, 400, height - 140) # Save c.save() ``` #### Create PDF with Multiple Pages ```python from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet doc = SimpleDocTemplate("report.pdf", pagesize=letter) styles = getSampleStyleSheet() story = [] # Add content title = Paragraph("Report Title", styles['Title']) story.append(title) story.append(Spacer(1, 12)) body = Paragraph("This is the body of the report. " * 20, styles['Normal']) story.append(body) story.append(PageBreak()) # Page 2 story.append(Paragraph("Page 2", styles['Heading1'])) story.append(Paragraph("Content for page 2", styles['Normal'])) # Build PDF doc.build(story) ``` #### Subscripts and Superscripts **Importante**: nunca usar caracteres Unicode subscript/superscript (subscript: 0-9, superscript: 0-9) em PDFs ReportLab. As fontes built-in nao incluem estes glifos, resultando em caixas pretas. Usar as tags XML do ReportLab em objectos Paragraph: ```python from reportlab.platypus import Paragraph from reportlab.lib.styles import getSampleStyleSheet styles = getSampleStyleSheet() # Subscripts: use tag chemical = Paragraph("H2O", styles['Normal']) # Superscripts: use tag squared = Paragraph("x2 + y2", styles['Normal']) ``` Para texto desenhado com canvas (nao Paragraph), ajustar manualmente o tamanho da fonte e posicao. ## Ferramentas de linha de comandos ### pdftotext (poppler-utils) ```bash # Extract text pdftotext input.pdf output.txt # Extract text preserving layout pdftotext -layout input.pdf output.txt # Extract specific pages pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5 ``` ### qpdf ```bash # Merge PDFs qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf # Split pages qpdf input.pdf --pages . 1-5 -- pages1-5.pdf qpdf input.pdf --pages . 6-10 -- pages6-10.pdf # Rotate pages qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees # Remove password qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf ``` ### pdftk (if available) ```bash # Merge pdftk file1.pdf file2.pdf cat output merged.pdf # Split pdftk input.pdf burst # Rotate pdftk input.pdf rotate 1east output rotated.pdf ``` ## Tarefas comuns ### Extrair texto de PDFs digitalizados (OCR) ```python # Requires: pip install pytesseract pdf2image import pytesseract from pdf2image import convert_from_path # Convert PDF to images images = convert_from_path('scanned.pdf') # OCR each page text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n" print(text) ``` ### Adicionar watermark ```python from pypdf import PdfReader, PdfWriter # Create watermark (or load existing) watermark = PdfReader("watermark.pdf").pages[0] # Apply to all pages reader = PdfReader("document.pdf") writer = PdfWriter() for page in reader.pages: page.merge_page(watermark) writer.add_page(page) with open("watermarked.pdf", "wb") as output: writer.write(output) ``` ### Extrair imagens ```bash # Using pdfimages (poppler-utils) pdfimages -j input.pdf output_prefix # This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc. ``` ### Proteccao por password ```python from pypdf import PdfReader, PdfWriter reader = PdfReader("input.pdf") writer = PdfWriter() for page in reader.pages: writer.add_page(page) # Add password writer.encrypt("userpassword", "ownerpassword") with open("encrypted.pdf", "wb") as output: writer.write(output) ``` ## Referencia rapida | Tarefa | Melhor ferramenta | Comando/codigo | |--------|-------------------|----------------| | Merge PDFs | pypdf | `writer.add_page(page)` | | Split PDFs | pypdf | One page per file | | Extrair texto | pdfplumber | `page.extract_text()` | | Extrair tabelas | pdfplumber | `page.extract_tables()` | | Criar PDFs | reportlab | Canvas or Platypus | | Merge CLI | qpdf | `qpdf --empty --pages ...` | | OCR scanned PDFs | pytesseract | Convert to image first | | Preencher formularios | pypdf ou annotations | Ver seccao abaixo | --- ## Preenchimento de formularios **Obrigatorio: seguir estes passos por ordem. Nao saltar para codigo directamente.** Primeiro verificar se o PDF tem campos preenchíveis. Executar a partir da pasta de scripts desta skill: `python scripts/check_fillable_fields.py ` Consoante o resultado, seguir a seccao "Campos preenchíveis" ou "Campos nao preenchíveis". ### Campos preenchíveis Se o PDF tiver campos de formulario nativos: 1. Extrair informacao dos campos: `python scripts/extract_form_field_info.py ` O JSON resultante contem campos com esta estrutura: ```json [ { "field_id": "(ID unico do campo)", "page": "(numero da pagina, 1-based)", "rect": "[left, bottom, right, top]", "type": "text | checkbox | radio_group | choice" } ] ``` Para **checkboxes**: propriedades `checked_value` e `unchecked_value`. Para **radio groups**: lista `radio_options` com `value` e `rect`. Para **choice fields**: lista `choice_options` com `value` e `text`. 2. Converter PDF para imagens para analise visual: `python scripts/convert_pdf_to_images.py ` Analisar as imagens para determinar o proposito de cada campo. 3. Criar `field_values.json`: ```json [ { "field_id": "last_name", "description": "Apelido do utilizador", "page": 1, "value": "Silva" }, { "field_id": "Checkbox12", "description": "Checkbox para maiores de 18", "page": 1, "value": "/On" } ] ``` 4. Preencher: `python scripts/fill_fillable_fields.py ` ### Campos nao preenchíveis Se o PDF nao tiver campos nativos, usar anotacoes de texto. Tentar primeiro extraccao por estrutura (mais preciso), depois estimativa visual como fallback. #### Passo 1: extraccao por estrutura `python scripts/extract_form_structure.py form_structure.json` Extrai labels de texto, linhas horizontais e checkboxes com coordenadas exactas. **Se form_structure.json tiver labels significativos** -> usar abordagem A (estrutura). **Se o PDF for digitalizado/imagem** -> usar abordagem B (visual). #### Abordagem A: coordenadas por estrutura (preferida) Analisar form_structure.json e identificar: - **Label groups**: elementos de texto adjacentes que formam um label - **Row structure**: labels com `top` similar estao na mesma linha - **Field columns**: areas de entrada comecam apos o label (x0 = label.x1 + gap) - **Checkboxes**: usar coordenadas directamente do JSON Criar fields.json com `pdf_width`/`pdf_height`: ```json { "pages": [ {"page_number": 1, "pdf_width": 612, "pdf_height": 792} ], "form_fields": [ { "page_number": 1, "description": "Campo apelido", "field_label": "Apelido", "label_bounding_box": [43, 63, 87, 73], "entry_bounding_box": [92, 63, 260, 79], "entry_text": {"text": "Silva", "font_size": 10} } ] } ``` #### Abordagem B: estimativa visual (fallback) 1. Converter PDF para imagens: `python scripts/convert_pdf_to_images.py ` 2. Identificar campos e posicoes aproximadas nas imagens. 3. Refinar com zoom (ImageMagick): ```bash magick -crop x++ +repage ``` Converter coordenadas do crop de volta para coordenadas da imagem completa: - full_x = crop_x + crop_offset_x - full_y = crop_y + crop_offset_y 4. Criar fields.json com `image_width`/`image_height`. #### Abordagem hibrida Quando a extraccao por estrutura funciona para a maioria dos campos mas falta alguns: 1. Usar abordagem A para campos detectados 2. Usar abordagem B para campos em falta 3. Converter todas as coordenadas para PDF: - pdf_x = image_x * (pdf_width / image_width) - pdf_y = image_y * (pdf_height / image_height) 4. Usar sistema de coordenadas unico com `pdf_width`/`pdf_height` #### Validacao e preenchimento Validar bounding boxes antes de preencher: `python scripts/check_bounding_boxes.py fields.json` Preencher o formulario: `python scripts/fill_pdf_form_with_annotations.py fields.json ` Verificar resultado: `python scripts/convert_pdf_to_images.py ` Criar imagem de validacao com bounding boxes sobrepostas: `python scripts/create_validation_image.py ` --- ## Referencia avancada ### pypdfium2 — rendering rapido ```python import pypdfium2 as pdfium from PIL import Image # Load PDF pdf = pdfium.PdfDocument("document.pdf") # Render page to image page = pdf[0] bitmap = page.render(scale=2.0, rotation=0) img = bitmap.to_pil() img.save("page_1.png", "PNG") # Process multiple pages for i, page in enumerate(pdf): bitmap = page.render(scale=1.5) img = bitmap.to_pil() img.save(f"page_{i+1}.jpg", "JPEG", quality=90) ``` ### pdfplumber — funcionalidades avancadas #### Texto com coordenadas precisas ```python import pdfplumber with pdfplumber.open("document.pdf") as pdf: page = pdf.pages[0] # Extract all text with coordinates chars = page.chars for char in chars[:10]: print(f"Char: '{char['text']}' at x:{char['x0']:.1f} y:{char['y0']:.1f}") # Extract text by bounding box (left, top, right, bottom) bbox_text = page.within_bbox((100, 100, 400, 200)).extract_text() ``` #### Tabelas complexas com settings customizados ```python import pdfplumber import pandas as pd with pdfplumber.open("complex_table.pdf") as pdf: page = pdf.pages[0] table_settings = { "vertical_strategy": "lines", "horizontal_strategy": "lines", "snap_tolerance": 3, "intersection_tolerance": 15 } tables = page.extract_tables(table_settings) # Visual debugging for table extraction img = page.to_image(resolution=150) img.save("debug_layout.png") ``` ### reportlab — relatorios profissionais com tabelas ```python from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph from reportlab.lib.styles import getSampleStyleSheet from reportlab.lib import colors data = [ ['Produto', 'Q1', 'Q2', 'Q3', 'Q4'], ['Widgets', '120', '135', '142', '158'], ['Gadgets', '85', '92', '98', '105'] ] doc = SimpleDocTemplate("report.pdf") elements = [] styles = getSampleStyleSheet() title = Paragraph("Relatorio Trimestral de Vendas", styles['Title']) elements.append(title) table = Table(data) table.setStyle(TableStyle([ ('BACKGROUND', (0, 0), (-1, 0), colors.grey), ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke), ('ALIGN', (0, 0), (-1, -1), 'CENTER'), ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'), ('FONTSIZE', (0, 0), (-1, 0), 14), ('BOTTOMPADDING', (0, 0), (-1, 0), 12), ('BACKGROUND', (0, 1), (-1, -1), colors.beige), ('GRID', (0, 0), (-1, -1), 1, colors.black) ])) elements.append(table) doc.build(elements) ``` ### JavaScript — pdf-lib (criacao e modificacao) #### Load and Manipulate Existing PDF ```javascript import { PDFDocument } from 'pdf-lib'; import fs from 'fs'; async function manipulatePDF() { const existingPdfBytes = fs.readFileSync('input.pdf'); const pdfDoc = await PDFDocument.load(existingPdfBytes); const pageCount = pdfDoc.getPageCount(); const newPage = pdfDoc.addPage([600, 400]); newPage.drawText('Added by pdf-lib', { x: 100, y: 300, size: 16 }); const pdfBytes = await pdfDoc.save(); fs.writeFileSync('modified.pdf', pdfBytes); } ``` #### Advanced Merge and Split ```javascript import { PDFDocument } from 'pdf-lib'; import fs from 'fs'; async function mergePDFs() { const mergedPdf = await PDFDocument.create(); const pdf1 = await PDFDocument.load(fs.readFileSync('doc1.pdf')); const pdf2 = await PDFDocument.load(fs.readFileSync('doc2.pdf')); const pdf1Pages = await mergedPdf.copyPages(pdf1, pdf1.getPageIndices()); pdf1Pages.forEach(page => mergedPdf.addPage(page)); const pdf2Pages = await mergedPdf.copyPages(pdf2, [0, 2, 4]); pdf2Pages.forEach(page => mergedPdf.addPage(page)); fs.writeFileSync('merged.pdf', await mergedPdf.save()); } ``` ### Operacoes avancadas CLI #### poppler-utils ```bash # Text with bounding box coordinates pdftotext -bbox-layout document.pdf output.xml # High-resolution PNG conversion pdftoppm -png -r 300 document.pdf output_prefix # Specific page range with high resolution pdftoppm -png -r 600 -f 1 -l 3 document.pdf high_res_pages # Extract all embedded images with metadata pdfimages -j -p document.pdf page_images # List image info without extracting pdfimages -list document.pdf ``` #### qpdf avancado ```bash # Split PDF into groups of pages qpdf --split-pages=3 input.pdf output_group_%02d.pdf # Complex page ranges from multiple PDFs qpdf --empty --pages doc1.pdf 1-3 doc2.pdf 5-7 doc3.pdf 2,4 -- combined.pdf # Optimize for web (linearize) qpdf --linearize input.pdf optimized.pdf # Repair corrupted PDF qpdf --check input.pdf qpdf --fix-qdf damaged.pdf repaired.pdf # Advanced encryption with permissions qpdf --encrypt user_pass owner_pass 256 --print=none --modify=none -- input.pdf encrypted.pdf # Check encryption status qpdf --show-encryption encrypted.pdf ``` ### Processamento em lote com error handling ```python import os import glob from pypdf import PdfReader, PdfWriter import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) def batch_process_pdfs(input_dir, operation='merge'): pdf_files = glob.glob(os.path.join(input_dir, "*.pdf")) if operation == 'merge': writer = PdfWriter() for pdf_file in pdf_files: try: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) logger.info(f"Processed: {pdf_file}") except Exception as e: logger.error(f"Failed to process {pdf_file}: {e}") continue with open("batch_merged.pdf", "wb") as output: writer.write(output) elif operation == 'extract_text': for pdf_file in pdf_files: try: reader = PdfReader(pdf_file) text = "" for page in reader.pages: text += page.extract_text() output_file = pdf_file.replace('.pdf', '.txt') with open(output_file, 'w', encoding='utf-8') as f: f.write(text) logger.info(f"Extracted text from: {pdf_file}") except Exception as e: logger.error(f"Failed to extract text from {pdf_file}: {e}") continue ``` ### Cropping avancado ```python from pypdf import PdfWriter, PdfReader reader = PdfReader("input.pdf") writer = PdfWriter() page = reader.pages[0] page.mediabox.left = 50 page.mediabox.bottom = 50 page.mediabox.right = 550 page.mediabox.top = 750 writer.add_page(page) with open("cropped.pdf", "wb") as output: writer.write(output) ``` ### Gestao de memoria para PDFs grandes ```python def process_large_pdf(pdf_path, chunk_size=10): reader = PdfReader(pdf_path) total_pages = len(reader.pages) for start_idx in range(0, total_pages, chunk_size): end_idx = min(start_idx + chunk_size, total_pages) writer = PdfWriter() for i in range(start_idx, end_idx): writer.add_page(reader.pages[i]) with open(f"chunk_{start_idx//chunk_size}.pdf", "wb") as output: writer.write(output) ``` ## Troubleshooting ### PDFs encriptados ```python from pypdf import PdfReader try: reader = PdfReader("encrypted.pdf") if reader.is_encrypted: reader.decrypt("password") except Exception as e: print(f"Failed to decrypt: {e}") ``` ### PDFs corrompidos ```bash qpdf --check corrupted.pdf qpdf --replace-input corrupted.pdf ``` ### Falha na extraccao de texto (fallback para OCR) ```python import pytesseract from pdf2image import convert_from_path def extract_text_with_ocr(pdf_path): images = convert_from_path(pdf_path) text = "" for i, image in enumerate(images): text += pytesseract.image_to_string(image) return text ``` ## Dicas de performance 1. **PDFs grandes**: usar streaming em vez de carregar tudo em memoria; `qpdf --split-pages` para dividir 2. **Extraccao de texto**: `pdftotext -bbox-layout` e o mais rapido; pdfplumber para tabelas 3. **Extraccao de imagens**: `pdfimages` e muito mais rapido que rendering de paginas 4. **Preenchimento de formularios**: pdf-lib mantem melhor a estrutura do formulario 5. **Memoria**: processar paginas individualmente com pypdfium2 para documentos grandes --- ## Integracao Descomplicar ### Caminhos frequentes para PDFs | Localizacao | Caminho | |-------------|---------| | Documentos empresa | `/media/ealmeida/Dados/GDrive/Cloud/Descomplicar/` | | Propostas | `/media/ealmeida/Dados/Hub/03-Propostas/` | | Arquivo clientes | `/media/ealmeida/Dados/GDrive/Arquivo_de_Clientes/` | | Knowledge Base | `/media/ealmeida/Dados/Hub/06-Operacoes/Knowledge-Base/PDFs/` | | Backups | `/media/ealmeida/Dados/GDrive/Backups/` | | Temporarios | `~/.claude-work/` (limpar ao concluir) | ### MCPs relevantes - **mcp__filesystem__read_file** / **write_file**: ler e escrever PDFs locais - **mcp__filesystem__search_files**: encontrar PDFs no sistema - **mcp__google-workspace__drive_search_files**: encontrar PDFs no Google Drive - **mcp__google-workspace__drive_read_file_content**: ler conteudo de ficheiros no Drive - **mcp__google-workspace__drive_upload_file**: enviar PDFs processados para o Drive ### Workflow tipico Descomplicar 1. Localizar PDF (filesystem ou Google Drive) 2. Descarregar para `~/.claude-work/` se necessario 3. Processar (extrair, merge, split, OCR, etc.) 4. Guardar resultado no destino final 5. Limpar temporarios de `~/.claude-work/` --- ## Licencas das bibliotecas - **pypdf**: BSD | **pdfplumber**: MIT | **pypdfium2**: Apache/BSD | **reportlab**: BSD - **poppler-utils**: GPL-2 | **qpdf**: Apache | **pdf-lib**: MIT | **pdfjs-dist**: Apache --- **Versao**: 1.0.0 | **Autor**: Descomplicar®