Extracting Tables from PDF

Extracting Tables from PDF Documents #

What is it? #

I tried several methods for extracting tables from PDF files, and the method below seems to be the most effective - camelot package.
However, for documents with heavy design elements, Python sometimes fails to recognize the tables correctly.


code sample #

import camelot
import pandas as pd
from tkinter import Tk
from tkinter.filedialog import askopenfilename, asksaveasfilename

# Hide the main Tkinter window
root = Tk()
root.withdraw()

# 1. Select a PDF file
pdf_file = askopenfilename(
    title="Select PDF File",
    filetypes=[("PDF files", "*.pdf")]
)

if not pdf_file:
    print("No PDF file selected.")
    exit()

# 2. Extract tables from the PDF using Camelot
# 'lattice': for tables with lines, 'stream': for tables without lines
tables = camelot.read_pdf(pdf_file, pages='all', flavor='lattice')

if tables.n == 0:
    print("No tables found in the PDF.")
    exit()

print(f"Number of tables found: {tables.n}")

# 3. Select the location to save the Excel file
excel_file = asksaveasfilename(
    title="Save as Excel",
    defaultextension=".xlsx",
    filetypes=[("Excel files", "*.xlsx")]
)

if not excel_file:
    print("No Excel file selected.")
    exit()

# 4. Save all extracted tables into an Excel file, each table in a separate sheet
with pd.ExcelWriter(excel_file, engine='openpyxl') as writer:
    for i, table in enumerate(tables):
        sheet_name = f"Table_{i+1}"  # Sheet name for each table
        table.df.to_excel(writer, index=False, sheet_name=sheet_name)

print(f"Excel file saved successfully: {excel_file}")

 
original post (Kor)