PDFstract - Easy way to extract and chunk documents for RAG

As you all know I have been working on this AI and Data Space for a while now and I have seen various bottlenecks and complexities in the data engineering and augmentation layer - at both startup and large enterprise clients

When you think of where to use AI or get value out of AI - the first and most prominent idea that comes is, why not build a centralized AI powered knowledge management system - It could be a simple RAG based system or a complex knowledge graph

But it all starts like this

Data Source identification
Data Augmentation
Data Transformation to Vectors/Graphs
Building Intelligent ChatBots / Consumers on top

While we cannot solve the entire data engineering pipeline problems and sources ( at least not yet )

I wanted to build an accelerator for one of the source at the least.

when I say source, Most of the enterprise knowledge are stored in form of articles, journals, workbooks, SOPs etc etc and when it comes to specific domains like BFSI there are tons of documents like claims, rulebooks, forms etc.

I thought why not we solve the Data Augmentation and AI readiness problem for PDFs ?

Thats when I decided to build PDFstract.

PDFstract is a framework available as a CLI, Python Module and Web UI to unify the PDF ingestion for the AI Knowledge management and RAG systems

Table of Contents

Why PDFStract Exists

Modern RAG and LLM systems depend on clean document ingestion.

But PDF extraction is fragmented:

Some libraries work better for structured reports
Others perform better on scanned or OCR-heavy documents
Output formats vary widely
Chunking strategies significantly impact retrieval performance

Teams often waste hours testing combinations manually.

PDFStract provides:

A unified abstraction over multiple PDF extractors
Standardized output formats (markdown, json, text)
Built-in chunking strategies for RAG pipelines
Easy benchmarking and comparison between libraries

It becomes the standardized ingestion layer of your AI data pipeline.

Get started in two lines

from pdfstract import PDFStract
ps=PDFStract()

# convert and chunk in a single step with auto mode
chunks=ps.convert_chunk("path/to/pdf", library="auto", chunker="auto")

# or do it in two steps
# convert first with your library of choice
md_content=ps.convert("path/to/pdf", library="docling")

# chunk with your chunking technique of choice
# Chunk the text
chunks = pdfstract.chunk(text, chunker='token' chunk_size=512, chunk_overlap=50)

What Makes PDFStract Different?

Instead of committing to a single PDF extraction library, PDFStract lets you:

Swap extractors with one parameter
Benchmark multiple libraries on the same document
Automate library selection
Standardize downstream processing
Keep your ingestion layer future-proof

As new extraction libraries emerge, PDFStract allows you to integrate them without rewriting your pipeline

PDFStract decouples your ingestion layer from any single extraction library.

Installation and Usage

Choose based on required extraction libraries.

pip install pdfstract
pip install pdfstract[standard]
pip install pdfstract[advanced]
pip install pdfstract[all]

CLI Usage

# List available libraries
pdfstract libs

# List available chunkers
pdfstract chunkers

# Convert a single PDF
pdfstract convert document.pdf – library pymupdf4llm – output result.md

# Convert and chunk in one command
pdfstract convert-chunk document.pdf – library pymupdf4llm – chunker semantic – output chunks.json

# Chunk an existing text file
pdfstract chunk document.md – chunker token – chunk-size 512 – output chunks.json

# Compare multiple libraries on one PDF
pdfstract compare sample.pdf -l pymupdf4llm -l marker -l docling – output ./comparison

# Batch convert 100+ PDFs in parallel
pdfstract batch ./documents – library pymupdf4llm – output ./converted – parallel 4

# Download models for a specific library
pdfstract download marker

Module Usage with Python

You don't need to use the CLI! PDFStract can be easily integrated into your Python applications as a library.

Convert a PDF (One-liner)

from pdfstract import convert_pdf

# Quick conversion with default settings
result = convert_pdf('sample.pdf', library='marker')
print(result)  # Markdown content

List Available Libraries

from pdfstract import PDFStract

pdfstract = PDFStract()

# Get list of available libraries
available = pdfstract.list_available_libraries()
print(available)  # ['pymupdf4llm', 'marker', 'docling', ...]

Structured Conversion

from pdfstract import PDFStract

pdfstract = PDFStract()

# Convert with options
result = pdfstract.convert(
    pdf_path='document.pdf',
    library='marker',
    output_format='markdown'  # or 'json', 'text'
)

Batch Processing Multiple PDFs

from pdfstract import PDFStract

pdfstract = PDFStract()

# Convert all PDFs in a directory in parallel
results = pdfstract.batch_convert(
    pdf_directory='./pdfs',
    library='pymupdf4llm',
    output_format='markdown',
    parallel_workers=4
)

print(f"✓ Success: {results['success']}")
print(f"✗ Failed: {results['failed']}")

Async Conversion (for Web Apps)

import asyncio
from pdfstract import PDFStract

async def process_pdfs():
    pdfstract = PDFStract()
    result = await pdfstract.convert_async(
        'document.pdf',
        library='docling',
        output_format='json'
    )
    return result

# Use in FastAPI, asyncio, etc.
asyncio.run(process_pdfs())

Text Chunking for RAG Pipelines

from pdfstract import PDFStract

pdfstract = PDFStract()

# 1. Extract PDF
text = pdfstract.convert('document.pdf', library='docling')

# 2. Chunk the text
chunks = pdfstract.chunk(
    text=text,
    chunker='semantic',  # or 'token', 'sentence', 'code', etc.
    chunk_size=512
)

print(f"Created {chunks['total_chunks']} chunks")

# 3. Process chunks for embedding/indexing
for chunk in chunks['chunks']:
    print(f"- {chunk['text'][:50]}... ({chunk['token_count']} tokens)")

Powerful Web UI

# Clone the repository
git clone https://github.com/aksarav/pdfstract.git
cd pdfstract

# Download models and start services (first time)
make up

# Or step by step:
make models   # Download HuggingFace/MinerU models (~10GB)
make build    # Build Docker images
make up       # Start services

What is next

As described earlier, PDFstract is a honest attempt to help ease of your Data Ingestion layer for AI RAG and Knowledge management system pipeline - From here, I have planned for the following feature additions to PDFstract and some of them are already Dev In Progress

Embedding layer
Graph Extraction Layer for Knowledge Graph
In built RAG support for Intelligence powered by LanceDB
UI updates and customisation and much more.

Please do feel free to share your thoughts and support and star the repository if you find it helpful.

Until next post.

Cheers

Sarav