Running Extraction
This guide covers different ways to run climatextract on your PDF reports.
Using Python API
Single PDF
Multiple PDFs
from climatextract import extract
files = [
"./data/pdfs/report1.pdf",
"./data/pdfs/report2.pdf",
]
result_path = extract(files)
Directory of PDFs
from climatextract import extract
# Processes all .pdf files in the directory
result_path = extract("./data/pdfs/")
Using Configuration File
For reproducible runs, create a climatextract.toml configuration file:
from climatextract import extract
# All settings from climatextract.toml
result_path = extract(config_path="climatextract.toml")
Override specific inputs while using config defaults:
from climatextract import extract
result_path = extract(
pdf_input="./data/pdfs/new_report.pdf",
config_path="climatextract.toml"
)
See Configuration for all available options.
Extract with Evaluation
Compare results against a gold standard dataset:
from climatextract import extract_and_evaluate
result_path = extract_and_evaluate(
pdf_input="./data/pdfs/",
gold_standard_path="./data/evaluation_dataset/gold_standard.csv"
)
Evaluation Output
Evaluation adds additional files to the output directory with precision, recall, and F1 scores.
MLflow Tracking
Enable experiment tracking by setting enable_mlflow=True:
from climatextract import extract
result_path = extract(
pdf_input="./data/pdfs/",
enable_mlflow=True
)
When enabled, metrics and parameters are logged to your configured MLflow server. See MLflow Setup for configuration details.
Using a Custom LLM or Embedding Provider
extract() and extract_and_evaluate() accept two keyword-only arguments — llm and embedder — that let you override the default Azure AI Foundry adapter. Pass in any subclass of LlmHandler or EmbeddingModelHandler.
Example — pass an explicit Foundry handler (equivalent to the default):
from climatextract import extract
from climatextract.adapters.azure_ai_foundry import (
AzureAIFoundryEmbeddingHandler,
AzureAIFoundryLlmHandler,
)
llm = AzureAIFoundryLlmHandler()
embedder = AzureAIFoundryEmbeddingHandler()
result_path = extract("./data/pdfs/", llm=llm, embedder=embedder)
To route to a different provider entirely (OpenAI direct, Anthropic, a local model, etc.), you can implement your own handler. See Custom Providers.
Processing Tips
Large Batches
When processing many PDFs, consider:
- Reducing
max_parallel_llm_prompts_runningto avoid rate limits - Using a pre-built embeddings database to skip re-embedding
- Monitoring the
output/directory for intermediate results
Memory Usage
Large PDF files can consume significant memory during embedding. Process in smaller batches if you encounter memory issues.
Next Steps
- Understanding Output – What the output files contain
- MLflow Setup – Configure experiment tracking