Quickstart
Get up and running with climatextract in just a few minutes.
Basic Extraction
The simplest way to extract emissions data from a PDF:
from climatextract import extract
# Extract from a single PDF
result_path = extract("./data/pdfs/company_2023_report.pdf")
print(f"Results saved to: {result_path}")
This will:
- Embed the PDF pages into a vector database
- Search for pages relevant to CO₂ emissions
- Use an LLM to extract Scope 1, 2, and 3 data
- Save results to the
output/directory
Many corporate annual reports / sustainability reports are available for download online, for example at the https://www.srnav.com/reports repository.
Extract from Multiple PDFs
Process an entire directory of reports:
from climatextract import extract
# Process all PDFs in a directory
result_path = extract("./data/pdfs/")
Or provide a specific list:
from climatextract import extract
files = [
"./data/pdfs/report1.pdf",
"./data/pdfs/report2.pdf",
]
result_path = extract(files)
Extract and Evaluate
If you have a gold standard dataset, validate your results:
from climatextract import extract_and_evaluate
result_path = extract_and_evaluate(
pdf_input="./data/pdfs/",
gold_standard_path="./data/evaluation_dataset/gold_standard.csv"
)
Or check out our gold-standard dataset: Paper, Data
Using a Configuration File
For more control, create a climatextract.toml configuration file:
from climatextract import extract
# Uses settings from climatextract.toml
result_path = extract(config_path="climatextract.toml")
See Configuration for all available options.
Working directory structure
In our projects, the working directory is typically organized as follows. This structure is used throughout this documentation.
<your-working-directory>/
├── data/
├── evaluation_dataset/ # (optional) Benchmark datasets for evaluation.
├── pdfs/ # Directory with PDF files (annual reports / sustainability reports). Move your files here.
├── processed/ # (automatically generated) PDF files after initial processing
├── embeddings/ # (automatically generated) Directory with .duckdb databases. Stores embeddings and raw text from each page in PDF files
└── tables/ # (automatically generated, only used if input_mode = "text+table") Stores tables from PDF files, converted with Docling to .csv files
├── output/ # (automatically generated) Every time when you run extract(), a new subfolder with results is created here
├── abc123-uuid/
└── def456-uuid/
├── climatextract.toml # ClimXtract Configuration File to override defaults
└── .env # Contains secret values about your environment, specific to your local configuration. Never, ever share this file!
We recommend that you move your sustainability reports to the .data/pdfs/ folder (or change the default names in the configuration file).
Teamwork: Populate data folders automatically
If you work in a team, it can be tedious to ensure everone has access to the same PDF files and embedding databases. Our solution:
- One superuser stores all the PDF files (and embedding databases) online in an Azure datalake.
- Everyone who has access can download the files from the datalake on demand as needed. No need to transfer, copy and paste PDF files by hand, and everyone uses the same database.
Add to your .env file:
See Sharing Large Files for details.
Output Structure
After extraction, you'll find results in output/<run-id>/:
output/
└── abc123-uuid/
├── raw_results.csv # Page-level LLM responses (with duplicates)
├── raw_results_temp.csv # Pre-normalization extraction results
├── results_long_format.csv # Main results in long format (with duplicates)
├── results_wide_format.csv # Results pivoted by year
└── config.json # Parameters, metrics, run info
# (config_and_metrics.json when using extract_and_evaluate)
See Understanding Output for column definitions.
Next Steps
- Configuration – Customize extraction behavior
- Architecture – Understand how the pipeline works