Skip to content

climatextract

Extract CO₂ emissions data from corporate sustainability reports using AI.

climatextract is an information extraction pipeline that surfaces Scope 1, 2, and 3 emissions data from PDF sustainability reports. Built by the LMU SODA Lab in collaboration with the Data Service Centre of Deutsche Bundesbank, it combines research around ESG reporting and Intelligent Document Processing. Employing semantic search and large language models, climatextract automates what was previously a tedious manual annotation process.


Key Features

  • 📄 PDF Processing – Automatically extract and embed text from sustainability reports
  • 🔍 Semantic Search – Find relevant pages using vector similarity
  • 🤖 LLM Extraction – Use GPT models to extract structured emissions data
  • 📊 Scope 1-3 Coverage – Extract direct and indirect emissions across all scopes
  • Evaluation – Compare results against gold standard datasets

Quick Example

from climatextract import extract

# Extract emissions from a PDF report
result_path = extract("./data/pdfs/company_2023_report.pdf")
print(f"Results saved to: {result_path}")

The main output is a well structured table, saved in .csv format.

report_id year indicator value_std ... unit_std ... page
company_2023_report.pdf 2015 scope 1 135.0 ... t CO2e ... 34
company_2023_report.pdf 2015 scope 2lb 41962.0 ... t CO2e ... 34
company_2023_report.pdf 2015 scope 2mb 37674.0 ... t CO2e ... 34
company_2023_report.pdf 2015 scope 3 1834.0 ... t CO2e ... 34
company_2023_report.pdf 2016 scope 1 170.0 ... t CO2e ... 34
... ... ... ... ... ... ... ...

Getting Started

New to climatextract? Start here:


Documentation Overview

Section Description
User Guide Configuration, running extractions, understanding output
Concepts Architecture, RAG pipeline, prompts, evaluation
API Reference Public API functions
Research Academic background, methodology, citation