Background

NOTE: Initial draft. This page has not been reviewed by the project team and may not be up to date.

climatextract is a research project developed at the LMU SODA Lab (Statistics and Data Science in Social Sciences and the Humanities) at Ludwig-Maximilians-Universität München, in collaboration with the Data Service Centre of Deutsche Bundesbank.

Origin

The project began as the team's submission for the ClimateNLP Workshop at ACL 2024. This workshop focuses on Natural Language Processing techniques applied to climate-related text, including:

Climate policy analysis
Sustainability report processing
Environmental data extraction

Problem Statement

Corporate sustainability reports contain valuable emissions data, but extracting this information manually is:

Time-consuming – Reports can be 100+ pages
Inconsistent – Data formats vary across companies
Error-prone – Manual transcription introduces mistakes
Unscalable – Thousands of reports are published annually

climatextract automates this process using modern NLP techniques.

Research Goals

The project aims to:

Demonstrate feasibility of automated emissions extraction
Benchmark performance against human annotations
Explore RAG architectures for structured data extraction
Enable large-scale analysis of corporate emissions trends

Institutional Context

The project is part of GIST (Green IT and Sustainability Tracking) research efforts at LMU Munich, contributing to:

Academic research on climate disclosure analysis
Open-source tools for sustainability researchers
Methodological advances in document AI

climatextract builds on advances in:

Retrieval-Augmented Generation – Combining search with LLMs
Document Understanding – Extracting structured data from PDFs
Climate NLP – Applying NLP to environmental text

For technical details, see Architecture and Methodology.

Next Steps

Methodology – Scientific methodology and reproducibility
Citation – How to cite this work