Background
NOTE: Initial draft. This page has not been reviewed by the project team and may not be up to date.
climatextract is a research project developed at the LMU SODA Lab (Statistics and Data Science in Social Sciences and the Humanities) at Ludwig-Maximilians-Universität München, in collaboration with the Data Service Centre of Deutsche Bundesbank.
Origin
The project began as the team's submission for the ClimateNLP Workshop at ACL 2024. This workshop focuses on Natural Language Processing techniques applied to climate-related text, including:
- Climate policy analysis
- Sustainability report processing
- Environmental data extraction
Problem Statement
Corporate sustainability reports contain valuable emissions data, but extracting this information manually is:
- Time-consuming – Reports can be 100+ pages
- Inconsistent – Data formats vary across companies
- Error-prone – Manual transcription introduces mistakes
- Unscalable – Thousands of reports are published annually
climatextract automates this process using modern NLP techniques.
Research Goals
The project aims to:
- Demonstrate feasibility of automated emissions extraction
- Benchmark performance against human annotations
- Explore RAG architectures for structured data extraction
- Enable large-scale analysis of corporate emissions trends
Institutional Context
The project is part of GIST (Green IT and Sustainability Tracking) research efforts at LMU Munich, contributing to:
- Academic research on climate disclosure analysis
- Open-source tools for sustainability researchers
- Methodological advances in document AI
Related Work
climatextract builds on advances in:
- Retrieval-Augmented Generation – Combining search with LLMs
- Document Understanding – Extracting structured data from PDFs
- Climate NLP – Applying NLP to environmental text
For technical details, see Architecture and Methodology.
Next Steps
- Methodology – Scientific methodology and reproducibility
- Citation – How to cite this work