Methodology
NOTE: Initial draft. This page has not been reviewed by the project team and may not be up to date.
This page describes the scientific methodology behind climatextract, suitable for academic citations and reproducibility.
Task Definition
Objective: Given a corporate sustainability report (PDF), extract all reported CO₂ emissions values for Scope 1, 2, and 3 across available years.
Output: Structured table with columns: report_id, year, indicator, value_std, unit_std, page (plus detail columns such as value_raw, value_score, unit_raw, unit_score, unit_cat, dupl_flag, select_flag)
Approach: Retrieval-Augmented Generation
We use a RAG pipeline consisting of:
1. Document Preprocessing
- PDF pages are extracted as text using Docling for text and table extraction
- Each page is treated as an independent document chunk
- Metadata (page number, filename) is preserved for traceability
2. Semantic Retrieval
- Pages are embedded using OpenAI's
text-embedding-ada-002model - A task-specific query ("What are the total CO2 emissions in different years? Include Scope 1, Scope 2, and Scope 3 emissions if available.") is embedded
- Cosine similarity identifies the most relevant pages
- Top-k pages are selected based on score thresholds
3. LLM Extraction
- Selected pages are passed to GPT models with structured prompts
- Prompts include:
- Role definition (climate analyst)
- KPI definitions (Scope 1, 2, 3)
- Extraction constraints (company-level, absolute values only)
- Output is constrained to JSON format with Pydantic validation
4. Post-Processing
- Unit normalization (e.g., "tonnes CO2" → "tCO2e")
- Duplicate resolution across pages
- Value validation and standardization
Evaluation Methodology
Gold Standard
- Human-annotated dataset of emissions values
- Covers diverse companies and reporting formats
- Annotations include page-level source references
Metrics
We report standard information extraction metrics:
- Precision: Fraction of extracted values that are correct
- Recall: Fraction of gold standard values that were extracted
- F1 Score: Harmonic mean of precision and recall
Matching Criteria
An extraction is considered correct if:
- Year matches exactly
- Scope matches exactly
- Value matches (with tolerance for minor differences)
- Unit is semantically equivalent after normalization
Reproducibility
All experiments are logged with:
- MLflow: Parameters, metrics, and artifacts
- Configuration files:
climatextract.tomlsnapshots - Version control: Git commit hashes
To reproduce results:
- Use the same
climatextract.tomlconfiguration - Ensure identical model versions (specified in config)
- Run against the same gold standard dataset
Limitations
- Language: Currently optimized for English reports
- Format: Complex tables may not be fully captured
- Coverage: Some disclosure formats may be missed
- Model dependency: Results may vary with different LLM versions
Future Work
- Multi-language support
- Improved table extraction
- Fine-tuned extraction models
- Expanded KPI coverage beyond emissions
Next Steps
- Citation – How to cite this work
- Background – Academic context and motivation