Architecture
climatextract uses a Retrieval-Augmented Generation (RAG) architecture to extract emissions data from PDF reports. This page provides a high-level overview of how the components work together.
Pipeline Overview
flowchart LR
A[PDF Reports] --> B[Text Extraction]
B --> C[Embedding]
C --> D[(Vector Database)]
E[Search Query] --> F[Semantic Search]
D --> F
F --> G[Relevant Pages]
G --> H[LLM Prompt]
H --> I[GPT Model]
I --> J[Structured Output]
J --> K[Post-Processing]
K --> L[CSV Results]
Core Components
1. PDF Processing
The pipeline starts by extracting text and tables from PDF pages:
- Uses Docling for text and table extraction
- Each page is processed independently
- Maintains page number metadata for traceability
2. Embedding & Storage
Text is converted to vector embeddings for semantic search:
- Default model:
text-embedding-ada-002(Azure-hosted, via the default adapter) - Embeddings stored in DuckDB for efficient retrieval
- Query embeddings are cached to avoid redundant API calls
3. Semantic Search
When processing a document, the pipeline:
- Embeds the search query (e.g., "What are the total CO2 emissions in different years? Include Scope 1, Scope 2, and Scope 3 emissions if available.")
- Computes cosine similarity against all page embeddings
- Retrieves top-k% (default: 5%) most relevant pages. Default: 4 pages at minimum, 7 pages at maximum.
4. LLM Extraction
Relevant pages are passed to a large language model:
- Structured prompts define the extraction task
- Output parsed via Pydantic models (
structured_json) or regex (default) - Each scope-year combination is extracted independently
5. Post-Processing
Raw LLM output is cleaned and structured:
- Unit normalization (e.g., "tonnes" → "tCO2e")
- Duplicate resolution across pages
- Multiple Scope X values for a given years are not allowed. Use priority rules to determine a single value.
- Value standardization and validation
Key Classes
| Class | Purpose |
|---|---|
ValueRetrieverPipeline |
Orchestrates the full extraction workflow |
EmbeddingsRepository |
Manages DuckDB storage for embeddings |
LlmHandler / EmbeddingModelHandler |
ABCs users subclass to plug in a provider |
Llm / EmbeddingModel |
Package-side wrappers that add usage counting and concurrency control around a user's handler |
AzureAIFoundryLlmHandler / AzureAIFoundryEmbeddingHandler |
Default Azure AI Foundry adapters, built on LiteLLM |
AzureOpenAILlmHandler / AzureOpenAIEmbeddingHandler |
Azure OpenAI Service adapters (legacy deployments), built on LiteLLM |
UsageCounter |
Accumulates token counts and USD cost across calls |
StructuredJsonPrompt |
Structures LLM prompts with Pydantic parsing |
EvaluatorData |
Computes evaluation metrics against a gold standard |
Pdfdoc |
PDF document representation with page data |
DataLakeManager |
Manages PDF downloads and file checks |
Provider Abstraction
LLM and embedding calls go through a two-layer pattern:
- Handler (user-supplied): a subclass of
LlmHandlerorEmbeddingModelHandlerthat talks to a specific provider. The package ships Azure reference handlers; for other providers, users implement their own. - Wrapper (
Llm/EmbeddingModel): package-side, provider-agnostic. Adds usage accounting and concurrency control around whatever the handler does.
Pipeline code calls the wrapper, never the handler directly. This keeps the pipeline ignorant of provider details and lets users swap providers without touching package internals. See Custom Providers for how to write a handler.
Data Flow
sequenceDiagram
participant User
participant Extract API
participant Pipeline
participant Vector DB
participant LLM
User->>Extract API: extract(pdf_path)
Extract API->>Pipeline: Process PDF
Pipeline->>Vector DB: Store/retrieve embeddings
Pipeline->>Vector DB: Semantic search
Vector DB-->>Pipeline: Relevant pages
Pipeline->>LLM: Extract emissions
LLM-->>Pipeline: Structured data
Pipeline->>Extract API: Save results
Extract API-->>User: Result path
Next Steps
- RAG Pipeline – Deep dive into retrieval and generation
- Prompts – How prompts are structured
- Evaluation – Measuring extraction quality