Architecture

climatextract uses a Retrieval-Augmented Generation (RAG) architecture to extract emissions data from PDF reports. This page provides a high-level overview of how the components work together.

Pipeline Overview

flowchart LR
    A[PDF Reports] --> B[Text Extraction]
    B --> C[Embedding]
    C --> D[(Vector Database)]
    E[Search Query] --> F[Semantic Search]
    D --> F
    F --> G[Relevant Pages]
    G --> H[LLM Prompt]
    H --> I[GPT Model]
    I --> J[Structured Output]
    J --> K[Post-Processing]
    K --> L[CSV Results]

Core Components

1. PDF Processing

The pipeline starts by extracting text and tables from PDF pages:

Uses Docling for text and table extraction
Each page is processed independently
Maintains page number metadata for traceability

2. Embedding & Storage

Text is converted to vector embeddings for semantic search:

Default model: text-embedding-ada-002 (Azure-hosted, via the default adapter)
Embeddings stored in DuckDB for efficient retrieval
Query embeddings are cached to avoid redundant API calls

3. Semantic Search

When processing a document, the pipeline:

Embeds the search query (e.g., "What are the total CO2 emissions in different years? Include Scope 1, Scope 2, and Scope 3 emissions if available.")
Computes cosine similarity against all page embeddings
Retrieves top-k% (default: 5%) most relevant pages. Default: 4 pages at minimum, 7 pages at maximum.

4. LLM Extraction

Relevant pages are passed to a large language model:

Structured prompts define the extraction task
Output parsed via Pydantic models (structured_json) or regex (default)
Each scope-year combination is extracted independently

5. Post-Processing

Raw LLM output is cleaned and structured:

Unit normalization (e.g., "tonnes" → "tCO2e")
Duplicate resolution across pages
Multiple Scope X values for a given years are not allowed. Use priority rules to determine a single value.
Value standardization and validation

Key Classes

Class	Purpose
`ValueRetrieverPipeline`	Orchestrates the full extraction workflow
`EmbeddingsRepository`	Manages DuckDB storage for embeddings
`LlmHandler` / `EmbeddingModelHandler`	ABCs users subclass to plug in a provider
`Llm` / `EmbeddingModel`	Package-side wrappers that add usage counting and concurrency control around a user's handler
`AzureAIFoundryLlmHandler` / `AzureAIFoundryEmbeddingHandler`	Default Azure AI Foundry adapters, built on LiteLLM
`AzureOpenAILlmHandler` / `AzureOpenAIEmbeddingHandler`	Azure OpenAI Service adapters (legacy deployments), built on LiteLLM
`UsageCounter`	Accumulates token counts and USD cost across calls
`StructuredJsonPrompt`	Structures LLM prompts with Pydantic parsing
`EvaluatorData`	Computes evaluation metrics against a gold standard
`Pdfdoc`	PDF document representation with page data
`DataLakeManager`	Manages PDF downloads and file checks

Provider Abstraction

LLM and embedding calls go through a two-layer pattern:

Handler (user-supplied): a subclass of LlmHandler or EmbeddingModelHandler that talks to a specific provider. The package ships Azure reference handlers; for other providers, users implement their own.
Wrapper (Llm / EmbeddingModel): package-side, provider-agnostic. Adds usage accounting and concurrency control around whatever the handler does.

Pipeline code calls the wrapper, never the handler directly. This keeps the pipeline ignorant of provider details and lets users swap providers without touching package internals. See Custom Providers for how to write a handler.

Data Flow

sequenceDiagram
    participant User
    participant Extract API
    participant Pipeline
    participant Vector DB
    participant LLM

    User->>Extract API: extract(pdf_path)
    Extract API->>Pipeline: Process PDF
    Pipeline->>Vector DB: Store/retrieve embeddings
    Pipeline->>Vector DB: Semantic search
    Vector DB-->>Pipeline: Relevant pages
    Pipeline->>LLM: Extract emissions
    LLM-->>Pipeline: Structured data
    Pipeline->>Extract API: Save results
    Extract API-->>User: Result path

Next Steps

RAG Pipeline – Deep dive into retrieval and generation
Prompts – How prompts are structured
Evaluation – Measuring extraction quality