Skip to content

Architecture

climatextract uses a Retrieval-Augmented Generation (RAG) architecture to extract emissions data from PDF reports. This page provides a high-level overview of how the components work together.


Pipeline Overview

flowchart LR
    A[PDF Reports] --> B[Text Extraction]
    B --> C[Embedding]
    C --> D[(Vector Database)]
    E[Search Query] --> F[Semantic Search]
    D --> F
    F --> G[Relevant Pages]
    G --> H[LLM Prompt]
    H --> I[GPT Model]
    I --> J[Structured Output]
    J --> K[Post-Processing]
    K --> L[CSV Results]

Core Components

1. PDF Processing

The pipeline starts by extracting text and tables from PDF pages:

  • Uses Docling for text and table extraction
  • Each page is processed independently
  • Maintains page number metadata for traceability

2. Embedding & Storage

Text is converted to vector embeddings for semantic search:

  • Default model: text-embedding-ada-002 (Azure-hosted, via the default adapter)
  • Embeddings stored in DuckDB for efficient retrieval
  • Query embeddings are cached to avoid redundant API calls

When processing a document, the pipeline:

  1. Embeds the search query (e.g., "What are the total CO2 emissions in different years? Include Scope 1, Scope 2, and Scope 3 emissions if available.")
  2. Computes cosine similarity against all page embeddings
  3. Retrieves top-k% (default: 5%) most relevant pages. Default: 4 pages at minimum, 7 pages at maximum.

4. LLM Extraction

Relevant pages are passed to a large language model:

  • Structured prompts define the extraction task
  • Output parsed via Pydantic models (structured_json) or regex (default)
  • Each scope-year combination is extracted independently

5. Post-Processing

Raw LLM output is cleaned and structured:

  • Unit normalization (e.g., "tonnes" → "tCO2e")
  • Duplicate resolution across pages
  • Multiple Scope X values for a given years are not allowed. Use priority rules to determine a single value.
  • Value standardization and validation

Key Classes

Class Purpose
ValueRetrieverPipeline Orchestrates the full extraction workflow
EmbeddingsRepository Manages DuckDB storage for embeddings
LlmHandler / EmbeddingModelHandler ABCs users subclass to plug in a provider
Llm / EmbeddingModel Package-side wrappers that add usage counting and concurrency control around a user's handler
AzureAIFoundryLlmHandler / AzureAIFoundryEmbeddingHandler Default Azure AI Foundry adapters, built on LiteLLM
AzureOpenAILlmHandler / AzureOpenAIEmbeddingHandler Azure OpenAI Service adapters (legacy deployments), built on LiteLLM
UsageCounter Accumulates token counts and USD cost across calls
StructuredJsonPrompt Structures LLM prompts with Pydantic parsing
EvaluatorData Computes evaluation metrics against a gold standard
Pdfdoc PDF document representation with page data
DataLakeManager Manages PDF downloads and file checks

Provider Abstraction

LLM and embedding calls go through a two-layer pattern:

  • Handler (user-supplied): a subclass of LlmHandler or EmbeddingModelHandler that talks to a specific provider. The package ships Azure reference handlers; for other providers, users implement their own.
  • Wrapper (Llm / EmbeddingModel): package-side, provider-agnostic. Adds usage accounting and concurrency control around whatever the handler does.

Pipeline code calls the wrapper, never the handler directly. This keeps the pipeline ignorant of provider details and lets users swap providers without touching package internals. See Custom Providers for how to write a handler.


Data Flow

sequenceDiagram
    participant User
    participant Extract API
    participant Pipeline
    participant Vector DB
    participant LLM

    User->>Extract API: extract(pdf_path)
    Extract API->>Pipeline: Process PDF
    Pipeline->>Vector DB: Store/retrieve embeddings
    Pipeline->>Vector DB: Semantic search
    Vector DB-->>Pipeline: Relevant pages
    Pipeline->>LLM: Extract emissions
    LLM-->>Pipeline: Structured data
    Pipeline->>Extract API: Save results
    Extract API-->>User: Result path

Next Steps