RAG Pipeline

climatextract implements a Retrieval-Augmented Generation (RAG) pipeline specifically designed for extracting structured data from sustainability reports. This page explains the retrieval and generation stages in detail.

What is RAG?

RAG combines two approaches:

Retrieval – Find relevant context from a large document corpus
Generation – Use an LLM to extract/generate answers from that context

This is more effective than prompting an LLM with entire documents because:

LLMs have context window limits
Irrelevant content can confuse the model
Focused context improves extraction accuracy

Retrieval Stage

Embedding Documents

Each PDF page is converted to a vector embedding:

# Simplified view of the embedding process
for page in pdf_pages:
    text = extract_text(page)
    embedding = embed_model.encode(text)
    store_in_database(page_id, embedding)

The embeddings capture semantic meaning, allowing similar content to have similar vectors.

Semantic Search

When processing a report, the pipeline:

Embeds a search query optimized for emissions data
Computes cosine similarity against page embeddings
Applies filtering to select the most relevant pages

flowchart LR
    Q[Query: What are the total CO2 emissions...] --> E[Query Embedding]
    E --> S[Similarity Search]
    DB[(Page Embeddings)] --> S
    S --> R[Top-K Pages]

Search Configuration

Control retrieval behavior in climatextract.toml:

[extraction]
percentile_threshold = 95 # Score cutoff percentile. Keep 5% most similar pages and discard 95%
similarity_top_k = 7      # Maximum pages to retrieve, overrides percentile_threshold for long documents
similarity_min_k = 4      # Minimum pages to retrieve, overrides percentile_threshold for short documents

# Context window for semantic search (default: 0, meaning that no adjacent pages are used)
context_window = 0

Generation Stage

Page-by-Page Extraction

Each relevant page is processed independently:

flowchart TD
    P1[Page 42] --> LLM1[LLM Call]
    P2[Page 43] --> LLM2[LLM Call]
    P3[Page 45] --> LLM3[LLM Call]
    LLM1 --> M[Merge Results]
    LLM2 --> M
    LLM3 --> M
    M --> O[Final Output]

This approach:

Handles multi-page reports effectively
Allows parallel processing for speed
Provides page-level traceability

Structured Output

The LLM is prompted to return data in a specific format:

{
  "KPI_Entries": [
    {"year": 2023, "scope": "1", "value": 55000.0, "unit": "tCO2e"},
    {"year": 2023, "scope": "2", "value": 120000.0, "unit": "tCO2e"}
  ]
}

Pydantic models validate the output structure.

Vector Database

climatextract uses DuckDB for embedding storage:

File-based (no server required)
Efficient similarity search with vector extensions
Persists embeddings across runs (avoids re-embedding)

The database stores:

PDF file metadata
Page text and embeddings
Search query embeddings (cached)

Performance Considerations

Caching

Pre-embed your PDF corpus once. Subsequent runs will reuse cached embeddings, significantly reducing API costs and processing time.

Parallel Processing

LLM calls are made concurrently. Adjust max_parallel_llm_prompts_running based on your API rate limits.

Next Steps

Prompts – How prompts are structured
Evaluation – Measuring extraction quality