Configuration

climatextract uses a TOML configuration file (climatextract.toml) to control all aspects of extraction. This guide explains each option.

Configuration File

Create a climatextract.toml file in your working directory. You can specify a different path:

from climatextract import extract

result = extract(config_path="./my-custom-config.toml")

Input Configuration

Control which PDF files to process:

[input]
# Option 1: List specific files
filename_list = [
    "/data/pdfs/company_2022_report.pdf",
    "/data/pdfs/company_2023_report.pdf"
]

# Option 2: Process all PDFs in a directory
# filename_list = "./data/pdfs/"

Model Configuration

Select which models to use for embedding and extraction:

[models]
# LLM model for extraction
llm_model = "gpt-5-nano"

# Embedding model for semantic search
emb_model = "text-embedding-ada-002"

# Maximum concurrent LLM API calls (adjust based on rate limits)
max_parallel_llm_prompts_running = 4

# Maximum concurrent embedding API calls (adjust based on rate limits)
max_parallel_embedding_calls = 30

Available LLM Models

Models supported by the default Azure AI Foundry adapter: gpt-5-nano (default), gpt-5-chat, gpt-5.2-chat, gpt-4.1, gpt-4o, gpt-4o-mini, gpt-oss-120b, Llama-4-Maverick-17B-128E-Instruct-FP8, o3-mini. To route to a different provider, see Custom Providers.

Available Embedding Models

Models supported by the default Azure AI Foundry adapter: text-embedding-ada-002, text-embedding-3-large.

Suggested values for max_parallel_llm_prompts_running per model:

Model	Suggested concurrency
`gpt-4o`, `gpt-4o-mini`, `gpt-oss-120b`	25
`Llama-4-Maverick-17B-128E-Instruct-FP8`	25
`gpt-5-chat`, `gpt-5.2-chat`	8
`gpt-4.1`	4
`o3-mini`	2

Suggested values for max_parallel_embedding_calls per model:

Model	Suggested concurrency
`text-embedding-ada-002`	30
`text-embedding-3-large`	50

These are starting points — your actual ceiling depends on your Azure deployment tier and regional quota. Above the ceiling, rate-limit retries will slow things down rather than speed them up.

Extraction Parameters

Fine-tune the extraction behavior:

[extraction]
# Year range for emissions data
year_min = 2013
year_max = 2024

# Input mode: "text" (default) or "text+table"
input_mode = "text"

# Only embed documents, skip extraction (default: false)
embed_only = false

# Prompt type: "default" or "structured_json"
prompt_type = "default"

# Semantic search settings
percentile_threshold = 95 # Score cutoff percentile. Keep 5% most similar pages and discard 95%
similarity_top_k = 7      # Maximum pages to retrieve, overrides percentile_threshold for long documents
similarity_min_k = 4      # Minimum pages to retrieve, overrides percentile_threshold for short documents

# Context window for semantic search (default: 0, meaning that no adjacent pages are used)
context_window = 0

# Custom path to a DuckDB embeddings file (optional)
# If omitted, uses default: data/processed/embeddings/{emb_model}_from_2025_03_06.duckdb
# embeddings_repository = "./data/processed/embeddings/custom_embeddings.duckdb"

Output Configuration

Control where results are saved:

[output]
# Base output directory (UUID subdirectory created per run)
output_dir = "output"

Evaluation Configuration

Configure evaluation against a gold standard:

[evaluation]
# Path to gold standard dataset
gold_standard = "data/evaluation_dataset/gold_standard.csv"

Evaluation runs whenever you call extract_and_evaluate(); extract() skips it.

Optional Azure Blob Storage paths for shared PDF files and embedding databases:

[datalake]
# Blob paths: "container_name" or "container_name/subfolder/path"
blob_path_pdfs = "pdfs"
blob_path_embeddings = "embeddings"

See Sharing Large Files for details on setting up and using the data lake.

MLflow Tracking

Optional experiment tracking with MLflow:

[mlflow]
# Experiment name
experiment_name = "/Shared/Experiments/my_experiment"

The tracking URI (databricks, ./mlruns, or a server URL) is configured via the MLFLOW_TRACKING_URI environment variable in your .env file. See MLflow setup for details.

Full Example

Here's a complete configuration file showing all available options:

[input]
filename_list = ["/data/pdfs/company_2023_report.pdf"]

[models]
llm_model = "gpt-5-nano"
emb_model = "text-embedding-ada-002"
max_parallel_llm_prompts_running = 4
max_parallel_embedding_calls = 30

[extraction]
year_min = 2018
year_max = 2024
input_mode = "text"                   # "text" (default) or "text+table"
embed_only = false                    # only embed, skip extraction
prompt_type = "default"               # "default" or "structured_json"
context_window = 0                    # context window for semantic search
similarity_top_k = 7                  # max pages to retrieve
similarity_min_k = 4                  # min pages to retrieve
percentile_threshold = 95             # score cutoff percentile
# embeddings_repository = "./data/processed/embeddings/custom_embeddings.duckdb"

[output]
output_dir = "output"

[evaluation]
gold_standard = "data/evaluation_dataset/gold_standard.csv"

[datalake]
blob_path_pdfs = "pdfs"              # "container" or "container/subfolder"
blob_path_embeddings = "embeddings"  # "container" or "container/subfolder"

[mlflow]
experiment_name = "climatextract_experiments"

Next Steps

Running Extraction – How to run the pipeline
Understanding Output – What the output files contain