Configuration
climatextract uses a TOML configuration file (climatextract.toml) to control all aspects of extraction. This guide explains each option.
Configuration File
Create a climatextract.toml file in your working directory. You can specify a different path:
Input Configuration
Control which PDF files to process:
[input]
# Option 1: List specific files
filename_list = [
"/data/pdfs/company_2022_report.pdf",
"/data/pdfs/company_2023_report.pdf"
]
# Option 2: Process all PDFs in a directory
# filename_list = "./data/pdfs/"
Model Configuration
Select which models to use for embedding and extraction:
[models]
# LLM model for extraction
llm_model = "gpt-5-nano"
# Embedding model for semantic search
emb_model = "text-embedding-ada-002"
# Maximum concurrent LLM API calls (adjust based on rate limits)
max_parallel_llm_prompts_running = 4
# Maximum concurrent embedding API calls (adjust based on rate limits)
max_parallel_embedding_calls = 30
Available LLM Models
Models supported by the default Azure AI Foundry adapter: gpt-5-nano (default), gpt-5-chat, gpt-5.2-chat, gpt-4.1, gpt-4o, gpt-4o-mini, gpt-oss-120b, Llama-4-Maverick-17B-128E-Instruct-FP8, o3-mini. To route to a different provider, see Custom Providers.
Available Embedding Models
Models supported by the default Azure AI Foundry adapter: text-embedding-ada-002, text-embedding-3-large.
Suggested values for max_parallel_llm_prompts_running per model:
| Model | Suggested concurrency |
|---|---|
gpt-4o, gpt-4o-mini, gpt-oss-120b |
25 |
Llama-4-Maverick-17B-128E-Instruct-FP8 |
25 |
gpt-5-chat, gpt-5.2-chat |
8 |
gpt-4.1 |
4 |
o3-mini |
2 |
Suggested values for max_parallel_embedding_calls per model:
| Model | Suggested concurrency |
|---|---|
text-embedding-ada-002 |
30 |
text-embedding-3-large |
50 |
These are starting points — your actual ceiling depends on your Azure deployment tier and regional quota. Above the ceiling, rate-limit retries will slow things down rather than speed them up.
Extraction Parameters
Fine-tune the extraction behavior:
[extraction]
# Year range for emissions data
year_min = 2013
year_max = 2024
# Input mode: "text" (default) or "text+table"
input_mode = "text"
# Only embed documents, skip extraction (default: false)
embed_only = false
# Prompt type: "default" or "structured_json"
prompt_type = "default"
# Semantic search settings
percentile_threshold = 95 # Score cutoff percentile. Keep 5% most similar pages and discard 95%
similarity_top_k = 7 # Maximum pages to retrieve, overrides percentile_threshold for long documents
similarity_min_k = 4 # Minimum pages to retrieve, overrides percentile_threshold for short documents
# Context window for semantic search (default: 0, meaning that no adjacent pages are used)
context_window = 0
# Custom path to a DuckDB embeddings file (optional)
# If omitted, uses default: data/processed/embeddings/{emb_model}_from_2025_03_06.duckdb
# embeddings_repository = "./data/processed/embeddings/custom_embeddings.duckdb"
Output Configuration
Control where results are saved:
Evaluation Configuration
Configure evaluation against a gold standard:
[evaluation]
# Path to gold standard dataset
gold_standard = "data/evaluation_dataset/gold_standard.csv"
Evaluation runs whenever you call extract_and_evaluate(); extract() skips it.
Sharing Large Files
Optional Azure Blob Storage paths for shared PDF files and embedding databases:
[datalake]
# Blob paths: "container_name" or "container_name/subfolder/path"
blob_path_pdfs = "pdfs"
blob_path_embeddings = "embeddings"
See Sharing Large Files for details on setting up and using the data lake.
MLflow Tracking
Optional experiment tracking with MLflow:
The tracking URI (databricks, ./mlruns, or a server URL) is configured via the MLFLOW_TRACKING_URI environment variable in your .env file. See MLflow setup for details.
Full Example
Here's a complete configuration file showing all available options:
[input]
filename_list = ["/data/pdfs/company_2023_report.pdf"]
[models]
llm_model = "gpt-5-nano"
emb_model = "text-embedding-ada-002"
max_parallel_llm_prompts_running = 4
max_parallel_embedding_calls = 30
[extraction]
year_min = 2018
year_max = 2024
input_mode = "text" # "text" (default) or "text+table"
embed_only = false # only embed, skip extraction
prompt_type = "default" # "default" or "structured_json"
context_window = 0 # context window for semantic search
similarity_top_k = 7 # max pages to retrieve
similarity_min_k = 4 # min pages to retrieve
percentile_threshold = 95 # score cutoff percentile
# embeddings_repository = "./data/processed/embeddings/custom_embeddings.duckdb"
[output]
output_dir = "output"
[evaluation]
gold_standard = "data/evaluation_dataset/gold_standard.csv"
[datalake]
blob_path_pdfs = "pdfs" # "container" or "container/subfolder"
blob_path_embeddings = "embeddings" # "container" or "container/subfolder"
[mlflow]
experiment_name = "climatextract_experiments"
Next Steps
- Running Extraction – How to run the pipeline
- Understanding Output – What the output files contain