API Reference

This page documents the public API for climatextract. These are the only functions needed for typical usage.

`extract`

Extract CO₂ emissions data from PDF reports.

from climatextract import extract

result_path = extract(
    pdf_input="./data/pdfs/company_report.pdf",
    config_path="climatextract.toml",
    enable_mlflow=False
)

Parameters

Parameter	Type	Default	Description
`pdf_input`	`str \\| List[str] \\| None`	`None`	A directory path (processes all PDFs), a single file path, or a list of file paths. If `None`, uses `filename_list` from config.
`config_path`	`str`	`"climatextract.toml"`	Path to configuration file.
`enable_mlflow`	`bool`	`False`	Whether to log results to MLflow. If `True`, uses MLflow settings from config.
`verbose`	`bool`	`False`	Show detailed per-PDF output.
`llm`	`LlmHandler \\| None`	`None`	Custom LLM handler (keyword-only). If `None`, uses the default Azure AI Foundry adapter. See Custom Providers.
`embedder`	`EmbeddingModelHandler \\| None`	`None`	Custom embedding handler (keyword-only). If `None`, uses the default Azure AI Foundry adapter. See Custom Providers.

Returns

Type	Description
`str \\| None`	Path to the results directory, or `None` if extraction failed.

Examples

Single PDF:

result = extract("./data/pdfs/report.pdf")

Directory of PDFs:

result = extract("./data/pdfs/sample_reports/")

Multiple specific files:

result = extract([
    "./data/pdfs/report1.pdf",
    "./data/pdfs/report2.pdf"
])

With MLflow tracking:

result = extract(
    pdf_input="./data/pdfs/",
    enable_mlflow=True
)

With a custom provider handler:

from climatextract import extract
from climatextract.adapters.azure_ai_foundry import (
    AzureAIFoundryEmbeddingHandler,
    AzureAIFoundryLlmHandler,
)

llm = AzureAIFoundryLlmHandler()
embedder = AzureAIFoundryEmbeddingHandler()

result_path = extract("./data/pdfs/", llm=llm, embedder=embedder)

`extract_and_evaluate`

Extract CO₂ emissions data and evaluate against a gold standard dataset.

from climatextract import extract_and_evaluate

result_path = extract_and_evaluate(
    pdf_input="./data/pdfs/sample_reports/",
    gold_standard_path="./data/evaluation_dataset/gold_standard.csv",
    config_path="climatextract.toml",
    enable_mlflow=False
)

Parameters

Parameter	Type	Default	Description
`pdf_input`	`str \\| List[str] \\| None`	`None`	A directory path, single file path, or list of file paths. If `None`, uses config.
`gold_standard_path`	`str \\| None`	`None`	Path to gold standard CSV. If `None`, uses `gold_standard` from config.
`config_path`	`str`	`"climatextract.toml"`	Path to configuration file.
`enable_mlflow`	`bool`	`False`	Whether to log results and metrics to MLflow.
`verbose`	`bool`	`False`	Show detailed per-PDF output.
`llm`	`LlmHandler \\| None`	`None`	Custom LLM handler (keyword-only). If `None`, uses the default Azure AI Foundry adapter. See Custom Providers.
`embedder`	`EmbeddingModelHandler \\| None`	`None`	Custom embedding handler (keyword-only). If `None`, uses the default Azure AI Foundry adapter. See Custom Providers.

Returns

Type	Description
`str \\| None`	Path to the results directory (includes evaluation files), or `None` if failed.

Examples

Basic evaluation:

result = extract_and_evaluate(
    pdf_input="./data/pdfs/test_set/",
    gold_standard_path="./data/evaluation_dataset/gold_standard.csv"
)

Using config defaults:

# Uses pdf_input and gold_standard from climatextract.toml
result = extract_and_evaluate()

With MLflow tracking:

result = extract_and_evaluate(
    pdf_input="./data/pdfs/",
    gold_standard_path="./data/evaluation/gold.csv",
    enable_mlflow=True
)

Import

Both functions are available directly from the climatextract package:

from climatextract import extract, extract_and_evaluate

Next Steps

Configuration – Customize extraction behavior
Background – Academic context and motivation