Skip to content

API Reference

This page documents the public API for climatextract. These are the only functions needed for typical usage.


extract

Extract CO₂ emissions data from PDF reports.

from climatextract import extract

result_path = extract(
    pdf_input="./data/pdfs/company_report.pdf",
    config_path="climatextract.toml",
    enable_mlflow=False
)

Parameters

Parameter Type Default Description
pdf_input str \| List[str] \| None None A directory path (processes all PDFs), a single file path, or a list of file paths. If None, uses filename_list from config.
config_path str "climatextract.toml" Path to configuration file.
enable_mlflow bool False Whether to log results to MLflow. If True, uses MLflow settings from config.
verbose bool False Show detailed per-PDF output.
llm LlmHandler \| None None Custom LLM handler (keyword-only). If None, uses the default Azure AI Foundry adapter. See Custom Providers.
embedder EmbeddingModelHandler \| None None Custom embedding handler (keyword-only). If None, uses the default Azure AI Foundry adapter. See Custom Providers.

Returns

Type Description
str \| None Path to the results directory, or None if extraction failed.

Examples

Single PDF:

result = extract("./data/pdfs/report.pdf")

Directory of PDFs:

result = extract("./data/pdfs/sample_reports/")

Multiple specific files:

result = extract([
    "./data/pdfs/report1.pdf",
    "./data/pdfs/report2.pdf"
])

With MLflow tracking:

result = extract(
    pdf_input="./data/pdfs/",
    enable_mlflow=True
)

With a custom provider handler:

from climatextract import extract
from climatextract.adapters.azure_ai_foundry import (
    AzureAIFoundryEmbeddingHandler,
    AzureAIFoundryLlmHandler,
)

llm = AzureAIFoundryLlmHandler()
embedder = AzureAIFoundryEmbeddingHandler()

result_path = extract("./data/pdfs/", llm=llm, embedder=embedder) 

extract_and_evaluate

Extract CO₂ emissions data and evaluate against a gold standard dataset.

from climatextract import extract_and_evaluate

result_path = extract_and_evaluate(
    pdf_input="./data/pdfs/sample_reports/",
    gold_standard_path="./data/evaluation_dataset/gold_standard.csv",
    config_path="climatextract.toml",
    enable_mlflow=False
)

Parameters

Parameter Type Default Description
pdf_input str \| List[str] \| None None A directory path, single file path, or list of file paths. If None, uses config.
gold_standard_path str \| None None Path to gold standard CSV. If None, uses gold_standard from config.
config_path str "climatextract.toml" Path to configuration file.
enable_mlflow bool False Whether to log results and metrics to MLflow.
verbose bool False Show detailed per-PDF output.
llm LlmHandler \| None None Custom LLM handler (keyword-only). If None, uses the default Azure AI Foundry adapter. See Custom Providers.
embedder EmbeddingModelHandler \| None None Custom embedding handler (keyword-only). If None, uses the default Azure AI Foundry adapter. See Custom Providers.

Returns

Type Description
str \| None Path to the results directory (includes evaluation files), or None if failed.

Examples

Basic evaluation:

result = extract_and_evaluate(
    pdf_input="./data/pdfs/test_set/",
    gold_standard_path="./data/evaluation_dataset/gold_standard.csv"
)

Using config defaults:

# Uses pdf_input and gold_standard from climatextract.toml
result = extract_and_evaluate()

With MLflow tracking:

result = extract_and_evaluate(
    pdf_input="./data/pdfs/",
    gold_standard_path="./data/evaluation/gold.csv",
    enable_mlflow=True
)

Import

Both functions are available directly from the climatextract package:

from climatextract import extract, extract_and_evaluate

Next Steps