Skip to content

Evaluation

climatextract includes a comprehensive evaluation framework to measure extraction quality against gold standard datasets. This page explains the evaluation metrics and process.


Evaluation Workflow

flowchart LR
    E[Extracted Results] --> M[Merge]
    G[Gold Standard] --> M
    M --> C[Compare]
    C --> Metrics[Precision / Recall / F1]

Gold Standard Dataset

The evaluation uses human-annotated ground truth data:

  • Format: CSV with expected emissions values
  • Columns: report_name, year, scope, value, unit
  • Default: data/evaluation_dataset/gold_standard.csv
[evaluation]
gold_standard = "data/evaluation_dataset/gold_standard.csv"

Evaluation Output

Calling extract_and_evaluate() produces:

  • Row-by-row comparison with benchmark
  • Per-document metrics with error type
  • Overall precision, recall, F1

Metrics Explained

Precision

How many extracted values were correct:

\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]

High precision = Few incorrect extractions

Recall

How many expected values were found:

\[ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]

High recall = Few missed values

F1 Score

Harmonic mean of precision and recall:

\[ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]

Match Classification

Values are classified as:

Type Description
True Positive Extracted value matches gold standard
False Positive Extracted value not in gold standard
False Negative Gold standard value not extracted
True Negative No value for a specific scope/year extracted, also not given in gold standard

Matching considers:

  • Year and scope must match exactly
  • Value must be within tolerance (default: exact match)
  • Unit normalization is applied before comparison

Running Evaluation

From Python:

from climatextract import extract_and_evaluate

result_path = extract_and_evaluate(
    pdf_input="./data/pdfs/sample/",
    gold_standard_path="./data/evaluation_dataset/gold_standard.csv"
)

Output Files

Evaluation creates additional files in the output directory:

File Contents
eval_results_vs_benchmark.csv Results matched row-wise with benchmark data
eval_results_metrics_by_ReportName.csv Metrics aggregated per report
config_and_metrics.json Configuration plus extraction and evaluation metrics

MLflow Integration

When MLflow is enabled, metrics are logged for experiment tracking:

  • Per-run precision, recall, F1
  • Comparison with previous runs
  • Hyperparameter tracking
[mlflow]
experiment_name = "/Shared/Experiments/precision_recall_analysis"

Next Steps