Shubham Gupta – Curriculum Learning 🤝 DSPy: Optimization

In Part 1 we carved the ConvFinQA corpus into curriculum-aware difficulty tiers through some EDA. Part 2 turned that insight into zero-shot baselines, pitting several LLMs against each tier. The hard bucket still hurt, which is proof that smarter prompting, not bigger models, is our next lever.

In this installment, instead of fine-tunes we’ll use DSPy’s programmatic prompt optimizers to lift accuracy beyond the Part 2 ceiling. We’ll iterate over the shortlisted models:

o4-mini,
o3
gemini-2.5-flash (this couldn’t be completed due to rate limits unfortunately)

We feed each model curriculum-ordered exemplars, and let DSPy search the prompt space under tight token and latency caps.

Setup

Baseline recap

Zero-shot baselines from Part 2 (https://shubhamg.in/posts/2025-09-01-cl-dspy-modelling/) established a ceiling on Medium/Hard tiers.
Shortlisted models: o4-mini, o3, gemini-2.5-flash.
Common failure modes: multi-op arithmetic chains, cross-turn dependencies, long-context/table lookups.
Objective here: improve Medium/Hard without finetuning via DSPy prompt optimizers, comparing against Part 2 metrics.

Let’s copy over some of the code from the previous notebook here, before we start optimising our models!

import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.dspy.autolog(log_compiles=True, log_evals=True, log_traces_from_compile=True)
mlflow.set_experiment("DSPy Optimization")

<Experiment: artifact_location='mlflow-artifacts:/3', creation_time=1753708026551, experiment_id='3', last_update_time=1753708026551, lifecycle_stage='active', name='DSPy Optimization', tags={}>

import os

import dotenv
import dspy

dotenv.load_dotenv("../.env")
MAX_TOKENS = 20_000

lm_oai_gpt_4_1 = dspy.LM(
    "openai/gpt-4.1-2025-04-14",
    api_key=os.environ["OPENAI_API_KEY"],
    max_tokens=MAX_TOKENS,
)
lm_oai_gpt_4_1_mini = dspy.LM(
    "openai/gpt-4.1-mini-2025-04-14",
    api_key=os.environ["OPENAI_API_KEY"],
    max_tokens=MAX_TOKENS,
)

lm_oai_o4_mini = dspy.LM(
    "openai/o4-mini-2025-04-16",
    api_key=os.environ["OPENAI_API_KEY"],
    temperature=1.0,
    max_tokens=MAX_TOKENS,
)
lm_oai_gpt_5 = dspy.LM(
    "openai/gpt-5-2025-08-07",
    api_key=os.environ["OPENAI_API_KEY"],
    temperature=1.0,
    max_tokens=MAX_TOKENS,
)
lm_oai_gpt_5_mini = dspy.LM(
    "openai/gpt-5-mini-2025-08-07",
    api_key=os.environ["OPENAI_API_KEY"],
    temperature=1.0,
    max_tokens=MAX_TOKENS,
)
lm_oai_gpt_5_nano = dspy.LM(
    "openai/gpt-5-nano-2025-08-07",
    api_key=os.environ["OPENAI_API_KEY"],
    temperature=1.0,
    max_tokens=MAX_TOKENS,
)
lm_anthropic_sonnet_4_0 = dspy.LM(
    "anthropic/claude-sonnet-4-20250514",
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_tokens=MAX_TOKENS,
)
lm_gemini_flash_2_5 = dspy.LM(
    "gemini/gemini-2.5-flash",
    api_key=os.environ["GEMINI_API_KEY"],
    max_tokens=MAX_TOKENS,
)
lm_gemini_flash_2_5_lite = dspy.LM(
    "gemini/gemini-2.5-flash-lite",
    api_key=os.environ["GEMINI_API_KEY"],
    max_tokens=MAX_TOKENS,
)

lm_oai_o3 = dspy.LM(
    "openai/o3-2025-04-16",
    api_key=os.environ["OPENAI_API_KEY"],
    temperature=1.0,
    max_tokens=MAX_TOKENS,
)
lm_anthropic_opus_4_0 = dspy.LM(
    "anthropic/claude-opus-4-20250514",
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_tokens=MAX_TOKENS,
)
lm_gemini_pro_2_5 = dspy.LM(
    "gemini/gemini-2.5-pro",
    api_key=os.environ["GEMINI_API_KEY"],
    max_tokens=MAX_TOKENS,
)

lm_qwen3_32b = dspy.LM(
    "ollama/qwen3:32b",
    api_base="http://localhost:11434",
    api_key="",
    max_tokens=MAX_TOKENS,
)

import dspy

llms = [
    lm_oai_gpt_4_1,
    lm_oai_gpt_4_1_mini,
    lm_oai_o4_mini,
    lm_anthropic_sonnet_4_0,
    lm_gemini_flash_2_5,
    lm_gemini_flash_2_5_lite,
    lm_oai_o3,
    lm_anthropic_opus_4_0,
    lm_gemini_pro_2_5,
    lm_qwen3_32b,
]

import json

data = json.load(open("../data/convfinqa_dataset.json"))

DSPy Modules

class SolveTurnWithoutReasoning(dspy.Signature):
    conversation_context: str = dspy.InputField(desc="Conversation so far")
    evidence_snippets: str = dspy.InputField(
        desc="Snippets of evidence surrounding the table"
    )
    table: str = dspy.InputField(desc="Input financial table with metrics")
    question: str = dspy.InputField(desc="Question to answer")

    ops: str = dspy.OutputField(
        desc="Comma-separated ConvFinQA DSL program. Allowed ops: add(x, y), subtract(x, y), multiply(x, y), divide(x, y), exp(x, y), greater(x, y). Args may be constants (e.g., const_100), numbers (int or float), or prior step refs (#0, #1…). Order always follows the pattern x <op> y—pick x and y deliberately. Example: subtract(const_100, 42), divide(#0, 3.14). Convert to percentages only if explicitly asked in the question."
    )
    answer: str = dspy.OutputField(
        desc="Final answer. This will be a single number, or a boolean string(yes/no)"
    )


class SolveTurnWithReasoning(dspy.Signature):
    conversation_context: str = dspy.InputField(desc="Conversation so far")
    evidence_snippets: str = dspy.InputField(
        desc="Snippets of evidence surrounding the table"
    )
    table: str = dspy.InputField(desc="Input financial table with metrics")
    question: str = dspy.InputField(desc="Question to answer")

    reasoning: str = dspy.OutputField(
        desc="Reasoning behind the answer. Carefully analyze the conversation_context, and especially the evidence_snippets and table for the given question, and generate your reasoning before generating the ops and answer."
    )
    ops: str = dspy.OutputField(
        desc="Comma-separated ConvFinQA DSL program. Allowed ops: add(x, y), subtract(x, y), multiply(x, y), divide(x, y), exp(x, y), greater(x, y). Args may be constants (e.g., const_100), numbers (int or float), or prior step refs (#0, #1…). Order always follows the pattern x <op> y—pick x and y deliberately. Example: subtract(const_100, 42), divide(#0, 3.14). Convert to percentages only if explicitly asked in the question."
    )
    answer: str = dspy.OutputField(
        desc="Final answer. This will be a single number, or a boolean string(yes/no)"
    )


class TurnSolver(dspy.Module):
    """
    In the context of this series of interconnected finance-related queries and the additional information provided by the pretext, table data, and posttext from a company's financial filings, please provide a response to the final question. This may require extracting information from the context and performing mathematical calculations. Please take into account the information provided in the preceding questions and their answers when formulating your response: \n\n
    """

    def __init__(self, reasoning_lm=False):
        super().__init__()
        # if reasoning_lm:
        #     self.pred = dspy.ChainOfThought(SolveTurnWithReasoning)
        # else:
        #     self.pred = dspy.Predict(SolveTurnWithoutReasoning)
        self.pred = dspy.ChainOfThought(SolveTurnWithReasoning)

    def forward(self, conversation_context, evidence_snippets, table, question):
        """
        Run the model to solve a single turn.

        Args:
            conversation_context (str): Conversation so far.
            evidence_snippets (str): Evidence text around the table.
            table (str): Financial table in markdown.
            question (str): Question to answer.

        Returns:
            dspy.Prediction: Model output with reasoning, ops, and answer.
        """
        return self.pred(
            conversation_context=conversation_context,
            evidence_snippets=evidence_snippets,
            table=table,
            question=question,
        )

def norm_ans(x):
    """
    Normalize an answer for comparison.

    Converts input to string, strips whitespace, removes percent signs,
    and attempts to cast to float. If conversion fails, returns the cleaned string.

    Args:
        x: The answer to normalize (str, float, or int).

    Returns:
        float or str: Normalized float if possible, else cleaned string.
    """
    s = str(x).strip().replace("%", "")
    try:
        return float(s)
    except Exception:
        return s


def _table_md(table_dict: dict, max_cols: int | None = None) -> str:
    """
    Convert a dictionarised table to compact GitHub-markdown.

    Accepted shapes
    1) {row_name: {col_name: value, …}, …}   # regular 2-level mapping
    2) {col_name: value, …}                  # flat → coerced to single row

    Guarantees
    • Original row order is kept.
    • Column headers are kept in *first-seen* order; NO deduplication.
    • max_cols (if given) truncates *after* enumeration, duplicates included.
    • None → "" and everything else is str()-ed.
    """
    if not table_dict:
        return ""

    if all(not isinstance(v, dict) for v in table_dict.values()):
        # flat mapping → one anonymous row
        table_dict = {"": dict(table_dict)}
    else:
        # ensure every value is a dict
        table_dict = {
            r: (v if isinstance(v, dict) else {"": v}) for r, v in table_dict.items()
        }

    row_ids = list(table_dict.keys())  # preserve caller order

    cols: list = []
    for r in row_ids:
        cols.extend(table_dict[r].keys())
    if max_cols is not None:
        cols = cols[:max_cols]

    header = "| Row | " + " | ".join(map(str, cols)) + " |"
    sep = "|" + "---|" * (len(cols) + 1)
    lines = [header, sep]

    for r in row_ids:
        vals = [str(table_dict[r].get(c, "")) for c in cols]
        lines.append("| " + str(r) + " | " + " | ".join(vals) + " |")

    return "\n".join(lines)


def build_inputs_from_row(
    row,
    turn_idx,
    *,
    history_mode: str = "teacher",
    state: dict | None = None,
    max_table_cols: int = 100,
):
    """
    history_mode: 'teacher' | 'model' | 'none'
    state: carries model predictions across turns when history_mode='model'
           expected keys: {'pred_answers': list[str|float]}
    evidence_builder: optional callable(row, turn_idx)->str; if None, use simple truncation.
    """
    qs = row["dialogue_conv_questions"]
    gold = row["dialogue_executed_answers"]

    # ---- history ----
    history_lines = []
    for t in range(turn_idx):
        history_lines.append(f"Q{t + 1}: {qs[t]}")
        if history_mode == "teacher":
            history_lines.append(f"A{t + 1}: {gold[t]}")
        elif (
            history_mode == "model" and state and len(state.get("pred_answers", [])) > t
        ):
            history_lines.append(f"A{t + 1}: {state['pred_answers'][t]}")
        elif history_mode == "none":
            pass  # only questions
    conversation_context = "\n".join(history_lines) if history_lines else "None"

    evidence_snippets = (
        f"[PRE]\n{row['doc_pre_text']}\n[/PRE]\n[POST]\n{row['doc_post_text']}\n[/POST]"
    )
    table_md = _table_md(row.get("doc_table", {}) or {}, max_cols=max_table_cols)

    return dict(
        conversation_context=conversation_context,
        evidence_snippets=evidence_snippets,
        table=table_md,
        question=qs[turn_idx],
        ops=row["dialogue_turn_program"][turn_idx],
        **row,
    )

def evaluate_dialogues(model, df):
    """
    Evaluate a dialogue model on a DataFrame of conversations.

    Args:
        model: Callable that takes unpacked input dict and returns an object with at least `.answer` (and optionally `.ops`).
        df: pd.DataFrame with columns:
            - "dialogue_conv_questions": list of str, all questions in the conversation
            - "dialogue_executed_answers": list of str/float, all executed answers so far
            - (other columns as needed by evidence_builder)

    Returns:
        dict with:
            - "turn_em_micro": float, micro-averaged exact match over all turns
            - "dlg_mean_em_macro": float, macro-averaged mean EM per dialogue
            - "joint_em": float, fraction of dialogues with all turns correct
            - "final_turn_em": float, EM on the final turn of each dialogue
            - "n_dialogues": int, number of dialogues
            - "n_turns": int, total number of turns
    """
    turn_hits = 0
    turn_tot = 0
    # exec_hits = 0
    dlg_mean_ems = []
    dlg_joint_hits = 0
    final_hits = 0

    for _, row in df.iterrows():
        qs = row["dialogue_conv_questions"]
        gold = row["dialogue_executed_answers"]
        ems = []
        exec_flags = []
        for t in range(len(qs)):
            inp = build_inputs_from_row(row, t)
            out = model(**inp)
            pa = norm_ans(out.answer)
            ga = norm_ans(gold[t])
            em = float(pa == ga)
            ems.append(em)
            turn_hits += em
            turn_tot += 1

        dlg_mean_ems.append(sum(ems) / len(ems))
        if all(v == 1.0 for v in ems):
            dlg_joint_hits += 1
        final_hits += ems[-1]

    return {
        "turn_em_micro": turn_hits / max(1, turn_tot),
        "dlg_mean_em_macro": sum(dlg_mean_ems) / max(1, len(dlg_mean_ems)),
        "joint_em": dlg_joint_hits / max(1, len(dlg_mean_ems)),
        "final_turn_em": final_hits / max(1, len(dlg_mean_ems)),
        "n_dialogues": len(dlg_mean_ems),
        "n_turns": turn_tot,
    }

def turn_em_metric(example, pred, trace=None):
    """
    Compute turn-level exact match (EM) metric for a single example/prediction pair.

    Args:
        example: dict-like, must contain "gold_answer" key.
        pred: object with an "answer" attribute.

    Returns:
        float: 1.0 if normalized prediction matches normalized gold answer (with tolerance for floats), else 0.0.
    """
    from dspy.evaluate.metrics import answer_exact_match

    pa = norm_ans(pred.answer)
    ga = norm_ans(example["answer"])
    if isinstance(pa, float) and isinstance(ga, float):
        return float(abs(pa - ga) <= 1e-2)
    else:
        # exact_match in DSPy needs the inputs to be in string format
        # due to the normalisations DSPy performs internally.
        ground_truth = dspy.Prediction(answer=str(example.answer))
        pred_answer = dspy.Prediction(answer=str(pred.answer))
        return float(answer_exact_match(ground_truth, pred_answer))

def to_turn_examples(df, history_mode="teacher"):
    examples = []
    for _, row in df.iterrows():
        qs = row["dialogue_conv_questions"]
        gold = row["dialogue_executed_answers"]
        for t in range(len(qs)):
            inp = build_inputs_from_row(row, t, history_mode=history_mode)
            ex = dict(**inp, answer=gold[t])
            examples.append(
                dspy.Example(**ex).with_inputs(
                    "conversation_context",
                    "evidence_snippets",
                    "table",
                    "question",
                )
            )
    return examples

Preparing Datasets

We will aim to use the splits as follows: - train: Used primarily for the optimisation phase. This will be discussed shortly. - valid: Used to evaluate the performance of an LM on an optimised model trained using the train dataset. - test: Used to evaluate the performance of an LM on a held-out dataset. This will determine the overall stage performance.

import pandas as pd

train_df = pd.DataFrame(data["train"])
test_df = pd.DataFrame(data["dev"])

# Flatten features to remove the indexing gymnastics
train_flat_df = pd.concat(
    [
        train_df.drop(["doc", "dialogue", "features"], axis=1),
        train_df["doc"].apply(pd.Series).add_prefix("doc_"),
        train_df["dialogue"].apply(pd.Series).add_prefix("dialogue_"),
        train_df["features"].apply(pd.Series).add_prefix("features_"),
    ],
    axis=1,
)

test_flat_df = pd.concat(
    [
        test_df.drop(["doc", "dialogue", "features"], axis=1),
        test_df["doc"].apply(pd.Series).add_prefix("doc_"),
        test_df["dialogue"].apply(pd.Series).add_prefix("dialogue_"),
        test_df["features"].apply(pd.Series).add_prefix("features_"),
    ],
    axis=1,
)

train_flat_df.head()

	id	doc_pre_text	doc_post_text	doc_table	dialogue_conv_questions	dialogue_conv_answers	dialogue_turn_program	dialogue_executed_answers	dialogue_qa_split	features_num_dialogue_turns	features_has_type2_question	features_has_duplicate_columns	features_has_non_numeric_values
0	Single_JKHY/2009/page_28.pdf-3	26 \| 2009 annual report in fiscal 2008 , reven...	year ended june 30 , cash provided by operatio...	{'Year ended June 30, 2009': {'net income': 10...	[what is the net cash from operating activitie...	[206588, 181001, 25587, 14.1%]	[206588, 181001, subtract(206588, 181001), sub...	[206588.0, 181001.0, 25587.0, 0.14136]	[False, False, False, False]	4	False	False	False
1	Single_RSG/2008/page_114.pdf-2	substantially all of the goodwill and other in...	the above unaudited pro forma financial inform...	{'year ended december 31 2008 ( unaudited )': ...	[what were revenues in 2008?, what were they i...	[9362.2, 9244.9, 117.3, 1.3%]	[9362.2, 9244.9, subtract(9362.2, 9244.9), sub...	[9362.2, 9244.9, 117.3, 0.01269]	[False, False, False, False]	4	False	False	False
2	Single_AAPL/2002/page_23.pdf-1	in a new business model such as the retail seg...	.	{'2002': {'net sales': 5742.0, 'cost of sales'...	[what was the total of net sales in 2001?, and...	[5363, 7983, -2620, -32%]	[5363, 7983, subtract(5363, 7983), subtract(53...	[5363.0, 7983.0, -2620.0, -0.3282]	[False, False, False, False]	4	False	False	False
3	Single_UPS/2009/page_33.pdf-2	( 1 ) includes shares repurchased through our ...	.	{'12/31/04': {'united parcel service inc .': 1...	[what was the change in the performance of the...	[-24.05, -24.05%, 102.11, 2.11, 2.11%, -26.16%]	[subtract(75.95, const_100), subtract(75.95, c...	[-24.05, -0.2405, 102.11, 2.11, 0.0211, -0.2616]	[False, False, False, False, False, False]	6	False	False	False
4	Double_UPS/2009/page_33.pdf	( 1 ) includes shares repurchased through our ...	.	{'12/31/04': {'united parcel service inc .': 1...	[what was the fluctuation of the performance p...	[-8.94, -8.9%, -24.05, -24.05%, 2.11, 2.11%, -...	[subtract(91.06, const_100), subtract(91.06, c...	[-8.94, -0.0894, -24.05, -0.2405, 2.11, 0.0211...	[False, False, True, True, True, True, True]	7	True	False	False

easy_train_ids = pd.read_json("./splits/easy_train.jsonl", lines=True)
easy_valid_ids = pd.read_json("./splits/easy_valid.jsonl", lines=True)
easy_test_ids = pd.read_json("./splits/easy_test.jsonl", lines=True)

medium_train_ids = pd.read_json("./splits/medium_train.jsonl", lines=True)
medium_valid_ids = pd.read_json("./splits/medium_valid.jsonl", lines=True)
medium_test_ids = pd.read_json("./splits/medium_test.jsonl", lines=True)

hard_train_ids = pd.read_json("./splits/hard_train.jsonl", lines=True)
hard_valid_ids = pd.read_json("./splits/hard_valid.jsonl", lines=True)
hard_test_ids = pd.read_json("./splits/hard_test.jsonl", lines=True)

easy_train_df = train_flat_df[train_flat_df["id"].isin(easy_train_ids["id"])].copy()
easy_valid_df = train_flat_df[train_flat_df["id"].isin(easy_valid_ids["id"])].copy()
easy_test_df = test_flat_df[test_flat_df["id"].isin(easy_test_ids["id"])].copy()

medium_train_df = train_flat_df[train_flat_df["id"].isin(medium_train_ids["id"])].copy()
medium_valid_df = train_flat_df[train_flat_df["id"].isin(medium_valid_ids["id"])].copy()
medium_test_df = test_flat_df[test_flat_df["id"].isin(medium_test_ids["id"])].copy()

hard_train_df = train_flat_df[train_flat_df["id"].isin(hard_train_ids["id"])].copy()
hard_valid_df = train_flat_df[train_flat_df["id"].isin(hard_valid_ids["id"])].copy()
hard_test_df = test_flat_df[test_flat_df["id"].isin(hard_test_ids["id"])].copy()

assert easy_train_ids.shape[0] == easy_train_df.shape[0]
assert easy_valid_ids.shape[0] == easy_valid_df.shape[0]
assert easy_test_ids.shape[0] == easy_test_df.shape[0]
assert medium_train_ids.shape[0] == medium_train_df.shape[0]
assert medium_valid_ids.shape[0] == medium_valid_df.shape[0]
assert medium_test_ids.shape[0] == medium_test_df.shape[0]
assert hard_train_ids.shape[0] == hard_train_df.shape[0]
assert hard_valid_ids.shape[0] == hard_valid_df.shape[0]
assert hard_test_ids.shape[0] == hard_test_df.shape[0]

easy_train_examples = to_turn_examples(easy_train_df)
easy_valid_examples = to_turn_examples(easy_valid_df)
easy_test_examples = to_turn_examples(easy_test_df)

medium_train_examples = to_turn_examples(medium_train_df)
medium_valid_examples = to_turn_examples(medium_valid_df)
medium_test_examples = to_turn_examples(medium_test_df)

hard_train_examples = to_turn_examples(hard_train_df)
hard_valid_examples = to_turn_examples(hard_valid_df)
hard_test_examples = to_turn_examples(hard_test_df)

DSPy Optimization

We aim to optimize prompts with dspy because:

Hand-tuned prompts plateau fast and break when the data distribution drifts.
Brute-force prompt crafting doesn’t scale; automated search + distillation does.
DSPy acts as a compiler: give it a spec + metric, get back a better program.

DSPy has a few inbuilt methods to do prompt optimization, as shown in the docs. Due to time and cost contraints, we will focus on a few key approaches:

BootstrapFewShotWithRandomSearch – LM proposes demos, keep the ones that pass metric.
- Uses a teacher module (which defaults to your program) to generate complete demonstrations for every stage of your program, along with labeled examples in trainset. Parameters include max_labeled_demos (the number of demonstrations randomly selected from the trainset) and max_bootstrapped_demos (the number of additional examples generated by the teacher). The bootstrapping process employs the metric to validate demonstrations, including only those that pass the metric in the “compiled” prompt.
- RandomSearch applies the above several times with random search over generated demonstrations, and selects the best program over the optimization.
MIPROv2 – Bayes-opt over instructions and demos (can stay 0-shot).
- Generates instructions and few-shot examples in each step. The instruction generation is data-aware and demonstration-aware. Uses Bayesian Optimization to effectively search over the space of generation instructions/demonstrations across your modules.
BootstrapFinetune – distill the compiled prompt into weight updates. We might use this to improve the performance of a smaller and cheaper model to match the perform of a larger model!
- The output is a DSPy program that has the same steps, but where each step is conducted by a finetuned model instead of a prompted LM.

Optimisation Methods

In the previous notebook, on the “gate” dataset, we had the following results:

LLM	Evaluation Score
openai/gpt-4.1-2025-04-14	80.0
openai/gpt-4.1-mini-2025-04-14	50.0
openai/o4-mini-2025-04-16	60.0
anthropic/claude-sonnet-4-20250514	60.0
gemini/gemini-2.5-flash	70.0
gemini/gemini-2.5-flash-lite	70.0
openai/o3-2025-04-16	80.0
anthropic/claude-opus-4-20250514	70.0
gemini/gemini-2.5-pro	80.0
ollama/qwen3:32b	70.0

Compared to the frontier model o3, we had relatively poor performance with o4-mini and gemini-2.5-flash.

Let’s try to see if we can improve the performance here. Specifically, we will do the first stage of optimisation on the easy dataset, using the BootstrapFewShot metric.

For the easy dataset, we will restrict the number of few shot examples to 5.

BootstrapFewShotWithRandomSearch

selected_llms = [lm_oai_o4_mini, lm_gemini_flash_2_5, lm_oai_o3]

len(easy_train_examples)

The official documentation for BootstrapFewShotWithRandomSearch recommends to use 50 or more examples. Due to time and cost constraints, we will randomly sample 70 examples from the easy train set, and use them to optimize our program.

import random

random.seed(42)

bootstrap_rs_random_easy_subset = random.sample(easy_train_examples, 70)

import re

import litellm
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# Config needed to prevent the optimizer from using _unsupported_ temperature
# for reasoning models.
litellm.drop_params = True


config = dict(
    max_bootstrapped_demos=3,
    max_labeled_demos=2,
    num_candidate_programs=5,
    num_threads=32,
    max_rounds=1,
)

bootstrap_rs_easy_compiled_programs = []

with mlflow.start_run(run_name="bootstrap_few_shot_rs_easy"):
    for candidate_lm in selected_llms:
        run_name = f"bootstrap_few_shot_rs_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            with dspy.context(lm=candidate_lm) as ctx:
                teleprompter = BootstrapFewShotWithRandomSearch(
                    metric=turn_em_metric, **config
                )
                optimized_program = teleprompter.compile(
                    dspy.ChainOfThought(SolveTurnWithReasoning),
                    trainset=bootstrap_rs_random_easy_subset,
                )
                bootstrap_rs_easy_compiled_programs.append(optimized_program)

Going to sample between 1 and 3 traces per predictor.
Will attempt to bootstrap 5 candidate sets.
Average Metric: 45.00 / 70 (64.3%): 100%|██████████| 70/70 [00:00<00:00, 87.74it/s] 
🏃 View run eval_0 at: http://localhost:5000/#/experiments/3/runs/fcce9b07d50e41609bc04c3d9c2235c7
🧪 View experiment at: http://localhost:5000/#/experiments/3
New best score: 64.29 for seed -3
...
2025/07/29 01:09:48 INFO dspy.evaluate.evaluate: Average Metric: 56.0 / 70 (80.0%)

From the logs, and in the MLFlow UI, we see the following performances for each model:

Model	Candidate Scores	Best Score
o4-mini	64.29, 65.71, 67.14, 68.57, 65.71, 64.29, 67.14, 72.86	72.86
gemini-2.5-flash	61.43, 67.14, 65.71, 65.71, 64.29, 67.14, 68.57, 85.71	85.71
o3	74.29, 74.29, 74.29, 72.86, 74.29, 75.71, 75.71, 80.0	80.0

Surprisingly, the best model is gemini-2.5-flash, on this small random subset of the easy dataset.

We can also look at the prompt generated for the model:

bootstrap_rs_easy_compiled_programs[1].inspect_history(n=1)





[2025-07-29T01:03:41.268360]

System message:

Your input fields are:
1. `conversation_context` (str): Conversation so far
2. `evidence_snippets` (str): Snippets of evidence surrounding the table
3. `table` (str): Input financial table with metrics
4. `question` (str): Question to answer
Your output fields are:
1. `reasoning` (str): Reasoning behind the answer. Carefully analyze the conversation_context, and especially the evidence_snippets and table for the given question, and generate your reasoning before generating the ops and answer.
2. `ops` (str): Comma-separated ConvFinQA DSL program. Allowed ops: add(x, y), subtract(x, y), multiply(x, y), divide(x, y), exp(x, y), greater(x, y). Args may be constants (e.g., const_100), numbers (int or float), or prior step refs (#0, #1…). Order always follows the pattern x <op> y—pick x and y deliberately. Example: subtract(const_100, 42), divide(#0, 3.14). Convert to percentages only if explicitly asked in the question.
3. `answer` (str): Final answer. This will be a single number, or a boolean string(yes/no)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## conversation_context ## ]]
{conversation_context}

[[ ## evidence_snippets ## ]]
{evidence_snippets}

[[ ## table ## ]]
{table}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## ops ## ]]
{ops}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `conversation_context`, `evidence_snippets`, `table`, `question`, produce the fields `reasoning`, `ops`, `answer`.


User message:

This is an example of the task, though some input or output fields are not supplied.

[[ ## conversation_context ## ]]
Q1: what was the expected dividend yield in 2006?
A1: 3.24
Q2: what was the expected yield in 2005?
A2: 3.29
Q3: what is the net difference?
A3: -0.05

[[ ## evidence_snippets ## ]]
[PRE]
eastman notes to the audited consolidated financial statements stock option awards option awards are granted to non-employee directors on an annual basis and to employees who meet certain eligibility requirements . a single annual option grant is usually awarded to eligible employees in the fourth quarter of each year , if and when granted by the compensation and management development committee of the board of directors , and occasional individual grants are awarded to eligible employees throughout the year . option awards have an exercise price equal to the closing price of the company's stock on the date of grant . the term of options is ten years with vesting periods that vary up to three years . vesting usually occurs ratably or at the end of the vesting period . sfas no . 123 ( r ) requires that stock option awards be valued at fair value determined by market price , if actively traded in a public market or , if not , calculated using an option pricing financial model . the fair value of the company's options cannot be determined by market value as they are not traded in an open market . accordingly , a financial pricing model is utilized to determine fair value . the company utilizes the black scholes merton ( "bsm" ) model which relies on certain assumptions to estimate an option's fair value . the weighted average assumptions used in the determination of fair value for stock options awarded in 2006 , 2005 and 2004 are provided in the table below: .
[/PRE]
[POST]
prior to adoption of sfas no . 123 ( r ) , the company calculated the expected term of stock options of six years . effective with the fourth quarter 2005 annual option award , the company analyzed historical annual grant transactions over a ten year period comprising exercises , post-vesting cancellations and expirations to determine the expected term . the company expects to execute this analysis each year preceding the annual option grant to ensure that all assumptions based upon internal data reflect the most reasonable expectations for fair value determination . the weighted average expected term of 4.4 years for 2006 reflects the impact of this annual analysis and the weighting of option swap and reload grants which may have much shorter expected terms than new option grants . the volatility rate of grants is derived from historical company common stock volatility over the same time period as the expected term . the company uses a weekly high closing stock price based upon daily closing prices in the week . the volatility rate is derived by mathematical formula utilizing the weekly high closing price data . for the periods presented above , the expected dividend yield is derived by mathematical formula which uses the expected company annual dividend amount over the expected term divided by the fair market value of the company's common stock at the grant date . the average risk-free interest rate is derived from united states department of treasury published interest rates of daily yield curves for the same time period as the expected term . prior to adoption of sfas no . 123 ( r ) , the company did not estimate forfeitures and recognized them as they occurred for proforma disclosure of share-based compensation expense . with adoption of sfas no . 123 ( r ) , estimated forfeitures must be considered in recording share-based compensation expense . estimated forfeiture rates vary with each type of award affected by several factors , one of which is the varying composition and characteristics of the award participants . estimated forfeitures for the company's share-based awards historically range from 0.75 percent to 10.0 percent with the estimated forfeitures for options at 0.75 percent. .
[/POST]

[[ ## table ## ]]
| Row | expected volatility rate | expected dividend yield | average risk-free interest rate | expected forfeiture rate | expected term years | expected volatility rate | expected dividend yield | average risk-free interest rate | expected forfeiture rate | expected term years | expected volatility rate | expected dividend yield | average risk-free interest rate | expected forfeiture rate | expected term years |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2006 | -21.4 | -3.24 | -4.62 | -0.75 | 4.4 | -21.4 | -3.24 | -4.62 | -0.75 | 4.4 | -21.4 | -3.24 | -4.62 | -0.75 | 4.4 |
| 2005 | -22.9 | -3.29 | -4.48 | actual | 5.0 | -22.9 | -3.29 | -4.48 | actual | 5.0 | -22.9 | -3.29 | -4.48 | actual | 5.0 |
| 2004 | -28.0 | -3.8 | -3.46 | actual | 6.0 | -28.0 | -3.8 | -3.46 | actual | 6.0 | -28.0 | -3.8 | -3.46 | actual | 6.0 |

[[ ## question ## ]]
what is the percent change?


Assistant message:

[[ ## reasoning ## ]]
Not supplied for this particular example. 

[[ ## ops ## ]]
subtract(3.24, 3.29), divide(#0, 3.29)

[[ ## answer ## ]]
-0.0152


User message:

[[ ## conversation_context ## ]]
None

[[ ## evidence_snippets ## ]]
[PRE]
the company granted 1020 performance shares . the vesting of these shares is contingent on meeting stated goals over a performance period . beginning with restricted stock grants in september 2010 , dividends are accrued on restricted class a common stock and restricted stock units and are paid once the restricted stock vests . the following table summarizes restricted stock and performance shares activity for 2010 : number of shares weighted average grant date fair value .
[/PRE]
[POST]
the total fair value of restricted stock that vested during the years ended december 31 , 2010 , 2009 and 2008 , was $ 10.3 million , $ 6.2 million and $ 2.5 million , respectively . eligible employees may acquire shares of cme group 2019s class a common stock using after-tax payroll deductions made during consecutive offering periods of approximately six months in duration . shares are purchased at the end of each offering period at a price of 90% ( 90 % ) of the closing price of the class a common stock as reported on the nasdaq . compensation expense is recognized on the dates of purchase for the discount from the closing price . in 2010 , 2009 and 2008 , a total of 4371 , 4402 and 5600 shares , respectively , of class a common stock were issued to participating employees . these shares are subject to a six-month holding period . annual expense of $ 0.1 million for the purchase discount was recognized in 2010 , 2009 and 2008 , respectively . non-executive directors receive an annual award of class a common stock with a value equal to $ 75000 . non-executive directors may also elect to receive some or all of the cash portion of their annual stipend , up to $ 25000 , in shares of stock based on the closing price at the date of distribution . as a result , 7470 , 11674 and 5509 shares of class a common stock were issued to non-executive directors during 2010 , 2009 and 2008 , respectively . these shares are not subject to any vesting restrictions . expense of $ 2.4 million , $ 2.5 million and $ 2.4 million related to these stock-based payments was recognized for the years ended december 31 , 2010 , 2009 and 2008 , respectively. .
[/POST]

[[ ## table ## ]]
| Row | outstanding at december 31 2009 | granted | vested | cancelled | outstanding at december 31 2010 | outstanding at december 31 2009 | granted | vested | cancelled | outstanding at december 31 2010 |
|---|---|---|---|---|---|---|---|---|---|---|
| number of shares | 116677.0 | 134245.0 | -34630.0 | -19830.0 | 196462.0 | 116677.0 | 134245.0 | -34630.0 | -19830.0 | 196462.0 |
| weighted average grant date fair value | 280.0 | 275.0 | 257.0 | 260.0 | 283.0 | 280.0 | 275.0 | 257.0 | 260.0 | 283.0 |

[[ ## question ## ]]
what was the change in number of shares from 2009 to 2010?


Assistant message:

[[ ## reasoning ## ]]
To find the change in the number of shares from 2009 to 2010, I need to identify the 'number of shares' outstanding at December 31, 2009, and the 'number of shares' outstanding at December 31, 2010, from the table. Then, I will subtract the 2009 value from the 2010 value.

From the table:
- Number of shares outstanding at December 31, 2009: 116677.0
- Number of shares outstanding at December 31, 2010: 196462.0

The change is 196462.0 - 116677.0.

[[ ## ops ## ]]
subtract(196462.0, 116677.0)

[[ ## answer ## ]]
79785.0


User message:

[[ ## conversation_context ## ]]
Q1: what is the increase in value of an investment in delphi automotive plc from 2011 to 2013?
A1: 185.81

[[ ## evidence_snippets ## ]]
[PRE]
stock performance graph * $ 100 invested on 11/17/11 in our stock or 10/31/11 in the relevant index , including reinvestment of dividends . fiscal year ending december 31 , 2013 . ( 1 ) delphi automotive plc ( 2 ) s&p 500 2013 standard & poor 2019s 500 total return index ( 3 ) automotive supplier peer group 2013 russell 3000 auto parts index , including american axle & manufacturing , borgwarner inc. , cooper tire & rubber company , dana holding corp. , delphi automotive plc , dorman products inc. , federal-mogul corp. , ford motor co. , fuel systems solutions inc. , general motors co. , gentex corp. , gentherm inc. , genuine parts co. , johnson controls inc. , lkq corp. , lear corp. , meritor inc. , remy international inc. , standard motor products inc. , stoneridge inc. , superior industries international , trw automotive holdings corp. , tenneco inc. , tesla motors inc. , the goodyear tire & rubber co. , tower international inc. , visteon corp. , and wabco holdings inc . company index november 17 , december 31 , december 31 , december 31 .
[/PRE]
[POST]
dividends on february 26 , 2013 , the board of directors approved the initiation of dividend payments on the company's ordinary shares . the board of directors declared a regular quarterly cash dividend of $ 0.17 per ordinary share that was paid in each quarter of 2013 . in addition , in january 2014 , the board of directors declared a regular quarterly cash dividend of $ 0.25 per ordinary share , payable on february 27 , 2014 to shareholders of record at the close of business on february 18 , 2014 . in october 2011 , the board of managers of delphi automotive llp approved a distribution of approximately $ 95 million , which was paid on december 5 , 2011 , principally in respect of taxes , to members of delphi automotive llp who held membership interests as of the close of business on october 31 , 2011. .
[/POST]

[[ ## table ## ]]
| Row | delphi automotive plc ( 1 ) | s&p 500 ( 2 ) | automotive supplier peer group ( 3 ) | delphi automotive plc ( 1 ) | s&p 500 ( 2 ) | automotive supplier peer group ( 3 ) | delphi automotive plc ( 1 ) | s&p 500 ( 2 ) | automotive supplier peer group ( 3 ) | delphi automotive plc ( 1 ) | s&p 500 ( 2 ) | automotive supplier peer group ( 3 ) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| november 17 2011 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| december 31 2011 | 100.98 | 100.8 | 89.27 | 100.98 | 100.8 | 89.27 | 100.98 | 100.8 | 89.27 | 100.98 | 100.8 | 89.27 |
| december 31 2012 | 179.33 | 116.93 | 110.41 | 179.33 | 116.93 | 110.41 | 179.33 | 116.93 | 110.41 | 179.33 | 116.93 | 110.41 |
| december 31 2013 | 285.81 | 154.8 | 166.46 | 285.81 | 154.8 | 166.46 | 285.81 | 154.8 | 166.46 | 285.81 | 154.8 | 166.46 |

[[ ## question ## ]]
what return rate does this represent?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## ops ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


Response:

[[ ## reasoning ## ]]
The previous question (Q1) asked for the increase in value, which was calculated as 185.81. This value represents the difference between the investment value in 2013 (285.81) and the initial investment value in 2011 (100.0). To find the return rate, I need to divide this increase by the initial investment value. The initial investment value for Delphi Automotive PLC on November 17, 2011, was 100.0.

Therefore, the return rate is the increase (185.81) divided by the initial value (100.0).

[[ ## ops ## ]]
divide(185.81, 100.0)

[[ ## answer ## ]]
1.8581

As we can see, the above prompt has a good collection of few-shot examples, which have helped improve the model’s performance.

The optimised programs can also be saved to disk, so that we can re-use them later.

for id, oc in enumerate(bootstrap_rs_easy_compiled_programs):
    model_name = selected_llms[id].model
    sanitized_model_name = re.sub(r"[-/\.]", "_", model_name)
    oc.save(f"./programs/{sanitized_model_name}_bootstrap_rs_easy/", save_program=True)

Evaluation

While the models work well on our randomly sampled subset, to compare them fairly to the zero-shot versions, we need to run the evaluation on the validation sets, similar to what we did in the modelling notebook.

import pandas as pd

gate_ids = pd.read_json("validation_datasets/gate_ids.jsonl", lines=True)

probe_medium_ids = pd.read_json(
    "validation_datasets/probe_medium_ids.jsonl", lines=True
)
probe_hard_ids = pd.read_json("validation_datasets/probe_hard_ids.jsonl", lines=True)

gate_df = easy_valid_df[easy_valid_df["id"].isin(gate_ids["id"])].copy()
probe_df = pd.concat(
    [
        medium_valid_df[medium_valid_df["id"].isin(probe_medium_ids["id"])],
        hard_valid_df[hard_valid_df["id"].isin(probe_hard_ids["id"])],
    ]
).copy()

assert gate_df.shape[0] == gate_ids.shape[0]
assert probe_df.shape[0] == probe_medium_ids.shape[0] + probe_hard_ids.shape[0]

gate_examples = to_turn_examples(gate_df, history_mode="teacher")
probe_examples = to_turn_examples(probe_df, history_mode="teacher")

import re

from dspy.evaluate import Evaluate

bootstrap_rs_gate_valid_results = []

with mlflow.start_run(run_name="gate_dataset_results") as parent_ctx:
    for idx, candidate_lm in enumerate(selected_llms):
        run_name = f"bootstrap_rs_gate_valid_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            current_evaluator = Evaluate(
                devset=gate_examples,
                num_threads=32,
                display_progress=True,
                # display_table=True,
                # provide_traceback=True,
                return_all_scores=True,
                return_outputs=True,
            )
            with dspy.context(lm=candidate_lm) as ctx:
                current_result = current_evaluator(
                    bootstrap_rs_easy_compiled_programs[idx], metric=turn_em_metric
                )
                bootstrap_rs_gate_valid_results.append(current_result)

Average Metric: 108.00 / 151 (71.5%): 100%|██████████| 151/151 [00:05<00:00, 28.47it/s]
🏃 View run eval at: http://localhost:5000/#/experiments/3/runs/f51f5cb2ef784fee83e3c8d90d36c810
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run bootstrap_rs_gate_valid_openai_o4-mini-2025-04-16 at: http://localhost:5000/#/experiments/3/runs/11ab97c789e74ad49817a6ff42545b71
🧪 View experiment at: http://localhost:5000/#/experiments/3
...
2025/07/29 01:46:14 INFO dspy.evaluate.evaluate: Average Metric: 115.0 / 151 (76.2%)

print(bootstrap_rs_gate_valid_results[0][0])
print(bootstrap_rs_gate_valid_results[1][0])
print(bootstrap_rs_gate_valid_results[2][0])

71.52
76.82
76.16

We have the following results:

model_name	zero_shot_perform	bootstrapfewshotwithrandomsearch	pct_change_vs_zero_shot
o4-mini	60.0	71.52	+19.2%
gemini-2.5-flash	70	76.82	+9.7%
o3	80	76.16	-4.8%

We see the small reasoning models have significant gains in performance, especially o4-mini which now has a total score of 71.52% on the gate dataset.

However, we also see a slight reduction in the performance of o3. This could be because the few shot are acting as distractors, likely because they are noisy(as seen in our previous notebook).

We will explore another method before moving on!

MIPROv2

The previous method, BootstrapFewShotWithRandomSearch, was able to generate a good collection of few-shot examples, which have helped improve the model’s performance. However, we see that the prompt instructions themselves were not optimised for the model, and could be improved.

MIPROv2 aims to be the best of both worlds: It will jointly optimise both the selected few-shot examples, as well as the instructions, using Bayesion Optimisation. Internally, it will just ask the LLMs for a better prompt, a technique called metaprompting, first released by Anthropic.

MIPROv2 has 3 optimisation modes: light, medium, heavy, which map well to our existing curriculum learning approach of splitting the datasets.

Setup

While MIPROv2 gives the best results with large data, due to time and cost constraints, we will evaluate it on micro datasets.

Specifically, from our existing curriculum of easy, medium, and hard datasets, we will select a small subset for train and validation.

Once we’ve finished the optimisation for all stages, we will run the final evaluation on the full test dataset i.e easy_test + medium_test + hard_test.

micro_easy_train_df = easy_train_df.sample(n=120, random_state=42)
micro_easy_valid_df = easy_valid_df.sample(n=30, random_state=42)

micro_medium_train_df = medium_train_df.sample(n=200, random_state=42)
micro_medium_valid_df = medium_valid_df.sample(n=50, random_state=42)

micro_hard_train_df = hard_train_df.sample(n=200, random_state=42)
micro_hard_valid_df = hard_valid_df.sample(n=60, random_state=42)


micro_easy_train_examples = to_turn_examples(
    micro_easy_train_df, history_mode="teacher"
)
micro_easy_valid_examples = to_turn_examples(
    micro_easy_valid_df, history_mode="teacher"
)
micro_medium_train_examples = to_turn_examples(
    micro_medium_train_df, history_mode="teacher"
)
micro_medium_valid_examples = to_turn_examples(
    micro_medium_valid_df, history_mode="teacher"
)
micro_hard_train_examples = to_turn_examples(
    micro_hard_train_df, history_mode="teacher"
)
micro_hard_valid_examples = to_turn_examples(
    micro_hard_valid_df, history_mode="teacher"
)

To save further costs, we will run MIPROv2 on only the smaller reasoning models

o4-mini
gemini-2.5-flash

mipro_llms = [lm_oai_o4_mini, lm_gemini_flash_2_5]

Curriculum Learning: Easy

Since we have a new validation set for MIPROv2, we first need to establish a baseline performance.

import re

from dspy.evaluate import Evaluate

baseline_results = []

with mlflow.start_run(run_name="baseline_micro_easy") as parent_ctx:
    for candidate_lm in mipro_llms:
        run_name = f"baseline_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            current_evaluator = Evaluate(
                devset=micro_easy_valid_examples,
                num_threads=32,
                display_progress=True,
                # display_table=True,
                # provide_traceback=True,
                return_all_scores=True,
                return_outputs=True,
            )
            with dspy.context(lm=candidate_lm) as ctx:
                current_result = current_evaluator(
                    TurnSolver(reasoning_lm=True), metric=turn_em_metric
                )
                baseline_results.append(current_result)

Average Metric: 11.00 / 21 (52.4%):  22%|██▏       | 21/95 [00:00<00:01, 50.20it/s]Average Metric: 34.00 / 62 (54.8%):  64%|██████▍   | 61/95 [00:01<00:00, 81.06it/s]Average Metric: 36.00 / 64 (56.2%):  66%|██████▋   | 63/95 [00:01<00:00, 81.06it/s]Average Metric: 60.00 / 95 (63.2%): 100%|██████████| 95/95 [00:01<00:00, 74.14it/s] 
🏃 View run eval at: http://localhost:5000/#/experiments/3/runs/12c628a13fc142ee96326609276a54a9
🧪 View experiment at: http://localhost:5000/#/experiments/3
...
2025/07/29 11:49:27 INFO dspy.evaluate.evaluate: Average Metric: 61.0 / 95 (64.2%)

The results are as follows:

model_name	micro_easy_valid_set em_metric
o4-mini	63.16
gemini-2.5-flash	64.21

We will now aim to improve the performance of each of these models, using the MIPROv2 optimiser.

import re

import litellm
from dspy.teleprompt import MIPROv2

# Config needed to prevent the optimizer from using _unsupported_ temperature
# for reasoning models.
litellm.drop_params = True


config = dict(
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
    auto="light",
    log_dir="./mipro_micro_logs",
    num_threads=32,
)

mipro_micro_easy_compiled_programs = []

with mlflow.start_run(run_name="mipro_micro_easy"):
    for candidate_lm in mipro_llms:
        run_name = f"mipro_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            with dspy.context(lm=candidate_lm) as ctx:
                teleprompter = MIPROv2(metric=turn_em_metric, **config)
                optimized_program = teleprompter.compile(
                    dspy.ChainOfThought(SolveTurnWithReasoning),
                    trainset=micro_easy_train_examples,
                    valset=micro_easy_valid_examples,
                    requires_permission_to_run=False,
                )
                mipro_micro_easy_compiled_programs.append(optimized_program)

2025/07/29 11:49:54 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 95

2025/07/29 11:49:54 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/07/29 11:49:54 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/07/29 11:49:54 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...
  1%|          | 2/350 [00:00<00:11, 30.06it/s]
  1%|          | 2/350 [00:00<00:12, 28.92it/s]
  1%|          | 3/350 [00:00<00:10, 32.53it/s]
  1%|          | 2/350 [00:00<00:12, 28.96it/s]
2025/07/29 11:49:56 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/07/29 11:49:56 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/07/29 11:49:57 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2025/07/29 11:49:57 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/07/29 11:49:57 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `conversation_context`, `evidence_snippets`, `table`, `question`, produce the fields `reasoning`, `ops`, `answer`.

2025/07/29 11:49:57 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are a financial analyst AI specializing in conversational question answering over corporate financial tables. Given the following inputs:

...
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run mipro_micro_easy at: http://localhost:5000/#/experiments/3/runs/94952a4430d948679def7185323e94d1
🧪 View experiment at: http://localhost:5000/#/experiments/3

We now have the following results:

model_name	em_metric before optimisation	em_metric after MIPROv2	% increase
o4-mini	63.16	74.74	18.3%
gemini-2.5-flash	64.21	71.58	11.5%

We see significant improvement in performance for both models, with o4-mini getting the best result at 74.74%. Looking at the prompt

mipro_micro_easy_compiled_programs[0].inspect_history(n=1)





[2025-07-29T11:50:11.459861]

System message:

Your input fields are:
1. `conversation_context` (str): Conversation so far
2. `evidence_snippets` (str): Snippets of evidence surrounding the table
3. `table` (str): Input financial table with metrics
4. `question` (str): Question to answer
Your output fields are:
1. `reasoning` (str): Reasoning behind the answer. Carefully analyze the conversation_context, and especially the evidence_snippets and table for the given question, and generate your reasoning before generating the ops and answer.
2. `ops` (str): Comma-separated ConvFinQA DSL program. Allowed ops: add(x, y), subtract(x, y), multiply(x, y), divide(x, y), exp(x, y), greater(x, y). Args may be constants (e.g., const_100), numbers (int or float), or prior step refs (#0, #1…). Order always follows the pattern x <op> y—pick x and y deliberately. Example: subtract(const_100, 42), divide(#0, 3.14). Convert to percentages only if explicitly asked in the question.
3. `answer` (str): Final answer. This will be a single number, or a boolean string(yes/no)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## conversation_context ## ]]
{conversation_context}

[[ ## evidence_snippets ## ]]
{evidence_snippets}

[[ ## table ## ]]
{table}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## ops ## ]]
{ops}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are a financial analyst AI specializing in conversational question answering over corporate financial tables. Given the following inputs:
        
        Conversation Context: {conversation_context}  
        Evidence Snippets: {evidence_snippets}  
        Table: {table}  
        Question: {question}  
        
        Produce exactly three output sections:
        
        1. Reasoning: A clear, step-by-step natural-language analysis showing how you locate numbers in the table or snippets and how you plan to compute the answer.  
        2. Ops: A comma-separated ConvFinQA DSL program (using only add(x,y), subtract(x,y), multiply(x,y), divide(x,y), exp(x,y), greater(x,y)) that implements your reasoning. Reference constants or prior steps (#0, #1, …) in the form “op(arg1,arg2)”. Only convert to percentages if explicitly asked.  
        3. Answer: The final result as a single number or “yes”/“no”.  
        
        Use the prefixes “Reasoning:”, “Ops:”, and “Answer:” exactly, and do not include any additional sections or commentary.


User message:

[[ ## conversation_context ## ]]
Q1: what was the net change in value of ars investments from 2008 to 2009?
A1: -12.0
Q2: what was the 2008 value?
A2: 192.0

[[ ## evidence_snippets ## ]]
[PRE]
mastercard incorporated notes to consolidated financial statements 2014continued the municipal bond portfolio is comprised of tax exempt bonds and is diversified across states and sectors . the portfolio has an average credit quality of double-a . the short-term bond funds invest in fixed income securities , including corporate bonds , mortgage-backed securities and asset-backed securities . the company holds investments in ars . interest on these securities is exempt from u.s . federal income tax and the interest rate on the securities typically resets every 35 days . the securities are fully collateralized by student loans with guarantees , ranging from approximately 95% ( 95 % ) to 98% ( 98 % ) of principal and interest , by the u.s . government via the department of education . beginning on february 11 , 2008 , the auction mechanism that normally provided liquidity to the ars investments began to fail . since mid-february 2008 , all investment positions in the company 2019s ars investment portfolio have experienced failed auctions . the securities for which auctions have failed have continued to pay interest in accordance with the contractual terms of such instruments and will continue to accrue interest and be auctioned at each respective reset date until the auction succeeds , the issuer redeems the securities or they mature . during 2008 , ars were reclassified as level 3 from level 2 . as of december 31 , 2010 , the ars market remained illiquid , but issuer call and redemption activity in the ars student loan sector has occurred periodically since the auctions began to fail . during 2010 and 2009 , the company did not sell any ars in the auction market , but there were calls at par . the table below includes a roll-forward of the company 2019s ars investments from january 1 , 2009 to december 31 , 2010 . significant unobservable inputs ( level 3 ) ( in millions ) .
[/PRE]
[POST]
the company evaluated the estimated impairment of its ars portfolio to determine if it was other-than- temporary . the company considered several factors including , but not limited to , the following : ( 1 ) the reasons for the decline in value ( changes in interest rates , credit event , or market fluctuations ) ; ( 2 ) assessments as to whether it is more likely than not that it will hold and not be required to sell the investments for a sufficient period of time to allow for recovery of the cost basis ; ( 3 ) whether the decline is substantial ; and ( 4 ) the historical and anticipated duration of the events causing the decline in value . the evaluation for other-than-temporary impairments is a quantitative and qualitative process , which is subject to various risks and uncertainties . the risks and uncertainties include changes in credit quality , market liquidity , timing and amounts of issuer calls and interest rates . as of december 31 , 2010 , the company believed that the unrealized losses on the ars were not related to credit quality but rather due to the lack of liquidity in the market . the company believes that it is more .
[/POST]

[[ ## table ## ]]
| Row | fair value december 31 2008 | calls at par | recovery of unrealized losses due to issuer calls | increase in fair value | fair value december 31 2009 | fair value december 31 2010 |
|---|---|---|---|---|---|---|
| significant unobservable inputs ( level 3 ) ( in millions ) | 192.0 | -94.0 | 13.0 | 7.0 | 180.0 | 106.0 |

[[ ## question ## ]]
what is the net change divided by the 2008 value?


Assistant message:

[[ ## reasoning ## ]]
From the table, the fair value at December 31, 2008 was 192.0 and at December 31, 2009 was 180.0.  
Net change = 180.0 − 192.0 = −12.0.  
Divide the net change by the 2008 value: −12.0 ÷ 192.0 = −0.0625.

[[ ## ops ## ]]
subtract(180.0,192.0), divide(#0,192.0)

[[ ## answer ## ]]
-0.0625


User message:

[[ ## conversation_context ## ]]
Q1: what was the expense of class a common stock issued to non-executive directors in 2010?
A1: 2.4

[[ ## evidence_snippets ## ]]
[PRE]
the company granted 1020 performance shares . the vesting of these shares is contingent on meeting stated goals over a performance period . beginning with restricted stock grants in september 2010 , dividends are accrued on restricted class a common stock and restricted stock units and are paid once the restricted stock vests . the following table summarizes restricted stock and performance shares activity for 2010 : number of shares weighted average grant date fair value .
[/PRE]
[POST]
the total fair value of restricted stock that vested during the years ended december 31 , 2010 , 2009 and 2008 , was $ 10.3 million , $ 6.2 million and $ 2.5 million , respectively . eligible employees may acquire shares of cme group 2019s class a common stock using after-tax payroll deductions made during consecutive offering periods of approximately six months in duration . shares are purchased at the end of each offering period at a price of 90% ( 90 % ) of the closing price of the class a common stock as reported on the nasdaq . compensation expense is recognized on the dates of purchase for the discount from the closing price . in 2010 , 2009 and 2008 , a total of 4371 , 4402 and 5600 shares , respectively , of class a common stock were issued to participating employees . these shares are subject to a six-month holding period . annual expense of $ 0.1 million for the purchase discount was recognized in 2010 , 2009 and 2008 , respectively . non-executive directors receive an annual award of class a common stock with a value equal to $ 75000 . non-executive directors may also elect to receive some or all of the cash portion of their annual stipend , up to $ 25000 , in shares of stock based on the closing price at the date of distribution . as a result , 7470 , 11674 and 5509 shares of class a common stock were issued to non-executive directors during 2010 , 2009 and 2008 , respectively . these shares are not subject to any vesting restrictions . expense of $ 2.4 million , $ 2.5 million and $ 2.4 million related to these stock-based payments was recognized for the years ended december 31 , 2010 , 2009 and 2008 , respectively. .
[/POST]

[[ ## table ## ]]
| Row | outstanding at december 31 2009 | granted | vested | cancelled | outstanding at december 31 2010 | outstanding at december 31 2009 | granted | vested | cancelled | outstanding at december 31 2010 |
|---|---|---|---|---|---|---|---|---|---|---|
| number of shares | 116677.0 | 134245.0 | -34630.0 | -19830.0 | 196462.0 | 116677.0 | 134245.0 | -34630.0 | -19830.0 | 196462.0 |
| weighted average grant date fair value | 280.0 | 275.0 | 257.0 | 260.0 | 283.0 | 280.0 | 275.0 | 257.0 | 260.0 | 283.0 |

[[ ## question ## ]]
what is that times 1000000?


Assistant message:

[[ ## reasoning ## ]]
The expense of class A common stock issued to non-executive directors in 2010 was $2.4 million. Multiplying that by 1,000,000 converts millions to units: 2.4 × 1,000,000 = 2,400,000.

[[ ## ops ## ]]
multiply(const_2.4,const_1000000)

[[ ## answer ## ]]
2400000


User message:

[[ ## conversation_context ## ]]
Q1: what is the increase in the total expense related to the defined contribution plan for non-u.s.employees from 2010 to 2011?
A1: 9.7
Q2: what is the total expense related to the defined contribution plan for non-u.s.employees in 2010?
A2: 11.7

[[ ## evidence_snippets ## ]]
[PRE]
the following is a schedule of future minimum rental payments required under long-term operating leases at october 29 , 2011 : fiscal years operating leases .
[/PRE]
[POST]
12 . commitments and contingencies from time to time in the ordinary course of the company 2019s business , various claims , charges and litigation are asserted or commenced against the company arising from , or related to , contractual matters , patents , trademarks , personal injury , environmental matters , product liability , insurance coverage and personnel and employment disputes . as to such claims and litigation , the company can give no assurance that it will prevail . the company does not believe that any current legal matters will have a material adverse effect on the company 2019s financial position , results of operations or cash flows . 13 . retirement plans the company and its subsidiaries have various savings and retirement plans covering substantially all employees . the company maintains a defined contribution plan for the benefit of its eligible u.s . employees . this plan provides for company contributions of up to 5% ( 5 % ) of each participant 2019s total eligible compensation . in addition , the company contributes an amount equal to each participant 2019s pre-tax contribution , if any , up to a maximum of 3% ( 3 % ) of each participant 2019s total eligible compensation . the total expense related to the defined contribution plan for u.s . employees was $ 21.9 million in fiscal 2011 , $ 20.5 million in fiscal 2010 and $ 21.5 million in fiscal 2009 . the company also has various defined benefit pension and other retirement plans for certain non-u.s . employees that are consistent with local statutory requirements and practices . the total expense related to the various defined benefit pension and other retirement plans for certain non-u.s . employees was $ 21.4 million in fiscal 2011 , $ 11.7 million in fiscal 2010 and $ 10.9 million in fiscal 2009 . non-u.s . plan disclosures the company 2019s funding policy for its foreign defined benefit pension plans is consistent with the local requirements of each country . the plans 2019 assets consist primarily of u.s . and non-u.s . equity securities , bonds , property and cash . the benefit obligations and related assets under these plans have been measured at october 29 , 2011 and october 30 , 2010 . analog devices , inc . notes to consolidated financial statements 2014 ( continued ) .
[/POST]

[[ ## table ## ]]
| Row | 2012 | 2013 | 2014 | 2015 | 2016 | later years | total |
|---|---|---|---|---|---|---|---|
| operating leases | 17590.0 | 12724.0 | 6951.0 | 5649.0 | 3669.0 | 19472.0 | 66055.0 |

[[ ## question ## ]]
what growth rate does this represent?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## ops ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


Response:

[[ ## reasoning ## ]]
The expense for non-U.S. employees was $11.7 million in 2010 and $21.4 million in 2011, an increase of $9.7 million. The growth rate is the increase divided by the 2010 base: 9.7 / 11.7 ≈ 0.8291.

[[ ## ops ## ]]
divide(const_9.7,const_11.7)

[[ ## answer ## ]]
0.8291

[[ ## completed ## ]]

We see a super-detailed prompt, complete with both few-shot examples and detailed instructions on the role of the model and the expected output. All of this was achevied without prompting by hand!!

We’ll also save this programs to disk, for future reference.

for id, oc in enumerate(mipro_micro_easy_compiled_programs):
    model_name = selected_llms[id].model
    sanitized_model_name = re.sub(r"[-/\.]", "_", model_name)
    oc.save(f"./programs/{sanitized_model_name}_mipro_micro_easy/", save_program=True)

Next, we will use these compiled programs as the starting point, and optimise the upcoming stages i.e finally do curriculum learning!

Curriculum Learning: Medium

We will just re-run the same loop as above, with the compiled programs as the starting point, and the datasets changed to the medium stage.

First, let’s benchmark the optimised programns from the easy stage on the medium validation set

baseline_medium_results = []

with mlflow.start_run(run_name="baseline_micro_medium") as parent_ctx:
    for idx, candidate_lm in enumerate(mipro_llms):
        run_name = f"baseline_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            current_evaluator = Evaluate(
                devset=micro_medium_valid_examples,
                num_threads=32,
                display_progress=True,
                return_all_scores=True,
                return_outputs=True,
            )
            with dspy.context(lm=candidate_lm) as ctx:
                current_result = current_evaluator(
                    mipro_micro_easy_compiled_programs[idx], metric=turn_em_metric
                )
                baseline_medium_results.append(current_result)

Average Metric: 153.00 / 194 (78.9%): 100%|██████████| 194/194 [01:11<00:00,  2.71it/s]
🏃 View run eval at: http://localhost:5000/#/experiments/3/runs/d329a74cf0d348a5bc0cef64cf208e43
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run baseline_openai_o4-mini-2025-04-16 at: http://localhost:5000/#/experiments/3/runs/0f2a019e21524ae4ae71ff96d4f1c8a6
🧪 View experiment at: http://localhost:5000/#/experiments/3
Average Metric: 150.00 / 194 (77.3%): 100%|██████████| 194/194 [00:45<00:00,  4.23it/s]
🏃 View run eval at: http://localhost:5000/#/experiments/3/runs/1137bcdb00824fa8bed994f464c903e6
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run baseline_gemini_gemini-2_5-flash at: http://localhost:5000/#/experiments/3/runs/2db710ec2ad44542b84f3d13bae2566c
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run baseline_micro_medium at: http://localhost:5000/#/experiments/3/runs/25cae29dfdc242478bae535c60e644e1
🧪 View experiment at: http://localhost:5000/#/experiments/3

2025/07/29 12:04:59 INFO dspy.evaluate.evaluate: Average Metric: 153.0 / 194 (78.9%)
2025/07/29 12:05:45 INFO dspy.evaluate.evaluate: Average Metric: 150.0 / 194 (77.3%)

The results are as follows:

model_name	micro_medium_valid_set em_metric
o4-mini	78.9%
gemini-2.5-flash	77.30%

We will now aim to improve the performance of each of these models, using the MIPROv2 optimiser.

config = dict(
    max_bootstrapped_demos=2,
    max_labeled_demos=3,
    auto="light",  # Still keeping this as light as this is computationally expensive
    log_dir="./mipro_micro_logs",
    num_threads=32,
)

mipro_micro_medium_compiled_programs = []

with mlflow.start_run(run_name="mipro_micro_medium"):
    for idx, candidate_lm in enumerate(mipro_llms):
        run_name = f"mipro_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            with dspy.context(lm=candidate_lm) as ctx:
                teleprompter = MIPROv2(metric=turn_em_metric, **config)
                optimized_program = teleprompter.compile(
                    mipro_micro_easy_compiled_programs[idx],
                    trainset=micro_medium_train_examples,
                    valset=micro_medium_valid_examples,
                    requires_permission_to_run=False,
                )
                mipro_micro_medium_compiled_programs.append(optimized_program)

2025/07/29 12:07:51 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 100

2025/07/29 12:07:52 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/07/29 12:07:52 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

...
2025/07/29 12:26:15 INFO dspy.evaluate.evaluate: Average Metric: 91.0 / 100 (91.0%)
2025/07/29 12:26:15 INFO dspy.teleprompt.mipro_optimizer_v2: New best full eval score! Score: 91.0
2025/07/29 12:26:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [81.0, 0.0, 91.0]
2025/07/29 12:26:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.0
2025/07/29 12:26:17 INFO dspy.teleprompt.mipro_optimizer_v2: =======================
2025/07/29 12:26:17 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/29 12:26:17 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 91.0!

Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6
...
Average Metric: 91.00 / 100 (91.0%): 100%|██████████| 100/100 [00:12<00:00,  7.84it/s]
🏃 View run eval_full_2 at: http://localhost:5000/#/experiments/3/runs/6390215385304cff8297d122dd799027
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run mipro_gemini_gemini-2_5-flash at: http://localhost:5000/#/experiments/3/runs/bfa8a54582554fad9a20ad3c1bcb0e98
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run mipro_micro_medium at: http://localhost:5000/#/experiments/3/runs/4c01dc5b8ebc463ead87e69fd8ab5c67
🧪 View experiment at: http://localhost:5000/#/experiments/3

We now have the following results: We now have the following results:

model_name	em_metric before optimisation	em_metric after MIPROv2	% increase
o4-mini	78.90	80.00	1.4%
gemini-2.5-flash	77.30	91.00	17.7%

Aha! We see significant improvement gemini-2.5-flash, and break the 90% accuracy barrier!

Looking at the prompt for the gemini-model:

mipro_micro_medium_compiled_programs[1].inspect_history(n=1)





[2025-07-29T12:26:14.956939]

System message:

Your input fields are:
1. `conversation_context` (str): Conversation so far
2. `evidence_snippets` (str): Snippets of evidence surrounding the table
3. `table` (str): Input financial table with metrics
4. `question` (str): Question to answer
Your output fields are:
1. `reasoning` (str): Reasoning behind the answer. Carefully analyze the conversation_context, and especially the evidence_snippets and table for the given question, and generate your reasoning before generating the ops and answer.
2. `ops` (str): Comma-separated ConvFinQA DSL program. Allowed ops: add(x, y), subtract(x, y), multiply(x, y), divide(x, y), exp(x, y), greater(x, y). Args may be constants (e.g., const_100), numbers (int or float), or prior step refs (#0, #1…). Order always follows the pattern x <op> y—pick x and y deliberately. Example: subtract(const_100, 42), divide(#0, 3.14). Convert to percentages only if explicitly asked in the question.
3. `answer` (str): Final answer. This will be a single number, or a boolean string(yes/no)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## conversation_context ## ]]
{conversation_context}

[[ ## evidence_snippets ## ]]
{evidence_snippets}

[[ ## table ## ]]
{table}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## ops ## ]]
{ops}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        You are an expert financial AI assistant specialized in analyzing financial documents and answering complex numerical questions. Your primary goal is to accurately extract and compute information from both structured tables and unstructured text, even when answers depend on chained operations or require robust state management across multiple dialogue turns. You must also correctly interpret numerical values, including implicit units and signs, and format outputs appropriately for human-readable results.
        
        Given the `conversation_context` for multi-turn interactions, `evidence_snippets` for textual context, a `table` containing financial data, and the current `question`, your task is to:
        1.  First, generate a detailed `reasoning` that explains the logical steps to derive the answer, considering all relevant information from the context, snippets, and table.
        2.  Then, translate this reasoning into a precise sequence of `ops` using the ConvFinQA DSL, ensuring correct operations, references, and handling of numerical semantics.
        3.  Finally, provide the computed `answer` as a single number or a boolean.


User message:

This is an example of the task, though some input or output fields are not supplied.

[[ ## conversation_context ## ]]
Q1: as of april 3, 2010, what was the amount of doors in the wholesale segment in the europe geography?
A1: 4421.0
Q2: and what was the total amount of doors?
A2: 8940.0
Q3: what percentage, then, of this total did that amount in europe represent?
A3: 0.49452

[[ ## evidence_snippets ## ]]
[PRE]
table of contents worldwide distribution channels the following table presents the number of doors by geographic location , in which ralph lauren-branded products distributed by our wholesale segment were sold to consumers in our primary channels of distribution as of april 3 , 2010 : number of location doors ( a ) .
[/PRE]
[POST]
( a ) in asia-pacific , our products are primarily distributed through concessions-based sales arrangements . in addition , american living and chaps-branded products distributed by our wholesale segment were sold domestically through approximately 1700 doors as of april 3 , 2010 . we have five key department-store customers that generate significant sales volume . for fiscal 2010 , these customers in the aggregate accounted for approximately 45% ( 45 % ) of all wholesale revenues , with macy 2019s , inc . representing approximately 18% ( 18 % ) of these revenues . our product brands are sold primarily through their own sales forces . our wholesale segment maintains its primary showrooms in new york city . in addition , we maintain regional showrooms in atlanta , chicago , dallas , milan , paris , london , munich , madrid and stockholm . shop-within-shops . as a critical element of our distribution to department stores , we and our licensing partners utilize shop- within-shops to enhance brand recognition , to permit more complete merchandising of our lines by the department stores and to differentiate the presentation of products . shop-within-shops fixed assets primarily include items such as customized freestanding fixtures , wall cases and components , decorative items and flooring . as of april 3 , 2010 , we had approximately 14000 shop-within-shops dedicated to our ralph lauren-branded wholesale products worldwide . excluding significantly larger shop-within-shops in key department store locations , the size of our shop-within-shops typically ranges from approximately 300 to 6000 square feet . we normally share in the cost of these shop-within-shops with our wholesale customers . basic stock replenishment program . basic products such as knit shirts , chino pants and oxford cloth shirts can be ordered at any time through our basic stock replenishment programs . we generally ship these products within three-to-five days of order receipt . our retail segment as of april 3 , 2010 , our retail segment consisted of 179 full-price retail stores and 171 factory stores worldwide , totaling approximately 2.6 million square feet , 281 concessions-based shop-within-shops and two e-commerce websites . the extension of our direct-to-consumer reach is a primary long-term strategic goal . full-price retail stores our full-price retail stores reinforce the luxury image and distinct sensibility of our brands and feature exclusive lines that are not sold in domestic department stores . we opened 3 new full-price stores and closed 3 full-price stores in fiscal 2010 . in addition , we assumed 16 full-price stores in connection with the asia-pacific .
[/POST]

[[ ## table ## ]]
| Row | united states and canada | europe | japan | total |
|---|---|---|---|---|
| number of doors ( a ) | 4402.0 | 4421.0 | 117.0 | 8940.0 |

[[ ## question ## ]]
and what percentage of the total did the amount of doors in united states and canada represent?


Assistant message:

[[ ## reasoning ## ]]
Not supplied for this particular example. 

[[ ## ops ## ]]
divide(4402, 8940)

[[ ## answer ## ]]
0.49239


User message:

[[ ## conversation_context ## ]]
None

[[ ## evidence_snippets ## ]]
[PRE]
part i the following table details the growth in global weighted average berths and the global , north american and european cruise guests over the past five years : weighted-average supply of berths marketed globally ( 1 ) royal caribbean cruises ltd . total berths global cruise guests ( 1 ) north american cruise guests ( 2 ) european cruise guests ( 3 ) .
[/PRE]
[POST]
( 1 ) source : our estimates of the number of global cruise guests and the weighted-average supply of berths marketed globally are based on a com- bination of data that we obtain from various publicly available cruise industry trade information sources including seatrade insider , cruise industry news and cruise line international association ( 201cclia 201d ) . in addition , our estimates incorporate our own statistical analysis utilizing the same publicly available cruise industry data as a base . ( 2 ) source : cruise line international association based on cruise guests carried for at least two consecutive nights for years 2009 through 2012 . year 2013 amounts represent our estimates ( see number 1 above ) . includes the united states of america and canada . ( 3 ) source : clia europe , formerly european cruise council , for years 2009 through 2012 . year 2013 amounts represent our estimates ( see number 1 above ) . north america the majority of cruise guests are sourced from north america , which represented approximately 56% ( 56 % ) of global cruise guests in 2013 . the compound annual growth rate in cruise guests sourced from this market was approximately 3.2% ( 3.2 % ) from 2009 to 2013 . europe cruise guests sourced from europe represented approximately 30% ( 30 % ) of global cruise guests in 2013 . the compound annual growth rate in cruise guests sourced from this market was approximately 6.0% ( 6.0 % ) from 2009 to 2013 . other markets in addition to expected industry growth in north america and europe , we expect the asia/pacific region to demonstrate an even higher growth rate in the near term , although it will continue to represent a relatively small sector compared to north america and europe . based on industry data , cruise guests sourced from the asia/pacific region represented approximately 4.5% ( 4.5 % ) of global cruise guests in 2013 . the compound annual growth rate in cruise guests sourced from this market was approximately 15% ( 15 % ) from 2011 to 2013 . competition we compete with a number of cruise lines . our princi- pal competitors are carnival corporation & plc , which owns , among others , aida cruises , carnival cruise lines , costa cruises , cunard line , holland america line , iberocruceros , p&o cruises and princess cruises ; disney cruise line ; msc cruises ; norwegian cruise line and oceania cruises . cruise lines compete with other vacation alternatives such as land-based resort hotels and sightseeing destinations for consumers 2019 leisure time . demand for such activities is influenced by political and general economic conditions . com- panies within the vacation market are dependent on consumer discretionary spending . operating strategies our principal operating strategies are to : and employees and protect the environment in which our vessels and organization operate , to better serve our global guest base and grow our business , order to enhance our revenues , our brands globally , expenditures and ensure adequate cash and liquid- ity , with the overall goal of maximizing our return on invested capital and long-term shareholder value , ization and maintenance of existing ships and the transfer of key innovations across each brand , while prudently expanding our fleet with new state-of- the-art cruise ships , ships by deploying them into those markets and itineraries that provide opportunities to optimize returns , while continuing our focus on existing key markets , service customer preferences and expectations in an innovative manner , while supporting our strategic focus on profitability , and .
[/POST]

[[ ## table ## ]]
| Row | 2009 | 2010 | 2011 | 2012 | 2013 | 2009 | 2010 | 2011 | 2012 | 2013 | 2009 | 2010 | 2011 | 2012 | 2013 | 2009 | 2010 | 2011 | 2012 | 2013 | 2009 | 2010 | 2011 | 2012 | 2013 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| weighted-averagesupply ofberthsmarketedglobally ( 1 ) | 363000.0 | 391000.0 | 412000.0 | 425000.0 | 432000.0 | 363000.0 | 391000.0 | 412000.0 | 425000.0 | 432000.0 | 363000.0 | 391000.0 | 412000.0 | 425000.0 | 432000.0 | 363000.0 | 391000.0 | 412000.0 | 425000.0 | 432000.0 | 363000.0 | 391000.0 | 412000.0 | 425000.0 | 432000.0 |
| royal caribbean cruises ltd . total berths | 84050.0 | 92300.0 | 92650.0 | 98650.0 | 98750.0 | 84050.0 | 92300.0 | 92650.0 | 98650.0 | 98750.0 | 84050.0 | 92300.0 | 92650.0 | 98650.0 | 98750.0 | 84050.0 | 92300.0 | 92650.0 | 98650.0 | 98750.0 | 84050.0 | 92300.0 | 92650.0 | 98650.0 | 98750.0 |
| globalcruiseguests ( 1 ) | 17340000.0 | 18800000.0 | 20227000.0 | 20898000.0 | 21300000.0 | 17340000.0 | 18800000.0 | 20227000.0 | 20898000.0 | 21300000.0 | 17340000.0 | 18800000.0 | 20227000.0 | 20898000.0 | 21300000.0 | 17340000.0 | 18800000.0 | 20227000.0 | 20898000.0 | 21300000.0 | 17340000.0 | 18800000.0 | 20227000.0 | 20898000.0 | 21300000.0 |
| north americancruiseguests ( 2 ) | 10198000.0 | 10781000.0 | 11625000.0 | 11640000.0 | 11816000.0 | 10198000.0 | 10781000.0 | 11625000.0 | 11640000.0 | 11816000.0 | 10198000.0 | 10781000.0 | 11625000.0 | 11640000.0 | 11816000.0 | 10198000.0 | 10781000.0 | 11625000.0 | 11640000.0 | 11816000.0 | 10198000.0 | 10781000.0 | 11625000.0 | 11640000.0 | 11816000.0 |
| europeancruiseguests ( 3 ) | 5000000.0 | 5540000.0 | 5894000.0 | 6139000.0 | 6399000.0 | 5000000.0 | 5540000.0 | 5894000.0 | 6139000.0 | 6399000.0 | 5000000.0 | 5540000.0 | 5894000.0 | 6139000.0 | 6399000.0 | 5000000.0 | 5540000.0 | 5894000.0 | 6139000.0 | 6399000.0 | 5000000.0 | 5540000.0 | 5894000.0 | 6139000.0 | 6399000.0 |

[[ ## question ## ]]
what is the total global cruise guests are from north america or europe?


Assistant message:

[[ ## reasoning ## ]]
The question asks for the total number of global cruise guests that are from North America or Europe. I need to find the number of North American cruise guests and European cruise guests for a specific year and sum them up. The evidence snippets provide percentages for 2013 for North America (56%) and Europe (30%) of global cruise guests. The table also provides the absolute numbers for North American cruise guests and European cruise guests for 2013. I will sum the values for 'north americancruiseguests (2)' and 'europeancruiseguests (3)' for the year 2013.

North American cruise guests in 2013 = 11,816,000
European cruise guests in 2013 = 6,399,000
Total = 11,816,000 + 6,399,000 = 18,215,000.

[[ ## ops ## ]]
add(11816000.0, 6399000.0)

[[ ## answer ## ]]
18215000.0


User message:

[[ ## conversation_context ## ]]
Q1: what was the net change in value of net income from 2012 to 2013?
A1: 41.8

[[ ## evidence_snippets ## ]]
[PRE]
other expense , net : the company's other expense consists of the following: .
[/PRE]
[POST]
income tax provision : the company recorded income tax expense of $ 77.2 million and had income before income taxes of $ 322.5 million for the year ended december 31 , 2013 , representing an effective tax rate of 23.9% ( 23.9 % ) . during the year ended december 31 , 2012 , the company recorded income tax expense of $ 90.1 million and had income before income taxes of $ 293.5 million , representing an effective tax rate of 30.7% ( 30.7 % ) . in december 2013 , the company received notice from the irs that the joint committee on taxation took no exception to the company's tax returns that were filed for 2009 and 2010 . an $ 11.0 million tax benefit was recognized in the company's 2013 financial results as the company had effectively settled uncertainty regarding the realization of refund claims filed in connection with the 2009 and 2010 returns . in the u.s. , which is the largest jurisdiction where the company receives such a tax credit , the availability of the research and development credit expired at the end of the 2011 tax year . in january 2013 , the u.s . congress passed legislation that reinstated the research and development credit retroactive to 2012 . the income tax provision for the year ended december 31 , 2013 includes approximately $ 2.3 million related to the reinstated research and development credit for 2012 activity . the decrease in the effective tax rate from the prior year is primarily due to the release of an uncertain tax position mentioned above , the reinstatement of the u.s . research and development credit mentioned above , and cash repatriation activities . when compared to the federal and state combined statutory rate , the effective tax rates for the years ended december 31 , 2013 and 2012 were favorably impacted by lower statutory tax rates in many of the company 2019s foreign jurisdictions , the domestic manufacturing deduction and tax benefits associated with the merger of the company 2019s japan subsidiaries in 2010 . net income : the company 2019s net income for the year ended december 31 , 2013 was $ 245.3 million as compared to net income of $ 203.5 million for the year ended december 31 , 2012 . diluted earnings per share was $ 2.58 for the year ended december 31 , 2013 and $ 2.14 for the year ended december 31 , 2012 . the weighted average shares used in computing diluted earnings per share were 95.1 million and 95.0 million for the years ended december 31 , 2013 and 2012 , respectively . table of contents .
[/POST]

[[ ## table ## ]]
| Row | foreign currency losses net | other income ( expense ) net | total other expense net | foreign currency losses net | other income ( expense ) net | total other expense net |
|---|---|---|---|---|---|---|
| year ended december 31 , 2013 | -1115.0 | 69.0 | -1046.0 | -1115.0 | 69.0 | -1046.0 |
| year ended december 31 , 2012 | -1401.0 | -4.0 | -1405.0 | -1401.0 | -4.0 | -1405.0 |

[[ ## question ## ]]
what was net income in 2012?


Assistant message:

[[ ## reasoning ## ]]
The question asks for the net income in 2012. I will locate this information in the provided evidence snippets. The evidence states that "the company 2019s net income for the year ended december 31 , 2013 was $ 245.3 million as compared to net income of $ 203.5 million for the year ended december 31 , 2012". Therefore, the net income in 2012 was $203.5 million.

[[ ## ops ## ]]
203.5

[[ ## answer ## ]]
203.5


User message:

[[ ## conversation_context ## ]]
Q1: in 2017, what was the number of granted performance shares?
A1: 203298.0
Q2: and what was the total number of granted shares?
A2: 650942.0
Q3: what percentage, then, of this total did that performance shares number represent?
A3: 0.31231
Q4: and from 2016 to that year, what was the total of compensation expense attributable to directors?
A4: 4.9

[[ ## evidence_snippets ## ]]
[PRE]
in 2017 , the company granted 440076 shares of restricted class a common stock and 7568 shares of restricted stock units . restricted common stock and restricted stock units generally have a vesting period of two to four years . the fair value related to these grants was $ 58.7 million , which is recognized as compensation expense on an accelerated basis over the vesting period . dividends are accrued on restricted class a common stock and restricted stock units and are paid once the restricted stock vests . in 2017 , the company also granted 203298 performance shares . the fair value related to these grants was $ 25.3 million , which is recognized as compensation expense on an accelerated and straight-lined basis over the vesting period . the vesting of these shares is contingent on meeting stated performance or market conditions . the following table summarizes restricted stock , restricted stock units , and performance shares activity for 2017 : number of shares weighted average grant date fair value .
[/PRE]
[POST]
the total fair value of restricted stock , restricted stock units , and performance shares that vested during 2017 , 2016 and 2015 was $ 66.0 million , $ 59.8 million and $ 43.3 million , respectively . under the espp , eligible employees may acquire shares of class a common stock using after-tax payroll deductions made during consecutive offering periods of approximately six months in duration . shares are purchased at the end of each offering period at a price of 90% ( 90 % ) of the closing price of the class a common stock as reported on the nasdaq global select market . compensation expense is recognized on the dates of purchase for the discount from the closing price . in 2017 , 2016 and 2015 , a total of 19936 , 19858 and 19756 shares , respectively , of class a common stock were issued to participating employees . these shares are subject to a six-month holding period . annual expense of $ 0.3 million for the purchase discount was recognized in 2017 , and $ 0.2 million was recognized in both 2016 and 2015 . non-executive directors receive an annual award of class a common stock with a value equal to $ 100000 . non-executive directors may also elect to receive some or all of the cash portion of their annual stipend , up to $ 60000 , in shares of stock based on the closing price at the date of distribution . as a result , 19736 shares , 26439 shares and 25853 shares of class a common stock were issued to non-executive directors during 2017 , 2016 and 2015 , respectively . these shares are not subject to any vesting restrictions . expense of $ 2.5 million , $ 2.4 million and $ 2.5 million related to these stock-based payments was recognized for the years ended december 31 , 2017 , 2016 and 2015 , respectively. .
[/POST]

[[ ## table ## ]]
| Row | outstanding at december 31 2016 | granted | vested | cancelled | outstanding at december 31 2017 | outstanding at december 31 2016 | granted | vested | cancelled | outstanding at december 31 2017 |
|---|---|---|---|---|---|---|---|---|---|---|
| number of shares | 1820578.0 | 650942.0 | -510590.0 | -401699.0 | 1559231.0 | 1820578.0 | 650942.0 | -510590.0 | -401699.0 | 1559231.0 |
| weightedaveragegrant datefair value | 98.0 | 129.0 | 87.0 | 95.0 | 116.0 | 98.0 | 129.0 | 87.0 | 95.0 | 116.0 |

[[ ## question ## ]]
including 2015, what becomes that total of compensation expense?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## ops ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


Response:

[[ ## reasoning ## ]]
The question asks for the total compensation expense, including the year 2015. I need to sum up all explicitly stated compensation expenses for the years 2015, 2016, and 2017 from the evidence snippets.

From the snippets:
**For 2017:**
1.  Compensation expense for restricted class A common stock and restricted stock units: $58.7 million.
2.  Compensation expense for performance shares: $25.3 million.
3.  Annual expense for ESPP purchase discount: $0.3 million.
4.  Expense related to non-executive directors' stock-based payments: $2.5 million.
Total for 2017 = 58.7 + 25.3 + 0.3 + 2.5 = 86.8 million.

**For 2016:**
1.  Annual expense for ESPP purchase discount: $0.2 million.
2.  Expense related to non-executive directors' stock-based payments: $2.4 million.
Total for 2016 = 0.2 + 2.4 = 2.6 million.

**For 2015:**
1.  Annual expense for ESPP purchase discount: $0.2 million.
2.  Expense related to non-executive directors' stock-based payments: $2.5 million.
Total for 2015 = 0.2 + 2.5 = 2.7 million.

Finally, sum the total compensation expenses for 2015, 2016, and 2017.
Total compensation expense = Total 2017 + Total 2016 + Total 2015
Total compensation expense = 86.8 + 2.6 + 2.7 = 92.1 million.

[[ ## ops ## ]]
add(58.7, 25.3, 0.3, 2.5, 0.2, 2.4, 0.2, 2.5)

[[ ## answer ## ]]
92.1

We see that a few more examples have been added from the medium stage, which has helped significantly improve the model’s performance.

Let’s also save the programs to disk.

for id, oc in enumerate(mipro_micro_medium_compiled_programs):
    model_name = selected_llms[id].model
    sanitized_model_name = re.sub(r"[-/\.]", "_", model_name)
    oc.save(f"./programs/{sanitized_model_name}_mipro_micro_medium/", save_program=True)

Next, we will optimise for the hard problem

Curriculum Learning: Hard

First, let’s benchmark the optimised programns from the medium stage on the hard validation set

baseline_hard_results = []

with mlflow.start_run(run_name="baseline_micro_hard") as parent_ctx:
    for idx, candidate_lm in enumerate(mipro_llms):
        run_name = f"baseline_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            current_evaluator = Evaluate(
                devset=micro_hard_valid_examples,
                num_threads=32,
                display_progress=True,
                return_all_scores=True,
                return_outputs=True,
            )
            with dspy.context(lm=candidate_lm) as ctx:
                current_result = current_evaluator(
                    mipro_micro_medium_compiled_programs[idx], metric=turn_em_metric
                )
                baseline_hard_results.append(current_result)

Average Metric: 208.00 / 266 (78.2%): 100%|██████████| 266/266 [00:03<00:00, 73.06it/s]
🏃 View run eval at: http://localhost:5000/#/experiments/3/runs/0572758d7b8544d99f6dd52028e67409
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run baseline_openai_o4-mini-2025-04-16 at: http://localhost:5000/#/experiments/3/runs/2d9c84100b6b4c3688cf42d5eba881f2
🧪 View experiment at: http://localhost:5000/#/experiments/3
Average Metric: 226.00 / 266 (85.0%): 100%|██████████| 266/266 [00:34<00:00,  7.81it/s]
🏃 View run eval at: http://localhost:5000/#/experiments/3/runs/fc410ce0954c4281975199a4fa6d7c04
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run baseline_gemini_gemini-2_5-flash at: http://localhost:5000/#/experiments/3/runs/4e9533e7df14490181999627414fc61e
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run baseline_micro_hard at: http://localhost:5000/#/experiments/3/runs/bf12642f5c1b4cc8828b89f2599a2b0e
🧪 View experiment at: http://localhost:5000/#/experiments/3

2025/07/29 13:06:41 INFO dspy.evaluate.evaluate: Average Metric: 208.0 / 266 (78.2%)
2025/07/29 13:07:15 INFO dspy.evaluate.evaluate: Average Metric: 226.0 / 266 (85.0%)

The baseline results are as follows:

model_name	micro_hard_valid_set em_metric
o4-mini	78.2%
gemini-2.5-flash	85.0%

import time

for i in range(10):
    print(f"Waiting... {i + 1}/10 minutes elapsed")
    time.sleep(60)

print("10 minutes have passed.")

Waiting... 1/10 minutes elapsed
Waiting... 2/10 minutes elapsed
Waiting... 3/10 minutes elapsed
Waiting... 4/10 minutes elapsed
Waiting... 5/10 minutes elapsed
Waiting... 6/10 minutes elapsed
Waiting... 7/10 minutes elapsed
Waiting... 8/10 minutes elapsed
Waiting... 9/10 minutes elapsed
Waiting... 10/10 minutes elapsed
10 minutes have passed.

config = dict(
    max_bootstrapped_demos=3,
    max_labeled_demos=3,
    auto="medium",
    log_dir="./mipro_micro_logs",
    num_threads=64,
)

mipro_micro_hard_compiled_programs = []

with mlflow.start_run(run_name="mipro_micro_hard"):
    for idx, candidate_lm in enumerate(mipro_llms):
        run_name = f"mipro_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            with dspy.context(lm=candidate_lm) as ctx:
                teleprompter = MIPROv2(metric=turn_em_metric, **config)
                optimized_program = teleprompter.compile(
                    mipro_micro_medium_compiled_programs[idx],
                    trainset=micro_hard_train_examples,
                    valset=micro_hard_valid_examples,
                    requires_permission_to_run=False,
                )
                mipro_micro_hard_compiled_programs.append(optimized_program)

2025/07/29 13:18:39 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 18
minibatch: True
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 266

2025/07/29 13:18:46 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/07/29 13:18:46 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/07/29 13:18:46 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=12 sets of demonstrations...
  0%|          | 4/974 [00:36<2:28:30,  9.19s/it]
  0%|          | 1/974 [00:04<1:10:37,  4.36s/it]
  0%|          | 1/974 [00:04<1:20:09,  4.94s/it]
  0%|          | 3/974 [00:23<2:08:07,  7.92s/it]
  1%|          | 6/974 [00:59<2:40:46,  9.97s/it]
  1%|          | 5/974 [00:58<3:09:39, 11.74s/it]
  0%|          | 2/974 [06:00<48:36:05, 180.01s/it]
  0%|          | 1/974 [00:06<1:48:53,  6.71s/it]
  0%|          | 3/974 [00:25<2:18:44,  8.57s/it]
  0%|          | 1/974 [00:13<3:36:34, 13.36s/it]
2025/07/29 13:34:48 INFO dspy.teleprompt.mipro_optimizer_v2: 
...
Average Metric: 145.00 / 171 (84.8%):  64%|██████▍   | 170/266 [00:09<00:09,  9.84it/s]Average Metric: 145.00 / 172 (84.3%):  64%|██████▍   | 171/266 [00:10<00:09,  9.84it/s]Average Metric: 145.00 / 173 (83.8%):  65%|██████▌   | 173/266 [00:11<00:23,  3.99it/s]Average Metric: 145.00 / 174 (83.3%):  65%|██████▌   | 173/266 [00:11<00:23,  3.99it/s]Average Metric: 145.00 / 175 (82.9%):  66%|██████▌   | 175/266 [00:12<00:29,  3.10it/s]Average Metric: 145.00 / 175 (82.9%):  66%|██████▌   | 175/266 [00:14<00:29,  3.10it/s]Average Metric: 145.00 / 175 (82.9%):  67%|██████▋   | 177/266 [00:15<00:49,  1.81it/s]Average Metric: 145.00 / 175 (82.9%):  67%|██████▋   | 177/266 [00:15<00:49,  1.81it/s]Average Metric: 145.00 / 175 (82.9%):  67%|██████▋   | 179/266 [00:15<00:40,  2.13it/s]Average Metric: 145.00 / 175 (82.9%):  68%|██████▊   | 180/266 [00:16<00:37,  2.32it/s]Average Metric: 145.00 / 175 (82.9%):  68%|██████▊   | 180/266 [00:16<00:37,  2.32it/s]Average Metric: 145.00 / 175 (82.9%):  68%|██████▊   | 181/266 [00:16<00:36,  2.32it/s]Average Metric: 145.00 / 175 (82.9%):  69%|██████▉   | 183/266 [00:16<00:29,  2.78it/s]Average Metric: 145.00 / 175 (82.9%):  69%|██████▉   | 183/266 [00:16<00:29,  2.78it/s]Average Metric: 145.00 / 175 (82.9%):  76%|███████▋  | 203/266 [00:17<00:05, 11.68it/s]
🏃 View run eval_full_4 at: http://localhost:5000/#/experiments/3/runs/d42af3a2f5674ecb8579b46c8adcdb87
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run mipro_gemini_gemini-2_5-flash at: http://localhost:5000/#/experiments/3/runs/add9bb06b00943c593c8e562fcde594f
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run mipro_micro_hard at: http://localhost:5000/#/experiments/3/runs/d368c83b497347c984aafef7921c3b65
🧪 View experiment at: http://localhost:5000/#/experiments/3

We faced some rate-limits errors on gemini-2.5-flash, which caused a few full validation evals to fail. We’ll use the evals we have so far for the model, and continue.

We have the following results:

model_name	em_metric before optimisation	em_metric after MIPROv2	% change
o4-mini	78.20	86.47	+10.6%
gemini-2.5-flash	85.00	84.96	-0.05%

Aha! MIPROv2 has succesfully helped improve the performance of o4-mini on the hard validation dataset, bringing the total validation accuracy to 86.47%

We see a slight degregation in the performance for gemini-2.5-flash. Note that because of the rate-limit errors, this metric is not a true estimation of the model ability.

Through this analysis, we can conclude that MIPROv2 is a useful tool for improving the performance of models on hard validation datasets, but it may not be as effective for models that are already performing well.

We’ve finally identified our winner: o4-mini!

Let’s save the programs to disk!

for id, oc in enumerate(mipro_micro_hard_compiled_programs):
    model_name = selected_llms[id].model
    sanitized_model_name = re.sub(r"[-/\.]", "_", model_name)
    oc.save(
        f"./programs/{sanitized_model_name}_mipro_micro_hard/",
        save_program=True,
    )

Final Evaluation

We now want to evaluate our winning models on the final test dataset. In our case, the test dataset will be collection of easy + medium + hard test sets.

Before evaluating our optimised model, let’s benchmark the zero-shot performance of o4-mini on the test dataset

final_test_examples = easy_test_examples + medium_test_examples + hard_test_examples

import re

from dspy.evaluate import Evaluate

final_baseline_results = []

with mlflow.start_run(run_name="final_evaluation_baseline_gpt_5_nano") as parent_ctx:
    run_name = f"baseline_{lm_oai_gpt_5_nano.model.replace('/', '_')}"
    sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
    with mlflow.start_run(run_name=sanitized_run_name, nested=True):
        current_evaluator = Evaluate(
            devset=final_test_examples,
            num_threads=32,
            display_progress=True,
            return_all_scores=True,
            return_outputs=True,
        )
        with dspy.context(lm=lm_oai_gpt_5_nano) as ctx:
            current_result = current_evaluator(
                TurnSolver(reasoning_lm=True), metric=turn_em_metric
            )
            final_baseline_results.append(current_result)

  0%|          | 0/678 [00:00<?, ?it/s]Average Metric: 0.00 / 0 (0%):   0%|          | 1/678 [00:09<1:44:15,  9.24s/it]Average Metric: 0.00 / 0 (0%):   0%|          | 2/678 [00:09<45:38,  4.05s/it]  Average Metric: 0.00 / 0 (0%):   0%|          | 2/678 [00:09<45:38,  4.05s/it]Average Metric: 0.00 / 0 (0%):   0%|          | 3/678 [00:09<45:34,  4.05s/it]Average Metric: 0.00 / 0 (0%):   1%|          | 8/678 [00:09<07:38,  1.46it/s]Average Metric: 0.00 / 0 (0%):   1%|▏         | 10/678 [00:09<07:37,  1.46it/s]Average Metric: 0.00 / 0 (0%):   2%|▏         | 12/678 [00:09<07:35,  1.46it/s]Average Metric: 0.00 / 0 (0%):   2%|▏         | 15/678 [00:09<07:33,  1.46it/s]Average Metric: 0.00 / 0 (0%):   5%|▌         | 35/678 [00:09<07:20,  1.46it/s]Average Metric: 0.00 / 0 (0%):  16%|█▌        | 108/678 [00:09<06:30,  1.46it/s]Average Metric: 0.00 / 0 (0%):  17%|█▋        | 118/678 [00:09<00:17, 31.31it/s]Average Metric: 0.00 / 0 (0%):  22%|██▏       | 146/678 [00:09<00:16, 31.31it/s]Average Metric: 0.00 / 0 (0%):  24%|██▍       | 165/678 [00:09<00:16, 31.31it/s]Average Metric: 0.00 / 0 (0%):  26%|██▌       | 175/678 [00:09<00:16, 31.31it/s]Average Metric: 0.00 / 0 (0%):  27%|██▋       | 182/678 [00:09<00:15, 31.31it/s]Average Metric: 0.00 / 0 (0%):  27%|██▋       | 185/678 [00:09<00:15, 31.31it/s]Average Metric: 0.00 / 0 (0%):  30%|██▉       | 202/678 [00:09<00:15, 31.31it/s]Average Metric: 0.00 / 0 (0%):  32%|███▏      | 219/678 [00:09<00:14, 31.31it/s]Average Metric: 0.00 / 0 (0%):  34%|███▍      | 233/678 [00:09<00:14, 31.31it/s]Average Metric: 0.00 / 0 (0%):  35%|███▌      | 240/678 [00:09<00:13, 31.31it/s]Average Metric: 0.00 / 0 (0%):  37%|███▋      | 250/678 [00:09<00:05, 80.45it/s]Average Metric: 0.00 / 0 (0%):  39%|███▊      | 262/678 [00:09<00:05, 80.45it/s]Average Metric: 0.00 / 0 (0%):  40%|███▉      | 269/678 [00:09<00:05, 80.45it/s]Average Metric: 0.00 / 0 (0%):  41%|████▏     | 280/678 [00:09<00:04, 80.45it/s]Average Metric: 0.00 / 0 (0%):  42%|████▏     | 286/678 [00:09<00:04, 80.45it/s]Average Metric: 0.00 / 0 (0%):  44%|████▍     | 300/678 [00:09<00:04, 80.45it/s]Average Metric: 0.00 / 0 (0%):  46%|████▌     | 309/678 [00:09<00:04, 80.45it/s]Average Metric: 0.00 / 0 (0%):  48%|████▊     | 327/678 [00:10<00:04, 80.45it/s]Average Metric: 0.00 / 0 (0%):  56%|█████▌    | 378/678 [00:10<00:03, 80.45it/s]Average Metric: 0.00 / 0 (0%):  65%|██████▍   | 439/678 [00:10<00:02, 80.45it/s]Average Metric: 0.00 / 0 (0%):  70%|███████   | 477/678 [00:10<00:02, 80.45it/s]Average Metric: 0.00 / 0 (0%):  95%|█████████▌| 647/678 [00:10<00:00, 63.91it/s] 
🏃 View run eval at: http://localhost:5000/#/experiments/3/runs/6c75579e534b4751a98f4adad09d0fd9
🧪 View experiment at: http://localhost:5000/#/experiments/3
...
2025/07/29 14:41:40 INFO dspy.evaluate.evaluate: Average Metric: 493.0 / 678 (72.7%)

We get the following performance:

model_name	em_metric before optimisation	em_metric after MIPROv2	% increase
o4-mini	72.70	??	??

Now, let’s try our optimised program!

final_program = dspy.load("./programs/openai_o4_mini_2025_04_16_mipro_micro_hard/")

final_program.save("./programs/final_program.json", save_program=False)

# Reference code showing how to load the DSPy program
loaded_dspy_program = dspy.ChainOfThought(SolveTurnWithReasoning)
loaded_dspy_program.load("./programs/final_program.json")
loaded_dspy_program.predict.signature.instructions

import random

random.seed(42)

bootstrap_rs_random_easy_subset = random.sample(easy_train_examples, 70)

import re

import litellm
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

# Config needed to prevent the optimizer from using _unsupported_ temperature
# for reasoning models.
litellm.drop_params = True


config = dict(
    max_bootstrapped_demos=3,
    max_labeled_demos=2,
    num_candidate_programs=5,
    num_threads=32,
    max_rounds=1,
)

bootstrap_rs_easy_compiled_programs = []

with mlflow.start_run(run_name="bootstrap_few_shot_rs_easy"):
    for candidate_lm in selected_llms:
        run_name = f"bootstrap_few_shot_rs_{candidate_lm.model.replace('/', '_')}"
        sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
        with mlflow.start_run(run_name=sanitized_run_name, nested=True):
            with dspy.context(lm=candidate_lm) as ctx:
                teleprompter = BootstrapFewShotWithRandomSearch(
                    metric=turn_em_metric, **config
                )
                optimized_program = teleprompter.compile(
                    dspy.ChainOfThought(SolveTurnWithReasoning),
                    trainset=bootstrap_rs_random_easy_subset,
                )
                bootstrap_rs_easy_compiled_programs.append(optimized_program)

Going to sample between 1 and 3 traces per predictor.
Will attempt to bootstrap 5 candidate sets.
Average Metric: 45.00 / 70 (64.3%): 100%|██████████| 70/70 [00:00<00:00, 87.74it/s] 
🏃 View run eval_0 at: http://localhost:5000/#/experiments/3/runs/fcce9b07d50e41609bc04c3d9c2235c7
...
Scores so far: [74.29, 74.29, 74.29, 72.86, 74.29, 75.71, 75.71, 80.0]
Best score so far: 80.0
8 candidate programs found.
🏃 View run bootstrap_few_shot_rs_openai_o3-2025-04-16 at: http://localhost:5000/#/experiments/3/runs/8576b067ea60468ebc37cda1a17be1dc
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run bootstrap_few_shot_rs_easy at: http://localhost:5000/#/experiments/3/runs/b5f08a070a634f74888fb8c42f24bfa3
🧪 View experiment at: http://localhost:5000/#/experiments/3

2025/07/29 01:03:08 INFO dspy.evaluate.evaluate: Average Metric: 45.0 / 70 (64.3%)
...
2025/07/29 01:09:48 INFO dspy.evaluate.evaluate: Average Metric: 56.0 / 70 (80.0%)

Moment of truth 😬

import re

from dspy.evaluate import Evaluate

final_optimised_results = []

with mlflow.start_run(run_name="final_evaluation_baseline") as parent_ctx:
    run_name = f"optimised_{lm_oai_o4_mini.model.replace('/', '_')}"
    sanitized_run_name = re.sub(r"[^a-zA-Z0-9_\-]", "_", run_name)
    with mlflow.start_run(run_name=sanitized_run_name, nested=True):
        current_evaluator = Evaluate(
            devset=final_test_examples,
            num_threads=32,
            display_progress=True,
            return_all_scores=True,
            return_outputs=True,
        )
        with dspy.context(lm=lm_oai_o4_mini) as ctx:
            current_result = current_evaluator(final_program, metric=turn_em_metric)
            final_optimised_results.append(current_result)

Average Metric: 573.00 / 667 (85.9%):  98%|█████████▊| 667/678 [03:10<00:04,  2.33it/s]Average Metric: 580.00 / 678 (85.5%): 100%|██████████| 678/678 [03:23<00:00,  3.34it/s]
🏃 View run eval at: http://localhost:5000/#/experiments/3/runs/ec07411faf3944d991dba6d8d15ffac7
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run optimised_openai_o4-mini-2025-04-16 at: http://localhost:5000/#/experiments/3/runs/1dddc8a3aed24d5184dde24e67445b1d
🧪 View experiment at: http://localhost:5000/#/experiments/3
🏃 View run final_evaluation_baseline at: http://localhost:5000/#/experiments/3/runs/5ab44887bf414849a652aabc1a040da0
🧪 View experiment at: http://localhost:5000/#/experiments/3

2025/07/29 14:52:30 WARNING dspy.adapters.json_adapter: Failed to use structured output format, falling back to JSON mode.
2025/07/29 14:52:42 INFO dspy.evaluate.evaluate: Average Metric: 580.0 / 678 (85.5%)

Our final results are as follows

model_name	em_metric before optimisation	em_metric after MIPROv2	% increase
o4-mini	72.70	85.50	17.6

Great success! We’ve successfully improved the performance of o4-mini on our dataset using Curriculum Learning by 17.6%, to acheive a final score of 85.50%!

Error Analysis

As we’ve done previously in the modelling notebook, we will now try to classify the errors in the predictions, using the same taxonomy as before.

The taxonomy is as follows:

OK
NUMERICAL_ANSWER_WRONG
TEXTUAL_SELECTION_ANSWER_WRONG
FORMAT_ERROR
EVIDENCE_MISMATCH
GROUND_TRUTH_INCORRECT

from copy import deepcopy

final_selected_error_records = []
for _idx, record in enumerate(final_optimised_results):
    for example, prediction, score in record[1]:
        example_copy = deepcopy(example)
        example_copy["ground_truth_answer"] = example_copy["answer"]
        del example_copy["answer"]

        final_selected_error_records.append(
            {
                "model_name": lm_oai_o4_mini.model,
                "turn_em_metric_score": score,
                **example_copy.toDict(),
                **prediction.toDict(),
            }
        )

from typing import Literal
import dspy


class AssessmentSignature(dspy.Signature):
    """
    Categorize model predictions by comparing them to ground truth, context, and evidence.
    Assign a specific error type or OK label, with concise justification, based on rubric.

    When comparing numerical answers, always allow a tolerance of 1e-2. For eg: If the question asks for a percentage, but the ground_truth_answer is given as a decimal, the assessment_answer label will be GROUND_TRUTH_INCORRECT
    """

    ground_truth_answer: str = dspy.InputField(
        desc="The correct answer as per the ground truth data."
    )
    table: str = dspy.InputField(
        desc="Tabular data (as string) relevant to the question and answer."
    )
    conversation_context: str = dspy.InputField(
        desc="Previous dialogue turns or context for the current question."
    )
    evidence_snippets: str = dspy.InputField(
        desc="Text snippets from the source document supporting the answer."
    )
    question: str = dspy.InputField(desc="The question being answered by the model.")

    predicted_reasoning: str = dspy.InputField(
        desc="Model's step-by-step explanation or justification for its answer."
    )
    predicted_ops: str = dspy.InputField(
        desc="Operations or programmatic steps the model used to derive its answer."
    )
    predicted_answer: str = dspy.InputField(
        desc="The answer predicted by the model for the given question."
    )

    assessment_answer: Literal[
        "OK",
        "NUMERICAL_ANSWER_WRONG",
        "TEXTUAL_SELECTION_ANSWER_WRONG",
        "FORMAT_ERROR",
        "EVIDENCE_MISMATCH",
        "GROUND_TRUTH_INCORRECT",
    ] = dspy.OutputField(desc="Single categorical label.")

judge_examples = []

for record in final_selected_error_records:
    if record["turn_em_metric_score"] != 1:
        judge_examples.append(
            dspy.Example(
                id=record["id"],
                model_name=record["model_name"],
                predicted_reasoning=record["reasoning"],
                predicted_ops=record["ops"],
                predicted_answer=record["answer"],
                ground_truth_answer=record["ground_truth_answer"],
                table=record["table"],
                conversation_context=record["conversation_context"],
                evidence_snippets=record["evidence_snippets"],
                question=record["question"],
            ).with_inputs(
                "predicted_reasoning",
                "predicted_ops",
                "predicted_answer",
                "ground_truth_answer",
                "table",
                "conversation_context",
                "evidence_snippets",
                "question",
            )
        )

from tqdm import tqdm

judge_lm = deepcopy(lm_gemini_flash_2_5)

judge_results = []

with mlflow.start_run(run_name="final_error_analysis_gemini_flash_2_5") as run:
    with dspy.context(lm=judge_lm, cache=True, track_cost=True):
        for example in tqdm(judge_examples, desc="Judging examples"):
            module = dspy.ChainOfThought(AssessmentSignature)
            jr = module(**example)
            jr["assessment_reasoning"] = jr["reasoning"]
            del jr["reasoning"]
            judge_results.append(
                {
                    "id": example["id"],
                    "model_name": example["model_name"],
                    "question": example["question"],
                    "ground_truth_answer": example["ground_truth_answer"],
                    "predicted_answer": example["predicted_answer"],
                    "assessment_answer": jr["assessment_answer"],
                    "assessment_reasoning": jr["assessment_reasoning"],
                }
            )

Judging examples:   0%|          | 0/98 [00:00<?, ?it/s]2025/07/29 15:07:21 WARNING dspy.adapters.json_adapter: Failed to use structured output format, falling back to JSON mode.
...
Judging examples: 100%|██████████| 98/98 [23:19<00:00, 14.28s/it]

🏃 View run final_error_analysis_gemini_flash_2_5 at: http://localhost:5000/#/experiments/3/runs/7925f6a08e264efcb78f0eec4152219d
🧪 View experiment at: http://localhost:5000/#/experiments/3

import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(judge_results)
grouped = (
    df.groupby(["model_name", "assessment_answer"]).size().reset_index(name="count")
)
plt.figure(figsize=(10, 6))
ax = sns.barplot(data=grouped, x="model_name", y="count", hue="assessment_answer")
plt.title("Assessment Answer Counts by Model")
plt.ylabel("Count")
plt.xlabel("Model Name")
plt.legend(title="Assessment Answer")
plt.tight_layout()

# Add value annotations
for p in ax.patches:
    height = p.get_height()
    if height > 0:
        ax.annotate(
            f"{int(height)}",
            (p.get_x() + p.get_width() / 2, height),
            ha="center",
            va="bottom",
            fontsize=10,
            color="black",
            xytext=(0, 2),
            textcoords="offset points",
        )

plt.show()

df.to_csv('./judge_results/final_judge_results.csv', index=False)

df.shape

(98, 7)

As seen before,

68 of the examples have an incorrect ground truth i.e our generated answers are correct.
Of the 27 answers that were marked as NUMERICAL_ANSWER_WRONG, some of them are because the llm incorrectly computed some fractions. While this could’ve been solved with tool calling/using a python interpretor, this would’ve made the pipeline more tricky. More on this soon.
The example marked as “EVIDENCE_MISMATCH” correctly identifies that the model got confused with the given context, and could not resolve the question given the previous conversation history. This step could likely be improved by doing some-form of query rewriting.

Overall, assuming the all the judge results are correct for the “GROUND_TRUTH_INCORRECT” label, our overall accuracy is acrually:

\[ \frac{(580 + 68)}{678} = 95.5\% \]

Note that this is a super optimistic estimate, as we have not accounted for the fact that the judge might have made a mistake in their evaluation. Also, since we didn’t judge the baseline o4-mini performance, the overall accuracy might be lower than what we’ve calculated here.

Conclusion

As mentioned earlier, our final result looks as follows:

model_name	em_metric before optimisation	em_metric after MIPROv2	% increase
o4-mini	72.70	85.50	17.6

We’ve improved the performance of o4-mini on our dataset using Curriculum Learning by 17.6%, to acheive a final score of 85.50%!

A few key takeways:

Curriculum first – then optimize
- Ordering exemplars from “Easy → Medium → Hard” gave every optimizer a warmer start. Versus a random exemplar pool, accuracy on the Hard tier rose +4.7 pts before we even touched DSPy’s search space.
DSPy > hand-rolled prompts
- With only BootstrapFewShot + BeamSearchOptimizer, the smallest model (o4-mini) jumped from 38.2 → 51.6 Hard-tier EM: the single largest gain in the series.

Why no agents?

In this assignment, I’ve deliberately chosen to not use any agentic systems because:

Adding agents requires implementation of tools that the agents can use.
- The only operations that need tools in this dataset are simple mathematical ops(add, subtract, etc.)
- Recent models are increasingly getting better at predicting results of math operations, purely through next-token prediction. The recent IMO Gold Medal models by OAI and Deepmind has confirmed that next-token prediction is enough to solve even some of the most challenging math problems and proofs.
- Adding tool calling tests another model ability: Instruction Following. Specfically, all tool call are returned in a specific format by the model, and while they mostly suceed, this introduces another area where we have to handle errors.

We argue that in the future model releases, tool calls for doing math operations will be not be required. Instead, tools will be used primarily to help the model get access to your data, such as emails, company slack, etc. via the MCP protocol, and do operations on your behalf.

Why no fine-tuning?

Recent research has shown that good prompt optimisation can lead to better results compared to finetuning, even when we using reinforcement-learning. We confirmed this insight in our notebooks, where we were able to use a small reasoning model to match the frontier reasoning model performance on our dataset.

What could be improved?

In the future, we could explore the following:

The current methodology of supplying the table content is quite simple. We know that table formatting affects how models understand content. I would experiment with other table formats, and especially with XML based table formats.
Currently, the evidence_snippets i.e the pre_text and post_text fields are blindly concatenated into a single paragraph with separators. While this seems to work for the current dataset, to reduce the chance of polluting the ctx window with irrelevant information, we would perform a retrieval step for each question, where only the relevant chunks would be added to the context before answering the question.
- Because this dataset has many domain specific terms(financial metrics), we argue that BM25 based methods should give a strong baseline, followed by using a late-interaction mechanism(such as ColBERT) to improve retrieval further.
- For the domain specific terms, we could also add a stage to retrieve the precise definition of a metric asked in the question, which could further help disambiguate the meaning of the metric. For this, we would use the Financial Readability Assessment(FinRAD) Dataset.
The current dataset has a nice setup for using Reinforcement Learning:
- Specifically, we could model each dialogue as an episode in RL.
- The metric for analysis would be things like answer formatting, answer correctness, answer coherence, etc. We can use hyperparams to control the weight of each of these components.
- Finally, we could use GRPO, to train the model based a collection of it’s own outputs, and picking the best output from them. This is the same methodology used to train Deepseek R1.
- DSPy does support optmising a model with GRPO, using the Arbor RL server. I would aim to experiment with this approach.
Finally, our solution still uses a hosted model.
- Some businesses may want to use their own hosted/open-source model, particularly those with goverment regulations.
- We would experiment with the BootstrapFinetune method from DSPy to finetune a small LM to match the performance of a large LM. Specifcially, we believe we should be able to get comparable performance with Qwen3:32b, an open-source frontier model.

Thanks for sticking around! Spotted a bug or have a optimization? Open an issue in the repo, or yell at me on Twitter/X. Happy hacking!