Fluency is not pedagogy
General-purpose foundation models are remarkably fluent. Ask one to explain the Krebs cycle and it will produce a coherent, well-structured answer. What it will not reliably do is explain the Krebs cycle differently to a learner who has just confused it with glycolysis than to one who understands the inputs but not the energy yield. Fluency is a language property. Pedagogy is a relational one — it depends on a model of the learner, not just a model of the subject. Off-the-shelf instruction tuning optimises for being helpful and harmless to an anonymous user. It does not optimise for tracking a specific person's evolving understanding across a session, a week, or a course. That gap is what our training pipeline exists to close.
What a foundation model is missing
- Learner state — what this specific person currently understands, and where the gap is
- Calibrated difficulty — matching explanation complexity to demonstrated readiness, not guessing
- Productive failure — knowing when to withhold the answer and ask a guiding question instead
- Misconception repair — recognising a wrong mental model, not just a wrong answer
- Restraint — stopping at the right depth instead of dumping everything the model knows
The data problem comes first
Every training decision downstream is constrained by what data exists to learn from. Generic web text teaches a model to sound like an explanation. It does not teach a model what happens after the explanation — whether the learner understood, what they asked next, where they got stuck. We build training corpora around interaction traces: real tutoring dialogues, annotated by where a learner's confusion surfaced and what response actually resolved it, plus synthetic dialogues generated to cover gaps the real data does not reach. The annotation layer is the expensive part. A transcript without a label for "this explanation worked" or "this one needed a second attempt" is just text. The label is what turns it into a training signal for pedagogy rather than for prose.
A transcript without a label for what worked is just text. The label is what turns it into a training signal for pedagogy rather than for prose.
The training pipeline, in order
- Domain pretraining continuation — further pretraining on curated educational and tutoring text, not just instruction data
- Supervised fine-tuning on annotated tutoring transcripts — explanations paired with labelled learner outcomes
- Misconception-targeted fine-tuning — a dedicated dataset of known error patterns and the corrections that resolve them
- Preference optimisation — reward signal built from which of two explanations produced better demonstrated understanding, not which one a rater preferred to read
- Calibration fine-tuning — penalising confident wrong answers more heavily than honest uncertainty
Why preference data cannot just be "which answer is better"
Most preference optimisation in the industry asks human raters which of two model outputs they would rather read. That signal rewards explanations that sound satisfying — comprehensive, confident, well-organised. It does not reward explanations that actually teach. The two frequently diverge: the answer that resolves a misconception is sometimes shorter, less complete, and more likely to ask the learner a question than to give them everything at once. We replaced rater preference with a measured outcome wherever we could — did the learner answer a follow-up retrieval question correctly after this explanation versus the alternative. Where that signal is too slow or too sparse to train against directly, we use a reward model trained specifically to predict it, rather than a reward model trained on general human preference.
The hardest thing to train into a model is not knowledge. It is the discipline to not say everything it knows the moment it is asked.
Evaluation has to look like tutoring, not like a benchmark
Standard language model benchmarks measure whether a model produces a correct final answer. That is close to irrelevant for a tutoring model, which is judged on the path to a learner's answer, not its own. Our evaluation harness runs the model against simulated learners seeded with specific, realistic misconceptions, and scores it on whether the learner's simulated belief state moves toward the correct one over the course of a conversation — and how many turns that takes. A model that gives the textbook-perfect explanation on turn one but leaves the misconception untouched scores worse than one that asks a clarifying question first and resolves it by turn three. This is the model that ends up inside Talos and reaches a learner through Gripho: not the most fluent one, the one that most reliably changes what a learner understands.