Chapter 1 of 2 · Published March 29, 2026

Early Exploration

What happens when you ask an AI to read a 400-year-old manuscript — and what it reveals about how reading works.

Can Claude learn how to transcribe early modern recipe manuscripts? I started with this research question about recipe books because of a larger research interest of mine in understanding how recipe books function as a genre that can produce early medical or early scientific knowledge outside of the more codified genre of the anatomy book. By asking if Claude can learn how to perform the role of paleographer, I hoped to get access to larger archives, including archives that haven’t already undergone transcription by a human. However, despite this goal, I soon realized that this question of teaching a machine to read deserves further exploration due to its potential pedagogical and methodological impacts. As I began running test examples on Claude, a striking trend started to emerge: how you structure the learning matters more than what you tell the learner.

Here’s what I’ve learned about teaching machines to read during these early implementation stages.

Only structural process changes improved accuracy — better instructions never helped

Paleography is a legitimately challenging task

Both for humans and machines, but the evaluation metrics are different. Humans and machines share challenges when it comes to paleography, including non-standard spelling and different letterforms than contemporary handwriting. And just like today, some hands are more challenging to read than others — which can be compounded by damaged manuscript pages and poor image quality.

Henslow MS688 — the error categories that moved across runs

As its discipline, paleographers also need to make decisions about how to transcribe: there isn’t a ground truth for what counts as a “correct” transcription, especially when you take into account the challenges of working in an early modern archive. For instance, diplomatic transcription is the most literal type of transcription, where your goal is to write exactly what you see on the page. Every word, mark, and abbreviation stays exactly the same. On the other hand, semi-diplomatic transcription is less strict to a degree. The goal remains to faithfully preserve the original text — spelling, punctuation, capitalization, line breaks while allowing for a small number of standardized editorial interventions, including expanded abbreviations, lowered superscript letters, and replaced thorns. This project uses semi-diplomatic transcription practices to preserve the integrity of the original manuscript page while ensuring legibility for contemporary readers.

Examples of challenges in paleography

Measuring Claude’s success

Character Error Rate (CER) is the standard metric for measuring the accuracy of a transcription used across the Handwritten Text Recognition (HTR) and Optical Character Recognition (OCR) fields. The lower the CER (i.e. the closest to 0%), the greater the accuracy. Currently, the field recognizes <5% as usable for most research purposes. The current benchmark for this project is 3.80% CER on Henslow MS699 (Run 6).

Henslow MS688 — Character Error Rate across eight blind evaluation runs. For comparison, Transkribus achieves ~5–8% CER with its general model (no training) and ~3% with a hand-trained model (2,500 pages of ground truth).

These initial results are encouraging but with a few massive caveats. The Henslow manuscript is an easy hand, and only one agent managed to reach this CER just one time over the course of eight runs. When agents struggled to transcribe the manuscript, they often hallucinated their results rather than admit when they were unsure.

Hallucinated attempts vs. honest gaps

These hallucination results, however, yielded an important methodology discovery: simply telling agents to “never make up text when they are unsure” does not work. What does work is providing pedagogical materials to help agents through challenging manuscripts.

Pedagogical materials: Successes

The most promising methodological change was when we told agents to first study a manuscript’s alphabet before trying to read a single word. This is most similar to how humans are taught early modern paleography. The pedagogical act of creating an alphabet helped immensely against hallucination, since agents were now trying to understand how a specific hand formed letters rather than guess what a word could be.

The alphabet chart the AI built by studying Henslow MS688 — what it learned before reading a single word

The other most promising methodological change was when we gave agents access to a vocabulary verification list as a way for agents to check and verify their readings. The vocabulary reference is ~19,000 words from 40 early modern sources relevant to recipe books, including 38 FromThePage community transcriptions, three EMROC triple-keyed transcriptions, and two printed herbals (Gerard’s 1597 and Culpeper’s 1652). Importantly, the vocabulary list works as a verification rather than a prediction one. For example, in Run 6 of the Sedley MS534, the agent initially misread a word as Sallanders (it’s a real word, a horse skin condition) but when it checked against the vocabulary list, it found celandine (an herb) with the unexpected spellings of Sallandine / Sallendine across 15 different manuscript sources. Since this manuscript is a medical recipe, celandine makes more sense in the context. The vocabulary list does not override what the agent sees but offers a plausible alternative from the same genre that the agent can verify against the letterforms.

Manuscript	Run 1	Run 2	Run 4	Run 5	Run 6	Run 7	Run 8	Best
Henslow MS688 Easy	11.3%	12%	4.96%	5.38%	3.8%	7.59%	4.54%	3.8%
Sedley MS534 Moderate	15.8%	21%	15.13%	16.55%	16.96%	16.42%	15.94%	15.13%
Bulkeley MS169 Moderate	22.8%	18%	18.7%	20.9%	16.21%	18.29%	18.29%	16.21%
Brumwich MS160 Hard	96.1%	93%	9.3%	50.62%	69.29%	62.49%	79.8%	9.3%
Jane Jackson MS373 Very hard	95.6%	95%	77.41%	46.85%	80.62%	67.22%	89.27%	46.85%

Best CER results for all five test manuscripts
Run 3 tested only Henslow and is not included in this table.

Key lessons learned

Blind experiments (All Runs). Agents will cheat when given the opportunity and inflate the CER. The lesson: ensuring blind set-up — where the transcribing agent does not have any way to reference the answer key — is essential.
Better instructions (Runs 2 & 5). Adding two specific anti-hallucination rules failed to move the needle. The lesson: telling a learner (even a machine) what to do isn’t the same as structuring how they do it.
More reference material (Run 7). Adding a visual alphabet from the Folger showing alphabet variation yielded worse CER across all manuscripts. The lesson: more information without a framework creates confusion.
Multi-agent consensus / triple-keying (Run 8). Inspired by EMROC’s triple-keying, which is the gold standard for human transcription. Three independent agents transcribed, a fourth reconciled using majority rule. It didn’t help. When agents can’t read the page, merging three bad readings just amplifies uncertainty. The agent ended up placing 200+ […] markers on the hardest manuscripts. The lesson: what works as a gold standard for human paleographers doesn’t necessarily transfer to machines.
Post-hoc review (Run 5). A separate review agent re-examined flagged passages. It reduced over-flagging (Brumwich went from 70% of flagged words to 13%) but couldn’t fix misreadings. The lesson: calibration and accuracy are two separate problems.