Thu Nov 6

New AI the Vilne-Yiddish Model, Developed at VU, Enables Recognition of Handwritten Yiddish Texts

Sukurta: 06 November 2025

53567167815 d5e1db4d3b c Following the establishment of the joint Digital Humanities (DH) Laboratory of the Faculties of History and Philology at Vilnius University, researchers have been pushing the boundaries of how historical documents can be studied using modern technologies.

One of the latest outcomes of this work is the Vilne-Yiddish model – a tool for handwritten text recognition (HTR) in the Yiddish language, developed by Dr Sergii Gurbych, a Postdoctoral Fellow at the Center for the Study of East European Jewry, Faculty of History.

The project represents a major step toward making Jewish historical materials accessible through AI. The most recent version of the Vilne-Yiddish model is already available in open access via Dr S. Gurbych’s GitHub repository, and together with the full dataset will be uploaded to Zenodo by the end of his project in February 2026.

Reading What Was Once Unreadable

Dr S. Gurbych explains that while printed Yiddish can already be recognised fairly accurately with existing tools, handwritten texts remain a challenge due to their variety.

“There are many different handwriting styles,” he notes. “They differ by period, region, and even by social background. Currently, scholars working with Yiddish texts manually transcribe dozens of pages from autobiographies, diaries, and letters – a process that is both time-consuming and labour-intensive.

The use of an automated recognition model can significantly accelerate this work by reducing the time required to process each page. While post-recognition manual correction remains necessary due to inevitable errors, the overall effort per page is nonetheless substantially reduced.”

Rediscovering Interwar Jewish Voices

The materials used to train the Vilne-Yiddish model come from autobiographies written in the 1930s and sent to YIVO – the Yidisher Visnshaftlekher Institut – from across Eastern and Central Europe.

Most of these manuscripts, dated between 1933 and 1939, were recently rediscovered in the archives of the National Library of Lithuania and had never been digitised before. Others were obtained from the YIVO online collections digitised through the Edward Blank YIVO Vilna Online Collections Project.

screenshot showing how the model works within the eScriptorium interfacepng

Using these handwritten sources, Dr S. Gurbych created a dataset – a collection of image-text pairs that allowed the model to “learn” the structure of Yiddish handwriting.

“The result,” he says, “is a model that achieves around 95 per cent accuracy – roughly one error per twenty characters. That is quite high for handwritten materials, especially considering the diversity of the scripts.”

Training the Machines to Read

Like any model, Vilne-Yiddish performs best on handwriting styles similar to those it has already seen during training. “The more a new handwriting differs from the samples in the dataset, the higher the error rate,” Dr S. Gurbych explains.

“To build a universal model, we would need hundreds of different handwriting samples – ideally, dozens of pages per style – which requires immense computational resources and time.”

To overcome this challenge, he proposes an alternative approach: fine-tuning.

“If a researcher has a base model and access to the original dataset, they can fine-tune it using just a dozen pages of the handwriting they are studying,” he says. “This way, the model learns that specific handwriting with high accuracy – and it takes much less effort and computing power than training a model from scratch.”

Open Access as a Core Principle

This approach depends on open access. “Both the model and the training dataset need to be freely available,” stresses Dr S. Gurbych. “That’s exactly what this project provides.

While most Hebrew HTR models are closed, the Vilne-Yiddish model and its dataset are both open-access. Anyone can use them, modify them, and build upon them.”

He mentions that the only comparable open project so far has been BiblIA, a dataset developed at the University of Lausanne for medieval Hebrew manuscripts under the direction of Professor Daniel Stökl Ben Ezra. It includes over 200 pages of Sephardic, Ashkenazic, and Italian scripts, and both the dataset and model are available online.

“Now we have something similar for Yiddish – specifically, interwar Yiddish manuscripts. This will help historians and linguists analyse handwritten sources that were previously too complex to process automatically.” says Dr S. Gurbych.

A Step Toward Broader Access

Dr S. Gurbych notes that although several Yiddish HTR models have been developed to date, they are not open-access and were trained on manuscripts of a different type and historical period.

One example is the DYBBUK model, developed under the supervision of Israeli scholar Dr Sinai Rusinek, which was trained on handwritten Yiddish theatre plays from the late 19th and early 20th centuries. Their restricted access prevents other researchers from fine-tuning new models based on these existing ones.“I hope this will contribute to advancing archival research on the history of Jews in Eastern and Central Europe,” he concludes. “Ultimately, Digital Humanities isn’t just about digitisation or data analysis – it’s about expanding access to culture and making the voices of the past legible again.”

Archivists and librarians will be able to convert already-digitised manuscripts from “images only” into fully searchable text. Once transcribed, documents can be indexed, enriched with markup and tags, and published online in a form that supports keyword search and further automated processing. Collections that were previously visible only as scanned pages become accessible sources of structured information.

Opening Doors for Research and Learning

For researchers, large sets of newly recognised texts open the door to modern analytical methods. Tools such as Named Entity Recognition (NER) will allow systematic extraction of place names, addresses, and personal names from handwritten sources.

Instead of manual page-by-page reading, scholars will be able to explore patterns across entire corpora of documents, generating new historical insights.

For the broader public, the model removes a major barrier: knowing Yiddish is no longer required to access the content of handwritten sources. Anyone can copy the recognised text and use an online translation service to understand a document. Letters, autobiographies, and diaries that remained unread for decades will become discoverable to descendants and communities seeking to reconnect with their past.

Educators and students can also incorporate these newly readable manuscripts into teaching and university projects. Working directly with real archival sources supports active learning and broadens engagement with Jewish cultural heritage.

Funded by the European Union and supported by the NextGenerationEU program “New Generation Lithuania.”