A workflow for HTR-postprocessing, labeling and classifying diachronic and regional variation in pre-modern Slavic texts

Home » Publications » A workflow for HTR-postprocessing, labeling and classifying diachronic and regional variation in pre-modern Slavic texts

2024

A workflow for HTR-postprocessing, labeling and classifying diachronic and regional variation in pre-modern Slavic texts

Piroska Lendvai, Maarten van Gom- pel, Anna Jouravel, Elena Renje, Uwe Reichel, Achim Rabus, and Eckhart Arnold

We describe ongoing work for developing a workflow for the applied use case of classifying diachronic and regional language variation in Pre-Modern Slavic texts. The data were obtained via handwritten text recognition (HTR) on medieval manuscripts and printings and partly by manual transcription. Our goal is to develop a workflow for such historical language data, covering HTR-postprocessing, annotating and classifying the digitized texts. We test and adapt existing language resources to fit the pipeline with low-barrier tooling, accessible for Humanists with limited experience in research data infrastructures, computational analysis or advanced methods of natural language processing (NLP). The workflow starts by addressing ground truth (GT) data creation for diagnosing and correcting HTR errors via string metrics and data-driven methods. On GT and on HTR data, we subsequently show classification results using transfer learning on sentence-level text snippets. Next, we report on our token-level data labeling efforts. Each step of the workflow is complemented with describing current limitations and our corresponding work in progress.

link to publication

A scientific publication by audEERING GmbH.
More info on our research page