So, how have y'all been the last *checks notes* few years? Yeah, I know, we live in interesting times... How about instead of focusing on that I show you what I have been up to since 2015 or so. Let's start with some of the projects I have been working on that you might find interesting.
Named after Ḥunayn ibn Isḥāq (807-873), a physician and prolific translator of classical philosophical and scientific literature, this ERC-supported project collected all texts of classical science that were translated into both Syriac and Arabic. The translations were then re-edited and aligned on the level of semantic and syntactic units that... Well, they are not quite sentences, but we tried to keep them as small as possible. The text is also tokenized and links to dictionaries and corpora are provided; and in some cases, we also provided aligned text of translations into modern languages. Although I did contribute to the editions, my main job was processing the data and building the interface. I know, I know, it now needs some updates, especially when it comes to the aforementioned links to various dictionaries, chief among them the Glossarium Graeco-Arabicum which is thankfully now back online, completely rebuilt. This project is a wonderful resource not just for those interested in the philosophical and scientific exchange between the East and the West, but also those learning/studying any/all of the languages involved.
Despite its tagline "The Syriac Thesaurus" (it's a user-friendliness thing), this is an electronic corpus, the only one worthy of the name for Syriac, which contains ~25 million words and represents roughly 95% of all literature in Syriac. It is largely based on printed editions, although a few manuscripts snuck in. Simtho (a Syriac word meaning "treasure" pronounced according to West Syriac conventions) is the product of thousands of hours of work by hundreds of people assembling metadata, scanning books and checking OCRd texts. All of that was done without any major grants or other financial support from research agencies or governments. Simtho is the largest project run at Beth Mardutho led by the indomitable George Kiraz. My job is being the last link in the chain, i.e. setting up and managing the entire processing pipeline, as well as the server(s) and all the software on it. This includes the installation (and its customization, for George has many ideas) of NoSketch Engine on which the whole corpus runs. In addition to that, I have been doing some language modelling and annotation, on which perhaps later. One by-product of the work on Simtho is a set of OCR models for the recognition of Syriac printed text. These models are trained using the open source Kraken platform and available on Zenodo.
Zoroastrian Middle-Persian Corpus and Dictionary
This DFG-supported ongoing project seeks to collect, annotate and analyze all available Zoroastrian texts written in Middle Persian to create a searchable corpus (in transcription) and finally an updated dictionary of Middle Persian. I was largely responsible for data processing, conversion and import, so none of what you see online is my work. The web application is still very much a work in progress, but once finished, it will be a one-stop shop for all your Zoroastrian Middle Persian needs, including manuscript images and comprehensive lexical resources.
No comments:
Post a Comment