DICTA: The Israel Center for Text Analysis
Avi Shmidman
Senior Lecturer, Bar-Ilan University; Senior Researcher, DICTA
The Dicta Center, founded by Professor Moshe Koppel, is dedicated to bridging the gap between cutting-edge research in the field of Natural Language Processing (NLP) on the one hand, and academic Jewish studies research on the other. For, although many Jewish studies projects can potentially benefit from advances in NLP, a serious obstacle often stands in the way. The overwhelming majority of NLP algorithms are designed for Indo-European texts, and do not address any of the unique challenges of Hebrew texts.
Hebrew words admit to an ambiguity far beyond that of Indo-European words. Because vowels are generally omitted, a single Hebrew word can generally be parsed as one of several different lexemes. This is compounded by the rampant use of grammatical prefixes, in which conjunctions, prepositions and the like are often prepended to subsequent words. The effect of these prefixes is doubly problematic: on the one hand, the resulting prefixed word can often plausibly be read as a single non-prefixed word; and, on the other hand, the basic building blocks of the sentence – conjunctions, prepositions and other functions words which so clearly demarcate the parts of the sentence in Indo-European languages – end up completely obscured. Furthermore, Hebrew texts do not utilize capital letters to differentiate between proper name and common words. Finally, medieval Hebrew texts are all the more ambiguous, due to their inconsistent and unpredictable orthography, their rampant use of ambiguous abbreviations, and their frequent code-switching between Hebrew and Aramaic.
Due to all of the foregoing, all too often proven NLP algorithms fall flat when applied to Jewish studies texts. It is on this backdrop that Dicta emerges. At Dicta, we have focused on solving these core challenges regarding historical Hebrew texts, so that advanced AI algorithms can be effectively leveraged upon them. Upon this foundation, we have developed a suite of NLP-based research tools for historical Hebrew texts, freely available at: https://dicta.org.il/
Among our tools:
- A Scriptural Allusion Finder tool, which identifies Scriptural quotations and allusions within any Hebrew text. Results can also be downloaded as a Microsoft Word document, with a set of footnotes citing each relevant verse.
- A Synopsis tool designed for automatic alignment of multiple recensions of a given Hebrew text. The system can process sets of textual witnesses each containing hundreds of thousands of words.
- Advanced search tools to query the Bible and Talmud according to the way the words would be written and pronounced today, regardless of the orthographic or morphological oddities in the ancient texts.
- Tools for automatic classification and segmentation of historical Hebrew texts. The tools provide comprehensive data regarding the features which most clearly distinguish sections of text from one another.
All of these tools are built upon our core algorithms which address the ambiguity challenge of Hebrew words and abbreviations. Users can also access these core algorithms directly, via our Nakdan tool (https://nakdanpro.dicta.org.il/) which predicts the full vocalization for all words in a given historical Hebrew text, and via our Abbreviation Expander tool (https://abbreviation.dicta.org.il/) predicts how each abbreviation within the text should be expanded. These tools utilize recurrent neural networks to predict the most likely interpretation of each item given the surrounding context. The tools also allow users to efficiently review and correct the output as relevant. Alternate possibilities are presented in order of probability, and a single keystroke will select the most likely alternate. Scholars can use these tools to prepare texts for publication, or for use in educational materials. Additionally, these tools provide a key advantage for scholars who are planning big-data experiments across large Hebrew corpora. Instead of running the corpora as is with all their inherent ambiguity, scholars can first analyze the corpora through our core tools, preprocessing them for abbreviation expansions, prefix segmentation, word disambiguation, and morphological tagging.
It should be noted that because these tools were built for scholars and designed by scholars, every effort has been made to ensure that the specialized needs of academic Jewish studies scholars are met. Thus, for instance, for scholars working on critical editions, the Nakdan provides an option to preserve matres lectionis precisely as in the source text. Similarly, the Nakdan properly handles texts containing editorial sigla, even midword. The Nakdan is also able to handle irregular medieval spellings, such as 2nd person past-tense forms with an extra “heh” at the end, or cases of silent yod following a prepositional prefix; and, similarly, it can handle irregular medieval morphologies, such as the use of a “kaf” prefix before a past-tense verb, or the retention of the definitive article following a prepositional prefix.
A set of configuration options allows users to instruct the Nakdan to adopt historical vocalization norms often used in academic publications. For instance, users can choose to vocalize the word “ma” with a patach and subsequent dagesh, rather than the standardized qamatz of modern Hebrew; similarly, users can choose to omit dagesh following the definitive article when applied to a participle form started with a mem+shwa. Finally, when texts include integrations of Biblical or Talmudic phrases, those phrases are automatically vocalized according to their canonical vocalizations.
Dicta’s tools and algorithms are under active development, and we often integrate new features and changes based upon user feedback. To send in your ideas and suggestions, write to us at: dicta@dicta.org.il