Israeli students' AI model rescues lost words of antiquity
"This breakthrough has the potential to revolutionize the field of epigraphy," said Professor Mark Last, who supervised the students' project.
TEL AVIV: In a groundbreaking development for historical researchers, students at Ben-Gurion University of the Negev have harnessed the power of artificial intelligence to restore illegible letters and words in ancient Hebrew and Aramaic inscriptions.
Every year, archaeologists unearth a wealth of ancient texts written in Hebrew and Aramaic across the Near East. These inscriptions are invaluable for understanding the region's rich cultural and historical heritage. However, many of these texts have suffered damage over time, making it difficult for scholars to decipher them. Natural disasters, political conflicts, and the ravages of time have all taken their toll on these ancient artifacts.
But BGU's innovative approach may revolutionize the field of epigraphy, the science of identifying, classifying, and interpreting inscriptions found on ancient artifacts such as coins, monuments, statues, buildings, or writing found on ancient papyrus, parchment or scrolls.
"This breakthrough has the potential to revolutionize the field of epigraphy," said Professor Mark Last, who supervised the students' project.
"Not only can we assist historians in reconstructing ancient texts more accurately, but I also believe that this model can be adapted to other morphologically rich ancient languages."
Traditionally, epigraphists relied on time-consuming manual methods to reconstruct missing parts of damaged inscriptions. However, those methods are prone to errors.
Students from the university's Department of Software and Information Systems Engineering who took on the project approached the challenge as an "extended masked language modeling task." This refers to a specific type of natural language processing task that builds upon the concept of masked language modeling, a technique commonly used in pre-training large-scale language models. Damaged content can comprise single characters, character n-grams (partial words), single complete words, and multi-word n-grams.
Led by Last, undergraduate students Niv Fono, Harel Moshayof, Eldar Karol, and Itay Asraf applied a masked language modeling approach to corrupted inscriptions in Hebrew and Aramaic. This involved training the system on a dataset comprising 22,144 sentences from the Old Testament and testing it on an additional 536 sentences, achieving notable success.
By employing an ensemble of word and character prediction models, they were able to achieve high accuracy in restoring damaged text.
Their model, dubbed "Embible," was presented to to the European Chapter of the Association for Computational Linguistics at its meeting in March.
"We can help historians who have devoted their lives to recreating these ancient texts as accurately as possible," said Last, "Furthermore, I believe the model can be extended to cover other morphologically rich ancient languages."