Researchers at IIT Kharagpur Develop Digital Infrastructure for Efficient Processing of Sanskrit Texts for Making the Language Accessible
*Srijata Saha Sahoo
The Astadhyayi (eight chapters) composed by Panini in the 6th – 5th century BCE is still considered as a rich commentatorial literature though the origin of the language is traced back to the 2nd millennium BCE when the Rig Veda was written after being continued for centuries through oral tradition and preservation of verbal knowledge in the guru-disciple relationship.
After many years of stagnation, there has been a renewed interest in Sanskrit since the announcement of NEP 2020. Besides, it should be mentioned also that many of the words in English have their origin in Sanskrit, like ‘Path’ from ‘Patha’, ‘Man’ from ‘Manu’, ‘Door’ from ‘Dwar’ and the like.
Various academic institutions both at school education as well as higher education are adopting innumerable approaches for improving the reach of the language through training programmes, research and outreach initiatives. While various digital resources have improved the accessibility and use of world languages as well as regional languages, Sanskrit presents unique challenges in automated computational processing.
This apart, to the sheer volume and diversity, both stylistic and chronological, found in Sanskrit texts, the linguistic peculiarities expressed by the language; pose several challenges in making these works accessible to the world.
To address such jeopardy, researchers at IIT Kharagpur led by Dr. Pawan Goyal have developed a digital infrastructure for the efficient processing of Sanskrit texts, by effectively combining state-of-the-art machine learning techniques and traditional linguistic knowledge from Sanskrit. The proposed framework is based on energy-based models and it enables the encoding of relevant linguistic information as constraints. In the words of Dr. Goyal, “Processing of Sanskrit texts poses several challenges owing to the high lexical productivity of the words, free word order in poetry, euphonic assimilation of sounds at the word boundaries and phonemic orthography followed in writing. Keeping these in mind, we proposed a generic graph-based framework that takes advantage of the free word order nature of the language. Further, we made use of linguistic insights from the traditional Sanskrit grammar for learning the feature function and applying the relevant constraints.” He further added, “Our proposed framework substantially reduced the training data requirements to as low as 10%, as compared to that of the neural state-of-the-art models. In all the Sanskrit-related tasks discussed in the work, we either achieved state-of-the-art results or ours is the only data-driven solution for those tasks,”
This work is accepted for publication in the Computational Linguistics journal published by the MIT Press. This work has been carried by research scholar Dr. Amrith Krishna, currently a post-doc at the University of Cambridge, supervised by Dr. Pawan Goyal. The paper currently addresses the tasks of word segmentation, morphological parsing, dependency parsing and poetry to prose conversion of Sanskrit text. The team is now actively collaborating with several external research groups to extend the application of the proposed system for automatic speech recognition and question-answering in Sanskrit.
Works in Sanskrit, numbering more than 30 million extant manuscripts, include extensive epics, subtle and intricate philosophical, mathematical, and scientific treatises, and rich literary, poetic, and dramatic texts. The proposed AI-based system, used in conjunction with interactive tools such as the Sanskrit Heritage reader, may help the users in the easier analysis of these manuscripts with word-by-word analysis and translation, the relation between words, poetry to prose conversion, search and question answering, etc.
Let us hope that from now onwards Sanskrit transforms to a more easily available language to its connoisseurs.
Read More Updates…