Bridging the Gap: Towards Linguistic Resource Development for the Low-Resource Lambani Language
Abstract
Language technology development is crucial for many downstream applications such as machine translation and language understanding. The lack of linguistic resources makes it challenging for technology development of under-resource languages. This paper aims at developing linguistic tools for Lambamni, an under-resourced tribal language of India through corpora creation, annotation, and transfer learning from contact language. Based on the annotated corpora, we develop the Lambani language tagset and our investigation focused on various methods for developing a Part-of-Speech (POS) tagger and also creating a morphology dictionary for Lambani. A total of eight BIS tagset is found to be present for Lambani language. The experimental results revealed that the statistical approach with GMM-HMM (Gaussian Mixture Model - Hidden Markov Model) achieved POS tagging accuracy of 96% despite the limited dataset containing 6,893 sentences. This success in a low-resource setting highlights the promising potential of GMM-HMM in overcoming challenges posed by the scarcity of annotated data in under-resourced languages. The experiments not only showcase the effectiveness of the proposed methods for low-resource language processing but also shed light on their applications and open new directions for research in language revitalization and the development of digital tools for zero-resource languages.
@inproceedings{dasare2023bridging, title={Bridging the Gap: Towards Linguistic Resource Development for the Low-Resource Lambani Language}, author={Dasare, Ashwini and Chowdhury, Amartya Roy and Menon, Aditya Srinivas and Anand, Konjengbam and Deepak, KT and Prasanna, SRM}, booktitle={International Conference on Speech and Computer}, pages={127--139}, year={2023}, organization={Springer} }