EXPLORE PUBLICATIONS BY COUNTRIES


	EUROPE

	MIDDLE EAST

	ASIA

	AFRICA
.............................

	United States of America

	United Kingdom

	Canada

	Australia

	Italy

	France

	Brazil

	Germany

	Malaysia

	Turkey

	China

	Taiwan

	Japan

	Saudi Arabia

	Jordan

	Egypt

	United Arab Emirates

	India

	Nigeria

Computing Perplexity Values for Under-resourced Languages using n-gram and Deep Learning Approaches

BAYANG SOULOUKNA Jules Paulin, DAYANG Paul, KOLYANG, WADOUFEY Abbel

Pages - 36 - 47 | Revised - 30-09-2022 | Published - 31-10-2022

Published in International Journal of Computational Linguistics (IJCL)

Volume - 13 Issue - 3 | Publication Date - October 2022 Table of Contents

MORE INFORMATION

References | Abstracting & Indexing

KEYWORDS

Language Model, Tpuri, n-grams, Neural Network, Long Short-term Memory, Multilayer Perceptron, Perplexity.

ABSTRACT

The interactions between computers and human language, through the approach called natural languages processing, need a very good model describing the language and a large amount of data. But for under-resourced languages, however due to lack of resources (texts resources), it becomes challenging to devise a good model adapted for minority languages. To cope with this issue, in this paper, we focus on the collection of data for the construction of a language model adapted to poorly endowed languages. Firstly, we describe the concept of under-resourced languages and difficulties related to the digital processing of those languages. To illustrate our model, we collect some text data of Tpuri an African language spoken in Cameroon and Chad. For the collection, we used diverse sources like existing printed documents. Our dataset contains 1640128 words and 108553 sentences. With the collected dataset, two main stemming approaches(n-gram and recurrent neural network) have been evaluated. The perplexity values have been computed in order to judge how good language model is according to the characteristics of under-resourced languages. For the statistical n-gram language model, we obtained the perplexity valueof 420.01 for bigram and 270.45 for trigram. Relying on a linear interpolation with xs= [0.2, 0.2, 0.4, 0.2], a best perplexity value of 56.74 could be determined. We also obtained a best perplexity equal to 47.21 with Laplace smoothing using 4-grams, when x has a value of 0.03.Implementing a recurrent neural network model using the multilayer perceptron (long short-term memory), we obtain a perplexity value of 77.18 which is to be considered as a better result.

REFERENCES

App, L. M. D., Blachon, D., Gauthier, E., & Besacier, L. (2016). Parallel Speech Collection for Under-resourced Language Studies Using the Parallel Speech Collection for Under-resourced Language Studies using the L ig -A ikuma Mobile Device App. December. https://doi.org/10.1016/j.procs.2016.04.030

Bellegarda, J. R., & Monz, C. (2015). State of the art in statistical methods for language and speech processing. Computer Speech & Language, 35, 163-184. https://doi.org/10.1016/j.csl.2015.07.001

Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137-1155.

Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56(1), 85-100. https://doi.org/10.1016/j.specom.2013.07.008

Brour, M., & Benabbou, A. (2019). ATLASLang NMT : Arabic text language into Arabic sign language neural machine translation. Journal of King Saud University - Computer and Information Sciences, xxxx. https://doi.org/10.1016/j.jksuci.2019.07.006

Caelen, J., Besacier, L., Bigi, B., Boitet, M. C., Mori, M. R. De, Haton, M. J., Berment, M. V., Caelen, M. J., & Besacier, M. L. (2006). Reconnaissance automatique de la parole pour des langues peu dotÃ©es.

Camara, É., Ndamba, J., Nstadi, C., Rey, V., & Véronis, J. (2004). Traitement informatique des langues africaines. Documents ALAF-ALAI, Paris, CNRS.

Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359-394.

Chen, S. F., & Rosenfeld, R. (2000). A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing, 8(1), 37-50.

De Wet, F., Badenhorst, J., & Modipa, T. (2016). Developing Speech Resources from Parliamentary Data for South African English. Procedia Computer Science, 81. https://doi.org/10.1016/j.procs.2016.04.028

Eiselen, R., & Puttkammer, M. J. (2014). Developing Text Resources for Ten South African Languages. LREC, 3698-3703.

El-Haj, M., Kruschwitz, U., & Fox, C. (2015). Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Language Resources and Evaluation, 49(3), 549-580. https://doi.org/10.1007/s10579-014-9274-3

Eshkol, I., & Antoine, J.-Y. (2017). 24e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) Actes de TALN 2017, volume 2 : articles courts. 2. http://taln2017.cnrs.fr/wp-content/uploads/2017/06/actes_TALN_2017-vol2.pdf#page=177

Esuli, A., Fagni, T., Fern, A. M., & National, I. (2016). JaTeCS , a Java library focused on automatic text categorization. 1-5.

Etman, A., & Beex, A. A. L. (2015). Language and Dialect Identification: A survey. IntelliSys 2015 - Proceedings of 2015 SAI Intelligent Systems Conference, December, 220-231. https://doi.org/10.1109/IntelliSys.2015.7361147

Gauthier, E., Besacier, L., & Voisin, S. (2016). Automatic Speech Recognition for African Languages with Vowel Length Contrast. Procedia Computer Science, 81, 136-143. https://doi.org/10.1016/j.procs.2016.04.041

Jivani, A. G., & others. (2011). A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl, 2(6), 1930-1938.

Lakew, S. M., Negri, M., & Turchi, M. (2020). L OW -R ESOURCE N EURAL M ACHINE T RANSLATION : 1-10.

Lau, J. H., Baldwin, T., & Cohn, T. (2017). Topically driven neural language model. ArXiv Preprint ArXiv:1704.08012.

Le, V. B., Bigi, B., Besacier, L., & Castelli, E. (2003). Using the Web for fast language model construction in minority languages. EUROSPEECH 2003 - 8th European Conference on Speech Communication and Technology, 3117-3120.

Mahtout, M. (2014). A Methodology for semi-automatic structuring of a bilingual lexicographical corpus: the French-Kabyle case (MÃthodologie pour la structuration semi-automatique du corpus dans une perspective de traitement automatique des langues: le cas du dictionnaire fr. TALN-RECITAL 2014 Workshop TALAf 2014: Traitement Automatique Des Langues Africaines (TALAf 2014: African Language Processing), 123-133.

McKellar, C. A., & Puttkammer, M. J. (2020). Dataset for comparable evaluation of machine translation between 11 South African languages. Data in Brief, 29, 105146. https://doi.org/https://doi.org/10.1016/j.dib.2020.105146

Nimaan, A., Nocera, P., & Torres-Moreno, J.-M. (2006). Boîte à outils TAL pour des langues peu informatisées : le cas du somali. Jadt. http://lexicometrica.univ-paris3.fr/jadt/jadt2006/PDF/II-062.pdf

Onyenwe, I. E. (2017). Developing methods and resources for automated processing of the african language igbo. University of Sheffield.

Paolillo, J. C. (2006). Evaluating Language Statistics : The Ethnologue and Beyond A report prepared for the UNESCO Institute for Statistics. Language.

Pellegrini, T., & Lamel, L. (2006). Investigating automatic decomposition for ASR in less represented languages. Ninth International Conference on Spoken Language Processing.

Peter Jackson, Ni. M. (2004). Review of “Natural language processing for online applications: Text retrieval, extraction and categorization.”TerminologyTerminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 10(1), 177-179. https://doi.org/10.1075/term.10.1.12dro

Rialland, A., Aborobongui, M. E., Adda-Decker, M., & Lamel, L. (n.d.). Mbochi: corpus oral, traitement automatique et exploration phonologique. Jep-Taln-Recital 2012, 1, 1. http://anthology.aclweb.org/W/W12/W12-1301.pdf%5Cnhttp://aclweb.org/anthology//W/W12/W12-1301.pdf

Ruelland, S. (1992). Description du parler tupuri de Mindaore (Tchad): phonologie, morphologie, syntaxe.

Ruelland, S. (1998). Dictionnaire Tupuri - Français - Anglais. Peeters.

Shikali, C. S., & Mokhosi, R. (2020). Enhancing African low-resource languages: Swahili data for language modelling. Data in Brief, 31, 105951. https://doi.org/https://doi.org/10.1016/j.dib.2020.105951

Tapo, A. A., Coulibaly, B., Diarra, S., Homan, C., Kreutzer, J., Luger, S., Nagashima, A., Zampieri, M., & Leventhal, M. (2014). Languages : A Case Study on Bambara.

Tomasz. (2018). Spoken Language Identification. July 2013. https://doi.org/10.13140/RG.2.2.29465.62561

Vu-minh, Q., Besacier, L., Blanchon, H., & Bigi, B. (n.d.). Modèle de langage sémantique pour la reconnaissance automatique de parole dans un contexte de traduction Mots clés-Key words 1 Introduction.

Vydrin, V., Rovenchak, A., & Maslinsky, K. (2016). Maninka Reference Corpus: A Presentation. TALAf 2016 : Traitement Automatique Des Langues Africaines (Écrit et Parole). Atelier JEP-TALN-RECITAL 2016 - Paris Le. https://halshs.archives-ouvertes.fr/halshs-01358144

Vydrin, V., Umr-, C., Bp, M., & Cedex, V. (2014). Projet des corpus écrits des langues manding : le bambara, le maninka 1.

MANUSCRIPT AUTHORS

Mr. BAYANG SOULOUKNA Jules Paulin

Faculty of Science/ Department of Mathematics and Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Maroua - Cameroon

paulinbayang@gmail.com

Mr. DAYANG Paul

Faculty of Science/Department of Mathematics and Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Ngaoundéré - Cameroon

Mr. KOLYANG

Higher Teachers' Training College/Department of Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Maroua - Cameroon

Mr. WADOUFEY Abbel

Faculty of Science/Department of Mathematics and Computer Science, National Institute of Cartography, Cameroon, The University of Ngaoundéré - Cameroon

CREATE AUTHOR ACCOUNT

LAUNCH YOUR SPECIAL ISSUE

View all special issues >>

PUBLICATION VIDEOS