Home > CSC-OpenAccess Library > Manuscript Information
EXPLORE PUBLICATIONS BY COUNTRIES |
EUROPE | |
MIDDLE EAST | |
ASIA | |
AFRICA | |
............................. | |
United States of America | |
United Kingdom | |
Canada | |
Australia | |
Italy | |
France | |
Brazil | |
Germany | |
Malaysia | |
Turkey | |
China | |
Taiwan | |
Japan | |
Saudi Arabia | |
Jordan | |
Egypt | |
United Arab Emirates | |
India | |
Nigeria |
Computing Perplexity Values for Under-resourced Languages using n-gram and Deep Learning Approaches
BAYANG SOULOUKNA Jules Paulin, DAYANG Paul, KOLYANG, WADOUFEY Abbel
Pages - 36 - 47 | Revised - 30-09-2022 | Published - 31-10-2022
MORE INFORMATION
KEYWORDS
Language Model, Tpuri, n-grams, Neural Network, Long Short-term Memory, Multilayer Perceptron, Perplexity.
ABSTRACT
The interactions between computers and human language, through the approach called natural
languages processing, need a very good model describing the language and a large amount of
data. But for under-resourced languages, however due to lack of resources (texts resources), it
becomes challenging to devise a good model adapted for minority languages. To cope with this
issue, in this paper, we focus on the collection of data for the construction of a language model
adapted to poorly endowed languages. Firstly, we describe the concept of under-resourced
languages and difficulties related to the digital processing of those languages. To illustrate our
model, we collect some text data of Tpuri an African language spoken in Cameroon and Chad.
For the collection, we used diverse sources like existing printed documents. Our dataset contains
1640128 words and 108553 sentences. With the collected dataset, two main stemming
approaches(n-gram and recurrent neural network) have been evaluated. The perplexity values
have been computed in order to judge how good language model is according to the
characteristics of under-resourced languages. For the statistical n-gram language model, we
obtained the perplexity valueof 420.01 for bigram and 270.45 for trigram. Relying on a linear
interpolation with xs= [0.2, 0.2, 0.4, 0.2], a best perplexity value of 56.74 could be determined.
We also obtained a best perplexity equal to 47.21 with Laplace smoothing using 4-grams, when x
has a value of 0.03.Implementing a recurrent neural network model using the multilayer
perceptron (long short-term memory), we obtain a perplexity value of 77.18 which is to be
considered as a better result.
App, L. M. D., Blachon, D., Gauthier, E., & Besacier, L. (2016). Parallel Speech Collection for Under-resourced Language Studies Using the Parallel Speech Collection for Under-resourced Language Studies using the L ig -A ikuma Mobile Device App. December. https://doi.org/10.1016/j.procs.2016.04.030 | |
Bellegarda, J. R., & Monz, C. (2015). State of the art in statistical methods for language and speech processing. Computer Speech & Language, 35, 163-184. https://doi.org/10.1016/j.csl.2015.07.001 | |
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137-1155. | |
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56(1), 85-100. https://doi.org/10.1016/j.specom.2013.07.008 | |
Brour, M., & Benabbou, A. (2019). ATLASLang NMT : Arabic text language into Arabic sign language neural machine translation. Journal of King Saud University - Computer and Information Sciences, xxxx. https://doi.org/10.1016/j.jksuci.2019.07.006 | |
Caelen, J., Besacier, L., Bigi, B., Boitet, M. C., Mori, M. R. De, Haton, M. J., Berment, M. V., Caelen, M. J., & Besacier, M. L. (2006). Reconnaissance automatique de la parole pour des langues peu dotées. | |
Camara, É., Ndamba, J., Nstadi, C., Rey, V., & Véronis, J. (2004). Traitement informatique des langues africaines. Documents ALAF-ALAI, Paris, CNRS. | |
Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359-394. | |
Chen, S. F., & Rosenfeld, R. (2000). A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing, 8(1), 37-50. | |
De Wet, F., Badenhorst, J., & Modipa, T. (2016). Developing Speech Resources from Parliamentary Data for South African English. Procedia Computer Science, 81. https://doi.org/10.1016/j.procs.2016.04.028 | |
Eiselen, R., & Puttkammer, M. J. (2014). Developing Text Resources for Ten South African Languages. LREC, 3698-3703. | |
El-Haj, M., Kruschwitz, U., & Fox, C. (2015). Creating language resources for under-resourced languages: methodologies, and experiments with Arabic. Language Resources and Evaluation, 49(3), 549-580. https://doi.org/10.1007/s10579-014-9274-3 | |
Eshkol, I., & Antoine, J.-Y. (2017). 24e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) Actes de TALN 2017, volume 2 : articles courts. 2. http://taln2017.cnrs.fr/wp-content/uploads/2017/06/actes_TALN_2017-vol2.pdf#page=177 | |
Esuli, A., Fagni, T., Fern, A. M., & National, I. (2016). JaTeCS , a Java library focused on automatic text categorization. 1-5. | |
Etman, A., & Beex, A. A. L. (2015). Language and Dialect Identification: A survey. IntelliSys 2015 - Proceedings of 2015 SAI Intelligent Systems Conference, December, 220-231. https://doi.org/10.1109/IntelliSys.2015.7361147 | |
Gauthier, E., Besacier, L., & Voisin, S. (2016). Automatic Speech Recognition for African Languages with Vowel Length Contrast. Procedia Computer Science, 81, 136-143. https://doi.org/10.1016/j.procs.2016.04.041 | |
Jivani, A. G., & others. (2011). A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl, 2(6), 1930-1938. | |
Lakew, S. M., Negri, M., & Turchi, M. (2020). L OW -R ESOURCE N EURAL M ACHINE T RANSLATION : 1-10. | |
Lau, J. H., Baldwin, T., & Cohn, T. (2017). Topically driven neural language model. ArXiv Preprint ArXiv:1704.08012. | |
Le, V. B., Bigi, B., Besacier, L., & Castelli, E. (2003). Using the Web for fast language model construction in minority languages. EUROSPEECH 2003 - 8th European Conference on Speech Communication and Technology, 3117-3120. | |
Mahtout, M. (2014). A Methodology for semi-automatic structuring of a bilingual lexicographical corpus: the French-Kabyle case (MÃthodologie pour la structuration semi-automatique du corpus dans une perspective de traitement automatique des langues: le cas du dictionnaire fr. TALN-RECITAL 2014 Workshop TALAf 2014: Traitement Automatique Des Langues Africaines (TALAf 2014: African Language Processing), 123-133. | |
McKellar, C. A., & Puttkammer, M. J. (2020). Dataset for comparable evaluation of machine translation between 11 South African languages. Data in Brief, 29, 105146. https://doi.org/https://doi.org/10.1016/j.dib.2020.105146 | |
Nimaan, A., Nocera, P., & Torres-Moreno, J.-M. (2006). Boîte à outils TAL pour des langues peu informatisées : le cas du somali. Jadt. http://lexicometrica.univ-paris3.fr/jadt/jadt2006/PDF/II-062.pdf | |
Onyenwe, I. E. (2017). Developing methods and resources for automated processing of the african language igbo. University of Sheffield. | |
Paolillo, J. C. (2006). Evaluating Language Statistics : The Ethnologue and Beyond A report prepared for the UNESCO Institute for Statistics. Language. | |
Pellegrini, T., & Lamel, L. (2006). Investigating automatic decomposition for ASR in less represented languages. Ninth International Conference on Spoken Language Processing. | |
Peter Jackson, Ni. M. (2004). Review of “Natural language processing for online applications: Text retrieval, extraction and categorization.”TerminologyTerminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 10(1), 177-179. https://doi.org/10.1075/term.10.1.12dro | |
Rialland, A., Aborobongui, M. E., Adda-Decker, M., & Lamel, L. (n.d.). Mbochi: corpus oral, traitement automatique et exploration phonologique. Jep-Taln-Recital 2012, 1, 1. http://anthology.aclweb.org/W/W12/W12-1301.pdf%5Cnhttp://aclweb.org/anthology//W/W12/W12-1301.pdf | |
Ruelland, S. (1992). Description du parler tupuri de Mindaore (Tchad): phonologie, morphologie, syntaxe. | |
Ruelland, S. (1998). Dictionnaire Tupuri - Français - Anglais. Peeters. | |
Shikali, C. S., & Mokhosi, R. (2020). Enhancing African low-resource languages: Swahili data for language modelling. Data in Brief, 31, 105951. https://doi.org/https://doi.org/10.1016/j.dib.2020.105951 | |
Tapo, A. A., Coulibaly, B., Diarra, S., Homan, C., Kreutzer, J., Luger, S., Nagashima, A., Zampieri, M., & Leventhal, M. (2014). Languages : A Case Study on Bambara. | |
Tomasz. (2018). Spoken Language Identification. July 2013. https://doi.org/10.13140/RG.2.2.29465.62561 | |
Vu-minh, Q., Besacier, L., Blanchon, H., & Bigi, B. (n.d.). Modèle de langage sémantique pour la reconnaissance automatique de parole dans un contexte de traduction Mots clés-Key words 1 Introduction. | |
Vydrin, V., Rovenchak, A., & Maslinsky, K. (2016). Maninka Reference Corpus: A Presentation. TALAf 2016 : Traitement Automatique Des Langues Africaines (Écrit et Parole). Atelier JEP-TALN-RECITAL 2016 - Paris Le. https://halshs.archives-ouvertes.fr/halshs-01358144 | |
Vydrin, V., Umr-, C., Bp, M., & Cedex, V. (2014). Projet des corpus écrits des langues manding : le bambara, le maninka 1. | |
Mr. BAYANG SOULOUKNA Jules Paulin
Faculty of Science/ Department of Mathematics and Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Maroua - Cameroon
paulinbayang@gmail.com
Mr. DAYANG Paul
Faculty of Science/Department of Mathematics and Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Ngaoundéré - Cameroon
Mr. KOLYANG
Higher Teachers' Training College/Department of Computer Science, Laboratoire de Recherche en Informatique (LARI), The University of Maroua - Cameroon
Mr. WADOUFEY Abbel
Faculty of Science/Department of Mathematics and Computer Science, National Institute of Cartography, Cameroon, The University of Ngaoundéré - Cameroon
|
|
|
|
View all special issues >> | |
|
|