Home > CSC-OpenAccess Library > Manuscript Information
EXPLORE PUBLICATIONS BY COUNTRIES |
EUROPE | |
MIDDLE EAST | |
ASIA | |
AFRICA | |
............................. | |
United States of America | |
United Kingdom | |
Canada | |
Australia | |
Italy | |
France | |
Brazil | |
Germany | |
Malaysia | |
Turkey | |
China | |
Taiwan | |
Japan | |
Saudi Arabia | |
Jordan | |
Egypt | |
United Arab Emirates | |
India | |
Nigeria |
Compression-Based Parts-of-Speech Tagger for The Arabic Language
Ibrahim S. Alkhazi, William J. Teahan
Pages - 1 - 15 | Revised - 31-03-2019 | Published - 30-04-2019
MORE INFORMATION
KEYWORDS
Natural Language Processing, Arabic Part-of-Speech Tagger, Hidden Markov Model, Statistical Language Model.
ABSTRACT
This paper explores the use of Compression-based models to train a Part-of-Speech (POS) tagger for the Arabic language. The newly developed tagger is based on the Prediction-by-Partial Matching (PPM) compression system, which has already been employed successfully in several NLP tasks. Several models were trained for the new tagger, the first models were trained using a silver-standard data from two different POS Arabic taggers, and the second model utilised the BAAC corpus, which is a 50K term manually annotated MSA corpus, where the PPM tagger achieved an accuracy of 93.07%. Also, the tag-based models were utilised to evaluate the performance of the new tagger by first tagging different Classical Arabic corpora and Modern Standard Arabic corpora then compressing the text using tag-based compression models. The results show that the use of silver-standard models has led to a reduction in the quality of the tag-based compression by an average of 0.43%, whereas the use of the gold-standard model has increased the tag-based compression quality by an average of 4.61% when used to tag Modern Standard Arabic text.
Abdelali, Ahmed, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. �Farasa: A Fast and Furious Segmenter for Arabic.� Pp. 11�16 in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. | |
Abumalloh, Rabab Ali, Hassan Maudi Al-Sarhan, Othman Ibrahim, and Waheeb Abu-Ulbeh. 2016. �Arabic Part-of-Speech Tagging.� Journal of Soft Computing and Decision Support Systems 3(2):45�52. | |
Al Shamsi, Fatma and Ahmed Guessoum. 2006. �A Hidden Markov Model-Based POS Tagger for Arabic.� Pp. 31�42 in Proceeding of the 8th International Conference on the Statistical Analysis of Textual Data, France. | |
Al-Harbi, S., A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh. 2008. �Automatic Arabic Text Classification.� in Proceedings of The 9th International Conference on the Statistical Analysis of Textual Data. | |
Al-Kazaz, Noor R., Sean A. Irvine, and William J. Teahan. 2016. �An Automatic Cryptanalysis of Transposition Ciphers Using Compression.� Pp. 36�52 in International Conference on Cryptology and Network Security. | |
Alabbas, Maytham and Allan Ramsay. 2012. �Improved POS-Tagging for Arabic by Combining Diverse Taggers.� Pp. 107�16 in IFIP International Conference on Artificial Intelligence Applications and Innovations. | |
Alghamdi, Mansoor A., Ibrahim S. Alkhazi, and William J. Teahan. 2016. �Arabic OCR Evaluation Tool.� Pp. 1�6 in Computer Science and Information Technology (CSIT), 2016 7th International Conference on. IEEE. | |
Alhawiti, Khaled M. 2014. �Adaptive Models of Arabic Text.� Ph.D. thesis, Bangor University. | |
Alkahtani, Saad and William J. Teahan. 2016. �A New Parallel Corpus of Arabic/English.� Pp. 279�84 in Proceedings of the Eighth Saudi Students Conference in the UK. | |
Alkahtani, Saad. 2015. �Building and Verifying Parallel Corpora between Arabic and English.� Ph.D. thesis, Bangor University. | |
Alkhazi, Ibrahim S. and William J. Teahan. 2017. �Classifying and Segmenting Classical and Modern Standard Arabic Using Minimum Cross-Entropy.� INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 8(4):421�30. | |
Alkhazi, Ibrahim S. and William J. Teahan. 2018. �BAAC: Bangor Arabic Annotated Corpus.� INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 9(11):131�40. | |
Alkhazi, Ibrahim S., Mansoor A. Alghamdi, and William J. Teahan. 2017. �Tag Based Models for Arabic Text Compression.� Pp. 697�705 in 2017 Intelligent Systems Conference (IntelliSys). IEEE. | |
Alosaimy, Abdulrahman Mohammed S. 2018. �Ensemble Morphosyntactic Analyser for Classical Arabic.� Ph.D. thesis, University of Leeds. | |
Alqrainy, Shihadeh. 2008. �A Morphological-Syntactical Analysis Approach for Arabic Textual Tagging.� | |
Anon. n.d. �Madamira Arabic Analyzer - Online.� Retrieved February 17, 2019a (https://camel.abudhabi.nyu.edu/madamira/). | |
Anon. n.d. �The Stanford Natural Language Processing Group.� Retrieved February 17, 2019b (https://nlp.stanford.edu/software/tagger.shtml). | |
Atwell, Eric Steven, Salim Elsheikh, and Mohammad Elsheikh. 2018. �TIMELINE OF THE DEVELOPMENT OF ARABIC POS TAGGERS AND MORPHOLOGICALANALYSERS.� | |
Brill, Eric. 1992. �A Simple Rule-Based Part of Speech Tagger.� Pp. 152�55 in Proceedings of the third conference on Applied natural language processing. | |
Brown, Peter F., Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and Jennifer C. Lai. 1992. �An Estimate of an Upper Bound for the Entropy of English.� Computational Linguistics 18(1):31�40. | |
Cleary, John and Witten, Ian. 1984. �Data Compression Using Adaptive Coding and Partial String Matching.� C(4):396�402. | |
Columbia University. n.d. �Arabic Language Disambiguation for Natural Language Processing Applications - Cu14012 - Columbia Technology Ventures.� Retrieved (http://innovation.columbia.edu/technologies/cu14012_arabic-language-disambiguation-for-natural-language-processing-applications). | |
Darwish, Kareem, Hamdy Mubarak, Ahmed Abdelali, Mohamed Eldesouki, Younes Samih, Randah Alharbi, Mohammed Attia, Walid Magdy, and Laura Kallmeyer. 2018. �Multi-Dialect Arabic POS Tagging: A CRF Approach.� in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). | |
Diab, Mona T. 2007. �Improved Arabic Base Phrase Chunking with a New Enriched POS Tag Set.� Pp. 89�96 in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources. | |
Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2004. �Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks.� Pp. 149�52 in Proceedings of HLT-NAACL 2004: Short papers. | |
Diab, Mona, Kadri Hacioglu, and Daniel Jurafsky. 2007. �Automatic Processing of Modern Standard Arabic Text.� Pp. 159�79 in Arabic Computational Morphology. Springer. | |
El Hadj, Yahya, I. Al-Sughayeir, and A. Al-Ansari. 2009. �Arabic Part-of-Speech Tagging Using the Sentence Structure.� in Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt. | |
El-Kareh, Seham and Sameh Al-Ansary. 2000. �An Interactive Multi-Features POS Tagger.� P. 83Y88 in the Proceedings of the International Conference on Artificial and Computational Intelligence for Decision Control and Automation in Intelligence for Decision Control and Automation in Engineering and Industrial Applications. | |
Francis, W. Nelson and Henry Kucera. 1979. �The Brown Corpus: A Standard Corpus of Present-Day Edited American English.� Providence, RI: Department of Linguistics, Brown University [Producer and Distributor]. | |
Green, Spence and Cd Manning. 2010. �Better Arabic Parsing: Baselines, Evaluations, and Analysis.� COLING �10 Proceedings of the 23rd International Conference on Computational Linguistics (August):394�402. | |
Green, Spence, Marie-Catherine de Marneffe, and Christopher D. Manning. 2013. �Parsing Models for Identifying Multiword Expressions.� Computational Linguistics 39(1):195�227. | |
Greene, Barbara B. and Gerald M. Rubin. 1971. �Automated Grammatical Tagging of English.� | |
Habash, Nizar and Owen Rambow. 2005. �Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop.� Pp. 573�80 in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. | |
Habash, Nizar, Owen Rambow, and Ryan Roth. 2009. �MADA+ TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization.� Pp. 102�9 in Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt. | |
Habash, Nizar, Ryan Roth, Owen Rambow, Ramy Eskander, and Nadi Tomeh. 2013. �Morphological Analysis and Disambiguation for Dialectal Arabic.� Pp. 426�32 in Hlt-Naacl. | |
Hadni, Meryeme, Said Alaoui Ouatik, Abdelmonaime Lachkar, and Mohammed Meknassi. 2013. �Hybrid Part-of-Speech Tagger for Non-Vocalized Arabic Text.� International Journal on Natural Language Computing (IJNLC) Vol 2. | |
Hajic, Jan, Otakar Smrz, Petr Zem�nek, Jan �naidauf, and Emanuel Be�ka. 2004. �Prague Arabic Dependency Treebank: Development in Data and Tools.� Pp. 110�17 in Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools. | |
Jelinek, Fred. 1990. �Self-Organized Language Modeling for Speech Recognition.� Readings in Speech Recognition 450�506. | |
Katz, Slava. 1987. �Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.� IEEE Transactions on Acoustics, Speech, and Signal Processing 35(3):400�401. | |
Khmelev, Dmitry V and William J. Teahan. 2003. �A Repetition Based Measure for Verification of Text Collections and for Text Categorization.� Pp. 104�10 in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. | |
Khoja, Shereen, Roger Garside, and Gerry Knowles. 2001. �A Tagset for the Morphosyntactic Tagging of Arabic.� Proceedings of the Corpus Linguistics. Lancaster University (UK) 13. | |
Khoja, Shereen. 2001. �APT: Arabic Part-of-Speech Tagger.� Pp. 20�25 in Proceedings of the Student Workshop at NAACL. | |
Khoja, Shereen. 2003. �APT: An Automatic Arabic Part-of-Speech Tagger.� Ph.D. thesis, Lancaster University. | |
Klein, Sheldon and Robert F. Simmons. 1963. �A Computational Approach to Grammatical Coding of English Words.� Journal of the ACM (JACM) 10(3):334�47. | |
Kuhn, Roland and Renato De Mori. 1990. �A Cache-Based Natural Language Model for Speech Recognition.� IEEE Transactions on Pattern Analysis and Machine Intelligence 12(6):570�83. | |
Linguistic Data Consortium. 2002. Buckwalter Arabic Morphological Analyzer?: Version 1.0. Linguistic Data Consortium. | |
Maamouri, Mohamed and Ann Bies. 2004. �Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools.� Pp. 2�9 in Proceedings of the Workshop on Computational Approaches to Arabic Script-based languages. | |
Martinez, Angel R. 2012. �Part-of-Speech Tagging.� Wiley Interdisciplinary Reviews: Computational Statistics 4(1):107�13. | |
Mohamed, Emad and Sandra K�bler. 2010. �Arabic Part of Speech Tagging.� in LREC. | |
Nguyen, Dat Quoc, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham. 2014. �RDRPOSTagger: A Ripple down Rules-Based Part-of-Speech Tagger.� Pp. 17�20 in Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. | |
nltk.org. n.d. �Simple Pipeline Architecture for an Information Extraction System.� Retrieved February 8, 2019 (http://www.nltk.org/book/ch07.html). | |
Pasha, Arfath, Mohamed Al-badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. �MADAMIRA?: A Fast , Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic.� Proceedings of the 9th Language Resources and Evaluation Conference (LREC�14) 14:1094�1101. | |
Richards, Debbie. 2009. �Two Decades of Ripple down Rules Research.� The Knowledge Engineering Review 24(2):159�84. | |
Soudi, Abdelhadi, Ali Farghaly, G�nter Neumann, and Rabih Zbib. 2012. Challenges for Arabic Machine Translation. Vol. 9. John Benjamins Publishing. | |
Taylor, Ann, Mitchell Marcus, and Beatrice Santorini. 2003. �The Penn Treebank: An Overview.� Pp. 5�22 in Treebanks. Springer. | |
Teahan, W. J. and John G. Cleary. 1998. �Tag Based Models of English Text.� Pp. 43�52 in Data Compression Conference. IEEE. | |
Teahan, William J. and John G. Cleary. 1997. �Applying Compression to Natural Language Processing.� in SPAE: The Corpus of Spoken Professional American-English. | |
Teahan, William J., Yingying Wen, Rodger McNab, and Ian H. Witten. 2000. �A Compression-Based Algorithm for Chinese Word Segmentation.� Computational Linguistics 26(3):375�93. | |
Teahan, William John, Stuart Inglis, John G. Cleary, and Geoffrey Holmes. 1998. �Correcting English Text Using PPM Models.� Pp. 289�98 in Data Compression Conference, 1998. DCC�98. Proceedings. | |
Teahan, William John. 1998. �Modelling English Text.� Ph.D. thesis, Waikato University. | |
Teahan, William John. 2000. �Text Classification and Segmentation Using Minimum Cross-Entropy.� Pp. 943�61 in Content-Based Multimedia Information Access-Volume 2. | |
Teahan, William. 2018. �A Compression-Based Toolkit for Modelling and Processing Natural Language Text.� Information 9(12):294. | |
Tim Buckwalter. n.d. �Buckwalter Arabic Transliteration.� Retrieved January 29, 2019 (http://www.qamus.org/transliteration.htm). | |
Wintner, Shuly. 2014. �Morphological Processing of Semitic Languages.� Pp. 43�66 in Natural language processing of Semitic languages. Springer. | |
Wu, Peiliang. 2007. �Adaptive Models of Chinese Text.� University of Wales, Bangor. | |
Mr. Ibrahim S. Alkhazi
College of Computers & Information Technology
Tabuk University
Tabuk, Saudi Arabia - Saudi Arabia
i.alkhazi@ut.edu.sa
Dr. William J. Teahan
School of Computer Science and Electronic Engineering
Bangor University
United Kingdom - United Kingdom
|
|
|
|
View all special issues >> | |
|
|