Automatic Detection of Stop Words for Texts in Uzbek Language

Khabibulla Madatov, Shukurla Bekchanov, Jernej Vičič

Abstract


Stop words are very important for information retrieval and text analysis investigation. This study aimed to automatically analyzed and detect stop words in texts in Uzbek language. Because of limited availability of methods for automatic search of stop words of texts in Uzbek we analyzed a newly prepared corpus. Uzbek language belongs to the family of agglutinative languages. As with all agglutinative languages, we can explain that the detection of stop words in Uzbek texts is a more complex process than in inflected languages: In inflected languages, words such as auxiliary words, articles, prepositions can be included in the stop words group. In agglutinative languages, the meanings of such words are hidden in the text. Therefore, it is not appropriate to apply all known methods of stop words detection in inflected languages directly to agglutinative languages.
In this work, the “School corpus” which contains 731156 Uzbek words has been investigated. The bigram method of analysis was applied to the corpus. We proposed the collocation method of detecting stop words of the corpus. We proposed the method of automatically detecting stop words of texts in Uzbek. It is shown that the collocation method is 6 times better than the bigram method.


Full Text:

PDF

References


S. Matlatipov, X. Madatov, G. Matlatipov,A. O‘razbayev, M. Raximboyev, I. Avezma-tov, U. Babajanov, L. Kurbanova, D. Xu-jamov, and D. Matjumayeva, “”o‘zbek tilin-ing statistik electron lug‘at” exm das-turi uchun guvohnoma,”Intellektual mulkagentligi, 2020.

A. W. Pradana and M. Hayaty, “The ef-fect of stemming and removal of stop wordson the accuracy of sentiment analysis onindonesian-language texts,”Game Technol-ogy, Information System, Computer Net-work, Computing, Electronics, and ControlJournal, vol. 4, no. 3, pp. 277–288, 2019.

R. U. Haque, P. Mehera, M. F. Mridha, andM. A. Hamid, “A complete bengali stop worddetection mechanism,” inConference Paper·May 2019. Conference, 2019.

R. Rania and D.K.Lobiyal, “Automatic con-struction of generic stop words list for hinditext,” inInternational Conference on Com-putational Intelligence and Data Science, vol.132, International Conference on Computa-tional Intelligence and Data Science.IC-CIDS 2018, 2018, pp. 362–370.

P. J. Burns, “Constructing stoplists for his-torical languages,”Digital Classics Online,vol. 4, no. 2, 2018.

R. M. Rakholia and J. R. Saini, “A rule-based approach to identify stop words forgujarati language,” inIn Proceedings of the5th International Conference on Frontiers inIntelligent Computing: Theory and Applica-tions, 2017, pp. 797–806.

J. K. Raulji and J. R. Saini, “Generatingstopword list for sanskrit language,” inIn:2017 IEEE 7th International Advance Com-puting Conference.IEEE 7th, 2017, pp.799–802.

O. D. Tijani, A. T. Akinwale, S. A.Onashoga, and E. O. Adeleke, “An auto-generated approach of stop words using aggregated analysis,” inIn: Proceedings of the13th International Conference of the NigeriaComputer Society, 2017, pp. 99–115.

M. Mhatre, D. Phondekar, P. Kadam,A. Chawathe, and K. Ghag, “Dimen-sionality reduction for sentiment analysisusing pre-processing techniques,”in Proceedings of the IEEE 2017 Interna-tional Conference on Computing Methodolo-gies and Communication.ICCMC, 2017,pp. 16–21. [Online]. Available: https://doi.org/10.1109/ICCMC.2017.8282676

C. Sammut and G. I. Webb, Eds.,TF–IDF.Boston, MA: Springer US, 2010,pp. 986–987. [Online]. Available: https://doi.org/10.1007/978-0-387-30164-8832

Y. Wang, K. Kim, B. Lee, and H. Y.Youn, “Word clustering based on pos featurefor efficient twitter sentiment analysis,”Human-centric Comput, vol. 8, no. 17,pp. 1–25, 2019. [Online]. Available:

https://doi.org/10.1186/s13673-018-0140-y

N. Ousirimaneechai and S. Sinthupinyo, “Ex-traction of trend keywords and stop words from thai facebook pages using character n-grams,”International Journal of MachineLearning and Computing, vol. 8, no. 6, 2018.

C. Slamet, A. R. Atmadja, D. S. Maylawati,R. S. Lestari, W. Dharmalaksana, and M. A.Ramdhani, “Automated text summarizationfor indonesian article using vector spacemodel model,” inIOP Conf. Ser. Mater.Sci. Eng., vol. 288, no. 1, Conference. IOP,2018. [Online]. Available: https://doi.org/10.1088/1757-899X/288/1/012037

G. Li and J. Li, “Research on senti-ment classification for tang poetry basedon tf-idf and fp-growth,” inProceedingsof 2018 IEEE 3rd Advanced Informa-tion Technology, Electronic and Automa-tion Control Conference.IAEAC, 2018,pp. 630–634. [Online]. Available: https://doi.org/10.1109/IAEAC.2018.8577715

H. M. Zin, N. Mustapha, M. A. A.Murad, and N. M. Sharef, “The effectsof pre-processing strategies in sentimentanalysis of online movie reviews,” inAIPConf. Proc.,vol. 1891,no. 1.AIPConf., 2017, pp. 1–7. [Online]. Available: https://doi.org/10.1063/1.5005422

S. K. Metin and B. Karaog’lan, “Stop worddetection as a binary classification problem,”Anadolu University Journal of Science andTechnology A- Applied Sciences and Engineering, vol. 18, no. 2, pp. 346–359, 2017.

J. K. Raulji and J. R. Saini, “Generating stopword list for sanskrit language,” inIn Ad-vance Computing Conference IEEE 7th In-ternational. IEEE, 2017, pp. 799–802.

S. J. R. Rakholia R. M.,“A rule-basedapproach to identify stop words for gu-jarati language,” inSuresh Chandra Satap-athy Vikrant Bhateja Siba K., 2017.

R. M. Rakholia and J. R. Saini, “Informationretrieval for gujarati language using cosinesimilarity based vector space model,” inThe-ory and Applications. SpringerSingapore,2017, pp. 1–9.

X.Madatov and S. Matlatipov, “Kosinuso’xshahshlik va uning o’zbek tili matnlar-iga tatbiqi haqida,”O’zMU xabarlari, vol. 2,no. 1, 2016.




DOI: https://doi.org/10.31449/inf.v47i2.3788

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.