Contextual Embedding Comparison for Out-of-vocabulary Handling in Indonesian POS Tagging
Abstract
Full Text:
PDFReferences
A. Chiche and B. Yitagesu, “Part of Speech tagging: A systematic review of Deep Learning and Machine Learning approaches,” J Big Data, vol. 9, no. 1, 2022, doi: 10.1186/s40537-022-00561-y.
M. Alfian, U. L. Yuhana, and D. Siahaan, “Indonesian Part-of-Speech tagger: A comparative study,” in 2023 10th International Conference on Advanced Informatics: Concept, Theory and Application (ICAICTA), IEEE, Oct. 2023, pp. 1–6. doi: 10.1109/ICAICTA59291.2023.10390353.
S. F. Kusuma, D. O. Siahaan, and C. Fatichah, “Automatic question generation with various difficulty levels based on knowledge ontology using a query template,” Knowl Based Syst, vol. 249, p. 108906, Aug. 2022, doi: 10.1016/j.knosys.2022.108906.
M. Z. Abdullah and C. Fatichah, “Feature-based POS tagging and sentence relevance for news multi-document summarization in Bahasa Indonesia,” Bulletin of Electrical Engineering and Informatics, vol. 11, no. 1, pp. 541–549, 2022, doi: 10.11591/eei.v11i1.3275.
L. Hu, Y. Tang, X. Wu, and J. Zeng, “Considering optimization of English grammar error correction based on neural network,” Neural Comput Appl, vol. 34, no. 5, pp. 3323–3335, Mar. 2022, doi: 10.1007/S00521-020-05591-2/FIGURES/17.
D. Hoesen and A. Purwarianti, “Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger,” Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018, pp. 35–38, 2019, doi: 10.1109/IALP.2018.8629158.
J. V. Lochter, R. M. Silva, and T. A. Almeida, “Multi-level out-of-vocabulary words handling approach,” Knowl Based Syst, vol. 251, Sep. 2022, doi: 10.1016/j.knosys.2022.108911.
P. Kolachina, M. Riedl, and C. Biemann, “Replacing OOV Words For Dependency Parsing With Distributional Semantics,” in NoDaLiDa 2017 - 21st Nordic Conference of Computational Linguistics, Proceedings of the Conference, 2017, pp. 11–9.
S. Garcia-Bordils et al., “Out-of-Vocabulary challenge report,” in Computer Vision -- ECCV 2022 Workshops, 2023, pp. 359–375. doi: 10.1007/978-3-031-25069-9_24.
X. Cai, S. Dong, and J. Hu, “A deep learning model incorporating part of speech and self-matching attention for named entity recognition of Chinese electronic medical records,” BMC Med Inform Decis Mak, vol. 19, 2019, doi: 10.1186/s12911-019-0762-7.
Imamah, U. L. Yuhana, A. Djunaidy, and M. H. Purnomo, “Development of text classification based on difficulty level in adaptive learning system using Convolutional Neural Network,” International Electronics Symposium 2021: Wireless Technologies and Intelligent Systems for Better Human Lives, IES 2021 - Proceedings, pp. 238–243, Sep. 2021, doi: 10.1109/IES53407.2021.9594021.
F. Gargiulo, S. Silvestri, M. Ciampi, and G. De Pietro, “Deep Neural Network for hierarchical extreme multi-label text classification,” Applied Soft Computing Journal, vol. 79, pp. 125–138, 2019, doi: 10.1016/j.asoc.2019.03.041.
S. Chotirat and P. Meesad, “Part-of-Speech tagging enhancement to Natural Language Processing for Thai WH-Question classification with Deep Learning,” Heliyon, vol. 7, no. 10, 2021, doi: 10.1016/j.heliyon.2021.e08216.
S. K. Nambiar, S. Peter David, and S. Mary Idicula, “Abstractive summarization of text document in Malayalam language: enhancing attention model using POS tagging feature,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 2, 2023, doi: 10.1145/3561819.
W. Liu and L. Wang, “POS-tagging enhanced Korean text summarization,” in Intelligent Computing Methodologies, Springer International Publishing, 2017, pp. 425–435. doi: 10.1007/978-3-319-63315-2_37.
W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, “Automatic Text Summarization: A comprehensive survey,” Mar. 01, 2021. doi: 10.1016/j.eswa.2020.113679.
V. H. Vu, Q. P. Nguyen, K. H. Nguyen, J. C. Shin, and C. Y. Ock, “Korean-Vietnamese neural machine translation with named entity recognition and part-of-speech tags,” IEICE Trans Inf Syst, vol. E103D, no. 4, 2020, doi: 10.1587/transinf.2019EDP7154.
Muljono, U. Afini, and C. Supriyanto, “Morphology analysis for Hidden Markov Model based Indonesian Part-of-Speech tagger,” in 2017 1st International Conference on Informatics and Computational Sciences (ICICoS), 2017, pp. 237–240. doi: 10.1109/ICICOS.2017.8276368.
I. I. Ayogu, A. O. Adetunmbi, B. A. Ojokoh, and S. A. Oluwadare, “A comparative study of hidden Markov model and conditional random fields on a Yorùba part-of-speech tagging task,” in Proceedings of the IEEE International Conference on Computing, Networking and Informatics, ICCNI 2017, 2017. doi: 10.1109/ICCNI.2017.8123784.
K. Nowakowski, M. Ptaszynski, F. Masui, and Y. Momouchi, “Improving Basic Natural Language Processing Tools for the Ainu Language,” Information 2019, Vol. 10, Page 329, vol. 10, no. 11, p. 329, Oct. 2019, doi: 10.3390/INFO10110329.
S. N. Bhattu, S. K. Nunna, D. V. L. N. Somayajulu, and B. Pradhan, “Improving code-mixed POS tagging using code-mixed embeddings,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 4, p. 1, 2020, doi: 10.1145/3380967.
L. Moudjari, F. Benamara, and K. Akli-Astouati, “Multi-level embeddings for processing Arabic social media contents,” Comput Speech Lang, vol. 70, 2021, doi: 10.1016/j.csl.2021.101240.
M. Janicki, “Semi-supervised induction of POS-tag lexicons with tree models,” in International Conference Recent Advances in Natural Language Processing, RANLP, 2019, pp. 507–515. doi: 10.26615/978-954-452-056-4_060.
L. Keiper, A. Horbach, and S. Thater, “Improving POS tagging of German learner language in a reading comprehension scenario,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016, 2016.
A. Jettakul, C. Thamjarat, K. Liaowongphuthorn, C. Udomcharoenchaikit, P. Vateekul, and P. Boonkwan, “A comparative study on various Deep Learning techniques for Thai NLP lexical and syntactic Tasks on noisy data,” in Proceeding of 2018 15th International Joint Conference on Computer Science and Software Engineering, JCSSE 2018, 2018. doi: 10.1109/JCSSE.2018.8457368.
D. G. Anastasyev, A. I. Andrianov, and E. M. Indenbom, “Part-of-speech tagging with rich language description,” in Komp’juternaja Lingvistika i Intellektual’nye Tehnologii, 2017.
E. Partalidou, E. Spyromitros-Xioufis, S. Doropoulos, S. Vologiannidis, and K. I. Diamantaras, “Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy,” in Proceedings - 2019 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2019, 2019, pp. 337–341. doi: 10.1145/3350546.3352543.
H. Yu, J. An, J. Yoon, H. Kim, and Y. Ko, “Simple methods to overcome the limitations of general word representations in natural language processing tasks,” Comput Speech Lang, vol. 59, pp. 91–113, 2020, doi: 10.1016/j.csl.2019.04.009.
M. S. Won, Y. S. Choi, S. Kim, C. W. Na, and J. H. Lee, “An embedding method for unseen words considering contextual information and morphological information,” in Proceedings of the ACM Symposium on Applied Computing, 2021, pp. 1055–1062. doi: 10.1145/3412841.3441982.
Y. Liu, W. Che, Y. Wang, B. Zheng, B. Qin, and T. Liu, “Deep contextualized word embeddings for universal dependency parsing,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 1, pp. 1–17, 2019, doi: 10.1145/3326497.
F. Marulli, M. Pota, and M. Esposito, “A comparison of character and word embeddings in bidirectional LSTMs for POS tagging in Italian,” in Smart Innovation, Systems and Technologies, 2019, pp. 14–23. doi: 10.1007/978-3-319-92231-7_2.
S. Fu, N. Lin, G. Zhu, and S. Jiang, “Towards Indonesian Part-of-Speech tagging: Corpus and models,” 2018 International Conference on Asian Language Processing (IALP), vol. 1, pp. 303–307, 2018.
A. Millour and K. Fort, “Unsupervised data augmentation for less-resourced languages with no standardized spelling,” in International Conference Recent Advances in Natural Language Processing, RANLP, 2019, pp. 776–784. doi: 10.26615/978-954-452-056-4_090.
G. Antipov, S. A. Berrani, N. Ruchaud, and J. L. Dugelay, “Learned vs hand-crafted features for pedestrian gender recognition,” MM 2015 - Proceedings of the 2015 ACM Multimedia Conference, pp. 1263–1266, Oct. 2015, doi: 10.1145/2733373.2806332.
P. Passban, Q. Liu, and A. Way, “Boosting neural Pos tagger for farsi using morphological information,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 16, no. 1, pp. 1–15, 2016, doi: 10.1145/2934676.
M. Alfian, U. L. Yuhana, D. Siahaan, H. Munazharoh, and E. Pardede, “Handling Out-of-Vocabulary in Indonesian POS Tagging: A Comparative Study,” in 2025 International Conference on Smart Computing, IoT and Machine Learning, SIML 2025, Surakarta: Institute of Electrical and Electronics Engineers Inc., Jul. 2025, pp. 1–6. doi: 10.1109/SIML65326.2025.11080832.
A. Makazhanov and Z. Yessenbayev, “Character-based feature extraction with LSTM networks for POS-tagging task,” in Application of Information and Communication Technologies, AICT 2016 - Conference Proceedings, 2017. doi: 10.1109/ICAICT.2016.7991654.
A. Kemos, H. Adel, and H. Schütze, “Neural semi-Markov conditional random fields for robust character-based part-of-speech tagging,” in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 2019, pp. 2736–2743.
P. Boonkwan and T. Supnithi, “Bidirectional deep learning of context representation for joint word segmentation and POS tagging,” in Advances in Intelligent Systems and Computing, 2018, pp. 184–196. doi: 10.1007/978-3-319-61911-8_17.
M. Pota, F. Marulli, M. Esposito, G. De Pietro, and H. Fujita, “Multilingual POS tagging by a composite Deep Architecture based on Character-Level features and on-the-fly enriched Word Embeddings,” Knowl Based Syst, vol. 164, pp. 309–323, 2019, doi: 10.1016/j.knosys.2018.11.003.
K. Kurniawan and A. F. Aji, “Toward a standardized and more accurate Indonesian Part-of-Speech tagging,” Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018, pp. 303–307, 2019, doi: 10.1109/IALP.2018.8629236.
S. Besharati, H. Veisi, A. Darzi, and S. H. H. Saravani, “A hybrid statistical and deep learning based technique for Persian part of speech tagging,” Iran Journal of Computer Science, vol. 4, no. 1, p. 35, 2021, doi: 10.1007/s42044-020-00063-1.
N. Bölücü and B. Can, “Unsupervised joint PoS tagging and stemming for agglutinative languages,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 18, no. 3, pp. 1–21, 2019, doi: 10.1145/3292398.
B. Wang, A. Wang, F. Chen, Y. Wang, and C. C. J. Kuo, “Evaluating word embedding models: Methods and experimental results,” 2019, Cambridge University Press. doi: 10.1017/ATSIP.2019.12.
T. Gui, Q. Zhang, H. Huang, M. Peng, and X. Huang, “Part-of-speech tagging for twitter with adversarial neural networks,” in EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 2017, pp. 2411–2420. doi: 10.18653/v1/d17-1256.
L. Qu, G. Ferraro, L. Zhou, W. Hou, N. Schneider, and T. Baldwin, “Big data small data, in domain out-of domain, known word unknown word: The impact of word representations on sequence labelling tasks,” in CoNLL 2015 - 19th Conference on Computational Natural Language Learning, Proceedings, 2015, pp. 83–93. doi: 10.18653/v1/k15-1009.
J. Wulff and A. Søgaard, “Learning finite state word representations for unsupervised Twitter adaptation of POS taggers,” in ACL-IJCNLP 2015 - Workshop on Noisy User-Generated Text, WNUT 2015 - Proceedings of the Workshop, 2015, pp. 162–166.
DOI: https://doi.org/10.31449/inf.v49i22.11204
This work is licensed under a Creative Commons Attribution 3.0 License.








