A novel term weighting scheme for imbalanced text classification

Tanapon Tantisripreecha; Nuanwan Soonthornphisaj

doi:10.31449/inf.v46i2.3523

Contact Editors Europe, Africa:
Matjaz Gams
N. and S. America:
Karthick Gunasekaran
Asia, Australia:
Vinay Singh
Overview papers:
Maria Ganzha
Wiesław Pawlowski
Aleksander Denisiuk Abstacting / Indexing

Informatica is surveyed by:

ACM Digital Library
Citeseer
COBISS
Compendex
Computer & Information Systems Abstracts
Computer Database
Computer Science Index
dLib.si
DBLP Computer Science Bibliography
Directory of Open Access Journals
Google Scholar
InfoTrac OneFile
Inspec
Linguistic and Language Behaviour Abstracts
Mathematical Reviews, MatSciNet, MatSci on SilverPlatter and Current Mathematical Publications
Scopus Publishing

Informatica is published by:

Support

Informatica is supported by:

ACM Slovenia
Slovenian Society for Pattern Recognition
Slovenian Artificial Intelligence Society
Slovenian Society for Cognitive Science
Slovenian Society of Mathematicians, Physicists and Astronomers
Automatic Control Society of Slovenia
Slovenian Academy of Engineering
International Federation for Information Processing

Journal Help

User

Journal Content Search
Browse

Information

Notifications

About The Authors

Tanapon Tantisripreecha
Department of Mathematics, Faculty of Science, Mahidol University
Thailand

Nuanwan Soonthornphisaj
Department of computer science, Faculty of Science, Kasetsart University, Thailand
Thailand

Support & Indexing

A novel term weighting scheme for imbalanced text classification

Tanapon Tantisripreecha, Nuanwan Soonthornphisaj

Abstract

High dimensional feature is the main problem of text domain. If imbalance class is also found in the context, the classifier’s performance is worsen. Moreover, solving imbalance problem by oversampling method in this circumstance is very difficult to get performance improvement. In this paper, a new term weighting scheme is proposed by combining Term frequency with an average of inverse document frequency factor. We denoted our scheme by TFmeanIDF. Our proposed method has high potential for imbalance text domain with high dimension. No feature selection or oversampling method is required. Extensive comparison results on 7 datasets validate the advantages of TFmeanIDF in terms of F1 score obtained from widely used base classifier such as logistic regression and Support Vector Machines. We found that F1 score of minority class is higher than that of baseline term weighting schemes. Using TFmeanIDF as a term weighting shows promising result of logistic regression and Support Vector Machines.

Full Text:

PDF

References

Tang, Z., Li, W., Li, Y. (2020) An improved term weighing scheme for text classification, Concurrency Computat Pract Exper. 32 (9) https://doi.org/10.1002/cpe.5604

Zhong Tang, Wenqiang Li, Yan Li, Wu Zhao, Song Li, (2020b) Several alternative term weighting methods for text representation and classification, Knowledge-Based Systems, Volume 207, 106399,

https://doi.org/10.1016/j.knosys.2020.106399.

Long Chen, Liangxiao Jiang, Chaoqun Li, 2021, Using modified term frequency to improve term weighting for text classification, Engineering Applications of Artificial Intelligence, 101, 104215,

https://doi.org/10.1016/j.engappai.2021.104215.

Salton, G., Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24 (5), 513–523.

Lan, M., Tan, C. L., Su, J., & Lu, Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-735. doi:10.1109/TPAMI.2008.110

Ng, H.T., Goh, W.B. and Low, K.L. (1997). Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization, Proc. SIGIR ’97, pp. 67-73.

López, V., Fernández, A., Moreno-Torres, J.G., Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics, Expert Syst. Appl. 39. pp. 6585–6608.

DOI: https://doi.org/10.31449/inf.v46i2.3523

This work is licensed under a Creative Commons Attribution 3.0 License.

Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific periodical publications.

Webmaster: Mario Konecki

Username
Password
Remember me