Categorization of Event Clusters from Twitter Using Term Weighting Schemes

Surender Singh Samant, NL Bhanu Murthy, Aruna Malapati

Abstract


A real-world event is commonly represented on Twitter as a collection of repetitive and noisy text messages posted by different users. Term weighting is a popular pre-processing step for text classification, especially when the size of the dataset is limited. In this paper, we propose a new term weighting scheme and a modification to an existing one and compare them with many state-of-the-art methods using three popular classifiers. We create a labelled Twitter dataset of events for exhaustive cross-validation experiments and use another Twitter event dataset for cross-corpus tests. The proposed schemes are among the best performers in many experiments, with the proposed modification significantly improving the performance of the original scheme. We create two majority voting based classifiers that further enhance the F1-scores of the best individual schemes.

Full Text:

PDF

References


[Alsaedi et al., 2016] Alsaedi, N., Burnap, P., and

Rana, O. F. (2016). Automatic summarization of

real world events using twitter. In Proceedings of

the Tenth International Conference on Web and So-

cial Media, Cologne, Germany, May 17-20, 2016.,

pages 511–514.

[Cardoso-Cachopo, 2007] Cardoso-Cachopo, A.

(2007). Improving methods for single-label text

categorization. PhD Thesis, Instituto Superior

Tecnico, Universidade Tecnica de Lisboa.

[Debole and Sebastiani, 2003] Debole, F. and Sebas-

tiani, F. (2003). Supervised term weighting for automated text categorization. In Proceedings of

the 2003 ACM Symposium on Applied Computing,

SAC ’03, pages 784–788, New York, NY, USA.

ACM.

[Escalante et al., 2015] Escalante, H. J., Garc´ ıa-

Limón, M. A., Morales-Reyes, A., Graff, M.,

Montes-y Gómez, M., Morales, E. F., and

Mart´ ınez-Carranza, J. (2015). Term-weighting

learning via genetic programming for text classi-

fication. Know.-Based Syst., 83(C):176–189.

[Joachims, 1998] Joachims, T. (1998). Text catego-

rization with support vector machines: Learning

with many relevant features. In Proceedings of

the 10th European Conference on Machine Learn-

ing, ECML’98, pages137–142, Berlin, Heidelberg.

Springer-Verlag.

[Kalyanam et al., 2016] Kalyanam, J., Quezada, M.,

Poblete, B., and Lanckriet, G. (2016). Prediction

and characterization of high-activity events in so-

cial media triggered by real-world news. PLOS

ONE, 11(12):1–13.

[Lan et al., 2006] Lan, M., Tan, C. L., and Low,

H. (2006). Proposing a new term weighting

scheme for text categorization. In Proceedings,

The Twenty-First National Conference on Artificial

Intelligence and the Eighteenth Innovative Appli-

cations of Artificial Intelligence Conference, July

-20, 2006, Boston, Massachusetts, USA, pages

–768.

[Malliaros and Skianis, 2015] Malliaros, F. D. and

Skianis, K. (2015). Graph-based term weight-

ing for text categorization. In 2015 IEEE/ACM

International Conference on Advances in Social

Networks Analysis and Mining (ASONAM), pages

–1479.

[McMinn et al., 2013] McMinn, A.J., Moshfeghi, Y.,

and Jose, J. M. (2013). Building a large-scale cor-

pus for evaluating event detection on twitter.

[Ng et al., 1997] Ng, H. T., Goh, W. B., and Low,

K. L. (1997). Feature selection, perceptron learn-

ing, and a usability case study for text categoriza-

tion. In Proceedings of the 20th annual interna-

tional ACM SIGIR conference on Research and

development in information retrieval - SIGIR ’97,

pages 67–73.

[Quan et al., 2011] Quan, X., Wenyin, L., andQiu, B.

(2011). Term weighting schemes for question cate-

gorization. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 33(5):1009–1021.

[Radev et al., 2004] Radev, D. R., Jing, H., Sty´ s, M.,

and Tam, D. (2004). Centroid-based summariza-

tion of multiple documents. Inf. Process. Manage.,

(6):919–938.

[Reed et al., 2006] Reed, J. W., Jiao, Y., Potok, T. E.,

Klump, B. A., Elmore, M. T., and Hurson, A. R.

(2006). Tf-icf: A new term weighting scheme for

clustering dynamic data streams. In 2006 5th In-

ternational Conference on Machine Learning and

Applications (ICMLA’06), pages 258–263.

[Wang et al., 2015] Wang, T., Cai, Y., Leung, H.,

Cai, Z., and Min, H. (2015). Entropy-based term

weighting schemes for text categorization in vsm.

In 2015 IEEE 27th International Conference on

Tools with Artificial Intelligence (ICTAI), pages

–332.

[Wu et al., 2017] Wu, H., Gu, X., and Gu, Y. (2017).

Balancing between over-weighting and under-

weighting in supervised term weighting. Inf. Pro-

cess. Manage., 53(2):547–557.

[Yang and Pedersen, 1997] Yang, Y. and Pedersen,

J. O. (1997). A comparative study on feature se-

lection in text categorization. In Proceedings of

the Fourteenth International Conference on Ma-

chine Learning, ICML ’97, pages 412–420, San

Francisco, CA, USA. Morgan Kaufmann Publish-

ers Inc.




DOI: https://doi.org/10.31449/inf.v45i3.3063

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.