Discriminating Between Closely Related Languages on Twitter"

Nikola    Ljubešić; Denis   Kranjčić

Contact Editors Europe, Africa:
Matjaz Gams
N. and S. America:
Karthick Gunasekaran
Asia, Australia:
Vinay Singh
Overview papers:
Maria Ganzha
Wiesław Pawlowski
Aleksander Denisiuk Abstacting / Indexing

Informatica is surveyed by:

ACM Digital Library
Citeseer
COBISS
Compendex
Computer & Information Systems Abstracts
Computer Database
Computer Science Index
dLib.si
DBLP Computer Science Bibliography
Directory of Open Access Journals
Google Scholar
InfoTrac OneFile
Inspec
Linguistic and Language Behaviour Abstracts
Mathematical Reviews, MatSciNet, MatSci on SilverPlatter and Current Mathematical Publications
Scopus Publishing

Informatica is published by:

Support

Informatica is supported by:

ACM Slovenia
Slovenian Society for Pattern Recognition
Slovenian Artificial Intelligence Society
Slovenian Society for Cognitive Science
Slovenian Society of Mathematicians, Physicists and Astronomers
Automatic Control Society of Slovenia
Slovenian Academy of Engineering
International Federation for Information Processing

Journal Help

User

Journal Content Search
Browse

Information

Notifications

About The Authors

Nikola Ljubešić

Denis Kranjčić

Support & Indexing

Discriminating Between Closely Related Languages on Twitter"

Nikola Ljubešić, Denis Kranjčić

Abstract

Editorial: "In this paper we tackle the problem of discriminating Twitter users by the language they tweet in, taking
into account very similar South-Slavic languages – Bosnian, Croatian, Montenegrin and Serbian. We
apply the supervised machine learning approach by annotating a subset of 500 users from an existing
Twitter collection by the language the users primarily tweet in. We show that by using a simple bag-ofwords
model, univariate feature selection, 320 strongest features and a standard classifier, we reach user
classification accuracy of 98%. Annotating the whole 63,160 users strong Twitter collection with the best
performing classifier and visualizing it on a map via tweet geo-information, we produce a Twitter language
map which clearly depicts the robustness of the classifier."

Full Text:

PDF

This work is licensed under a Creative Commons Attribution 3.0 License.

Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific periodical publications.

Webmaster: Mario Konecki

Username
Password
Remember me