Discriminating Between Closely Related Languages on Twitter"
Abstract
Editorial: "In this paper we tackle the problem of discriminating Twitter users by the language they tweet in, taking
into account very similar South-Slavic languages – Bosnian, Croatian, Montenegrin and Serbian. We
apply the supervised machine learning approach by annotating a subset of 500 users from an existing
Twitter collection by the language the users primarily tweet in. We show that by using a simple bag-ofwords
model, univariate feature selection, 320 strongest features and a standard classifier, we reach user
classification accuracy of 98%. Annotating the whole 63,160 users strong Twitter collection with the best
performing classifier and visualizing it on a map via tweet geo-information, we produce a Twitter language
map which clearly depicts the robustness of the classifier."
into account very similar South-Slavic languages – Bosnian, Croatian, Montenegrin and Serbian. We
apply the supervised machine learning approach by annotating a subset of 500 users from an existing
Twitter collection by the language the users primarily tweet in. We show that by using a simple bag-ofwords
model, univariate feature selection, 320 strongest features and a standard classifier, we reach user
classification accuracy of 98%. Annotating the whole 63,160 users strong Twitter collection with the best
performing classifier and visualizing it on a map via tweet geo-information, we produce a Twitter language
map which clearly depicts the robustness of the classifier."
Full Text:
PDFThis work is licensed under a Creative Commons Attribution 3.0 License.