Download PDFOpen PDF in browserAn ensemble model for sentiment classification on code-mixed data in Dravidian LanguagesEasyChair Preprint 72669 pages•Date: December 27, 2021AbstractDravidian languages, Tamil, Kannada, Malayalam and Telugu, is spoken by over 220 million but is vastly under-resourced for natural language processing tasks. Code-switching and code-mixing have been on the rise, with multilingual speakers opting for expressing their opinion in their mother tongue along with English in both written text as well as in speech. Challenges arise in sentiment analysis of code-switched Dravidian languages because of under-resourced corpora and randomness in language interspersing. This paper applied an ensemble sentiment classification strategy based on majority voting using 13 different classification models on the Dravidian code-mixed languages dataset provided in FIRE 2021. The code-mixed dataset contained YouTube comments where the average word count per comment is less than 7. The key conclusion from our experiments was that the ensemble of multiple classifiers outperformed others for sentiment classification. Our approaches show that a result of weighted F1-score of 0.59, 0.65 and 0.60, respectively, on Kannada, Malayalam and Tamil code-switched data can be achieved with the traditional machine learning algorithms through an ensemble of multiple classifiers. Keyphrases: Dravidian, Kanglish, Manglish, Tanglish, code-mixing, code-switching, sentiment classification
|