Ensemble of Multilingual Language Models with Pseudo Labeling for offence Detection in Dravidian Languages

Debapriya Tula

Debapriya Tula

Bengaluru, Karnataka

1 0
  • 0 Collaborators

we propose a multilingual ensemble-based model that can identify offensive content targeted against an individual (or group) in low resource Dravidian language. Our model is able to handle code-mixed data as well as instances where the script used is mixed (for instance, Tamil and Latin). ...learn more

Project status: Published/In Market

oneAPI, Artificial Intelligence

Intel Technologies
DevCloud

Docs/PDFs [1]Code Samples [1]

Overview / Usage

With the advent of social media, we have seen a proliferation of data and public discourse. Unfortunately, this includes offensive content as well. The problem is exacerbated due to the sheer number of languages spoken on these platforms and the multiple other modalities used for sharing offensive content (images, gifs, videos and more). In this paper, we propose a multilingual ensemble-based model that can identify offensive content targeted against an individual (or group) in low resource Dravidian language. Our model is able to handle code-mixed data as well as instances where the script used is mixed (for instance, Tamil and Latin). Our solution ranked number one for the Malayalam dataset and ranked 4th and 5th for Tamil and Kannada, respectively.

Methodology / Approach

We build a soft-voting ensemble of 3 deep learning models, viz,

  1. the DistilmBERT, which is a distilled version of the multilingual BERT (mBERT) trained on 104 languages.

  2. the IndicBERT, which is an AlBERT based model trained exclusively on Indian languages.

  3. the ULMFiT, which is a transfer learning technique building on the AWD-LSTM network.

Technologies Used

Natural Language Processing, Pytorch, Transformers

Documents and Presentations

Repository

https://github.com/Debapriya-Tula/EACL2021-DravidianTask-Bitions

Comments (0)