Authorship Clustering in News Paper Comments

0 0
  • 0 Collaborators

In this project authorship verification, clustering and identification task will be applied to German online news paper comments. ...learn more

Artificial Intelligence

Groups
Student Developers for AI

Overview / Usage

This project will apply author verification, clustering and identification on a
dataset combining opinion comments pulled from big German online newspaper sites and
comments on social media accounts of these newspaper sites. The main research will
focus on determining whether a reliable linkage between anonymous users based on their
comments on different platforms using different pseudonyms and accounts can be established.
These links can be used to cluster comments and tweets based on the authorship.
Then the clusters are input for further analyses like behavioural or time dependent analyses
which can be used to identify social bots. (Ferrara et al. 2014)

In literature author clustering is composed of many author verification tasks to generate
authorship links between documents. These authorship links are labelled with a confidence
score indicating the likelihood to be authored by the same author. The confidence
score can be used as distance measures for clustering algorithms to find clusters of documents
most likely to be written by the same author. (Stamatatos et al. 2016)

To generate the authorship links different features can be used to measure similarities
or differences. In literature, it’s referred to as stylometric features. Stamatatos (2008)
divide these features in five groups: lexical, character, syntactic, semantic and application
specific. These feature groups contain several different features and instances so feature
selection and combination might be needed to use them. After the feature selection and
extraction statistical models are used to compute the confidence score. Stamatatos (2008)
refers to these models as attribution methods and identifies two different kinds of models.
Profile based models where all texts by an author are combined to compute a profile
of the author and instance based models which treat each text individually. In my case
I will use instance based models as I don’t have texts which are certainly written by an
author. Based on Stamatatos (2008) the main question can be divided into sub-questions
as follows:

Q1 Which feature groups and particular features are suitable for comments from
online sites as these texts are often very short and consist of many abbreviations?

Q2 Which feature selection and extraction methods can be used for the identified
features?

Q3 Which statistical and machine learning methods are most accurate to compute
the confidence score?

Q4 Which performance measures can be used to evaluate the computed models?

The project contributes in the field of author identification to improve methodologies to
reliably link authors to their texts. Especially it tries to sharpen these methods for use in
online media with a big number of possible authors and very short texts per author. In
addition, it shows the usage of author identification regarding the distinction of human
and machines in a real world problem.

Amigó, E., Gonzalo, J., Artiles, J., and Verdejo, F. 2009. “A comparison of extrinsic clustering
evaluation metrics based on formal constraints,” Information Retrieval (12:4),
pp. 461–486.

Bagnall, D. 2016. “Authorship clustering using multi-headed recurrent neural networks,”
in Working Notes of {CLEF} 2016 - Conference and Labs of the Evaluation forum,
Évora, Portugal, 5-8 September, 2016.

Ferrara, E., Varol, O., Davis, C., Menczer, F., and Flammini, A. 2014. “The Rise of Social
Bots,” Communications of the ACM (59:7), pp. 96–104.

Kocher, M. 2016. “UniNE at CLEF 2016 : Author Clustering,” in Working Notes of
{CLEF} 2016 - Conference and Labs of the Evaluation forum, {É}vora, Portugal, 5-8.

Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval.
Pan.webis.de 2016. “PAN @ CLEF 2016,” .

Stamatatos, E. 2008. “A Survey of Modern Authorship Attribution Methods,” Journal of
the American Society for Information Science and Technology (60:3), pp. 433–643.

Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B.,
and Potthast, M. 2016. “Clustering by Authorship Within and Across Documents,”
Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum Évora,
Portugal, 5-8 September, 2016 pp. 691–715.

Comments (0)