Authorship Clustering in News Paper Comments

Authorship Clustering in News Paper Comments

Stephan Volkeri

Stephan Volkeri

Münster, Nordrhein-Westfalen

In this project authorship verification, clustering and identification task will be applied to German online news paper comments.

Artificial Intelligence

  • 0 Collaborators

  • 0 Followers

    Follow

Description

This project will apply author verification, clustering and identification on a dataset combining opinion comments pulled from big German online newspaper sites and comments on social media accounts of these newspaper sites. The main research will focus on determining whether a reliable linkage between anonymous users based on their comments on different platforms using different pseudonyms and accounts can be established. These links can be used to cluster comments and tweets based on the authorship. Then the clusters are input for further analyses like behavioural or time dependent analyses which can be used to identify social bots. (Ferrara et al. 2014)

In literature author clustering is composed of many author verification tasks to generate authorship links between documents. These authorship links are labelled with a confidence score indicating the likelihood to be authored by the same author. The confidence score can be used as distance measures for clustering algorithms to find clusters of documents most likely to be written by the same author. (Stamatatos et al. 2016)

To generate the authorship links different features can be used to measure similarities or differences. In literature, it’s referred to as stylometric features. Stamatatos (2008) divide these features in five groups: lexical, character, syntactic, semantic and application specific. These feature groups contain several different features and instances so feature selection and combination might be needed to use them. After the feature selection and extraction statistical models are used to compute the confidence score. Stamatatos (2008) refers to these models as attribution methods and identifies two different kinds of models. Profile based models where all texts by an author are combined to compute a profile of the author and instance based models which treat each text individually. In my case I will use instance based models as I don’t have texts which are certainly written by an author. Based on Stamatatos (2008) the main question can be divided into sub-questions as follows:

Q1 Which feature groups and particular features are suitable for comments from online sites as these texts are often very short and consist of many abbreviations?

Q2 Which feature selection and extraction methods can be used for the identified features?

Q3 Which statistical and machine learning methods are most accurate to compute the confidence score?

Q4 Which performance measures can be used to evaluate the computed models?

The project contributes in the field of author identification to improve methodologies to reliably link authors to their texts. Especially it tries to sharpen these methods for use in online media with a big number of possible authors and very short texts per author. In addition, it shows the usage of author identification regarding the distinction of human and machines in a real world problem.

Amigó, E., Gonzalo, J., Artiles, J., and Verdejo, F. 2009. “A comparison of extrinsic clustering evaluation metrics based on formal constraints,” Information Retrieval (12:4), pp. 461–486.

Bagnall, D. 2016. “Authorship clustering using multi-headed recurrent neural networks,” in Working Notes of {CLEF} 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016.

Ferrara, E., Varol, O., Davis, C., Menczer, F., and Flammini, A. 2014. “The Rise of Social Bots,” Communications of the ACM (59:7), pp. 96–104.

Kocher, M. 2016. “UniNE at CLEF 2016 : Author Clustering,” in Working Notes of {CLEF} 2016 - Conference and Labs of the Evaluation forum, {É}vora, Portugal, 5-8.

Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Pan.webis.de 2016. “PAN @ CLEF 2016,” .

Stamatatos, E. 2008. “A Survey of Modern Authorship Attribution Methods,” Journal of the American Society for Information Science and Technology (60:3), pp. 433–643.

Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., and Potthast, M. 2016. “Clustering by Authorship Within and Across Documents,” Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum Évora, Portugal, 5-8 September, 2016 pp. 691–715.

Links

Project Website

Medium profilbild

Stephan V. created project Authorship Clustering in News Paper Comments

Medium 94da9077 ad3c 4734 a262 9431cbecbb41

Authorship Clustering in News Paper Comments

This project will apply author verification, clustering and identification on a dataset combining opinion comments pulled from big German online newspaper sites and comments on social media accounts of these newspaper sites. The main research will focus on determining whether a reliable linkage between anonymous users based on their comments on different platforms using different pseudonyms and accounts can be established. These links can be used to cluster comments and tweets based on the authorship. Then the clusters are input for further analyses like behavioural or time dependent analyses which can be used to identify social bots. (Ferrara et al. 2014)

In literature author clustering is composed of many author verification tasks to generate authorship links between documents. These authorship links are labelled with a confidence score indicating the likelihood to be authored by the same author. The confidence score can be used as distance measures for clustering algorithms to find clusters of documents most likely to be written by the same author. (Stamatatos et al. 2016)

To generate the authorship links different features can be used to measure similarities or differences. In literature, it’s referred to as stylometric features. Stamatatos (2008) divide these features in five groups: lexical, character, syntactic, semantic and application specific. These feature groups contain several different features and instances so feature selection and combination might be needed to use them. After the feature selection and extraction statistical models are used to compute the confidence score. Stamatatos (2008) refers to these models as attribution methods and identifies two different kinds of models. Profile based models where all texts by an author are combined to compute a profile of the author and instance based models which treat each text individually. In my case I will use instance based models as I don’t have texts which are certainly written by an author. Based on Stamatatos (2008) the main question can be divided into sub-questions as follows:

Q1 Which feature groups and particular features are suitable for comments from online sites as these texts are often very short and consist of many abbreviations?

Q2 Which feature selection and extraction methods can be used for the identified features?

Q3 Which statistical and machine learning methods are most accurate to compute the confidence score?

Q4 Which performance measures can be used to evaluate the computed models?

The project contributes in the field of author identification to improve methodologies to reliably link authors to their texts. Especially it tries to sharpen these methods for use in online media with a big number of possible authors and very short texts per author. In addition, it shows the usage of author identification regarding the distinction of human and machines in a real world problem.

Amigó, E., Gonzalo, J., Artiles, J., and Verdejo, F. 2009. “A comparison of extrinsic clustering evaluation metrics based on formal constraints,” Information Retrieval (12:4), pp. 461–486.

Bagnall, D. 2016. “Authorship clustering using multi-headed recurrent neural networks,” in Working Notes of {CLEF} 2016 - Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016.

Ferrara, E., Varol, O., Davis, C., Menczer, F., and Flammini, A. 2014. “The Rise of Social Bots,” Communications of the ACM (59:7), pp. 96–104.

Kocher, M. 2016. “UniNE at CLEF 2016 : Author Clustering,” in Working Notes of {CLEF} 2016 - Conference and Labs of the Evaluation forum, {É}vora, Portugal, 5-8.

Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Pan.webis.de 2016. “PAN @ CLEF 2016,” .

Stamatatos, E. 2008. “A Survey of Modern Authorship Attribution Methods,” Journal of the American Society for Information Science and Technology (60:3), pp. 433–643.

Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., and Potthast, M. 2016. “Clustering by Authorship Within and Across Documents,” Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum Évora, Portugal, 5-8 September, 2016 pp. 691–715.

No users to show at the moment.

No users to show at the moment.