Auto-generation of tags from arXiv article titles

Sayak Paul

Sayak Paul

Kolkata, West Bengal

0 0
  • 0 Collaborators

arXiv is a powerhouse of modern research specially in the field of machine learning. Many of the seminal papers we read today are openly hosted on arXiv. This project demonstrates a method by which one can auto-generate the tags that might be related to an arXiv article. ...learn more

Project status: Under Development

Artificial Intelligence

Intel Technologies
Intel Python

Code Samples [1]

Overview / Usage

The main objective of the project is to automatically generate the tags that might be associated with an arXiv article. The tags here denote categories such as cs.LG, cs.AI, stat.ML and so on.

An article can be tagged to more than one category and that makes this problem a multi-label text classification problem: Given an arXiv article what are the most probable tags to which the arXiv article can get mapped. The metric typically used here is **categorical accuracy - **it requires to predict all the categories accurately to which an article can be tagged.

The dataset I used is available here: https://www.kaggle.com/neelshah18/arxivdataset

Methodology / Approach

The main problem I tackled in this project was the data preparation part. How could I effectively represent the multi-labels (tags) related to an article title? The answer was Label Binarization. I incorporated good software engineering practice while defining my machine learning models in this project. I started with simple n-gram models like a simple fully connected network. From there on I tried more advanced models.

I will opensource the project very soon and will also do a blog post detailing each and every major decision I took while doing the project.

Technologies Used

  • Python 3.7

  • Jupyter Notebooks

  • Facet

  • Scikit-Learn

  • Keras

Repository

https://github.com/sayakpaul/Generating-categories-from-arXiv-paper-titles/

Comments (0)