Beyond a bag of words

0 0
  • 0 Collaborators

The users visiting the blog sites on a day-to-day basis will use the application to get the automatic summary of the blog to make the decision whether to read the blog by looking at the summary output. It saves user’s time to read the lengthier, less redundant and irrelevant information. ...learn more

Project status: Under Development

Artificial Intelligence

Intel Technologies
AI DevCloud / Xeon

Overview / Usage

Search engines have been making improvement towards getting precise information on the World Wide Web (WWW). We search through large numbers of documents every day to keep ahead. Blogosphere is current popular source of information with personal and business blogs written to be read by others. Blogs are used by millions, everyone from individuals to businesses. According to a study by Hutama et al. around 25% of the total time spent online is devoted to reading blogs. Also, studies show a rise in trend of writing lengthier posts every year. The hard truth is that many of these articles and blogs have repeated information or material not applicable to the user. It is difficult to figure out important content of the blog. However, user has already spent time reading it. Today, the struggle is not to get access to information but to save time in getting the most useful material. What if one could get accurate and succinct summaries of the key points of every blog or article? The goal of this project is to provide automatic text summarization serving up coherent, succinct highlights that helps user stay informed in a fraction of the time. Text summarization is a very tough challenge, especially for longer texts such as news articles, blog posts, etc. Our system uses Natural Language Processing (NLP) and Machine Learning to extract parts of documents and automatically create a summary. Instead of reading a lengthy article, the proposed system uses “deep reinforced” model to provide a quick takeaway in an easy to ready summary.

Methodology / Approach

The project architecture consists of a client-server model, with a client facing web-based portal that is used to provide input blog text and server to provide processing of input text into the summary. When the user clicks on the summary button on the web page the application automatically scans the blog content on the page.
The server will receive scanned text for the preprocessing of the blog text. The process of splitting input text into sentences (segmentation) as well as into words (tokenizing) will be taken care by the server. Then each sentence and word will be taken its features. Once the features obtained from sentences are ready for classification, then classification is performed using pre-prepared training data. Classified sentences that containing meaningful elements will be selected as summary sentence then returned to the client for display in the browser.

Technologies Used

Python, nodejs, Nltk Library (Natural Language Toolkit), Intel Xeon Phi cluster, Intel DevCloud.

Comments (0)