Event Detection Using Bit Locality Sensitive Hashing & DBSCAN clustering on Twitter Data

Segun sodimu

Segun sodimu

Ogun State

1 0
  • 0 Collaborators

Locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input such as tweets, documents, images e.t.c items into the same "buckets" with high probability. We used the concepts of LSH to detect Trending tweets on twitter data ...learn more

Project status: Published/In Market

Artificial Intelligence

Intel Technologies
Intel Integrated Graphics

Code Samples [1]

Overview / Usage

Bit locality sensitive hashing with DBScan is an unsupervised machine learning algorithm that takes in twitter data and clusters them to detect trending topic.

The goal of using LSH is to group similar tweets to the same buckets, Candidate pairs are those that hash at least once to the same bucket.

By using MinHash with different permutations as the input data, It is easier to perform nearest neighbour search and group tweets that are similar

Methodology / Approach

Data Collection

We start by downloading twitter data and save into csv file.

Preprocessing

We perform different preprocessing technique on the dataset such as :

  1. Removing Non-alphanumeric character.
  2. typecast all words into lower case.
  3. Eliminate punctuations.
  4. Replace numbers with the word equivalent.
  5. Remove stop words
  6. Perform Lemmatization

Feature Representation

TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

We used TFIDF to convert our already preprocessed dataset from human-readable text to sparse vectors for each tweet.

**MinHash Hashing **

MinHash is a technique for quickly estimating how similar two sets are. We will be using Minhash here to transform each tweet into a hashed equivalent using 4 ([64, 128, 256, 512]) different permutations for the hash function.

Locality Sensitive Hashing

Locality-sensitive hashing(LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability.

  • We passed the already hashed datasets to an LSH algorithm so as to group similar tweets to the same buckets.
  • We filtered and remove noise from all the bucket.

Dimensionality Reduction with PCA

Due to the high dimension of the input data, We will need to reduce the dimension so that it will be faster with less memory consumption when performing PCA operations.

DBSCAN Clustering

Density-based spatial clustering of applications with noise (DBSCAN) is a well-known data clustering algorithm that is commonly used in data mining and machine learning.

  • We passed the result of the LSH operations to a DBScan clustering algorithm to automatically cluster and group result together.
  • We remove noise from the cluster.
  • We select the max cluster with content and get the most occurrence word.

Technologies Used

Python3

Density-based spatial clustering of applications with noise (DBSCAN)

MinHash

Locality Sensitive Hashing

Repository

https://github.com/princesegzy01/bitLsh-TrendingTopic

Comments (0)