Phishing Website Detection

Gangesh Basker

Gangesh Basker

Hosur, Tamil Nadu

0 0
  • 0 Collaborators

This project presents an enhanced approach for phishing site detection, leveraging advanced machine learning techniques. Various ML algorithms like Decision Tree, Random Forest, Multilayer Perceptrons, XGBoost, Autoencoder Neural Networks and Support Vector Machines have been compared. ...learn more

Project status: Published/In Market

oneAPI, Artificial Intelligence

Intel Technologies
oneAPI

Docs/PDFs [1]Code Samples [1]

Overview / Usage

Much of what we do is managed online in today's increasingly technological world. This surge in online engagement has resulted in a tremendous spike in cybercrime. Phishing has been the most powerful and harmful of all cyber-attacks. Phishing has been a critical security problem resulting in significant losses for businesses and customers. Phishing attempts are becoming more common because of inadequate identification techniques and protective methods.

These days internet fraudulence is tremendously huge, the websites we access are trying to acquire our personal information or data in an illegitimate way.Since there are too many websites at hand, the risk of getting your devices hacked and being scammed is substantial.

Phishing websites are known for criminally acquiring passwords, private information, credit card details, locations, and any such undesirable data. Often times website appears to replicate the original site, but turns out to be a phishing site.

This being the case, it is a time of immense need for an effective solution to detect and provide timely acknowledgment of whether the websites we access are safe to use.

This project presents an enhanced approach for phishing site detection, leveraging advanced machine learning techniques. Various ML algorithms like Decision Tree, Random Forest, Multilayer Perceptrons, XGBoost, Autoencoder Neural Networks and Support Vector Machines have been compared. The proposed methodology encompasses a multi-layered framework that combines website content analysis, URL-based features, and behavioural analysis to identify potential phishing sites accurately.

Methodology / Approach

1️⃣ Pre-install all the required libraries :

  1. Tensoflow
  2. Numpy
  3. Pandas
  4. SciKit-Learn

2️⃣ Understand the dataset :

Datasets containing phishing and legitimate websites is collected from open-source platform PhishTank.

This service provide a set of phishing URLs in multiple formats like csv, json etc. that gets updated hourly.

From this dataset, 5000 random phishing URLs are collected to train the machine learning models.

The legitimate URLs are obatined from the open datasets of the University of New Brunswick, This dataset has a collection of benign, spam, phishing, malware & defacement URLs. Out of all these types, the benign url dataset is considered for this project.

From this dataset, 5000 random legitimate URLs are collected to train the ML models.

3️⃣ Feature Extraction :

The below-mentioned category of features are extracted from the URL data:

  1. Addressed Bar-based features:
  • In this category, 9 features are extracted.
  1. Domain-based Features:
  • In this category, 4 features are extracted.
  1. HTML & Javascript-based Features:
  • In this category, 4 features are extracted.

So, all together 17 features are extracted from the 10,000 URL dataset and are stored in '5.urldata.csv' file in the Data Files folder.

4️⃣ Build and train the model :

Before starting the ML model training, the data is split into 80-20, i.e., 8000 training samples & 2000 testing samples. From the dataset, it is clear that this is a supervised machine-learning task. This data set comes under a classification problem, as the input URL is classified as phishing (1) or legitimate (0).

The supervised machine learning models (classification) considered to train the dataset in this project are:

  1. Decision Tree
  2. Random Forest
  3. Multilayer Perceptrons
  4. XGBoost
  5. Autoencoder Neural Network
  6. Support Vector Machines

5️⃣ Save the model :

save the model and calculate the training and testing accuracy.

6️⃣ Tesing and Training accuracy :

We did 50 epochs, to get a good accuracy from the XGBoost model i.e. 86.7% for training accuracy and 85.8% for testing accuracy. From the obtained results of the above models, XGBoost Classifier has highest model performance of 86.7%. So the model is saved to the file 'XGBoostClassifier.pickle.dat'

Technologies Used

  1. Tensoflow

  2. Numpy

  3. Pandas

  4. SciKit-Learn

  5. Jupyter Notebook

  6. Flutter

Documents and Presentations

Repository

https://github.com/gangeshbaskerr/Phishing-Website-Detection

Comments (0)