Water Quality Prediction (Intel OneAPI Online Hackathon)

Vinay Vaishnav

Vinay Vaishnav

Jodhpur, Rajasthan

Using Intel's AI Analytics Toolkit we tried to predict freshwater quality, achieving an F1 score of 0.81786 with TabNet ensemble model, addressing data challenges and optimizing performance for sustainable water assessment. ...learn more

Project status: Published/In Market

oneAPI, Artificial Intelligence

Intel Technologies
DevCloud, oneAPI, Intel Python, Intel CPU, Intel Integrated Graphics

Docs/PDFs [1]Code Samples [1]Links [3]

Overview / Usage

Water is a precious and limited resource, constituting only 3% of the Earth's total water volume. Access to clean and safe water is essential for both human life and the environment. However, with increasing pollution and other factors affecting water quality, it's crucial to predict the suitability of water for consumption. This project focuses on predicting water quality using a large dataset containing various features related to water samples. The goal is to determine whether a given water sample is sustainable or not, addressing a critical concern for society.

Methodology / Approach

Data Analysis and Preprocessing:

  • The dataset consisted of over 59 Lakh data points and 22 feature columns, including parameters like pH level, chemical contents, color, turbidity, and more.
  • Initial analysis revealed a high degree of missing values, requiring data cleaning and imputation.
  • Missing value handling strategies included dropping rows with insignificant missing values, filling with feature mean values, or removal based on feature distribution.

Technological Aspect: Intel AI Analytics Toolkit:

  • Given the dataset's size, the project utilized the Intel AI Analytics Toolkit, a collection of frameworks and tools designed to optimize data science and analytics on Intel architectures.
  • Key packages from the toolkit included Intel Extensions for Scikit-learn and XGBoost, oneDNN (Deep Neural Network Library), and Intel Extension for PyTorch.

Machine Learning Models:

  • The dataset was divided into training, validation, and test sets (70:10:20 ratio).
  • A sequential approach to model selection was followed:
  • Logistic Regression: Provided a baseline understanding of feature influence.
  • Decision Tree: Captured non-linear relationships and identified important features.
  • XGBoost: Known for high predictive accuracy, handling missing data well.
  • Multilayer Perceptron (MLP): Utilized neural networks for intricate pattern learning.
  • TabNet: Combined neural networks with attention mechanisms for interpretability.

Evaluation Metrics:

F1 Score was chosen as the primary evaluation metric due to its suitability for binary classification, ability to balance precision and recall, and robustness against high-dimensional data challenges.

Results:

  • TabNet emerged as the top-performing model, achieving the highest F1 Score.
  • To further enhance performance, multiple TabNet models were ensembled, resulting in improved predictive accuracy and generalization.

Technologies Used

  • Python programming language for coding and data analysis.
  • Intel AI Analytics Toolkit, leveraging:
    • Intel Extensions for Scikit-learn and XGBoost for faster execution on multi-core processors.
    • oneDNN to enhance PyTorch's speed.
  • Various machine learning libraries like scikit-learn, XGBoost, PyTorch, and TabNet.

Documents and Presentations

Repository

https://github.com/VinayVaishnav/Water_Quality_Prediction

Collaborators

1 Result

1 Result

Comments (0)