DPFree Data Generation

0 0
  • 0 Collaborators

Obtain a differentially private dataset mirroring the original while safeguarding individual privacy ...learn more

Project status: Under Development

oneAPI, Artificial Intelligence

Intel Technologies
oneAPI, Intel CPU, Intel Python

Docs/PDFs [1]Code Samples [1]

Overview / Usage

Data is the lifeblood of modern artificial intelligence. Getting the right data is both the most important and the most challenging part of building powerful AI. Collecting quality data from the real world is complicated, expensive and time-consuming. This is where synthetic data comes in.

Synthetic data is information that's artificially manufactured rather than generated by real-world events. Synthetic data is created algorithmically, and it is used as a stand-in for test datasets of production or operational data, to validate mathematical models and, increasingly, to train machine learning models.

The project uses Intel Hardware- Intel i5 Processor on a macbook pro 2015, and scope for boosting performance using Intel Software- Intel® Optimization for TensorFlow*, Intel® Optimization for PyTorch*, Intel® Extension for Scikit-learn*, among others

Differential privacy (DP) is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. A differentially private synthetic dataset looks like the original dataset - it has the same schema and attempts to maintain properties of the original dataset (e.g., correlations between attributes) - but it provides a provable privacy guarantee for individuals in the original dataset.

Using the Synthetic Observation Generation with differential privacy using GANs (SOGDPG) solution, we obtain complete privacy risk-free shareable synthetic observations.

Medical records, the most sensitive data of all industries can be shared freely when generated using SOGDPG, as all the private information is removed and replaced with indistinguishable synthetically generated observations.

In India, the Covid-19 pandemic had disturbed the nation with its unprecedented vicious nature. Multiple attempts to build software solutions had hit a roadblock because of unavailability of data from sources due to inefficient logistics, sensitivity of data and the prioritisation of workforce focusing on handling the pandemic. This could all be solved with SOGDPC, where the already available data could generate multiple synthetic and indistinguishable datasets for medical software solutions to work upon

Methodology / Approach

The project uses Intel Hardware- Intel i5 Processor and scope for boosting performance using- Intel® Optimization for TensorFlow*, PyTorch*, Intel® Extension for Scikit-learn*

Using the Synthetic Observation Generation with differential privacy using GANs solution, we obtain a differentially private synthetic dataset which has the same schema and attempts to maintain properties of the original dataset (e.g., correlations between attributes) - but provides provable privacy guarantee for individuals in the original dataset.

Technologies Used

Intel® AI Analytics Toolkit

The project uses Intel Hardware- Intel i5 Processor and scope for boosting performance using- Intel® Optimization for TensorFlow*, PyTorch*, Intel® Extension for Scikit-learn*

Intel Machine learning Framework- scikit-learn, TensorFlow

Intel® Distribution for Python* with highly optimized scikit-learn*

Intel® Extension for TensorFlow*

Documents and Presentations

Repository

https://github.com/sayashraaj/oneapi

Comments (0)