Multiple Overlapping Window Method for Learning Patterns from Big Data
Brijraj Singh
Roorkee, Uttarakhand
- 0 Collaborators
It becomes tedious to make a general training model using huge number of samples in memory restricted environment. This project aims at solving the stated issue and making the kernel based methods usable on Big data. ...learn more
Project status: Published/In Market
Overview / Usage
The problem of missing values in the process of data acquisition is becoming critical due to both hardware failure and human error. Radial Basis Function based interpolation and missing value prediction in the dependent variable through surface fitting has been the viable solution from a long time. However, the solution works well on the idea of building one equation of ‘N’ weight variables corresponding to each sampled data point from a set of ‘N’ samples. Still, because of the inherent computations, it demands big primary memory when ‘N’ becomes large. Hence, on a memory-restricted setup, sometimes it suffers from the memory overflow problem. In this paper, we propose a novel data decomposition based RBF enabled surface fitting approach by building and aggregating multiple small models in an overlapping manner, which works well even with smaller primary memory with minimal impact on loss of generality. We also consider two hyperparameters as window size and overlapping index in order to tune bias-variance tradeoff. The proposed approach is applied to ten real-world datasets having multiple dimensions and results are found competitive while comparing with single trained model and Kernel Ridge Regression. Therefore, we believe this approach will rejuvenate the RBF based surface fitting method in attaining better performance in the Big data world too.
Methodology / Approach
Multiple Overlapping Window Method (MOWM) follows Divide & Conquer approach and operates on small data chunk at a time. In this method, the data chunk is referred as window and number of data points in a window determines the size of the window which is therefore spatial size of the window depends on data density, hence two windows of same window size may spatially differ with each other. In the preliminary stage, the method sorts the data in increasing order of their Euclidean distance. First ω*N number of data points corresponding to the first window are picked and model is prepared for it, where ω is the windowing index and N is the total number of data points. Next, the window is slid by and another model is prepared. This process of model preparation is repeated until it covers all the data points. Models collected in this way are aggregated together to produce the global model. Window Size and Sliding Index are taken as hyperparameters and are tuned corresponding to each dataset. The value of Sliding Index and Window Size are data dependent parameters and are decided on the basis of affordability and variance in the data (Theorem 1). Since bigger data window possesses more generalization and so provides a better estimate but due to resource limitation, it is required to keep it small under affordable loss in generalization. The general behavior of the method can be seen in Figure 2 (refer paper) and analytical aspect can be seen in Figure 3 (refer paper), where the initial window is of size ω, variance mean μ1 after sliding by l1,l2 new windows are of variance v2, v3 and mean μ2, μ3. The present method is divided among four phases A, B, C, D. Where Phase-A explains about pre-processing work, Phase-B explains about the development of data windows and conversion to RDD for parallelism, Phase-C explains about models creation and Phase-D tests the outcome. The pictorial view of the proposed method can be viewed in Figure 2 which shows the functioning of the algorithm using one or two independent and one dependent variable.
Technologies Used
Programming Language : Python
Big-data Architecture : Spark (pyspark)
Kernel function : Radial Basis Function with Muliquadric kernel