Predicting MOOC Dropout Rate

0 0
  • 0 Collaborators

This project is designed based on the setting of KDD Cup 2015. The competition participants need to predict whether a user will drop a course within next 10 days based on his or her prior activities. If a user U leaves no records for course C in the log during the next 10 days, we define it as dropout from course C. The given data includes: course information, event log, enrollment, and their actual dropout. After extracting 73 features from the enrollments from the given data, the highest accuracy on the testing set of 0.8802315 has attained using Extreme Gradient Boosting (XGBoost) Classifier. In this report, I will explain the philosophy behind feature extraction, selection of the classifiers, and the ways to improve their performances in order to achieve such result. ...learn more

Project status: Concept

Artificial Intelligence

Overview / Usage

This project is designed based on the setting of KDD Cup 2015.
The competition participants need to predict whether a user will drop a course within next 10 days based on his or her prior activities. If a user U leaves no records for course C in the log during the next 10 days, we define it as dropout from course C.
The given data includes: course information, event log, enrollment, and their actual dropout. After extracting 73 features from the enrollments from the given data, the highest accuracy on the testing set of 0.8802315 has attained using Extreme Gradient Boosting (XGBoost) Classifier.
In this report, I will explain the philosophy behind feature extraction, selection of the classifiers, and the ways to improve their performances in order to achieve such result.

Methodology / Approach

To extract features, there are different perspectives to consider:

  1. Event log count in different domains
    a. Occur time, e.g. “Monday” and “Hour 12”
    b. Occur event, e.g. “access” and “discussion”
    c. Occur source, e.g. “browser” and “server”
  2. Duration
    a. Between 2 consecutive event logs
    b. Between the first and the last event
    c. Between the last event and the end of the course
  3. Course dropout rate
  4. Derived features
    a. Trending slope
    b. Polynomial feature

Technologies Used

Random Forest
One of the ensemble method used is the Random Forest.
Random Forest Classifier is an ensemble algorithm that follows the bagging technique, which the results of multiple models are combined to get a generalized result. It randomly selects a set of features to decide the best split at each node of the decision tree. It is used to compared the performance with other bagging classifier.
Boosting
Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous models. Each iteration, multiple models are created and produce a model using the weighted mean of all the models. As a result, it combines a number of weak learners to form a strong learner based on the fact that individual models would not perform well on the entire dataset, but they work well for some part of the dataset. Each model boosts the performance of the ensemble.
In this project, Gradient Boosting Classifier and Extreme Gradient Boosting Classifier are used.
Compared to Gradient Boosting Classifier, Extreme Gradient Boosting Classifier is similar but with a few advantages:

  1. Regularization to help reduce overfitting
  2. Performs much faster
  3. Extra randomization parameter is used to reduce correlation between the trees, resulting in better performance
    Neural Network
    Multilayer Perceptron Classifier is tried out to examine its performance compared with other classifiers.
    Bagging Based on the Best Classifier
    Since bagging based on a base classifier is time consuming, I planned to apply bagging classifier based on the best classifier.
    Selected Classifiers
    Six classifiers are adopted for prediction. It is listed below with the its library:
  4. Random Forest Classifier (sklean.ensemble.RandomForestClassifier)
  5. Stochastic Gradient Descent Classifier (sklearn.linear_model.SGDClassifier)
  6. Gradient Boosting Classifier (sklearn.ensemble.GradientBoostingClassifier)
  7. Multilayer Perceptron Classifier (sklearn.neural_network.MLPClassifier)
  8. Extreme Gradient Boosting Classifier (xgboost)
  9. Bagging Classifier (sklearn.ensemble.BaggingClassifier)
    All the classifiers are trained using Intel® AI DevCloud .
    Most of the classifiers are trained with parameter tuning using Scikit Learn GridSearchCV library, with cross validation fold of 5.
Comments (0)