UCSD/FICO 2010 data mining contest (1st place)


Ensemble Selection with a lot of weka classifiers.

Base models (ranked by performance on my test set)
Bagging+Boosting with RandomTree (minNumInstances per leaf needs to be between 1500 and 3500, with 3000+ iterations)
RandomSubspace + LinearRegression (SubSpaceSize: 50% - 90%, change random seed)
Stacking with fast leaners, such as trees, and NaiveBayes
AdditiveRegression with REPTree
RBFNetwork (large number of basis functions per class)
SVM (RBF with different gammas)
Many others, in total of about 500 classifiers

Two Papers on Ensemble Selection which helped a lot:
Caruana, Rich, et al. "Ensemble selection from libraries of models." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
Caruana, R.; Munson, A.; Niculescu-Mizil, A.; , "Getting the Most Out of Ensemble Selection," Data Mining, 2006. ICDM '06. Sixth International Conference on , vol., no., pp.828-833, 18-22 Dec. 2006

UCSD/FICO 2009 data mining contest (1st place)


Here is the procedure including the methods we used to achieve the lift score 3.566 on the 'hard' problem.

WEKA 3.7 was employed as our modeling and programming tool.

Data preprocessing

We transformed the following nominal/string attributes into numeric.

State1 (counting the number of its nominal value revealed both in the training and testing sets, equivalent to re-weighting the states)

zip1 (counting the number of its nominal value revealed in both the training and testing sets, equivalent to re-weighting the zip codes)

cusAttr1 (counting the number of its nominal value revealed in both the training and testing sets, this actually resulted a new attribute which is assumed as the number of logins by some kind of 'ID' number)

cusAttr2 (counting the number of its nominal value revealed in both the training and testing sets, this actually resulted a new attribute which is assumed as the number of logins by email)

Subsampling and Oversampling

We applied two sampling methods to the data set. Firstly, we used the SpreadSubsample sampling technique (WEKA implementation) with the following parameters:

weka.filters.supervised.instance.SpreadSubsample M 2.0 X 0.0 S 1 W

W -- Whether instance weights will be adjusted to maintain total weight per class.

M -- The maximum class distribution spread. (0 = no maximum spread, 1 = uniform distribution, 10 = allow at most a 10:1 ratio between the classes).

X -- The maximum count for any class value (0 = unlimited).

S -- Sets the random number seed for subsampling.

This parameter set resulted a much smaller data set with only 7962 instances (2654 positives and 5308 negatives).

Secondly, we applied the Synthetic Minority Oversampling TEchnique (SMOTE of WEKA implementation) to the new data set. The parameters we used are:

weka.filters.supervised.instance.SMOTE C 0 K 15 P 100.0 S 1

C -- The index of the class value to which SMOTE should be applied. Use a value of 0 to auto-detect the non-empty minority class.

K -- The number of nearest neighbors to use.

P -- The percentage of SMOTE instances to create.

S -- The seed used for random sampling

This gave us a data set of size 10616, consists of 5308 positives and 5308 negatives. In the oversampling stage above, there are additional 2654 positives generated by SMOTE. This is the data set (about 10% of the original data set) that we used to model the 'hard' problem.


The RandomForests classifier implemented in WEKA 3.7 was used to build the model. The lift score 3.566 was achieved by using the following parameters:

weka.classifiers.trees.RandomForest -I 4750 -K 1 -S 1 (1)

Where K is the number of attributes used when constructing a random tree, I is the number of trees, and S is simply a random seed.


Here is a list ranked by how much improvement (in terms of maximizing the lift score) a particular consideration has contributed.

1 Data transformation in the preprocessing stage contributes the most improvement. (Life score 3.4+ could be achieved using J48, the WEKA decision tree implementation with only the data resulted after the preprocessing)

2 How much data to use, we believe 10% and 1:1 is just an local optimal for our configuration. Better lift value might be achieved by examining more setups.

3 Whether instance weighting should be used in the SpreadSubsampling stage.

4 The number of tree for the RandomForest classifier. We found that there are different sizes of trees could result the same lift score 3.566, such as 4800 and 4775. We recored that when only 10 trees were used, all other parameters are same as (1), the lift score was about 3.4+.