Blog:2013-01-20: An Online Predictive Modeling Service request

Jump to navigation Jump to search

This week, I was extremely annoyed at how inefficient the task of predictive modeling remains. This time around I needed to train a term labeler given ~100k labeled terms. The first hour involved coding and testing a simple script to featurize strings: based on some top-off-head features. After that came several modeling attempts. The first attempt was with rpart R package. The experience reconfirmed that R is not user friendly. Fortunately, I had some sample code from a month earlier. Even after ten hours though it still had not returned a model, nor after sampling it down to 100 records. :-/ I then opted to use an SVM system, starting with SVMlight until I noticed that it was not licensed to use in a commercial setting and so moved to LIBsvm. Drats, these system require that categorical features be binarized. I started writing a simple perl script to do this, but then decided to search for one that already does this. However, before I got here a colleague pointed me to the BigML Service. So, someone else has finally created an online predictive modeling service. Sure enough it created a decision tree. Surprisingly and happily, it does not support other classification models yet, like SVMs, kNN, naive Bayes, or Logistic Regression. Nor does it produce a learning curve representation. The decision tree also, curiously, avoided the use of any of the categorical features.

This experience validated for me the need of a full-featured online predictive modeling service. Too bad that I did not persevere to keep the PredictionWorks predictive modeling service running after 2001. Knowing what I know now such a service should ideally also enable a data miner to replicate the results on their favorite data mining package. It would, for example, report the steps required in R to create a similar model. Ideally the user interface would could also guide the user on additional pre-processing steps. For example, it could have told me that the categorical features were likely too large to be used by a decision tree... Further, the online service could enable the creation of a new predictive modeling language which I will leave for another day.

Updated Beliefs

<comments />