Struck by Kaggle

December 11, 2011

Recently, updating this blog got a lot harder than I anticipated. I moved to New York, started a new job, got a dog, and discovered Kaggle. What’s Kaggle? From the site:

Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems - the world’s best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.

Competing on Kaggle is easy: create an account, join a competition, and upload predictions as .csv. Feedback on submissions is instant, and a public leaderboard is constantly updated as participants submit. The available competitions provide nice range of problems to tackle, including predicting insurance claims, mapping dark matter, and modeling Wikipedia edits.

Before I ever heard of Kaggle, back in early 2010, I participated in my first modeling/prediction contest: the now-defunct Analytics X competition. The goal was to predict the spatial distribution of homicides in Philadelphia. I spent many hours working on a single submission for that competition, and I remember being horrified by how poorly it ranked. Participants were ranked on a leaderboard by predictive accuracy, and I was abysmally low. My precious mutlilevel model was a total flop, and based on the crazy high accuracies on the rest of the leaderboard, my fellow competitors were using some kind of voodoo. Dismayed, I stopped working on the Analytics X competition and focused on finishing graduate school.

My Analytics X experience continued to haunt me. What happened? According to my statistics textbooks, I did everything right! So, I started poking around for clues to my (and my model’s) failure. I began following discussions at r/MachingLearning and Cross-Validated, picked up Berk’s gentle Statistical Learning from a Regression Perspective (highly recommended for people coming from social sciences!), struggled through a lot of The Elements of Statistical Learning. At some point, I stumbled upon a great presentation by Kaggle’s Jeremy Howard called Getting in shape for the sport of data science. Things started to come together. It wasn’t voodoo after all. Armed with a few new tricks, I was determined to give the prediction game another go.

Since then, I’ve entered two Kaggle competitions. In Photo Quality Prediction (which just ended), I built models to predict how people would rate photos. Right now, I’m in the middle of Don’t Get Kicked!, where I’m learning how to predict whether an auctioned car will turn out to be a lemon. It turns out that this is really fun. I’m very far from even placing a competition, but I’m learning a lot and fairing much better after adopting some techniques outside of the standard social science toolbox.

What’s really struck me about predictive modeling (or machine learning/statistical learning/data science) is the change in perspective from how I learned statistical analysis. Focusing on accurate predictions forces you to approach statistical modeling in a very different way, and this can be enlightening, even if hypothesis testing is your day job.

Take the Don’t Get Kicked! contest for example. If you handed me a ton of data about a car auction and asked me to predict the lemons, my first reaction (based on my social science training) would be to completely overthink the problem. I’d probably come up with a complex theory about which variables would be useful, dismiss a lot of variables as useless, hypothesize about interactions, and then fit the appropriate generalized linear model. And it would fail.

Now I know not to sweat the small stuff at first. Keep an open mind—don’t dismiss anything. Clean it up, but keep everything. Start simple and dump everything into a regularized regression model. Let the model learn what’s important. How’d it perform? Where do you make errors? Ok, now think a little more about the problem and how the data might be generated. Take a good look at the data. Don’t get too attached to any given model. Be critical of your approach and be prepared to trash it if you hit a wall.

Many Kaggle competitors submit 50+ entries throughout a given contest. I’ve already submitted about 25 attempts in a current competition, making small changes each time (edit: and have burned myself in the process!). Through these iterations, I feel like I’m getting a much better sense of what is happening in a dataset—how variables are related and how to model their interactions—than if I were just testing a model’s goodness-of-fit or whether a certain coefficient was different from zero. I think this is counterintuitive to those who assume that the world of machine learning and statistical prediction is all black boxes without much concern for the underlying processes that generated the data.

My experience with Kaggle has left me wondering why predictive accuracy isn’t more important in social science. Sure, good prediction does not equal a good theory, but shouldn’t a good theory result in good predictions? Yet I can’t remember the last time I ever saw any kind of cross-validation in a paper. If there was an academic psychology journal that hosted Kaggle-style competitions between different theoretical camps, I’d read it every month. I’d even pay for it! Until then, I’ll continue reading No Free Hunch.

Postscript (2/15/15): About a year after writing this, I took a new job as a postdoc in an interdisciplinary project, working alongside some truly awesome people who actually do machine learning for a living. Long story short, we eventually published a couple papers with a very strong emphasis on predictive accuracy in mainstream psychology journals (one in JPSP and one in Psychological Science). When I originally wrote this blog post, I was pretty frustrated with the cultural divide between social science and predictive modeling (similar to the “two cultures” described by Leo Breiman here). Fast forward just a few years, and now we are seeing things like cross-validation and regularization in psych journals. Whoa! Things are changing.