The value of Kaggle to Data Scientists

by Guido Tapia

in software-engineering,

November 5, 2014

Kaggle is an interesting company. It provides companies a way to access Data Scientists in a completion like format and a very low cost. The value proposition of Kaggle for companies is very clear; this article will focus on the flipside of this equation. What value does Kaggle give to the data scientists?

This blog post contains my own personal opinions, however I have tried to be as balanced in my views as possible and I have tried to present all benefits and disadvantages to this platform.

Disadvantages of Kaggle to the Data Scientist

Cheapens the Skill

Most Kaggle competitions award prizes in the order of 5k – 50k. There are a few competitions that get a lot of media attention as they have much higher prizes, however these are very rare. Given the chances of winning a Kaggle competition and the prize money involved then the monetary returns of Kaggle are negligible. Therefore we have highly educated and skilled people providing an extremely valuable service to commercial organisations for free. This reeks of software development’s Open Source commercialisation strategy that aims to destroy competitors by providing a free product and charging for other services (like support). In this case Kaggle’s competitors are the Data Scientists themselves as they could be consulting for organisations directly instead of going through Kaggle. This is an interesting argument that could be the subject of its own blog post so let’s put it aside for now.

Kaggle Encourages Costly Models over Fast Models

When competing, the participants have no incentive to create robust, fast, bug proof models. The only incentive is to get the best possible predictive model disregarding all other factors. This is very far removed from the real world where accuracy compromises are made regularly mainly for performance purposes.

Kaggle Does not Teach Data Prep, Data Soruce Merging, Communicating w/ Business, etc

Competitions on Kaggle go straight for the last 5-10% of the job. It assumes all data sources have been merged, assumes management has decided on the best question to ask the data and IT has prepared the data and cleaned it. This again is very different from real life projects and could be giving some Data Scientists, especially unexperienced ones a false view of the industry.

Kaggle Competitions Take too Long to Compete In

I would guess that most top 50 competitors in any competition would put in around 40-200 hours in a single competition. This is a huge amount of time, so what does the competitor get out of it?

Benefits of Kaggle to the Data Scientist

Opportunities for Learning

Kaggle is the best source on the internet at the moment for budding data scientists to learn and hone their craft. I am confident in the accuracy of the statement having seen many people start out on simple competitions and slowly progress over time to be a highly skilled data scientist. This great blog post from Triskelion demonstrates this clearly. This benefit cannot be overstated, data science is hard!! You need to practice and this is the place to do it.

Opportunities to Discuss and ask Questions of Other Data Scientists

The Kaggle forums are a great place to ask questions and expand your knowledge. I regularly check these forums even if I’m not competing as they are a treasure trove of wonderful ideas and supportive and highly skilled individuals.

The Final 10%

The final 10% of a Data Science project is the machine learning / predictive analysis modelling. The other 90% are administrative, managerial, communications, development, business analysis tasks. These tasks are very important but in all honesty an experienced manager has these skills. The technical skills needed in this 90% are also easily available as most experienced developers can merge data sources and clear datasets. It is the final 10% where a great data scientist pays for himself. This is where Kaggle competitions sharper your skills, exactly where you want to be focusing your training.

Try out Data Science

Something that you quickly learn from any Predictive Analytics project is the monotony of data science. It can be extremely painful and is definitely not suitable for everyone. Many skilled developers have the maths, stats and other skills for Data Science but they may lack that patience and pedanticness that is required to be successful in the field. Kaggle gives you the chance to try out the career, I’m sure many have decided it’s just not for them after a competition or two.

Promotional Opportunities

I doubt how much value Kaggle actually provides in terms of promotion for the individual. I personally have never been approached for a project because of my Kaggle Master status or my Kaggle ranking. I have brought up the fact that I indeed am a Kaggle Master at some meetings but this generally gets ignored mainly because most people outside of the Data Science field do not know what Kaggle is. However, there may be some value there and I’m sure that the top 10-20 kagglers must get some promotional value from the platform.

TL;DR (Summary)

Kaggle may cheapen the data science skillset somewhat, providing huge business benefits at very low cost and zero pay to data scientists. However, I like it and will continue to compete on my weekends/evenings as I have found the learning opportunities Kaggle provides are second to none.