If you are interested in machine learning and live in planet Earth, you have probably seen a lot of articles (example) in the news about machine learning algorithms predicting the outcomes of the FIFA World Cup. Most of them were about big companies (from Google to Reuters) implementing complex systems (mostly deep learners) based in huge data sets, which is almost the best scenario for this kind of work. They have a big chunk of data, and the huge computing power to process it using large sized neural networks. And the results were impressive.
But not all machine learning happens in that kind of scenarios. Sometimes, you have very sparse data, and your “datacenter” is just your laptop right there, in your lap. Those constraints will surely limit what can be done, but they shouldn’t rule out the possibility of using machine learning for your own benefit, or the benefit of others.
This article will show you a very simple example of that.
(at the end, links to the actual code and data)
The specific scenario was this:
I may know nothing about football and cooking, but I know a little bit about programming and machine learning, so I quickly devised a plan to escape this dire situation. Not to win the contest, but just to be sure I wouldn’t lose and have to cook.
It should be possible to predict the outcomes of the matches using simple machine learning algorithms and some basic public data about past matches, with a reasonable level of precision, at least better than the other football-illiterate family members.
Wikipedia has entries for every World Cup since the beginning of time (1930), detailing each played match and the outcome. Using this I was able to build a dataset with this structure:
(year, team1, team2, outcome),(year, team1, team2, outcome),(year, team1, team2, outcome),...
But for most of the simpler algorithms, we can’t feed them team names, we must transform them to numbers, and numbers from which some kind of knowledge can be obtained, numbers that could be somewhat related to the match outcomes.
So based in the full history of matches, I calculated some basic stats for each team, like number of played matches, % of matches won, number of cups won, average podium score, etc. Then replaced “team1” and “team2” with the stats of both teams, obtaining something like this:
(year, team1_stat1, team1_stat2..., team2_stat1, team2_stat2, ..., outcome),(year, team1_stat1, team1_stat2..., team2_stat1, team2_stat2, ..., outcome),(year, team1_stat1, team1_stat2..., team2_stat1, team2_stat2, ..., outcome),...
Finally, I did scaling (otherwise the year was over-influencing the results), and separated training and test sets, to be able to avoid overfitting to the train set by keeping an eye on the test set performance.
Once I had the datasets ready, it was just matter of creating a multilayer perceptron (the most common neural network) using the PyBrain library, and then train it using those sets.
I have some critics to PyBrain’s API (specially in the data sets bits), but it’s fairly easy to use. And the complexity of these computations is something easily done by a common laptop, we don’t need a datacenter to run this training.
I had to make some compromises to get better results. For example, I decided to only learn and predict winner-loser matches, and not ties. By doing this I knew I would fail at predicting a small percentage of matches (the ties), but I got a far better performance in the winner-loser matches, which yields an overall better prediction performance.
I experimented with several different network structures, activation functions, and features sets, until I got an acceptable level of prediction accuracy: 75% of winner-loser matches predicted correctly (in both, train and test set).
75% may sound impressive to some people, and very poor to other, so let me clarify two things:
So 75% is “more than halfway through”, from the worst possible, 50%, to the best possible, somewhere below 100%.
A few hours before the first World Cup match, I was informed that my predictions were incomplete: they lacked the goal counts! (I didn’t know I had to provide them, the rules weren’t clear enough).
I had a classifier, predicting goals was a task for a regressor, not a classifier. So I did a quick solution: I calculated which was the most common result, and used that for every match: 2 goals for the winner, 1 goal for the loser.
And so the World Cup began, and my neural network predicted the outcomes of the matches, in an epic quest to free me from the punishment of cooking. Remember, I only needed it to be better than the other family members who, as me, didn’t know a lot about football. There were some football fans participating, I didn’t expect to win against them.
Simultaneously, just for research purposes, I registered and uploaded the predictions to an online contest (http://el-ega.com.ar), in which about 250 other people were participating (many of them football fans, I presumed).
The outcome was a surprise: I didn’t just avoid losing, but I won at the family contest (first place), and also at the online contest. That simple, small neural network, trained with easily obtainable data from wikipedia, outperformed more than 250 people.
As you can see, you don’t need huge amounts of data, nor giant data centers, to use machine learning to solve real problems with acceptable prediction performance. In this article the example was about football matches, but you could be using these techniques to solve work related problems everyday, maybe outperforming human experts, with tools that are at everybody’s reach.
Of course you are able to do more when you have the datacenter and big data, but you don’t need to wait for that to start changing the world with the beautiful combination of maths and programming that machine learning is.
So, tell us your story, which small problems have you solved?
If you want to hack or learn from it, the full code of the world cup predictor is here:
Want to read more? Follow us on Twitter @machinalis