Industrial Machine Learning with Feature Forge

Your intuitions to scikit-learn in just 2 steps

Say you have a big pile of data and a set of intuitions that can help to classify your data. Say you are a python developer. FeatureForge is the open source tool that will bridge you from there to pure machine learning stuff on scikit-learn without the need of boilerplate, incorporating in the way a set of utilities that will increase the robustness of your solution.

All you have to do is

Step 1:
Translate your intuitions into features
Step 2:
Evaluate them in the scikit-learn way

and precisely how those 2 things can be done with this tool is what I want to tell here.

Feature Definition:

Your solution will strongly depend on your features, so you better be sure of have solid features instead of just features. So, how to be sure that each of the features is behaving like expected? In our experience, sometimes some reasonable and promising features can be discarded not because they don’t provide value, but instead they were buggy, and because of those bugs they were looking valueless. Probably you won’t be consuming your function output, and maybe in the best case you’ll only be watching some plots and aggregations of your feature results, but just that. FeatureForge gives you a set of utilities to help you minimize the usual mistakes when writing features. With little effort you can improve your features from this:

def image_height(img):
    return img.get("heigth", 0) # yes, it's misspelled

def picture_posts_ratio(user):
    return float(user['pictures_nr']) / user['posts_nr']

to this:

from featureforge.feature import input_schema, output_schema

@input_schema({"heigth": int})
def image_heigth(img):
    return img.get("heigth", 0)

@input_schema({'posts_nr': int, 'pictures_nr': int})
@output_schema(float, lambda i: i >= 0 and i <= 1)
def picture_posts_ratio(user):
    return float(user['pictures_nr']) / user['posts_nr']

Both cases are (almost) real features with real bugs we wrote, and in both cases the input/output schemas help to easily identify what was going on.

  • With image_heigth, because of a typo in the word height, we were going to end up with a constantly zero value feature, but the input schema helped us to realize that no single data point was having the expected input, and because of that we discovered the bug.
  • With picture_posts_ratio we were giving for granted that each picture was counting as a post, which is something that we later realize wasn’t the case, and because of that we were having users with picture ratio higher than one, which was not what we wanted.

Another nice thing with features written like this is that the provided BaseFeatureFixture mixin helps to write unitests easily, like this:

import unittest
from featureforge.validate import BaseFeatureFixture, APPROX, EQ
from featureforge.feature import make_feature

class TestPicturesRatio(unittest.TestCase, BaseFeatureFixture):
    feature = make_feature(picture_posts_ratio)
    fixtures = dict(
        test_simple=({"posts_nr": 4, "pictures_nr": 1},
                     APPROX, 0.2),
        test_no_pics=({"posts_nr": 3, "pictures_nr": 0},
                      EQ, 0.0),
        test_nothing=({"posts_nr": 0, "pictures_nr": 0},
                      EQ, 0.0),
    )

Which will:

  • ensure that the feature produces the expected output with the given input, and for free
  • ensure that the given fixtures satisfy the input schema,
  • ensure that the generated outputs satisfy the output schema, and
  • ensure that when the feature is stressed with some random valid inputs, it’s always returning things that satisfy the output schema.

Both features after bug fixing and testing ended up like this:

@input_schema({"height": int}))
def image_height(img):
    return img.get("height", 0)

@input_schema({'posts_nr': And(int, lambda i: i >= 0),
               'pictures_nr': And(int, lambda i: i >= 0)})
@output_schema(float, lambda i: i >= 0 and i <= 1)
def picture_posts_ratio(user):
    total = user['posts_nr'] + user['pictures_nr']
    if not total:
        return 0.0
    return float(user['pictures_nr']) / total

You can read more detailed descriptions, use cases and examples at here: Feature Definition.

Feature Evaluation:

Given your features, FeatureForge Vectorizer allows you to easily evaluate them against your data and deliver it to scikit-learn like this:

from featureforge.vectorizer import Vectorizer
from sklearn import Pipeline
v = Vectorizer([some_feature, some_other_feature])
... build other steps like classificators/regressors ...
p = Pipeline([v, step2, step3])
p.fit(data)
# Here you can use p methods depending on what you built. See the scikit
# documentation for examples

The output for the Vectorizer is a numpy matrix, ready to be used as input for a machine learning algorithm, with one row per each data point, and columns for the features.

Some other nice things about the FeatureForge Vectorizer:

  • It’s smart enough to work with both scalar and enumerated features
  • It works smoothly with consumable datasets
  • It can be set to run on Tolerant Mode, sacrifying performance in order to let you to have experiment results even in the case that some of your data is invalid or some features are faulty.

You can read more detailed descriptions, use cases and examples at here: Vectorization

Because of all these, is that we are proud to announce the first release Feature Forge. We are already using this in our projects, but we are sure that there’s plenty room to grow, and that’s why we invite you to collaborate and submit your ideas. It’s free software, and hosted on github.

Links:

Previous / Next posts


Comments