Naming Conventions in Predictive Analytics and Machine Learning

In this article I am going to discuss the importance of naming conventions in ML projects. What do I mean by naming conventions?  I mainly mean using descriptive ways of labelling features in a data set.  What is the reason for this?  Speed of experimentation.

Naming Conventions

  • Categorical columns start with ‘c_’
  • Continuous (numerical) columns start with ‘n_’
  • Binary columns start with ‘b_’
  • Date columns start with ‘d_’

Examples of Benefits

Once your datasets is labelled clearly with these conventions then experimenting with features becomes very fast.

cv = functools.partial(do_cv, LogisticRegression(), n_folds=10, n_samples=10000)
cv(one_hot_encode(X), y) # One hot encode all categorical features
cv(contrasts(X), y) # Do simple contrast coding on all categorical features
cv(bin(X, n_bins=100), y) # Split all continuous features into 100 bins
X = engineer(X, ‘c_1(:)c_2’) # Create a new categorical feature that is a combination of 2 other
X = engineer(X, ‘n_1(*)n_2’) # Create a combination of 2 numericals (by multiplication)
X = engineer(X, ‘n_1(lg)’) # Create a log of feature ‘n_1’
X = engineer(X, ‘(^2)’) # Create a square feature for each numerical feature
X = engineer(X, ‘(lg)’) # Create a log feature for each numerical feature

In a real world example this would look something like:

X = remove(X, dates=True)
for n1, n2 in combinations(X, group_size=2, numericals=True): X = engineer(X, n1 + ‘(*)’ + n2)
for c1, c2 in combinations(X, group_size=2, categoricals=True): X = engineer(X, c1 + ‘(:)’ + c2)
X = engineer(X, ‘(^2)’)
X = engineer(X, ‘(lg)’)
cv(X, y)

Summary

The resulting DSL from using good naming convention leads to very clear code that relates directly to the data munging operations being done.  Another benefit is that but once your ‘one_hot_encode’ method is written and tested you can trust it for future projects (as long as they use the same naming conventions).

Aug, 26, 2014

0