PredictBench successfully predicts product classifications for one of the world’s largest ecommerce and FMCG companies

As any large FMCG (CPG) is aware, classifying products correctly is critical to having good analytics capabilities.  It is also clear to any global organisation that this is a surprisingly difficult task to achieve.  Most regions use different classifications and combining them on a global scale is a non-trivial task.  It is such a difficult task that many organisations simply ignore it and miss out on potential insights from a global view of product sales.

The Otto Group is a German ecommerce company that sells tremendous amounts of goods and they recently released a dataset to address this exact issue.  Given details of over 200,000 products it was the data scientist’s job to correctly distinguish between Otto’s main product categories.

We used PredictBench to tackle this job and we had amazing accuracy in classification. In fact the PredictBench team was able to come within 0.021 points of the optimal solution which was no mean feat, beating out around 3500 teams from around the world.  Our final position in this challenge was 16th (out of 3514).

On this project we teamed up with American data scientist Walter Reade who brought invaluable experience and knowledge to the PredictBench team.

Working together with Walter we were able to put together and ensemble of hundreds of models including linear models, neural nets, deep convolutional nets, tree based ensembles (random forests and gradient boosted trees) and many others.  The huge scale of the final solution goes to show how incredibly complex this problem was and the skills that were required to achieve such amazing results.

Working with Otto data and teaming up Walter was a great boon to the PredictBench team and we hope to replicate this in the near future.

Machine Learning for the FMCG Industry

A detailed observation on the potential benefits of using modern Machine Learning technologies in the FMCG vertical

Executive Summary

The unique characteristics of the FMCG industry make it an ideal candidate for Machine Learning and associated technologies. These characteristics include very large volumes of transactions and data, and a large number of data sources that influence projections. These characteristics mean that traditional analytics technologies struggles with the volume and complexity of the data which is exactly where Machine Learning is best suited.Most horizontals of the industry are candidates for optimisation including improving the effectiveness of marketing campaigns, increasing the performance of the sales team, optimising the supply chain and streamlining manufacturing. The FMCG industry has been relatively slow to adopt these cutting edge technologies which gives an early entrant an opportunity to strongly outperform its competitors.



FMCG Introduction

FMCG (Fast Moving Consumer Goods) refers to organisations that sell products in large quantities. These products are usually inexpensive, the volumes sold are large and may often have a short shelf life. Profits on individual items is very small and large volumes are required to have a viable business.These characteristics offer many challenges and also many opportunities.

This paper investigates these challenges and opportunities in detail and focuses on the use of Machine Learning technologies to optimise processes to increase profits for FMCG companies.

Machine Learning Introduction

The following list should serve as a refresher when thinking about Machine Learning vs traditional analytics and business intelligence:1. Unstructured Data

Modern Big Data technologies and advanced machine learning algorithms can analyse data in any format, such as images, videos, text, emails, social media messages, server, logs, etc. Whereas traditional analytics can only analyse structured data in databases.

2. Combine Data

Modern technologies allows us to quickly merge datasets together and form rich data collections that can merge internal company data with external public data sets. This allows the data scientist to enrich sales and marketing data for instance with government social demographic statistics. Traditional analytics is usually performed on data silos and when data sets are combined this is usually done at a huge expense by building data warehouses which still only usually have internal company data.

3. Future vs Past

Machine Learning is often called predictive analytics as one of its major use cases is to predict the future. Advanced machine learning algorithms will ingest all your data and find patterns that can then be used to make accurate inferences about the future. These predictions are qualified with an accuracy metric so management can make intelligent decisions based on these predictions. Traditional analytics rarely tries to infer future events and only deal with explaining and visualising past events.

4. Answers vs Reports

Using the predictive power of machine learning, management can start asking smart questions from their data. Questions such as:

  • What is the optimal marketing campaign to increase market awareness for product X
  • How many of product Y should we product to reduce oversupply next winter season
  • What sales rep should I use to manage our new customer to maximise potential profit

This is very different from existing business intelligence suites which usually deliver dry reports or charts which are very often misinterpreted.

5. Speed of delivery

Traditional analytics / business intelligence implementations can take years to complete. They are intrusively integrated into an organisations IT and as such move very slowly. Modern machine learning technologies allow for management to get answers from their data very quickly and efficiently. A simple question can be answered in weeks not years.

6. Machine analysis vs human interpretation

Machine Learning uses advanced computer algorithms to analyse unlimited quantities of data. This analysis is done totally impartially and free from any biases that are common in many manual analysis. The outputs from these algorithms are also very easy to interpret and leave very little room for misrepresentation making them very objective and quantifiable tools for decision making.

Machine Learning in FMCG

The FMCG (Fast Moving Consumer Goods) industry is an ideal target for Predictive Analytics and Machine Learning. There are several unique attributes of the industry that makes this so; these are:

  • The massive volumes involved
  • Access to good quality sales data
  • Short shelf life
  • Current forecasting techniques are relatively inaccurate
  • Current marketing strategies are less than optimal
  • Current manufacturing practices are less than ideal
  • Current supply chain strategies are less than optimal
  • Consumer numbers are very large

We now explore each of these attributes in detail.

1. Large volumes / access to good quality sales data

The number of sale transactions available to modern FMCG organisations is huge. This data can usually be purchased from retailers and is of very high quality. This sales data forms the backbone for any predictive model as increasing sales should always be the primary objective of any predictive project.Most large FMCG companies also have very good systems in place that record data at every stage of a product’s lifecycle. From manufacturing to delivery to marketing and sales. These systems usually have very high quality data and require very little data cleansing to be valuable.

Given the enormous volumes of transactions generated by FMCG this data is usually very hard to analyse manually as it overwhelms most brave analysts. Currently many organisations have not gone beyond basic analysis at a very high aggregated level, for instance: sales for the week, sales for a store, etc. And where they do drill down deeper into the data, this is usually done by senior analysts with years of experience (and biases) at a huge cost.

2. Short shelf life

FMCG products usually have a short shelf life meaning that the costs of oversupply and over manufacture can be significant. Given also, the large volumes of products any optimisation to the oversupply (or undersupply) problem can result in very large ROI. The over/under supply problem is again a perfect candidate for machine learning technologies.

3. Sales and marketing

If your goal is to increase sales then having accurate sales forecasting is critical. With an accurate forecasting model you can create simulations that allow managers to do quality “what if” analysis. Currently sales forecasting is inaccurate and senior management lack the confidence in these numbers. Having the ability to merge many data sources (sales, marketing, digital, demographics, weather, etc.) greatly improves the quality of sales forecasts when compared to traditional predictions which are traditionally done on isolated and aggregated sales figures.Once the sales data is merged with the marketing data we can start making very accurate marketing predictions also. Questions like:

  • Which product should we promote this month
  • What type of campaign will be most profitable for this product
  • What consumer segment should we target
  • How can we get value from our social media data and use current consumer sentiment to create timely marketing campaigns

4. Manufacturing and supply chain

Most large FMCG have wonderful ERP systems that hold a wealth of hidden value in their data. This data can be used to create models that can answer several critical questions.

  • How can we guarantee on time delivery
  • How can we shorten the time to manufacture a product
  • How can we increase the yield for a product
  • How can we minimise product returns / complaints


PredictBench is a product that enables you to get the most value from your data. It is quick and efficient and does not need to involve your IT department. You do not have to understand reporting, statistics or any form of data analysis techniques. You just ask us what questions you want answered and using the latest Machine Learning technologies; we give you those answers.If you are interested in learning more please feel free tocontact me.


Founded in 2002, PicNet has been a leading provider of IT services and solutions to Australian businesses.PicNet helps organisations use technology to increase productivity, reduce costs, minimise risks and grow strategically.

PredictBench set to go global

The official announcement of the Elevate 61 participants was release today.  We are very proud to be included in this list.  Our latest offering “PredictBench” has been recognised as being innovative and exciting enough for Advance and KPMG to help us take it to the US!!

This means we will be extremely busy in the coming weeks/months traveling to the US, meeting and presenting PredictBench to companies and potential partners.

Over the next few months PicNet will be showcasing PredictBench in Los Angeles, San Franciso and New York as well as in all Australian major cities.

This is a wonderful opportunity that will help companies around the world take advantage of our PredictBench solution that we have worked very hard to build and are extremely proud of.

What is PredictBench

PredictBench is a solution that helps organisations predict with confidence future business events based on their own historical data and other influencer factors.  It allows organisations to answer questions such as:

  • What marketing campaign will give me the greatest return on investment
  • How much of a certain product to produce to reduce oversupply whilst guaranteeing no undersupply
  • How can we measure the risk a customer represents

In the past these technologies have only been available to Silicon Valley research start-ups or corporate giants.  We bring this technology to all corporations and government entities in an affordable and efficient solution that aims to deliver real value for money.

For more information please visit the PredictBench page, watch the short video or download the flyer.

The Value of Data – A short guide to learn how to maximise the value of your data.

This is an excerpt from the white paper available here.

Over the last few years the data industry has been shaken to its core. We have new names, roles, technologies, products coming out on a daily basis. The term “Big Data” has been overused so much that it may be losing some of its meaning.

I am meeting people on a regular basis and the message I receive over and over again is that it’s overwhelming. I am going to try to address this concern in this paper.

The Purpose of Data

The sole purpose of data in an organisation is to support business decisions. Data can do this in several ways.

  • By communicating information about the past
  • By communication information about the future
  • By recommending actions for future success

The first of these has long been addressed by traditional reporting and business intelligence tools so I will not spend too much time here. What I really want to address is the final 2 points:

  • Communication information about the future
  • Recommending actions for future success

The Future

There are several ways that data can help us peek into the future of our organisation. The first and most traditional is the statistician. The statistician is the professional that can look at the data and give you an inference about the future based on the data available. The statistician, especially one that is highly knowledgeable about the business can use his domain expertise and data experience to give extremely valuable insights into the future.

The second way of getting real future value from data is to use Predictive Analytics. Predictive Analytics is also known as Advanced Analytics, Prescriptive Analytics and Machine Learning but I suggest we just call it Predictive Analytics as it clearly summarises the objective of the technology.

Predictive Analytics

Predictive Analytics is about value. It is how you convert your data into real insights about the future. It is not a product, it is not a platform, and it is not Hadoop or any other vendor name. Predictive Analytics is solely the science of predicting the future from historical data.

Predictive Analytics is also not a person. This is an important point because Joe the statistician cannot handle an Excel file of 2GB and ask him to bring in Facebook, Twitter and web-traffic data into his inferences and he’ll probably have a nervous breakdown.

There is only so much data a human head can manage. Computers however do not have this problem; they can handle any vast amounts of data from any variety of sources. They also do not bring any biases to the analysis which has also been a problem in the past.

How to “Predictive Analytics”

Until recently, implementing a Predictive Analytics project has been the domain and capacity of large and sophisticated companies, however, most recently with the emergence of affordable cloud computing, in memory computing analysis, sophisticated modelling tools, combined with the skills of computer programmers, data scientists and analysts, Predictive Analytics is now affordable as a service, by most medium size enterprises.

The solution can be procured as a service i.e. on/off, pay as you go and when you needed it. No longer is huge capital investment required but instead, understanding the need, the challenge, developing proof of concepts and analysing outputs, provide the effective and affordable introduction to the benefits of Predictive Analytics.

Predict What

Predictive Analytics can predict numerous variables supported by your historical data. The following are some examples:

  • Potential success of a marketing campaign
  • How best to segment customers
  • What marketing mediums have the best ROI
  • When will my machine fail
  • Why did my machine fail

As long as in the past we have recorded our actions and we also have at a later date recorded the results we can then learn form that data and make predictions about the future.

Predictions can be real time or can be on weekly/monthly/quarterly basis, it all depends on your needs.
Getting Started

There are several ways to get started. You can recruit your very own Data Scientist. Not an easy task considering the high specialisation of these professionals but it is what a lot of companies are doing.

You could also use a service provider. A good service provider will have a team of IT and data people including Data Scientists that have experience in doing these types of projects for customers in your industry.

At PicNet we always recommend that our customers start with a proof of concept. Some of these Predictive Analytics projects can take a long time to implement and integrate back into the processes of the organisation so it’s always best to take small bites and see if there is value to be had. We usually go for a 2-4 week project which usually goes something like this:

  • We get potential questions about the future that management would like answered
  • We audit all data and its quality in the organisation
  • We prepare and clean this data
  • We get external data if it can help the prediction (census, web traffic, Facebook, Twitter, weather, etc.)
  • Build a simple predictive model, perhaps on a subset of the data
  • Provide a report with predictions for the next period (2 months for instance)

This report can then be used by the business to test the value and accuracy of the predictions. If the business approves we then push that system into production. This will mean that you may get weekly, daily or real time predictions as the business requires. The length of these production implementations vary greatly depending on the systems that need to be integrated with and many other factors.


As a manager charged with the responsibility of curating your organisations data resources you should forget vendors and platforms and no sql databases, forget about the 3 Vs (or is it 4?) of big data. As a manager your only concern should be the only V that matters and that is Value. And if you want to get value from your data then consider Predictive Analytics.

For any additional information please feel free to contact me on my details below.


Vowpal Wabbit for Windows and Binary for Win x64

Getting VW working on windows is a real pain.  Even though I had the whole environment set up as described on the readme it still took me a good couple of hours to build.

So with absolutely no guarantee or support options here is my built version of vw.exe version 7.7.0.  This was built on a Windows 7 x64 box and I have only tested on this one box so use at your own risk!!

If you were after the executable only then there is no need to continue reading, the rest is about python.

So I started playing around with VW.exe and quickly realised that the command line is a terrible place to experiment on a machine learning algorithm.  So I started looking for python wrappers and found this.  Which is a nice wrapper but it does not work on Windows.  So I hacked it up a little (with no permission, sorry Joseph Reisinger) and have a windows friendly version with updated command line options here.

So how do you use the python wrapper?

First we need to convert your data into VW input format I use my pandas extensions helper method: _df_to_vw

You will be able to turn this into a generic converter very easily, infact there are already plenty around such as:

So now you have your files converted, let’s use the classifier:

# where files open file streams to the VW file
training_lines = training_vw_file.readlines()
testing_lines = testing_vw_file.readlines()

The VowpalWabbitClassifier is fully scikit-learn compatible so use it in your cross validations, grid searches, etc with ease.  And just have a look at the code to see all the options it supports and if there are missing options please fork and submit back to me.

The value of Kaggle to Data Scientists

Kaggle is an interesting company.  It provides companies a way to access Data Scientists in a completion like format and a very low cost.  The value proposition of Kaggle for companies is very clear; this article will focus on the flipside of this equation.  What value does Kaggle give to the data scientists?

This blog post contains my own personal opinions, however I have tried to be as balanced in my views as possible and I have tried to present all benefits and disadvantages to this platform.

Disadvantages of Kaggle to the Data Scientist

Cheapens the Skill

Most Kaggle competitions award prizes in the order of 5k – 50k.  There are a few competitions that get a lot of media attention as they have much higher prizes, however these are very rare.  Given the chances of winning a Kaggle competition and the prize money involved then the monetary returns of Kaggle are negligible.  Therefore we have highly educated and skilled people providing an extremely valuable service to commercial organisations for free.  This reeks of software development’s Open Source commercialisation strategy that aims to destroy competitors by providing a free product and charging for other services (like support).  In this case Kaggle’s competitors are the Data Scientists themselves as they could be consulting for organisations directly instead of going through Kaggle.  This is an interesting argument that could be the subject of its own blog post so let’s put it aside for now.

Kaggle Encourages Costly Models over Fast Models

When competing, the participants have no incentive to create robust, fast, bug proof models.  The only incentive is to get the best possible predictive model disregarding all other factors.  This is very far removed from the real world where accuracy compromises are made regularly mainly for performance purposes.

Kaggle Does not Teach Data Prep, Data Soruce Merging, Communicating w/ Business, etc

Competitions on Kaggle go straight for the last 5-10% of the job.  It assumes all data sources have been merged, assumes management has decided on the best question to ask the data and IT has prepared the data and cleaned it.  This again is very different from real life projects and could be giving some Data Scientists, especially unexperienced ones a false view of the industry.

Kaggle Competitions Take too Long to Compete In

I would guess that most top 50 competitors in any competition would put in around 40-200 hours in a single competition.  This is a huge amount of time, so what does the competitor get out of it?

Benefits of Kaggle to the Data Scientist

Opportunities for Learning

Kaggle is the best source on the internet at the moment for budding data scientists to learn and hone their craft.  I am confident in the accuracy of the statement having seen many people start out on simple competitions and slowly progress over time to be a highly skilled data scientist.  This great blog post from Triskelion demonstrates this clearly.  This benefit cannot be overstated, data science is hard!! You need to practice and this is the place to do it.

Opportunities to Discuss and ask Questions of Other Data Scientists

The Kaggle forums are a great place to ask questions and expand your knowledge.  I regularly check these forums even if I’m not competing as they are a treasure trove of wonderful ideas and supportive and highly skilled individuals.

The Final 10%

The final 10% of a Data Science project is the machine learning / predictive analysis modelling.  The other 90% are administrative, managerial, communications, development, business analysis tasks.  These tasks are very important but in all honesty an experienced manager has these skills.  The technical skills needed in this 90% are also easily available as most experienced developers can merge data sources and clear datasets.  It is the final 10% where a great data scientist pays for himself.  This is where Kaggle competitions sharper your skills, exactly where you want to be focusing your training.

Try out Data Science

Something that you quickly learn from any Predictive Analytics project is the monotony of data science.  It can be extremely painful and is definitely not suitable for everyone.  Many skilled developers have the maths, stats and other skills for Data Science but they may lack that patience and pedanticness that is required to be successful in the field.  Kaggle gives you the chance to try out the career, I’m sure many have decided it’s just not for them after a competition or two.

Promotional Opportunities

I doubt how much value Kaggle actually provides in terms of promotion for the individual.  I personally have never been approached for a project because of my Kaggle Master status or my Kaggle ranking.  I have brought up the fact that I indeed am a Kaggle Master at some meetings but this generally gets ignored mainly because most people outside of the Data Science field do not know what Kaggle is.  However, there may be some value there and I’m sure that the top 10-20 kagglers must get some promotional value from the platform.

TL;DR (Summary)

Kaggle may cheapen the data science skillset somewhat, providing huge business benefits at very low cost and zero pay to data scientists. However, I like it and will continue to compete on my weekends/evenings as I have found the learning opportunities Kaggle provides are second to none.

Property Market Predictions – PicNet Predictive Analytics

This post explores options for the application Machine Learning techniques to the Australian residential property market with the objective of predicting insights that would be useful for buyers, sellers and the industry.  With access to good data it is possible to predict sale/auction prices by home, street, suburb, municipality, etc.  We could also predict number of registered bidders at auctions or parties through on open days.

Data Availability

The success of a Predictive Analytics and Machine Learning project depends totally on the data available and its applicability to the problem at hand.  A careful analysis of available data is required before any work can begin in this space.  But some potential data sources that could be brought together for a predictive model include:

  • RPData: RPData contains ownership, property features, land size, sales history information.  This data is generally considered to be of decent quality and can be relied on
  • Real estate marketing strategy used in a property campaign will greatly affect the outcome of a sale/auction.  However, access to this data will be difficult and may need to be omitted.  Perhaps 2 models could be built, one with all participating real estates (that are providing this data) another where this data is unknown.
  • Area demographics information available from census data will also affect predictions.
  • Commercial properties in the area will also affect the outcome of a property campaign.  This effect can be both positive and negative depending on the type of commercial (i.e. café vs factory) and quantity.  RPData has some of this data but the quality of this may not be great.  Local councils have commercial property data but this will be very hard to access.  The best source for this data may simply be ABN registration details which is not good quality (many ABNs do not have a corresponding business) but it may serve the purpose of showing volumes and type of businesses registered in the area.
  • web traffic logs:  Interest in a property can be measured by analysing the web traffic activity for a property on property websites.  Details such as number of visits, time on page, bounce rates, etc. could provide real insight into the volume and sentiment of potential buyers.
  • Weather forecasts:  The impact of the weather on open houses, auctions, etc. could be real and this data should also be included in any predictive model.
  • Crime statistics available through various government web properties such as and  This data should also be included in the predictive model.
  • School location and performance data available from should be included as local school can affect property prices.
  • Public transport location and frequency in the area also affects property prices.  This data is available from various public transport online properties.
  • Exchange rates and global economic statistics will also affect property prices.  The state of the economy in our local region especially in China and South East Asia will affect prices in certain property markets in Australia.  Exchange rates data is easily available; data for each relevant country may need to be sourced on a country by country basis.
  • The state of the local economy will also affect property prices. This data can be sourced from the RBA web site and perhaps other sources.
  • Social media could be a source of sentiment data showing shortage of property in an area, interest in properties in an area and general sentiment for an area.  Social media data could also show movements in and out of an area.  This data can be bought from Facebook and
  • Number of registered bidders, parties at open houses:  Real estates have this information which could be very valuable in many predictions.  However access to this data could be hard to access.
  • Google Trends.  A great tool to analyse interest in a suburb, property, etc.

Given the wealth of data available in this space I believe that a very accurate predictive model can be built.


Possible Predictive Models

Sales/Auction Price

The holy grail of property market prediction is “how much will it go for?”  Whilst a general trend can be identified the sale amount at the end of the day will depend on who is there on the day and how much they want the property.  However, average figures for an area will be highly predictable as volume eventually overrides the confounding noise of an individual’s effect on a sale/auction.

Auction Day Bidders

If it is possible to get past number of registered bidders from real estate agencies then predicting future number of bidders would also be highly accurate.  This could also be applied in the prediction of number of people at open houses.

Area Predictions

More general predictions at the street/suburb/municipality level would also be possible.  Once data begins to be aggregated like this, predictions are generally much more accurate but they offcourse lose their granularity which may devalue the prediction.


Proof of Concept

A potential project to gauge the effectiveness of a predictive model would be something like this:

Find the current benchmark

Find current property predictors and use their accuracy as a benchmark.  These benchmarks will be used to compare the accuracy of this project to what is currently available.  From my initial research these predictions are usually very general (suburb, city, state, level) or they are of low accuracy

Initial data

Depending on the interested stakeholders and their access to good data this step could range from easy to very difficult.  Whatever access to data we get at this step could mean the success or failure of the proof of concept.

P.O.C. implementation

Implement a simple model with the current available data and make predictions for the next period.  These predictions will be used to measure against existing benchmarks.

Iteratively add more data

If the proof of concept shows that we have the potential to make real and accurate property market predictions then we can start investing in getting more data.  Talking to real estate web property owners, real estates, councils, etc.  We would then add each new data source to the model measuring its impact on the prediction accuracy.

Potential Business Opportunities

Once the model is proven and hard numbers can back its predictive power several business could be developed that would take advantage of this information:

Real Estates

Real Estates are always looking for accurate ways to predict the price of their properties.  This system could supplement their trained agents in predicting property prices and developing appropriate marketing campaigns.

Investors/Home Buyers

A service for the public to accurately predict the price of a residential home would be invaluable to the individual.  This service however could have a negative feedback to the model driving people from or to properties.


If marketing strategies can be compared in effectiveness using these models then marketeers can use this data to charge for advertising space knowing and being able to prove effectiveness.


Insurance companies would be very interested in volumes of sales, price of assets, etc.


Predicting loan volumes and areas of potential growth for their loans would be very valuable for   banks. This would help plan future loan amounts and marketing opportunities.


Demand planning for a future period would be dramatically improved with access to accurate predictions for a given sales period.

Small Business Owners

Many small businesses offer services to new home owners in an area.  These businesses could use future volume predictions for demand planning and marketing campaign planning and they could also use value predictions to identify customers in the correct financial demographics for their services.

PicNet and Predictive Analytics

PicNet is ideally positioned to work with partners on this and many other Predictive Analytics projects having both the skills and tools required to build these sophisticated data environments and predictive models.  Guido Tapia, PicNet’s manager of Software and Data has 20 years of Software and Machine Learning experience which increases the chances of success dramatically.

If you are interested in Machine Learning or anything else mentioned in this article please feel free to contact Guido Tapia directly.

Naming Conventions in Predictive Analytics and Machine Learning

In this article I am going to discuss the importance of naming conventions in ML projects. What do I mean by naming conventions?  I mainly mean using descriptive ways of labelling features in a data set.  What is the reason for this?  Speed of experimentation.

Naming Conventions

  • Categorical columns start with ‘c_’
  • Continuous (numerical) columns start with ‘n_’
  • Binary columns start with ‘b_’
  • Date columns start with ‘d_’

Examples of Benefits

Once your datasets is labelled clearly with these conventions then experimenting with features becomes very fast.

cv = functools.partial(do_cv, LogisticRegression(), n_folds=10, n_samples=10000)
cv(one_hot_encode(X), y) # One hot encode all categorical features
cv(contrasts(X), y) # Do simple contrast coding on all categorical features
cv(bin(X, n_bins=100), y) # Split all continuous features into 100 bins
X = engineer(X, ‘c_1(:)c_2’) # Create a new categorical feature that is a combination of 2 other
X = engineer(X, ‘n_1(*)n_2’) # Create a combination of 2 numericals (by multiplication)
X = engineer(X, ‘n_1(lg)’) # Create a log of feature ‘n_1’
X = engineer(X, ‘(^2)’) # Create a square feature for each numerical feature
X = engineer(X, ‘(lg)’) # Create a log feature for each numerical feature

In a real world example this would look something like:

X = remove(X, dates=True)
for n1, n2 in combinations(X, group_size=2, numericals=True): X = engineer(X, n1 + ‘(*)’ + n2)
for c1, c2 in combinations(X, group_size=2, categoricals=True): X = engineer(X, c1 + ‘(:)’ + c2)
X = engineer(X, ‘(^2)’)
X = engineer(X, ‘(lg)’)
cv(X, y)


The resulting DSL from using good naming convention leads to very clear code that relates directly to the data munging operations being done.  Another benefit is that but once your ‘one_hot_encode’ method is written and tested you can trust it for future projects (as long as they use the same naming conventions).

Using private partial classes to hide implementation details of an interface. Workaround for package level protection in C#

I miss very few things from the Java language, one gem I really miss is the package-private accessibility modifier. This was so useful, your IDE colour coded your package classes in another colour so you knew they were not part of the public API. You could skim read the files in a package (namespace) and see exactly what you needed to look at, ignoring all low-level implementation details.

This unfortunately is not in C#, the closest C# gets is the internal modifier. I personally really dislike this modifier as I think it has contributed to the nightmare that is 100-200 project solutions which are so common amongst some .Net shops.

This pattern is an alternative, I think its a very common alternative but recently during a code review I explained it to someone who appreciated the experience so I thought I’d write it up.

Often, C# developers will do this kind of encapsulation using nested private classes. I have a big problem with this as it leads to those 2-3k line files which are unintelligible. So why not just make those nested classes private partials? Let’s see how this would work.

Let’s assume we have a namespace Clown whose responsibility is to create clowns for customers (i.e. like a clown booking service for kids parties). The customer basically fills in the details that their clown should have and then books a clown for their party.

The details are specified using an instance of ClownSpecifications:

public class ClownSpecifications {
  public bool KidFriendly { get;set; }
  public bool Fun { get;set; }
  public bool Scary { get;set; }
  public bool Creepy { get;set; }

The clown itself is simply an implementation of the IClown interface. This interface is the only thing the user ever sees.

public interface IClown {
  void DoYourThing();

And then we need a clown factory that builds clowns based on the provided specifications:

public partial class ClownFactory
  public IClown CreateClown(ClownSpecifications specs) {
    if (specs.Creepy && specs.Scary) { return new ThatClownFromStephenKingsBook(); }
    if (specs.Creepy) { return new TheJoker(); }
    if (specs.KidFriendly && specs.Fun) { return new Bobo(); }
    if (specs.Fun) { return new RudeClown(); }
    return new GenericBoringClown();

  private partial class ThatClownFromStephenKingsBook {}
  private partial class TheJoker {}
  private partial class Bobo {}
  private partial class RudeClown {}
  private partial class GenericBoringClown {}

A few things to notice here. The first is that the ClownFactory itself needs to be marked partial:

public partial class ClownFactory

This is required simply because there is no way to create top level private partial classes.

Secondly, the implementation classes are defined in a super minimalistic fashion:

private partial class ThatClownFromStephenKingsBook {}

They don’t event implement the IClown interface in this definition.

So now an implementation of IClown looks like this:

public partial class ClownFactory {
  private partial class ThatClownFromStephenKingsBook : IClown {
    public void DoYourThing() {
      // ...

That’s it, this is actually working code. And the great thing about it is that your namespace now looks like this:

See, you can now more easily tell that the public API of that namespace is IClown, ClownSpecifications and ClownFactory. To clean this up even more you could create a new directory called impl and hide the implementations there. I personally do not do this as then Resharper starts yelling at me about mismatching namespaces.

Solid Principles: Part Two – Open Closed Principle

Software entities (classes, modules, functions, etc.) should be open for extension but closed for modification.‘ [R. Martin]

The OCP is a set of strategies based on inheritance and polymorphism that aims to make code more extensible with fewer side effects when extending. The way side effects are controlled is by adding functionality to the system without modifying any existing code.
The key to the OCP is programming to abstractions, i.e. Interfaces and base classes. But not only programming to abstractions, but doing it well without using many of the common pitfalls that violates the OCP.
An example is in order. Let’s assume we have a Library stock management system. The system has several types of Stock items. These could be: Magazines, Periodicals, Books, DVDs, CDs, etc. Each of these items may have their own checkout rules and we may program this as follows:

class OrderProcess:
  def checkout_item(item):
    var due_date
    switch (item.type)
        due_date = now.AddWeeks(2)
      case ITEM_TYPE_DVD:
        due_date = now.AddWeeks(1)
      case ITEM_TYPE_BOOK:
        due_date = now.AddWeeks(6)

This is a neive implementation because every time a new stock item type is added we will need to add a case to this statement (and any other switch statements that switch on item.type). So a better implementation, one that respects the OCP would be.

interface ItemType:
  def get_due_date(from)
  def checkout()

And we could have implementations like this:

class MagazineItemType implementes ItemType:
  def get_due_date(from):
    return from.AddWeeks(2)

  def checkout()
    var due = this_get_due_date(now)
    // Any other checkout processes applicable to Magazines

class DVDItemType implementes ItemType:
  def get_due_date(from):
    return from.AddWeeks(1)

  def checkout()
    var due = this_get_due_date(now)
    // Any other checkout processes applicable to DVDs

class OrderProcess:
  def checkout_item(ItemType item):

So now, if for any reason we needed to add a new Item Type, say Blu-ray we can just create a new class and the system will be able to handle it without modifying any existing code.
The reason the system now ‘magically’ works with a new Item Type (Blu-ray) without modifying any code is that the high level functions of the system do not know anything about Magazines, DVDs, etc. They simply know about the abstraction which is the ItemType interface.
As we’ve seen, switch statements or long if chains can be a smell that you’re violating the OCP. Other signs include having code like this in your system:

  def checkout_item(item):
    if (item.type is ITEM_TYPE_BLU_RAY) 
      throw error ('Blu-ray cannot be checked out')

Having high level functions that are even remotely aware of concrete types is an indication of future heart ache. It is also important to note that even low level types should not know about each other. For instance, a Magazine should not know about DVDs. These relationships are also violations to the OCP.
It is important to note that all of these techniques are ways to manage the complexity of source code. Now, inheritance and polymorphism themselves are complex tools so like always, you need to be judicious in your usage of inheritance hierarchies. It is important to adhere to the OCP when you think part of a system is likely to change. For instance, in the example above, it is perfectly reasonable for there to be new Stock Item Types in the future so creating a hierarchy of these types is a good idea. Other areas in the system which are not likely to change can violates the OCP. The thing is not to be religious about any technique but if you are going to ignore an OCP violation (or any other technique) do so consciously.