About Guido Tapia

Over the last 2 years Guido has been involved in building 2 predictive analytics libraries for both the .Net platform and the Python language. These libraries and Guido's machine learning experience have placed PicNet at the forefront of predictive analytics services in Australia.

For the last 10 years Guido has been the Software and Data Manager at PicNet and in that time Guido has delivered hundreds of successful software and data projects. An experience architect and all round 'software guy' Guido has been responsible for giving PicNet its 'quality software provider' reputation.

Prior to PicNet, Guido was in the gaming space working on advanced graphics engines, sound (digital signal processing) engines, AI players and other great technologies.

Interesting Links:

PredictBench successfully predicts product classifications for one of the world’s largest ecommerce and FMCG companies

As any large FMCG (CPG) is aware, classifying products correctly is critical to having good analytics capabilities.  It is also clear to any global organisation that this is a surprisingly difficult task to achieve.  Most regions use different classifications and combining them on a global scale is a non-trivial task.  It is such a difficult task that many organisations simply ignore it and miss out on potential insights from a global view of product sales.

The Otto Group is a German ecommerce company that sells tremendous amounts of goods and they recently released a dataset to address this exact issue.  Given details of over 200,000 products it was the data scientist’s job to correctly distinguish between Otto’s main product categories.

We used PredictBench to tackle this job and we had amazing accuracy in classification. In fact the PredictBench team was able to come within 0.021 points of the optimal solution which was no mean feat, beating out around 3500 teams from around the world.  Our final position in this challenge was 16th (out of 3514).

On this project we teamed up with American data scientist Walter Reade who brought invaluable experience and knowledge to the PredictBench team.

Working together with Walter we were able to put together and ensemble of hundreds of models including linear models, neural nets, deep convolutional nets, tree based ensembles (random forests and gradient boosted trees) and many others.  The huge scale of the final solution goes to show how incredibly complex this problem was and the skills that were required to achieve such amazing results.

Working with Otto data and teaming up Walter was a great boon to the PredictBench team and we hope to replicate this in the near future.

Machine Learning for the FMCG Industry

A detailed observation on the potential benefits of using modern Machine Learning technologies in the FMCG vertical

Executive Summary

The unique characteristics of the FMCG industry make it an ideal candidate for Machine Learning and associated technologies. These characteristics include very large volumes of transactions and data, and a large number of data sources that influence projections. These characteristics mean that traditional analytics technologies struggles with the volume and complexity of the data which is exactly where Machine Learning is best suited.Most horizontals of the industry are candidates for optimisation including improving the effectiveness of marketing campaigns, increasing the performance of the sales team, optimising the supply chain and streamlining manufacturing. The FMCG industry has been relatively slow to adopt these cutting edge technologies which gives an early entrant an opportunity to strongly outperform its competitors.

none

Introduction

FMCG Introduction

FMCG (Fast Moving Consumer Goods) refers to organisations that sell products in large quantities. These products are usually inexpensive, the volumes sold are large and may often have a short shelf life. Profits on individual items is very small and large volumes are required to have a viable business.These characteristics offer many challenges and also many opportunities.

This paper investigates these challenges and opportunities in detail and focuses on the use of Machine Learning technologies to optimise processes to increase profits for FMCG companies.

Machine Learning Introduction

The following list should serve as a refresher when thinking about Machine Learning vs traditional analytics and business intelligence:1. Unstructured Data

Modern Big Data technologies and advanced machine learning algorithms can analyse data in any format, such as images, videos, text, emails, social media messages, server, logs, etc. Whereas traditional analytics can only analyse structured data in databases.

2. Combine Data

Modern technologies allows us to quickly merge datasets together and form rich data collections that can merge internal company data with external public data sets. This allows the data scientist to enrich sales and marketing data for instance with government social demographic statistics. Traditional analytics is usually performed on data silos and when data sets are combined this is usually done at a huge expense by building data warehouses which still only usually have internal company data.

3. Future vs Past

Machine Learning is often called predictive analytics as one of its major use cases is to predict the future. Advanced machine learning algorithms will ingest all your data and find patterns that can then be used to make accurate inferences about the future. These predictions are qualified with an accuracy metric so management can make intelligent decisions based on these predictions. Traditional analytics rarely tries to infer future events and only deal with explaining and visualising past events.

4. Answers vs Reports

Using the predictive power of machine learning, management can start asking smart questions from their data. Questions such as:

  • What is the optimal marketing campaign to increase market awareness for product X
  • How many of product Y should we product to reduce oversupply next winter season
  • What sales rep should I use to manage our new customer to maximise potential profit

This is very different from existing business intelligence suites which usually deliver dry reports or charts which are very often misinterpreted.

5. Speed of delivery

Traditional analytics / business intelligence implementations can take years to complete. They are intrusively integrated into an organisations IT and as such move very slowly. Modern machine learning technologies allow for management to get answers from their data very quickly and efficiently. A simple question can be answered in weeks not years.

6. Machine analysis vs human interpretation

Machine Learning uses advanced computer algorithms to analyse unlimited quantities of data. This analysis is done totally impartially and free from any biases that are common in many manual analysis. The outputs from these algorithms are also very easy to interpret and leave very little room for misrepresentation making them very objective and quantifiable tools for decision making.

Machine Learning in FMCG

The FMCG (Fast Moving Consumer Goods) industry is an ideal target for Predictive Analytics and Machine Learning. There are several unique attributes of the industry that makes this so; these are:

  • The massive volumes involved
  • Access to good quality sales data
  • Short shelf life
  • Current forecasting techniques are relatively inaccurate
  • Current marketing strategies are less than optimal
  • Current manufacturing practices are less than ideal
  • Current supply chain strategies are less than optimal
  • Consumer numbers are very large

We now explore each of these attributes in detail.

1. Large volumes / access to good quality sales data

none
The number of sale transactions available to modern FMCG organisations is huge. This data can usually be purchased from retailers and is of very high quality. This sales data forms the backbone for any predictive model as increasing sales should always be the primary objective of any predictive project.Most large FMCG companies also have very good systems in place that record data at every stage of a product’s lifecycle. From manufacturing to delivery to marketing and sales. These systems usually have very high quality data and require very little data cleansing to be valuable.

Given the enormous volumes of transactions generated by FMCG this data is usually very hard to analyse manually as it overwhelms most brave analysts. Currently many organisations have not gone beyond basic analysis at a very high aggregated level, for instance: sales for the week, sales for a store, etc. And where they do drill down deeper into the data, this is usually done by senior analysts with years of experience (and biases) at a huge cost.

2. Short shelf life

none
FMCG products usually have a short shelf life meaning that the costs of oversupply and over manufacture can be significant. Given also, the large volumes of products any optimisation to the oversupply (or undersupply) problem can result in very large ROI. The over/under supply problem is again a perfect candidate for machine learning technologies.

3. Sales and marketing

none
If your goal is to increase sales then having accurate sales forecasting is critical. With an accurate forecasting model you can create simulations that allow managers to do quality “what if” analysis. Currently sales forecasting is inaccurate and senior management lack the confidence in these numbers. Having the ability to merge many data sources (sales, marketing, digital, demographics, weather, etc.) greatly improves the quality of sales forecasts when compared to traditional predictions which are traditionally done on isolated and aggregated sales figures.Once the sales data is merged with the marketing data we can start making very accurate marketing predictions also. Questions like:

  • Which product should we promote this month
  • What type of campaign will be most profitable for this product
  • What consumer segment should we target
  • How can we get value from our social media data and use current consumer sentiment to create timely marketing campaigns

4. Manufacturing and supply chain

Most large FMCG have wonderful ERP systems that hold a wealth of hidden value in their data. This data can be used to create models that can answer several critical questions.

  • How can we guarantee on time delivery
  • How can we shorten the time to manufacture a product
  • How can we increase the yield for a product
  • How can we minimise product returns / complaints

PredictBench

PredictBench is a product that enables you to get the most value from your data. It is quick and efficient and does not need to involve your IT department. You do not have to understand reporting, statistics or any form of data analysis techniques. You just ask us what questions you want answered and using the latest Machine Learning technologies; we give you those answers.If you are interested in learning more please feel free tocontact me.

PicNet

none
Founded in 2002, PicNet has been a leading provider of IT services and solutions to Australian businesses.PicNet helps organisations use technology to increase productivity, reduce costs, minimise risks and grow strategically.

PredictBench set to go global

The official announcement of the Elevate 61 participants was release today.  We are very proud to be included in this list.  Our latest offering “PredictBench” has been recognised as being innovative and exciting enough for Advance and KPMG to help us take it to the US!!

This means we will be extremely busy in the coming weeks/months traveling to the US, meeting and presenting PredictBench to companies and potential partners.

Over the next few months PicNet will be showcasing PredictBench in Los Angeles, San Franciso and New York as well as in all Australian major cities.

This is a wonderful opportunity that will help companies around the world take advantage of our PredictBench solution that we have worked very hard to build and are extremely proud of.

What is PredictBench

PredictBench is a solution that helps organisations predict with confidence future business events based on their own historical data and other influencer factors.  It allows organisations to answer questions such as:

  • What marketing campaign will give me the greatest return on investment
  • How much of a certain product to produce to reduce oversupply whilst guaranteeing no undersupply
  • How can we measure the risk a customer represents

In the past these technologies have only been available to Silicon Valley research start-ups or corporate giants.  We bring this technology to all corporations and government entities in an affordable and efficient solution that aims to deliver real value for money.

For more information please visit the PredictBench page, watch the short video or download the flyer.

The Value of Data – A short guide to learn how to maximise the value of your data.

This is an excerpt from the white paper available here.

Over the last few years the data industry has been shaken to its core. We have new names, roles, technologies, products coming out on a daily basis. The term “Big Data” has been overused so much that it may be losing some of its meaning.

I am meeting people on a regular basis and the message I receive over and over again is that it’s overwhelming. I am going to try to address this concern in this paper.

The Purpose of Data

The sole purpose of data in an organisation is to support business decisions. Data can do this in several ways.

  • By communicating information about the past
  • By communication information about the future
  • By recommending actions for future success

The first of these has long been addressed by traditional reporting and business intelligence tools so I will not spend too much time here. What I really want to address is the final 2 points:

  • Communication information about the future
  • Recommending actions for future success

The Future

There are several ways that data can help us peek into the future of our organisation. The first and most traditional is the statistician. The statistician is the professional that can look at the data and give you an inference about the future based on the data available. The statistician, especially one that is highly knowledgeable about the business can use his domain expertise and data experience to give extremely valuable insights into the future.

The second way of getting real future value from data is to use Predictive Analytics. Predictive Analytics is also known as Advanced Analytics, Prescriptive Analytics and Machine Learning but I suggest we just call it Predictive Analytics as it clearly summarises the objective of the technology.

Predictive Analytics

Predictive Analytics is about value. It is how you convert your data into real insights about the future. It is not a product, it is not a platform, and it is not Hadoop or any other vendor name. Predictive Analytics is solely the science of predicting the future from historical data.

Predictive Analytics is also not a person. This is an important point because Joe the statistician cannot handle an Excel file of 2GB and ask him to bring in Facebook, Twitter and web-traffic data into his inferences and he’ll probably have a nervous breakdown.

There is only so much data a human head can manage. Computers however do not have this problem; they can handle any vast amounts of data from any variety of sources. They also do not bring any biases to the analysis which has also been a problem in the past.

How to “Predictive Analytics”

Until recently, implementing a Predictive Analytics project has been the domain and capacity of large and sophisticated companies, however, most recently with the emergence of affordable cloud computing, in memory computing analysis, sophisticated modelling tools, combined with the skills of computer programmers, data scientists and analysts, Predictive Analytics is now affordable as a service, by most medium size enterprises.

The solution can be procured as a service i.e. on/off, pay as you go and when you needed it. No longer is huge capital investment required but instead, understanding the need, the challenge, developing proof of concepts and analysing outputs, provide the effective and affordable introduction to the benefits of Predictive Analytics.

Predict What

Predictive Analytics can predict numerous variables supported by your historical data. The following are some examples:

  • Potential success of a marketing campaign
  • How best to segment customers
  • What marketing mediums have the best ROI
  • When will my machine fail
  • Why did my machine fail

As long as in the past we have recorded our actions and we also have at a later date recorded the results we can then learn form that data and make predictions about the future.

Predictions can be real time or can be on weekly/monthly/quarterly basis, it all depends on your needs.
Getting Started

There are several ways to get started. You can recruit your very own Data Scientist. Not an easy task considering the high specialisation of these professionals but it is what a lot of companies are doing.

You could also use a service provider. A good service provider will have a team of IT and data people including Data Scientists that have experience in doing these types of projects for customers in your industry.

At PicNet we always recommend that our customers start with a proof of concept. Some of these Predictive Analytics projects can take a long time to implement and integrate back into the processes of the organisation so it’s always best to take small bites and see if there is value to be had. We usually go for a 2-4 week project which usually goes something like this:

  • We get potential questions about the future that management would like answered
  • We audit all data and its quality in the organisation
  • We prepare and clean this data
  • We get external data if it can help the prediction (census, web traffic, Facebook, Twitter, weather, etc.)
  • Build a simple predictive model, perhaps on a subset of the data
  • Provide a report with predictions for the next period (2 months for instance)

This report can then be used by the business to test the value and accuracy of the predictions. If the business approves we then push that system into production. This will mean that you may get weekly, daily or real time predictions as the business requires. The length of these production implementations vary greatly depending on the systems that need to be integrated with and many other factors.

Summary

As a manager charged with the responsibility of curating your organisations data resources you should forget vendors and platforms and no sql databases, forget about the 3 Vs (or is it 4?) of big data. As a manager your only concern should be the only V that matters and that is Value. And if you want to get value from your data then consider Predictive Analytics.

For any additional information please feel free to contact me on my details below.

 

Vowpal Wabbit for Windows and Binary for Win x64

Getting VW working on windows is a real pain.  Even though I had the whole environment set up as described on the readme it still took me a good couple of hours to build.

So with absolutely no guarantee or support options here is my built version of vw.exe version 7.7.0.  This was built on a Windows 7 x64 box and I have only tested on this one box so use at your own risk!!

If you were after the executable only then there is no need to continue reading, the rest is about python.

So I started playing around with VW.exe and quickly realised that the command line is a terrible place to experiment on a machine learning algorithm.  So I started looking for python wrappers and found this.  Which is a nice wrapper but it does not work on Windows.  So I hacked it up a little (with no permission, sorry Joseph Reisinger) and have a windows friendly version with updated command line options here.

So how do you use the python wrapper?

First we need to convert your data into VW input format I use my pandas extensions helper method: _df_to_vw

You will be able to turn this into a generic converter very easily, infact there are already plenty around such as:

https://github.com/zygmuntz/phraug2

So now you have your files converted, let’s use the classifier:

# where files open file streams to the VW file
training_lines = training_vw_file.readlines()
testing_lines = testing_vw_file.readlines()
VowpalWabbitClassifier().fit(training_lines).\
  predict(testing_lines)

The VowpalWabbitClassifier is fully scikit-learn compatible so use it in your cross validations, grid searches, etc with ease.  And just have a look at the code to see all the options it supports and if there are missing options please fork and submit back to me.

The value of Kaggle to Data Scientists

Kaggle is an interesting company.  It provides companies a way to access Data Scientists in a completion like format and a very low cost.  The value proposition of Kaggle for companies is very clear; this article will focus on the flipside of this equation.  What value does Kaggle give to the data scientists?

This blog post contains my own personal opinions, however I have tried to be as balanced in my views as possible and I have tried to present all benefits and disadvantages to this platform.

Disadvantages of Kaggle to the Data Scientist

Cheapens the Skill

Most Kaggle competitions award prizes in the order of 5k – 50k.  There are a few competitions that get a lot of media attention as they have much higher prizes, however these are very rare.  Given the chances of winning a Kaggle competition and the prize money involved then the monetary returns of Kaggle are negligible.  Therefore we have highly educated and skilled people providing an extremely valuable service to commercial organisations for free.  This reeks of software development’s Open Source commercialisation strategy that aims to destroy competitors by providing a free product and charging for other services (like support).  In this case Kaggle’s competitors are the Data Scientists themselves as they could be consulting for organisations directly instead of going through Kaggle.  This is an interesting argument that could be the subject of its own blog post so let’s put it aside for now.

Kaggle Encourages Costly Models over Fast Models

When competing, the participants have no incentive to create robust, fast, bug proof models.  The only incentive is to get the best possible predictive model disregarding all other factors.  This is very far removed from the real world where accuracy compromises are made regularly mainly for performance purposes.

Kaggle Does not Teach Data Prep, Data Soruce Merging, Communicating w/ Business, etc

Competitions on Kaggle go straight for the last 5-10% of the job.  It assumes all data sources have been merged, assumes management has decided on the best question to ask the data and IT has prepared the data and cleaned it.  This again is very different from real life projects and could be giving some Data Scientists, especially unexperienced ones a false view of the industry.

Kaggle Competitions Take too Long to Compete In

I would guess that most top 50 competitors in any competition would put in around 40-200 hours in a single competition.  This is a huge amount of time, so what does the competitor get out of it?

Benefits of Kaggle to the Data Scientist

Opportunities for Learning

Kaggle is the best source on the internet at the moment for budding data scientists to learn and hone their craft.  I am confident in the accuracy of the statement having seen many people start out on simple competitions and slowly progress over time to be a highly skilled data scientist.  This great blog post from Triskelion demonstrates this clearly.  This benefit cannot be overstated, data science is hard!! You need to practice and this is the place to do it.

Opportunities to Discuss and ask Questions of Other Data Scientists

The Kaggle forums are a great place to ask questions and expand your knowledge.  I regularly check these forums even if I’m not competing as they are a treasure trove of wonderful ideas and supportive and highly skilled individuals.

The Final 10%

The final 10% of a Data Science project is the machine learning / predictive analysis modelling.  The other 90% are administrative, managerial, communications, development, business analysis tasks.  These tasks are very important but in all honesty an experienced manager has these skills.  The technical skills needed in this 90% are also easily available as most experienced developers can merge data sources and clear datasets.  It is the final 10% where a great data scientist pays for himself.  This is where Kaggle competitions sharper your skills, exactly where you want to be focusing your training.

Try out Data Science

Something that you quickly learn from any Predictive Analytics project is the monotony of data science.  It can be extremely painful and is definitely not suitable for everyone.  Many skilled developers have the maths, stats and other skills for Data Science but they may lack that patience and pedanticness that is required to be successful in the field.  Kaggle gives you the chance to try out the career, I’m sure many have decided it’s just not for them after a competition or two.

Promotional Opportunities

I doubt how much value Kaggle actually provides in terms of promotion for the individual.  I personally have never been approached for a project because of my Kaggle Master status or my Kaggle ranking.  I have brought up the fact that I indeed am a Kaggle Master at some meetings but this generally gets ignored mainly because most people outside of the Data Science field do not know what Kaggle is.  However, there may be some value there and I’m sure that the top 10-20 kagglers must get some promotional value from the platform.

TL;DR (Summary)

Kaggle may cheapen the data science skillset somewhat, providing huge business benefits at very low cost and zero pay to data scientists. However, I like it and will continue to compete on my weekends/evenings as I have found the learning opportunities Kaggle provides are second to none.

Property Market Predictions – PicNet Predictive Analytics

This post explores options for the application Machine Learning techniques to the Australian residential property market with the objective of predicting insights that would be useful for buyers, sellers and the industry.  With access to good data it is possible to predict sale/auction prices by home, street, suburb, municipality, etc.  We could also predict number of registered bidders at auctions or parties through on open days.

Data Availability

The success of a Predictive Analytics and Machine Learning project depends totally on the data available and its applicability to the problem at hand.  A careful analysis of available data is required before any work can begin in this space.  But some potential data sources that could be brought together for a predictive model include:

  • RPData: RPData contains ownership, property features, land size, sales history information.  This data is generally considered to be of decent quality and can be relied on
  • Real estate marketing strategy used in a property campaign will greatly affect the outcome of a sale/auction.  However, access to this data will be difficult and may need to be omitted.  Perhaps 2 models could be built, one with all participating real estates (that are providing this data) another where this data is unknown.
  • Area demographics information available from census data will also affect predictions.
  • Commercial properties in the area will also affect the outcome of a property campaign.  This effect can be both positive and negative depending on the type of commercial (i.e. café vs factory) and quantity.  RPData has some of this data but the quality of this may not be great.  Local councils have commercial property data but this will be very hard to access.  The best source for this data may simply be ABN registration details which is not good quality (many ABNs do not have a corresponding business) but it may serve the purpose of showing volumes and type of businesses registered in the area.
  • Domain.com.au/Realestate.com.au web traffic logs:  Interest in a property can be measured by analysing the web traffic activity for a property on property websites.  Details such as number of visits, time on page, bounce rates, etc. could provide real insight into the volume and sentiment of potential buyers.
  • Weather forecasts:  The impact of the weather on open houses, auctions, etc. could be real and this data should also be included in any predictive model.
  • Crime statistics available through various government web properties such as abs.gov.au and data.nsw.gov.au.  This data should also be included in the predictive model.
  • School location and performance data available from myschool.edu.au should be included as local school can affect property prices.
  • Public transport location and frequency in the area also affects property prices.  This data is available from various public transport online properties.
  • Exchange rates and global economic statistics will also affect property prices.  The state of the economy in our local region especially in China and South East Asia will affect prices in certain property markets in Australia.  Exchange rates data is easily available; data for each relevant country may need to be sourced on a country by country basis.
  • The state of the local economy will also affect property prices. This data can be sourced from the RBA web site and perhaps other sources.
  • Social media could be a source of sentiment data showing shortage of property in an area, interest in properties in an area and general sentiment for an area.  Social media data could also show movements in and out of an area.  This data can be bought from Facebook and gnip.com.
  • Number of registered bidders, parties at open houses:  Real estates have this information which could be very valuable in many predictions.  However access to this data could be hard to access.
  • Google Trends.  A great tool to analyse interest in a suburb, property, etc.

Given the wealth of data available in this space I believe that a very accurate predictive model can be built.

 

Possible Predictive Models

Sales/Auction Price

The holy grail of property market prediction is “how much will it go for?”  Whilst a general trend can be identified the sale amount at the end of the day will depend on who is there on the day and how much they want the property.  However, average figures for an area will be highly predictable as volume eventually overrides the confounding noise of an individual’s effect on a sale/auction.

Auction Day Bidders

If it is possible to get past number of registered bidders from real estate agencies then predicting future number of bidders would also be highly accurate.  This could also be applied in the prediction of number of people at open houses.

Area Predictions

More general predictions at the street/suburb/municipality level would also be possible.  Once data begins to be aggregated like this, predictions are generally much more accurate but they offcourse lose their granularity which may devalue the prediction.

 

Proof of Concept

A potential project to gauge the effectiveness of a predictive model would be something like this:

Find the current benchmark

Find current property predictors and use their accuracy as a benchmark.  These benchmarks will be used to compare the accuracy of this project to what is currently available.  From my initial research these predictions are usually very general (suburb, city, state, level) or they are of low accuracy

Initial data

Depending on the interested stakeholders and their access to good data this step could range from easy to very difficult.  Whatever access to data we get at this step could mean the success or failure of the proof of concept.

P.O.C. implementation

Implement a simple model with the current available data and make predictions for the next period.  These predictions will be used to measure against existing benchmarks.

Iteratively add more data

If the proof of concept shows that we have the potential to make real and accurate property market predictions then we can start investing in getting more data.  Talking to real estate web property owners, real estates, councils, etc.  We would then add each new data source to the model measuring its impact on the prediction accuracy.

Potential Business Opportunities

Once the model is proven and hard numbers can back its predictive power several business could be developed that would take advantage of this information:

Real Estates

Real Estates are always looking for accurate ways to predict the price of their properties.  This system could supplement their trained agents in predicting property prices and developing appropriate marketing campaigns.

Investors/Home Buyers

A service for the public to accurately predict the price of a residential home would be invaluable to the individual.  This service however could have a negative feedback to the model driving people from or to properties.

Marketeers

If marketing strategies can be compared in effectiveness using these models then marketeers can use this data to charge for advertising space knowing and being able to prove effectiveness.

Insurers

Insurance companies would be very interested in volumes of sales, price of assets, etc.

Banks

Predicting loan volumes and areas of potential growth for their loans would be very valuable for   banks. This would help plan future loan amounts and marketing opportunities.

Builders

Demand planning for a future period would be dramatically improved with access to accurate predictions for a given sales period.

Small Business Owners

Many small businesses offer services to new home owners in an area.  These businesses could use future volume predictions for demand planning and marketing campaign planning and they could also use value predictions to identify customers in the correct financial demographics for their services.

PicNet and Predictive Analytics

PicNet is ideally positioned to work with partners on this and many other Predictive Analytics projects having both the skills and tools required to build these sophisticated data environments and predictive models.  Guido Tapia, PicNet’s manager of Software and Data has 20 years of Software and Machine Learning experience which increases the chances of success dramatically.

If you are interested in Machine Learning or anything else mentioned in this article please feel free to contact Guido Tapia directly.

Fluent python interface for Machine Learning

I often say that Machine Learning is like programming in the 60s, you prepare your program, double check everything, hand in your punch cards to the IBM operator, go home and wait.  And just like back then, if you had a bug in your code it would mean a huge amount of wasted time.  Sometimes these things cannot be helped, for instance; it is not uncommon to leave a feature selection wrapper running over the weekend only to find on Monday morning that you got an out of memory error sometime during the weekend.  This article explains one way to reduce these errors and make your code less buggy.

Less code = less bugs

This is the only truth in software development.  A bug free system is only possible if it also contains no code.  So we should always aim to reduce the amount of code needed.  How??

  • Use tried and tested libraries
  • Write reusable code and test this code enough to have confidence that it works
  • Only use this reusable code
  • Whenever possible test your new code
  • Write expressive code.  Make logical bugs obvious.

Libraries

All libraries are full of bugs, again code=bugs so this is of no fault of the library.  However, if a library has lots of users you can be fairly certain that most bugs you will have been found and hopefully fixed.  If you are pushing the boundaries of the library you will inevitably also find bugs but this is not the general case.  Usually, a well-respected library should be reasonably safe to use and to trust.

Reusable Code

Most libraries you use are generic, meaning that they can be used in many contexts.  Depending on your job you will need something more specific.  So write it, wrap your libraries in an abstraction that is specific to what you do.  Do this and then TEST IT!!!  Every time you find a use-case that your abstraction does not support, write it and test it.  Use scikit-learns dummy datasets to create reproducible test cases that will guarantee a certain feature works for your given use case.

Try to always maintain this abstraction separate from any specific predictive project and ensure that it is project agnostic.

Fluent interfaces for ML

This article focuses on using your reusable code wisely aiming to minimize bugs and enhance the expressiveness of the code.

Expressiveness is a key to writing logically correct code.  If you want all rows with a date greater than the start of this year it is much easier to catch a logical bug in this code:

filtered = filter(data, greater_than_start_of_this_year)

Instead of this code:

filtered = filter(data, lambda row: row.date_created >=
  date(date.today().year(), 1, 1))

Whilst the ‘greater_than_start_of_this_year’ function has the same functionality as the lambda expression in the second example it differs in several important ways:

  • It is easily tested:  It is a separate function totally isolated from the context it is running in, this makes testability much easier.
  • It is much, MUCH easier to read and review (it is more expressive).

This expressiveness is sometimes described as ‘declarative’ where the non-expressive form is sometimes called ‘imperative’.  You should always strive to write declarative code as it is easier to read.

One of the best ways, I have found to write declarative code is to use fluent interfaces. These interfaces were popularized by jQuery and then by .Net Linq expressions and others.  A sample fluent jQuery snippet is:

$("#divid")
    .addClass("classname")
    .css("color", "blue")
    .append("Some new text");

It’s funny but this ‘fluent’ style of programming was slammed in the late 90s as error prone, Marin Fowler identified ‘Message Chains’ as a code smell that should be remedied however, I have found totally the opposite effect.  Fluent programming interfaces are easier to read, this means less bugs.

How can this be applied to machine learning?  Easy, have a look at the following code:

# load training data
classifier = linear_model.LogisticRegression()
X, y = load_train_X_and_y(6e6)

# replace missing values with the mode for
#   categoricals and 0 for continous features
X = X.missing('mode', 0)

# split the training set into categoricals and
# numericals features
X_categoricals = X[X.categoricals()]
X_numericals = X[X.numericals()]

# do some feature engineering (add log and linear combinations
# for all numericals features). Scale the numerical dataset and
# append one hot encoded categorical features to this dataset.
# Then cross validate using LogisticRegression classifier and
# 1 million samples.
X_numericals.\
  engineer('lg()').\
  engineer('mult()').\
  scale().\
  append_right(X_categoricals.one_hot_encode()).\
  cross_validate(classifier, 1e6)

The comments in the above code are totally redundant, the code pretty much documents itself; see:

classifier = linear_model.LogisticRegression()
X, y = load_train_X_and_y(6e6)
X = X.missing('mode', 0)

X_categoricals = X[X.categoricals()]
X_numericals = X[X.numericals()]

X_numericals.\
  engineer('lg()').\
  engineer('mult()').\
  scale().\
  append_right(X_categoricals.one_hot_encode()).\
  cross_validate(classifier, 1e6)

I would then add a comment at the end of this code block, something like:

# 0.98 +/- 0.001 – took 2.5 minutes

Then commit this experiment to git.

The fact that I can trust my reusable code means I just have to review the code I write here and given the expressiveness of the code finding bugs is usually very straight forward.

After several experiments this is what a source file will look like.  See how easy the code is to read.  See how simple it is to review past experiments and think about what works and does not work.

classifier = linear_model.LogisticRegression()
X, y = load_train_X_and_y(6e6)
X = X.missing('mode', 0)

X.\
  engineer('lg()').\
  engineer('mult()').\
  scale().\
  one_hot_encode().\
  cross_validate(classifier, 1e6)
# 0.92 +/-0.0002  

X.\
  engineer('lg()').\
  scale().\
  one_hot_encode().\
  cross_validate(classifier, 1e6)
# 0.90 +/-0.001  

X.\
  engineer('lg()').\
  scale().\
  one_hot_encode().\
  cross_validate(classifier, 1e6)
# 0.86 +/-0.003

My wrapper for pandas and scikit-learn is available here and depends on naming conventions described here.  But I encourage you to write your own.  You need confidence in your code and the only way to achieve that is to write it and test it yourself to your own level of comfort.

Naming Conventions in Predictive Analytics and Machine Learning

In this article I am going to discuss the importance of naming conventions in ML projects. What do I mean by naming conventions?  I mainly mean using descriptive ways of labelling features in a data set.  What is the reason for this?  Speed of experimentation.

Naming Conventions

  • Categorical columns start with ‘c_’
  • Continuous (numerical) columns start with ‘n_’
  • Binary columns start with ‘b_’
  • Date columns start with ‘d_’

Examples of Benefits

Once your datasets is labelled clearly with these conventions then experimenting with features becomes very fast.

cv = functools.partial(do_cv, LogisticRegression(), n_folds=10, n_samples=10000)
cv(one_hot_encode(X), y) # One hot encode all categorical features
cv(contrasts(X), y) # Do simple contrast coding on all categorical features
cv(bin(X, n_bins=100), y) # Split all continuous features into 100 bins
X = engineer(X, ‘c_1(:)c_2’) # Create a new categorical feature that is a combination of 2 other
X = engineer(X, ‘n_1(*)n_2’) # Create a combination of 2 numericals (by multiplication)
X = engineer(X, ‘n_1(lg)’) # Create a log of feature ‘n_1’
X = engineer(X, ‘(^2)’) # Create a square feature for each numerical feature
X = engineer(X, ‘(lg)’) # Create a log feature for each numerical feature

In a real world example this would look something like:

X = remove(X, dates=True)
for n1, n2 in combinations(X, group_size=2, numericals=True): X = engineer(X, n1 + ‘(*)’ + n2)
for c1, c2 in combinations(X, group_size=2, categoricals=True): X = engineer(X, c1 + ‘(:)’ + c2)
X = engineer(X, ‘(^2)’)
X = engineer(X, ‘(lg)’)
cv(X, y)

Summary

The resulting DSL from using good naming convention leads to very clear code that relates directly to the data munging operations being done.  Another benefit is that but once your ‘one_hot_encode’ method is written and tested you can trust it for future projects (as long as they use the same naming conventions).

Using private partial classes to hide implementation details of an interface. Workaround for package level protection in C#

I miss very few things from the Java language, one gem I really miss is the package-private accessibility modifier. This was so useful, your IDE colour coded your package classes in another colour so you knew they were not part of the public API. You could skim read the files in a package (namespace) and see exactly what you needed to look at, ignoring all low-level implementation details.

This unfortunately is not in C#, the closest C# gets is the internal modifier. I personally really dislike this modifier as I think it has contributed to the nightmare that is 100-200 project solutions which are so common amongst some .Net shops.

This pattern is an alternative, I think its a very common alternative but recently during a code review I explained it to someone who appreciated the experience so I thought I’d write it up.

Often, C# developers will do this kind of encapsulation using nested private classes. I have a big problem with this as it leads to those 2-3k line files which are unintelligible. So why not just make those nested classes private partials? Let’s see how this would work.

Let’s assume we have a namespace Clown whose responsibility is to create clowns for customers (i.e. like a clown booking service for kids parties). The customer basically fills in the details that their clown should have and then books a clown for their party.

The details are specified using an instance of ClownSpecifications:

public class ClownSpecifications {
  public bool KidFriendly { get;set; }
  public bool Fun { get;set; }
  public bool Scary { get;set; }
  public bool Creepy { get;set; }
}

The clown itself is simply an implementation of the IClown interface. This interface is the only thing the user ever sees.

public interface IClown {
  void DoYourThing();
}

And then we need a clown factory that builds clowns based on the provided specifications:

public partial class ClownFactory
{
  public IClown CreateClown(ClownSpecifications specs) {
    if (specs.Creepy && specs.Scary) { return new ThatClownFromStephenKingsBook(); }
    if (specs.Creepy) { return new TheJoker(); }
    if (specs.KidFriendly && specs.Fun) { return new Bobo(); }
    if (specs.Fun) { return new RudeClown(); }
    return new GenericBoringClown();
  }

  private partial class ThatClownFromStephenKingsBook {}
  private partial class TheJoker {}
  private partial class Bobo {}
  private partial class RudeClown {}
  private partial class GenericBoringClown {}
}

A few things to notice here. The first is that the ClownFactory itself needs to be marked partial:

public partial class ClownFactory

This is required simply because there is no way to create top level private partial classes.

Secondly, the implementation classes are defined in a super minimalistic fashion:

private partial class ThatClownFromStephenKingsBook {}

They don’t event implement the IClown interface in this definition.

So now an implementation of IClown looks like this:

public partial class ClownFactory {
  private partial class ThatClownFromStephenKingsBook : IClown {
    public void DoYourThing() {
      // ...
    }
  }
}

That’s it, this is actually working code. And the great thing about it is that your namespace now looks like this:

See, you can now more easily tell that the public API of that namespace is IClown, ClownSpecifications and ClownFactory. To clean this up even more you could create a new directory called impl and hide the implementations there. I personally do not do this as then Resharper starts yelling at me about mismatching namespaces.