About Guido Tapia

Over the last 2 years Guido has been involved in building 2 predictive analytics libraries for both the .Net platform and the Python language. These libraries and Guido's machine learning experience have placed PicNet at the forefront of predictive analytics services in Australia.

For the last 10 years Guido has been the Software and Data Manager at PicNet and in that time Guido has delivered hundreds of successful software and data projects. An experience architect and all round 'software guy' Guido has been responsible for giving PicNet its 'quality software provider' reputation.

Prior to PicNet, Guido was in the gaming space working on advanced graphics engines, sound (digital signal processing) engines, AI players and other great technologies.

Interesting Links:

The Value of Data – A short guide to learn how to maximise the value of your data.

This is an excerpt from the white paper available here.

Over the last few years the data industry has been shaken to its core. We have new names, roles, technologies, products coming out on a daily basis. The term “Big Data” has been overused so much that it may be losing some of its meaning.

I am meeting people on a regular basis and the message I receive over and over again is that it’s overwhelming. I am going to try to address this concern in this paper.

The Purpose of Data

The sole purpose of data in an organisation is to support business decisions. Data can do this in several ways.

  • By communicating information about the past
  • By communication information about the future
  • By recommending actions for future success

The first of these has long been addressed by traditional reporting and business intelligence tools so I will not spend too much time here. What I really want to address is the final 2 points:

  • Communication information about the future
  • Recommending actions for future success

The Future

There are several ways that data can help us peek into the future of our organisation. The first and most traditional is the statistician. The statistician is the professional that can look at the data and give you an inference about the future based on the data available. The statistician, especially one that is highly knowledgeable about the business can use his domain expertise and data experience to give extremely valuable insights into the future.

The second way of getting real future value from data is to use Predictive Analytics. Predictive Analytics is also known as Advanced Analytics, Prescriptive Analytics and Machine Learning but I suggest we just call it Predictive Analytics as it clearly summarises the objective of the technology.

Predictive Analytics

Predictive Analytics is about value. It is how you convert your data into real insights about the future. It is not a product, it is not a platform, and it is not Hadoop or any other vendor name. Predictive Analytics is solely the science of predicting the future from historical data.

Predictive Analytics is also not a person. This is an important point because Joe the statistician cannot handle an Excel file of 2GB and ask him to bring in Facebook, Twitter and web-traffic data into his inferences and he’ll probably have a nervous breakdown.

There is only so much data a human head can manage. Computers however do not have this problem; they can handle any vast amounts of data from any variety of sources. They also do not bring any biases to the analysis which has also been a problem in the past.

How to “Predictive Analytics”

Until recently, implementing a Predictive Analytics project has been the domain and capacity of large and sophisticated companies, however, most recently with the emergence of affordable cloud computing, in memory computing analysis, sophisticated modelling tools, combined with the skills of computer programmers, data scientists and analysts, Predictive Analytics is now affordable as a service, by most medium size enterprises.

The solution can be procured as a service i.e. on/off, pay as you go and when you needed it. No longer is huge capital investment required but instead, understanding the need, the challenge, developing proof of concepts and analysing outputs, provide the effective and affordable introduction to the benefits of Predictive Analytics.

Predict What

Predictive Analytics can predict numerous variables supported by your historical data. The following are some examples:

  • Potential success of a marketing campaign
  • How best to segment customers
  • What marketing mediums have the best ROI
  • When will my machine fail
  • Why did my machine fail

As long as in the past we have recorded our actions and we also have at a later date recorded the results we can then learn form that data and make predictions about the future.

Predictions can be real time or can be on weekly/monthly/quarterly basis, it all depends on your needs.
Getting Started

There are several ways to get started. You can recruit your very own Data Scientist. Not an easy task considering the high specialisation of these professionals but it is what a lot of companies are doing.

You could also use a service provider. A good service provider will have a team of IT and data people including Data Scientists that have experience in doing these types of projects for customers in your industry.

At PicNet we always recommend that our customers start with a proof of concept. Some of these Predictive Analytics projects can take a long time to implement and integrate back into the processes of the organisation so it’s always best to take small bites and see if there is value to be had. We usually go for a 2-4 week project which usually goes something like this:

  • We get potential questions about the future that management would like answered
  • We audit all data and its quality in the organisation
  • We prepare and clean this data
  • We get external data if it can help the prediction (census, web traffic, Facebook, Twitter, weather, etc.)
  • Build a simple predictive model, perhaps on a subset of the data
  • Provide a report with predictions for the next period (2 months for instance)

This report can then be used by the business to test the value and accuracy of the predictions. If the business approves we then push that system into production. This will mean that you may get weekly, daily or real time predictions as the business requires. The length of these production implementations vary greatly depending on the systems that need to be integrated with and many other factors.

Summary

As a manager charged with the responsibility of curating your organisations data resources you should forget vendors and platforms and no sql databases, forget about the 3 Vs (or is it 4?) of big data. As a manager your only concern should be the only V that matters and that is Value. And if you want to get value from your data then consider Predictive Analytics.

For any additional information please feel free to contact me on my details below.

 

Vowpal Wabbit for Windows and Binary for Win x64

Getting VW working on windows is a real pain.  Even though I had the whole environment set up as described on the readme it still took me a good couple of hours to build.

So with absolutely no guarantee or support options here is my built version of vw.exe version 7.7.0.  This was built on a Windows 7 x64 box and I have only tested on this one box so use at your own risk!!

If you were after the executable only then there is no need to continue reading, the rest is about python.

So I started playing around with VW.exe and quickly realised that the command line is a terrible place to experiment on a machine learning algorithm.  So I started looking for python wrappers and found this.  Which is a nice wrapper but it does not work on Windows.  So I hacked it up a little (with no permission, sorry Joseph Reisinger) and have a windows friendly version with updated command line options here.

So how do you use the python wrapper?

First we need to convert your data into VW input format I use my pandas extensions helper method: _df_to_vw

You will be able to turn this into a generic converter very easily, infact there are already plenty around such as:

https://github.com/zygmuntz/phraug2

So now you have your files converted, let’s use the classifier:

# where files open file streams to the VW file
training_lines = training_vw_file.readlines()
testing_lines = testing_vw_file.readlines()
VowpalWabbitClassifier().fit(training_lines).\
  predict(testing_lines)

The VowpalWabbitClassifier is fully scikit-learn compatible so use it in your cross validations, grid searches, etc with ease.  And just have a look at the code to see all the options it supports and if there are missing options please fork and submit back to me.

The value of Kaggle to Data Scientists

Kaggle is an interesting company.  It provides companies a way to access Data Scientists in a completion like format and a very low cost.  The value proposition of Kaggle for companies is very clear; this article will focus on the flipside of this equation.  What value does Kaggle give to the data scientists?

This blog post contains my own personal opinions, however I have tried to be as balanced in my views as possible and I have tried to present all benefits and disadvantages to this platform.

Disadvantages of Kaggle to the Data Scientist

Cheapens the Skill

Most Kaggle competitions award prizes in the order of 5k – 50k.  There are a few competitions that get a lot of media attention as they have much higher prizes, however these are very rare.  Given the chances of winning a Kaggle competition and the prize money involved then the monetary returns of Kaggle are negligible.  Therefore we have highly educated and skilled people providing an extremely valuable service to commercial organisations for free.  This reeks of software development’s Open Source commercialisation strategy that aims to destroy competitors by providing a free product and charging for other services (like support).  In this case Kaggle’s competitors are the Data Scientists themselves as they could be consulting for organisations directly instead of going through Kaggle.  This is an interesting argument that could be the subject of its own blog post so let’s put it aside for now.

Kaggle Encourages Costly Models over Fast Models

When competing, the participants have no incentive to create robust, fast, bug proof models.  The only incentive is to get the best possible predictive model disregarding all other factors.  This is very far removed from the real world where accuracy compromises are made regularly mainly for performance purposes.

Kaggle Does not Teach Data Prep, Data Soruce Merging, Communicating w/ Business, etc

Competitions on Kaggle go straight for the last 5-10% of the job.  It assumes all data sources have been merged, assumes management has decided on the best question to ask the data and IT has prepared the data and cleaned it.  This again is very different from real life projects and could be giving some Data Scientists, especially unexperienced ones a false view of the industry.

Kaggle Competitions Take too Long to Compete In

I would guess that most top 50 competitors in any competition would put in around 40-200 hours in a single competition.  This is a huge amount of time, so what does the competitor get out of it?

Benefits of Kaggle to the Data Scientist

Opportunities for Learning

Kaggle is the best source on the internet at the moment for budding data scientists to learn and hone their craft.  I am confident in the accuracy of the statement having seen many people start out on simple competitions and slowly progress over time to be a highly skilled data scientist.  This great blog post from Triskelion demonstrates this clearly.  This benefit cannot be overstated, data science is hard!! You need to practice and this is the place to do it.

Opportunities to Discuss and ask Questions of Other Data Scientists

The Kaggle forums are a great place to ask questions and expand your knowledge.  I regularly check these forums even if I’m not competing as they are a treasure trove of wonderful ideas and supportive and highly skilled individuals.

The Final 10%

The final 10% of a Data Science project is the machine learning / predictive analysis modelling.  The other 90% are administrative, managerial, communications, development, business analysis tasks.  These tasks are very important but in all honesty an experienced manager has these skills.  The technical skills needed in this 90% are also easily available as most experienced developers can merge data sources and clear datasets.  It is the final 10% where a great data scientist pays for himself.  This is where Kaggle competitions sharper your skills, exactly where you want to be focusing your training.

Try out Data Science

Something that you quickly learn from any Predictive Analytics project is the monotony of data science.  It can be extremely painful and is definitely not suitable for everyone.  Many skilled developers have the maths, stats and other skills for Data Science but they may lack that patience and pedanticness that is required to be successful in the field.  Kaggle gives you the chance to try out the career, I’m sure many have decided it’s just not for them after a competition or two.

Promotional Opportunities

I doubt how much value Kaggle actually provides in terms of promotion for the individual.  I personally have never been approached for a project because of my Kaggle Master status or my Kaggle ranking.  I have brought up the fact that I indeed am a Kaggle Master at some meetings but this generally gets ignored mainly because most people outside of the Data Science field do not know what Kaggle is.  However, there may be some value there and I’m sure that the top 10-20 kagglers must get some promotional value from the platform.

TL;DR (Summary)

Kaggle may cheapen the data science skillset somewhat, providing huge business benefits at very low cost and zero pay to data scientists. However, I like it and will continue to compete on my weekends/evenings as I have found the learning opportunities Kaggle provides are second to none.

Property Market Predictions – PicNet Predictive Analytics

This post explores options for the application Machine Learning techniques to the Australian residential property market with the objective of predicting insights that would be useful for buyers, sellers and the industry.  With access to good data it is possible to predict sale/auction prices by home, street, suburb, municipality, etc.  We could also predict number of registered bidders at auctions or parties through on open days.

Data Availability

The success of a Predictive Analytics and Machine Learning project depends totally on the data available and its applicability to the problem at hand.  A careful analysis of available data is required before any work can begin in this space.  But some potential data sources that could be brought together for a predictive model include:

  • RPData: RPData contains ownership, property features, land size, sales history information.  This data is generally considered to be of decent quality and can be relied on
  • Real estate marketing strategy used in a property campaign will greatly affect the outcome of a sale/auction.  However, access to this data will be difficult and may need to be omitted.  Perhaps 2 models could be built, one with all participating real estates (that are providing this data) another where this data is unknown.
  • Area demographics information available from census data will also affect predictions.
  • Commercial properties in the area will also affect the outcome of a property campaign.  This effect can be both positive and negative depending on the type of commercial (i.e. café vs factory) and quantity.  RPData has some of this data but the quality of this may not be great.  Local councils have commercial property data but this will be very hard to access.  The best source for this data may simply be ABN registration details which is not good quality (many ABNs do not have a corresponding business) but it may serve the purpose of showing volumes and type of businesses registered in the area.
  • Domain.com.au/Realestate.com.au web traffic logs:  Interest in a property can be measured by analysing the web traffic activity for a property on property websites.  Details such as number of visits, time on page, bounce rates, etc. could provide real insight into the volume and sentiment of potential buyers.
  • Weather forecasts:  The impact of the weather on open houses, auctions, etc. could be real and this data should also be included in any predictive model.
  • Crime statistics available through various government web properties such as abs.gov.au and data.nsw.gov.au.  This data should also be included in the predictive model.
  • School location and performance data available from myschool.edu.au should be included as local school can affect property prices.
  • Public transport location and frequency in the area also affects property prices.  This data is available from various public transport online properties.
  • Exchange rates and global economic statistics will also affect property prices.  The state of the economy in our local region especially in China and South East Asia will affect prices in certain property markets in Australia.  Exchange rates data is easily available; data for each relevant country may need to be sourced on a country by country basis.
  • The state of the local economy will also affect property prices. This data can be sourced from the RBA web site and perhaps other sources.
  • Social media could be a source of sentiment data showing shortage of property in an area, interest in properties in an area and general sentiment for an area.  Social media data could also show movements in and out of an area.  This data can be bought from Facebook and gnip.com.
  • Number of registered bidders, parties at open houses:  Real estates have this information which could be very valuable in many predictions.  However access to this data could be hard to access.
  • Google Trends.  A great tool to analyse interest in a suburb, property, etc.

Given the wealth of data available in this space I believe that a very accurate predictive model can be built.

 

Possible Predictive Models

Sales/Auction Price

The holy grail of property market prediction is “how much will it go for?”  Whilst a general trend can be identified the sale amount at the end of the day will depend on who is there on the day and how much they want the property.  However, average figures for an area will be highly predictable as volume eventually overrides the confounding noise of an individual’s effect on a sale/auction.

Auction Day Bidders

If it is possible to get past number of registered bidders from real estate agencies then predicting future number of bidders would also be highly accurate.  This could also be applied in the prediction of number of people at open houses.

Area Predictions

More general predictions at the street/suburb/municipality level would also be possible.  Once data begins to be aggregated like this, predictions are generally much more accurate but they offcourse lose their granularity which may devalue the prediction.

 

Proof of Concept

A potential project to gauge the effectiveness of a predictive model would be something like this:

Find the current benchmark

Find current property predictors and use their accuracy as a benchmark.  These benchmarks will be used to compare the accuracy of this project to what is currently available.  From my initial research these predictions are usually very general (suburb, city, state, level) or they are of low accuracy

Initial data

Depending on the interested stakeholders and their access to good data this step could range from easy to very difficult.  Whatever access to data we get at this step could mean the success or failure of the proof of concept.

P.O.C. implementation

Implement a simple model with the current available data and make predictions for the next period.  These predictions will be used to measure against existing benchmarks.

Iteratively add more data

If the proof of concept shows that we have the potential to make real and accurate property market predictions then we can start investing in getting more data.  Talking to real estate web property owners, real estates, councils, etc.  We would then add each new data source to the model measuring its impact on the prediction accuracy.

Potential Business Opportunities

Once the model is proven and hard numbers can back its predictive power several business could be developed that would take advantage of this information:

Real Estates

Real Estates are always looking for accurate ways to predict the price of their properties.  This system could supplement their trained agents in predicting property prices and developing appropriate marketing campaigns.

Investors/Home Buyers

A service for the public to accurately predict the price of a residential home would be invaluable to the individual.  This service however could have a negative feedback to the model driving people from or to properties.

Marketeers

If marketing strategies can be compared in effectiveness using these models then marketeers can use this data to charge for advertising space knowing and being able to prove effectiveness.

Insurers

Insurance companies would be very interested in volumes of sales, price of assets, etc.

Banks

Predicting loan volumes and areas of potential growth for their loans would be very valuable for   banks. This would help plan future loan amounts and marketing opportunities.

Builders

Demand planning for a future period would be dramatically improved with access to accurate predictions for a given sales period.

Small Business Owners

Many small businesses offer services to new home owners in an area.  These businesses could use future volume predictions for demand planning and marketing campaign planning and they could also use value predictions to identify customers in the correct financial demographics for their services.

PicNet and Predictive Analytics

PicNet is ideally positioned to work with partners on this and many other Predictive Analytics projects having both the skills and tools required to build these sophisticated data environments and predictive models.  Guido Tapia, PicNet’s manager of Software and Data has 20 years of Software and Machine Learning experience which increases the chances of success dramatically.

If you are interested in Machine Learning or anything else mentioned in this article please feel free to contact Guido Tapia directly.

Fluent python interface for Machine Learning

I often say that Machine Learning is like programming in the 60s, you prepare your program, double check everything, hand in your punch cards to the IBM operator, go home and wait.  And just like back then, if you had a bug in your code it would mean a huge amount of wasted time.  Sometimes these things cannot be helped, for instance; it is not uncommon to leave a feature selection wrapper running over the weekend only to find on Monday morning that you got an out of memory error sometime during the weekend.  This article explains one way to reduce these errors and make your code less buggy.

Less code = less bugs

This is the only truth in software development.  A bug free system is only possible if it also contains no code.  So we should always aim to reduce the amount of code needed.  How??

  • Use tried and tested libraries
  • Write reusable code and test this code enough to have confidence that it works
  • Only use this reusable code
  • Whenever possible test your new code
  • Write expressive code.  Make logical bugs obvious.

Libraries

All libraries are full of bugs, again code=bugs so this is of no fault of the library.  However, if a library has lots of users you can be fairly certain that most bugs you will have been found and hopefully fixed.  If you are pushing the boundaries of the library you will inevitably also find bugs but this is not the general case.  Usually, a well-respected library should be reasonably safe to use and to trust.

Reusable Code

Most libraries you use are generic, meaning that they can be used in many contexts.  Depending on your job you will need something more specific.  So write it, wrap your libraries in an abstraction that is specific to what you do.  Do this and then TEST IT!!!  Every time you find a use-case that your abstraction does not support, write it and test it.  Use scikit-learns dummy datasets to create reproducible test cases that will guarantee a certain feature works for your given use case.

Try to always maintain this abstraction separate from any specific predictive project and ensure that it is project agnostic.

Fluent interfaces for ML

This article focuses on using your reusable code wisely aiming to minimize bugs and enhance the expressiveness of the code.

Expressiveness is a key to writing logically correct code.  If you want all rows with a date greater than the start of this year it is much easier to catch a logical bug in this code:

filtered = filter(data, greater_than_start_of_this_year)

Instead of this code:

filtered = filter(data, lambda row: row.date_created >=
  date(date.today().year(), 1, 1))

Whilst the ‘greater_than_start_of_this_year’ function has the same functionality as the lambda expression in the second example it differs in several important ways:

  • It is easily tested:  It is a separate function totally isolated from the context it is running in, this makes testability much easier.
  • It is much, MUCH easier to read and review (it is more expressive).

This expressiveness is sometimes described as ‘declarative’ where the non-expressive form is sometimes called ‘imperative’.  You should always strive to write declarative code as it is easier to read.

One of the best ways, I have found to write declarative code is to use fluent interfaces. These interfaces were popularized by jQuery and then by .Net Linq expressions and others.  A sample fluent jQuery snippet is:

$("#divid")
    .addClass("classname")
    .css("color", "blue")
    .append("Some new text");

It’s funny but this ‘fluent’ style of programming was slammed in the late 90s as error prone, Marin Fowler identified ‘Message Chains’ as a code smell that should be remedied however, I have found totally the opposite effect.  Fluent programming interfaces are easier to read, this means less bugs.

How can this be applied to machine learning?  Easy, have a look at the following code:

# load training data
classifier = linear_model.LogisticRegression()
X, y = load_train_X_and_y(6e6)

# replace missing values with the mode for
#   categoricals and 0 for continous features
X = X.missing('mode', 0)

# split the training set into categoricals and
# numericals features
X_categoricals = X[X.categoricals()]
X_numericals = X[X.numericals()]

# do some feature engineering (add log and linear combinations
# for all numericals features). Scale the numerical dataset and
# append one hot encoded categorical features to this dataset.
# Then cross validate using LogisticRegression classifier and
# 1 million samples.
X_numericals.\
  engineer('lg()').\
  engineer('mult()').\
  scale().\
  append_right(X_categoricals.one_hot_encode()).\
  cross_validate(classifier, 1e6)

The comments in the above code are totally redundant, the code pretty much documents itself; see:

classifier = linear_model.LogisticRegression()
X, y = load_train_X_and_y(6e6)
X = X.missing('mode', 0)

X_categoricals = X[X.categoricals()]
X_numericals = X[X.numericals()]

X_numericals.\
  engineer('lg()').\
  engineer('mult()').\
  scale().\
  append_right(X_categoricals.one_hot_encode()).\
  cross_validate(classifier, 1e6)

I would then add a comment at the end of this code block, something like:

# 0.98 +/- 0.001 – took 2.5 minutes

Then commit this experiment to git.

The fact that I can trust my reusable code means I just have to review the code I write here and given the expressiveness of the code finding bugs is usually very straight forward.

After several experiments this is what a source file will look like.  See how easy the code is to read.  See how simple it is to review past experiments and think about what works and does not work.

classifier = linear_model.LogisticRegression()
X, y = load_train_X_and_y(6e6)
X = X.missing('mode', 0)

X.\
  engineer('lg()').\
  engineer('mult()').\
  scale().\
  one_hot_encode().\
  cross_validate(classifier, 1e6)
# 0.92 +/-0.0002  

X.\
  engineer('lg()').\
  scale().\
  one_hot_encode().\
  cross_validate(classifier, 1e6)
# 0.90 +/-0.001  

X.\
  engineer('lg()').\
  scale().\
  one_hot_encode().\
  cross_validate(classifier, 1e6)
# 0.86 +/-0.003

My wrapper for pandas and scikit-learn is available here and depends on naming conventions described here.  But I encourage you to write your own.  You need confidence in your code and the only way to achieve that is to write it and test it yourself to your own level of comfort.

Naming Conventions in Predictive Analytics and Machine Learning

In this article I am going to discuss the importance of naming conventions in ML projects. What do I mean by naming conventions?  I mainly mean using descriptive ways of labelling features in a data set.  What is the reason for this?  Speed of experimentation.

Naming Conventions

  • Categorical columns start with ‘c_’
  • Continuous (numerical) columns start with ‘n_’
  • Binary columns start with ‘b_’
  • Date columns start with ‘d_’

Examples of Benefits

Once your datasets is labelled clearly with these conventions then experimenting with features becomes very fast.

cv = functools.partial(do_cv, LogisticRegression(), n_folds=10, n_samples=10000)
cv(one_hot_encode(X), y) # One hot encode all categorical features
cv(contrasts(X), y) # Do simple contrast coding on all categorical features
cv(bin(X, n_bins=100), y) # Split all continuous features into 100 bins
X = engineer(X, ‘c_1(:)c_2’) # Create a new categorical feature that is a combination of 2 other
X = engineer(X, ‘n_1(*)n_2’) # Create a combination of 2 numericals (by multiplication)
X = engineer(X, ‘n_1(lg)’) # Create a log of feature ‘n_1’
X = engineer(X, ‘(^2)’) # Create a square feature for each numerical feature
X = engineer(X, ‘(lg)’) # Create a log feature for each numerical feature

In a real world example this would look something like:

X = remove(X, dates=True)
for n1, n2 in combinations(X, group_size=2, numericals=True): X = engineer(X, n1 + ‘(*)’ + n2)
for c1, c2 in combinations(X, group_size=2, categoricals=True): X = engineer(X, c1 + ‘(:)’ + c2)
X = engineer(X, ‘(^2)’)
X = engineer(X, ‘(lg)’)
cv(X, y)

Summary

The resulting DSL from using good naming convention leads to very clear code that relates directly to the data munging operations being done.  Another benefit is that but once your ‘one_hot_encode’ method is written and tested you can trust it for future projects (as long as they use the same naming conventions).

Using private partial classes to hide implementation details of an interface. Workaround for package level protection in C#

I miss very few things from the Java language, one gem I really miss is the package-private accessibility modifier. This was so useful, your IDE colour coded your package classes in another colour so you knew they were not part of the public API. You could skim read the files in a package (namespace) and see exactly what you needed to look at, ignoring all low-level implementation details.

This unfortunately is not in C#, the closest C# gets is the internal modifier. I personally really dislike this modifier as I think it has contributed to the nightmare that is 100-200 project solutions which are so common amongst some .Net shops.

This pattern is an alternative, I think its a very common alternative but recently during a code review I explained it to someone who appreciated the experience so I thought I’d write it up.

Often, C# developers will do this kind of encapsulation using nested private classes. I have a big problem with this as it leads to those 2-3k line files which are unintelligible. So why not just make those nested classes private partials? Let’s see how this would work.

Let’s assume we have a namespace Clown whose responsibility is to create clowns for customers (i.e. like a clown booking service for kids parties). The customer basically fills in the details that their clown should have and then books a clown for their party.

The details are specified using an instance of ClownSpecifications:

public class ClownSpecifications {
  public bool KidFriendly { get;set; }
  public bool Fun { get;set; }
  public bool Scary { get;set; }
  public bool Creepy { get;set; }
}

The clown itself is simply an implementation of the IClown interface. This interface is the only thing the user ever sees.

public interface IClown {
  void DoYourThing();
}

And then we need a clown factory that builds clowns based on the provided specifications:

public partial class ClownFactory
{
  public IClown CreateClown(ClownSpecifications specs) {
    if (specs.Creepy && specs.Scary) { return new ThatClownFromStephenKingsBook(); }
    if (specs.Creepy) { return new TheJoker(); }
    if (specs.KidFriendly && specs.Fun) { return new Bobo(); }
    if (specs.Fun) { return new RudeClown(); }
    return new GenericBoringClown();
  }

  private partial class ThatClownFromStephenKingsBook {}
  private partial class TheJoker {}
  private partial class Bobo {}
  private partial class RudeClown {}
  private partial class GenericBoringClown {}
}

A few things to notice here. The first is that the ClownFactory itself needs to be marked partial:

public partial class ClownFactory

This is required simply because there is no way to create top level private partial classes.

Secondly, the implementation classes are defined in a super minimalistic fashion:

private partial class ThatClownFromStephenKingsBook {}

They don’t event implement the IClown interface in this definition.

So now an implementation of IClown looks like this:

public partial class ClownFactory {
  private partial class ThatClownFromStephenKingsBook : IClown {
    public void DoYourThing() {
      // ...
    }
  }
}

That’s it, this is actually working code. And the great thing about it is that your namespace now looks like this:

See, you can now more easily tell that the public API of that namespace is IClown, ClownSpecifications and ClownFactory. To clean this up even more you could create a new directory called impl and hide the implementations there. I personally do not do this as then Resharper starts yelling at me about mismatching namespaces.

Solid Principles: Part Two – Open Closed Principle

Software entities (classes, modules, functions, etc.) should be open for extension but closed for modification.‘ [R. Martin]

The OCP is a set of strategies based on inheritance and polymorphism that aims to make code more extensible with fewer side effects when extending. The way side effects are controlled is by adding functionality to the system without modifying any existing code.
The key to the OCP is programming to abstractions, i.e. Interfaces and base classes. But not only programming to abstractions, but doing it well without using many of the common pitfalls that violates the OCP.
An example is in order. Let’s assume we have a Library stock management system. The system has several types of Stock items. These could be: Magazines, Periodicals, Books, DVDs, CDs, etc. Each of these items may have their own checkout rules and we may program this as follows:

class OrderProcess:
  def checkout_item(item):
    var due_date
    switch (item.type)
      case ITEM_TYPE_MAGAZINE:
        due_date = now.AddWeeks(2)
      case ITEM_TYPE_DVD:
        due_date = now.AddWeeks(1)
      case ITEM_TYPE_BOOK:
        due_date = now.AddWeeks(6)
      ...

This is a neive implementation because every time a new stock item type is added we will need to add a case to this statement (and any other switch statements that switch on item.type). So a better implementation, one that respects the OCP would be.

interface ItemType:
  def get_due_date(from)
  def checkout()

And we could have implementations like this:

class MagazineItemType implementes ItemType:
  def get_due_date(from):
    return from.AddWeeks(2)

  def checkout()
    var due = this_get_due_date(now)
    // Any other checkout processes applicable to Magazines

class DVDItemType implementes ItemType:
  def get_due_date(from):
    return from.AddWeeks(1)

  def checkout()
    var due = this_get_due_date(now)
    // Any other checkout processes applicable to DVDs

class OrderProcess:
  def checkout_item(ItemType item):
    item.checkout()

So now, if for any reason we needed to add a new Item Type, say Blu-ray we can just create a new class and the system will be able to handle it without modifying any existing code.
The reason the system now ‘magically’ works with a new Item Type (Blu-ray) without modifying any code is that the high level functions of the system do not know anything about Magazines, DVDs, etc. They simply know about the abstraction which is the ItemType interface.
As we’ve seen, switch statements or long if chains can be a smell that you’re violating the OCP. Other signs include having code like this in your system:

  def checkout_item(item):
    if (item.type is ITEM_TYPE_BLU_RAY)
      throw error ('Blu-ray cannot be checked out')
    item.checkout()

Having high level functions that are even remotely aware of concrete types is an indication of future heart ache. It is also important to note that even low level types should not know about each other. For instance, a Magazine should not know about DVDs. These relationships are also violations to the OCP.
It is important to note that all of these techniques are ways to manage the complexity of source code. Now, inheritance and polymorphism themselves are complex tools so like always, you need to be judicious in your usage of inheritance hierarchies. It is important to adhere to the OCP when you think part of a system is likely to change. For instance, in the example above, it is perfectly reasonable for there to be new Stock Item Types in the future so creating a hierarchy of these types is a good idea. Other areas in the system which are not likely to change can violates the OCP. The thing is not to be religious about any technique but if you are going to ignore an OCP violation (or any other technique) do so consciously.

Solid Principles: Part One

Over the coming weeks I plan to do a bit of a study on the SOLID principles.  SOLID stands for:

  • Single Responsibility
  • Open-Closed
  • Liskov Substitution
  • Interface Segregation
  • Dependency Inversion

The term was coined by Robert Martin [http://cleancoder.posterous.com/].

The five principles if used judiciously should result in code that is easier to maintain by being highly decoupled and allow the changing of specific implementation details without (or with less) friction.

Like every principle/guideline in software development the SOLID principles need to be understood but not used blindly.  It is very easy to over architect a solution by being too dogmatic about the use of any guideline.  You do however, need to be aware when a violation of SOLID principles occurs and make that decision based on its context and merits.

Single Responsibility Principle – SOLID Principles

Robert Martin describes the Single Responsibility Principle (SRP) as: “A class should have only one reason to change“(1).  I think the best way to get our heads around this concept is to view some code.  So let’s consider the following example which is a business rules object that defines how jobs are handled in an issue tracking system.

class JobHandler(db, query_engine, email_sender):
  this.db = db
  this.query_engine = query_engine
  this.email_sender = email_sender

  def add_job(job):
    this.db.add(job)

  def delete_job(job):
    this.db.delete(job)

  def update_job(job):
    this.db.update(job)

  def email_user_about_job(job):
    this.email_sender.send(job.get_html_details(), job.user.email)

  def find_all_jobs_assigned_to(user):
    return this.query_engine.run("select all jobs assigned to: ", user)

  def find_all_completed_jobs(user):
    return this.query_engine.run("select all jobs with status: ", "completed")

So, what is the jobs handler doing?

  • Doing basic CRUD operations on the jobs (add/delete/update).  We could also assume that we would do validation in these methods also.
  • Doing queries on jobs.  These could potentially get very complex if we add pagination support, etc.
  • Doing workflow functions, such as email users.

Let’s critically review this code.  What can we see?

  • There are 3 dependencies (db, query_engine and email_sender)
  • There is low cohesion (http://en.wikipedia.org/wiki/Cohesion_(computer_science)) which is the ‘smell’ that Robert Martin was trying to address with this principle.  Basically cohesion means that we have dependencies that are only used by part of a class.  Low cohesion is usually an indication that a class is doing too much (or violates the Single Responsibility Principle).
  • The name Handler, Controller, Manager, Oracle, Deity are all indications that you have a class that could be potentially too loosely defined and which in turn may have too many responsibilities.
  • If we wanted to have a unit test to test the work flow of the system we would also need to instantiate a db and a query_engine dependency.  This adds friction to our tests and usually results in poor test coverage.

I think it’s clear that the above object has 3 obvious responsibilities these are:

  • Performing validation and CRUD like operations on a job
  • Performing complex queries on jobs
  • Managing workflows as they relate to jobs

So perhaps a better design would be something like:

class JobRepository(db):
  this.db = db

  def add_job(job):
    this.db.add(job);

  def update_job(job):
    this.db.update(job);

  def delete_job(job):
    this.db.delete(job);

class JobFinder(query_engine):
  this.query_engine = query_engine

  def find_all_jobs_assigned_to(user):
    return this.query_engine.run("select all jobs assigned to: ", user)

  def find_all_completed_jobs(user):
    return this.query_engine.run("select all jobs with status: ", "completed")

class JobWorkFlow(email_sender):
  this.email_sender = email_sender

  def email_user_about_job(job):
    this.email_sender.send(job.get_html_details(), job.user.email)

So let’s critically analyse this code.

  • We can see we have increased the number of classes to 3.  This arguably increases complexity of the system as it adds modules that need to be understood.
  • We can see that each class is highly cohesive and very small and focused.  This is a good thing.
  • We can see that any unit test only has a single dependency to initialise or mock to test a class.  This will encourage developers to keep the test quality up to a good standard.
  • If we place these 3 classes in a well named namespace such as ‘jobs’ it could in fact ease the complexity of the system (contradicting the first item in this list).  As we could just browse the file names without even opening them to know exactly what functions are done by each class.

Conclusion

Conclusion? Well there really is no conclusion.  It is important to realise that this is a trivial example whose responsibilities were obvious.  Many times separating concerns is not as easy and decoupling these concerns may be very difficult.

In the example above I would comfortably say that the refactored code is better than the original code but this may not be the case with a real world example. Now when you see a class that as; low cohesion, too much responsibility,  too many reasons to change, too many dependencies, etc.  You can recognise this as a smell and violation of the SRP.  You can then make the educated decision as to whether refactoring the code will result in better, cleaner more maintainable code.

On the other hand, refactoring is a hard process and the more you do it the easier it becomes, so do not be scared to take a little bit of time to refactor something like this.  You will find that the case for not fixing SRP violations will become less compelling.

A faster, better sql server and sql azure log appender for log4net

After much effort, trying to get the default DatabaseAppender working in log4net I decided to write my own, so with the help of one of my alpha geeks (Tnx Chinsu) we created this awesome (its awesome because it uses batch inserts and actually works on Azure) database appender for Log4Net.

https://gist.github.com/965366

Use at your own risk, I did have to modify the code slightly to remove an internal dependency and I did not test this in prod after the modification.  However the modification was very minor and should not cause any issues.

Also remember to schedule a service that deletes your old log files.

Guido Tapia