72.66% accuracy on latest test

I rewrote the pruning algorithm yesterday to use a method that’s slightly less accurate, but much quicker. After an exhaustive test that I ran all night on a data set of 356 sales/fails from my main business, it ended up giving a correct judgement 72.66% of the time.

Example

Let’s say you get 200 leads a week, but only have time to follow up on 100 of them before the next week starts and the next 200 come in.

Let’s say your usual sale rate is 50%.

That means that every week, if you follow the leads one by one, you will make 50 sales.

But what if you could order them so that the more likely sales were first?

Because the SalePredict engine has a 73% accuracy, that means you will make an average of 73 sales a week.

That’s a 50% increase in sales.

Technical changes

Two days ago, I wrote a pruning algorithm that was able to take a network with hundreds of inputs, and gradually prune it down to just the inputs that made the biggest difference in success.

However, the algorithm took about 6 hours to run against 350 inputs.

This is because for every prune, it would train the network using a full set of inputs minus the one being tested for. That’s 350 neural net trainings for the first round, 349 for the next, 348 for the next. It was a lot!

I realised I could short-cut this by making an assumption – that the most useless input of any particular network was the one that had the lowest standard deviation in its connection weights to other neurons.

This meant that instead of (350+1)*(350/2)-(18+1)*(18/2)= 61254 network trainings, I just needed to to 350-18= 332. That’s 175 times quicker. And obviously even quicker the more inputs that need to be pruned.

Tonight, I’ll run the test again – I think I can get it even more accurate.

At the moment, the test assumes that if a website has a calculated value of more than 50%, it is a sale, and if lower than 50%, it is a fail. That’s not right, though – if it’s around 50%, then we should really say that there is simply not enough information available to state either way.

So I’ll run the test again, but this time record the calculated values of each website in a spreadsheet, so I can find out what “band” of values produce the best accuracy in the test.

I mean, of the websites that the system says are >90% a sale, exactly how accurate are those ones, compared to the websites the system says are 80-90%?

I don’t know, and I want to know.

Pruning networks for better results and less “overfitting”

A neural network, which is what SalePredict.com is based on, reads information from a number of inputs, processes the information through some mathematical filters, and spits out one or more results. Specifically, this is true of feed-forward networks, which is what we use here.

If we have a lot of inputs, then we need to have a huge amount of training data, in order to avoid a problem called “overfitting”.

Overfitting happens when there is so much information that the neural network learns to “memorise” the answers, instead of learning to generalise about the inputs and come up with reasonable intuitions about the answers.

As an example, let’s say we have four inputs (hair colour, wears a dress, wears trousers, has long hair) and we train a network on whether a subject is male or female.

We know ourselves through repeated exposure, that generally, dresses are worn only by females, and usually, long hair indicates female as well. So we could make a reasonably accurate prediction that if a person is wearing a dress and has long hair, it is a female.

If we only have a small amount of data, though, we might come up with other reasonable conclusions.

Example:

Sex Brown Hair Long Hair Wearing Trousers Wearing a Dress
M Yes
F Yes Yes Yes
F Yes Yes
M Yes Yes Yes

Using the table above, it is reasonable to say that if a person does not have brown hair, they are male. Or that if they have short hair that is brown, they are female.

We know this is too specific to just the information we have. If we use that training to try classify a fifth person, we can’t be sure at all that it will be accurate.

We call this “overfitting”. The network has learned to memorise specific patterns in inputs that are frankly off-topic.

If we were to drop the Brown Hair and Long Hair inputs altogether, we would be left with two columns that are completely on-topic, and could be fairly sure that it would be correct when we look at a fifth person, because the patterns learned are straightforward and in the data set, they correlate exactly.

The network in SalePredict.com is much more difficult to train. It doesn’t know any English, and definitely doesn’t know the difference between a flower shop and a website selling fire and safety inspections.

By using just the top “n” n-grams found across all of your websites, you run the same risk of over-fitting as was shown in the above example.

A very common n-gram in websites is the phrase “more details”. If the data set is not pruned carefully, then it will train itself to recognise how “more details” fits into pattern it’s trying to learn, when in fact, it’s just noise.

So, how do we tell the network what n-grams are noise, and what are actually useful for pattern recognition?

There are a number of research papers available on “pruning” in neural networks, but every one that I’ve found is about reducing complexity within the network. What we need is a way to decide which inputs are not needed in the first place before the signals get into the network.

In general, the solution is:

  1. while the number of inputs is greater than the rounded-up square root of the number of training points,
    1. foreach n-gram input:
      1. create a new network that does not have that ngram
      2. train against that new network
    2. whichever of those test session had the best overall result (mean square error), remove its associated n-gram completely from the main network

This takes a long time to run.

If you have 100 data points in your training data, the system needs to whittle down to 10 (square root of 100), which means it needs to run 100 training sessions to get to 99, 99 training sessions to get to 98, etc all the way down to 10. That’s (100+1)*(100/2)-(10+1)*(10/2) = 4995 training sessions.

So, training in this way takes 5000 times as much time as if you were to just do it the short and nasty way.

But is it better?

I will run some tests tonight to see what the difference in accuracy is.

Improvements

I had hoped to finish writing a new “pruning” algorithm for the neural network, but had to finish my son’s Halloween costume instead.

I did get time to reduce the reliance on writing files though, which should speed up analysis of websites.

I initially set the neural network to have 100 inputs. This worked well for a while, but I realised that this was leading towards “overfitting”, which happens when a network learns to memorise the input patterns, instead of learning to generalise from them.

In human terms, it’s like the difference between “that car’s engine has a specific frequency I have recorded as fast”, and “that car sounds fast”. One of them relies on exact data, and the other is more general.

To help with this, I’ve changed the input rules. The number of inputs in the network are now equal to the square root of the number of websites in the training set.

That means that if there are 50 websites in Sales and 50 in Fails, then the neural network will train itself based on only 10 n-gram inputs.

The next stage is to write an algorithm which can “prune” the n-grams list that are used for inputs.

You see, if you just use the top 10 n-grams (as we’re doing at the moment), then the system learns to generalise based on word patterns that are generic across all websites. For example, navigation menu entries.

Instead, the n-grams being tested against should be words that are more contextual in nature.

FANN (the artificial neural network I’m using) doesn’t have an automated pruning mechanism so I need to write one myself. Hopefully I have time tonight!

How it will work:

  1. start with n n-gram inputs, where n = number of websites in training set
  2. while n > square root of number of websites
    1. for each n-gram, train the network with all n n-grams as inputs except this one
    2. whichever n-gram had the least impact on the accuracy of the network after training, drop it from the list
    3. reduce n by 1

This method will obviously take longer to train than a straight-up list of the top n n-grams, but it will also end up being much more accurate.

Sorry for the technical jargon. I’s 6:52am and I’ve been up a few hours already.

About SalePredict

Let’s say you have a lead list of 200 prospective clients. You need to talk to each of them to try make a sale. It would be very useful if you had a tool which could sort them in order of how likely they were to commit to a sale.

Earlier this year, I realised that it was possible to predict with high accuracy what companies could be brought on board with my company, by simply analysing the language used on their website, and comparing that with the language used by people already signed up with us.

For example, let’s say you are in the Field Service Management Software (FSM) industry as I am, and someone gives you a list of 100 local company websites.

You can make an educated guess at what companies you should approach, based on the wording used on the websites.

For example, FSM clients tend to talk about specifications, certification, standards, and tend to not talk about prices for products, or music, or knitting.

But think for a moment: when you make that educated guess, exactly why would you choose one company over another, if their websites are the only information you have so far?

The answer is that successful sales generally are made to companies that are similar to those you’ve already sold to.

That’s exactly what the SalePredict engine does!

You give it a list of your successful sales, to teach it what a successful sale’s website looks like.

You then give it a list of unsuccessful sales (fails), so it knows what kind of language is used by companies that are not suitable for your work.

Then you give it that list of 200 websites.

It will go through them all and assign them scores based on how similar they are to your successes (and how dissimilar they are to your fails).

It then orders the prospects so that you know which ones to call first.

Let’s talk it out with actual numbers.

Let’s say you have a list of 200 companies, and your usual hit/miss rate is 50%, so you can say with high certainty that of those 200 companies, you will make 100 sales.

Let’s say it takes 2 hours of work per company before you know if it’s a sale or a fail. You need to research each of the companies, and then talk to them.

Well, with your usual method, you will need to slog your way through them all. It will take 10 weeks, and you will make 100 sales.

Now, let’s say you use the SalePredict engine. Based on my own tests, it is 66% accurate even with just 41 data points, and gets more accurate as you feed it more data.

This means that it will order those 200 prospects so that 66% of the actual sales will be in the top 100 of the list.

You can now spend half the time (5 weeks) ringing up 66 sales, and then get another 200 leads, run them through the engine, and do another 66 sales for the other 5 weeks.

So, using a straight-forward by the numbers method, you make 100 sales. But, by using SalePredict, you make 132 sales.

Which is better? 100 sales or 132 sales? Hmm…