# How to catch a fraudster – using 'top cop' Benford and the power of maths

Yes, we've been hit over the head enough times with the phrase "big data" to be aware of its presence, even though we've been up to our armpits in streams of huge unstructured datasets for years. Those of you who are analysts or data scientists will have already picked up a set of tools that help you find hidden information …

This topic is closed for new posts.
1. #### Benford...

Isn't this also know as Zipf's law - usually applied to the relative usages of words?

1. #### Re: Benford...

You might be thinking of the use of PKZip to compare many works, X, from a known author- say all of a student's past essays - with a piece of text of dubious authenticity, Y.

You Zip X, and then Zip X+Y. If the second Zip file is a fair larger than the first, you might be dealing with a case of plagiarism.

1. #### Re: Benford...

No, Zipf's law really is a thing - it's to do with the power law distributions you find in various real-world phenomena, one of the best-known of which is indeed the frequency with which particular words appear in any given body of text in a natural language.

1. #### Re: Benford...

I was more curious what this has to do with unstructured data. An order amount would be stored in normal RDBMS and most of those support calculated indexes so it would be fairly easy to query. An even better solution would be to maintain a separate table with these values in it.

2. #### Re: Benford...

@tomsk

I came across another example of the Zipf Law.

We were encoding digital maps, taking a square block of pixels and replacing it with a code number. This gave us a useful compression for those particular maps.

The code numbers followed a Zipf distribution.

1. #### Re: Benford...

Zipf' Law is a generalization of Benford's (and older if I recall --- think Zipf based his work on Pareto / the Pareto Principle).

Anyway, the whole "River" explanation seemed a bit overly long to me; a comparison closer to home so to speak is with the snaking "rivers" we live at: streets.

So, why are more people living at house numbers beginning with 6 than with 7? Because every street with a no. 7 has a no.6, with a number 70 has a number 60--69, with a number 700 has a number 600--699; but conversely there are some streets that have 60-something numbers (maybe not all!) but stop before reaching 79; and those that do reach 79 may end between 601 and 799. [In the UK the set of 1000+ numbered houses is negligible.] There is more streets that reach the 100s than the 200s, so

A street with houses 1--199 has more than 50% people living in a number starting on 1, about 5% starting on 2, 3, .. 9. A street with houses 1--299 has over 33% in 1.., over 33% in 2.., and the rest equally split over the remaining 7. And so forth... There's more streets ending in the 300s than in the 400s, more ending in the 200s than the 300s, etc.

As with rivers, the causes of street length (and house number density) are myriad unrelated factors, mostly geographical, political, etc... hence Zipf applies.

2. #### More!

Statistics wasn't as fun or as interesting as this when I was in college.

1. #### Yes please Re: More!

When I was taught statistics it was from the dull and dry perspective of "work out how likely something is to happen". It turns out that in real life, a much more interesting question is "here's what actually happened, work out how likely it is that it was random". Not only is that great for detecting fraud, it's also a basic tool of science; for example the discovery of the Higgs Boson basically boils down to "here's the decay events we recorded, how likely is it that we'd see that distribution of events without a Higgs Boson in the mix? Vanishingly small? OK, that's the discovery."

1. #### excellent

Beautifully timed. Read this last night, and today replaced a maths teacher (as an extra he wassupervising Naplan) and was teaching year 12 probablity.

Had me intrigued so I looked up the other methods

3. #### Amazed!

More of this sort of thing!

4. #### Nice one

Edumacational.

5. #### How to deal with limits?

I was trying to apply Benford's law to a payment system but I was struggling with limits. Each person working with the system has an individual limit, e.g. someone can enter payments up to £500, other persons up to £750 or £12,000 etc. And what's more, the payments as such may have limits too, e.g. for some sort of business the payments may usually go up to £2,500, for others to £9,000 etc.

I did identify several abnormalities according to Benford but for now I gave up as it didn't seem feasible to dig into those by hand. Someone in here got an idea how to deal with that?

Btw, good start and looking forward to reading the rest of this series.

1. #### Re: How to deal with limits?

The thing about Benford's is that it applies to numbers that are logarithmically distributed, i.e. most real-world numbers. If you got a sheet of logarithmic graph paper and threw darts at it at random you'd get most darts landing in a "number begins with a 1" section, because those sections are wider than the others. Benford's distribution. It works no matter what units (scaling factor) you use because log distributions are scale invariant.

So here's my idea.... if you need to check using limits, get out your sheet of log graph paper again and mark off the limits of the data range. Scale the data to fit the paper if necessary. Then plot your data points. If the data is Benfordly-distributed, the points will appear to be randomly spread. If it isn't, the points will be clustered in the narrow-line sections of the graph and out of the wider-line sections.

Beer, in case it doesn't work.

2. #### Re: How to deal with limits?

> I was trying to apply Benford's law to a payment system but

> I was struggling with limits.

Normalize each person's data to his individual limit?

1. #### Re: How to deal with limits?

Normalize each person's data to his individual limit?

Normalising the data by scaling the numbers by a fixed factor, however, makes the data fit better into a Benford-type distribution, diluting any effect you may be looking for.

I'd suggest that the limit is irrelevant, unless you are expecting to always be reducing the value of payments to the value of the limit.

3. #### Re: How to deal with limits?

Read the link from the article to the the Journal of Accounting, and it talks about checking that situation.

1. #### Re: How to deal with limits?

Well, it seems like there's no easy solution - I'll have to invest more, probably much more time and get the hands dirty...

1. #### Re: How to deal with limits? @ Evil Auditor

Given I'm not an accountant, statistican or an auditor, evil or otherwise.

Anyway, if your looking at a set of transtions between 2500.00 and 5000,00 say. The smallist unit that cna change is 1 cent/penny, i.e. 2500.00 next value is 2500.01.

So if you then multiple by 100, to get to whoel number.

Next to look at the conecpt of limits, I thought that if muiltpling is OK (i.e. does not dsiruput the use of Benford's law) then why not divide. As the lowest limit 250000, then use that, but then I thought more and that would leave us with everything in the range of 1 and 2, not very helpful.

So why not subtraction, you can't use the lower limit as potentional you get Zero, for some transactions.

So howabout subtracting 249999 from each trancation. This would leave everything in the range of 1 upto 250001.

By the time you then only look at the leading digit would/should fit into Benford's Law?

1. #### Re: How to deal with limits? @ Evil Auditor

Bob Wheeler, are you looking for a job at Eurostat? I'm pretty sure they do this kind of things on a regular basis. ;-) Seriously though, I understand a bit of statistics but not enough to know what the implications of your suggestion would be. I have the feeling that it might mess up with the numbers but going to discuss with the maths genius that is hiding somewhere in the basement of my office building...

1. #### Re: How to deal with limits? @ Evil Auditor

I'm now not so sure about the subtraction side of things and how that might effect the numbers.

While you just moving the trend line to the left, what happens when it crosses the boundary and it moves from the '1' column to the left, and therefor falles into the '9' column.

I like basements, nice and snug, not to many visitors........

Nurse, Is it time for my meds yet?

1. #### Re: How to deal with limits? @ Evil Auditor

I had some similar thoughts, but what do I know. Unfortunately, I couldn't find the maths guy yet - it's Friday and on Fridays he's never been seen near the office.

6. Fascinating, but I would still have expected 9 to have a higher incidence than 8 if the numbers are about money as ,e.g., retailers price products at 9.99 rather than 10.00

1. Those retailers prices are not natural.

2. "Fascinating, but I would still have expected 9 to have a higher incidence than 8 if the numbers are about money as ,e.g., retailers price products at 9.99 rather than 10.00"

It doesn't actually work that way. Lots of products are priced at 9.99, but not as many as cost £8.xx and there aren't as many of those as things that cost £7.xx and so on.

3. > I would still have expected 9 to have a higher incidence than 8 if the numbers are about money

It only apples to the incidence of the first digit.

4. If you buy two items priced at 5.99, 6.99, 7.99, 8.99 or 9.99, the total will still start with a 1.

5. OK, so products are often sold for x9.99 I suspect 9.99 is not more common than 1.99 or 19.99 And what numbers come up in the final bill if you buy more than one item and VAT is added?

6. Interesting responses, thank you, and they demonstrate even further how my intuition about approximate answers can be badly flawed...

@ Patrick R "Those retailers prices are not natural"

- Very good, on both levels :-) Yes, a bias has been introduced so they are "chosen numbers" rather than "random numbers" or "numbers that occur in nature"

@ Terry Barnes "It doesn't actually work that way..."

- Possibly, but I suspect retailers will also overprice a cheaper item up to the £10 "mental barrier" to compensate for those they have to underprice. (Ignoring competition etc.)

@ YAAC "It only apples to the incidence of the first digit."

- Yes, I wasn't counting the "insignificant" digits, just using them to demonstrate my theory about retailer pricing

@ Loyal Commenter "If you buy two items priced..."

- I notice you chose your set of numbers carefully, but yep, a "transform" breaks an even distribution. Transforms are back into the realm of pure maths, so we can get away from these damn statistics....

@ Cynic_999 "I suspect 9.99 is not more common than 1.99..."

OK, that's hardly rigorous and the results were closer than I expected! Interestingly, a googlefight between 9 and 8 (or any other 2 digits) seem to to follow Benford's Law.

As above, I appreciate that transforms break even distributions, but a couple of counter-points: In the UK, retail prices almost always include VAT (to avoid arguments at the till, if it isn't by law). And how about big-ticket items? If you're buying a budget car, you expect that £9,999 price tag to include everything. If you want a "nice" car, £19.999 is just the base model and you are prepared to pay for extras.

1. @ Loyal Commenter "If you buy two items priced..."

- I notice you chose your set of numbers carefully, but yep, a "transform" breaks an even distribution. Transforms are back into the realm of pure maths, so we can get away from these damn statistics....

For the sake of brevity, yes I did 'select' my numbers, to illustrate that the majority of prices end up starting with the digit 1 if doubled. A more complete example follows:

If you have items priced at x.99, and your values of x are evenly distributed (i.e. non-random, as in prices in a shop), these are the prices you get if you buy a given number of them (up to nine; as we are working in base ten, going further will give the same results; proof of this is left as an exercise for the reader...)

.99 1.98 2.97 3.96 4.95 5.94 6.93 7.92 8.91

1.99 3.98 5.97 7.96 9.95 11.94 13.93 15.92 17.91

2.99 5.98 8.97 11.96 14.95 17.94 20.93 23.92 27.91

3.99 7.98 11.97 15.96 19.95 23.94 27.93 31.92 35.91

4.99 9.98 14.97 19.96 24.95 29.94 34.93 39.92 44.91

5.99 11.98 17.97 23.96 29.95 35.94 41.93 47.92 53.91

6.99 13.98 20.97 27.96 34.95 41.94 48.93 55.92 62.91

7.99 15.98 23.97 31.96 39.95 47.94 55.93 63.92 71.91

8.99 17.98 26.97 35.96 44.95 53.94 62.93 71.92 80.91

9.99 19.98 29.97 39.96 49.95 59.94 69.93 79.92 89.91

If you then count the incidence of each first digit, you get:

19 : 15 : 12 : 9 : 8 : 5 : 6 : 4: 3

Which is an almost perfect Benford distribution. Voila!

7. I am still digesting the concept of a Readers Digest digest. Good article, again please.

1. >I am still digesting the concept of a Readers Digest digest.

1. #### re: I only read the summary

I only read the Condensed summary.

1. #### Re: re: I only read the summary

Err... I didn't?

1. #### Re: re: I only read the summary

I only have the Executive Summary read to me.

Naaaahhhhh Nah Nah

8. #### Scale independence

At this point you might be wondering if this is to do with the units in which you choose to measure, but no, this phenomenon is unit-independent.

It's exactly because it is unit- or more accurately scale-independent that it works. I am a little surprised not to have seen the word "logarithm" in the article.

Scale issues make me a little sceptical of its value for fraud detection: I would imagine that credit limits and common denominations of payments would obscure the distribution.

1. #### Re: Scale independence

Surprisingly not. It's a pretty standard test applied where fraud might be an issue - business expenses, bank transactions etc...

Have a look at your bank statement for last month. Check out how many numbers start with '1' compared to all the others.

2. #### Re: Scale independence

Julian Havel explains Benford's Law in Ch. 14 "It's a Logarithmic World" of his excellent book "Gamma, exploring Euler's constant".

BTW, Benford's Law also holds for binary numbers: expect the number of zeros to be double the number of ones. Try it.

1. #### Re: Scale independence, Binary numbers

Er, sorry, no. The numbers analysed do not include leading zeros, so binary numbers all begin with one. But yes, that is what Benford's Law would predict, even if it is a trivial case :-)

3. #### Re: Scale independence

@Dr Paul Taylor: I am a little surprised not to have seen the word "logarithm" in the article.

It is actually there if you look closely. ;-) [Hint: in the description of where Frank Benford started from.]

9. #### Scale

In other words, you can have a much bigger scale of values when beginning with 1 than with 9.

Example [1 to 1.999] and [9 to 9.999] ...feel the restriction?

10. #### The R link is daft

R the programming language is very good for statistics hacking, but the "for example R" link points to an article on IBM's System R which was a relational database. Both data related, but not the same thing.

We use Benford's Law in some assignments. I'll point students towards this nice informal introduction. Looking forward to the rest of the articles

12. #### House numbers

Has anyone done any work on the distribution of digits of house numbers?

Assume that each house number in a street is made up from individual numbers, so 13 has a 1 and a 3 screwed next to each other on its gatepost.

Since streets aren't usually long enough for house numbers to go all the way up to 99 (or further!) there is going to be a bias of the distribution of the first digit towards 1 (since we don't usually number houses as 01, 02, etc) and the distribution of the second digit towards 0, since some streets will miss out on 25, 26, 27, 28 and 29, say.

What will the distribution be when multiple streets in a town are combined together? Knowing how many metal plates with zeros, ones, twos, etc, to produce must be well-known to the manufacturer of house numbers!

Not quite a Benford's Law distribution, though...

1. #### Re: House numbers

A common US practice is to number outward, 100 per block, from a central point. In Washington, DC, that is the Capitol; in Denver it is (I think) Broadway and Exposition Sts. The applicability of Benford's law depends on how far this is carried. Within Washington, DC, the law does not apply. In Montgomery County, Maryland, which adjoins DC on the north, the numbering system continues, and for the north-south addresses Benford's law kicks in about the Beltway.

The US government has made a big push to convert the old rural route delivery services to a common pattern of the "numbering outward" type, which must be increasing the reach of Benford's law here.

1. #### Re: House numbers

A common US practice is to number outward, 100 per block, from a central point.

In many places in the US, house numbers don't increment by a constant value, either. My house number ends in 19; the next-door neighbor's house number ends in 11. I'm not sure how they were assigned (possibly something to do with the platting, since house lots here do not correspond to surveyed parcels; my lot's legal description is 78 words long).

It's also not uncommon to have house numbers up into the five digits in the US.

In Japan, on the other hand, house numbers are assigned randomly, as part of a long-running prank being played against visitors.

1. #### Re: House numbers

> I'm not sure how they were assigned

As I recall, and simplifying: allot 100 to each block, start from 1, take the length of the first façade and divide by the length of the block--that's the number of the second house, then take the length of the second façade, divide by block length and add to the second house's number: assign this to the third house and so on.

13. Brilliant Stuff. Next article on an introduction to the simplicity of complex algebraic equations? :)

14. #### Quibbles

" Below is some real data collected by Benford himself in 1938 from a copy of Reader’s Digest."

No, below is a bar chart showing we know not what. The digits of numbers found in the text? The digits of the page numbers?

1. #### Re: Quibbles

(re-)Read the sentence before the one you quoted. The graph shows the frequency distribution of the leading digit of the numbers found in the magazine.

15. #### Dear Prof.

Can I ask why the addition to the length of the river is a function of the length of the river?

Then we extend the river several times, each extension being a random per cent figure of the length of the river in question.

i.e why are we using

Xn+1 = Xn + f(Xn)

Xn+1 = Xn + f(R)

Where R is a random length factor added to the river independent to the length of the river

If I added a rock to the path of my river and it extended the river length by 5 meters it shouldn't matter if my river is 5 or 5000KM long

1. #### Re: Dear Prof.

Because the total number of disturbances in the river that extend its length will depend on the 'original' length that the river covered.

1. #### Re: Dear Prof.

I can accept that - appropriate thumbs up applied

16. #### Wow

Wow that was a great article… More please !!

17. #### Numberphile

Funny, I was only watching the Numberphile video of this the other day.

A good channel to subscribe to, if that's your bag! Brady has several good channels on other subjects too, mostly from Nottingham Uni.

18. This is amazing!

I just ran it on my bank account outbound transactions using this guide: http://www.theiia.org/intAuditor/media/files/Step-by-step_Instructions_for_Using_Benford's_Law[1].pdf

Here is the resulting chart: http://www.bouncingball.mobi/apps/files/benford.png

Knowing this, I will conduct my fraud much more efficiently in the future.

1. Knowing this, I will conduct my fraud much more efficiently in the future.

Since this is an IT site, presumably you meant "Knowing this, I have modified my automated fraud system to shape its transaction stream to fit a Benford distribution".

19. #### Probability Density??

This seems like a quick hack against probabalistic density functions. Learn from a big set of data what a normal distribution of values is. A Benford distribution may well occur in some models. Then score against that probabilistic distribution. Anomalies will show up. The advantage of probabilistic distributions is that they can be setup in multidimensional space (so our friend with limits can use the limit as a separate dimension). They also copes with all sorts of different distributions and clusters.

The downside is that they are horribly computationally intensive, but like all data mining, understanding the data first is key to developing a solution. Throwing random methods or models and hoping one "works" is a recipe for disaster. Some data will fit. Some won't. Worse, some data might appear to fit, but if you don't understand why, it could be an artifact of sampling, or just a one off. There are no magic methods.

1. #### Re: Probability Density??

Learn from a big set of data what a normal distribution of values is. A Benford distribution may well occur in some models. Then score against that probabilistic distribution.

To generalize a bit further: Train a Hidden Markov Model to predict the distribution of values, then use that model to decode incoming transactions and flag those with a high error rate.

This is equivalent to compressing with an HMM-based compressor such as PPMD (or e.g. the BWT, used in bzip2, which is basically a specialized HMM in another form) and looking to see how good or poor the compression is, with poor compression indicating an anomaly.

It's just a way of moving the entropy around to improve the signal. As Dave 126 pointed out in the second comment (albeit out of misinterpreting the first comment), compression algorithms have been used for this sort of thing for decades. (I think Shannon may even have suggested it, though I can't recall the context.)

A more-ambitious approach would be to train something like an SVM to classify a transaction stream as "normal" or not, which is mathematically a similar operation but gives you an easier way to get a binary flag ("inspect this further"), since an SVM maximizes the separation between the candidate sets in an n-dimensional hyperspace. But this is all basic machine-learning stuff that would be covered in most introductions to the subject.

What's nice about Benford's is that it's conceptually accessible. Unlike an HMM or similar, there aren't any hidden parameters and it's easy to explain how it works. That saves you from having to treat the machine as an oracle, and it helps avoid relying on a bad model (because the training data was rubbish or whatever).

20. #### From having had to deliver stuff to people in it..

.. there are streets in Belper, Derbyshire which are numbered according to when they were built.

21. #### Coincidence

Last week my daughter, in year 4 of primary school, was told to draw a graph of some statistics for her homework, e.g. colours of cars going past. I suggested she plot the frequency of leading digits in a list of populations of countries of the world. She got a fairly nice Benford distribution. My wife had a go at me for confusing the poor child. I'm waiting to hear if her class teacher was similarly confused/angry ...

You used a wrong word (or your software subbed in a wrong word):

"country credited to other sales people but I would confidentially expect"

I would think to expect in a confidential manner would be weird. But to confidently expect something is common.

1. I would think to expect in a confidential manner would be weird.

Why? Do you publish all your expectations?

23. #### Reminds me of when I (and others) cooked expenses ...

whilst working at Mohawk Data Sciences.

We service technicians had many accounts dotted all over the Metro Toronto area which gave rise to a matrix of inter-account road journeys for which we were compensated. Recording the service calls was easy, compiling expenses at the end of the week wasn't.

So we developed this chart, based upon true numbers rounded up to the nearest mile. There were always a few who rounded up a few extra miles - potentially OK for journeys from one end of the city to another.

The accountant was sharp - he could, as your example demonstrates, spot the fakes, the padded accounts! He called me in only once. My territory included God's country (Northern Ontario where bears and aurora borealis hang out). He queried my expenses. I explained that there had been a road closure and my usual motel had caught fire.

Before I left MDS, I asked him how he spotted fakes, and his explanation was similar to the example.

1. #### Re: Reminds me of when I (and others) cooked expenses ...

> He queried my expenses. I explained that there had been a road closure and my usual motel had caught fire.

Blocking the road and setting fire to your motel seems a bit excessive. Proper BOFH territory that is.

24. Maybe a version of this rule works for randomly distributed numbers. Sure, if you pick a number at random between 1 and 100 it will have a 1 in 10 chance of having a given number at the start. But what about a random number between 1 and n, where n itself can vary? Well, if n=1 then we know we begin with a 1, and if n=2 then it's 50/50 between a 1 and a 2. Doing this for n from 1 to 9 and averaging out the percentages we get the following distribution:

1: 28.2%

2: 18.3%

3: 13.3%

4: 10.0%

5: 7.5%

6: 5.5%

7: 3.8%

8: 2.4%

9: 1.1%

Looks very reminiscent of Benford...

1. Yes, same distribution, same reason: the Arabic numbering system is logarithmic.

25. #### I tried to test this, but it didn't work...

I grabbed a bag of M&M's and tried to plot the number of times each colour appeared in the bag, but I kept losing count. I'd get half way through the bag, my fingers would get sticky from the sugar, I'd try to lick them clean, & end up passed out from the chocolate high. Can someone else give it a try & post their results? I'll check back after I recover from the sugar coma. =-)p

26. That's Numberwang!

Fun.

28. #### E-Series are 1 heavy,

Renard numbers and the various E series used for electronic components, such as resistor values have many values starting with a 1, and few starting with 7, 8 or 9.

29. Fascinating! Thanks for showing another unexpected application of maths!

30. Maths is beautiful.

This topic is closed for new posts.