Benford...
Isn't this also know as Zipf's law - usually applied to the relative usages of words?
Yes, we've been hit over the head enough times with the phrase "big data" to be aware of its presence, even though we've been up to our armpits in streams of huge unstructured datasets for years. Those of you who are analysts or data scientists will have already picked up a set of tools that help you find hidden information …
You might be thinking of the use of PKZip to compare many works, X, from a known author- say all of a student's past essays - with a piece of text of dubious authenticity, Y.
You Zip X, and then Zip X+Y. If the second Zip file is a fair larger than the first, you might be dealing with a case of plagiarism.
Zipf' Law is a generalization of Benford's (and older if I recall --- think Zipf based his work on Pareto / the Pareto Principle).
Anyway, the whole "River" explanation seemed a bit overly long to me; a comparison closer to home so to speak is with the snaking "rivers" we live at: streets.
So, why are more people living at house numbers beginning with 6 than with 7? Because every street with a no. 7 has a no.6, with a number 70 has a number 60--69, with a number 700 has a number 600--699; but conversely there are some streets that have 60-something numbers (maybe not all!) but stop before reaching 79; and those that do reach 79 may end between 601 and 799. [In the UK the set of 1000+ numbered houses is negligible.] There is more streets that reach the 100s than the 200s, so
A street with houses 1--199 has more than 50% people living in a number starting on 1, about 5% starting on 2, 3, .. 9. A street with houses 1--299 has over 33% in 1.., over 33% in 2.., and the rest equally split over the remaining 7. And so forth... There's more streets ending in the 300s than in the 400s, more ending in the 200s than the 300s, etc.
As with rivers, the causes of street length (and house number density) are myriad unrelated factors, mostly geographical, political, etc... hence Zipf applies.
When I was taught statistics it was from the dull and dry perspective of "work out how likely something is to happen". It turns out that in real life, a much more interesting question is "here's what actually happened, work out how likely it is that it was random". Not only is that great for detecting fraud, it's also a basic tool of science; for example the discovery of the Higgs Boson basically boils down to "here's the decay events we recorded, how likely is it that we'd see that distribution of events without a Higgs Boson in the mix? Vanishingly small? OK, that's the discovery."
More of the same, please.
I was trying to apply Benford's law to a payment system but I was struggling with limits. Each person working with the system has an individual limit, e.g. someone can enter payments up to £500, other persons up to £750 or £12,000 etc. And what's more, the payments as such may have limits too, e.g. for some sort of business the payments may usually go up to £2,500, for others to £9,000 etc.
I did identify several abnormalities according to Benford but for now I gave up as it didn't seem feasible to dig into those by hand. Someone in here got an idea how to deal with that?
Btw, good start and looking forward to reading the rest of this series.
The thing about Benford's is that it applies to numbers that are logarithmically distributed, i.e. most real-world numbers. If you got a sheet of logarithmic graph paper and threw darts at it at random you'd get most darts landing in a "number begins with a 1" section, because those sections are wider than the others. Benford's distribution. It works no matter what units (scaling factor) you use because log distributions are scale invariant.
So here's my idea.... if you need to check using limits, get out your sheet of log graph paper again and mark off the limits of the data range. Scale the data to fit the paper if necessary. Then plot your data points. If the data is Benfordly-distributed, the points will appear to be randomly spread. If it isn't, the points will be clustered in the narrow-line sections of the graph and out of the wider-line sections.
Beer, in case it doesn't work.
Normalize each person's data to his individual limit?
Normalising the data by scaling the numbers by a fixed factor, however, makes the data fit better into a Benford-type distribution, diluting any effect you may be looking for.
I'd suggest that the limit is irrelevant, unless you are expecting to always be reducing the value of payments to the value of the limit.
I was thinking more about this last night - sad I know.
Given I'm not an accountant, statistican or an auditor, evil or otherwise.
Anyway, if your looking at a set of transtions between 2500.00 and 5000,00 say. The smallist unit that cna change is 1 cent/penny, i.e. 2500.00 next value is 2500.01.
So if you then multiple by 100, to get to whoel number.
Next to look at the conecpt of limits, I thought that if muiltpling is OK (i.e. does not dsiruput the use of Benford's law) then why not divide. As the lowest limit 250000, then use that, but then I thought more and that would leave us with everything in the range of 1 and 2, not very helpful.
So why not subtraction, you can't use the lower limit as potentional you get Zero, for some transactions.
So howabout subtracting 249999 from each trancation. This would leave everything in the range of 1 upto 250001.
By the time you then only look at the leading digit would/should fit into Benford's Law?
Bob Wheeler, are you looking for a job at Eurostat? I'm pretty sure they do this kind of things on a regular basis. ;-) Seriously though, I understand a bit of statistics but not enough to know what the implications of your suggestion would be. I have the feeling that it might mess up with the numbers but going to discuss with the maths genius that is hiding somewhere in the basement of my office building...
Umm, been thinking more about this, sad, oh so very sad.
I'm now not so sure about the subtraction side of things and how that might effect the numbers.
While you just moving the trend line to the left, what happens when it crosses the boundary and it moves from the '1' column to the left, and therefor falles into the '9' column.
I like basements, nice and snug, not to many visitors........
Nurse, Is it time for my meds yet?
"Fascinating, but I would still have expected 9 to have a higher incidence than 8 if the numbers are about money as ,e.g., retailers price products at 9.99 rather than 10.00"
It doesn't actually work that way. Lots of products are priced at 9.99, but not as many as cost £8.xx and there aren't as many of those as things that cost £7.xx and so on.
Interesting responses, thank you, and they demonstrate even further how my intuition about approximate answers can be badly flawed...
@ Patrick R "Those retailers prices are not natural"
- Very good, on both levels :-) Yes, a bias has been introduced so they are "chosen numbers" rather than "random numbers" or "numbers that occur in nature"
@ Terry Barnes "It doesn't actually work that way..."
- Possibly, but I suspect retailers will also overprice a cheaper item up to the £10 "mental barrier" to compensate for those they have to underprice. (Ignoring competition etc.)
@ YAAC "It only apples to the incidence of the first digit."
- Yes, I wasn't counting the "insignificant" digits, just using them to demonstrate my theory about retailer pricing
@ Loyal Commenter "If you buy two items priced..."
- I notice you chose your set of numbers carefully, but yep, a "transform" breaks an even distribution. Transforms are back into the realm of pure maths, so we can get away from these damn statistics....
@ Cynic_999 "I suspect 9.99 is not more common than 1.99..."
http://www.googlefight.com/index.php?lang=en_GB&word1=9.99&word2=1.99
OK, that's hardly rigorous and the results were closer than I expected! Interestingly, a googlefight between 9 and 8 (or any other 2 digits) seem to to follow Benford's Law.
As above, I appreciate that transforms break even distributions, but a couple of counter-points: In the UK, retail prices almost always include VAT (to avoid arguments at the till, if it isn't by law). And how about big-ticket items? If you're buying a budget car, you expect that £9,999 price tag to include everything. If you want a "nice" car, £19.999 is just the base model and you are prepared to pay for extras.
@ Loyal Commenter "If you buy two items priced..."
- I notice you chose your set of numbers carefully, but yep, a "transform" breaks an even distribution. Transforms are back into the realm of pure maths, so we can get away from these damn statistics....
For the sake of brevity, yes I did 'select' my numbers, to illustrate that the majority of prices end up starting with the digit 1 if doubled. A more complete example follows:
If you have items priced at x.99, and your values of x are evenly distributed (i.e. non-random, as in prices in a shop), these are the prices you get if you buy a given number of them (up to nine; as we are working in base ten, going further will give the same results; proof of this is left as an exercise for the reader...)
.99 1.98 2.97 3.96 4.95 5.94 6.93 7.92 8.91
1.99 3.98 5.97 7.96 9.95 11.94 13.93 15.92 17.91
2.99 5.98 8.97 11.96 14.95 17.94 20.93 23.92 27.91
3.99 7.98 11.97 15.96 19.95 23.94 27.93 31.92 35.91
4.99 9.98 14.97 19.96 24.95 29.94 34.93 39.92 44.91
5.99 11.98 17.97 23.96 29.95 35.94 41.93 47.92 53.91
6.99 13.98 20.97 27.96 34.95 41.94 48.93 55.92 62.91
7.99 15.98 23.97 31.96 39.95 47.94 55.93 63.92 71.91
8.99 17.98 26.97 35.96 44.95 53.94 62.93 71.92 80.91
9.99 19.98 29.97 39.96 49.95 59.94 69.93 79.92 89.91
If you then count the incidence of each first digit, you get:
19 : 15 : 12 : 9 : 8 : 5 : 6 : 4: 3
Which is an almost perfect Benford distribution. Voila!
At this point you might be wondering if this is to do with the units in which you choose to measure, but no, this phenomenon is unit-independent.
It's exactly because it is unit- or more accurately scale-independent that it works. I am a little surprised not to have seen the word "logarithm" in the article.
Scale issues make me a little sceptical of its value for fraud detection: I would imagine that credit limits and common denominations of payments would obscure the distribution.
Has anyone done any work on the distribution of digits of house numbers?
Assume that each house number in a street is made up from individual numbers, so 13 has a 1 and a 3 screwed next to each other on its gatepost.
Since streets aren't usually long enough for house numbers to go all the way up to 99 (or further!) there is going to be a bias of the distribution of the first digit towards 1 (since we don't usually number houses as 01, 02, etc) and the distribution of the second digit towards 0, since some streets will miss out on 25, 26, 27, 28 and 29, say.
What will the distribution be when multiple streets in a town are combined together? Knowing how many metal plates with zeros, ones, twos, etc, to produce must be well-known to the manufacturer of house numbers!
Not quite a Benford's Law distribution, though...
A common US practice is to number outward, 100 per block, from a central point. In Washington, DC, that is the Capitol; in Denver it is (I think) Broadway and Exposition Sts. The applicability of Benford's law depends on how far this is carried. Within Washington, DC, the law does not apply. In Montgomery County, Maryland, which adjoins DC on the north, the numbering system continues, and for the north-south addresses Benford's law kicks in about the Beltway.
The US government has made a big push to convert the old rural route delivery services to a common pattern of the "numbering outward" type, which must be increasing the reach of Benford's law here.
A common US practice is to number outward, 100 per block, from a central point.
In many places in the US, house numbers don't increment by a constant value, either. My house number ends in 19; the next-door neighbor's house number ends in 11. I'm not sure how they were assigned (possibly something to do with the platting, since house lots here do not correspond to surveyed parcels; my lot's legal description is 78 words long).
It's also not uncommon to have house numbers up into the five digits in the US.
In Japan, on the other hand, house numbers are assigned randomly, as part of a long-running prank being played against visitors.
> I'm not sure how they were assigned
As I recall, and simplifying: allot 100 to each block, start from 1, take the length of the first façade and divide by the length of the block--that's the number of the second house, then take the length of the second façade, divide by block length and add to the second house's number: assign this to the third house and so on.
Can I ask why the addition to the length of the river is a function of the length of the river?
Then we extend the river several times, each extension being a random per cent figure of the length of the river in question.
i.e why are we using
X_{n+1} = X_{n} + f(X_{n})
instead of
X_{n+1} = X_{n} + f(R)
Where R is a random length factor added to the river independent to the length of the river
If I added a rock to the path of my river and it extended the river length by 5 meters it shouldn't matter if my river is 5 or 5000KM long
This is amazing!
I just ran it on my bank account outbound transactions using this guide: http://www.theiia.org/intAuditor/media/files/Step-by-step_Instructions_for_Using_Benford's_Law[1].pdf
Here is the resulting chart: http://www.bouncingball.mobi/apps/files/benford.png
Knowing this, I will conduct my fraud much more efficiently in the future.
This seems like a quick hack against probabalistic density functions. Learn from a big set of data what a normal distribution of values is. A Benford distribution may well occur in some models. Then score against that probabilistic distribution. Anomalies will show up. The advantage of probabilistic distributions is that they can be setup in multidimensional space (so our friend with limits can use the limit as a separate dimension). They also copes with all sorts of different distributions and clusters.
The downside is that they are horribly computationally intensive, but like all data mining, understanding the data first is key to developing a solution. Throwing random methods or models and hoping one "works" is a recipe for disaster. Some data will fit. Some won't. Worse, some data might appear to fit, but if you don't understand why, it could be an artifact of sampling, or just a one off. There are no magic methods.
Learn from a big set of data what a normal distribution of values is. A Benford distribution may well occur in some models. Then score against that probabilistic distribution.
To generalize a bit further: Train a Hidden Markov Model to predict the distribution of values, then use that model to decode incoming transactions and flag those with a high error rate.
This is equivalent to compressing with an HMM-based compressor such as PPMD (or e.g. the BWT, used in bzip2, which is basically a specialized HMM in another form) and looking to see how good or poor the compression is, with poor compression indicating an anomaly.
It's just a way of moving the entropy around to improve the signal. As Dave 126 pointed out in the second comment (albeit out of misinterpreting the first comment), compression algorithms have been used for this sort of thing for decades. (I think Shannon may even have suggested it, though I can't recall the context.)
A more-ambitious approach would be to train something like an SVM to classify a transaction stream as "normal" or not, which is mathematically a similar operation but gives you an easier way to get a binary flag ("inspect this further"), since an SVM maximizes the separation between the candidate sets in an n-dimensional hyperspace. But this is all basic machine-learning stuff that would be covered in most introductions to the subject.
What's nice about Benford's is that it's conceptually accessible. Unlike an HMM or similar, there aren't any hidden parameters and it's easy to explain how it works. That saves you from having to treat the machine as an oracle, and it helps avoid relying on a bad model (because the training data was rubbish or whatever).
Last week my daughter, in year 4 of primary school, was told to draw a graph of some statistics for her homework, e.g. colours of cars going past. I suggested she plot the frequency of leading digits in a list of populations of countries of the world. She got a fairly nice Benford distribution. My wife had a go at me for confusing the poor child. I'm waiting to hear if her class teacher was similarly confused/angry ...
whilst working at Mohawk Data Sciences.
We service technicians had many accounts dotted all over the Metro Toronto area which gave rise to a matrix of inter-account road journeys for which we were compensated. Recording the service calls was easy, compiling expenses at the end of the week wasn't.
So we developed this chart, based upon true numbers rounded up to the nearest mile. There were always a few who rounded up a few extra miles - potentially OK for journeys from one end of the city to another.
The accountant was sharp - he could, as your example demonstrates, spot the fakes, the padded accounts! He called me in only once. My territory included God's country (Northern Ontario where bears and aurora borealis hang out). He queried my expenses. I explained that there had been a road closure and my usual motel had caught fire.
Before I left MDS, I asked him how he spotted fakes, and his explanation was similar to the example.
Maybe a version of this rule works for randomly distributed numbers. Sure, if you pick a number at random between 1 and 100 it will have a 1 in 10 chance of having a given number at the start. But what about a random number between 1 and n, where n itself can vary? Well, if n=1 then we know we begin with a 1, and if n=2 then it's 50/50 between a 1 and a 2. Doing this for n from 1 to 9 and averaging out the percentages we get the following distribution:
1: 28.2%
2: 18.3%
3: 13.3%
4: 10.0%
5: 7.5%
6: 5.5%
7: 3.8%
8: 2.4%
9: 1.1%
Looks very reminiscent of Benford...
I grabbed a bag of M&M's and tried to plot the number of times each colour appeared in the bag, but I kept losing count. I'd get half way through the bag, my fingers would get sticky from the sugar, I'd try to lick them clean, & end up passed out from the chocolate high. Can someone else give it a try & post their results? I'll check back after I recover from the sugar coma. =-)p