Nice one again
Had not heard of this one. Very interesting read. Looking forward to the rest of the series
You've employed Benford's Law to out fraudsters hidden in seemingly random numbers. Now what do you do if you need answers but some of your data is missing? Welcome to the German tank problem, the second in The Reg's guide to crafty techniques from the world of mathematics that can help you quickly solve niggling data problems …
You may also enjoy "Cryptonomicon" by Neal Stephenson. Part of the plot follows a slightly fictionalised/alternate-history version of the allied code breakers (Bletchley Park, etc.) as they deal with exactly this kind of problem.
And also worry about the reverse issue of trying to counteract their own possible revelations of information back the other way. e.g. having worked out how many tanks per week the enemy is building, what if you inadvertently change your behavior in such a way that the enemy can work out what you have done? Then they might start deliberately messing with the serial numbers to spoil your analysis whilst suddenly increasing production...
@ DanDanDan
Information is everywhere but you wont see it if you dont know where to look or why you are looking. Wiki is full of information of almost anything... so having the information does not make anyone wiser. It is the pooling of such information to the relevant places for the right people.
Obviously the reg could just post a link, but we came to the reg to read the reg which would lead me to assume people find reading the articles on here is easy while the same is not necessarily true for the layout of wiki.
It's for this reason that every client gets their own invoice number when I (and many others I know) invoice them: AB001 where their company initials are AB. That way they don't don't know if I'm a) depending on them for an income and thus they can take the piss or b) doing lots of work on the side when they think they're getting all my time. Of course you can estimate the answers to both questions other ways (how much you see me for a start!) but depending on who you are and what I'm doing it can help a little, now I'm back off PAYE and back on a random selection of projects (+ a PhD)
>It's for this reason that every client gets their own invoice number
Unfortunately, if you are VAT registered, HMRC like to see a nice sequence of invoice numbers without any gaps.
Because many IT billing systems can only handle simple sequential numbering, I have used the logic of the German Tank problem to estimate the number of customers and the net fluctuation various mobile phone operators have.
"Unfortunately, if you are VAT registered, HMRC like to see a nice sequence of invoice numbers without any gaps."
That's not entirely correct. Nice sequence, yes, but no problem having multiple sequences. What they don't like is gaps in any sequence.
Per project sequences for example are not at all uncommon. Per customer is nothing else than that.
Had a VAT audit not too long ago, and they didn't complain at all about the fact that I had several sequences (by project/customer type).
>Had a VAT audit not too long ago, and they didn't complain at all about the fact that I had several sequences
Thanks for the prod to revisit the VAT invoice rules! Yes the requirements are for no gaps and no duplicates.
I take it you maintain a master register/ledger, so all invoices effectively have an internal unique sequence number and a published (unique) sequence number. This certainly would help when you need to double check a VAT return.
Is this not a bit like the 'remaining stock' on an Ebay item or an online shop?
When I ran an online shop site we only used to set this value to 10 or 20 then reset it once it hit zero.
As much to give the customers the feeling that stock of their item was not infinite, as to fool the competitors about our turnover!
This post has been deleted by its author
Looks to me like an ISU-122. Not technically a tank, but a self-propelled gun ... but certainly Russian.
http://en.wikipedia.org/wiki/ISU-122
(This is not the first time that I've lamented that El Reg doesn't provide larger versions of the thumbnail images used in its lists of articles.)
However, like the statistical analysis in the article, it did play a part in reducing German tank numbers, so it isn't entirely irrelevant to the subject at hand.
You are all wrong. The vehicle shown is in fact a Bolo Mk XXIV Continental Siege Engine.
Understandable mistake, there being no figure in shot to give it scale (those roadwheels are, in fact, over thirty feet or about twelve metres in diameter), but if you know where to look you can make out the unit's name (Restartus) on the glacis underneath and to the right of the ball mantlet of the rail gun.
> This is not the first time that I've lamented that El Reg doesn't provide larger versions of the thumbnail images used in its lists of articles.
Your competitors' websites can be a valuable hunting ground.
Yes and no. Say your competitor has accidentally leaked 0.1% of their records on their homepage, and you notice that by clever manipulation of the URL you can make it also reveal the other 99.9% (0.1% at a time), should you then go on to extract their entire database?
Common sense says that they have published this data, the law commonly comes down on those who extract databases in this way - just ask weev.
That's not the same thing at all. URL manipulation is a form of hacking. It may only work in the presence of apallingly lax security (like accessing a system with a default user name and password), but it's still an attempt to access data you're not supposed to have access to. And the main point of law here that the 'free data' crowd tend to gloss over is that 'exposing' is not the same as 'publishing'.
With this method, one can demonstrate to any enquiry that all the source data was openly available (assuming you haven't been skimming numbers off other people's orders).
I get it perfectly well - the law says that when you accidentally give access to information to someone not authorized, you're not publishing the data, and when the unauthorized person access that data it is unauthorized access to a computer.
The law is a fucking ass. Putting something online is publishing, allowing someone access to data is authorizing them to access it. The law says that these things are not publishing nor authorization, and so the law is - obviously - wrong.
It does not matter that you did it accidentally - don't have bad processes.
It does not matter that the "someone" is an unidentified anonymous internet user - that is who you authorized to access it.
Businesses and courts don't like this because it made their lives difficult, so instead they made the law difficult. Much better to redefine what "published" and "authorized" mean in newspeak than to properly secure your data.
Anyway, the whole point of this was not about the vagaries of URL manipulation - TFA suggests you can infer information from your competitors, and indeed you very often can.
Just be wary when you realise you can extract a great deal of information from them and think about the legal implications before you fire up a script to capture all that lovely information - it might be illegal to retrieve the information they have "published" and "authorized" you to access, for the reasons listed above.
@Tom38: totally agreed with your reasoning. If as a company you don't want people to know things, don't put them unprotected on the internet. IMO, the members of the legal profession who ignore this (and interpret the law as done by your esteemed partner in this discussion) just don't get it or, as you indicate, probably don't want to get it.
Unfortunately, they get to make the rules...
So what you're saying is, I was absolutely right, you are in breach of the law when uncovering data that you were not intended to access; that inadvertently making data available does not constitute publishing it, but because you don't happen to like the law as it stands, you're still going to downvote me and try and claim some sort of moral superiority?
Re: "the whole point of this was not about the vagaries of URL manipulation" - err - yes it is, because that's exactly and entirely what your original post, and my rebuttal, was based on. We do in fact agree that the law, as it stands, does not support what you call 'common sense', but then it's not 'common sense' just because you say it is.
The real matter here is that inferring information from published data is perfectly legal. Extracting data by unauthorised means is not.
>I was absolutely right, you are in breach of the law when uncovering data that you were not intended to access
It depends on the meaning of "were not intended to access". I find it interesting just what you can get Google to uncover through a well constructed set of search terms: In my regular web searches on various aspects of IT, I keep encountering hoards of information which on investigation seem to be totally inaccessible via the normal html homepage. given Google uses 'signposts' erected by the website owner to uncover content it puts a different spin on what is meant by "intent".
The crux of the article is that you can often extrapolate, fill-in-the-gaps, connect-the-dots, the data you actually want/need from other data someone has intentionally made public, or has exposed as part of a business flow (like the engine serial numbers). It is not about extracting that information straight from the source; URL manipulation would be akin to sending in a spy to take a peek at the weekly internal production reports from the tank factories. Sure, you can do that, but you risk your spy getting shot or your data sniffing being caught. Using maths and statistics on legally available incomplete data doesn't carry the risk of being hauled before the beak.
Ah yes, the so called doomsday argument. I've read that almost everyone who comes across it sees a flaw in it. Apparently if you actually get down to actually putting serious study into it you end up changing your mind about what the flaws are. In essence there's a consensus that it's wrong but no one can agree on WHY it's wrong.
Which, frankly, is a bad sign for the future of the human race. Better have a beer now.
I've read that almost everyone who comes across it sees a flaw in it. Apparently if you actually get down to actually putting serious study into it you end up changing your mind about what the flaws are.
Perhaps you read that in Randall Munroe's What-If? It's a nice discussion of the Doomsday Argument1, and his phrasing is similar to yours.
Which, frankly, is a bad sign for the future of the human race.
Maybe so (though I personally find myself unable to care about hypothetical long-term survival of the species), but it's a good sign for each of us as individuals, since by the same argument we're most likely not living in the End Times. So that's one fewer thing to worry about.
1In reference to Twitter and hypothetical web-page height, naturally.
Perhaps you read that in Randall Munroe's What-If?
I couldn't remember where I'd read it, but if it's been in What-If then that's most likely it. I make pretty regular visits there. Math, physics and logic applied answer to silly questions with Randall's brand of humor....what more could a nerd want?
On the far end, for big values of E it makes sense, you're increasing your estimation by a small amount, the higher the sample you get the smaller the correction you make.
But for small samples you're increasing your estimate by some factor (100% with one sample, 50% when E=2, 30% when E=3) that does not seem very reasonable. Is there some lower bound for the number of samples?
You have a very large uncertainty when dealing with small sample sizes, so I find this entirely reasonable. Basically, if you have one sample, you just assume the one number you found is somewhere in the middle of the range. (The chances of finding one in the first or last quarter of the range is much smaller than finding one in the middle somewhere). Thus you double the number and call it a day. You'll only get any decent sort of estimate with larger sample sizes. I'd say atleast E=5 to be a lower limit for any sort of "accuracy", but that doesnt make lower sample size guesstimates any less relevant or that higher sample sizes are very accurate.
"The chances of finding one in the first or last quarter of the range is much smaller than finding one in the middle somewhere"
If we're talking about stumbling upon a piece of data, why is more likely for it to be from the centre rather than the tail i.e. what, other than it occurring more frequently in nature, makes us assume a normal rather than uniform distribution? Would the tanks not be a uniform distribution?
Just curious.
Old tanks are more likely to have been converted to scrap already. New tanks are more likely to still be at the factory, "enroute" or deployed to particularly strategic locations. Which means the general population is more likely to come from the middle segment. (Ofcourse, if you start looking at tanks in those particular locations that just received a shipment of new tanks this might skew the data) Overal its just a decent assumption to take the number you have and double it if you have just a single sample. Once you get 2, you get slightly more confidence, etc.
There's no "lower bound" as such, but you want to have enough samples to be confident in your estimation of N.
You're looking for information on "Confidence Interval" Check out the best answer from the below page. It details how to find the confidence interval of the maximum likelihood estimators for "a" and "b", where a and b are the lower and upper bound of the distribution.
http://stats.stackexchange.com/questions/20158/determining-sample-size-for-uniform-distribution
1. Many companies' products' serial numbers incorporate production year and/or month. [Not necessarily known by the researcher] so 1204**** may, just may, relate to April (20)12 production.
2. Honda 450: 1965 - 1968 serial numbers began CB450-1000001; but 1968 - 1969 serial numbers began CB450-3000001. Squaddies searching blown-up bits might find 1000321 & 3000198 but not know the year. If they were tank numbers, bulk orders for white flags ensues.
In the first example, if you get a decent sample of serial numbers production date incrementation becomes quite obvious.
In the second example, if you have a sampling from the CB450-1xxxxxx range and a sampling from the 3xxxxxxx range, it'll become obvious they come from 2 different series. Missing any data of an interlying 2xxxxxx range means you can assume that range doesnt exist.
1930's Aston Martins started with the month production started as a letter (A for Jan, B for Feb etc) then a single digit for the year of the decade (2 for 1932, 3 for 1933 etc) followed by the actual chassis number for the model, and a suffix letter for type. So eg C2/201/S is short chassis 201, built in March 1932. G7/722/L would be long chassis 722, laid down in July 1937. However, they don't seem to have built the chassis in order - so we have H7/717, C7/719 (5 months earlier) B40/720, (3 years later) F9/721, (back a year) A9/722, A7/730 B7/736 (finally moving forward again) and so on. Even if you spot the lack of repetition of 3 digit numbers, you'd still be caught out as there were frequent jumps to the next hundred when a new model came out/new owner bought the company. (The earlier cars up to number 74 were much easier; S for sports and T for Touring with no indication of date. Apart from MS1, a polished chassis built specially for motor show display. Subsequently though the number 273 appeared on at least 3 sets of records, but does not appear to have ever actually left the factory.)
of course similar maths is used in biology too - estimating species etc.
But an interesting twist is the knock on effect. The component supply train (in biology the food chain) is also part of the analysis.
You should try it on the Tube some day, spotting the "missing link" as we call it....
P.
Reminds me of France declaring Mendeleev (the periodic table guy) persona non-grata 100+ years ago. He guessed correctly that their super-secret advanced smokeless gunpowder is indeed trinitrocellulose by counting the railway wagons with cotton, sulphuric asid and potash going into the plants.
He also quite correctly predicted that it will all end up in tears (due to degradation of higly nitrated and non-inhibited cellulose over time). And indeed it did: http://en.wikipedia.org/wiki/French_battleship_Libert%C3%A9
Yup. Just one of many historical examples of a side-channel (aka "covert channel") attack.
When I were a lad, I read a novel in the Danny Dunn series where Danny and friends learn to heuristically interpret product codes and the like. I thought it was great fun, though the plot mostly revolves around mundane exploits like finding expired product on the shelves of a local store. Gave me a lasting appreciation of side channels.
These days they're well-known in computer cryptanalysis for things like the timing and power attacks described by Kocher and others. (Kocher's original timing attack against RSA and other systems was particularly important as it demonstrated using these side channels to break security was feasible; blinding is now considered a requirement for timing- and power-sensitive algorithms.)
Bag tag numbers are not allocated sequentially for security reasons.
+++++
When I first went to USA and opened a bank account with a cheque book, a colleague advised me to rip out the first 50 or so cheques. The reason - he said retailers would be suspicious of a low serial number cheque, indicating a.... recently opened cheque account.
Yeah, I got that reaction once. I gave her the gimlet eye and said "yeah it is, what of it?" and let the awkward pause continue very uncomfortably. She finally rang it up and never made eye contact again.
Of course, I'm also antisocial enough that when ATMs were invented, I immediately thought "oh! I no longer have to deal with making inane small talk with dumbass condescending bank tellers!! thank ******ing god!"
It's a technique of inference.
Like the old CIA section that dealt with "crateology, " the study of shipping containers.
Keep in mind that incrementing counts are used because they are cheap to track and to generate ( "Is SN 37256 one of ours?" Well if the SN counter is up to 40000 probably).
May be important, may not be
Correct if you are looking at the entire tank line production. However if you read the Wikipedia article mentioned higher up, you'll see that the investigation, as described, concerned one particular tank model.
Which kind of makes sense intuitively, because serial numbers would not necessarily carry across model lines.
But kudos for applying a bit of sanity-checking to a numerical claim. People often have no grasp of numbers, take things at face value and fail to see gaping holes in the most stupid claims.
Awesome article! Encore! Encore!
A friend working at Intel,once (1998) told me that "tomorrow there would be a press release
in all the morning newspapers", so I told him the subject of the press release.
He looked a little surprised, and asked why I believed this.
So I told him that it had to be something really special for the morning
newspapers to pick it up, and unless Intel was doing something completely
different from what they were doing, the only thing that would make the
morning newspapers, would be a highly integrated x86 processor.
Intel releasing a 2 x size flash memory, would not be interesting to the general public,
and would only make it into Electronics Magazines.
True enough, the next day the i386SL was released.
This used to be a big question in the mainframe age.
I maintained material numbers between 100000 - 153000, of which those below 125000 were migrated from two 20th C. systems.
Then a new application split the materials by type and gave each a domain - from 700000 for sales products, 800000 for raw materials, 600000 for manufactured parts, etc., keeping the old "mixed" ones as legacy.
An external customer sees numbers from 100000 to 730000 on his shipping papers, and has less overview of what is going on.