Cell Phone Tracking
A few points worth making:
1. It's a fairly safe guess that they were doing Cell Id tracking, not latitude/longitude as some commenters have suggested. There are several reasons for this: 1) it's easy to collect as part of normal network operations (and however the network may feel about your privacy, they *do* care about the money they'd have to spend supporting this research); 2) the cost of tracking 100,000 people using any of the better positioning techniques (Angle of Arrival, Time Difference of Arrival, etc.) is prohibitively high; 3) if they collected anything more accurate then you can bet that the company would become a bit concerned about the privacy risks as well.
2. I think that Barabas is either being slightly disingenuous or rather naive when he says that some (hexadecimal?) hash is anonymous and therefore 'secure'. As someone else has pointed out, that's *pseudonymous* data, and that means much lower levels of privacy protection. There are a number of well-known vectors of attack against this class of information and if they'd done anything particularly clever to protect the anonymity of the users then you can bet they'd be touting it to the heavens. I'll be quite interested to see what would happen were I to request access to this data for research purposes.
3. The data from Rome (@Drew), on the other hand, actually is anonymous. That research uses Erlang (there was an article on this research project in the IEEE), which is a measure of bandwidth usage that evolved from analog cellular networks. 1 Erlang is one person-hour of usage, so you don't really know if it's 1 person on the phone for a *looooong* time or 60 people on the phone for a minute. The GPS data for that research came from buses and taxis, which is hardly what people should be getting up in arms about.
4. I'd be rather unpleasantly surprised to find out that the network had shared billing location with the researchers since you can deduce home and work locations simply by looking at where people spend the most time. On the theory that nearly everyone returns home on weekends, and can often be found at home on weeknights, it's not too hard to guess where 'home' is for any randomly-selected user. This hardly resolves the underlying privacy issue with this data, but if you combine having to guess at someone's house using statistics with the diameter of the average network cell (100m in downtown areas, 5km in rural areas) then this does make it *slightly* harder to figure out who someone might be in an automated way (you'd get a lot of false positives and false negatives unless you were looking for a specific person whose habits you already knew).
It also behooves me to talk about some of the positives of this type of research and some of the permission-based issues that arise. For things like transportation planning and infrastructure provision, it helps a *lot* to understand where people are and when, as well as where they are trying to go. Most of this type of planning is based on random samples of a few hundred people, but if you could figure out that a lot of people are trying to get from A to G then you could redesign your bus routes so that an express bus did that segment directly rather than wandering through B, C, D, E, and F. You can also makes guesses about travel mode by tracking the speed of a phone, so this (again) can help you to figure out how to deliver services better (maybe a bus-only lane to encourage a switch to public transit, or maybe the placement of government offices or public service announcements at locations where people are likely to see them/encounter them at convenient times). Or in the event of a terrorist attack or natural catastrophe you could determine pretty quickly how many people were affected... For all of this type of work if you use mobile phone data then you can start to work with samples of hundreds of thousands so your findings and predictive models get much, much better. The challenge is to do it in a way that is provably private... I don't think that this research lives up to that standard, so interesting as it is I have some serious qualms about how it was done, and Barabas' reticence suggests that he does as well.
The problem is that securing permission from several million people is, frankly, impossible, as is managing opt-ins and opt-outs for each of them individually (at what point in the process do you filter out the opt-outs? how do you manage permissions in the first place?). So there's a debate that needs to be had around whether the public value of research using this type of data outweighs some arbitrary level of concern about privacy. If I could promise you a 1 in 1,000,000 chance of being reidentified would that be enough if I could also promise you better public transit or public services? How about 1 in 100,000? Right now permission is an all-or-nothing game on both sides of the debate, and I don't see that as very constructive because this data really could be used for *your* benefit (and not just for advertising and surveillance).
And on a final note, I'd guess that this research used the same operator (and possibly data set) as an article on social networks that appeared in the Proceedings of the National Academy of Sciences (PNAS) last year. I've heard rumours about this data set and all I can say is that I'm 99% certain it didn't come from the UK (small comfort, I'm sure).