back to article De-identify, re-identify: Anonymised data's dirty little secret

Publishing data of all kinds offers big benefits for government, academic, and business users. Regulators demand that we make that data anonymous to deliver its benefits while protecting personal privacy. But what happens when people read between the lines? Making data anonymous is known as de-identifying it, but doing it …

  1. Anonymous Coward
    Anonymous Coward

    The article said it best.

    "(T)he data shouldn't be shared at all." Make collecting it optional & _sharing_ it illegal. No giving it away for free, no "accidentally" leaking it, no selling it, no letting someone else browse your customer records, nothing. YOU can compile it but you can't show it to anyone else.

    1. ibmalone

      Re: The article said it best.

      How does one do immensely useful public health or other medical research then? Any dataset of really useful size is generally too big to be collected by a single institution, but limiting its use to that institution freezes out a lot of talent for investigating the data. The alternative is a plethora of underpowered studies.

      Edit: in this case I'm talking about data collected from participants with informed consent and ethical approval.

      1. Doctor Syntax Silver badge

        Re: The article said it best.

        "in this case I'm talking about data collected from participants with informed consent and ethical approval."

        Which is a different matter if it's still being use within the scope of the original consent, even if there are multiple research partners.

      2. Anonymous Coward
        Anonymous Coward

        Re: The article said it best.

        People who support medical research can download an app for that purpose.

    2. veti Silver badge

      Re: The article said it best.

      That would set many forms of scientific research back about 30 years. Pretty much all medical research, for instance, requires publication of exactly this sort of data. If we're no longer allowed to read that data, every country in the world will have to retest and recertify every drug or treatment or vaccine for itself.

      1. doublelayer Silver badge

        Re: The article said it best.

        Rubbish. If you run a drug trial, you state at the start that you're trying a drug, here's what you think it does, you're collecting lots of information about their medical progress, and you're going to be publishing the data you find. The participants are told this at the start and agree to it, or they don't end up in the trial. That is informed and specific consent, something most of the other examples don't have.

  2. Anonymous Coward
    Anonymous Coward

    "device ID and location data"

    You can get the location data from the device ID.

    I'd remind you that Protonmail handed over the Device ID (and device type, creation date, IP address) against those French protestors. The device ID is *client* side and they would not be doing a special client app, just for those protestors, so I suspect they've always been logging Device ID , and it adds to my view of Protonmail as a dodgy honeypot operation.

    I'd remind you of the huge databases the spies have accumulated*, and the claim that its not mass surveillance, if they don't look. Then "it's not mass surveillance if it's a selector algorithm that looks" and then "it's not mass surveillance if a selector algorithm *continuously* looks". And, I suspect they'll be big on AI now, running AI queries even they don't understand against the dataset they claim is not mass surveillance.

    I wonder how many politicians in the world are on Grindr?

    * And worse, the spies turned their spying inwards against their own nations. A particularly worrying trend when your spies trust their allied foreign agents more than their citizens.

    1. scrubber

      Google lawyers

      Just to repeat an old complaint of mine: this is at least in part the fault of Google who claimed, before NSA mass collection was revealed, that Google was not "reading" emails because it was automatically scanned by their computers and processed by the algorithms and no human saw the text of your mail. This claim was then used by the NSA in court using Google's victory as precedence.

  3. Anonymous Coward
    Anonymous Coward

    At a simplistic level we had this with an employee satisfaction survey, completed anonymously. To get meaningful results we needed to know Grade, Department and Location however typically there would only be one manager/supervisor in a department at a given location.

    Whilst I could identify some individuals in the data set I only reported to management, and we only published, any analysis at a single level, i.e. Grade or Department or Location, there was no two-way analysis.

    1. elsergiovolador Silver badge

      I am not sure why would these be anonymous at all. If your employees are scared to speak up, then you have bigger problems.

      1. Potemkine! Silver badge

        I am not sure why would these be anonymous at all.

        Do you live in the care bears world?

        When critics are made, most of the time grudges result of them. May they be accurate or not.

        Anonymization is a way to protect free speech. That's why China wants to have everybody identified on the Net.

        1. Alumoi Silver badge

          And by China I mean every government in the world. And by everybody I mean everybody except politicians and their friends.

        2. Robert Carnegie Silver badge

          Nothing says "this is bollocks" like an office e-mail reminding you that you haven't submitted your anonymous employee satisfaction survey form yet.

        3. Snowy Silver badge
          Mushroom

          Care bear world

          Sounds all nice and all that but is it?

          All the bears are nice and good and never do anything bad, I bet they have great social scores.

          If your not good the bears give you a care bear stare and your made to be good!!

      2. doublelayer Silver badge

        "If your employees are scared to speak up, then you have bigger problems."

        No, if they're scared to speak up, then this is your bigger problem. Which is why you often need some method for them to report. If they know of something unethical or worse going on, do you think they're going to be happy to go announce that? They're not, and they're right to be worried. The incidents of punishment against those reporting misconduct are many, so it can be useful to have some method to send the information without putting an immediate target on your head should it turn out that the person who received it wants to blame you for causing a problem.

        Even then, it's hard to do it, as there's usually a small set of people who could know the thing you're disclosing, but at least you have some protection. Forcing everybody to be identified at all times is just forcing them to stay quiet and leave as their only remedy, which harms everybody except those who create the original problem. By the way, this works the same way for complaints about a bad employee as it does for larger ethical or legal issues, just shifted down a little.

        1. elsergiovolador Silver badge

          If they know of something unethical or worse going on, do you think they're going to be happy to go announce that?

          If something illegal is going on then worker should report that to the authorities rather than managers so they have a chance to cover up.

          The incidents of punishment against those reporting misconduct are many

          Anonymity does not stop that. People above can still retaliate, just sometimes not against the people who reported something.

          so it can be useful to have some method to send the information without putting an immediate target on your head should it turn out that the person who received it wants to blame you for causing a problem.

          If things are going that bad, it's better to resign than try to fix up botched recruitment. Unless you are paid to resolve interpersonal issues, what's the point?

          which harms everybody except those who create the original problem

          No, it only harms the company that made bad hiring decisions. If good people are leaving, then it should be a wake up call that maybe HR needs to investigate. I don't think workers should be supporting rotten organisations. I have seen many instances where anonymous reporting was unsuccessful - company instead of resolving the problem was spending resources to find out who complained and then other workers suffered as a result.

          In all instances it would have been much better if affected workers simply left - it would have saved them months of stress and no resolution anyway.

          1. MachDiamond Silver badge

            "then it should be a wake up call that maybe HR needs to investigate"

            Too many times it's HR that's the problem.

    2. General Purpose

      The funder wanted routine staff stats including sexuality. We collected them anonymously and without any chance of cross-tabulating. Even so, it was a small organisation in which people were generally quite happily open about their sexuality, so it was easy to see that someone must have ticked a box that wasn't how they usually presented themselves, and there weren't many possibilities.

    3. David Shaw

      UK telephone company

      in a UK telephone company a certain group of staff, myself included, became concerned about our relative pay levels. We each anonymously wrote our take-home pay on a post-it, randomly secretly mixed them in a big envelope, then the senior guy wrote them up on the whiteboard.

      The raw data was a big surprise, one employee of an Asian extraction went on to subsequently win his case in an employment tribunal, on the basis of that "employee satisfaction survey". It wasn't a fair fight tho' and he was pretty much destroyed (career & health) by the company on the way to his moral victory, and a couple of thousand quid payout, that didn't compensate for any of it. That UK telco company now no longer exists.

      Oh, and in later career I published about deanonymising (not down to plaintext, but enough to profile streams for who/what/where/origin/political-leaning/etc) metadata of HTTPS over Wi-Fi by simple AI/ML pattern matching of the top 100 websites similar packet streams that I captured that same day.

      At least where I did that research has published & verified pay & conditions equality for all staff, gender, origin etc. Nice to hope that they'll continue in business for quite a long while...

  4. elsergiovolador Silver badge

    Behaviour

    If you have enough data, person can be identified from their behaviour. So you can "anonymise" all you want, but behaviour itself is unique to any given person, just like fingerprint.

    Now, for most services there is absolutely no need to store personal data. Many websites that allegedly gather tons of data, are pretty much useless when it comes to user experience - take for example Amazon - recommendations are rubbish, search prioritises Chinese garbage passing as legitimate products, or Facebook - recommendations have little to do with what you are interested in, and by the looks of it they don't show updates from things you follow unless they pay.

    I think most of these corporations that perform mass data collection, use it to manipulate the consumer into buying things they don't want or need and / or get that person addicted.

    As far as I am concerned most data collection should be illegal.

    1. John Brown (no body) Silver badge

      Re: Behaviour

      "recommendations are rubbish"

      Virgin media cable TV service. They have absolute and complete records of everything I watch, how long I watch for, what I record, if or when I watch those recordings. But I only ever once to my recollection ever watched one single programmes from the "Suggested" list. Everything in the list is either something I've seen before or would never normally watch. It's no wonder that the likes of Amazon or Facebook can't get it right when they can only ever have partial data on a subject.

      1. Korev Silver badge
        Black Helicopters

        Re: Behaviour

        But I only ever once to my recollection ever watched one single programmes from the "Suggested" list.

        You're assuming that list is things that you would like to watch; it's more likely to be what they want you to see. For example, if a streaming music or video site has made their own content then they might want you to view that so they had have to pay royalties and/or a PHB can show return on investment.

        1. John Brown (no body) Silver badge
          Thumb Up

          Re: Behaviour

          Whatever the motivation for the suggestions, it's clearly not working in my case :-)

          1. ravenviz Silver badge
            Devil

            Re: Behaviour

            Or not that you know of!

      2. MachDiamond Silver badge

        Re: Behaviour

        "It's no wonder that the likes of Amazon or Facebook can't get it right when they can only ever have partial data on a subject."

        If you fell much closer to the average, it might seem uncanny how well their recommendations might seem. I also find suggestions from online entities to be wide of the mark nearly all of the time. Is it that somebody that watches lots of television winds up being conditioned? If you shop on Amazon, are you being nudged into considering things you might not otherwise take an interest?

  5. Anonymous Coward
    Anonymous Coward

    "NHS Data extracted will be pseudonymous" says Tory Government

    .......while being in bed with famous de-anonymising expert Peter Thiel and his company Palantir.

    *

    So....in addition to the threats posed by all the other "anonymised" data sets available for purchase out there, UK citizens also have to worry about the threats posed by their own government!!!

    *

    Yup.......when you next hear a politician talk about "Keeping us safe"......be very, very afraid!!!! This phrase actually means "Keeping me - the politician personallly - very rich".

    1. Ken Hagan Gold badge

      Re: "NHS Data extracted will be pseudonymous" says Tory Government

      The fact that they felt the need to invent a new word to describe their actions is quite telling.

  6. Doctor Syntax Silver badge

    There's a fairly straightforward solution. Make re-identification of de-identified data illegal with personal responsibility for some person in senior management. The newspaper editor, the marketing manager or even better, the CEO get a criminal record and go to jail. And for good measure the company loses any govt contracts it may have, forfeiting any outstanding payments for work done.

    The only way to deal with excess data is to make it toxic. That will give businesses second thoughts about collecting it in the first place and make them very, very careful about how they use it.

    1. Anonymous Coward
      Anonymous Coward

      No....Not Really a "Straightforward Soution"....

      @Doctor_Syntax

      ........but that won't be a bit of use when HUGE datasets are exfiltrated by unknown actors. See Equifax slurp as an example. Note that the "unknown actors" might be the modern STASI (aka NSA, GCHQ, Five Eyes, etc. etc.).

      1. Doctor Syntax Silver badge

        Re: No....Not Really a "Straightforward Soution"....

        None of the examples in the article falls into that category. However the concentration of data that the likes of Equifax accumulate could also be regarded in the same way. When a business holding that much data guards it so badly there should be personal penalties for senior management. It might take a few prominent cases of CEOs or board members jailed but not too many. Management needs to be put into thinking along the lines of "This stuff could be dangerous to me ".

    2. Evil Auditor Silver badge

      «Make re-identification of de-identified data illegal»

      That would be a bit like shooting the messenger. Rather make the sharing of personal data -without explicit consent by the subject- illegal.

    3. Mike 137 Silver badge

      "Make re-identification of de-identified data illegal"

      Section 171 of the UK Data Protection Act 2018 already does this unless consent has been obtaind from the Data Controller, with certain defences including testing the effectivness of anonymisation and the public interest.

      1. Doctor Syntax Silver badge

        Re: "Make re-identification of de-identified data illegal"

        And the Act also allows for personal penalties. How many of those have you heard of. Things need to be moved on: personal, not just corporate, jeopardy needs to become the norm.

      2. SundogUK Silver badge

        Re: "Make re-identification of de-identified data illegal"

        "...the public interest."

        Which means exactly what they want it to mean.

    4. Loyal Commenter Silver badge

      That works when it's being de-anonymised for fairly innocuous purposes, such as advertising (yes, I know advertising isn't actually that innocuous). What it does nothing about is malign purpose.

      What if someone de-anonymises that data so that they can use it to perform identity theft? They're criminals anyway, they're not going to care that it's illegal to do so.

      What about state actors? The STASI don't exist any more, but don't for a second imagine that there are no other government level organisations that are exactly the same. If you don't know what the STASI were getting up to in the second half of the 20th century, then I suggest you visit the museum that is now housed in their former headquarters. Take a look at the "bread vans" used to abduct people, and the industrial scale steamers used to open pretty much everyone's mail.

      More specifically, what about foreign state-run covert organisations, like Russia's GRB operatives who go round poisoning dissidents. Do you think they are going to care that what they are doing is illegal when they de-anonymise some publicly available data to track the movements of their target?

      I think the actual answer lies in banning the commercial sale of such data sets at all, whilst retaining a fair-use policy for research, with the conditions that the researchers are responsible for the data they handle, and such research is registered and vetted in some way to prevent bad actors pretending to be researchers, or exfiltrating sensitive data from research establishments.

      As noted by others here, "legitimate" uses for such data are pretty poor anyway. I don't need to see adverts for washing machines for two months after I've just bought a new washing machine, but this is the pattern that seem to be common for targeted ads - show you lots of them for variations on that one-off or infrequent purchase that you've just made. That, and "people who bought that item that you got as a Christmas present for your six-year-old niece also searched for these other Disney Princess items". And, of course the other favourite of targeted advertising, where there are too few data points. Just bought an out-of-print book on some obscure topic? Well, the last person who bought a copy of that six months ago, also bought a load of sex toys, and parts for a classic car, so does this dildo interest you, sir? What about this replacement head gasket for a mark 6 Cortina?

    5. CrackedNoggin Bronze badge

      The only people who would then be prosecuted are those researching and exposing how easy it is re-identify data. How do I know that?

  7. First Light

    Outed

    A closeted gay priest working for the predominantly hyper-conservative USCCB. That's an attractive target for miscreants.

  8. Mike 137 Silver badge

    "most practical remedies are limited in terms of really being risk-based"

    The validity of the risk decision depends on who's risk it is seen to be. If the data anonymiser or user defines the risk, is it risk to themselves (getting caught out) or risk to the data subjects? If the former it's the wrong risk to consider. If the latter, they're far from the best judge.

    Bearing in mind the poor quality of most corporate risk decision making (limited investigation, fragile methods), attempting to assess (to you) hypothetical risk to some third party accurately is almost impossible, particularly in aggregate ("risk to this population"). But quite apart from any inadequacy of information or technique, potential for harm depends not only on the event but on personal (i.e. individual) circumstances, so risk to the population can be too crude a measure. The outliers may be the important instances.

    The GDPR itself is pretty weak on this. Despite using the term "risk" dozens of times throughout the text, the only two categories of risk it specifies are "risk" and "high risk", and it provides no significant guidance (let alone any definition) of where the threshold between the two sits. As the party deciding on risk in that context is also the party interested in processing the data, this presents a serious failure of governance..

  9. Cuddles

    When is an ID not an ID?

    " Even though the data set had no identifying information, the newsletter found him using his device ID"

    It seems a little misleading to say that there is no identifying information in a data set containing a unique identifier for every user. That is precisely what the "I" in "ID" stands for, after all.

    1. Graham Cobb Silver badge

      Re: When is an ID not an ID?

      This is a hard problem. What we need are some experts to help define the rules.

      For example, take data from those cameras that watch traffic to help navigation apps (the commercial ones - let's put police cameras on one side for now). The data they collect includes a load of observations of the same number plate in different places over time. They then analyse that to understand how long it is taking traffic to get from (say) J1 on a motorway to J2 on the motorway at this time.

      Now they realise that this data may be valuable at later time for other purposes (such as road planning). They could just sell the data, with the full reg number (or maybe a cut down reg number). Better, they could replace each occurrence of the same reg each day with a random number - so journeys for the same person could not be correlated across different days. Except that is probably not enough: many people make the same journeys every day so the person who leaves my village at 8AM, stops for 10 minutes in the local Sainsburys, then drives up the M1 is probably me. And if I sometimes take a diversion to my mistress on the way it can probably be spotted.

      Like most hard problems we need multiple solutions working together. People selling data must make sure it contains no identifiers (nothing that definitely links samples together - not just email addresses, names or IP addresses but device fingerprints, etc).

      They also need to make sure that data that is not critical for the purpose it has been sold for is removed - for example, age may not be necessary if you are calculating footfall for new bus routes. Thirdly, reidentification must be a criminal (not just contractual) act. And I am sure more.

      Of course, Boris has decided the new ICO will burn all the privacy rules, instead of actually trying to fix these problems.

      1. Anonymous Coward
        Anonymous Coward

        Re: When is an ID not an ID?

        "Thirdly, reidentification must be a criminal (not just contractual) act. And I am sure more."

        It already is in the UK as mentioned earlier by another commentor earlier - the UK DPA 2018 which came into effect on the same day as GDPR introduced several new criminal offenses. Section 171 of UK DPA 2018 covers this:

        "criminalises the re-identification of personal data that has been ‘de-identified’ (de-identification being a process - such as redactions - to remove/conceal personal data)"

        from https://www.cps.gov.uk/legal-guidance/data-protection-act-2018-criminal-offences

    2. Anonymous Coward
      Anonymous Coward

      Re: When is an ID not an ID?

      Under GDPR device ID is (probably) classed as identifiable information. The only thing you could argue is whether there is a 1:1 mapping of device to individual, i.e. is it shared device or not.

  10. Mike 137 Silver badge

    "Control and accountability disappears when you hand it over."

    That's the big problem with GDPR Data Transfers in general. A data controller has no statutory obligation beyond verifying the legality of the recipient's privacy regime. Liability for what goes on under the radar after the transfer has happened is specifically excluded. This was made clear quite a while back in relation to tracking widgets on web sites. The site owner is responsible to data subjects only up to the point where the tracking message leaves their site. Thereafter the responsibility to data subjects passes to the recipient of the data. As a result it has emerged, at least in practice, that the transmitting party generally doesn't care what the receiving party does with the data (beyond of course the utility of the exchange to itself). This necessarily undermines the entire intent of the relevant regulations.

    1. Anonymous Coward
      Anonymous Coward

      Once again with feeling.................

      @Mike_137

      .................so.................GDPR really is a joke!!!

    2. Loyal Commenter Silver badge

      Re: "Control and accountability disappears when you hand it over."

      The two parties, in this situation, are the "data controller" and "data processor". If the controller has established that the processor has a legitimate reason for having and processing the data, that is all they need to do. The processor also has obligations, though, and that includes only using that data for the purposes that they specify. Anything that goes on "under the radar" is forbidden, and the processor is liable.

      Yes, the controller is not responsible for this, but why should they? It's the processor that is acting illegally. If they don't get found out, of course they get away with it, but guess what? That applies to every single crime. If nobody finds out you did it, you get away with it, from the most trivial crime of shoplifting a penny sweet, to mass murder, if you don't get caught, nobody knows.

      What GDPR does do is set out minimum penalties for those who do get caught, and specifies that data should only be processed for set purposes. It might not be perfect, and there may be the odd loophole, but it's a hell of a lot better than the US approach of "psst, wanna buy some data?"

      1. Mike 137 Silver badge

        Re: "Control and accountability disappears when you hand it over."

        "The two parties, in this situation, are the "data controller" and "data processor""

        Not necessarily. If, for example, the NHS stores all our medical records, they may make tem available to other organisations, not as processors on behalf of the NHS but for the purpose of those other organisations (e.g. university based research projects). In this case the relationship may be joint controllership or sharing. Obligations on joint controllers are strictly specified, but obligations on sources of information are much less so under sharing agreements unless that sharing takes the data outside the data protection jurisdiction or approved third countries. It is *unfortunately) assumed that every data controller within the relevant or approved jurisdictions will abide by the law.

        And by the way, " If the controller has established that the processor has a legitimate reason for having and processing the data, that is all they need to do" is not strictly correct. A processor can only process on the direct instructions of a data controller that specify exactly what is to be done with what data for what purposes. So the only legitimate reason a processor has is having been specifically so instructed by a controller. That's primarily why the use of behemoth processing services (e.g. Mailchimp, Survey Monkey) are legally questionable, as their typically non-negotiable unilaterally imposed contracts with their customers (the controllers) specify the processing despite their being officially processors on behalf of said controllers.

        1. Anonymous Coward
          Anonymous Coward

          Re: "Control and accountability disappears when you hand it over."

          "Not necessarily. If, for example, the NHS stores all our medical records, they may make tem available to other organisations, not as processors on behalf of the NHS but for the purpose of those other organisations (e.g. university based research projects). In this case the relationship may be joint controllership or sharing."

          There is not really a single entity called the NHS - it's lots of distinct orgs, both public sector and businesses (i.e. GPs and dentists).

          The "NHS" doesn't store all our records, GP Practices are the sole Data Controller for their registered patients' records ("primary care"). Hospitals/Trusts store secondary care records for people they have treated. Both these types of records may be/are shared with other orgs. GPDPR for example was an attempt to create a central store for both primary and secondary care records.

          I have ongoing ICO cases regarding GP Practices sharing their patients records via a Data Sharing Agreement (DSA) that defines they, along with the other orgs involves, are "Data Controllers in Common" (not quite the same as Joint Data Controllers). Except the DSA that governs (i,e, legalises it) this sharing requires all participants to be signatories to the DSA but the local health body has admitted that *none* of the GP Practices involved have ever signed (it seems they've never even seen it so couldn't even have agreed to it) the DSA - and so their participation in the sharing (and likewise for the recipients of the shared data) has had no valid lawful basis for the past 10 years! Part of my complaint is regarding the GPs lost of control (as sole Data Controller) over my patient records once they have shared them with the central system - even if this data is used for the intended purpose it is retained centrally for death+10 years and may be later used for additional purposes that the GP is unaware of/did not intend.

          "A processor can only process on the direct instructions of a data controller that specify exactly what is to be done with what data for what purposes. So the only legitimate reason a processor has is having been specifically so instructed by a controller."

          That forms another part of my complaint - my GP Practice (and it appears all others participant practices) *never* instructed EMIS and INPS (the companies that host practically all of the UK GPs patient record systems) to setup/enable the automated data sharing integration with the central system. It appears that the central health service agency running the data sharing asked/told EMIS & INPS to enable data sharing integration for all the GP Practices in question - so EMIS & INPS, as Data Processors for GPs, have broken data protection law by acting without instructions from their Data Controllers (each individual GP Practice) and the GP Practices (and their DPOs) have also broken data protection law by failing to ensure they are in control of their Data Processors.

          "That's primarily why the use of behemoth processing services (e.g. Mailchimp, Survey Monkey) are legally questionable, as their typically non-negotiable unilaterally imposed contracts with their customers (the controllers) specify the processing despite their being officially processors on behalf of said controllers."

          That feeds into issues like Data Minimisation and Purpose Limitation, especially for the sharing of data as if the Data Controller (i.e. GP) has no control over which data is deemed manditory for sharing purposes (e.g. in an API or in an interactive online system) then the Data Controller cannot exercise their responsibility for ensuring Data Minimisation whenever its someone else who decides on the mandatory criteria.

      2. Doctor Syntax Silver badge

        Re: "Control and accountability disappears when you hand it over."

        "Yes, the controller is not responsible for this, but why should they?"

        Let's turn that round. Why shouldn't they be? They are the ones with whom the data subject has a relationship. They are the ones who undertook - or should have done - to handle the data carefully. They have a duty of care. Part of that includes care in their choice of who they entrusted to process the data if they didn't do so themselves. The processor is the controller's agent. The controller should be responsible tor the actions of the agent.

        That doesn't let the agent of the hook, of course, and, indeed, they would presumably be liable to the principal for breach of contract, but the data subject should have a very clearly identified body from whom to claim redress.

        The problems with the Privacy Figleaf are that the US jurisdiction doesn't give allows the contractual obligations to be overruled. If that were not the case it seems to be accepted that the EU data subject would be able to take action against the processor in the US; that, I think, in unacceptable, the data subject should be able to take action in the jurisdiction where the original transaction occurred and against the other party in that transaction, the data controller.

        1. Loyal Commenter Silver badge

          Re: "Control and accountability disappears when you hand it over."

          Let's turn that round. Why shouldn't they be?

          Because that would have a stifling effect on service providers.

          Say, for example, that you are running a medium sized business with a few hundred employees, and you want to get another company to handle payroll for you. Should you be held responsible for a rogue employee of that payroll company stealing that data and attempting identity theft on one of your employees? You've given the payroll company the data they need (presumably, names, bank account details, NI Numbers, salaries, etc.) and they have agreed to use those data for the purposes of administering payroll for your employees. If they've failed to secure that data adequately, despite giving assurances that they have (which is part of the role of data processor), that is their responsibility.

          If this is the case, where does the shift in responsibility end? Buy something from a supermarket, and an employee of another branch of that supermarket goes berserk and kills someone, are you to be held responsible for that? Why should there be special conditions on a business relationship, where responsibility for things outside of your direct control are transferred, just because data is involved?

          On the other hand, if, as a data controller, you have failed to get a proper data processor agreement from that processor, detailing the scope, and purpose of their data processing, along with assurances that it will not be used for anything outside that scope, then you have failed your due diligence. "Barry downt eh pub can do your payroll, just send him a spreadsheet with all your employee deets".

          The responsibilities of, and between controller and processor are pretty well defined.

  11. mark l 2 Silver badge

    Having full UK postcode in any supposedly anonymised data is asking for trouble. Sure if the postcode was for a highly dense urban area such as an apartment block there is a good degree on anonymity but in more rural areas of the UK a postcode could be assigned to an area containing only one property, so it would be trivial to identify someone from that sort of data.

    Not sure how it works for US ZIP codes and if they cover a wider area since they are only numeric?

    1. Anonymous Coward
      Anonymous Coward

      Even an apartment block it isn't safe. throw in a few more fields, link to another data source and that anonymity evaporates, so yes no data set should contain full UK postcode, unless it is intended to be identifiable. Outcode or sector are generally considered ok, otherwise you are looking at super output areas which have specified population size requirements.

    2. Irony Deficient

      how it works for US ZIP codes

      There are two varieties of US ZIP codes: a five-digit version and a nine-digit version (which is the five-digit version plus a four-digit extension, separated by a figure dash). The five-digit version generally identifies an area that is associated with a particular post office (and the area tends to be larger for rural locations than for urban locations), although organizations that receive a large volume of mail can have their own unique ZIP codes (e.g. the headquarters for Walmart has one of its own). Some five-digit ZIP codes are only associated with a block of post office boxes, which is different to the ZIP code for the hosting post office. The nine-digit ZIP code can, but does not necessarily, identify a particular city block, a specific group of apartments/flats, an individual building, or a unique post office box.

      1. Loyal Commenter Silver badge

        Re: how it works for US ZIP codes

        UK postal codes are very similar. Aside from a couple of peculiar exceptions (Girobank, and Santa spring to mind), they are structured as an "in code", and "out code", and, the bit you don't normally see a "delivery point suffix" (DPS), so a full UK postcode might look something like SW1 1AA 1A. The "in code" corresponds to a postal area (usually a single sorting office AFAIK), the "out code" is the street area, down to 10-20 properties, or so (although it may be up to a couple of hundred flats, or just one house), and the DPS uniquely identifies the specific letter box. The DPS is only really used in mail sorting - those weird barcodes you sometime see printed on a letter contain the full post code, and the Royal Mail will give bulk senders a discount if the mail they send is pre-sorted with DPS barcodes printed on them. They provide something called the "Postcode Address File" (PAF) to businesses, at vast expense, which allows every single address to be looked up, and the DPS allocated. The PAF could certainly be useful to any crook trying to de-anonymise data if cross-referenced to other postcode data.

        I know far too much about this subject due to a job I once held which was largely based around cleansing and sorting data and printing and sending mailshots for various organisations.

        1. Irony Deficient

          Re: how it works for US ZIP codes

          The Royal Mail scheme sounds quite similar to that of the USPS. US postal barcodes (both types, viz “POSTNET” and “Intelligent Mail”) include an extra two digits as a “delivery point routing code”, which technically extends ZIP codes to 11 digits, but those extra two digits are never used with human-readable ZIP codes. Their primary use is for presorted bulk mail to qualify the sender for postage discounts. (A past job required me to become acquainted with the USPS Domestic Mail Manual and USPS Publication 28, Postal Addressing Standards.)

      2. disgruntled yank

        Re: how it works for US ZIP codes

        Those who know better, please correct me. But I believe that a co-worker told me that the USPS reassigns the four-digit extensions routinely.

        1. Anonymous Coward
          Anonymous Coward

          Re: how it works for US ZIP codes

          Not as far as I have ever seen. My current address has had the same extension since 5000 B.C.

    3. Will S

      Gov's Grid

      The UK Government removed postcodes and true geolocation data from published crime and road traffic accident data around 5+ years ago now. They probably considered postcodes to be too sensitive information. Now locational data published is shifted to a nearby point.

      I realised this when I noticed our local supermarket was a mega-crime hotspot!

      1. Ken Moorhouse Silver badge

        Re: Now locational data published is shifted to a nearby point.

        Which could be somebody's - your house. ISTR there was someone in the US who was bombarded with people contacting them about a variety of things, rightly complaining that this shouldn't happen. The default location should ALWAYS be in a green or blue (water) patch on the map, though this could be difficult in highly urbanised areas.

        >I realised this when I noticed our local supermarket was a mega-crime hotspot!

        The nearest supermarket to where I live is/was a Pokemon meeting point. How I knew was an influx of kids who were all looking at their phones, exclaiming that they were very close to a landmark. This could have a negative impact on crime in an area, dependant on participant.

        Apparently a successful deterrent for gangs hanging out around such such street furniture as green telephone cabinets is to paint them pink (the cabinet, that is).

        1. Ken Hagan Gold badge

          Re: Now locational data published is shifted to a nearby point.

          " (the cabinet, that is)."

          Oh. A pity you clarified that.

    4. Giles C Silver badge

      My postcode only covers 10 houses, so if someone looked up bmw owners there is only one in this postcode area - mine. So that would identify me pretty quickly.

      If you live in a block of flats then that does increase the pool size, but a row of semidetached houses there aren’t many to choose from. My parents live in a road with 8 houses using the same postcode so that reduces the pool further.

      Unless companies need to post things to you then they don’t need anything more than the first 4/5 characters of the postcode.

      Ie: PE4 will identify part of Peterborough (Werrington), the next digit will give you the sub area there are only 3 (5,6,7) and beyond that you shouldn’t be including the rest of the postcode especially not if the data is being anonymised.

  12. heyrick Silver badge

    "Even though the data set had no identifying information, the newsletter found him using his device ID and location data."

    There is so much wrong with this sentence. People really need to get away from the mentality of "they don't have my name therefore it is anonymous". If there is information that is unique to the person, then it is identifying information. You can be identifed. Your name can be derived later. A device ID and a location? Not anonymous, not even close.

  13. General Purpose

    Avoiding disclosure - England and Wales census

    Census results for England and Wales are deliberately corrupted to avoid personal disclosure. That's because they're often cross-tabulated for very small output areas, small enough that unique people might show up. The standard statement is

    "In order to protect against disclosure of personal information from the 2011 Census, there has been swapping of records in the Census database between different geographic areas, and so some counts will be affected. In the main, the greatest effects will be at the lowest geographies, since the record swapping is targeted towards those households with unusual characteristics in small areas."

    Should such corruption be standard or required practice?

    1. MachDiamond Silver badge

      Re: Avoiding disclosure - England and Wales census

      "Should such corruption be standard or required practice?"

      The value of the data is diminished so much that the question is if it's even useful for the purpose it has been collected. Census data has been creeping towards the ever more creepy for decades. I can see the point in tallying the number of people in the area and it's helpful to know if they are under 18, over a certain age or in between. It might even be helpful to know, broadly, what sort of work they do, if any. Anything else asked should be out of bounds. Certainly governments want to know, but putting it in census questionnaires and requiring a response is Orwellian.

    2. heyrick Silver badge

      Re: Avoiding disclosure - England and Wales census

      "Census results for England and Wales are deliberately corrupted to avoid personal disclosure."

      Is that before or after Lockheed Martin snarfs a copy and passes it on to everybody?

      And yes, I know it's Leidos this time (another extra-national outfit handling sensitive data). Guess what, Leidos is basically Lockheed Martin.

      We might as well all set up Facebook accounts and post our census data there.

  14. Gallosek

    Change the way it's done

    There is an excellent answer, that is already used in healthcare situations around the world, and resolves many of these issues.

    Don't share the data.

    Instead agree on a common data model and publish that, so people can write queries targeting the data and only give them the answer.

    * You can charge or rate limit every time a question is asked, instead of once for access to the data.

    * You can mandate limits, such as what fields can be included in the output or limiting it to derived data (such as how many do X)

    * You can filter out specific queries that return too few records to be relevant

    * You can filter out generic queries that return too much data

    * You can limit how many tier2 data items can be requested

    * You can record what questions are asked and audit them

    * The data never leaves your system, and as you can charge people for the compute time to ask them, the questions have to be good, so it means you limit data fishing

    One good example is OMOP https://www.ohdsi.org/data-standardization/the-common-data-model/

    1. Ozzard

      Re: Change the way it's done

      * You can turn off querying from organisations that break the rules.

      * You can bring down the portcullis completely if you want.

      * You can put a human between the request and the response, running the query past the Caldicott guardian in healthcare for example.

      I was the architect of one such system.

  15. Tron Silver badge

    One option.

    I would like to see companies advertising the fact that they never sell or release their data. To anyone. Ever. And would require the maintenance of that if the company changed hands, or would lock the data and place it with a trusted third party to ensure that the data was protected.

    I would choose a supplier or service who offered this guarantee over one who did not.

    The move to distributed computing may go part way to fixing this, creating an environment where companies would not be expected to retain large datasets or sell them, with data being held on users' own systems as a default.

    1. Anonymous Coward
      Anonymous Coward

      Re: One option.

      Why would you trust a liar?

  16. Anonymous Coward
    Anonymous Coward

    This anonymous post is public.

    When your data is collected (it's always being collected) they will tell you what you want to hear, but the only safe assumption is that your data is never anonymous. Everyone collecting your data sees it as a problem that can be "fixed" by convincing users that their data is anonymous. The only slightly safe way to live is to assume that nothing you say or do is private, and that everything is being monitored.

  17. Ken Moorhouse Silver badge

    The other side of the coin...

    ...is organisations who cannot collect meaningful information from a subject without frightening them away, due to scepticism that if full information is collected, then it can be used against them. Examples might include reporting of crime, or someone going for an STI test. If complete information is not captured it is of little use, as someone might be a serial reporter of crime, or go to every STI clinic in London for multiple opinions, skewing the anonymised statistics collected.

    ===

    I think I mentioned before that data snapshots handed out to researchers should be anonymised with Temporal Keys in the same way that the Bank issues you with a Temporal Key to login to your account (using a keyfob calculator). What I mean by this is that all of the data bar the very narrow data required for the research, is scrambled. but that scrambling is unique to that distinct request for the data. The next researcher's snapshot will be scrambled differently. So a researcher asking for one set of data one day, and another set another day (analogy: enter 1st and 3rd digits, then next time enter 2nd and 4th digits of your passkey) will be presented with two sets of data, but no way to meld the sets together without serious work in correlating for similar looking patterns.

    What is to stop the data custodian asking how critical is the exact DOB. If it is not critical then a random scrambling of this could be done, say random plus or minus 5 days added to each DOB in the data extract.

    Researchers would only be able to request tiny percentages of the overall data to reduce correlation matching further.

  18. This post has been deleted by its author

    1. Anonymous Coward
      Anonymous Coward

      It's not whether you win or lose, it's how you play the game.

  19. Anonymous Coward
    Anonymous Coward

    Suppose that the horse has bolted and it not coming back. Not that I want that.

    As a thought experiment it allows us to move on to the next issue - the asymmetry of access to information. If every politicians financial transactions were public knowledge - if we knew where they were, who they met with, and what they said. If accepting that level of intrusiveness were a condition for the job - would Democracy get a new extension on life?

  20. privacyfirst

    Clarification on differential privacy

    This is a great article but there is some misinterpretation on how differential privacy works.

    It does not focus on adding noise to the input data (like adding noise to each parameter like suggested in the article). In itself, this would be somehow equivalent to deleting part of the information or truncating fields (the more noise you add the less meaningful the least significant digits become, up to the point where the value becomes useless which is the same as deleting it).

    Instead it focuses on measuring how much one individual could impact the output of the processing that is being done. The noise is designed to cover the impact of one individual for a given computation.

    This is significantly different because it works in all situation and does not need to be aware of what computation is being done. For instance, sometimes the mere existence of the record reveals information, no matter how much noise has been added to it, or how much columns have been deleted. Differential privacy even protects against that kind of inference.

  21. Snowy Silver badge
    Trollface

    If the data as any use

    Then with enough effort it it is not anonymous.

  22. MachDiamond Silver badge

    One more data point

    An anonymized database can be compared to a DB of known people to find matching identities with a certain confidence percentage. De-identyfied health data can be huge since many financial institutions and other businesses would love to know the health status of people and that is the sort of thing they are prohibited from asking for. If they can get health data and re-connect it with specific people, that's a boon to them. While it may be impossible to correlate every set to a specific person, a high enough percentage of high probability matches can make the effort worthwhile. It also means that large data aggregators that already have a lot of data points on people will be able to make more matches. Adding new data points is great, but there is also value in raising the confidence level of the data points they already have on a target.

    I recall a medical study in the UK where many children were able to be identified from anonymized data as their conditions were rare and they lived in a low density area. It was so easy to figure out who they were that they didn't need to bother removing their names from the DB. A broad study on a certain age group and how they fair with a common cold or flu can also lead to a specific person. Again, anybody that lives in a lower population density area is going to stand out due to the need to have more information than just age and how long they suffer from flu symptoms during a particular season. It could be ethnicity, physical handicaps, other underlying medical conditions or recent medical procedures. It might be statistically significant to know if a person has recently had their tonsils removed or an appendectomy. Perhaps it's not, but researchers will want to see information like that on each subject so their study captures enough data to uncover relationships.

    People can be unique due to a profession coupled with other innocuous data. A left-handed female farrier might be rather unique over a large area. Add in just two or three other data points and there may not be two candidates in the region or country.

  23. Errop

    Anonymity in the virtual world is in the first place, you need to monitor it day and night, or choose Utopia Ecosystem and sleep peacefully.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like