back to article Subpoenaed PyPI says bye-bye to as much IP address data as it can

PyPI, the Python Package Index, began evaluating ways to reduce the amount of identifying information that it stores even before the US Justice Department came asking for data on suspect users. But now that the code repository has disclosed receiving three subpoenas for data on five users earlier this year, the Python …

  1. PRR Bronze badge
    Thumb Up

    THIS is a Register-worthy sub-head!!

    Python package pile prefers protecting programmer privacy

  2. katrinab Silver badge
    Meh

    Salting IP addresses

    In terms of complexity, this is like brute-forcing a 7 character password made up of only lower case letters. A modern computer could go through the entire address space in basically no time at all.

    1. ExampleOne

      Re: Salting IP addresses

      That was pretty much my thought. 32 bits of meaningful information, less a bunch of IP addresses that will never be seen.

      IPv6 address hashing might be more useful, but how much of their user base are IPv6 users?

      1. Anonymous Coward
        Anonymous Coward

        Re: Salting IP addresses

        Isn't the salt, and the hash arbitrarily big until this isn't an issue?

        1. yetanotheraoc Silver badge

          Re: Salting IP addresses

          Why isn't the salt subject to subpoena?

        2. Yet Another Anonymous coward Silver badge

          Re: Salting IP addresses

          >Isn't the salt, and the hash arbitrarily big until this isn't an issue?

          Doesn't help. You need to know the salt to do the match. So the search space is the same- it doesn't really increase compute. The salt only protects from pre-computed hashes

        3. katrinab Silver badge

          Re: Salting IP addresses

          No. Salt guards against pre-computed hash tables.

          Take each IP address, add salt, hash. Doing that about 3.7bn times for all the actually usable IP addresses will not take long at all. It would certainly be done in well under a minute, less than a second on decent hardware.

    2. Anonymous Coward
      Anonymous Coward

      Differential privacy helps with this

      Instead of always logging a hashed IP address, their system could log only extremely vague/imprecise information until suspected abuse is detected, which it could do by counting the number of suspicious encounters until a certain threshold is reached. Only if specific abuse thresholds are reached, would the system begin logging progressively more specific hashed data. It could start with vague geolocation info of the ISP (not the user), then the vague ASN range, then the precise ASN, then vaguely related subnet blocks, then more specific subnets, then gradually adding more octets, with the data becoming only more and more precise as the need to have it actually increases in a quantifiable way. Associated data like timestamps should be kept as vague as possible too. For example, if the logs are just to protect against DoS attacks, then the log should be an aggregate count of how many times a given hashed value was encountered over a relevant timeframe, not a precise readout of every encounter with millisecond accurate times.

      With this approach, most subpoenaed data would be too vague to be abused, even if successfully bruteforced and most legitimate users privacy would be extremely well protected (regardless of legal disclosures) most of the time.

  3. AcceptableName

    No subpoenas for RubyGems but I'm betting it's a different story for NPM.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like