back to article What happens when your massive text-generating neural net starts spitting out people's phone numbers? If you're OpenAI, you create a filter

OpenAI is building a content filter to prevent GPT-3, its latest and largest text-generating neural network, from inadvertently revealing people's personal information as it prepares to commercialize the software through an API. Its engineers are developing a content-filtering system to block the software from outputting, for …

  1. ProfessorBlockchain

    SSDD - Sanitize your data

    https://www.theregister.com/2018/03/02/secrets_fed_into_ai_models_as_training_data_can_be_stolen/

  2. DS999 Silver badge

    So much for "AI"

    The idea anyone considers this to be "intelligence" when they have to stop it from randomly spitting out wads of text including personal information is laughable.

    1. Yet Another Anonymous coward Silver badge

      Re: So much for "AI"

      Wait till you find out about /dev/rand spitting out credit card numbers, people's ages, their weight etc

      1. doublelayer Silver badge

        Re: So much for "AI"

        The problem is that a random number generator can produce valid or invalid numbers and, even if it produced a valid number, it has no idea what it is for. This has collected a bunch of real numbers and starts handing them out. Admittedly, it's not malicious about doing it, because it just hands out real numbers whenever they're tangentially connected, but it's not just random strings of digits which happen to be callable. If I run a random number generator to produce a number that looks like a credit card number, the chances are incredibly high that it will not work. If I collect real credit card numbers, the chances that at least one of them will work is significant. That is the important difference.

    2. LionelB Bronze badge
      Trollface

      Re: So much for "AI"

      Wait: I know humans who do that. What was the benchmark for AI again?

  3. Red Ted Silver badge
    Joke

    What are the chances...

    Of your phone number turning up in the text?

    About 2^267709 to 1 against.

    With apologies to D.A., again.

    1. Korev Silver badge
      Terminator

      Re: What are the chances...

      More to the point, what's the chance of 0118 999 88199 9119 725...3 turning up?

      1. Martin an gof Silver badge

        Re: What are the chances...

        Upvote for the IT Crowd reference, but to carry on the DNA reference, the number you are actually looking for is 01 226 7709. I think, it wasn't easy to work it out in the text of the book or the radio play, it was the TV animations that made it clear.

        M.

  4. Empire of the Pussycat

    The response to GDPR subject access requests will be interesting

    (body)

  5. doublelayer Silver badge

    A little idea

    In case OpenAI is listening, I have had a brainwave that might be a little handy. Your engineers are busy writing some software to scan output for phone numbers? Then the software will remove that output so people don't see it? I think it might work pretty well if you reversed this process and applied that filter to, you know, the input. So the big blob doesn't have phone numbers in it. That way, it would only generate numbers by randomly adding digits, which is much less likely to be a valid number and wouldn't be able to associate it with other information. In fact, while we're having brainwaves, maybe it's not so useful to give it the option to randomly spit out digits; we already have random number generators thank you, and they only give us numbers when asked.

    Any chance OpenAI is looking for a chief sanity officer? I'd apply as long as they don't prevent me from working another job simultaneously. I think I might need a backup job when the data protection authorities come along.

    1. iron Silver badge

      Re: A little idea

      I think a chief sanity officer would have read the whole article and spotted the bit which says that phone numbers in the input might be important for the AI's understanding of context and the connections between addresses, phone numbers, names, and the surrounding words. And, replacing them in the input with a 555 style number would cause more isues because you are training your AI with fake data so it will draw false conclusions.

      1. jmch Silver badge

        Re: A little idea

        True, but you could trivially alter the input to randomise the last, say, 5 digits of the number (it might be useful for the AI if it can infer some information about country and area codes), as well as randomising other personal data.

        Incidentally, properly anonymising personal data while keeping some relationships intact is faaar from trivial, but that's what boffins are paid for right?

      2. doublelayer Silver badge

        Re: A little idea

        I did read that. I didn't care. It needs to read real phone numbers to learn what a phone number's like? Two solutions. First, replace all phone numbers with a tag indicating it's a phone number, but without the content. If you're afraid that your code is so bad that it will read a single [phone_number] over and over and weight it too heavily, append a random number so it will see them as different. Second option: don't bother. Why does the AI need to know about phone numbers? It shouldn't be printing them. Phone numbers should only be printed if they go to people who are supposed to be contacted, which means they should be provided manually. Otherwise, it's actually doing a worse job at its task because it is including not just information which is irrelevant, but information which is actively wrong. I think those are reasonable options for handling the phone number problem.

  6. Anonymous Coward
    Anonymous Coward

    Just sayin’

    01234 is the dialling code for Bedford. Local numbers are six digits.

    I did wonder, idly, whether 01234 567890 was a valid number. It starts out very like the phone number I used to have...

    1. TRT Silver badge

      Re: Just sayin’

      There’s already a fictional area code for the USA... 555

      1. Mark #255

        Re: Just sayin’

        And for the UK, Ofcom have reserved sets of numbers for TV and radio dramas to use

      2. AndrueC Silver badge
        Happy

        Re: Just sayin’

        Yes and it's boringly predictable and sticks out like a sore thumb. I suppose that might be deliberate to deter people even trying the number but the UK system is more opaque so at least phone numbers look realistic. They even allow numbers to be localised to reflect where a film/show is based.

      3. Alan W. Rateliff, II

        Re: Just sayin’

        The fictional "555" is actually an exchange, not an area code. The numbers would be something like 202-555-xxxx or 613-555-xxxx. I recall an article some while back which also listed sets of numbers originally used for fiction, as 555-1212 was a real number in almost all areas which connected the caller to directory services and there were (still are?) others which connect to local weather, time and date, and other services.

        Am I to understand that OpenAI is building an AI to monitor the output of an AI? Will this be external to the original AI like a censor, or will it be built into the AI to allow it to self-censor? What happens when the censor AI goes balmy and starts censoring AI output which it thinks could be doxxing, even though it bears little resemblance to PII, or information which could lead to doxxing? Will this new censoring AI begin berating other AIs over which it has no control for outputting potential PII?

        1. Anonymous Coward
          Anonymous Coward

          Re: Just sayin’

          555 is listed as a valid NPA (area code). The official NANP does list it.

          https://nationalnanpa.com/enas/area_code_query.do

          PA Code Search Information

          Below are the search results for NPA: 555

          General Information

          Type of Code: Easily Recognizable Code

          Is this code assignable: No

          If not, why: Directory Assistance

          Geographic(G)

          or non-geographic(N):

          If non-geographic, usage:

          Is this code reserved for future use: No

          Is this code assigned: No

          Is this code in use: N

          NPA Relief Status:

          In service date:

          Planning Letter(s):

      4. Anonymous Coward
        Anonymous Coward

        Re: Just sayin’

        Actually the NANP has the NPA and the NXX. 555 is valid for both and it actually isn't fictional. While it is used in movies and TV shows, it is valid. You can take an NPA say 313 and then add 555 for the NXX and then 1212 so the full number would be 313 555 1212 and you will get directory assistance.

  7. Paul 195
    FAIL

    It's still just fast clockwork

    Calling the spreadsheets generated through machine learning "Artificial Intelligence" is really an adman's definition of intelligence. The further AI moves from very specialised domains and towards more general ones, the more obvious the limitation of not understand context becomes. This article illustrates the problem almost perfectly.

    1. Warm Braw Silver badge

      Re: It's still just fast clockwork

      As someone who's had a number of relatives with dementia, my observation is that there seem to be broadly two significant components of intelligence - pattern recognition and logical processing. Without the logical processing to discard improbable pattern recognition results you get hallucinations as well as the loss of rational behaviour. Without the pattern recognition, it's difficult to identify anything just by trying to reason from first principles.

      It appears that AI has probably got very good at pattern recognition, but that without some sort of deductive reasoning to correct obvious (to us) errors and impose a framework of constraints (legal, moral...) I feel its field of application is - or should be - quite narrowly defined. I'm not sure a post-hoc filter is up to it.

      1. ThatOne Silver badge

        Re: It's still just fast clockwork

        Indeed. The analytic capacity required to asses potential consequences of something you say is definitely way beyond an "AI" and will remain so for a long while. Even humans don't always manage...

      2. Anonymous Coward
        Anonymous Coward

        Re: It's still just fast clockwork

        @Warm Braw: That's the best summary of the current state of "AI" that I have seen anywhere on the Interwebs. It needs to be more widely seen.

        It's also a pretty decent description of some aspects of the behaviour of people with dementia that I have known.

        Murky buckets, mon sewer.

        Anyone like to Tweet the summary at e.g. Ruarigh Cellan Jones? Or maybe Peter Cockran?

        Analyse this with your semantic networks if you can.

  8. Anonymous Coward
    Anonymous Coward

    Backwards tracing?

    Suppose I prompted "So I killed him. And this is where I got rid of the body..." And then generated tens of thousands of outputs, and then dug thousands of holes... And solved a real crime.

    My question is, can the language model engineers make a system backward-traceable? I don't know the terms of course, but you know what I'm getting at. And yeah, I realize the "training set" from which the outputs come is the WHOLE training set. That's the point: for any subset of an output, can I query the model to tell me more (something PROVABLE even) about that particular subset's sources?

    This would be a useful function. And also it may become necessary to ensure privacy and accountability and public trust.

    1. MrReynolds2U Bronze badge

      Re: Backwards tracing?

      That would also be very useful in detecting bias and reasoning flaws.

    2. Wayland

      Re: Backwards tracing?

      I was thinking the same thing. Computers are deterministic in that given the same state, data and inputs they get the same result. The problem is the AI computer scientists have lost track of what 'state' their AI is in and what processes are happening to return an answer. No one knows why it gave that answer. We dumped a load of data into it, not quite sure exactly what data and it did some 'learnin' and now it says this when you ask it a question.

  9. L💔🐧

    "The researchers believe it's a legal gray area."

    "Personal information can't be stripped out of training data"

    So Google could just call everything it harvests "training data" and skirt the GDPR? I really don't think the law is as gray as those researchers (conveniently) believe it is.

    1. Filippo Silver badge

      My thoughts exactly. I really don't think there's any gray area here. As soon as someone gets his phone number leaked this way and sues, OpenAI is going to be in serious trouble.

    2. spold

      Grabbing the info off the web would be a "collection" of personal information (PI), processing it for training would be a "use" of it, regurgitating it would be a "disclosure" (and could quite well constitute a "breach"). All without having obtained consent from the individual concerned (it is also PI if I can identify the person by reference or matching to other info/databases that may be available).

      Longitudinal training data is far more useful in this case because it gives you a history of related events that would improve the "AI" learning, however, even if it is de-identified I only need to link one event to someone to reveal the whole chain of events. So it could spit out sensitive information.

      Wait until someone complains to their privacy regulator - that would likely get interesting and costly, particularly in GDPR land.

  10. amanfromMars 1 Silver badge

    The Fly in the Ointment Filter Flaw

    As OpenAI gears up to make GPT-3 generally available, it's taking no chances, and that's why it's building a filter to scrub generated text of not just phone numbers but any problematic personal data.

    And we all know what filters do. They gather all of that sensitive information in one convenient extraction location.

    1. amanfromMars 1 Silver badge

      Re: The Fly in the Ointment Filter Flaw

      And that puts the likes of an OpenAI or DeepMind facility in a greater position of raw soft and hard core power than any established government or conventional military machinery you may care to imagine and mention.

      FCUK with them at your peril and 'tis wise to ensure that they have whatever they might want from you ..... lest they turn all live rogue and evil renegade model enemy.

      Quite whether that apparent submission and virtual surrender would render oneself prime and as one of their vital leaders, with the provision of that which they seek/sought, with a practical virtual seat around the board room circular table, is an interesting question to consider ‽ .

  11. Anonymous Coward
    Facepalm

    AI

    It may be A but it sure ain't I.

  12. skotl

    tl;dr: System can generate random sets of numbers. Some random sets of numbers might be a valid phone number.

    Must be a slow news day.

    1. Alan W. Rateliff, II

      But does it produce only random numbers? Intelligence tends to be lazy (or efficient, depending upon your perspective.) If I can just spout some formatted number I already know off the top of my head, I am more likely to do that than spend whatever time is necessary to manufacture such information. Even if it means stringing together chunks of numbers I already know.

      Consider PINs. Rather than formulate a random number sequence and risk committing this transient information to memory, if I can instead associate this particular function with a number I already know (significant date, phone number, address, etc.) then the process is not only easier and quicker, but the long term result will be more dependable.

      Of course, that scenario is more about input for your memory than outputting information. Consider, then, lying about an event in which you were unexpectedly caught participating. Your first telling of the lie will be simple and constructed from what you can most quickly throw together. As time goes on this lie becomes more elaborate or might change altogether to account for various holes or shortcomings. As well, as the lie becomes more elaborate and incorporates more elements not already part of your repertoire, it becomes more difficult to memorize and thus defend in the long term.

      Are AIs just as efficient as our HI? Can, and will, an AI lie?

  13. Pascal Monett Silver badge
    Stop

    "It's hard to tell"

    It shouldn't be. It's a program, it should have a log of its activity. That way, you ask a question, you get an answer, and you check the log to find out how it got the answer.

    I fail to see why incorporating an activity log wasn't thought of at the very beginning of the process. I've been incorporation execution logs of my automated scripts for over twenty years now. The amount of time that saves when creating a program is appreciable, the amount of time it saves when the customer comes back six months later with the inevitable "it's broken, no we haven't changed anything" is priceless.

    Put a log in - it's not rocket science.

    1. Joni Kahara

      Re: "It's hard to tell"

      But log what, exactly? I'm not an expert but the way these current systems are built has little to do with how e.g. machine "intelligence" was approached in the 1950s.

  14. shortfatbaldhairyman
    FAIL

    Output prediction not possible

    It is not easy to say what will be predicted by these beasts. It will be a game of whack a mole.

    The simpler problems might be ironed out. But there can (and will) be predictions which we cannot even begin to imagine (and do not want to imagine).

  15. This post has been deleted by its author

  16. IT Hack

    Security In Depth

    Good to see that as a part of coding fundamentals management are taking information security seriously.

  17. IGotOut Silver badge

    99% Invisible

    Have a podcast on the Enron scandal and the publicly published emails.

    It goes on to how to note how a huge amount of "AI", including Siri was based and trained on this.

    As they point out, using emails from a single business group, in a specific industry, full of mysonganic jokes, fraudulent activities and highly personal information, may not be the best source material.

  18. Il'Geller

    The reason for the above (security) problems is the choice of the wrong set of texts for training AI. I initially chose a set of personal tests: for example, the texts of Dickens or Dostoevsky. The fact is that such AIs have all the character traits of their prototypes and can hide and deceive. For instance, an AI Clone of Dostoevsky hid information about his participation in a conspiracy against Russia. Thus a personalized AI can be trained what information it can give and to whom, and which to hide.

    I tried to create an AI using collections of random texts, as Openal does. Such AIs are completely unmanageable and simple-minded, they are not able to think and talk complete nonsense...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2022