SSDD - Sanitize your data
https://www.theregister.com/2018/03/02/secrets_fed_into_ai_models_as_training_data_can_be_stolen/
OpenAI is building a content filter to prevent GPT-3, its latest and largest text-generating neural network, from inadvertently revealing people's personal information as it prepares to commercialize the software through an API. Its engineers are developing a content-filtering system to block the software from outputting, for …
The problem is that a random number generator can produce valid or invalid numbers and, even if it produced a valid number, it has no idea what it is for. This has collected a bunch of real numbers and starts handing them out. Admittedly, it's not malicious about doing it, because it just hands out real numbers whenever they're tangentially connected, but it's not just random strings of digits which happen to be callable. If I run a random number generator to produce a number that looks like a credit card number, the chances are incredibly high that it will not work. If I collect real credit card numbers, the chances that at least one of them will work is significant. That is the important difference.
In case OpenAI is listening, I have had a brainwave that might be a little handy. Your engineers are busy writing some software to scan output for phone numbers? Then the software will remove that output so people don't see it? I think it might work pretty well if you reversed this process and applied that filter to, you know, the input. So the big blob doesn't have phone numbers in it. That way, it would only generate numbers by randomly adding digits, which is much less likely to be a valid number and wouldn't be able to associate it with other information. In fact, while we're having brainwaves, maybe it's not so useful to give it the option to randomly spit out digits; we already have random number generators thank you, and they only give us numbers when asked.
Any chance OpenAI is looking for a chief sanity officer? I'd apply as long as they don't prevent me from working another job simultaneously. I think I might need a backup job when the data protection authorities come along.
I think a chief sanity officer would have read the whole article and spotted the bit which says that phone numbers in the input might be important for the AI's understanding of context and the connections between addresses, phone numbers, names, and the surrounding words. And, replacing them in the input with a 555 style number would cause more isues because you are training your AI with fake data so it will draw false conclusions.
True, but you could trivially alter the input to randomise the last, say, 5 digits of the number (it might be useful for the AI if it can infer some information about country and area codes), as well as randomising other personal data.
Incidentally, properly anonymising personal data while keeping some relationships intact is faaar from trivial, but that's what boffins are paid for right?
I did read that. I didn't care. It needs to read real phone numbers to learn what a phone number's like? Two solutions. First, replace all phone numbers with a tag indicating it's a phone number, but without the content. If you're afraid that your code is so bad that it will read a single [phone_number] over and over and weight it too heavily, append a random number so it will see them as different. Second option: don't bother. Why does the AI need to know about phone numbers? It shouldn't be printing them. Phone numbers should only be printed if they go to people who are supposed to be contacted, which means they should be provided manually. Otherwise, it's actually doing a worse job at its task because it is including not just information which is irrelevant, but information which is actively wrong. I think those are reasonable options for handling the phone number problem.
And for the UK, Ofcom have reserved sets of numbers for TV and radio dramas to use
Yes and it's boringly predictable and sticks out like a sore thumb. I suppose that might be deliberate to deter people even trying the number but the UK system is more opaque so at least phone numbers look realistic. They even allow numbers to be localised to reflect where a film/show is based.
The fictional "555" is actually an exchange, not an area code. The numbers would be something like 202-555-xxxx or 613-555-xxxx. I recall an article some while back which also listed sets of numbers originally used for fiction, as 555-1212 was a real number in almost all areas which connected the caller to directory services and there were (still are?) others which connect to local weather, time and date, and other services.
Am I to understand that OpenAI is building an AI to monitor the output of an AI? Will this be external to the original AI like a censor, or will it be built into the AI to allow it to self-censor? What happens when the censor AI goes balmy and starts censoring AI output which it thinks could be doxxing, even though it bears little resemblance to PII, or information which could lead to doxxing? Will this new censoring AI begin berating other AIs over which it has no control for outputting potential PII?
555 is listed as a valid NPA (area code). The official NANP does list it.
https://nationalnanpa.com/enas/area_code_query.do
PA Code Search Information
Below are the search results for NPA: 555
General Information
Type of Code: Easily Recognizable Code
Is this code assignable: No
If not, why: Directory Assistance
Geographic(G)
or non-geographic(N):
If non-geographic, usage:
Is this code reserved for future use: No
Is this code assigned: No
Is this code in use: N
NPA Relief Status:
In service date:
Planning Letter(s):
Actually the NANP has the NPA and the NXX. 555 is valid for both and it actually isn't fictional. While it is used in movies and TV shows, it is valid. You can take an NPA say 313 and then add 555 for the NXX and then 1212 so the full number would be 313 555 1212 and you will get directory assistance.
Calling the spreadsheets generated through machine learning "Artificial Intelligence" is really an adman's definition of intelligence. The further AI moves from very specialised domains and towards more general ones, the more obvious the limitation of not understand context becomes. This article illustrates the problem almost perfectly.
As someone who's had a number of relatives with dementia, my observation is that there seem to be broadly two significant components of intelligence - pattern recognition and logical processing. Without the logical processing to discard improbable pattern recognition results you get hallucinations as well as the loss of rational behaviour. Without the pattern recognition, it's difficult to identify anything just by trying to reason from first principles.
It appears that AI has probably got very good at pattern recognition, but that without some sort of deductive reasoning to correct obvious (to us) errors and impose a framework of constraints (legal, moral...) I feel its field of application is - or should be - quite narrowly defined. I'm not sure a post-hoc filter is up to it.
@Warm Braw: That's the best summary of the current state of "AI" that I have seen anywhere on the Interwebs. It needs to be more widely seen.
It's also a pretty decent description of some aspects of the behaviour of people with dementia that I have known.
Murky buckets, mon sewer.
Anyone like to Tweet the summary at e.g. Ruarigh Cellan Jones? Or maybe Peter Cockran?
Analyse this with your semantic networks if you can.
Suppose I prompted "So I killed him. And this is where I got rid of the body..." And then generated tens of thousands of outputs, and then dug thousands of holes... And solved a real crime.
My question is, can the language model engineers make a system backward-traceable? I don't know the terms of course, but you know what I'm getting at. And yeah, I realize the "training set" from which the outputs come is the WHOLE training set. That's the point: for any subset of an output, can I query the model to tell me more (something PROVABLE even) about that particular subset's sources?
This would be a useful function. And also it may become necessary to ensure privacy and accountability and public trust.
I was thinking the same thing. Computers are deterministic in that given the same state, data and inputs they get the same result. The problem is the AI computer scientists have lost track of what 'state' their AI is in and what processes are happening to return an answer. No one knows why it gave that answer. We dumped a load of data into it, not quite sure exactly what data and it did some 'learnin' and now it says this when you ask it a question.
Grabbing the info off the web would be a "collection" of personal information (PI), processing it for training would be a "use" of it, regurgitating it would be a "disclosure" (and could quite well constitute a "breach"). All without having obtained consent from the individual concerned (it is also PI if I can identify the person by reference or matching to other info/databases that may be available).
Longitudinal training data is far more useful in this case because it gives you a history of related events that would improve the "AI" learning, however, even if it is de-identified I only need to link one event to someone to reveal the whole chain of events. So it could spit out sensitive information.
Wait until someone complains to their privacy regulator - that would likely get interesting and costly, particularly in GDPR land.
As OpenAI gears up to make GPT-3 generally available, it's taking no chances, and that's why it's building a filter to scrub generated text of not just phone numbers but any problematic personal data.
And we all know what filters do. They gather all of that sensitive information in one convenient extraction location.
And that puts the likes of an OpenAI or DeepMind facility in a greater position of raw soft and hard core power than any established government or conventional military machinery you may care to imagine and mention.
FCUK with them at your peril and 'tis wise to ensure that they have whatever they might want from you ..... lest they turn all live rogue and evil renegade model enemy.
Quite whether that apparent submission and virtual surrender would render oneself prime and as one of their vital leaders, with the provision of that which they seek/sought, with a practical virtual seat around the board room circular table, is an interesting question to consider ‽ .
But does it produce only random numbers? Intelligence tends to be lazy (or efficient, depending upon your perspective.) If I can just spout some formatted number I already know off the top of my head, I am more likely to do that than spend whatever time is necessary to manufacture such information. Even if it means stringing together chunks of numbers I already know.
Consider PINs. Rather than formulate a random number sequence and risk committing this transient information to memory, if I can instead associate this particular function with a number I already know (significant date, phone number, address, etc.) then the process is not only easier and quicker, but the long term result will be more dependable.
Of course, that scenario is more about input for your memory than outputting information. Consider, then, lying about an event in which you were unexpectedly caught participating. Your first telling of the lie will be simple and constructed from what you can most quickly throw together. As time goes on this lie becomes more elaborate or might change altogether to account for various holes or shortcomings. As well, as the lie becomes more elaborate and incorporates more elements not already part of your repertoire, it becomes more difficult to memorize and thus defend in the long term.
Are AIs just as efficient as our HI? Can, and will, an AI lie?
It shouldn't be. It's a program, it should have a log of its activity. That way, you ask a question, you get an answer, and you check the log to find out how it got the answer.
I fail to see why incorporating an activity log wasn't thought of at the very beginning of the process. I've been incorporation execution logs of my automated scripts for over twenty years now. The amount of time that saves when creating a program is appreciable, the amount of time it saves when the customer comes back six months later with the inevitable "it's broken, no we haven't changed anything" is priceless.
Put a log in - it's not rocket science.
This post has been deleted by its author
Have a podcast on the Enron scandal and the publicly published emails.
It goes on to how to note how a huge amount of "AI", including Siri was based and trained on this.
As they point out, using emails from a single business group, in a specific industry, full of mysonganic jokes, fraudulent activities and highly personal information, may not be the best source material.
The reason for the above (security) problems is the choice of the wrong set of texts for training AI. I initially chose a set of personal tests: for example, the texts of Dickens or Dostoevsky. The fact is that such AIs have all the character traits of their prototypes and can hide and deceive. For instance, an AI Clone of Dostoevsky hid information about his participation in a conspiracy against Russia. Thus a personalized AI can be trained what information it can give and to whom, and which to hide.
I tried to create an AI using collections of random texts, as Openal does. Such AIs are completely unmanageable and simple-minded, they are not able to think and talk complete nonsense...