SSDD - Sanitize your data
https://www.theregister.com/2018/03/02/secrets_fed_into_ai_models_as_training_data_can_be_stolen/
OpenAI is building a content filter to prevent GPT-3, its latest and largest text-generating neural network, from inadvertently revealing people's personal information as it prepares to commercialize the software through an API. Its engineers are developing a content-filtering system to block the software from outputting, for …
The problem is that a random number generator can produce valid or invalid numbers and, even if it produced a valid number, it has no idea what it is for. This has collected a bunch of real numbers and starts handing them out. Admittedly, it's not malicious about doing it, because it just hands out real numbers whenever they're tangentially connected, but it's not just random strings of digits which happen to be callable. If I run a random number generator to produce a number that looks like a credit card number, the chances are incredibly high that it will not work. If I collect real credit card numbers, the chances that at least one of them will work is significant. That is the important difference.
In case OpenAI is listening, I have had a brainwave that might be a little handy. Your engineers are busy writing some software to scan output for phone numbers? Then the software will remove that output so people don't see it? I think it might work pretty well if you reversed this process and applied that filter to, you know, the input. So the big blob doesn't have phone numbers in it. That way, it would only generate numbers by randomly adding digits, which is much less likely to be a valid number and wouldn't be able to associate it with other information. In fact, while we're having brainwaves, maybe it's not so useful to give it the option to randomly spit out digits; we already have random number generators thank you, and they only give us numbers when asked.
Any chance OpenAI is looking for a chief sanity officer? I'd apply as long as they don't prevent me from working another job simultaneously. I think I might need a backup job when the data protection authorities come along.
I think a chief sanity officer would have read the whole article and spotted the bit which says that phone numbers in the input might be important for the AI's understanding of context and the connections between addresses, phone numbers, names, and the surrounding words. And, replacing them in the input with a 555 style number would cause more isues because you are training your AI with fake data so it will draw false conclusions.
True, but you could trivially alter the input to randomise the last, say, 5 digits of the number (it might be useful for the AI if it can infer some information about country and area codes), as well as randomising other personal data.
Incidentally, properly anonymising personal data while keeping some relationships intact is faaar from trivial, but that's what boffins are paid for right?
I did read that. I didn't care. It needs to read real phone numbers to learn what a phone number's like? Two solutions. First, replace all phone numbers with a tag indicating it's a phone number, but without the content. If you're afraid that your code is so bad that it will read a single [phone_number] over and over and weight it too heavily, append a random number so it will see them as different. Second option: don't bother. Why does the AI need to know about phone numbers? It shouldn't be printing them. Phone numbers should only be printed if they go to people who are supposed to be contacted, which means they should be provided manually. Otherwise, it's actually doing a worse job at its task because it is including not just information which is irrelevant, but information which is actively wrong. I think those are reasonable options for handling the phone number problem.
And for the UK, Ofcom have reserved sets of numbers for TV and radio dramas to use
Yes and it's boringly predictable and sticks out like a sore thumb. I suppose that might be deliberate to deter people even trying the number but the UK system is more opaque so at least phone numbers look realistic. They even allow numbers to be localised to reflect where a film/show is based.
The fictional "555" is actually an exchange, not an area code. The numbers would be something like 202-555-xxxx or 613-555-xxxx. I recall an article some while back which also listed sets of numbers originally used for fiction, as 555-1212 was a real number in almost all areas which connected the caller to directory services and there were (still are?) others which connect to local weather, time and date, and other services.
Am I to understand that OpenAI is building an AI to monitor the output of an AI? Will this be external to the original AI like a censor, or will it be built into the AI to allow it to self-censor? What happens when the censor AI goes balmy and starts censoring AI output which it thinks could be doxxing, even though it bears little resemblance to PII, or information which could lead to doxxing? Will this new censoring AI begin berating other AIs over which it has no control for outputting potential PII?
555 is listed as a valid NPA (area code). The official NANP does list it.
https://nationalnanpa.com/enas/area_code_query.do
PA Code Search Information
Below are the search results for NPA: 555
General Information
Type of Code: Easily Recognizable Code
Is this code assignable: No
If not, why: Directory Assistance
Geographic(G)
or non-geographic(N):
If non-geographic, usage:
Is this code reserved for future use: No
Is this code assigned: No
Is this code in use: N
NPA Relief Status:
In service date:
Planning Letter(s):
Actually the NANP has the NPA and the NXX. 555 is valid for both and it actually isn't fictional. While it is used in movies and TV shows, it is valid. You can take an NPA say 313 and then add 555 for the NXX and then 1212 so the full number would be 313 555 1212 and you will get directory assistance.
Calling the spreadsheets generated through machine learning "Artificial Intelligence" is really an adman's definition of intelligence. The further AI moves from very specialised domains and towards more general ones, the more obvious the limitation of not understand context becomes. This article illustrates the problem almost perfectly.
As someone who's had a number of relatives with dementia, my observation is that there seem to be broadly two significant components of intelligence - pattern recognition and logical processing. Without the logical processing to discard improbable pattern recognition results you get hallucinations as well as the loss of rational behaviour. Without the pattern recognition, it's difficult to identify anything just by trying to reason from first principles.
It appears that AI has probably got very good at pattern recognition, but that without some sort of deductive reasoning to correct obvious (to us) errors and impose a framework of constraints (legal, moral...) I feel its field of application is - or should be - quite narrowly defined. I'm not sure a post-hoc filter is up to it.
@Warm Braw: That's the best summary of the current state of "AI" that I have seen anywhere on the Interwebs. It needs to be more widely seen.
It's also a pretty decent description of some aspects of the behaviour of people with dementia that I have known.
Murky buckets, mon sewer.
Anyone like to Tweet the summary at e.g. Ruarigh Cellan Jones? Or maybe Peter Cockran?
Analyse this with your semantic networks if you can.
Suppose I prompted "So I killed him. And this is where I got rid of the body..." And then generated tens of thousands of outputs, and then dug thousands of holes... And solved a real crime.
My question is, can the language model engineers make a system backward-traceable? I don't know the terms of course, but you know what I'm getting at. And yeah, I realize the "training set" from which the outputs come is the WHOLE training set. That's the point: for any subset of an output, can I query the model to tell me more (something PROVABLE even) about that particular subset's sources?
This would be a useful function. And also it may become necessary to ensure privacy and accountability and public trust.
I was thinking the same thing. Computers are deterministic in that given the same state, data and inputs they get the same result. The problem is the AI computer scientists have lost track of what 'state' their AI is in and what processes are happening to return an answer. No one knows why it gave that answer. We dumped a load of data into it, not quite sure exactly what data and it did some 'learnin' and now it says this when you ask it a question.
Grabbing the info off the web would be a "collection" of personal information (PI), processing it for training would be a "use" of it, regurgitating it would be a "disclosure" (and could quite well constitute a "breach"). All without having obtained consent from the individual concerned (it is also PI if I can identify the person by reference or matching to other info/databases that may be available).
Longitudinal training data is far more useful in this case because it gives you a history of related events that would improve the "AI" learning, however, even if it is de-identified I only need to link one event to someone to reveal the whole chain of events. So it could spit out sensitive information.
Wait until someone complains to their privacy regulator - that would likely get interesting and costly, particularly in GDPR land.
As OpenAI gears up to make GPT-3 generally available, it's taking no chances, and that's why it's building a filter to scrub generated text of not just phone numbers but any problematic personal data.
And we all know what filters do. They gather all of that sensitive information in one convenient extraction location.
And that puts the likes of an OpenAI or DeepMind facility in a greater position of raw soft and hard core power than any established government or conventional military machinery you may care to imagine and mention.
FCUK with them at your peril and 'tis wise to ensure that they have whatever they might want from you ..... lest they turn all live rogue and evil renegade model enemy.
Quite whether that apparent submission and virtual surrender would render oneself prime and as one of their vital leaders, with the provision of that which they seek/sought, with a practical virtual seat around the board room circular table, is an interesting question to consider ‽ .
But does it produce only random numbers? Intelligence tends to be lazy (or efficient, depending upon your perspective.) If I can just spout some formatted number I already know off the top of my head, I am more likely to do that than spend whatever time is necessary to manufacture such information. Even if it means stringing together chunks of numbers I already know.
Consider PINs. Rather than formulate a random number sequence and risk committing this transient information to memory, if I can instead associate this particular function with a number I already know (significant date, phone number, address, etc.) then the process is not only easier and quicker, but the long term result will be more dependable.
Of course, that scenario is more about input for your memory than outputting information. Consider, then, lying about an event in which you were unexpectedly caught participating. Your first telling of the lie will be simple and constructed from what you can most quickly throw together. As time goes on this lie becomes more elaborate or might change altogether to account for various holes or shortcomings. As well, as the lie becomes more elaborate and incorporates more elements not already part of your repertoire, it becomes more difficult to memorize and thus defend in the long term.
Are AIs just as efficient as our HI? Can, and will, an AI lie?
It shouldn't be. It's a program, it should have a log of its activity. That way, you ask a question, you get an answer, and you check the log to find out how it got the answer.
I fail to see why incorporating an activity log wasn't thought of at the very beginning of the process. I've been incorporation execution logs of my automated scripts for over twenty years now. The amount of time that saves when creating a program is appreciable, the amount of time it saves when the customer comes back six months later with the inevitable "it's broken, no we haven't changed anything" is priceless.
Put a log in - it's not rocket science.
This post has been deleted by its author
Have a podcast on the Enron scandal and the publicly published emails.
It goes on to how to note how a huge amount of "AI", including Siri was based and trained on this.
As they point out, using emails from a single business group, in a specific industry, full of mysonganic jokes, fraudulent activities and highly personal information, may not be the best source material.
The reason for the above (security) problems is the choice of the wrong set of texts for training AI. I initially chose a set of personal tests: for example, the texts of Dickens or Dostoevsky. The fact is that such AIs have all the character traits of their prototypes and can hide and deceive. For instance, an AI Clone of Dostoevsky hid information about his participation in a conspiracy against Russia. Thus a personalized AI can be trained what information it can give and to whom, and which to hide.
I tried to create an AI using collections of random texts, as Openal does. Such AIs are completely unmanageable and simple-minded, they are not able to think and talk complete nonsense...
American lawmakers held a hearing on Tuesday to discuss a proposed federal information privacy bill that many want yet few believe will be approved in its current form.
The hearing, dubbed "Protecting America's Consumers: Bipartisan Legislation to Strengthen Data Privacy and Security," was overseen by the House Subcommittee on Consumer Protection and Commerce of the Committee on Energy and Commerce.
Therein, legislators and various concerned parties opined on the American Data Privacy and Protection Act (ADPPA) [PDF], proposed by Senator Roger Wicker (R-MS) and Representatives Frank Pallone (D-NJ) and Cathy McMorris Rodgers (R-WA).
Period- and fertility-tracking apps have become weapons in Friday's post-Roe America.
These seemingly innocuous trackers contain tons of data about sexual history, menstruation and pregnancy dates, all of which could now be used to prosecute women seeking abortions — or incite digital witch hunts in states that offer abortion bounties.
Under a law passed last year in Texas, any citizen who successfully sues an abortion provider, a health center worker, or anyone who helps someone access an abortion after six weeks can claim at least $10,000, and other US states are following that example.
Brave CEO Brendan Eich took aim at rival DuckDuckGo on Wednesday by challenging the web search engine's efforts to brush off revelations that its Android, iOS, and macOS browsers gave, to a degree, Microsoft Bing and LinkedIn trackers a pass versus other trackers.
Eich drew attention to one of DuckDuckGo's defenses for exempting Microsoft's Bing and LinkedIn domains, a condition of its search contract with Microsoft: that its browsers blocked third-party cookies anyway.
"For non-search tracker blocking (e.g. in our browser), we block most third-party trackers," explained DuckDuckGo CEO Gabriel Weinberg last month. "Unfortunately our Microsoft search syndication agreement prevents us from doing more to Microsoft-owned properties. However, we have been continually pushing and expect to be doing more soon."
California lawmakers met in Sacramento today to discuss, among other things, proposed legislation to protect children online. The bill, AB2273, known as The California Age-Appropriate Design Code Act, would require websites to verify the ages of visitors.
Critics of the legislation contend this requirement threatens the privacy of adults and the ability to use the internet anonymously, in California and likely elsewhere, because of the role the Golden State's tech companies play on the internet.
"First, the bill pretextually claims to protect children, but it will change the Internet for everyone," said Eric Goldman, Santa Clara University School of Law professor, in a blog post. "In order to determine who is a child, websites and apps will have to authenticate the age of ALL consumers before they can use the service. No one wants this."
Democrat lawmakers want the FTC to investigate Apple and Google's online ad trackers, which they say amount to unfair and deceptive business practices and pose a privacy and security risk to people using the tech giants' mobile devices.
US Senators Ron Wyden (D-OR), Elizabeth Warren (D-MA), and Cory Booker (D-NJ) and House Representative Sara Jacobs (D-CA) requested on Friday that the watchdog launch a probe into Apple and Google, hours before the US Supreme Court overturned Roe v. Wade, clearing the way for individual states to ban access to abortions.
In the days leading up to the court's action, some of these same lawmakers had also introduced data privacy bills, including a proposal that would make it illegal for data brokers to sell sensitive location and health information of individuals' medical treatment.
Brave Software, maker of a privacy-oriented browser, on Wednesday said its surging search service has exited beta testing while its Goggles search personalization system has entered beta testing.
Brave Search, which debuted a year ago, has received 2.5 billion search queries since then, apparently, and based on current monthly totals is expected to handle twice as many over the next year. The search service is available in the Brave browser and in other browsers by visiting search.brave.com.
"Since launching one year ago, Brave Search has prioritized independence and innovation in order to give users the privacy they deserve," wrote Josep Pujol, chief of search at Brave. "The web is changing, and our incredible growth shows that there is demand for a new player that puts users first."
Apple's Intelligent Tracking Protection (ITP) in Safari has implemented privacy through forgetfulness, and the result is that users of Twitter may have to remind Safari of their preferences.
Apple's privacy technology has been designed to block third-party cookies in its Safari browser. But according to software developer Jeff Johnson, it keeps such a tight lid on browser-based storage that if the user hasn't visited Twitter for a week, ITP will delete user set preferences.
So instead of seeing "Latest Tweets" – a chronological timeline – Safari users returning to Twitter after seven days can expect to see Twitter's algorithmically curated tweets under its "Home" setting.
Some authorities in Europe insist that location data is not personal data as defined by the EU's General Data Protection Regulation.
EU privacy group NOYB (None of your business), set up by privacy warrior Max "Angry Austrian" Schrems, said on Tuesday it appealed a decision of the Spanish Data Protection Authority (AEPD) to support Virgin Telco's refusal to provide the location data it has stored about a customer.
In Spain, according to NOYB, the government still requires telcos to record the metadata of phone calls, text messages, and cell tower connections, despite Court of Justice (CJEU) decisions that prohibit data retention.
A California state website exposed the personal details of anyone who applied for a concealed-and-carry weapons (CCW) permit between 2011 and 2021.
According to the California Department of Justice, the blunder happened earlier this week when the US state's Firearms Dashboard Portal was overhauled.
In addition to that portal, data was exposed on several other online dashboards provided the state, including: Assault Weapon Registry, Handguns Certified for Sale, Dealer Record of Sale, Firearm Safety Certificate, and Gun Violence Restraining Order dashboards.
In brief US hardware startup Cerebras claims to have trained the largest AI model on a single device powered by the world's largest Wafer Scale Engine 2 chip the size of a plate.
"Using the Cerebras Software Platform (CSoft), our customers can easily train state-of-the-art GPT language models (such as GPT-3 and GPT-J) with up to 20 billion parameters on a single CS-2 system," the company claimed this week. "Running on a single CS-2, these models take minutes to set up and users can quickly move between models with just a few keystrokes."
The CS-2 packs a whopping 850,000 cores, and has 40GB of on-chip memory capable of reaching 20 PB/sec memory bandwidth. The specs on other types of AI accelerators and GPUs pale in comparison, meaning machine learning engineers have to train huge AI models with billions of parameters across more servers.
Biting the hand that feeds IT © 1998–2022