The cops will be scrambling to get their biometrics on one of these post haste.
Then they'll be able to claim they can't delete anthing it as it would impact their systems.
AI systems have weird memories. The machines desperately cling onto the data they’ve been trained on, making it difficult to delete bits of it. In fact, they often have to be completely retrained from scratch with the newer, smaller dataset. That’s no good in an age where individuals can request their personal data be removed …
I doubt most AI systems would work that way. The actual trained model probably does not contain the data, just weights relevant to some of the data.
A bit like knowing a persons (you) supplied data and has black hair and green eyes... you can train an AI on that, but it won't know *you* are the one data point that had green eyes and black hair, because probably hundreds or thousands of other data points match that.
Likewise, just pumping fingerprints into an AI model will tell you about fingerprints in general, not an individual's prints.
However, it is still your data, and you could revoke access to it.
Unfortunately, you generally have no way of knowing how its been trained, even if you trained it.
If you train by, say, genetic algorithms - it's quite possible that there's a mirror of your data in the trained network and it flags "success" as "John Smith was a success and this guy's data also looks like the John Smith data which I have taken upon myself to copy inside me".
Maybe not intentionally but that's the problem - you have no idea what it's training itself on, what changes that makes to itself (could be a statistic, could be wholesale copying of the input data), or how to untrain / delete that part.
Good luck proving in a court of law that the thing *doesn't* contain John Smith's personal information, especially if you've been copying it around your entire company... whoops! So Mr John Smith's address has been visible inside the trained neural net and you've given that to millions of people as part of your amazing AI product? Oh dear!
These things really are as unreliable, untrainable, uncontrollable and as stupid as they sound.
True. But I was not arguing from the legal definition (I backed that as being a requirement to retrain the model, because it is still your data), but from a practicality one.
I've no worries about AI trained models with my data being backwards engineered, because it's rather extremely difficult. I'm more worried about the original sources still being around!
"I've no worries about AI trained models with my data being backwards engineered, because it's rather extremely difficult. I'm more worried about the original sources still being around!"
I agree, I'm not worried about the possibility that my data can be extracted from the model, and worrying about the original data existing is much more relevant.
However, I also don't want my data to be used to train these systems from a purely ethical/philosophical stance. I am deeply concerned that these systems are increasingly being used for purposes that I think are harmful to everyone, and don't want to do anything that will help in their development, including contributing data.
"These things really are as unreliable, untrainable, uncontrollable and as stupid as they sound."
So, basically, just like people.
“Deletion is difficult because most machine learning models are complex black boxes so it is not clear how a data point or a set of data point is really being used,” James Zou, an assistant professor of biomedical data science at Stanford University, told The Register.
Presumably, then, it's just like I have no idea WHY my mind seems to think that it's important that I keep the theme songs for Stingray and Fireball XL-5 in my data set. Apparently, though, it's really, REALLY important that I do and I have to assume that forcing that information OUT of storage would result in permanent damage to my personal program.
We may be backing into actual machine intelligence and won't know that we've achieved it until the AI in question intermittently starts saying "Why can't I get that <random, useless bit of data> out of my head?!!?"
Being trained using dictionary definitions computer can accurately determine parts of speech for each word and build the correct patterns, and begin to understand you. That is, computer can stop being a calculator and start thinking. This is your only chance to force information OUT of storage.
At the same time the quality of dictionary definitions is incredibly important! The more accurately they explain each word of each pattern the higher the chance that the computer will understand both its data and you, and delete what is not right.
Without the dictionary definitions you can get a calculator and not AI, the comp won't know what should stay and what must go.
I don't see how any results obtained as the output of a process that cannot be fully understood would hold up in court in the first place. But I suppose it would be useful as a way to demonstrate "probable cause" for warrants and such where they'd find the evidence used to make the case. The idea of a black box being part of the legal system worries me almost as much as the untold number of fallible humans also involved.
is it still identifiable private info? And as the laws requiring such mechanisms for deletion apply to all filing systems, agnostic of the physical medium, electronic, paper or otherwise, does this also apply to people's brains? You'd have to develop some form of amnesia drug, and be sure it had no side effects. Anyway, the duty rostered lab have just finished brewing today's coffee, so I'll just pop down to get a cup... if I can remember where the kitchen is.
> is it still identifiable private info?
This was my question. To take an example, Bob is gay, and has HIV. That's pretty sensitive information.
However, the relevant trained healthcare model will simply weigh the sexuality as a factor against HIV infection. In simplistic terms, the inclusion of Bob's data may push the model's link between those two facts from 20.1 to 20.2. Bob's statistics affect the model's behaviour slightly, but there is no way you could ask the model if Bob is gay - it simply doesn't know or care.
Obviously if the training dataset is retained, that's a completely different story - but I can't see how personally identifiable information could be gleaned about a single subject from a true "black box" model.
Ah, but it might be able to classify a face as likely to belong to a female aged 62, who most likely will have a high echelon job in UK politics, live in Sonning and be unable to dance properly to her choice in music which will likely include Abba.
Ah, but it might be able to classify a face as likely to belong to a female aged 62, who most likely will have a high echelon job in UK politics, live in Sonning and be unable to dance properly to her choice in music which will likely include Abba. .... TRT
Sounds like a Right Sponge of a Wannabe Maggie May, TRT, with Zero Immaculate Fonts for Virtual Forces from SMARTR Sources Delivering Earnest Wishes for Future Presentation, via Advanced IntelAIgent Service Servers with Ready Free Access to Critical Defence and Strategic Attack Facilities and Utilities when Refusal of Excellent Terms Provides the Hellish Torments of Unimagined Unimaginable Excess in Full Clear Sight of So Many Other Mothers, more Avenging Hawk than Clucking Hen..
Horses for courses and strike while the iron is hot are two of those Universal Dicta Phrases which everyone should know of and learn more about to discover how everything has grown so suddenly so weirdly, and how it will continue to grow Out of Command and Control until Appropriate Emergency Overwhelming Action is Warranted Absolutely Necessary by folk you never heard of before and oft from afar.
I must admit I do find that something of an alien situation that humans appear to readily accept as practically normal whenever it most certainly is not.
Do you not need to know who's pulling the strings in your private public media shows ...... those occasions of infinite travel in the times and spaces of others? If you want to know, ... Hey, surely just ask around everywhere, for where else would they think be a Safe and Secure Place to Hide and Always Blast Back to for Master BetaTest Reinforcement Training of Server Assets.
An XSSive Facility where Quality Parameters are ACTed Out In House for the True Register and First Hand Knowledge of Possible and Future Expected Performance with Virtual Forces from SMARTR Sources??
Yes, It sure is at least all of that. And now you also know. Is there anything you want to do with what you now know or are you content to leave what AI and IT can do up to Friends who might actually know what truly needs to be done, and how to simply and quickly do it.
Ask them how it is done, and even if they wanted to tell you, one may not be certified enough to listen.
There some ground breaking works going on out there in Deep Virtual Space. And there's no possible way Earth can avoid them ..... is there? Be Honest, IT'll Save Time and Encourages and Assists Great Works to Begin Wonderfully Rejuvenated.
Now we are getting into what exact attributes about the individual are encoded in the data fed to train the machine learning model. There's the boundary condition and it's exactly the same problem with have with "Big Data" and pseudo-anonymity. How easy is it to reverse the crunching?
Well yeah, I understood if someone "Bob" wants to be deleted, then that means removing all references to Bob, not deleting the data. If someone looked at the 20.2 model's data they wouldn't be able to say "Hey look, here's Bob".
Why anyone would want to run AI over a dataset of personally identi....
>Obviously if the training dataset is retained, that's a completely different story
Suspect if you applied the principles that are being used to anonymise data for marketing/advertising analytic purposes, the training set shouldn't actually contain readily identifiable realworld individuals and thus be fully GDPR compliant.
In your example there is actually no need for Bob to be personally identifiable so his name/ID should not even be part of the input.
I think this is where this all should start: if it is painful to remove personal data and will deliver a ginormous fine when it doesn't happen, maybe it's better not to use data that can identify an individual.
Key is, however, that fines are at GDPR levels of pain, not the usual "give me the change in your pockets and we're even" sort of fines that are the norm in the US.
Black box algorithms do something with their data. Some of it, they keep. The question is whether it's possible to retrieve any of it. If, for example, patient name was used in the training set and given to the program, there is a good chance it has done something with that data. Maybe it identifies that people named Bob are more likely to have certain health conditions than other names. If you provided it with more information about Bob, it might be able to predict more information. Of course, a good AI developer wouldn't include patient name, as that's an invitation to pollute the data and has historically proven problematic*. But people do do it sometimes and it could therefore be a privacy risk.
*For example, an algorithm trained on medical data to determine the likelihood of a patient having cancer was given the name of the hospital where the patient was receiving treatment. The algorithm was able to determine that patients staying at hospitals with "cancer center" in them were more likely to have or develop cancer. This made the algorithm next to useless, but it also increased the accuracy rate and if we know one thing about AI companies, they like good accuracy rates.
To quote GDPR Article 4(2): ‘processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;
So strictly, retaining information about persons in one's memory could be construed as processing. But practically, I believe the interpretation would be narrower - constrained by "use", insofar as information merely remembered can not of itself affect the rights and freedoms of the person remembered unless some action is taken that makes use of the information. Thus the act of making use of the remembered information would be the processing.
I suspect that in practice there won't be any easy solutions, which leads me to conclude that AIs should not be trained using data that is subject to the GDPR.
What they need is realistic, but fake, data. Obviously that would be more expensive than just lifting real-life data, but here's a thought : why not train an AI to generate bespoke fake data sets?
That's an interesting idea, but how do you ensure that said fake data is realistic enough to train on ?
If you take real data and tweak it, is it still representative of reality within acceptable margins ? What is the impact of training a statistical analysis machine on wrong data, hoping to get right answers ?
I mean, we know a mentat could do it, but we don't have mentats yet, we only have statistics and, by rule of thumb, if you put in bad numbers, you're not going to get good results however much you try to convince yourself and everyone that you're working with "AI".
Fair points. You would have to be able to analyse the output and conclude that the fake data is statistically identical to the real data. Of course, it might not need an AI to do the fake data generation - depending on how well you understand the source data population, and how good your statistical analysis skills are.
Maybe it would all be to much, but I reckon any company that can devise a generalised fake data generator, converting a customer's real data into clean and GDPR (or whatever)-compliant data, might find it to be a very profitable venture.
How can you even be sure that there is or is not information in the AI which is relevant to an individual? And even more worryingly how can you be certain that you fake data does not create relationships which seem to identify individuals. After all, if the data is realistic then it must mimic reality.
On another note, some 20 years ago when I was writing HMRC's personal review system they were using real data for test purposes. I thought this was a no no and so used some name generators I and a colleague had written for some simulations.
Being a bit fed up at the time I didn't use the English language names.
Some samples follow:
Brictius filius Æson
Galenus filius Artemidorus
Rogatus filius Luciferus
Antipater filius Alcinder
Isocrates filius Clophas
Bratislav Radovladov
Budig Gostomyslov
Rogovlad Vyshemirov
Chestimir Ostromirov
Perei Velislavov
Gradimir Izjaslavov
Volodimer Radimirov
Vladimir Miodragov
Samovlad Yarovidov
Sobeslav Svyatoslavov
Lyutomir Naislavov
Gorica Yasnomyslov
Sveinbiǫrn Þorvarðrsson
Gunnarr Þorfinnrsson
Ásbiǫrn Óleifrsson
Ásbiǫrn Ǫnundrsson
Hólmsteinn Oddrsson
Finnr Hrólfrsson
Ǫrnólfr Álfrsson
Oddr Geirmundrsson
Eysteinn Þorvaldrsson
Arnórr Eilífrsson
Ingialdr Steinólfrsson
Bárðr Kárisson
Watching the attempts at pronunciation was fun.
Here are some names you should have used:
Arheddis Varkenjaab and Aywellbe Fayed
Arhevbin Fayed and Bybeiev Rhibodie
Aynayda Pizaqvick and Malexa Kriest
Awul Dasfilshabeda and Nowaynayda Zheet
Makollig Jezvahted and Levdaroum DeBahzted
Steelaygot Maowenbach and Tuka Piziniztee
Which brings up an interesting question. How difficult would it be for someone to poison the training data set? I can see lots of state actors interested in essentially playing Jedi mind tricks on such AI systems. Quite literally "these are not the drones you're looking for" type of thing.
Firstly you have to have a retraining cycle for ML models anyway - if not for feature/software release reasons then just for "freshness"
Secondly a good model should need much if anything in the way of PII, most use cases are about acting on novel views of small-medium groups of people. Most use cases are not UK Rozzer "find the crims" types.
Thirdly - for Black Box models you would have to prove that the PII has been retained in a format that still makes it PII - with is both unlikely and virtually impossible.
A good one to troll the UK Rozzers or Govt with though...
Quite apart from the fact that to do so would presumably require explicit consent under GDPR, what is the legitimate use case for training an AI on personal data? The only one I can think of is to randomise names/addresses, and to use AI for that sounds a lot like a case of the solution looking for a problem...
What about trying to link employment and home address to car insurance premiums? Or lifestyle choices to predicted medical issues?
There are plenty of legitimate uses too. Or was it wrong for the early HIV campaigns to target the gay community? Because that's the sort of information that a model like this could produce, with massively increased accuracy.
What about trying to link employment and home address to car insurance premiums?
Isn't this what actuarial tables are for? Why would you need a trained neural-net AI to calculate risks when it's essentially just a lookup into a big ol' database full of statistics?
But if you use actuarial tables, you don't get to say in your adverts that you're a modern 21st-century company using AI, and you probably have to pay some actuaries to update the tables for you, rather than taking your database and throwing it at a model building AI program built over one summer by a temporary dev. And you might get more precision with the neural nets because it could be more complex, but mostly that first one.
This honestly sounds like the same problem as trying to have you data deleted from an academic study. By the time it is the published paper it is so detached from your identity and yet there are so many derived numbers and conclusions that they could not remove you without redoing the entire piece of research.
So the balance of privacy and practicality here is all going to be around the correct process to anonymize the data on its way into the process. This is a pretty well understood process for academic papers I do not really think it should be that much of an intractable problem for AI researchers or developers.
My understanding is that the algorithm doesn't contain the data, just a model that should output the correct answer when supplied with certain input and the data was merely used as part of the process of develping the model.
I understand it to be similar to using one of those "equality questionnaire" things to produce statistics that say 70% of my customers are white males, then based on that I make a system that assumes a new customer has a 70% chance of being a white male... if one of my customers revokes permission regarding their questionaiire, am I obligated to go back and modify the report I produced to say that 69.9% of my customers are white males and modify the resulting system to say that there is a 69.9% chance of the new customer being a white male?
Basically, is revoking a right to store the data the same thing as requiring you to modify conclusions that were based on summarising that data as part of a larger dataset?
The problem being that a ML model may contain a lot more than a summary of the data. It might say that a new customer with no identifying information has a 70% chance of being a white male who would like to buy apples at a confidence rate of 62%, bread at a level of 73%, and lettuce at 52%. It might also say that a customer whose address is 1234 2nd St. is 99.5% likely to be a white female aged 42 who will want to purchase potatoes at 97% confidence, oranges at 88%, and bread at 99.83%. The problem with models is that you can keep asking them questions, and while they're not always correct, they'll take any information you give them and try to answer questions based on that info. In this scenario, address would not be a necessary thing, and a good developer wouldn't provide it. A bad developer might not notice, and an evil developer might provide it so you have a black box that's hard to audit but allows access to this data. This is why GDPR has to apply to all the source data they're about to use. If it's discovered that they violated privacy rules in obtaining the data to give to their model creation process, the models that resulted might contain parts of the data.
Does it actually contain the fact that a 42 year old white female lives at 1234 2nd st, or just an algorithm that can produce a result that coincides exactly with reality for that specific example?
I have bookmarked one of the pages that was generated by the library of babel algorithm. none of that text is stored on the server, it just generates data based on the selected variables... (a bookmark is shorter than telling you how to navigate to the location of that specific page of text, but the page was already there at that location before I went and searched for it)
Yeah, that specific algorithm kind of demonstrates how complicated these things can get... is possession of the algorithm illegal in countries that ban certain forms of speech?
It really depends how our notional algorithm was trained. The model stores information it thinks is useful, and the developer cannot tell it to retain information or block it from storing information that is provided to it. So if address was an included field, and the algorithm was trained in such a way that it came to the conclusion, correct or not, that address was a useful feature, it would probably store relationships of that nature. Somewhat crazily, in that case, it would think it could guess at every address, including ones it's never seen before (though a good system would lower the reported confidence rate). It's not guaranteed to contain that information, but it easily could.
The major reason this could happen, outside my example of the evil programmer above, is where there are actually a bunch of patterns. Suppose that the 1200-1300 block on 2nd St. is primarily occupied by vegetarians (we're sticking with the grocers' model in this example). If the model was given this information, it could easily notice that people there are more likely to be ordering lots of vegetables and much less likely to order meat. This might convince it to keep more address information, because it was useful in that scenario. Now imagine that there is a large group of people, say a large extended family, who share the same surname and have specific recipes in common. Now the model sees a pattern where customer name is usefully connected to buying habits, and more of that data is retained. And now we have a model that stores two types of data that are next to useless, because they don't scale to the public at large, but would allow access to potentially private data. Removing that data would require retraining the model to exclude it. That's why we have to not put that data in in the first place.
This post has been deleted by its author
What you call “model” I call "profile".
1. “Training” your data by dictionary definitions you create in each profile a huge number of truly significant patterns, which very accurately determine its semantic orientation. For this "training" must remove from the profile what I call "lexical noise", right at the stage of parsing (preliminary preparation) of the data. This deletion ensures that each profile can both be found and itself find only a narrow circle of other profiles.
(I wrote in my patent: "Such lexical noise is typically superfluous predicative definitions that do not explain the central themes contained within the digital textual information and, accordingly, removal of such noise often results in an improvement in the quality of the structured data." In another I wrote: "If Compatibility=100% - most likely only absolutely identical paragraphs/passages can be found. If Compatibility=0% - all paragraphs/passages that have even one same word and/or predicative definition are found. )
Then the presence or absence of some address matters only in combination with a variety of other patterns, since they will allow or not to overcome the compatibility threshold necessary for either receiving or transmitting information.
Therefore in order for compatibility to really work is necessary to remove all lexical noise, which is impossible without a high-quality dictionary.
2. If you train your date by "other data" and not by the high-quality dictionary then most likely you will not be able to remove its lexical noise. Indeed, this "other data" plays the role of a dictionary, defining parts of speech and the meaning of the words of your data. That is, you must be sure that the "other data" is able to adequately do it.
And now please explain me why spend time and a huge number of resources on the creation a new dictionary when there is the old and proven high-quality? Only because you don't want to pay me?
3. There is a sentence "Alice and Greg swim with joy." If a system doesn't see each word's part of speech, then the word "joy" can be taken as a noun (name) "Joy", resulting in erroneous patterns when parsing the sentence.
For instance, if the word "joy" is a noun-name, then these patterns appear:
- Alice swims
- Greg's swims
- joy swims.
If the word "joy" - an adjective, then these:
- Alice swims with joy
- Greg swims with joy.
You what may or may not happen, some system see the words 'joy" and "Joy" the same - try to type 'ilya geller" and "Ilya Geller" in Google?