How does it work?
It simply gives it a Brummie, Glaswegian or some other strong accent... guaranteed to defeat any AI
(listen to 'Breaking the News' from BBC Scotland... the only podcast to come with subtitles)
The thought that our gadgets are spying on us isn't a pleasant one, which is why a group of Columbia University researchers have created what they call "neural voice camouflage." This technology won't necessarily stop a human listener from understanding someone if they're snooping (you can give recordings a listen and view …
> any reasonable British accent will defeat these American speech "recognition" programs
It certainly defeats mine!... I still remember when I first arrived in the UK (30-40 years ago) and tried to order something in a pub. The landlord replied something to me in a language I'd never heard before... I don't know if he was just having fun with the foreigner, but that didn't sound anywhere related to English as I know it.
I seem to recall that a simple white noise generator (check V for Vendetta, among others), aka the go-to spy covert-conversation-protector, is largely enough to confuse a microphone while retaining human ear's capability of listening to one's neighbour.
Have physics changed, or has that always been a red herring ?
White noise is a brute force attack - to generate white noise that will mask you from a microphone but still allow conversation the white noise needs to be directional towards the microphone or generated at an intervening surface (e.g. window pane). This is a much more subtle approach. It is also not too hard to remove white noise - its what noise cancelling headphones and the background noise cancellation on videoconferencing applications do all the time.
This is an academically interesting technique because it is relying on generating more specific "anti-noise" than a brute force white noise approach. How useful it is remains to be seen - but it suggests an interesting ability to (for example) allow a phone conversation that humans can understand, but cannot be automatically transcribed by anyone intercepting it. Thats unlikely to be useful for secure conversations because its probably easier to just end-to-end encrypt the voice channel, but it may stop your phone provider using your speech to sell you a new upgrade (or your government sending you for re-education). I can forsee it being a feature offered by privacy-focussed communication apps.
I think current AI transcribers are pretty good at listening through white noise. Not as good as humans, obviously, but I suspect that in order to defeat an AI transcriber through white noise alone, you'd have to deploy a volume that would be annoying to humans.
Indeed, this is an AI model that can defeat AI models trying to listen to you. The escalation of this naturally results in an AI model eventually figuring out the only way it can win is by eliminating humans.
A long-gone solicitor relative of mine told a group of us kids that the way to defeat eavesdroppers is with anything that rustles. To demonstrate he had one of us sit opposite him at a small table, and the rest of us were at the other end of the room. He just casually rustled a newspaper, while speaking in an even quiet voice. We couldn't work out anything, except just the occasional random word.
is fucking amazing at picking data from noise.
There's an oft-cited experiment where someone reads a corrupted text to another person who corrects it in real time and when I say real time, I mean <30ms lag.
Alternatively there's another experiment which uses TDM to reduce the data in a flow of speech to c. 10% and it's still intelligible
Human speech has evolved over millions of years and has helped us not need power, speed or strength when it comes to bodies.
I remember a demo in uni where the prof started with 8 bit 8kHz sampled speech and sequentially removed bits from the input. Though noisy, the 1-bit version was still mostly intelligible.
Later, I worked on a voice messaging application for my employer, and we repeated the experiment with similar results. Turns out, the human speech recognition wetware looks mainly at zero crossings.
Also Adaptive Delta PCM. That has been around for a long time (for compressed voice channels - the actual input is compared to the predicted input and only the difference is sent).
So predict the output (in real time, as we did with analog electronics some, what, 40 years ago) and modify it slightly.
On a slightly different note, Shannon came up with a method to calculate the entropy of the English language. Might be another useful concept to use...
I would expect it works by adding noise the speech recognition software isn't expecting, but that doesn't mean it can't be modified to ignore that noise. If it doesn't impair conversations between people, how can it be a permanent roadblock to speech recognition software?
If this is widely adopted it'll probably stop working because the snoops will modify their software to keep snooping. Alexa et al will view any speech it can't understand as a bug and send a copy of it back to home base for analysis so using this might make it MORE likely you're snooped upon.