Little voice in the back of your mind
That is fantastic piece of research now they just need to run their detection methodology over the 100 billion tictoks
Or
Switch the assistants off
Academics in the US have developed an attack dubbed NUIT, for Near-Ultrasound Inaudible Trojan, that exploits vulnerabilities in smart device microphones and voice assistants to silently and remotely access smart phones and home devices. The research team — Guenevere Chen, an associate professor at the University of Texas at …
This post has been deleted by its author
A single R-C filter isn't good for much and these microphones are only 2 to 8 cubic mm. They're already full of with MEMS hardware, MEMS electrostatic bias, digitizer, and solder pads.
There could be a software fix if it's possible to adjust the sample rate. These digital MEMS mics take samples at whatever speed they're clocked for, so varying the clock rate of the interface would scramble sample aliasing attacks. Of course you'd have to resample the result in software without adding any new aliasing bugs. It's easy math but mistakes wouldn't be audible.
Yes, 500-3500 Hz on POTS worked OK-ish for decades because you have redundancy in speech as well as (usually) context to help you resolve ambiguities. Add in background noise, and the redundancy degrades, and so does the resulting accuracy of interpretation. Hence the need for Alpha, Bravo, Charlie, Delta, Echo.......
The range above 3500 Hz is useful for resolving direction, as well as being less prone to corruption by reverberation, and so helps separate out competing sources. As for the remaining range up to 20 kHz bit, true for music, dog whistles, and mosquito tones, but there is negligible energy above 11 kHz, in even high-quality recorded speech.
........ speaks a person too old to hear above 10 kHz these days.
A wadge of cotton wool over the microphone would serve as a reasonable low-pass filter for speech.
There is content in the human voice, percussion, and strings upto and past 20k that lots of folks do hear. It is that airy and raspy quality you can hear from a CD or nice record that muddy FM (which stops around 15khz) is missing. And then there's AM which is what is being suggested here. In short, having the ability to record upto and past 20-20kHz is very desirable and produces a noticable improvement to the sound for nearly everyone. It also helps simplify the maths, but that's a whole other ballgame.
Not a topic I have any particular knowledge about - but might a broader bandwidth allow improvements in speech recognition in noisy situations.
Nope. In most instances you want to restrict the bandwidth to improve intelligibility. Obviously there's a limit to bandwidth reduction before you get to the point of reducing intelligibility, but if you cover the "speech band" - which is that which was conveyed by the old telephone network - (300Hx to 3kHz is normally enough, sometimes with slight emphasis around 2kHz), you achieve maximum intelligibility, even in noisy environments. The use of noise-cancelling microphone methods really help (two microphones in opposite directions - the user talks into one,and both receive the ambient noise. By phase reversing one of the mics, the common-mode signals - ie: the ambient noise - is cancelled, and the difference - the speech - is transmitted).
The "hi-fi" bandwidth of mobile phone microphones is probably provided to enhance recording. It's trivial to filter the speech input to these "digital assistants", so it should be done!
Because I am pretty sure the microphones are off the shelf hardware, not custom made per smartphone design. Therefore the manufacturer would choose to design a product that offers the most versatility to designers, offer the most sales options to potential designers / buyers. The single mic design can be used in everything from voice recorders to smartphones to devices ready to record music, all without incurring the costs of additional SKU's. This allows tremendous volume production, due to lower inventory costs which is pretty much high on all these device maker's requirement lists.
So, yeah. Industrialized capitalism.
You need to control the Nyquist frequency of the digital sample interval, the frequency at which higher analog frequencies start to alias into successively lower digital frequencies (think of wagon wheels seeming to go backwards in movies). Sampling to allow 20 kHz means any higher frequency sampling interference into frequencies below 20 kHz still remains well out of audible range.
They have demonstrated that they can activate a lot of voice assistants, but all but one of them is going to talk to the user while executing the malicious commands. That gives the user a chance to hear that something is going on, and more importantly, for most of the interaction, they can simply shout no to cancel it because most of the questions, such as authorizing a transaction or confirming a lock are going to ask a yes or no question and the local voice will be more easily detected than the ultrasound.
The only one they can activate without making a loud sound is Siri, but that one will pose some extra problems. Unlike some others which listen for anyone saying their wake word, Siri is activated by pressing a button or by a specific voice. Activating the voice wake word requires the user to train the phone to recognize their voice specifically, and it then doesn't generally activate on someone else's voice. If you have a friend with an iPhone, try it and see if theirs turns on. This means that an attacker can't just create a single track to activate Siri on any device, and if they don't already have a recording of the victim saying the wake word, they can only hope to activate with other samples. This might provide some insulation to practical use of the attack.
"Siri is activated by pressing a button or by a specific voice"
If there's a way to turn off the "specific voice" component, so ONLY pressing the button would activate the assistant, that would pretty well stop these kinds of attacks. Bonus points if the microphone doesn't get turned on unless the button is being pushed, i.e. the "assistant" only listens when the button says to.
"If there's a way to turn off the "specific voice" component, so ONLY pressing the button would activate the assistant, that would pretty well stop these kinds of attacks."
There is, and if you don't train it on your voice, that's the default.
"Bonus points if the microphone doesn't get turned on unless the button is being pushed, i.e. the "assistant" only listens when the button says to."
Yes, it has that. Because it's on a phone, the microphone is still connected, but if you don't have the voice activation turned on, Siri won't be processing any input from the mic.
but all but one of them is going to talk to the user
I don't have a connected garage door. Plenty of people do: they use their IOT system to open the door for tradesmen and deliveries when they are not at home
Couple that with a music system left turned on or a beamed ultrasound attack, and you've got a potential problem.
"Plenty of people do: they use their IOT system to open the door for tradesmen and deliveries when they are not at home"
I do it the old fashioned way by hiding a key outside and telling them where it is. After they've been and done, I retrieve the key. If they don't return the key, they don't get paid. The downside is they could have the physical key duplicated, but if they want to return and nick some things later, they'd be better off breaking in since the lack of a forced entry would put them under suspicion. If I didn't happen to be available to pick up the phone to send a code to let them in when they deigned to show up, they'd just leave and bill me for the visit. At least I won't have hundreds in the tech that's required to do it the electronic way.
That's alright, because you require authentication to make these devices do anything on your local network or with your local devices right?
I mean, you can't just say "Do This Stupid Thing" in any voice and have it immediately carry out that command, right?
You know, where "This Stupid Thing" could include "make unwanted phone calls and money transfers, disable alarm systems, or unlock doors". I mean, you put all those interfaces behind passwords and authentication and two-factor and confirmation that the requestor is the authorised user of the system, right?
You don't just let someone turn off your alarm system by having a random stranger say "Turn off alarm system", right? That would just be terminally stupid, I think we agree.
What happens if you nuke Siri and her pals from MS, Google, and Amazon, from orbit? The first thing I do with a new iDevice is turn Siri off, and, where possible, delete the mouthy bitch. MS seems to have abandoned Cortana, and in any case I nuke that even more mouthy bitch on sight. And I don’t have any of Google’s or Amazon’s mouthy bitches, and never will. If the ‘voice assistant’ is turned off or deleted, it can’t be attacked, right?
I recently had to install Windows 10 for the first time, and as a guy that's played HALO literally since it came out, having Cortana suddenly speak up during the install WAS CREEPY AS FUCK.
It was about as unnerving as hearing a voice mail from my dead grandmother or something on that order.
Made my skin crawl.
I couldn't hit "turn that crap OFF NOW" fast enough.
Could you play "unlock door" to an ultrasonic transducer stuck onto a window so that people in the room wouldn't know the door was unlocked?
Seems a bit risky because I guess even a normal speaker set so that it would vibrate the glass to transmit sound would make a door very vulnerable
... when my plants start screaming?
Apple has already patented an iDog, which is a device in the shape of a classic dog (with rounded corners, of course) in order to detect the attack and alert the owner with iBarks . Until now they couldn't manage to solve the issues with Siri ordering tonnes of dog food with each complain of the puppy.
It doesn't have to be silent to work - people sleep. That's the best (for the crooks) time to buzz open the front door, actually. Near me recently, somebody heard a crashing sound in the middle of the night, but went back to sleep. In the morning they discovered their garage door ripped off its hinges and things missing.
Analog transducers aren't linear. With MEMS, it can be even more odd. While the microphone's response isn't flat up to 30kHz or so, it can have peaks above the audible range and adding a low-pass filter in some of these circuits isn't an option. The effect could be one of sub-harmonic distortion so you tickle the mic above audible and it outputs a signal well below that. I had to do a deep dive into inertial management units (IMU's) some years ago and got familiar with MEMS issues in a very base level way.
Encoding the signals into a YouTube video isn't going to work. The playback device would need to be able to reproduce the sounds and no consumer audio speakers I've ever come across are better than 10dB down at 20kHz on their way to nothing. The target would need to have a rather expensive audiophile system or a professional audio system with TAD Beryllium HF drivers. The source would also need to have no High and Low pass filtering which is very common to prevent overloading by out of band signals. Many modern amplifiers have Low Pass filters as they can have enough bandwidth to transmit if they aren't 'slowed down'.
"no consumer audio speakers I've ever come across are better than 10dB down at 20kHz on their way to nothing"
You've seen only crappy stuff then, "not-speakers" so to say. Beepers or buzzers can't, I can admit that.
Of course it depends on what you mean by 'consumer speakers', but typically anything you can buy from a shop is, by definition, 'consumer' stuff. Like my speakers and these are +-2 dB from 20 to 20kHz. Nothing high end, just hifi.
https://www.whathifi.com/best-buys/hi-fi/best-hi-fi-speakers
.... and every one of them reaches up to 20kHz ... cheapest with a massive price tag of £250. For a speaker that's not a lot.