Re: Nonsense comments
Enjoyed the mitchell and webb reference :D so you get a +1 from me, even though I don't fully agree with your stance. Regarding your quotes, my point stands as the two parts you quoted aren't mutually exclusive. From "holding" I assume you're implying the data is going to Cisco servers and then being analysed. I suspect that is a big part of your concern there. Certainly possible, but actually really expensive when you can get end-users to pay for that processing power, bandwidth, and storage (probably in-memory, so let's assume I'm talking memory). There's a big push towards edge processing to avoid the need for cloud compute. Having ML run on the customer hardware and send a summary of the findings to the cloud significantly reduces operational costs. So on that note, the question then becomes: is it a big no-no for Cisco (or whoever) to be collecting that telemetry? Without any further details, I think it's fair to say: Perhaps. To answer that question we need to know what the telemetry actually contains. You can certainly be opposed to telemetry in general on privacy grounds, that's entirely reasonable, I don't like it much either - but it's also pervasive and standard practice, so difficult to get overly offended about specific instances of this when collecting and sending basic usage logs (which is all telemetry means) is essentially ubiquitous.
To make sure I wasn't barking up the wrong tree with my rather opinionated post I had a look at the actual paper. You can look at the PDF yourself, they believe the telemetry contained the min, mean, and max audio levels collected over a 1 minute period, which is related to the automatic gain control (the way the VCA automatically adjusts it's volume). For an analogy, this is like taking an entire month of rainfall forecast and giving the min, mean, and max rainfall across the entire period. There simply isn't enough data to say how much it rained on a given day, and equivalently there isn't enough data in that telemetry to reconstruct a conversation or tell what is there in the room. From that you can basically determine roughly how loud the environment is, perhaps identify if there is something loud in the room from one moment to the next (which is potentially correlated with when the user is there), and tell if you might be in an anechoic chamber. So if we think about this again from a privacy perspective, what we have is audio-derived data which is insufficient to reconstruct any dialogue or to determine what is in the environment. The adversary (Cisco, or someone breaking into their systems) could basically determine that you are being very quiet, very loud, or very normal volume. I might care about that if I was worried someone might be able to tell if I'm home or out of the house, but who connects to a video conference and then leaves the house? If you're connected to the conference, THAT is the signal that you're home, trying to infer something from coarse-grained audio levels is nonsense.
TL;DR: they aren't sending your conversation, they are sending a value on a scale from "it was very quite in the last minute" to "something was loud in the last minute". If there was enough data in that to reconstruct a conversation think how many MP3s we'd have been able to fit on a floppy disk! Unfortunately it isn't so.
Happy to discuss further.