"need to have access to their target's voice"
With all the audio data available from all sorts of sources on the Internet, that doesn't seem to be much of a barrier if you're seeking to spoof the voice of anyone who is known. Celebrities, politicians, major CEOs, all of them have their voice on publicly-available sources somewhere.
Now, if you're targetting someone for specific reasons that is not a social media aficionado, it makes things a lot more complicated, especially if you do not know the person socially. You're going to have to find a way to meet the person, get the person talking and put your mobile phone down to record the conversation. That will mean cleaning up the recording afterwards, which is never an easy task.
So, the basic question really is just how well recorded does the target's voice need to be ? Will a few dozen seconds in the street suffice, or do you need a few minutes of sound booth recording ?