Triangulation with two or more transducers for locating keystrokes was put forward years ago... sorry, I haven't got a link, but it is much the same principal that cartographers and radio and radar engineers have used for years.
I can only imagine that this lad's solution requires the 'training' it does because because it relies on different relative X Y Z values being detected at a single point, and that these vary depending upon the surface being used. With calibration, calibration, calibration, you wouldn't even need to develop a model of how the vibration passes through the medium.
Another way of doing something similar might be to use two earbuds as transducers, if any phones allow for for stereo-in. If not, then even using the mono headset mike*-in in addition the phone's accelerometer and built-in microphones (most phones have two internal mikes, for noise cancellation) might drastically reduce the 'training time'.
Extra points awarded for using several phones, communicating with each other sonically (see Reg article yesterday!), to give more locations and accuracy.
*though presumably a bit of Blu-tak might be required to stop it moving around the desk : D