I get this completely
Some of you have posted that you don't get why they switched from Midi to raw audio waves.
I happen to think that is exactly the right thing to do. Unless you are going to teach the neural network to actually play the real instruments.
Here's my thinking. There's a number of times when I have seen sheet music for something. If I play that piece of music "exactly" as written it will sound awful. This is effectively what the Midi based AI's are doing, they're following the patterns of things written in sheet music form.
However, the thing that separates a good performance from a bad performance is the difference between what's written on the sheet of music - which is a guide, and the actual performance.
A performance can come alive when the player modifies the tempo, changes the pitch, attack or vibrato on a given note. They may also play the a given note slightly sharp or flat, add dynamic effects.
I mentioned the attack of a note. Henry Purcell's compositions for example require a trumpet to be played full with round notes. The pieces that I'm thinking of are a little pompous, and confident.
Bhramms Lullaby however needs a soft approach. The individual notes should be rounded, giving a much less aggressive attack. The tempo can be adjusted more and there's also room to apply more dynamic range.
A midi file simply won't contain all this detail, but by programming an AI with raw audio sounds. There is an opportunity for this parts to be learned.