GPGPU in Java is very difficult
While not specifically to do with GPUs as such, I did some research a while back into the possibility of getting a Java VM that could run on a PS3 so that it would be able to run code on the SPUs as well as on the main CPU (PPU). Besides the approach of actually adding new keywords to the language (as is mentioned in the article), I found two projects that actually got some way towards the goal. Both aimed to take unmodified Java code and have it run on an asymmetric CPU setup (ie, the PS3's Cell). The first (*) was based on the CACAO JIT compiler and hooked into the function call mechanism so that each method got executed on a separate core. I don't think the author got very far, but he did go over a lot of options and documented a lot of the design decisions very well. The second (**) was based on the Jikes Research VM and it used thread creation as the point to migrate control over to a new core. That project got a lot further, but I don't think they ever publicly released the code (although I'm sure an email to the authors would probably get you access). Again, the various papers and such that they produced give really good descriptions of the approach taken and all that.
Where I'm going with this is that it's hard enough to target the JIT code generation so that it can run on an asymmetric setup like that of the Cell processor. It's much more difficult when you try to target it to graphics hardware. OpenCL is a nice step in the right direction for doing GPGPU work, but graphics hardware design (except maybe for the really high end stuff--I'm not sure) still tends to be geared towards fixed ideas of the execution pipeline (eg, shader models, emphasis on textures and matrices) and there are generally fairly high penalties for such things as branching, sending data back out to the CPU (outside of the frame buffer mechanisms), context switching and inter-core communication (again, if it lies outside the standard shader/pipeline model). I would love to see GPU cores and the interconnects between them and the CPU moving more towards the Cell model, but OpenCL notwithstanding I don't think we're making much progress in that direction. Likewise, I'd love for these guys to succeed, though I think it's going to be a long hard climb.
* http://vergiss-blackjack.de/diploma-thesis_georg-sorst_java-on-cell.pdf
** http://research.microsoft.com/pubs/132351/herajvm_oopsla.pdf