"With Webmap, we can do this 33 per cent faster on the same hardware."
Never mind, MicroSoft will port it to .Net when they buy Yahoo and make it slower again.
Sometime after the New Year, Yahoo! flipped the switch on what it calls the world’s largest Hadoop application, using the much-hyped open-source grid computing platform to tackle a task no smaller than the web itself. Known as Yahoo! Search Webmap, this Hadoopified mega-app provides the world’s second most popular search …
I would like too see another great search-engine like Google, I used Yahoo! for years until Google became faster and more accurate than Yahoo!
Theres 1 question ramaining unclear for me as i read this article:
Does speeding up the indexing of pages also speed up the search results? I think no. And does it mean there are better search results with the use of this grid? It only seems to be a matter of more efficient cpu-use.
Broadly speaking, GFS, Nutch and Hadoop et al are all designed to improve processing rather than results as they're all about distributed systems on commodity hardware so they scale horizontally rather than vertically and improve the way that processors are used for those queries. JAQL does sound like the complementary query language for the system but I suppose that the implication is that it has to remember what queries 'work' by scoring them and remembering them, making them part of the information that's being processed in the first place...
Given greater CPU work for a given set of hardware, certainly costs and electical power usage are reduced. But increasing the total CPU work available can also enable new levels of data analysis and algorithms that were unfeasable without the horsepower for them. A decade ago I helped a major airline upgrade from an SGI sever to a cluster of IBM SP2 supercomputers and massive storage arrays - and they began using data an entire magnitude larger and more complete in their load forcasting calculations. They did re-write the software (had to paralleize it explicitly as they moved from SMP to MPP, and change the calculations to use more granular data), but the notion of what they were doing didn't change - just the data and ability to sift through more of it. Apparently they gained quite better results, once they tuned the new forecasting system's parameters.
Okay, so, I've not actually read the article (I might do in a minute) but I thought that there was a rule that stated all items with "Yahoo!" in the title had to have an exclamation mark after every word in the headline.....
coat, retrieved, donned, and left the building.
"had to paralleize it explicitly as they moved from SMP to MPP, and change the calculations to use more granular data"
WTF are you on about? I'm sorry but twenty years in the computer industry along with a degree in computer science, and I still have no idea what you mean there? Could you re-phrase it so us mere mortals have a clue?
MPP=Massivly Parallel Processing
With an SMP box, there is a single OS image that schedules applications across the processors. If you write threaded code, then most SMP implementations will schedule threads on seperate processors without you having to write the code to explicitly taks into account the fact that there are multiple processors.
With MPP, there are multiple OS images in the cluster, and you have to write to an API that will allow different units of work to be placed on different systems. This means you have to make the application much more aware of the shape of the cluster. This also means that if not written carefully, you may not get better performance by adding additional nodes into the cluster.
Unfortunatly, too many IBM SP/2 implementions were not really parallel processing clusters, more like lan-in-a-can systems (goodness, where did I dredge that term up from).
But what Google does is a quantum leap up from what SP/2s were capable of, and are much more like Mare Nostrum and Blue Gene/L.
Thanks for the explanation, I'd got the SMP and MMP bit, I've just never heard of Granulising Data, It left me totally baffled (as you probably guessed).
AC: You'll be pleased to know I've since googled it, and its just a posh name for creating subsets or something like that... I think? ...Possibly?
Nurse, Nurse! wheres my Spectrum?
I'm afraid my adventures in Concurrent Processors stopped at the Transputer and Occam, something I never got the hang of, because of my habit of using TAB instead of spaces. Boy! would that confuse things :)
So your explanation was gratefully received.
Mind, I still use the command line to compile stuff, Old dog, new tricks and all that.