Reply to post: Re: Huh?

BIG DATA wizards: LEARN from CERN, not the F500

Ian Michael Gumby
Boffin

Re: Huh?

Yeah... like I said the author really doesn't know Jack.

He's confused.

The latest generation of tools is using memory rather than disk in its processing.

Tools like SOLR (In memory Indexing) , Spark, and now Tachyon are using more memory rather than reading from disk. This should reduce the time it takes to work with the data.

However, at the same time the data has to persist to disk. Even in Spark the data resides in RDD which is local to the process. The distributed file system makes the data available to any and all nodes, yet most of the time with Hadoop's Map/Reduce, the data is residing on local disks to where the processing is occurring. So you're pushing the code to the data and not the other way around. Code is at least one or two magnitudes smaller in size so you will get better results than trying to push the data. *

*YMMV, it depends on what is being processed and the time it takes to process the data. If the time it takes to process a record of data is >> than the time it takes to push the data across the network, then you would be better off not using M/R because you will create hot nodes in the cluster.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2022