back to article BIG DATA wizards: LEARN from CERN, not the F500

Big data has a problem: it is being abused. One of the biggest misconceptions is that big data is about archiving everything forever, buying the biggest, cheapest storage pool, and building a giant proverbial barn of hay in order to try to find needles. Vendor marketing has abused this. Consider marketing that advises that you …

  1. James 51

    We need 'CERN does it better' t-shirts.

  2. et tu, brute?
    Thumb Up

    Creative solution of the year!

    "When looking for a needle in a haystack, its easier to just burn all the hay."

    But you would need a blower as well to get rid of the ashes...

    1. Terje

      Re: Creative solution of the year!

      Or to stay in with what CERN would do, just pass one very very big spare magnet above the ashes and pick up the needle with that!

      1. et tu, brute?

        Re: Creative solution of the year!

        "Or to stay in with what CERN would do, just pass one very very big spare magnet above the ashes and pick up the needle with that!"

        Would that still require burning the hay though? Or could we find the needle and spare the hay using the magnet?

        I mean, horses also have to eat... and overdone food is never nice!

    2. Sproing

      Re: Creative solution of the year!

      Magnet. They've got a few lying around ...

  3. Anonymous Coward
    Anonymous Coward

    "99.99 per cent of the sensor stream data produced is discarded"

    So starting with 500EB/day, that leaves "only" 50PB per day to store...

    1. Lost_Signal

      Re: "99.99 per cent of the sensor stream data produced is discarded"

      Good catch!

      Looks like some 9's got dropped in the shuffle (as well as looking around many secondary sources had it wrong).

      Lets try this again shall we?

      L1 filtering 40Mhz to ~60-65Ghz (so ~.015% data retained).

      L2 filtering 65Khz to 6Khz so (10% of data retained)

      L3 filtering 5-6Khz to 500-600hz so (10% of data retained). So 99.99995 % of data is filtered (well before further passes are made at the data). (This math check out?)

      As much as I love arguing about percents, the point is a lot of companies are hoarding the inverse (99.99% rather than 0.01, or 0.000001%), something they can "afford" to do at small scale when they don't try to do anything useful with the data, but will not scale as volumes grow.

  4. Ian Michael Gumby

    It would help if the author knew what he was talking about.

    As someone who's been involved in using 'Big Data' tech in the F500 companies for over 5 years, and as someone who's helped set the strategy with a couple of the F500 companies, I can say that the author doesn't know jack.

    You are talking apples to oranges when you try to talk about sensor data in the same way you talk about transactional data. Big difference. In terms of sensor data. Is it discrete or continuous data? If its continuous data then you may have long periods of no change in readings that get stored and this data could be tossed and then focus on the discrete data which occurs when there is a change in the sensor's input.

    The reason vendors are saying to store everything is that many who are new to Big Data don't know what data is relevant and what data is not, or what data may become relevant when you combine it with other data sets.

    Currently the F500 tend to store things in a relational model and when items don't fit, they get dropped. By going to a semi-structured or unstructured system, you can retain more attributes which may hold value. In terms of purging data or moving to cold storage... there are other factors like regulatory and business use cases that determine what to do.

    In terms of the vendors, they are not going to tell a business what to do or how to do it. (Of course they'll spin reality to let them sell their version of Hadoop/Big Data and what tool is best.) They are going to say, when in doubt, save everything. Hardware is relatively cheap and getting cheaper in terms of cost per TB.

    Sorry, but the author should look inwards and think more about the problem than trying to base a recommendation on something he read on the internet.

    Not all problems are equal, so why should their solutions be the same?

    I guess Mr. Nicholson should add more tools to his tool belt. His hammer vision makes everything look like nails and even with nails, you will want to think about different types of hammers. ;-P

    1. StuCom

      Re: It would help if the author knew what he was talking about.

      I think perhaps you should go back and re-read the article. Mr Nicholson makes some great points. You're right that transactions aren't the same as sensor data; I'd question the merit of moving from a relational model for all but the most specialist areas. Where's the payoff? Just what is being stored? If it was being dropped before (presumably because something was wrong with it?), is it really needed?

      1. Ian Michael Gumby

        @StuCom Re: It would help if the author knew what he was talking about.

        Naw, sorry any sort of 'points' are lost in the blubbering noise.

        In the enterprise, there is always going to be the need to have transactional systems so you'll need the RDBMS. WRT Hadoop, Hive uses an RDBMS to manage schema data. So too does HCatalog. Then there's Amabari (Hortonworks) and Ranger for security, while Sentry (Cloudera) apparently does not.

        But where relational modeling falls apart is that is an inflexible schema that is set at the beginning. Hadoop's tools have a 'late binding' schema at run time. (Sorry for lack of a better description, schemas are enforced when you run the job not when you load the data in to the files. )

        There's more, but you should get the idea.

        The author really doesn't know much about Hadoop and the other tools in the ecosystem so for him to make a suggestion to look towards CERN is a bit of a joke. No offense to CERN because they have done some really good work there and they do know what they are doing.

        My point is that CERN and the F500 are two different beasts.

      2. Lost_Signal

        Re: It would help if the author knew what he was talking about.

        Correct. This short article wasn't meant to settle the NO-SQL vs SQL argument for good (or even address it!) it was to talk about reducing and filtering data on ingestion being a good tool that everyone should view more seriously. As always, your mileage may very.

  5. Christopher Lane

    I wonder...

    ...if you can walk all the way around the "loop" and if you can does it have subterranean passport control at the four points it crosses the French-Swiss border?

    1. Chemist

      Re: I wonder...

      "subterranean passport control at the four points it crosses the French-Swiss border?"

      Schengen in a word. Haven't shown a passport at the Swiss border with anywhere for years.

      1. Yes Me Silver badge

        Re: I wonder...

        You wouldn't want to try it while the machine was running. Not if you wanted a long and cancer-free life. Also, you can visit from where you are right now:

    2. Uffish

      Re: I wonder...

      Can you imagine the customs problems with all those hadrons crossing and re-crossing the borders.

  6. Anonymous Coward
    Anonymous Coward


    "If data is processed on disk then this processing is occurring over high throughput, low latency connections." There's no case where disk constitutes a higher throughput, lower latency connection than in-memory processing. I'm a little lost as to what this sentence means in this article.

    1. Ian Michael Gumby

      Re: Huh?

      Yeah... like I said the author really doesn't know Jack.

      He's confused.

      The latest generation of tools is using memory rather than disk in its processing.

      Tools like SOLR (In memory Indexing) , Spark, and now Tachyon are using more memory rather than reading from disk. This should reduce the time it takes to work with the data.

      However, at the same time the data has to persist to disk. Even in Spark the data resides in RDD which is local to the process. The distributed file system makes the data available to any and all nodes, yet most of the time with Hadoop's Map/Reduce, the data is residing on local disks to where the processing is occurring. So you're pushing the code to the data and not the other way around. Code is at least one or two magnitudes smaller in size so you will get better results than trying to push the data. *

      *YMMV, it depends on what is being processed and the time it takes to process the data. If the time it takes to process a record of data is >> than the time it takes to push the data across the network, then you would be better off not using M/R because you will create hot nodes in the cluster.

    2. Anonymous Coward
      Anonymous Coward

      Re: Huh?

      I took it to mean that they apply a "filter" at source (maybe it never even emits certain data), and only save what they know they need. As opposed to saving all the data, and filtering it. To go back to the logging analogy, it would be to not log everything (which generates huge amount of noise), but only log changes / new / unusual events (obv depends on your needs here).

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2022