Isn't that like any project
Replace the words Hadoop and Spark with any project you've ever done?
Your attempt at putting Hadoop or Spark to work probably won't work, and you'll be partly to blame for thinking they are magic. That's the gist of a talk delivered by Gartner research director Nick Heudecker at the firm's Sydney Data & Analytics Summit 2017. Heudecker opened with the grim prediction that 70 per cent of Hadoop …
Not really. Hadoop in general and Spark in particular is massively over-hyped and misunderstood. People think that they can just load all their data into HDFS and run their existing workloads on it, because the vendors tell them they can, but the vendors don't say that doing so is likely to be massively inefficient and error-prone (both reliability and data quality). You need to do a huge amount of work to make Hadoop replace something existing, because it is unlike anything existing.
Contrast with other major projects, an on-site CRM to Salesforce migration for example, where the target capabilities are relatively well known, are similar to existing business processes, have a stable release cycle and don't tend to break on new releases.
I am not convinced that running Hadoop/Spark in the cloud makes a great deal of sense in the general case, particularly when you are running on shared infrastructure.
1) Non-trivial parallel workloads are much more sensitive to latency, and in particular latency spikes which is one area where Clouds running over shared infrastructure are at a significant disadvantage.
2) A lot of Hadoop/Spark apps operate on huge datasets (most apps I've worked with in recent years are process multiple TB of data in a run), which you have to first transfer into the cloud... In some cases a run will generate a lot of data too (again multiple TB) - and you as a customer have to pay to transfer that data to and from the Cloud datacenters, which also adds latency and cost over and above on-prem.
I don't know about other providers, but AWS generally charges for download but not upload. I use this fact (along with the pay-as-you-go charging on storage) to encourage more efficient ETL and reporting. If they only store what they need to, and only spit out/download what they need to see as an end result, it's cheaper. This is in contrast to an on-prem EDW (for example), where some central project has bought/delivered the warehouse, and individual business projects don't care about efficiencies because they aren't paying for that big row of Teradata kit etc.
...a sack full of spanners. Sure, it can do some amazing work, but it requires a lot of self assembly, and you need to have the right sort of problem that the tools will actually fit.
If your data is big enough, and you have the means to act on the results, you can indeed deliver real value. I was lucky enough to work on a project that returned significant revenue (I offered to work on a percentage basis, but they seemed to think that was unreasonable... sigh).
On the other hand, if you've got the infrastructure there, and guys smart enough to bolt it all together (which seems to more or less be a prerequisite at the moment), you can roll your own stuff out of the wider 'big data' ecosystem that's likely to be a far better fit than trying to put in a screw with the Spark hammer. We replaced clunky Hadoop batch processing with Kafka + microservices + Cassandra and delivered the same results but in near real time and with much less reliance on 'magic coordination'. Despite the hype, Spark was not a good enough fit to justify mangling the problem just to be able to use it.
"Hadoop, for example, is very good at doing extract, transform and load operations at speed, but its SQL-handling features aren't stellar. It also chokes on machine learning or other advanced analytics tasks because it is storage-centric."
Is it opposites day? Hadoop's got half a dozen world-class SQL engines specialised for different tasks, ranging from the solid-as-a-rock-but-slow-as-shit Hive all the way through to stonking fast analytical engines in Impala and Greenplum. Meanwhile on the ML side Spark is the definitive distributed processing framework, designed specifically for the iterative processing that dominates ML use cases.
What absolute twaddle.
Hadoop chokes on SQL because its not relational.
You talk about 'world class' SQL engines, which they don't have. World Class was Informix XPS which supposedly was moved in to their IDS product. But you don't use a distributed database for OLTP.
Hint: Try holding a distributed lock.
Big Data isn't relational. Do you not remember Stonebraker's rant when Google released their papers? It was classic.
Spark is spark. There are good things and bad things about Spark. And if you're in the Spark camp, that means you're not a big fan of Flink. (There are some of us who are neutral and work with both and understand the merits of both. )
Again, I have to post Anon. Too many people in this space know who I am, and some even know my alias that I use here.
SQL != OLTP. We're effectively in OLAP world here, not a transactional one, so what have locks got to do with anything in a world where you're working with terabytes of append-only data?
And no, I'm more than happy with Flink. I think it makes some short-sighted architectural choices (almost no one needs <10ms latency on streams), but it's got a wonderfully well thought out API.
I don't know which is worse. The story or the comments here.
As someone who has been working in Big Data for a long time... longer than most, I can tell you that on the one hand Heudecker is right that your projects are likely to fail. That's pretty much the one thing he did get right and it doesn't take a rocket scientist to know why. But Heudecker doesn't know Spark or Hadoop. Its clear that he's never really worked in depth with either tool.
The reason most projects fail is that those implementing the design don't really understand Hadoop or Spark and still think in terms of relational modeling. There are too many people who claim to be experts and even those who are open source Apache committers don't understand completely what they are doing.
Big Data is easy if you know what you're doing and the enterprises hire competent staff.
Too many people take the week long class on Hadoop, Spark , etc and think that they know what they are doing.
Some here get it. You have to know how to use the tool in order to be successful with it. You can't be a 10 week wonder and expect to get things done right. You need to have developers who are 'old school'.
Posted Anon because I really know more than I can talk about.
Could not agree more - so many reasons for failure but articles like this just add foobar to the mix.
The main reasons I have seen for failure:
1.) Lack of qualified use cases:
a.) Big data is not GB's, 10s of TBs or more
b.) Lack of identified business pain (what are we trying to solve)
c.) Lack of metrics (what are we trying to accomplish and how do we measure it)
2.) Expecting "magic" DW offload... ignoring the fact that existing DW's have decades of procedural work to analyse and identify candidates to migrate.
3.) Lack of an understanding that an offload will likely not entail moving everything across - it's usually a cost rationalisation exercise
4.) SI's dropping the ball. I'm sorry but I've seen this far too often: recently had an SI hire their Hadoop administrator who neither knew Linux or Hadoop... and also weren't prepared to send him on any training. A good SI would tick all three boxes.
5.) Ignoring that existing DW's had their own wobble to maturity
6.) A last one: SI's bringing their own pseudo Hadoop technologies into the mix... watering down many of the value props Hadoop (and associated technologies) bring.
The counterpoint to this article is that I have seen some very very successful customers, but ironically they will often be customers who don't to advertise their success.
Also the AWS comment... I'm stunned:
1.) AWS is good for transient use cases. Period. Spin up, Read S3, do work, Write S3, spin down. Long running clusters aren't suited for this due to lack of any kind of upgrade path for EMR clusters.
2.) AWS may provide the "latest" products but don't perform certifications with existing BI tools which is how most BI teams will integrate with Hadoop
3.) AWS has zero committers on either Hadoop or Spark. Good luck getting fixes in. They ride on the efforts on the major Hadoop distros.
4.) Many clients deploy Hadoop on AWS precisely for this reason (EC2).
5.) S3 may be a much better cost alternative to instance storage but sheesh will it be slow as dogs balls. Have seen clients do an about face when this materialises. Back to cost vs value.
6.) AWS dedicated Hadoop support... good luck with that.
In summary: this article is terrible. It confuses the issue of why Hadoop fails and gives people a fluff direction to look at AWS without any real meat behind it except, "They baseline quicker" without discussing why this might not be a good idea.
Also posting anonymously because this industry is small.
Hadoop eco-system, specially MapReduce and now Spark has always had an achilles heel. It has consistently ignored operational aspects (https://goo.gl/QGpRWe). TCO (cost of ownership) and TTV (time to value) has been impacted. Too many failures; add to that the propensity to try to do everything for everybody. SQL issues are consequence of that.
Demanding high competency is effectively admiting failure. It has been over 10 yrs. Big data is neither operationalized nor productized today. cloud helps somewhat, but the software has to hold up on its own. Big data is hard as is, the last thing it needs is over-promise and under delivery.
@ datatorrent.com, we are very operational focused, and have baked in a lot of these thoughts in Apex. Do try out, specially for ingestion and ETL.
Biting the hand that feeds IT © 1998–2020