The worry is that some idiotic companies in the IT sector are pushing BigData and MapReduce for enterprise applications such as payroll or HR.
BigData and MapReduce is perfect for search engines but useless for applications such as CRM and ERP.
You may not realize it, but data is far and away the most critical element in any computer system. Data is all-important. It’s the center of the universe. A managing director at JPMorgan Chase was quoted as calling data “the lifeblood of the company.” A major tech conference held recently (with data as its primary focus) …
"If a company that’s been around for years suddenly argues that it needs Big Data techniques to run its business, it must mean that either [...] or it's been hobbling along forever with systems that don’t quite work. Either of those claims would be hard to believe."
The second is all too believeable, and is keeping me in a job right now ...
It isn't what you've got, its how you use it.
I talked to a recruitment consultant a while ago who pointed out that all the recruitment companies have gone "big data." That is, they do word frequency analysis on CVs and just search on a big pile of "stuff" and take the top CVs on the list.
So now you have to keep repeating keywords, add abbreviations in brackets and that sort of thing to make sure your CV ends up on page 1 of the search results.
They have replaced personal knowledge and relationships with a technical solution which will inevitably lead to poorer quality but greater quantity of words in people's CVs. I'd be surprised if people weren't already using white-on-white text to bump their CV's visibility to the search engine.
By destructuring the data they've increased their storage costs, removed information from the system and now they have to keep tweaking the systems to stop them being gamed. Sending a slightly irrelevant advert to someone is one thing, but making business decisions about personnel suitability based on this stuff is dangerous. The reason we have structure is because it organises data into easily understood information. A word-cloud from a comment box might be fine for an initial analysis of what people are talking about, but it doesn't tell you what they are saying - the data is there, the information has been removed.
"Is it true that this kind of data can’t be managed easily, quickly and without painful pre-processing using a relational database, the designated dinosaur of the Big Data crowd? Possibly."
In practice the last thing you want is a relational database.
Single record per item (aka Model 204) is much easier to work with and much faster. If data comes from multiple sources, it won't be consistent and almost any field chosen for an index won't be present for some of the data.
If you have to fix that first in order to load into a relational database, I'll be done before you get started.
Yes, it takes more space, but disk is very cheap.
Personally I prefer awk and C to Python and C++, but to each his own.
Vendors such as Oracle have had tools for years to suck in raw data. At a place I worked, a research group used a big distributed relational database to do their preliminary data selection, then went to town on resulting small data of 50GB or so. Other groups used locally brewed shell scripts to extract data from arbitary logs and load it into another distributed relational data base. One of my tasks required pulling all data from a relation database and reformatting it to a new standard. Done with unix shell, not using shell tools. Could have used awk but a bit slower. I suggest selecting data munging tools comes back to what data, from where and whatever the coder thinks is appropriate.
As for large data mining and speed, check out how many databases pass the WalMart Test.
Given increasing availability and decreasing price of CPU, RAM and disk, what is big these days anyway ?
Bless you! I also work for a "Big Data Will Change Everything" company.
I told my boss I was ready for it. As soon as he can tell me why I need to analyze 400,000,000 'duck face' photo posts on FaceBook in order to better calculate Order Cycle Time ... I'll get right on it.
Oddly enough he hasn't gotten back to me yet!
"One article I read recently on the subject of textual analysis (looking for patterns across all books by a given author, for example) gave the impression that this was something that was never done before, because it couldn’t be."
This is an interesting remark with some truth in it as far as my own experience goes. Often more to do with the never ending need for buzz and "innovation" bullsh~ bingo than anything else. But there is a major development still and that is web-enabled "pipe-lining" and processing of texts pulled from various resources at will, like cloud services creating the ability to chain tools and make selections from a staggering amount of open data.
At the level of algorithms there might be less innovation happening and yet new possibilities and demands might still rise because of new types of questions being asked combined with greater computer power available to the average online researcher. But this is indeed not a typical "big data" issue at all and perhaps more similar to developments with online data availability in general. But that's definitely all sounding generally less exciting to the ones deciding over the funding.
John, the obvious question: Why ? Young Zactly is right. Big Data is a fashion/buzzword. Handling petabytes of data is being done where there is a definable problem, fine. But across the web ? Across a research WAN, probably. If you mean using APIs accessable to browsers, fine, but not new. Text analysis has been going on for millenia. (sp?) Vaguely remember articles on computerised text analysis from the 1970s.
At the level of algorithms there might be less innovation happening and yet new possibilities and demands might still rise because of new types of questions being asked combined with greater computer power available to the average online researcher.
There isn't "less innovation happening" in textual-analysis algorithms. There's a huge amount of research being done in this area, and there has been for decades, and the state of the art has moved forward tremendously since the 1970s.
Really, why do people post stuff like this (or make claims about it in articles) without doing even the most trivial research?
And yes, greater computing resources have also made a large difference. Most notably, it's now possible, even convenient, to process large corpora on personal computers. (The straightforward, largely un-optimized English-text parser I threw together in Java a couple of years ago averaged about a thousand sentences a second, on the WSJ corpus, on one of my laptops.) That's opened this sort of work up to many more researchers - they don't have to compete for time on large computer installations, they have convenient access to the data, etc. But the algorithms have most certainly been advancing as well.
While Mandl makes some good points in the article, this particular bit was naive and misleading.
Every client I've ever worked for has used one or both of the phrases "We handle a huge amount of data" and "I bet you've never seen it this bad". Almost without exception they're processing a very normal amount of data, sometimes in very inefficient ways. The science behind MapReduce is far more important than that specific technology - often there are equally useful techniques that are better suited to a client's needs, however much they might want to install Hadoop etc.
when a business would plan and cost their infrastructure for storing data, and compare it to the value of that data? Because to do that you'd have had to submit a plan of what you were going to do with that data BEFORE anyone could blow the cash on a petabyte silo.
You'd think in a time of recession, business cases would be produced and perused with logic and rigour. But no, let's ritually slaughter a few fatted calves to the God of buzzwords and base our business strategy on the lay of the entrails. Now do those entrails look a bit 'horsey' to you..?