Big data...
...is apparently defined by volume (how much), variety (what types) and velocity (how fast), or some combination of all three.
The term is in vogue due to the likes of Google, Yahoo and Facebook introducing the world to new analytic paradigms based on the MapReduce framework, open source software (Linux, Hadoop etc), commodity hardware and the notion of 'noSQL'...and also because the IT industry needs new buzzwords du jour. At the moment it's the turn of 'big data' and 'cloud'.
In theory, 'big data' as done by the likes of Google is all about unstructured data. In reality, there's a lot of structured data still out there, and I'd argue that all data has some structure anyway, so 'semi-structured' may be a better term.
Ebay has a multi-petabyte 256 node Teradata system chock full of structured data, in addition to the large Hadoop stack for web analytics, so there's clearly life in the old structured dog yet.
There's nothing new in 'doing analytics' - a lot of companies have regarded analytics as a competitive differentiator for a long time. There are companies out there, even in the lil' ol' UK, that have been using Teradata, which only does analytics, since the 1980's. I started my career at one of them.
For the typical mid-market company, if there is such a thing, all we ever tend to see is SQL Server on top of SAN/NAS. It's cheap, feature-rich, easy to tame and works OK until data volumes increase beyond a few hundred GB or so. The pain threshold is obviously dependant on the hardware, DBA/developer skill, schema and application complexity.
All SMP based databases suffer the same scaling issues, hence Microsoft's attempt to build an MPP version of SQL Server, (Madison/PDW), Oracle's Exadata and HP’s NeoView.
IBM in the BI mid-market is not something we see very often. Netezza Skimmer has never been sold as a production system before, as far as I know. IBM's own web site describes it as for 'test and development. A proprietary IBM blade based system running Postgres on Linux is hardly a good fit for the Windows/SQL Server/SAN/NAS/COTS hardware crowd.
Having said that, we did deploy a pre-IBM Netezza system as far back as 2003 for a small telco with only 100,000 customers, but they did have several billion rows of data and complex queries to support.
@Wonderbar1 - Teradata's competitive advantage consists of several capabilities...performance, scalability, resilience, functionality, maturity, support, 3rd party tool integration (e.g. in-database SAS), ease of use, applications and data models to name a few. It’s a true ‘full service’ offering.
Teradata is the only database built from day 1 (in the 1980's) to support parallel query execution using an MPP architecture across an arbitrary number of SMP nodes all acting in tandem as a single coherent system. That is very, very hard to do - ask Microsoft, Oracle, HP or IBM.
Overall, Teradata 'just works'. All those big name users can't be wrong.
The Teradata secret sauce for me is the scalable 'bynet' inter-node interconnect. This is used for data shipping between SMP nodes in support of join/aggregation/sort processing. The bynet is scalable and resilient and 'just works'. It also performs merge processing for final results preparation.
Other MPP systems typically have a non-scalable interconnect bandwidth consisting of a dumb bit-pipe. Even worse, those that ship intermediate results to a single node for final aggregation/sort/merge processing can hardly claim to be linearly scalable. Some Exadata clusters run tens of TBs of RAM on the master node to address this issue.
Teradata's bynet has processing capability that enables final merge operations to be executed in parallel in the bynet interconnect fabric without landing intermediate results in any single place for collation. Cool eh?
See here for more info: http://it.toolbox.com/wiki/index.php/BYNET
Teradata consists of OEMd Dell servers running SUSE Linux and dedicated storage from LSI or EMC. Teradata was historically regarded, quite rightly, as 'reassuringly expensive', but the launch of the new line of Teradata 'appliances' a few years ago has made Teradata price-competitive with the likes of Netezza, thus eroding Netezza's disruptive pricing model. Competition is a healthy thing etc.
Appliance adoption has been a key feature of Teradata's strong performance over the last few years, as reported several times on El Reg.
Have you ever run an Oracle query across a 20 node system running hundreds of virtual processors all working together? I did a few minutes ago - a 250m row count(*) in under 1 second with no caching, no metadata, no indexes, no tuning, no partitions and no concern for what else is running.
I can't remember when I last submitted a query to Teradata that either didn't finish or caused the system to barf. That happens a lot on Oracle/SQL Server.
The last project I worked on was a 20TB Teradata system that supports a very wide range of applications, including real-time loading of web data and several tables of over a billion rows. Total downtime for the year, including planned maintenance, is measured in single hours.
“But I could do all that with X, Y and Z”, we often hear. Off you go then. If you can get it to work, and that’s a big ‘if’, your boss won’t bet the farm on it. That’s another reason the likes of Teradata win business – it’s a safe bet for the decision makers.
Back to work…