back to article AWS CEO Adam Selipsky promises 'Zero ETL' future in re:Invent keynote

AWS CEO Adam Selipsky promised a "zero ETL future" in his re:Invent keynote on Tuesday in Las Vegas, introducing new integration between the Redshift data warehouse service and the Aurora relational database service. Zero-ETL, now in preview, lets users link Aurora data from multiple clusters to a Redshift instance, to achieve …

  1. JassMan
    Trollface

    A point of philosophy

    ...causing AWS watcher and cloud economist Corey Quinn to declare that "to say I'm disappointed by this keynote is a significant understatement."

    Thats the problem of living on a society where everything is overhyped. Far better to live in a country full of whelm. Being underwhelmed is much better for your soul than being disappointed.

  2. elsergiovolador Silver badge

    Pointless

    Towards the end of today's keynote Selipsky referenced the "just walk out" technology which lets shoppers walk into a grocery store and walk out with their purchases without the inconvenience of checkout tills and how a palm recognition service called Amazon One is reducing the need for standing in line.

    Reminds me of some inventions to reduce tape noise and increase dynamic range when CD was just around the corner making these solutions obsolete shortly after they came to market.

    Or when mate of mine spent two years developing an app to remind you of shows that are going to air on telly, so you could organise your time to watch it. Then Netflix came...

    That being said, given that AWS is a WEF partner, this technology will be crucial to implement e.g. zoned cities where only people with certain social credit score can enter and they will be recognised even if they cut out their chip.

    1. Anonymous Coward
      Anonymous Coward

      Re: Pointless

      What's new tech you have in mind which would obsolete the "just walk out" tech?

      Is it open source products by Amazon that consumers could carry out with no need to calculate payment, so that customers could then resell them for a profit?

      1. elsergiovolador Silver badge

        Re: Pointless

        Most likely you will have an app and the products will be coming to you, so there will be no need to go to shop at all.

        Alternatively you'll have some form of click and collect when you will only be coming to pick up stuff already collected for you.

        That direction is in line with Great Reset as the engagement will be fully traceable. In the future government will be able to e.g. block you from buying sugary products or alcohol and these apps will either have allowlists or SKUs will be matched against restrictions on your CBDC.

    2. jmch Silver badge
      Facepalm

      Re: Pointless

      "owards the end of today's keynote Selipsky referenced the "just walk out" technology which lets shoppers walk into a grocery store and walk out with their purchases without the inconvenience of checkout tills"

      I can't remember which UK supermarket it was, but I'm pretty sure one of them was trialling RFID chips in everything linked to a customer card that allowed exactly this - that was at least 15 years ago!

  3. This post has been deleted by its author

  4. elregidente

    I am an Amazon Redshift specialist, and I have Views about all this.

    Bona fides (and a bit of self-publicity) : I maintain a web-site, where I publish white papers of investigations into Redshift, and maintain ongoing monitoring of Redshift across regions; https://www.amazonredshiftresearchproject.org

    I may be wrong, but I think I know more about Redshift than anyone outside the RS dev teams. I've spent the last five years investigating Redshift, full-time.

    Redshift is basically a vehicle for sorting, which is to say, for having sorted tables, rather than unsorted tables.

    It is this method, sorting, which when and only when correctly used, allows timely SQL on Big Data.

    You also get the cluster (as opposed to a single node), but that's a secondary method for improving performance - it doesn't do anywhere near as much as correctly operated sorting, there are sharp limits to the cluster size, a few behaviours actually slow down as cluster size grows (commits, for example), and it costs a lot of money.

    There are two key problems with sorting.

    First, it makes data design challenging. When you define your tables, you also define their sorting order, and only queries which are appropriate to the sorting orders you have will execute in a timely manner; so when you make your data design, and so pick your sorting orders, you are defining the set of queries will can execute in a timely manner and also the set which *cannot*.

    Your job is to make it so all the existing queries, and the know near-future queries, and the general medium term future queries, and going to be in the set of queries which *can* execute in a timely manner. (In the end, after enough time, there will be enough change that your data design must be re-worked.)

    This issue, getting the data design right, is a *complete* kicker. It's usually challenging, and it's an art, not a science, and - critically - it's not enough for the *devs* to know how to get this right. Once the design has been made, it must also be *queried correctly*, which means the *USERS* also have to know all about sorting and how to operate it correctly; if they issue queries which are inappropriate to the sorting orders in the data design, pow, it's game over - you are *not* going to get timely SQL, and the cluster will grind to a halt.

    So Redshift is a knowledge-intensive database, for both the devs and the users; it's not enough to know SQL. You need to know SQL, and Redshift, and that's problematic, because AWS to my eye publish no meaningful information about Redshift.

    Where operating sorting correctly imposed a range of constraints and restrictions upon the use of Redshift, is a quite narrow use-case database; it is NOT, absolutely not, a general purpose database, in any way, shape or form.

    The second problem is VACUUM; which is to say, data in Redshift is either sorted, or unsorted. New data almost always is unsorted, and it has to be sorted, by the VACUUM command. However, you can only have *one* VACUUM command running at a time, *PER CLUSTER*. Not per table, not per database, but *per cluster*. So you have a budget of 24 hours of VACUUM time per day; that's it.

    Redshift - like all sorted databases - faces a producer-consumer scenario. New incoming data is producing unsorted blocks (all data in RS is stored in 1mb blocks - it's the atomic unit of disk I/O); VACUUM consumes them. When the rate at which new unsorted blocks are produced exceeds the rate at which those blocks are consumed, it's game over. Your cluster will then degenerate into an unsorted state, which is to say, sorting will be being operated incorrectly, and Redshift operated incorrectly is *always* the wrong choice - there are better choices in that scenario.

    I am quite sure this new real-time data-feed will produce unsorted blocks, and I am certain it will be gratuitously used by uninformed end-users (which is all of them, as AWS to my eye publish on meaningful information about Redshift at all), and it will I suspect consume a significant part of the cluster's capacity to consume unsorted blocks.

    There's no free lunch here.

    Redshift for the last however many years to my eye has had almost entirely *non*-Big-Data capable functionality added to it. I suspect this is more of the same.

    I would add, as a warning, I consider AWS, as far as Redshift is concerned, to have a culture of secrecy, and to relentlessly hype Redshift, and to deliberately obfuscate *all* weaknesses. I consider the docs worthless - you read them and come out the other end with no clue what Redshift is for - and that the TAMs say "yes" to everything you ask them. Finally, I think RS Support are terrible; I think they have a lot of facts, but no *understanding*. My experiences with them, and the experiences I hear from other admin, are of just the most superficial responses and obvious lack of technical comprehension - but clients who are not aware of this are misled by the belief that they are talking to people who know what they're doing (and given how much they cost for enterprise support, they ought to be).

    The upshot of all this is that I see a lot of companies moving to Snowflake. AWS have only themselves to blame. In my view, AWS need to publish meaningful documentation, so clients *can* learn how to use Redshift correctly, and then have Redshift only used by people who actually have use cases which are valid for Redshift, and move all other users to more appropriate database types (Postgres or clustered Postgres, or clustered unsorted row-store, such as Exasol, which is a product AWS do not offer).

    1. Charlie Clark Silver badge

      Re: I am an Amazon Redshift specialist, and I have Views about all this.

      Thanks for the details. Please forgive me if I'm way off but this sounds like this is a denormalised data store optimised for a particular set of queries? ie. one of the things RDBMS were designed specifically to avoid because of the lack of flexibility and problems associated with implementation-specific queries?

      1. 0x80004005

        Re: I am an Amazon Redshift specialist, and I have Views about all this.

        Grandparent is excellent and informative post.

        So many of these whizz-bang products were invented because the people who make them don't know how to use a relational database properly.

        How does this feature set sound:

        - The same data, sliced, diced, sorted, denormalised in multiple different ways on disk

        - Immediate update of all of these slices upon insert/update/delete of source data; no need to rebuild from zero

        - Store each slice of data on a different storage array if you wish

        - Bulk amendment of multiple records will only recalculate changed results

        And... I'm describing SQL Server indexed views, partitions and the MERGE statement, features which have been there for many many years.

        Admittedly these were restricted to the Enterprise Edition a while back, but that's gone now and you can do this on any version, even Express.

        1. elregidente

          Re: I am an Amazon Redshift specialist, and I have Views about all this.

          Thankyou! you're very kind to say so :-)

          Regarding your observations about SQL Server, the problem is indexes do not scale to Big Data. Too much disk I/O. You can't load Big Data in a timely manner while updating indexes. Sorry =-) no simple way out.

          There are five basic types of database, and they are orthogonal. When you come to design a database, you're faced with a series of choices, where you must chose one option or the other - you necessarily cannot have both (can't be short and tall at the same time, as it were).

          So there's an almost infinite number of possible types of database - but in practice, there are five, because those five are orthogonal and each is the option set which is better than every other option set, except for the other four; so you have key-value, you have map-reduce, you have relational (which has four sub-types, sorted/unsorted, row-store/column-store, but one of them makes no sense (sorted row-store), so no one has ever made it), giving a total of five.

          The options are in the end ultimately defined by the properties of computer hardware. Processors are fast, memory is slow, disk seeks are mind-numbingly slow.

          So coming back to indexes, if you want to handle Big Data with SQL, you can't have indexes, because indexes do too much disk seeking; you're restricted to sequential disk I/O, which means sorting, and all the constraints and restrictions it brings.

      2. elregidente

        Re: I am an Amazon Redshift specialist, and I have Views about all this.

        In a sense, yes, but I would say there is more flexibility than *that*, and also you're getting SQL, which has a lot of functionality, and provides strong typing.

        The more skilled the data designer, and the more fortunate they are with the data they must design for, the wider the range of queries the data design can handle.

        I usually find I can come up with something nice - and it's worth it because of the staggering efficiency of sorting.

        Writing this now I've realized I missed out one important matter, in the original post; so, with unsorted relational, basically speaking, the time taken to retrieve a row from a table depends on the number of rows in the table.

        So as you increase the number of rows, even though you're still only taking *a* row, the time taken to get that row becomes longer and longer.

        This is why unsorted doesn't scale for Big Data.

        Sorting, when and only when operated correctly, provides the property that the time taken to retrieve a row is *independent* of the number of rows in the table.

        This is why sorted databases (when correctly operated) *do* scale for Big Data.

        To put it more broadly, and in terms of queries, with an unsorted database, the time taken by the query depends on the size of the tables; with a sorted database, the time taken depends on the number of rows the query needs to read to do the work it has to do.

        So a big fat query which has to read every row is going to run at the same speed on unsorted as sorted; but a nice slim query, which reads say 100 rows, will run slowly on unsorted, but very, very quickly on sorted, because on sorted that query really will read only the 100 rows.

        (I'm speaking broadly here - there are technical details to consider - but this is the essence of the difference, and is correct and truthful for reasoning about these systems and comparing them.)

    2. jmch Silver badge
      Boffin

      Re: I am an Amazon Redshift specialist, and I have Views about all this.

      Very interesting detailed breakdown, thanks.

      More in general the problem you specify applies generally to other database / data warehouse architectures. The 2 main points of having a DW are to not query an operational database multiple times for the same data, and for people querying for data having a more intuitive view of the data than the one that is optimised for the operational application.

      Strictly speaking one could do away with ETL completely if one were willing to take the (pretty large) performance hit (both on the operational and reporting sides), but you would simply need to replace an ETL+reporting team with far larger operations and reporting teams - so you haven't really gained anything. Real-time analytic reporting, whatever anyone in the business might tell you, is still a tiny requirement dreamed up by some middle manager. Board-level Execs just want to see monthly or quarterly figures, most everyone else is extremely happy to get next-day daily data as long as it's accurate and clearly presented (which is already enough of a huge challenge.)

      1. elregidente

        Re: I am an Amazon Redshift specialist, and I have Views about all this.

        My pleasure.

        Regarding your observation about ETL : I remember from a long time ago the comment that almost all computing is an exercise in caching :-)

    3. isme

      Re: I am an Amazon Redshift specialist, and I have Views about all this.

      Just when I was thinking of cancelling my subscription (Dabbsy etc.) you share your Redshift research and make my day.

      It's posts like these that make me come back to read the Register.

      Thanks.

      Also: "Zero ETL"? Zero dirty data? Zero mismatch between operational and analytical schemas? There's a first time for everything I guess.

      1. elregidente

        Re: I am an Amazon Redshift specialist, and I have Views about all this.

        Thankyou so much!

        It's a pleasure to share the knowledge :-)

    4. Joseba4242

      Re: I am an Amazon Redshift specialist, and I have Views about all this.

      Thank you very much, great information that you don't find in many other places.

      How do you compare this with BigQuery?

  5. xyz Silver badge

    Hoo boy...

    The missive brokers are out today...

    Short answer is bloke from aws says buy into our monolith of (micro) bits or stick with your monolith of not (micro) bits.

    I doubt there is anyone on here (apart from the ex aws perp) who needs to think on the aws scale, so don't let some aws floggo honcho bully you into insecurity.

    IT is basically a set of tech fashion trends, so be brave and trust your instincts and needs, and not what some invested suit tries to sell you.

    I'll shut up now.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon