back to article The Great Graph Database Debate: Relational can't do everything

Welcome back to the latest Register Debate in which writers discuss technology topics, and you the reader choose the winning argument. The format is simple: we propose a motion, the arguments for the motion run on Monday and Wednesday, and the arguments against run today and Thursday. During the week you can cast your vote on …

  1. Tom7

    The crux of our disagreement is simply with the claim that some future "well-architected" relational database engine could render the use of today's useful, existing, in-production graph databases unnecessary

    Not a great way to start, but mis-quoting (apparently deliberately) the other side. No-one has so far mentioned some future "well-architected relational database engine" but rather well-architected databases (ie schemas) within existing relational database engines.

    This, and a big pile of thinly-veiled neo4j sales-speak, appears to be about the level this contribution to the debate is operating on.

    1. stiine Silver badge
      Facepalm

      He's also never heard of indexes to say nothing of multiple indexes. Perhaps its because a graph database is soley indexes.

  2. elsergiovolador Silver badge

    Knock knock

    *knock knock*

    Someone: who is there?

    Salesman: would you like to talk about our saviour graph database?

    Someone: sure, would you like to come in?

    Salesman: *confused* I am not sure I've never made it this far.

  3. Charlie Clark Silver badge

    YAQL is not the solution

    Much I dislike SQL, the presence of SPARQL, GQL, etc. which try to provide something like it, show the need for a query language. But all the attempts stem from building something for a different storage engine rather than a better query language, and that is an indication of a greater flaw in the approaches: confusing the model with the implementation. But there is nothing in the relational model that requires data be stored in row tuples. Hence, any conclusions that derived from that are spurious.

    in fact, rather than spending even more on licences, I think a lot of problems could be solved by ensuring that people dealing with data are properly supported by RDBMS staff so that models provide the data they should, as fast as is necessary and without breaking the bank. The persistent fad for map/reduce (denormalise and then parallelise) because hardware is so cheap is just leading to massive bills for stuff running on clouds.

    1. Steve Channell
      Facepalm

      Re: YAQL is not the solution

      Back in the day… Chris Date argued that the relational model should focus on domains rather than columns in table, but IBM liked the idea that rows were easier to migrate from ISAM records.. SQL != Relational… doesn’t have to be that way..

      The assertion “relation can’t do everything” is nonsense because a graph triple is a tuple and a tuple is a row. The real difference is that Neo4J adds properties to graph nodes, which are segments in a network/hierarchical model and related tables in a SQL relational model.

      The performance difference is not primarily graph properties, but time series: If you want to plot daily balance of a mortgage account (for linier regression to close), a time-series database will infer daily balances for all the days that don’t go up with interest or down with repayments – relational and graph databases require joins to a calendar.

      There are lots of cases where normalised relational models are inefficient, but rather than pouring your data in Neo4J Java database, you can instead use a column store or xml/json/blob column in an existing db and take advantage of parallel query honed over decades in C/C++ databases

      1. Charlie Clark Silver badge
        Thumb Up

        Re: YAQL is not the solution

        Yes, I suspect if we were able to use something closer to relational algebra, we'd have fewer problems and faster databases… I'd certainly love it for a wire protocol between client and server.

        I've always found the process of normalising data to be very helpful in understanding the actual task in hand. No problem with dumping associated metadata in "blob" column for the occasional ad hoc discovery query: it's extremely cheap and can help you get started.

  4. Anonymous Coward
    Stop

    Use-case

    Consider these 2 (very common in the real world) use-cases:

    1. People using relational databases with all the complexity of MySQL, Postgres, MariaDB etc.... for tiny little web applications like Wordpress or Magento.

    2. Companies harvesting a metric fuckton of your data to do nothing except track / advertise shit to you. These tend to go more with the graph database option over relational.

    The thing I'm struggling with though is for all the complexity afforded by these database platforms, the actual tangible output and use-case of the systems, is... boring.

    If we look at some other apps and services the choice of database is quite moot when you actually consider how worthwhile the end application tends to be. It would be a bit like over-engineering a house to the point where it could withstand a war zone, whilst ignoring the fact nobody actually wants to live there.

    TL/DR: it doesn't really matter which you pick because your application for that database is incredibly simple. The database platform used isn't the issue.

    1. Tim99 Silver badge

      Re: Use-case

      For (1) - If you really need a small relational database SQLite is your friend. Whilst it is "a single user database", a web server with multiple clients is "a single user". With WAL (write ahead logging) I’ve just prototyped something with multiple related tables (using that) running a multiuser test at ~120 added rows per second on a 2GB Raspberry pi 4. Eventually it got to >5 million rows. I was pleasantly surprised to find that the sqlite3 command line shell could successfully back it up into another SQLite data file whist it was running at close to that rate…

      1. elsergiovolador Silver badge

        Re: Use-case

        For (1) I don't think you even need a database at all. You can just use the filesystem and construct filenames as if you were using something like dynamodb and call it a day.

        1. Tim99 Silver badge

          Re: Use-case

          I haven't tried it... Appropriate Uses For SQLite (sqlite.org) - Replacement for ad hoc disk files

          Many programs use fopen(), fread(), and fwrite() to create and manage files of data in home-grown formats. SQLite works particularly well as a replacement for these ad hoc data files. Contrary to intuition, SQLite can be faster than the filesystem for reading and writing content to disk.

          1. elsergiovolador Silver badge

            Re: Use-case

            For a blog site this doesn't matter whether it is faster. Makes no difference if post loads in 1ms vs 1.2ms.

        2. Anonymous Coward
          WTF?

          Re: Use-case

          @elsergiovolador

          For (1) I don't think you even need a database at all. You can just use the filesystem and construct filenames as if you were using something like dynamodb and call it a day.

          Except all of the code used to read/write data is written in SQL and needs features of SQL such as joining tables of related data.

          If you were going to go down this avenue you'd pretty much be doing something completely different. You definitely couldn't swap out a relational database to use files and then expect everything else to just work. It's not even a comparison.

          1. elsergiovolador Silver badge

            Re: Use-case

            Why not? After all filesystem becomes just a key value store and the data would be denormalised, so there is no need to maintain relations, especially in something simple like a blog page.

    2. nyquist212

      Re: Use-case

      Those 2 use cases are typical of the trivial. Consider healthcare data and the terrible state of Electronic Medical Record Systems (and their underlying RDBMS). Your health care record is a directed graph. The tabular schema necessary to hold this (and thousands of other patients) results in massive normalization into tens of thousands of tables. Relational is not a "good" fit for this use case.

      1. James Anderson Silver badge

        Re: Use-case

        Hard to be the ancient and venerable MUMPS database for medical records.

        One of the best ever databases with the worst ever language/API.

  5. Groo The Wanderer

    The problem I've found in the industry over the years is companies with a "standard database" policy, or lead designers who have a "preferred" database that they always use.

    It is best to keep an open mind with databases, and use the provider that is best suited to the task at hand. There is no one database model that "does it all" easily for all use cases; each of the main architectures (relational, graph, object store, and document) has advantages for certain types of data access. Trying to shoehorn out-of-band data into the wrong database engine is not only painful and inefficient, it is prone to bugs and breakage.

    I'd rather go to the hassle of setting up two phased commits so I could use the right database for each type of data to be managed than try to stuff it all in one general purpose engine.

  6. yetanotheraoc Silver badge

    I don't understand the motion

    In particular I am struggling with this bit: "... for most of the same use cases."

    I don't have any experience with Neo4j or the like. I do know quite a lot about a domain-specific graph which has been implemented over and over in different languages as a tight special-purpose graph db. Interesting that all those are optimized for select. There was one developer who wanted fast inserts so re-did it in an off-the-shelf relational db, but with any large dataset select performance is so bad nobody uses that "solution". It turns out insert was the developer's use case, but select is the customer's use case.

    I'm picturing a Venn diagram where you have things that should only be done relational, things that should only be done graph, and "same use cases" is the overlap. My problem is I don't know how many applications belong to that overlap, and the two debaters so far haven't addressed it.

    Pavlo wrote on Monday, "it is straightforward to model a graph (using SQL) as a collection of tables", and this seems to me to be saying *all* applications belong to the overlap area. I reject this idea on its face. Webber wrote today, "for graph use cases common in the wild, native storage engines often have very significant run-time benefits", and this is talking about the graph area instead of the overlap area. Not what I wanted to hear about.

    So I'm struggling.

    1. The Velveteen Hangnail

      Re: I don't understand the motion

      IMO the entire pro-graph side is outright disingenuous. The venn diagram between RDBMSes and non-relational database are basically a big fat circle for RDBMS, with several small, slightly overlapping dots to represent other database types. The ranges of use-cases are NOT equal.

      RDBMSes won the database wars because they provide the most flexibility in dealing with data. If you don't understand what you need from a database, you choose relational. Period. End of discussion. Unless you are doing something VERY specific, an RDBMS can tackle almost every use case you throw at it. These other database types may be able to handle these other specific use cases very well, but that comes at the cost that they absolutely suck at everything else.

      I haven't done much work with graph databases myself, but this whole riggamarole the GraphQL crowd is peddling smells like the EXACT same nonsense that the NoSQL crowd peddled a few years ago. Oh oh oh RDBMSes suck! Use this new shiny thing instead! MongoDB is webscale! NoSQL got it speed by sacrificing _everything_ that made databases manageable, extensible, and robust, forcing you to front load all that processing into custom applications instead. I saw it over and over again.

      And when you finally dug through the weeds and investigated the REAL reason NoSQL was used? Because RDBMSes are "outdated" and "not cool". That SQL is "too hard". I would have laughed if I wasn't so shocked by the utter idiocy of it all. And now look. NOBODY is talking about NoSQL anymore. But now we're repeating the exact same nonsense again except with graph databases.

      If you want to use a non-relational database, then go nuts. Just for the love of $DEITY analyze what you're going to do with it first and confirm that what you want aligns with the strengths of said database type. And when I say analyze, I mean look beyond the first 5 minutes where you're writing your hello world app.

      If you can't be sure that your edgy hipster choice of database will support your needs 5 years from now, then don't do it. Or... do. There are plenty of consulting companies who are happy to take your money to remediate the shitshow your code base has become cause you were more worried about being cool than getting the job done.

      1. yetanotheraoc Silver badge

        Re: I don't understand the motion

        Even though RDBMS is what I know best and is my goto (well, that or a flat file with a script), this next bothers me on so many levels:

        "If you don't understand what you need from a database, you choose relational. Period. End of discussion."

        A. If you don't understand what you need from a database then you better find out before doing anything.

        B. If you still don't understand what you need from a database, then NoSQL seems to me a much better choice. That way you don't spend any time up front pulling out columns you won't need. And you won't wake up one day to realize your ETL since the get-go has been throwing away that weird "dispensable" data element that just happens to hold outsized business value.

        C. Strongly defaulting to RDBMS guarantees, statistically speaking, that some of the time that choice was wrong. Maybe only one percent of the time should we choose graph instead of relational. But if that one percent of the time graph gets us 2x performance, that's a huge gain. Also, one percent of the universe of databases is a really big slice of pie.

        D. Strongly defaulting to RDBMS might mean we are so heavily invested in it that it will seem right even when it is wrong. This last one is why I'm most interested in the cases where relational and graph could both be valid solutions. Seeing which is "better" there is more instructive overall for the general case.

        E. I like learning new things, for their own sake, even if my previous knowledge has been good enough so far.

        1. Slipoch

          Re: I don't understand the motion

          Great answer, particularly on the risks of an RDMS first approach. So many projects I have had to fix where it was obvious they didn't know what they were doing and made massive ongoing issues simply because of SQL.

      2. Slipoch

        Re: I don't understand the motion

        Real reason nosql was used and is still used in speed critical and reliability critical applications such as air traffic control? Speed. Indexing in SQL for speed on highly related data often takes a lot of custom faffing around (I know we did it for > 10 years), in a good NoSQL db this is inbuilt, sometimes you may need to indicate a property on a field requires indexing, but often the out of box solution (VDB & Realm) is fast enough. backups can be done within engine/or within filesystem, or within separate programmability

        When a full SQL Server instance takes 10 hours to insert 1 million of the same records that velocitydb takes 12s on the same system for then there is a big bloody problem with using SQL for any kind of large-scale input/output.

        Another reason NoSql is often preferred by developers - no separate install. Using most SQL systems means having to separately install the sql server (or have an online one running). The SQL systems (like SQLite) that can be packaged as part of an application are large and very slow. Most NoSQL packages can be included in your codebase to run when the application runs (in this situation)

        Complex models - In NoSQL I can have a DB model that contains a list of non-db models which each in turn references a different db model type in a relationship. I cannot do that on SQL and have the queries work on it, the non-db model inside the db model would itself have to be it's own table even if the lookups would never query for that table-data in a solitary context.

        Native models - Allows pulling data out in a native format used by the code, whether it is C#, C++, even FE scripting languages (although if you are after speed, scripting kinda makes that a bit pointless).

        Entity Framework does NOT do this by default, and requires console commands to be run for each migration or model change, and requires careful planning to avoid breaking the model -> db or db-> model relationship. The liklihood of a user requiring SQL native output is pretty low, usually the data goes through an application first, this removes the conversion step that SQL requires to take flat-file data and convert it into 3d models.

        Easier migration - handle your migration in code, either leave a new column null or add a default value, or work out the correct value of the new column from other fields on the old data.

        The arguments against anything new on behalf of SQL remind of the arguments used in pro-PIC dbs back in the 80's. and if you've ever had to use PIC you would understand how wrong those were on a fundamental level, the complaints of not being able to store multiple pieces of data in one field missed the fundamental difference of how sql didn't rely on you knowing already what datatype the field had to provide for the info. The issue I see is you approach this like it is MSSQL or Oracle when the way the data is organised is fundamentally different and allows for different things.

        So lets take a look at your arguments briefly

        Managable - NoSQL generally doesn't require much management, migration data that needs to change is put into code. Sharding and other facets like geo-replication can be handled externally or manually in many ways, the speed increase allows for greater flexibility and a LOT more control in this regard as you are not limited by the tooling within the db itself. SQL has the advantage that you can mix and match data from different sources easier rather than having hardcoded relationships, but you can do this with a graphdb, nosql is built around the models being input and output so it cannot do this wihtout setting models up to do it (although a lot do allow either SQL or SQL-like queries on the data)

        Extensible - yeah not sure what you mean by this, most good NoSQL dbs allow for more extensibility in that they allow extensions to the core functionality of the db at a code level, something SQL doesn't allow except at the surface level in the case of stored procedures and that autoprogrammability at a SQL level (not at the core code level). Some of them do not have stuff like stored procedures or views because they are redundant due to the way the db works, in other cases you can do something similar to graph calls within a nosql db to get a view.

        Reliable - for anyone using Azure MS SQL instances will know what it is like when they update the db and bugger the connection up. Or simply the price of running a instance on the equivalent of a computer from 10 years ago.

        Running it yourself on your own server MSSQL is finniky about what else is on there in services and will often fail if the server is not setup in a specific manner. Running on local systems is terrible as AV programs can and will interfere, and remaining up to date on client computers (when used in applications) is not easy as there are breaking changes between versions that disallow downgrading.

        Autobackups are great but nosql only require minor changes to code to organise or you can setup your OS to do them.

        All the abover being said, if you have a stable system then sql will be good.

        The big difference is that now you simply organise your code the way the program/application requires, no more DB manager required unless you are running on a multi-instance web-accessible way. (then it would be a server admin not a db admin).

        The one downside seems to be scaling out with a ton of data, but both velocity and realm both handle this quite well (why Mongo bought the cloud version of Realm to run it's backend).

        All of the above being said, the only time I can see a graphdb being useful over both NoSQL and SQL is when you want to use it for neural networks and other association based things. I have never come across graphql being more useful than SQL or NoSQL calls. (I guess maybe if you have a dynamic method of calling the data and need highly variable amounts of data for those calls?)

        The above is probably poorly worded I don't have time for anything else, so sorry.

        1. Tim99 Silver badge

          Re: I don't understand the motion

          When a full SQL Server instance takes 10 hours to insert 1 million of the same records that velocitydb takes 12s on the same system for then there is a big bloody problem with using SQL for any kind of large-scale input/output.

          I'm not sure what was wrong with your SQL Server data/database, but experience tells me that "it is not typical".

          As an aside I've just completed a feasibility study for a small pro-bono project using the lightweight SQLite database on a desktop with a web front end. A loop generated test data and inserted it into an indexed table. I got 2.3 million record in an hour, each record was updated 6 times and a trigger inserted a record from each update into another indexed relationship table, giving a total of 16 million records an hour. Each row was retrieved three times, its values inserted into strings and then cleared. So a simple test app using one core of an old i3 iMac manipulated and inserted a million records in <4 minutes. I've just confirmed this by inserting 50,000 records from each of two clients (i.e. 700,000 records in total) in 2.7 minutes.The database size was 0.64GB. Unlike my simple desktop system, a SQL Server will normally use multiple cores and probably has a fast hardware RAID 10 setup...

          Obviously my results are not a million records in 12 seconds, but are perhaps indicative?

          1. Groo The Wanderer

            Re: I don't understand the motion

            What your results are indicative of is the effectiveness of database caching when all the IO is focused on ONE TABLE. :(

            1. Tim99 Silver badge

              Re: I don't understand the motion

              Good thought; but including all the inserts, updates and selects, it was three tables.

              Disclaimer: I'm old and the memory may be going, but I go back to building stuff with DEC Rdb and Oracle V4...

            2. werdsmith Silver badge

              Re: I don't understand the motion

              I would expect it to be faster across multiple tables than a single table.

        2. Charlie Clark Silver badge
          Stop

          Re: I don't understand the motion

          The above is probably poorly worded I don't have time for anything else, so sorry.

          You should have saved your time for something more constructive. The post is full of specious half-truths related to a couple of implementation-specific events.

          Consistency always comes at a cost. NoSQL routinely fail because they are prepared to sacrifice consistency and robustness for speed. You can do the same on any RDBMS by simply disabling all the checks and then you can watch your application go up in smoke.

      3. Charlie Clark Silver badge
        Pint

        Re: I don't understand the motion

        Have one of these, you deserve it.

      4. James Anderson Silver badge

        Re: I don't understand the motion

        While I am a big fan of relational and would wholeheartedly endorse the previous posters recommendation of SqLite for almost everything there are a few edge cases where relational falls short.

        Anything recursive such as Bill Of Materials just ends up a complete mess if implemented with an RDMS.

        Using RDMS to create a customisable product is just horrible ( hears looking at you SAP ).

  7. toomanylogins

    If RDMS are good why is Wordpress such as mess?

    I develop a static site generator and we have implemented importing from WordPress and Joomla! from MySQL. Importing from Joomla! was straightforward its mostly normalised and we were easily able to cross-reference and import the data. WordPress on the other hand is a mess. They have everything stuffed into text fields as serialised data which can't be efficient. Furthermore it seems every developer of a WordPress extension creates their own formats.

    It seems to me the one weakness of an RDMS is where customisation by third parties is required. Having said that I don't know how this is handled in a graph database.

  8. This post has been deleted by its author

  9. Mo64
    Holmes

    I followed the link for Migrations & was expecting to find Alter Table syntax not an executable so on that point at least +1 to relational

  10. Allknowingroger

    Graph for the win

    Palantir & Watson uses graph neural network software, built on top of a graph database.

    Wikidata is a reasonable competitor.

    The real question is what type of database is best for a LLM?

  11. spireite Silver badge
    Mushroom

    Entity framework.... literally spawn of the devil, and a DBAs Kryptonite

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon