Expert exchange
How could any barely technical Data Engineer not love this article?!
Great job, vultures! (again)
It's been a year since Databricks bought Tabular for $1 billion, livening up the sleepy world of table formats. People have been playing with it. It's captured people's imaginations for sure... The data lake company, with its origins around Apache Spark, had created the Delta Lake table format to help users bring query …
The actual storage of the metadata, to me, is an implementation detail, and whether you store it in the file system, or you store it in a catalog, or something like that is or a relational data store, is not as important as the APIs you use to interact with it."
Important here is the REST spec, which "behind the scenes" can keep information about metadata, where the files are.
This is back-to-front: REST is merely a communication protocol. Where you store your data is, for data analysis, far more important than the protocol. And, even if it's metadata, I see no good reason not to keep it in the database.
The Duck approach is generating a lot of enthusiasm from those looking to move away from US-based lock-in.
The problem is there is no truly portable solution to the need for "the database" to handle the database storage of the metadata for DuckDB's proposal. I agree with the Apache committees - that is a non-starter. Metadata must be available in a text-only format that can be versioned along with the configuration files of the system so you have a proper recovery point. You can't easily do that with an active RDBMS. Nor do I know of any widely-accepted RDBMS that can be guaranteed to handle the sheer volume of requests that a large data lake system can be presumed to be dealing with.
I just stumbled across this over the weekend.
https://www.geeksforgeeks.org/time-space-trade-off-in-algorithms/
You have to scroll pretty far down to find the payoff. The first part of the article discusses past and current solutions before it get to the new discovery. There is also a good YouTube video that explains this better.