First thing they did was replace MapReduce, which was incredibly slow and limited, with Apache Spark as the main data processing engine. Since they were an open source stack, the data lakes had even more developers working on them. Well, as soon as data lakes got out in the world, they improved, like any software stack that has adoption. That’s a unified analytics platform – a distributed data warehouse plus support for many kinds of data (structured, semi-structured, and streaming), Python and R as well as SQL clients, and in-database machine learning, geospatial, and other advanced analytics built-in. That was essential to support streaming use cases including Internet of Things (IoT)environments. That work has been largely contributed by the open source community.ĭata warehouse vendors also took their big concurrency advantage one step further, and isolated workload resources so that streaming data ELT and data onboarding could run inside the database all the time, without slowing down ad hoc BI queries, or machine learning model training. They also added in clients for frameworks like R and Python so data scientists could do their work on whole datasets without having to sample them down to in-memory single-node size or get someone else to productionize them when they were done. I can use a database to query Parquet files in an S3 bucket on AWS now, with database-level resource efficiency and outstanding performance.Īnalytical databases (like Vertica, where I work) also embedded advanced analytics capabilities like geospatial analysis, time series, and machine learning – not just algorithms, but the whole shebang from statistical analyses to model evaluation. They also separated out data storage to utilize object storage as well as distributed file systems. They’ve added support for streaming data, and schema-on-read for semi-structured data, and the ability to query data in formats other than their own. The first thing they did was fix the affordable scalability problem by becoming distributed with massively parallel processing (MPP) engines, just like the data lakes. I did a whole article on the evolution of the modern data warehouse, so I won’t repeat myself here, but essentially, over the last decade since the data lake started grabbing headlines, the analytical databases have been adding data lake-like capabilities to counter their previous weaknesses. Every data lake vendor is racing to add data warehouse-like capabilities, and every data warehouse vendor is racing to add data lake-like capabilities. Race to the middleĮssentially, these two new architecture concepts came from a huge race to the middle. Short answer, “They’re extremely similar architectural concepts.” The rest of this post is the long answer. I inevitably get one question, “How is this different from a lakehouse?” There are two answers, a short one that’s glib and easy, and a longer one that really dives into things. I’ve been doing a bunch of speeches at various conferences on the merging of the data warehouse and data lake into a single unified analytics platform.
0 Comments
Leave a Reply. |