Databrick's Delta Lake Framework

A Databrick's Delta Lake Framework is a DFS Data Storage Management Platform (for DFS data storage management).

Example(s):
- Delta Lake, v0.70 [1] (~2020-06-18).
- …
Counter-Example(s):
- Apache/Uber Hudi.
See: Copy on Write (CoW), Merge on Read (MoR).

References

2022

https://databricks.com/blog/2022/06/30/open-sourcing-all-of-delta-lake.html
- QUOTE: ... Delta Lake enables organizations to build Data Lakehouses, which enable data warehousing and machine learning directly on the data lake. But Delta Lake does not stop there. Today, it is the most comprehensive Lakehouse format used by over 7,000 organizations, processing exabytes of data per day. Beyond core functionality that enables seamlessly ingesting and consuming streaming and batch data in a reliable and performant manner, one of the most important capabilities of Delta Lake is Delta Sharing, which enables different companies to share data sets in a secure way. Delta Lake also comes with standalone readers/writers that lets any Python, Ruby, or Rust client write data directly to Delta Lake without requiring any big data engine such as Apache Spark™. Finally, Delta Lake has been optimized over time and significantly outperforms all other Lakehouse formats. Delta Lake comes with a rich set of open-source connectors, including Apache Flink, Presto, and Trino. Today, we are excited to announce our commitment to open source Delta Lake by open-sourcing all of Delta Lake, including capabilities that were hitherto only available in Databricks. We hope that this democratizes the use and adoption of data lakehouses. But before we cover that, we’d like to tell you about the history of Delta. ...
  ... They could not use data warehouses for this use case because (i) they were cost-prohibitive for the massive event data that they had, (ii) they did not support real-time streaming use cases which were essential for intrusion detection, and (iii) there was a lack of support for advanced machine learning, which is needed to detect zero-day attacks and other suspicious patterns. So building it on a data lake was the only feasible option at the time, but they were struggling with pipelines failing due to a large number of concurrent streaming and batch jobs and weren’t able to ensure transactional consistency and data accessibility for all of their data. ...

2020a

https://medium.com/swlh/apache-hudi-vs-delta-lake-295c019fe3c5
- QUOTE: ... Off late ACID compliance on Hadoop like system-based Data Lake has gained a lot of traction and Databricks Delta Lake and Uber’s Hudi have been the major contributors and competitors. As both solve a major problem by providing the different flavors of abstraction on “parquet” file format; ...

2000b

(Armbrust et al., 2020) ⇒ Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, Michał ́Switakowski, Michał Szafranski, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz1, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, and Matei Zaharia. (2020). “Delta Lake: High-Performance ACID Table Storage OverCloud Object Stores.” In: Proceedings of VLDB-2020.

2020c

https://delta.io/
- QUOTE: Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
  ... ...
  ** Key Features
  - ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level. Learn more at Diving into Delta Lake: Unpacking the Transaction Log.
  - Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  - Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. Learn more in Introducing Delta Lake Time Travel for Large Scale Data Lakes.
  - Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  - Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  - Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption. For more information, refer to Diving Into Delta Lake: Schema Enforcement & Evolution.
  - Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL. For more information, refer to Diving Into Delta Lake: Schema Enforcement & Evolution.
  - Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  - Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture. For more information, refer to Announcing the Delta Lake 0.3.0 Release and Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs which includes code snippets for merge, update, and delete DML commands.
  - 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

2019

https://techcrunch.com/2019/10/15/databricks-brings-its-delta-lake-open-source-project-to-the-linux-foundation/
- QUOTE: ... Delta Lake, Ghodsi said, is essentially the data layer of the Lake House pattern. It brings support for ACID transactions to data lakes, scalable metadata handling and data versioning, for example. All the data is stored in the Apache Parquet format and users can enforce schemas (and change them with relative ease if necessary).

Databrick's Delta Lake Framework

References

2022

2020a

2000b

2020c

2019

Navigation menu

Search