00: Data Lake Vs. Data Warehouse Vs. Delta Lake

Modern data architectures will have both the Data Lakes & Data Warehouses.

Q1. What questions do you need to ask for choosing a Data Warehouse over a Data Lake for your BI (i.e. Business Intelligence) reporting?

A1. The gap between a data lake & a data warehouse is narrowing with the advent of MPP (i.e. Massively Parallel Processing) systems like delta lake from Databricks, Snowflake, Amazon Redshift, etc. Often you will need both, and they will compliment each other.

Data Lake Vs Data Warehouse

Source: https://orzota.com/2018/02/16/enterprise-data-lake/

Still, it pays to ask the right questions as both compliment each other.

1. Who is going to consume my data?

Since the data within data lakes may not be curated and can originate from non operational systems like IoT devices, call detail records from cell towers, etc it isn’t a good fit for the average business analytics user. Data lakes are the playground for data scientists and expert data analysts. So, always ask – Is it going to be consumed by just

a few people

like Data Scientists & analysts or by the

masses

from various business units like sales, marketing, call centre staff, CEO, etc. Data warehouses appeal to the masses as the data is structured & processed to be

orderly

with schema enforcement & ACID transactions.

2. Am I using it for orderly info or open-minded data discovery?

Data lakes offer flexible solutions with vast & varied data – both structured & unstructured data for open-minded

data discovery

A Data Lake is a place where new data can first enter without any specific use case for it. It’s a place where any kind of data can reside. So it’s a great source to discover new ideas and experiment with data. This

openness

comes at the cost of lack of meaningful structure, which a Data warehouse has. Data lake can be a bit of a mess for a larger mass of business users. In Data Warehousing the data conform to dimensions and measures. Easily queryable & consumable with consistent & governed data models.

3. Do I need consistent results?

Data warehouses are more orderly with ACID compliance. For example, Amazon Redshift, Google BigQuery, Snowflake, Delta Lake from Databricks, etc. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse.

4. Am I querying billions of records or just a few millions?

Data lake based technologies can handle petabytes of data, and you can query billions of records. The modern data warehouses like Amazon Redshift, Google BigQuery, Snowflake, Delta Lake from Databricks, etc fall into the massively parallel processing (i.e. MPP), and can handle billions of records.

5. Am I going to have a few reports or hundreds of them?

When you are going to have 100s of reports the data warehouses are more suited as the data is more orderly with ACID compliance, support for various SCD types, and conform to dimensions and measures. Easily queryable & consumable with consistent & governed data models.

Data LakeData Warehouse
Data StructureRawProcessed
Purpose of ingesting the dataNot yet known.Currently in use.
Consumers of the dataData scientists and analystsBusiness users & management.
Data Granularitylow level of detail.summary or aggregated level.
Types of DataUnstructured (E.g. Word, PDF, etc), semi-structured (E.g. XML, JSON, etc)and structured in tabular form.Mostly structured in tabular form.
Data ProcessingFast ingestion of new data via ELT.Time consuming when adding new data.
Schema enforcementNo predefined schema (schema on reading).Predefined schema on writing.
StorageDesigned for low-cost storage.High-end storage like SAN.
AccessibilityHighly accessible and quick to update.More difficult & costly to make changes. Focused on easier read. Change Data Capture (CDC) tools will be used like those by IBM Data stage, Informatica or Talend.
securityComprehensive security down to column or row level.Platform level security, and not as granular as row level.

Q2. Why do we need delta lake for Spark?

A2. Delta lake for Spark addresses the following short-comings in Spark.

#1. ACID compliance
The Spark documentation says that:

Save operations can optionally take a SaveMode, which specifies how to handle existing data if present. It is important to realise that these save modes

do not utilise any locking and are NOT atomic

.

Additionally, when performing an Overwrite, the data will be deleted before writing out the new data.

Overwrite = DELETE + WRITE

This can cause data consistency, isolation & durability issues such as writing only partial data, losing data (i.e. if delete succeeds & write fails), etc when a runtime exception is thrown.

#2. Schema enforcement

Since Spark is schema on read, there could be unintended consequences caused by writing different data types to the same column say an age column as in first batch of files with type int, and another batch of files with type decimal. You will only get an error when you read the data.

#3. Data skipping

Databases make use of secondary indexes to improve query performance. Index allow you to skip reading unwanted data. Spark does not use indexes. When you filter data in your Spark queries, it is inefficient to read all the data. Spark allows you to partition the data so that you can only read the required partitions. In other words skip partitions. Having too many partitions is inefficient as it can cause “small files” issue. Partitioning is only good for following types of columns:

1) Chronological columns as in year, month, etc

2) Low cardinality ( i.e. a low number of distinct values ) columns as in country, state, city, etc.

3) Combining 2-3 low cardinality ( i.e. a low number of distinct values ) columns as in country

#4. Small files issue

When you ingest data via streaming say every 5 minutes, there is a chance that you end up with too many small (i.e. smaller than the 128 MB block size) files. Small files are inefficient because

1) Inefficient compression, listing & file open/close.

2) Excessive metadata to be stored on the name node. The MPP (i.e. Massively Parallel Processing) databases like Delta Lake from Databricks, Snowflake, Redshift, Big Query, etc address the above shortcomings for a more orderly data.

Delta Lake is an open-source addition that brings the capabilities such as ACID transactions, scalable metadata handling, unification of streaming and batch data processing, version control and indexing to a data lake. Delta lake provides snapshot isolation for concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimisation through compaction and z-order partitioning to achieve better performance.

Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

  • ACID transactions for Spark with Serializable isolation levels to ensure that inconsistent data are not consumed by the readers.
  • Scalable metadata handling leveraging the Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease. Prior to this you relied on Hive metastore.
  • Streaming and batch unification where Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic back-fill, interactive queries all just work out of the box.
  • Schema enforcement to prevent insertion of bad records during ingestion.
  • Data versioning enables data rollbacks, full historical commit logs, and reproducible machine learning experiments.
  • Upserts with Managed Delta Lake on Databricks (also coming soon to the open source Delta Lake): The MERGE command allows you to efficiently upsert and delete records in your data lakes. MERGE simplifies how a number of common data pipelines can be built. The complicated multi-step processes that inefficiently rewrote entire partitions can now be replaced by simple MERGE queries. This finer-grained update capability simplifies how you build your big data pipelines for Slowly Changing Dimension (aka SCD) by maintaining commit logs.


Categories Menu - Q&As, FAQs & Tutorials

Top