Modern data architectures will have both the Data Lakes & Data Warehouses. The Data Engineers build the data pipelines for the data analysts and scientists to build business reports & models to analyse the data.
Q1. What questions do you need to ask for choosing a Data Warehouse over a Data Lake for your BI (i.e. Business Intelligence) reporting?
A1. The gap between a data lake & a data warehouse is narrowing with the advent of MPP (i.e. Massively Parallel Processing) systems like delta lake from Databricks, Snowflake, Amazon Redshift, etc. Often you will need both, and they will compliment each other.Still, it pays to ask the right questions as both compliment each other.
1. Who is going to consume my data?
Since the data within data lakes may not be curated and can originate from non operational systems like IoT devices, call detail records from cell towers, etc it isn’t a good fit for the average business analytics user. Data lakes are the playground for data scientists and expert data analysts. So, always ask – Is it going to be consumed by just
a few people
like Data Scientists & analysts or by the
from various business units like sales, marketing, call centre staff, CEO, etc. Data warehouses appeal to the masses as the data is structured & processed to be
with schema enforcement & ACID transactions.
2. Am I using it for orderly info or open-minded data discovery?
Data lakes offer flexible solutions with vast & varied data – both structured & unstructured data for open-mindeddata discovery
A Data Lake is a place where new data can first enter without any specific use case for it. It’s a place where any kind of data can reside. So it’s a great source to discover new ideas and experiment with data. Thisopenness
comes at the cost of lack of meaningful structure, which a Data warehouse has. Data lake can be a bit of a mess for a larger mass of business users. In Data Warehousing the data conform to dimensions and measures. Easily queryable & consumable with consistent & governed data models.
3. Do I need consistent results?
Data warehouses are more orderly with ACID compliance. For example, Amazon Redshift, Google BigQuery, Snowflake, Delta Lake from Databricks, etc. Delta Lake eases the ETL workload by enabling ACID transactions in a warehousing environment. Coupling this with structured streaming, you can achieve a low latency data warehouse.
4. Am I querying billions of records or just a few millions?
Data lake based technologies can handle petabytes of data, and you can query billions of records. The modern data warehouses like Amazon Redshift, Google BigQuery, Snowflake, Delta Lake from Databricks, etc fall into the massively parallel processing (i.e. MPP), and can handle billions of records.
5. Am I going to have a few reports or hundreds of them?
When you are going to have 100s of reports the data warehouses are more suited as the data is more orderly with ACID compliance, support for various SCD types, and conform to dimensions and measures. Easily queryable & consumable with consistent & governed data models.
Q2. Why do we need delta lake for Spark?
A2. Delta lake for Spark addresses the following short-comings in Spark.
#1. ACID complianceThe Spark documentation says that:
Save operations can optionally take a SaveMode, which specifies how to handle existing data if present. It is important to realise that these save modesdo not utilise any locking and are NOT atomic
Additionally, when performing an Overwrite, the data will be deleted before writing out the new data.
Overwrite = DELETE + WRITE
This can cause data consistency, isolation & durability issues such as writing only partial data, losing data (i.e. if delete succeeds & write fails), etc when a runtime exception is thrown.
#2. Schema enforcement
Since Spark is schema on read, there could be unintended consequences caused by writing different data types to the same column say an age column as in first batch of files with type int, and another batch of files with type decimal. You will only get an error when you read the data.
#3. Data skipping
Databases make use of secondary indexes to improve query performance. Index allow you to skip reading unwanted data. Spark does not use indexes. When you filter data in your Spark queries, it is inefficient to read all the data. Spark allows you to partition the data so that you can only read the required partitions. In other words skip partitions. Having too many partitions is inefficient as it can cause “small files” issue. Partitioning is only good for following types of columns:
1) Chronological columns as in year, month, etc
2) Low cardinality ( i.e. a low number of distinct values ) columns as in country, state, city, etc.
3) Combining 2-3 low cardinality ( i.e. a low number of distinct values ) columns as in country
#4. Small files issue
When you ingest data via streaming say every 5 minutes, there is a chance that you end up with too many small (i.e. smaller than the 128 MB block size) files. Small files are inefficient because
1) Inefficient compression, listing & file open/close.
2) Excessive metadata to be stored on the name node. The MPP (i.e. Massively Parallel Processing) databases like Delta Lake from Databricks, Snowflake, Redshift, Big Query, etc address the above shortcomings for a more orderly data.
Delta Lake is an open-source addition that brings the capabilities such as ACID transactions, scalable metadata handling, unification of streaming and batch data processing, version control and indexing to a data lake. Delta lake provides snapshot isolation for concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimisation through compaction and z-order partitioning to achieve better performance.
Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
- ACID transactions for Spark with Serializable isolation levels to ensure that inconsistent data are not consumed by the readers.
- Scalable metadata handling leveraging the Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease. Prior to this you relied on Hive metastore.
- Streaming and batch unification where Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic back-fill, interactive queries all just work out of the box.
- Schema enforcement to prevent insertion of bad records during ingestion.
- Data versioning enables data rollbacks, full historical commit logs, and reproducible machine learning experiments.
- Upserts with Managed Delta Lake on Databricks (also coming soon to the open source Delta Lake): The MERGE command allows you to efficiently upsert and delete records in your data lakes. MERGE simplifies how a number of common data pipelines can be built. The complicated multi-step processes that inefficiently rewrote entire partitions can now be replaced by simple MERGE queries. This finer-grained update capability simplifies how you build your big data pipelines for Slowly Changing Dimension (aka SCD) by maintaining commit logs.