What is all the hype about becoming a (Big) Data Engineer? There is a demand for Data Engineers as organisations have been ramping up their investments on big data related projects since 2019. Why Big Data?
Confused about the various roles like Data Engineer, Technical Business Analyst, DevOps Engineer, Data Scientist, etc. Often Big Data projects will have all the above roles complimenting each other. There will be an overlap in the skills as the Data Scientist will have basic programming skills, Data Engineers will be required to have basic DevOps skills, and the technical business analysts will be required to have good SQL & scripting skills. What do data analysts, engineers & scientists do?
#1: Know your data, how it is modelled & stored
You can neither properly analyse the data nor write correct SQLs without understanding the data modelling concepts. Facts Vs Dimensions, Slowly Changing Dimensions (i.e. SCD), Entity Relationship diagrams (i.e. ER), Master Data, Transactional Data, Reference Data, Historical Data, Snapshot Data, Delta Data, Metadata(e.g. etl vs load timestamps, etl ids, etc), partitioning of the data, etc.
#2: Know the data warehouse concepts
The data engineers must understand how the data warehouses work. You can’t write good SQLs without knowing data warehouse concepts like grain of the data, fact (aka measure) vs dimension, star vs snowflake schema, change data capture, data vault model, etc .
#3 Know your SQL
Once you know how the data is captured & stored, you must know SQL to work with the data. You must have intermediate to advanced SQL skills to write efficient queries for many different business scenarios to extract, transform, analyse & profile the data.
#4: Know what Big Data is?
Not just the triple V’s as in Volume, Velocity & Variety. There is a lot more to it. Understanding the Big Data challenges, performance & memory considerations, best practices, etc.
#5: Distributed storage & computing basics
You must know the “Distributed Computing” fundamentals. The architecture, challenges, solutions, when to use which database, transaction management, high availability, consistency, partition tolerance, CAP theorem, etc.
#6 NoSQL databases
Not only SQL databases used to solve modern data challenges. When to use which database? OLTP vs OLAP, CAP theorem, More reads vs writes, usage patterns, etc.
You must know your Unix as the jobs to ingest & transform data are often run in a Unix environment.
#8 Version control systems like Git
Your SQL code code, Unix shell scripts, pipeline code & job schedules will be stored in a code repository like Git. Many project teams will be modifying things concurrently, hence the code versions & changes must be managed properly.
#9: Regular Expressions are used everywhere
REGular EXpression is a powerful tool for those who work with data. You can validate your data, manipulate your data & do a lot more things to make your monotonous tasks easy.
Now a days Data Engineers must have the basic understanding of the DevOps – Docker, Kubernetes, Jenkins, build tools, deployment tools, etc.
#11: Cloud computing
Data can be stored on infrastructures on prem, on cloud or hybrid. Organisations either already have their Data warehouses & lakes on the cloud or in the process of migrating to cloud. For example, Data bricks lake house, Snowflakes, Amazon Redshift, Google BigQuery, etc are some of the MPP (i.e. Massively Parallel Processing) systems. Watch this space for
#12: Data Security
PII (i.e. Personally Identifiable Information), Data access controls, Data masking, Data Classification, etc. Data assets must be protected & used for the intended purpose only. There will be specialist teams responsible for data security & governance, but all data engineers must have a basic understanding.
#13: Data Governance & Metadata management
Lack of good data governance can lead to data inconsistencies & regulatory non-compliance. A data lake can quickly become a data swamp without the proper data governance. The data lineage must be properly tracked. Extensible metadata registry is required to provide data discovery & data lineage management functions.
#14: Data Analytics & Science
Cleansing & preprocessing data. What do you understand by the terms personalisation, next best offer, next best action, and recommendation engines? Clickstream vs ecommerce vs geolocation analytics? What is machine learning?
The below skills are specific to your role as an Apache Spark developer, PySpark developer, Abinitio developer, etc. Having said this knowing at least one programming language like Python is a plus to be a data engineer. For some Data Engineer roles, relevant programming skills are a must depending on what toolsets are used. For example, some organisations use PySpark or Spark using Scala to build data pipelines, whilst other organisations have built ingestion & transformation frameworks on top of Spark requiring data engineers to only have strong SQL, Unix, Git, REGEX & DevOps skills. Alternatively, ETL tools like Alteryx, Abinitio, etc can be used to build data pipelines requiring solid SQL, Unix, Git, REGEX & DevOps skills.
#15: Programming Skills
Python, Java, Scala, etc are popular programming languages.
#16: Spark API skills
Apache Spark is a popular Big Data processing tool. Apache Spark API is available in Python, Scala & Java. Apache Spark on the cloud via PaaS providers like Azure HDinsight, Amazon Web Services EMR, Google Dataproc, etc. Databricks is a unified analytics platform on top of Apache Spark. Databricks is available as a SaaS on the cloud.
#17: Data ingestion & transformation tools (aka ETL tools)
There are many ETL tools available on the market. For example, Alteryx, Abinitio, and the list goes on.