A roadmap to become a Big Data Engineer – What skills are required?

What is all the hype about becoming a (Big) Data Engineer? There is a demand for Data Engineers as organisations have been ramping up their investments on big data related projects since 2019. Why Big Data?

Big Data Engineer road map. What skills are required?

Big Data Engineer road map. What skills are required?

Confused about the various roles like Data Engineer, Technical Business Analyst, DevOps Engineer, Data Scientist, etc. Often Big Data projects will have all the above roles complimenting each other. There will be an overlap in the skills as the Data Scientist will have basic programming skills, Data Engineers will be required to have basic DevOps skills, and the technical business analysts will be required to have good SQL & scripting skills. What do data analysts, engineers & scientists do?

#01: SQL

Once you know how the data is captured & stored, you must know SQL to work with the data. You must have intermediate to advanced SQL skills to write efficient queries for many different business scenarios to extract, transform, analyse & profile the data.

80+ SQL Interview FAQs

#02: Programming Skills

Python, Java, Scala, etc are popular programming languages. Apache Spark can be programmed in any one of these languages.

100+ Python Interview FAQs

150+ Scala interview FAQs

300+ Core Java interview Q&As

#03: Spark API skills

Apache Spark is a popular Big Data processing tool. Apache Spark API is available in Python, Scala & Java. Apache Spark on the cloud via PaaS providers like Azure HDinsight, Amazon Web Services EMR, Google Dataproc, etc. Databricks is a unified analytics platform on top of Apache Spark. Databricks is available as a SaaS on the cloud.

80+ Spark interview Q&As

30+ Spark SQL interview Q&As

#04: Data ingestion & transformation tools (aka ETL tools)

There are many ETL tools available on the market. For example, Alteryx, Ab initio, and the list goes on.

Sqoop & NiFi interview Q&As

#05: Know your data, how it is modelled & stored

You can neither properly analyse the data nor write correct SQLs without understanding the data modelling concepts. Facts Vs Dimensions, Slowly Changing Dimensions (i.e. SCD), Entity Relationship diagrams (i.e. ER), Master Data, Transactional Data, Reference Data, Historical Data, Snapshot Data, Delta Data, Metadata(e.g. etl vs load timestamps, etl ids, etc), partitioning of the data, etc.

Data modelling interview Q&As

#06: Know the data warehouse concepts

The data engineers must understand how the data warehouses work. You can’t write good SQLs without knowing data warehouse concepts like grain of the data, fact (aka measure) vs dimension, star vs snowflake schema, change data capture, data vault model, etc .

Data warehouse interview Q&As

#07: Know what Big Data is?

Not just the triple V’s as in Volume, Velocity & Variety. There is a lot more to it. Understanding the Big Data challenges, performance & memory considerations, best practices, etc.

Big Data Architectures & concepts Q&As

#08: Distributed storage & computing basics

You must know the “Distributed Computing” fundamentals. The architecture, challenges, solutions, when to use which database, transaction management, high availability, consistency, partition tolerance, CAP theorem, etc.

Distributed Computing interview Q&As

CAP Theorem Q&As

Big Data Architectures Q&As

YARN & Zoo Keeper interview Q&As

Apache Kafka interview Q&As

#09 NoSQL databases

Not only SQL databases used to solve modern data challenges. When to use which database? OLTP vs OLAP, CAP theorem, More reads vs writes, usage patterns, etc.

NoSQL interview Q&As

#10: Unix

You must know your Unix as the jobs to ingest & transform data are often run in a Unix environment.

Unix interview Q&As

#11 Version control systems like Git

Your SQL code code, Unix shell scripts, pipeline code & job schedules will be stored in a code repository like Git. Many project teams will be modifying things concurrently, hence the code versions & changes must be managed properly.

Git interview Q&As

#12: Regular Expressions are used everywhere

REGular EXpression is a powerful tool for those who work with data. You can validate your data, manipulate your data & do a lot more things to make your monotonous tasks easy.

REGEX interview Q&As

#13: DevOps & CI/CD Tools

Now a days Data Engineers must have the basic understanding of the DevOps – Docker, Kubernetes, Jenkins, build tools, deployment tools, etc.

DevOps interview Q&As

#14: Cloud computing

Data can be stored on infrastructures on prem, on cloud or hybrid. Organisations either already have their Data warehouses & lakes on the cloud or in the process of migrating to cloud. For example, Data bricks lake house, Snowflakes, Amazon Redshift, Google BigQuery, etc are some of the MPP (i.e. Massively Parallel Processing) systems. Watch this space for

AWS interview Q&As

PySpark on Databricks

#15: Data Security

PII (i.e. Personally Identifiable Information), Data access controls, Data masking, Data Classification, etc. Data assets must be protected & used for the intended purpose only. There will be specialist teams responsible for data security & governance, but all data engineers must have a basic understanding.

Data Security interview Q&As

#16: Data Governance & Metadata management

Lack of good data governance can lead to data inconsistencies & regulatory non-compliance. A data lake can quickly become a data swamp without the proper data governance. The data lineage must be properly tracked. Extensible metadata registry is required to provide data discovery & data lineage management functions.

Data Governance interview Q&As

#17: Data Analytics & Science

Cleansing & preprocessing data. What do you understand by the terms personalisation, next best offer, next best action, and recommendation engines? Clickstream vs ecommerce vs geolocation analytics? What is machine learning?

Data Analytics & Science interview Q&As


As a beginner or someone transitioning his/her career to Big Data requires at least one programming language like Python or Java. For some Data Engineering roles, relevant programming skills are a must depending on what toolsets are used. For example, some organisations use PySpark or Spark using Scala to build data pipelines, whilst other organisations have built ingestion & transformation frameworks on top of Spark requiring data engineers to only have strong SQL, Unix, Git, REGEX & DevOps skills. Alternatively, ETL tools like Alteryx, Abinitio, etc can be used to build data pipelines requiring solid SQL, Unix, Git, REGEX & DevOps skills. Some of these frameworks enable Technical Business Analysts with strong SQL skills to build data pipelines. More & more organisations are building their data lakes on the Cloud, hence cloud computing skills are essential.

(Visited 136 times, 2 visits today)

800+ Java & Big Data Interview Q&As

200+ Java & Big Data Tutorials

Top