Blog Archives
1 2 3 4 5 12

00: 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 1

Q1. What is dimensional modelling in a Data Warehouse (i.e. DWH)? A1. A dimensional model is a data structure technique optimised for Data Warehousing tools (i.e. … Read more ›...



00: 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 2

This extends Q1 to Q5 at 13 Data Warehouse interview Q&As – Fact Vs Dimension, CDC, SCD, etc – part 1. Q6. … Read more ›...



00: Apache Spark eco system & anatomy interview Q&As

Q01. Can you summarise the Spark eco system?
A01. Apache Spark is a general purpose cluster computing system. It provides high-level API in Java,

Read more ›



00: Data Lake Vs. Data Warehouse Vs. Delta Lake

Modern data architectures will have both the Data Lakes & Data Warehouses. The Data Engineers build the data pipelines for the data analysts and scientists to build business reports &

Read more ›



00: Q1 – Q6 Hadoop based Big Data architecture & basics interview Q&As

There are a number of technologies to ingest & run analytical queries over Big Data (i.e. large volume of data). Big Data is used in Business Intelligence (i.e. BI) reporting,

Read more ›



01: Getting started with Zookeeper tutorial

Installing Zookeepr on Windows Step 1: Download Zookeeper from http://zookeeper.apache.org/. At the time of writing downloading zookeeper-3.4.11.tar.gz. Step 2: Using 7-zip on windows unpack the gzipped tar file into a...



01: Apache Flume with JMS source (Websphere MQ) and HDFS sink

Apache Flume is used in the Hadoop ecosystem for ingesting data. In this example, let’s ingest data from Websphere MQ. Step 1: Apache flume is config driven. … Read more...



01: Apache Hadoop HDFS Tutorial

Step 1: Download the latest version of “Apache Hadoop common” from http://apache.claz.org/hadoop using wget, curl or a browser. This tutorial uses “http://apache.claz.org/hadoop/core/hadoop-2.7.1/”.

Step 2: You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

Read more ›



01: AWS Q&As on VPC, Subnets, Availability Zones, VPN, Route tables, NACLs & Security Groups

The above diagram addresses many of the questions that follow. Q1. What is a VPC in AWS? A1. A virtual private cloud (VPC) is a virtual network dedicated to your...



01: Coding “Java way in Scala” Vs “Scala way in Scala”

Example #1: Read from a list & write to a list

Java Way in Scala

Output: List(Java Programming, Scala Programming,

Read more ›



01: Databricks getting started – Spark, Shell, SQL


Step 1:
Signup to Databricks community edition – https://databricks.com/try-databricks. Fill in the details and you can leave your mobile number blank. Select “

Read more ›



01: Docker tutorial with Java & Maven

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As.

Step 1: Create a Java project “

Read more ›



01: Getting started with Apache Kafka on Mac tutorial

Prerequisite This tutorial assumes that Java 8 is installed. You check this with

If Java is not installed, you can install it on Mac with:

Note: If you are using windows,

Read more ›



01: Getting started with Python on Mac OS

Python is popular in Big Data & data science projects. This tutorial outlines the basic steps to get started with Python on Mac OS.

1. Install Xcode

Xcode can be installed via Apple appstore.

Read more ›



01: Installing & getting started with Apache Storm on Cloudera quickstart

Step 1: Download latest version of Storm (E.g. apache-storm-1.1.1.tar.gz) from http://storm.apache.org/downloads.html On Cloudera machine it will be downloaded to the folder “/home/cloudera/Downloads”. Step 2: Create a directory named “/opt/storm” …...



01: Lambda, Kappa & Delta Data Architectures Interview Q&As – Overview

Q1. What is the Lambda Architecture? A1. It is a data-processing architecture designed to handle Big Data by using both real-time streaming (e.g. … Read more ›...



01: Learn Hadoop API by examples in Java

These Hadoop tutorials assume that you have installed Cloudera QuickStart, which has the Hadoop eco system like HDFS, Spark, Hive, HBase, YARN, etc.

What is Hadoop &

Read more ›



01: Python Iterators, Generators & Decorators Tutorial

Assumes that Python3 is installed as described in Getting started with Python.

1. Iterators

Iterators don’t compute the value of each item when instantiated. They only compute it when you ask for it.

Read more ›



01: Q01 – Q07 General Big Data, Data Science & Data Analytics Interview Q&As

Q01. How is Big Data used in industries?
A01. The main goal for most organisations is to enhance customer experience, and consequently increase sales. The other goals include cost reduction,

Read more ›



01: Scala Functional Programming basics – pure functions, referential transparency & side effects

Q1. What is a pure function?
A1. A pure function is a function where the following conditions are met:

1) The Input solely determines the output.

Read more ›



01: Spark RDD joins in Scala tutorial

This tutorial extends Setting up Spark and Scala with Maven.

Step 1: Let’s take a simple example of joining a student to department.

Read more ›



01: Spark tutorial- writing a file from a local file system to HDFS

This tutorial assumes that you have set up Cloudera as per “cloudera quickstart vm tutorial installation” YouTube videos that you can search Google or YouTube. You can install it on...



01. Setting up Scala & practicing the concepts via REPL the Scala way for Java developers

Scala runs on the JVM, so Java and Scala stacks can be freely mixed. You can call Java libraries from Scala. Having said this, it is very important that you learn to write code the Scala way,

Read more ›



01a: Convert XML file To Sequence File – writing & reading – Local File System

Sequence files are good for saving raw data into HDFS. Sequence files are compressible and splittable. It is also useful for combining a number of smaller files into a single...



01A: Spark on Zeppelin – Docker pull from Docker hub

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As.

What is Apache Zeppelin?

Read more ›



01b: Convert XML file To Sequence File – writing & reading – Hadoop File System (i.e HDFS)

This extends Convert XML file To Sequence File – writing & reading – Local File System. Step 1: Upload “report.xml” onto HDFS. E.g using the Cloudera HUE on to path...



01B: Spark on Zeppelin – custom Dockerfile

Pre-requisite: Docker is installed on your machine for Mac OS X (E.g. $ brew cask install docker) or Windows 10. Docker interview Q&As.

What is Apache Zeppelin?

Read more ›



01B: Spark tutorial – writing to HDFS from Spark using Hadoop API

Step 1: The “pom.xml” that defines the dependencies for Spark & Hadoop APIs. Step 2: The Spark job that writes numbers 1 to 10 to 10 different files on HDFS....



02: Apache Flume with Custom classes for JMS Source & HDFS Sink

This post extends 01: Apache Flume with JMS source (Websphere MQ) and HDFS sink to write Flume customization code. We will be customizing 3 things. 1) Customized JMS Source message...



02: Apache Kafka multi-broker cluster tutorial

This extends Getting started with Apache Kafka on Mac tutorial. This assumes that the zookeeper & kafka servers are started as per the previous tutorial. List topics Create a topic...



1 2 3 4 5 12

800+ Java Interview Q&As

Top