Q1. What is Hadoop?
A1. Hadoop is an open-source software framework for storing large amounts of data and processing/querying those data on a cluster with multiple nodes of commodity hardware (i.e. low cost hardware). In short, Hadoop consists of
1. HDFS (Hadoop Distributed File System): HDFS allows you to store huge amounts of data in a distributed and a redundant manner. For example, a 1 GB (i.e 1024 MB) text file can be split into 16 * 128MB files and stored on 8 different nodes in a Hadoop cluster. Each split can be replicated 3 times for fault tolerance so that if 1 node goes down, you have backups. HDFS is good for sequential write-once-and-read-many times type access.
2. MapReduce: A computational framework. This processes large amounts of data in a distributed and parallel manner. When you do a query on the above 1 GB file for all users with age > 18, there will be say “8 map” functions running in parallel to extract users with age > 18 within its 128MB split file, and then the “reduce” function will run to combine all the individual outputs into a single final result.
3. YARN (Yet Another Resource Nagotiator): A framework for job scheduling and cluster resource management.
4. Hadoop eco system, with 15+ frameworks & tools like Sqoop, Flume, Kafka, Pig, Hive, Spark, Impala, etc to ingest data into HDFS, to wrangle data (i.e. transform, enrich, aggregate, etc) within HDFS, and to query data from HDFS for business intelligence & analytics. Some tools like Pig & Hive are abstraction layers on top of MapReduce, whilst the other tools like Spark & Impala are improved architecture/design from MapReduce for much improved latencies to support near real-time (i.e. NRT) & real-time processing.
Q2. Why are organizations moving from traditional data warehouse tools to smarter data hubs based on Hadoop eco systems?
A2. Organizations are investing on enhancing their
Existing data infrastructure:
- predominantly using “structured data” stored in high-end & expensive hardwares
- predominantly processed as ETL batch jobs for ingesting data into RDBMS and data warehouse systems for data mining, analysis & reporting to make key business decisions.
- predominantly handle data volumes in gigabytes to terabytes
Smarter data infrastructure based on Hadoop where
- structured (e.g. RDBMS), unstructured (e.g, images, PDFs, docs ), & semi-structured (e.g. logs, XMLs) data can be stored in cheaper commodity machines in a scalable and fault tolerant manner.
- data can be ingested via batch jobs and near real time (i.e. NRT, 200ms to 2 seconds) streaming (e.g. Flume & Kafka).
- data can be queried with low latency (i.e under 100ms) capabilities with tools like Spark & Impala.
- larger data volumes in terabytes to petabytes can be stored.
which empowers organizations to make better business decisions with smarter & bigger data with more powerful tools to ingest data, to wrangle stored data (e.g. aggregate, enrich, transform, etc), and to query the wrangled data with low-latency capabilities for reporting & business intelligence.
Q3. How does a smarter & bigger data hub architectures differ from a traditional data warehouse architectures?
Traditional Enterprise Data Warehouse Architecture
Hadoop based Data Hub Architecture
Q4. What are the benefits of Hadoop based Data Hubs?
1) Improves the overall SLAs (i.e. Service Level Agreements) as the data volume & complexity grows. E.g. “Shared Nothing” architecture, parallel processing, memory intensive processing frameworks like Spark & Impala, and resource preemption in YARN’s capacity scheduler.
2) Scaling data warehouses can be expensive. Adding additional high-end hardware capacities & licensing of data warehouse tools can cost significantly more. Hadoop based solutions can not only be cheaper with commodity hardware nodes & open-source tools, but also can compliment the data warehouse solution by offloading data transformations to Hadoop tools like Spark & Impala for more efficient parallel processing of BigData. This will also free up the data warehouse resources.
3) Exploration of new avenues & leads. Hadoop can provide an exploratory sandbox for the data scientists to discover potentially valuable data from social media, log files, emails, etc that are not normally available in data warehouses.
4) Better flexibility. Often business requirements change, and this requires changes to schema & reports. Hadoop based solutions are not only flexible to handle evolving schemas, but also can handle semi-structured & unstructured data from disparate sources like social media, application log files, images, PDFs, and document files.
Q5. What are key steps in BigData solutions?
A5. Ingesting Data, Storing Data (i.e. Data Modelling), and processing data (i.e data wrangling, data transformations & querying data).
1) Ingesting Data:
Extracting data from various sources like
1. RDBMs Relational Database Management Systems like Oracle, MySQL, etc.
2. ERPs Enterprise Resource Planning (i.e. ERP) systems like SAP.
3. CRM Customer Relationships Management systems like Siebel, Salesforce, etc.
4. Social Media feeds & log files.
5. Flat files, docos, and images.
and storing them on data hub based on “Hadoop Distributed File System”, which is abbreviated as HDFS. Data can be ingested via batch jobs (e.g. running every 15 minutes, once every night, etc), streaming near-real-time (i.e 100ms to 2 minutes) and streaming in real-time (i.e. under 100ms).
One common term used in Hadoop is “Schema-On-Read“. This means unprocessed (aka raw) data can be loaded into HDFS with a structure applied at processing time based on the requirements of the processing application. This is different from “Schema-On-Write”, which is used in RDBMs where schema need to be defined before the data can be loaded.
2) Storing Data:
Data can be stored on HDFS or NoSQL databases like HBase. HDFS is optimized for sequential access & the usage pattern of “Write-Once & Read-Many”. HDFS has high read & write rates as it can parallelize I/O s to multiple drives. HBase sits on top of HDFS and stores data as key/value pairs in a columnar fashion. Columns are clubbed together as column families. HBase is suited for random read/write access. Before data can be stored into Hadoop, you need consider the following
1) Data Storage Formats: There are a number of file formats (e.g CSV, JSON, sequence, AVRO, Parquet, etc) & data compression algorithms (e.g snappy, LZO, gzip, bzip2, etc) that can be applied. Each has particular strengths. Compression algorithms like LZO and bzip2 are splittable.
2) Data Modelling: Despite the schema-less nature of Hadoop, schema design is an important consideration. This includes directory structures and schema of objects stored in HBase, Hive and Impala. Hadoop often serves as a data hub for the entire organization, and the data is intended to be shared. Hence, carefully structured & organized storage of your data is important.
3) Metadata management: Metadata related to stored data.
4) Multitenancy: As smarter data hubs host multiple users, groups, and applications. This often results in challemges relating to governance, standardization, and management.
3) Processing Data:
Hadoop’s processing framework uses the HDFS. It uses the “Shared Nothing” architecture, which in distributed systems each node is completely independent of other nodes in the system. There are no shared resources like CPU, memory, and disk storage that can become a bottle-neck. Hadoop’s processing frameworks like Spark, Pig, Hive, Impala, etc processes distinct subset of the data and there is no need to manage access to the shared data. “Sharing nothing” architectures are very
1. scalable as more nodes can be added without further contention.
2. fault tolerant as each node is independent, and there are no single points of failure, and the system can quickly recover from a failure of an individual node.
Q6. How would you go about choosing among the different file formats for storing and processing data?
A6. One of the key design decisions is regarding file formats based on the
1) Usage patterns like accessing 5 columns out of 50 columns vs accessing most of the columns.
2) Splittability to be processed in parallel.
3) Block compression saving storage space vs read/write/transfer performance
4) Schema evolution to add fields, modify fields, and rename fields.
1. CSV Files:
CSV files are common for exchanging data between Hadoop & external systems. CSVs are readable & parsable. CSVs are handy for bulk loading from databases to Hadoop or into an analytic database. When using CSV files in Hadoop never include header or footer lines. Each line of the file should contain records. CSV files limited support for schema evaluations as new fields can only be appended to the end of a record and existing fields can never be limited. CSV files do not support block compression, hence compressing a CSV file comes at a significant read performance cost.
2. JSON Files:
JSON records are different from JSON files, and each line is its own JSON record, and as JSON stores both schema & data together for each record enabling full schema evolution & splittability. JSON files do not support block level compression.
3. Sequence Files:
Sequence files store data in binary format with a similar structure to CSV files. Like CSV, Sequence files do not store meta data, hence only schema evolution is appending new fields to the end of the record. Unlike CSV files, Sequence files do support block compression. Sequence files are also splittable. Sequence files can be used to solve “small files problem” by combining smaller XML files by storing the filename as the key and the file contents as the value. Due to complexity in reading sequence files, they are more suited for in-flight (i.e. intermediate) data storage.
Note: A SequenceFile is Java centric and cannot be used cross-platform.
4. Avro Files:
are suited for long term storage with schema. Avro files store meta data with data, but also allow specification of independent schema for reading the file. This enables full schema evolution support allowing you to rename, add, and delete fields and change data types of fields by defining a new independent schema. Avro file defines the schema in JSON format, and the data will be in binary JSON format. Avro files are also splittable & support block compression. More suited in usage patterns where row level access is required. This means all the columns in the row are queried. Not suited when a row has 50+ columns and the usage pattern requires only 10 or less columns to be accessed. Parquet file format is more suited for this columnar access usage pattern.
5. Columnar Formats: e.g. RCFile, ORC
RDBMs store records in a row-oriented fashion as this is efficient for cases where many columns of a record need to be fetched. Row oriented writing is also efficient if all the column values are known at the time of writing a record to the disk. But this approach would not be efficient to fetch just 10% of the columns in a row or if all the column values are not known at the time of writing. This is where columnar files make more sense. So columnar format works well
- skipping I/O and decompression on columns that are not part of the query
- for queries that only access a small subset of columns.
- for data-warehousing-type applications where users want to aggregate certain columns over a large collection of records.
RC & ORC formats are specifically written Hive, and not general purpose as Parquet.
6. Parquet Files:
Parquet file is a columnar file like RC and ORC. Parquet files support block compression and optimized for query performance as 10 or less columns can be selected from 50+ columns records. Parquet file write performance is slower than non columnar file formats. Parquet also support limited schema evolution by allowing new columns to be added at the end. Parquet can be read and written to with Avro APIs and Avro schemas.
So, in summary favor Sequence, Avro, & Parquet file formats over the others. Sequence files for raw & intermediate storage. Avro & Parquet files for processing.
More Hadoop eco system Interview Q&A
Top 5 Member Last 30 days
- ♦ Q11-Q23: Top 50+ Core on Java OOP Interview Questions & Answers 2,344 views
- ♦ 11 Spring boot interview questions & answers 2,198 views
- 03: ♦ 5 Java multithreading scenarios interview questions answered 1,497 views
- 18 Java scenarios based interview Questions and Answers 1,089 views
- 01: ♦ 15 Ice breaker questions asked 90% of the time in Java interviews 858 views
Latest posts by Arulkumaran Kumaraswamipillai (see all)
- 10: ♥ Coding Scala Way – groupBy, mapValues & identity - March 28, 2017
- 19: Q109 – Q113 Scala ADT (Algebraic Data Types) Interview Q&As - March 25, 2017
- Part 6: Intermediate What is wrong with this Java code? - March 20, 2017
- Q16-Q24 written test questions and answers on core Java - March 4, 2017
- 12: Q92 – Q97 Hadoop file formats and how to choose - March 3, 2017