21: Q138 – Q142 AWS data lakes overview Interview Q&As

Q138. What are the basic considerations when building a data lake on an AWS platform?

1. Ingestion into S3 (i.e. Simple Storage Service) from different source systems. E.g. csv/xml/json files extracted from source systems, streamed data from source systems via message queues/topics, etc.

2. Catalog & Search ingested data by building a metadata. For example, triggering a lambda function via S3 event notification to create a secondary index on DynamoDB (i.e.NoSQL database) or ElasticSearch). The primary index is the object key used to store the object in the S3 bucket. For example,

3. Protect & Secure the data in S3. a) ACL, S3 bucket policy, b) AWS KMS (i.e. Key Management Service). Immediately after the data is uploaded to a S3 bucket, it is encrypted, and decrypted when downloaded by the same IAM (i.e. Identity & Access Management) role, c) VPC End points for S3 and d) Versioning

4. Processing & analytics i.e. processing the data via ETL tools like AWS EMR, AWS Glue, AWS data pipelines, etc and consuming them via Amazon API Gateway, Amazon Redshift, AWS Athena, Amazon Quicksight, etc

source: aws youtube on “Building a Data Lake on AWS”

Q139. How will you ingest data from disparate source systems into AWS S3?

Online for regular transfers

1. Manually upload via AWS console using the public internet.

2. AWS VPC endpoint on EC2 for S3. The communication takes place via Amazon’s private network. Once you establish the VPC endpoint, you can use AWS Cli (e.g. aws s3 ls. etc) or AWS SDK to ingest data into AWS S3. The VPC and the S3 buckets must be in the same region. This uses the public internet.

3. AWS direct connect gives better bandwidth by using a private/leased line (i.e. WAN). This creates an isolated private cloud without touching the public internet to give better throughput & low latency.

4) Amazon Kinesis Firehose to collect, process, and analyze streaming data with very high throughput. Amazon Kinesis Firehose was purpose-built to load streaming data into AWS. You need to create a delivery stream via AWS console, and then publish data to the delivery stream via AWS SDK in Java or a Bash script using AWS CLI.

To ingest data, you can use

a) aws-cli tools with a bash script.

b) s3-bash, which is a small collection of BASH scripts to let you use Amazon’s S3 Web Service from an Unix, Linux or Mac OS X command shell without needing Java, Python, etc. It relies instead on curl, openssl, and gnu core utils.

c) Setting up your Java, Eclipse IDE, AWS Toolkit for Eclipse, and Maven to use the AWS SDK in Java. AWS SDK supports many other languages like Python, PHP, .NET, Node.js, Ruby, etc. If you are using Python, Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK), which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2.

5) Vendor specific replication tools to transfer say traditional Oracle or Teradata relational data into AWS S3. For example,

a. Teradata Parallel Transporter (i.e. TPT) operators and the Teradata Access Module for Amazon S3 run on an EC2 instance.

b. Attunity CloudBeam provides cloud-optimized data replication from all major on-premises sources to Amazon Web Services, Microsoft Azure and Google Cloud.

c. Oracle GoldenGate enables the continuous, real-time capture, routing, transformation, and delivery of transnational data across heterogeneous environments. You can integrate GoldenGate with Kafka/Kinesis to stream data into AWS s3.

Offline for infrequent & one time bulk transfer(s)

6. AWS import/export disk for infrequent or one time rapid data transfer. This is an offline transfer, which involves sending your own portable storage device to AWS data center to be uploaded into AWS S3.

7. AWS import/export with snowballs, which are Amazon owned network attached storage devices for infrequent or one time rapid data transfer. You can transfer petabytes of data very securely. This is an offline transfer, which involves Amazon shipping you the device for you to copy the data into the device using a client software. Once the data is copied, power down the device, and ship it back to Amazon to be uploaded into S3.

Q140. What are some of the considerations to improve performance when uploading large files to S3?

1. Multipart upload for PUTs

Upload the files in parallel to S3 using multipart upload. This breaks your larger objects into chunks and upload a number of chunks in parallel. If the upload of a chunk fails, you can simply restart it. You’ll be able to improve your overall upload speed by taking advantage of this parallelism. The multiple chunks or parts uploaded to S3 will be finalized in S3.

2. Range HTTP header for GETs

The range http header can improve the GETs performance by allowing the object to be retrieved in parts instead of as a whole. This also quickly recover from failures as only the part that failed to download needs to be retried.

3. Add a random prefix to the key names

S3 maintains an index of object key names in each AWS region. S3 stores key names in alphabetical order. Object keys are stored across multiple partitions in the index and the key name dictates which partition the key is in, hence using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood of hot spotting (i.e. overwhelming the I/O capacity by targeting the same partition). So, introduce some randomness in the key name prefixes to distribute the keys across multiple index partitions.

The “random_number” adds randomness to the beginning of the key name. This can be generated via a hashing function to distribute the key name prefixes.

4. Build and maintain secondary Index outside of S3

Object key names are stored lexicographically in Amazon S3 indexes, making it hard to sort and manipulate, hence build a secondary index on DynamoDB (i.e. NoSQL database) to query/list metadata of the objects as opposed to performing operations directly against S3. You can create event notifications on S3 buckets like “on object creation”, which invokes a lambda function to update the DynamoDB metadata index.

Q141. What is one of the key requirements you need to consider when working with S3?
A141. You must find out the total number of requests per second at peak usage. A typical workload involves a burst of 100 PUT/LIST/DELETE requests per second and 300 GET requests per second. The the typical workload exceeds 200 PUT/LIST/DELETE requests per second and 600 GET requests per second, then you need to check with AWS support.

Q142. How do you protect S3 buckets & files from unintended overwrites & deletions?

1. Use versions

Gives you the ability to retrieve and restore deleted objects or rollback to previous versions. Versioning does not prevent bucket deletion and must be backed up across other regions.

2. Cross Region replication

Use Cross Region replication feature to backup data to a different region.

3. Enable additional security

Enable additional security by configuring a bucket to enable MFA (Multi-Factor Authentication) on deletion.

4. Tracking with event notifications

Use Event Notifications to be notified for any put or delete request on the S3 objects.

Q143. What are the differences among S3, EBS (i.e. Elastic Block Storage), and EFS (Elastic File Storage)?
A143. Amazon S3 is an object storage and is suitable for storing user files and backups in massive numbers. It is suited for WORM (i.e. Write Once Read Many times) pattern. Say you have a 50 MB csv file in S3 and if you modify 1 record in the csv then whole file will have to be replaced as opposed to only updating the block that has changed.

Amazon EBS & EFS were designed to provide faster storage for the users of Amazon EC2 cloud computing service. Amazon EBS is similar to your computer’s drive but in virtualized environment. Say you have a 50 MB csv file in EBS that spans 20 blocks and if you modify 1 record in the csv, only the block that was modified is updated. Amazon EBS is a storage for the drives of your virtual machines. It stores data as blocks of the same size and organises them through the hierarchy similar to a traditional file system. EBS is not scalable, but faster than S3 and EFS.

An Amazon EBS volume is a durable, block-level storage device that you can attach to a single EC2 instance in the same availability zone. You can use EBS volumes as primary storage for data that requires frequent updates, such as the system drive for an instance or storage for a database application. Amazon EBS provides the ability to create snapshots (backups) of any EBS volume and write a copy of the data in the volume to Amazon S3, where it is stored redundantly in multiple Availability Zones.

Amazon EFS is handy when you want to run an application with high workloads that need scalable storage and relatively fast output. Amazon EFS is automatically scalable. You can mount EFS to various AWS services and access it from various virtual machines. Amazon EFS is especially helpful for running servers, shared volumes like NAS devices, big data analytics, and other workloads that require scaling. EFS is faster than S3, but slower than EBS.

300+ Java & Big Data Interview FAQs

16+ Java Key Areas Interview Q&As

800+ Java Interview Q&As

300+ Java & Big Data Tutorials