10. Setting up & getting started with Hive

Hive is popular SQL based tool to process Big Data. It stores & retrieves data to/from HDFS.

Prerequisite: Xcode & Hadoop are installed as outlined in Setting up & getting started with Hadoop on Mac

1. Install Hive

Let’s install hive-2.3.4 with hadoop 2.7.7.

This will install 2.3.4 version of Hive in the folder /usr/local/apache-hive-2.3.4-bin/

2. set HIVE_HOME

In the “.bash_profile”

Activate the change.

3. Install the MySQL Server

Hive requires a RDBMS to store its meta-data. It is called the Hive metastore. Hive can be used with the embedded Derby database, but Derby is good for the sake of development and unit testing, but won’t scale to a production environment as only a single user can connect to the derby database at any instant of time. Better way to configure is to use an external database which is JDBC compliant like MySQL, Oracle, etc.

4. Start the MySQL Server

5. Set up MySQL Server

Create a database named “metastore” and a new user named “hiveuser”, and grant permissions. You also need to run the schema upgrade scripts. The Hive version being used here is “2.3.x”.

Note: schematool is an offline command line tool to manage the metastore. This tool can be used to initialize the metastore schema for the current Hive version (E.g. 2.3.x).

Before you run hive for the first time, run the following to initialize the schema:

6. Download mysql-connector-java

Download from https://dev.mysql.com/downloads/connector/j/

Copy the “mysql-connector-java-8.0.13.jar” that has the Driver class “com.mysql.jdbc.Driver

7. Configure Hive – hive-env.sh

8. Configure Hive – hive-site.xml

Make sure the following selected lines between the and tags of hive-site.xml have the values as shown below:

Make sure that the following properties are set as shown below. These properties are used to connect to the external MySQL metastore database. When you start the Hive shell, it will automatically connect to the MySQL database and create the required tables in it.

Include the below configuration in conf/hive-site.xml

9. Configure YARN – yarn-site.xml

Make sure that the following properties are set in yarn-site.xml.

10. Start hadoop if not already started

Output:

11. Verify Hive installation

Output:

12. Start the Hive metastore interface

The Hive metastore interface by default listens at port 9083. Make sure it is.

13. Start Hive

Output:

14. Create HDFS path

15. Create a Hive table

16. Insert data into test_table

Output: runs a mapreduce job

17. SELECT from test_table

18. Where is the underlying data stored in HDFS?

Output:

Output:

Output:

Output:

So, basically when you execute a SQL query, a mapreduce job is run to save or select data tp/from HDFS. Hive can be run with other execution engines like Spark & Tez. The default engine is mapreduce.

We will look at more examples in the coming tutorials.

Learn more about Hive Q&As style

Hive Interview FAQs


Java Developer Interview Q&As

800+ Java Interview Q&As

Top