Hive is popular SQL based tool to process Big Data. It stores & retrieves data to/from HDFS.
Prerequisite: Xcode & Hadoop are installed as outlined in Setting up & getting started with Hadoop on Mac
1. Install Hive
Let’s install hive-2.3.4 with hadoop 2.7.7.
1 2 3 4 | $ wget http://www-us.apache.org/dist/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz $ sudo tar xvzf apache-hive-2.3.4-bin.tar.gz -C /usr/local |
This will install 2.3.4 version of Hive in the folder /usr/local/apache-hive-2.3.4-bin/
2. set HIVE_HOME
1 2 3 | $ vi ~/.bash_profile |
In the “.bash_profile”
1 2 3 4 5 6 7 | export HIVE_HOME=/usr/local/apache-hive-2.3.4-bin export HIVE_CONF_DIR=$HIVE_HOME/conf export PATH=$HIVE_HOME/bin:$PATH export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/* export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/* |
Activate the change.
1 2 3 | $ source ~/.bash_profile |
3. Install the MySQL Server
Hive requires a RDBMS to store its meta-data. It is called the Hive metastore. Hive can be used with the embedded Derby database, but Derby is good for the sake of development and unit testing, but won’t scale to a production environment as only a single user can connect to the derby database at any instant of time. Better way to configure is to use an external database which is JDBC compliant like MySQL, Oracle, etc.
1 2 3 | $ brew install mysql |
4. Start the MySQL Server
1 2 3 | $ mysql.server start |
5. Set up MySQL Server
1 2 3 | $ mysql -u root |
Create a database named “metastore” and a new user named “hiveuser”, and grant permissions. You also need to run the schema upgrade scripts. The Hive version being used here is “2.3.x”.
1 2 3 4 5 6 7 | mysql> CREATE DATABASE metastore; mysql> USE metastore; mysql> SOURCE /usr/local/apache-hive-2.3.4-bin/scripts/metastore/upgrade/mysql/hive-schema-2.3.0.mysql.sql mysql> CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'password'; mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE, INDEX ON metastore.* TO 'hiveuser'@'localhost'; |
Note: schematool is an offline command line tool to manage the metastore. This tool can be used to initialize the metastore schema for the current Hive version (E.g. 2.3.x).
1 2 3 | $ schematool -dbType mssql -info |
Before you run hive for the first time, run the following to initialize the schema:
1 2 3 | $ schematool -dbType mssql -initSchema |
6. Download mysql-connector-java
Download from https://dev.mysql.com/downloads/connector/j/
1 2 3 | $ curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.13.tar.gz/from/http://mysql.he.net/' | tar xz |
Copy the “mysql-connector-java-8.0.13.jar” that has the Driver class “com.mysql.jdbc.Driver”
1 2 3 | $ mv mysql-connector-java-8.0.13/mysql-connector-java-8.0.13.jar /usr/local/apache-hive-2.3.4-bin/lib |
1 2 3 | $ sudo chown -R <user>:admin apache-hive-2.3.4-bin |
7. Configure Hive – hive-env.sh
1 2 3 4 | $ cd /usr/local/apache-hive-2.3.4-bin/conf $ cp hive-env.sh.template hive-env.sh |
8. Configure Hive – hive-site.xml
1 2 3 4 | $ cd /usr/local/apache-hive-2.3.4-bin/conf $ cp hive-default.xml.template hive-site.xml |
Make sure the following selected lines between the
1 2 3 | $ vi /usr/local/apache-hive-2.3.4-bin/conf/hive-site.xml |
Make sure that the following properties are set as shown below. These properties are used to connect to the external MySQL metastore database. When you start the Hive shell, it will automatically connect to the MySQL database and create the required tables in it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | <configuration> //..................... <property> <name>javax.jdo.option.ConnectionPassword</name> <value>password</value> <description>password to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.cj.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> <description>Username to use against metastore database</description> </property> <property> <name>hive.querylog.location</name> <value>/usr/local/apache-hive-2.3.4-bin/iotmp</value> <description>Location of Hive run time structured log file</description> </property> <property> <name>hive.exec.local.scratchdir</name> <value>/usr/local/apache-hive-2.3.4-bin/iotmp</value> <description>Local scratch space for Hive jobs</description> </property> <property> <name>hive.downloaded.resources.dir</name> <value>/usr/local/apache-hive-2.3.4-bin/iotmp</value> <description>Temporary local directory for added resources in the remote file system. </description> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> <description> Enforce metastore schema version consistency. True: Verify that version information stored in is compatible with one from Hive jars. Also disable automatic schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures proper metastore schema migration. (Default) False: Warn if the version information stored in metastore doesn't match with one from in Hive jars. </description> </property> //....................... <configuration> |
Include the below configuration in conf/hive-site.xml
1 2 3 4 5 | <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> </property> |
9. Configure YARN – yarn-site.xml
Make sure that the following properties are set in yarn-site.xml.
1 2 3 4 5 6 7 8 9 10 11 12 | ... <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> .... |
10. Start hadoop if not already started
1 2 3 4 | $ ./start-all.sh $ jps -lm |
Output:
1 2 3 4 5 6 7 | 4580 sun.tools.jps.Jps -lm 4406 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager 4167 org.apache.hadoop.hdfs.server.datanode.DataNode 4503 org.apache.hadoop.yarn.server.nodemanager.NodeManager 4073 org.apache.hadoop.hdfs.server.namenode.NameNode 4283 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode |
11. Verify Hive installation
1 2 3 | $ echo $HIVE_HOME |
Output:
1 2 3 | /usr/local/apache-hive-2.3.4-bin |
12. Start the Hive metastore interface
1 | $ hive --service metastore & |
The Hive metastore interface by default listens at port 9083. Make sure it is.
1 | $ netstat -an | grep 9083 |
13. Start Hive
1 2 3 | $ hive |
Output:
1 2 3 | hive> |
14. Create HDFS path
1 2 3 4 5 | $ hdfs dfs -mkdir -p /user/hive/warehouse $ hdfs dfs -chown -R hiveuser /user/hive/warehouse $ hdfs dfs -chmod -R 777 /user/hive/warehouse |
15. Create a Hive table
1 2 3 4 5 6 7 8 9 10 11 | hive> CREATE TABLE IF NOT EXISTS test_table > (col1 int COMMENT 'Integer Column', > col2 string COMMENT 'String Column') > COMMENT 'This is test table' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE; OK Time taken: 4.016 seconds |
16. Insert data into test_table
1 2 3 4 | hive> INSERT INTO default.test_table values(1, 'Some test value 1'); hive> INSERT INTO test_table values(2, 'Some test value 2'); |
Output: runs a mapreduce job
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1542456661191_0002, Tracking URL = http://MacBook-Pro-2.local:8088/proxy/application_1542456661191_0002/ Kill Command = /usr/local/hadoop-2.7.7/bin/hadoop job -kill job_1542456661191_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2018-11-17 23:15:50,403 Stage-1 map = 0%, reduce = 0% 2018-11-17 23:15:55,562 Stage-1 map = 100%, reduce = 0% Ended Job = job_1542456661191_0002 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://localhost:9000/user/hive/warehouse/test_table/.hive-staging_hive_2018-11-17_23-15-43_815_7508356764358939336-1/-ext-10000 Loading data to table default.test_table MapReduce Jobs Launched: Stage-Stage-1: Map: 1 HDFS Read: 4344 HDFS Write: 94 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 13.153 seconds |
17. SELECT from test_table
1 2 3 4 5 6 7 | hive> SELECT * FROM test_table; OK 1 Some test value 1 2 Some test value 2 Time taken: 0.118 seconds, Fetched: 2 row(s) |
18. Where is the underlying data stored in HDFS?
1 | hdfs dfs -ls /user/hive/warehouse |
Output:
1 | drwxrwxrwx - user supergroup 0 2018-11-17 23:15 /user/hive/warehouse/test_table |
1 | hdfs dfs -ls /user/hive/warehouse/test_table |
Output:
1 2 | -rwxrwxrwx 3 user supergroup 20 2018-11-17 23:12 /user/hive/warehouse/test_table/000000_0 -rwxrwxrwx 3 user supergroup 20 2018-11-17 23:15 /user/hive/warehouse/test_table/000000_0_copy_1 |
1 | hdfs dfs -cat /user/hive/warehouse/test_table/000000_0 |
Output:
1 | 1,Some test value 1 |
1 | hdfs dfs -cat /user/hive/warehouse/test_table/000000_0_copy_1 |
Output:
1 | 2,Some test value 2 |
So, basically when you execute a SQL query, a mapreduce job is run to save or select data tp/from HDFS. Hive can be run with other execution engines like Spark & Tez. The default engine is mapreduce.
We will look at more examples in the coming tutorials.