Hive Connector
This document describes the Hypertable-Hive integration. Hive is data warehouse infrastructure built on top of Hadoop. The Hypertable-Hive integration allows Hive QL statements to read from and write to Hypertable via SELECT and INSERT commands.
Table of Contents
Introduction
Hypertable integrates with Hive through its external storage handler interface. A hive storage handler for Hypertable is included in the Hypertable jar file which can be found in the lib/java directory of your Hypertable installation.
(NOTE: The lib/java directory is empty when Hypertable is initially installed, but gets populated the first time you start Hypertable on top of Hadoop HDFS)
The Hypertable-Hive storage handler was overhauled in Hypertable version 0.9.8.1 to make it compatible with the latest version of Hive (0.13.0 or greater) and to add new features such as support for the binary column format and the ability to create and drop external Hypertable tables through the hive command interpreter. This document describes this new version of the storage handler as it exists in Hypertable version 0.9.8.1 and later releases. It has been tested with the following Hadoop/Hive combinations:
- Cloudera CDH4 with Hive 0.13.1
- Hortonworks HDP2.1 with Hive 0.13.0
The Hypertable storage handler should work with any Hadoop 2 based distribution and Hive 0.13.0 or later.
Prerequisites
HT_JAR=`readlink -f /opt/hypertable/current/lib/java/hypertable.jar` hdfs dfs -rm -f $HT_JAR hdfs dfs -mkdir -p `dirname $HT_JAR` hdfs dfs -copyFromLocal $HT_JAR `dirname $HT_JAR`
The above commands only need to be run once each time Hypertable is upgraded. Once the hypertable.jar files is copied to the proper location in HDFS, the hive interpreter can be run as follows:
# Commands read from a file hive --auxpath $HT_JAR -f <fname> # Commands supplied on the command line hive --auxpath $HT_JAR -e "<commands>"
Table Create, Load, and Query
The first step required for this example is to create a test namespace in Hypertable to parallel the test database that we'll create in Hive:
/opt/hypertable/current/bin/ht shell -e "create namespace test;"
The data file used in this example (kv1.txt) contains data that looks like the following:
... 481 val_481 457 val_457 98 val_98 282 val_282 197 val_197 ...
Next we'll run the hive script shown below, but before we do that, some explanation is in order. The script starts by creating a test database and then creates an external, non-native table called hypertable_1 backed by a Hypertable table xyz using the Hypertable storage handler. It then creates a native table pokes and populates it with the data from kv1.txt. Lastly, it populates hypertable_1 by performing an INSERT command that selects the rows from pokes where id column equals 98.
The mapping of columns from the Hive table hypertable_1 to the Hypertable table xyz is specified by the WITH SERDEPROPERTIES ('hypertable.columns.mapping' = ':key,poke:id') clause. The comma separated fields after the '=' character parallel the column definitions in the CREATE TABLE statement.
The first field is :key which specifies the column which will be used as the row key in the Hypertable table xyz. There must be exactly one :key mapping (compound keys are not supported). In this case, the row key maps to the first column in hypertable_1 which is id. The second field is poke:id which indicates that the column with column family poke and column qualifier "id" will map to the second column in hypertable_1 which is poke_id.
CREATE DATABASE test; USE test; CREATE TABLE hypertable_1(id int, poke_id string) STORED BY 'org.hypertable.hadoop.hive.HypertableStorageHandler' WITH SERDEPROPERTIES ('hypertable.columns.mapping' = ':key,poke:id') TBLPROPERTIES ('hypertable.table.name' = '/test/xyz'); CREATE TABLE pokes (id INT, poker STRING); LOAD DATA LOCAL INPATH 'kv1.txt' OVERWRITE INTO TABLE pokes; INSERT OVERWRITE TABLE hypertable_1 SELECT * FROM pokes WHERE id=98;
Now that the table hypertable_1 has been populated with data, we can dump its contents by running the following hive command:
hive --auxpath $HT_JAR -e "USE test; SELECT * FROM hypertable_1;" 2> /dev/null
which will generate the following output:
98 val_98
Now let's verify that the table xyz was indeed created in Hypertable and was populated with data:
$ /opt/hypertable/current/bin/ht shell hypertable> use test; hypertable> get listing; xyz hypertable> show create table xyz; CREATE TABLE xyz ( poke, ACCESS GROUP default (poke) ); hypertable> select * from xyz; 98 poke:id val_98 98 poke:id val_98
Note that the xyz table contains two cells with row key 98. This is because there are two rows in the input file (kv1.txt) that contain "98 val_98" and Hive collapses them as duplicates.
Mapping to an Existing Table
In this next example we'll create an external, non-native table in Hive called hypertable_2 that maps to an existing table in Hypertable. We'll use the Hypertable table xyz that was created in the previous example. For this mapping to work properly, the column data in the Hypertable table must match the data types specified for the Hive columns. In this example, the key column, id, is specified with data type int. Hive uses string representation as the default encoding format for data types in the external non-native tables, so the contents of the corresponding Hypertable column (row key) must contain integers rendered as ASCII strings, which is exactly how the Hypertable table xyz was populated in the previous exercise (see next section for how to specify binary encoding format).
USE test; CREATE EXTERNAL TABLE hypertable_2(id int, poke_id string) STORED BY 'org.hypertable.hadoop.hive.HypertableStorageHandler' WITH SERDEPROPERTIES ('hypertable.columns.mapping' = ':key,poke:id') TBLPROPERTIES ('hypertable.table.name' = '/test/xyz');
The above Hive statements have the effect of creating a new Hive table hypertable_2 mapped to the table xyz in Hypertable. To verify that this mapping was successful, we can dump the contents of the hypertable_2 table with the following command:
hive --auxpath $HT_JAR -e "USE test; SELECT * FROM hypertable_2;" 2> /dev/null
which will generate the following output:
98 val_98
Binary Format Column Data
Hive uses ASCII strings for the default encoding format of column data in external, non-native tables. This can be somewhat inefficient because it requires integer-to-string conversion and occupies more space. For example, the 4-byte integer value 1000000000 requires ten bytes to store as an ASCII string, instead of four. Hive solves this problem by allowing a binary encoding format to be specified for any column. This is accomplished by appending a "#b" suffix to the column name in the column mapping clause, as illustrated in the hive script below.
The data file used in this example (employee.txt) contains the following data:
jimmy 21 20000.0 jim 35 40000.0 james 50 60000.0 billy 25 25000.0 bill 40 45000.0 william 60 65000.0
USE test; CREATE TABLE hypertable_binary(key string, age int, salary double) STORED BY 'org.hypertable.hadoop.hive.HypertableStorageHandler' WITH SERDEPROPERTIES ('hypertable.columns.mapping' = ':key,info:age#b,info:salary#b') TBLPROPERTIES ('hypertable.table.name' = '/test/employee'); CREATE TABLE employee (name STRING, age INT, salary DOUBLE); LOAD DATA LOCAL INPATH 'employee.txt' OVERWRITE INTO TABLE employee; INSERT OVERWRITE TABLE hypertable_binary SELECT * FROM employee;
After running the above script, we can verify that the table employee was indeed created in Hypertable and was populated with binary data:
$ /opt/hypertable/current/bin/ht shell hypertable> use test; hypertable> show tables; employee xyz hypertable> show create table employee; CREATE TABLE employee ( info, ACCESS GROUP default (info) ); hypertable> select * from employee; bill info:age \0\0\0( bill info:salary @??\0\0\0\0\0 billy info:age \0\0\0 billy info:salary @?j\0\0\0\0\0 james info:age \0\0\02 james info:salary @?L\0\0\0\0\0 jim info:age \0\0\0# jim info:salary @?\0\0\0\0\0 jimmy info:age \0\0\0 jimmy info:salary @ӈ\0\0\0\0\0 william info:age \0\0\0< william info:salary @?\0\0\0\0\0
We can now issue Hive QL commands to select rows of interest:
$ hive --auxpath $HT_JAR -e "USE test; SELECT * FROM hypertable_binary WHERE age > 30; quit;" 2> /dev/null bill 40 45000.0 james 50 60000.0 jim 35 40000.0 william 60 65000.0
$ hive --auxpath $HT_JAR -e "USE test; SELECT * FROM hypertable_binary WHERE salary < 30000.0; quit;" 2> /dev/null billy 25 25000.0 jimmy 21 20000.0
Limitations
-
Whitespace should not be used in between entries in the hypertable.columns.mapping string, since these will be interperted as part of the column name, which is not what you want.
-
There is currently no way to access multiple versions of the same Hypertable cell. Hive can only access the latest version of a cell.
- Binary format is not supported for the :key column
Acknowledgements
We'd like to thank John Sichi for his help and suggestions in developing the initial Hypertable Storage Handler for Hive.