Test Setup

This document describes how to setup and run the Hypertable vs. HBase performance evaluation test, to compare the performance of Hypertable 0.9.5.5 with that of HBase 0.90.4. All of the test specific configuration and run scripts can be found in test2.tar.gz and can be examined via the links provided in the pertinent sections below. We built the test framework and checked it into the Hypertable source tree under examples/java/org/hypertable/examples/PerformanceTest/. The test framework is compiled and included in the hypertable-0.9.5.5-examples.jar file included in the Hypertable binary packages. The following source files contain the code which encapsulates the interaction with the Hypertable and HBase APIs:

Driver.java
DriverCommon.java
DriverHypertable.java
DriverHBase.java

Prerequisites

The following prerequisites must be satisfied to run this test:

The user account from which the test is being run must have password-less sudo priviledge.
The root account on the machine from which this test will be administered must have password-less ssh access to all machines in the test cluster.
Hypertable and HBase must be installed on all machines in the test cluster, including the machine from which the test is being administered.

Step 1. Setup and Install Hadoop

The first step is to setup and install HDFS. We ran the test using HDFS version 0.20.2 (CDH3u2) with the following configuration files:

hadoop/conf/core-site.xml
hadoop/conf/hadoop-env.sh
hadoop/conf/hdfs-site.xml
hadoop/conf/masters
hadoop/conf/slaves

Step 2. Setup and Install HBase

Install and configure HBase version 0.90.4 (CDH3u2). We used the following configuration files in our test:

hbase/conf/regionservers
hbase/conf/hbase-env.sh
hbase/conf/hbase-site.xml [reading]
hbase/conf/hbase-site.xml [writing]

Some notable non-default configuration includes the following variables set in hbase-env.sh:

export HBASE_REGIONSERVER_OPTS="-Xmx14g -Xms14g -Xmn128m -XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -verbose:gc \
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps \
-Xloggc:$HBASE_HOME/logs/gc-$(hostname)-hbase.log"
export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

Step 3. Setup and Install Hypertable

Install and configure Hypertable version 0.9.5.5 (download) using the following configuration files:

hypertable/conf/hypertable.cfg [writing]
hypertable/conf/hypertable.cfg [reading]

We used the same configuration file for write, scan, and random read uniform tests. The only non-default configuration property that was modified was the one to increase the range split size to make it more inline with the HBase configuration.

Hypertable.RangeServer.Range.SplitSize=1GB

For the Zipfian random read tests, we increased the size of the query cache to two gigabytes with the addition of the following configuration property.

Hypertable.RangeServer.QueryCache.MaxMemory=2G

Step 4. Configure Capistrano "Capfile"

We use a tool called Capistrano to manage Hypertable clusters. Capistrano is a simple tool that facilitates remote task execution. It relies on ssh and reads its instructions from a file called "Capfile". We augmented the Capfile to include tasks for starting and stopping the test framework. The following Capfile was used in our test:

Capfile

This Capfile can be used to run the test on a different set of machines with a different configuration. The only requirement would be to edit the variables and role definitions at the top of the file. The following listing shows the top portion of the Capfile that would need to change to launch Hypertable and the performance evaluation test on a different cluster with a different configuration.

set :source_machine, "test00"
set :install_dir,  "/opt/hypertable/doug"
set :hypertable_version, "0.9.5.5"
set :default_pkg, "/tmp/hypertable-0.9.5.5-linux-x86_64.deb"
set :default_dfs, "hadoop"
set :default_config, "/home/doug/benchmark/perftest-hypertable.cfg"
set :default_additional_args, ""
set :hbase_home, "/usr/lib/hbase"
set :default_client_multiplier, 1
set :default_test_driver, "hypertable"
set :default_test_args, ""

role :source, "test00"
role :master, "test01"
role :hyperspace, "test01", "test02", "test03"
role :slave,  "test04", "test05", "test06", "test07", "test08", "test09", "test10", "test11", "test12", "test13", "test14", "test15"
role :localhost, "test00"
role :thriftbroker
role :spare
role :test_client, "test00", "test01", "test02", "test03"
role :test_dispatcher, "test00"

Step 5. Run Tests

The test scripts, Capfile, and default Hypertable configuration file can be installed by un-taring the test2.tar.gz archive.

$ tar xzvf test2.tar.gz 
bin/
bin/run-test-load-sequential.sh
bin/run-test-scan.sh
bin/run-test-load.sh
bin/run-test-read-random.sh
bin/clean-database.sh
bin/test-config.sh
Capfile
perftest-hypertable.cfg
reports/

After you have un-tared this archive, modify the Capfile as described in step 4 and modify the following properties in perftest-hypertable.cfg:

HdfsBroker.fs.default.name=hdfs://test01:9000
Hyperspace.Replica.Host=test01
Hyperspace.Replica.Host=test02
Hyperspace.Replica.Host=test03
Hypertable.RangeServer.Range.SplitSize=1GB

Test Scripts

The tests can be run by hand, using the following set of scripts. Some of the tests need to have configuration files adjusted prior to running the test. We created scripts to run each test and in between each test, we adjusted the configuraiton as needed. Each on of these test scripts deposits the result of the test in a summary file in the reports/ subdirectory. The test parameters are encoded in the filename of the summary report, for example:

reports/test2-hbase-random-read-zipfian-4901960784-20-1000-512clients.txt
reports/test2-hbase-sequential-write-490196078-20-1000-48clients.txt

The following section describes each test script.

bin/test-config.sh

This script is included by the other test scripts and contains definitions for three important variables that control the behavior of the tests.

let DATA_SIZE=5000000000000

# The following variable points to the english Wikipedia export file that is sampled
# for value data.  This file must be present on all test client machines
VALUE_DATA=/data/1/test/enwiki-sample.txt

# The following variable points to the file containing the cumulative mass function
# data used to generate the Zipfian distribution.  It can be generated with the
# following command:
#
# /opt/hypertable/current/bin/jrun org.hypertable.Common.DiscreteRandomGeneratorZipf\
 --generate-cmf-file /data/1/test/cmf.dat 0 100000000
#
CMF=/data/1/test/cmf.dat

bin/run-test-load.sh

This script is used to perform the random write test. The system argument can be either "hypertable" or "hbase". The number of keys submitted is computed as DATA_SIZE / (key-size + value-size) with DATA_SIZE being the variable defined in test-config.sh. The keys are formed as an ASCII string representation of a number in the range of [0..key-count]*10.

bin/run-test-scan.sh

This script is used to perform the scan test. The argument system can be either "hypertable" or "hbase". The range of keys is computed in the same way as is done in the run-test-load.sh script ([0..key-count]*10) and the key space is divided into segments and each segment is fed to a test client for scanning.

bin/run-test-load-sequential.sh

This script is used to load a table in preparation for the random read tests. The system argument can be either "hypertable" or "hbase". The set of keys loaded are exactly in the range [0..key-count), where key-count is computed as DATA_SIZE / (key-size + value-size). After running this script, each key in the range [0..key-count] will contain exactly one cell.

bin/run-test-read-random.s [--zipfian]

This script is used to perform the random read test. The system argument can be either "hypertable" or "hbase". If the --zipfian argument is supplied, the test clients will generate a zipfian key distribution in the range [0..key-count) as defined in the run-test-load-sequential.sh script. To efficiently generate the zipfian distribution, the clients load cumulative mass function data from a file specified by the CMF variable in the test-config.sh script. This file should be present on all test client machines. If the --zipfian argument is not supplied, a uniform key distribution will be generated.

bin/clean-database.sh

This script is used to clean the database in preparation for each load test. The argument system can be either "hypertable" or "hbase". After running this script, you should wait several minutes for HDFS to garbage collect all of the deleted files.

Random Write and Scan Tests

The random write and sequential scan tests were run four times. The amount of data written into the table was fixed at 5TB, but the value size varied from 10KB to 10 bytes, with the corresponding cell count going from 500 million to 167 billion. Prior to running these tests, we set the data set size to 5TB by setting the DATA_SIZE variable in the test-config.sh file as follows:

let DATA_SIZE=5000000000000

We also pushed out the hbase-site.xml file and modified the perfeval-hypertable.cfg file to contain the system configuration properties for each system appropriate for the test. The following script illustrates how we ran the tests.

#!/usr/bin/env bash

# Set SYSTEM variable to either "hbase" or "hypertable"
SYSTEM=$0

let VALUE_SIZE=10000

while [ $VALUE_SIZE -ge 10 ]; do
  ./bin/run-test-load.sh $SYSTEM 20 $VALUE_SIZE
  ./bin/run-test-scan.sh $SYSTEM 20 $VALUE_SIZE
  let VALUE_SIZE=VALUE_SIZE/10
done

Random Read Tests

The random read tests were run twice, once with a 5TB and again with 0.5TB to measure the performance of each system under different RAM-to-disk ratios. In addition to varying the dataset size, we ran each test with a uniform as well as Zipfian key distribution. The Zipfian key distribution was chosen to simulate realistic workload. All tests were run with a fixed value size of 1KB.

We pushed out the hbase-site.xml file and modified the perfeval-hypertable.cfg file to contain the system configuration properties for each system appropriate for the test. The following script illustrates how we ran the tests.

#!/usr/bin/env bash

# Set SYSTEM variable to either "hbase" or "hypertable"
SYSTEM=$0

./bin/run-test-load-sequential.sh $SYSTEM 20 1000
./bin/run-test-read-random.sh $SYSTEM 20 1000
./bin/run-test-read-random.sh --zipfian $SYSTEM 20 1000