Hadoop

Hadoop is an open source implementation of the Google Filesystem and MapReduce parallel computation framework. The Hadoop filesystem (HDFS) is the filesystem that most people run Hypertable on top of as it contains all of the architectural features required to efficiently support Hypertable. This document describes how to get Hypertable up and running on top of the Hadoop filesystem.

Prerequisites

Before you get started with the installation, there are some general system requirements that need to be satisfied before proceeding. These requirements are described in the following list.

admin machine - You should designate one of the machines in your Hypertable cluster as the admin machine (admin1 in examples below). This is the machine from which you will be administering the cluster. It can be the same machine as one of the masters or can be any machine of your choosing. The only requirement is that this machine have ssh access to all of the other machines in the cluster.
password-less ssh - The tools used to administer Hypertable require password-less ssh login access from the admin machine to all other machines in the cluster (masters, hyperspace replicas, range servers, etc). See Password-less SSH Login for details on how to set this up.
ssh MaxStartups - sshd on the admin machine needs to be configured to allow simultaneous connections from all of the machines in the Hypertable cluster. The default simultaneous connection limit, MaxStartups, defaults to 10. See SSH Connection Limit for details on how to increase this limit.
firewall - The Hypertable processes use TCP and UDP to communicate with one another and with client applications. Firewalls can block this traffic and prevent Hypertable from operating properly. Any firewall that blocks traffic between the Hypertable machines should be disabled or the appropriate ports should be opened up to allow Hypertable communication. See Hypertable Firewall Requirements for instructions on how to do this.
open file limit - Most operating systems have a limit on the total number of files that a process can have open at any one time. This limit is usually set too low for Hypertable, since it can create a very large number of files. See Open File Limit for details on how to increase this limit.
Linux Kernel Tuning - The Linux kernel exposes some configuration parameters that can be tuned to optimize the behavior of the kernel for the specific workload which the operating system will be handling. To tune these parameters for optimum performance of the Hypertable server processes, see Linux Kernel Configuration.

Step 1 - Install HDFS

The first step in getting Hypertable up and running on top of Hadoop is to install HDFS. Hypertable currently runs on top of the following Hadoop distributions:

Apache Hadoop 2
Cloudera CDH3
Cloudera CDH4
Cloudera CDH5
IBM BigInsights 3
Hortonworks Data Platform 2

Each RangeServer process should run on a machine that is also running an HDFS DataNode. It's best not to run the HDFS NameNode on the same machine as a RangeServer since both of those processes tend to consume a lot of RAM.

To accommodate Bigtable-style workload, HDFS needs to be specially configured. The dfs.datanode.max.xcievers property, which controls the number of files that a DataNode can service concurrently, should be increased to at least 4096 and the dfs.namenode.handler.count, which controls the number of NameNode threads available to handle RPCs, should be increased to at least 20. This can be accomplished by adding the following lines to the conf/hdfs-site.xml file.

<property>
  <name>dfs.namenode.handler.count
  <value>20</value>
</name></property>
<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>4096</value>
</property>

Once the filesystem is installed, create a /hypertable directory that is readable and writable by the user account in which hypertable will run. For example:

sudo -u hdfs hadoop fs -mkdir /hypertable
sudo -u hdfs hadoop fs -chmod 777 /hypertable

Step 2 - Install Hypertable on admin machine

The first step in getting Hypertable installed on a cluster is to choose a machine from which you'll administer Hypertable and install Hypertable on that machine. In this document, we assume that the administration machine is called admin1. Hypertable can be installed via binary packages which can be downloaded as described on the Hypertable Download page. The packages come bundled with nearly all of the dependent shared libraries. The nice thing about this approach is that only a single package is needed for all flavors of Linux for each supported architecture (32-bit and 64-bit). The only requirement is that your system be built with glibc 2.4+ (released on March 6th 2006). Hypertable comes with a program launch script, ht, that sets up LD_LIBRARY_PATH (or DYLD_LIBRARY_PATH) to point to the lib/ directory of the installation so that the dependent libraries can be found by the dynamic linker.

To begin the package installation, log into the admin machine (admin1) and then download the package you would like to install. Then issue the command listed below for your operating system.

Redhat, CentOS, or SUSE Installation

$ sudo rpm -ivh --replacepkgs --nomd5 --nodeps --oldpackage package.rpm

Debian or Ubuntu Installation

$ sudo dpkg --install package.deb

Bzipped Archive Installation

sudo mkdir /opt/hypertable
tar xjvf package.tar.bz2
sudo mv hypertable-*/opt/hypertable/* /opt/hypertable/

Mac installation

Double-click the package.dmg file and follow the instructions

The Redhat, Debian, and Mac packages will install Hypertable under a directory by the name of /opt/hypertable/$VERSION by default.

Add Hypertable bin/ directory to PATH

So that you don't have to specify absolute paths when running Hypertable commands, we recommend that you add the Hypertable bin/ directory to the program search path for your shell. If you're running the bash shell, this can be accomplished by adding the following line to your .bashrc file:

export PATH=$PATH:/opt/hypertable/current/bin

You'll need to log out and log back in to pick up the PATH change. If you're running a shell other than bash, consult the documentation for your shell for instructions on how to modify the program search path.

Step 3 - Edit cluster.def

Hypertable is administered with the ht_cluster cluster task automation tool. It requires a configuration file called cluster.def which is located in the conf/ directory of the installation. Hypertable comes with an example cluster configuration file called cluster.def-EXAMPLE. On the admin machine (admin1 in this example), copy or rename this file to cluster.def, for example:

cd /opt/hypertable/0.9.8.5/conf
cp cluster.def-EXAMPLE cluster.def

There are some variables that are set at the top of this file that you need to modify for your particular environment. These variables are shown below.

INSTALL_PREFIX=/opt/hypertable
HYPERTABLE_VERSION=0.9.8.5
PACKAGE_FILE=/root/packages/hypertable-0.9.8.5-linux-x86_64.tar.gz
FS=hadoop
HADOOP_DISTRO=cdh4
ORIGIN_CONFIG_FILE=/root/hypertable.cfg
PROMPT_CLEAN=true

Table 1 contains a description of each of the variables.

**Table 1**. `cluster.def` Variables
Variable	Description
`INSTALL_PREFIX`	Directory on admin machine containing Hypertable installations. It is also the directory on the remote machines where the installation will get rsync'ed to.
`HYPERTABLE_VERSION`	version of Hypertable you are deploying
`PACKAGE_FILE`	Path to binary package file (.dmg, .rpm, or .tar.bz2) on source machine
`FS`	File system you are running Hypertable on top of. Valid values are "local", "hadoop", "qfs", or "mapr"
`HADOOP_DISTRO`	Hadoop distribution. Current supported values are "apache2", "cdh3", "cdh4", "cdh5", "ibmbi3", "hdp2" (See Table 2 below for description).
`ORIGIN_CONFIG_FILE`	Location of the origin Hypertable configuration file that you plan to use. This file should reside somewhere outside of the Hypertable installation and willl act as the "master" or upstream configuration file. Hypertable configuration changes should be made to this file and can then be distributed to the installation with the `ht cluster push_config` command.

Table 2 contains a description of the valid values that can be specified for the HADOOP_DISTRO variable.

**Table 2**. Hadoop Distribution Tags
Tag	Distribution
apache2	Apache Hadoop 2
cdh3	Cloudera CDH3
cdh4	Cloudera CDH4
cdh5	Cloudera CDH5
ibmbi3	IBM BigInsights 3
hdp2	Hortonworks Data Platform 2

In addition to the above variables, you also need to specify which machines will act in the various roles. You do that by editing the role definition statements:

role: source admin1
role: master master[00-02]
role: hyperspace hyperspace[00-02]
role: slave slave[000-199]
role: thriftbroker
role: spare

The following table describes each role.

Table 3. `cluster.def` Roles
Role	Description
`source`	The machine from which you will be distributing the binaries (admin1 in this example).
`master`	The machine that will run the Hypertable master process as well as an FS broker. Ideally this machine is high quality and somewhat lightly loaded (e.g. not running a RangeServer). Typically you would have a high quality machine running the Hypertable master, a Hyperspace replica, and the HDFS NameNode
`hyperspace`	The machines that will run Hyperspace replicas. There should be at least one machine defined for this role. The machines that take on this role should be somewhat lightly loaded (e.g. not running a RangeServer)
`slave`	The machines that will run RangeServers. Hypertable is designed to run on a filesystem like HDFS. In fact, the system works best from a performance standpoint when the RangeServers are run on the same machines as the HDFS DataNodes. This role will also launch an FS broker and a ThriftBroker.
`thriftbroker`	Additional machines that will be running a ThriftBroker (e.g. web servers). NOTE: You do not have to add the master and slave machines to this role, since a ThriftBroker is automatically started on the master and each slave machine to support MapReduce.
`spare`	Machines that will act as standbys. They will be kept current with the latest binaries during installation and upgrades.

Step 4 - Install Hypertable on remaining machines

On the admin machine (admin1) where you installed Hypertable in step 2, make sure the HYPERTABLE_VERSION variable at the top of the cluster.def file is set to the version of Hypertable that you are installing. Also make sure the PACKAGE_FILE variable is set to the absolute path of the package file that you are going to install. For example, if you're installing version 0.9.8.5 and using the RPM package located at /root/packages/hypertable-0.9.8.5-linux-x86_64.rpm, set the variables as follows.

HYPERTABLE_VERSION=0.9.8.5
PACKAGE_FILE=/root/packages/hypertable-0.9.8.5-linux-x86_64.rpm

To distribute and install the binary package on all cluster machines, issue the following command. This command will cause the package to get rsync'ed to all participating machines and installed with the appropriate package manager (rpm, dpkg, or tar) depending on the package type.

$ ht cluster install_package

If you prefer compiling the binaries from source, you can use ht cluster to distribute the binaries with rsync. On admin1 be sure Hypertable is installed in the location INSTALL_PREFIX/HYPERTABLE_VERSION as defined by the variables at the top of the cluster.def file (/opt/hypertable/0.9.8.5 in this example). Then distribute the binaries with the following command.

$ ht cluster dist

Step 5 - FHS-ize installation

See Filesystem Hierarchy Standard for an introduction to FHS. If you're running as a user other than root, first create the directories /etc/opt/hypertable and /var/opt/hypertable on all machines in the cluster and change ownership to the user account under which the binaries will be run. For example:

$ sudo ht cluster
cluster> mkdir /etc/opt/hypertable /var/opt/hypertable
cluster> chown chris:staff /etc/opt/hypertable /var/opt/hypertable

Then FHS-ize the installation with the following command:

$ ht cluster fhsize

Step 6 - Edit hypertable.cfg

The next step is to tailor the hypertable.cfg file for your deployment. A basic hypertable.cfg file can be found in the conf/ subdirectory of your hypertable installation which can be copied and modified as needed. The following table shows the minimum set of required and recommended properties that you need to modify.

Table 1. Recommended and Required Properties
Property	Description
`HdfsBroker.Hadoop.ConfDir`	Directory containing the Hadoop configuration files. The Hadoop DFS broker obtains the address of the NameNode by reading the `fs.defaultFS` property from the `core-site.xml` file and obtains the default replication factor by reading the `dfs.replication` property from the `hdfs-site.xml` file.
`Hyperspace.Replica.Host`	Hostname of Hyperspace replica
`Hypertable.Cluster.Name`	Assigns a name to a cluster which will be displayed in the Monitoring UI and in the subject line of administrator notification messages
`Hypertable.RangeServer.Monitoring.DataDirectories`	This property is optional, but recommended. It contains a list of directories that are the mount points of the HDFS data node storage volumes. By setting this property appropriately, the Hypertable monitoring system will be able to provide accurate disk usage information.

You can leave all other properties at their default values. Hypertable is designed to adapt to the hardware on which it runs and to dynamically adapt to changes in workload, so no special configuration is needed beyond the basic properties listed in the above table. For example, the following shows the changes we made to the hypertable.cfg file for our test cluster.

HdfsBroker.Hadoop.ConfDir=/etc/hadoop/conf

Hyperspace.Replica.Host=hyperspace00
Hyperspace.Replica.Host=hyperspace01
Hyperspace.Replica.Host=hyperspace02

Hypertable.Cluster.Name="SV Primary"

Hypertable.RangeServer.Monitoring.DataDirectories="/data/1,/data/2,/data/3,/data/4"

See hypertable.cfg

Once you've created the hypertable.cfg file for your cluster, put it on the source machine (admin1) and set the absolute pathname referenced in the ORIGIN_CONFIG_FILE variable at the top of cluster.def to point to this file (e.g. /root/hypertable.cfg). Then distribute the custom config files with the following command.

$ ht cluster push_config

If you ever need to make changes to the config file, make the changes, re-run ht cluster push_config, and then restart Hypertable (see Give it a try section below for instructions on starting and stopping Hypertable).

Step 7 - Set "current" link

To make the latest version of Hypertable referenceable from a well-known location, create a "current" link to point to the latest installation. This can be accomplished with the following command.

$ ht cluster set_current

Step 8 - Install monitoring system dependencies

The Monitoring UI is written in Ruby with the Sinatra web framework and uses RRDTool to create databases for storing metric data. Ruby 1.8.7 or greater is required. The monitoring UI is served by the active master. The first step to getting the dependencies set up is to install Ruby, RubyGems, and RRDTool on all of the master machines:

Redhat (CentOS)

$ sudo ht cluster
cluster> with master yum -y install rrdtool ruby rubygems ruby-devel ruby-rdoc

Debian (Ubuntu)

$ sudo ht cluster
cluster> with master apt-get -y install rrdtool ruby rubygems ruby-dev rdoc

Mac OSX

Ruby is installed by default on OSX, so nothing needs to be done to install it. For RRDTool, you can install it with either MacPorts or Homebrew. To install it with MacPorts, issue the following command:

$ sudo ht cluster
cluster> with master port install rrdtool

To install it with Homebrew, issue the following command:

$ sudo ht cluster
cluster> with master brew install rrdtool

After successfully installing Ruby, RubyGems, and RRDTool, the next step is to verify that you have Ruby 1.8.7 or greater installed:

$ ruby --version
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]

If your version of ruby is older than 1.8.7, you'll need to consult your operating system documentation (i.e. the Internet) to figure out how to install a newer version of Ruby and RubyGems (the packages may be called ruby19, rubygems19, ...).

Once you have the correct version of Ruby and RubyGems installed, install the required gems as follows:

$ sudo ht cluster
cluster> with master gem install sinatra rack thin json titleize syck

NOTE: the syck gem is only required for ruby >= 2.0

Step 9 - Install notification script

The Hypertable Master will invoke a notification script (conf/notification-hook.sh) to inform the Hypertable administrator of certain events such as machine failure or any problems that may have been encountered during machine failure recovery. The script accepts two arguments, a subject string and a message body string. The prefix of the subject line string can be examined to determine the type of notification, "NOTICE" indicating a notification of abnormal condition, and "ERROR" indicating a hard error that requires intervention. The following is an example notification script (/opt/hypertable/current/conf/notification-hook.sh) that can be used to email notificaiton to a list of administrators:

#!/usr/bin/env bash

recipients="root"
subject=$1
message=$2
echo -e $message | mail -s "$subject" ${recipients}

Modify the recipients variable to contain the the list of recipients to whom notificaiton messages are to be sent. Verify that the script works properly by testing it manually:

/opt/hypertable/current/conf/notification-hook.sh "Test Message" "This is a test."

Once the script is working properly, you can distribute it to all appropriate machines with:

ht cluster push_config

Step 10 - Synchronize clocks

Hypertable cannot operate correctly unless the clocks on all machines are synchronized. Use the Network Time Protocol (ntp) to ensure that the clocks get synchronized and remain in sync. Run the 'date' command on all machines to make sure they are in sync. The following ht cluster shell session show the output of a cluster with properly synchronized clocks.

$ ht cluster
cluster> date
[hyperspace00] Wed Oct 29 09:09:17 PDT 2014
[hyperspace01] Wed Oct 29 09:09:17 PDT 2014
[hyperspace02] Wed Oct 29 09:09:17 PDT 2014
[master00] Wed Oct 29 09:09:17 PDT 2014
[master01] Wed Oct 29 09:09:17 PDT 2014
[master02] Wed Oct 29 09:09:17 PDT 2014
[slave10] Wed Oct 29 09:09:17 PDT 2014
[slave11] Wed Oct 29 09:09:17 PDT 2014
[slave12] Wed Oct 29 09:09:17 PDT 2014
[slave13] Wed Oct 29 09:09:17 PDT 2014
[slave14] Wed Oct 29 09:09:17 PDT 2014
[slave15] Wed Oct 29 09:09:17 PDT 2014

Give it a try

Start Hypertable

To start all of the Hypertable servers:

$ ht cluster start

Alternatively, if you want to start the primary database processes and the ThriftBrokers independently, you can start using the following two commands:

$ ht cluster start_database
$ ht cluster start_thriftbrokers

Verify that it works

Create a table.

echo "CREATE TABLE foo ( c1, c2 ); GET LISTING;" | ht shell --batch

The output of this command should look like:

foo
sys (namespace)
tmp (namespace)

Load some data.

echo "INSERT INTO foo VALUES('001', 'c1', 'very'), \
    ('000', 'c1', 'Hypertable'), ('001', 'c2', 'easy'), ('000', 'c2', 'is');" \
    | ht shell --batch

Dump the table.

echo "SELECT * FROM foo;" | ht shell --batch

The output of this command should look like:

000	c1	Hypertable
000	c2	is
001	c1	very
001	c2	easy

View Monitoring UI

The Hypertable monitoring UI is accessible on the machine that is the active master. It takes about a minute after Hypertable comes up for the Monitoring system to gather information and start populating the metric databases. Issue the following command to determine the IP address of the active master:

$ echo "open /hypertable/master; attrget /hypertable/master address;" | ht hyperspace --batch | cut -f1 -d:
SESSION CALLBACK
192.168.17.11

Once you know the IP address of the active master you can pull up the monitoring UI in a web browser by pointing it at the active master IP address and port 15860 (http://192.168.17.11:15860 in this example). If you get an error message in the browser, wait a minute, then hit refresh and it should come up.

Stop Hypertable

To stop Hypertable, shutting down all servers:

$ ht cluster stop

To stop the ThriftBrokers only:

$ ht cluster stop_thriftbrokers

To stop the primary database processes (should be run after stop_thriftbrokers):

$ ht cluster stop_database

If you want to wipe your database clean, removing all namespaces and tables:

$ ht cluster destroy

What Next?

Congratulations! Now that you have successfully installed Hypertable, we recommend that you walk through the HQL Tutorial to get familiar with using the system.

Appendix A. How to Change Hadoop Distro

Hypertable can run on most modern distributions of Hadoop. The Hypertable packages currently have built-in support for the Apache, Cloudera, IBM BigInsights, and Hortonworks distribtions. CDH4 is the distribution that is configured by default. To switch to the Hortonworks (HDP2) distribution, edit the HADOOP_DISTRO variable at the top of your cluster.def file:

HADOOP_DISTRO=hdp2

and then run the following commands:

$ ht cluster set_hadoop_distro
$ ht cluster push_config