The Hypertable Blog

New and Improved Secondary Indexes

07.10.2014  |  New Feature  |  0

As part of our ongoing effort to make Hypertable useful for a wide range of applications, we've greatly improved our support for secondary indexing.  These improvements, introduced in the 0.9.8.0 release, enable much more powerful queries that access rows by means other than the primary key.  As part of this effort, we've changed the query language to allow for more expressive queries and have brought it more in-line with SQL.

Like the original implementation, thew new secondary indexes are fully consistent and massively scalable.  They can be used to index not only column values, but also column qualifiers.  Secondary indexes can be defined in the CREATE TABLE statement with the INDEX and QUALIFIER INDEX clauses, for example:

CREATE TABLE customers (

    name,
    INDEX name,
    tag,
    QUALIFIER INDEX tag,
    info,
    INDEX info,
    QUALIFIER INDEX info
);

The ALTER TABLE command has been enhanced to allow for creating and dropping indexes of preexisting tables.  For example:

ALTER TABLE customers ADD ( QUALIFIER INDEX name ) DROP INDEX ( info );

To populate indexes added with ALTER TABLE or to convert indexes created with older versions of Hypertable into the new format, the indexes need to be rebuilt.  This can be accomplished with the new REBUILD INDICES command.  For example:

REBUILD INDICES customers;

The examples which follow in this post assume that the following table has been created and populated with products.tsv.

CREATE TABLE products (
  title,
  section,
  info,
  category,
  INDEX section,
  INDEX info,
  QUALIFIER INDEX info,
  QUALIFIER INDEX category
);

A set of scripts and data files for running all of the examples in this post can be found in the archive secondary-indexes.tgz.

Queries against the value index

One important difference between secondary index queries and normal table scan queries is that with secondary index queries, the SELECT statement behaves like SQL SELECT in that the WHERE predicate acts as a boolean selector of rows and the columns listed after the SELECT keyword act as a projection of the selected rows. In other words, you can project an arbitrary set of columns that don't necessarily have to be referenced in the WHERE predicate.

Exact match

Select the title column of all rows whose section column contains exactly "books".

SELECT title FROM products WHERE section = "books";
0307743659	title	The Shining Mass Market Paperback
0321321928	title	C++ Common Knowledge: Essential Intermediate Programming [Paperback]
0321776402	title	C++ Primer Plus (6th Edition) (Developer's Library)

Select the title column of all rows that contain an info:actor column containing exactly "Jack Nicholson".

SELECT title FROM products WHERE info:actor = "Jack Nicholson";
B00002VWE0	title	 Five Easy Pieces (1970)
B002VWNIDG	title	The Shining (1980)

Prefix match

Select the title and info:publisher columns of all rows that contain an info:publisher column whose value starts with "Addison-Wesley".

SELECT title,info:publisher FROM products WHERE info:publisher =^ 'Addison-Wesley';
SELECT title,info:publisher FROM products WHERE info:publisher =~ /^Addison-Wesley/;
0321321928	title	C++ Common Knowledge: Essential Intermediate Programming [Paperback]
0321321928	info:publisher	Addison-Wesley Professional; 1 edition (March 10, 2005)
0321776402	title	C++ Primer Plus (6th Edition) (Developer's Library)
0321776402	info:publisher	Addison-Wesley Professional; 6 edition (October 28, 2011)

Regular expression match

Select the title and info:publisher columns of all rows that contain an info:author column whose value starts with "Stephen " followed by either a 'P' or 'K' character.

SELECT title,info:author FROM products WHERE info:author =~ /^Stephen [PK]/;
SELECT title,info:author FROM products WHERE Regexp(info:author, '^Stephen [PK]');
0307743659	title	The Shining Mass Market Paperback
0307743659 info:author Stephen King
0321776402	title	C++ Primer Plus (6th Edition) (Developer's Library)
0321776402	info:author	Stephen Prata

Queries against the qualifier index

Like the value index queries, the qualifier index queries have the traditional SQL select/project semantics.

Exact match

Select the title column of all rows that contain an info:studio column.

SELECT title FROM products WHERE Exists(info:studio);
B00002VWE0	title	Five Easy Pieces (1970)
B000Q66J1M	title	2001: A Space Odyssey [Blu-ray]
B002VWNIDG	title	The Shining (1980)

Regular expression match

Select the title column of all rows that contain a category column with qualifier that starts with "/Movies".

SELECT title FROM products WHERE Exists(category:/^\/Movies/);
B00002VWE0	title	Five Easy Pieces (1970)
B000Q66J1M	title	2001: A Space Odyssey [Blu-ray]
B002VWNIDG	title	The Shining (1980)

Select the title column of all rows that contain a category column with qualifier that starts with "/Books" followed at some point later with "Programming".

SELECT title FROM products WHERE Exists(category:/^\/Books.*Programming/);
0321321928	title	C++ Common Knowledge: Essential Intermediate Programming [Paperback]
0321776402	title	C++ Primer Plus (6th Edition) (Developer's Library)

Boolean operators

The boolean operators AND and OR can be used to combine column predicates. The boolean AND operator can also combine column predicates and ROW intervals.

OR operator

Select the title column of all rows that contain an info:author column that starts with "Stephen P" or contain an info:publisher column that starts with "Anchor".

SELECT title FROM products WHERE info:author =~ /^Stephen P/ OR info:publisher =^ "Anchor";
SELECT title FROM products WHERE info:author =~ /^Stephen P/ OR info:publisher =~ /^Anchor/;
0307743659	title	The Shining Mass Market Paperback
0321776402	title	C++ Primer Plus (6th Edition) (Developer's Library)

AND operator

Select the title column of all rows that contain an info:author column that starts with "Stephen " followed by either the letter 'P' or 'K' and contain an info:publisher column that starts witht "Anchor"

SELECT title FROM products
  WHERE info:author =~ /^Stephen [PK]/ AND info:publisher =^ "Anchor";
SELECT title FROM products
  WHERE info:author =~ /^Stephen [PK]/ AND info:publisher =~ /^Anchor/;
0307743659	title	The Shining Mass Market Paperback

Select the title column of all rows whose row key is greater than 'B00002VWE0' and that contain an info:actor column containing exactly "Jack Nicholson".

SELECT title FROM products WHERE ROW > 'B00002VWE0' AND info:actor = 'Jack Nicholson';
B002VWNIDG	title	The Shining (1980)

Select the title column of all rows whose row key starts with 'B' and that contain an info:actor column containing exactly "Jack Nicholson".

SELECT title FROM products WHERE ROW =^ 'B' AND info:actor = 'Jack Nicholson';
B00002VWE0	title	Five Easy Pieces (1970)
B002VWNIDG	title	The Shining (1980)

Programming examples (in all supported languages) of the above queries can be found in the Hypertable Developer Guide. Special thanks goes to Philip Gatt for providing feedback on the original implementation that led to these improvements!

Posted By:  Doug Judd



permalink

Testing of Hypertable RangeServer Failover

03.04.2013  |  Testing  |  0

As we mentioned in our previous post announcing the newly arrived RangeServer failover feature, robustness and application transparency were #1 priorities.  To achieve these objectives, we placed enormous emphasis on testing.  While this testing effort was painstaking and led to a very long development cycle, it has paid enormous dividends in quality and robustness.  The following table lists the regression tests that were added to Hypertable to guarantee correctness and ensure that the expected behavior holds true for future releases.

# Test Description 1 RangeServer-failover-basic-1  Start Hypertable with two RangeServers (rs1,rs2) and then 1) Load table with data, 2) Kill rs1 and wait for recovery to complete, 3) Stop Hypertable and then restart with just rs2. Dump keys after each step and verify that the table contains the exact set of keys that were loaded. Verify that no ranges are assigned to rs1. 2 RangeServer-failover-basic-2  Start Hypertable with three RangeServers (rs1,rs2,rs3) and then 1) Load table with data, 2) Kill rs1 and wait for recovery to complete, 3) Stop Hypertable and then restart with just rs2 and rs3. Dump keys after each of these three steps and verify that the table contains the exact set of keys that were loaded. 3 RangeServer-failover-basic-3  Start Hypertable with five RangeServers (rs1,rs2,rs3,rs4,rs5) and then 1) Load table with data, 2) Kill rs1 and rs2 and wait for recovery to complete for both servers, 3) Stop Hypertable and then restart with just rs3, rs4, and rs5. Dump keys after each of the three steps and verify that the table contains the exact set of keys that were loaded. 4 RangeServer-failover-basic-4  Start Hypertable with two RangeServers (rs1,rs2) and then load table with data. Kill rs1 and wait for recovery to...

read more

Hypertable has Reached a Major Milestone!

02.14.2013  |  RangeServer Failover  |  2

With the release of Hypertable version 0.9.7.0 comes support for automatic RangeServer failover.  Hypertable will now detect when a RangeServer has failed, logically remove it from the system, and automatically re-assign the ranges that it was managing to other RangeServers.  This represents a major milestone for Hypertable and alows for very large scale deployments.  We have been activly working on this feature, full-time, for 1 1/2 years.  To give you an idea of the magnitude of the change, here are the commit statistics:

  • 441 changed files
  • 17,522 line additions
  • 6,384 line deletions

The reason that this feature has been a long time in the making is because we placed a very high standard of quality for this feature so that under no circumstance, a RangeServer failure would lead to consistency problems or data loss.  We're confident that we've achieved 100% correctness under every conceivable circumstance.  The two primary goals for the feature, robustness and applicaiton transparancy, are described below.

Robustness

We designed the RangeServer failover feature to be extremely robust.  RangeServers can fail in any state (mid-split, transferring, etc.) and will be recovered properly.  The system can also withstand the loss of any RangeServer, even the ones holding the ROOT or other METADATA ranges.  To achieve this level of robustness, we added 63 regression tests that verify the correct handling of RangeServer failures in every conceivable failure scenario.  We will follow up later with a blog post describing these tests.

Application Transparency

Another important aspect of our RangeServer failover implementation is application transparency.  Aside from a transient delay in database access, RangeServer failures are...

read more

Roadmap to Hypertable 1.0

07.23.2012  |  Release Status  |  11

With the release of Hypertable version 0.9.6.0 I thought I would take some time to describe where we are in terms of the Hypertable 1.0 release and what work is remaining.  We had intended to make the next Hypertable release our beta release.  However, it’s been four months since the release of 0.9.5.6 and since the beta release is not quite ready to go, we decided to do one last alpha release and call it 0.9.6.0.  In this release we’ve put in a considerable effort to fix a number of stability issues that have affected prior releases.

0.9.6.0 Stability Improvements for HDFS deployments

The biggest source of instability for Hypertable deployments running on top of HDFS has do with the unclean shutdown of either the Master or RangeServer.   Upon restart after this situation has ocurred, the RangeServer (or Master) can fail to come up with an error message similar to the following in its log file:

1342810317 ERROR Hypertable.RangeServer : verify_backup (/root/src/hypertable/src/cc/Hypertable/Lib/MetaLogReader.cc:131): MetaLog file '/hypertable/servers/rs12/log/rsml/0' has length 0 < backup file '/opt/hypertable/0.9.5.6 /run/log_backup/rsml/rs12/0' length 11376

This problem was due to a misunderstanding on our part of the HDFS API semantics.  Whenever the Master or RangeServer writes data to any of its log files, it makes a call to FSDataOutputStream.sync() to ensure that the data makes it in to the filesystem and is persistent.  However, after making this call, a call to the FileStatus.getLen() does not return the correct value.  FileStatus.getLen() only returns the correct file length if the file was properly closed.  HDFS provides an alternate API, DFSClient.DFSDataInputStream.getVisibleLength(), that returns the actual length of the file regardless...

read more

Secondary Indices Have Arrived!

03.22.2012  |  New Feature  |  17

Until now, SELECT queries in Hypertable had to include a row key, row prefix or row interval specification in order to be fast. Searching for rows by specifying a cell value or a column qualifier involved a full table scan which resulted in poor performance and scaled badly because queries took longer as the dataset grew. With 0.9.5.6, we’ve implemented secondary indices that will make such SELECT queries lightning fast!

Hypertable supports two kinds of indices: a cell value index and a column qualifier index. This blog post explains what they are, how they work and how to use them.

The cell value index

Let’s look at an example of how to create those two indices.  A big telco asks us to design a table for its customer data.  Every user profile has a customer ID as the row key. But our system also wants to provide fast queries by phone number, since customers can dial in and our automated answering system can then immediately figure out who’s calling by checking the caller ID.  We therefore decide to create a secondary index on the phone number.  The following statement might be used to create this table and along with a phone number index:

CREATE TABLE customers ( name, address, phone_no, INDEX phone_no );

Internally, Hypertable will now create a table customers and an index table ^customers. Every cell that is now inserted into the phone_no column family will be transformed and inserted into the index table as well. If you’re curious, you can insert some phone numbers and run, SELECT * FROM “^customers”; to see how the index was updated.

Not every query makes use of the...

read more

Sehrch.com: A Structured Search Engine Powered By Hypertable

03.15.2012  |  Hypertable Case Study  |  10

Sehrch.com is a structured search engine.  It provides powerful querying capabilities that enable users to quickly complete complex information retrieval tasks.  It gathers conceptual awareness from the Linked Open Data cloud, and can be used as (1) a regular search engine or (2) as a structured search engine.  In both cases conceptual awareness is used to build entity centric result sets.  Try this simple query: Pop singers less than 20 years old.

Sehrch.com gathers data from the Semantic Web in the form of RDF, crawling the Linked Open Data cloud and making requests with headers accepting RDF NTriples.  Data dumps are also obtained from various sources.  In order to store this data, we required a data store capable of storing tens of billions of triples using the least hardware while still delivering high performance.  So we conducted our own study to find the most appropriate store for this type and quantity of data.

As Semantic Web people, our initial choice would have been to use native RDF data stores, better known as triplestores.  But from our initial usage we quickly concluded that SPARQL compliant triplestores and large quantities of data do not mix well.  As a challenge, we attempted to load 1.3 billion triples (the entire DBpedia and Freebase datasets) into a dual core machine with only 3GB memory.  The furthest any of the open source triplestores (4store, TDB, Virtuoso) progressed to load the datasets upon the given hardware was around 80 million triples.  We were told that the only solution was more hardware.  We weren't the only ones facing significant hardware requirements when attempting to the load this volume of data.  For example, in the following post a machine with 8 cores and 32GB...

read more

Welcome!

02.07.2012  |  Hypertable Unveils New Website  |  5

Welcome to the new Hypertable website.  This new website is easy to navigate and has all of the tools you'll need to learn about Hypertable and easily deploy it for your big data aplications.

We have put a tremendous amount of effort into the Documentation section of the website.  There you'll find an Architectural Overview, Installation Instructions, Administrator Guides, Reference Guides and much more.  Be sure check out the Code Examples section for working code examples written Java, C++, PHP, Python, Perl and Ruby.

We're also very excited to announce new products and services available today:

  • UpTime Support Subscription – for round-the-clock, 7 days a week, 365 days a year "uptime" assurance for your Hypertable deployments -- with support stafflocated in the United States and Europe
  • Training and Certification –  taught by big data experts, Hypertable Training and Certification classes are held in Silicon Valley, USA and Munich, Germany
  • Commercial License – for organizations such as OEMs, ISVs, and VARs who distribute Hypertable with their closed source products, Hypertable Inc. offers the software under a flexible OEM commercial license

Also, please check out the Hypertable vs. HBase Performance Evaluation II.  In this in-house performance test, Hypertable demonstrates its power by easily loading 167 billion records into a 16-node test cluster, a test in which HBase failed, struggling with memory management.

If you're looking to take commercial advantage of the power of big data, then the Hypertable database platform is the right choice for you.  Our new product and service offerings will ensure that you get the most out of your Hypertable deployment, allowing you to take full advantage of Hypertable's unprecedented performance and cost-efficiency...

read more