Sehrch.com: A Structured Search Engine Powered By Hypertable
03.15.2012 | Hypertable Case Study
Sehrch.com is a structured search engine. It provides powerful querying capabilities that enable users to quickly complete complex information retrieval tasks. It gathers conceptual awareness from the Linked Open Data cloud, and can be used as (1) a regular search engine or (2) as a structured search engine. In both cases conceptual awareness is used to build entity centric result sets. Try this simple query: Pop singers less than 20 years old.
Sehrch.com gathers data from the Semantic Web in the form of RDF, crawling the Linked Open Data cloud and making requests with headers accepting RDF NTriples. Data dumps are also obtained from various sources. In order to store this data, we required a data store capable of storing tens of billions of triples using the least hardware while still delivering high performance. So we conducted our own study to find the most appropriate store for this type and quantity of data.
As Semantic Web people, our initial choice would have been to use native RDF data stores, better known as triplestores. But from our initial usage we quickly concluded that SPARQL compliant triplestores and large quantities of data do not mix well. As a challenge, we attempted to load 1.3 billion triples (the entire DBpedia and Freebase datasets) into a dual core machine with only 3GB memory. The furthest any of the open source triplestores (4store, TDB, Virtuoso) progressed to load the datasets upon the given hardware was around 80 million triples. We were told that the only solution was more hardware. We weren't the only ones facing significant hardware requirements when attempting to the load this volume of data. For example, in the following post a machine with 8 cores and 32GB of RAM was used to load DBpedia in only English and German languages (approximately 300 million triples) using Virtuoso: Setting up a local DBpedia mirror with Virtuoso, we were attempting to load four times that much data, but on a machine with only 10% of the memory!
We then discovered Hypertable. We were able to load DBpedia and Freebase (1.3 billion triples) into Hypertable in less than 24 hours on that same dual core machine (still only 3GB memory). To see how far Hypertable could go, we loaded the datasets three times over, in total storing close to 4 billion triples on that single node! We were shocked to discover that even with that volume of data, Hypertable could still deliver a sustained query throughput of 1,000 queries per second. Since then, we have never looked back and believe using Hypertable in this way could be an eye opener for the Semantic Web community as a whole. To learn more about Sehrch.com and how Hypertable is used as the underlying storage technology, see the following case study:
Posted By: Az from Sehrch.com, email: az