Indexing Key-Value Stores for Large-Scale Real-time Analytics


Project Description

Introduction
Large-scale real-time analytics are promising in many modern database/web applications, ranging from social networks, mobile clouds, to large-system diagnosis. Various application-specific software is developed for large-scale real-time analytics, such as Titan for graph OLTP workloads, ElasticSearch for full-text search, and Dapper for distributed system tracing, and etc. To store and persist the big-data generated from these modern applications, various emerging key-value stores have the potential; the key-value stores, such as HBase, Cassandra, Riak, HyperTable, etc, are all designed for scalable and write-intensive data storage. However, as the front-end software requires rich data accesses, the key-value stores may fall short; existing key-value stores primarily provide key-based access methods such as Get, the value-based access methods are rarely supported, due to various challenges for indexing big data at scale.

Research proposal: Deferred index maintenance
This project proposal focuses on the problem of real-time index maintenance in write-optimized key-value stores. The problem is flexible to different index structures, either a global index (e.g. an index table) or a local index (e.g. a replica of a per-node store or an index embedded on each storage file). The real-time index maintenance requires that given a data update, the index entries obsoleted by the updates should be removed in real-time; however, locating and accessing obsolete index entries could be consuming due to the log-structured merge design. To reduce the write amplification caused by real-time indexing, we propose several approaches for lightweight index maintenance. Our first idea is to delay the obsolete index "repair" at offline hours when compactions occur for storage reorganization. Towards this aspect, we propose HIndex, a system design that closely couples an index repair with the native compaction process. Our second idea is based on the observation that offline index repair in HIndex may incurs extra overhead for query processing. To improve the indexed query performance, I further proposed RIndex that schedules the expensive index-repair operations in online hours. To minimize the performance intrusion, I proposed various adaptive scheduling strategies that are carefully designed with the awareness of current data access patterns, system loads, and unique performance characteristics in the underlying key-value stores.

System impl. and deployment
The prototype system of HIndex and RIndex is implemented based on different cloud storage infrastructures, including HBase and Cassandra. The HBase-based implementation is based on HBase's CoProcessor API. For performance study, HIndex is currently deployed on EMulab. We are currently planning to deploy HIndex on realistic public-cloud platforms, such as Amazon AWS. We have set up YCSB as a benchmark tool for HIndex. In addition, we have implemented a general framework, called Diff-index, for exploring different indexing options on key-value stores.


Papers

  1. Wei Tan, Sandeep Tata, Yuzhe Tang, Liana Fong
    Diff-Index: Differentiated Index in Distributed Log-Structured Data Stores
    EDBT 2014, [
    pdf]
     
  2. Yuzhe Tang, Arun Iyengar, Wei Tan, Ling Liu
    Lightweight Index Maintenance for Log-structured Key-Value Stores
    [
    Under submission, src]
     
  3. Yuzhe Tang, etc
    Real-time Analytics on Scalable Key-Value Stores
    [
    Under preparation]