CT-Talk with Michael Markovitch from Brown University

on "FITing-Tree: A data aware index structure"

Abstract: Index structures are one of the most important tools that DBAs leverage in order to improve the performance of analytics and transactional workloads. However, with the explosion of data that is constantly being generated in a wide variety of domains including autonomous vehicles, Internet of Things (IoT) devices, and E-commerce sites, building several indexes can often become prohibitive and consume valuable system resources. In fact, a recent study has shown that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a state-of-the-art in-memory DBMS. This overhead consumes valuable and expensive main memory, and limits the amount of space that a database has available to store new data or process existing data.

An index is essentially a function mapping a key to a location, and FITing-Tree approximates the index function using piece-wise linear functions. This approximation allows us to reduce the size of the index by taking advantage of the underlying data distribution. However, approximation is a lossy type of compression, and therefore the approximated index may not point to the actual location. At the core of our index is a tunable error parameter which we use to bound the distance between the approximate location to the actual location. The bounded error provides a strong guarantee that the approximated location is always "close" to the actual location. This allows us to exploit the imbalance between bandwidth and latency: the time it takes to read a batch of multiple consecutive locations can be shorter than two round trips to memory (and the error represents the size of the batch). The tunable error parameter therefore allows to balance lookup performance and space consumption by through the tradeoff between bandwidth and latency. In order for FITing-Tree to be practical, the approximation must be done in a very fast and efficient manner. We provide a streaming algorithm which segments the key space to non overlapping linear segments, where every piece-wise linear function has an error guarantee, with only a single pass over the data.