The Delicate Art of Organizing Data in HBase
Here at Flurry we make extensive use of HBase, a distributed column-store database system built on top of Apache Hadoop and HDFS. For those used to relational database systems, HBase may seem quite limited in the ways you can query data. Rows in HBase can have unlimited columns, but can only be accessed by a single row key. This is kind of like declaring a primary key on a SQL table, except you can only query the table by this key. So if you wish to look up a row by more than one column, as in SQL where you would perform a select * from table where col1 = val1 and col2 = val2, you would have to read the entire table and then filter it. As you can probably tell, reading an entire table can be an extremely costly operation, especially on the data sizes that HBase usually handles. One of the solutions to this is to use a composite key, combining multiple pieces of data into a single block.
What’s a composite key?
HBase uses basic byte arrays to store its data, in both row keys and columns. However it would be tedious to try and manipulate those in our code, so we store our keys in a container class that knows how to serialize itself. For example:
Here we have two fields forming a single key with a method to serialize it to a byte array. Now any time we want to look up or save a row, we can use this class to do so.
While this allows us to instantly get a row if we know both the userId and applicationId, what if we want to look up multiple rows using just a subset of the fields in our row key? Because HBase stores rows sorted lexicographically by comparing the key byte arrays, we can perform a scan to effectively select on one of our index fields instead of requiring both. Using the above index as an example, if we wanted to get all rows with a specific userId:
Since we’ve set the start and stop rows of the Scanner, HBase can give us all the rows with the specified userId without having to read the whole table. However there is a limitation here. Since we write out the userId first and the applicationId second, our data is organized such that all rows with the same userId are adjacent to each other, and rows with the same applicationId but different userIds are not. Thus if we want to query by just the applicationId in this example, we need to scan the entire table.
There are two “gotchas” to this serialization approach that makes it work as we expect it. First, we assume here that userId is greater than 0. The binary representation of a negative 2’s Complement Integer would be lexicographically after the largest positive number. So if we intend to have negative userIds, we would need to change our serialization method to preserve ordering. Secondly, DataOutput.writeUTF specifically serializes a string by first writing a short (2 bytes) for its length, then all the characters in the string. Using this serialization method, the empty string is naturally first lexicographically. If it we serialized it using a different method, such that the empty string were not first, then our scan would stop somewhere in the next userId.
As you can see, just putting our index data in our row key is not enough. The serialization format of our components determines how well we can exploit the properties of HBase lexicographic key ordering, and the order of the fields in our serialization determines what queries we are able to make. Deciding what order to write our row key components in needs to be heuristically driven by the reads we need to perform quickly.
How to organize your key
The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish. Using the above example, if we almost always partition your data by userId, putting it in our index first is a good idea. That way we can easily select all the applicationIds for a userId, and also select a specific applicationId for a userId when we need to. However, if we are often looking up data by the applicationId for all users, then we would probably want to put the applicationId first.
As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.
When your key gets too complex
If your queries go beyond the simple funnel model, then it’s time to consider adding another table. Those used to heavily normalized relational databases will instinctively shy away from repeating data in multiple tables, but HBase is designed to handle large amounts of data, so we need to make use of that to overcome the limitations of a single index. Adding another table that stores the same cells with a differently organized key can reduce your need to perform full table scans, which are extremely expensive time-wise, at the cost of space, which is significantly cheaper if you’re running at Hadoop scale. Using the above example, if we routinely needed to query by just userId or applicationId, we would probably want to create a second table. Assuming we still want to be able to query by both userId and applicationId, the field we use as a key for the second table would depend on the distribution of the relationship of user to application and vice versa. If a user has more applications than an application has users, then that would mean scanning a composite key of (userId, applicationId) would take longer than scanning it in (applicationId, userId) order, and vice versa.
The downside to this approach is the added complexity of ensuring that the data in both of your tables is the same. You have to ensure that whenever you write data to one table, you write it to both tables simultaneously. It helps to have all of your HBase reading and writing encapsulated so that individual producers or consumers are not accessing the HBase client code directly, and the use of multiple tables to represent a single HBase-backed entity is opaque to its users.
If you’ve got a lot of rows
Sometimes storing large parts of your data in the key can be hazardous. Scanning multiple rows is usually more costly than reading a single row, even if that single row has many columns. So if performing a scan on the first part of your composite key often returns many rows, then you might be better off reorganizing your table to convert some of the parts of your composite key to column names. The tradeoff there is that HBase reads entire rows at a time, so while you can instruct the HBase client to only return certain columns, this will not speed up your queries. In order to do that, you need to use column families.
Getting it right the first time
Finally, always take extreme care to pick the right row key structure for your current and future querying needs. While most developers are probably used to throwing out most of the code they write, and refactoring many times as necessary for evolving business requirements, changing the structure of an HBase table is very costly, and should be avoided at all costs. Unlike with most relational databases, where adding a new index is a simple operation with a mild performance hit, changing an HBase table’s structure requires a full rewrite of all the data in the table. Operating on the scale that you’d normally use Hadoop and HBase for, this can be extremely time consuming. Furthermore, if you can’t afford your table to be offline during the process, you need to handle the migration process while still allowing reads from the old and new tables, and writes to the new table, all while maintaining data integrity.
HBase or not HBase
All of this can make using HBase intimidating to first time users. Don’t be afraid! These techniques are common to most NoSQL systems which are the future of large scale data storage. Mastering this new world allows you to unlock a massively powerful system for large data sets and perform analysis never possible before. While you need to spend more time up front on your data schema, you gain the ability to easily work with Petabytes of data and tables containing hundreds of Billions of rows. That’s big data. We’ll be writing more about HBase in the future, let us know if there’s something you’d like us to cover.
Here at Flurry, we constantly working to evolve our big data handling strategies. If that sounds interesting, we are hiring engineers! Please check out http://flurry.com/jobs for more information.