In my last post on databases, “Dueling Databases: NoSQL vs. SQL”, I mainly examined relational databases and different strategies for scaling them. This provided some context for why NoSQL databases are a necessary addition to the data-management repertoire, but I didn’t have space to take a deep look at what exactly makes NoSQL databases special or how they differ from—and complement—traditional, relational databases. That’s what I will cover in this post.
So What Is NoSQL Already?
The most important fact to remember about NoSQL databases is that they are a class of non-relational databases, not a single database model. In many ways, the only thing that all NoSQL databases have in common is that they are not relational databases (like SQL). Apart from that, NoSQL databases encompass a wide variety of different database paradigms, and some of these database structures differ widely from each other. To understand why this is the case, it can be helpful to quickly look at what drove the development of NoSQL databases.
How Did We Get Here—and Where Is Here?
Set the WABAC machine[*] to 2008. Highly data-intensive companies like Google, Facebook, and Twitter were hitting the limits of what relational databases could do (see the previous post in this series, “Dueling Databases: NoSQL vs. SQL”). These companies’ database needs provided a huge impetus to the development of NoSQL databases. Amazon’s launch two years earlier of Amazon Simple Storage Service (Amazon S3) in 2006 helped lay the groundwork for mainstream acceptance of non-relational data storage. Once businesses and organizations began to accept working with dynamically typed data, the door was open to adopting databases adapted to these new data realities.
One way of thinking about current NoSQL databases is to divide them into two broad types: key-value (think of a giant hash table) and schema-less, which is a catch-all category that includes several sub-types. Looking back at historical precedents, Amazon S3 is an example of key-value data storage: because the kind of varied data stored on Amazon S3 defied any kind of meaningful structure or schema, the data was retrieved according to its unique key. Popular key-value databases include Redis, Memcached, and Riak KV.
Key-Value Stores …
Key-value databases are powerful because of their simplicity. They are almost infinitely mutable because they don’t particularly care (or need to care) about what they store. For a key-value database, everything just needs a key with which to retrieve the data. This structure also makes key-value stores fast—they are, after all, really just hash tables. Likewise, this structure makes key-value stores easy to scale vertically (adding more values to the hash table) and horizontally (breaking the table and its records across multiple instances).
Key-value databases are far from being panaceas, though. Not all data is best modeled solely as a key-value pair. For example, think of a social network: if each person on that network has a record in a database, the relationships between those records is far more valuable information than the data stored in the individual records themselves. Schema-less databases represent other NoSQL methodologies better suited to other types of data.
… And Everything Else
Schema-less NoSQL databases come in several different varieties: column-based, document-based, and graph-based databases.
- Columnar databases might be the NoSQL database variety closest to traditional, relational databases, they come with a big twist. Rather than organizing data by row, columnar NoSQL databases organize data by column. This means that, for example, instead of individual websites being catalogued as rows in a database table, with columns for URLs and text, a columnar database would index the websites’ URLs based on individual words in the website text. In this way column-based databases can be efficient for finding which URLs contain the word “cat” or “video,” which is precisely what motivated the development by Google of its Bigtable database in order to support its search engine. Prominent examples of column-based databases include Apache Cassandra, Apache HBase, and Apache Accumulo.
- Graph-based databases differ only slightly—though profoundly—from the other types of schema-less NoSQL databases. The actual data in graph databases can be stored using either a key-value or document-based database. Graph databases then overlay a graph structure of nodes, edges, and properties for the stored data. This is perfect for hierarchical data that needs to be quickly traversed, such as the network of friends in a social network. Widely used graph databases include Neo4j, OrientDB, and Titan.
No Free Lunch—or Trade-Off-Free Data Solutions
The great divide among NoSQL databases is whether or not they are based solely on key-value stores. Key-value databases represent the most streamlined NoSQL offerings, but they are not suited to all use cases. Columnar databases can provide fast performance for workloads that revolve around inverted indexes, such as indexing websites by key word. Document-based databases are good for semi-structured data that can furnish useful metadata to the database. And graph databases are ideally suited for use cases where the relationship between data records is of prime importance.
As a class, NoSQL contains multitudes of divergent database solutions. The principal thing that they all have in common is that they are not relational databases; but a related common denominator among NoSQL database solutions is that they present similar trade-offs in comparison with relation databases. NoSQL databases of all kinds have a lot to offer in terms of simplicity and scalability, particularly when working with dynamically typed data with frequently changing schemas. But everything has a trade-off, and database design is one of them.
NoSQL databases handle data consistency, availability, and amenability to partitioning differently than relational databases. The details of those trade-offs will be the subject of the next installment of this series.