A Technology Overview of NoSQL Embedded Database Systems

This article will explore the backdrop in which NoSQL embedded database systems emerged, and a high level technical overview/summary of NoSQL systems. It is not a review or comparison of specific NoSQL embedded database products.

Why relational came about in the first place.

This will, by necessity for an article of this size, be a gross over-simplification. E.F. Codd wanted to separate the logical database model from the physical database model. Earlier database system models, e.g. hierarchical and network model, lacked this separation and required intimate knowledge of the physical layout of the data by the programmer in order to navigate through the database contents, following “database pointers” from parent record to child records (called member records in the network model). Consequently, these database programming interfaces were called “navigational APIs”, a term that is still in use today.

Codd created a mathematical model, the end result of which allows programmers to express “what” data is desired and allowed the database management system to figure out the “how”. Typically, this is done with the SQL language, though there is nothing in the relational database model (or Codd’s famous 12 rules) that mandate SQL.

For clarity, NoSQL embedded database systems have always existed alongside relational database systems, even before the term NoSQL became popular, and before scalability became the blunt object with which to beat up RDBMS. BerkeleyDB was a very popular key-value store in the 90s and early 00s, before it was acquired by Oracle. The motivation for choosing non-SQL solutions at the time was greater simplicity, greater flexibility, and better performance by virtue of bypassing a SQL interpreter.

Enter “web scale” systems

Prior to the burgeoning popularity of the internet in the early 90s, businesses computerized their systems on private data center systems such as IBM mainframes and departmental solutions on minicomputers such as Hewlett Packard’s HP 3000 and various Unix-based microcomputers and, eventually, on desktop PCs. Companies generally did not grant access to their systems to external entities. To the extent that data was shared/passed between systems, it was done so through arcane batch systems like EDI (Electronic Data Interchange). Scalability within an organization was a tractable problem

Along came Amazon-like companies (so called Web 2.0) with their need to have systems that could scale to accommodate thousands of concurrent geographically disparate users. No single computer system could handle the workload, which drove the need for “horizontal scalability” within and across data centers, spreading the workload across a number of computer systems. This implies the need to also distribute the database across a number of systems which, rightly or wrongly, was a perceived weakness of the RDBMS available at the time. That weakness gave birth to NoSQL systems that offered the promise of superior scalability while sacrificing some benefits of the relational model (and eventually NewSQL systems that promise both relational model fidelity and scalability).

NoSQL is not one thing

NoSQL encompasses many different approaches to database system design, for example key-value, graph, document and wide column systems. Some graph database vendors, notably Neo4J, don’t include themselves under the umbrella of NoSQL. Neo4J goes to some lengths to distinguish their technology from NoSQL as being vastly superior to graph capabilities grafted onto NoSQL database systems. Graph database systems don’t really solve the horizontal scalability problem, though. Such systems expect a graph to fit in a single system, so they don’t lend themselves to partitioning and distributing the partitions across systems (also known as sharding). For this reason, graph databases will be excluded from the remainder of this article.

Key-value, document and wide column stores solve the scalability problem by partitioning the database and distributing the partitions across nodes of a cluster of systems, or in some cases, by distributing the entire database across a cluster of systems. So-called NewSQL also take this approach.

NoSQL databases are generally denormalized. This simplifies partitioning because everything about something of interest in the database is kept together in one key-value pair, or one document. Conversely, relational database systems containing normalized data are more difficult to partition. For example, if there is a one-to-many relationship between CUSTOMER and ORDER, and a given CUSTOMER primary key exists on node 1 of a cluster and the foreign key of CUSTOMER in the ORDER table exists on a different node, it is necessary to solve the thorny problem of implementing cross-node joins. Avoiding this necessity when partitioning and distributing the data takes some work.

So, we can solve scalability by partitioning and distributing NoSQL databases. But distributed NoSQL systems introduce their own challenges, which are summarized by the CAP Theorem:

It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

Enforcing consistency implies replicating data between nodes synchronously, which requires a roundtrip communication between the sending and receiving node(s). This is painfully slow, so forget about scalability, and you must sacrifice partition tolerance.

Enforcing availability is generally accomplished by implementing eventual consistency. In other words, by giving up the consistency guarantee and allowing replication to proceed asynchronously, which means that some nodes will have stale data at any given moment of time but will catch up eventually.

Providing partition tolerance also means sacrificing consistency. You can’t have consistency if a partition that contains a slice of the horizontally partitioned data has gone missing.

Conclusion

NoSQL embedded database systems offer the hope of greater simplicity. The fundamental operations of a key-value store, for example, are get and put. Pretty simple. NoSQL also offers the hope of greater flexibility. Generally, these are unstructured databases or, in other words, schema-less. Key-value and document stores permit you to store any opaque in the database without defining its structure beforehand. Wide column stores allow different rows of the same table to contain different columns. This comes at the cost of greater complexity in other ways. For example the burden shifts to the programmer to figure out if or how to pre-join/denormalize data. But that denormalization also provides the benefit of less friction when grappling with horizontal scalability.