I just finished my makeup exam for the database course. Posting my reading report here to share (and hopefully pass~).

“Seven Databases in Seven Weeks” was published in 2012. It introduces seven of the most popular open-source databases at the time, including a relational database (PostgreSQL), key-value databases (Riak, Redis), a column-oriented database (HBase), document-oriented databases (MongoDB, CouchDB), and a graph database (Neo4j). Except for PostgreSQL, the other six are collectively called NoSQL, meaning they do not use the relational model and do not use SQL as their query language.

The book follows the same format as “Seven Languages in Seven Weeks”: one chapter per database, each chapter divided into three sections called Day 1, Day 2, and Day 3. Unlike official database documentation, this book does not simply introduce each technology, but explores the core concepts of each one, helping readers understand the strengths and weaknesses of each database and which database to use for which kinds of requirements.

When the Relational Model Hits a Wall

For many years, relational databases represented by Oracle and MySQL have been the de facto choice, and also what we learn in database courses. However, with the development of the Web and big data, a series of problems with the relational model have become apparent:

  1. Schemas are hard to modify. The relational model requires specifying entities, attributes, and relationships in advance. In the complex and rapidly changing Web environment, we often need to add or modify attributes and relationships, and even use different attribute sets for different entities, which is not easy to achieve in the relational model.
  2. Strict consistency and durability sacrifice speed. Some Web applications only require so-called “eventual consistency,” meaning they can tolerate brief periods of inconsistency during transaction processing. In such cases, the strict consistency model of relational databases slows things down. Similarly, some databases are used purely as caches, or can tolerate small amounts of data loss; then a lot of data can be kept in memory instead of always being written to disk. Some NoSQL databases take advantage of these weaker durability requirements and complete many operations in memory, increasing throughput by an order of magnitude.
  3. The strict consistency model sacrifices availability or partition tolerance. The classic theorem for distributed databases is the “CAP theorem”: you can only have at most two of Consistency, Availability, and Partition Tolerance. Relational databases usually prioritize consistency, while under big data conditions partition tolerance becomes the primary concern. Therefore, in NoSQL systems we often need to sacrifice some degree of consistency to ensure availability and partition tolerance.

The mathematically elegant relational model has run into the wall of scalability. The NoSQL movement emerged as a response. NoSQL emphasizes customization: choosing appropriate databases and designing appropriate distributed architectures based on actual needs. From this perspective, NoSQL increases the workload of database administrators; the era when you could just buy a few “big iron” high-performance machines and sleep soundly is gone. For programmers, NoSQL systems are generally schemaless. This seems to avoid some hassles, but in reality, just like the difference between dynamic and static type systems, with fewer constraints, bugs have more places to hide.

Since PostgreSQL is a classic relational database, “Seven Databases in Seven Weeks” only covers some PostgreSQL-specific features such as full-text search and multidimensional queries in addition to the relational model; this report will omit that part.

Riak

Riak is a distributed key-value database. A Riak instance has several buckets (similar to tables in a relational database). Each bucket contains many keys (similar to rows in a relational database), and the value can be data of any type (specified by the Content-Type in the HTTP header).

The query language is HTTP REST: CRUD (Create, Read, Update, Delete) operations are implemented via HTTP GET (analogous to SELECT in SQL), POST (INSERT), PUT (UPDATE), and DELETE (DELETE). Data types and some options are also specified via HTTP headers.

Of course, as a NoSQL database, Riak does not support the most important operation in relational databases: relations.

Fault Tolerance

Fault tolerance is Riak’s most important feature. Riak was inspired by Amazon’s Dynamo paper, whose key contribution lies in implementing tradeoffs among CAP (Consistency, Availability, Partition Tolerance). Riak has three parameters:

  • N: the number of replicas in the cluster, i.e., the number of nodes to which a write will eventually be replicated. Note that N can be smaller than the number of nodes in the cluster.
  • W: the number of nodes that must successfully complete the write before a response is returned. If W is less than N, a success response is returned to the client after W nodes have written successfully; the remaining N–W nodes continue replicating in the background.
  • R: the number of nodes from which data must be read in order for a read to succeed; data is read from R nodes and the version with the newest timestamp is chosen (explained below). Why read from R nodes (instead of 1)? Because when W is less than N, it’s possible that the write has not actually completed even though a success was returned, and if a read starts at that moment, it might hit the old value. If R + W > N, it can be proven that the data you read, if consistent, will not be an old value.

There are some classic configurations of N, R, and W:

  • W = N, R = 1: this is what relational databases do—ensure that writes are fully completed before returning to guarantee consistency, but writes are relatively slow.
  • W = 1, R = N: makes writes the fastest, at the cost of slower reads.
  • W = R = N/2 + 1: balances the latency between reads and writes.

It’s worth noting that N, R, and W can be configured differently for each key, and even R can be specified as a parameter in the read request (GET). This deep idea from the Dynamo paper allows Riak to be flexible enough to handle data with different consistency requirements and read/write frequencies.

Vector Clocks

As mentioned earlier, when R > 1 you need to decide which version among the ones read is the newest. This is a fundamental problem in distributed systems. The simplest solution is, of course, to synchronize all machine clocks precisely and just use timestamps. Google’s Spanner (OSDI 2012 Best Paper) does exactly this: it uses atomic clocks and NTP to ensure that any two machines in the world differ by at most 10 ms. Of course, when the English edition of “Seven Databases in Seven Weeks” was published, Spanner might not have been out yet.

Riak’s vector clocks use a technique similar to Git. Each time an update occurs, a pseudo-random tag is appended to the end of that key’s vector clock, which is equivalent to increasing the vector’s length by 1. When two different nodes synchronize, if their vectors are identical, no synchronization is needed. If one vector is a subset of the other, only one-way copying is required (similar to a fast-forward update in Git). If neither vector is a subset of the other, then when the client queries that key, two values are returned. The client can resolve the conflict itself and submit a new value and vector (similar to resolving a merge conflict in Git).

As the number of updates grows, the vector will keep expanding. Therefore Riak provides trimming options to prune old entries in the vector clock (for example, older than one day).

MapReduce

Several databases introduced in the book provide MapReduce frameworks. MapReduce breaks a problem into two parts. First, a map function transforms a list of data into another list of data of a different type. Second, a reduce function transforms the list produced by map into one or more scalar values. This pattern allows the system to break a task into smaller component tasks and run them in parallel across a large cluster.

In Riak, MapReduce tasks can be written in JavaScript. JavaScript code used as a map function can also be stored in the database as a stored procedure so that every time you GET that key, this stored procedure is executed.

HBase

HBase is a column-oriented database based on the ideas from Google’s BigTable, mainly used for big data problems. HBase is built on top of Hadoop. Hadoop is a scalable computing platform that provides a distributed file system and MapReduce. HBase is suitable for back-end OLTP systems; it is relatively slow for individual operations but excels at scanning huge data sets.

Data Structure

HBase stores data in tables. Each record is a row, each row can have several column families, and each column family is a hashmap whose entries are called cells. Compared to relational databases, this adds one more layer: in relational databases, the basic feature is that each “cell” at the intersection of a row and a column is atomic; in HBase, the data at the intersection of a row and a column family is itself a hashmap, and HBase indexes the keys within that hashmap.

This structure is very suitable for inverted indexes on text: rows are articles, there is only one column family, and the hashmap stores each word in the text as the key and the word’s frequency as the value. Once HBase has indexed this, it’s very easy to find which article a word appears in most frequently.

Common link information on the Web also fits this structure well: rows are pages; in the hashmap, each link’s anchor text is the key and the link target is the value.

Why do we need multiple column families? This is to avoid creating multiple tables when storing different types of information. For example, an article has an author, title, body, and comments. The author and title can be stored in the same column family, but if you need inverted indexes for the body and comments, putting them in the original column family would cause confusion. So it’s better to set up three column families: basic info (author, title), content index, and comment index.

If you don’t often need to fetch entire rows, it’s recommended to place different column families into different tables so that the cluster can partition regions more effectively.

HBase has built-in support for versioning data, using timestamps accurate to the millisecond to identify data versions, so it’s well-suited for applications like wikis that need to preserve historical versions.

Data Partitioning

HBase can be operated via Hadoop’s Java API; the book uses JRuby.

Similar to Riak, HBase generates a unique ID (a hash value) for each piece of data. HBase divides the hash value space into several non-overlapping regions and assigns each region to a specific region server in the cluster. A special .META. table records which region server is responsible for which region of each table.

If a region server fails, the master will reassign the regions that belonged to the failed node to other region servers. If the master fails, the next region server in line will take over its responsibilities.

Compared with Riak, HBase’s availability is not as high. With suitable parameter configuration, Riak can keep working even if only one node remains alive, whereas HBase can only survive disasters in which one or two nodes go down.

Additionally, it’s worth noting that HBase can easily scale up but is hard to scale down, so it’s like a “nail gun”: it can do big jobs, but you have to be careful not to hurt your own fingers.

MongoDB

Relational databases have powerful querying capabilities, while databases like Riak and HBase have strong distributed characteristics. MongoDB finds an optimal balance between these two. Therefore, MongoDB is used as persistent storage by many emerging web applications.

Data structure

MongoDB consists of several collections (similar to tables in a relational database). Each collection contains multiple documents (similar to rows in a relational database), and each document is a JSON object. Unlike relational databases, JSON objects have no concept of “schema,” and values can be nested to arbitrary depth (i.e., the value corresponding to a key in a JSON object can itself be a JSON object).

Each document contains an _id field as the unique identifier of this document. This _id is generated automatically and consists of a 4‑byte timestamp, a 3‑byte machine ID, a 2‑byte process ID, and a 3‑byte incrementing counter, ensuring the uniqueness of _id.

Data structures that require joins between tables in relational databases can often be represented with subdocuments in MongoDB. Using the example of posts containing comments: in a relational model, you need a posts table and a comments table and then establish a foreign key; in MongoDB, you only need a collection of posts, where each post is a document, and one of its fields is a subdocument that contains all the comments for that post.

This data structure is clearly more natural than the relational model. If you need to break the tree structure and query all comments ever made by a particular author, you can create an index for that and lookups will be no slower than in a relational database. MongoDB can use B‑tree indexes, 2D indexes, or spherical geospatial indexes.

Queries

MongoDB’s native language is JavaScript. Since JavaScript has native support for JSON, data manipulation feels very natural.

MongoDB’s most basic query method is find, which can work similarly to SQL:

  • Specify conditions such as equality, greater than, less than, in a set, etc.
  • Support regular expressions (a feature SQL does not have)
  • Connect the above conditions with Boolean operators (and, or, not)
  • Filter which fields are included in the result set (equivalent to projection in relational algebra)
  • Use aggregation functions (such as count, distinct)
  • GROUP BY with user‑defined reduce functions

Of course, due to limitations of the JavaScript language, query conditions must be written as JSON objects, which is less intuitive than SQL.

By appending .explain() to a query, you can find out whether indexes are used and similar information, much like the EXPLAIN statement in MySQL.

MongoDB provides atomic increment/decrement instructions, as well as atomic instructions that update and return the old value.

MongoDB also provides a JavaScript API for MapReduce.

Replication and sharding

MongoDB uses a method similar to hot standby in relational databases to increase redundancy. A MongoDB cluster will automatically “elect” a primary node to handle all read and write operations, and secondary nodes automatically synchronize data from the primary. To ensure elections work properly, there should always be an odd number of nodes; otherwise, if the cluster network is partitioned into two sides with equal numbers of nodes, neither side can elect a primary, causing service interruption. When only one node remains, it will elect itself.

When the data set becomes so large that one node can no longer handle it, horizontal sharding can be done based on value ranges. A mongos server is needed as a “reverse proxy”: it acts as the single endpoint for user requests while maintaining how data is partitioned across mongod nodes and dispatching requests accordingly.

Just like RAID, replication and sharding can be used simultaneously; their purposes are redundancy and performance, respectively. For example, three servers can form a replica set, two replica sets can form two shards, and with the addition of configuration servers and a mongos “reverse proxy” server, you get an eight‑server MongoDB cluster.

CouchDB

CouchDB is quite similar to MongoDB. Their two main differences are:

  1. CouchDB is more lightweight and can be embedded into an application like SQLite.
  2. CouchDB data carries version numbers.

Transactions

Neither MongoDB nor CouchDB provides “locks” or “transactions.” In MongoDB, you often rely on built‑in “atomic operations” to implement these behaviors; CouchDB’s secret to avoiding conflicts is ensuring that only the latest version of a document is modified. CouchDB assigns a field named _rev to each document (in the form of a monotonically increasing version number plus a pseudorandom string), and whenever the document changes, the _rev field is updated. For update (POST) and delete (DELETE) operations, you must provide not only _id but also _rev. If _rev does not match, the operation is rejected. A rejected operation might indicate that between the time the document was last read and the time you attempted to update or delete it, the document was modified, causing the version number to change.

When deleting a document, CouchDB does not modify the document in place; instead, it “updates” it with a new blank document.

In MongoDB, special keys start with $, while in CouchDB, special keys start with _.

In addition to accessing CouchDB via a JavaScript API, you can also use an HTTP REST API similar to Riak’s.

Views

Views in CouchDB are implemented via JavaScript functions. Typically, such a function performs a “full table scan,” executing the map function on each document to obtain a collection of map results. HBase uses a similar method to generate views. While this method is flexible, full table scans are extremely time‑consuming, so HBase provides filtering conditions within the map function, allowing documents that meet the conditions to be filtered first before being passed into the map function. CouchDB has no such filter mechanism, but it can cache map results, so each time a view is queried, the map function needs to be executed only on documents that have changed.

In addition to map‑based views, CouchDB also supports reduce.

Change push

In web applications, you often need to know whether some specific value has changed. In relational databases, this is often achieved with triggers.

In CouchDB, you can write JavaScript code to listen to a collection; whenever any write operation occurs on this collection, the listener code will be invoked. By adding filters in the listener code, you can focus on a subset of changes, such as listening only for document deletions or only for documents that match certain criteria.

CouchDB’s change push works well together with Node.js’s event‑driven mechanism, using long polling: send a listen request to MongoDB, and when an event occurs, a callback function is invoked. The callback receives chunks of data from MongoDB until all data is received, then parses the data and iterates through the documents, calling the map function.

Replication

CouchDB provides redundancy in a way similar to Riak: each node can independently perform reads and writes. A symptom of conflicts is that the same _id has different _rev values on two nodes. CouchDB handles this in a very straightforward and coarse way: it chooses one of the two different _rev values (of course, this selection algorithm is not random). Since CouchDB retains historical versions of documents, if a client discovers something is wrong, it can read historical versions and perform recovery.

Neo4j

Neo4j is a “graph database,” in the mathematical sense of a directed graph. Neo4j focuses on storing relationships between data. Unlike relational databases, Neo4j does not require the schema for relationships to be defined in advance, making it very convenient to modify relationships and store “special‑case” relationships.

The recommended programming language for Neo4j is Gremlin, a graph traversal domain‑specific language implemented in Groovy. It uses standard graph theory terminology. Suppose g is the graph object and alice is some vertex in the graph, then:

  • g.V represents all vertices
  • g.E represents all edges
  • g.v(0) represents vertex 0
  • g.v(0).map() lists the properties of vertex 0
  • g.V.filter(it.name==’riesling’) selects vertices that meet the given condition
  • g.v(0).outE gets the set of outgoing edges from vertex 0
  • g.v(0).outE.inV finds the set of vertices that vertex 0 points to
  • g.v(0).inE.outV is the set of vertices that point to vertex 0
  • g.v(0).inE.outV.next() gets the first vertex from the above set
  • alice.bothE(‘friends’).bothV.except([alice]).name gets the names of all of Alice’s friends (regardless of edge direction)
  • alice.bothE(‘friends’).bothV.except([alice]).loop(3){ it.loops<=2 }.dedup.name gets Alice’s friends‑of‑friends‑of‑friends (where loop(3) indicates looping over the preceding three steps bothE(‘friends’).bothV.except([alice]), loops<=2 means looping twice, and dedup removes duplicates from the set)

As can be seen, the power of Gremlin lies in “pipelining” operations on sets, very similar to how jQuery operates on DOM elements.

Neo4j supports ACID transactions in single‑node mode. In multi‑node high‑availability mode, because writes to one node are not immediately synchronized to other nodes, transactions cannot obey ACID.

Redis

Redis’s biggest characteristic is speed. It is essentially a key‑value database, but it also provides advanced data structures such as sets, queues, stacks, and publish‑subscribe. Redis is generally used as a cache.

Redis uses SET to set a key, GET to retrieve a key, MSET to set multiple keys at once, and MGET to retrieve multiple keys at once. The atomic data type in Redis is string, but it can recognize integers: INCR can atomically increment an integer key.

Similar to PostgreSQL and Neo4j, Redis also supports transactions. Use MULTI to start a transaction, EXEC to commit, and DISCARD to roll back; usage is very similar to transactions in relational databases.

Redis also supports the following advanced data structures:

Hash table

Somewhat like nested documents in MongoDB or CouchDB: the value corresponding to a key is itself a key‑value store. Of course, further nesting is not supported.

List

A list contains multiple ordered values and can serve as both a queue and a stack, because you can PUSH and POP from both ends. You can use LRANGE to obtain any segment of a list, and LREM to delete matching values from a list.

The power of lists lies in blocking pop operations, which can be used to support messaging systems. When the queue is empty, BRPOP will block until an element is inserted into the queue.

Sets

The difference between sets and lists is that sets store some unordered, non-duplicate values. Redis supports operations such as intersection, union, and difference between sets.

Sorted Sets

Hash tables, lists, and sets are all built into high-level programming languages. Sorted sets are not. For example, we need to record the number of visits to each post and sort them with the most visited at the front. This can certainly be done with a relational database, but the performance of write operations in relational databases is often not impressive. Writing your own binary search tree or heap is of course also possible, but then you don’t have data persistence.

With Redis, you just need to use the ZADD command and provide two parameters—score and member—to implement this functionality. You can also use the ZRANGE command to return a certain range from the sorted sequence.

An even more powerful feature of sorted sets is that you can create something like “views”, defined as the union or intersection of several sorted sets. Each sorted set can be assigned a weight (used to multiply the score), and the minimum, maximum, or sum of the weighted scores can be used as the score of that item in the “view”. This helps to implement simple ranking systems. For example, “hot posts” can take into account a combination of page views, number of replies, and number of likes.

Publish-Subscribe System

Linux does not natively provide a publish-subscribe system, yet this work pattern is very common. The current approaches are either to implement a socket server or to use a messaging library such as ZeroMQ. Redis also provides an elegant solution: SUBSCRIBE to subscribe to messages, and PUBLISH to publish messages. Each published message will be pushed to all subscribers that are currently receiving messages.

The difference between the publish-subscribe system and the blocking queue pop operation mentioned earlier is that if there are multiple subscribers at the same time, the publish-subscribe system will send the message to every subscriber, while the blocking queue will send the message to any one “subscriber”.

Persistence

Redis is fast because all its operations are performed in memory. Redis provides two ways to achieve persistence: an append-only operation log, and periodically flushing memory changes to disk.

Since Redis is often used as a cache server, you can also specify a TTL (time to live) for an existing key. Once it expires, Redis will automatically delete this key.

Conclusion

The six types of non-relational databases introduced earlier, when combined appropriately, can form a highly available, scalable, and high-performance storage system.

  • MongoDB / CouchDB as persistent storage for general data;
  • Redis as a cache for MongoDB / CouchDB and as a temporary store for session data;
  • Neo4j to store complex relationships between entities;
  • Hbase for big data analytics such as OLTP;
  • Riak for data that requires high availability.
    Finally, let’s list and compare the pros and cons of the seven databases described in the book:
    Advantages Disadvantages
    MongoDB Complex queries Embedding capability
    CouchDB Easy to embed Complex queries
    Riak High availability Complex queries
    Redis Fast Complex data structures
    PostgreSQL Strict relational model Distributed availability
    Neo4j Flexible representation of relationships Big data
    Hbase Big data, integration with Hadoop Fast queries
    After reading this book, we can understand the development trends of today’s databases. Under the great wave of the Web, in order to improve distributed availability, many distinctive databases have emerged in addition to the relational model. Hopefully, in future development work, we will no longer be constrained by one or two database paradigms, but instead choose a suitable combination of databases based on the actual needs of the project.

Comments