Database — sql, nosql

Tagged as: cloud

What is this debate on databases ?

We have many “Millions of users” talk these days with many large services serving lot of users. This takes us quickly to the thought of how to scale our service — roughly the starting point on the database scale debate.

Why is database part discussed heavily in the service scaling ? This stems from the fact that many of the services out there are not compute heavy. They look-up data and return the results to the user-side, either as web pages or via structured data APIs (JSON or XML).

To handle lot of such requests from users, we have easy ways to add more web servers, leveraging the cloud. As long as you maintain these servers stateless (no correlation with the previous request), this is fairly easy. These servers but can respond only after fetching the data from the databases — this puts the databases at the center of the scaling problem— Database becomes the bottle neck. Now, the obvious thought — why don’t we add more database servers like the way we added the web-servers ? This is but inherently hard due to the fact that database cant be stateless — in fact, what they do is to be state-ful !

Couple of options here:

  1. Keep increasing the data server capacity (network, storage etc) while more requests and data come in. This is typically called scale-up, but it is obvious that you will hit a limit at some point.
  2. Another option is scale-out : Add more servers and split the load among them. But what data to put in each of them ?

ScaleUp vs ScaleOut

Let us talk about scale-out. There are various options here:

  1. Duplicate the data on multiple servers, so we have more servers to handle load and let the request go to the least-loaded one. This is called replication, but has the inherent problem of maintaining consistency when you make changes to the data. You will have to make change in all the replicas.

  2. Another option is to split the data (based on some criteria) into multiple servers, so a given server is accessed only for a specific portion of data, effectively distributing load. This is called partitioning. This can either be horizontal, where certain records go to another server (say, all user records with user names starting in “A”). Or it can be vertical, where certain types-of-data goes to another server (say, all user-profile in one place, all user-activity in another place). Partitioning is also not easy, because business will have data-relations across records or across types of data. If the particular request can be responded to only after looking into a relation (call it JOIN in database terms), and if it happens that the data is now in different partitions, it slows down the fetch. Back to square one !

So far we just discussed the problems. My take is, it is at this juncture SQL vs NoSQL thoughts came in. How to effectively scale-out with all the above problems being solved ?

NoSQL databases typically push/prompt the developer to try reduce the relations among the various types of data (call them tables, or objects), by intentionally reducing the power of data-query (no JOINs or transactions). When a developer is designing the data schema with this limitation, then it naturally tend to avoid many inter-dependency among data.

Many (not all) of the applications can live with this limitation. This philosophy also favors de-normalized data architecture (it is ok to keep some data duplicates) — so instead of JOIN-ing from another place, it is all there in same place. What about the consistency ? — We have to live with some level of inconsistency — better called eventual consistency, where you are ok for data to be in-consistent for a limited period of time say, for few seconds. This is acceptable in many cases (say, like the number of times an article was shared — this counter need not be accurate to the last digit !).

So with some of these changes, effectively the strict data-relations reduce and now quickly we can do that infinite scale-out for large volumes of data — guaranteeing very fast look-ups. Data is partitioned and distributed on different servers. Relations among them are very limited (or very loose, I would say) — And we get the scalability.

AWS Dynamo DB, Google Datastore, MongoDB and CouchDB are some of the leaders in this space. Dynamo DB is also fully managed — you treat the database like an API and everything else is taken care of — almost zero data administration. MongoDB and CouchDB also have offerings from various vendors which reduces the data management overheads. Dynamo DB gives you that table feel, for those coming from SQL. They also have features to bridge some of the gaps of NoSQL systems  — Query, Atomic counters, Conditional Writes etc.

Can NoSQL be adopted everywhere ? I think No.

Even within a single product there will be parts where you need the traditional SQL — with rich-query, data JOINs, Transaction support etc. So pick the right database for the specific needs— be prepared to use both the schools !

Tagged as: cloud
comments powered by Disqus