Cloud computing is most often associated with scalability (see Amazon CTO Werner Vogel’s definition of scalability). One commonly held view is that you can simply move an application onto cloud based infrastructure and it will then “magically” scale on demand. The reality is that there is no free lunch. Simply throwing additional CPU cycles or storage at an application is not going to deliver linear scalability unless the application was designed to scale in such a manner.
The cloud era heralds the development of new enterprise application platforms available on demand as well as new social platforms. However this isn’t as simple as taking the current crop of relational database centric solutions and deploying them on Amazon EC2. Of course this isn’t stopping vendors from taking that approach and offering on demand versions of their products. The challenge is that these applications are not designed to scale dynamically and in a distributed manner. The result is that as traffic and usage grows there will be a continual cycle of monitoring and patches to try and keep the application performing to an acceptable level. While this will always be necessary to monitor and improve there are lessons to be learnt from some of the largest concurrent, multi-user sites that can help reduce the pain.
Consideration of cloud based scaling is clearly dependent on the nature of the application and the anticipated volume of usage. If the application for example is very read heavy and low on write transactions then replicating databases with good caching could well be sufficient. However for solutions that require massively concurrent heavy write based access to the database consideration needs to be given to architecting to achieve scalability.
Distributed database versus relational database
Relational databases are primarily designed for managing updates and transactions on a single instance. This is a problem when you need massively concurrent access with millions of users initiating write transactions. The approach taken to address this is usually clustering or sharding. But this is really attempting to patch up the problem rather than addressing it full on. That said there are many large scale examples using a relational database and applying these approaches.
Given a clean sheet and current developments what approaches can be used to address massively concurrent write heavy applications. Well there are a number of different distributed database solutions that have emerged in the last few years either based on some form of key-value distributed hash table (DHT), column-oriented store or document centric. They are often built to address precisely the issue of scaling for write heavy applications. However they should not be considered a direct replacement for a relational database. They often lack support for complex joins, foreign keys as well as reporting and aggregation – although some of these areas are beginning to be addressed. Also there is not currently an SQL or object mapping such as Active Record to cleanly and transparently access them from code, so extra development effort is required. However they should certainly be considered as part of on an overall architecture and leveraged to reduce write heavy bottlenecks in the solution.
Apache CouchDB – document centric approach built using Erlang
Cassandra – DHT variant that supports a rich data model, originally born at Facebook, now an Apache incubator project
HBase – column-oriented store similar to Google’s BigTable, uses Hadoop as a distributed map/reduce file system.
Here is a great blog from the cofounder of Last.fm on the multitude of alternatives to a traditional RDBMS for heavy distributed write based applications http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
Another blog worth reading on distributed key stores is http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores.
Stateless immutable services
One of the guiding principles for linear scalability is to have lightweight, independent, stateless operations that can be executed anywhere and run on newly deployed threads/processes/cores/machines transparently as needed in order to service an increasing number of requests. These type of services should share nothing with any other services they simply process asynchronous messages. This type of async message passing has been proven to scale in languages such as Erlang. One paradigm that is closely aligned to this approach is known as the Actor model. The actor model is all about passing immutable messages and the share-nothing philosophy. A lightweight stateless protocol such as REST is well suited to allowing these services to be accessed across the internet through HTTP.
Speaking the language of scalability
As always choice of programming language can end up being more an emotional rather than necessary decision. But it is true that it can help to pick the right tool for the job at hand. Some languages have better support for developing highly concurrent distributed and scalable applications. The characteristics to look for are languages that encourage immutable data structures and referentially transparent methods, typically being functional in nature and supporting asynchronous message passing. Two popular languages that are receiving alot of attention are Scala and Erlang. Scala runs on the JVM and was famously used to provide scalability for Twitter by implementing a message queuing solution. Erlang has it routes in embedded systems and so was optimised to run on minimal resources. It utilises processes which are much lighter and faster than even O/S threads supporting both multiple cores or multiple machines transparently. Both Scala and Erlang have good support for the Actor model again encouraging scalable independent async message driven design.
In the end there is still more learning and maturing to be done in developing the next generation of cloud based solutions and not all will need to scale to high volumes. It will be an interesting time and there is much that can be learnt from others who are already dipping their feet in this pool. A good site for keeping track of what others are doing in the whole space of scalability is http://highscalability.com/. Being aware of these changes is especially important when embarking on new projects where consideration to scale and using cloud infrastructure are factors.