Wednesday, August 12, 2009

Object Relation Mappers

Object-Relation Mapper (ORM) is usually a library, which provide more or less automatic and transparent conversion between in-memory object representation and relational database tables. For a while I have been pondering the question whether to use ORMs at all, and if the answer is yes, then when, what for, and for which price. Apparently this is a pretty complex topic deserving a full size article or even a book. This blog provides a brief account on my major findings.

The reason why we might need an ORM steems from the fact that any persistent storage will always have its own optimization criteria, which will always be different from that of in-memory objects representation. Two the most obvious reasons are: address alignment and big/little endian representation of numeric data. In order to speed up computation numeric data in RAM will usually be aligned at addresses divisible by 4 and will have internal representation suitable for the CPU (for example, little endian on Intel processors). Persistent storage, on the other hand, even flat file or Flash memory, will always strive to compact data representation in order to speed up data access, and will use a portable byte order for numerics. Therefore Persistence-Object Gap (I prefer this term to a more narrow Object-Relation Gap) will always exist, and forth and back conversion needs to be accepted as an inevitable evil. Still, since the conversion code is purely technical it should not interfere with the application core logic and ideally should be isolated into a separate Data Access layer, the whole reponsibility of which should to provide and efficient conversion between in-memory objects representation and persistent storage of whatever type is in use.

The first infrastructural service, required for almost any Data Access layer would be so-called "seriliazation", namely converting forth and back between external and internal representations without dealing with input/output. All modern dynamic languages, such as Java, Python, Ruby, Groovy, etc. provide a built-in support for serialization using reflection. Serialization could use different external formats: binary (e.g. Java), XML, YAML, etc.

The next service, which might be required, is an ability to make objects persistent. The most trivial solution might be to read an object from a file. A more complicated solution could be to use a simple index indexed file, such as provided by Berkly DB. In a sense ORMs do provide this service, but in a more complicated way (see below). 

Simple object persistence solution normally do not provide neither caching, nor transactional mechanism, and nor locking. If this is required, a full fledged Object Oriented Database engine, such as ObjectStore, could be used.

When dealing with object collections, persistent or not, we normally need to apply some selection criteria to specify which objects we would like to access and process. Normally individual object access is supported by some kind of dictionary container. In Java and C++ these dictionary class are called Maps, which introduces a lot of confusion with map algorithm popular in Functional Programming. For multi-objects selection languages like Python, Ruby and Scala support so called list comprehension. For example, in Python it could look like [Function(p) for p in P_list if Predicate(p)]. Java does not have such a mechanism and filters need to be manually specified as special objects. Whatever a mechanism for object filtering it must be kept in the mind that the whole selection process is performed solely in RAM. Automatic translation solution of list comprehensions into SQL queries is not known at the moment.

Now, let's suppose we choose a Relational Database for persisting some or all objects. The main reason for this choice could be that opposite to Object-Oriented DBs Relational DB technology is very mature and offers a number of well-established brands such as Oracle, MS SQL, MySQL to name the few. These blue chips products take upon themselves the whole set of persistent data life cycle: clustering, backup, schema management, query optimization, migration and grows, transaction integrity, and locks. Relational DBs are in general optimized for mass data entry (in relatively fixed format - tables) and ad hoc queries. Even though it does not cover everything, this is a very large class of data-centric applications.

The first infrastructural support, which might be required for Relational DB access, is a Database Connection API. Usually there is a generic DB Connection API specified for particular programming language, such as JDBC for Java, while for every specific DB engine there is a separate implementation (very often extending slightly the standard API, for example, cx_Oracle in Python). This DB Connection API usually defines a basic set of function to establish a connection, to commit/rollback changes and to transfer data forth and back through a generic collection, normally called RecordSet. Even at this simplest level there is a catch. RecordSets work perfectly OK for data retrieval, but what happens if we want to modify or delete a particular record? Should we do it through a RecordSet or we have to submit a separate UPDATE WHERE or DELETE WHERE SQL query. Too often the documentation is very ambiguous about this. The best understanding I managed to achieve is as follows. Whether update or delete could be performed through a RecordSet depends on the original query. If the query was for a single table we may expect the resulting RecordSet to be updatable. If the RecordSet is a result of a more complex query such as JOIN, chances are we will need to submit separate queries for UPDATE and DELETE. If only DB Connection API is used, the Data Access layer will need to be manually programmed with presumably a separate class per table and, may be even query. 

DB Connection APIs normally accept SQL queries in a form of string. While this enough flexibility that could lead to some annoying and hard to debug mistakes. For that reason some developers prefer to use an additional SQL Query Builder API, for example SQL Alchemy for Python. Here basic SQL elements are presented in a form of classes or functions thus providing an ability to catch some technical mistakes at an early stage of compilation. Opposite to what was mentioned above with regard to object filters, the SQL Query does directly translate predicates into SQL. In other words, object filters specified in a form of, say, list comprehension will be executed at the DB client side, while logical predicates specified for SQL Query builder will be executed at the DB server side. This obviously will have a different impact on the overall system performance and scalability. What is better is very hard to say. Originally SQL Databases were optimized for so called scale-up pattern: powerful, expensive computer for DB server, cheap, not powerful clients, relatively slow network. That was true about 15-20 years ago. Under this circumstances delegating as much work as possible to the DB Server will lead to better results. This is what the most of DB experts will argue till now. However many modern, especially large scale Web (e.g. eBay, Google), applications were designed for another, so-called, scale up: perform aggressive horizontal and vertical split of the system such that DB and application servers could be deployed on cheap, commodity computers, with a lot of redundancy; assume that local network is fast. Under this circumstances the less work the DB server will perform, the better. We will see the same philosophy in more sophisticated ORMs. The last comment: this scaling out pattern, even though very successful in providing the right balance between scalability, availability and throughout, does not come for free. Scaling out significantly increases the hardware footprint (real estate required to host all these computers) and electric power consumption for computers and cooling.

DB Connection and SQL Query builder APIs by themselves do not deal with object caching, which is usually required for any non-trivial application. Indeed, retrieving every object each time directly from DB could be very inefficient. In order to improve overall performance developers often opt for storing those object which have already being retrieved in a memory cache. There is a relatively large number of open source cache solutions for Java: http://java-source.net/open-source/cache-solutions. Developing a robust cache solution is not a trivial task. Two main questions need to be addressed: when the cache is updated in order to reflect changes made in DB by other processes, and when the changes made by this process are sent back to the DB. The most popular method is to synchronize cache on commit/rollback of the current transaction, but for a large number of objects it will still be a suboptimal solution. For example, LinkedIn, developed a very sophisticated Memory Bus, which keeps in memory representation of the LinkedIn connection graph refreshed using a special Oracle last updated time stamp feature.

So far we assumed that SQL queries are directly embedded in the program code in a form of string or using query builder objects. If we are dealing with a complex legacy database or even worth with a number of DB schema versions this approach would cause a significant maintenance headache:every time there is a change in existing DB schema or there is a new version of schema, we will need to make a change in code. For compiled languages such as Java it creates a problem. An alternative solution would be to extract SQL queries in a separate configuration, say XML, file, thus eliminating the need to recompile the code each time DB schema changes. This is what iBatis library for Java basically provides.

Using iBatis approach assumes that all SQL mapping is performed by developers manually. As was said that could be the right solution for legacy DBs, but for new development it's still too cumbersome and introduces some subtle maintenance problems. When we develop a new system we would ideally prefer to keep class declaration and DB schema in synch and to generate all relevant SQL queries automatically. This is what full fledged ORMs such as Active Record in Ruby, Elixir in Python, Active Objects in Java, and at certain degree Hibernate in Java are doing. In their pure form ORMs provide a declarative method to specify object fields, and object relationships. In this mode ORMs generate primary keys for every object automatically. SQL queries for object retrieval, update, delete, and relationship traversing are generated automatically as well. ORMs maintain internal cache (based on so-called Identity Map pattern), where already retrieved objects are kept in order to improve performance. Object filtering could be performed through list comprehension or SQL query builder. This in my view might be a major source of confusion, but as far as I know a satisfactory automatic solution does not exist at the moment. In their pure form ORMs take control over DB optimization completely out of hands of developer. As long as it works for particular application, everything is fine. However when and if a serious performance problems raises up one has to be prepared to spend a substantial amount of time of trying to figure out which kind of query does ORM generates and why does it affects negatively the performance. In general dynamic languages such as Ruby and Python are doing ORMs better due to their unique meta-programming capabilities. In Java Active Objects library made a genuine effort to achieve the same level of code clarify and convenience. Hibernate on the other hand uses either XML configuration files (which is in my view horrible) or Java annotations (slightly better). Almost all ORMs support this or another form of hybrid model when objects could be mapped on existing tables in a more flexible way.

The major advantage of ORMs for new development is that they save a lot development effort and make the Data Access layer to practically disappear. Specification of persistent entities and their relationships could be embedded directly in the Domain Logic Layer (some people call it Business Logic Layer, but I prefer this more specific term). When used properly it could improve significantly the code clarity and development speed. This is based on assumption that the overall performance of the target application will be satisfactory or could be fine tuned locally. For a very large class of applications it does work this way, but we again must remember that many SQL query optimizations might be not available, at least not directly. Usually using ORMs in this pure mode leads to re-thinking of the whole domain model concept, which I see as a positive sign. One needs to remember that ORMs do optimize, but in their own way, primarily through lazy retrieval and sophisticated cache. Also, one has to remember that although declarative entities and relationships specification could, and usually should, be an integral part of Domain Model it is NOT the whole domain model (the main disadvantage of Ruby on Rails in my view is that is does assume that its Model consists solely of Active Records). Using Domain-Driven Design book terminology we could state that ORM declarative style is good for specifying entities, aggregates, values and, may be, cross-aggregate relationships. It does not cover Services, Factories, Events and Repositories. With regard to Repositories one needs to keep in the mind that using the default solution, provided by ORM, to treat entity class as Repository would usually be a poor choice. Indeed, statements like Project.query.all() or Project.query.filter(id = x) are too cumbersome and system dependent. An alternative is to use ProjectRepository.GetAll() and ProjectRepository.GetProject(id) methods. The latter works better with dependency injection and better supports unit testing and mocking.


1 comment:

  1. As RoR fan, I'd like to mention that there is a DataMapper as an altenative to Active Record.
    http://bit.ly/2gmXQ

    And in Rails 3 the problem you mention will be improved.

    ReplyDelete