Wednesday, August 22, 2012

NoSQL or RDBMS, not again!

No matter what, I found the same question over and over again, "should I use a NoSQL solution to my <what ever problem>, help me!", but some days ago I crossed a question that I want to discuss. To avoid any problem with the author of the question I will try to rephrase it here:

"Our company XYZ created a solution ABC that is working really well on our production environment, at this moment we address thousands a records a month and 1 terabytes of information is created each month, our current architecture is <monster server here> and we're happy with our current performance, "...

If you're developer you will know at this point that the Universal developer rule should be applied here: "If it's not broken, please don't fix it".

He continues... "Our application follows the model of write once read many, because we do business intelligence to scrap out our data and get information from there, most of our queries uses 2 or 3 joins, and we don't have weird data types, just plain numbers, texts and dates, our system will get more data and more users and we're worried that our current platform will not support this new workload, that's why we want to do the move to a NoSQL system, which one do you recommend for our problem?"

Here's where the things are getting nasty, and I think I need to clarify that I'm a developer too, and I'm in love with NoSQL strategy, that's why I created djondb because I'm convinced that "One size does not fit all" applies to the RDBMS world, 4 years ago the only way to store information in an organized way was RDBMS and several people think that they should apply RDBMS to every single problem, and that's not true. But this same principle applies to the NoSQL world, take a look of Maslow's hammer law.

NoSQL is a really good solution to several data environments, but you need to ask yourself if you're aware that most of your code should change if you decide to move from one solution to other, it is like saying that you're going to recreate your application in Java because it has better performance according to <put your link here>, you need to know that this kind of changes will require a lot of work to achieve the same features you already have, so you need to evaluate if this is going to work as a benefit or you will have a lot of work to achieve the same thing.

To illustrate the SQL - NoSQL transition I will use a sample, the architecture of the sample could be very different according to the approach you decided on your application (EJBs, Spring, LINQ, etc) and that will change the way you coded your solution, but I will go with a very straightforward and well know solution.

Data Model comparison

Most of you are already aware of the differences between one model and the other, but here's a very simple sample to illustrate this.

In a ERP data model you will have the following tables:

Taken from http://www.databaseanswers.org

This is a very simple model, a customer order a product and you register the identification of the customer and product into the table "Orders", each order is related with one or multiple transactions, etc.

From this simple model it's very easy to assume some of the SQL statements you should use to store or retrieve elements from your DB, let's summarize some simple operations:

  • Create products:
    Insert into Products (product_code, product_category_cod, description, vendor_code) VALUES ('00010101', 'E_01', 'iPad', 'Tablet blah blah', 'VEN_0010')
    
  • List of products by category:
    Select product_code, description from Products where product_category_cod = 'E_01'
    
  • Show the transactions ordered by a user:
    To retrieve this you will need to execute some SQLs or you will get a bunch or repeated information, first you will need to get the order "header":
    Select o.order_id, c.Customer_name, c.Customer_last_name, c.Customer_username, o.order_date, ... from Orders o inner join Customer c on c.customer_id = o.customer_id where c.username = 'JACKTHEREAPER'
    
    Now you will retrieve the details of the order using:
    Select p.description, o.price, o.units, ... from Orders o inner join Products p on p.product_code = o.product_code where o.order_id = 'XR1210120'
    

At this point I think you got the idea, no matter if you use some ORM, persistence framework, etc at the end your application will execute these SQLs to retrieve or store information into the database.

Let's see the same data model from NoSQL perspective, one of the key decision process about using NoSQL is to denormalize your data in a way that store it and retrieve it will be align with how the application uses it, instead of forcing a row/columns structure.

Questions to be solved by your NoSQL model

What will be benefit of storing the order detail as a separate entity? is the order detail an entity? or just an aggregation values that compounds an order? An order does not have any meaning without the details and they are useless if are not tight to an order. This will mean that the order should have a structure similar to this:

{
   { Order header info }
   { Order details }
}

Now lets figure out what are we going to store at the header info, 90% of the queries you will do to retrieve the order header will include the name of the customer, and his id, it's going to be pointless to show the order in the web application without this vital information, it's going to be easier to ignore information than adding new required information, so I will store the denormalized data from the customer inside the order, like this:

{
   { 
      order_id: "XR1210120",
      customer_id: "CI_29120",
      customer_name: "Jack",
      customer_lastname: "Reaper",
      date: "2012/07/26"
   }
   { Order details }
}

Note: "Hey!, what are you doing? what is going to happen if the customer changes his name?", how many times you face this kind of problem? anyway, if the customer changed his name and he comes back in 10 years, the invoice should be printed as it was at the time of the sale, because that's the way it's stored at the IRS, or any other state department, right?

What about the order details? same approach, we could save the product name along with the order and you will have everything you need at the time of printing/showing the order to the user. What about the product price? same thing, you will need to save the price at the time of the purchase, so it will be better to keep it within the order and not in an inventory table.

Here's an order with full details:

{
    "order_header": {
        "order_id": "XR1210120",
        "customer_id": "CI_29120",
        "customer_name": "Jack",
        "customer_lastname": "Reaper",
        "date": "2012/07/26"
    },
    "Orderdetails": [
        {
            "product_code": "00010101",
            "Category_code": "E_01",
            "description": "iPad",
            "price": 450,
            "units": 1
        },
        {
            "product_code": "00010203",
            "Category_code": "E_01",
            "description": "Xbox 360",
            "price": 320,
            "units": 1
        }
    ]
}

Let's do the same operations over djondb, using the djon-shell:

  • Create products:
    insert('sampledb', 'products', { product_code: '00010101', category_code: 'E_01', name: 'iPad', description:'Tablet blah blah', vendor_code: 'VEN_0010'});
    
  • List of products by category:
    find('sampledb', 'products', '$"category_code" == "E_01"');
    
  • Show the transactions ordered by a user:

    Now, you'll only need one round trip to the database to retrieve all the information you will need to show the order:

    find('sampledb', 'orders', '$"user" == "JACKTHEREAPER"');
    
    done!, you will get something like:
    [{
        "order_header": {
            "order_id": "XR1210120",
            "customer_id": "CI_29120",
            "customer_name": "Jack",
            "customer_lastname": "Reaper",
            "date": "2012/07/26"
        },
        "Orderdetails": [
            {
                "product_code": "00010101",
                "Category_code": "E_01",
                "description": "iPad",
                "price": 450,
                "units": 1
            },
            {
                "product_code": "00010203",
                "Category_code": "E_01",
                "description": "Xbox 360",
                "price": 320,
                "units": 1
            }
        }
    },
    {
        "order_header": {
            "order_id": "NP112021",
            "customer_id": "CI_29120",
            "customer_name": "Jack",
            "customer_lastname": "Reaper",
            "date": "2012/07/28"
        },
    ....
    ]

    This result is a JSON, very simple to iterate and interact to, but the idea of this article is not to convince you to use NoSQL, but to help you understanding the difference between your RDBMS application approach and the NoSQL approach, and highlight how these approaches will lead you to change some key points in your architecture.

Now that you got the idea of how the storage is mapped from one side to the other, it's time to take a look of your code.

Code comparison

Until now, the things had been very straightforward, and it's very likely that you have some sentences executing SQLs similar to the ones I showed, but the code is something else, it's hard to know how every architect created their designs, but I will use a commonly known design pattern, known as "DAO" (Data Access Object) to illustrate my point.

The DAO pattern uses a class to retrieve the information from the database and create a Transfer Object that is going to be returned to the business layer, if you want to store information of an order it will work like this:

Note: I will do the assumption that you used jquery in your front-end and it will be able to create a full order in one step (header and dome details) and then bulk it to the server to be stored.

Note: My second assumption is that you have some json framework at java that will be able to handle json nicely.

Who?What to do?
Web ControllerCatches the data from the "request" object and puts it in a Value Object (POJO, POCO, etc)
ModelReceives the Value Object and perform validations, etc
ModelCalls a DA class to store the VO
DA ClassCreates the required SQL to store the values from the VO

The code (using java) will look something like this:

Web Controller: Catches the data from the "request" object and puts it in a Value Object (POJO, POCO, etc)

Order o = new Order();
JSONObject json = JSONParser.parse(Request.getParameter("order"));
o.setCode(json.getString("order_code"));
o.setCustomerId(json.getString("customer_id"));
o. ... (all the order header information)
for (JSONObject detail : json.getJSONArray("Orderdetails")) {
     OrderDetail d = new OrderDetail();
     d.setProductCode(detail.getString("product_code"));
     d.setUnits(detail.getInt("units"));
     ...
     o.addDetail(d);
}

Now, you will get this and put it on a Model Class (following the MVC pattern)

Model: Receives the Value Object and perform validations, etc

    Model m = new Model();
    m.setOrder(o);
    m.saveOrder();

Model: Calls a DA class to store the VO

    MyDAClass da = new MyDAClass();
    da.insert(o);

DA Class: Creates the required SQL to store the values from the VO

    StringBuffer sql = new StringBuffer();
    sql.append("INSERT INTO Order (Order_code, customer_id....) ");
    sql.append(" VALUES ('" + o.getOrderCode() + "', ...");

    Statement stmOrder = connection.createStatement();
    stmOrder.execute(sql.toString());

    // at this point you will need to do the same for each order detail:
    for (OrderDetail detail : o.getOrderDetails()) {
       StringBuffer sqlDetails = new StringBuffer();
       sql.append("INSERT INTO OrderDetail (Order_detail_code, Order_code, product_code....) ");
       sql.append(" VALUES ('" + detail.getOrderDetailCode() + "', ...");
    }       

I know, you could use a preparedStatement here and use parameters, etc... or you could use EJB or ORM to store the VO, but here's not where you will see the difference between NoSQL and RDBMS approaches.

Now, let's take a look of the same code using djondb as NoSQL repository, I will keep the same MVC and DAO concept to keep the code side by side comparable.

Same code NoSQL approach

Web Controller: Catches the data from the "request" object and puts it in a Value Object (POJO, POCO, etc) at this point you will not need to map your data to a VO, it's pointless, you could use JSON at server side and it will work really well.


JSONObject order = JSONParser.parse(Request.getParameter("order"));

Now, you will get this and put it on a Model Class (following the MVC pattern)

Model: Receives the Value Object and perform validations, etc, this does not have any change:

    Model m = new Model();
    m.setOrder(order);
    m.saveOrder();

Model: Calls a DA class to store the VO, same thing the model does not suffer major changes.

    MyDAClass da = new MyDAClass();
    da.insert(o);

DA Class: Creates the required SQL to store the values from the VO

    connectionWrapper.insert("ERP", "Orders", order);

Just one line will do the trick, same with the controller, mainly you are avoiding the overhead of doing unnecessary transformations, you will get the json and store it directly to your database, no more additional processing will be needed here.

The same coding sample could be used to "listing orders" or "showing an order", you will see that you coded a lot less using JSON as transport and persistance mechanism

Conclusion

Although I tried to put it very simple, converting an application from RDBMS to NoSQL is not a trivial task, you will need to spend time creating a new design of your application, yes... you will get a lot from this, you will be able to create new features easily, you will be able to grasp the performance of NoSQL in your application, but you should not think that just putting a NoSQL behind will do the magic and boost your performance, you actually need to start thinking on "NoSQL" data model and new ways to code your application to get most of the NoSQL world.