Wednesday, August 22, 2012

NoSQL or RDBMS, not again!

No matter what, I found the same question over and over again, "should I use a NoSQL solution to my <what ever problem>, help me!", but some days ago I crossed a question that I want to discuss. To avoid any problem with the author of the question I will try to rephrase it here:

"Our company XYZ created a solution ABC that is working really well on our production environment, at this moment we address thousands a records a month and 1 terabytes of information is created each month, our current architecture is <monster server here> and we're happy with our current performance, "...

If you're developer you will know at this point that the Universal developer rule should be applied here: "If it's not broken, please don't fix it".

He continues... "Our application follows the model of write once read many, because we do business intelligence to scrap out our data and get information from there, most of our queries uses 2 or 3 joins, and we don't have weird data types, just plain numbers, texts and dates, our system will get more data and more users and we're worried that our current platform will not support this new workload, that's why we want to do the move to a NoSQL system, which one do you recommend for our problem?"

Here's where the things are getting nasty, and I think I need to clarify that I'm a developer too, and I'm in love with NoSQL strategy, that's why I created djondb because I'm convinced that "One size does not fit all" applies to the RDBMS world, 4 years ago the only way to store information in an organized way was RDBMS and several people think that they should apply RDBMS to every single problem, and that's not true. But this same principle applies to the NoSQL world, take a look of Maslow's hammer law.

NoSQL is a really good solution to several data environments, but you need to ask yourself if you're aware that most of your code should change if you decide to move from one solution to other, it is like saying that you're going to recreate your application in Java because it has better performance according to <put your link here>, you need to know that this kind of changes will require a lot of work to achieve the same features you already have, so you need to evaluate if this is going to work as a benefit or you will have a lot of work to achieve the same thing.

To illustrate the SQL - NoSQL transition I will use a sample, the architecture of the sample could be very different according to the approach you decided on your application (EJBs, Spring, LINQ, etc) and that will change the way you coded your solution, but I will go with a very straightforward and well know solution.

Data Model comparison

Most of you are already aware of the differences between one model and the other, but here's a very simple sample to illustrate this.

In a ERP data model you will have the following tables:

Taken from http://www.databaseanswers.org

This is a very simple model, a customer order a product and you register the identification of the customer and product into the table "Orders", each order is related with one or multiple transactions, etc.

From this simple model it's very easy to assume some of the SQL statements you should use to store or retrieve elements from your DB, let's summarize some simple operations:

Create products:

Insert into Products (product_code, product_category_cod, description, vendor_code) VALUES ('00010101', 'E_01', 'iPad', 'Tablet blah blah', 'VEN_0010')

List of products by category:

Select product_code, description from Products where product_category_cod = 'E_01'

Show the transactions ordered by a user:
To retrieve this you will need to execute some SQLs or you will get a bunch or repeated information, first you will need to get the order "header":

Select o.order_id, c.Customer_name, c.Customer_last_name, c.Customer_username, o.order_date, ... from Orders o inner join Customer c on c.customer_id = o.customer_id where c.username = 'JACKTHEREAPER'

Now you will retrieve the details of the order using:

Select p.description, o.price, o.units, ... from Orders o inner join Products p on p.product_code = o.product_code where o.order_id = 'XR1210120'

At this point I think you got the idea, no matter if you use some ORM, persistence framework, etc at the end your application will execute these SQLs to retrieve or store information into the database.

Let's see the same data model from NoSQL perspective, one of the key decision process about using NoSQL is to denormalize your data in a way that store it and retrieve it will be align with how the application uses it, instead of forcing a row/columns structure.

Questions to be solved by your NoSQL model

What will be benefit of storing the order detail as a separate entity? is the order detail an entity? or just an aggregation values that compounds an order? An order does not have any meaning without the details and they are useless if are not tight to an order. This will mean that the order should have a structure similar to this:

{
   { Order header info }
   { Order details }
}

Now lets figure out what are we going to store at the header info, 90% of the queries you will do to retrieve the order header will include the name of the customer, and his id, it's going to be pointless to show the order in the web application without this vital information, it's going to be easier to ignore information than adding new required information, so I will store the denormalized data from the customer inside the order, like this:

{
   { 
      order_id: "XR1210120",
      customer_id: "CI_29120",
      customer_name: "Jack",
      customer_lastname: "Reaper",
      date: "2012/07/26"
   }
   { Order details }
}

Note: "Hey!, what are you doing? what is going to happen if the customer changes his name?", how many times you face this kind of problem? anyway, if the customer changed his name and he comes back in 10 years, the invoice should be printed as it was at the time of the sale, because that's the way it's stored at the IRS, or any other state department, right?

What about the order details? same approach, we could save the product name along with the order and you will have everything you need at the time of printing/showing the order to the user. What about the product price? same thing, you will need to save the price at the time of the purchase, so it will be better to keep it within the order and not in an inventory table.

Here's an order with full details:

{
    "order_header": {
        "order_id": "XR1210120",
        "customer_id": "CI_29120",
        "customer_name": "Jack",
        "customer_lastname": "Reaper",
        "date": "2012/07/26"
    },
    "Orderdetails": [
        {
            "product_code": "00010101",
            "Category_code": "E_01",
            "description": "iPad",
            "price": 450,
            "units": 1
        },
        {
            "product_code": "00010203",
            "Category_code": "E_01",
            "description": "Xbox 360",
            "price": 320,
            "units": 1
        }
    ]
}

Let's do the same operations over djondb, using the djon-shell:

Create products:

insert('sampledb', 'products', { product_code: '00010101', category_code: 'E_01', name: 'iPad', description:'Tablet blah blah', vendor_code: 'VEN_0010'});

List of products by category:

find('sampledb', 'products', '$"category_code" == "E_01"');

Show the transactions ordered by a user:

Now, you'll only need one round trip to the database to retrieve all the information you will need to show the order:

find('sampledb', 'orders', '$"user" == "JACKTHEREAPER"');

done!, you will get something like:

[{
    "order_header": {
        "order_id": "XR1210120",
        "customer_id": "CI_29120",
        "customer_name": "Jack",
        "customer_lastname": "Reaper",
        "date": "2012/07/26"
    },
    "Orderdetails": [
        {
            "product_code": "00010101",
            "Category_code": "E_01",
            "description": "iPad",
            "price": 450,
            "units": 1
        },
        {
            "product_code": "00010203",
            "Category_code": "E_01",
            "description": "Xbox 360",
            "price": 320,
            "units": 1
        }
    }
},
{
    "order_header": {
        "order_id": "NP112021",
        "customer_id": "CI_29120",
        "customer_name": "Jack",
        "customer_lastname": "Reaper",
        "date": "2012/07/28"
    },
....
]

This result is a JSON, very simple to iterate and interact to, but the idea of this article is not to convince you to use NoSQL, but to help you understanding the difference between your RDBMS application approach and the NoSQL approach, and highlight how these approaches will lead you to change some key points in your architecture.

Now that you got the idea of how the storage is mapped from one side to the other, it's time to take a look of your code.

Code comparison

Until now, the things had been very straightforward, and it's very likely that you have some sentences executing SQLs similar to the ones I showed, but the code is something else, it's hard to know how every architect created their designs, but I will use a commonly known design pattern, known as "DAO" (Data Access Object) to illustrate my point.

The DAO pattern uses a class to retrieve the information from the database and create a Transfer Object that is going to be returned to the business layer, if you want to store information of an order it will work like this:

Note: I will do the assumption that you used jquery in your front-end and it will be able to create a full order in one step (header and dome details) and then bulk it to the server to be stored.

Note: My second assumption is that you have some json framework at java that will be able to handle json nicely.

Who?	What to do?
Web Controller	Catches the data from the "request" object and puts it in a Value Object (POJO, POCO, etc)
Model	Receives the Value Object and perform validations, etc
Model	Calls a DA class to store the VO
DA Class	Creates the required SQL to store the values from the VO

The code (using java) will look something like this:

Web Controller: Catches the data from the "request" object and puts it in a Value Object (POJO, POCO, etc)

Order o = new Order();
JSONObject json = JSONParser.parse(Request.getParameter("order"));
o.setCode(json.getString("order_code"));
o.setCustomerId(json.getString("customer_id"));
o. ... (all the order header information)
for (JSONObject detail : json.getJSONArray("Orderdetails")) {
     OrderDetail d = new OrderDetail();
     d.setProductCode(detail.getString("product_code"));
     d.setUnits(detail.getInt("units"));
     ...
     o.addDetail(d);
}

Now, you will get this and put it on a Model Class (following the MVC pattern)

Model: Receives the Value Object and perform validations, etc

    Model m = new Model();
    m.setOrder(o);
    m.saveOrder();

Model: Calls a DA class to store the VO

    MyDAClass da = new MyDAClass();
    da.insert(o);

DA Class: Creates the required SQL to store the values from the VO

    StringBuffer sql = new StringBuffer();
    sql.append("INSERT INTO Order (Order_code, customer_id....) ");
    sql.append(" VALUES ('" + o.getOrderCode() + "', ...");

    Statement stmOrder = connection.createStatement();
    stmOrder.execute(sql.toString());

    // at this point you will need to do the same for each order detail:
    for (OrderDetail detail : o.getOrderDetails()) {
       StringBuffer sqlDetails = new StringBuffer();
       sql.append("INSERT INTO OrderDetail (Order_detail_code, Order_code, product_code....) ");
       sql.append(" VALUES ('" + detail.getOrderDetailCode() + "', ...");
    }

I know, you could use a preparedStatement here and use parameters, etc... or you could use EJB or ORM to store the VO, but here's not where you will see the difference between NoSQL and RDBMS approaches.

Now, let's take a look of the same code using djondb as NoSQL repository, I will keep the same MVC and DAO concept to keep the code side by side comparable.

Same code NoSQL approach

Web Controller: Catches the data from the "request" object and puts it in a Value Object (POJO, POCO, etc) at this point you will not need to map your data to a VO, it's pointless, you could use JSON at server side and it will work really well.


JSONObject order = JSONParser.parse(Request.getParameter("order"));

Now, you will get this and put it on a Model Class (following the MVC pattern)

Model: Receives the Value Object and perform validations, etc, this does not have any change:

    Model m = new Model();
    m.setOrder(order);
    m.saveOrder();

Model: Calls a DA class to store the VO, same thing the model does not suffer major changes.

    MyDAClass da = new MyDAClass();
    da.insert(o);

DA Class: Creates the required SQL to store the values from the VO

    connectionWrapper.insert("ERP", "Orders", order);

Just one line will do the trick, same with the controller, mainly you are avoiding the overhead of doing unnecessary transformations, you will get the json and store it directly to your database, no more additional processing will be needed here.

The same coding sample could be used to "listing orders" or "showing an order", you will see that you coded a lot less using JSON as transport and persistance mechanism

Conclusion

Although I tried to put it very simple, converting an application from RDBMS to NoSQL is not a trivial task, you will need to spend time creating a new design of your application, yes... you will get a lot from this, you will be able to create new features easily, you will be able to grasp the performance of NoSQL in your application, but you should not think that just putting a NoSQL behind will do the magic and boost your performance, you actually need to start thinking on "NoSQL" data model and new ways to code your application to get most of the NoSQL world.

Wednesday, January 11, 2012

djondb - Progress of my own NoSQL DB

The last two months has been really good, my own NoSQL Server is going like a skyrocket, and I'm close to finish the Milestone 2, here're the milestones I created 3 months ago:

Milestone 1 version 0.1

Basic features

Allow inserts
Allow updates
Allow finds by key
Shell
PHP Driver
C++ Driver

Milestone 2 version 0.1

Complete basic features

Finds by any filter
Arbitrary Index creation
Backup / Restore
Java Driver
C Driver
.NET Driver

Milestone 3 version 0.1

Sharding

Transactions
Sharding

Milestone 4 version 0.1

Nice to have

Authentication
Clustering

At this moment the db is fully functional and I'm doing a demo, which is a craiglist like site to demostrate why NoSQL is a nice technology for fast development.

What's next?

At January 26 I will be at @hubbog doing a speech showing an overview of what is NoSQL, what is it good for (and what is not), and some demos about how a NoSQL (primary djondb) improves the development cycle, and let finish your projects in no time

For the next months I will complete the Milestones 3 and 4, in the meantime I hope to get some help to develop drivers, sites around djondb, etc.

Going open source?

This is a great question, and I dont know the answer yet, I always wanted to share and give something to the world, but I'm not that big to avoid problems with someone else copying my code and using it for their own benefit, that's why I will wait until I finished all the milestones and upload the first version to the web to make this decision.

Building a protocol

Introduction

When I started to rewrite the save file algorithm for djon time tracker, I searched what will be the best way to do this, and I found a lot of people saying XML is the way to go, but everytime I tried this the files got very large and the time to read and write was huge, that's why I started to look at the binary formats.

The binary formats are simple files where you store bytes instead of chars, straighforward definition, but what is that? how do I know that a string started at some point, or I have an integer, when I read it the only think I see is HEX values? Here's where protocols are useful.

from wikipedia: "A communications protocol is a system of digital message formats and rules for exchanging those messages in or between computing systems and in telecommunications. A protocol may have a formal description."

When you write your own protocol you should define an unique way to write and read every single piece of data, then you have to follow the rules to read and write based on the protocol you defined.

Let's build our own protocol.

Let's say that you're going to save or transmit over a wire the data of a customer:

Customer

Name
Last Name
Birth Date
Salary

First, we need to define the type of each data:

Customer

Name: chars
Last Name: chars
Birth Date: date
Salary: integer

Now we define a unique set of rules to write a Customer:

Data Order: the data will follow the same order everytime (Name, Last Name, Birth Date, Salary)
Labels? if we have a fix set of data (like the example above) it's useless to name each piece of data, so we will avoid this and save space

Now we need to define how to save each type, let's start with the easy one.

Integer

In C an integer is a 2 bytes length data, that means you will have a 2 chars to store. Ex: 65000 as a salary will be FDE8 (2 chars, FD and E8 which could be translated to char:253 and 232).

Let's write some code here, and save an integer in the simplest way:


void saveInt(int a) {
    FILE* f = fopen("test.dat", "wb");
    fwrite(&a, 1, sizeof(a), f);
    fclose(f);
}

This code works well and it's very straightforward, but it has a big problem. It will write the 2 bytes (from the example above: FD and E8) in an unknown order, could be E8FD or FDE8 depending on the architecture of the machine it runs, this means that if the architecture of the machine where you're going to read the file changes you will get a very different result. That is called Little/Big Endian problem. To fix this we will ensure that the order will be the same all the time, this is done using the following code:


void writeInt(int a) {
    FILE* f = fopen("test.dat", "wb");
    unsigned char c = (a & 255);
    fwrite(&c, 1, 1, f);
    unsigned char c2= ((a >> 8) & 255);
    fwrite(&c2, 1, 1, f);
 
    fclose(f);
}

This code will ensure that the order will be same everytime, and it will not depend on the architecture of the machine it runs. Let's break down this instructions:


    unsigned char c = (a & 255);

If you have an integer of 65000 (FDF8) it will do an "and" operation with 00FF, this will "erase" the higher byte:


    FDF8
And 00FF
    ====
    00F8

The next instruction will do a similar operation, it will move the bytes from right to left and erase the higher part:


unsigned char c2= ((a >> 8) & 255)

FDF8 >> 8   = XXFD
XXFD & 00FF = 00FD

With this simple method (called Little Endian) we will ensure that the write will be always in the same order, now the read will be easy:


int readInt() {
    FILE* f = fopen("test.dat", "rb");
    unsigned char c;
    fread(&c, 1, 1, f);
    unsigned char c2;
    fread(&c2, 1, 1, f);
    
    int res = c & (c2 << 8);

    fclose(f);

    return res;
}

c2 will be FD and c will contain F8 doing the "<< 8" the FD will go up and adding will result in FDF8 (the original number)

Now that we solved the big issue, the other things are easier.

Strings

One of the main issues with strings is how to deal with the length of the string, one possible solution could be to put a fixed char at the end of the string and read until reach that character.

Strings solution 1


void writeString(char* c, int len) {
   FILE* f = fopen("test.dat" "wb");
   for (int x = 0; x < len; x++)
      fwrite(&c[x], 1, 1, f);

   char end = '*';
   fwrite(&end, 1, 1, f);
   fclose(f);
}

char* readString() {
   FILE *f = fopen("test.dat", "rb");
   char c;
   char buffer[256];
   int pos = 0;
   do {
      fread(&c, 1, 1, f);
      if (c != '*') {
          buffer[pos] = c;
          pos++;
      }
   } while (c != '*');
   buffer[pos] = '\0'; // terminated-string
   fclose(f);
   return buffer;
}

This solution works pretty well, and it can be improved using stringstreams or others, but it has a big problem, what if the original string contain the character '*' in between? change it for other char? what will be the odds that character is included too? This could be easily fixed if you write down the size of the string and then the content of the string, and you read it in the same way, first the length and then the contents.

Strings solution 2


void writeString(char* c, int len) {
   writeInt(len);
   FILE* f = fopen("test.dat" "wb");
   for (int x = 0; x < len; x++)
      fwrite(&c[x], 1, 1, f);

   char end = '*';
   fwrite(&end, 1, 1, f);
   fclose(f);
}

char* readString() {
   int len = readInt();
   FILE *f = fopen("test.dat", "rb");
   char c;
   char* result = (char*)malloc(len+1);
   for (int x = 0; x < len; x++) {
      fread(&c, 1, 1, f);
      if (c != '*') {
          result[x] = c;
      }
   };
   result[len] = '\0'; // terminated-string
   fclose(f);
   return result;
}

Solved! (off course you could change the methods to open the file, do all the operations and then close it, these were written this way to avoid complexity)

Now, the main code:


write(Customer c) {
    writeString(c.name());
    writeString(c.lastName());
    writeDate(c.birthDate()); // I will let this to the reader
    writeInt(c.salary());
}

Customer read() {
    Customer c;
    c.setName(readString());
    c.setLastName(readString());
    c.setBirthDate(readDate());
    c.setSalary(readInt());
}

This solution could be applied to network transmission, files, or anything you want. You could translate this solution to other languages.

Tuesday, January 3, 2012

Apache error 403: Forbidden

I'd run into an annoying problem that is very easy to solve, once you know the answer... as usual.

I was starting a simple project to demostrate how to create an application using djondb as a NoSQL db and wrote some simple pages emulating the famous craigslist page, but as soon as I added the "<Directory>" directive to the apache server started to get the message:


Forbidden

You don't have permission to access /demo2/temp.html on this server.

Run to google... do some searches and all of the results pointed to file permission problems, I just went to the console and run the chmod a+rwx (I know.. it's not secure, but it's a demo pc), restarted apache and... puff.. the error kept popping up, read more... did some changes... and nothing the error persisted, then I crossed to a post that was really helpful Fixing 403 Forbidden on alias directory with Apache one of the answers suggested to login using the apache user, and try to navigate to the file.

As soon as I did that I realized that my folder was: /home/cross/workspace/db/demo... etc... and I changed the permissions to the "demo" directory, but not to all the parent folders (workspace/db) and that was causing the problem. I added my private group (cross) to the user www-data (the user used to start apache) and now everything is working.

usermod -a -G cross www-data

Easy? yes... I know it is but I want to share this "enlightened knowledge" in case you crossed with the same problem. Took 1 hour of my time to solve this, wondering why on earth the test application worked fine with the PC at my office (ubuntu 10.10) and didn't at my home PC with ubuntu 11.10. (Actually I still wonder why I didnt run into this problem at my office... the path is the same)