Wednesday, January 11, 2012

djondb - Progress of my own NoSQL DB

The last two months has been really good, my own NoSQL Server is going like a skyrocket, and I'm close to finish the Milestone 2, here're the milestones I created 3 months ago:

Milestone 1 version 0.1

Basic features

Allow inserts
Allow updates
Allow finds by key
Shell
PHP Driver
C++ Driver

Milestone 2 version 0.1

Complete basic features

Finds by any filter
Arbitrary Index creation
Backup / Restore
Java Driver
C Driver
.NET Driver

Milestone 3 version 0.1

Sharding

Transactions
Sharding

Milestone 4 version 0.1

Nice to have

Authentication
Clustering

At this moment the db is fully functional and I'm doing a demo, which is a craiglist like site to demostrate why NoSQL is a nice technology for fast development.

What's next?

At January 26 I will be at @hubbog doing a speech showing an overview of what is NoSQL, what is it good for (and what is not), and some demos about how a NoSQL (primary djondb) improves the development cycle, and let finish your projects in no time

For the next months I will complete the Milestones 3 and 4, in the meantime I hope to get some help to develop drivers, sites around djondb, etc.

Going open source?

This is a great question, and I dont know the answer yet, I always wanted to share and give something to the world, but I'm not that big to avoid problems with someone else copying my code and using it for their own benefit, that's why I will wait until I finished all the milestones and upload the first version to the web to make this decision.

Building a protocol

Introduction

When I started to rewrite the save file algorithm for djon time tracker, I searched what will be the best way to do this, and I found a lot of people saying XML is the way to go, but everytime I tried this the files got very large and the time to read and write was huge, that's why I started to look at the binary formats.

The binary formats are simple files where you store bytes instead of chars, straighforward definition, but what is that? how do I know that a string started at some point, or I have an integer, when I read it the only think I see is HEX values? Here's where protocols are useful.

from wikipedia: "A communications protocol is a system of digital message formats and rules for exchanging those messages in or between computing systems and in telecommunications. A protocol may have a formal description."

When you write your own protocol you should define an unique way to write and read every single piece of data, then you have to follow the rules to read and write based on the protocol you defined.

Let's build our own protocol.

Let's say that you're going to save or transmit over a wire the data of a customer:

Customer

Name
Last Name
Birth Date
Salary

First, we need to define the type of each data:

Customer

Name: chars
Last Name: chars
Birth Date: date
Salary: integer

Now we define a unique set of rules to write a Customer:

Data Order: the data will follow the same order everytime (Name, Last Name, Birth Date, Salary)
Labels? if we have a fix set of data (like the example above) it's useless to name each piece of data, so we will avoid this and save space

Now we need to define how to save each type, let's start with the easy one.

Integer

In C an integer is a 2 bytes length data, that means you will have a 2 chars to store. Ex: 65000 as a salary will be FDE8 (2 chars, FD and E8 which could be translated to char:253 and 232).

Let's write some code here, and save an integer in the simplest way:


void saveInt(int a) {
    FILE* f = fopen("test.dat", "wb");
    fwrite(&a, 1, sizeof(a), f);
    fclose(f);
}

This code works well and it's very straightforward, but it has a big problem. It will write the 2 bytes (from the example above: FD and E8) in an unknown order, could be E8FD or FDE8 depending on the architecture of the machine it runs, this means that if the architecture of the machine where you're going to read the file changes you will get a very different result. That is called Little/Big Endian problem. To fix this we will ensure that the order will be the same all the time, this is done using the following code:


void writeInt(int a) {
    FILE* f = fopen("test.dat", "wb");
    unsigned char c = (a & 255);
    fwrite(&c, 1, 1, f);
    unsigned char c2= ((a >> 8) & 255);
    fwrite(&c2, 1, 1, f);
 
    fclose(f);
}

This code will ensure that the order will be same everytime, and it will not depend on the architecture of the machine it runs. Let's break down this instructions:


    unsigned char c = (a & 255);

If you have an integer of 65000 (FDF8) it will do an "and" operation with 00FF, this will "erase" the higher byte:


    FDF8
And 00FF
    ====
    00F8

The next instruction will do a similar operation, it will move the bytes from right to left and erase the higher part:


unsigned char c2= ((a >> 8) & 255)

FDF8 >> 8   = XXFD
XXFD & 00FF = 00FD

With this simple method (called Little Endian) we will ensure that the write will be always in the same order, now the read will be easy:


int readInt() {
    FILE* f = fopen("test.dat", "rb");
    unsigned char c;
    fread(&c, 1, 1, f);
    unsigned char c2;
    fread(&c2, 1, 1, f);
    
    int res = c & (c2 << 8);

    fclose(f);

    return res;
}

c2 will be FD and c will contain F8 doing the "<< 8" the FD will go up and adding will result in FDF8 (the original number)

Now that we solved the big issue, the other things are easier.

Strings

One of the main issues with strings is how to deal with the length of the string, one possible solution could be to put a fixed char at the end of the string and read until reach that character.

Strings solution 1


void writeString(char* c, int len) {
   FILE* f = fopen("test.dat" "wb");
   for (int x = 0; x < len; x++)
      fwrite(&c[x], 1, 1, f);

   char end = '*';
   fwrite(&end, 1, 1, f);
   fclose(f);
}

char* readString() {
   FILE *f = fopen("test.dat", "rb");
   char c;
   char buffer[256];
   int pos = 0;
   do {
      fread(&c, 1, 1, f);
      if (c != '*') {
          buffer[pos] = c;
          pos++;
      }
   } while (c != '*');
   buffer[pos] = '\0'; // terminated-string
   fclose(f);
   return buffer;
}

This solution works pretty well, and it can be improved using stringstreams or others, but it has a big problem, what if the original string contain the character '*' in between? change it for other char? what will be the odds that character is included too? This could be easily fixed if you write down the size of the string and then the content of the string, and you read it in the same way, first the length and then the contents.

Strings solution 2


void writeString(char* c, int len) {
   writeInt(len);
   FILE* f = fopen("test.dat" "wb");
   for (int x = 0; x < len; x++)
      fwrite(&c[x], 1, 1, f);

   char end = '*';
   fwrite(&end, 1, 1, f);
   fclose(f);
}

char* readString() {
   int len = readInt();
   FILE *f = fopen("test.dat", "rb");
   char c;
   char* result = (char*)malloc(len+1);
   for (int x = 0; x < len; x++) {
      fread(&c, 1, 1, f);
      if (c != '*') {
          result[x] = c;
      }
   };
   result[len] = '\0'; // terminated-string
   fclose(f);
   return result;
}

Solved! (off course you could change the methods to open the file, do all the operations and then close it, these were written this way to avoid complexity)

Now, the main code:


write(Customer c) {
    writeString(c.name());
    writeString(c.lastName());
    writeDate(c.birthDate()); // I will let this to the reader
    writeInt(c.salary());
}

Customer read() {
    Customer c;
    c.setName(readString());
    c.setLastName(readString());
    c.setBirthDate(readDate());
    c.setSalary(readInt());
}

This solution could be applied to network transmission, files, or anything you want. You could translate this solution to other languages.

Tuesday, January 3, 2012

Apache error 403: Forbidden

I'd run into an annoying problem that is very easy to solve, once you know the answer... as usual.

I was starting a simple project to demostrate how to create an application using djondb as a NoSQL db and wrote some simple pages emulating the famous craigslist page, but as soon as I added the "<Directory>" directive to the apache server started to get the message:


Forbidden

You don't have permission to access /demo2/temp.html on this server.

Run to google... do some searches and all of the results pointed to file permission problems, I just went to the console and run the chmod a+rwx (I know.. it's not secure, but it's a demo pc), restarted apache and... puff.. the error kept popping up, read more... did some changes... and nothing the error persisted, then I crossed to a post that was really helpful Fixing 403 Forbidden on alias directory with Apache one of the answers suggested to login using the apache user, and try to navigate to the file.

As soon as I did that I realized that my folder was: /home/cross/workspace/db/demo... etc... and I changed the permissions to the "demo" directory, but not to all the parent folders (workspace/db) and that was causing the problem. I added my private group (cross) to the user www-data (the user used to start apache) and now everything is working.

usermod -a -G cross www-data

Easy? yes... I know it is but I want to share this "enlightened knowledge" in case you crossed with the same problem. Took 1 hour of my time to solve this, wondering why on earth the test application worked fine with the PC at my office (ubuntu 10.10) and didn't at my home PC with ubuntu 11.10. (Actually I still wonder why I didnt run into this problem at my office... the path is the same)