NoSQL on the Rise

There is a relatively new movement called NoSQL. It involves high performance databases for little to no cost. Social Networks have been using NoSQL. You might be able to use them too, depending on your specific database requirements.

The NoSQL movement has different products from different vendors. Amazon created SimpleDB. It is data storage for key value pairs. And Google created BigTable, open sourcing the data model for it.

Some NoSQL databases focus on storing JSON objects. Examples of these databases are CouchDB and MongoDB. CouchDB let's you use a REST API to access the JSON data.

These NoSQL databases often have limitations. Some impose a time limit for queries to complete. Others return only partial data sets. These databases can hold a lot of data. But they often are poor for complicated queries.

Script Performance

I had this script to write to correct some data. The script was going to affect about 20k records. One part of the script needed to mine through a table with 1B records. This needed to be done for each of the affected 20k records. So I knew that a brute force approach would never complete.

So I thought perhaps I could preprocess those 1B records. There were only a few thousand of those records that were of interest. If I could just load the couple thousand relevant records into memory, I could use that data to go though my batch.

This seemed to be a good candidate for some type of collection. But what kind should I use? I don't do enough PL/SQL to be very comfortable with collections. I read around and though maybe it could be a VARRAY. But I only had experience with PL/SQL tables. What is a developer to do?

In the end I needed to get this script coded ASAP. So I did the simplest programmatic approach. I created a temporary table with a few thousand records. This table represented the important data mined from the 1B record raw table. I am hoping that SQL against this small temporary table will be very fast. We shall see when I ship the fix out to the production environment where the volumes are huge.

MongoDB

Previously I had written an overview of the NoSQL movement. Today I want to talk a little more about a specification implementation. I will cover MongoDB.

NoSQL has some niche applications. Sometimes your information does not fit into a database schema. MongoDB tries to provide high performance. This is in contrast to other NoSQL implementations which are gerared toward high concurrency.

There are no transactions in MonogoDB. You can't recover from an abrupt shutdown. MongoDB is good if you need data quickly, and you access the data frequently. JSON is used as the default data notation. Internally MongoDB uses BSON (binary JSON). MongoDB itself is written in C++.

The use of MongoDB is very similar to using a traditional relational database. I have not played with it yet. Not sure that I will. You still need to know what the competition is using though.

Chunking for Performance

So you have a lot of data you want to update. I am talking about the whole big table. How do you do it? Well in the good old days you found a way to divide and conquer. Pretty much you put some limiter in the WHERE clause of the update to do a small bunch at a time. That way the whole darn table did not get locked.

The idea is similar in Oracle 11g Rel 2. However the bunching is supported in the system. You make use of the DBMS_PARALLEL_EXECUTE package. You choose a way to chunk up your data. That is, you divide up the data by some characteristic. However the new package does the work of splitting up your DML between different statements which operate in parallel.

The procedures in the package do their own COMMIT on their parts. You also need to have the CREATE JOB privilege, as a bunch of jobs will be scheduled on your behalf to get the job some in parallel. A common way to chunk up your data is by ROWID. However you can choose some other attribute of your data. On our project we use a generated ID to do the chunking. And previously we had used partitions to break up the chunks. Maybe it is time to rethink that plan.