Tracking Weird Problems

I got word that a bug in our software was becoming high priority. Because of this problem, our users had to do a lot of extra work. This is never a good thing. And it was especially troubling since the beginning of the year is the busiest time for our client.

The users were so busy that at first I could not get any information about the problem. So I just started working on other issues. But then some important customers were saying these problems were really holding them back. So I tracked down some users and got some info from them.

Apparently some nightly jobs were not inserting records as expected. At first glance, it looked like records were being inserted. But talking as I talked with a few users, I found that not all expected records were being populated. I had a hard time tracing the actual code for this nightly job. Last year somebody made some changes and did not check them into source control. Shame on them.

I had to go back to our delivery of the changes for the code last year to get a hold of the source. Luckily I found one pattern for the missing data. An error code was being stored in a look up table. This was the start of a break through. The only problem was that the source code showed that the problems were supposed to be logged each time the error code was set. I could find no evidence of such logs in production.

Next step was to try to replicate the problem. This was harder than it seemed. The nightly job was something we were not set up to test. Somebody once had an idea that we should have a test bed. But it never got further than an idea. I had a DBA help out and try to start creating users to match production. No luck. I got bold and ran a few tests in production. This also proved to be a dead end.

By now I had to resort to an old fashioned technique. I went to trace the code and results by hand. Went through an example of some data which got an error code set. Kept building up the huge SQL statement that actually inserts the data. Then I got an Oracle error. Apparently the users had entered some SQL into the system that did GROUP BY and ORDER BY operations. The code tries to append WHERE clause data at the end. But this cannot come after the GROUP BY or ORDER BY in a SQL statement. I got the users to clean up their SQL.

In retrospect, this ordeal made me realize a few things. We need an adequate test bed. We need production to log errors diligently (still working on this). And finally the software should not allow users to enter data that will result in middle of the night errors. Time to implement some validation.

Basic Failure

I knew something was wrong when a couple different developers on the project asked me if I was good at SQL. We have full time DBAs. We have consultants from Oracle. But I still got the call.

Apparently a script was not working correctly in development. It worked fine in Production. But a developer needed to do some tests in development and things were going weird. There was a column IDENTIFIER in a table defined as VARCHAR2(5). And they tried to run a query like this:

SELECT COUNT(*)
FROM table
WHERE identifier='00350';

The problem was that the count was returning 0. However there were many records that had the specific identified the WHERE clause was comparing. This was just blowing the mind of the developer. Surprisingly when he removed the single quotes from the WHERE clause, the count came back correctly.

We tried a number of experiments to determine the source of the problem. For example, we inserted a new row with the same identified. Then the count returned 1 (not 51 as we expected). However updating all 51 records setting the identifier to the quoted value did not change the query results. Somebody thought maybe this was due to a bad index on the column. But this column was not a part of any index. We also tried creating another database table with a similarly defined column, but we could not replicate the problem with this other table.

My conclusion was that something went wrong when these 50 records were put into the table. I could not explain why. We loaded the data doing a sql*load like we always do. For now to get the fix out the developer changed the source code to not use quotes. This is just a hack until we can get a grip on this perplexing issue. Any thoughts?

Fast to Slow


I work on a complex system. We have huge amount of PL/SQL code running at night to do all sorts of tasks. One such task is performing number crunching and storing results in meta data tables. The applications consult these tables during the day to get information quickly.

Our database is a very large one. And the data keeps growing. So the number crunching job kept getting longer and longer. I asked for help from the performance engineering team at our company. Their first idea was a divide and conquer method. So I wrote some korn shell scripts that run 10 jobs in parallel, crunching portions of the data at the same time. There was not a 10 for 1 speed improvement. But it was close. We had a very happy system administration staff working at night because now our jobs finished quickly.

When we started up the system the following year, the number crunching routine slowed back down to old levels. Our chief DBA assumed that we somehow deployed the old code. I had him check the Production korn shell and PL/SQL packages. The correct code was there. So I called in the performance engineering team. They monitored the job the next night. The job finished in record time. Performance engineering said perhaps the gathering of statistics first was the key. Whatever the reason, I am glad we are back to quick speeds.
We had to implement changes to our password policies. To centralize business rules, we coded the specifics in one password verify function. This was painful to test because we have many applications that connect to the database using different technologies.

I found that those applications which used OLE DB were able to respond to errors intelligently. We threw different errors when bad passwords were chosen. This was a breeze to catch using OLE DB.

For most of the generic errors we were able to handle the situation with applications using Pro*C. However some errors contained larger error text. There were multiple ORA error numbers embedded in the text. With Pro*C, we were getting the from the SQLCA (SQL Communications Area). But the error text in this structure is limited to 70 characters. There were scenarios where we needed access to more text.

In the end we implemented a hack whereby the Pro*C code duplicated the password verification logic so we were not stuck needing more than 70 characters of error text. I seriously hope we can redesign this hack in the next go around. Perhaps we shall port all the Pro*C access to use OLE DB. Or maybe we can dig deeper to find a way to resolve this in Pro*C.

The Move to DB2

My company lost the maintenance contract with our client. Last time this happened I just jumped to the new contractor. But this time around I am going to stay with my company.

This means a new project and a new database. In my case, it will be IBM DB2. I know nothing about DB2. So it is time to hit the books. I am a little sad at leaving Oracle development. I have built up quite a library of Oracles books that I have read cover to cover.

Here are some of the titles:
  • Oracle PL/SQL Programming
  • Oracle Performance Tuning
  • Oracle SQL*Plus
  • Oracle Database Administration
  • Oracle Essentials

Luckily my company has an online books service. So I am already checking out some DB2 titles.