Tracking Weird Problems

I got word that a bug in our software was becoming high priority. Because of this problem, our users had to do a lot of extra work. This is never a good thing. And it was especially troubling since the beginning of the year is the busiest time for our client.

The users were so busy that at first I could not get any information about the problem. So I just started working on other issues. But then some important customers were saying these problems were really holding them back. So I tracked down some users and got some info from them.

Apparently some nightly jobs were not inserting records as expected. At first glance, it looked like records were being inserted. But talking as I talked with a few users, I found that not all expected records were being populated. I had a hard time tracing the actual code for this nightly job. Last year somebody made some changes and did not check them into source control. Shame on them.

I had to go back to our delivery of the changes for the code last year to get a hold of the source. Luckily I found one pattern for the missing data. An error code was being stored in a look up table. This was the start of a break through. The only problem was that the source code showed that the problems were supposed to be logged each time the error code was set. I could find no evidence of such logs in production.

Next step was to try to replicate the problem. This was harder than it seemed. The nightly job was something we were not set up to test. Somebody once had an idea that we should have a test bed. But it never got further than an idea. I had a DBA help out and try to start creating users to match production. No luck. I got bold and ran a few tests in production. This also proved to be a dead end.

By now I had to resort to an old fashioned technique. I went to trace the code and results by hand. Went through an example of some data which got an error code set. Kept building up the huge SQL statement that actually inserts the data. Then I got an Oracle error. Apparently the users had entered some SQL into the system that did GROUP BY and ORDER BY operations. The code tries to append WHERE clause data at the end. But this cannot come after the GROUP BY or ORDER BY in a SQL statement. I got the users to clean up their SQL.

In retrospect, this ordeal made me realize a few things. We need an adequate test bed. We need production to log errors diligently (still working on this). And finally the software should not allow users to enter data that will result in middle of the night errors. Time to implement some validation.