Uncompiled Report

I went to visit the customer site the other day. One of the testers there said they were having a problem with a report not printing. I was on another high priority task. So I encouraged them to submit a trouble ticket. The ticket got assigned to me. At first I could not tell what was going on. There were no events logged to the database which would suggest foul play. I could not make the problem happen in a development environment. In the end, I wrote up a debug version of the Oracle Report that was involved, logging each line of code to the database.

When the problem was replicated by the customer, I found that the first report that gets called was generating an exception when trying to run a second worker report. There was an additional error in that the caller report was unable to log this activity to the database due to a coding error in the exception handler. At least one mystery was solved. However I still was unsure why the worker report was not working. I went into the customer environment, exported out the data that was used when the problem happened. Then I got back into developer. I tried to run the worker report using Oracle Reports Builder. The result was a REP-1247 “Report contains uncompiled code”.

At first I just recompiled the report using my database login. But then the report ran fine. So I started to concentrate on why the report got the REP-1247 in the first place. I reasoned that you only needed to compile the report once. After that you should be able to run the report without recompiling it. I traced the problem down to a line in the after parameter form of the report. It was doing a SELECT FROM DUAL. That seemed harmless. However it would just not work in Reports Builder unless I recompiled it. Our local reports expert told me to apply the latest patch to the old Reports 6i tool. That also did not solve my problem.

In the end, our reports guru said you just need to recompile the report every time you want to run it from Reports Developer. That did not seem too encouraging. It also did not feel right. This really feels like a bug. And I know the Oracle answer will be for me to upgrade to Oracle Reports 10g. However that requires some infrastructure upgrades like a back end report server. I don’t think my client is ready for all of that just yet. I can’t be the only one having this problem. Or maybe we are few, since I am mixing the old Reports 6i with a recent Oracle 10g database. Does anybody have any advice for me?

TRIM to NULL


Our customer complained that data was sometimes not showing up in one of our applications. A developer on our team researched the problem and found that it occurred only when the data had leading spaces. His solution was to modify the retrieval Pro*C code to trim out the spaces. On the surface that sounded like an acceptable solution. So he changed the code from

SELECT NVL(text, ‘ ‘)
INTO v_text;

to

SELECT TRIM(NVL(text,’ ‘)
INTO v_text;

The result was that the text that had leading spaces was displayed without the spaces. The problem is that this made the application through an ORA-1405 exception for most cases where there was no text. This was a difficult problem to diagnose. The SQL ran fine in SQL*Plus and PL/SQL Developer.

I was the one who stepped in and debugged this problem. The way I found that this change was the problem was by reviewing the source code change history. The text column was NULL for most records. The NVL turned it into a space. However the TRIM caused the space to revert back to an empty string (NULL). Pro*C did not like putting such a value into the variable. The result was the exception we were getting.

My initial fix was to back out the changes and ship our release. Then I thought maybe we could add one extra NVL around the whole clause to ensure the fix goes in, but it does not break the normal case. The problem is that it was starting to look like some ugly code. I am thinking our eventual fix would be to modify the application to be able to handle text that with leading spaces. Nevertheless this was an interesting exercise in debugging. Another developer spent the weekend trying to figure out what was going on.

Developer of the Year

This month’s Oracle magazine doled out a bunch of awards. One of them was the PL/SQL Developer of the Year. This year the honor went to Alex De Vergori. He is a database architect for Betfair in London, England.

Alex works on a betting engine with 250k lines of code. It is distributed among computers located around the world. It uses Oracle RAC, as well as Oracle Coherence. There is a middle tier written in Java. The front end uses Ajax. The system processes 5 million transactions a day on average. At peak times it crunches 1000 transactions a second.

I wondered what it took to become PL/SQL developer of the year. Alex stated that you needed an attention to detail. You also need to be passionate and take pride in your software. Somehow I think you might also need to be working with the latest Oracle technologies. To tell you the truth, I have not even heard of Coherence before reading about this award.

Looking back on my own PL/SQL programming career, my biggest year was when I implemented a subsystem of our loads software. It was written mostly in PL/SQL, with a little UNIX Korn shell and a SQL script or two. The production code had about 6k lines of code. However I also coded up a unit test harness that was 5k lines of code by itself.

This may not seem like an abundance of code. The hard part was determining the requirements for the coding. That took so long that the schedule only gave me 4 days to code the whole thing. Let’s just say I knocked out a lot of code that compiled. And it ran super fast. There were a number of trouble tickets written against my code. They all got resolved.

So what should I do? Nominate myself for the award next year? Maybe I should try to contribute to Oracle magazine. No. I think I will just keep posting to my blog here. In fact, perhaps I should give out my own award. What do you think about the “2008 Top Oracle PL/SQL Blogger” award? I hate to tell you this, but the winner is me.

High Priority

Our customer organization identified a problem with one of our applications. So they submitted a high priority trouble ticket. Internal testers working for our company could not duplicate the problem in their test environment. The development team lead called me asking why the customer made this a high priority problem. I told him I could think of many reasons why they would justify such a classification. However I recommended we just fix the problem and not try to fight whether it was high priority or not. Due to my recommendation, the problem got assigned to me.

I always like to do a little homework on the problem assigned to me before I call the customer. So I installed the application transmittal they were running. However I could not make the problem happen. I put a call in with the customer that opened the problem. She was not at her desk. Due to the high priority nature of the problem, I tracked her down in her test lab. She walked me through the steps she was taking. That’s when I figured the problem must be related to the data in her test database. It was missing some values.

This should have been no problem. I got busy writing a PL/SQL script to correct the data in her test database. I just needed to insert a bunch of rows in one table. At first I thought I wanted a bunch of nested loops to control the values I inserted into each of the columns. However they were character data and I could not figure out how to look through some char values. Shame on me. Time was ticking away on this high priority problem.

In the end, I decided to get down to business. I wrote a small helper nested procedure to do the dirty work. I hard coded some values in it. Then I hard coded a bunch of calls to the helper procedure. It was not clean. But it got the job done. I made sure to order the parameters to the procedure so that calls to it at least look good in the source code. I did use one numeric loop to control some of the calls to the local procedure. But I needed to format and convert the numbers to text before making the call.

While unit testing I found that my script generated some exceptions. I used the tried and true method of commenting everything out until I knew what the problem was. More unit testing showed that I got some of the values wrongs. I fixed these. Then I also wrote some SQL for our own testers to run in order to recreate the problem in their database. I think I did a pretty good job with this trouble ticket. From the moment I got the call to work this problem,. It was well under 2 hours before a solution was delivered to test.

Missing Boilerplate Text

A security audit of our system found that not all reports generated had the correct legal notices on them. So our reports developer added the boilerplate text to all reports that did not have it. The developer tested the reports using Oracle Report Builder. The text showed up fine. So the modified RDF files were checked in for our latest build to the customer. There was a smoke test performed when the build was done. The developer did not find the text when the reports were run with the updated application. The developer asked me for some helped. Apparently management had promised this to be delivered tomorrow. That it was not working would have been unacceptable.

I usually like to know exactly what is going on. So I had the developer run the application and print out all the reports that got modified. It seemed like half of them did contain the new required text. The other half had the text missing. Then we ran those problem reports in Report Builder. Sure enough the text showed up there. I made sure we were looking at the same RDF files. We were. Then I thought perhaps it had something to do with the parameters being passed to the report. The developer checked the database (we log all parameter values). This was not the cause. I was at a loss, and had to get back to my normal job.

An hour later, the reports developer came running in. Here was the problem. The reports with the missing text get printed in landscape mode. However the Report Builder tool was displaying them in portrait orientation. In essence the boilerplate text was located too far down on the page to been seen when it was printed out. Once this was discovered, it was a trivial task to move the text where it would be seen on all pages given that they were printed in landscape mode. I am sure glad that I did not have to determine the cause of this problem. Who would have thought this was the case.

Report Builder

I got assigned a trouble ticket that stated the values in a report were wrong. At first another developer thought there was something wrong with my application. I tested it out and found it writing correct values to the database. Being a thorough guy, I decided to dig into what the report was doing. We have a full time reports person here on staff. However she was nowhere to be found when I got to this problem.

To make things run fast, I always develop on a virtual machine these days. Unfortunately I still do not have all my tools installed on that machine. So I decided to install Oracle Report Builder on it. This is an old version (6i). I did not even bother to apply any patches. I found the installation process to be very painful. I was using a remote connection to the virtual desktop. The screen kept spending a lot of time repainting. It is slow because I am remotely accessing the machine. I guess they did not take that into account way back when they were writing this Oracle installer.

Here is a funny thing I did see once the screen stopped repainting. There was a lot of Oracle propaganda that I had to endure during the install process. I saw a message that I was going to be experiencing “High Quality Reports”. That’s a good thing. My customers don’t like low quality reports. Then I read that I would able to explore all my data. I certainly hope that would be the case. I was also informed that I was embarking upon a solution with “Unlimited Data Formatting”.

Things got even more crazy the further I got into the install process. I got one message that just said “Oracle Reports!” Apparently you cannot emphasize this product enough. Then I got to the last straw that said this product was “integrated with Oracle’s spectrum of business intelligence tools”. Ok. If they have to go this far to promote the product, I am starting to worry that this is going to turn out to be a total dud of a product.

Luckily I have worked with this version of Oracle Report Builder before. And it has some issues. Some of them get fixed with one of the patches for the product. I found it to be better than some of the early Oracle 10g report products that seemed to be rushed out the door with a lot of bugs. After I finally installed the program, I was able to determine what the problem was. The report was looking at the wrong field from the data. This is a simple enough fix.

Database Upgrade

This year our project is finally upgrading the database to Oracle 10g. Our DBAs initially decided to go with version 10.2.0.3 for the back end. However we have found some problems during testing. The database is automatically gathering statistics at inopportune times. This is causing some performance issues. I would think you could control this with some parameter.

There are some other problems with bulk loading of data. Our DBAs have been in contact with Oracle Corporation. And they have identified the issues as bugs. The resolution is to go to version 10.2.0.4. That sounds find and dandy. However we now have to go through our performance testing again. I hope this patch resolves these problems. And I pray that there are no new problems introduced with the patches.

You have to keep up with current versions of software. I understand this. Being a software developer myself, I know our users need upgrades all the time. However it hurts being the consumer of some products that need upgrading. There is pain involved with the upgrade itself. Part of this pain is dealing with things that are broken. You also do not want to stay on the bleeding edge. Otherwise you might bleed yourself with the results.

I would like to think that Oracle 10g is stable enough at this point. Don’t they already have an Oracle 11g out there now?

Password Reset Script

The project I work on implements a number of security rules to match our client’s policies. One of them is that you cannot reuse a password unless you have chosen a number of intermediate passwords in between. I won’t go into the exact number of different passwords you must choose before being able to reuse one. Let’s just assume it is 50 intermediate passwords before reusing one.

I have a lot of database accounts in the development and test environment. It is a challenge to remember the passwords for all of them. So I choose a passwords format that makes it easy for me to remember them. This plan gets foiled every time I am forced to change my password for whatever reason. At that point I cannot change my password back to the well known format. I must have a lot of different passwords before reusing the one I know.

Since I am a developer, I figure I can write a script to help me with this problem. I loop through 50 iterations. I use a sequence to generate a random sequence. Then I ALTER USER to change the password in the loop. At the end, I change the password back to the one I want to reuse. I bang out this code really quickly. I am pleased when the script executes to completion. However, the script fails when it tries to reuse my password. I get an ORA-28007: “the password cannot be reused” error.

This perplexes me for a while. I am successfully changing my password to a bunch of unique values for 50 times. So I should be able to reuse the old one. I guess that maybe somebody has messed up the profile which restricts password reuse. No. The profile still says I can reuse after 50 different passwords are used. Then I see the problem. In addition to the PASSWORD_RESUSE_MAX parameter that makes me choose 50 intermediate passwords, there is also a PASSWORD_REUSE_TIME parameter that prevents me from reusing the same password too soon. I had forgot about that rule.

A true hacker does not give up so easily. I was trying to emulate a user switching the password a bunch of times. That was the wrong approach. Instead I need to switch my profile to a different one, such as the DEFAULT. Then I can bypass all the password rules. This feels wrong. But you have to do what you have to do. These are development accounts anyway.

Oracle Instant Client

Installing the Oracle client can cause a lot of headaches. As the author of the Software Maintenance blog can attest, the Oracle client is sometimes very tricky. You need the client to access an Oracle database. Historically Oracle has provided a fat client. However there is now an Instant Client that simplifies matters greatly.

The Instant Client is free. It has support for OCI, OCCI, and JDBC connectivity. There are versions of the Instant Client for Microsoft Windows, Mac, Solaris, Linux, and many other operating systems. SQL*Plus is available, but it comes packaged separately.

The great thing about the Instant Client is that you just have to unzip it. There is not Oracle Universal Installer to configure it. Just point your ORACLE_HOME to the directory where you have unzipped the files. The Instant Client is the ultimate thin client for Oracle connectivity. I might try to get our project to use this. We have our own set of issues with the fat Oracle 10g client.

Oracle Open World

I saw an advertisement in a magazine recently for Oracle Open World. It is being held in San Francisco from September 21 through 25. It has big sponsors such as Intel and HP. It also has smaller sponsors like Computer Sciences Corporation. However there was one problem with the ad. I was not exactly sure what this conference was about.

Sure it is somehow revolving around Oracle. But what does the “open” refer to? Is it open source? I doubt it. I went to the web site and found the conference covered applications, databases, middleware, and industries. This did not narrow the field down much. I also saw some information that I would be able to connect with both experts and peers at the conference. Unfortunately this is still pretty generic.

I did get the impression that this conference was not specific to Oracle developers. I really wish they had a buzz line for the conference. Something like “Oracle Open World – Running Oracle on Linux System” would be helpful.

Oracle has hopped onto the environmental trend by having a Green Program within the conference. They are going to review green components, if that makes any sense. There shall be a review of green practices in general. Their motto is to rethink, reduce, reuse, and recycle. At this point I cannot tell for sure. However I suspect this is really just a lot of lip service to capitalize on the popularity of the environmental movement.

The cost for Oracle Open World is between $2000 and $2500, depending on how early you register. I have an interest in the Oracle database. Our customer has huge amount of data stored in one. Some of our code is PL/SQL stored procedures. I might be a good candidate to attend this conference. The promoters of this conference have not done a good job of telling what exactly this conference is about. So they have lost at least one customer.

Trigger Problems

I read this month’s Tom Kyte column in Oracle magazine with interest. He focused on his strong recommendation for PL/SQL developers to skip triggers. This is because triggers cause maintenance problems. They contain side effects. And people usually forget about the logic contained in triggers when reviewing system behavior. Tom believes that triggers are almost always implemented incorrectly. They make systems hard to understand.

One of Tom’s main cautions is to not perform operations in a trigger than cannot be rolled back. For example, pretty must all of the util_file package operations cannot be rolled back. So if you have then in a trigger, you are waiting for a problem to happen. Tom says that the Oracle mutating table constraint is there to protect the programmer against themselves. Tom believes that a trigger which enforces referential integrity is very suspicious.

The system I work on has a lot of triggers in it. This alone should not be reason for caution. However when a respected figure such as Tom Kyte warns you about a certain database technology, I figure we should take heed. Most of the triggers in our system perform some type of logging or auditing. However there are some triggers which propagate copies of data to tables which are not normalized. And we have other triggers which keep meta data (like counts) up to date.

Luckily all of the operations performed in our triggers can be rolled back. We do not normally fool around with autonomous transactions in our triggers. Nor do we perform file operations that also cannot be rolled back. In essence, you do not have to study the triggers to figure out what most of our system does.

SQL Developer Update

I was reading the latest Oracle magazine. Sue Harper’s column discussed version 1.5.1 of Oracle SQL Developer. She states that this version adds a number of features such as:

* Export utilities
* Generates HTML documentation from schema
* Ability to copy objects between schemas

As you probably already know, SQL Developer is a graphical tool from Oracle for database development. It allows you to browse database objects, run SQL, and debug PL/SQL among other things. It requires an Oracle database version of 9.2.0.1 or greater. There are versions of this application for Microsoft Windows, Linux, and the Mac. It is a free tool.

Version 1.5 of SQL Developer added a number of new features to the product:

* Store code templates for reuse
* Refactor the code
* View Java code
* Perform schema difference
* Import Excel spreadsheet to create tables
* Open trace files for performance tuning
* Run a report to monitor sessions
* Use CVS and Subversion source code control

Version 1.5.1 is a patch. It is available for Windows XP and Vista. It ships with and requires the JDK version 1.5. This patch makes SQL Developers support newer versions of Sybase.

I have to say that I already use an old version of PL/SQL Developer from Allround Automations, which is a different product than Oracle’s PL/SQL Developer. And for some tasks I actually use Oracle SQL*Plus. My first impression of Oracle’s PL/SQL Developer was that it did not feel natural to use. However I have heard some good things on the street about the product. When I decide to upgrade to a new tool, I will definitely be giving this one another look since it comes from Oracle and is free.

Bad Where Clause

A developer dropped some tables in their schema. They did this manually outside of our application suite. The applications expect the tables to exist. The result is that our applications blow up when you select certain data. I traced this problem back to the developer who actually dropped the table. I decided to write a script to find all occurrences of the problem. It was supposed to correct the application data based on the absence of the database table. I did a couple quick tests and all looked good. However I am getting into the practice of writing unit test scripts to fully verify my code. I was surprised to find that my script was actually deleting a lot more data than it should be doing.

I stared at the script for a long time. And I kept running through my unit tests again and again. But the script kept eliminating all the application data, good and bad alike. This was troubling. As a software engineer, logic problems like this should be trivial for me. It did not help that we had a lot of distractions at the office. However I am trained to be able to work in any conditions and produce correct code. So I just added some restrictions to the WHERE clause in my script. Then the script did not delete all the application data. However I still did not understand what was going on. The only thing I could do was go over the code line by line until I saw the light.

This was the pseudo code for the script:

Cursor to loop through app application records
Check if table exists
If table does not exist
Delete application data

It does not get much simpler than this. Then I finally did spot the error. The where clause in my check was like WHERE column = table.column. It should have been WHERE column = cursor.column. At that point I felt pretty stupid. I spent the whole afternoon looking for that one. I write about it here to warn you in the rare case that you write or stumble upon such a problem in your own journey.

Pro*C to Stored Proc

Our development team is currently coding some new features into the system. A senior developer got assigned a couple tasks to complete. I performed the design for one of these tasks. And I provided the design to the developer. I purposefully left the database technology out of the design. This was because I figured it would be easiest for the developer to use a technology they were familiar with. The task got broken up and assigned to one other developer as well. The first guy decided to use Pro*C to access the database. He figured this was the easiest for him.

The application being modified already uses Pro*C for some of its database access. So a decision to use Pro*C did not seem unusual. And I knew that this particular developer was familiar with it. However the other developer requested that the common code be put in a PL/SQL stored procedure. That way it could be written once. This stored procedure could be called from the Pro*C function. And since this guy was the team lead, he made the final call. The other developer expressed reservation about this decision. As a result, the coding task got reassigned to me.

Personally I love writing code in PL/SQL. And I figured it was no big deal to call a stored procedure in Pro*C. You just do something like this:

EXEC SQL EXECUTE
BEGIN
pkg.proc(:var);
END;
END-EXEC;

There were other examples of this pattern in the code for the application being modified. So I coded in those hooks. I also wrote the common PL/SQL back end stored procedure. The procedure was actually quite easy to code. I knew we made the right decision when the database table being accessed got some column name changes. I just modified the stored procedure and everything worked fine.

Maybe the original developer did not want to agree with our team lead. I normally do not take this stance. It is fun to write code. A good developer should be able to write the code in any language and with any technology. Those details are usually not important. Calling a stored procedure in a package from Pro*C is also easy. I think I might give the link to this blog post to the original developer. He might learn a thing or two. And I am not talking about Pro*C syntax.

Real DBAs

We have two types of DBAs on our project. There is the DBA who works exclusively on our team, but has limited admin rights. Then there are the computer center DBAs that run the show. These latter folks are the security police for our system.

There are benefits and drawbacks to this two tiered DBA setup. Right now we are suffering from some of the drawbacks. I have written about this on my Software Maintenance Blog with a post entitled Database Lockdown.

I encourage you to check it out. It is most relevant, not only because we run an Oracle 10g database on the back end, but it might affect your work.

Select From Dual

I am designing some changes to the application suite my project supports. There are some requirements that need me to add new tables to the database schema. So I requested the DBAs to add them. Our project has some Oracle consultants which review the database change requests. One of my new tables was grouped in with some other needs into a global table. The consultants had a lot of ideas on this new table and its access. That sounded good to me. One consultant wrote a PL/SQL package that managed the table, including things such as auditing.

Before I go on vacation, I wanted to make sure the development staff had what they needed to implement the code changes. So I tested out the PL/SQL package written by the consultant. I logged into SQL*Plus. There were a bunch of functions exported by the PL/SQL package. I executed a SELECT FROM DUAL for one of the getter functions. It worked fine. Then I tried to SELECT FROM DUAL one of the functions that modified the underlying database table. The function returned an error to me. So I pushed it back to the consultant because I thought maybe he had a bug.

The consultant came back to me with an answer. The function was trapping an ORA-14551: cannot perform a DML operation inside a query. Of course that made sense. The SELECT FROM DUAL cannot change a value in the database. Why didn’t I think about that. The way to test out this stuff is to write some PL/SQL code, and call the functions from there. That is how the developers are going to have to implement this. The consultant also cautioned me against testing this directly from SELECTs in SQL*Plus because the functions are autonomous transactions. Apparently things get screwy when you call autonomous transactions like that.

The bottom line is that you cannot issue a SELECT FROM DUAL on a function that modified the database. There is a rule in Oracle that prohibits this. I have actually run into this before. However since I did not write the implementation for this package, I did not see what was going on under the hood. Developers beware.

Recycle Bin

Next year we are going to Oracle 10g on the back end. Currently production is using an Oracle 9i database. Our development database has been upgraded to 10g so we can be ready for next year. Developers have curiously found a bunch of tables in our schemas with names like “BIN$TktvMTcFRA3gRAADunAMgQ==$0”. However we are unable to drop these tables. A little research revealed that this is a new feature for Oracle 10g called the Recycle Bin.

I want to give a shout out to Natalka Roshah from Oracle FAQ. Her article explained the mysteries of the Recycle Bin to me. Apparently given the right 10g option being set, Oracle shall not actually drop a table when you issue the DROP command. Instead it renames the table so you can restore it easily in the future. A little research shows that the table does disappear from the USER_TABLES view. However it still exists in the USER_OBJECTS views with a type of “TABLE”.

Our project actually wants to truly drop the table when we issue a drop. So we will have our DBA team disable the Recycle Bin option. However for now we have resorted to issuing the following command to drop a specific Recycle Bin table:

PURGE TABLE “BIN$TktvMTcFRA3gRAADunAMgQ==$0”;

In my case, I wanted to get rid of all Recycle Bin tables in my schema. I had amassed 313 of them (which were taking up space). So I got rid of them all with the following command:

PURGE RECYCLEBIN;

The real funny thing about this is that some experienced developers who were not familiar with this Oracle 10g feature were sure these were not real tables. This included dudes who used to work for Oracle. I do not know about anybody else. But I think it is time for my company to send me to Oracle 10g training. Don’t you?

Lost Connection

Our testing team reported that all application would error out after 30 minutes of non-use. They got an error message stating the connection had been lost. Initially I picked up this problem and researched it. Most people immediately discounted this as a database server issue. However I had some evidence that this might not be so. This year we moved from the Oracle 8 client to the Oracle 10g client. Thread based connections seemed to behave differently in 10g. So I found some instances where you needed to do some explicit connection management between threads using database contexts. If not, I recall getting a lot of connection lost error messages.

I was able to duplicate the problem easy enough. Just log in and wait 30 minutes without doing anything. Sure enough I made the problem occur. However after reviewing the code involved with just logging in, I determined this was not a thread coding issue. I hit Google search to see if I could find any knowledge on the subject. Most of the web sites I visited recommended that you modify the timeout parameter on the server’s sqlnet.ora file. Armed with this information, I forwarded the problem to the DBA Team.

The next time I saw the lead DBA, I asked him about our timeout problem. He said that he encountered the problem as well. It was happening even in SQL*Plus. I told him I read a lot of articles about the timeout value in sqlnet.ora. However the DBA insisted that the timeout was not set on our server. Instead he believed this to be a network issue. So he forwarded the problem to the network administration team. This did not immediately resolve the problem. But it was due to the problem not receiving the priority it required. A couple calls later from the DBA, and it seems the problems have been resolved.

A lot of my experience with this problem is that you cannot just trust your instinct with any given problem. Nor can you entirely defer to the knowledge gained by Oracle. You need to dig in to the specifics of the problem and methodically identify the true cause. This is not to say that you need to discount any hunch you have. However you need to identify the hunches as theories, and test out those theories. Agent Moulder said it best when he recommend you trust nobody.

Report Server Removal

Recently I had some problems with Oracle reports working on my machine. So I decided to uninstall and reinstall Oracle Reports Server 6i. There is no option to remove the Reports Server from Windows Add/Remove Programs. So I ran the Oracle Installer that came with Oracle Forms and Reports. It is nice that Oracle provides an installer that can uninstall its software. However I found this program a bit confusing and ineffective. The installer lists all installed software (that the installer can remove). So I need select the software I want removed. I selected all the software, assuming that this would cover all of the Report Server. This is where my problems began.

The Oracle installer produced a number of error message on my machine. Some messages seemed clear enough. For example, a service was running that needed to be manually shut down before Oracle could proceed with the removal. Too bad the error message did not tell me how to shut down the service. I tried going to Windows Services. That did not seem to resolve the issue. Then there were other cryptic error messages. They seemed to imply that the selected software was either uninstalled of in a bad state. That does not help me much.

When I went as far as I could, I was hoping the Report Server was gone. No such luck. There was still a directory where the Report Server was installed. A lot of the files and subdirectories had been removed. But many more remained. I tried to delete the whole directory with Windows Explorer. This did not work. Some program or service was locking the executables in the bin directory. I resorted to Google for help. Turns out you can rename the binaries in the bin directory. But you just cannot delete them. I wanted to reboot Windows in safe mode, then delete the pesky files. However I was in a hurry so I skipped it.

Another place where I saw remnants of the Report Server after uninstalling was the Windows registry. The data is stored in HKLM\SOFTWARE\Oracle. I continued to see an Oracle home entry for the Report Server. Well at least I have full control over the registry with the Windows Regedit program. The only danger is that you need to know what you are doing or else you could make matter worse by hacking the registry. I just deleted the Oracle Home registry entry (mine was HOME1). I also decided to remove ORACLE_HOMES\ORACLE1. And I figured that I should decrement the HOME_COUNTER in ALL_HOMES. I really do not recommend this level of registry hacking. But I was desperate.

Finally I got to reinstalling the Reports Server. At first I brought the software up to patch 3b which mimics ours Production configuration. But I later found I needed patch 18 for everything to work correctly. What a nightmare. I sincerely hope the never version of Report Server work better during removal.

Count or Continue

At work we have a large system coded mainly in the C++ programming language. The back end is an Oracle 10g database. Users have numerous requirements for configurable reports. We implement these using Oracle Reports. Frequently the application will create temporary tables in the user’s schema. The table names are passed to the report. This allows long running reports to essentially work on snapshots of the data. It also simplifies some of the data gathering performed in the actual reports.

Each time a new reported is selected, the application shall typically delete the prior temporary tables. The application then recreates the temporary tables based on user selections and current data in the database. This usually works fine since the same users normally run the same reports over and over. The only tricky part is that sometimes a user will choose a report for the first time. And in that scenario, there is no existing temporary table to delete.

My instinct for this scenario is to first detect whether the table exists or not. This can be a query from the USER_TABLES view. Once I find that the count is not zero, the code executes SQL to drop the old table. I was surprised to find another technique used in some code I recently debugged. It always attempted to drop the old table even if it did not exist. It preceded this code with a EXEC SQL WHENEVER SQLERROR DO CONTINUE.

Logically the ignore errors method works just as well as the count with optional drop of table. I started to wonder which way was better though. One would think the ignore errors method was better for the normal case. Only one SQL statement was issued. I would imagine this to run faster. But maybe there is heavy penalty for trying to drop a table that does not exist. In the end I left the code alone to ignore errors and always try to drop the table which might not even exist.

This whole discovery got started by a tester finding a scenario where the DROP failed and then a following CREATE TABLE failing as well. But that story is a subject for a future post.

Password Policy

Our system has stringent requirements for database passwords. One of the many rules is that old passwords cannot be reused for a very long time. This includes both a long number of days, and a large number of different passwords in between reuses. Developers and testers frequently have a lot of different accounts to conduct testing. The password policy makes it difficult to keep track of all the different passwords that have to keep changing.

Recently a tester asked a DBA to reset her passwords back to the ones she normally uses. The DBA attempted to do this, but was halted by the implementation of the security standards. It was too soon to reuse the old password. So the DBA chose a new password for the tester. Later I spoke with the DBA about this problem. Both he and I knew that it was the database profile that tied the user to the password verify function which prevented quick password reuse. I said we could just temporarily switch the user to a different database profile. Then the password rules would not be enforced. Oracle would not recheck the password when the original database profile was restored. This seemed like the ultimate plan to skirt the database password rules. In fact I coded up a script to do this for all my database accounts.

The DBA I spoke with knew about the profile swap trick. But he was hesitant to do this for the testers. He likes to do things by the book. The trick would be violating the security policy. I told him he was right. However we are just talking about a test environment. In reality, it is impossible to keep track of all the changing password for the multitude of accounts that we require. So people write down their passwords. The security rules get violated anyway. But at least the system was set up to discourage illegal password reuse.

I think the DBA has the right attitude. If you are always serious about security, you will most likely be safe always as well. It is a slippery slope to skirt security rules sometimes. There are times when the urgency of the moment requires some rules to be bent. But this decision should not be taken lightly. However I have not thrown out my script which resets my passwords by first switching up the profiles. In that way I am able to remember all my well known passwords and not write them down. That is my compromise.

Tnsnames File

Our database administrators manage a number of databases and schemas for our project. There are a number of production instances, different schemas for development, and separate databases for testing. These counts get multiplied by the different environments required for multiple versions of our system. All of these machines and instances are managed by one common "tnsnames.ora" file. The DBAs put this file out on the network in a globally accessible location. All non-production users point to this file using the TNSADMIN environment variables. Normally this arrangement works well. However once in a while, somebody messes up the tnsnames file. The result is a lot of pain.

Just this past week a couple developers encountered this sort of problem. We have been modifying the design used to print reports from our application suite. Previously we fed reports into an Oracle 6i Reports Server installed on the workstations. After moving to an Oracle 10g database on the back end, we detected some problems running the reports from the Reports Server. So we had a developer code up our own Reports Server using Visual Basic. This work around seemed to run pretty well. We just needed the application to call the Visual Basic app instead of the Reports Server. But then this fix just stopped working. In fact none of the reports seemed to be working correctly.

A couple of us developers spent a few days looking into the problem. I had been uninstalling and reinstalling Oracle software like the Reports Server and Reports Builder. None of this worked. After a while somebody realized that the tnsnames entry for reports was removed from the tnsnames file. It was a relief for the development team. We still were angry that our time was wasted. Luckily we install a local file on all our Production user workstations. The senior DBA decided it was finally time to lock down access to the tnsnames file. The trouble is that development needs access to the file in order to connect to all the databases we use.

So the senior DBA decided to "hide" the tnsnames file on the server. I tested a couple applications that use the file for connectivity. So did the senior DBA. We did not notice any problems. However one of the reports developers started seeing some weird problems in reports. She recalled out tnsnames debacle and started first by trying to research the tnsnames file. She panicked when she could not locate the tnsnames files on the network server (remember it had been hidden). I told her what the senior DBA had done. As of today I have not heard back whether she has resolved her reports problems. Maybe it is related to the hiding of the tnsnames file.

There must be a better way of providing access to the tnsnames file for connectivity while protecting it from accidental modification. Any ideas?

Tkprof to the Rescue

The DBA Manager asked me to give him the SQL for a certain operation in our application. Luckily I knew a lot about it. So I dug into our requirements, looked at the code, and explained what the query did. Then I broke down a simple example of which SQL statements were executed by the application. I was proud of myself.

Then I got another request. The DBA Manager wanted all the SQL for a given report. This was a bit more complicated. The application creates a temporary table. Then it calls an Oracle report which is just a driver for another Oracle report. I knew the type of information this setup retrieved. But it would take a while to extract the SQL. And right now I don't have a lot of time.

The DBA Manager got back to me and told me to run a trace and get the SQL results quickly that way. This was easier said than done. I had some type of bug in my Oracle Report Builder, such that I could not compile in the commands to enable session level tracing. And my database account did not have permission to enable session level tracing either. I got some help from a DBA and an Oracle Reports developer. So I had Oracle generate a trace for a session in which I only ran the report.

Unfortunately I did not have access to the machine that hosted the Oracle database. But a DBA hooked me up with the trace file. This was the first time I had done this. I studied up on the tkprof command, but ended up executing it with all the default options. I was amazed at how simple it was to let the database capture all the SQL. But my instinct told me some SQL from the report was missing. The DBA Manager told me I could just give him the tkprof output. He would take it from there.

I think I need to get a little more practice running traces. And I certainly need to get my Oracle Reports installation fixed so I get make reports changes. Perhaps I can also obtain access to the UNIX box where we run our main Oracle database. Then I can run traces on my own and not need a DBA to get me the trace file each time. I have the feeling that me and tkprof are going to get along very well.

Full Text Error Messages

When programming with Pro*C, you can retrieve up to the first 70 characters of any Oracle error messages using the SQLCA structure. However there are times when you need the full text of the error message. My project had run into this need last year. For some reason the developers could not figure out how to do this. I wish they had read the Oracle Pro*C Programmer's Guide. It explains exactly how to do this. You need to call the sqlglm() function.

I could go through how you can use the sqlglm function. However I recommend you just consult the Oracle Pro*C Programmer's Guide. Isn't it like a developer to not read the manual? There is no excuse for this. Oracle provides the documentation free of charge on the web.

This reminds me of another huge problem I encountered this year with multi threaded applications having trouble managing multiple database contexts. At lot of high power developers thought up many theories about what our problems were. And there was a lot of guesses on how we could solve the problem. None of the guesses panned out. Luckily I read up the Pro*C Programmer's Guide and employed the Pro*C functions to explicitly manage database contexts for threads. Problem solved. Maybe I will give an example of how I did this in the future. The Pro*C Programmer's Guide is a good start. But it lacks some detailed examples.

Oracle Reports with 10g

My team is working to upgrade to Oracle 10g. On the back end, we are moving from an Oracle 9i to an Oracle 10g database. And we are upgrading from the ancient Oracle 8 client to the Oracle 10g client on the workstations. So I am busy modifying our Visual C++ projects to work with the next client. So far I am attending to the easy work, liking compiling our Pro*C code with the next client. But I think we are going to be chasing down a lot of problems when we do run time testing.

Currently one developer on our team is assigned the reports. We are sticking with the Oracle 6 Report Builder. We are also leaving the workstation with the Oracle 6 Report Server. Most of our reports just run with the Oracle 6 run time client. These seems to be working out OK. However a few reports go through the Oracle Report Server on the workstation. These are giving our developer a lot of trouble when run.

Specifically, our reports developer compiles the reports against the 10g database using the Oracle 6 report builder. The reports run fine from Oracle Report Builder. However the Oracle Report Server complains when it tries to run the report. It always states that the report contain uncompiled PL/SQL. This has been driving the reports developer nuts. We have tried looking into patches to the Report Builder. But I think the problem actually is in the Report Server. I am not sure if Oracle is going to be any help, since these are old products.

The development team is hoping that the kinks get worked out. We want to start using more recent products from Oracle. It is no fun being stuck in the past using outdated and unsupported products in our system. I will keep you posted on our progress. We have a big test scheduled with the customer in a month and a half. Good luck to us.

TNS Problems

Our production users were getting Oracle errors that had to do with TNS problems. I passed these on to our DBA Team. However I did some research on them and thought I would share some findings.

(1) ORA-12203 "TNS: unable to connect to destination"
  • Potential cause is that the destination is not listening
  • It may also be due to network problems
  • You can issue a "lsnrctl status" at the host to see if the listener is running
  • This may be related to the entry in the tnsname.ora file

(2) ORA-12535 "TNS: operation timed out"

  • You should try tnsping first to start the diagnosis
  • Look at the listener log on the host for more clues
  • Check the sqlnet log files on the client side
  • Run a statspack report, and check Top 5 timed events

Luckily the problem did not last long in production. The trouble ticket itself got closed out. But our DBA Team has an action item to determine why the problem happened, and prevent this from happening in the future.

Order in an Update Statement

I got the most unusual e-mail from a coworker today. She was trying to research a very difficult software bug in our system. The trouble was that she could not reproduce the problem. This happens to the best of us from time to time. However she was not on the hunt for things in the code which might be the cause of the problem. This is a very dangerous journey for even the best of software engineers. Her analysis and e-mail to me was concern over a SQL statement like this:

UPDATE tablename
SET column1=500, column2=column1;

Her plan was to reorder the two assignments in the SET clause. And the reasoning was that maybe Oracle would sometimes assign column1 to a value of 500, then assign column2 to that same value due to this UPDATE. Now this sounded quite preposterous to me. I told her that the switch in order would have no material change in the results, ever. I tried to be patient. I explained that Oracle would always take the values prior to the update, and then use those whenever an assignment was made referencing one of the columns in the table. She thought she could do some Googling on the Internet to find an example where somebody ran into this problem. She already searched our code and found places where the order in the SET clause was switched around.

I told the coworker that she could try to ship out the code change. It would make no difference in the results. And we would look a little funny if we tried to pass that off as the solution to the problem. At the very least I would have hoped one of our DBAs would have protested. Heck. I determined I should protest a little louder. Nothing like a little education to make a good impression though. So I looked up the Oracle documentation on the UPDATE statement. We are using Oracle 9i on the back end. So I found what I needed in the PL/SQL Programmer's reference for Oracle 9i. It defines columns on the right hand side in such UPDATE statements as referring to the old values of the columns in the row. Now I guess she could make the argument that Oracle states this is the case but it not following through. But that statement is probably as likely as stating that sometimes Oracle just forgets to follow through on the whole update. I was pleased when I got a message back thanking me for the information.

The search goes on for the source of the strange customer problem. I have already provided the customer with the grim projection that we shall not fix this problem any time this week. It might be time to guide this developer to tried and true software maintenance methodologies. A fitting example would be that you cannot fix a problem until you can reproduce it. The justification is that you can never know whether your fix works if you do not know how to make the problem happen in the first place. Another idea I had was that maybe it was time for a little Oracle database training in this case. I have taken a class or two on Oracle technology. But I have also read a ton of books and a lot of other information on the web. However I think my best instructor has been many years of hands-on maintenance in front of a huge Oracle database. Sometimes I overlook this experience because I take it for granted.

OWB Skills

Our project at work has a bunch of data that we load into the system on a yearly basis. In the past, we had some UNIX shell scripts and C programs which handled this load. Over time the code was ported to Oracle Warehouse Builder (OWB).

Another company has won the software maintenance contract for our project. So we are transitioning out. Our OWB developer has already left the project. We are planning to turn the OWB source code over to the new contractor. They were looking for some OWB training from us. However we were only obligated to provide them with the source code. It is not in our statement of work to provide technical training to our replacement. This is just tough luck.

If I were working for the winning contractor, I would take these OWB projects and move the code into something more common. Personally I would recommend straight Oracle PL/SQL and maybe some Korn shell scripts. You don't want to be locked into some little used tool that nobody knows how to use. This is a recipe for trouble. Heck. You could even convert the projects to a common programming language such as Java. The problem is that there is not a strong business case for this effort. In the end, the customer will just get something that does what the existing OWB project does. No net benefit other than ease of maintenance. I wonder what the future holds for this annual loading software.

Slow System

This week our users complained that the problem system had stopping responding. It was the second time that this had happened. Our DBA Team and Performance Engineering Team had some ideas about the problem. But in the end their analysis said that a strange process ran on the server that consumed all of the CPU. This answer did not generate a lot of faith from the user community.

When the problem happened a second time, the application development team had to step up to the task. This time around the DBA Team analyzed the Oracle trace files and found that there was an ORA-00060 error "deadlock detected". But this did not make any sense to me. The deadlock error happened at the end of an hour long system halt. Oracle should be able to detect such a deadlock within seconds, and at the most minutes.

Development took this idea and ran with it. We determined where in the code we execute the SQL that was reported to be in deadlock. Then we analyzed how the application can cause that line of code to be in a deadlock situation. The analysis was not too conclusive. We did some tests to try to reproduce the problem with this knowledge. But we were unsuccessful.

I scanned a database table where our application logs exceptional events. Strangely enough I found a lot of ORA-03113 errors "end of file on communications channel". This by itself does not indicate anything. When the system is hung, anybody can kill our application and cause their session to terminate. However they were coming from the same stored procedure in the database. I inspected this stored procedure and found some very suspicious code which does the following:
  • Declares a cursor with a SELECT ... FOR UPDATE clause
  • Loops through the cursor
  • Updates each record
  • Performs a COMMIT and the end of all the updates

Now it appears that multiple users were all hung on this stored procedure. I assumed that the first user to open the cursor would get a lock on all the records (due to the FOR UPDATE). However maybe Oracle only locks the records as the records are fetched from the cursor one by one. And if so, suppose all users were selecting the same records. Furthermore let us assume that some users locked some records, while others locked the rest. This would indeed be a deadlock situation. But I would still expect that Oracle could recognize this and give everybody except the first user an ORA-00060 and kill their locks.

My plan for now is to have a developer simulate the situation where multiple users all execute this stored procedure at the same time. In addition, I want there to be a lot of records to update, thereby increasing the odds that we can conflict with each other's transactions. If this turns out to be the problem, I will (1) get the database guys to configure Oracle to detect this easily, and (2) rewrite the stored procedure to get the locks one record at a time.

Subquery WHERE Clause

I was working on a trouble ticket. Traced the problem down to some SQL code that was actually in a Pro*C file. Here was the update statement:

UPDATE table1 one
SET dummy=200
WHERE table1.ROWID IN
(
SELECT innerone.ROWID
FROM table1 innerone,
table2 two
WHERE innerone.id = two.id
);

Now I wanted to limit the subquery to only choose those rows that had the code column of table1 equal to 5. Therefore I added another criteria to the inner query WHERE clause. My natural instinct was to use the alias from the inner query itself like this:

UPDATE table1 one
SET dummy=200
WHERE table1.ROWID IN
(
SELECT innerone.ROWID
FROM table1 innerone,
table2 two
WHERE innerone.id = two.id
AND innerone.code = 5
);

My reasoning was that it would be faster to reference just the inner subquery tables. I did not want to reference the outer query table, as that would make it a correlated subquery. However another developer came in behind me and changed my code to reference the outer query table alias:

UPDATE table1 one
SET dummy=200
WHERE table1.ROWID IN
(
SELECT innerone.ROWID
FROM table1 innerone,
table2 two
WHERE innerone.id = two.id
AND one.code = 5
);

Shouldn't these both equate to the same results? This other developer is out of the office now. But I am hoping to get an explanation when he gets back. What do you think?

Perceived Performance Problems

Our software team often has difficult problems so solve. We maintain a system with a lot of users and a ton of data. Recently a certain functionality in our application has gotten terribly slow. The users don't even try to perform those functions any more. It is a difficult problem because we cannot make it happen in a development environment. We tried to add some logging but it did not shine any more light into the situation. The last resort was to allow a developer to run the application in a production environment. But it seems to take an act of congress to authorize this to happen.

Some consultants came up with a plan. They would fly out to the user's site. And they would see what the user was experiencing. Then at the same time an Oracle performance consultant would run some tools on the back end database to monitor what was going on. This would identify the problem, or at least the problem area, if in fact the problem exists on the database side. We make a couple calls to some stored procedures in a PL/SQL package. One of these calls could be hanging. I really hope this is not some weird hang up on the workstation. Because then this consultant test will give us no new information (other than they could not locate the problem).

Saved by Documentation

This year my team was tasked with making some complex changes in our system. These changes needed to be made deep in our PL/SQL packages that run as a part of our nightly process. Since I am the one who has been around the longest, and I am also one of the few who actually know how our nightly process works, I got assigned this task.

Our requirements team documented the needed changes as best they could. And I questioned them mercilessly until they got all the details from the customer. Then I went off and coded up a number of changes to meet the specification as I understood it. Unfortunately I made some incorrect assumptions about the values set by some other parts in the system. As soon as this code hit production, the trouble tickets started coming in.

As this was my code, I jumped in and determined what was wrong and coded up some PL/SQL code changes. Our process requires us to conduct unit testing. So I doctored up some data to cover all the test cases. There was only one problem. The code I implemented is only supposed to be run on Wednesdays. And today is a Thursday. No problem. I just modified the code to run every day while I did my unit tests.

Another part of our process is to document all changes so that they can be peer reviewed. Often times this seems like unnecessary overhead. But today it saved the day. As I was generating the documentation, highlighting the portions that I changed, I noticed that the code was written to run every day. Oops. Luckily I caught it before peer review, and more importantly, before we shipped this code to Production. If that had gone through there would have been all kinds of fireworks. We do have other safeguards such as independent testings. But I am not sure they would have caught this given the rush on the fix.

I can safely say that I was saved today by our standard development process. Maybe I also need to look into other ways to test "Wednesday only" functionality without hacking up the code. What do you think?

Development Versus Run Time

Our customer had some high priority problems. The original developer working on the problem was not making enough progress. So the problem got reassigned to me. I had a late breakthrough on the problem. Apparently one of our application's metadata tables got out of sync. So I needed to write a script to fix the data. The challenge was that it was very late when I made the discovery. And the customer wanted the fix now.

Here are some more details on the problem. We have a main table with read-only data. It has a lot of records (maybe 30 or 40 millions records and growing). The application makes use of a second database table containing meta data with around 10,000 records. Unfortunately there were multiple problems with the meta data. Some of the records did not belong in the table. And other records were missing.

Given such small amount of time to write a fix, I just hacked out a brute force method. I removed all the meta data records. Then I went through the 40 million records in the source data and reconstructed the meta data table. I did some quick performance tests and they looked ok. So I am planning to ship the fix as is. The main driver of this design was the limited development time. Had I more time to work on the script, I could have built in some smarts which only added and removed a small amount of records. However this would have taken much longer to code and test.

This is usually the trade off in software maintenance. I hope the script runs fast enough when it gets deployed to Production.

Fixing a lot of data

We have some PL/SQL code which computes a lot of values on a daily bases. Right after the SQL*load, the PL/SQL goes to work and performs the computations. Then the client applications use these computed values on the display screens. Nothing special here.

I got a trouble ticket stating a new computed value was not being set. After checking the production data, I found that this was true. Found out that the wrong version of the package was compiled into the database. So I spoke with the team who controls the PL/SQL package install. Sure enough the instructions were correct in thier release notes. But the wrong version was compiled in. The immediate problem could be fixed by compiling the correct version of the code. But the total fix required going back and recomputing the values for all the data loaded so far. We got the computation wrong for about 15 million records.

My first cut of a script to fix the existing data was to run a cursor through the 15 million records. Then I would execute a bunch of logic to figure out which records needed updating. Then I would perform the update for each record. After completing version 1, I worried that this script might take a long time to run. So I spent some time looking at the Production data. And I was able to determine that I could put some conditions in my cursor that would bring the total record count down to about 40 thousand. Now I was on the right path.

Unfortunately my first set of unit tests on random data showed I had some bugs. My code that retrieved the data to determine the computations made some wrong assumptions. So I was trying to SELECT some data that had no records. In my early days as a programmer, I would just first SELECT the COUNT, then SELECT the data if the COUNT was greater than 0. But this is doing double work. So instead I coded a CURSOR which opened and fetched. If I got any data from the fetch I did my computations. No need for a separate COUNT. I checked for the existence of the data, and retrieved all in one step. This can speed things up when you are dealing with a lot of data.

40 thousand records is not huge. But it is big enough that you want to consider performance enhancing features. The only drawback to my work was that it took up a good deal of development time. Now is the busy season for my customer. So I had to wrap up this task and move on to the next problem to solve. I am glad that I forced myself to spend a little extra time to get the solution right. In fact while looking at the Production code to get some ideas for acceleration, I noticed some other related problems that I solved in the same script. I'd say I earned my pay today.

Tracking Weird Problems

I got word that a bug in our software was becoming high priority. Because of this problem, our users had to do a lot of extra work. This is never a good thing. And it was especially troubling since the beginning of the year is the busiest time for our client.

The users were so busy that at first I could not get any information about the problem. So I just started working on other issues. But then some important customers were saying these problems were really holding them back. So I tracked down some users and got some info from them.

Apparently some nightly jobs were not inserting records as expected. At first glance, it looked like records were being inserted. But talking as I talked with a few users, I found that not all expected records were being populated. I had a hard time tracing the actual code for this nightly job. Last year somebody made some changes and did not check them into source control. Shame on them.

I had to go back to our delivery of the changes for the code last year to get a hold of the source. Luckily I found one pattern for the missing data. An error code was being stored in a look up table. This was the start of a break through. The only problem was that the source code showed that the problems were supposed to be logged each time the error code was set. I could find no evidence of such logs in production.

Next step was to try to replicate the problem. This was harder than it seemed. The nightly job was something we were not set up to test. Somebody once had an idea that we should have a test bed. But it never got further than an idea. I had a DBA help out and try to start creating users to match production. No luck. I got bold and ran a few tests in production. This also proved to be a dead end.

By now I had to resort to an old fashioned technique. I went to trace the code and results by hand. Went through an example of some data which got an error code set. Kept building up the huge SQL statement that actually inserts the data. Then I got an Oracle error. Apparently the users had entered some SQL into the system that did GROUP BY and ORDER BY operations. The code tries to append WHERE clause data at the end. But this cannot come after the GROUP BY or ORDER BY in a SQL statement. I got the users to clean up their SQL.

In retrospect, this ordeal made me realize a few things. We need an adequate test bed. We need production to log errors diligently (still working on this). And finally the software should not allow users to enter data that will result in middle of the night errors. Time to implement some validation.

Basic Failure

I knew something was wrong when a couple different developers on the project asked me if I was good at SQL. We have full time DBAs. We have consultants from Oracle. But I still got the call.

Apparently a script was not working correctly in development. It worked fine in Production. But a developer needed to do some tests in development and things were going weird. There was a column IDENTIFIER in a table defined as VARCHAR2(5). And they tried to run a query like this:

SELECT COUNT(*)
FROM table
WHERE identifier='00350';

The problem was that the count was returning 0. However there were many records that had the specific identified the WHERE clause was comparing. This was just blowing the mind of the developer. Surprisingly when he removed the single quotes from the WHERE clause, the count came back correctly.

We tried a number of experiments to determine the source of the problem. For example, we inserted a new row with the same identified. Then the count returned 1 (not 51 as we expected). However updating all 51 records setting the identifier to the quoted value did not change the query results. Somebody thought maybe this was due to a bad index on the column. But this column was not a part of any index. We also tried creating another database table with a similarly defined column, but we could not replicate the problem with this other table.

My conclusion was that something went wrong when these 50 records were put into the table. I could not explain why. We loaded the data doing a sql*load like we always do. For now to get the fix out the developer changed the source code to not use quotes. This is just a hack until we can get a grip on this perplexing issue. Any thoughts?

Fast to Slow


I work on a complex system. We have huge amount of PL/SQL code running at night to do all sorts of tasks. One such task is performing number crunching and storing results in meta data tables. The applications consult these tables during the day to get information quickly.

Our database is a very large one. And the data keeps growing. So the number crunching job kept getting longer and longer. I asked for help from the performance engineering team at our company. Their first idea was a divide and conquer method. So I wrote some korn shell scripts that run 10 jobs in parallel, crunching portions of the data at the same time. There was not a 10 for 1 speed improvement. But it was close. We had a very happy system administration staff working at night because now our jobs finished quickly.

When we started up the system the following year, the number crunching routine slowed back down to old levels. Our chief DBA assumed that we somehow deployed the old code. I had him check the Production korn shell and PL/SQL packages. The correct code was there. So I called in the performance engineering team. They monitored the job the next night. The job finished in record time. Performance engineering said perhaps the gathering of statistics first was the key. Whatever the reason, I am glad we are back to quick speeds.