When the problem was replicated by the customer, I found that the first report that gets called was generating an exception when trying to run a second worker report. There was an additional error in that the caller report was unable to log this activity to the database due to a coding error in the exception handler. At least one mystery was solved. However I still was unsure why the worker report was not working. I went into the customer environment, exported out the data that was used when the problem happened. Then I got back into developer. I tried to run the worker report using Oracle Reports Builder. The result was a REP-1247 “Report contains uncompiled code”.
At first I just recompiled the report using my database login. But then the report ran fine. So I started to concentrate on why the report got the REP-1247 in the first place. I reasoned that you only needed to compile the report once. After that you should be able to run the report without recompiling it. I traced the problem down to a line in the after parameter form of the report. It was doing a SELECT FROM DUAL. That seemed harmless. However it would just not work in Reports Builder unless I recompiled it. Our local reports expert told me to apply the latest patch to the old Reports 6i tool. That also did not solve my problem.
In the end, our reports guru said you just need to recompile the report every time you want to run it from Reports Developer. That did not seem too encouraging. It also did not feel right. This really feels like a bug. And I know the Oracle answer will be for me to upgrade to Oracle Reports 10g. However that requires some infrastructure upgrades like a back end report server. I don’t think my client is ready for all of that just yet. I can’t be the only one having this problem. Or maybe we are few, since I am mixing the old Reports 6i with a recent Oracle 10g database. Does anybody have any advice for me?
Our customer complained that data was sometimes not showing up in one of our applications. A developer on our team researched the problem and found that it occurred only when the data had leading spaces. His solution was to modify the retrieval Pro*C code to trim out the spaces. On the surface that sounded like an acceptable solution. So he changed the code from
SELECT NVL(text, ‘ ‘)
SELECT TRIM(NVL(text,’ ‘)
The result was that the text that had leading spaces was displayed without the spaces. The problem is that this made the application through an ORA-1405 exception for most cases where there was no text. This was a difficult problem to diagnose. The SQL ran fine in SQL*Plus and PL/SQL Developer.
I was the one who stepped in and debugged this problem. The way I found that this change was the problem was by reviewing the source code change history. The text column was NULL for most records. The NVL turned it into a space. However the TRIM caused the space to revert back to an empty string (NULL). Pro*C did not like putting such a value into the variable. The result was the exception we were getting.
My initial fix was to back out the changes and ship our release. Then I thought maybe we could add one extra NVL around the whole clause to ensure the fix goes in, but it does not break the normal case. The problem is that it was starting to look like some ugly code. I am thinking our eventual fix would be to modify the application to be able to handle text that with leading spaces. Nevertheless this was an interesting exercise in debugging. Another developer spent the weekend trying to figure out what was going on.
Alex works on a betting engine with 250k lines of code. It is distributed among computers located around the world. It uses Oracle RAC, as well as Oracle Coherence. There is a middle tier written in Java. The front end uses Ajax. The system processes 5 million transactions a day on average. At peak times it crunches 1000 transactions a second.
I wondered what it took to become PL/SQL developer of the year. Alex stated that you needed an attention to detail. You also need to be passionate and take pride in your software. Somehow I think you might also need to be working with the latest Oracle technologies. To tell you the truth, I have not even heard of Coherence before reading about this award.
Looking back on my own PL/SQL programming career, my biggest year was when I implemented a subsystem of our loads software. It was written mostly in PL/SQL, with a little UNIX Korn shell and a SQL script or two. The production code had about 6k lines of code. However I also coded up a unit test harness that was 5k lines of code by itself.
This may not seem like an abundance of code. The hard part was determining the requirements for the coding. That took so long that the schedule only gave me 4 days to code the whole thing. Let’s just say I knocked out a lot of code that compiled. And it ran super fast. There were a number of trouble tickets written against my code. They all got resolved.
So what should I do? Nominate myself for the award next year? Maybe I should try to contribute to Oracle magazine. No. I think I will just keep posting to my blog here. In fact, perhaps I should give out my own award. What do you think about the “2008 Top Oracle PL/SQL Blogger” award? I hate to tell you this, but the winner is me.
I always like to do a little homework on the problem assigned to me before I call the customer. So I installed the application transmittal they were running. However I could not make the problem happen. I put a call in with the customer that opened the problem. She was not at her desk. Due to the high priority nature of the problem, I tracked her down in her test lab. She walked me through the steps she was taking. That’s when I figured the problem must be related to the data in her test database. It was missing some values.
This should have been no problem. I got busy writing a PL/SQL script to correct the data in her test database. I just needed to insert a bunch of rows in one table. At first I thought I wanted a bunch of nested loops to control the values I inserted into each of the columns. However they were character data and I could not figure out how to look through some char values. Shame on me. Time was ticking away on this high priority problem.
In the end, I decided to get down to business. I wrote a small helper nested procedure to do the dirty work. I hard coded some values in it. Then I hard coded a bunch of calls to the helper procedure. It was not clean. But it got the job done. I made sure to order the parameters to the procedure so that calls to it at least look good in the source code. I did use one numeric loop to control some of the calls to the local procedure. But I needed to format and convert the numbers to text before making the call.
While unit testing I found that my script generated some exceptions. I used the tried and true method of commenting everything out until I knew what the problem was. More unit testing showed that I got some of the values wrongs. I fixed these. Then I also wrote some SQL for our own testers to run in order to recreate the problem in their database. I think I did a pretty good job with this trouble ticket. From the moment I got the call to work this problem,. It was well under 2 hours before a solution was delivered to test.
I usually like to know exactly what is going on. So I had the developer run the application and print out all the reports that got modified. It seemed like half of them did contain the new required text. The other half had the text missing. Then we ran those problem reports in Report Builder. Sure enough the text showed up there. I made sure we were looking at the same RDF files. We were. Then I thought perhaps it had something to do with the parameters being passed to the report. The developer checked the database (we log all parameter values). This was not the cause. I was at a loss, and had to get back to my normal job.
An hour later, the reports developer came running in. Here was the problem. The reports with the missing text get printed in landscape mode. However the Report Builder tool was displaying them in portrait orientation. In essence the boilerplate text was located too far down on the page to been seen when it was printed out. Once this was discovered, it was a trivial task to move the text where it would be seen on all pages given that they were printed in landscape mode. I am sure glad that I did not have to determine the cause of this problem. Who would have thought this was the case.
To make things run fast, I always develop on a virtual machine these days. Unfortunately I still do not have all my tools installed on that machine. So I decided to install Oracle Report Builder on it. This is an old version (6i). I did not even bother to apply any patches. I found the installation process to be very painful. I was using a remote connection to the virtual desktop. The screen kept spending a lot of time repainting. It is slow because I am remotely accessing the machine. I guess they did not take that into account way back when they were writing this Oracle installer.
Here is a funny thing I did see once the screen stopped repainting. There was a lot of Oracle propaganda that I had to endure during the install process. I saw a message that I was going to be experiencing “High Quality Reports”. That’s a good thing. My customers don’t like low quality reports. Then I read that I would able to explore all my data. I certainly hope that would be the case. I was also informed that I was embarking upon a solution with “Unlimited Data Formatting”.
Things got even more crazy the further I got into the install process. I got one message that just said “Oracle Reports!” Apparently you cannot emphasize this product enough. Then I got to the last straw that said this product was “integrated with Oracle’s spectrum of business intelligence tools”. Ok. If they have to go this far to promote the product, I am starting to worry that this is going to turn out to be a total dud of a product.
Luckily I have worked with this version of Oracle Report Builder before. And it has some issues. Some of them get fixed with one of the patches for the product. I found it to be better than some of the early Oracle 10g report products that seemed to be rushed out the door with a lot of bugs. After I finally installed the program, I was able to determine what the problem was. The report was looking at the wrong field from the data. This is a simple enough fix.
There are some other problems with bulk loading of data. Our DBAs have been in contact with Oracle Corporation. And they have identified the issues as bugs. The resolution is to go to version 10.2.0.4. That sounds find and dandy. However we now have to go through our performance testing again. I hope this patch resolves these problems. And I pray that there are no new problems introduced with the patches.
You have to keep up with current versions of software. I understand this. Being a software developer myself, I know our users need upgrades all the time. However it hurts being the consumer of some products that need upgrading. There is pain involved with the upgrade itself. Part of this pain is dealing with things that are broken. You also do not want to stay on the bleeding edge. Otherwise you might bleed yourself with the results.
I would like to think that Oracle 10g is stable enough at this point. Don’t they already have an Oracle 11g out there now?
I have a lot of database accounts in the development and test environment. It is a challenge to remember the passwords for all of them. So I choose a passwords format that makes it easy for me to remember them. This plan gets foiled every time I am forced to change my password for whatever reason. At that point I cannot change my password back to the well known format. I must have a lot of different passwords before reusing the one I know.
Since I am a developer, I figure I can write a script to help me with this problem. I loop through 50 iterations. I use a sequence to generate a random sequence. Then I ALTER USER to change the password in the loop. At the end, I change the password back to the one I want to reuse. I bang out this code really quickly. I am pleased when the script executes to completion. However, the script fails when it tries to reuse my password. I get an ORA-28007: “the password cannot be reused” error.
This perplexes me for a while. I am successfully changing my password to a bunch of unique values for 50 times. So I should be able to reuse the old one. I guess that maybe somebody has messed up the profile which restricts password reuse. No. The profile still says I can reuse after 50 different passwords are used. Then I see the problem. In addition to the PASSWORD_RESUSE_MAX parameter that makes me choose 50 intermediate passwords, there is also a PASSWORD_REUSE_TIME parameter that prevents me from reusing the same password too soon. I had forgot about that rule.
A true hacker does not give up so easily. I was trying to emulate a user switching the password a bunch of times. That was the wrong approach. Instead I need to switch my profile to a different one, such as the DEFAULT. Then I can bypass all the password rules. This feels wrong. But you have to do what you have to do. These are development accounts anyway.
The Instant Client is free. It has support for OCI, OCCI, and JDBC connectivity. There are versions of the Instant Client for Microsoft Windows, Mac, Solaris, Linux, and many other operating systems. SQL*Plus is available, but it comes packaged separately.
The great thing about the Instant Client is that you just have to unzip it. There is not Oracle Universal Installer to configure it. Just point your ORACLE_HOME to the directory where you have unzipped the files. The Instant Client is the ultimate thin client for Oracle connectivity. I might try to get our project to use this. We have our own set of issues with the fat Oracle 10g client.
Sure it is somehow revolving around Oracle. But what does the “open” refer to? Is it open source? I doubt it. I went to the web site and found the conference covered applications, databases, middleware, and industries. This did not narrow the field down much. I also saw some information that I would be able to connect with both experts and peers at the conference. Unfortunately this is still pretty generic.
I did get the impression that this conference was not specific to Oracle developers. I really wish they had a buzz line for the conference. Something like “Oracle Open World – Running Oracle on Linux System” would be helpful.
Oracle has hopped onto the environmental trend by having a Green Program within the conference. They are going to review green components, if that makes any sense. There shall be a review of green practices in general. Their motto is to rethink, reduce, reuse, and recycle. At this point I cannot tell for sure. However I suspect this is really just a lot of lip service to capitalize on the popularity of the environmental movement.
The cost for Oracle Open World is between $2000 and $2500, depending on how early you register. I have an interest in the Oracle database. Our customer has huge amount of data stored in one. Some of our code is PL/SQL stored procedures. I might be a good candidate to attend this conference. The promoters of this conference have not done a good job of telling what exactly this conference is about. So they have lost at least one customer.
One of Tom’s main cautions is to not perform operations in a trigger than cannot be rolled back. For example, pretty must all of the util_file package operations cannot be rolled back. So if you have then in a trigger, you are waiting for a problem to happen. Tom says that the Oracle mutating table constraint is there to protect the programmer against themselves. Tom believes that a trigger which enforces referential integrity is very suspicious.
The system I work on has a lot of triggers in it. This alone should not be reason for caution. However when a respected figure such as Tom Kyte warns you about a certain database technology, I figure we should take heed. Most of the triggers in our system perform some type of logging or auditing. However there are some triggers which propagate copies of data to tables which are not normalized. And we have other triggers which keep meta data (like counts) up to date.
Luckily all of the operations performed in our triggers can be rolled back. We do not normally fool around with autonomous transactions in our triggers. Nor do we perform file operations that also cannot be rolled back. In essence, you do not have to study the triggers to figure out what most of our system does.
* Export utilities
* Generates HTML documentation from schema
* Ability to copy objects between schemas
As you probably already know, SQL Developer is a graphical tool from Oracle for database development. It allows you to browse database objects, run SQL, and debug PL/SQL among other things. It requires an Oracle database version of 220.127.116.11 or greater. There are versions of this application for Microsoft Windows, Linux, and the Mac. It is a free tool.
Version 1.5 of SQL Developer added a number of new features to the product:
* Store code templates for reuse
* Refactor the code
* View Java code
* Perform schema difference
* Import Excel spreadsheet to create tables
* Open trace files for performance tuning
* Run a report to monitor sessions
* Use CVS and Subversion source code control
Version 1.5.1 is a patch. It is available for Windows XP and Vista. It ships with and requires the JDK version 1.5. This patch makes SQL Developers support newer versions of Sybase.
I have to say that I already use an old version of PL/SQL Developer from Allround Automations, which is a different product than Oracle’s PL/SQL Developer. And for some tasks I actually use Oracle SQL*Plus. My first impression of Oracle’s PL/SQL Developer was that it did not feel natural to use. However I have heard some good things on the street about the product. When I decide to upgrade to a new tool, I will definitely be giving this one another look since it comes from Oracle and is free.
I stared at the script for a long time. And I kept running through my unit tests again and again. But the script kept eliminating all the application data, good and bad alike. This was troubling. As a software engineer, logic problems like this should be trivial for me. It did not help that we had a lot of distractions at the office. However I am trained to be able to work in any conditions and produce correct code. So I just added some restrictions to the WHERE clause in my script. Then the script did not delete all the application data. However I still did not understand what was going on. The only thing I could do was go over the code line by line until I saw the light.
This was the pseudo code for the script:
Cursor to loop through app application records
Check if table exists
If table does not exist
Delete application data
It does not get much simpler than this. Then I finally did spot the error. The where clause in my check was like WHERE column = table.column. It should have been WHERE column = cursor.column. At that point I felt pretty stupid. I spent the whole afternoon looking for that one. I write about it here to warn you in the rare case that you write or stumble upon such a problem in your own journey.
The application being modified already uses Pro*C for some of its database access. So a decision to use Pro*C did not seem unusual. And I knew that this particular developer was familiar with it. However the other developer requested that the common code be put in a PL/SQL stored procedure. That way it could be written once. This stored procedure could be called from the Pro*C function. And since this guy was the team lead, he made the final call. The other developer expressed reservation about this decision. As a result, the coding task got reassigned to me.
Personally I love writing code in PL/SQL. And I figured it was no big deal to call a stored procedure in Pro*C. You just do something like this:
There were other examples of this pattern in the code for the application being modified. So I coded in those hooks. I also wrote the common PL/SQL back end stored procedure. The procedure was actually quite easy to code. I knew we made the right decision when the database table being accessed got some column name changes. I just modified the stored procedure and everything worked fine.
EXEC SQL EXECUTE
Maybe the original developer did not want to agree with our team lead. I normally do not take this stance. It is fun to write code. A good developer should be able to write the code in any language and with any technology. Those details are usually not important. Calling a stored procedure in a package from Pro*C is also easy. I think I might give the link to this blog post to the original developer. He might learn a thing or two. And I am not talking about Pro*C syntax.
There are benefits and drawbacks to this two tiered DBA setup. Right now we are suffering from some of the drawbacks. I have written about this on my Software Maintenance Blog with a post entitled Database Lockdown.
I encourage you to check it out. It is most relevant, not only because we run an Oracle 10g database on the back end, but it might affect your work.
Before I go on vacation, I wanted to make sure the development staff had what they needed to implement the code changes. So I tested out the PL/SQL package written by the consultant. I logged into SQL*Plus. There were a bunch of functions exported by the PL/SQL package. I executed a SELECT FROM DUAL for one of the getter functions. It worked fine. Then I tried to SELECT FROM DUAL one of the functions that modified the underlying database table. The function returned an error to me. So I pushed it back to the consultant because I thought maybe he had a bug.
The consultant came back to me with an answer. The function was trapping an ORA-14551: cannot perform a DML operation inside a query. Of course that made sense. The SELECT FROM DUAL cannot change a value in the database. Why didn’t I think about that. The way to test out this stuff is to write some PL/SQL code, and call the functions from there. That is how the developers are going to have to implement this. The consultant also cautioned me against testing this directly from SELECTs in SQL*Plus because the functions are autonomous transactions. Apparently things get screwy when you call autonomous transactions like that.
The bottom line is that you cannot issue a SELECT FROM DUAL on a function that modified the database. There is a rule in Oracle that prohibits this. I have actually run into this before. However since I did not write the implementation for this package, I did not see what was going on under the hood. Developers beware.
I want to give a shout out to Natalka Roshah from Oracle FAQ. Her article explained the mysteries of the Recycle Bin to me. Apparently given the right 10g option being set, Oracle shall not actually drop a table when you issue the DROP command. Instead it renames the table so you can restore it easily in the future. A little research shows that the table does disappear from the USER_TABLES view. However it still exists in the USER_OBJECTS views with a type of “TABLE”.
Our project actually wants to truly drop the table when we issue a drop. So we will have our DBA team disable the Recycle Bin option. However for now we have resorted to issuing the following command to drop a specific Recycle Bin table:
PURGE TABLE “BIN$TktvMTcFRA3gRAADunAMgQ==$0”;
In my case, I wanted to get rid of all Recycle Bin tables in my schema. I had amassed 313 of them (which were taking up space). So I got rid of them all with the following command:
The real funny thing about this is that some experienced developers who were not familiar with this Oracle 10g feature were sure these were not real tables. This included dudes who used to work for Oracle. I do not know about anybody else. But I think it is time for my company to send me to Oracle 10g training. Don’t you?
I was able to duplicate the problem easy enough. Just log in and wait 30 minutes without doing anything. Sure enough I made the problem occur. However after reviewing the code involved with just logging in, I determined this was not a thread coding issue. I hit Google search to see if I could find any knowledge on the subject. Most of the web sites I visited recommended that you modify the timeout parameter on the server’s sqlnet.ora file. Armed with this information, I forwarded the problem to the DBA Team.
The next time I saw the lead DBA, I asked him about our timeout problem. He said that he encountered the problem as well. It was happening even in SQL*Plus. I told him I read a lot of articles about the timeout value in sqlnet.ora. However the DBA insisted that the timeout was not set on our server. Instead he believed this to be a network issue. So he forwarded the problem to the network administration team. This did not immediately resolve the problem. But it was due to the problem not receiving the priority it required. A couple calls later from the DBA, and it seems the problems have been resolved.
A lot of my experience with this problem is that you cannot just trust your instinct with any given problem. Nor can you entirely defer to the knowledge gained by Oracle. You need to dig in to the specifics of the problem and methodically identify the true cause. This is not to say that you need to discount any hunch you have. However you need to identify the hunches as theories, and test out those theories. Agent Moulder said it best when he recommend you trust nobody.
The Oracle installer produced a number of error message on my machine. Some messages seemed clear enough. For example, a service was running that needed to be manually shut down before Oracle could proceed with the removal. Too bad the error message did not tell me how to shut down the service. I tried going to Windows Services. That did not seem to resolve the issue. Then there were other cryptic error messages. They seemed to imply that the selected software was either uninstalled of in a bad state. That does not help me much.
When I went as far as I could, I was hoping the Report Server was gone. No such luck. There was still a directory where the Report Server was installed. A lot of the files and subdirectories had been removed. But many more remained. I tried to delete the whole directory with Windows Explorer. This did not work. Some program or service was locking the executables in the bin directory. I resorted to Google for help. Turns out you can rename the binaries in the bin directory. But you just cannot delete them. I wanted to reboot Windows in safe mode, then delete the pesky files. However I was in a hurry so I skipped it.
Another place where I saw remnants of the Report Server after uninstalling was the Windows registry. The data is stored in HKLM\SOFTWARE\Oracle. I continued to see an Oracle home entry for the Report Server. Well at least I have full control over the registry with the Windows Regedit program. The only danger is that you need to know what you are doing or else you could make matter worse by hacking the registry. I just deleted the Oracle Home registry entry (mine was HOME1). I also decided to remove ORACLE_HOMES\ORACLE1. And I figured that I should decrement the HOME_COUNTER in ALL_HOMES. I really do not recommend this level of registry hacking. But I was desperate.
Finally I got to reinstalling the Reports Server. At first I brought the software up to patch 3b which mimics ours Production configuration. But I later found I needed patch 18 for everything to work correctly. What a nightmare. I sincerely hope the never version of Report Server work better during removal.
Each time a new reported is selected, the application shall typically delete the prior temporary tables. The application then recreates the temporary tables based on user selections and current data in the database. This usually works fine since the same users normally run the same reports over and over. The only tricky part is that sometimes a user will choose a report for the first time. And in that scenario, there is no existing temporary table to delete.
My instinct for this scenario is to first detect whether the table exists or not. This can be a query from the USER_TABLES view. Once I find that the count is not zero, the code executes SQL to drop the old table. I was surprised to find another technique used in some code I recently debugged. It always attempted to drop the old table even if it did not exist. It preceded this code with a EXEC SQL WHENEVER SQLERROR DO CONTINUE.
Logically the ignore errors method works just as well as the count with optional drop of table. I started to wonder which way was better though. One would think the ignore errors method was better for the normal case. Only one SQL statement was issued. I would imagine this to run faster. But maybe there is heavy penalty for trying to drop a table that does not exist. In the end I left the code alone to ignore errors and always try to drop the table which might not even exist.
This whole discovery got started by a tester finding a scenario where the DROP failed and then a following CREATE TABLE failing as well. But that story is a subject for a future post.
Recently a tester asked a DBA to reset her passwords back to the ones she normally uses. The DBA attempted to do this, but was halted by the implementation of the security standards. It was too soon to reuse the old password. So the DBA chose a new password for the tester. Later I spoke with the DBA about this problem. Both he and I knew that it was the database profile that tied the user to the password verify function which prevented quick password reuse. I said we could just temporarily switch the user to a different database profile. Then the password rules would not be enforced. Oracle would not recheck the password when the original database profile was restored. This seemed like the ultimate plan to skirt the database password rules. In fact I coded up a script to do this for all my database accounts.
The DBA I spoke with knew about the profile swap trick. But he was hesitant to do this for the testers. He likes to do things by the book. The trick would be violating the security policy. I told him he was right. However we are just talking about a test environment. In reality, it is impossible to keep track of all the changing password for the multitude of accounts that we require. So people write down their passwords. The security rules get violated anyway. But at least the system was set up to discourage illegal password reuse.
I think the DBA has the right attitude. If you are always serious about security, you will most likely be safe always as well. It is a slippery slope to skirt security rules sometimes. There are times when the urgency of the moment requires some rules to be bent. But this decision should not be taken lightly. However I have not thrown out my script which resets my passwords by first switching up the profiles. In that way I am able to remember all my well known passwords and not write them down. That is my compromise.
Just this past week a couple developers encountered this sort of problem. We have been modifying the design used to print reports from our application suite. Previously we fed reports into an Oracle 6i Reports Server installed on the workstations. After moving to an Oracle 10g database on the back end, we detected some problems running the reports from the Reports Server. So we had a developer code up our own Reports Server using Visual Basic. This work around seemed to run pretty well. We just needed the application to call the Visual Basic app instead of the Reports Server. But then this fix just stopped working. In fact none of the reports seemed to be working correctly.
A couple of us developers spent a few days looking into the problem. I had been uninstalling and reinstalling Oracle software like the Reports Server and Reports Builder. None of this worked. After a while somebody realized that the tnsnames entry for reports was removed from the tnsnames file. It was a relief for the development team. We still were angry that our time was wasted. Luckily we install a local file on all our Production user workstations. The senior DBA decided it was finally time to lock down access to the tnsnames file. The trouble is that development needs access to the file in order to connect to all the databases we use.
So the senior DBA decided to "hide" the tnsnames file on the server. I tested a couple applications that use the file for connectivity. So did the senior DBA. We did not notice any problems. However one of the reports developers started seeing some weird problems in reports. She recalled out tnsnames debacle and started first by trying to research the tnsnames file. She panicked when she could not locate the tnsnames files on the network server (remember it had been hidden). I told her what the senior DBA had done. As of today I have not heard back whether she has resolved her reports problems. Maybe it is related to the hiding of the tnsnames file.
There must be a better way of providing access to the tnsnames file for connectivity while protecting it from accidental modification. Any ideas?
Then I got another request. The DBA Manager wanted all the SQL for a given report. This was a bit more complicated. The application creates a temporary table. Then it calls an Oracle report which is just a driver for another Oracle report. I knew the type of information this setup retrieved. But it would take a while to extract the SQL. And right now I don't have a lot of time.
The DBA Manager got back to me and told me to run a trace and get the SQL results quickly that way. This was easier said than done. I had some type of bug in my Oracle Report Builder, such that I could not compile in the commands to enable session level tracing. And my database account did not have permission to enable session level tracing either. I got some help from a DBA and an Oracle Reports developer. So I had Oracle generate a trace for a session in which I only ran the report.
Unfortunately I did not have access to the machine that hosted the Oracle database. But a DBA hooked me up with the trace file. This was the first time I had done this. I studied up on the tkprof command, but ended up executing it with all the default options. I was amazed at how simple it was to let the database capture all the SQL. But my instinct told me some SQL from the report was missing. The DBA Manager told me I could just give him the tkprof output. He would take it from there.
I think I need to get a little more practice running traces. And I certainly need to get my Oracle Reports installation fixed so I get make reports changes. Perhaps I can also obtain access to the UNIX box where we run our main Oracle database. Then I can run traces on my own and not need a DBA to get me the trace file each time. I have the feeling that me and tkprof are going to get along very well.
I could go through how you can use the sqlglm function. However I recommend you just consult the Oracle Pro*C Programmer's Guide. Isn't it like a developer to not read the manual? There is no excuse for this. Oracle provides the documentation free of charge on the web.
This reminds me of another huge problem I encountered this year with multi threaded applications having trouble managing multiple database contexts. At lot of high power developers thought up many theories about what our problems were. And there was a lot of guesses on how we could solve the problem. None of the guesses panned out. Luckily I read up the Pro*C Programmer's Guide and employed the Pro*C functions to explicitly manage database contexts for threads. Problem solved. Maybe I will give an example of how I did this in the future. The Pro*C Programmer's Guide is a good start. But it lacks some detailed examples.
Currently one developer on our team is assigned the reports. We are sticking with the Oracle 6 Report Builder. We are also leaving the workstation with the Oracle 6 Report Server. Most of our reports just run with the Oracle 6 run time client. These seems to be working out OK. However a few reports go through the Oracle Report Server on the workstation. These are giving our developer a lot of trouble when run.
Specifically, our reports developer compiles the reports against the 10g database using the Oracle 6 report builder. The reports run fine from Oracle Report Builder. However the Oracle Report Server complains when it tries to run the report. It always states that the report contain uncompiled PL/SQL. This has been driving the reports developer nuts. We have tried looking into patches to the Report Builder. But I think the problem actually is in the Report Server. I am not sure if Oracle is going to be any help, since these are old products.
The development team is hoping that the kinks get worked out. We want to start using more recent products from Oracle. It is no fun being stuck in the past using outdated and unsupported products in our system. I will keep you posted on our progress. We have a big test scheduled with the customer in a month and a half. Good luck to us.
(1) ORA-12203 "TNS: unable to connect to destination"
- Potential cause is that the destination is not listening
- It may also be due to network problems
- You can issue a "lsnrctl status" at the host to see if the listener is running
- This may be related to the entry in the tnsname.ora file
(2) ORA-12535 "TNS: operation timed out"
- You should try tnsping first to start the diagnosis
- Look at the listener log on the host for more clues
- Check the sqlnet log files on the client side
- Run a statspack report, and check Top 5 timed events
Luckily the problem did not last long in production. The trouble ticket itself got closed out. But our DBA Team has an action item to determine why the problem happened, and prevent this from happening in the future.
SET column1=500, column2=column1;
Her plan was to reorder the two assignments in the SET clause. And the reasoning was that maybe Oracle would sometimes assign column1 to a value of 500, then assign column2 to that same value due to this UPDATE. Now this sounded quite preposterous to me. I told her that the switch in order would have no material change in the results, ever. I tried to be patient. I explained that Oracle would always take the values prior to the update, and then use those whenever an assignment was made referencing one of the columns in the table. She thought she could do some Googling on the Internet to find an example where somebody ran into this problem. She already searched our code and found places where the order in the SET clause was switched around.
I told the coworker that she could try to ship out the code change. It would make no difference in the results. And we would look a little funny if we tried to pass that off as the solution to the problem. At the very least I would have hoped one of our DBAs would have protested. Heck. I determined I should protest a little louder. Nothing like a little education to make a good impression though. So I looked up the Oracle documentation on the UPDATE statement. We are using Oracle 9i on the back end. So I found what I needed in the PL/SQL Programmer's reference for Oracle 9i. It defines columns on the right hand side in such UPDATE statements as referring to the old values of the columns in the row. Now I guess she could make the argument that Oracle states this is the case but it not following through. But that statement is probably as likely as stating that sometimes Oracle just forgets to follow through on the whole update. I was pleased when I got a message back thanking me for the information.
The search goes on for the source of the strange customer problem. I have already provided the customer with the grim projection that we shall not fix this problem any time this week. It might be time to guide this developer to tried and true software maintenance methodologies. A fitting example would be that you cannot fix a problem until you can reproduce it. The justification is that you can never know whether your fix works if you do not know how to make the problem happen in the first place. Another idea I had was that maybe it was time for a little Oracle database training in this case. I have taken a class or two on Oracle technology. But I have also read a ton of books and a lot of other information on the web. However I think my best instructor has been many years of hands-on maintenance in front of a huge Oracle database. Sometimes I overlook this experience because I take it for granted.
Another company has won the software maintenance contract for our project. So we are transitioning out. Our OWB developer has already left the project. We are planning to turn the OWB source code over to the new contractor. They were looking for some OWB training from us. However we were only obligated to provide them with the source code. It is not in our statement of work to provide technical training to our replacement. This is just tough luck.
If I were working for the winning contractor, I would take these OWB projects and move the code into something more common. Personally I would recommend straight Oracle PL/SQL and maybe some Korn shell scripts. You don't want to be locked into some little used tool that nobody knows how to use. This is a recipe for trouble. Heck. You could even convert the projects to a common programming language such as Java. The problem is that there is not a strong business case for this effort. In the end, the customer will just get something that does what the existing OWB project does. No net benefit other than ease of maintenance. I wonder what the future holds for this annual loading software.
When the problem happened a second time, the application development team had to step up to the task. This time around the DBA Team analyzed the Oracle trace files and found that there was an ORA-00060 error "deadlock detected". But this did not make any sense to me. The deadlock error happened at the end of an hour long system halt. Oracle should be able to detect such a deadlock within seconds, and at the most minutes.
Development took this idea and ran with it. We determined where in the code we execute the SQL that was reported to be in deadlock. Then we analyzed how the application can cause that line of code to be in a deadlock situation. The analysis was not too conclusive. We did some tests to try to reproduce the problem with this knowledge. But we were unsuccessful.
I scanned a database table where our application logs exceptional events. Strangely enough I found a lot of ORA-03113 errors "end of file on communications channel". This by itself does not indicate anything. When the system is hung, anybody can kill our application and cause their session to terminate. However they were coming from the same stored procedure in the database. I inspected this stored procedure and found some very suspicious code which does the following:
- Declares a cursor with a SELECT ... FOR UPDATE clause
- Loops through the cursor
- Updates each record
- Performs a COMMIT and the end of all the updates
Now it appears that multiple users were all hung on this stored procedure. I assumed that the first user to open the cursor would get a lock on all the records (due to the FOR UPDATE). However maybe Oracle only locks the records as the records are fetched from the cursor one by one. And if so, suppose all users were selecting the same records. Furthermore let us assume that some users locked some records, while others locked the rest. This would indeed be a deadlock situation. But I would still expect that Oracle could recognize this and give everybody except the first user an ORA-00060 and kill their locks.
My plan for now is to have a developer simulate the situation where multiple users all execute this stored procedure at the same time. In addition, I want there to be a lot of records to update, thereby increasing the odds that we can conflict with each other's transactions. If this turns out to be the problem, I will (1) get the database guys to configure Oracle to detect this easily, and (2) rewrite the stored procedure to get the locks one record at a time.
UPDATE table1 one
WHERE table1.ROWID IN
FROM table1 innerone,
WHERE innerone.id = two.id
Now I wanted to limit the subquery to only choose those rows that had the code column of table1 equal to 5. Therefore I added another criteria to the inner query WHERE clause. My natural instinct was to use the alias from the inner query itself like this:
UPDATE table1 one
WHERE table1.ROWID IN
FROM table1 innerone,
WHERE innerone.id = two.id
AND innerone.code = 5
My reasoning was that it would be faster to reference just the inner subquery tables. I did not want to reference the outer query table, as that would make it a correlated subquery. However another developer came in behind me and changed my code to reference the outer query table alias:
UPDATE table1 one
WHERE table1.ROWID IN
FROM table1 innerone,
WHERE innerone.id = two.id
AND one.code = 5
Shouldn't these both equate to the same results? This other developer is out of the office now. But I am hoping to get an explanation when he gets back. What do you think?
Some consultants came up with a plan. They would fly out to the user's site. And they would see what the user was experiencing. Then at the same time an Oracle performance consultant would run some tools on the back end database to monitor what was going on. This would identify the problem, or at least the problem area, if in fact the problem exists on the database side. We make a couple calls to some stored procedures in a PL/SQL package. One of these calls could be hanging. I really hope this is not some weird hang up on the workstation. Because then this consultant test will give us no new information (other than they could not locate the problem).
Our requirements team documented the needed changes as best they could. And I questioned them mercilessly until they got all the details from the customer. Then I went off and coded up a number of changes to meet the specification as I understood it. Unfortunately I made some incorrect assumptions about the values set by some other parts in the system. As soon as this code hit production, the trouble tickets started coming in.
As this was my code, I jumped in and determined what was wrong and coded up some PL/SQL code changes. Our process requires us to conduct unit testing. So I doctored up some data to cover all the test cases. There was only one problem. The code I implemented is only supposed to be run on Wednesdays. And today is a Thursday. No problem. I just modified the code to run every day while I did my unit tests.
Another part of our process is to document all changes so that they can be peer reviewed. Often times this seems like unnecessary overhead. But today it saved the day. As I was generating the documentation, highlighting the portions that I changed, I noticed that the code was written to run every day. Oops. Luckily I caught it before peer review, and more importantly, before we shipped this code to Production. If that had gone through there would have been all kinds of fireworks. We do have other safeguards such as independent testings. But I am not sure they would have caught this given the rush on the fix.
I can safely say that I was saved today by our standard development process. Maybe I also need to look into other ways to test "Wednesday only" functionality without hacking up the code. What do you think?
Here are some more details on the problem. We have a main table with read-only data. It has a lot of records (maybe 30 or 40 millions records and growing). The application makes use of a second database table containing meta data with around 10,000 records. Unfortunately there were multiple problems with the meta data. Some of the records did not belong in the table. And other records were missing.
Given such small amount of time to write a fix, I just hacked out a brute force method. I removed all the meta data records. Then I went through the 40 million records in the source data and reconstructed the meta data table. I did some quick performance tests and they looked ok. So I am planning to ship the fix as is. The main driver of this design was the limited development time. Had I more time to work on the script, I could have built in some smarts which only added and removed a small amount of records. However this would have taken much longer to code and test.
This is usually the trade off in software maintenance. I hope the script runs fast enough when it gets deployed to Production.
I got a trouble ticket stating a new computed value was not being set. After checking the production data, I found that this was true. Found out that the wrong version of the package was compiled into the database. So I spoke with the team who controls the PL/SQL package install. Sure enough the instructions were correct in thier release notes. But the wrong version was compiled in. The immediate problem could be fixed by compiling the correct version of the code. But the total fix required going back and recomputing the values for all the data loaded so far. We got the computation wrong for about 15 million records.
My first cut of a script to fix the existing data was to run a cursor through the 15 million records. Then I would execute a bunch of logic to figure out which records needed updating. Then I would perform the update for each record. After completing version 1, I worried that this script might take a long time to run. So I spent some time looking at the Production data. And I was able to determine that I could put some conditions in my cursor that would bring the total record count down to about 40 thousand. Now I was on the right path.
Unfortunately my first set of unit tests on random data showed I had some bugs. My code that retrieved the data to determine the computations made some wrong assumptions. So I was trying to SELECT some data that had no records. In my early days as a programmer, I would just first SELECT the COUNT, then SELECT the data if the COUNT was greater than 0. But this is doing double work. So instead I coded a CURSOR which opened and fetched. If I got any data from the fetch I did my computations. No need for a separate COUNT. I checked for the existence of the data, and retrieved all in one step. This can speed things up when you are dealing with a lot of data.
40 thousand records is not huge. But it is big enough that you want to consider performance enhancing features. The only drawback to my work was that it took up a good deal of development time. Now is the busy season for my customer. So I had to wrap up this task and move on to the next problem to solve. I am glad that I forced myself to spend a little extra time to get the solution right. In fact while looking at the Production code to get some ideas for acceleration, I noticed some other related problems that I solved in the same script. I'd say I earned my pay today.
The users were so busy that at first I could not get any information about the problem. So I just started working on other issues. But then some important customers were saying these problems were really holding them back. So I tracked down some users and got some info from them.
Apparently some nightly jobs were not inserting records as expected. At first glance, it looked like records were being inserted. But talking as I talked with a few users, I found that not all expected records were being populated. I had a hard time tracing the actual code for this nightly job. Last year somebody made some changes and did not check them into source control. Shame on them.
I had to go back to our delivery of the changes for the code last year to get a hold of the source. Luckily I found one pattern for the missing data. An error code was being stored in a look up table. This was the start of a break through. The only problem was that the source code showed that the problems were supposed to be logged each time the error code was set. I could find no evidence of such logs in production.
Next step was to try to replicate the problem. This was harder than it seemed. The nightly job was something we were not set up to test. Somebody once had an idea that we should have a test bed. But it never got further than an idea. I had a DBA help out and try to start creating users to match production. No luck. I got bold and ran a few tests in production. This also proved to be a dead end.
By now I had to resort to an old fashioned technique. I went to trace the code and results by hand. Went through an example of some data which got an error code set. Kept building up the huge SQL statement that actually inserts the data. Then I got an Oracle error. Apparently the users had entered some SQL into the system that did GROUP BY and ORDER BY operations. The code tries to append WHERE clause data at the end. But this cannot come after the GROUP BY or ORDER BY in a SQL statement. I got the users to clean up their SQL.
In retrospect, this ordeal made me realize a few things. We need an adequate test bed. We need production to log errors diligently (still working on this). And finally the software should not allow users to enter data that will result in middle of the night errors. Time to implement some validation.
Apparently a script was not working correctly in development. It worked fine in Production. But a developer needed to do some tests in development and things were going weird. There was a column IDENTIFIER in a table defined as VARCHAR2(5). And they tried to run a query like this:
The problem was that the count was returning 0. However there were many records that had the specific identified the WHERE clause was comparing. This was just blowing the mind of the developer. Surprisingly when he removed the single quotes from the WHERE clause, the count came back correctly.
We tried a number of experiments to determine the source of the problem. For example, we inserted a new row with the same identified. Then the count returned 1 (not 51 as we expected). However updating all 51 records setting the identifier to the quoted value did not change the query results. Somebody thought maybe this was due to a bad index on the column. But this column was not a part of any index. We also tried creating another database table with a similarly defined column, but we could not replicate the problem with this other table.
My conclusion was that something went wrong when these 50 records were put into the table. I could not explain why. We loaded the data doing a sql*load like we always do. For now to get the fix out the developer changed the source code to not use quotes. This is just a hack until we can get a grip on this perplexing issue. Any thoughts?
Our database is a very large one. And the data keeps growing. So the number crunching job kept getting longer and longer. I asked for help from the performance engineering team at our company. Their first idea was a divide and conquer method. So I wrote some korn shell scripts that run 10 jobs in parallel, crunching portions of the data at the same time. There was not a 10 for 1 speed improvement. But it was close. We had a very happy system administration staff working at night because now our jobs finished quickly.
When we started up the system the following year, the number crunching routine slowed back down to old levels. Our chief DBA assumed that we somehow deployed the old code. I had him check the Production korn shell and PL/SQL packages. The correct code was there. So I called in the performance engineering team. They monitored the job the next night. The job finished in record time. Performance engineering said perhaps the gathering of statistics first was the key. Whatever the reason, I am glad we are back to quick speeds.
I found that those applications which used OLE DB were able to respond to errors intelligently. We threw different errors when bad passwords were chosen. This was a breeze to catch using OLE DB.
For most of the generic errors we were able to handle the situation with applications using Pro*C. However some errors contained larger error text. There were multiple ORA error numbers embedded in the text. With Pro*C, we were getting the from the SQLCA (SQL Communications Area). But the error text in this structure is limited to 70 characters. There were scenarios where we needed access to more text.
In the end we implemented a hack whereby the Pro*C code duplicated the password verification logic so we were not stuck needing more than 70 characters of error text. I seriously hope we can redesign this hack in the next go around. Perhaps we shall port all the Pro*C access to use OLE DB. Or maybe we can dig deeper to find a way to resolve this in Pro*C.
This means a new project and a new database. In my case, it will be IBM DB2. I know nothing about DB2. So it is time to hit the books. I am a little sad at leaving Oracle development. I have built up quite a library of Oracles books that I have read cover to cover.
Here are some of the titles:
- Oracle PL/SQL Programming
- Oracle Performance Tuning
- Oracle SQL*Plus
- Oracle Database Administration
- Oracle Essentials
Luckily my company has an online books service. So I am already checking out some DB2 titles.