TNS Problems

Our production users were getting Oracle errors that had to do with TNS problems. I passed these on to our DBA Team. However I did some research on them and thought I would share some findings.

(1) ORA-12203 "TNS: unable to connect to destination"
  • Potential cause is that the destination is not listening
  • It may also be due to network problems
  • You can issue a "lsnrctl status" at the host to see if the listener is running
  • This may be related to the entry in the tnsname.ora file

(2) ORA-12535 "TNS: operation timed out"

  • You should try tnsping first to start the diagnosis
  • Look at the listener log on the host for more clues
  • Check the sqlnet log files on the client side
  • Run a statspack report, and check Top 5 timed events

Luckily the problem did not last long in production. The trouble ticket itself got closed out. But our DBA Team has an action item to determine why the problem happened, and prevent this from happening in the future.

Order in an Update Statement

I got the most unusual e-mail from a coworker today. She was trying to research a very difficult software bug in our system. The trouble was that she could not reproduce the problem. This happens to the best of us from time to time. However she was not on the hunt for things in the code which might be the cause of the problem. This is a very dangerous journey for even the best of software engineers. Her analysis and e-mail to me was concern over a SQL statement like this:

UPDATE tablename
SET column1=500, column2=column1;

Her plan was to reorder the two assignments in the SET clause. And the reasoning was that maybe Oracle would sometimes assign column1 to a value of 500, then assign column2 to that same value due to this UPDATE. Now this sounded quite preposterous to me. I told her that the switch in order would have no material change in the results, ever. I tried to be patient. I explained that Oracle would always take the values prior to the update, and then use those whenever an assignment was made referencing one of the columns in the table. She thought she could do some Googling on the Internet to find an example where somebody ran into this problem. She already searched our code and found places where the order in the SET clause was switched around.

I told the coworker that she could try to ship out the code change. It would make no difference in the results. And we would look a little funny if we tried to pass that off as the solution to the problem. At the very least I would have hoped one of our DBAs would have protested. Heck. I determined I should protest a little louder. Nothing like a little education to make a good impression though. So I looked up the Oracle documentation on the UPDATE statement. We are using Oracle 9i on the back end. So I found what I needed in the PL/SQL Programmer's reference for Oracle 9i. It defines columns on the right hand side in such UPDATE statements as referring to the old values of the columns in the row. Now I guess she could make the argument that Oracle states this is the case but it not following through. But that statement is probably as likely as stating that sometimes Oracle just forgets to follow through on the whole update. I was pleased when I got a message back thanking me for the information.

The search goes on for the source of the strange customer problem. I have already provided the customer with the grim projection that we shall not fix this problem any time this week. It might be time to guide this developer to tried and true software maintenance methodologies. A fitting example would be that you cannot fix a problem until you can reproduce it. The justification is that you can never know whether your fix works if you do not know how to make the problem happen in the first place. Another idea I had was that maybe it was time for a little Oracle database training in this case. I have taken a class or two on Oracle technology. But I have also read a ton of books and a lot of other information on the web. However I think my best instructor has been many years of hands-on maintenance in front of a huge Oracle database. Sometimes I overlook this experience because I take it for granted.

OWB Skills

Our project at work has a bunch of data that we load into the system on a yearly basis. In the past, we had some UNIX shell scripts and C programs which handled this load. Over time the code was ported to Oracle Warehouse Builder (OWB).

Another company has won the software maintenance contract for our project. So we are transitioning out. Our OWB developer has already left the project. We are planning to turn the OWB source code over to the new contractor. They were looking for some OWB training from us. However we were only obligated to provide them with the source code. It is not in our statement of work to provide technical training to our replacement. This is just tough luck.

If I were working for the winning contractor, I would take these OWB projects and move the code into something more common. Personally I would recommend straight Oracle PL/SQL and maybe some Korn shell scripts. You don't want to be locked into some little used tool that nobody knows how to use. This is a recipe for trouble. Heck. You could even convert the projects to a common programming language such as Java. The problem is that there is not a strong business case for this effort. In the end, the customer will just get something that does what the existing OWB project does. No net benefit other than ease of maintenance. I wonder what the future holds for this annual loading software.

Slow System

This week our users complained that the problem system had stopping responding. It was the second time that this had happened. Our DBA Team and Performance Engineering Team had some ideas about the problem. But in the end their analysis said that a strange process ran on the server that consumed all of the CPU. This answer did not generate a lot of faith from the user community.

When the problem happened a second time, the application development team had to step up to the task. This time around the DBA Team analyzed the Oracle trace files and found that there was an ORA-00060 error "deadlock detected". But this did not make any sense to me. The deadlock error happened at the end of an hour long system halt. Oracle should be able to detect such a deadlock within seconds, and at the most minutes.

Development took this idea and ran with it. We determined where in the code we execute the SQL that was reported to be in deadlock. Then we analyzed how the application can cause that line of code to be in a deadlock situation. The analysis was not too conclusive. We did some tests to try to reproduce the problem with this knowledge. But we were unsuccessful.

I scanned a database table where our application logs exceptional events. Strangely enough I found a lot of ORA-03113 errors "end of file on communications channel". This by itself does not indicate anything. When the system is hung, anybody can kill our application and cause their session to terminate. However they were coming from the same stored procedure in the database. I inspected this stored procedure and found some very suspicious code which does the following:
  • Declares a cursor with a SELECT ... FOR UPDATE clause
  • Loops through the cursor
  • Updates each record
  • Performs a COMMIT and the end of all the updates

Now it appears that multiple users were all hung on this stored procedure. I assumed that the first user to open the cursor would get a lock on all the records (due to the FOR UPDATE). However maybe Oracle only locks the records as the records are fetched from the cursor one by one. And if so, suppose all users were selecting the same records. Furthermore let us assume that some users locked some records, while others locked the rest. This would indeed be a deadlock situation. But I would still expect that Oracle could recognize this and give everybody except the first user an ORA-00060 and kill their locks.

My plan for now is to have a developer simulate the situation where multiple users all execute this stored procedure at the same time. In addition, I want there to be a lot of records to update, thereby increasing the odds that we can conflict with each other's transactions. If this turns out to be the problem, I will (1) get the database guys to configure Oracle to detect this easily, and (2) rewrite the stored procedure to get the locks one record at a time.