This week our users complained that the problem system had stopping responding. It was the second time that this had happened. Our
DBA Team and Performance Engineering Team had some ideas about the problem. But in the end their analysis said that a strange process ran on the server that consumed all of the CPU. This answer did not generate a lot of faith from the user community.
When the problem happened a second time, the application development team had to step up to the task. This time around the
DBA Team analyzed the Oracle trace files and found that there was an ORA-00060 error "deadlock detected". But this did not make any sense to me. The deadlock error happened at the end of an hour long system halt. Oracle should be able to detect such a deadlock within seconds, and at the most minutes.
Development took this idea and ran with it. We determined where in the code we execute the
SQL that was reported to be in deadlock. Then we analyzed how the application can cause that line of code to be in a deadlock situation. The analysis was not too conclusive. We did some tests to try to reproduce the problem with this knowledge. But we were unsuccessful.
I scanned a database table where our application logs exceptional events. Strangely enough I found a lot of ORA-03113 errors "end of file on communications channel". This by itself does not indicate anything. When the system is hung, anybody can kill our application and cause their session to terminate. However they were coming from the same stored procedure in the database. I inspected this stored procedure and found some very suspicious code which does the following:
- Declares a cursor with a SELECT ... FOR UPDATE clause
- Loops through the cursor
- Updates each record
- Performs a COMMIT and the end of all the updates
Now it appears that multiple users were all hung on this stored procedure. I assumed that the first user to open the cursor would get a lock on all the records (due to the FOR UPDATE). However maybe Oracle only locks the records as the records are fetched from the cursor one by one. And if so, suppose all users were selecting the same records. Furthermore let us assume that some users locked some records, while others locked the rest. This would indeed be a deadlock situation. But I would still expect that Oracle could recognize this and give everybody except the first user an ORA-00060 and kill their locks.
My plan for now is to have a developer simulate the situation where multiple users all execute this stored procedure at the same time. In addition, I want there to be a lot of records to update, thereby increasing the odds that we can conflict with each other's transactions. If this turns out to be the problem, I will (1) get the database guys to configure Oracle to detect this easily, and (2) rewrite the stored procedure to get the locks one record at a time.