We recently released a new version of our system to the customer. They needed us to fix a bug. The bug involved the application displaying too many records on some screens. The fix involved adding a new column to a table to remove the ambiguity of records. There was one final touch that was required. We needed to populate this column for all the existing records.
One of our junior developers coded up a PL/SQL script to go through all the records and determine the value for the new column. With some guidance from a senior developer, she even made the script fast by using FOR UPDATE and WHERE CURRENT OF. Good stuff. There was just one problem. The customer complained that the darn thing did not work. There was a flurry of emails trying to figure out what was wrong.
Finally I got dragged into a conference call to discuss this problem. At first a lot of people were taken aback that most of the records did not get the column populated. Finally I could not take it any more and I chimed in that we were required to not fill in all records, but only the ones where we could reasonably determined the value. The rest were to remain NULL. Another developer got ahold of the script, checked out how it handled exceptions, and searched the production database for evidence of any problems. Sure enough an exception was raised. The script caught the exception, but handled it by logging the error to a table and exiting. The shell script that called the PL/SQL script reported success and we were none the wiser.
Now we get into the blame game as to why testers and developers did not detect the problem. A DBA chimed in with some sage advice. If we are going to trap errors, then we should use DBMS_OUTPUT to pipe a message to the screen informing a DBA that the thing bombed. Then we could deal with the problem immediately. In fact, it might have been better to leave the script bomb and report the exception to the command line. Then the customer would not have been the one to discover the error.
Reproducing a Race Condition
-
We have a job at work that runs every Wednesday night. All of a sudden, it
aborted the last 2 weeks. This caused some critical data to be late. The
main ...