Unintended, but interesting consequences
It's interesting how from time to time something happens that makes sense and seems logical afterwards, but at the time it causes a bit of a surprise. Part of the fun of working with this type of software!
A few days ago we had an incident in an Oracle DW database when a developer tried to load an infinitely big file from a very large source. Yeah, you got it: big-data-ish!
Suffice to say: 180GB and still going before it hit the limits I've set for that particular tablespace.
Yes, I do use auto-extend. It's a very effective way to ensure things don't get out of control while still allowing for the odd mis-hap without any major consequences. Particularly when the storage is a SAN and any considerations about block clustering and disk use distribution are mostly irrelevant.
And no: if carefully setup it really does not impact performance that much. Yes, I know it CAN be proven in certain circumstances to cause problems. "can" != "must", last time I looked!
Anyways, the developer did the right thing in that the table was loaded without indexes, no-logging, etcetc. It just ran out of space, due to my tight control of that resource.
So the developer did the next best thing: he killed the session and waited for the rollback.
And waited...
And waited...
...waited...
You got it: one of those "never-ending" single-threaded rollbacks that Oracle goes into in some circumstances. The tablespace is locally managed and uniform size 40MB, so there were quite a few things pending for the rollback to go through.
No biggie, seen a LOT worse! But of course, it needed attention. And I didn't really want to crash the instance and do a startup parallel rollback - read about that here .
Upon investigation, I noticed that indeed the session was marked "KILLED" and upon checking v$transaction (yes, I know about v$longops: that doesn't work in this case!) it looked like we were up for another 4 hours at least of single-threaded rollback!
Pain! And of course we needed the space after the rollback returned the tablespace to nearly empty, to run other tasks of the DW.
I also noticed that the session's background process (Unix, single-thread dedicated) was still there.
Now, when a session gets killed the background process eventually goes away once the rollback is done.
In this case, it of course didn't - rollback still going.
But it wasn't clocking up any CPU time. If it was doing anything, it'd have been totally I/O bound. PMON and DBWR on the other hand were quite busy. To be expected - that is the correct behavior for these conditions.
Trouble is: PMON and DBWR do this in a single-thread fashion. Which takes FOREVER!
Now, here comes the interesting bit.
I decided based on past experience to actually kill the original background process. As in "kill -9" after getting its "pid" through a suitable v$session/v$process query. This, because in theory SMON should then take over the session and roll it back, freeing up any contention around PMON and DBWR.
Now, SMON is ALSO the process that does this if the database crashes and a startup rollback takes place. The thing is: it does it in parallel to make the startup faster, if one has FAST_START_PARALLEL_ROLLBACK set to LOW or HIGH. We have it set to LOW. My hope was that it would also do a parallel recovery here.
And indeed it did! While watching the proceedings via OEM monitors, I saw the same session number show up in a totally different colour (pink) in the Top Activity screen, this time associated with the SMON process. After clicking on the session, I saw that a number of PQ processes were actually active associated with it!
And guess what? The rollback finished in around 30 minutes. Instead of 4 hours!
Now, THAT is what I call a somewhat interesting outcome. I did not know for sure SMON would do a "startup recovery" with full parallelism. My hope was it would take over DBWR work and do a partial parallel rollback. As it turns out, it did a LOT better than that!
And that is indeed something to be happy about!
Now, before anyone transforms this into another "silver bullet" to be performed everytime a rollback of a killed session takes too long:
As such, think before taking off on tangents off this. If I get some free time, I'll try and investigate a bit more about it and how far it can be taken. But given the current workload, fat chance! So if anyone else wants to take over the checking, be my guest: there is enough detail above to construct a few test cases and it doesn't need to be 180GB! A third of that should be more than enough to already cause a few delays!
Anyways, here it is for future reference and who knows: one day it might actually help someone?
Speaking of which: our Sydney Oracle Meetup is about to restart its operations for this year. We'll be trying as hard as usual to make it interesting and fun to attend. And we need someone to replace Yury, who is leaving us for the Californian shores. So, if you like to contribute to the community and want to be part of a really active techno-geek group, STEP UP!
Catchyalata folks, and keep smiling!
A few days ago we had an incident in an Oracle DW database when a developer tried to load an infinitely big file from a very large source. Yeah, you got it: big-data-ish!
Suffice to say: 180GB and still going before it hit the limits I've set for that particular tablespace.
Yes, I do use auto-extend. It's a very effective way to ensure things don't get out of control while still allowing for the odd mis-hap without any major consequences. Particularly when the storage is a SAN and any considerations about block clustering and disk use distribution are mostly irrelevant.
And no: if carefully setup it really does not impact performance that much. Yes, I know it CAN be proven in certain circumstances to cause problems. "can" != "must", last time I looked!
Anyways, the developer did the right thing in that the table was loaded without indexes, no-logging, etcetc. It just ran out of space, due to my tight control of that resource.
So the developer did the next best thing: he killed the session and waited for the rollback.
And waited...
And waited...
...waited...
You got it: one of those "never-ending" single-threaded rollbacks that Oracle goes into in some circumstances. The tablespace is locally managed and uniform size 40MB, so there were quite a few things pending for the rollback to go through.
No biggie, seen a LOT worse! But of course, it needed attention. And I didn't really want to crash the instance and do a startup parallel rollback - read about that here .
Upon investigation, I noticed that indeed the session was marked "KILLED" and upon checking v$transaction (yes, I know about v$longops: that doesn't work in this case!) it looked like we were up for another 4 hours at least of single-threaded rollback!
Pain! And of course we needed the space after the rollback returned the tablespace to nearly empty, to run other tasks of the DW.
I also noticed that the session's background process (Unix, single-thread dedicated) was still there.
Now, when a session gets killed the background process eventually goes away once the rollback is done.
In this case, it of course didn't - rollback still going.
But it wasn't clocking up any CPU time. If it was doing anything, it'd have been totally I/O bound. PMON and DBWR on the other hand were quite busy. To be expected - that is the correct behavior for these conditions.
Trouble is: PMON and DBWR do this in a single-thread fashion. Which takes FOREVER!
Now, here comes the interesting bit.
I decided based on past experience to actually kill the original background process. As in "kill -9" after getting its "pid" through a suitable v$session/v$process query. This, because in theory SMON should then take over the session and roll it back, freeing up any contention around PMON and DBWR.
Now, SMON is ALSO the process that does this if the database crashes and a startup rollback takes place. The thing is: it does it in parallel to make the startup faster, if one has FAST_START_PARALLEL_ROLLBACK set to LOW or HIGH. We have it set to LOW. My hope was that it would also do a parallel recovery here.
And indeed it did! While watching the proceedings via OEM monitors, I saw the same session number show up in a totally different colour (pink) in the Top Activity screen, this time associated with the SMON process. After clicking on the session, I saw that a number of PQ processes were actually active associated with it!
And guess what? The rollback finished in around 30 minutes. Instead of 4 hours!
Now, THAT is what I call a somewhat interesting outcome. I did not know for sure SMON would do a "startup recovery" with full parallelism. My hope was it would take over DBWR work and do a partial parallel rollback. As it turns out, it did a LOT better than that!
And that is indeed something to be happy about!
Now, before anyone transforms this into another "silver bullet" to be performed everytime a rollback of a killed session takes too long:
- This worked for Aix 7.1 and 11.2.0.3 patched up. Don't go around trying this in your 7.3.4 old db!!!
- It was a decision taken not lightly, after consulting the manuals and checking with the developer the exact conditions of what the program had done and how and for how long.
- This database is under full recovery control with archive logging. Worst came to worst, I could just lose the tablespace, recover it online from backup and roll it forward to get things back into shape without major outages of the whole lot.
As such, think before taking off on tangents off this. If I get some free time, I'll try and investigate a bit more about it and how far it can be taken. But given the current workload, fat chance! So if anyone else wants to take over the checking, be my guest: there is enough detail above to construct a few test cases and it doesn't need to be 180GB! A third of that should be more than enough to already cause a few delays!
Anyways, here it is for future reference and who knows: one day it might actually help someone?
Speaking of which: our Sydney Oracle Meetup is about to restart its operations for this year. We'll be trying as hard as usual to make it interesting and fun to attend. And we need someone to replace Yury, who is leaving us for the Californian shores. So, if you like to contribute to the community and want to be part of a really active techno-geek group, STEP UP!
Catchyalata folks, and keep smiling!