2014/03/06

Unintended, but interesting consequences

It's interesting how from time to time something happens that makes sense and seems logical afterwards, but at the time it causes a bit of a surprise.  Part of the fun of working with this type of software!

A few days ago we had an incident in an Oracle DW database when a developer tried to load an infinitely big file from a very large source.  Yeah, you got it: big-data-ish! 

Suffice to say: 180GB and still going before it hit the limits I've set for that particular tablespace.
Yes, I do use auto-extend. It's a very effective way to ensure things don't get out of control while still allowing for the odd mis-hap without any major consequences.  Particularly when the storage is a SAN and any considerations about block clustering and disk use distribution are mostly irrelevant.

And no: if carefully setup it really does not impact performance that much.  Yes, I know it CAN be proven in certain circumstances to cause problems.  "can" != "must", last time I looked!

Anyways, the developer did the right thing in that the table was loaded without indexes, no-logging, etcetc.  It just ran out of space, due to my tight control of that resource.

So the developer did the next best thing: he killed the session and waited for the rollback.
And waited...
And waited...
...waited...

You got it: one of those "never-ending" single-threaded rollbacks that Oracle goes into in some circumstances. The tablespace is locally managed and uniform size 40MB, so there were quite a few things pending for the rollback to go through.

No biggie, seen a LOT worse!  But of course, it needed attention. And I didn't really want to crash the instance and do a startup parallel rollback - read about that here .

Upon investigation, I noticed that indeed the session was marked "KILLED" and upon checking v$transaction (yes, I know about v$longops: that doesn't work in this case!) it looked like we were up for another 4 hours at least of single-threaded rollback!

Pain!  And of course we needed the space after the rollback returned the tablespace to nearly empty, to run other tasks of the DW.

I also noticed that the session's background process (Unix, single-thread dedicated) was still there.

Now, when a session gets killed the background process eventually goes away once the rollback is done.
In this case, it of course didn't - rollback still going.

But it wasn't clocking up any CPU time.  If it was doing anything, it'd have been totally I/O bound.  PMON and DBWR on the other hand were quite busy.  To be expected - that is the correct behavior for these conditions.

Trouble is: PMON and DBWR do this in a single-thread fashion.  Which takes FOREVER!

Now, here comes the interesting bit.

I decided based on past experience to actually kill the original background process.  As in "kill -9" after getting its "pid" through a suitable v$session/v$process query.  This, because in theory SMON should then take over the session and roll it back, freeing up any contention around PMON and DBWR.

Now, SMON is ALSO the process that does this if the database crashes and a startup rollback takes place. The thing is: it does it in parallel to make the startup faster, if one has FAST_START_PARALLEL_ROLLBACK set to LOW or HIGH. We have it set to LOW. My hope was that it would also do a parallel recovery here.

And indeed it did!  While watching the proceedings via OEM monitors, I saw the same session number show up in a totally different colour (pink) in the Top Activity screen, this time associated with the SMON process.  After clicking on the session, I saw that a number of PQ processes were actually active associated with it!

And guess what?  The rollback finished in around 30 minutes.  Instead of 4 hours!

Now, THAT is what I call a somewhat interesting outcome.  I did not know for sure SMON would do a "startup recovery" with full parallelism.  My hope was it would take over DBWR work and do a partial parallel rollback.  As it turns out, it did a LOT better than that!

And that is indeed something to be happy about!

Now, before anyone transforms this into another "silver bullet" to be performed everytime a rollback of a killed session takes too long:


  1. This worked for Aix 7.1 and 11.2.0.3 patched up. Don't go around trying this in your 7.3.4 old db!!!
  2. It was a decision taken not lightly, after consulting the manuals and checking with the developer the exact conditions of what the program had done and how and for how long.
  3. This database is under full recovery control with archive logging.  Worst came to worst, I could just lose the tablespace, recover it online from backup and roll it forward to get things back into shape without major outages of the whole lot.

As such, think before taking off on tangents off this. If I get some free time, I'll try and investigate a bit more about it and how far it can be taken.  But given the current workload, fat chance!  So if anyone else wants to take over the checking, be my guest: there is enough detail above to construct a few test cases and it doesn't need to be 180GB! A third of that should be more than enough to already cause a few delays!

Anyways, here it is for future reference and who knows: one day it might actually help someone?

Speaking of which: our Sydney Oracle Meetup is about to restart its operations for this year.  We'll be trying as hard as usual to make it interesting and fun to attend.  And we need someone to replace Yury, who is leaving us for the Californian shores.  So, if you like to contribute to the community and want to be part of a really active techno-geek group, STEP UP!


Catchyalata folks, and keep smiling!

2014/03/04

Latest for the folks who have to deal with Peoplesoft

Dang, been a while since the last posts!  A lot of water under the bridge since then.

We've ditched a few people that were not really helping anything, and are now actively looking at cloud solutions, "big data" use, etcetc.

Meanwhile, there is the small detail that business as usual has to continue: it's very easy to parrot about the latest gimmick/feature/funtastic technology that will revolutionize how to boil water.
But if the monthly payroll and invoicing and purchasing fail, I can promise you a few HARD and concrete financial surprises!

Speaking of payroll...

A few years ago I made some posts here on the problem with PAYCALC runs in Peoplesoft and how hard it is to have correct CBO statistics for the multitude of scratchpad tables this software uses to store intermediate results during a pay calculation run - be it daily, weekly or monthly.  Those posts are regularly visited and comments added to them by folks experiencing the same problem.

(Fancy that!  A problem that is common to others and not caused by yours truly  - the "bad dba", according to marketeers from so many companies hellbent on selling us con-sultancy!)

This lack of statistics is mostly caused by the fact that the software truncates the tables at the end of the pay calc process. And does not calculate statistics after re-populating them with intermediate results during the next pay cycle!

With the obvious result that no matter how often and regularly a dba recalculates statistics on these tables, they cannot possibly ever be correct at the most important time!

One way around this I mentioned at the time was to actually turn on debugging for the modules that use those tables - assuming it's easy to find them, not all folks know how to do that!

The end result is Peoplesoft code will detect the debug flag and fire off a ANALYZE just after re-populating those tables.  Not perfect ( analyze if anything is a deprecated command!) but better than no stats!   At the end of the run, the output of the debug is simply thrown away.  Slow, but much better than before!

Another way was to force estimates to be taken for those tables by the CBO.  Again, not perfect.  But still better than pay calc runtimes in the tens of hours!

Enter our upgrade last year to Peoplesoft HCM and EPM 9.2!  Apart from using the "fusion" architecture for screen handling (slow as molasses...) this release introduces a CAPITAL improvement to all pay calc runs!

What Oracle has done is very simple: from 9.2 onwards, they do NOT truncate the scratchpad tables at the end of each pay calc!

So instead of guessing/analyzing stats for empty tables, the dbas can now analyze them with significant, relevant and timely data contents, do histograms if that is their penchant, etcetcetc!  With the immediate result that instead of the previous pot luck runtime for the paycalcs, now they are reliably constant in their duration and much, much faster! In simple terms: the CBO finally has SOMETHING accurate to base its "cost" on!


Man!  You mean to tell me it took Peoplesoft to become available in Oracle cloud for them to realize what was being done before was eminently STUPID and making it impossible for any dba to have a predictable paycalc execution time?


Apparently when it became their cloud responsibility, the solution was quickly found and implemented...


Of course, if they had actually LISTENED to the frequent and insistent complaints by dbas who actually deal with their software day to day for years on end rather than playing "ace" cards, things would have perhaps progressed a lot faster!... 



But let's ignore stupidity and rejoice: there is a light at the end of the tunnel and it's not incoming!

Get those upgrade juices going, folks, and try to convince your management that 9.2 is THE release of Peoplesoft to aim at.

It'll help everyone. (Including Oracle - but I don't think their marketing is smart enough to detect that...)

Catchyalata, folks. Do me a favour: smile and have fun!