DA's-R-Us: 2007/09

2007/09/27

more on peoplesoft

Following my last post, here is some more feedback on our Peoplesoft world.

As I mentioned before, we're in the middle of a major rollout of new functionality on our payroll systems. Mostly to do with finally letting folks access their own information as well as letting managers preview and better plan for annual and other leave, reviews, promotions, new employees, etcetc.

In simple words: self-service.

A not small part of all this is to ensure we can actually cover the performance and service levels expected of such a roll-out. Hence the focus on performance and improving the current process execution time.

So out came our Loadrunner software and on went some very heavy and complex load mixes of up to 200 users at low wait times, ramping up to 500 and over later on.

Highly recommended! Particularly if one wants to have any significant values out of dbms_stats.gather_system_stats()!

Like I mentioned previously, one of the major aspects of all this is to let the payroll folks be as pre-emptive as can be with pay calculations. Something not easily done due to the many problems with that function in Peoplesoft.

In the previous post I indicated where some of these problems can be found in the customer connection site: go there to find more details in various combinations of releases.

Well, I'm glad to report we have most of these under control. There are still one or two abnormal conditions that can cause blow-outs in execution time but I feel confident the approach we are following is sound and will produce a sensible result.

One of the problems we have is that with 10.2.0.2 Oracle and 8.48.10 Peopletools, we had to disable a lot of the functionality of the optimizer in order to get reliable results in our SQL. This is done by inclusion of so-called "undocumented" initialization parameters - usually the ones prefixed by an underscore "_".

Now, let me be very clear here: the ones we added, we did so BY SPECIFIC recommendation of Oracle Support! There!, before I get the usual cacophony about "thou shalt not use undocumented parameters". Hey, we were told by Oracle to USE these ones!

While I'm here, is it a good idea to use this type of parameter unless under duress? I am with the experts on this, 100% of the way: hate to have to use them.

Why? Well, first of all: there is a reason why they are mostly undocumented and "hidden". Mostly to do with the fact that they usually disable major sections of the inherent Oracle db functionality.

This can in turn have secondary effects on other pieces of functionality! And so on, the roller-coaster goes.

Second: when one upgrades to a later version or release, which of these parameters should then be taken away?

Yes, these, and many other similar problems.

But, in this case we were told by Support to use them, so we had to.

Of course, one of the side effects of the ones we use is that large tracts of the advanced CBO functionalitty of 10.2.0.2 were disabled!

With the result that our pay calcs became wildly unstable in performance. Mostly to do with the awfull SQL locked away in the pay calc Cobol program...

An example:


INSERT INTO ps_gp_rto_trgr_wrk
    (emplid, country, trgr_event_id, trgr_effdt, trgr_create_ts, cal_run_id)
    (SELECT DISTINCT a.emplid, c.country, a.trgr_event_id, a.trgr_effdt,
            a.trgr_create_ts, a.cal_run_id
         FROM ps_gp_rto_trgr a, ps_job b, ps_gp_cal_run_dtl c
         WHERE a.cal_run_id = :1
           AND b.emplid BETWEEN :2 AND :3
           AND b.emplid = a.emplid
           AND b.gp_paygroup = c.gp_paygroup
           AND (b.pay_system_flg = 'GP'
           OR  b.absence_system_cd = 'AM')
           AND b.empl_status IN ('A', 'P', 'Q', 'U')
           AND b.effseq = (SELECT max(b1.effseq)
                    FROM ps_job b1
!->                 WHERE b1.emplid BETWEEN :4 AND :5
!->                    AND b1.emplid = b.emplid
                      AND b1.empl_rcd = b.empl_rcd
                      AND b1.effdt = b.effdt)
           AND b.effdt <= c.prd_end_dt
           AND a.trgr_status = 'I'
           AND a.emplid BETWEEN :6 AND :7
           AND a.country = :8
           AND ('N' = :9
           OR  a.emplid IN (SELECT emplid
                    FROM ps_gp_grp_list_run
                    WHERE run_cntl_id = :10
                      AND oprid = :11))
           AND ('A' = :12
           OR  'S' = :13
           AND a.emplid IN (SELECT emplid
                    FROM ps_gp_iter_trgr c
                    WHERE c.cal_run_id = a.cal_run_id
                      AND c.iter_trgr_status = 'U'
                      AND c.emplid BETWEEN :14 AND :15))
           AND NOT EXISTS (SELECT 'X'
                           FROM ps_gp_rto_trg_wrk1 e
!->                          WHERE e.emplid = a.emplid
!->                          AND e.emplid BETWEEN :16 AND :17
                             AND e.trgr_event_id = a.trgr_event_id
                             AND e.trgr_effdt = a.trgr_effdt
                             AND e.trgr_create_ts = a.trgr_create_ts
                             AND e.country = c.country))

See those little "!->" flags I put at the start of those 4 lines? Yeah, that is the problem. Now, let me try to explain what is going on.

Those subqueries are what is called "co-related". IOW: for each row of the main query, the subquery will be executed and take the CURRENT and UNIQUE values of the columns from the main query and use those for its predicates. This type of query has been used in Oracle and SQL for decades now, nothing new here.

What is new is the BETWEEN predicate added by the nice Peoplesoft folks: it is completely and utterly unnecessary as the unique value for the CURRENT emplid is ALREADY provided by the main query! No need for a range scan!

Peoplesoft has an index to facilitate that subquery that looks like this:


1 ASC EMPLID
2 ASC EMPL_RCD
3 DESC SYS_NC00163$
4 DESC SYS_NC00164$

and the two DESC columns are EFFDT and EFFSEQ, of course.

All that BETWEEN predicate is doing is confusing the CBO out of its normal path. Because the column in the BETWEEN is the leading column of a concatenated index. And given that our CBO is already brain-damaged with about half a dozen "_" parameters suggested by Support, it doesn't take much for it to go South!

Indeed that was the case. We'd have this thing cached in SGA for a while and running fine, then all of a sudden the next execution would take 30 times longer to complete!

From there onwards, it'd stay bad until we did a flush of the shared_pool, thereby forcing all future statements to do a re-parse and recalculate the execution plan. Then things would go back to normal for a while, with luck. After a few hours, BANG again...

Jonathan was kind enough to help me look at this a few days ago and the pointers and ideas in his site for further investigation were most helpful. Thanks a million guv, if you happen to read this: I owe you a drink of your choice, next you're in Sydney!

As often happens, Tom Kyte had a recent and most relevant couple of entries in his blog that are also applicable here. Do yourself a favour and read both entries and the comments.

Basically, what all pointed out is that the plan as shown for the statement WITH the bind variables is different from the plan as executed WHEN the values for the bind variables are actually given.

And different again at some later stage when other sets of variable values were used.

Good old bind-variable peeking, as Gary alluded to in a comment to my last entry. Normally, not a problem at all. Unless the values being picked cause my handicapped CBO to go South with a redundant BETWEEN predicate, that is!

And indeed that was the problem here. With some combos of bind variables we got a horrible plan that took the BETWEEN predicate as the access path instead of the filter path. That basically stopped the selectivity of the index on the first column of the concatenation, instead of using all columns.

The symptom was a huge - 400000000!- bump in the BUFFER_GETS, caused by all the index range scanning taking place for every row in the main query.

Sometimes we'd be lucky enough to get the first execution to pick a good plan and it'd stay in memory until flushed out by normal load patterns. Then we'd get a bad plan on the next execution - or good, depending on which combination of bind values would cause the CBO to favour the wrong predicate to look at the index.

Turns out this SQL uses one of those WRK tables I mentioned in the previous post. So what I did to prevent this problem was really very simple: add that table to the ones that I clean the stats from and lock away from further gather_table_stats().

That way, this plan is re-examined by the dynamic sampling every single time this single SQL is executed. As it happens, the plan picked this way is invariably one that involves a full scan of the PS_GP_RTO_TRGR table, which in turn is the one that causes the BETWEEN predicate to be ignored and all the other index columns to be used for the access path.

"Overhead of the re-parsing!", I hear you complain? In the grand scheme of things it's not really that bad: about 100 execs a day. Enough to cause trouble if each execution time is 30 minutes, but not enough to cause trouble if I force a re-parse for every execution and it finishes in seconds!

And lo and behold, our pay calc is now an example of good behaviour: longest monthly run for all employees is 7 minutes, as opposed to the previous erratic executions of up to 1 hour!

Anyways, on to lighter things.

Quick peek at one of the locals:

There are two of them nesting in the pond across the park. Amazingly graceful creatures!

These little fellas always crack me up with their rowdy antics:

They are indeed gorgeous, happy birds.

Samuel at the recent NSW regionals, smoking everyone else by about 10 metres in the 4th 100M heat:

He ended up 4th in the finals. Dang, at 12 years of age he's obliterated my best ever 100m time: I feel really old now...
:-)

Catchyalata, folks!

2007/09/13

one for the peoplesoft folks...

Yeah, we use that software. The EPY - Payroll HCM - variety.

And if you do too, you know what a RPITA the pay calc COBOL program can be...

and it doesn't help when Oracle's CBO and its bugs joins forces to make it plain unbearable!

Here is an example of what can happen and a possible approach to cope with it.

First, the disclaimers:

1- you may see this if you run Peopletools 8.48 or later and Oracle 10.2.0.2. If you don't, then by all means read on but do NOT assume it will work miracles in your case!

2- if you read the following looking for cut-and-paste "recipes" instead of as an exercise in problem resolution, don't blame me if you plonk it in and it doesn't work: I DO hope it doesn't, because you fully deserve it not to! THINK and REASON, before you blindly copy!

3- no animals were hurt while following this process. Sometimes I wish I could get my hands around the neck of whoever coded the Oracle CBO. But, I digress...

Anyways, here we go.

If you have a Peoplesoft customer connection account, search for this:

solution ID 48335, "EPY: COBOL Performance Issues: Paycalc or other COBOL jobs take too long to run"

and read the contents to see the kind of stuff we Peoplesoft Oracle dbas have to cope with...

OK, so your Peoplesoft users complain "something is wrong with Oracle" because for no apparent reason their pay calc has suddenly hung and they had to cancel it. While normally it only takes a few minutes. No reason whatsoever, just plain random pattern.

Familiar song?

Well, they are right: something is indeed wrong with your database! And it's likely the optimizer stats, the CBO, or both.

More than likely it's another hit of the "temp table stats out of whack" syndrome. I referred to it in my previous blog, in the context of our DW. But it also happens with Peoplesoft.

OK, so what is this "out of whack" then?

Well, Peoplesoft in the latest releases relies on temporary "work" tables, where it keeps all sorts of intermediate results before it does a final pass. These tables have names usually terminated in the "WRK" string. In Peopletools 8.48 EPY, there are 118 of them.

They usually have 0 rows: the Peoplesoft programs religiously truncate them at the start, populate them, use them for all sorts of joins, then truncate them again at the end.

Nice!

Or is it?...

Well, in 10g you have an automatic stats gathering process that kicks in every night after 10pm and all through the weekend, don't you?

It politely looks for "stale stats" and tables with no stats and analyzes them for you. All well and good. The Peoplesoft Pay Calc processes by then will be mostly idle and the work tables will have 0 rows, and that is noted in the stats columns!

But when the COBOL programs next populate these tables, they won't be 0 rows anymore, will they? But does the CBO know about that? Of course not: them stats say 0, don't they?

And do the COBOL programs run an analyse at the end of populating the tables? Noooooooo, of course not: gotta stay platform-agnostic and analyze is "Oracle only", isn't it?

Recipe for disaster starting to take shape...

Still, not too bad. Most of the time.

But then the odd little bug in the CBO creeps in... Read on.

Not all of these tables are used by the pay calc program, of course. Only 4 are hit hard by this program:

PS_GP_CAL_TS_WRK, PS_GP_PYE_HIST_WRK,
PS_GP_PYE_RCLC_WRK and PS_GP_PYE_STAT_WRK

How do I know that? Well, that is the subject for another blog. Suffice to say I do know these are touched by various processes associated with the COBOL pay calc.

The CBO being that model of intelligence we know about, does its job of optimizing the SQL used by the COBOL programs.
And most of the time indeed it is a great job: it runs like greased lightning!

Until, that is, you hit this sort of problem(click on images to see details):

Nice, eh? A few hundred million buffer gets for a pittance of disk reads, on a statement that normally would take a fraction of a second to run...

Armed with your fresh knowledge of things 10gr2, you know that you can run an explain plan on a running statement, using the dbms_xplan.display_cursor PL/SQL standard package. All you have to do is give it the SQL-ID above and bang: it should show you the problem, shouldn't it?

Yeah, that is the theory.

Then, there is the bug...

So, let's try it:

Beg yours? This plan results in 136 million buffer gets, with source tables with only a few thousand rows including the WRK one? Hellloooo, bug?...

Well: big problem, isn't it? Something is obviously gone seriously wrong with the "infallible CBO - so good, we don't need RULE anymore"...

But the worse problem is this: the pay calc COBOL program is locked in, you can't change it. And it is run ad-hoc, by users, whenever they need to. So you can't really schedule an analyze at the right moment, can you?

Whatever you do has to be done outside the program. All the usual suggestions of "slap a RULE hint on it", or "run the stats gathering before you run the temp table SQL" or "change the optimizer_index_caching to 90 for that session" and so many others you hear about, are in this instance worth exactly jack! Aren't they?

But we must solve this somehow. Enter cunning plan: what if we force a dynamic sample of stats every time *WRK tables are used?

Well, we could. But it's not pretty: eventually, we'd be doing a LOT of dynamic sampling - 118 of them "temp" tables... - and that can have its own disadvantages!

The solution? Well, we can remove stats from just a few tables. Namely the 4 that pay calc uses. That should make dynamic sampling work on them and just them.

But how do we stop the darn 10gr2 maintenance job at night from considering these tables for stats gathering and thereby destroying our cunning plan of "no stats to force dynamic sampling"?

Enter the "lock out gather stats" feature of 10gr2! This allows you to put a flag on a table, saying to the dbms_stats package: "stop! Out of bounds! Go away!"

So, we did just that:

and bingo, our pay calc now runs as smoothly as it was intended to!

Now, if only I could get hold of whoever let such a bug through the CBO....

Catchyalata, folks!

2007/09/06

the trouble with blind faith

10g introduced the new job scheduler, presumably to replace the previous dbms_job functionality.

I wish someone had told all RDBMS developers about that, because there are still a lot of functions inside Oracle that use the previous dbms_job. But that is not the subject of this entry.

Now, one of the new pieces of functionality that makes use of the scheduler is the "maintenance window" and its automatic optimizer statistics gathering.

This can be a very convenient piece of functionality that automatically and regularly inspects the database and determines which objects need statistics gathering and then proceeds to do just that: gather them stats.

It uses the existence - or lack - of statistics for an object as one of the criteria to trigger a stats gathering. It also uses "staleness of stats" as the other criteria, defining it by default as "more than 10% of table rows has changed therefore stats must be stale".

But like any automatic process that relies on fixed boundaries, it is only as smart as the criteria used for defining these boundaries in the first place!

It is therefore up to each dba to help it along and ensure it can do the best job possible within its constraints. And I wish Oracle would make this a lot clearer in their documentation...

Now, what possibly can go wrong with a u-beaut automagic process in-line with the "we don't need no theenking dbas" mantra?

Well, let's look at this very simple and yet very common scenario:

1- the default maintenance window kicks in at 10pm every night and goes to 6am - and all day on weekends. During that window, a gather stats job runs looking for stats to update.

2- your site is a DW that has constant - 24X7 - ETL jobs running.

3- part of these jobs involves truncating work tables, filling them up with intermediate data, applying all sorts of calculations to that new data and then aggregating results into dimension tables for later processing of fact data.

Hmmmm... So what does the gather stats job do when it hits one of these intermediate tables that, by coincidence, just had a truncate done on it?

Bingo! It analyzes it, because it has changed by more than 10% and now its stats must be "stale".

Now of course, when the ETL job starts pumping data into these tables and starts the aggregation and analysis phase, the optimizer is going to come up with some very funny execution plans, based on the assumption that these tables are empty!

If you then inspect the execution plans of the statements involved, typically you will find that any joins on these tabls will all of a sudden start using "merge join cartesian" and other nice constructs that work oh-so-efficiently on fairly loaded tables...

Of course, this is where blind faith in automation must stop and analytic thought should kick-in.

Three immediate and obvious ways of addressing this problem:

1- Change your ETL processes to re-analyze the intermediate tables immediately after re-loading them with new data. This is my preferred approach since release 7 of Oracle as it puts the solution to the problem right at the source of it, meaning I don't have to figure out complex timing or scheduling to work around the problem. This is of course only possible if you have access to the ETL code.

2- Use the "stats lockout" feature of 10g whereby you can specify a lock, object by object, that will prevent auto stats gathering jobs from acting on these objects. You can override these locks with your own execution of the stats gathering procedure, tailored to each table and timed to perfection: when the tables have significant amounts of data in them. I'm starting to like this approach more and more.

3- Remove the stats from the tables all together and rely on dynamic sampling to gather sample stats at SQL runtime. The BIG disadvantage of this method of course is that the dynamic sampling has a finite overhead and is volatile: it won't survive between SQL statements. So it will have to be done for every statement that needs to access the tables. If you have a lot of these statements, you have a lot of overhead right there. I prefer to leave dynamic sampling to where I feel it might be more useful: adhoc and volatile DSS systems, rather than deterministic, scheduled ETL jobs.

Take your pick of which one solves your case best, or a combination of them if that is the case.

But never, ever believe that blind faith in an automatic process will produce ideal results in every case.

Speaking of ETL:

This is what opal miners use to sift through the large amounts of rubble they dig out of the shafts: the lot goes through the tube and is circulated in the big barrel. The heavy stuff drops to the bottom of the barrel and is removed every few minutes. The light stuff is pumped out to the big fan and the atmosphere. Put simply: a BIG vacuum cleaner! Quite "thunderdome", don't you think?