one for the peoplesoft folks...
Yeah, we use that software. The EPY - Payroll HCM - variety.
And if you do too, you know what a RPITA the pay calc COBOL program can be...
and it doesn't help when Oracle's CBO and its bugs joins forces to make it plain unbearable!
Here is an example of what can happen and a possible approach to cope with it.
First, the disclaimers:
1- you may see this if you run Peopletools 8.48 or later and Oracle 10.2.0.2. If you don't, then by all means read on but do NOT assume it will work miracles in your case!
2- if you read the following looking for cut-and-paste "recipes" instead of as an exercise in problem resolution, don't blame me if you plonk it in and it doesn't work: I DO hope it doesn't, because you fully deserve it not to! THINK and REASON, before you blindly copy!
3- no animals were hurt while following this process. Sometimes I wish I could get my hands around the neck of whoever coded the Oracle CBO. But, I digress...
Anyways, here we go.
If you have a Peoplesoft customer connection account, search for this:
solution ID 48335, "EPY: COBOL Performance Issues: Paycalc or other COBOL jobs take too long to run"
and read the contents to see the kind of stuff we Peoplesoft Oracle dbas have to cope with...
OK, so your Peoplesoft users complain "something is wrong with Oracle" because for no apparent reason their pay calc has suddenly hung and they had to cancel it. While normally it only takes a few minutes. No reason whatsoever, just plain random pattern.
Familiar song?
Well, they are right: something is indeed wrong with your database! And it's likely the optimizer stats, the CBO, or both.
More than likely it's another hit of the "temp table stats out of whack" syndrome. I referred to it in my previous blog, in the context of our DW. But it also happens with Peoplesoft.
OK, so what is this "out of whack" then?
Well, Peoplesoft in the latest releases relies on temporary "work" tables, where it keeps all sorts of intermediate results before it does a final pass. These tables have names usually terminated in the "WRK" string. In Peopletools 8.48 EPY, there are 118 of them.
They usually have 0 rows: the Peoplesoft programs religiously truncate them at the start, populate them, use them for all sorts of joins, then truncate them again at the end.
Nice!
Or is it?...
Well, in 10g you have an automatic stats gathering process that kicks in every night after 10pm and all through the weekend, don't you?
It politely looks for "stale stats" and tables with no stats and analyzes them for you. All well and good. The Peoplesoft Pay Calc processes by then will be mostly idle and the work tables will have 0 rows, and that is noted in the stats columns!
But when the COBOL programs next populate these tables, they won't be 0 rows anymore, will they? But does the CBO know about that? Of course not: them stats say 0, don't they?
And do the COBOL programs run an analyse at the end of populating the tables? Noooooooo, of course not: gotta stay platform-agnostic and analyze is "Oracle only", isn't it?
Recipe for disaster starting to take shape...
Still, not too bad. Most of the time.
But then the odd little bug in the CBO creeps in... Read on.
Not all of these tables are used by the pay calc program, of course. Only 4 are hit hard by this program:
PS_GP_CAL_TS_WRK, PS_GP_PYE_HIST_WRK,
PS_GP_PYE_RCLC_WRK and PS_GP_PYE_STAT_WRK
How do I know that? Well, that is the subject for another blog. Suffice to say I do know these are touched by various processes associated with the COBOL pay calc.
The CBO being that model of intelligence we know about, does its job of optimizing the SQL used by the COBOL programs.
And most of the time indeed it is a great job: it runs like greased lightning!
Until, that is, you hit this sort of problem(click on images to see details):
Nice, eh? A few hundred million buffer gets for a pittance of disk reads, on a statement that normally would take a fraction of a second to run...
Armed with your fresh knowledge of things 10gr2, you know that you can run an explain plan on a running statement, using the dbms_xplan.display_cursor PL/SQL standard package. All you have to do is give it the SQL-ID above and bang: it should show you the problem, shouldn't it?
Yeah, that is the theory.
Then, there is the bug...
So, let's try it:
Beg yours? This plan results in 136 million buffer gets, with source tables with only a few thousand rows including the WRK one? Hellloooo, bug?...
Well: big problem, isn't it? Something is obviously gone seriously wrong with the "infallible CBO - so good, we don't need RULE anymore"...
But the worse problem is this: the pay calc COBOL program is locked in, you can't change it. And it is run ad-hoc, by users, whenever they need to. So you can't really schedule an analyze at the right moment, can you?
Whatever you do has to be done outside the program. All the usual suggestions of "slap a RULE hint on it", or "run the stats gathering before you run the temp table SQL" or "change the optimizer_index_caching to 90 for that session" and so many others you hear about, are in this instance worth exactly jack! Aren't they?
But we must solve this somehow. Enter cunning plan: what if we force a dynamic sample of stats every time *WRK tables are used?
Well, we could. But it's not pretty: eventually, we'd be doing a LOT of dynamic sampling - 118 of them "temp" tables... - and that can have its own disadvantages!
The solution? Well, we can remove stats from just a few tables. Namely the 4 that pay calc uses. That should make dynamic sampling work on them and just them.
But how do we stop the darn 10gr2 maintenance job at night from considering these tables for stats gathering and thereby destroying our cunning plan of "no stats to force dynamic sampling"?
Enter the "lock out gather stats" feature of 10gr2! This allows you to put a flag on a table, saying to the dbms_stats package: "stop! Out of bounds! Go away!"
So, we did just that:
and bingo, our pay calc now runs as smoothly as it was intended to!
Now, if only I could get hold of whoever let such a bug through the CBO....
Catchyalata, folks!
And if you do too, you know what a RPITA the pay calc COBOL program can be...
and it doesn't help when Oracle's CBO and its bugs joins forces to make it plain unbearable!
Here is an example of what can happen and a possible approach to cope with it.
First, the disclaimers:
1- you may see this if you run Peopletools 8.48 or later and Oracle 10.2.0.2. If you don't, then by all means read on but do NOT assume it will work miracles in your case!
2- if you read the following looking for cut-and-paste "recipes" instead of as an exercise in problem resolution, don't blame me if you plonk it in and it doesn't work: I DO hope it doesn't, because you fully deserve it not to! THINK and REASON, before you blindly copy!
3- no animals were hurt while following this process. Sometimes I wish I could get my hands around the neck of whoever coded the Oracle CBO. But, I digress...
Anyways, here we go.
If you have a Peoplesoft customer connection account, search for this:
solution ID 48335, "EPY: COBOL Performance Issues: Paycalc or other COBOL jobs take too long to run"
and read the contents to see the kind of stuff we Peoplesoft Oracle dbas have to cope with...
OK, so your Peoplesoft users complain "something is wrong with Oracle" because for no apparent reason their pay calc has suddenly hung and they had to cancel it. While normally it only takes a few minutes. No reason whatsoever, just plain random pattern.
Familiar song?
Well, they are right: something is indeed wrong with your database! And it's likely the optimizer stats, the CBO, or both.
More than likely it's another hit of the "temp table stats out of whack" syndrome. I referred to it in my previous blog, in the context of our DW. But it also happens with Peoplesoft.
OK, so what is this "out of whack" then?
Well, Peoplesoft in the latest releases relies on temporary "work" tables, where it keeps all sorts of intermediate results before it does a final pass. These tables have names usually terminated in the "WRK" string. In Peopletools 8.48 EPY, there are 118 of them.
They usually have 0 rows: the Peoplesoft programs religiously truncate them at the start, populate them, use them for all sorts of joins, then truncate them again at the end.
Nice!
Or is it?...
Well, in 10g you have an automatic stats gathering process that kicks in every night after 10pm and all through the weekend, don't you?
It politely looks for "stale stats" and tables with no stats and analyzes them for you. All well and good. The Peoplesoft Pay Calc processes by then will be mostly idle and the work tables will have 0 rows, and that is noted in the stats columns!
But when the COBOL programs next populate these tables, they won't be 0 rows anymore, will they? But does the CBO know about that? Of course not: them stats say 0, don't they?
And do the COBOL programs run an analyse at the end of populating the tables? Noooooooo, of course not: gotta stay platform-agnostic and analyze is "Oracle only", isn't it?
Recipe for disaster starting to take shape...
Still, not too bad. Most of the time.
But then the odd little bug in the CBO creeps in... Read on.
Not all of these tables are used by the pay calc program, of course. Only 4 are hit hard by this program:
PS_GP_CAL_TS_WRK, PS_GP_PYE_HIST_WRK,
PS_GP_PYE_RCLC_WRK and PS_GP_PYE_STAT_WRK
How do I know that? Well, that is the subject for another blog. Suffice to say I do know these are touched by various processes associated with the COBOL pay calc.
The CBO being that model of intelligence we know about, does its job of optimizing the SQL used by the COBOL programs.
And most of the time indeed it is a great job: it runs like greased lightning!
Until, that is, you hit this sort of problem(click on images to see details):
Nice, eh? A few hundred million buffer gets for a pittance of disk reads, on a statement that normally would take a fraction of a second to run...
Armed with your fresh knowledge of things 10gr2, you know that you can run an explain plan on a running statement, using the dbms_xplan.display_cursor PL/SQL standard package. All you have to do is give it the SQL-ID above and bang: it should show you the problem, shouldn't it?
Yeah, that is the theory.
Then, there is the bug...
So, let's try it:
Beg yours? This plan results in 136 million buffer gets, with source tables with only a few thousand rows including the WRK one? Hellloooo, bug?...
Well: big problem, isn't it? Something is obviously gone seriously wrong with the "infallible CBO - so good, we don't need RULE anymore"...
But the worse problem is this: the pay calc COBOL program is locked in, you can't change it. And it is run ad-hoc, by users, whenever they need to. So you can't really schedule an analyze at the right moment, can you?
Whatever you do has to be done outside the program. All the usual suggestions of "slap a RULE hint on it", or "run the stats gathering before you run the temp table SQL" or "change the optimizer_index_caching to 90 for that session" and so many others you hear about, are in this instance worth exactly jack! Aren't they?
But we must solve this somehow. Enter cunning plan: what if we force a dynamic sample of stats every time *WRK tables are used?
Well, we could. But it's not pretty: eventually, we'd be doing a LOT of dynamic sampling - 118 of them "temp" tables... - and that can have its own disadvantages!
The solution? Well, we can remove stats from just a few tables. Namely the 4 that pay calc uses. That should make dynamic sampling work on them and just them.
But how do we stop the darn 10gr2 maintenance job at night from considering these tables for stats gathering and thereby destroying our cunning plan of "no stats to force dynamic sampling"?
Enter the "lock out gather stats" feature of 10gr2! This allows you to put a flag on a table, saying to the dbms_stats package: "stop! Out of bounds! Go away!"
So, we did just that:
and bingo, our pay calc now runs as smoothly as it was intended to!
Now, if only I could get hold of whoever let such a bug through the CBO....
Catchyalata, folks!
16 Comments:
I'm not 100% convinced that the problem is (solely) with the dodgy work table stats. What are those employee bind variables like ?
The CBO seems to think that PRS_PERSON would only have 1 row fitting in the boundaries which sounds more like a bind-variable peeking issue. Maybe it was run once for a single employee and then agains for a whole set.
I know there's an underscore parameter to turn peeking off, but don't know any other solutions.
Gary:
Very good point indeed. But recall that the WRK tables are truncated before each run. So what was there in prior run means nothing to the next run.
And if the current contents was affected by peeking, then dynamic sampling would be also affected?
But I'll definitely look that one up in detail and see if I can make observed behaviour fit the model.
Thanks for that!
Gary:
_optim_peek_user_binds=FALSE
is the one.
Still testing if it'll make any difference.
On a good note: we pumped 100 users with 15secs delay between hits via loadrunner into this thing and it held itself without the slightest problem. Wouldn't have dared guess that, three weeks ago...
Hi guys!
Sorry 2.5yrs later im still bugged by this :(
How about PS_GP_RTO_TRGR_WRK ?
also safe to exclude?
log files:
TABLE_NAME NUM_ROWS LAST_ANALYZED
------------------------------ ---------- ---------------
PS_GP_GRP_LIST_RUN 0 13-MAR-10
PS_GP_ITER_TRGR 5041 13-MAR-10
PS_GP_RTO_TRGR 459246 13-MAR-10
PS_GP_RTO_TRGR_WRK
PS_GP_RTO_TRG_WRK1 0 13-MAR-10
PS_JOB 322143 14-MAR-10
Subobject Obj. Logical
Owner Tablespace Object Name Name Type Reads %Total
---------- ---------- -------------------- ---------- ----- ------------ -------
SYSADM PSINDEX PS_GP_RTO_TRG_WRK1 INDEX 192,399,712 57.59
SYSADM GPAPP PS_GP_RTO_TRGR TABLE 66,107,936 19.79
SYSADM PSINDEX PS_DEPT_TBL INDEX 14,803,776 4.43
SYSADM PSINDEX PS_GP_RSLT_ERN_DED INDEX 13,401,520 4.01
SYSADM HRLARGE PS_JOB TABLE 8,906,304 2.67
-------------------------------------------------------------
63 SYSADM INSERT INTO PS_GP_RTO_TRGR_WRK (EMPLID ,COUNTRY ,TRGR_EVENT_ID ,
63 SYSADM TRGR_EFFDT ,TRGR_CREATE_TS ,CAL_RUN_ID ) SELECT DISTINCT A.EMPLI
63 SYSADM D ,C.COUNTRY ,A.TRGR_EVENT_ID ,A.TRGR_EFFDT ,A.TRGR_CREATE_TS ,A
63 SYSADM .CAL_RUN_ID FROM PS_GP_RTO_TRGR A ,PS_JOB B ,PS_GP_CAL_RUN_DTL C
63 SYSADM WHERE A.CAL_RUN_ID=:1 AND B.EMPLID BETWEEN :2 AND :3 AND B.EMPL
63 SYSADM ID=A.EMPLID AND B.GP_PAYGROUP=C.GP_PAYGROUP AND (B.PAY_SYSTEM_FL
@Sean:
shouldn't be a problem, depends on your particular situation and release of Oracle.
Also, please see this later post:
http://dbasrus.blogspot.com/2009/09/quick-catch-up-on-peoplesoft.html
where another solution - involving a small change in PS code - is proposed.
Hi Noons
Appreciate the reply :)
We are on Oracle 9i actually.
3 questions
1) In the post you linked me - what does it mean by "having Peopletools" do the analyze? Any relation to this ? (http://blog.psftdba.com/2009/11/controlling-how-peoplesoft-cobol.html)
2) Since we are on 9i - adviceable for us to use this parameter ?
alter system set "_optim_peek_user_binds"=FALSE;
It was suggested by Oracle but not sure of the downstream impact
3) Do you have a list of temp/work tables used in Payroll runs so that we can exclude them from our stats collection?
Thanks a mil!
@Sean:
No worries, glad to be of help.
On your questions:
1- Yes, that post from Dave Kurz is the one we used to find the PS page and turn on stats collection only when/as needed. As opposed to doing it via the database, to everything adn always.
2- Probably wouldn't work: 9i didn't have bind variable peeking, that's 10g.
But quite frankly: the problem I see with fixing this via db parameters is they apply to everything, not just the specific case of a specific PS table. Which means you have to test the whole lot of the application to make sure the parameter is not going to blow something else out!
Much easier to just turn on an option to force PS code to do the analyze, where it's needed, and only there.
3- The tables you see in the little screenshot grab in this post are the ones we used:
PS_GP_CAL_TS_WRK,
PS_GP_PYE_HIST_WRK,
PS_GP_PYE_RCLC_WRK,
PS_GP_PYE_STAT_WRK.
This was for Peopletools 8.11 paycalc, so if you are running another version do a check that these indeed do exist with your setup.
HTH
Hey again
Clarification:
With regards to the "Update Stats" on the PS page:
What you did was
1)Turn on the trace facility of Peopletools for the module in question and ignoring the trace output. Part of the "tracing" code does an analyze on all scratchpad tables after populating them with interim results.
We removed our database stats blocking code and let Peoplesoft do the analyze during the paycalc.
2) Run Paycalc
Just by checking the box works? or are there underlying changes needed too?
From Peoplebooks:
By checking the Update Statistics box on will cause the batch process to update the statistics on four (GP_PYE_STAT_WRK, GP_PYE_HIST_WRK, GP_TLTRC_WRK, & GP_TLPTM_WRK) of the most important temporary tables while they are still populated.
The system is told to get accurate statistics for these temp tables during the batch process, when they are fully populated.
There are only 4 worktables here - it seems that Payroll uses alot of them hence just making sure we are on the right track.....
Thanks!
Yes, very much so. As I understand it the trace output is discarded, but the stats are analyzed at the right moment.
All we did was make sure that the previous pay calc terminated OK - so that all temp tables are truncated - then activate the option to do the stats.
After that, just run pay calcs normally.
The number of temp tables is huge - last time I counted there were something like 300?
But these 4 are the ones pay calc "hangs" on, so all that's needed for that process is these 4 to be analyzed at the right time.
Hi Noons.
First, You are a kind of heroe for me, because I never found an extremely clearly reason of why the Paycalc hangs up randomly.
But i trace and apply many differents parameters on init.ora for the same error and my only true solution was turn off the HASH JOIN method for sort (you can call me crazy)
It's a very dangerous decision but thats the only way that i solve this issue. We are on Oracle 11g R1 ver. 11.1.0.7, Oracle suggests turning off the OPTIMIZER_DYNAMIC_SAMPLING parameter, i do that but the issue still here.
The query on your pic, is the same issue in my HRMS 9.1 PayCalc and now, you are the only person that i can trust because this query, Oracle solves in seconds or when the database is curse, my PSTEMP going out of space.
Sounds familiar? I hope so... Please when you've a little time, i appreciate your feedback (and sorry for my english grammar but i hope you understand me)
Glad this was of help, Patrick.
Have a further look at this post:
http://dbasrus.blogspot.com/2009/09/quick-catch-up-on-peoplesoft.html
It has what I believe is a solid and definitive solution to this problem without forcing you to "brain-damage" the optimiser with extra options.
Give it a try: it's worked for us and we have not had a problem for a while now.
If you get a chance, swing by Dave Kurz's blog - there is a link to it in my post - he always has something of interest on Peoplesoft.
it is amazing.. thank you for ur posting
Hi Noons,
After we upgraded to HCM9.1/Oracle11gr2,the paycalc process started running longer than normal and I found that becoz of bind peeking,the CBO chooses bad plans in some cases based on luck i should say.Oracle support has suggested to use /*+bind_aware */ hint in all stored statements having EMPLID Between clause.Even after applying the same,a paycalc for ~100 payees ran for 37 minutes and the same ran in 5 minutes before the upgrade.Will write more details in the next comment section
Antony
Thanks for your comment, Antony. What you say worries me no end as we're about to embark in a similar upgrade ourselves. In theory it should hit around the Xmas/NY timeframe.
Man! I can see I'm going to have fun...
Have a look at some of the prior comments above in relation to a better solution I found in Dave Kurz's site. It might just be the ticket for the issue you're hitting?
Otherwise, we're both in trouble!
:)
On a more serrious note: there is a way to disable bind peeking all together, and sometimes I wonder if that wouldn't be the solution to all these problems with PS?
I mean: I like it and what it can do in some cases, but the fact remains that a LOT of performance problems are caused by it.
As a general rule, I tend to implement histograms for columns involved with these bind variables, it seems to help. But it's not a silver bullet...
Thanks sharing such wonderful page, appreciated keep updates.
I need to do another post on this subject.
With Peoplesoft 9.2 (the latest version, compatible with RDBMS 11gr2), this problem disappears completely.
What Peoplesoft/Oracle have done is stop the temp tables being truncated at the end of a pay calc. They are now truncated at the start of a paycalc!
And of course that means we can now calculate statistics on the tables at any point in time outside of an actual pay calc run and end up with a significant sample and a relevant set of statistics!
Assuming there are no significant changes in volumes of pay calculation, the stats calculated for these tables will closely match what ends up in there during the pay calcs.
Now, that is what I call a smart solution! Pity it's taken so many years and wasted computer time before someone in Oracle's development finally had a spark of normality!....
Post a Comment
<< Home