2020/01/21

Been a while...

Haven't been here for a long while.
Then again, Oracle is on the way out from our site so I'm not at all surprised!
I'll bet all those who spent so much of their time trying to get me ejected from my position and replaced by outsourcing must be really happy they created for themselves a long future here!   😄

12.2 upgrades done in almost all left-over dbs now.  After nearly 2 years of patch/patch the patch/and patch again!
Just to give you an idea of how bad this was, the patches so far:
25415713, 28163133 (July 2018 release update, heaps of patches here), 28390273, applied in left to right order.
And a few more needed but so far I've been able to work around those issues without any further patching.  But to get here it took 2 years!
The number of "release updates" I had to go through until I found one with a working dbua was absolutely mind-blowing! 

First it was RMAN not working properly at all.  Then it was logs being produced in enormous quantities into $ORACLE_BASE.   Then it was RMAN making huge logs again.
And don't get me started on how many times the dbua upgrade tool failed!...  😡


And huge issues with getting the lot to become stable enough to run anything on it!

Quite frankly, I'm not surprised Oracle is almost on the way out of most sites here in NSW and the few remaining ones are not planning to do anything else after 12.2...

Ah well, I'm sure it's because of all those "bad dba 1.0"s who are condemned to work with unstable and buggy software! And of course, my name is Father Xmas.

Speaking of which, happy 2020 to anyone who still occasionally might come here.

On the photo front not much happening.
The one below was from my daughter's marriage.
The ring-bearer was our valiant Hutch, who is now known as

"Dog-of-The-Rings"

(with profuse apologies to Sir Peter Jackson!  😂)



2017/09/18

And the idiocy continues...

Been a while since my last post.  Then again, Oracle administration has become such a small part of my work that I hardly find the motivation...

But the latest idiocy from the Larry Ellison camp is just too much!
Yeah sure: another iteration of the "let's get rid of all those DBAs" nonsense that has essentially killed the company here in NSW Australia, since the days of release9!
This one is REALLY going to work, this time around.  And I'm hereafter to be referred to as Father Xmas...

Has it even percolated the mind of these idiots that 2 dbas that take care of 400+ dbs is NOT, has NEVER been and will NEVER be "all those DBAs"?
And that getting rid of 2 people will hardly save ANY $$$ to a company's IT department that counts dozens of folks?

What an idiot Lazza has turned into if he's behind this...

Anyways, the joke will continue for a little while more, I guess: takes a long time for a very large company to become completely irrelevant.  If I was a share holder of Oracle I'd be seriously worried with this latest "strategy".  It's clear Lazza is past his use-by date and needs to be replaced or the company will just get worse and worse.
In a way that has been the REAL difference between Bill Gates and Lazza: the first actually saw the light and left the company before he became a dead weight. As a result, Microslop has become a lot more important and larger.

Ah well, I'm sure a lot of the "Let's have lots of beers" club will find the above insulting.
But then again I never cared much about that lot and what they think! So, good riddance to OTN&Co and the "Ace" bullshit!...

Sorry, been too busy for photos and don't have much to share at the moment.
Maybe later, if I still feel like posting anything:
we're about to embark on a major 12.1 upgrade and I'm sure the usual cohort of half-finished features and unverified patches will keep us busy as usual!

But I couldn't leave you without a little bit of whimsy I made a few weeks ago, as a result of looking at a very bad SQL in one of our "written by a Java-expert" applications:


2014/09/23

OED 12c

Yeah...    Aka: Oracle Enterprise Damager...

For those who might not know - yes indeed, a few of the so-called "cognoscenti" are not all-knowing! - we've been engaged in using grid control to monitor all our db servers for quite a while.

That's both MSSQL and Oracle RDBMS servers.  For 4 years now!

In a nutshell and to cut a very long story short:

 - we started with 10g.  The lesser said about that one, the better. It didn't work, period!

- then went up to 11g grid control in the hope of finding something in the product that actually worked. That was mostly it: just hope.  Slightly better for the basics but a total disaster in any advanced functionality including host and MSSQL monitoring.

- finally about a year ago, we got 12c and after the usual bunfight about where to run it and licensing issues, ended up doing so in RH under a vm.

Pete Sharman - the product manager for this product - planted enough earworms on me about how much better this particular release was. And he was absolutely RIGHT!

OEM 12c compared to what preceded is like a breath of fresh air. True "night and day" stuff!
(yes, I used the word "stuff"- why, got a problem? Tough! :) )

Far from perfect, mind you.  But sooo much better than the nonsense that GC was before, there is simply no comparison.

All this to introduce the main subject of this post.

As part of the effort to monitor and manage our dbs with this product, we have been investigating and actively using its features.

A lot of them. Both for monitoring and for managing dbs.

As in: "real life" work.  Not just "powerpointing" and "blogging" about how great it is and how it's gonna replace boiled water and all dbas while producing heaps of vapor clouds.  Like others do...

 And of course, as such, we've hit problems.  Which mostly we've tried to work around as much as possible rather than just get stumped or wait for patches that may never arrive.

Why?

You see: we actually have to deliver finished projects, measured MONTHLY by agile sprints.  Not just powerpoint presentations and travel and chinwag time.

Treating every obstacle as insurmountable is not an option in such a framework.

So, we have to think and apply past experience to resolve these problems or find workarounds.
And quickly!

One happened just a few days ago and it got us stumped.

It's easy as pie to reproduce, if you have access to OEM 12c:
  1. Create a super-admin login, and add to it a group with a chart.
  2. Logged in as that, go to the chart display page and using the right-top menu with the name of the login, pick option 2 to make that screen the default startup screen for that user.
  3. Now, logout and login again to confirm that indeed you end up in that screen. 
  4. After this, go to the main menus and get rid of that group altogether, then logout.
    Note that you did not change the default startup screen for that user.
  5. So now, login again and watch what happens. 
Yeah!  Blank screen, no menus anywhere!

One would think that in a well thought out - and well tested - product there would be an admin option somewhere to assign a default screen to any login, via SYSMAN.  Or at the very least to stop us from deleting something that is used somewhere else as a default!

Nup, no way. Too easy! Gotta make life hard for all those "bad dbas" out there, ain't it?

So, in came Support.  And one of the most exemplar and stellar lateral thinking workarounds I've had the luck of witnessing!


Indeed!  A perfect example of lateral thinking by someone genuinely interested in helping resolve a problem, rather than pontificating or worse yet - patronizing others.

It's simple.  Goes like this:
  1. Login with the affected user and get the blank screen in one browser.
  2. Login to SYSMAN in ANOTHER browser (not page, another browser! I use both Firefox and Chrome, so this is dirt easy).
  3. In the second (working) browser, go to a screen that you know the first login can access.  Click on the browser's url area, copy the url to the clipboard.
  4. Then simply go back to the first browser with the blank page, paste the url and hit Enter.
And Bob's a close relative of your mother!  The screen now displays properly in the first login's browser. Then it's simply a matter of picking a suitable screen and assigning it as a new default via the now existing and operational top-right menu.

It is indeed a brilliant workaround, making full use of the web-based nature of the OEM UI.

I've left it to the support person to decide if the root cause of the problem should be passed on as a bug or as an enhancement.


To my way of thinking, it should be reported as a bug as it can leave a login non-operacional, with possible dire consequences if that is a super-admin.

But adding an option to the security section to add any given page as the default one for any given login is most definitely an enhancement. Not a bug fix.

Couldn't make up my mind on which way to go, so I left it to the good judgement of the support person to decide.

Here is hoping that someone with a brain somewhere in the OEM dev team actually gets this, understands its importance and finds a solution. We lost a full day of work fiddling to try and get a solution/workaround.  Until I decided to go straight to Support and get them to resolve the issue or provide us with a workaround.


Still: here it is with all its gory details, hoping it might be of use for anyone at some stage.

All's well that ends well!


But indeed there is still a lot to be done to make OEM a truly reliable and usable product.

For EVERYONE.

Not just for those who pretend to know a lot about it but rely instead entirely on direct access to developers to resolve any problems. Dirt easy that way...

And that's about it for the time being, folks.
Catchyalata!

2014/03/06

Unintended, but interesting consequences

It's interesting how from time to time something happens that makes sense and seems logical afterwards, but at the time it causes a bit of a surprise.  Part of the fun of working with this type of software!

A few days ago we had an incident in an Oracle DW database when a developer tried to load an infinitely big file from a very large source.  Yeah, you got it: big-data-ish! 

Suffice to say: 180GB and still going before it hit the limits I've set for that particular tablespace.
Yes, I do use auto-extend. It's a very effective way to ensure things don't get out of control while still allowing for the odd mis-hap without any major consequences.  Particularly when the storage is a SAN and any considerations about block clustering and disk use distribution are mostly irrelevant.

And no: if carefully setup it really does not impact performance that much.  Yes, I know it CAN be proven in certain circumstances to cause problems.  "can" != "must", last time I looked!

Anyways, the developer did the right thing in that the table was loaded without indexes, no-logging, etcetc.  It just ran out of space, due to my tight control of that resource.

So the developer did the next best thing: he killed the session and waited for the rollback.
And waited...
And waited...
...waited...

You got it: one of those "never-ending" single-threaded rollbacks that Oracle goes into in some circumstances. The tablespace is locally managed and uniform size 40MB, so there were quite a few things pending for the rollback to go through.

No biggie, seen a LOT worse!  But of course, it needed attention. And I didn't really want to crash the instance and do a startup parallel rollback - read about that here .

Upon investigation, I noticed that indeed the session was marked "KILLED" and upon checking v$transaction (yes, I know about v$longops: that doesn't work in this case!) it looked like we were up for another 4 hours at least of single-threaded rollback!

Pain!  And of course we needed the space after the rollback returned the tablespace to nearly empty, to run other tasks of the DW.

I also noticed that the session's background process (Unix, single-thread dedicated) was still there.

Now, when a session gets killed the background process eventually goes away once the rollback is done.
In this case, it of course didn't - rollback still going.

But it wasn't clocking up any CPU time.  If it was doing anything, it'd have been totally I/O bound.  PMON and DBWR on the other hand were quite busy.  To be expected - that is the correct behavior for these conditions.

Trouble is: PMON and DBWR do this in a single-thread fashion.  Which takes FOREVER!

Now, here comes the interesting bit.

I decided based on past experience to actually kill the original background process.  As in "kill -9" after getting its "pid" through a suitable v$session/v$process query.  This, because in theory SMON should then take over the session and roll it back, freeing up any contention around PMON and DBWR.

Now, SMON is ALSO the process that does this if the database crashes and a startup rollback takes place. The thing is: it does it in parallel to make the startup faster, if one has FAST_START_PARALLEL_ROLLBACK set to LOW or HIGH. We have it set to LOW. My hope was that it would also do a parallel recovery here.

And indeed it did!  While watching the proceedings via OEM monitors, I saw the same session number show up in a totally different colour (pink) in the Top Activity screen, this time associated with the SMON process.  After clicking on the session, I saw that a number of PQ processes were actually active associated with it!

And guess what?  The rollback finished in around 30 minutes.  Instead of 4 hours!

Now, THAT is what I call a somewhat interesting outcome.  I did not know for sure SMON would do a "startup recovery" with full parallelism.  My hope was it would take over DBWR work and do a partial parallel rollback.  As it turns out, it did a LOT better than that!

And that is indeed something to be happy about!

Now, before anyone transforms this into another "silver bullet" to be performed everytime a rollback of a killed session takes too long:


  1. This worked for Aix 7.1 and 11.2.0.3 patched up. Don't go around trying this in your 7.3.4 old db!!!
  2. It was a decision taken not lightly, after consulting the manuals and checking with the developer the exact conditions of what the program had done and how and for how long.
  3. This database is under full recovery control with archive logging.  Worst came to worst, I could just lose the tablespace, recover it online from backup and roll it forward to get things back into shape without major outages of the whole lot.

As such, think before taking off on tangents off this. If I get some free time, I'll try and investigate a bit more about it and how far it can be taken.  But given the current workload, fat chance!  So if anyone else wants to take over the checking, be my guest: there is enough detail above to construct a few test cases and it doesn't need to be 180GB! A third of that should be more than enough to already cause a few delays!

Anyways, here it is for future reference and who knows: one day it might actually help someone?

Speaking of which: our Sydney Oracle Meetup is about to restart its operations for this year.  We'll be trying as hard as usual to make it interesting and fun to attend.  And we need someone to replace Yury, who is leaving us for the Californian shores.  So, if you like to contribute to the community and want to be part of a really active techno-geek group, STEP UP!


Catchyalata folks, and keep smiling!

2014/03/04

Latest for the folks who have to deal with Peoplesoft

Dang, been a while since the last posts!  A lot of water under the bridge since then.

We've ditched a few people that were not really helping anything, and are now actively looking at cloud solutions, "big data" use, etcetc.

Meanwhile, there is the small detail that business as usual has to continue: it's very easy to parrot about the latest gimmick/feature/funtastic technology that will revolutionize how to boil water.
But if the monthly payroll and invoicing and purchasing fail, I can promise you a few HARD and concrete financial surprises!

Speaking of payroll...

A few years ago I made some posts here on the problem with PAYCALC runs in Peoplesoft and how hard it is to have correct CBO statistics for the multitude of scratchpad tables this software uses to store intermediate results during a pay calculation run - be it daily, weekly or monthly.  Those posts are regularly visited and comments added to them by folks experiencing the same problem.

(Fancy that!  A problem that is common to others and not caused by yours truly  - the "bad dba", according to marketeers from so many companies hellbent on selling us con-sultancy!)

This lack of statistics is mostly caused by the fact that the software truncates the tables at the end of the pay calc process. And does not calculate statistics after re-populating them with intermediate results during the next pay cycle!

With the obvious result that no matter how often and regularly a dba recalculates statistics on these tables, they cannot possibly ever be correct at the most important time!

One way around this I mentioned at the time was to actually turn on debugging for the modules that use those tables - assuming it's easy to find them, not all folks know how to do that!

The end result is Peoplesoft code will detect the debug flag and fire off a ANALYZE just after re-populating those tables.  Not perfect ( analyze if anything is a deprecated command!) but better than no stats!   At the end of the run, the output of the debug is simply thrown away.  Slow, but much better than before!

Another way was to force estimates to be taken for those tables by the CBO.  Again, not perfect.  But still better than pay calc runtimes in the tens of hours!

Enter our upgrade last year to Peoplesoft HCM and EPM 9.2!  Apart from using the "fusion" architecture for screen handling (slow as molasses...) this release introduces a CAPITAL improvement to all pay calc runs!

What Oracle has done is very simple: from 9.2 onwards, they do NOT truncate the scratchpad tables at the end of each pay calc!

So instead of guessing/analyzing stats for empty tables, the dbas can now analyze them with significant, relevant and timely data contents, do histograms if that is their penchant, etcetcetc!  With the immediate result that instead of the previous pot luck runtime for the paycalcs, now they are reliably constant in their duration and much, much faster! In simple terms: the CBO finally has SOMETHING accurate to base its "cost" on!


Man!  You mean to tell me it took Peoplesoft to become available in Oracle cloud for them to realize what was being done before was eminently STUPID and making it impossible for any dba to have a predictable paycalc execution time?


Apparently when it became their cloud responsibility, the solution was quickly found and implemented...


Of course, if they had actually LISTENED to the frequent and insistent complaints by dbas who actually deal with their software day to day for years on end rather than playing "ace" cards, things would have perhaps progressed a lot faster!... 



But let's ignore stupidity and rejoice: there is a light at the end of the tunnel and it's not incoming!

Get those upgrade juices going, folks, and try to convince your management that 9.2 is THE release of Peoplesoft to aim at.

It'll help everyone. (Including Oracle - but I don't think their marketing is smart enough to detect that...)

Catchyalata, folks. Do me a favour: smile and have fun!

2013/12/16

Back to shell scripting basics

Some quick and basic Unix and ksh stuff that crossed my path recently.

As part of the move of our Oracle dbs to our new P7+ hardware (more on that later...), I'm taking the opportunity to review and improve a few of the preventive maintenance and monitoring scripts I have floating around.

Some were written a few years ago by other dbas and have been added to by myself as needed.  Others were written from scratch by yours truly.

One of the things I do in most if not all of my ksh scripts is to add upfront a function called "alert".
This function is very short, basically just this:
function alert {
mhmail ${MAILTO} -subject "${ORACLE_SID} online" -body "fullbackup:${*}"
logit "email alert sent:${*}"
where MAILTO is set to:
export MAILTO='db_admin'
and of course db_admin is an email alias with all dba email addresses - both internal and external standby ones.


"logit" is another  internal function that just adds a message to whatever has been defined as the script's log file.  That can be a strict logfile or just simply file unit 2.

In a nutshell: whenever I want to be notified by email of a particular step or stage or error in any script, all I have to do is stick in a

alert "alert message"

and bang! up in my email and that of all other dbas comes up a new msg.

OK, all nice and dandy.

Been working for ages.

But this time in one of the new P7+ servers, all of a sudden, one of the scripts stopped sending me an email with its end status!

When I tested email in that server from the command line, all was well.  It was only from INSIDE my scripts that the mhmail command was not working.

WTH?  Could this be some P7+ weirdness?

A little bit of debugging with a test script confirmed that indeed the mhmail command in the alert function was not producing anything inside ANY script, but worked perfectly otherwise.

Time for some serious thinking cap stuff.  After pulling the debug concepts book out of the shelf and doing a lot of log-scanning, I found out that indeed the script was executing an "alert" command, it was just that the command and the function weren't apparently the same!

Weird and weirder...

Time to go back to basics and find out exactly what the "alert" command line was producing inside the script.  It turned out it was simply doing a "cd" to the trace directory in 11g: $ORACLE_BASE/diag/rdbms/$ORACLE_SID/$ORACLE_SID/trace!

It was then that the proverbial little light went on: as part of the review of the scripts, I had added a ksh command alias to my .profile file that did exactly that - change to the trace directory - and it was named "alert"!!!

Bingo, problem identified: 
an aliased command takes precedence over any function of the same name in  a ksh shell script.

Now, that is something I had never seen before nor was I aware of that particular trait of the Korn shell - I don't know if bash will do the same? Ah well, learn until I die...

Time to fix the root problem, once and for all:
  1. All alias names that do "cd" to somewhere are now prefixed with "go".
    or
  2. All script function names are now prefixed with  "f".
I did 1, but am thinking seriously of going with 2 as well. You know: tobesure-tobesure, just the Irish-dba in me coming to the surface! 

Dang, it's been a while since I had to do this much script debugging!  It definitely pays to stay up to date on shell script debugging tips and tricks.

 Anyways, nuff with the techo stuff!


Catchyalata, folks!