2008/08/06

stop fixing the symptom!

Long time no post!
Sorry folks: been busy with a lot at work and at home.
It's Winter outdoor sports season here so all spare time is spent ferrying kids from venue to venue.

Anyways: attended a recent Oracle seminar on high availability.

Good stuff. And for once, we didn't get another "injection" of "Larry Ellison did or said this or that or the other":

Hallelujah!!!


Interesting also to see some charts and stats on oldest versions of Oracle in use in the audience. Most were 7 and 8, some 9 and even less 10. Says a lot for the current level of IT and databases in this state...

Then again, during the seminar we heard various snipe comments from the management audience on how we should "get rid of dbas", and "don't hire experienced dbas".

Any wonder why the majority of older versions of Oracle in use are non-supported releases? With this kind of IT damagement in place, what do you expect?

I'm surprised there weren't a few dBase users around...





Still, one part of the seminar was dedicated to the Grid product. In which of course we were treated to the good old dba2.0 vs dba1.0 nonsense.

I really wish Oracle would stop this type of rubbish.

Why is it rubbish? Stay with me for a moment, then:




Scenario 1:

dba1.0 is asked to "check why db is slow". What could be more common, I hear you ask? Followed by the usual (bad) use of (bad) scripts to "fix" nothing.

Then with a flash-bang, dba2.0 walks in and proceeds to click on a (painfully slow and unresponsive) "graphical" grid screen which magically auto-tunes everything.

The only thing missing are the rose petals, the soft music and the low fog...


Well actually: it isn't a common event at all to start with! That is the whole problem!


Except for exceptional circumstances of s/w upgrades causing an hitherto unsprung optimizer bug to rear its ugly head, it's been quite a few years since I last walked into a "disaster area" as described in that scenario!


Most db installations nowadays have things well under control and situations with half a dozen bad statements nicely lined up and repeatedly executing are in fact quite rare!


By far the vast majority of performance problems I see nowadays are caused by either a runaway single statement - usually caused by a bug or a bad combination of stats - or simply plain undersized, archaic hardware configurations! These just can't be tuned out: the fix is to get better hardware, period!


In the case of the single runaway statement, it is usually currently executing and taking hours to run instead of the usual seconds.

Slap the plastic-fantastic dba2.0 grid auto-tuner onto it and it's gonna create a profile for the NEXT execution of that statement.

Hang on a minute: so what happens to the CURRENT execution which is taking hours?

Ah yes, the fantastic dba2.0 product not only hasn't got a clue what caused the problem to surface out of the blue sky in the first place, but also hasn't got a chance in hell of fixing the CURRENTLY long running statement!

Profiles do not affect the CURRENTLY excecuting disaster, they affect the NEXT execution.

Too late, I'm afraid...


See: this is the problem with "treating symptoms".


Instead of standing back and analyzing WHY did that statement's execution blow out in the first place, dba2.0 has now condemned him/herself to a lifetime of clicking on screens every time a problem shows up and ending up with a setup that is never fixed but just jumps from red light to red light!



Had dba2.0 actually ANALYZED the problem instead of "profiling" away blindly trusting technology, he/she might actually have clued-on to the simple fact that a workarea table got filled with rows and was not re-analyzed before the next execution. And of course the CBO being the predictable thing it is, it proceeded to ignore any optimization based on indexes of the supposedly "empty" table. Cartesian product-city, do I hear you yell? Right!

Most common scenario in non-bug related blow-outs.


And no chance in hell of the "intelligent" grid thing being able to fix the problem: at best it'll fix the symptom by re-analyzing the table.

Right on time for the next "truncate" of that workarea table to leave the stats again out of whack!



This is the problem with all these auto "tuning" environments: they assume the cause of a problem is the symptom. So they treat the symptom. Forever condemning themselves to fix the next iteration of the symptom.
And the next...


BAD move. INCREDIBLY bad...




Scenario 2:

dba1.0 is asked to find out why db had a "hang" last Saturday night. dba1.0 then (wrongly) proceeds to look at statspack and picks up straight away there is a blow-out in parsing. Therefore concludes - quite rightly in fact! - that somehow the application is using string-concat for parameter passing instead of bind variables.

What he fails to produce is the actual statement that caused the problem. But that would be an easy exercise of scanning the SQL area of the statspack report.


Of course dba2.0 waltzes to the AWR screens (conveniently already loaded, they take AGES to come up!) and pinpoints the problem to a given statement that indeed uses no bind variables.
After plenty of pregnant pauses waiting for the screens to refresh:
God, that thing is s-l-o-w!!!


Now, when was the last time you saw one of these applications that nicely and conveniently do not use bind variables? They were weeded out ages ago! Except for deranged twits masquerading as "duhvelopers" parading their backyard "web 2" apps, it is indeed very rare.

Even in the SQL Server arena it's getting harder and harder to find!


And of course: dba1.0 found the problem in a few seconds! Although a real dba would have first CHECKED the alert log in case there was a serious data corruption causing the hang!





Yes, it's all very nice. And not expensive at all: after shelling out for ES licences, you STILL have to pay a small fortune to get ASH/ADDM/AWR running with grid!


Oh: and you better beef-up all the hardware involved. The whole darn lot has so much overhead it'll rival your DW in resource consumption...




But, I digress...



The problem with all these you-beaut-clickety-click tuning interfaces is always the same: they are about 10 years late to the market and they still don't treat the problem nor provide any clues to it. They simply fix the symptoms.



Don't get me wrong: having a tool that can quickly analyze a bunch of SQL statements and come up with a good execution plan would have been great!

10 years ago. When I was dealing with systems that needed it, that is...



Now I - and quite a few others - run packaged applications from third party or bought-by-Oracle stables. Mostly, they are reasonably well written and obvious problems have long been ironed out.


What remains are problems caused by running them in seriously undersized hardware - nothing grid or any tuning pack can do to fix that!


Or caused by the vagaries of a stats gathering process that can't possibly match the largely random execution patterns of a complex application.


Nothing a tuning pack can do to fix it either. You have to roll-up your sleeves, identify the problem tables, clear their stats and force the CBO to dynamically analyze them from now on: only way to make sure it'll grab a good guess.

Oh, and don't forget to lock stats to stop that pesky daily auto-stats job from stuffing up things.





Now, let me clarify something here: it's not all bad! I actually like grid a lot.

In fact we're going to implement it soon. After I nearly twisted management's neck around here to get it!

It'll definitely help us to manage our Oracle and non-Oracle dbs and with a bit of custom tweaking it might even do something useful for our Wintel and Lotus Domino systems management.

It'll take a while but I need all the help I can get to manage all this paraphernalia and mish-mash of dbs and applications.



But if I hear another sales rep waltz-in with a "solution to my performance problem", I swear:
there is gonna be a murder...




All this to simply say: on the dba2.0 vs dba1.0 rigmarole quite frankly, count me firmly on the side of the
dba3.0 = dba2.0+dba1.0!


dba 3.0 professionals use grid and the tuning packs to quickly find out the CAUSE of the problem.

Then they use in-depth analysis and reasoning based on gathered statistical facts and application execution patterns to ensure they implement a fix that will stop the problem from EVER happening again!


Rather than sit all day in front of a blinking screen, clicking away.



Call me crazy if you like but I reckon I can use my time much more profitably to my employer if I'm dedicating it to capacity planning and project development tasks, rather than chasing blinking lights with a mouse: I know a place where folks who do that can be hired for a LOT less moolah...





Coming back to more joyfull things, here is a small tribute to a photographer I admire: Max Dupain.


Max used to shoot Sydney's Northern beaches and anything to do with life in that area, back in the 1930s. I'm slowly building a portfolio of my own interpretation of his views and outlooks, in modern times.




This is Narrabeen lagoon entrance taken from an unusual angle that shows how amazing it really is:






On a stormy sunset, beach sand aquires reflexions rarely seen in a normal day:





Turimetta beach. I used to fish here a lot, back in the 80s. Now I'm too old to go rock-hopping but I still visit often. Very nice memories, here:




During a Winter storm, local surf kids flock to Narrabeen rock pool to enjoy the majestic swell. I was actually quite scared while taking this one:




and this one:




but the kids were having a ball: some of them were actually body-surfing the waves, INSIDE the rock pool!

Respect!




Catchyalata, folks!