2007/01/18

keys, more keys and nothing but keys...

Yours truly and Jonathan will not agree on this one.

Read the comments section. Don't get me wrong: I fully respect his arguments and I know he's not about to do less to anyone else! We can agree to disagree on anything without any ruffled feathers.


Having been mostly on the side of the fence that deals with the consequences of design - good or bad is irrelevant - I can't possibly see what major advantages Natural keys might bring. As opposed to
Surrogate keys
.

Unless of course Oracle decides to dictate by design of their own optimizer that one WILL be better than the other: a very risky proposition, indeed!

Sure: natural keys make examples look good and understandable. They also make ad-hoc queries, in some conditions, look like they are clearer and easier to understand.

But quite frankly, the nuances and subtleties between

select emp.*
from dept, emp
where emp.deptno = dept.deptno
and deptname = 'Accounting'


and

select emp.*
from emp
where emp.deptname = 'Accounting'


are not THAT many that I'd care much for one or the other based only on these reasons: I'll just use whatever ends up being the most efficient.

Note that efficiency is NOT the same as performance! It encompasses other considerations such as efficient use of the -definitely finite - disk resources one starts off with.



First: this is the universe of RDBMS. Joins are the bread and butter of data normalization. Normalization implies one ends up with joins IF one wants to make sure data is not replicated all over the place. Which will happen if one doesn't follow normalization at least up to 3NF.


Using reduction in number of joins as an argument against surrogate keys in rdbms design is quite frankly not valid: they are part and parcel of this type of databases. Not an issue. They don't need to be reduced. In fact, normalization doesn't happen unless one doesn't end up with a design that requires joins.

There is also the very strong argument that once one allows repeating data attribute values in tables that should not contain the attribute in the first place, one opens the door to data corruption as a result of a possible coding error. And who has ever seen applications without errors? "Features" anyone?

Yeah, I know: facetious argument. But no less real. And I like to bang on the "real" anvil!


That's why, even in a natural key environment as in the above code snippet, one better use

select emp.*
from emp,dept
where emp.deptname = 'Accounting'
AND emp.deptno = dept.deptno
AND dept.deptname = emp.deptname


for the second example. Or whatever syntax the standard mandates we should use to get this.

But you get my drift.

End users being what they are, if one ever stores the "Accounting" word spelled as "Acounting" in the emp table I can virtually guarantee a logically incorrect result. Although of course a physically correct result will be produced: the database engine does not spell-check statement predicate values and returns exactly and precisely what one asks for.

By the same token: if one ever has to change the name of that department from "Accounting" to "Excel-heads" - heaven forbid! - the amount of REDO and UNDO in the system will go through the roof. Add the index entries themselves - yes: those have to change as well, not just the table data! - and you have a recipe for what dbas call "hell on Earth". With good reason.

Use surrogate keys and your change is in one table, one row. Period. Boom. Overhead? What's that?


One of the primary objectives of normalization has always been avoiding the repetition of attribute values in the final db. Why? Because that repetition, unchecked, will cause tremendous amounts of data to be stored - and changed, as noted above!


This is why one has, for example, star schemas in DW databases: keep dimension data repeated - denormalized - in a fact table and one is up for a review of the disk storage budget. Simple as that.


Exactly the same line of reasoning can be applied to DSS databases. Or even more so: here you don't have ONE single, enormous fact table. Typically, you have a dozen or so very large tables per application, with a slew of ancillary tables that store data to be used in aggregation/roll-ups as well as validation. And the results of those aggregations. It is perfectly reasonable for DSS databases to have various tables in excess of 200Mrows, nowadays. 6 years ago those would have been considered the realm of DW!


Repeat department names in these as part of using natural keys and I'll guarantee a major storage capacity blow-out at SOME stage of the life of the application.


The best example I can offer is a case a few years ago where someone asked me to offload some data from a database for offline sorting, "fully denormalized". This database made extensive use of surrogate keys. I did indeed offload the data: from 500MB in the db, it became 60GB in the flat file... Nuff said about that idea.


I've never seen one single major commercial application that uses natural keys. Not one. Jonathan - quite rightly - notes this as a wrong argument: it's not because they use meaningless keys that they are commercial - and successful. You know, the old Betamax quality argument. But that's not the issue here.

The issue is: there aren't ANY that use natural keys! Betamax at least existed here and there. Applications?

Look at SAP: none.
Peoplesoft? Not a trace.
Oracle's own Financials? Nowhere to be seen.
JDEdwards? Dream on!

Look: it's quite possible to cite an example. But it won't be a major, commercial application.


Why? Because the folks who are responsible for the design of these know what a pain it is to make sure all those instances of "Accounting" are spelt the right way. And what happens to cascading values when one of them is wrong.


The counterpart is a single instance of the value and a - hopefully! - sequence-generated identifier stored repeatedly. Got a wrong surrogate key somewhere? Sure: replace it by the correct value in the single storage location. No multiple cascading. Period. That is just such a large efficiency gain it has always won every single other argument against it in the design teams.


Sure: in terms of canonical theory of database design, a natural key may indeed make sense. Problem is that not one rdbms out there follows the canonical model. Fabian Pascal has touched on this for years, since the glorious old days of Compuserve long before the Internet existed: if commercial rdbms engines followed the strict rules of rdbms theory, indeed natural keys would be a, well, natural choice.


Fact is: they don't. So, they aren't. And that is why you won't find them anywhere near commercial products.



Of course, we still disagree on this one. And so does Howard. And Fabian. And Chris Date. Hey, I can live with that. No one has won either side of this argument.

Yet.

So maybe, just maybe, ALL of us are completely wrong?


:-)


Photo hat on:

More b&w photos in the dA site. I'm rediscovering the joys of developing my own film! Currently experimenting with Delta 400 and Microphen. Next I'll try Tri-x and Xtol. They seem to be the most popular among amateurs nowadays.




catchyalata, folks

5 Comments:

Blogger Jeffrey Kemp said...

I can see your point, and agree only insofar as it relates to the pragmatic side of things (it's impractical to avoid surrogate keys given the current DBMS's available). If the pk needs to be updateable and it is referenced by foreign keys then it makes sense to use a surrogate key for the FK, but only for performance reasons.

"there aren't ANY that use natural keys"

Only need one counter-example to falsify this statement. My first Oracle job involved working with a 1GB database that, while it had many surrogate keys, also used several natural keys. The most common example was a "history" table, e.g.:

EMP (emp_id, name, etc) pk=emp_id

EMP_HISTORY (emp_id, date_from, date_to, name, etc) pk=(emp_id,date_from)

A surrogate key for EMP_HISTORY was simply not required for the application.

Thursday, January 18, 2007 7:17:00 pm  
Blogger Noons said...

I tried to make it abundantly clear, in both my reply to Jonathan and in this blog, that I am talking about commercial designs. Not the odd in-house application or extension or the odd table here and there.


If natural keys were such a great solution, they would naturaly (pun intended!) have been extensively used by commercial designs.


The commercial environment is quite Darwinian, as a matter of fact: if something is not practical, it won't survive. Call that pragmatical or whatever, it's reality.


Having been on the design end of the world as well and with a few of them under my belt, all successful, all proven efficient and solid, I can say without any reservations that I would never, under any conditions, use natural keys by default with current rdbms engines.

Unless forced to do so. Yes, that also happens.


Notice that I said: current.


It might well change. But I don't have any hopes: the sheer amount of installed systems out there will take many, many decades to evolve even if a solution was available now.


Add to that the OO and j2ee mantra of UIDs everywhere and you got the future covered as well, no matter what might surface.


Natural keys missed the boat. Good idea, impractical with available technology. Pity. But life moves on.

Thursday, January 18, 2007 8:25:00 pm  
Blogger Don said...

This comment has been removed by the author.

Friday, January 19, 2007 3:10:00 am  
Blogger jgar the jorrible said...

I think that wikipedia article about Fabian has mutated tremendously since I last looked at it.

But it's ok, society made us do it.

Ran across http://www.webservertalk.com/archive149-2004-10-423635.html looking for commercial examples :-)

Funnily enough, I've spent this past week upgrading a commercial system that has many keys that include warehouse, because they are shutting one down and moving everything elsewhere. This of course exposes every typo, inventory bug, stuff moved without telling the computer, backorder, etc. Would it have been simpler with surrogates? Well maybe, but the broken stuff would propagate. The way it is, I can propagate things that are certain, and anything else force into a manual correction procedure. There will be a physical inventory eventually anyways, but doing it this way reduces the work there and fixes old problems much faster (including problems that would otherwise never be found, and orphaning old data problems that would just be useless extra work if found with a cleaning program).

But is a two-character representation of a warehouse (like '23') that people use to refer to it a natural key? Or is it a surrogate that people happen to use (instead of "Kalamazoopie Distribution Warehouse")? Programmatically it wouldn't have made much difference if the big character representation were used, but storage-wise it is in many places. I certainly couldn't have left the '23' and changed the description...

Saturday, January 20, 2007 10:03:00 am  
Blogger Noons said...

and then you have places like a client I worked at 4 years ago, where everyone in IT was so used to some of the surrogate key values that they used to refer to the particular row instances by the keys themselves.

So, it was accident 3567 involving a 6654 vehicle. Not an accident in the Catherine Hill Bay mine involving a Cat truck...

The IT folks refused point blank to have the surrogate keys changed in the new database so we had to load them as they were and adjust the sequences after that.

Never mind the end users didn't have a clue what those numbers were because they never saw them in any screen or report...

They come in all shapes and sizes, jgar! ;-)

Sunday, January 21, 2007 12:43:00 am  

Post a Comment

Links to this post:

Create a Link

<< Home