Probably one of the most often discussed and hotly debated topics in both data modeling and database design is that of normalization. Many database designers, architects, DBA’s and senior developers have differing positions and/or beliefs on the topic. However quite unfortunately, they often are not communicating with optimal effectiveness due to some fundamental differences in terminology and understanding. The simple solution is often to take five minutes to review normalization basics – and thus get everyone on the same page. That’s what this month’s blog entry strives to do.
However before you proceed, always ask yourself what kind of database and application you’re building – because many important design issues vary widely depending on the target system’s nature shown in the chart below.
|Business Focus||Operational||Operational Tactical||Tactical||Tactical
|End User Tools||Client Server
|Client Server||Client Server
|Size in Gigs||10 – 200||50 – 400||50 – 400||400 – 4000|
|Data Modeling||Traditional ER||Traditional ER||N/A||Dimensional|
For the purposes of this discussion, let’s assume that you’re working on traditional OLTP database and application. Thus you’re hoping to achieve third normal form. Let’s review what that means – and specifically, without using fancy mathematical definitions and/or terms that only PhD’s might understand. Let’s make normalization both easy and fun 🙂
Remember the old “Perry Mason” television show? (maybe I should have said Matlock so as not to date myself) When a witness takes the stand – the cleark asks “Do you swear to tell the truth, the whole truth, and nothing but the truth – so help you God?” Well, we can utilize that simple little jingle to both define and remember normalization. Our catch phrase will be “… the key, the whole key, and nothing but the key – so help me Codd”. I’ll assume everyone knows who Dr. Codd was – the father of modern relational theory.
First Normal Form
There are four fundamental requirements for an entity or table table to meet first normal form – all of which must hold true:
- The entity or table must have a defined and active primary key
- The primary key is the minimal set of columns that uniquely identify a record
- Each attribute or column in the table contains a single value (i.e. atomicity)
- Eliminate repeating groups (i.e. no arrays)
It’s these last two bullet items where we’ll focus (i.e. for now, we’ll just agree that our tables should have primary keys as part of good relational design). We’re looking for single atomic values that depend upon the key – hence the first item in our little jingle.
So here’s an entity to review for adherance to first normal form:
So how does CUSTOMER measure up? There is a primary key, and it’s a single numerical column – can’t get much shorter than that. And all of the columns have simple data types that appear to contain single values (i.e. no complex data types like varrays, nested tables or XML Types – where you’d have to ask some additional questions). So at first glance you might accept CUSTOMER as being in 1NF – but you very well could be wrong L
I actually testified as an expert witness in a court case on this first issue. See the LONG field – what and how is that being used for? After talking to the developers and looking at some code, assume that we find out that all the customer’s orders are collected here as the order number followed by a semi-colon. That’s an array – or a collection of values. That means we’re not in 1NF. And more importantly – we have a real world performace issue.
How would you write the query to join CUSTOMER and ORDER? Since SQL can not handle this – you’d have to programmatically process the join. So instead of the server doing the work on potentially large tables, the client code must process the join logic. Thus you have two bottlenecks – the client CPU and the network traffic to send all the records to the client to examine. Ouch!
OK – that example is a bit extreme. The real problem is with ADDRESS_LINE. In this case we have modeling how an address looks on an envelope rather than its constituent parts. Now in the case of an address that’s probably reasonable – but it sets a dangerous precedent. Look at the example below:
The ORDER contains a bunch of dates. The idea is to track each step of processing the order from start to finish. So everytime we find a new “work in progress” step that we want to track, we need to alter the ORDER table and add an optional column. Yuck 😦
A more elegant solution that removes this hidden repeating group is to create a separate entity for work order tracking as shown here:
Now we can add new new “work in progress” steps by merely adding a row of data to the ORDER_STAGES table. In fact, we can even build a maintenance screen to support this so that end-users can easily make such business oriented modifications without sending this back to information systems. In other words, the resulting system is more flexible.
Second Normal Form
There are two fundamental requirements for an entity or table table to meet second normal form – all of which must hold true:
- The entity or table must be in first normal form
- All attributes or columns must depend upon the whole key
It’s the last bullet item where we’ll focus. If the primary key is constructed upon a single attribute or column, then we are already in 2NF. But if the primary key is composed of multiple attributes or columns (i.e. a concatenated key), then we are looking for all the attributes or columns to depend upon the whole key – hence the second item in our little jingle.
So here’s an entity to review for adherance to second normal form:
The test is really simple – does each attribute or column depend upon the whole key? First, does PROD_COST depend on both PROD_NAME and VENDOR_NAME? It might be arguable that this holds. But second, does FACTORY_LOCATION depend upon anything other than VENDOR_NAME? Since the answer is most likely not, then we separate this information into its own entity as shown here:
The only major problem that I see more often than not is where people break a cardinal relational database design rule, and have a single attribute or column contain multiple pieces of information. For example, they might design an attribute or column called PRODUCT_KEY which is a VARCHAR2(20) column where the first 10 characters are the product’s name and the second 10 are the vendor’s name. Thus in effect they have a concatenated key that violates 2NF. My point is that you cannot effectively normalize when attributes or columns don’t represent a single concept – so please don’t do it.
Third Normal Form
There are two fundamental requirements for an entity or table table to meet third normal form – all of which must hold true:
- The entity or table must be in second normal form (and thus also in 1NF)
- All non-key attributes or columns must depend on nothing but the key
It’s the last bullet item where we’ll focus. Does every non-key attribute or column depend upon nothing but the key – hence the third item in our little jingle.
So here’s an entity to review for adherance to third normal form:
Again the test is quite simple – does each non-key attribute or column depend only on PROD_NUMBER? The last two attributes or columns very clearly depend on just the PROD_CATEGORY, so it violates 3NF. Once again the solution is to simply separate this related information into its own entity as shown here:
Higher Normal Forms
Most shops will be well served if they can regularly identify and correct for up to 3NF. The primary question of whether to normalize further really depends on whom you ask. The database designers and data architects might like to normalize further, but then they don’t have to write the more complex code to manage such highly structured data. There is a clear tradeoff between code readability and maintainibility when you go much higher than 3NF. But it’s my experience that most developers’ skills and patience rapidly erode beyond 3NF. We live in a world of do more faster and with less resources. So let’s not design a perfect database doomed to fail based upon academic or theoretical practice that sacrifices coding efficiency to save cheap disk space. Let’s choose a happy medium that’s effecitve and yet allows us to remain efficient in the implementation. And you heard that from a person with his PhD – so I guess 20 years in the trenches has forced me to practice what I preach J