Probably one of the most often discussed and hotly debated topics in both data modeling and database design is that of normalization. Many database designers, architects, DBA’s and senior developers have differing positions and/or beliefs on the topic. However quite unfortunately, they often are not communicating with optimal effectiveness due to some fundamental differences in terminology and understanding. The simple solution is often to take five minutes to review normalization basics – and thus get everyone on the same page. That’s what this month’s blog entry strives to do.

However before you proceed, always ask yourself what kind of database and application you’re building – because many important design issues vary widely depending on the target system’s nature shown in the chart below.

OLTP ODS OLAP DM/DW
Business Focus Operational Operational Tactical Tactical Tactical

Strategic

End User Tools Client Server

Web

Client Server

Web

Client Server Client Server

Web

DB Technology Relational Relational Cubic Relational
Trans Count Large Medium Small Small
Trans Size Small Medium Medium Large
Trans Time Short Medium Long Long
Size in Gigs 10 – 200 50 – 400 50 – 400 400 – 4000
Normalization 3NF 3NF N/A 0NF
Data Modeling Traditional ER Traditional ER N/A Dimensional

For the purposes of this discussion, let’s assume that you’re working on traditional OLTP database and application. Thus you’re hoping to achieve third normal form. Let’s review what that means – and specifically, without using fancy mathematical definitions and/or terms that only PhD’s might understand. Let’s make normalization both easy and fun 🙂

Remember the old “Perry Mason” television show? (maybe I should have said Matlock so as not to date myself) When a witness takes the stand – the cleark asks “Do you swear to tell the truth, the whole truth, and nothing but the truth – so help you God?” Well, we can utilize that simple little jingle to both define and remember normalization. Our catch phrase will be “… the key, the whole key, and nothing but the key – so help me Codd”. I’ll assume everyone knows who Dr. Codd was – the father of modern relational theory.

First Normal Form

There are four fundamental requirements for an entity or table table to meet first normal form – all of which must hold true:

  • The entity or table must have a defined and active primary key
  • The primary key is the minimal set of columns that uniquely identify a record
  • Each attribute or column in the table contains a single value (i.e. atomicity)
  • Eliminate repeating groups (i.e. no arrays)

It’s these last two bullet items where we’ll focus (i.e. for now, we’ll just agree that our tables should have primary keys as part of good relational design). We’re looking for single atomic values that depend upon the key – hence the first item in our little jingle.

So here’s an entity to review for adherance to first normal form:

y1

So how does CUSTOMER measure up? There is a primary key, and it’s a single numerical column – can’t get much shorter than that. And all of the columns have simple data types that appear to contain single values (i.e. no complex data types like varrays, nested tables or XML Types – where you’d have to ask some additional questions). So at first glance you might accept CUSTOMER as being in 1NF – but you very well could be wrong L

I actually testified as an expert witness in a court case on this first issue. See the LONG field – what and how is that being used for? After talking to the developers and looking at some code, assume that we find out that all the customer’s orders are collected here as the order number followed by a semi-colon. That’s an array – or a collection of values. That means we’re not in 1NF. And more importantly – we have a real world performace issue.

How would you write the query to join CUSTOMER and ORDER? Since SQL can not handle this – you’d have to programmatically process the join. So instead of the server doing the work on potentially large tables, the client code must process the join logic. Thus you have two bottlenecks – the client CPU and the network traffic to send all the records to the client to examine. Ouch!

OK – that example is a bit extreme. The real problem is with ADDRESS_LINE. In this case we have modeling how an address looks on an envelope rather than its constituent parts. Now in the case of an address that’s probably reasonable – but it sets a dangerous precedent. Look at the example below:

y2

The ORDER contains a bunch of dates. The idea is to track each step of processing the order from start to finish. So everytime we find a new “work in progress” step that we want to track, we need to alter the ORDER table and add an optional column. Yuck 😦

A more elegant solution that removes this hidden repeating group is to create a separate entity for work order tracking as shown here:

y3

Now we can add new new “work in progress” steps by merely adding a row of data to the ORDER_STAGES table. In fact, we can even build a maintenance screen to support this so that end-users can easily make such business oriented modifications without sending this back to information systems. In other words, the resulting system is more flexible.

Second Normal Form

There are two fundamental requirements for an entity or table table to meet second normal form – all of which must hold true:

  • The entity or table must be in first normal form
  • All attributes or columns must depend upon the whole key

It’s the last bullet item where we’ll focus. If the primary key is constructed upon a single attribute or column, then we are already in 2NF. But if the primary key is composed of multiple attributes or columns (i.e. a concatenated key), then we are looking for all the attributes or columns to depend upon the whole key – hence the second item in our little jingle.

So here’s an entity to review for adherance to second normal form:

y4

The test is really simple – does each attribute or column depend upon the whole key? First, does PROD_COST depend on both PROD_NAME and VENDOR_NAME? It might be arguable that this holds. But second, does FACTORY_LOCATION depend upon anything other than VENDOR_NAME? Since the answer is most likely not, then we separate this information into its own entity as shown here:

y5

The only major problem that I see more often than not is where people break a cardinal relational database design rule, and have a single attribute or column contain multiple pieces of information. For example, they might design an attribute or column called PRODUCT_KEY which is a VARCHAR2(20) column where the first 10 characters are the product’s name and the second 10 are the vendor’s name. Thus in effect they have a concatenated key that violates 2NF. My point is that you cannot effectively normalize when attributes or columns don’t represent a single concept – so please don’t do it.

Third Normal Form

There are two fundamental requirements for an entity or table table to meet third normal form – all of which must hold true:

  • The entity or table must be in second normal form (and thus also in 1NF)
  • All non-key attributes or columns must depend on nothing but the key

It’s the last bullet item where we’ll focus. Does every non-key attribute or column depend upon nothing but the key – hence the third item in our little jingle.

So here’s an entity to review for adherance to third normal form:

y6

Again the test is quite simple – does each non-key attribute or column depend only on PROD_NUMBER? The last two attributes or columns very clearly depend on just the PROD_CATEGORY, so it violates 3NF. Once again the solution is to simply separate this related information into its own entity as shown here:

y7

Higher Normal Forms

Most shops will be well served if they can regularly identify and correct for up to 3NF. The primary question of whether to normalize further really depends on whom you ask. The database designers and data architects might like to normalize further, but then they don’t have to write the more complex code to manage such highly structured data. There is a clear tradeoff between code readability and maintainibility when you go much higher than 3NF. But it’s my experience that most developers’ skills and patience rapidly erode beyond 3NF. We live in a world of do more faster and with less resources. So let’s not design a perfect database doomed to fail based upon academic or theoretical practice that sacrifices coding efficiency to save cheap disk space. Let’s choose a happy medium that’s effecitve and yet allows us to remain efficient in the implementation. And you heard that from a person with his PhD – so I guess 20 years in the trenches has forced me to practice what I preach J

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About bscalzo2

Bert Scalzo is an Oracle ACE, blogger, author, speaker and database technology consultant. His work experience includes stints as product manager for DBArtisan and Rapid SQL at IDERA and chief architect for the popular Toad family of products at Quest Software. He has three decades of Oracle® database experience and previously worked for both Oracle Education and Oracle Consulting. Bert holds several Oracle Masters certifications and his academic credentials include a BS, MS and Ph.D. in computer science, as well as an MBA. He has presented at numerous Oracle conferences and user groups, including OOW, ODTUG, IOUG, OAUG, RMOUG and many others. Bert’s areas of interest include data modeling, database benchmarking, database tuning and optimization, "star schema" data warehouses, Linux® and VMware®. He has written for Oracle Technology Network (OTN), Oracle Magazine, Oracle Informant, PC Week (eWeek), Dell Power Solutions Magazine, The LINUX Journal, LINUX.com, Oracle FAQ and Toad World. Bert has also written the following books: • Oracle DBA Guide to Data Warehousing and Star Schemas • TOAD Handbook (1st Edition) • TOAD Handbook (2nd Edition) • TOAD Pocket Reference (2nd Edition) • Database Benchmarking: Practical Methods for Oracle & SQL Server • Advanced Oracle Utilities: The Definitive Reference • Oracle on VMware: Expert Tips for Database Virtualization • Introduction to Oracle: Basic Skills for Any Oracle User • Introduction to SQL Server: Basic Skills for Any SQL Server User • Toad Unleashed • Leveraging Oracle Database 12cR2 Testing Tools • Database Benchmarking and Stress Testing (coming 2018)

Category

Uncategorized