72  Normalization

Normalization aims to make insert,updates and deletes more understandable and clear. Normalization is about how the data is stored not how the data is retrieved. In fact, normalization process divides a given table to more tables. Then to be able retrieve data in the same format, these divided tables have to be joined Codd (1990). This joining of tables may slightly slow down data access. To repeat, the normalization process improves data quality by reducing duplication and inconsistencies.

An excellent oldie but goodie article about this topic is “A simple guide to five normal forms in relational database theory” by Kent (1983).

The basic ideas of normalization (Codd (1990)) are follows:

There are 11 different types of normalization according to wikipedia page of database normalization (Wikipedia (2018)). But first three normal forms are most important. These 3 normal forms prevent insert,update and delete anomalies. In fact, when your tables are in 3rd normal form, they are considered in normalized form.

We will cover only these three normal forms since this is an introductory course.

72.1 First Normal Form

In the first normal form, all records must contain same number of fields. The first normal form forbids multi-valued attributes and groups. Every cell should contain only atomic values.

For example, in the following table: skills, separated using commas, are multi-valued attributes.

Id Skills
1 Programming, Web Design, Databases
2 Databases, Algorithms
Important

Not in 1st normal form

To be in the first normal form, a given cell should not contain complex data structures, like sets or lists.

In the following table: phone numbers are repeated for the same person. This again violates 1st normal form.

Id Name Phone
1 Atilla Phone 1
2 Atilla Phone 2
Important

Not in 1st normal form

72.1.1 Side note: atomic

There are disagreements about definition of atomic cell: Consider following table:

Id Name Address
1 Atilla Özgür Example Street, No 20, Ankara Türkiye
2 Duru Özgür Example Str, No 20, MG Germany
Tip

Is this table in 1st normal form?

Does address and name values are atomic in this table? It could be argued that:

  1. Name should be divided to first name and last name columns.
  2. Address column should be divided so that at least city and country information should be in different columns.

But where to stop? In the same way:

  • Any string value could be thought as list of characters.
  • A date value could be divided to year, month and day values.

My opinion is that: think about your application queries.

If your application needs to deal with first and last name separately, for example sort/filter according to last names, then this name column should be divided. In the same way, if your application needs to filter/order data according to city and country, then, these two information, city and country, should be in different columns. For date values, specific date SQL functions will be helpful; therefore, you may not need to divide date columns.

First normal form is actually a design guideline since relational database theory does not deal with variable number of fields Kent (1983).

72.2 Second Normal Form

Second normal form is only relevant when the key is composite key. That is when your key consists of at least two columns. This is about natural keys, even if you add synthetic key to your table, if your natural keys are composite, this situation persists.

Consider following example by Kent (1983), example table is slightly changed and example data is added for better understanding.

Equipment Warehouse Quantity Warehouse-Address
Laptop Main Warehouse 50 123 Main Street, City A
Printer Main Warehouse 20 123 Main Street, City A
Monitor Branch Warehouse 15 456 Side Street, City B
Laptop Branch Warehouse 10 456 Side Street, City B

Here, composite natural key is equipment and warehouse. But, warehouse address is a fact about warehouse not about this whole key. The problems of this design are as follows:

  1. Warehouse address repeated for every equipment
  2. If a warehouse address changed, every row in this table for this warehouse has to be changed.
  3. Since data is redundant, address information may be updated wrongly. (Update anomaly)
  4. If at a point of time, there is no equipment in a given warehouse, we loose the address information of the warehouse

To change this to 2nd normal form.

  1. First normal form should be satisfied.
  2. We need to divide this table to two tables as follows. That is this extra information is moved to its own table.
Equipment Warehouse Quantity
Laptop Main Warehouse 50
Printer Main Warehouse 20
Monitor Branch Warehouse 15
Laptop Branch Warehouse 10
Warehouse Warehouse-Address
Main Warehouse 123 Main St, Cityville
Main Warehouse 123 Main St, Cityville
Branch Warehouse 456 Branch Ave, Suburbia
Branch Warehouse 456 Branch Ave, Suburbia

72.3 Third Normal Form

Third normal form is about extra information in the columns. When a non key column contains information about another non key column, 3rd normal form is violated.

Consider following example:

Employee Department Building
John Doe Sales Main Building
Jane Smith Marketing Main Building
David Lee Engineering Building Eng
Atilla Özgür Engineering Building Eng
Sarah Jones Finance Main Building
Michael Brown HR Main Building

If each department is located in single building, then the building column is about the department not about the employee. This design has following problems:

  • The building column is repeated for every department
  • If the building changes for a department, lets say company decided to move marketing to a new building, every marketing employee row has to be changed.
  • due to redundant information, data update, insert anomalies may occur. For the same department, different rows may show different building information.
  • if a department has no employees (i.e. a newly established one), how to store building information of this department?

To convert this table to the 3rd normal form:

  1. First and second normal forms has to be satisfied
  2. divide this extra information to its own table. See following tables
Employee Department
John Doe Sales
Jane Smith Marketing
David Lee Engineering
Atilla Özgür Engineering
Sarah Jones Finance
Michael Brown HR
Department Building
Sales Main Building
Marketing Main Building
Engineering Building Eng
Engineering Building Eng
Finance Main Building
HR Main Building