72 Normalization
Normalization aims to make insert,updates and deletes more understandable and clear. Normalization is about how the data is stored not how the data is retrieved. In fact, normalization process divides a given table to more tables. Then to be able retrieve data in the same format, these divided tables have to be joined Codd (1990). This joining of tables may slightly slow down data access. To repeat, the normalization process improves data quality by reducing duplication and inconsistencies.
An excellent oldie but goodie article about this topic is “A simple guide to five normal forms in relational database theory” by Kent (1983).
The basic ideas of normalization (Codd (1990)) are follows:
- Every table should have a primary key
- Every cell (value in given row and column) in table is identified by the primary key.
- every cell contains only atomic (single) values.
There are 11 different types of normalization according to wikipedia page of database normalization (Wikipedia (2018)). But first three normal forms are most important. These 3 normal forms prevent insert,update and delete anomalies. In fact, when your tables are in 3rd normal form, they are considered in normalized form.
We will cover only these three normal forms since this is an introductory course.
72.1 First Normal Form
In the first normal form, all records must contain same number of fields. The first normal form forbids multi-valued attributes and groups. Every cell should contain only atomic values.
For example, in the following table: skills, separated using commas, are multi-valued attributes.
Id | Skills |
---|---|
1 | Programming, Web Design, Databases |
2 | Databases, Algorithms |
Not in 1st normal form
To be in the first normal form, a given cell should not contain complex data structures, like sets or lists.
In the following table: phone numbers are repeated for the same person. This again violates 1st normal form.
Id | Name | Phone |
---|---|---|
1 | Atilla | Phone 1 |
2 | Atilla | Phone 2 |
Not in 1st normal form
72.1.1 Side note: atomic
There are disagreements about definition of atomic cell: Consider following table:
Id | Name | Address |
---|---|---|
1 | Atilla Özgür | Example Street, No 20, Ankara Türkiye |
2 | Duru Özgür | Example Str, No 20, MG Germany |
Is this table in 1st normal form?
Does address and name values are atomic in this table? It could be argued that:
- Name should be divided to first name and last name columns.
- Address column should be divided so that at least city and country information should be in different columns.
But where to stop? In the same way:
- Any string value could be thought as list of characters.
- A date value could be divided to year, month and day values.
My opinion is that: think about your application queries.
If your application needs to deal with first and last name separately, for example sort/filter according to last names, then this name column should be divided. In the same way, if your application needs to filter/order data according to city and country, then, these two information, city and country, should be in different columns. For date values, specific date SQL functions will be helpful; therefore, you may not need to divide date columns.
First normal form is actually a design guideline since relational database theory does not deal with variable number of fields Kent (1983).
72.2 Second Normal Form
Second normal form is only relevant when the key is composite key. That is when your key consists of at least two columns. This is about natural keys, even if you add synthetic key to your table, if your natural keys are composite, this situation persists.
Consider following example by Kent (1983), example table is slightly changed and example data is added for better understanding.
Equipment | Warehouse | Quantity | Warehouse-Address |
---|---|---|---|
Laptop | Main Warehouse | 50 | 123 Main Street, City A |
Printer | Main Warehouse | 20 | 123 Main Street, City A |
Monitor | Branch Warehouse | 15 | 456 Side Street, City B |
Laptop | Branch Warehouse | 10 | 456 Side Street, City B |
Here, composite natural key is equipment and warehouse. But, warehouse address is a fact about warehouse not about this whole key. The problems of this design are as follows:
- Warehouse address repeated for every equipment
- If a warehouse address changed, every row in this table for this warehouse has to be changed.
- Since data is redundant, address information may be updated wrongly. (Update anomaly)
- If at a point of time, there is no equipment in a given warehouse, we loose the address information of the warehouse
To change this to 2nd normal form.
- First normal form should be satisfied.
- We need to divide this table to two tables as follows. That is this extra information is moved to its own table.
Equipment | Warehouse | Quantity |
---|---|---|
Laptop | Main Warehouse | 50 |
Printer | Main Warehouse | 20 |
Monitor | Branch Warehouse | 15 |
Laptop | Branch Warehouse | 10 |
Warehouse | Warehouse-Address |
---|---|
Main Warehouse | 123 Main St, Cityville |
Main Warehouse | 123 Main St, Cityville |
Branch Warehouse | 456 Branch Ave, Suburbia |
Branch Warehouse | 456 Branch Ave, Suburbia |
72.3 Third Normal Form
Third normal form is about extra information in the columns. When a non key column contains information about another non key column, 3rd normal form is violated.
Consider following example:
Employee | Department | Building |
---|---|---|
John Doe | Sales | Main Building |
Jane Smith | Marketing | Main Building |
David Lee | Engineering | Building Eng |
Atilla Özgür | Engineering | Building Eng |
Sarah Jones | Finance | Main Building |
Michael Brown | HR | Main Building |
If each department is located in single building, then the building column is about the department not about the employee. This design has following problems:
- The building column is repeated for every department
- If the building changes for a department, lets say company decided to move marketing to a new building, every marketing employee row has to be changed.
- due to redundant information, data update, insert anomalies may occur. For the same department, different rows may show different building information.
- if a department has no employees (i.e. a newly established one), how to store building information of this department?
To convert this table to the 3rd normal form:
- First and second normal forms has to be satisfied
- divide this extra information to its own table. See following tables
Employee | Department |
---|---|
John Doe | Sales |
Jane Smith | Marketing |
David Lee | Engineering |
Atilla Özgür | Engineering |
Sarah Jones | Finance |
Michael Brown | HR |
Department | Building |
---|---|
Sales | Main Building |
Marketing | Main Building |
Engineering | Building Eng |
Engineering | Building Eng |
Finance | Main Building |
HR | Main Building |