
Database Normalization
Author - Abdul Rahman (Bhai)
Database
1 Articles
Table of Contents
What we gonna do?
Building a database without proper normalization is like storing every item in your house on your kitchen table—messy, inefficient, and bound to cause problems. In this article, let's learn about Database Normalization and how to eliminate wasteful determinism to create clean, maintainable, and performant database schemas.
Here's your mental anchor for this entire journey: RAWD—Reduce All Wasteful Determinism. That's not "reduce all determinism"—just the wasteful kind. We'll explore what determinism means in databases and how to identify which relationships are helpful and which ones cause problems.
Why we gonna do?
The Real Cost of Bad Data Design
Picture a credit union database from the 1970s. One flat file. One table. Hundreds or even thousands of fields wide. Not because developers were incompetent, but because disk I/O in that era was literally a million times slower than today. A join that takes 0.1 seconds now would have taken over a day back then. Normalization wasn't physically possible at scale.
But today? We don't have that excuse. Modern hardware and databases thrive on normalized structures. Without normalization, you're dealing with:
- Data redundancy—storing the same information multiple times wastes storage and memory
- Update anomalies—changing data in one place but forgetting others creates inconsistencies
- Insert complexity—having to update multiple places for a single logical change
- Delete problems—removing data might accidentally remove related information you wanted to keep
Think about it this way: if you're storing a value in more than one place, you're creating competing sources of truth. Which one is correct when they disagree? Version control gives us a single source of truth for code—normalization does the same for data.
Understanding Determinism
Determinism just means cause and effect—one thing has a defining relationship with another. Here's a simple example: if you multiply 6 by 5, you get 30. Always. That's determinism.
Now, what do you need to store to reproduce this calculation later? Just the multiplicand (6) and the multiplier (5). You don't need to store the product (30) because it's entirely determined by the other two values. Storing 30 is wasteful—at best it's redundant, at worst it creates errors when your parameters change but you forget to update the stored result.
That's wasteful determinism in a nutshell: storing values that can be calculated or derived from other data you already have.
How we gonna do?
The Normalization Journey: Five Normal Forms
We're going to walk through five normal forms, from the most obvious problems to increasingly subtle issues. Think of normalization as progressive refinement—each form addresses a specific type of wasteful determinism. Let's use a credit union database as our example throughout.
First Normal Form (1NF): No Repeating Groups
First Normal Form Rule: Ensure atomic attributes and a primary key.
Atomicity means indivisibility—your data shouldn't be subdivided into constituent parts. The Greek word atom literally means "no cutting" (a = not, tomos = cut).
Here's the most egregious violation you'll see: pivoted fields—rows of what should be a separate table pivoted into columns. Imagine a Members table like this:
CREATE TABLE Members (
MemberId INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
AccountNumber1 VARCHAR(20),
AccountNumber2 VARCHAR(20),
AccountNumber3 VARCHAR(20)
);
Here's what the data looks like with this flawed design:
MemberId FirstName LastName AccountNumber1 AccountNumber2 AccountNumber3
-------- --------- -------- -------------- -------------- --------------
1001 John Smith CHK-12345 SAV-67890 NULL
1002 Jane Doe CHK-11111 SAV-22222 MMA-33333
1003 Bob Johnson CHK-99999 NULL NULL
The problem? Multiplicity is baked into the schema. What happens when a customer wants a fourth account? You're stuck. Changing table schemas on old mainframe systems required hardware changes—so you'd see hacky workarounds like this:
CREATE TABLE Members (
MemberId INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
AccountNumber1 VARCHAR(20),
AccountNumber2 VARCHAR(20),
AccountNumber3 VARCHAR(20),
AccountNumber4 VARCHAR(20) -- Added later, painfully
);
Or worse—storing multiple values in a single field separated by commas. This is database malpractice.
The Fix: Create a separate Accounts table with a foreign key back to Members. Now you can have 0, 1, 42, or however many accounts make sense:
CREATE TABLE Members (
MemberId INT PRIMARY KEY IDENTITY,
FirstName VARCHAR(50),
LastName VARCHAR(50)
);
CREATE TABLE Accounts (
AccountNumber VARCHAR(20),
MemberId INT,
PRIMARY KEY (AccountNumber, MemberId),
FOREIGN KEY (MemberId) REFERENCES Members(MemberId)
);
To migrate existing data, you'd use a UNION ALL query to unpivot those horizontal fields into vertical rows:
INSERT INTO Accounts (AccountNumber, MemberId)
SELECT AccountNumber1, MemberId FROM Members WHERE AccountNumber1 IS NOT NULL
UNION ALL
SELECT AccountNumber2, MemberId FROM Members WHERE AccountNumber2 IS NOT NULL
UNION ALL
SELECT AccountNumber3, MemberId FROM Members WHERE AccountNumber3 IS NOT NULL;
After normalization to 1NF, the data looks like this:
-- Members Table (cleaned up)
MemberId FirstName LastName
-------- --------- --------
1001 John Smith
1002 Jane Doe
1003 Bob Johnson
-- Accounts Table (new, normalized)
AccountNumber MemberId
------------- --------
CHK-12345 1001
SAV-67890 1001
CHK-11111 1002
SAV-22222 1002
MMA-33333 1002
CHK-99999 1003
Notice how we can now easily add a fourth, fifth, or hundredth account without modifying the schema!
Second Normal Form (2NF): Full Functional Dependency
Second Normal Form Rule: Ensure non-key attributes depend on the entire primary key, not just part of it.
This one's subtle but incredibly common. Let's say you add a BranchName field to your Accounts table:
CREATE TABLE Accounts (
AccountNumber VARCHAR(20),
MemberId INT,
BranchName VARCHAR(100),
PRIMARY KEY (AccountNumber, MemberId)
);
Here's sample data showing the 2NF violation:
AccountNumber MemberId BranchName
------------- -------- ------------
CHK-12345 1001 Downtown
SAV-67890 1001 Downtown
CHK-11111 1002 Westside
SAV-22222 1002 Westside
MMA-33333 1002 Westside
Now here's the question: what drives the value of BranchName? Is it determined by the entire composite key (AccountNumber + MemberId), or just the MemberId part?
After talking to an actual human being at the credit union, you learn that branches are associated with members, not accounts. The BranchName depends only on MemberId—half of your primary key. That's a partial dependency, and it violates 2NF.
The Fix: Move BranchName to the Members table where it belongs:
-- Add BranchName to Members
ALTER TABLE Members ADD BranchName VARCHAR(100);
-- Migrate data with an UPDATE-JOIN
UPDATE Members
SET BranchName = A.BranchName
FROM Members M
INNER JOIN Accounts A ON M.MemberId = A.MemberId;
-- Clean up the redundant column
ALTER TABLE Accounts DROP COLUMN BranchName;
After fixing the 2NF violation:
-- Members Table (now includes BranchName)
MemberId FirstName LastName BranchName
-------- --------- -------- ------------
1001 John Smith Downtown
1002 Jane Doe Westside
1003 Bob Johnson Northside
-- Accounts Table (BranchName removed)
AccountNumber MemberId
------------- --------
CHK-12345 1001
SAV-67890 1001
CHK-11111 1002
SAV-22222 1002
MMA-33333 1002
CHK-99999 1003
Notice how BranchName no longer repeats for every account—it's stored once per member.
A key insight here: you can't normalize properly without understanding your business rules. First normal form can be inferred mechanically by examining the schema. Beyond that, you need domain knowledge.
Third Normal Form (3NF): No Transitive Dependencies
Third Normal Form Rule: Ensure non-key attributes don't depend on other non-key attributes.
Let's say you also moved BranchManager to the Members table along with BranchName:
CREATE TABLE Members (
MemberId INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
BranchName VARCHAR(100),
BranchManager VARCHAR(100)
);
Here's the data showing the 3NF violation (notice the redundancy in BranchManager):
MemberId FirstName LastName BranchName BranchManager
-------- --------- -------- ---------- -------------
1001 John Smith Downtown Alice Wong
1002 Jane Doe Westside Bob Martinez
1003 Bob Johnson Northside Carol Kim
1004 Sarah Lee Downtown Alice Wong
1005 Mike Chen Westside Bob Martinez
The problem? BranchManager is determined by BranchName, not by MemberId. This is called a transitive dependency—BranchManager relies on BranchName, which in turn relies on MemberId.
Here's a strong smell: whenever you see prefixed field names (BranchName, BranchManager, LoanType, LoanAmount), you probably need another table.
The Fix: Create a Branches table:
CREATE TABLE Branches (
BranchId INT PRIMARY KEY IDENTITY,
Name VARCHAR(100),
Manager VARCHAR(100)
);
-- Add foreign key to Members
ALTER TABLE Members ADD BranchId INT;
ALTER TABLE Members ADD FOREIGN KEY (BranchId) REFERENCES Branches(BranchId);
-- Populate Branches from distinct values in Members
INSERT INTO Branches (Name, Manager)
SELECT DISTINCT BranchName, BranchManager FROM Members;
-- Update Members with the new foreign key
UPDATE Members
SET BranchId = B.BranchId
FROM Members M
INNER JOIN Branches B ON M.BranchName = B.Name AND M.BranchManager = B.Manager;
-- Clean up
ALTER TABLE Members DROP COLUMN BranchName;
ALTER TABLE Members DROP COLUMN BranchManager;
ALTER TABLE Members ALTER COLUMN BranchId INT NOT NULL;
After normalizing to 3NF:
-- Branches Table (new)
BranchId Name Manager
-------- --------- -------------
1 Downtown Alice Wong
2 Westside Bob Martinez
3 Northside Carol Kim
-- Members Table (BranchName and BranchManager replaced with BranchId)
MemberId FirstName LastName BranchId
-------- --------- -------- --------
1001 John Smith 1
1002 Jane Doe 2
1003 Bob Johnson 3
1004 Sarah Lee 1
1005 Mike Chen 2
Now branch information is stored exactly once, and changing a branch manager requires updating only one row!
Boyce-Codd Normal Form (BCNF): Every Determinant Must Be a Candidate Key
Boyce-Codd Normal Form Rule: Every determinant must be a candidate key. In simpler terms, if a field determines another field, it must be able to uniquely identify the row.
BCNF is a special case of 3NF. Let's add a RegionalDirector field to our Branches table:
CREATE TABLE Branches (
BranchId INT PRIMARY KEY,
Name VARCHAR(100),
Manager VARCHAR(100),
RegionalDirector VARCHAR(100)
);
Here's sample data showing the BCNF violation:
BranchId Name Manager RegionalDirector
-------- --------- ------------- ----------------
1 Downtown Alice Wong David Smith
2 Westside Bob Martinez David Smith
3 Northside Carol Kim Emily Johnson
4 Eastside Dan Park Emily Johnson
5 Central Eva Lopez David Smith
Now imagine the data shows that RegionalDirector is unique per region, not per branch. Multiple branches share the same RegionalDirector. The director determines the region, but the director isn't the primary key of Branches.
RegionalDirector is a determinant (it determines region information) but it's not a candidate key—it can't uniquely identify a branch row.
The Fix: Create a Regions table:
CREATE TABLE Regions (
RegionId INT PRIMARY KEY IDENTITY,
Director VARCHAR(100)
);
-- Populate Regions
INSERT INTO Regions (Director)
SELECT DISTINCT RegionalDirector FROM Branches;
-- Add foreign key to Branches
ALTER TABLE Branches ADD RegionId INT;
UPDATE Branches
SET RegionId = R.RegionId
FROM Branches B
INNER JOIN Regions R ON B.RegionalDirector = R.Director;
ALTER TABLE Branches DROP COLUMN RegionalDirector;
ALTER TABLE Branches ALTER COLUMN RegionId INT NOT NULL;
ALTER TABLE Branches ADD FOREIGN KEY (RegionId) REFERENCES Regions(RegionId);
After normalizing to BCNF:
-- Regions Table (new)
RegionId Director
-------- ----------------
1 David Smith
2 Emily Johnson
-- Branches Table (RegionalDirector replaced with RegionId)
BranchId Name Manager RegionId
-------- --------- ------------- --------
1 Downtown Alice Wong 1
2 Westside Bob Martinez 1
3 Northside Carol Kim 2
4 Eastside Dan Park 2
5 Central Eva Lopez 1
Now regional director information is stored once per region, not once per branch!
Fourth Normal Form (4NF): No Independent Multi-Valued Dependencies
Fourth Normal Form Rule: Don't store independent lists in the same table.
Imagine a BranchServices table that tracks both the services offered at each branch and the staff roles present:
CREATE TABLE BranchServices (
BranchId INT,
Service VARCHAR(50),
Role VARCHAR(50),
PRIMARY KEY (BranchId, Service, Role)
);
You end up with data like this:
BranchId Service Role
-------- -------- -------------------
100 Checking Teller
100 Savings Teller
100 Mortgage Teller
100 Checking Mortgage Originator
100 Savings Mortgage Originator
100 Mortgage Mortgage Originator
See the problem? You're creating a Cartesian product. Services and Roles are independent—knowing that a branch offers mortgages doesn't tell you anything about which roles are present. But storing them together forces you to repeat every combination.
What we really want to represent:
- Branch 100 offers: Checking, Savings, Mortgage (3 facts)
- Branch 100 has: Teller, Mortgage Originator (2 facts)
But we're storing 3 × 2 = 6 rows to represent 5 independent facts!
The Fix: Split into two independent tables:
CREATE TABLE BranchServices (
BranchId INT,
Service VARCHAR(50),
PRIMARY KEY (BranchId, Service),
FOREIGN KEY (BranchId) REFERENCES Branches(BranchId)
);
CREATE TABLE BranchRoles (
BranchId INT,
Role VARCHAR(50),
PRIMARY KEY (BranchId, Role),
FOREIGN KEY (BranchId) REFERENCES Branches(BranchId)
);
After normalizing to 4NF:
-- BranchServices Table
BranchId Service
-------- --------
100 Checking
100 Savings
100 Mortgage
-- BranchRoles Table
BranchId Role
-------- -------------------
100 Teller
100 Mortgage Originator
Now we store exactly 5 rows for 5 independent facts. No Cartesian explosion!
Fifth Normal Form (5NF): No Join Dependencies
Fifth Normal Form Rule: A table should not be splittable into three or more smaller tables that, when joined, don't lose information or create false data.
Here's the key cognitive marker: if your table's primary key is a composite of all its fields, you probably have a 5NF violation.
Imagine the Services table actually represents three independent facts:
- A given role performs a particular service (Teller performs Checking)
- A given role is present at a branch (Teller works at Branch 100)
- A service is offered at a branch (Checking is available at Branch 100)
-- Instead of one table with all three facts mixed:
CREATE TABLE BranchServiceRole (
BranchId INT,
Service VARCHAR(50),
Role VARCHAR(50),
PRIMARY KEY (BranchId, Service, Role)
);
-- Split into three atomic two-column tables:
CREATE TABLE RoleServices (
Role VARCHAR(50),
Service VARCHAR(50),
PRIMARY KEY (Role, Service)
);
CREATE TABLE BranchRoles (
BranchId INT,
Role VARCHAR(50),
PRIMARY KEY (BranchId, Role)
);
CREATE TABLE BranchServices (
BranchId INT,
Service VARCHAR(50),
PRIMARY KEY (BranchId, Service)
);
Before 5NF (single table with join dependency):
-- BranchServiceRole Table (violates 5NF)
BranchId Service Role
-------- -------- -------------------
100 Checking Teller
100 Savings Teller
101 Mortgage Mortgage Originator
After 5NF (three independent tables):
-- RoleServices: What services can a role perform?
Role Service
------------------- --------
Teller Checking
Teller Savings
Mortgage Originator Mortgage
-- BranchRoles: What roles are at each branch?
BranchId Role
-------- -------------------
100 Teller
101 Mortgage Originator
-- BranchServices: What services does each branch offer?
BranchId Service
-------- --------
100 Checking
100 Savings
101 Mortgage
Now you can represent each fact independently without creating false implications. If you need to know what services Branch 100 offers, you don't need to know what roles are there—the facts are truly independent.
The Practical Side: Foreign Keys and Constraints
Here's something critical: normalization isn't complete without foreign keys and constraints. When you split tables, you're potentially removing information unless you formalize the relationships:
-- Always add foreign key constraints
ALTER TABLE Accounts
ADD CONSTRAINT FK_Accounts_Members
FOREIGN KEY (MemberId) REFERENCES Members(MemberId);
ALTER TABLE Members
ADD CONSTRAINT FK_Members_Branches
FOREIGN KEY (BranchId) REFERENCES Branches(BranchId);
ALTER TABLE Branches
ADD CONSTRAINT FK_Branches_Regions
FOREIGN KEY (RegionId) REFERENCES Regions(RegionId);
These constraints serve two purposes:
- Data integrity—prevent orphaned records
- Documentation—make relationships explicit and queryable
Use a database diagramming tool that automatically infers relationships from foreign keys. If you can't trace your way across all connected tables using these connectors, you've missed a foreign key.
When NOT to Normalize: OLAP vs OLTP
We've focused on OLTP (Online Transaction Processing)—transactional databases optimized for inserts, updates, and deletes. Normalization shines here because it reduces the number of places you must modify data.
But there's another world: OLAP (Online Analytical Processing). OLAP databases are read-only (or read-mostly) and optimized for reporting and analytics. Here, denormalization makes sense.
Why? Because joins are expensive at scale. For dashboards querying billions of rows, pre-joining tables and storing redundant data in a "data warehouse" or "data mart" can improve query performance by orders of magnitude.
The key insight: OLAP data is derived from normalized OLTP data. You maintain your source of truth in normalized form, then periodically export and denormalize it for analytical workloads. Best of both worlds.
Schema on Write vs Schema on Read
Traditional relational databases enforce schema-on-write—data must conform to the schema when inserted. This ensures consistency and cleanliness.
Document-oriented databases (MongoDB, Cosmos DB) use schema-on-read—data can be any shape when written, and you deal with differences when querying. This is useful for:
- Highly heterogeneous data sources
- Rapid ingestion at massive scale (think X or Amazon)
- Data that genuinely doesn't fit a rigid schema (emails, documents)
At scale, companies use a map-reduce pattern: ingest raw data however you can (schema-on-read), then map and reduce it into normalized, transactional databases (schema-on-write) for business logic.
Real-World Best Practices
After 10+ years working with databases, here's what I've learned:
1. Don't Break Existing Applications
When normalizing a legacy database, use the strangler vine pattern. Keep the old flat file running, set up a nightly ETL to your normalized database, and migrate applications one by one. Only retire the legacy system when everything is migrated.
2. Manage Schema Migrations
Use a migration tool like Flyway or Entity Framework Migrations. Keep ordered scripts in version control, track which migrations have run in a database table, and never modify old migrations—only add new ones.
-- Example migration table
CREATE TABLE SchemaVersions (
Version INT PRIMARY KEY,
Description VARCHAR(255),
AppliedOn DATETIME DEFAULT GETDATE()
);
3. Good Enough is Better Than Perfect
If you can't normalize everything in one sprint, do what you can. Getting to 2NF is a huge win even if 3NF has to wait. Normalize incrementally and stabilize between changes.
4. Watch for Null Values
Nulls are weak smells that you might need to break out data into a second table. If half your rows have null values for several columns, consider whether those columns represent optional relationships that should be their own table.
Summary
Database normalization is fundamentally about reducing wasteful determinism—eliminating redundancy by ensuring data is stored in exactly one place. We've covered five normal forms:
- 1NF: Atomic attributes, no repeating groups, primary key required
- 2NF: Non-key attributes depend on the entire primary key
- 3NF: Non-key attributes don't depend on other non-key attributes
- Boyce-Codd (BCNF): Every determinant must be a candidate key
- 4NF: No independent multi-valued dependencies (no Cartesian products)
- 5NF: No join dependencies (can't split into three+ tables without losing info)
Remember: normalization requires understanding your business domain. Statistical analysis can hint at problems, but you need to talk to actual humans to understand the true relationships in your data.
Finally, know when to denormalize. OLTP databases benefit from normalization; OLAP databases benefit from denormalization. Maintain your source of truth in normalized form, then derive optimized read structures as needed.
Master these concepts, and you'll build databases that are maintainable, performant, and truly serve as a single source of truth for your enterprise data.