👉🏼 Click here to Join I ❤️ .NET WhatsApp Channel to get 🔔 notified about new articles and other updates.
Database Normalization

Database Normalization

Author - Abdul Rahman (Bhai)

Database

1 Articles

Improve

Table of Contents

  1. What we gonna do?
  2. Why we gonna do?
  3. How we gonna do?
  4. Summary

What we gonna do?

Building a database without proper normalization is like storing every item in your house on your kitchen table—messy, inefficient, and bound to cause problems. In this article, let's learn about Database Normalization and how to eliminate wasteful determinism to create clean, maintainable, and performant database schemas.

Here's your mental anchor for this entire journey: RAWD—Reduce All Wasteful Determinism. That's not "reduce all determinism"—just the wasteful kind. We'll explore what determinism means in databases and how to identify which relationships are helpful and which ones cause problems.

Why we gonna do?

The Real Cost of Bad Data Design

Picture a credit union database from the 1970s. One flat file. One table. Hundreds or even thousands of fields wide. Not because developers were incompetent, but because disk I/O in that era was literally a million times slower than today. A join that takes 0.1 seconds now would have taken over a day back then. Normalization wasn't physically possible at scale.

But today? We don't have that excuse. Modern hardware and databases thrive on normalized structures. Without normalization, you're dealing with:

  • Data redundancy—storing the same information multiple times wastes storage and memory
  • Update anomalies—changing data in one place but forgetting others creates inconsistencies
  • Insert complexity—having to update multiple places for a single logical change
  • Delete problems—removing data might accidentally remove related information you wanted to keep

Think about it this way: if you're storing a value in more than one place, you're creating competing sources of truth. Which one is correct when they disagree? Version control gives us a single source of truth for code—normalization does the same for data.

Understanding Determinism

Determinism just means cause and effect—one thing has a defining relationship with another. Here's a simple example: if you multiply 6 by 5, you get 30. Always. That's determinism.

Now, what do you need to store to reproduce this calculation later? Just the multiplicand (6) and the multiplier (5). You don't need to store the product (30) because it's entirely determined by the other two values. Storing 30 is wasteful—at best it's redundant, at worst it creates errors when your parameters change but you forget to update the stored result.

That's wasteful determinism in a nutshell: storing values that can be calculated or derived from other data you already have.

How we gonna do?

The Normalization Journey: Five Normal Forms

We're going to walk through five normal forms, from the most obvious problems to increasingly subtle issues. Think of normalization as progressive refinement—each form addresses a specific type of wasteful determinism. Let's use a credit union database as our example throughout.

First Normal Form (1NF): No Repeating Groups

First Normal Form Rule: Ensure atomic attributes and a primary key.

Atomicity means indivisibility—your data shouldn't be subdivided into constituent parts. The Greek word atom literally means "no cutting" (a = not, tomos = cut).

Here's the most egregious violation you'll see: pivoted fields—rows of what should be a separate table pivoted into columns. Imagine a Members table like this:


CREATE TABLE Members (
    MemberId INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    AccountNumber1 VARCHAR(20),
    AccountNumber2 VARCHAR(20),
    AccountNumber3 VARCHAR(20)
);
            

Here's what the data looks like with this flawed design:


MemberId  FirstName  LastName  AccountNumber1  AccountNumber2  AccountNumber3
--------  ---------  --------  --------------  --------------  --------------
1001      John       Smith     CHK-12345       SAV-67890       NULL
1002      Jane       Doe       CHK-11111       SAV-22222       MMA-33333
1003      Bob        Johnson   CHK-99999       NULL            NULL
            

The problem? Multiplicity is baked into the schema. What happens when a customer wants a fourth account? You're stuck. Changing table schemas on old mainframe systems required hardware changes—so you'd see hacky workarounds like this:


CREATE TABLE Members (
    MemberId INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    AccountNumber1 VARCHAR(20),
    AccountNumber2 VARCHAR(20),
    AccountNumber3 VARCHAR(20),
    AccountNumber4 VARCHAR(20)  -- Added later, painfully
);
            

Or worse—storing multiple values in a single field separated by commas. This is database malpractice.

The Fix: Create a separate Accounts table with a foreign key back to Members. Now you can have 0, 1, 42, or however many accounts make sense:


CREATE TABLE Members (
    MemberId INT PRIMARY KEY IDENTITY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50)
);

CREATE TABLE Accounts (
    AccountNumber VARCHAR(20),
    MemberId INT,
    PRIMARY KEY (AccountNumber, MemberId),
    FOREIGN KEY (MemberId) REFERENCES Members(MemberId)
);
            

To migrate existing data, you'd use a UNION ALL query to unpivot those horizontal fields into vertical rows:


INSERT INTO Accounts (AccountNumber, MemberId)
SELECT AccountNumber1, MemberId FROM Members WHERE AccountNumber1 IS NOT NULL
UNION ALL
SELECT AccountNumber2, MemberId FROM Members WHERE AccountNumber2 IS NOT NULL
UNION ALL
SELECT AccountNumber3, MemberId FROM Members WHERE AccountNumber3 IS NOT NULL;
            

After normalization to 1NF, the data looks like this:


-- Members Table (cleaned up)
MemberId  FirstName  LastName
--------  ---------  --------
1001      John       Smith
1002      Jane       Doe
1003      Bob        Johnson

-- Accounts Table (new, normalized)
AccountNumber  MemberId
-------------  --------
CHK-12345      1001
SAV-67890      1001
CHK-11111      1002
SAV-22222      1002
MMA-33333      1002
CHK-99999      1003
            

Notice how we can now easily add a fourth, fifth, or hundredth account without modifying the schema!

Second Normal Form (2NF): Full Functional Dependency

Second Normal Form Rule: Ensure non-key attributes depend on the entire primary key, not just part of it.

This one's subtle but incredibly common. Let's say you add a BranchName field to your Accounts table:


CREATE TABLE Accounts (
    AccountNumber VARCHAR(20),
    MemberId INT,
    BranchName VARCHAR(100),
    PRIMARY KEY (AccountNumber, MemberId)
);
            

Here's sample data showing the 2NF violation:


AccountNumber  MemberId  BranchName
-------------  --------  ------------
CHK-12345      1001      Downtown
SAV-67890      1001      Downtown
CHK-11111      1002      Westside
SAV-22222      1002      Westside
MMA-33333      1002      Westside
            

Now here's the question: what drives the value of BranchName? Is it determined by the entire composite key (AccountNumber + MemberId), or just the MemberId part?

After talking to an actual human being at the credit union, you learn that branches are associated with members, not accounts. The BranchName depends only on MemberId—half of your primary key. That's a partial dependency, and it violates 2NF.

The Fix: Move BranchName to the Members table where it belongs:


-- Add BranchName to Members
ALTER TABLE Members ADD BranchName VARCHAR(100);

-- Migrate data with an UPDATE-JOIN
UPDATE Members
SET BranchName = A.BranchName
FROM Members M
INNER JOIN Accounts A ON M.MemberId = A.MemberId;

-- Clean up the redundant column
ALTER TABLE Accounts DROP COLUMN BranchName;
            

After fixing the 2NF violation:


-- Members Table (now includes BranchName)
MemberId  FirstName  LastName  BranchName
--------  ---------  --------  ------------
1001      John       Smith     Downtown
1002      Jane       Doe       Westside
1003      Bob        Johnson   Northside

-- Accounts Table (BranchName removed)
AccountNumber  MemberId
-------------  --------
CHK-12345      1001
SAV-67890      1001
CHK-11111      1002
SAV-22222      1002
MMA-33333      1002
CHK-99999      1003
            

Notice how BranchName no longer repeats for every account—it's stored once per member.

A key insight here: you can't normalize properly without understanding your business rules. First normal form can be inferred mechanically by examining the schema. Beyond that, you need domain knowledge.

Third Normal Form (3NF): No Transitive Dependencies

Third Normal Form Rule: Ensure non-key attributes don't depend on other non-key attributes.

Let's say you also moved BranchManager to the Members table along with BranchName:


CREATE TABLE Members (
    MemberId INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    BranchName VARCHAR(100),
    BranchManager VARCHAR(100)
);
            

Here's the data showing the 3NF violation (notice the redundancy in BranchManager):


MemberId  FirstName  LastName  BranchName  BranchManager
--------  ---------  --------  ----------  -------------
1001      John       Smith     Downtown    Alice Wong
1002      Jane       Doe       Westside    Bob Martinez
1003      Bob        Johnson   Northside   Carol Kim
1004      Sarah      Lee       Downtown    Alice Wong
1005      Mike       Chen      Westside    Bob Martinez
            

The problem? BranchManager is determined by BranchName, not by MemberId. This is called a transitive dependency—BranchManager relies on BranchName, which in turn relies on MemberId.

Here's a strong smell: whenever you see prefixed field names (BranchName, BranchManager, LoanType, LoanAmount), you probably need another table.

The Fix: Create a Branches table:


CREATE TABLE Branches (
    BranchId INT PRIMARY KEY IDENTITY,
    Name VARCHAR(100),
    Manager VARCHAR(100)
);

-- Add foreign key to Members
ALTER TABLE Members ADD BranchId INT;
ALTER TABLE Members ADD FOREIGN KEY (BranchId) REFERENCES Branches(BranchId);

-- Populate Branches from distinct values in Members
INSERT INTO Branches (Name, Manager)
SELECT DISTINCT BranchName, BranchManager FROM Members;

-- Update Members with the new foreign key
UPDATE Members
SET BranchId = B.BranchId
FROM Members M
INNER JOIN Branches B ON M.BranchName = B.Name AND M.BranchManager = B.Manager;

-- Clean up
ALTER TABLE Members DROP COLUMN BranchName;
ALTER TABLE Members DROP COLUMN BranchManager;
ALTER TABLE Members ALTER COLUMN BranchId INT NOT NULL;
            

After normalizing to 3NF:


-- Branches Table (new)
BranchId  Name       Manager
--------  ---------  -------------
1         Downtown   Alice Wong
2         Westside   Bob Martinez
3         Northside  Carol Kim

-- Members Table (BranchName and BranchManager replaced with BranchId)
MemberId  FirstName  LastName  BranchId
--------  ---------  --------  --------
1001      John       Smith     1
1002      Jane       Doe       2
1003      Bob        Johnson   3
1004      Sarah      Lee       1
1005      Mike       Chen      2
            

Now branch information is stored exactly once, and changing a branch manager requires updating only one row!

Boyce-Codd Normal Form (BCNF): Every Determinant Must Be a Candidate Key

Boyce-Codd Normal Form Rule: Every determinant must be a candidate key. In simpler terms, if a field determines another field, it must be able to uniquely identify the row.

BCNF is a special case of 3NF. Let's add a RegionalDirector field to our Branches table:


CREATE TABLE Branches (
    BranchId INT PRIMARY KEY,
    Name VARCHAR(100),
    Manager VARCHAR(100),
    RegionalDirector VARCHAR(100)
);
            

Here's sample data showing the BCNF violation:


BranchId  Name       Manager        RegionalDirector
--------  ---------  -------------  ----------------
1         Downtown   Alice Wong     David Smith
2         Westside   Bob Martinez   David Smith
3         Northside  Carol Kim      Emily Johnson
4         Eastside   Dan Park       Emily Johnson
5         Central    Eva Lopez      David Smith
            

Now imagine the data shows that RegionalDirector is unique per region, not per branch. Multiple branches share the same RegionalDirector. The director determines the region, but the director isn't the primary key of Branches.

RegionalDirector is a determinant (it determines region information) but it's not a candidate key—it can't uniquely identify a branch row.

The Fix: Create a Regions table:


CREATE TABLE Regions (
    RegionId INT PRIMARY KEY IDENTITY,
    Director VARCHAR(100)
);

-- Populate Regions
INSERT INTO Regions (Director)
SELECT DISTINCT RegionalDirector FROM Branches;

-- Add foreign key to Branches
ALTER TABLE Branches ADD RegionId INT;

UPDATE Branches
SET RegionId = R.RegionId
FROM Branches B
INNER JOIN Regions R ON B.RegionalDirector = R.Director;

ALTER TABLE Branches DROP COLUMN RegionalDirector;
ALTER TABLE Branches ALTER COLUMN RegionId INT NOT NULL;
ALTER TABLE Branches ADD FOREIGN KEY (RegionId) REFERENCES Regions(RegionId);
            

After normalizing to BCNF:


-- Regions Table (new)
RegionId  Director
--------  ----------------
1         David Smith
2         Emily Johnson

-- Branches Table (RegionalDirector replaced with RegionId)
BranchId  Name       Manager        RegionId
--------  ---------  -------------  --------
1         Downtown   Alice Wong     1
2         Westside   Bob Martinez   1
3         Northside  Carol Kim      2
4         Eastside   Dan Park       2
5         Central    Eva Lopez      1
            

Now regional director information is stored once per region, not once per branch!

Fourth Normal Form (4NF): No Independent Multi-Valued Dependencies

Fourth Normal Form Rule: Don't store independent lists in the same table.

Imagine a BranchServices table that tracks both the services offered at each branch and the staff roles present:


CREATE TABLE BranchServices (
    BranchId INT,
    Service VARCHAR(50),
    Role VARCHAR(50),
    PRIMARY KEY (BranchId, Service, Role)
);
            

You end up with data like this:


BranchId  Service   Role
--------  --------  -------------------
100       Checking  Teller
100       Savings   Teller
100       Mortgage  Teller
100       Checking  Mortgage Originator
100       Savings   Mortgage Originator
100       Mortgage  Mortgage Originator
            

See the problem? You're creating a Cartesian product. Services and Roles are independent—knowing that a branch offers mortgages doesn't tell you anything about which roles are present. But storing them together forces you to repeat every combination.

What we really want to represent:

  • Branch 100 offers: Checking, Savings, Mortgage (3 facts)
  • Branch 100 has: Teller, Mortgage Originator (2 facts)

But we're storing 3 × 2 = 6 rows to represent 5 independent facts!

The Fix: Split into two independent tables:


CREATE TABLE BranchServices (
    BranchId INT,
    Service VARCHAR(50),
    PRIMARY KEY (BranchId, Service),
    FOREIGN KEY (BranchId) REFERENCES Branches(BranchId)
);

CREATE TABLE BranchRoles (
    BranchId INT,
    Role VARCHAR(50),
    PRIMARY KEY (BranchId, Role),
    FOREIGN KEY (BranchId) REFERENCES Branches(BranchId)
);
            

After normalizing to 4NF:


-- BranchServices Table
BranchId  Service
--------  --------
100       Checking
100       Savings
100       Mortgage

-- BranchRoles Table
BranchId  Role
--------  -------------------
100       Teller
100       Mortgage Originator
            

Now we store exactly 5 rows for 5 independent facts. No Cartesian explosion!

Fifth Normal Form (5NF): No Join Dependencies

Fifth Normal Form Rule: A table should not be splittable into three or more smaller tables that, when joined, don't lose information or create false data.

Here's the key cognitive marker: if your table's primary key is a composite of all its fields, you probably have a 5NF violation.

Imagine the Services table actually represents three independent facts:

  • A given role performs a particular service (Teller performs Checking)
  • A given role is present at a branch (Teller works at Branch 100)
  • A service is offered at a branch (Checking is available at Branch 100)

-- Instead of one table with all three facts mixed:
CREATE TABLE BranchServiceRole (
    BranchId INT,
    Service VARCHAR(50),
    Role VARCHAR(50),
    PRIMARY KEY (BranchId, Service, Role)
);

-- Split into three atomic two-column tables:
CREATE TABLE RoleServices (
    Role VARCHAR(50),
    Service VARCHAR(50),
    PRIMARY KEY (Role, Service)
);

CREATE TABLE BranchRoles (
    BranchId INT,
    Role VARCHAR(50),
    PRIMARY KEY (BranchId, Role)
);

CREATE TABLE BranchServices (
    BranchId INT,
    Service VARCHAR(50),
    PRIMARY KEY (BranchId, Service)
);
            

Before 5NF (single table with join dependency):


-- BranchServiceRole Table (violates 5NF)
BranchId  Service   Role
--------  --------  -------------------
100       Checking  Teller
100       Savings   Teller
101       Mortgage  Mortgage Originator
            

After 5NF (three independent tables):


-- RoleServices: What services can a role perform?
Role                 Service
-------------------  --------
Teller               Checking
Teller               Savings
Mortgage Originator  Mortgage

-- BranchRoles: What roles are at each branch?
BranchId  Role
--------  -------------------
100       Teller
101       Mortgage Originator

-- BranchServices: What services does each branch offer?
BranchId  Service
--------  --------
100       Checking
100       Savings
101       Mortgage
            

Now you can represent each fact independently without creating false implications. If you need to know what services Branch 100 offers, you don't need to know what roles are there—the facts are truly independent.

The Practical Side: Foreign Keys and Constraints

Here's something critical: normalization isn't complete without foreign keys and constraints. When you split tables, you're potentially removing information unless you formalize the relationships:


-- Always add foreign key constraints
ALTER TABLE Accounts 
ADD CONSTRAINT FK_Accounts_Members 
FOREIGN KEY (MemberId) REFERENCES Members(MemberId);

ALTER TABLE Members
ADD CONSTRAINT FK_Members_Branches
FOREIGN KEY (BranchId) REFERENCES Branches(BranchId);

ALTER TABLE Branches
ADD CONSTRAINT FK_Branches_Regions
FOREIGN KEY (RegionId) REFERENCES Regions(RegionId);
            

These constraints serve two purposes:

  • Data integrity—prevent orphaned records
  • Documentation—make relationships explicit and queryable

Use a database diagramming tool that automatically infers relationships from foreign keys. If you can't trace your way across all connected tables using these connectors, you've missed a foreign key.

When NOT to Normalize: OLAP vs OLTP

We've focused on OLTP (Online Transaction Processing)—transactional databases optimized for inserts, updates, and deletes. Normalization shines here because it reduces the number of places you must modify data.

But there's another world: OLAP (Online Analytical Processing). OLAP databases are read-only (or read-mostly) and optimized for reporting and analytics. Here, denormalization makes sense.

Why? Because joins are expensive at scale. For dashboards querying billions of rows, pre-joining tables and storing redundant data in a "data warehouse" or "data mart" can improve query performance by orders of magnitude.

The key insight: OLAP data is derived from normalized OLTP data. You maintain your source of truth in normalized form, then periodically export and denormalize it for analytical workloads. Best of both worlds.

Schema on Write vs Schema on Read

Traditional relational databases enforce schema-on-write—data must conform to the schema when inserted. This ensures consistency and cleanliness.

Document-oriented databases (MongoDB, Cosmos DB) use schema-on-read—data can be any shape when written, and you deal with differences when querying. This is useful for:

  • Highly heterogeneous data sources
  • Rapid ingestion at massive scale (think X or Amazon)
  • Data that genuinely doesn't fit a rigid schema (emails, documents)

At scale, companies use a map-reduce pattern: ingest raw data however you can (schema-on-read), then map and reduce it into normalized, transactional databases (schema-on-write) for business logic.

Real-World Best Practices

After 10+ years working with databases, here's what I've learned:

1. Don't Break Existing Applications
When normalizing a legacy database, use the strangler vine pattern. Keep the old flat file running, set up a nightly ETL to your normalized database, and migrate applications one by one. Only retire the legacy system when everything is migrated.

2. Manage Schema Migrations
Use a migration tool like Flyway or Entity Framework Migrations. Keep ordered scripts in version control, track which migrations have run in a database table, and never modify old migrations—only add new ones.


-- Example migration table
CREATE TABLE SchemaVersions (
    Version INT PRIMARY KEY,
    Description VARCHAR(255),
    AppliedOn DATETIME DEFAULT GETDATE()
);
            

3. Good Enough is Better Than Perfect
If you can't normalize everything in one sprint, do what you can. Getting to 2NF is a huge win even if 3NF has to wait. Normalize incrementally and stabilize between changes.

4. Watch for Null Values
Nulls are weak smells that you might need to break out data into a second table. If half your rows have null values for several columns, consider whether those columns represent optional relationships that should be their own table.

Summary

Database normalization is fundamentally about reducing wasteful determinism—eliminating redundancy by ensuring data is stored in exactly one place. We've covered five normal forms:

  • 1NF: Atomic attributes, no repeating groups, primary key required
  • 2NF: Non-key attributes depend on the entire primary key
  • 3NF: Non-key attributes don't depend on other non-key attributes
  • Boyce-Codd (BCNF): Every determinant must be a candidate key
  • 4NF: No independent multi-valued dependencies (no Cartesian products)
  • 5NF: No join dependencies (can't split into three+ tables without losing info)

Remember: normalization requires understanding your business domain. Statistical analysis can hint at problems, but you need to talk to actual humans to understand the true relationships in your data.

Finally, know when to denormalize. OLTP databases benefit from normalization; OLAP databases benefit from denormalization. Maintain your source of truth in normalized form, then derive optimized read structures as needed.

Master these concepts, and you'll build databases that are maintainable, performant, and truly serve as a single source of truth for your enterprise data.

👉🏼 Click here to Join I ❤️ .NET WhatsApp Channel to get 🔔 notified about new articles and other updates.
  • Database
  • Database
  • Normalization
  • 1NF
  • 2NF
  • 3NF
  • Boyce-Codd
  • 4NF
  • 5NF
  • Data Modeling
  • Schema Design
  • SQL