Database Design From The Ground Up

At the heart of every software project lies the database—The database is the foundation that determines how data is stored, accessed, and updated in every software project. Whatever your definition of a database may be (Relational, NoSQL, TimeSeries, etc.). Database theory provides the principles for structuring and organizing this data efficiently with one of the core concepts in database design being Third Normal Form (3NF), a standard for normalizing a database to reduce redundancy and ensure data integrity.

3NF is achieved by making sure that:

The database is in Second Normal Form (2NF)
- 1NF ensures that the table structure is tabular with atomic values and unique rows. In an example of documents and folders, every document and every folder is represented by a row in their own respective tables in the database.
- 2NF is a database normalization step that ensures all non-key attributes are fully dependent on the entire primary key, eliminating partial dependencies. This reduces redundancy and improves data integrity by organizing data into separate but related tables. For example, in it’s simplest form using documents and folders again – documents have their own set of attributes and folders have their own set of attributes, and each belongs in their own tables with a third tables that relates one to the other.
All non-key attributes are non-transitively dependent on the primary key. This means that every piece of data in a table should directly depend on the main identifier, the primary key, of that table and not on any other data in the table. This keeps the data organized and eliminates duplication.

Simply, 3NF eliminates duplicate data by separating it into related tables. This process reduces the space used by storing only necessary information and avoiding repetition.

While 3NF and higher forms of normalization (BCNF, 4NF, and 5NF) optimize storage and maintain data consistency, they come with a trade-off when working with it practically. Joining multiple normalized tables to retrieve information can be computationally expensive and also comes with complexity from a development perspective. Also, maintaining indexes to speed up these joins requires additional storage space and management overhead.

In many applications, the frequency of different database operations varies. However, typically, insert and update statements occur less frequently than select statements. Then, optimizing the database for read operations, at least from my perspective, becomes crucial for performance.

There are scenarios where a fully normalized database may not be practical. For example, consider an e-commerce application that frequently queries product information and category details. In a normalized database, this would require joining the products and categories tables. However, if query performance becomes a bottleneck, it may make more sense to denormalize by including category details directly in the products table, even if it means some data redundancy.

In a scenario like this, the cost of updating a few extra records when category information changes outweighs the performance gains in query operations. This approach can lead to faster applications, especially when read performance is important and you want sub-second response times.

A well-designed database schema is at the core of a successful software project. The process typically begins with understanding the data and relationships, followed by creating an initial schema. Once the schema is in place, it’s essential to test it against use cases and queries to ensure it meets functional requirements.

Some questions to consider during this evaluation include:

Can I retrieve all records that meet a certain criteria? For instance, can you efficiently query all products within a specific price range or all orders placed by a particular customer?
Can I insert all the data I need? Ensure that the schema supports inserting complex data structures without redundancy or loss of information.
Can I paginate when necessary? Verify that the schema allows efficient pagination for large datasets, which is crucial for applications with extensive data.
Are there potential bottlenecks in joins or index maintenance? Assess if the joins between tables are optimized and if the indexes are manageable without excessive overhead. Denormalize where necessary.

By addressing these questions and many more early in the design phase, you can create a robust database schema that balances normalization with performance considerations.

Database design is a nuanced process that requires careful consideration of both theoretical principles and practical performance needs. While normalization reduces redundancy and maintains data integrity, real-world applications often necessitate strategic denormalization to optimize read performance.

This is just one strategy in optimizing applications, but there are many others including better separation of concerns, caching at the data level, and caching at the code level.