Sunday, April 29, 2012

SQL Server - Reseeding Identity Value

Recently we came across an issue where a table was not accepting any new records. Upon further inspection we discovered that the table had an identity key of type INT and it has reached its limit. An INT datatype is 4 byte large and as such can hold a maximum of 2147483647 rows. The table in question only had about 300,000 rows.

An identity field is an auto generated number and the default increment is 1, so in a table where you simply insert or update records, you should technically never run out of available values for INT datatype or it will take a long long time. But, let's assume you inserted 1000 records in this table, and then deleted 500 of those records. Because identity value is incrementally generated, the next identity value will be 1001, even though your table only has 500 records. Precisely this is what happened with this table. At most, about 2000 rows were inserted in this table, but the data was inserted and re-inserted throughout the day. For example, 2000 rows may be inserted in the morning, then after an hour, same 2000 rows were inserted again, deleting the first 2000 rows etc. So, during every insert even though the number of records were more or less same, extra 2000 identity values were being eaten up. In about a day close to 20,000 identity values were used for just 2000 entries.

Extrapolate that to several years and you can imagine why the table was running out of values. There are couple of ways to fix this problem, but it largely depends on how complex your database schema is, i.e. whether this identity key is also used as a foreign key in some other table(s) etc. Fortunately, this table was stand alone, so it was rather simple.

A temporary solution to this problem is to find a contiguous block of unused identity values. For example, let's imagine your table has 1,000,000 rows but the 1st record's identity value is 500,000. Which means you can reseed your table to restart your identity keys from value 1 again by running the following DBCC command...

DBCC CHECKIDENT (YourTableName, reseed, 0)


Once you run this command, the next record in this table will start from 1. But, as I said, this is a temporary solution because once the identity value reaches 499,999 in the above example, it will error out when you try to insert next record, because identity value 500,000 is already present.

But what if you couldn't find contiguous empty identity values. If you can afford to keep database down for a few hours, here is what you can do...

  • Disconnect users from the database (with proper notification of course) and bring it in single user mode.
  • Create a new table with the same schema (name it something other than your table in question) and then write a query to move all the records from old table to this new table (except the identity value) which should reseed the identity value in this new table and all the records will be inserted in contiguous fashion, starting from 1. 
  • Once you have completed the move, rename or drop the old table (I would rename it at least until the dust has settled) and rename the new table to the same name as the old table.

But, what if you can't afford to keep your database down for more than a few minutes? It gets little tricky, but it is still doable. Here is what I would do...

  • Switch the database to single user mode.
  • Rename the current table to something else. 
  • Create a new table with the same schema and name it same as the old table.
  • Bring the database back up so that the users can start using it and all the new records will not be added to this new table starting from identity value 1. (Old data is still not available).
  • Write a query to move the data from old table to the new table (ideally you would want to create a scheduled task that can be run when the load on the database, especially on this table is low to avoid deadlocks). You can also play with query wait times or CPU resource governor but I would recommend the following...
    • Normally I wouldn't recommend this, but since this is a background operation and the key is to lock the table for the least amount of time, I would create a FAST_FORWARD cursor to run on the old table and insert one record at a time in the new table. This, coupled with scheduling during low load will prevent or at least minimize deadlocks.
There may be other ways, but this has worked for me. Also, if you want to check the current identity value of your table, you can run the following command

DBCC CHECKIDENT (YourTableName)

While these are workarounds, deleting and reinserting the same data is never a good idea. Also, it doesn't beat a good database design. Don't use identity key if you don't have a particular reason for it and also if you must use it then ensure proper data type (SMALLINT, INT or BIGINT) depending on the potential table size.

Thank you.

Thursday, April 26, 2012

SQL Server Triggers

In previous post we talked about views, today we will review triggers and what purpose they serve. Triggers are useful but they should be used judiciously.

A trigger is a T-SQL script or a special stored procedure that is bound to the table and automatically executes based on a certain action on the table. For example, you can have a trigger that executes when a record is inserted, updated or deleted in a table.
There are two types of triggers - Data Definition Language (DDL) triggers and Data Manipulation Language (DML) triggers. DML triggers execute in response to data modification, for example, when you insert/update/delete a record. DDL triggers are executed when you modify data definition for example - adding a column to the table etc.

DML Triggers
DML Triggers can be further categorized into AFTER Trigger and INSTEAD OF Trigger. After triggers are executed after the statement with which they are associated with has been executed.
For example, suppose you have an audit table and anytime a record is inserted in the main table, you want to copy the newly inserted record in an audit table.
CREATE TRIGGER trgAfterInsert ON [dbo].[Customers] 
FOR INSERT
AS
    Declare @CustomerID INT;
    Declare @FirstName varchar(50);
    Declare @LastName varchar(50);
    Declare @EmailAddress varchar(100);
    
    SELECT @CustomerID =newRow.CustomerID from inserted newRow;    
    SELECT @FirstName = newRow.FirstName from inserted newRow;    
    SELECT @LastName = newRow.LastName from inserted newRow;    
    SELECT @EmailAddress = newRow.EmailAddress from inserted newRow;    
    
    INSERT INTO Customers_Audit
           (CustomerID,FirstName,LastName,EmailAddress) 
    values(@CustomerID,@FirstName,@LastName,@EmailAddress);

    PRINT 'AFTER INSERT trigger fired.'
GO

When you insert a record in customers table, it will fire this AFTER trigger and insert the newly inserted record into Customers_Audit table.

Similarly, you can create a trigger for UPDATE or DELETE.

Notice "from inserted" is a logical table inserted method that you can use to get the values from newly inserted rows. There is no logical update method, but you can use the same inserted method to get the updated rows and do something with them. For deleted rows, there is logical deleted method to get the deleted values. For example, I can create a new trigger when a record is deleted.

CREATE TRIGGER trgAfterDelete ON [dbo].[Customers] 
FOR DELETE
AS
    Declare @CustomerID INT;
    Declare @FirstName varchar(50);
    Declare @LastName varchar(50);
    Declare @EmailAddress varchar(100);
    
    SELECT @CustomerID =deletedRow.CustomerID from deleted deletedRow;    
    SELECT @FirstName = deletedRow.FirstName from deleted deletedRow;    
    SELECT @LastName = deletedRow.LastName from deleted deletedRow;    
    SELECT @EmailAddress = deletedRow.EmailAddress from deleted deletedRow;    
    
    INSERT INTO Customers_Audit
           (CustomerID,FirstName,LastName,EmailAddress) 
    values(@CustomerID,@FirstName,@LastName,@EmailAddress);

    PRINT 'AFTER DELETED trigger fired.'
GO

INSTEAD OF Trigger
INSTEAD OF Trigger allows you to intercept an execution and then roll back or commit transaction based on your criteria. For example, you don't want to delete certain records. You can create an INSTEAD OF trigger that can intercept the deletion, detect if the condition is satisfied and then either commit the deletion or rollback.

For example, I don't want to delete any customer whose last name is "Smith". I can create an INSTEAD OF trigger to prevent deleting such customers.
CREATE TRIGGER trgInsteadOfDelete ON [dbo].[Customers] 
INSTEAD OF DELETE
AS
    Declare @CustomerID INT;
    Declare @FirstName varchar(50);
    Declare @LastName varchar(50);
    Declare @EmailAddress varchar(100);
    
    SELECT @CustomerID =deletedRow.CustomerID from deleted deletedRow;    
    SELECT @FirstName = deletedRow.FirstName from deleted deletedRow;    
    SELECT @LastName = deletedRow.LastName from deleted deletedRow;    
    SELECT @EmailAddress = deletedRow.EmailAddress from deleted deletedRow;    
    
    IF (@LastName='Smith')
        BEGIN
            RAISERROR('Cannot Delete',16,1)
            ROLLBACK;
        END
    ELSE
        BEGIN
            DELETE FROM Customers WHERE CustomerID=@CustomerID
            COMMIT
        END
        
GO


DDL Trigger
DDL Triggers are fired when a table schema is modified. If I want to prevent any table modification, I can create a trigger like this...

CREATE TRIGGER TrgAlterTable
ON DATABASE 
FOR DROP_TABLE, ALTER_TABLE 
AS 
PRINT 'You must disable Trigger "TrgAlterTable" to drop or alter tables!' 
ROLLBACK ;

Unless I disable or drop this trigger, I will not be able to drop or modify a table.

Enabling / Disabling a Trigger
You can disable all triggers or a specific trigger by executing the following script.
ALTER TABLE Customers {ENABLE|DISBALE} TRIGGER ALL 

To enable or disable a specific trigger, simply replace "ALL" with the trigger name.

While triggers serve a useful purpose, one of the issue I have with them is that they are hidden and after you have long forgotten, troubleshooting an issue becomes a problem because you just don't think about the triggers. That's why it is important to have some kind of messaging/logging in triggers to trace an issue.

Thank you.



Sunday, April 22, 2012

Using SQL Views to Insert/Update/Delete Data

In previous post we discussed SQL Server Views and how you can use them to logically abstract some of the complexities of the underlying schema. We focused mainly on selecting data using views.

Views also allow you to insert/update/delete data, but they are limited in their power when it comes to modifying data.

Think back for a second - when you insert data in a table, you can only insert in one table at a time. If more than one table is related via referential integrity, you must insert in one table and then insert in subsequent tables using the referential integrity key from the primary table. Similarly, when you are updating or deleting data from the tables, you can only update/delete from one table at a time. Additionally, the user must have proper permissions to do.

Same restrictions apply when you are using views to achieve this. Some of the restrictions that apply to views are...
  • Insert/Update/Delete statements must reference columns from only one base table.
  • Columns in the view must relate to underlying columns in the base table directly i.e. they cannot be computed columns such as AVG, COUNT, SUM, MIN, MAX etc. 
  • The columns being modified cannot also be grouped or affected by DISTINCT, GROUP BY or HAVING clauses
    • If you think about it, it makes sense. SQL Server wouldn't know which column to modify if you have applied these clauses.
  • If you have used WITH CHECK OPTION (see previous post about this), you can't use TOP in SELECT statement of the view.
  • If you have any sub/nested queries, same restrictions apply to them as well.
  • Constraints defined at the table columns such as not null, referential integrityetc. also apply when modifying via views.
In previous post we created a view to retrieve some records from multiple tables. Let's see if we can use the same view to update or delete a record.

CREATE VIEW view_Orders

AS
SELECT C.FirstName As CustomerFirstName,C.LastName As CustomerLastName,
Count(O.OrderID)As TotalOrders,S.FirstName As SalesFirstName,
S.LastName As SalesLastName FROM Customers 
C INNER JOIN Orders O ON C.CustomerID=O.CustomerID
INNER JOIN SalesPerson S ON O.SalesPersonID=S.SalesPersonID
Group By C.FirstName,C.LastName,S.FirstName,S.LastName
GO

Since both customer's and sales person's first and last name are used in Group By clause and also the OrderID column is using Count function (OrderID is identity column so you can't update anyway, but just wanted to make a point), none of the columns in this view can be updated or deleted.

What if I create a view like this...

CREATE VIEW view_OrdersList

AS
SELECT C.FirstName As CustomerFirstName,C.LastName As CustomerLastName,
O.OrderID As OrderNumber,S.FirstName As SalesFirstName,
S.LastName As SalesLastName FROM Customers 
C INNER JOIN Orders O ON C.CustomerID=O.CustomerID
INNER JOIN SalesPerson S ON O.SalesPersonID=S.SalesPersonID
GO

I should be able to use this view to pretty much update any column i.e. Customer First Name, Last Name and Sales Person First and Last Name. I can't update Order ID because it is identity column.

In my database, this view returns the following records

I can run the following update commands one after another to update the customer last name and sales person last name using the same view, even though they are in two different underlying tables.

UPDATE View_OrdersList SET CustomerLastName='Roberts' WHERE 
CustomerLastName='Smith'
GO
UPDATE View_OrdersList SET SalesLastName='Kimberly' WHERE 
SalesLastName='Kimber'
GO

But what about updating both records using the same update query?

UPDATE View_OrdersList SET CustomerLastName='Roberts',SalesLastName='Kimberly' 
WHERE CustomerLastName='Smith' AND SalesLastName='Kimber'
GO

This will result in the following error

Needless to say it violates the above defined rules i.e. only one table can be updated at a time, hence the error.

I can also insert using the same view or delete a record as long as no referential integrity or other constraints are violated.

A view abstracts away some of the complexities and allows you to use the same view to select/update/insert/delete, although you still have to ensure all the conditions are satisfied just as you would when making modifications directly in tables.

Thank you.


Saturday, April 21, 2012

SQL Server Views

So far we have been discussing database management, backup, transactions, log shipping etc. In most places developers are generally not concerned with these tasks; but they fall under the domain of DBAs.  We will now turn our attention to the database development, primarily the tasks that are in the domain of a developer. Although admittedly, today's topic could go either way.

In this post, we will take a look at views and how you can create views to simplify some of the complex operations. Views are logical representation of your schema. They hide the underlying complexities of your database schema, table relationships etc.

For example, below is a subset of tables from my database with related tables tied together via referential integrity.



Let's assume you want to run the following query...

SELECT C.FirstName,C.LastName,Count(O.OrderID)As TotalOrders,S.FirstName,S.LastName
FROM Customers C INNER JOIN Orders O ON C.CustomerID=O.CustomerID
INNER JOIN SalesPerson S ON O.SalesPersonID=S.SalesPersonID
Group By C.FirstName,C.LastName,S.FirstName,S.LastName

Even though this query only joins three tables, it is quite complex. What if you can create a view that abstracts all this information and your application can simply use the view to get the same results instead of using this query.

CREATE VIEW view_Orders

AS
SELECT C.FirstName As CustomerFirstName,C.LastName As CustomerLastName,
Count(O.OrderID)As TotalOrders,S.FirstName As SalesFirstName,
S.LastName As SalesLastName FROM Customers 
C INNER JOIN Orders O ON C.CustomerID=O.CustomerID
INNER JOIN SalesPerson S ON O.SalesPersonID=S.SalesPersonID
Group By C.FirstName,C.LastName,S.FirstName,S.LastName
GO

Now, I can simply write a query against this view which will return me same results as the query above

SELECT * FROM view_Orders

In many ways, views are pseudo-tables. You can open a view in design mode and it will show you the tables that are used in that particular view and relationship between them. For example, if I right click on the view I created above and select design, I will see something like this, showing three tables by their Alias name. Also notice that the columns this view is returning are checked and the column which is being computed (OrderID) has a function sign on right hand side.




While a view allows you to use many clauses available with SELECT such as GROUP BY, TOP and functions such as Count, Sum etc. you can't use ORDER BY Clause, unless you use TOP. For example, If I change my above view to something like this...

CREATE VIEW view_Orders

AS
SELECT C.FirstName As CustomerFirstName,C.LastName As CustomerLastName,
Count(O.OrderID)As TotalOrders,S.FirstName As SalesFirstName,
S.LastName As SalesLastName FROM Customers C 
INNER JOIN Orders O ON C.CustomerID=O.CustomerID
INNER JOIN SalesPerson S ON O.SalesPersonID=S.SalesPersonID
Group By C.FirstName,C.LastName,S.FirstName,S.LastName 
ORDER BY C.FirstName
GO

I will get an error. But I can use ORDER BY Clause if I want to return TOP x number of rows like this...

CREATE VIEW view_Orders

AS
SELECT TOP 10 C.FirstName As CustomerFirstName,C.LastName As CustomerLastName,
Count(O.OrderID)As TotalOrders,S.FirstName As SalesFirstName,
S.LastName As SalesLastName FROM Customers C 
INNER JOIN Orders O ON C.CustomerID=O.CustomerID
INNER JOIN SalesPerson S ON O.SalesPersonID=S.SalesPersonID
Group By C.FirstName,C.LastName,S.FirstName,S.LastName 
ORDER BY C.FirstName
GO

If you want to order result set returned by a view, you can always supply ORDER BY clause in your query

CREATE VIEW view_Orders ORDER BY CustomerFirstName

Views also allow you to add certain options when creating them. The two most important once are WITH SCHEMABINDING and WITH ENCRYPTION.

If you create a view with SchemaBinding, you will not be able to alter the schema of any tables that are associated with this view without first dropping the view  or altering the view, removing SchemaBinding option. For example, above view can be created as


CREATE VIEW view_Orders WITH SCHEMABINDING

Also, you cannot use SchemaBinding unless you use the two part table name i.e. schema name.table name. In my previous example, all tables have dbo schema, so I will modify the above statement like this...

CREATE VIEW view_OrdersWITH SCHEMABINDING

AS
SELECT TOP 10 C.FirstName As CustomerFirstName,C.LastName As CustomerLastName,
Count(O.OrderID)As TotalOrders,S.FirstName As SalesFirstName,S.LastName As SalesLastName
FROM dbo.Customers C INNER JOIN dbo.Orders O ON C.CustomerID=O.CustomerID
INNER JOIN dbo.SalesPerson S ON O.SalesPersonID=S.SalesPersonID
Group By C.FirstName,C.LastName,S.FirstName,S.LastName ORDER BY C.FirstName
GO

You can also Encrypt view by using WITH ENCRYPTION. When a view is encrypted, you cannot see it either in design mode or generate a create  / alter / drop and create script. The only script you can create is Drop.  An encrypted view cannot be un-encrypted.

CREATE VIEW view_OrdersWITH SCHEMABINDING, ENCRYPTION

AS
SELECT TOP 10 C.FirstName As CustomerFirstName,C.LastName As CustomerLastName,
Count(O.OrderID)As TotalOrders,S.FirstName As SalesFirstName,S.LastName As SalesLastName
FROM dbo.Customers C INNER JOIN dbo.Orders O ON C.CustomerID=O.CustomerID
INNER JOIN dbo.SalesPerson S ON O.SalesPersonID=S.SalesPersonID
Group By C.FirstName,C.LastName,S.FirstName,S.LastName ORDER BY C.FirstName
GO

Partitioned Views
A partitioned view is defined as a view that selects data from multiple tables that have the same structure. The data is said to be partitioned, hence the term partitioned. The data is combined together using UNION ALL. The two tables can be on the same server or even on different servers.

Views are powerful feature in the sense that they abstract away some of the underlying complexities and simplify working with data.

Although views are generally used for selecting data, you can use them to update/insert data, but you are limited in what you can do when inserting/updating data via views. In next post we will discuss using views to insert/update data.

Thank you


Saturday, April 14, 2012

Log Shipping Configuration - Things to watch out

Recently I came across a situation where a server was configured to log ship. Everything was setup properly and the primary server had security rights to copy transaction log files to the secondary server.

Once configured, log shipping would work for a day and restore the logs to the secondary database as you would expect, but then it will start failing. Based on the error message, it was clear that log shipping chain was being broken somehow.

The person in charge of the database management had disabled differential backups erroneously thinking they were somehow breaking the log chain.

Remember that full backups and differential backups do not break transaction log chain. The only way you can break the log chain is either by switching the database to simple recovery model (which will truncate the log and break the chain) or if you take a backup outside of the log shipping backup process.

Log shipping consists of the following main operations:

  • Backup Transaction Log
  • Copy it to the secondary destination server (You may configure so that the backup file is created on the secondary server and this step won't be necessary).
  • Restore the Transaction Log to the secondary database (on the secondary database).
Log shipping simply creates a transaction log backup from the last log based on LSN and restore must happen in that order, i.e. previous log must be restored first and so on. Naturally, if an outside process takes a transaction log backup, it will break the log chain and any log restores without this backup will fail. The only way to recover would be to find the missing log backup or start all over.

Indeed that was the case here. Even though there wasn't another transaction log backup process, there was a shrink log file job configured to run nightly and as part of this process, the job was first creating a backup and then shrinking the log file.

Transaction log backups will automatically truncate the log file (although it may not happen right away) but file generally won't shrink, so if you are low on disk space it is good practice to shrink the log file periodically, but you shouldn't be taking the backup, if you already have transaction log backup or log shipping configured elsewhere.

To know more about transaction log shipping and how to configure it, check out this post.

Thank you.

Tuesday, April 10, 2012

Full vs Differential vs Transaction Log Backup

SQL Server provides three backup options - Full backup, Differential backup and Transaction Logs backup. Today, we will review the three backup strategies and key differences between them.

Full Backup
Full backup is just that, a full backup of your database at a point in time. You can restore the full backup on the same or a different SQL Server.

Differential Backup
Differential backup backs only the changes since the last full backup. The benefit of differential backup is that it is very fast and takes less space since you are only backing up the changes.

Differential backup has a backup chain which starts from the last full backup. All differential backups are from the previous full backup. It is possible to take another full backup without breaking the differential backup chain, i.e. to have it continue from the previous full backup. If you take a full backup with COPY_ONLY option, you will not break the differential backup chain, but without the COPY_ONLY option, the previous differential backup chain will be broken and a new chain will start from the most recent backup.

You can perform a restore at point in time by restoring a full backup and then applying the most recent differential backup.

Transaction Logs
Transaction Logs are the changes since the last transaction log backup. I have seen some confusion about whether transaction log backups are from the last full backup or from the last transaction log backup. If you are taking full database backup for the very first time, you transaction log back up chain will start after the full backup. Any subsequent full or differential backups will not break the log chain and the next transaction log backup will be from the last transaction log backup and not the last full backup.

The transaction log backup only works in Full and Bulk Logged recovery model and the only way to break the log chain is by either switching the recovery model to Simple or if you choose to override existing backup set when creating a full backup media set.

If your database is set to full or bulk logged recovery model, you must take frequent log backups otherwise your log file won't truncate, filling up your hard drive.

It is good to use all three backup schemes in your database environment to ensure you keep the restore media set or files as current as possible, so you can restore to a point in time and minimize data loss.

Thank you.


Saturday, April 7, 2012

Configuring SQL Server Maintenance Plan

I may not have explicitly mentioned earlier, but all the examples, and the content that I write generally applies to SQL Server 2008. In most cases it will be true for SQL Server 2005 as well, but not always. For example, in one of the blog about compressing data during backup, I discussed "WITH COMPRESSION" command to use during backup. This command was only introduced with SQL Server 2008.

Today, we will review another facility available in SQL 2008 (also available in SQL 2005, but some of the features may not be the same) called Maintenance Plan. Maintenance Plan allows you to setup certain maintenance tasks by using SQL Management Studio with ease. This comes specially handy when you have multiple databases on the same server and need to setup maintenance plan such as nightly backup for all the databases at once.

1. Connect to your SQL instance via Management Studio and go to Management > Maintenance Plans.


2. Right click on Maintenance Plan and either select New Maintenance Plan or Maintenance Plan Wizard. Maintenance Plan Wizard is quite powerful and walks you through selecting available maintenance plans and configure it. In this example, we will use the first option and setup our own maintenance plan. Select "New Maintenance Plan" and name your plan appropriately, which will open design surface and a list of available maintenance plan.



3. Drag one or more tasks that you want to configure in this maintenance plan. We will configure two tasks - database backup and check database integrity task.

Check Database Integrity
It is often a good idea to check the database integrity immediately before or after the backup.
Drag the task on the design surface and then right click > Edit to configure your databases. The resulting UI will allow you to select one or more databases. You can also check all databases (this is helpful if you are going to add databases in future and don't want to keep adding new databases to this list, alternatively if you have too many databases, you may want to configure few databases in one task).



Check the "Ignore databases where state is not online" to only perform integrity check when database is online. Click on OK to save the change. 

From the design surface top menu, click on the calendar icon to setup the schedule for this task. Name your schedule and setup appropriate schedule. Clock on OK to save and close the schedule window. 


Configure Database Backup
Repeat the same steps and drag the database backup task and configure it. Again right click on the task and configure your options. Here you can configure whether you want full, differential or transaction log backup, select one or more databases and define the location where you want to save the backup files. There is also an option called "Verify Backup Integrity" this shouldn't be confused with Database Integrity task which we configure in above step.



In this example, we want to ensure the integrity check is performed first before the database backup. Right click on either of the two task and select "Add Precedence Constraint" from the context menu and select the precedence you want.



That's all there is to it. Once you save your plan, SQL Server will automatically create a SQL Agent job to run your plan at the scheduled time.

If I had database mail setup, I can also add a third task "Notify Operator Task" as the last step which can send a notification email in the event of the task failure. See my previous post to learn about configure your SQL Server to send emails.

Thank you.






Wednesday, April 4, 2012

Database Normalization - Third Normal Form

In previous two posts we discussed First Normal Form and Second Normal Form. Today, we will discuss Third Normal Form.

Third Normal Form (3NF)
This rule breaks down the table structure a bit further. For a table to be in 3NF, it must also satisfy 1NF and 2NF. To make a 2NF table satisfy 3NF, remove the columns that don't fully depend on the primary key.

Imagine you have a payroll application which records the total hours an employee worked, the pay rate and total pay for the week. The following table schema can be used...

PayRoll

  • EmployeeID
  • WorkedDate
  • HoursWorked
  • PayRateID
  • Total
Let's see if this table satisfies 1NF and 2NF. There is no duplicate data in the same row. The Payroll table is associated to Employee table via EmployeeID and to PayRates via PayRateID. It has no redundant data, has a primary key (EmployeeID and WorkedDate), and foreign keys (EmployeeID and PayRateID). Thus, all conditions for 1NF and 2NF are satisfied.

What about 3NF? Is there any field here that doesn't depend on Primary Key? HoursWorked is fully dependent on EmployeeID and WorkedDate. Different Employees could work different number of hours on different dates. Also, PayrateID is fully dependent on Primary Key as well.

At first it may appear that PayrateID doesn't depend on WorkedDate but rather only on EmployeeID because an Employees' pay will be same day after day. But what if the Employee works on a holiday when they will be paid time and a half?

What about Total column? Total column is really HoursWorked * PayRate, hence it is a computed column and as such isn't fully dependent on primary key. For this table to be in 3NF, this column must be removed. So, where do you store this column? Actually no where. This is a computed column and you should compute it on the fly when retrieving the data.

As I mentioned in the first post about Normalization, the normalization goes up to 6NF and more normalization types may still be possible, but 3NF is generally the standard for most databases.

Thank you.


Monday, April 2, 2012

Database Normalization - Second Normal Form

In previous post we discussed the concept of Normalization and First Normal Form. Today, we will review Second Normal Form. You may want to refresh previous post before continuing on.

Second Normal Form (2NF)
2NF takes this concept  a bit further. This rule states that any data that is in multiple rows of the same table should be moved to a new table and the two tables should be joined via a foreign key. Basically, the idea is to reduce data redundancy by extracting redundant data and moving it elsewhere. Let's imagine you have a customers table that has the following columns

Customers
  • CustomerID
  • FirstName
  • LastName
  • Address
  • City
  • State
  • PostalCode
  • Country
Sure, you will have more than one customer from the same city, even more from the same state and definitely more from the same country. You may possibly have many customers from the same postal (zip) code. As you enter customers in this table, you are duplicating all this data i.e. City, State, PostalCode, Country etc. This table is not in 2NF. To make it 2NF compliant, create a new table and store City, State, PostalCode, Country. Let's call this table Addresses. This table has the following columns

Addresses
  • City
  • State
  • PostalCode
  • Country
What columns do you think could be redundant here? Sure PostalCode won't be, but what about City? A city could have multiple postal codes? Definitely a State can have multiple cities and Country will have multiple states. A good schema for all the tables may be the following...

Customers
  • CustomerID
  • StreetAddress
  • CityID
CityAddress
  • CityID
  • City
  • PostalCode
  • StateID
  • CountryID
States
  • StateID
  • State
Countries
  • CountryID
  • Country

Notice, one table resulted in 4 different tables. It is definitely more complex but more flexible. You can even pre-populate States and Countries in advance.

This schema also satisfies second rule of 2NF i.e. the related tables should be related by foreign key. CityID is a foreign key in Customers, StateID, CountryID are foreign keys in CityAddress table.

We will discuss 3NF in future post.

Thank you.