Tuesday 18 March 2014

Database Anomalies and how to Overcome those Anomalies:

                In our day-to-day life, we come across data stored in various formats and structure. The most common form is a flat-file structure or commonly known as "spreadsheet-like" structure. This means nothing but data stored in rows and columns with the first row representing the headers under which the relevant data is stored. However, as the amount of data i.e. the number of records and the number of fields go on increasing the format becomes difficult to manage, analyze & maintain.

 

     For example, consider the following example. Let us say that we have only 5 records of information in the entire flat-file.
Vendor ID
Vendor Name
Vendor Address
Products Supplied
V001
ABC
Park Street
Bottles
V001
ABC
Park Street
Dishes
V002
XYZ
Club Street
Spoons
V003
LMN
Forest Street
Dishes
V003
LMN
Forest Street
Spoons
     On the face of it, the data & the way it is structured looks pretty normal and very much "spreadsheet like".Let us try to analyze the data & check various scenarios.

 Updating the data: 

       Suppose if the address of one of the vendor say ABC changes to Park View Street from Park Street. In this case we would have to update the address of the vendor ABC in all the records i.e. 2 in this case. Just imagine if the database is huge, you might have to make such updates in say 5000 records. This can be a tedious task and there is a very high probability to go wrong here – i.e. to forget to correctly update the address in one of the row.

Vendor ID
Vendor Name
Vendor Address
Products Supplied
V001
ABC
Park View Street
Bottles
V001
ABC
Park Street
Dishes
V002
XYZ
Club Street
Spoons
V003
LMN
Forest Street
Dishes
V003
LMN
Forest Street
Spoons
     As you can see above, Park View Street has been updated only in the first record whereas the second record continues to hold the old street name. This is an anomaly arising out of an update operation and is known as an ‘Update Anomaly’.

 Deleting the data:

      Suppose the buying company temporarily stops procuring Spoons. In this case, if we try to delete the record containing spoon, then we also end up deleting the vendor details of XYZ even though he might be a legitimate & active vendor.

Vendor ID
Vendor Name
Vendor Address
Products Supplied
V001
ABC
Park Street
Bottles
V001
ABC
Park Street
Dishes
V002
XYZ
Club Street
Spoons
V003
LMN
Forest Street
Dishes
V003
LMN
Forest Street
Spoons
     As you can see above, all the details of Vendor XYZ are deleted on deleting the record containing spoons under the "Products Supplied" column .This anomaly arising out of delete operation is known as a ‘Delete Anomaly’.

 Inserting the data:

     Suppose the buying company finalizes a new vendor but he would start supplying only after say 6 weeks. In other words, as of now we cannot assign any "products supplied" to that new vendor. In that case, one may not be able to insert the details of the new vendor until it is know what product is he going to supply.

Vendor ID
Vendor Name
Vendor Address
Products Supplied
V001
ABC
Park  Street
Bottles
V001
ABC
Park Street
Dishes
V002
XYZ
Club Street
Spoons
V003
LMN
Forest Street
Dishes
V003
LMN
Forest Street
Spoons
V004
PQR
Main Street
????
     As you can see above, the new vendor details cannot be inserted due to the unavailable information with regards to the product that he is going to supply. This anomaly arising out of an insert operation is known as an Insert Anomaly.

       Note that all these anomalies will make your data quite error prone and would render it redundant. Also, note that the database that we are discussing here is an extremely small sample with five records & four fields. Now, just to put things in a little perspective, imagine a data with around 120,000 records and 22 fields. If we perform a number of such operations (delete, update & insertion), you can easily guess how the anomalies just discussed above will be compounded.

So, the next logical question is – What do we do now? Is there any way to avoid such anomalies?

        And the answer is Yes. There is a process to avoid these anomalies and that process is known as "Normalization". Normalization, thus can be defined as the process of decomposing i.e. reducing the data to a set of relations (a single data table reduced to multiple tables with each data table having certain relationship among each other). This is to ensure data integrity as well as eliminate anomalies.

Hence while normalizing the data, we got to follow certain rules of normalization known as Codd’s 12 rules that were proposed by a gentleman known as Mr. Edgar F. Codd.

There are various levels to which the data can be normalized. Those levels are First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), BCNF, 4NF, DKNF, 6NF.

Let us not get into the technicalities of this process. The idea here is to give you an overview of why we do what we do.

So, coming back to our example, the normalized data would look something like this:

Table 1 : Product_Table
Product ID
Products Supplied
P001
Bottles
P002
Dishes
P003
Spoons
Table 2: Vendor_Table
Vendor ID
Vendor Name
Vendor Address
V001
ABC
Park Street
V002
XYZ
Club Street
V003
LMN
Forest Street
Table 3: Transaction_Table
Transaction ID
Vendor ID
Product ID
T001
V001
P001
T002
V001
P002
T003
V002
P003
T004
V003
P002
T005
V003
P003
If you observe the above tables, you’ll see that the table 3 is linked/related to table 2 via Vendor ID whereas table 3 is linked or related to table 1 via Product ID.

Now, ProductID in case of Table 1 and VendorID in case of Table 2 are known as Primary keys. So, Primary key, as you can see, uniquely identifies each record in the table. In other words, every table should have one field that uniquely identifies each record.

Further, these primary keys when used in table 3 are then called as Foreign keys. So, foreign keys are used to link tables. Also, note that in table 3, TransactionID will be the primary key.

Thus, such a normalized database linked via keys is known as a "Relational database"

  In the above table 3, if you want to add a new transaction record say T006 with a new Vendor ID & Product ID, then you would be able to do so only if you first add a record of new Vendor in Table 2 and a record of new Product in Table 1. Else, without adding these records in Table 1 & Table 2, you wont be allowed to insert new transaction in Table 3. This is known as ensuring Referential Integrity. This also includes cases such as if vendor V003 in Table 2, has its ID changed to V004 for some reason instead of V003, then corresponding changes need to be automatically updated in Table 3. Also, in Table 1, if Spoons is discontinued & deleted from the table, then corresponding transaction of spoons needs to be deleted from Table 3. However, one needs to exercise caution while allowing automatic delete in the transaction table since for a particular business case not having a transaction involving spoon will be a requirement whereas in some other case it wont be a requirement.

 A few words on relationship between different tables:

One-to-Many – This type of relationship is said to exist if one record in Table ‘x’ is said to have multiple linked records to Table ‘y’ . Now, in our example, each record in Table 2 (Vendor_Table) has links to multiple records of Table 3 (Transaction_Table). E.g: Vendor ID V001 is linked to multiple records T001 & T002 in Table 3. This type of relationship is the most common one.

One-to-One – This type of relationships is said to exist if one record in Table ‘x’ is said to have exactly one linked record in Table ‘y’. This type of relationship is rarely used. Imagine we create table 4 (Vendor_Account) wherein we hold information such as Vendor Bank Account numbers. In that case, each record in Table 2 will have exactly one record in Table 4.

Many-to-Many – This type of relationship is said to exist if multiple records in one Table ‘x’ is said to have multiple links to records of Table ‘y’. Now, in our example, a vendor can supply multiple products whereas the same product can be supplied by multiple vendors. Hence Table1 and Table 2 exhibit Many-to-Many relationship.

I just hope that this post serves to give you a basic overview of the database concepts in a nutshell & help you understand the answer to your very first question – Why the need to have multiple Tables?.


No comments:

Post a Comment