13.2 File Organisation and Access

13.2 File Access Files

Summary

File organisation and access relates to the use of records, fields and files.

  • A record is a collection of related data items (possibly of different types) stored in fields and treated as a single entity for processing.

  • A field is a single data item, and many fields make up a record. Each field has a name and one key field called the primary key is used to identify the record.

  • A data file is a collection of records holding the same type of information but about different objects or individuals

Files have three important characteristics

  1. Whether the file is permanent or temporary

  2. The way the records are organised (stored) – sequential, or serial

  3. Method of access or location – sequential or direct access

TL;DR;

Binary file data is written using records. A record is a collection of fields and a record is either fixed in length or uses delimiters to determine the next field, record, etc. Numeric fields do not need delimiters as their size is fixed by the data type. Strings, however, do.

When discussing these types of file, we talk about a hit rate. The hit rate is how many records need to be accessed in a given session. E.g. a bank's billing cycle would involve most records and have a high hit rate, but a passenger booking an airline seat would require a low hit rate.

Files are organised and accessed using different methods:

Organisation: Serial; sequential; random

Access: Sequential; Direct

You can think in terms of data structures: a sequential access similar to a linked list and random access as an array.

Serial file organisation, stores records chronologically (the order in which they were written) are only accessed via sequential means.

Sequentially organised files store records by a key field (e.g. surname) value (this does not HAVE to be unique). If a record is inserted, a separate file is made, read and copied up to the necessary insertion point (using the key), written and the rest of the file copied across. When deleting records, the record can be marked and during file maintenance, or when inserting, deleted flagged records can be skipped.

Sequential files can be accessed either sequentially (from start to end) - as would be necessary on a tape, or using random access. Random because like with RAM, we can access any part without needing to start at the beginning. This type requires media that allows for random access, such as a hard disk.

Direct access sequential files use one or more separate file indexes to allow for random access. The index contains the specific key, which can be different from the storage key, and the record's file location. This method requires the file to be stored on a direct access medium, such as a hard disk.

A benefit over random organisation, is that the record size can vary by record and accessed sequentially (fast when needing a high hit rate) but also directly when needing a particular record.

Random organised files access in a direct manner. The record location is determined using a hash of the key value. You cannot hash multiple fields, as you can with direct access sequential files, as their storage location is determined by the hash value. Direct access is much faster than sequential access, but random organised files require fixed record lengths, in order to calculate the file storage position in bytes.

E.g. if a record contains 4 fields, we might allocate 50B to the last name, 50B to the first name, 8B to the DoB and 11B for the postcode. Each record is then accessed by (n-1 * record size). Our record size here would be 119B, so to get to the third record, we can jump directly to the 238th byte. Hashing needs to allow for collisions, whereby either open or closed hashing can be used. If the name was only 10 characters, 40B would be filled with spaces, meaning it can be wasteful. In an open hashing system (closed addressing), an overflow area needs to be allocated for storage. This is rarely used.

Randomly organised direct access files cannot be read sequentially for complex reasons relating to the file structure (e.g. the records are not stored contiguously). In addition, with a high hit rate application, when updating multiple records, either the transaction list (what needs to be updated) must be stored in the same order as the records, or each record needs to be found through hashing. This isn't fast or efficient!


Key Terms

Master File

In most systems using flat file storage, the term master file refers to files which contain the most up-to-date information. In large systems this may be a number of files.

Transaction File

In order to keep the master file up-to-date, it is often more efficient to store a list of changes that need to be made rather than to update the master file each time a minor change is needed. This would typically be done in a transaction file.

Changes to files may include insertion, deletion, amendment. The transaction file may include the whole of a record that needs to be updated along with instructions as to what type of change needs to be made (Insert, Delete or Amendment).

Reference Files

Reference files are used to store data which a program uses when running but does not make changes to. Tax data is an example.

File Organisation

Files can be stored in different ways depending on file use, volume of data to be processed when the file is accessed, frequency with which the file will be modified.

Hit Rate

The hit rate refers to how much of the file is accessed in one operation. For example, a payroll file will have a very high hit rate when the pay is calculated and payslips produced. It would also have a low hit rate during the working month when only a few records are being added or updated.

File Organisation Types:

See the PP for details on serial.

Types of file organisation are:

  • Serial Access File Organisation

  • Sequential Access File Organisation

  • Direct Access File Organisation

  • Index Sequential Access File Organisation

Sequential [access] File Organisation:

  1. All records are stored in a organised/sorted sequential order by a unique key (see notes).

  2. That is, the records are arranged in the ascending or descending order of a key field.

    • In a student information system, the file would contain roll number, name, division, marks obtained in the examination.

    • In a payroll application, the records are stored with employee number as a key field.

  3. To locate a particular record in such file organisation, we have to start searching from the beginning of the file until it is found in the file

  4. Normally created and maintained on magnetic tapes. e.g. Audio Cassettes.

  5. There is no need for any storage space identification (it grows with the data set)

Advantages:

  1. Simple to understand - no having to hash/determine a location

  2. Easier to organise, maintain

  3. Economical when accessing a large number of records

Disadvantages:

  1. Entire file has to be processed

      • The sorting does not remove the need to access other records as the search looks for particular records.

  2. Transactions must be sorted in a particular sequence before processing

  3. Time consuming searching

    1. Random enquiries are not possible to handle (indexed sequential solves this problem)

If a particular record must be found in a file and the number of records in the file is very small, then it may not be difficult to search the file from beginning to end to find the desired record (serial). For files containing large numbers of records, however, this method is inefficient. A special ordering technique is needed so that records can be retrieved more easily. For this reason, records are arranged according to a key value (often referred to as a primary key). The key is one data field chosen to identify the record. Since a key is used to locate a particular record, it is often unique; that is, no two records in a file can have the same key value, but this is not always the case. It very much depends on what key is being used.

Benefits of using Sequential over Serial

On the face of it, the only difference between serial and sequential file structure (both accessed sequentially - confusing, I know!) is that sequential files are sorted, serial files are not. The main benefit is when there is a high hit rate on the file. Let's say that 90% of the records are going to be accessed for processing, if the file was unsorted, the program would potentially have to read through the entire file (until a match was found) for EACH record that needed processing. Knowing the records are sorted in order, the file can read through in one pass.

On that basis, this also means this method is far superior for this type of application than random access (but not sequentially indexed files).

In the example below, the social insurance number field is used as the key.

SIN (key) Last Name First Name

628-306-591 Smith Terry

632-233-579 Wilson Barbara

633-306-591 White Doug

640-354-966 Smith Heather

641-692-853 McMillan Terry

Employee File

National Insurance numbers, for example, are excellent keys for employee records because no two people in the United Kingdom have the same number. An employee record is located by searching for the appropriate value in the National Insurance field. The key value in an inventory file could be the item number. When records are ordered according to their key values, a sequential file is formed.

Updating a sequential file involves two sets of files: the master file and transaction file. The master file is the file containing all existing records. The transaction file is the file containing the changes to be made to the master. During the updating process a third file, the new master file, is created.

Before updating begins, the old master file must be in sequential order according to the key field. The transaction file should then be sorted on the same key as the master file. The computer compares the key of the first master file record with the key from the first record in the transaction file. If the keys do not match, the computer writes the record to the new master file as is, and then it reads the next record on the master file.

When a match between the master and transaction records occurs, the computer updates the new master file record. Sometimes if a transaction record has no matching master record, the computer may generate an error message. Some transactions may add a new record, some may modify a record, while others may delete an existing record. This logic is determined by the programmer(s).

With sequential processing there is no direct way to locate the matching master record for a transaction. Each time a certain master record is to be updated, the entire master file must be read and a new master file created. Since this makes updating one record in a file very time-consuming, transactions are collected over a period of time and processed in one run (see below). Therefore, batch file access must be used with sequential file design.

Direct Access File Organisation (Random Access)

This is where records are stored randomly (e.g. via hashing a key field) but are accessed directly

  1. Also known as Random Access or relative file organisation.

  2. Records are stored in Direct Access Storage Device (DASD). Such as magnetic disk (Hard disks).

  3. For direct access, the file is viewed as numbered sequence of blocks or records.

  4. These blocks or records are taken as key for accessing the desired information randomly.

    • Records are a fixed size, so fields require padding to fill their allocated byte storage

  5. It allows arbitrary (random) blocks to be read or written.

  6. It is useful for immediate access to large amount of information. They are often used in accessing large databases.

  7. This technique is called as hashing

Advantages:

  1. Immediate access of the desired records.

  2. No sorting of the records is required. Locations are generally hashed.

  3. Faster updating of several files, as can go directly to the record

  4. Helps in online transaction processing system like online reservation systems.

Disadvantages:

  1. Data may be accidentally erased or over-written unless special precautions are taken (collisions)

  2. Less efficient as compared to sequential file organisation in the use of storage space (records are often fixed in size)

  3. Only one key is used - which is the value to be hashed, etc.

    • If you determine the value by CustomerID, you cannot search on surname without potentially having to access every single record, as the location was determined by a different attribute.

  4. Cannot be accessed sequentially

Direct access file design also uses the key field of the records but in a different way from sequential design. The key field provides the only way of accessing data within a direct access file design. Therefore, records are stored according to their hash value.

The data record being sought is retrieved according to its key value, so records before or after it are not read. Generally this key is a memory or disk address for a particular record or file. The address is usually a number from five to seven digits that is related to the physical characteristics of the storage medium. When a file is created, this address determines where the record is written. During retrieval, the address determines where to locate the record. Another way to obtain the address of a record is to place the record keys and their corresponding addresses in a directory. During processing, the computer searches the directory to locate the address of a particular record.

Direct-access file design is much more efficient than searching an entire data file for a particular record. It is useful when information must be updated and retrieved quickly and when current information is crucial. A common application of direct-access file organisation is for airline seat reservations. Current information about available flights and seats must be available at all times so that flights are not overbooked. Virtually all modern computer systems today use a Direct Access file mechanism when opening and running computer applications. This includes portable and desktop computers as you are using now.

In contrast to a batch-processing system, a direct-access system does not require transaction records to be grouped or sorted before they are processed. Data is submitted to the computer in the order it occurs, usually using an online access method. Direct-access storage devices (DASDs), such as magnetic disk drive units, make this type of processing possible. A particular record on a master file can be accessed directly, using its assigned address or keys, and updated without all preceding records on the file being read. Only the key to the record needs to be known. Thus, up-to-the-minute reports can be provided.

With direct-access processing, the computer can locate the record to be updated without processing all records that precede it.

Index Sequential access file organisation (ISAM):

  1. This file organisation is a synthesis of the above two file organisations.

  2. It combines the positive features of both the sequential and direct access file organisations.

  3. Here records are stored sequentially on a direct access device such as magnetic disk by a primary key. Hence, we can access data either sequentially or randomly using the index. The index is stored in a file and read into memory when the file is opened. The index providing the location of the record.

  4. It may have multiple keys. These keys can be alphanumeric

  5. The key upon which the data records are ordered is called the primary key.

  6. Other keys are called alternate keys

Advantages:

  1. Multiple keys€ are also alphanumeric in nature

  2. Both sequential and random access is possible

    1. Accessing of records is fast, if the index table is properly organised

Disadvantages:

  1. More storage space is needed because of the presence of Index

    1. Less efficient in the use of storage space as compared to other file organisations

  2. It requires special software, i.e expensive.

Sequential processing is suitable for applications in which the proportion of records processed in an updating run is high. However, sequential files provide slow response times and cannot adequately handle file inquiries. On the other hand, direct-access processing is inappropriate for applications like payroll, where most records are processed during a single run. When a single file must be used for both batch processing and online processing, neither direct-access nor sequential file organisation is appropriate. The same customer file that is used in a weekly batch run for preparing bills by the accounting department may be used daily by order entry personnel to record orders and check credit status. To some extent, the limitations of both types of file design can be minimised by using another approach to file organisation, indexed-sequential design.

In this structure, the records are stored sequentially on a direct-access storage device according to a primary key. A primary key is a field that will be often be unique for each record on the file. In addition, secondary keys can also be established. Secondary keys are fields that are used to gain access to records on the file but may not be unique. For instance, if postcode is chosen as a secondary key, there may be several records with the same value. Records on an indexed-sequential file can be accessed randomly by using either the primary or one of the secondary keys, or the file can be read sequentially, in order according to primary key. As with random access, indexed sequential files must have a fixed length.

The method used to gain access to a record on an indexed-sequential file is a little different from the method used for a direct-access file. Every record on an indexed-sequential file may not have its own unique address. Rather, several records may be grouped together and one address given for the entire group. An index table is created for all fields that are primary or secondary keys. The index table lists the value of the key (such as social security number) and the corresponding address on the direct-access storage device at which the group containing that record can be found. (The index table can either be stored at the beginning of the file or a separate file of indexes may be created.) A key given by the user is matched against the index table to get an approximate address for the required record. The computer then goes to that location on the direct-access storage device and checks records sequentially until the desired record is found. In the case of secondary keys, all records with that key may be retrieved.

More information can be found here.

File access Algorithms

It is unclear if this knowledge is required for the exam, but it is sensible to look and process it in case you are asked to discuss or code methods of access.

Serial Files

Records are stored one after another in no particular order. Suitable type of file organisation for transaction files.

Adding Records To A Serial File

    • Open File

    • Append Record To End Of File

Deleting A Record From A Serial File

    • Open old file for reading

    • Open new file for writing

    • Start from beginning of old file

    • Repeat

      • Read next record as Current Record

      • If current record <> record to be deleted then

      • Write record to new file

      • End If

    • Until End Of Old File

Finding A Particular Record In A Serial File

    • Open file for reading

    • Start from beginning of file

    • Repeat

      • Read next record

      • Test for a match

    • Until End Of File OR match made

Sequential Files

Records stored one after another but are stored in key sequence, eg by customer number. Suitable for master files, particularly when programs routinely have to access every file (eg payroll system).

Adding A Record To A Sequential File

    • Open old file for reading

    • Open new file for writing

    • Start from beginning of old file

    • Repeat

      • Read next record as Current Record

      • If current record key > key of record to be added then

      • Write new record to the new file

      • End If

      • Write current record to new file

    • Until new record inserted or End Of Old File

    • If new record not inserted then

    • Write new record to new file

    • End If

    • If not End Of Old File then

    • Repeat

      • Read next record as Current Record

      • Write current record to the new file

    • Until End Of Old File

    • End If

Deleting A Record From A Sequential File

    • Open old file for reading

    • Open new file for writing

    • Start from beginning of old file

    • Repeat

      • Read next record as Current Record

      • If current record key <> key of record to be deleted then

      • Write record to new file

      • End If

    • Until End Of Old File

Finding A Particular Record In A Sequential File

    • Open file for reading

    • Start from beginning of file

    • Repeat

      • Read next record

      • Test for a match

    • Until record found OR key of current record > key of record wanted (doesn't exist)

Updating Files

    • Updates are performed on Master Files (sequential).

    • Changes that need to be made are stored in a Transaction File (serial).

    • These changes may be

      • Insertion Of A New Record

      • Deletion Of An Existing Record

      • Amendment Of An Existing Record

Updating By Copying

The name given to method of updating a sequential file is updating by copying. First, the transaction file must be sorted into the same order as the master file.

    • A record is read from the master file.

    • A record is read from the transaction file.

    • The keys are compared to see if an update is necessary.

    • If the record from the master has a key lower than the one from the transaction file, it needs no update and can be written to a new file.

    • Another record is read from the master file.

    • If there is a transaction with the same key, the change is made to the master record in main memory. It is kept there until there are no more transactions with the same key.

Updating By Copying - Algorithm

    • Open master file for reading

    • Open transaction file for reading

    • Open new master file for writing

    • Repeat

      • Read next transaction record

      • While master record key < transaction record key

        • Write master record to new file

        • Read next master record

      • End While

      • Update record

    • Until End Of File (Transaction)

    • While Not End Of File (master)

      • Read next master record

      • Write master record to new master file

    • End While

Direct Access Files

Direct Access Files

The following terms are also used to refer to this type of file.

    • Random File

    • Hash File

    • Relative File

They all mean the same thing.

Organisation Of Random Files

Records are stored and retrieved according to where they are in the file or on the disk. In a simple example you might store the data below with customer 1's record in block 1 of the disk and customer 2's in block 2 and so on..

How Do You Work Out Where To Store It

Hashing Algorithms

To find out which address to use for a particular record you apply a hashing algorithm to the key of the record. We call this relative file addressing.

Hashing Example

A hashing algorithm might calculate disk addresses by Dividing the record key by 1000 and taking the remainder (modulus).

    • Record Number = 75481

    • 75481/1000 = 75 remainder 481

    • Disk Address = 481

Why Bother Doing That?

Imagine you have around 800 records, the record keys are 5 digit numbers. You would need 99999 blocks of storage space to accommodate every possible record key if you used the key itself to determine where you position the record. A 3 digit disk address is far more efficient.

Smash, Bang, Wallop!

What happens if the algorithm returns the same address for two different records? This is called a synonym or a collision. When this occurs you need to find some way of resolving where this file is to be stored.

You could,

    • Place the record in the next available slot and adapt the program to look there if it can't find the record where it expects it to be.

    • Store it in an overflow area of the file placing a flag next to the existing record (flag is set to false when the file is created - changed to true if a collision occurs).

More Hash Less Smash

A good hashing algorithm...

    • Can generate all of the available addresses on the disk

    • Is quick to calculate

    • Causes relatively few collisions.

Adding Records To A File

    • Apply the hashing algorithm.

    • Store the record at the address calculated.

    • If that space is taken, use the next available space.

How A Record Is Found

    • Hashing algorithm used to find where the record is expected to be stored.

    • If a different record is found, the next space is searched until the record is found or until a blank space is found (record doesn't exist)

How Do You Delete Records

If there were no such things as collisions, the process would be easy - use the hashing algorithm. Deleting a record might mean any records that collided and were placed in the next space, undetectable by the search program. Why? How?

Deleting Records

How about not actually deleting the record? Set a flag next to it and only acknowledge its presence when looking for records that you might expect to be at that location. This is called logically deleting the record - it is still physically there.

Online Resources


Hashing

Hashing is the process of taking a variable-length input string and returning a fixed-length output value (determined by the hash size. e.g. 128b). Hashing algorithms are always deterministic. That means that the same input ALWAYS produces the same output. However, they should prevent backwards lookup by ensuring that similar strings produce very different outputs.

e.g. "Ted was really silly" might produce 12345, but "ted was really silly" might give 4.

A hash is the outputted value and the hashing algorithm is the function producing the hash,

Hashing, as covered above, can be used with random access files to pass in a key field and output the position in the file to store. As random access files have a fixed record size, we take this value and multiply it by the record size to get the byte position where the record will either be read or written.

Hashing algorithms can be as straight forward as adding up all the ASCII values in the input string and taking a mod of the results. The MOD function is what enables the output to be between a given size. E.g. a 8-bit hash will lie between 0 and 255.