B.Tech (AI & DS, CSE-AI, CSE-DS): File Structure - File Organization Indexing Hashing

File Storage Concepts

A file is a sequence of records. File organization refers to physical layout or a structure of record occurrences in a file. File organization determines the way records are stored and accessed.

In many cases, all records in a file are of the same record type. If every record in the file has exactly the same size (in bytes), the file is said to be made of fixed-length records. If different records in the file have different sizes, the file is said to be made up of variable-length records. A file may have variable-length records for several reasons :

The file records are of the same record type, but one or more fields are of varying sizes (variable-length fields).
The file records are of the same record type but one or more fields may have multiple values for individual records. Such a field is called repeating field and a group of values for the field is often called a repeating group.
The file records are of the same record type, but one or more fields are optional.
The file has records of different record types and hence of varying size (mixed file). This would occur if related records of different types were clustered (placed together) on disk blocks.

The records of a file must be allocated to disk blocks because a block is a unit of data transfer between disk and memory. The division of a track (on storage medium) into equal sized disk blocks is set by the operating system during disk formatting. The hardware address of a block comprises a surface number, track number and block number. Buffer – a contiguous reserved area in main storage that holds one block-has also an address. For a read command, the block from disk is copied into the buffer, whereas for a write command the contents of the buffer are copied into the disk block. Sometimes several contiguous blocks, called a cluster, may be transferred as a unit. In such cases buffer size is adjusted to cluster size. When the block size is larger than the record size, each block will contain numerous records, while there can be files with large records that cannot fit in one block. In the latter case the records can span more than one block.

Here it is worthwhile to note the difference between the terms file organization and access method. A file organization refers to the organization of the data of a file into records, blocks and access structures; this includes the way the records and blocks are placed on the storage medium and interlinked. An access method on the other hand, provides a group of operations – such as find, read, modify, delete etc., — that can be applied to a file. In general, it is possible to apply several access methods to a file organization. Some access methods, though, can be applied only to files organised in certain ways. For example, we cannot apply an indexed access method to a file without an index.

Sequential Access Method (SAM)

In sequential files, the records are stored in a predefined order. Records which occur in a sequential file are usually sorted on the primary key and physically arranged on the storage medium in order by primary key. If only sequential access is required (which is rarely the case), sequential media (magnetic tapes) are suitable and probably the most cost-effective way of processing such files. Direct access devices such as disks may be but are not necessarily, referenced sequentially. Some types of processing are best done through sequential access, even when direct access devices are used.

Sequential access is fast and efficient while dealing with large volumes of data that need to be processed periodically. However, it is require that all new transactions be sorted into a proper sequence for sequential access processing. Also, most of the database or file may have to be searched to locate, store, or modify even a small number of data records. Thus, this method is too slow to handle applications requiring immediate updating or responses.

Sequential files are generally used for backup or transporting data to a different system. A sequential ASCII file is a popular export/import format that most database systems support.

Indexed Sequential Access Method (ISAM)

In indexed sequential files, record occurrences are sorted and stored in order by primary key on a direct access storage device. In addition, a separate table (or file) called an index is maintained on primary key values to give the physical address of each record occurrence. This approach gives (almost) direct access to record occurrences via the index table and sequential access via the way in which the records are laid out on the storage medium. The physical address of a record given by the index file is also called a pointer. The pointer or address can take many forms depending on the operating system and the database one is using.

Today systems use virtual addresses instead of physical addresses. A virtual address could be based on imaginary disk drive layout. The database refers to a base set of tracks and cylinders. The computer then maps these values into actual storage locations. This arrangement is the basis for an approach known as the virtual sequential access method (VSAM). Another common approach is to define a location in terms of its distance from the start of a file (relative address). Virtual or relative addresses are always better than the physical address because of their portability.

In case a few records need to be processed quickly, the index is used to directly access the records needed. However, when large numbers of records must be processed periodically, the sequential organization provided by this method is used.

Direct Access Method (DAM)

When using the direct access method, the record occurrences in a file do not have to be arranged in any particular sequence on storage media. However, the computer must keep track of the storage location of each record using a variety of direct organization methods so that data is retrieved when needed. New transactions data do not have to be sorted, and processing that requires immediate responses or updating is easily handled.

In the direct access method, an alogrithm is used to compute the address of a record. The primary key value is the input to the algorithm and the block address of the record is the output.

To implement the approach, a portion of the storage space is reserved for the file. This space should be large enough to hold the file plus some allowance for growth.

Then the algorithm that generates the appropriate address for a given primary key is devised. The algorithm is commonly called hashing algorithm. The process of converting primary key values into addresses is called key-to-address transformation. More than one logical record usually fits into a block, so we may think of the reserved storage area as being broken into record slots sequentially numbered from 1 to n. These sequential numbers are called relative pointers or relative addresses, because they indicate the position of the record relative to the beginning of the file. The objective of the hashing algorithm is to generate relative addresses that disperse the records throughout the reserved storage space in a random but uniform manner. The records can be retrieved very rapidly because the address is computed rather than found through table look-up via indexes stored on a disk file. A collision is said to occur if more than one record maps to the same block. Since one block usually holds several records, collisions are only a problem when the number of records mapping to a block exceeds the block’s capacity. To account for this event, most direct access methods support an overflow area for collisions which is searched sequentially.

The hashed key approach is extremely fast since the key’s value is immediately converted into a storage location, and data can be retrieved in one pass to the disk.

B.Tech (AI & DS, CSE-AI, CSE-DS)

File Structure - File Organization Indexing Hashing

No comments:

Post a Comment