Lecture: Basic File Structures, Hashing, and Modern Storage Architectures
Readings: Chapter 16, Fundamentals of database systems, Seventh Edition (R. Elmasri, S. Navathe).
Placing File Records on Disk
- a database is stored as a collection of files. Each file is a sequence of records. A record is a sequence of fields
- recall a block is the unit of data transfer between disk and memory
- two ways to store records on blocks
- spanned
- larger than a single block
- pointer at end of the first block points to block containing the remainder of record
- unspanned
- records not allowed to cross block boundaries
- spanned
- a database system seeks to minimize the number of block transfers between the disk and memory
- this can be achieved by keeping as many blocks as possible in main memory
- buffer: portion of main memory available to store copies of disk blocks
- buffer manager: subsystem responsible for allocating buffer space in main memory
- blocking factor: average number of records per block per file
bfr=floor(B/R), whereBis the block size in bytes, andRis the record size in bytes
- operation on files
- retrieval operations: no change to file data
- update operations: file change by insertion, deletion, or modification
- records selected based on selection condition
Organization of Record in Files
- heap: a record can be placed anywhere in the file where there’s space
- records are placed in file in order of insertion
- inserting a new record is very efficient
- searching for a record is linear
- average blocks to access a specific record:
b/2(bis # of blocks)
- sequential: store records in sequential order, based on the value of the search key of each record
- records sorted by an ordering field. This is suitable for applications that require sequential processing of the entire file
- reading is records is very efficient (can use binary search)
- finding next record is also efficient
- deletion: use pointer chains
- insertion: locate position where the record is to be inserted (need to check if there’s free space or not). In either case, pointer chain must be updated
- need to organize file from time to time to restore sequential order
- average blocks to access a specific record:
log_2(b)(bis # of blocks)
- hashing: a hash function computed on some attribute of each record. The result specifies in which block of the file the record should be placed
- average blocks to access a specific record: 1
- collision: hash field value for inserted record hashes to address already containing a different record
- collision resolution: process of finding another position for a record when a collision occurs
- open addressing: check subsequent positions in order until an unused (empty) position is found
- multiple hashing: apply a second hash function if the first results in a collision
- chaining: various overflow locations are kept
- hashing techniques with dynamic file expansion
- extensible hashing: file performance does not degrade as file grows
- dynamic hashing: maintains tree-structured directory
- linear hashing: allows hash file to expand and shrink buckets without needing a directory