CSI-308›Storage and File Structure

Database SystemsTopic 16 of 22

Storage and File Structure

8 minread

1,286words

Intermediatelevel

Storage and File Structure in Databases

In database systems, storage and file structure refer to how data is physically stored on disk and how it is organized for efficient access, retrieval, and modification. The design of these structures plays a critical role in ensuring that the database performs efficiently, scales well, and maintains data integrity.

In this context, we will cover the following concepts:

Data Storage Basics
File Organization Techniques
Data Access Paths
Data Structures for Efficient Storage
Disk Storage and Buffer Management
Types of File Structures

1. Data Storage Basics

When a DBMS stores data, it does so in files that reside on disk. These files are managed by the operating system and the DBMS together. The DBMS abstracts the physical storage of data from users, presenting a logical view through tables, views, and indexes.

Key Concepts:

Storage Unit: The smallest unit of storage is typically called a block or page. This unit typically holds one or more rows of a table or index.
Disk: Data is stored on physical disks (HDDs or SSDs) managed by the operating system. The DBMS manages how the data is distributed across multiple disks or file systems.
File System: The underlying file system (e.g., NTFS, ext4, or ZFS) is used by the OS to handle files. The DBMS uses this file system to store data in a specific organization.

2. File Organization Techniques

File organization refers to how data is stored in files on disk. Efficient file organization is essential for optimal performance, as it impacts how quickly data can be retrieved, inserted, updated, or deleted.

Types of File Organization:

Heap File Organization: This is the simplest type of file organization, where records are inserted randomly in the file without any particular order. It is inefficient for large databases because searching and updating records requires a full table scan.
- Example: A table of students where each new student record is added at the end of the file.
Sequential File Organization: In this organization, records are stored in sorted order based on a key attribute. It allows for efficient range queries but is slow for insertions and deletions since records must be shifted to maintain order.
- Example: Storing employee records sorted by employee ID.
Hashed File Organization: This organization uses a hash function to map a key value to a specific file location. It allows fast retrieval based on the key but is inefficient for range queries, as there is no ordering.
- Example: A hash table to store user records where the key is the user ID.

3. Data Access Paths

A data access path determines how the DBMS locates and retrieves data from storage. Access paths are crucial for query performance because they dictate the efficiency of retrieving records based on specific conditions (like searching by a key or sorting by an attribute).

Common Data Access Paths:

Primary Access Path: This refers to the fastest path to access data, usually based on the primary key. For example, accessing records by their unique identifier (e.g., student ID).
Secondary Access Path: This is used for non-primary key attributes, such as searching by last name or by the date of birth in the case of a "Students" table. These access paths are typically implemented using indexes.
Clustered Access Path: This refers to when records are physically grouped or "clustered" together based on a certain criterion (e.g., grouping all records of the same department in a company in adjacent blocks). It improves the performance of range queries.

4. Data Structures for Efficient Storage

Data structures help organize and store data in a way that facilitates quick access, insertion, updating, and deletion. The choice of data structure depends on the types of queries and the size of the data.

Key Data Structures:

B-trees (Balanced Trees):
- B-trees are commonly used for indexing in relational databases. They allow fast searching, insertion, and deletion. B-trees are balanced trees, meaning the depth of the tree remains constant, ensuring predictable access times.
- B-trees store data in a sorted order, which makes range queries (finding all records between two values) efficient.
Example: An index on the BookID field of the Books table might use a B-tree to allow fast lookup of books by ID.
Hash Tables:
- A hash table is used to map keys to specific values (e.g., mapping employee ID to employee records). A hash function is used to map the key to a disk location, allowing constant-time access on average.
Bitmap Indexes:
- A bitmap index is a compact data structure used to store the existence of records that satisfy a particular condition (e.g., gender = 'Male'). Bitmap indexes are especially useful for columns with low cardinality (few distinct values).
Quadtrees / R-trees:
- These are used for spatial databases, storing multi-dimensional data (e.g., geographic coordinates, image data, or objects in 3D space). They partition the space into regions to allow efficient range queries.

5. Disk Storage and Buffer Management

Data is stored on disk, but accessing data from disk is slower than from memory. To optimize disk I/O, the DBMS uses buffer management techniques to minimize the number of disk accesses.

Buffer Pool: A buffer pool is an area of memory (RAM) where the DBMS temporarily stores data pages that are frequently accessed. When a query needs data, the DBMS first checks the buffer pool to see if the data is already loaded. If not, the data is fetched from disk and placed in the buffer.
Page Replacement: When the buffer pool is full and a new page needs to be loaded, the DBMS must choose an old page to evict. Common algorithms include LRU (Least Recently Used), FIFO (First In, First Out), and Clock.
Write-Ahead Log (WAL): This technique ensures that changes made to data are logged before they are written to disk. This provides durability and is a key part of maintaining ACID properties (Atomicity, Consistency, Isolation, Durability).

6. Types of File Structures

File structures refer to how data is physically stored and organized in files. The choice of file structure affects the efficiency of accessing, inserting, and deleting data.

Common File Structures:

Heap Files:
- Heap file is an unordered collection of records. In a heap file, new records are added at the end of the file, and there is no specific order in which records are stored. This is simple but inefficient for large databases, especially if the data needs to be searched often.
Sequential Files:
- Records in sequential files are stored in sorted order, typically based on a key. This allows efficient searching for records in a specific range but can make inserts and deletions more expensive because the file must be reorganized to maintain order.
Hashed Files:
- Hash-based file structures store data in buckets, and a hash function is used to determine the bucket in which a record is stored. This structure allows fast lookups for specific keys but is not efficient for range queries or sorting.
Clustered Files:
- In a clustered file structure, related records from different tables are stored together in the same physical block. This reduces I/O costs for certain types of queries (e.g., joining tables) by ensuring that related data is stored close together on disk.

Conclusion

The storage and file structure of a database plays a crucial role in determining its efficiency and performance. By choosing the right file organization techniques, access paths, data structures, and disk management strategies, a DBMS can optimize data retrieval, insertion, updating, and deletion. Understanding these concepts is key to designing high-performance, scalable databases that meet the needs of various applications.

Previous topic 15

Physical Database Design

Next topic 17

Indexed Files

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.

CSI-308›Storage and File Structure

Database SystemsTopic 16 of 22

Storage and File Structure

8 minread

1,286words

Intermediatelevel

Storage and File Structure in Databases

In this context, we will cover the following concepts:

Data Storage Basics
File Organization Techniques
Data Access Paths
Data Structures for Efficient Storage
Disk Storage and Buffer Management
Types of File Structures

1. Data Storage Basics

Key Concepts:

Storage Unit: The smallest unit of storage is typically called a block or page. This unit typically holds one or more rows of a table or index.
Disk: Data is stored on physical disks (HDDs or SSDs) managed by the operating system. The DBMS manages how the data is distributed across multiple disks or file systems.
File System: The underlying file system (e.g., NTFS, ext4, or ZFS) is used by the OS to handle files. The DBMS uses this file system to store data in a specific organization.

2. File Organization Techniques

Types of File Organization:

Heap File Organization: This is the simplest type of file organization, where records are inserted randomly in the file without any particular order. It is inefficient for large databases because searching and updating records requires a full table scan.
- Example: A table of students where each new student record is added at the end of the file.
Sequential File Organization: In this organization, records are stored in sorted order based on a key attribute. It allows for efficient range queries but is slow for insertions and deletions since records must be shifted to maintain order.
- Example: Storing employee records sorted by employee ID.
Hashed File Organization: This organization uses a hash function to map a key value to a specific file location. It allows fast retrieval based on the key but is inefficient for range queries, as there is no ordering.
- Example: A hash table to store user records where the key is the user ID.

3. Data Access Paths

Common Data Access Paths:

Primary Access Path: This refers to the fastest path to access data, usually based on the primary key. For example, accessing records by their unique identifier (e.g., student ID).
Secondary Access Path: This is used for non-primary key attributes, such as searching by last name or by the date of birth in the case of a "Students" table. These access paths are typically implemented using indexes.
Clustered Access Path: This refers to when records are physically grouped or "clustered" together based on a certain criterion (e.g., grouping all records of the same department in a company in adjacent blocks). It improves the performance of range queries.

4. Data Structures for Efficient Storage

Key Data Structures:

B-trees (Balanced Trees):
- B-trees are commonly used for indexing in relational databases. They allow fast searching, insertion, and deletion. B-trees are balanced trees, meaning the depth of the tree remains constant, ensuring predictable access times.
- B-trees store data in a sorted order, which makes range queries (finding all records between two values) efficient.
Example: An index on the BookID field of the Books table might use a B-tree to allow fast lookup of books by ID.
Hash Tables:
- A hash table is used to map keys to specific values (e.g., mapping employee ID to employee records). A hash function is used to map the key to a disk location, allowing constant-time access on average.
Bitmap Indexes:
- A bitmap index is a compact data structure used to store the existence of records that satisfy a particular condition (e.g., gender = 'Male'). Bitmap indexes are especially useful for columns with low cardinality (few distinct values).
Quadtrees / R-trees:
- These are used for spatial databases, storing multi-dimensional data (e.g., geographic coordinates, image data, or objects in 3D space). They partition the space into regions to allow efficient range queries.

5. Disk Storage and Buffer Management

Data is stored on disk, but accessing data from disk is slower than from memory. To optimize disk I/O, the DBMS uses buffer management techniques to minimize the number of disk accesses.

Buffer Pool: A buffer pool is an area of memory (RAM) where the DBMS temporarily stores data pages that are frequently accessed. When a query needs data, the DBMS first checks the buffer pool to see if the data is already loaded. If not, the data is fetched from disk and placed in the buffer.
Page Replacement: When the buffer pool is full and a new page needs to be loaded, the DBMS must choose an old page to evict. Common algorithms include LRU (Least Recently Used), FIFO (First In, First Out), and Clock.
Write-Ahead Log (WAL): This technique ensures that changes made to data are logged before they are written to disk. This provides durability and is a key part of maintaining ACID properties (Atomicity, Consistency, Isolation, Durability).

6. Types of File Structures

File structures refer to how data is physically stored and organized in files. The choice of file structure affects the efficiency of accessing, inserting, and deleting data.

Common File Structures:

Heap Files:
- Heap file is an unordered collection of records. In a heap file, new records are added at the end of the file, and there is no specific order in which records are stored. This is simple but inefficient for large databases, especially if the data needs to be searched often.
Sequential Files:
- Records in sequential files are stored in sorted order, typically based on a key. This allows efficient searching for records in a specific range but can make inserts and deletions more expensive because the file must be reorganized to maintain order.
Hashed Files:
- Hash-based file structures store data in buckets, and a hash function is used to determine the bucket in which a record is stored. This structure allows fast lookups for specific keys but is not efficient for range queries or sorting.
Clustered Files:
- In a clustered file structure, related records from different tables are stored together in the same physical block. This reduces I/O costs for certain types of queries (e.g., joining tables) by ensuring that related data is stored close together on disk.

Conclusion

Previous topic 15

Physical Database Design

Next topic 17

Indexed Files

Past Papers

Open this section to load past papers

Click on Show Past Papers to see past papers.