Skip to content

MS SQL Server – System processes

July 19, 2011

I. Write page processing

Writing a modified data page.

The I/O from an instance of the Database Engine includes logical and physical writes. A logical write occurs when data is modified in a page in the buffer cache. A physical write occurs when the page is written from the buffer cache to disk.

 The buffer cache contains pages, each of which has a reference counter and an indicator of whether the page contains data modifications not yet written to disk. The reference counter shows how frequently the page has been accessed. The buffer cache is scanned periodically by worker threads (eager writing), lazy writer and checkpoint. The reference counter on each buffer page is decremented. When the reference counter on a page is equal to zero at the end of the buffer cache scan, SQL Server checks whether the page is dirty. If so data changes are written to disk and the page is freed. If the page isn’t dirty at the end of the buffer cache scan the page is freed. Once the page is freed it is placed on the free page list. Using this method, pages that are accessed frequently remain in memory, whereas those that aren’t accessed frequently are freed up for use by new queries.

When a page is modified in the buffer cache, it is not immediately written back to disk; instead, the page is marked as dirty. This means that a page can have more than one logical write made before it is physically written to disk. For each logical write, a transaction log record is inserted in the log cache that records the modification. The log records must be written to disk before the associated dirty page is removed from the buffer cache and written to disk. SQL Server uses a technique known as write-ahead logging that prevents writing a dirty page before the associated log record is written to disk. This is essential to the correct working of the recovery manager

Write-ahead logging (WAL):

SQL Server uses a write-ahead log (WAL), which guarantees that no data modifications are written to disk before the associated log record is written to disk. This maintains the ACID properties for a transaction. For more information about transactions and ACID properties, see Transactions (Database Engine).

To understand how the write-ahead log works, it is important for you to know how modified data is written to disk. SQL Server maintains a buffer cache into which it reads data pages when data must be retrieved. Data modifications are not made directly to disk, but are made to the copy of the page in the buffer cache. The modification is not written to disk until a checkpoint occurs in the database, or the modification must be written to disk so the buffer can be used to hold a new page. Writing a modified data page from the buffer cache to disk is called flushing the page. A page modified in the cache, but not yet written to disk, is called a dirty page.

At the time a modification is made to a page in the buffer, a log record is built in the log cache that records the modification. This log record must be written to disk before the associated dirty page is flushed from the buffer cache to disk. If the dirty page is flushed before the log record is written, the dirty page creates a modification on the disk that cannot be rolled back if the server fails before the log record is written to disk. SQL Server has logic that prevents a dirty page from being flushed before the associated log record is written. Log records are written to disk when the transactions are committed.

When the buffer manager writes a page to disk, it searches for adjacent dirty pages that can be included in a single gather-write operation. Adjacent pages have consecutive page IDs and are from the same file; the pages do not have to be contiguous in memory. The search continues both forward and backward until one of the following events occurs:

  • A clean page is found.
  • 32 pages have been found.
  • A dirty page is found whose log sequence number (LSN) has not yet been flushed in the log.
  • A page is found that cannot be immediately latched.

In this way, the entire set of pages can be written to disk with a single gather-write operation.

Just before a page is written, the form of page protection specified in the database is added to the page. If torn page protection is added, the page must be latched EX(clusively) for the I/O. This is because the torn page protection modifies the page, making it unsuitable for any other thread to read. If checksum page protection is added, or the database uses no page protection, the page is latched with an UP(date) latch for the I/O. This latch prevents anyone else from modifying the page during the write, but still allows readers to use it. For more information about disk I/O page protection options, see Buffer Management.

A dirty page is written to disk in one of three ways.

  • Lazy writingThe lazy writer is a system process that keeps free buffers available by removing infrequently used pages from the buffer cache. Dirty pages are first written to disk.
  • Eager writingThe eager write process writes dirty data pages associated with nonlogged operations such as bulk insert and select into. This process allows creating and writing new pages to take place in parallel. That is, the calling operation does not have to wait until the entire operation finishes before writing the pages to disk.
  • CheckpointThe checkpoint process periodically scans the buffer cache for buffers with pages from a specified database and writes all dirty pages to disk. Checkpoints save time during a later recovery by creating a point at which all dirty pages are guaranteed to have been written to disk. The user may request a checkpoint operation by using the CHECKPOINT command, or the Database Engine may generate automatic checkpoints based on the amount of log space used and time elapsed since the last checkpoint. In addition, a checkpoint is generated when certain activities occur. For example, when a data or log file is added or removed from a database, or when the instance of SQL Server is stopped. For more information, see Checkpoints and the Active Portion of the Log.The lazy writing, eager writing, and checkpoint processes do not wait for the I/O operation to complete. They always use asynchronous (or overlapped) I/O and continue with other work, checking for I/O success later. This allows SQL Server to maximize both CPU and I/O resources for the appropriate tasks. 

If you would like to push all pages to disk and clear the buffer pool for that database first issue a checkpoint – ensuring dirty pages are pushed to disk and thus “cleaned”:

Collapse this tableExpand this table
Statement Actions performed
BEGIN TRANSACTION Written to the log cache area but there is no need to flush to stable storage because the SQL Server has not made any physical changes.
INSERT INTO tblTest 1. Data page 150 is retrieved into SQL Server data cache, if not already available.

2. The page is latched, pinned, and marked dirty, and appropriate locks are obtained.

3. An Insert Log record is built and added to the log cache.

4. A new row is added to the data page in cache and set the data page status dirty.

5. The latch is released.

6. The log records associated with the transaction or page does not need to be flushed at this point because all changes remain in volatile storage.

COMMIT TRANSACTION 1. A Commit Log record is formed and the log records associated with the transaction must be written to stable storage. The transaction is not considered committed until the log records are correctly assigned to stable storage.

2. Data page 150 remains in SQL Server data cache and is not immediately flushed to stable storage. When the log records are properly secured recovery can redo the operation if necessary.

3. Transactional locks are released.

II. Read page processing

The I/O from an instance of the SQL Server Database Engine includes logical and physical reads. A logical read occurs every time the Database Engine requests a page from the buffer cache. If the page is not currently in the buffer cache, a physical read first copies the page from disk into the cache.

The read requests generated by an instance of the Database Engine are controlled by the relational engine and optimized by the storage engine. The relational engine determines the most effective access method (such as a table scan, an index scan, or a keyed read); the access methods and buffer manager components of the storage engine determine the general pattern of reads to perform, and optimize the reads required to implement the access method. The thread executing the batch schedules the reads.


The Database Engine supports a performance optimization mechanism called read-ahead. Read-ahead anticipates the data and index pages needed to fulfill a query execution plan and brings the pages into the buffer cache before they are actually used by the query. This allows computation and I/O to overlap, taking full advantage of both the CPU and the disk.

The read-ahead mechanism allows the Database Engine to read up to 64 contiguous pages (512KB) from one file. The read is performed as a single scatter-gather read to the appropriate number of (probably non-contiguous) buffers in the buffer cache. If any of the pages in the range are already present in the buffer cache, the corresponding page from the read will be discarded when the read completes. The range of pages may also be “trimmed” from either end if the corresponding pages are already present in the cache.

There are two kinds of read-ahead: one for data pages and one for index pages.

Reading Data Pages

Table scans used to read data pages are very efficient in the Database Engine. The index allocation map (IAM) pages in a SQL Server database list the extents used by a table or index. The storage engine can read the IAM to build a sorted list of the disk addresses that must be read. This allows the storage engine to optimize its I/Os as large sequential reads that are performed in sequence, based on their location on the disk. For more information about IAM pages, see Managing Space Used by Objects.

Reading Index Pages

The storage engine reads index pages serially in key order. For example, this illustration shows a simplified representation of a set of leaf pages that contains a set of keys and the intermediate index node mapping the leaf pages. For more information about the structure of pages in an index, see Clustered Index Structures.

Intermediate index node maps to leaf pages by key

The storage engine uses the information in the intermediate index page above the leaf level to schedule serial read-aheads for the pages that contain the keys. If a request is made for all the keys from ABC to DEF, the storage engine first reads the index page above the leaf page. However, it does not just read each data page in sequence from page 504 to page 556 (the last page with keys in the specified range). Instead, the storage engine scans the intermediate index page and builds a list of the leaf pages that must be read. The storage engine then schedules all the reads in key order. The storage engine also recognizes that pages 504/505 and 527/528 are contiguous and performs a single scatter read to retrieve the adjacent pages in a single operation. When there are many pages to be retrieved in a serial operation, the storage engine schedules a block of reads at a time. When a subset of these reads is completed, the storage engine schedules an equal number of new reads until all the required reads have been scheduled.

The storage engine uses prefetching to speed base table lookups from nonclustered indexes. The leaf rows of a nonclustered index contain pointers to the data rows that contain each specific key value. As the storage engine reads through the leaf pages of the nonclustered index, it also starts scheduling asynchronous reads for the data rows whose pointers have already been retrieved. This allows the storage engine to retrieve data rows from the underlying table before it has completed the scan of the nonclustered index. Prefetching is used regardless of whether the table has a clustered index. SQL Server Enterprise uses more prefetching than other editions of SQL Server, allowing more pages to be read ahead. The level of prefetching is not configurable in any edition. For more information about nonclustered indexes, see Nonclustered Index Structures.

Advanced Scanning:

In SQL Server Enterprise, the advanced scan feature allows multiple tasks to share full table scans. If the execution plan of a Transact-SQL statement requires a scan of the data pages in a table and the Database Engine detects that the table is already being scanned for another execution plan, the Database Engine joins the second scan to the first, at the current location of the second scan. The Database Engine reads each page one time and passes the rows from each page to both execution plans. This continues until the end of the table is reached.

At that point, the first execution plan has the complete results of a scan, but the second execution plan must still retrieve the data pages that were read before it joined the in-progress scan. The scan for the second execution plan then wraps back to the first data page of the table and scans forward to where it joined the first scan. Any number of scans can be combined like this. The Database Engine will keep looping through the data pages until it has completed all the scans. This mechanism is also called “merry-go-round scanning” and demonstrates why the order of the results returned from a SELECT statement cannot be guaranteed without an ORDER BY clause.

For example, assume that you have a table with 500,000 pages. UserA executes a Transact-SQL statement that requires a scan of the table. When that scan has processed 100,000 pages, UserB executes another Transact-SQL statement that scans the same table. The Database Engine schedules one set of read requests for pages after 100,001, and passes the rows from each page back to both scans. When the scan reaches the 200,000th page, UserC executes another Transact-SQL statement that scans the same table. Starting with page 200,001, the Database Engine passes the rows from each page it reads back to all three scans. After it reads the 500,000th row, the scan for UserA is complete, and the scans for UserB and UserC wrap back and start to read the pages starting with page 1. When the Database Engine gets to page 100,000, the scan for UserB is completed. The scan for UserC then keeps going alone until it reads page 200,000. At this point, all the scans have been completed.

Without advanced scanning, each user would have to compete for buffer space and cause disk arm contention. The same pages would then be read once for each user, instead of read one time and shared by multiple users, slowing down performance and taxing resources.


From → DBA

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: