Optimizing transaction log performance

I recently stumbled on a great article by Paul S. Randal. He summarizes well what can be done to optimize transaction log performance. You can read his full post here.

This post is my follow up on his post. Using examples we'll take closer look at a transaction log and WRITELOG bottleneck. Latter part of the post examines ways to optimize a system suffering from slow transaction log. But first we recap the problem we are trying to solve.

Transaction log bottleneck

Consider an OLTP system that is doing inserts in somewhat constant rate. For some reason or other the system needs to have this new data as soon as its available and therefore staging table and bulk inserts cannot be used. We need to resort to using good old INSERT statement.

Now everything would be just ok, if our transaction log and data files were located on sufficiently fast storage. But even with fast storage there is a limit how many insert operations it can perform. Especially in OLTP systems it is easy to encounter situations where transaction log becomes system's bottleneck.

Test setup

To simulate transaction log bottleneck we can create a database where the log is located on the same physical disk as the data. Performance wise this is a very bad choice, but a simple way to do our testing. Locating transaction log this way is actually default way on new SQL Server setups and is one of the first things to check when system monitoring indicates that there might be a performance problem with the transaction log.

Following script will create a test database and one target table for inserts. I have added some non clustered indexes to better reflect the fact that we are simulating an OLTP system. This kind of indexes are often one of the reasons for transaction log issues.

USE master

IF DB_ID('OltpDatabase') IS NOT NULL
    DROP Database OltpDatabase

GO

CREATE DATABASE OltpDatabase

GO

USE OltpDatabase
CREATE TABLE Customer
(
    CustomerId      int IDENTITY,
    LastName        varchar(25) NOT NULL,
    FirstName       varchar(20) NOT NULL,
    Gender          bit NOT NULL,
    DateOfBirth     date NOT NULL,
    StreetAddress   varchar(48) NOT NULL,
    City            varchar(30) NOT NULL,
    PostalCode      varchar(10) NOT NULL,
    CountryCode     char(2) NOT NULL,
    PhoneNumber     varchar(16) NULL,
    Email           varchar(48) NULL
)

GO

CREATE UNIQUE CLUSTERED INDEX PK_Customer ON Customer (CustomerId)

GO

CREATE INDEX IX_Customer_Name ON Customer (LastName, FirstName)

GO

CREATE INDEX IX_Customer_DateOfBirth ON Customer (DateOfBirth)

GO

CREATE INDEX IX_Customer_Location ON Customer (CountryCode, City)

GO

CREATE INDEX IX_Customer_PhoneNumber ON Customer (PhoneNumber)

GO

CREATE INDEX IX_Customer_Email ON Customer (Email)

GO

CREATE PROCEDURE InsertTestData @NumberOfRows int
AS
BEGIN
    SET NOCOUNT ON

    WHILE @NumberOfRows > 0
    BEGIN
        INSERT INTO Customer 
        (
            LastName,
            FirstName,
            Gender,
            DateOfBirth,
            StreetAddress,
            City,
            PostalCode,
            CountryCode,
            PhoneNumber,
            Email
        )
        VALUES 
        (
            'LN' + CAST(CAST((10000 * RAND()) AS int) AS varchar(max)),
            'FN' + CAST(CAST((10000 * RAND()) AS int) AS varchar(max)),
            CAST((2 * RAND()) AS bit),
            DATEADD(day, CAST((-10000 * RAND()) AS int), GETUTCDATE()),
            'SA' + CAST(CAST((10000000 * RAND()) AS int) AS varchar(max)),
            'CI' + CAST(CAST((100 * RAND()) AS int) AS varchar(max)),
            'PC' + CAST(CAST((1000 * RAND()) AS int) AS varchar(max)),
            'C' + CAST(CAST((10 * RAND()) AS int) AS varchar(max)),
            'PN' + CAST(CAST((10000000 * RAND()) AS int) AS varchar(max)),
            'EM' + CAST(CAST((10000000 * RAND()) AS int) AS varchar(max))
        )

        SET @NumberOfRows -= 1
    END
END

GO

EXEC InsertTestData 100000

To drive the test I have created a procedure that inserts rows to the Customer test table. During initialization of the test 100 000 rows will be inserted to the test table. Another 100 000 rows will be inserted to the test table during testing. This test runs just under eleven minutes on my virtualized test system. That's a long time to wait for few row inserts. Surely there is place for an improvement.

Wait statistics

sys.dm_os_wait_stats management view can be used to analyze database's performance. This view contains SQL Server's wait information statistics. If you are unfamiliar with performance monitoring using SQL Server wait stats, see this for more information.

To analyze insert operation performance, database must first be reinitialized to get more comparable test results. This can be accomplished by running the same initialization script as before. Then following script can be used to collect wait statistics for the test insert profiling. By analyzing results of these tests, we soon notice that WRITELOG wait type is dominating. It means that most of the time is spent on waiting the transaction log to flush to disc.

USE OltpDatabase

-- Clear Wait Stats 
DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR);

GO

-- Execute test
EXEC InsertTestData 100000

GO

-- Snapshot of SQL Server 2012 WaitStats. Based on work done by Glenn Berry and Paul S. Randal.
WITH waits
AS (SELECT wait_type,
    CAST(wait_time_ms / 1000. AS DECIMAL(12, 2)) AS [wait_time_s],
    CAST((wait_time_ms - signal_wait_time_ms) / 1000.0 AS DECIMAL(12, 2)) AS [resource_wait_time_s],
    CAST(signal_wait_time_ms / 1000. AS DECIMAL(12, 2)) AS [signal_wait_time_s],
    waiting_tasks_count AS [wait_count],
    CAST(100. * wait_time_ms / SUM(wait_time_ms) OVER () AS decimal(12,2)) AS [percentage],
    ROW_NUMBER() OVER (ORDER BY wait_time_ms DESC) AS [row_number]
    FROM sys.dm_os_wait_stats WITH (NOLOCK)
    WHERE wait_type NOT IN (N'CLR_SEMAPHORE', N'LAZYWRITER_SLEEP', N'RESOURCE_QUEUE',N'SLEEP_TASK',
                            N'SLEEP_SYSTEMTASK', N'SQLTRACE_BUFFER_FLUSH', N'WAITFOR', N'LOGMGR_QUEUE',
                            N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH', N'XE_TIMER_EVENT',
                            N'BROKER_TO_FLUSH', N'BROKER_TASK_STOP', N'CLR_MANUAL_EVENT', N'CLR_AUTO_EVENT',
                            N'DISPATCHER_QUEUE_SEMAPHORE' ,N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'XE_DISPATCHER_WAIT',
                            N'XE_DISPATCHER_JOIN', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'ONDEMAND_TASK_QUEUE',
                            N'BROKER_EVENTHANDLER', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'DIRTY_PAGE_POLL',
                            N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',N'SP_SERVER_DIAGNOSTICS_SLEEP')),
running_waits 
AS (SELECT *,
    SUM([percentage]) OVER(ORDER BY [percentage] DESC ROWS UNBOUNDED PRECEDING) AS [running_percentage]
    FROM waits)
SELECT 
    wait_type,
    percentage,
    wait_time_s,
    resource_wait_time_s,
    signal_wait_time_s,
    wait_count,
    CAST(wait_time_s / wait_count AS DECIMAL (14, 4)) AS avg_wait_time_s,
    CAST(resource_wait_time_s / wait_count AS DECIMAL (14, 4)) AS avg_resource_wait_time_s,
    CAST(signal_wait_time_s / wait_count AS DECIMAL (14, 4)) AS avg_signal_wait_time_s
FROM running_waits
WHERE [running_percentage] - [percentage] <= 99 -- Running total percentage threshold
ORDER BY [running_percentage]
OPTION (RECOMPILE);

Virtual file statistics

At this point it's a good idea to check file statistics. Using sys.dm_io_virtual_file_stats function we can retrieve statistics for all database files. Log file has always file_id 2, but it can also be identified by the physical name of file. Following script first collects baseline information, then executes the insert test and finally outputs virtual file statistics for the test.

USE OltpDatabase

-- Based on the work done by Jimmy May and Paul S. Randal.
-- http://www.sqlskills.com/blogs/paul/how-to-examine-io-subsystem-latencies-from-within-sql-server/

-- Collect baseline information
DECLARE @original_vfs TABLE (file_id int, num_of_reads bigint, num_of_writes bigint, io_stall_read_ms bigint, io_stall_write_ms bigint, num_of_bytes_read bigint, num_of_bytes_written bigint)
INSERT @original_vfs
SELECT
    file_id,
    num_of_reads,
    num_of_writes,
    io_stall_read_ms,
    io_stall_write_ms,
    num_of_bytes_read,
    num_of_bytes_written
FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL)

-- Execute test
EXEC InsertTestData 10000

-- Show changes compared to baseline
SELECT 
    diff.file_id,
    CASE WHEN num_of_reads = 0 THEN 0 ELSE io_stall_read_ms / num_of_reads END AS [avg_read_ms],
    CASE WHEN num_of_writes = 0 THEN 0 ELSE io_stall_write_ms / num_of_writes END AS [avg_write_ms],
    num_of_reads,
    num_of_bytes_read,
    CASE WHEN num_of_reads = 0 THEN 0 ELSE num_of_bytes_read / num_of_reads END AS [avg_bytes_per_read],
    io_stall_read_ms AS [total_read_waits_ms],
    num_of_writes,
    num_of_bytes_written,
    CASE WHEN num_of_reads = 0 THEN 0 ELSE num_of_bytes_written / num_of_writes END AS [avg_bytes_per_written],
    io_stall_write_ms AS [total_write_waits_ms],
    physical_name
FROM (
    SELECT
        vfs.database_id,
        vfs.file_id,
        vfs.num_of_reads - org.num_of_reads as num_of_reads,
        vfs.num_of_writes - org.num_of_writes as num_of_writes,
        vfs.io_stall_read_ms - org.io_stall_read_ms as io_stall_read_ms,
        vfs.io_stall_write_ms - org.io_stall_write_ms as io_stall_write_ms,
        vfs.num_of_bytes_read - org.num_of_bytes_read as num_of_bytes_read,
        vfs.num_of_bytes_written - org.num_of_bytes_written as num_of_bytes_written
    FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL) AS vfs
    INNER JOIN @original_vfs AS org ON vfs.file_id = org.file_id) diff
INNER JOIN sys.master_files AS mf ON diff.database_id = mf.database_id AND diff.file_id = mf.file_id
ORDER BY diff.file_id

From the returned statistics we can see that transaction log writes are quite slow, 5 ms on average. Typically in production you should see write latencies of 1 ms or less. 5 ms average write latency on each operation amounts to total of 500 seconds. We can also observe that number of transaction log flushes in the previous test is little bit over 100 000, which is number of insert operations that was performed. So what else is happening inside the transaction log during the tests?

Transaction log contents

To dwell further down the path to enlightenment we need to examine contents of the transaction log. This we can accomplish with fn_dblog undocumented function. This function returns one row for each entry in database's transaction log. Note that transaction log is truncated at each checkpoint in simple mode and this function will return only transaction log entries that were made after last truncation. In order to clear all entries that are not related to test, we run manual checkpoint. To get all entries from the test, we block transaction log truncation by starting new transaction in a second connection. This ensures that all transaction log entries that we are interested in will be returned by fn_dblog function. Next we'll reinitialize the database and block transaction log flushing.

-- Block transaction log flushing.
-- Execute in second query window.
USE OltpDatabase
BEGIN TRAN
EXEC InsertTestData 1
--ROLLBACK

Now that the transaction log truncation is blocked we can run the actual test to see what's happening in the transaction log.

USE OltpDatabase

-- Force checkpoint to clear transaction log (Simple backup mode)
CHECKPOINT

-- Execute test
EXEC InsertTestData 100000

-- Display summary of transaction log contents
SELECT 
    Operation, 
    Context, 
    COUNT(*) AS RecordCount, 
    SUM([Log Record Length] + [Log Reserve]) AS TotalLength, 
    AVG([Log Record Length] + [Log Reserve]) AS AverageLength,
    SUM([Log Record Length]) AS TotalRecordLength, 
    AVG([Log Record Length]) AS AverageRecordLength,
    SUM([Log Reserve]) AS TotalRollbackReserve, 
    AVG([Log Reserve]) AS AverageRollbackReserve
FROM fn_dblog(NULL, NULL)
GROUP BY Operation, Context
UNION ALL
SELECT 
    'Total:', 
    NULL, 
    COUNT(*) AS RecordCount, 
    SUM([Log Record Length] + [Log Reserve]) AS TotalLength, 
    AVG([Log Record Length] + [Log Reserve]) AS AverageLength,
    SUM([Log Record Length]) AS TotalRecordLength, 
    AVG([Log Record Length]) AS AverageRecordLength,
    SUM([Log Reserve]) AS TotalRollbackReserve, 
    AVG([Log Reserve]) AS AverageRollbackReserve
FROM fn_dblog(NULL, NULL)
ORDER BY TotalLength DESC

-- Closer examination of page splits
SELECT Context, COUNT(*) AS NumberOfSplits, AllocUnitName
FROM fn_dblog(NULL, NULL)
WHERE Operation = 'LOP_DELETE_SPLIT'
GROUP BY Context, AllocUnitName
ORDER BY NumberOfSplits DESC

Previous script returns only summary of the transaction log contents. You may want to see actual contents of the log by querying fn_dblog directly. There is a lot of information that help you understand how transaction log works. Summary returned by the script contains number of records, total and average space required, total and average record length and total and average space reserved for rollback for each transaction log operation type.

As you can see from the results there are approximately 100 000 LOP_INSERT_ROW operations for clustered index pages and 500 000 for non clustered index pages. These match the actual physical insert operations that are done by SQL Server. Few extra inserts are related to page splits.

Page splits can be identified from the fn_dblog output with operation name LOP_DELETE_SPLIT. As you can see from previous results there were few thousand page splits. In contrast to the total of 1 GB of transaction log entries 250 kB LOP_DELETE_SPLIT entries seem small. Why everyone is saying that page splits are bad?

Although we can identify page splits with LOP_DELETE_SPLIT entries the page split operation consists of multiple transaction log entries. Here is a rough example how index leaf page is split:

LOP_BEGIN_XACT - System internal transaction is started
LOP_INSYSXACT/LCX_INDEX_INTERIOR - Parent index page is marked as being part of system transaction
LOP_INSYSXACT/LCX_INDEX_LEAF - Original index leaf page is marked as being part of system transaction
LOP_INSYSXACT/LCX_INDEX_LEAF - Next index leaf page index page is marked as being part of system transaction
LOP_FORMAT_PAGE/LCX_INDEX_LEAF - New data page is zeroed
LOP_MODIFY_ROW/LCX_PFS - New index leaf page is marked as allocated in Page Free Space (PFS)
LOP_HOBT_DELTA - Page count of the table is incremented
LOP_FORMAT_PAGE/LCX_INDEX_LEAF - New data page is formatted as index leaf page
LOP_INSERT_ROWS/LCX_INDEX_LEAF - Second part of the rows are inserted on allocated data page
LOP_DELETE_SPLIT/LCX_INDEX_LEAF - Second part of the rows are removed from the original index page
LOP_MODIFY_HEADER/LCX_INDEX_LEAF - Next page pointer of original index leaf page is modified to point allocated page
LOP_MODIFY_HEADER/LCX_INDEX_LEAF - Previous page pointer of next index leaf page is modified to point allocated page
LOP_INSERT_ROWS/LCX_INDEX_INTERIOR - Split key is inserted on parent index page
LOP_MIGRATE_LOCKS - New index leaf page locks are registered as part of user transaction
LOP_MIGRATE_LOCKS - New index leaf page locks are registered as part of user transaction
LOP_MIGRATE_LOCKS - New index leaf page locks are registered as part of user transaction
LOP_INSYSXACT/LCX_INDEX_LEAF - Allocated index page is released from system transaction
LOP_INSYSXACT/LCX_INDEX_LEAF - Next index leaf page is released from system transaction
LOP_INSYSXACT/LCX_INDEX_LEAF - Original index leaf page released from system transaction
LOP_INSYSXACT/LCX_INDEX_INTERIOR - Parent index page is marked as being part of system transaction
LOP_COMMIT_XACT - System internal transaction is committed

So there is quite a lot going on and this was only the simplest case where neither extent allocation nor parent page splitting was necessary. Paul S. Randal examined transaction log space usage of inserts against page splits in his blog post and discovered that page split can consume more than 40 times the space of non-page split insert.

Transaction log optimization

Now that we can monitor transaction log operation, we can begin investigation on how we can optimize its performance.

Hardware and storage level optimization

Most straightforward solution would be to spend some money on new hardware. There are also other things you can do on hardware and storage level to improve transaction log performance. Kimblerly L. Tripp made a great post on this topic. She lists eight ways to do optimization:

Move transaction log to dedicated / faster disc.
Defrag transaction log disk to solve external fragmentation.
Make sure you have only one transaction log file.
Use RAID 1+0 or 0+1 to store transaction log if possible.
Make sure that the transaction log is always available.
Preallocate enough space for transaction and data files.
Autogrow transaction log file with fixed amount e.g. 1GB.
Check that number of VLFs (Virtual Log Files) is reasonable.

See details from her post here. I think that sums it up quite nicely. Just remember that WRITELOG problems are typically latency and not throughput problems. Buying a big dollar SAN won't solve the problem if your network is slow or contended.

Identifying and removing unnecessary indexes

Lets now focus on more on the logical side of the problem namely database design and index choices. When a row is inserted to a table, SQL Server needs to modify every non-clustered index of the table in addition to modifications done to table's clustered index or heap. Insert operations to a table with many indexes will be many times slower than inserts to a table without any additional indexes. Obvious way to improve insert performance is to remove unnecessary indexes from the table. Less indexes means less physical index modifications and therefore less page splits.

Sys.dm_db_index_usage_stats management view can be used to identify unused and under used indexes. This time I won't go now into details, but you can get more information for example in this post. One other thing to remember is to check if there are any duplicate or redundant indexes which queries can be satisfied by other indexes. These indexes are also candidates for removal. For more information see this series of posts.

Rebuilding indexes

We already identified that page splits can be costly so they should be avoided. We can reduce number of page splits by preallocating empty space to the indexes for rows that will be inserted. Because of the free space, new rows can be inserted on the leaf pages of a index without causing rows to spill on other pages. With enough free space in the indexes we can reduce number of page splits significantly. Of course more free space must be made periodically in database indexes, but doing that during off hours can save resources on peak time.

Let's now reinitialize the database and insert our test rows. If we look at the indexes using sys.dm_db_index_physical_stats function we can see that on almost all pages of table's non-clustered indexes have been fragmented. Fragmented indexes cause increased I/O operations. Ideally none of the pages are fragmented which is almost the case of table's clustered index that is filled sequentially using IDENTITY column. Even though every index page is only 65% full on average, pages are still splitting because some of the pages may be almost full as others are only half full. This is called internal index fragmentation. Read more about internal and external index fragmentation and how it can affect performance from Brent Ozar's post.

USE OltpDatabase
SELECT
    OBJECT_SCHEMA_NAME(ps.OBJECT_ID) AS [Schema Name], 
    OBJECT_NAME(ps.OBJECT_ID) AS [Object Name], 
    i.name AS [Index Name],
    ps.index_id,
    ps.index_type_desc,
    ps.avg_fragmentation_in_percent, 
    ps.fragment_count,
    ps.page_count,
    ps.record_count,
    ps.avg_record_size_in_bytes,
    ps.avg_page_space_used_in_percent,
    i.fill_factor,
    i.has_filter,
    i.filter_definition
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL ,'DETAILED') AS ps
INNER JOIN sys.indexes AS i WITH (NOLOCK) ON ps.[object_id] = i.[object_id] AND ps.index_id = i.index_id
WHERE database_id = DB_ID() AND (index_level = 0) -- Only leaf level
ORDER BY 1, 2, 3 OPTION (RECOMPILE);

Internal index fragmentation can be fixed by rebuilding or reorganizing indexes. It is good idea to do this periodically to keep your database in top shape. Remember that using fill factor 100 (default 0 means also 100) can really hurt insert performance, because every insert on a full page will cause a page split. Because we know that we have 100 000 rows and we will be inserting another 100 000 rows in our tests, we can calculate that fill factor should be less than 50%. Otherwise we cannot even theoretically fit all new rows to empty space of the indexes. Because we insert rows at random and some pages get more hits and others less, we need some extra spare space to avoid page splits. For the next test we'll rebuild our indexes with fill factor of 40%.

USE OltpDatabase

-- Reindex with fill factor of 40%
ALTER INDEX ALL ON dbo.Customer REBUILD WITH (FILLFACTOR = 40);

Now when we run the test again and examine the transaction log contents, we can see that number of page splits have decreased significantly. We have also shaved off 5% of number of written transaction log entries and the used transaction log space.

If we look now at the wait statistics we can see that WRITELOG wait type is still dominating the wait statistics, but actual wait time is now only 90 seconds. Total execution time of the test has dropped from original 651 seconds to 97 seconds. That's over 500% improvement with minimal modifications.

Improvements can also be observed from virtual file statistics. Transaction log waits have plummeted to a 83 seconds from the original 535 seconds.

Summary

While reading this post you have learned how to monitor system waits using sys.dm_os_wait_stats management view, how to view file operation statistics, how to dig transaction log contents using fn_dblog function and how to monitor internal fragmentation of indexes. I have shortly discussed about hardware and software level optimization as well as database design issues that can affect transaction log performance. Finally I have shown that page splits can be reduced by allocating free space during index rebuilds. I'll hope that these tools will be of value for you in your everyday work.

In my next post I'll show you that choosing a fill factor for an index is not work of magic. I'll introduce you to adaptive fill factor scheme and compare its performance against various industry thumb rule fill factors.

Show your support by commenting or by tweeting. I'll answer any questions as well as I can.

You can download scripts used in this post here:

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

3 comments:

Uwe RickebJune 9, 2013 at 9:04 AM
Hi Tomi,

thank you for this really great written article which demonstrates in detail the effects of page splits. That's a task I wanted to write in my blog several times :)

Just one "idea". I think it isn't the correct way to configure the fillfactor for EVERY index but I suppose it is for demonstration only.

Fillfactor is - from my point of view - a value which has to be determined for every index separately.

Once again - thank you for such a great article.
Will you allow me to use your example data for my - german - blog?
Uwe RickebJune 10, 2013 at 8:51 AM
Tomi,

thank you for sharing the example codings.
As soon as I have written my blog entry (don't know when) I will inform you personally.

For sure will I reference to your article...

Best, Uwe

Efficient SQL

April 24, 2013