In the last part I found that there is a new bottleneck. It seems this is related to the PAGEIOLATCH_SH and PAGEIOLATCH_EX. The exact values depend on the time slots which is measured by the ShowIOBottlenecks script. The picture shows >70 percent wait time.
To track down the latch contention wait events Microsoft provides a decent whitepaper. I used the following script and run it several times to get an idea which resources are blocked.
1 2 3 4 5 6 7 8 9 10 |
SELECT wt.session_id, wt.wait_type , er.last_wait_type AS last_wait_type , wt.wait_duration_ms , wt.blocking_session_id, wt.blocking_exec_context_id, resource_description FROM sys.dm_os_waiting_tasks wt JOIN sys.dm_exec_sessions es ON wt.session_id = es.session_id JOIN sys.dm_exec_requests er ON wt.session_id = er.session_id WHERE es.is_user_process = 1 AND wt.wait_type <> 'SLEEP_TASK' ORDER BY wt.wait_duration_ms desc |
The resource_description column returned by this script provides the resource description in the format <DatabaseID,FileID,PageID> where the name of the database associated with DatabaseID can be determined by passing the value of DatabaseID to the DB_NAME () function.
First lets find out which table this is. This can be done via inspecting the the page and retrieving the Metadata ObjectId.
1 2 3 4 |
--enable trace flag 3604 to enable console output dbcc traceon (3604) --examine the details of the page dbcc page (5,1, 240101, -1) |
The metadata objectid is 373576369. Now it is easy to retrieve the related table name.
It is the “warehouse” table.
What is the bottleneck here?
First of all this an explanation about the wait events:
PAGEIOLATCH_EX
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Exclusive mode. Long waits may indicate problems with the disk subsystem.
PAGEIOLATCH_SH
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem
In our case this means a lot of inserts/updates are done when running the TPC-C workload and a task waits on a latch for this page shared or exclusive! When inspecting this page we know its the warehouse table and we created the database with 33 warehouses in the beginning.
The page size in SQL server is 8K and the 33 rows all fit just in one page (m_slotcnt =33). This means some operations can no be parallelized!!
To solve this I will change the “physical” design of this table which is still in-line with the TPC-C rules. There may be different ways to achieve this. I add a column and insert some text which forces SQL server to restructure the pages and then delete the column.
1 2 3 4 |
use tpcc; alter table dbo.warehouse add expandit NCHAR(3000); update dbo.warehouse set expandit = 'FILLWITHTEXT' ; alter table dbo.warehouse drop column expandit; |
Okay now check if the m_slotCnt is 1 which means every row is in one page.
It’s done.
When running the workload again the PAGEIOLATCH_SH and PAGELATCHIO_EX wait events are nearly gone.
Before:
- System achieved 338989 SQL Server TPM at 73685 NOPM
- System achieved 348164 SQL Server TPM at 75689 NOPM
- System achieved 336965 SQL Server TPM at 73206 NOPM
After:
- System achieved 386324 SQL Server TPM at 83941 NOPM
- System achieved 370919 SQL Server TPM at 80620 NOPM
- System achieved 366426 SQL Server TPM at 79726 NOPM
The workload increased slightly. Again I monitored that CPU is at 100% when running. At this point I could continue to tune the SQL statements as I did the last 2-3 posts. Remember IÂ started the SQL Server Performance Tuning with 20820 TPM at 4530 NOPM. This means more then 10x faster!
But the next step maybe to add some hardware. This all runs on just 2 of the 4 cores which are available as I wrote in the first part.
Go ChaosMonkey!