Category Archives: Storage Performance

8. SQL Server Performance Tuning study with HammerDB – Solve PAGEIOLATCH latch contention

In the last part I found that there is a new bottleneck. It seems this is related to the PAGEIOLATCH_SH and PAGEIOLATCH_EX. The exact values depend on the time slots which is measured by the ShowIOBottlenecks script. The picture shows >70 percent wait time.

PagelatchIO

To track down the latch contention wait events Microsoft provides a decent whitepaper. I used the following script and run it several times to get an idea which resources are blocked.

PagelatchSH_5.1.240101

The resource_description column returned by this script provides the resource description in the format <DatabaseID,FileID,PageID> where the name of the database associated with DatabaseID can be determined by passing the value of DatabaseID to the DB_NAME () function.

First lets find out which table this is. This can be done via inspecting the the page and retrieving the Metadata ObjectId.

dbccpage

The metadata objectid is 373576369. Now it is easy to retrieve the related table name.


warehouse_tablename

It is the “warehouse” table.

What is the bottleneck here?

dbccpage

First of all this an explanation about the wait events:

PAGEIOLATCH_EX
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Exclusive mode. Long waits may indicate problems with the disk subsystem.

PAGEIOLATCH_SH
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem

In our case this means a lot of inserts/updates are done when running the TPC-C workload and a task waits on a latch for this page shared or exclusive! When inspecting this page we know its the warehouse table and we created the database with 33 warehouses in the beginning.

The page size in SQL server is 8K and the 33 rows all fit just in one page (m_slotcnt =33). This means some operations can no be parallelized!!

To solve this I will change the “physical” design of this table which is still in-line with the TPC-C rules. There may be different ways to achieve this. I add a column and insert some text which forces SQL server to restructure the pages and then delete the column.

add_drop

Okay now check if the m_slotCnt is 1 which means every row is in one page.

dbccpage_new

It’s done.

ShowIOBottleneck_solvedWarehouse

When running the workload again the PAGEIOLATCH_SH and PAGELATCHIO_EX wait events are nearly gone.

Before:

  • System achieved 338989 SQL Server TPM at 73685 NOPM
  • System achieved 348164 SQL Server TPM at 75689 NOPM
  • System achieved 336965 SQL Server TPM at 73206 NOPM

After:

  • System achieved 386324 SQL Server TPM at 83941 NOPM
  • System achieved 370919 SQL Server TPM at 80620 NOPM
  • System achieved 366426 SQL Server TPM at 79726 NOPM

The workload increased slightly. Again I monitored that CPU is at 100% when running. At this point I could continue to tune the SQL statements as I did the last 2-3 posts. Remember I started the SQL Server Performance Tuning with 20820 TPM at 4530 NOPM. This means more then 10x faster!

But the next step maybe to add some hardware. This all runs on just 2 of the 4 cores which are available as I wrote in the first part.

Go ChaosMonkey!

FIO (Flexible I/O Tester) Part9 – fio2gnuplot to visualize the output

When installing the linux build of “fio” it provides a tool called fio2gnuplot. This tool renders the output files of “fio” and uses gnuplot to generate nice graphics. Gnuplot is a portable command-line driven graphing utility which is freely distributed.

Example shows distribution of IOPS with different block sizes and different Read/Write Mix:

PX600-1000-IOPS-mes3DPlt

Requirements

I am using “fio” 2.2.10 which was release on 12.09.2015.

Since 2.1.2 fio2gnuplot is part of the “fio” release. To generate the graphics you need to install gnuplot.

How to generate the log files?

There are some “fio” options to generate log files.

  • write_bw_log=<Filename>
  • write_iops_log=<Filename>
  • write_lat_log=<Filename>
  • per_job_logs=0/1 ( >2.2.8 so not for Windows build 16.09.2015)

write_bw_log generates a log file with the bandwidth details of the job and so on. If you don’t set the per_job_logs=0 then for each thread (numjob=X)  there will be one file. Most of the time this is not wanted because you would like to generate graphics based on all threads. An issue I found is that the default patterns of fio2gnuplot ( -b / -i) will not work because it search for  ( *_bw.log and *_iops.log) file endings. But the files end with *_bw.X.log and *_iops.X.log. It should be fixed with this commit.

If per_job_logs=0 set and all log files option have been set:

  • write_bw_log=fio-test
  • write_iops_log=fio-test
  • write_lat_log=fio-test

then 5 files will be generated:

How does a log file look like?

Means 4096 bytes in the fourth column is the block size (4K). The second column is the bandwidth in KB/s. I believe the first column is the passed time in ms. The third column which is 0 indicates that this row is related to reads. If this is related to write than the third column is 1.

Using fio2gnuplot

fio2gnuplot works in two major phases. The first phase is to generate the input files for gnuplot and do some calculating based on the data like the avg or min and max.

Starting fio2gnuplot -b will search for all bandwidth files in the local directory and generates the input files for gnuplot. The opition “-i” is the default pattern for iops files. There is no default  pattern for latency.

fio2gnuplot_phase1

The second phase is to generate the graphics. The option “-g” can be used for this. Per default “-g” deletes the input files for gnuplot. The option “-k” can be used to keep this files for later editing. If you want to make changes to the output you are able to edit gnuplot files like the mygraph file.

fio2gnuplot_phase2

And this is the output of fio-test_bw-2Draw.png

fio-test_bw-2Draw

Using fio2gnuplot to compare files with the default pattern -b or -i

You can copy all log file in the same directory and call fio2gnuplot with the right pattern. I make use of “-b” for bandwidth comparisons.

fio2gnuplot_compare

And this is the output of compare-result-2Dsmooth.png

compare-result-2Dsmooth

Using fio2gnuplot to compare files with a custom pattern

Sometimes the default pattern will not work. For example there is no pattern for the latency output. For this case you can specify your own pattern with the option “-p <pattern>” and using a title. WARNING: Using the pattern “*.log” will raise an error. I fixed this and in the future this should work.

compare-result-pattern

And this is the output of compare-result-2Dsmooth.png

compare-lat-2Dsmooth

Go Keepass2.

 

7. SQL Server Performance Tuning study with HammerDB – Flashsoft and PX600 unleash the full power

I solved all bottlenecks since we started this performance tuning study. But now I can’t find any improvements which can be done without altering the schema or indexes which is not allowed by TPC-C rules. It is a similar situation when you run a third party application with a database which you are not allowed to change. A great solution to improve the disk latency is caching based on Flash, because it is transparent to the application vendor. The advantage of Flashsoft 3.7 is that it provides a READ and WRITE cache. The write cache is the one which should help with this OLTP workload. Remember Flashsoft can cache FC,iSCSI,NFS and local devices.

Phase 3 – Forming a hypothesis – Part 5

  • Based on observation and declaration form a hypothesis
    • Based on observation and the lessons I learned, I believe the TPM/NOPM values should increase, if the disc access latency will be reduced with the use of READ/WRITE cache (Flashsoft).

Phase 4 – Define an appropriated method to test the hypothesis

  • 4.1 don’t define too complex methods
  • 4.2 choose … for testing the hypothesis
    • the right workload
      • original workload
    • the right metrics
      • In this case I concentrate only on the TPM/NOPM values.
    • some metrics as key metrics
      • TPM/NOPM
    • the right level of details
    • an efficient approach in terms of time and results
      • Installing and configuring Flashsoft will take 30 min
    • a tool you fully understand
  • 4.3 document the defined method and setup a test plan

  I will run the following test plans:

Test plan 1

Implement a READ/WRITE cache for SQL Server based on Samsung 840 basic

Start HammerDB workload

Run ShowIOBottlenecks and Resource Monitor

Stop HammerDB workload and compare this run with the baseline

Analyze the ShowIOBottlenecks

Test plan 2

If ShowIOBottlenecks still shows wait events for disk latency I will use the PX600-1000 as READ/ WRITE cache device

Start HammerDB workload

Run ShowIOBottlenecks and Resource Monitor

Stop HammerDB workload and compare this run with the baseline

Phase 5 – Testing the hypothesis – Test Plan 1+2

  • 5.1 Run the test plan
  • avoid or don’t test if other workloads are running
  • run the test at least two times

I recorded a video when running the test plan 1.

  • System achieved 338486 SQL Server TPM at 73651 NOPM
  • System achieved 314510 SQL Server TPM at 68320 NOPM
  • System achieved 313778 SQL Server TPM at 68256 NOPM

the Resource Monitor showed that there is still latency around 10ms and ShowIOBottlenecks shows that there are wait events for Write log. So I decided to use the PX600-1000.

I recorded a video when running the test plan 2.

  • System achieved 338989 SQL Server TPM at 73685 NOPM
  • System achieved 348164 SQL Server TPM at 75689 NOPM
  • System achieved 336965 SQL Server TPM at 73206 NOPM
  •  5.2 save the results
    • All results are saved into the log files

Phase 6 – Analysis of results – Test Plan 1+2

  • 6.2 Read and interpret all metrics
    • understand all metrics
    • compare metrics to basic/advanced baseline metrics
      • TEST RESULT Flashsoft Samsung 840:
        • System achieved (338486,314510,313778)=322258 SQL Server TPM at (73651,68320,68256)=  70076 NOPM
      • TEST RESULT Flashsoft SanDisk PX600-1000:

        • System achieved (338989,348164,336965)= 341373 SQL Server TPM at  (73685,75689,73206) = 74193 NOPM
      • TEST RESULT before:
        • System achieved =161882 SQL Server TPM at = 35150 NOPM
    • has sensitivity analysis been done?
      • Just an approximation. There are so much variables even in this simple environment that this would take too much time. The approximation shows that as long I don’t make changes to the environment  the results should be stable.
    • concentrate on key metrics
      • While using Flashsoft with Samsung Basic 840 or SanDisk PX600-1000 I could nearly double the performance compared to the last run.
    •  is the result statistically correct?
      • No. The selection was only one point in time. I repeated the test a few times with a similar result, but still no.
  • 6.3 Visualize your data
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1
    • nothing strange
  • 6.5 Present understandable graphics for your audience
    • Done.

Phase 7 – Conclusion

Is the goal or issue well defined? If not go back to  “Phase 1.1”

  • 7.1 Form a conclusion if and how the hypothesis achieved the goal or solved the issue!
    • The hypothesis is true. I doubled the performance while I make use of the Flashsoft caching solution. I found that there is only a small difference between the Samsung and the SanDisk drive. SanDisk PX600-1000 should be much faster than the consumer SSD. The reason seems to be a new bottleneck I found. Page Latch waits are involved!
  • 7.2 Next Step
    • Is the hypothesis true?
      • Yes
      • if goal/issue is not achieved/solved, form a new hypothesis.
        • I will form a new hypothesis in the next post of this series where I’ll track down the Page Latch wait events and solve them.

6. SQL Server Performance Tuning study with HammerDB – Database Engine Tuning Advisor

In Part 5 an issue with the cost estimator has been solved and the HammerDB workload runs much faster. But what to tune now? Let’s give the Database Engine Tuning Advisor a chance for this performance tuning.

Phase 3 – Forming a hypothesis – Part 4

  • Based on observation and declaration form a hypothesis
    • Based on observation and the lessons I learned I believe the TPM/NOPM values should increase if we further tune SQL statements with the help of the Database Engine Tuning Advisor

Phase 4 – Define an appropriated method to test the hypothesis

  • 4.1 don’t define too complex methods
  • 4.2 choose … for testing the hypothesis
    • the right workload
      • original workload
    • the right metrics
      • In this case I concentrate only on the TPM/NOPM values.
    • some metrics as key metrics
      • TPM/NOPM
    • the right level of details
    • an efficient approach in terms of time and results
      • Adding indexes may take 1h
    • a tool you fully understand
      • The Database Engine Tuning Advisor will be used to analyze the plan cache and this tool will provide some advises how to introduce indexes, partitions  etc.
  • 4.3 document the defined method and setup a test plan

  I will run the following test plan and analyzing:

Test plan 1

Run the Database Engine Tuning Advisor and analyze the plan cache (last 1000 events)

Analyze the report

Maybe create indexes / partitions/ stats. etc

Start HammerDB workload

Stop HammerDB workload and compare this run with the baseline

Phase 5 – Testing the hypothesis – Test Plan 1

  • 5.1 Run the test plan

    • avoid or don’t test if other workloads are running
    • run the test at least two times

I recorded a video when running the test plan 1.  So as long as I stay in line with the TPC-C rules the Database Engine Tuning Advisor optimization is only related to creating new statistics.

  •  5.2 save the results
    • All results are saved into the log files

Phase 6 – Analysis of results – Test Plan 1

  • 6.2 Read and interpret all metrics
    • understand all metrics
    • compare metrics to basic/advanced baseline metrics
      • TEST RESULT NOW : System achieved =161882 SQL Server TPM at = 35150 NOPM
      • TEST RESULT BEFORE: System achieved 161091 SQL Server TPM at = 35013 NOPM
    • has sensitivity analysis been done?
      • Just an approximation. There are so much variables even in this simple environment that this would take too much time. The approximation shows that as long I don’t make changes myself to the environment  the results should be stable.
    • concentrate on key metrics
      • so no measurable changes
    •  is the result statistically correct?
      • No. The selection was only one point in time. I repeated the test a few times with a similar result, but still no.
  • 6.3 Visualize your data
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1
    • nothing strange
  • 6.5 Present understandable graphics for your audience
    • Done.

Phase 7 – Conclusion

Is the goal or issue well defined? If not go back to  “Phase 1.1”

  • 7.1 Form a conclusion if and how the hypothesis achieved the goal or solved the issue!
    • The hypothesis could not really tested. The Database Engine Advisor Engine provided changes which are not in line with the TPC-C so we could not really tune SQL Statements. BUT when there are no more options left to tune I will introduce new indexes.
  • 7.2 Next Step
    • Is the hypothesis true?
      • Not evaluated
      • if goal/issue is not achieved/solved, form a new hypothesis.
        • I will form a new hypothesis in the next post of this series. The end of the video showed that we should give Flashsoft another try to improve the disk latency.

5. SQL Server Performance Tuning study with HammerDB – TOP 10 most costly SQL statements

In the last post we found that the bottleneck seems not related to the wait events  ACCESS_METHODS_DATASET_PARENT. I learned a few points now! Tuning the wait events should not be the first step in tuning a database. Since ages it is recommended to tune a database workload starting by the application down to the hardware. The reason is obvious. The performance tuning factors you can get using a better hardware or optimizing your hardware is normally in a range between 5% to 100%. Tuning one SQL statement may increase the overall performance 10x , 50x  or maybe 1000x.

Phase 3 – Forming a hypothesis – Part 3

  • Based on observation and declaration form a hypothesis
    • Based on observation that the bottleneck is related to CPU and workload I believe: “The TPM/NOPM should be increasing if we improve the TOP 10 most costly SQL statements or at least the most costly of all”

Phase 4 – Define an appropriated method to test the hypothesis

  • 4.1 don’t define too complex methods
  • 4.2 choose … for testing the hypothesis
    • the right workload
      • original workload
    • the right metrics
    • some metrics as key metrics
      • total_worker_time for the most costly SQL statement will be declared as the key metric because the workload seems to be bounded by CPU
    • the right level of details
    • an efficient approach in terms of time and results
      • approx. 1 h
    • a tool you fully understand
      • Pinal Dave posted a nice script which I will use to list the TOP 10 most costly SQL statements and the used execution plan. I make use of the order by total_worker_time.
  • 4.3 document the defined method and setup a test plan

  I will run the following test plan and analyzing:

Test plan 1

Run the TOP10 script.

Identify and analyze the most costly SQL statement of all (make use of the execution plan analyzer)

Tune the discovered SQL statement

Start HammerDB workload

Run the TOP10 script and confirm if the most costly SQL statement of all is improved

Stop HammerDB workload and compare this run with the baseline

Phase 5 – Testing the hypothesis – Test Plan 1

  • 5.1 Run the test plan

    • avoid or don’t test if other workloads are running
    • run the test at least two times

I run the TOP10 Script:

SQLServer2014_TOP10_before

The most costly statement here is surprisingly a select statement. This is strange because the OLTP workload should most of the time try to update/insert something.

After a short Internet research I found this blog which showed that this is related to the changes of the SQL Server Query Optimizer cardinality estimation process.

So adding an Index or changing the Query Optimizer? I decide to change the database compatibility to the pre-SQL Server 2014 legacy CE. The right way would be to add an index or use the Trace Flag 948. But these changes would not stay in-line with TPC-C rules!

The execution plan for the SQL Select looks like this before the changes with a table scan of the stock table which costs a lot!:

SQL Server2 014 Execution plan before

I change the SQL Server Query Optimizer cardinality estimation for the tpcc database to pre-SQL Server 2014. After the change the sys.dm_exec_query_stats should be flushed.

SQL_Change_compatibily-level

I started the HammerDB workload again like shown in this video.

I run the TOP10 script:

SQLServer2014_TOP10_after

As you can see there are still two phases while running. One which shows CPU nearly at 90% and low disk access. And another with high disk access (saturated) and low CPU involved.

I stopped the HammerDB workload and compare this run with the baseline

  •  5.2 save the results
    • All results are saved into the log files

Phase 6 – Analysis of results – Test Plan 1

  • 6.2 Read and interpret all metrics
    • understand all metrics
      • sys.dm_exec_query_stats show that the most costly SQL statement. The select which has been an issue is:

      • compare metrics to basic/advanced baseline metrics
        • TEST RESULT NOW : System achieved (149811,160772,172690)=161091 SQL Server TPM at (32541,34909,37590)= 35013 NOPM
        • TEST RESULT IN THE Beginning: System achieved 20820 SQL Server TPM at 4530 NOPM
      • is the result statistically correct?
        • No. The selection was only one point in time. I repeated the test a few times with a similar result, but still no.
      • has sensitivity analysis been done?
        • Just an approximation. There are so much variables even in this simple environment that this would take too much time. The approximation shows that as long I don’t make changes myself to the environment  the results should be stable.
      • concentrate on key metrics
        • total_worker_time for the most costly SQL statement has been reduced to 13.339.391 compared to 1.286.386.037 before. The statement is not in the TOP10 anymore.

           
  • 6.3 Visualize your data
    • The screen-shots will do the job.
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1
    • nothing strange
  • 6.5 Present understandable graphics for your audience
    • Done.

Phase 7 – Conclusion

Is the goal or issue well defined? If not go back to  “Phase 1.1”

  • 7.1 Form a conclusion if and how the hypothesis achieved the goal or solved the issue!
    • The hypothesis has been proven right. The TPM=161091 and NOPM=35013 reached shows that solving this bottleneck caused by the SQL Server Query Optimizer cardinality estimation seems to have a big influence. The performance increased around 7x!
  • 7.2 Next Step
    • Is the hypothesis true?
      • Yes. 
      • if goal/issue is not achieved/solved, form a new hypothesis.
        • I will form a new hypothesis in the next post of this series because I am pretty sure there is much more I can tune.

4. SQL Server Performance Tuning study with HammerDB – solving ACCESS_METHODS_DATASET_PARENT

In the last post we found that the bottleneck seems not related to the disc access latency. So I disabled Flashsoft 3.7 for now. The next part in this performance tuning study will show how we solve the wait events for ACCESS_METHODS_DATASET_PARENT.

Phase 3 – Forming a hypothesis – Part 2

  • Based on observation and declaration form a hypothesis
    • Based on observation and the last ShowIOBottlenecks run I believe: “The TPM/NOPM should be increasing if the wait events  Intra Query Parallelism – ACCESS_METHODS_DATASET_PARENT will be reduced/resolved”

Phase 4 – Define an appropriated method to test the hypothesis

  • 4.1 don’t define too complex methods
  • 4.2 choose … for testing the hypothesis
    • the right workload
      • original workload
    • the right metrics
    • some metrics as key metrics
    • the right level of details
    • an efficient approach in terms of time and results
      • approx. 10 min
    • a tool you fully understand
  • 4.3 document the defined method and setup a test plan

  I will run the following test plan and analyzing:

Test plan 1

This post by sqlskills points out that the ACCESS_METHODS_DATASET_PARENT I should investigate the MAX_DOP setting

I will check the Max Degree of Parallelism setting which should be 0 per default and set it to a value of 2. Because this server has HT active and we got 2 active cores.

Microsoft shows how to set the Max Degree of Parallelism for this instance.

I will run ShowIOBottlenecks while the HammerDB workload is running with Max Degree of Parallelism set to 2.

Phase 5 – Testing the hypothesis – Test Plan 1

  • 5.1 Run the test plan

    • avoid or don’t test if other workloads are running
    • run the test at least two times

I checked the actual Max Degree of Parallelism with this query:

nameminimummaximumconfig_valuerun_value
max degree of parallelism03276700

Then I set the Max Degree of Parallelism to 2:

The output has been:

I started the HammerDB workload again like shown in this video.

  •  5.2 save the results
    • All results are saved into the log files

Phase 6 – Analysis of results – Test Plan 1

  • 6.2 Read and interpret all metrics
    • understand all metrics
      • The important metrics ShowIOBottlenecks showed that from a wait event view there are two main phases when running:
      • This one shows that CPU and Threading seems to be the bottleneck and the disk is not saturated.SQLServer2014_MaxDegreePar_active_CPU
      • This one shows that I/O WRITELOG seems to be the bottleneck and CPU still got a ~30% wait time.SQLServer2014_MaxDegreePar_active_WriteLog
  • compare metrics to basic/advanced baseline metrics
    • TEST RESULT NOW : System achieved 20227 SQL Server TPM at 4443 NOPM (Best run)
    • TEST RESULT IN THE Beginning: System achieved 20820 SQL Server TPM at 4530 NOPM
  • is the result statistically correct?
    • No. The selection was only one point in time. I repeated the test a few times with a similar result, but still no.
  • has sensitivity analysis been done?
    • Just an approximation. There are so much variables even in this simple environment that this would take too much time. The approximation shows that as long I don’t make changes myself to the environment  the results should be stable.
  • concentrate on key metrics
    • The ACCESS_METHODS_DATASET_PARENT is sometimes still arising (<=10%) but it seems to be solved for now.
  • 6.3 Visualize your data
    • The screen-shots will do the job.
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1
    • nothing strange
  • 6.5 Present understandable graphics for your audience
    • Done.

Phase 7 – Conclusion

Is the goal or issue well defined? If not go back to  “Phase 1.1”

  • 7.1 Form a conclusion if and how the hypothesis achieved the goal or solved the issue!
    • The hypothesis has been proven wrong. The TPM=20170 and NOPM=4336 reached shows that solving the ACCESS_METHODS_DATASET_PARENT seems to have nearly no influence at the moment.
    • The reasons why the hypothesis is wrong:
      • It seems we solved this bottleneck but it seems there is another issue which we did not discover yet.
      • The setting MAX DOP to 2 will be used in this case because it is recommend by Microsoft.
  • 7.2 Next Step
    • Is the hypothesis true?
      • No. 
      • if goal/issue is not achieved/solved, form a new hypothesis.
        • I will form a new hypothesis in the next post of this series. And believe me. The TPM and NOPM values will increase 🙂

3. SQL Server Performance Tuning study with HammerDB – Using 8PP – Part1

In the post 8PP I described a scientific approach to test a system. In this series I will follow 8PP and showcase how it can be used to fulfill a performance tuning of SQL Server for this HammerDB workload.

The workload I will tune is as follow:

HammerDB TPC-C autopilot workload which runs three times a timed driver script with 25 virtual user with a ramped up time of 1 min and a runtime of 5 Min. The minutes per test are set to 8 min to give the SQL Server a chance to settle after each run. The average TPM has been (20998,20178,21283)= 20820 and NOPM has been (4602,4380,4609)=4530. Watch the beginning of this video to understand how I setup the workload.

Phase 1 – Observation

1.1 Understand the problem/issue

  • Talk to all responsible people if possible
    • I am responsible for the whole setup.
  • Is the problem/issue based on a real workload?
    • No! It is always important to know if the workload is based on a real workload because some synthetic tests are not well chosen to test a system. Your chance to avoid spending time on senseless testing.
  • Is the evaluation technique appropriate?
    • In this case I accept that this test is synthetic and consider it is appropriate for my target.

1.2 Define your universe

  • If possible isolate the system as much as you can
    • The whole test is running on one PC. I could increase the isolation if I uninstall  the unused 10Gbe NICs and the management NIC which I am using for RDP access. I consider this should have nearly zero impact on my testing.
  • Make sure to write down exactly how your system/environment is build
    • Firmware, OS, driver, app versions, etc…

1.3 Define and run basic baseline tests (CPU,MEM,NET,STORAGE)

  • Define the basic tests and run them while the application is stopped
    • These test are done to make sure that there is no mayor bottleneck introduced because of a bad configured hardware environment.
    • I used PerformanceTest 8.0 to check CPU and MEM.
      • CPU results
        • The CPU Mark reached only 5467. Remember only 2 out of 4 cores are active. I will repeat this test when all 4 cores are online again.
        • Passmark_comparicon
      • MEM results
        • The Memory Mark reached 2661 and a latency of 28.2. Compared to the baselines of 29ns its right configured.
        • Passmark_comparicon_Mem
    • I skipped the NET test because it should not be involved in this testing.
    • I tested the SSD storage before:
  • Document the basic baseline tests
    • All basic base line tests are documented here. The results seems to be in-line with the well-known test from the Internet.
  • Compare to older basic baseline tests if any are available
    • There are no old basic baseline tests I could use to compare.

1.4 Describe the problem/issue in detail

  • Document the symptoms of the problem/issue
    • I started HammerDB TPC-C autopilot workload which runs three times a timed driver script with 25 virtual user with a ramped up time of 1 min and runtime of 5 Min. The minutes per test are set to 8 min to give the SQL Server a chance to settle after each run. The average TPM has been (20998,20178,21283)= 20820 and NOPM has been (4602,4380,4609)=4530.
  • Document the system behavior (CPU,MEM,NETWORK,Storage) while the problem/issue arise
    • While running the HammerDB workload I monitored the utilization of CPU,Memory and Network which showed that CPU is ~100% workload. No memory pressure and no traffic on the network. The disk showed a response time of up to ~30ms and up to ~5MB/sec.
    • For sure this documentation could be in more detail but should be okay for now.

Phase 2 – Declaration of the end goal or issue

  • Official declare the goal or issue
    • The goal is to increase the TPM/NOPM number of the HammerDB TPC-C workload with 25 virtual user with a ramped up of 1 min and runtime of 5 min to the highest possible number. I will resolve all bottlenecks regardless if hardware, software,  SQL Server options or SQL schema as long it is transparent to the HammerDB driver script. This means it is okay to add an index, or make use of NAND flash or more CPUs. But it is not allowed to change the driver script itself and it needs to run without errors.
  • Agree with all participants on this goal or issue
    • I agree with this goal 🙂 

Phase 3 – Forming a hypothesis – Part 1

  • Based on observation and declaration form a hypothesis
    • Based on observation 1.4 – I believe: “The TPM/NOPM should be increasing if the access time to the data on the disk will be faster”

Phase 4 – Define an appropriated method to test the hypothesis

  • 4.1 don’t define too complex methods
  • 4.2 choose … for testing the hypothesis
    • the right workload
      • original workload
    • the right metrics
      • latency of database files
      • wait event time of IO in relation to the whole wait time
      • read/write throughput
      • read/write distribution
    • some metrics as key metrics
      • wait event time of IOs in relation to the whole wait time
    • the right level of details
    • an efficient approach in terms of time and results
      • approx. 2h
    • a tool you fully understand
  • 4.3 document the defined method and setup a test plan

  I will run the following test plan and analyzing:

Test plan 1

I will run ShowIOBottlenecks while the HammerDB workload is running.

Prove if there are significant wait events for the disk access in relation to the whole wait time.

Test plan 2

I will run the HammerDB workload and use the Resource Monitor to monitor disk access.

If there are significant wait events for the disk access I will further check if its related to READ or WRITE.

If related to READ I will make use of ioTurbine Profiler which shows if a read cache could help here.

Test plan 3

Depending of the findings I will introduce a read/write cache with the help of Flashsoft 3.7 to cache the data files.

Again run ShowIOBottlenecks while the HammerDB workload is running.

Document my findings on this blog.

Phase 5 – Testing the hypothesis – Test Plan 1

  • 5.1 Run the test plan

    • avoid or don’t test if other workloads are running
    • run the test at least two times
  1. I recorded a short video to demonstrate how I used the ShowIOBottlenecks script to verify if there is a IO bottleneck.

  • 5.2 save the results
    • I saved the results of ShowIOBottleneckss to a xlsx.

Phase 6 – Analysis of results – Test Plan 1

  • 6.2 Read and interpret all metrics
    • understand all metrics
      • The important metrics ShowIOBottlenecks showed are:
        • around 40% Intra Query Parallelism wait events for ACCESS_METHODS_DATASET_PARENT
        • around 25% CPU & Threading (CPU)
        • around 20% of the time the writelog IO
        • The latency to the data files seem to be high. ~30ms
    • compare metrics to basic/advanced baseline metrics
      • There are no baseline so far
    • is the result statistically correct?
      • No. The selection was only one point in time. I repeated the test a few times with a similar result, but still no.
    • has sensitivity analysis been done?
      • Just an approximation. There are so much variables even in this simple environment that this would take too much time. The approximation shows that as long I don’t make changes myself to the environment  the results should be stable.
    • concentrate on key metrics
      • wait event time of IO is 20% of to the whole wait time
  • 6.3 Visualize your data
    • HammerDB_initial_autopilot_run3x25_wait_distribution
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1”

    • The high Intra Query Parallelism could be considered strange. It means that another bottleneck exist which I should solve in first place. But I will ignore it right now.
  • 6.5 Present understandable graphics for your audience
    • Done.

Phase 5 – Testing the hypothesis – Test Plan 2

  • 5.1 Run the test plan
  1. I recorded a short video to demonstrate how I used the ShowIOBottlenecks script. Two times I showed how to display the read/write distribution with Windows Resource Monitor. Min 2:00 and 3:35.

  • 5.2 save the results
    • The read has been ~10.000 B/sec
    • The write has been ~2.752.000 B/sec

Phase 6 – Analysis of results – Test Plan 2

  • 6.2 Read and interpret all metrics
    • understand all metrics
      • read and write B/sec
    • compare metrics to basic/advanced baseline metrics
      • There are no baseline so far
    • is the result statistically correct?
      • No. The selection was only one point in time. I repeated the test a few times with a similar result, but still no.
    • has sensitivity analysis been done?
      • Just an approximation. There are so much variables even in this simple environment that this would take too much time. The approximation shows that as long I don’t make changes myself to the environment  the results should be stable.
    • concentrate on key metrics
      • The read B/sec is ~ 2,8% of all disk B/sec. So it does not look like that a read cache will help.
      • Just for showcasing the ioTurbine Profiler:
        • I started the HammerDB workload again and monitored the D:\ device. I ignored 4.2 (time) in this case which should not be done in real environments.
  • 6.3 Visualize your data
    • read_write_distribution
    • watch the ioTurbine Profiler video !!!!
    • https://www.youtube.com/watch?v=O_eDnnIZ6wo
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1”

    • Nothing strange. It’s an OLTP workload so there should be more writes than reads.
  • 6.5 Present understandable graphics for your audience
    • See 6.3.

Phase 5 – Testing the hypothesis – Test Plan 3

  • 5.1 Run the test plan
  • The Analysis showed that faster writes could make sense and reduce the wait events for the write log which are up to 20% of the whole wait time. There are different options to achieve fast writes for the log.
    • Introduce a faster storage for the log file
    • Introduce a write cache for the log file
    • Tune the log file flush mechanisms and log file structures
    • Tune the application to avoid too much commits.
  • As defined we are not able to tune the application. We could try to tune the log file itself but the best bet should be faster storage or a write cache because we know the D:\ is a slow HDD. In this case I will start with the write cache. The reason is that a write cache could be used transparently for a single device, logical device or pool and for block devices provided via a SAN, even when using a shared file systems like CSV or VMFS.
  1. recorded a short video where I am configuring Flashsoft 3.7 to use the Samsung 840 Basic SSD as a read/write cache for the D:\ drive. Then I started the HammerDB again and let the baseline run again and started the ShowIOBottlenecks script.
  • 5.2 save the results
    • The results are saved in the log files.

Phase 6 – Analysis of results – Test Plan 3

  • 6.2 Read and interpret all metrics
    • understand all metrics
    • compare metrics to basic/advanced baseline metrics
      • The baseline is documented in 1.4 with TPM=20820 and NOPM=4530
      • With Flashsoft 3.7 activated we reached TPM=20170 and NOPM=4336
    • is the result statistically correct?
      • More or less. The test run 3 times in a row and the last run has been recorded.
    • has sensitivity analysis been done?
      • Just an approximation. There are so much variables even in this simple environment that this would take too much time. The approximation shows that as long I don’t make changes myself to the environment  the results should be stable.
    • concentrate on key metrics
      • The key metric is: wait event time of IOs in relation to the whole wait time. The video showed we could reduce the wait time for write logs to <=1% which has been up to 20%.
  • 6.3 Visualize your data
    • I skipped it because it is that simple to understand. 20% wait time for write log dropped to <=1%.
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1”

    • nothing strange
  • 6.5 Present understandable graphics for your audience
    • Done.

Phase 7 – Conclusion

Is the goal or issue well defined? If not go back to  “Phase 1.1”

  • 7.1 Form a conclusion if and how the hypothesis achieved the goal or solved the issue!
    • The hypothesis has been proven wrong. The TPM=20170 and NOPM=4336 I reached with Flashsoft 3.7 and faster disk access seems to have no influence at the moment. Even the workload runs slower than before. I believe that is related to some CPU  overhead introduced by Flashsoft. But that’s not proven.
    • The reasons why the hypothesis is wrong:
      • I ignored strange values at Test plan 1 –  6.4
      • I ignored to understand all metrics at Test plan 1 – 6.1
        • How much time is waiting time compared to the runtime? This is a difficult topic because the wait times we see is related to sys.dm_os_wait_stats and sys.dm_os_latch_stats. These values are the aggregated values of all tasks. This counter can only indicate if there are wait events and which one maybe a problem. In this case disc access time is not the bottleneck.
        • In the video at 7:50 you can see 41822 wait events occurred for Intra Query Parallelism and the avg waits per ms is 340. The script ShowIOBottlenecks runs for ~5000ms so a value of 8,3 should be there. I found a bug in the script which I corrected here.
  • 7.2 Next Step
    • Is the hypothesis true?
      • No. 
      • if goal/issue is not achieved/solved, form a new hypothesis.
        • I will form a new hypothesis in the next post of this series. And believe me. The TPM and NOPM values will increase 🙂

Go VLC.

 

Samsung 840 Basic- Baseline tests with FIO based on Windows 2012R2

This post shows the baseline FOB raw peak performance of the Samsung 840 Basic which is installed in Testverse with Windows 2012R2. I used the “fio” (Flexible IO Tester) for this test. “fio” is my preferred tool to test SSD.

I make use of the Sample Windows “fio” job files. The specifications shows six metrics but I concentrate on (READ/WRITE bandwidth,Ran.READ/WRITE 4K). I tested with the job files:

  • fio-job-04 = fio random 1M write peak BW
  • fio-job-09 = fio random 1M read peak BW
  • fio-job-010 = fio random 4K read peak IOPS
  • fio-job-011 = fio random 4K write peak IOPS

Samsung840Basic

Command line example:

Results:

fio-job-09 = fio random 1M read peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 531,59 MB/s

BW09_compare-result-2Dtrend

fio-job-04 = fio random 1M write peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 134,72MB/s.

BW04_compare-result-2Dtrend

fio-job-010 = fio random 4K read peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 100293.

IO10_compare-result-2Dtrend

fio-job-011 = fio random 4K write peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 22313.

IO11_compare-result-2Dtrend

 

SanDisk ioMemory/Fusion-io ioDrive – Baseline tests with FIO based on Windows 2012R2

This post shows the baseline FOB raw peak performance of the SanDisk PX600-1000 which is installed in Testverse with Windows 2012R2. I used the “fio” (Flexible IO Tester) for this test. “fio” is the preferred tool to test SanDisk/Fusion-IO ioDrive/ioMemory/SSD.

I make use of the Sample Windows “fio” job files. The specification shows four metrics (READ/WRITE bandwidth,Ran.READ/WRITE 4K) which I tested here with the job files:

  • fio-job-04 = fio random 1M write peak BW
  • fio-job-09 = fio random 1M read peak BW
  • fio-job-010 = fio random 4K read peak IOPS
  • fio-job-011 = fio random 4K write peak IOPS

SanDiskPX600-1000_spec

Command line example:

Results:

fio-job-09 = fio random 1M read peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 2590.2MB/s

BW9_compare-result-2Dtrend

fio-job-04 = fio random 1M write peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 1337.1MB/s.

BW4_compare-result-2Dtrend

fio-job-010 = fio random 4K read peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 271557.

IOPS_10_compare-result-2Dtrend

fio-job-011 = fio random 4K write peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 284429.

IOPS11_compare-result-2Dtrend

FIO (Flexible I/O Tester) Part8 – Interpret and understand the result/output

The result/output of “fio” can be overwhelming because this decent tool does a lot for you. Your job is to feed “fio” with the right options and then interpret the result/output. This posting will help you to understand the result/output in detail. I know it’s difficult to read but I am limited by the WordPress design here a little bit and may improve it in the future.

The official documentation

The HOWTO provides some insights about the result/output of “fio”. I copy&paste some parts of the HOWTO and give you some more details or summarize other parts.

Output while running

IdleRunDescription
PThread setup, but not started.
CThread created.
IThread initialized, waiting or generating necessary data.
pThread running pre-reading file(s).
RRunning, doing sequential reads.
rRunning, doing random reads.
WRunning, doing sequential writes.
wRunning, doing random writes.
M
Running, doing mixed sequential reads/writes.
mRunning, doing mixed random reads/writes.
FRunning, currently waiting for fsync()
fRunning, finishing up (writing IO logs, etc)
VRunning, doing verification of written data.
EThread exited, not reaped by main thread yet.
_Thread reaped, or
XThread reaped, exited with an error.
KThread reaped, exited due to signal.

Job overview output

This will give you an overview about the jobs and the option used. It’s useful to check the heading if you receive only the results of a run but not the command line call or job file.

Data direction output

All details for each data direction will be shown here. Most important numbers are:

  • io=
  • bw=
  • iops=
  • issued =
  • lat =

Details inside the box.

Group statistics

Disk statistics

Example: Interpret the result/output of SanDisk PX600-1000 and Samsung EVO 840

I ran the linux raw peak performance test:  job-fio-11.ini which is included in the “fio” sample linux files on Testverse.

This means a 4K random write test to show the peak 4K write IOPS.

I invoked the script:

Result for SanDisk PX600-1000:

Result for Samsung EVO 840:

Part 1: Job overview output

  • “g=0”
    • this job belongs to group 0 – Groups can be used to aggregate job results.
  •  “rw=rw”
    • the IO pattern is “random write”
  •  “bs=4K-4K/4K-4K/4K-4K”
    • the block size is 4K for all types (read,write,discard). This test is only scheduling  writes. The important part for this job is 4K-4K in the middle.
  •  “ioengine=libaio”
    • the used ioengine is libaio which means parallel writes will be scheduled. This is not a good test for the file system performance because it skips the page cache.
  •  “iodepth=32”
    • there will be up to “32 IO units” in flight against the device. So there will be a queue filled up to 32 outstanding IOs most of the time.
  • the version is “fio-2.2.9-26-g669e”
  • 4 threads will be started.

Part 2: Data direction output

The header is self explaining: Job name = fio-job-11 …

The detailed IO statistics for the job:

  • “write” (remember the job schedules 100% random writes)
  • “io=41119MB”
    • number of MB transfered
  • “bw=1370.5MB/s”
    • write data at a speed of 1370.5MB per second in average
  • “iops=350728”
    • is the average IO per second (4K in this case).
  • “runt=30013msec”
    • The job ran ~30 seconds
  • “slat”, “clat”,”lat” min,max,avg,stdev
    • slat means submission latency and presents the time it took to submit this IO to the kernel for processing.
    • clat  means completion latency and presents the time that passes between submission to the kernel and when the IO is complete, not including submission latency.
    • lat is the best metric which represents the whole latency an application would experience.  The avg slat+ avg clat = ~ avg lat.
    • Keep an eye if the numbers are usec or msec ..etc.! Compare PX600-1000 to EVO 840.
    • See Al Tobey blog for some more details.
  • “clat percentile” gives a detailed explanation how much IO in percentage completed in which time frame. In this case: 99% of the IO completed in <=1192 usec = 1,2 msec. This value is often used to ignore the few spikes when testing. The maximum clat has been 13505 which has been ~14x longer than the average of 344.
  • “bw”  min, max, per,avg, stdev
    • In this case the bandwidth  has been 345090 KB/s = ~337 MB/s
  • “lat” this is like the clat part.
    • In this case 91.01% of the IO completed between 500usec and >250usec. This is in line with the avg latency of 360.23usec. Only ~ 8,7% of the IO took between 2ms and >750usec. Both together is nearly 99,8% of all IO.
  •  “cpu”
    • this line is dedicated to the CPU usage of the running the job
      •  “usr=5.66%”
        • this is the percentage of CPU usage of the running job at user level
        • 100% would mean that one CPU core will be at 100% workload, depending if HT on/off
      •  “sys=12.09%”
        • this is the percentage of CPU usage of the running job at system/kernel level
      • “ctx=2783241”
        • The number of context switches this thread encountered while running
      • “majf=0” and “minf=32689”
  • “IO depths :
    • “32=116.7%… ”
      • this number showed that this job was able to have ~32 IO units in flight
      • I am not sure why it’s >100%
      • “submit: …..16=100.0%….”
        • shows how many IO were submitted in a single submit call. In this case it could be in the range of 8 to 16
        • This is in line with the script which used iodepth_batch values
      • “complete: …. 16=100.0%….”
        • same like submit but for completed calls.
      • “issued:….total=w=10526416/w=0/d=0″…
        • 10526416 write IO have been issued, no reads, no discards and none of them have been short or dropped

Part 3: Group statistics

  • WRITE
    • “io=41119MB”
      • As in the job statistics the same amount of transfered MB here because its only one  job
    • “aggrb=1370.5MB/s”
      • aggregated bandwidth of all jobs/threads for group 0
    • “minb=1370.5MB/s maxb=1370.5MB/s”
      • The minimum and maximum bandwidth one thread saw. In this case is the minimum the same as the maximum.I don’t think this is correct! Will clarify this.
    • “mint=30008msec” and “maxt=30008msec”
      • Smallest and longest runtime of one of the jobs. The same because we ran only one job

Part 4: Disk statistics

  • “fioa: ios=71/767604”
    • 71 read IO and 767604 write IO on /dev/fioa
    • I am not sure why there are 71 read IO. I am pretty sure i didn’t run anything myself in the background. Who knows?
  • “merge=0/0” number of merges the IO from the IO scheduler
    • no merges here
  • “ticks=0/259476”
    • number of ticks we kept the drive busy. A sign that the device is saturated.
    • A tick is related to one jiffy. The next lines are only a approximation. read about Kernel Timer for more details.
    • In this example I checked the value for CONFIG_HZ

      • CONFIG_HZ is set to 250 which means 1 second / 250 = 0,004s = 4ms
      • 259476 ticks * 4ms = 1037s  ???
      • I believe this is the cumulated wait time all IO spend in the queue. If you increase only the iodepth the value increase linear.
      • 1037s / 32 (iodepth) = ~32,4s its a little bit more than the runtime of 30s
  • “io_queue=256816”
    • total time spend in the disk queue
  • “util=94.63%”
    • the utilization of the drive is 94.63 means the drive seems to be nearly saturated with this workload.

This should give you a good idea which parts of a result/output exist and some insights how to interpret them. And yes its a tough one.

Go GIMP.