Category Archives: Testverse

8. SQL Server Performance Tuning study with HammerDB – Solve PAGEIOLATCH latch contention

In the last part I found that there is a new bottleneck. It seems this is related to the PAGEIOLATCH_SH and PAGEIOLATCH_EX. The exact values depend on the time slots which is measured by the ShowIOBottlenecks script. The picture shows >70 percent wait time.

PagelatchIO

To track down the latch contention wait events Microsoft provides a decent whitepaper. I used the following script and run it several times to get an idea which resources are blocked.

PagelatchSH_5.1.240101

The resource_description column returned by this script provides the resource description in the format <DatabaseID,FileID,PageID> where the name of the database associated with DatabaseID can be determined by passing the value of DatabaseID to the DB_NAME () function.

First lets find out which table this is. This can be done via inspecting the the page and retrieving the Metadata ObjectId.

dbccpage

The metadata objectid is 373576369. Now it is easy to retrieve the related table name.


warehouse_tablename

It is the “warehouse” table.

What is the bottleneck here?

dbccpage

First of all this an explanation about the wait events:

PAGEIOLATCH_EX
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Exclusive mode. Long waits may indicate problems with the disk subsystem.

PAGEIOLATCH_SH
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem

In our case this means a lot of inserts/updates are done when running the TPC-C workload and a task waits on a latch for this page shared or exclusive! When inspecting this page we know its the warehouse table and we created the database with 33 warehouses in the beginning.

The page size in SQL server is 8K and the 33 rows all fit just in one page (m_slotcnt =33). This means some operations can no be parallelized!!

To solve this I will change the “physical” design of this table which is still in-line with the TPC-C rules. There may be different ways to achieve this. I add a column and insert some text which forces SQL server to restructure the pages and then delete the column.

add_drop

Okay now check if the m_slotCnt is 1 which means every row is in one page.

dbccpage_new

It’s done.

ShowIOBottleneck_solvedWarehouse

When running the workload again the PAGEIOLATCH_SH and PAGELATCHIO_EX wait events are nearly gone.

Before:

  • System achieved 338989 SQL Server TPM at 73685 NOPM
  • System achieved 348164 SQL Server TPM at 75689 NOPM
  • System achieved 336965 SQL Server TPM at 73206 NOPM

After:

  • System achieved 386324 SQL Server TPM at 83941 NOPM
  • System achieved 370919 SQL Server TPM at 80620 NOPM
  • System achieved 366426 SQL Server TPM at 79726 NOPM

The workload increased slightly. Again I monitored that CPU is at 100% when running. At this point I could continue to tune the SQL statements as I did the last 2-3 posts. Remember I started the SQL Server Performance Tuning with 20820 TPM at 4530 NOPM. This means more then 10x faster!

But the next step maybe to add some hardware. This all runs on just 2 of the 4 cores which are available as I wrote in the first part.

Go ChaosMonkey!

FIO (Flexible I/O Tester) Part9 – fio2gnuplot to visualize the output

When installing the linux build of “fio” it provides a tool called fio2gnuplot. This tool renders the output files of “fio” and uses gnuplot to generate nice graphics. Gnuplot is a portable command-line driven graphing utility which is freely distributed.

Example shows distribution of IOPS with different block sizes and different Read/Write Mix:

PX600-1000-IOPS-mes3DPlt

Requirements

I am using “fio” 2.2.10 which was release on 12.09.2015.

Since 2.1.2 fio2gnuplot is part of the “fio” release. To generate the graphics you need to install gnuplot.

How to generate the log files?

There are some “fio” options to generate log files.

  • write_bw_log=<Filename>
  • write_iops_log=<Filename>
  • write_lat_log=<Filename>
  • per_job_logs=0/1 ( >2.2.8 so not for Windows build 16.09.2015)

write_bw_log generates a log file with the bandwidth details of the job and so on. If you don’t set the per_job_logs=0 then for each thread (numjob=X)  there will be one file. Most of the time this is not wanted because you would like to generate graphics based on all threads. An issue I found is that the default patterns of fio2gnuplot ( -b / -i) will not work because it search for  ( *_bw.log and *_iops.log) file endings. But the files end with *_bw.X.log and *_iops.X.log. It should be fixed with this commit.

If per_job_logs=0 set and all log files option have been set:

  • write_bw_log=fio-test
  • write_iops_log=fio-test
  • write_lat_log=fio-test

then 5 files will be generated:

How does a log file look like?

Means 4096 bytes in the fourth column is the block size (4K). The second column is the bandwidth in KB/s. I believe the first column is the passed time in ms. The third column which is 0 indicates that this row is related to reads. If this is related to write than the third column is 1.

Using fio2gnuplot

fio2gnuplot works in two major phases. The first phase is to generate the input files for gnuplot and do some calculating based on the data like the avg or min and max.

Starting fio2gnuplot -b will search for all bandwidth files in the local directory and generates the input files for gnuplot. The opition “-i” is the default pattern for iops files. There is no default  pattern for latency.

fio2gnuplot_phase1

The second phase is to generate the graphics. The option “-g” can be used for this. Per default “-g” deletes the input files for gnuplot. The option “-k” can be used to keep this files for later editing. If you want to make changes to the output you are able to edit gnuplot files like the mygraph file.

fio2gnuplot_phase2

And this is the output of fio-test_bw-2Draw.png

fio-test_bw-2Draw

Using fio2gnuplot to compare files with the default pattern -b or -i

You can copy all log file in the same directory and call fio2gnuplot with the right pattern. I make use of “-b” for bandwidth comparisons.

fio2gnuplot_compare

And this is the output of compare-result-2Dsmooth.png

compare-result-2Dsmooth

Using fio2gnuplot to compare files with a custom pattern

Sometimes the default pattern will not work. For example there is no pattern for the latency output. For this case you can specify your own pattern with the option “-p <pattern>” and using a title. WARNING: Using the pattern “*.log” will raise an error. I fixed this and in the future this should work.

compare-result-pattern

And this is the output of compare-result-2Dsmooth.png

compare-lat-2Dsmooth

Go Keepass2.

 

Samsung 840 Basic- Baseline tests with FIO based on Windows 2012R2

This post shows the baseline FOB raw peak performance of the Samsung 840 Basic which is installed in Testverse with Windows 2012R2. I used the “fio” (Flexible IO Tester) for this test. “fio” is my preferred tool to test SSD.

I make use of the Sample Windows “fio” job files. The specifications shows six metrics but I concentrate on (READ/WRITE bandwidth,Ran.READ/WRITE 4K). I tested with the job files:

  • fio-job-04 = fio random 1M write peak BW
  • fio-job-09 = fio random 1M read peak BW
  • fio-job-010 = fio random 4K read peak IOPS
  • fio-job-011 = fio random 4K write peak IOPS

Samsung840Basic

Command line example:

Results:

fio-job-09 = fio random 1M read peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 531,59 MB/s

BW09_compare-result-2Dtrend

fio-job-04 = fio random 1M write peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 134,72MB/s.

BW04_compare-result-2Dtrend

fio-job-010 = fio random 4K read peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 100293.

IO10_compare-result-2Dtrend

fio-job-011 = fio random 4K write peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 22313.

IO11_compare-result-2Dtrend

 

SanDisk ioMemory/Fusion-io ioDrive – Baseline tests with FIO based on Windows 2012R2

This post shows the baseline FOB raw peak performance of the SanDisk PX600-1000 which is installed in Testverse with Windows 2012R2. I used the “fio” (Flexible IO Tester) for this test. “fio” is the preferred tool to test SanDisk/Fusion-IO ioDrive/ioMemory/SSD.

I make use of the Sample Windows “fio” job files. The specification shows four metrics (READ/WRITE bandwidth,Ran.READ/WRITE 4K) which I tested here with the job files:

  • fio-job-04 = fio random 1M write peak BW
  • fio-job-09 = fio random 1M read peak BW
  • fio-job-010 = fio random 4K read peak IOPS
  • fio-job-011 = fio random 4K write peak IOPS

SanDiskPX600-1000_spec

Command line example:

Results:

fio-job-09 = fio random 1M read peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 2590.2MB/s

BW9_compare-result-2Dtrend

fio-job-04 = fio random 1M write peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 1337.1MB/s.

BW4_compare-result-2Dtrend

fio-job-010 = fio random 4K read peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 271557.

IOPS_10_compare-result-2Dtrend

fio-job-011 = fio random 4K write peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 284429.

IOPS11_compare-result-2Dtrend

FIO (Flexible I/O Tester) Part8 – Interpret and understand the result/output

The result/output of “fio” can be overwhelming because this decent tool does a lot for you. Your job is to feed “fio” with the right options and then interpret the result/output. This posting will help you to understand the result/output in detail. I know it’s difficult to read but I am limited by the WordPress design here a little bit and may improve it in the future.

The official documentation

The HOWTO provides some insights about the result/output of “fio”. I copy&paste some parts of the HOWTO and give you some more details or summarize other parts.

Output while running

IdleRunDescription
PThread setup, but not started.
CThread created.
IThread initialized, waiting or generating necessary data.
pThread running pre-reading file(s).
RRunning, doing sequential reads.
rRunning, doing random reads.
WRunning, doing sequential writes.
wRunning, doing random writes.
M
Running, doing mixed sequential reads/writes.
mRunning, doing mixed random reads/writes.
FRunning, currently waiting for fsync()
fRunning, finishing up (writing IO logs, etc)
VRunning, doing verification of written data.
EThread exited, not reaped by main thread yet.
_Thread reaped, or
XThread reaped, exited with an error.
KThread reaped, exited due to signal.

Job overview output

This will give you an overview about the jobs and the option used. It’s useful to check the heading if you receive only the results of a run but not the command line call or job file.

Data direction output

All details for each data direction will be shown here. Most important numbers are:

  • io=
  • bw=
  • iops=
  • issued =
  • lat =

Details inside the box.

Group statistics

Disk statistics

Example: Interpret the result/output of SanDisk PX600-1000 and Samsung EVO 840

I ran the linux raw peak performance test:  job-fio-11.ini which is included in the “fio” sample linux files on Testverse.

This means a 4K random write test to show the peak 4K write IOPS.

I invoked the script:

Result for SanDisk PX600-1000:

Result for Samsung EVO 840:

Part 1: Job overview output

  • “g=0”
    • this job belongs to group 0 – Groups can be used to aggregate job results.
  •  “rw=rw”
    • the IO pattern is “random write”
  •  “bs=4K-4K/4K-4K/4K-4K”
    • the block size is 4K for all types (read,write,discard). This test is only scheduling  writes. The important part for this job is 4K-4K in the middle.
  •  “ioengine=libaio”
    • the used ioengine is libaio which means parallel writes will be scheduled. This is not a good test for the file system performance because it skips the page cache.
  •  “iodepth=32”
    • there will be up to “32 IO units” in flight against the device. So there will be a queue filled up to 32 outstanding IOs most of the time.
  • the version is “fio-2.2.9-26-g669e”
  • 4 threads will be started.

Part 2: Data direction output

The header is self explaining: Job name = fio-job-11 …

The detailed IO statistics for the job:

  • “write” (remember the job schedules 100% random writes)
  • “io=41119MB”
    • number of MB transfered
  • “bw=1370.5MB/s”
    • write data at a speed of 1370.5MB per second in average
  • “iops=350728”
    • is the average IO per second (4K in this case).
  • “runt=30013msec”
    • The job ran ~30 seconds
  • “slat”, “clat”,”lat” min,max,avg,stdev
    • slat means submission latency and presents the time it took to submit this IO to the kernel for processing.
    • clat  means completion latency and presents the time that passes between submission to the kernel and when the IO is complete, not including submission latency.
    • lat is the best metric which represents the whole latency an application would experience.  The avg slat+ avg clat = ~ avg lat.
    • Keep an eye if the numbers are usec or msec ..etc.! Compare PX600-1000 to EVO 840.
    • See Al Tobey blog for some more details.
  • “clat percentile” gives a detailed explanation how much IO in percentage completed in which time frame. In this case: 99% of the IO completed in <=1192 usec = 1,2 msec. This value is often used to ignore the few spikes when testing. The maximum clat has been 13505 which has been ~14x longer than the average of 344.
  • “bw”  min, max, per,avg, stdev
    • In this case the bandwidth  has been 345090 KB/s = ~337 MB/s
  • “lat” this is like the clat part.
    • In this case 91.01% of the IO completed between 500usec and >250usec. This is in line with the avg latency of 360.23usec. Only ~ 8,7% of the IO took between 2ms and >750usec. Both together is nearly 99,8% of all IO.
  •  “cpu”
    • this line is dedicated to the CPU usage of the running the job
      •  “usr=5.66%”
        • this is the percentage of CPU usage of the running job at user level
        • 100% would mean that one CPU core will be at 100% workload, depending if HT on/off
      •  “sys=12.09%”
        • this is the percentage of CPU usage of the running job at system/kernel level
      • “ctx=2783241”
        • The number of context switches this thread encountered while running
      • “majf=0” and “minf=32689”
  • “IO depths :
    • “32=116.7%… ”
      • this number showed that this job was able to have ~32 IO units in flight
      • I am not sure why it’s >100%
      • “submit: …..16=100.0%….”
        • shows how many IO were submitted in a single submit call. In this case it could be in the range of 8 to 16
        • This is in line with the script which used iodepth_batch values
      • “complete: …. 16=100.0%….”
        • same like submit but for completed calls.
      • “issued:….total=w=10526416/w=0/d=0″…
        • 10526416 write IO have been issued, no reads, no discards and none of them have been short or dropped

Part 3: Group statistics

  • WRITE
    • “io=41119MB”
      • As in the job statistics the same amount of transfered MB here because its only one  job
    • “aggrb=1370.5MB/s”
      • aggregated bandwidth of all jobs/threads for group 0
    • “minb=1370.5MB/s maxb=1370.5MB/s”
      • The minimum and maximum bandwidth one thread saw. In this case is the minimum the same as the maximum.I don’t think this is correct! Will clarify this.
    • “mint=30008msec” and “maxt=30008msec”
      • Smallest and longest runtime of one of the jobs. The same because we ran only one job

Part 4: Disk statistics

  • “fioa: ios=71/767604”
    • 71 read IO and 767604 write IO on /dev/fioa
    • I am not sure why there are 71 read IO. I am pretty sure i didn’t run anything myself in the background. Who knows?
  • “merge=0/0” number of merges the IO from the IO scheduler
    • no merges here
  • “ticks=0/259476”
    • number of ticks we kept the drive busy. A sign that the device is saturated.
    • A tick is related to one jiffy. The next lines are only a approximation. read about Kernel Timer for more details.
    • In this example I checked the value for CONFIG_HZ

      • CONFIG_HZ is set to 250 which means 1 second / 250 = 0,004s = 4ms
      • 259476 ticks * 4ms = 1037s  ???
      • I believe this is the cumulated wait time all IO spend in the queue. If you increase only the iodepth the value increase linear.
      • 1037s / 32 (iodepth) = ~32,4s its a little bit more than the runtime of 30s
  • “io_queue=256816”
    • total time spend in the disk queue
  • “util=94.63%”
    • the utilization of the drive is 94.63 means the drive seems to be nearly saturated with this workload.

This should give you a good idea which parts of a result/output exist and some insights how to interpret them. And yes its a tough one.

Go GIMP.

FIO (Flexible I/O Tester) Part7 – Steady State of SSD,NVMe,PCIe Flash with TKperf

The real world performance of a Flash device is shown when the Steady State is reached. In most cases these are not the performance values which are shown on the vendors website.

The SNIA organization defined a specification how to test flash devices or Solid State Storage. The industry would like to have a methodology to compare NAND Flash devices with a scientific approach. The reason why there is a need for this specification is that the NAND Flash devices write performance heavily depends on the write history of the device. There are three write phases of a NAND Flash device:

  • FOB (Fresh- Out of the Box)
  • Transition
  • Steady State

SNIA_FOB_Transition_SteadyState

FOB (Fresh- Out of the Box) or Secure Erase(Sanitize?)

A device taken fresh out of the box should provide the best possible write performance for a while. Why? A flash device writes data in 4 KB pages inside of 256 KB blocks. To add additional pages to a partially filled block, the solid-state drive must erase the entire block before writing data back to it.

nand-flash-memory-pages-and-blocks

If the flash device fills up, fewer and fewer empty blocks are available. In their place are partially filled blocks. The NAND Flash device can’t just write the new data to these partially filled blocks — that would erase the existing data. Instead of a simple write operation, the NAND Flash device has to read the value of the block into its cache, modify the value with the new data, and then write it back. (Write Amplification)

Often when you would like to test a device some data has already been written to it. This means you can’t test the FOB performance anymore. For this it is possible to “Secure Erase” the device. This feature was original introduced to delete all data on a flash device securely which means that all pages/blocks will be zeroed even the blocks which are over-provisioned (not visible to the OS).  But it can also be used to optimize the performance and restore the FOB performance for a while. The vendors provide tools for this. Be careful. Some vendors make us of Sanitize and Secure Erase as features. But the implementation is different. So a Secure Erase may only delete the mapping table and not the blocks them self.

Transition

The transition is the phase between the good performance of FOB and Steady State. Most of the time the performance drops continuously over time and the write cliff appears till the Steady State is reached.

Steady State

The scientific definition for Steady State is:

Basically this means: Use a predefined write/read pattern and run this against the device until the performance of the device will be stable over time.

But should you run the test yourself? First a little bit math:

The pseudo code for IOPS states:

The whole block will be run up to 25 times depending if Steady State is already reached. Each run will be for 1 Minute.

25 (maximum runs) * 7 (R/W mixes) * 8 (block sizes) = 1400 minutes = 23,3h

The test could run for ~24h and it will write a lot. I strongly advice that you don´t run the tests yourself as long as you agree that the device looses lifetime.

There are some scripts and tools which can be useful to test the device which are based on “fio”.

  • a bash script by James Bowen. This scripts run for ~24 hours and does not stop even Steady State is reached
  • tkperf by Georg Schönberger which I prefer
  • block-storage git project by Jason Read which is aimed for cloud environments. A full implementation of PTS can’t be done in cloud environments. (For example: Secure Erase)

Using TKperf with SanDisk PX600-1000

TKperf is a python script which implements the full SNIA PTS specification. With Ubuntu its really easy to install.

After tkperf is installed and all dependencies as well I started a test. I tested a SanDisk PX600-1000 installed in Testverse. Because for the PX600 PCIe device you can’t run “hdparm” to Secure Erase, Georg Schöneberger implemented a new option “-i fusion” which leverages the SanDisk Command-line tools to Secure Erase the device. Again: Thank you @devtux_at

The following command runs all four SNIA PTS tests (IOPS,Latency,Throughput,Write Saturation). I ran with 4 jobs, an iodepth of 16 and used refill buffers to avoid compression of the device. The file test.dsc is a simple text file which describes the drive because “hdparm” can’t get infos about the PX600.

REMEMBER: ALL data will be lost and your device looses lifetime or maybe destroyed!

Results:

tkperf generates some nice png files which summarizes the testing. And the tests reached steady state after 4-5 rounds.

The device was formatted with 512 bytes sector size. This was needed to run all tests. It would improve the performance for all bigger block sizes to format the device with 4k sector size!

IOPS

PX600-1000-IOPS-mes3DPlt

Latency

This is the reason why these cards are that nice. Providing  < 0,2 ms latency is great!

PX600-1000-LAT-stdyStConvPlt

Throughput

PX600-1000-TP-RW-stdyStConvPlt

Write Saturation

PX600-1000-writeSatIOPSPlt

Go tkperf!

FIO (Flexible I/O Tester) Part6 – Sequential read and readahead

In the last read tests I found that the sequential read IOPS have been higher than expected. I left “invalidate” on default, that means that the used file for the test should be dropped out of the page cache when the test starts. So why are the IOPS higher then the raw device performance? I found that the readahead is responsible for this.

Important: readahead will only come into the play when the read is using the page cache. This means “direct=0”.

Set and get the readahead value

You can use the tool “blockdev” to show the readahead value:

or set the size with:

Example with different readahead values:

I set the readahead value to 128 and run this file: readahead

readahead_128

“issued: total=r=25600” shows that 25600 IOPS have been issued but “sda : ios=561” shows that only 561 hit the device.So we can estimate that to read 100MB with 561 IOPS each IOPS needs to be ~182KB. This seems to be higher then 128 (read ahead value) * 512Bytes (default sector size) = 64KB. Something is wrong!  I run:

With an output of 4096. Okay the physical block size is 4096 Bytes and not 512 Bytes. 128 (read ahead value) * 4096 Bytes (physical block sector size) = 256KB. This would fit much better with the value of ~182KB.

I set the readahead value to 256 and run the same test again.

readahead_256

296 IOPS means around the half of the IOPS than the last test.

I set the readahead value to 512 and run the same test again.

readahead_512

Again around the half.

So this value can have an impact on the sequential read performance of your device. But most of the times sequential read are not the bottleneck in a typical environment. Even HDD’s can provide really fast sequential reads as long the fragmentation is under control.

Go leofs.

FIO (Flexible I/O Tester) Part5 – Direct I/O or buffered (page cache) or raw performance?

Before you start to run tests against devices you should know that most operating systems make use of a DRAM caches for IO devices. So for example page cache in Linux which is: “sometimes also called disk cache, is a transparent cache for the pages originating from a secondary storage device such as a hard disk drive (HDD)”. Windows uses a similar but other approach to page cache but for my convenience I will use page cache as synonym for both approaches.

Direct I/O means an IO access where you bypass the cache (page cache).

RAW performance

If we want to measure the performance of an IO device, should we use these caching techniques or avoid them? I believe it makes sense to run a baseline of your device without the influence of any file system or page cache. The implementation of page caches are different in each OS which means the same device may vary a lot. File systems also introduce a huge variance when running tests. For example: “If someone says that a device with NTFS reaches ~50.000 IOPS/4K” Have you ever asked which version of NTFS? You should have. See Memory changes in NTFS with Windows Server 2012 compared to version 1.0.

For sure in a real world workload the influence of a page cache maybe really important and should not be underestimated or uncared. But to measure the raw performance of an device ignore any file system or page cache.

Page caching vs Direct I/O vs RAW performance

Linux Page Caching:

“fio” per default invalidates the buffer/page cache for the used files when it starts. I believe this is done to avoid the influence of the page cache when running. But remember, while “fio” is running, the page cache will build up and may have an influence on your test.

We need to set the option “invalidate=0” to keep the page/buffer in memory to test with the page cache.

Page cache in action on Testverse:

jobfile: cache-run

The first run looked like that:

fio_cache-1-run

The marked parts show that 2560 IO of 4K have been issued. But only 1121 IO hit the device. So 1439 IO seems to be answered via the page cache. So lets run it again because the file should be in page cache now.

fio_cache-2-run

The second run proved it. Zero IO hit the device. Means 100% of the IO was handle by the page cache. And WOHOOO : 853333 IOPS 🙂

How to monitor the page cache?

There is a nice tool called cachestat written by Brendan D. Gregg which is part of the perf-tools. The tool is just a script so there is no need to compile it. Just download and run 🙂

cachestat

The marked part shows a “cache-run”.

free -m can give you some details as well.

You can run this command to clear the page cache.

Example:

free-m

I ran “free -m” to show the  actual state. Then ran 2x the “cache-run” with a size of 1000MB. The second run was MUCH faster. Then “free -m” shows that “used” and “cached” is increased by 1000. This means the testfile1 is fully cached. Then I cleared the cache with echo 3 to /proc/sys/vm/drop_caches. “free -m” shows that cached is near zero.

Windows Page Caching/System cache

This chapter is even more complicated. I will update it soon!

Direct I/O

Direct I/O means an IO access where you bypass the cache (page cache). You are able to force a direct I/O with the option “direct=1”.

jobfile: cache-run-direct

cache-run-direct

Direct I/O is the first step in measuring storage devices raw performance. The IOPS dropped a lot compared to the page cache second run. Direct 8767 IOPS vs Buffered 853333 IOPS. You may think that 8767 are too less IOPS for a SanDisk ioMemory device? Remember this is a random read with 1 job/thread and sync with an iodepth with 1. This means each read IO need to wait that the IO before is completed.

RAW performance

RAW performance means you schedule workloads against the native block device without a page cache or a file system. This is the way how most vendors measure their own performance values and present them on their website.

The option “filename=/dev/sdb” (Linux) or “filename=\\.\PhysicalDrive1” (Windows) uses the second device of your system.

WARNING: The data on the selected device can be lost!!!!!

Please double check that you selected the right device. Re-check any files you downloaded.

raw-run

You may noticed that the IOPS slightly increased to 9142 compared to the direct test run.

Go tomcat.

FIO (Flexible I/O Tester) Part4 – fixed, ranges or finer grained block sizes?

Using fixed block sizes is the most common way when doing storage (device) tests. There is more then one reason for that. The obvious reason is that a lot of tools only support fixed block size. Another reason is that the storage vendors present their performance counters mostly based on fixed block sizes with 4K. But this doesn’t means it’s the best way to test or measure your storage (device).

There are two common tests which you would like to perform:

  • RAW peak performance
  • Simulating production workload

Raw peak performance means a test where you try to test your storage (device) to achieve the same or similar values as the vendor presents on his website/documentation. This is by nature the highest possible value (peak) without the influence of page caches or file systems. Why should you do this? It is not to make sure the numbers at the vendor website are true or false. It is to make sure the device is running as expected and all best practices have been done.

Simulating production workload means a test where you try to test your storage (device) against a workload production. But most of the time you are not able to run the production workload directly on the storage (device). So then you need to simulate the production workload with a tool like “fio”,”sqlio“,”swingbench“… etc. If possible always test with real production workloads instead of simulations.

Fixed block size

You may used the option “blocksize=4K” or “bs=4K” which is the way to specify a fixed block size of 4K. But the output shows “BS=4K-4K/4K-4K/4K-4K”. So why 6 numbers instead of one?

There are three parts ( / is the separator) in the output. The first is related to the block size for reads, the second for writes and the third for discards. 4K-4K means use a block size from the range of 4K to 4K which is exactly 4K. But this gives you an idea that ranges could be used. There is even more then ranges. Block sizes can be fine grained.

You are able to specify fixed block size with different values for read,write,discard.

Example: Lets run a job with 50% reads and 50% writes. The read should use a fixed block size of 4K but the writes should use 64K. So we need to set the option “rw=readwrite” or “rw=rw” and the option “rwmixread=50” or “rwmixwrite=50”.

file: read-4K-write-64K

read-4K-write-64K

Block size ranges

Instead of using a fixed block size you can specify a range.

Example: Let’s run a read workload with a block size in the range of 1K to 16K. This can be done by set the “blocksize_range=1K:16K” or “bsrange=1K:16K”. But how will the mix look like?

file: read_bs1K-16K

read-bs1K-16K

1400 IO have been issued. At this point I am note sure about the real distribution. I tried a few calculation but have not been able to provide an explanation yet.The documentation states:

fio will mix the issued io block sizes. The issued io unit will always be a multiple of the minimum value given (also see bs_unaligned).

 Finer grained block sizes

Sometimes it maybe useful to control how the weight for a block size within a range should be set. This can be done with the option “bssplit”. What could be the reason why you would like to control the block size? Lets assume you know a production workload and would like to use “fio” to act similar inside a new VM/container to evaluate if the production workload would run fine.

The format for this option is:
bssplit=blocksize/percentage:blocksize/percentage

for as many block sizes as needed.

Example: Let’s run a read workload with a block size 10% 1K, 80% 4k, 10% 16K. This can be done by set the option “bssplit=1K/10:4K/80:16K/10”.

file: read_bs1K10p_4K80p_16K10p

read_bs1K10p_4K80p_16K10p

2029 IO issued. The simple math is:

  • 2029 * 0,8 (80% of IO) * 4K (is 4K) = 6492,8 K
  • 2029 * 0,1 (10% of IO) * 1K (is 1K) = 202,9 K
  • 2029 * 0,1 (10% of IO) * 16K (is 16K) = 3246,4 K
  • 6492,8K + 202,9K + 3246,4K = 9942,1K ~ 10MB

Yes there is a derivation with means my approach is only a simple approximation of the real algorithm. Remember you can not issue one half of an IO 🙂

Go Zookeeper.

FIO (Flexible I/O Tester) Part3 – Environment variables and keywords

FIO is able to make us of environment variables and reserved keywords in job files. This can  avoid some work so that you don´t need to change the job files each time you call them.

Define a variable in a job file

The documentation is pretty nice:

So for example this is an entry in the job file where the size should be set via an environment variable.

Example of using environment variables

I would like to run four different job files and use for each run a different size (1MB,10MB,100MB,1G,10G,100G).

  • Job1 is sequential read / 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job2 is sequential write/ 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job3 is random read / 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job4 is random write / 4 KB / 1 thread / ioengine=sync / iodepth=1

I make use of the default settings under linux. For example I did not specify the block size which is 4K per default, but it is best practice to define them in case a default value may be changed.

Files:

jobfile1: read_4KB_numjob1_sync_iodepth1

jobfile2: write_4KB_numjob1_sync_iodepth1

jobfile3: randread_4KB_numjob1_sync_iodepth1

jobfile3: randwrite_4KB_numjob1_sync_iodepth1

 

Run fio:

1) Run with SIZE of 1M

fio_run_4_files_with_env

Yes this could be done in one job file but often you are using predefined files from a vendor or a website like this and just want to call them with your size settings.

2) Run with SIZE of 10M

…. and so on.

Reserved keywords:

There are 3 documented reserved keywords:

  • $pagesize           The architecture page size of the running system
  • $mb_memory    Megabytes of total memory in the system
  • $ncpus                Number of online available CPUs

You can use them in the command line or in the jobs files. They are substituted with the current system values when you run the job.

Use cases?

Lets say you would like to run a job with as much threads as online available CPUs.

file: read-x-threads

In my case (Testverse) there are 4 CPU cores with Hyper-Threading(HT) on. Means 8 threads.

read-x-threads

Go Jenkins!