Monthly Archives: August 2015

2. SQL Server Performance Tuning study with HammerDB – Setup HammerDB

In the last post of this performance tuning study the setup of SQL Server has been done which is up and running now. The next part is to install the workload generator HammerDB. I decided to use HammerDB because its open source and implements the TPC-C benchmark defined by http://www.tpc.org/.

Setup HammerDB

I downloaded the version  HammerDB-2.18-Win-x86-64 from the website and installed it to the C:\ drive.

The important part of HammerDB is to specify how you set it up and how you run it.

After launching HammerDB I switched to SQL Server and I choose TPC-C as the benchmark option.

DBHammer_TPC-C

Schema Build with 33 warehouses

I used 33 warehouses in this configuration. There is a reason why it is 33 and not for example 100. I will cover this later. The build run took a while.

HammerDB_schemaBuild

HammerDB_schemaBuildit

HammerDB is installed and ready to run synthetic workloads.

Go webmin!

1. SQL Server Performance Tuning study with HammerDB – Setup SQL Server

Based on a project with Thomas Kejser last year I started a new blog project to showcase a simple SQL Server performance tuning study. The target is to show some basic tuning options you could use to improve the SQL Server performance. I will make use of the 8PP to follow a scientific approach of tuning. But a database system like SQL Server 2014 SP 1 is a complex environment and I will cover only a few topics of the full monitor and performance tuning SQL Server provides.

The Use Case

Learn SQL Server basic tuning based on the 8PP approach!

Because I don’t have a good real world workload to tune I will use the tool HammerDB instead to generate a TPC-C workload. The focus of this study is the tuning itself so it’s not important which workload I use. I agree that the bottlenecks I will encounter may not represent real world workload ones.

The Setup

Hardware:

I will make use of Testverse as the hardware platform.

IMPORTANT!!!!! The active CPU cores have been set to 2 instead of 4.

The reason for this is that I believe there will be a point in time where the CPU will be a bottleneck. Then I will have a chance to show what influence more CPU power will have.

All tests will run on this machine so no network or other systems should be involved. Testverse is connected to the LAN and I will use RDP while running the tests. I keep an eye on the network part and consider this has little to no influence to the testing.

The OS drive Samsung 840 Basic is replaced by OCZ Agility 2 OCZSSD2-2AGTE120G.

The test SSD/PCIe devices will be installed with FOB performance in the beginning.

Operating System:

Windows 2012 R2 DataCenter Edition with all actual patches are installed (24.08.2015). Automatic updates will be deactivated for the test period. The Samsung EcoGreen F4 HDD is formatted with NTFS v3.1 with 512 bytes per Sector with 64k NTFS allocation unit size.

windows2012_disk_manager

SQL Server 2014 SP1:

The SQL Server will be installed right now. The instance root is set to the Samsung EcoGreen F4 HDD (D:\}. One Requirement is .Net Framework 3.5 SP1 which can be installed via the “Add Roles and Features”.

SQLServer2ß14_Net351

Feature selection

I select only the features which are really needed for this study. So basically its the “Database Engine Service” and the “Management Tools”. Remember I changed the instance root to the D:\ drive.

SQLServer2ß14_Feature_selection

Instance Configuration

In this case I make use of the default instance.

SQLServer2ß14_instance

Server Configuration

I enable all SQL Server services and set the “Startup Type” to Automatic. The agent and browser services maybe used later.

SQLServer2ß14_accounts

Database Engine Configuration

I decided to use the Mixed Mode. Last time I used HammerDB it has been difficult to go with Windows authentication mode. The local administrator group will be SQL Server administrators as well. This is not best practices but will work.

SQLServer2ß14_authentication

Let’s check with the SQL Server Management Studio if we are able to connect and run a simple select.

SQLServer2014_simple_select

Okay. SQL Server is up and running with the default configuration. I did not make changes to the default settings of SQL Server 2014 SP1 to make sure this test can be easily  reproduced.

Go HammerDB!

 

 

Samsung 840 Basic- Baseline tests with FIO based on Windows 2012R2

This post shows the baseline FOB raw peak performance of the Samsung 840 Basic which is installed in Testverse with Windows 2012R2. I used the “fio” (Flexible IO Tester) for this test. “fio” is my preferred tool to test SSD.

I make use of the Sample Windows “fio” job files. The specifications shows six metrics but I concentrate on (READ/WRITE bandwidth,Ran.READ/WRITE 4K). I tested with the job files:

  • fio-job-04 = fio random 1M write peak BW
  • fio-job-09 = fio random 1M read peak BW
  • fio-job-010 = fio random 4K read peak IOPS
  • fio-job-011 = fio random 4K write peak IOPS

Samsung840Basic

Command line example:

Results:

fio-job-09 = fio random 1M read peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 531,59 MB/s

BW09_compare-result-2Dtrend

fio-job-04 = fio random 1M write peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 134,72MB/s.

BW04_compare-result-2Dtrend

fio-job-010 = fio random 4K read peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 100293.

IO10_compare-result-2Dtrend

fio-job-011 = fio random 4K write peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 22313.

IO11_compare-result-2Dtrend

 

SanDisk ioMemory/Fusion-io ioDrive – Baseline tests with FIO based on Windows 2012R2

This post shows the baseline FOB raw peak performance of the SanDisk PX600-1000 which is installed in Testverse with Windows 2012R2. I used the “fio” (Flexible IO Tester) for this test. “fio” is the preferred tool to test SanDisk/Fusion-IO ioDrive/ioMemory/SSD.

I make use of the Sample Windows “fio” job files. The specification shows four metrics (READ/WRITE bandwidth,Ran.READ/WRITE 4K) which I tested here with the job files:

  • fio-job-04 = fio random 1M write peak BW
  • fio-job-09 = fio random 1M read peak BW
  • fio-job-010 = fio random 4K read peak IOPS
  • fio-job-011 = fio random 4K write peak IOPS

SanDiskPX600-1000_spec

Command line example:

Results:

fio-job-09 = fio random 1M read peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 2590.2MB/s

BW9_compare-result-2Dtrend

fio-job-04 = fio random 1M write peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 1337.1MB/s.

BW4_compare-result-2Dtrend

fio-job-010 = fio random 4K read peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 271557.

IOPS_10_compare-result-2Dtrend

fio-job-011 = fio random 4K write peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 284429.

IOPS11_compare-result-2Dtrend

FIO (Flexible I/O Tester) Part8 – Interpret and understand the result/output

The result/output of “fio” can be overwhelming because this decent tool does a lot for you. Your job is to feed “fio” with the right options and then interpret the result/output. This posting will help you to understand the result/output in detail. I know it’s difficult to read but I am limited by the WordPress design here a little bit and may improve it in the future.

The official documentation

The HOWTO provides some insights about the result/output of “fio”. I copy&paste some parts of the HOWTO and give you some more details or summarize other parts.

Output while running

IdleRunDescription
PThread setup, but not started.
CThread created.
IThread initialized, waiting or generating necessary data.
pThread running pre-reading file(s).
RRunning, doing sequential reads.
rRunning, doing random reads.
WRunning, doing sequential writes.
wRunning, doing random writes.
M
Running, doing mixed sequential reads/writes.
mRunning, doing mixed random reads/writes.
FRunning, currently waiting for fsync()
fRunning, finishing up (writing IO logs, etc)
VRunning, doing verification of written data.
EThread exited, not reaped by main thread yet.
_Thread reaped, or
XThread reaped, exited with an error.
KThread reaped, exited due to signal.

Job overview output

This will give you an overview about the jobs and the option used. It’s useful to check the heading if you receive only the results of a run but not the command line call or job file.

Data direction output

All details for each data direction will be shown here. Most important numbers are:

  • io=
  • bw=
  • iops=
  • issued =
  • lat =

Details inside the box.

Group statistics

Disk statistics

Example: Interpret the result/output of SanDisk PX600-1000 and Samsung EVO 840

I ran the linux raw peak performance test:  job-fio-11.ini which is included in the “fio” sample linux files on Testverse.

This means a 4K random write test to show the peak 4K write IOPS.

I invoked the script:

Result for SanDisk PX600-1000:

Result for Samsung EVO 840:

Part 1: Job overview output

  • “g=0”
    • this job belongs to group 0 – Groups can be used to aggregate job results.
  •  “rw=rw”
    • the IO pattern is “random write”
  •  “bs=4K-4K/4K-4K/4K-4K”
    • the block size is 4K for all types (read,write,discard). This test is only scheduling  writes. The important part for this job is 4K-4K in the middle.
  •  “ioengine=libaio”
    • the used ioengine is libaio which means parallel writes will be scheduled. This is not a good test for the file system performance because it skips the page cache.
  •  “iodepth=32”
    • there will be up to “32 IO units” in flight against the device. So there will be a queue filled up to 32 outstanding IOs most of the time.
  • the version is “fio-2.2.9-26-g669e”
  • 4 threads will be started.

Part 2: Data direction output

The header is self explaining: Job name = fio-job-11 …

The detailed IO statistics for the job:

  • “write” (remember the job schedules 100% random writes)
  • “io=41119MB”
    • number of MB transfered
  • “bw=1370.5MB/s”
    • write data at a speed of 1370.5MB per second in average
  • “iops=350728”
    • is the average IO per second (4K in this case).
  • “runt=30013msec”
    • The job ran ~30 seconds
  • “slat”, “clat”,”lat” min,max,avg,stdev
    • slat means submission latency and presents the time it took to submit this IO to the kernel for processing.
    • clat  means completion latency and presents the time that passes between submission to the kernel and when the IO is complete, not including submission latency.
    • lat is the best metric which represents the whole latency an application would experience.  The avg slat+ avg clat = ~ avg lat.
    • Keep an eye if the numbers are usec or msec ..etc.! Compare PX600-1000 to EVO 840.
    • See Al Tobey blog for some more details.
  • “clat percentile” gives a detailed explanation how much IO in percentage completed in which time frame. In this case: 99% of the IO completed in <=1192 usec = 1,2 msec. This value is often used to ignore the few spikes when testing. The maximum clat has been 13505 which has been ~14x longer than the average of 344.
  • “bw”  min, max, per,avg, stdev
    • In this case the bandwidth  has been 345090 KB/s = ~337 MB/s
  • “lat” this is like the clat part.
    • In this case 91.01% of the IO completed between 500usec and >250usec. This is in line with the avg latency of 360.23usec. Only ~ 8,7% of the IO took between 2ms and >750usec. Both together is nearly 99,8% of all IO.
  •  “cpu”
    • this line is dedicated to the CPU usage of the running the job
      •  “usr=5.66%”
        • this is the percentage of CPU usage of the running job at user level
        • 100% would mean that one CPU core will be at 100% workload, depending if HT on/off
      •  “sys=12.09%”
        • this is the percentage of CPU usage of the running job at system/kernel level
      • “ctx=2783241”
        • The number of context switches this thread encountered while running
      • “majf=0” and “minf=32689”
  • “IO depths :
    • “32=116.7%… ”
      • this number showed that this job was able to have ~32 IO units in flight
      • I am not sure why it’s >100%
      • “submit: …..16=100.0%….”
        • shows how many IO were submitted in a single submit call. In this case it could be in the range of 8 to 16
        • This is in line with the script which used iodepth_batch values
      • “complete: …. 16=100.0%….”
        • same like submit but for completed calls.
      • “issued:….total=w=10526416/w=0/d=0″…
        • 10526416 write IO have been issued, no reads, no discards and none of them have been short or dropped

Part 3: Group statistics

  • WRITE
    • “io=41119MB”
      • As in the job statistics the same amount of transfered MB here because its only one  job
    • “aggrb=1370.5MB/s”
      • aggregated bandwidth of all jobs/threads for group 0
    • “minb=1370.5MB/s maxb=1370.5MB/s”
      • The minimum and maximum bandwidth one thread saw. In this case is the minimum the same as the maximum.I don’t think this is correct! Will clarify this.
    • “mint=30008msec” and “maxt=30008msec”
      • Smallest and longest runtime of one of the jobs. The same because we ran only one job

Part 4: Disk statistics

  • “fioa: ios=71/767604”
    • 71 read IO and 767604 write IO on /dev/fioa
    • I am not sure why there are 71 read IO. I am pretty sure i didn’t run anything myself in the background. Who knows?
  • “merge=0/0” number of merges the IO from the IO scheduler
    • no merges here
  • “ticks=0/259476”
    • number of ticks we kept the drive busy. A sign that the device is saturated.
    • A tick is related to one jiffy. The next lines are only a approximation. read about Kernel Timer for more details.
    • In this example I checked the value for CONFIG_HZ

      • CONFIG_HZ is set to 250 which means 1 second / 250 = 0,004s = 4ms
      • 259476 ticks * 4ms = 1037s  ???
      • I believe this is the cumulated wait time all IO spend in the queue. If you increase only the iodepth the value increase linear.
      • 1037s / 32 (iodepth) = ~32,4s its a little bit more than the runtime of 30s
  • “io_queue=256816”
    • total time spend in the disk queue
  • “util=94.63%”
    • the utilization of the drive is 94.63 means the drive seems to be nearly saturated with this workload.

This should give you a good idea which parts of a result/output exist and some insights how to interpret them. And yes its a tough one.

Go GIMP.

FIO (Flexible I/O Tester) Part7 – Steady State of SSD,NVMe,PCIe Flash with TKperf

The real world performance of a Flash device is shown when the Steady State is reached. In most cases these are not the performance values which are shown on the vendors website.

The SNIA organization defined a specification how to test flash devices or Solid State Storage. The industry would like to have a methodology to compare NAND Flash devices with a scientific approach. The reason why there is a need for this specification is that the NAND Flash devices write performance heavily depends on the write history of the device. There are three write phases of a NAND Flash device:

  • FOB (Fresh- Out of the Box)
  • Transition
  • Steady State

SNIA_FOB_Transition_SteadyState

FOB (Fresh- Out of the Box) or Secure Erase(Sanitize?)

A device taken fresh out of the box should provide the best possible write performance for a while. Why? A flash device writes data in 4 KB pages inside of 256 KB blocks. To add additional pages to a partially filled block, the solid-state drive must erase the entire block before writing data back to it.

nand-flash-memory-pages-and-blocks

If the flash device fills up, fewer and fewer empty blocks are available. In their place are partially filled blocks. The NAND Flash device can’t just write the new data to these partially filled blocks — that would erase the existing data. Instead of a simple write operation, the NAND Flash device has to read the value of the block into its cache, modify the value with the new data, and then write it back. (Write Amplification)

Often when you would like to test a device some data has already been written to it. This means you can’t test the FOB performance anymore. For this it is possible to “Secure Erase” the device. This feature was original introduced to delete all data on a flash device securely which means that all pages/blocks will be zeroed even the blocks which are over-provisioned (not visible to the OS).  But it can also be used to optimize the performance and restore the FOB performance for a while. The vendors provide tools for this. Be careful. Some vendors make us of Sanitize and Secure Erase as features. But the implementation is different. So a Secure Erase may only delete the mapping table and not the blocks them self.

Transition

The transition is the phase between the good performance of FOB and Steady State. Most of the time the performance drops continuously over time and the write cliff appears till the Steady State is reached.

Steady State

The scientific definition for Steady State is:

Basically this means: Use a predefined write/read pattern and run this against the device until the performance of the device will be stable over time.

But should you run the test yourself? First a little bit math:

The pseudo code for IOPS states:

The whole block will be run up to 25 times depending if Steady State is already reached. Each run will be for 1 Minute.

25 (maximum runs) * 7 (R/W mixes) * 8 (block sizes) = 1400 minutes = 23,3h

The test could run for ~24h and it will write a lot. I strongly advice that you don´t run the tests yourself as long as you agree that the device looses lifetime.

There are some scripts and tools which can be useful to test the device which are based on “fio”.

  • a bash script by James Bowen. This scripts run for ~24 hours and does not stop even Steady State is reached
  • tkperf by Georg Schönberger which I prefer
  • block-storage git project by Jason Read which is aimed for cloud environments. A full implementation of PTS can’t be done in cloud environments. (For example: Secure Erase)

Using TKperf with SanDisk PX600-1000

TKperf is a python script which implements the full SNIA PTS specification. With Ubuntu its really easy to install.

After tkperf is installed and all dependencies as well I started a test. I tested a SanDisk PX600-1000 installed in Testverse. Because for the PX600 PCIe device you can’t run “hdparm” to Secure Erase, Georg Schöneberger implemented a new option “-i fusion” which leverages the SanDisk Command-line tools to Secure Erase the device. Again: Thank you @devtux_at

The following command runs all four SNIA PTS tests (IOPS,Latency,Throughput,Write Saturation). I ran with 4 jobs, an iodepth of 16 and used refill buffers to avoid compression of the device. The file test.dsc is a simple text file which describes the drive because “hdparm” can’t get infos about the PX600.

REMEMBER: ALL data will be lost and your device looses lifetime or maybe destroyed!

Results:

tkperf generates some nice png files which summarizes the testing. And the tests reached steady state after 4-5 rounds.

The device was formatted with 512 bytes sector size. This was needed to run all tests. It would improve the performance for all bigger block sizes to format the device with 4k sector size!

IOPS

PX600-1000-IOPS-mes3DPlt

Latency

This is the reason why these cards are that nice. Providing  < 0,2 ms latency is great!

PX600-1000-LAT-stdyStConvPlt

Throughput

PX600-1000-TP-RW-stdyStConvPlt

Write Saturation

PX600-1000-writeSatIOPSPlt

Go tkperf!

FIO (Flexible I/O Tester) Part6 – Sequential read and readahead

In the last read tests I found that the sequential read IOPS have been higher than expected. I left “invalidate” on default, that means that the used file for the test should be dropped out of the page cache when the test starts. So why are the IOPS higher then the raw device performance? I found that the readahead is responsible for this.

Important: readahead will only come into the play when the read is using the page cache. This means “direct=0”.

Set and get the readahead value

You can use the tool “blockdev” to show the readahead value:

or set the size with:

Example with different readahead values:

I set the readahead value to 128 and run this file: readahead

readahead_128

“issued: total=r=25600” shows that 25600 IOPS have been issued but “sda : ios=561” shows that only 561 hit the device.So we can estimate that to read 100MB with 561 IOPS each IOPS needs to be ~182KB. This seems to be higher then 128 (read ahead value) * 512Bytes (default sector size) = 64KB. Something is wrong!  I run:

With an output of 4096. Okay the physical block size is 4096 Bytes and not 512 Bytes. 128 (read ahead value) * 4096 Bytes (physical block sector size) = 256KB. This would fit much better with the value of ~182KB.

I set the readahead value to 256 and run the same test again.

readahead_256

296 IOPS means around the half of the IOPS than the last test.

I set the readahead value to 512 and run the same test again.

readahead_512

Again around the half.

So this value can have an impact on the sequential read performance of your device. But most of the times sequential read are not the bottleneck in a typical environment. Even HDD’s can provide really fast sequential reads as long the fragmentation is under control.

Go leofs.

FIO (Flexible I/O Tester) Part5 – Direct I/O or buffered (page cache) or raw performance?

Before you start to run tests against devices you should know that most operating systems make use of a DRAM caches for IO devices. So for example page cache in Linux which is: “sometimes also called disk cache, is a transparent cache for the pages originating from a secondary storage device such as a hard disk drive (HDD)”. Windows uses a similar but other approach to page cache but for my convenience I will use page cache as synonym for both approaches.

Direct I/O means an IO access where you bypass the cache (page cache).

RAW performance

If we want to measure the performance of an IO device, should we use these caching techniques or avoid them? I believe it makes sense to run a baseline of your device without the influence of any file system or page cache. The implementation of page caches are different in each OS which means the same device may vary a lot. File systems also introduce a huge variance when running tests. For example: “If someone says that a device with NTFS reaches ~50.000 IOPS/4K” Have you ever asked which version of NTFS? You should have. See Memory changes in NTFS with Windows Server 2012 compared to version 1.0.

For sure in a real world workload the influence of a page cache maybe really important and should not be underestimated or uncared. But to measure the raw performance of an device ignore any file system or page cache.

Page caching vs Direct I/O vs RAW performance

Linux Page Caching:

“fio” per default invalidates the buffer/page cache for the used files when it starts. I believe this is done to avoid the influence of the page cache when running. But remember, while “fio” is running, the page cache will build up and may have an influence on your test.

We need to set the option “invalidate=0” to keep the page/buffer in memory to test with the page cache.

Page cache in action on Testverse:

jobfile: cache-run

The first run looked like that:

fio_cache-1-run

The marked parts show that 2560 IO of 4K have been issued. But only 1121 IO hit the device. So 1439 IO seems to be answered via the page cache. So lets run it again because the file should be in page cache now.

fio_cache-2-run

The second run proved it. Zero IO hit the device. Means 100% of the IO was handle by the page cache. And WOHOOO : 853333 IOPS 🙂

How to monitor the page cache?

There is a nice tool called cachestat written by Brendan D. Gregg which is part of the perf-tools. The tool is just a script so there is no need to compile it. Just download and run 🙂

cachestat

The marked part shows a “cache-run”.

free -m can give you some details as well.

You can run this command to clear the page cache.

Example:

free-m

I ran “free -m” to show the  actual state. Then ran 2x the “cache-run” with a size of 1000MB. The second run was MUCH faster. Then “free -m” shows that “used” and “cached” is increased by 1000. This means the testfile1 is fully cached. Then I cleared the cache with echo 3 to /proc/sys/vm/drop_caches. “free -m” shows that cached is near zero.

Windows Page Caching/System cache

This chapter is even more complicated. I will update it soon!

Direct I/O

Direct I/O means an IO access where you bypass the cache (page cache). You are able to force a direct I/O with the option “direct=1”.

jobfile: cache-run-direct

cache-run-direct

Direct I/O is the first step in measuring storage devices raw performance. The IOPS dropped a lot compared to the page cache second run. Direct 8767 IOPS vs Buffered 853333 IOPS. You may think that 8767 are too less IOPS for a SanDisk ioMemory device? Remember this is a random read with 1 job/thread and sync with an iodepth with 1. This means each read IO need to wait that the IO before is completed.

RAW performance

RAW performance means you schedule workloads against the native block device without a page cache or a file system. This is the way how most vendors measure their own performance values and present them on their website.

The option “filename=/dev/sdb” (Linux) or “filename=\\.\PhysicalDrive1” (Windows) uses the second device of your system.

WARNING: The data on the selected device can be lost!!!!!

Please double check that you selected the right device. Re-check any files you downloaded.

raw-run

You may noticed that the IOPS slightly increased to 9142 compared to the direct test run.

Go tomcat.

FIO (Flexible I/O Tester) Part4 – fixed, ranges or finer grained block sizes?

Using fixed block sizes is the most common way when doing storage (device) tests. There is more then one reason for that. The obvious reason is that a lot of tools only support fixed block size. Another reason is that the storage vendors present their performance counters mostly based on fixed block sizes with 4K. But this doesn’t means it’s the best way to test or measure your storage (device).

There are two common tests which you would like to perform:

  • RAW peak performance
  • Simulating production workload

Raw peak performance means a test where you try to test your storage (device) to achieve the same or similar values as the vendor presents on his website/documentation. This is by nature the highest possible value (peak) without the influence of page caches or file systems. Why should you do this? It is not to make sure the numbers at the vendor website are true or false. It is to make sure the device is running as expected and all best practices have been done.

Simulating production workload means a test where you try to test your storage (device) against a workload production. But most of the time you are not able to run the production workload directly on the storage (device). So then you need to simulate the production workload with a tool like “fio”,”sqlio“,”swingbench“… etc. If possible always test with real production workloads instead of simulations.

Fixed block size

You may used the option “blocksize=4K” or “bs=4K” which is the way to specify a fixed block size of 4K. But the output shows “BS=4K-4K/4K-4K/4K-4K”. So why 6 numbers instead of one?

There are three parts ( / is the separator) in the output. The first is related to the block size for reads, the second for writes and the third for discards. 4K-4K means use a block size from the range of 4K to 4K which is exactly 4K. But this gives you an idea that ranges could be used. There is even more then ranges. Block sizes can be fine grained.

You are able to specify fixed block size with different values for read,write,discard.

Example: Lets run a job with 50% reads and 50% writes. The read should use a fixed block size of 4K but the writes should use 64K. So we need to set the option “rw=readwrite” or “rw=rw” and the option “rwmixread=50” or “rwmixwrite=50”.

file: read-4K-write-64K

read-4K-write-64K

Block size ranges

Instead of using a fixed block size you can specify a range.

Example: Let’s run a read workload with a block size in the range of 1K to 16K. This can be done by set the “blocksize_range=1K:16K” or “bsrange=1K:16K”. But how will the mix look like?

file: read_bs1K-16K

read-bs1K-16K

1400 IO have been issued. At this point I am note sure about the real distribution. I tried a few calculation but have not been able to provide an explanation yet.The documentation states:

fio will mix the issued io block sizes. The issued io unit will always be a multiple of the minimum value given (also see bs_unaligned).

 Finer grained block sizes

Sometimes it maybe useful to control how the weight for a block size within a range should be set. This can be done with the option “bssplit”. What could be the reason why you would like to control the block size? Lets assume you know a production workload and would like to use “fio” to act similar inside a new VM/container to evaluate if the production workload would run fine.

The format for this option is:
bssplit=blocksize/percentage:blocksize/percentage

for as many block sizes as needed.

Example: Let’s run a read workload with a block size 10% 1K, 80% 4k, 10% 16K. This can be done by set the option “bssplit=1K/10:4K/80:16K/10”.

file: read_bs1K10p_4K80p_16K10p

read_bs1K10p_4K80p_16K10p

2029 IO issued. The simple math is:

  • 2029 * 0,8 (80% of IO) * 4K (is 4K) = 6492,8 K
  • 2029 * 0,1 (10% of IO) * 1K (is 1K) = 202,9 K
  • 2029 * 0,1 (10% of IO) * 16K (is 16K) = 3246,4 K
  • 6492,8K + 202,9K + 3246,4K = 9942,1K ~ 10MB

Yes there is a derivation with means my approach is only a simple approximation of the real algorithm. Remember you can not issue one half of an IO 🙂

Go Zookeeper.

FIO (Flexible I/O Tester) Part3 – Environment variables and keywords

FIO is able to make us of environment variables and reserved keywords in job files. This can  avoid some work so that you don´t need to change the job files each time you call them.

Define a variable in a job file

The documentation is pretty nice:

So for example this is an entry in the job file where the size should be set via an environment variable.

Example of using environment variables

I would like to run four different job files and use for each run a different size (1MB,10MB,100MB,1G,10G,100G).

  • Job1 is sequential read / 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job2 is sequential write/ 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job3 is random read / 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job4 is random write / 4 KB / 1 thread / ioengine=sync / iodepth=1

I make use of the default settings under linux. For example I did not specify the block size which is 4K per default, but it is best practice to define them in case a default value may be changed.

Files:

jobfile1: read_4KB_numjob1_sync_iodepth1

jobfile2: write_4KB_numjob1_sync_iodepth1

jobfile3: randread_4KB_numjob1_sync_iodepth1

jobfile3: randwrite_4KB_numjob1_sync_iodepth1

 

Run fio:

1) Run with SIZE of 1M

fio_run_4_files_with_env

Yes this could be done in one job file but often you are using predefined files from a vendor or a website like this and just want to call them with your size settings.

2) Run with SIZE of 10M

…. and so on.

Reserved keywords:

There are 3 documented reserved keywords:

  • $pagesize           The architecture page size of the running system
  • $mb_memory    Megabytes of total memory in the system
  • $ncpus                Number of online available CPUs

You can use them in the command line or in the jobs files. They are substituted with the current system values when you run the job.

Use cases?

Lets say you would like to run a job with as much threads as online available CPUs.

file: read-x-threads

In my case (Testverse) there are 4 CPU cores with Hyper-Threading(HT) on. Means 8 threads.

read-x-threads

Go Jenkins!