Category Archives: fio (Flexible IO Tester)

FIO (Flexible I/O Tester) Part9 – fio2gnuplot to visualize the output

When installing the linux build of “fio” it provides a tool called fio2gnuplot. This tool renders the output files of “fio” and uses gnuplot to generate nice graphics. Gnuplot is a portable command-line driven graphing utility which is freely distributed.

Example shows distribution of IOPS with different block sizes and different Read/Write Mix:

PX600-1000-IOPS-mes3DPlt

Requirements

I am using “fio” 2.2.10 which was release on 12.09.2015.

Since 2.1.2 fio2gnuplot is part of the “fio” release. To generate the graphics you need to install gnuplot.

How to generate the log files?

There are some “fio” options to generate log files.

  • write_bw_log=<Filename>
  • write_iops_log=<Filename>
  • write_lat_log=<Filename>
  • per_job_logs=0/1 ( >2.2.8 so not for Windows build 16.09.2015)

write_bw_log generates a log file with the bandwidth details of the job and so on. If you don’t set the per_job_logs=0 then for each thread (numjob=X)  there will be one file. Most of the time this is not wanted because you would like to generate graphics based on all threads. An issue I found is that the default patterns of fio2gnuplot ( -b / -i) will not work because it search for  ( *_bw.log and *_iops.log) file endings. But the files end with *_bw.X.log and *_iops.X.log. It should be fixed with this commit.

If per_job_logs=0 set and all log files option have been set:

  • write_bw_log=fio-test
  • write_iops_log=fio-test
  • write_lat_log=fio-test

then 5 files will be generated:

How does a log file look like?

Means 4096 bytes in the fourth column is the block size (4K). The second column is the bandwidth in KB/s. I believe the first column is the passed time in ms. The third column which is 0 indicates that this row is related to reads. If this is related to write than the third column is 1.

Using fio2gnuplot

fio2gnuplot works in two major phases. The first phase is to generate the input files for gnuplot and do some calculating based on the data like the avg or min and max.

Starting fio2gnuplot -b will search for all bandwidth files in the local directory and generates the input files for gnuplot. The opition “-i” is the default pattern for iops files. There is no default  pattern for latency.

fio2gnuplot_phase1

The second phase is to generate the graphics. The option “-g” can be used for this. Per default “-g” deletes the input files for gnuplot. The option “-k” can be used to keep this files for later editing. If you want to make changes to the output you are able to edit gnuplot files like the mygraph file.

fio2gnuplot_phase2

And this is the output of fio-test_bw-2Draw.png

fio-test_bw-2Draw

Using fio2gnuplot to compare files with the default pattern -b or -i

You can copy all log file in the same directory and call fio2gnuplot with the right pattern. I make use of “-b” for bandwidth comparisons.

fio2gnuplot_compare

And this is the output of compare-result-2Dsmooth.png

compare-result-2Dsmooth

Using fio2gnuplot to compare files with a custom pattern

Sometimes the default pattern will not work. For example there is no pattern for the latency output. For this case you can specify your own pattern with the option “-p <pattern>” and using a title. WARNING: Using the pattern “*.log” will raise an error. I fixed this and in the future this should work.

compare-result-pattern

And this is the output of compare-result-2Dsmooth.png

compare-lat-2Dsmooth

Go Keepass2.

 

Samsung 840 Basic- Baseline tests with FIO based on Windows 2012R2

This post shows the baseline FOB raw peak performance of the Samsung 840 Basic which is installed in Testverse with Windows 2012R2. I used the “fio” (Flexible IO Tester) for this test. “fio” is my preferred tool to test SSD.

I make use of the Sample Windows “fio” job files. The specifications shows six metrics but I concentrate on (READ/WRITE bandwidth,Ran.READ/WRITE 4K). I tested with the job files:

  • fio-job-04 = fio random 1M write peak BW
  • fio-job-09 = fio random 1M read peak BW
  • fio-job-010 = fio random 4K read peak IOPS
  • fio-job-011 = fio random 4K write peak IOPS

Samsung840Basic

Command line example:

Results:

fio-job-09 = fio random 1M read peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 531,59 MB/s

BW09_compare-result-2Dtrend

fio-job-04 = fio random 1M write peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 134,72MB/s.

BW04_compare-result-2Dtrend

fio-job-010 = fio random 4K read peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 100293.

IO10_compare-result-2Dtrend

fio-job-011 = fio random 4K write peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 22313.

IO11_compare-result-2Dtrend

 

SanDisk ioMemory/Fusion-io ioDrive – Baseline tests with FIO based on Windows 2012R2

This post shows the baseline FOB raw peak performance of the SanDisk PX600-1000 which is installed in Testverse with Windows 2012R2. I used the “fio” (Flexible IO Tester) for this test. “fio” is the preferred tool to test SanDisk/Fusion-IO ioDrive/ioMemory/SSD.

I make use of the Sample Windows “fio” job files. The specification shows four metrics (READ/WRITE bandwidth,Ran.READ/WRITE 4K) which I tested here with the job files:

  • fio-job-04 = fio random 1M write peak BW
  • fio-job-09 = fio random 1M read peak BW
  • fio-job-010 = fio random 4K read peak IOPS
  • fio-job-011 = fio random 4K write peak IOPS

SanDiskPX600-1000_spec

Command line example:

Results:

fio-job-09 = fio random 1M read peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 2590.2MB/s

BW9_compare-result-2Dtrend

fio-job-04 = fio random 1M write peak BW

The following graph shows the IO bandwidth for each thread. The average bandwidth for all four threads is 1337.1MB/s.

BW4_compare-result-2Dtrend

fio-job-010 = fio random 4K read peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 271557.

IOPS_10_compare-result-2Dtrend

fio-job-011 = fio random 4K write peak IOPS

The following graph shows the IOPS for each thread. The average IOPS for all four threads is 284429.

IOPS11_compare-result-2Dtrend

FIO (Flexible I/O Tester) Part8 – Interpret and understand the result/output

The result/output of “fio” can be overwhelming because this decent tool does a lot for you. Your job is to feed “fio” with the right options and then interpret the result/output. This posting will help you to understand the result/output in detail. I know it’s difficult to read but I am limited by the WordPress design here a little bit and may improve it in the future.

The official documentation

The HOWTO provides some insights about the result/output of “fio”. I copy&paste some parts of the HOWTO and give you some more details or summarize other parts.

Output while running

IdleRunDescription
PThread setup, but not started.
CThread created.
IThread initialized, waiting or generating necessary data.
pThread running pre-reading file(s).
RRunning, doing sequential reads.
rRunning, doing random reads.
WRunning, doing sequential writes.
wRunning, doing random writes.
M
Running, doing mixed sequential reads/writes.
mRunning, doing mixed random reads/writes.
FRunning, currently waiting for fsync()
fRunning, finishing up (writing IO logs, etc)
VRunning, doing verification of written data.
EThread exited, not reaped by main thread yet.
_Thread reaped, or
XThread reaped, exited with an error.
KThread reaped, exited due to signal.

Job overview output

This will give you an overview about the jobs and the option used. It’s useful to check the heading if you receive only the results of a run but not the command line call or job file.

Data direction output

All details for each data direction will be shown here. Most important numbers are:

  • io=
  • bw=
  • iops=
  • issued =
  • lat =

Details inside the box.

Group statistics

Disk statistics

Example: Interpret the result/output of SanDisk PX600-1000 and Samsung EVO 840

I ran the linux raw peak performance test:  job-fio-11.ini which is included in the “fio” sample linux files on Testverse.

This means a 4K random write test to show the peak 4K write IOPS.

I invoked the script:

Result for SanDisk PX600-1000:

Result for Samsung EVO 840:

Part 1: Job overview output

  • “g=0”
    • this job belongs to group 0 – Groups can be used to aggregate job results.
  •  “rw=rw”
    • the IO pattern is “random write”
  •  “bs=4K-4K/4K-4K/4K-4K”
    • the block size is 4K for all types (read,write,discard). This test is only scheduling  writes. The important part for this job is 4K-4K in the middle.
  •  “ioengine=libaio”
    • the used ioengine is libaio which means parallel writes will be scheduled. This is not a good test for the file system performance because it skips the page cache.
  •  “iodepth=32”
    • there will be up to “32 IO units” in flight against the device. So there will be a queue filled up to 32 outstanding IOs most of the time.
  • the version is “fio-2.2.9-26-g669e”
  • 4 threads will be started.

Part 2: Data direction output

The header is self explaining: Job name = fio-job-11 …

The detailed IO statistics for the job:

  • “write” (remember the job schedules 100% random writes)
  • “io=41119MB”
    • number of MB transfered
  • “bw=1370.5MB/s”
    • write data at a speed of 1370.5MB per second in average
  • “iops=350728”
    • is the average IO per second (4K in this case).
  • “runt=30013msec”
    • The job ran ~30 seconds
  • “slat”, “clat”,”lat” min,max,avg,stdev
    • slat means submission latency and presents the time it took to submit this IO to the kernel for processing.
    • clat  means completion latency and presents the time that passes between submission to the kernel and when the IO is complete, not including submission latency.
    • lat is the best metric which represents the whole latency an application would experience.  The avg slat+ avg clat = ~ avg lat.
    • Keep an eye if the numbers are usec or msec ..etc.! Compare PX600-1000 to EVO 840.
    • See Al Tobey blog for some more details.
  • “clat percentile” gives a detailed explanation how much IO in percentage completed in which time frame. In this case: 99% of the IO completed in <=1192 usec = 1,2 msec. This value is often used to ignore the few spikes when testing. The maximum clat has been 13505 which has been ~14x longer than the average of 344.
  • “bw”  min, max, per,avg, stdev
    • In this case the bandwidth  has been 345090 KB/s = ~337 MB/s
  • “lat” this is like the clat part.
    • In this case 91.01% of the IO completed between 500usec and >250usec. This is in line with the avg latency of 360.23usec. Only ~ 8,7% of the IO took between 2ms and >750usec. Both together is nearly 99,8% of all IO.
  •  “cpu”
    • this line is dedicated to the CPU usage of the running the job
      •  “usr=5.66%”
        • this is the percentage of CPU usage of the running job at user level
        • 100% would mean that one CPU core will be at 100% workload, depending if HT on/off
      •  “sys=12.09%”
        • this is the percentage of CPU usage of the running job at system/kernel level
      • “ctx=2783241”
        • The number of context switches this thread encountered while running
      • “majf=0” and “minf=32689”
  • “IO depths :
    • “32=116.7%… ”
      • this number showed that this job was able to have ~32 IO units in flight
      • I am not sure why it’s >100%
      • “submit: …..16=100.0%….”
        • shows how many IO were submitted in a single submit call. In this case it could be in the range of 8 to 16
        • This is in line with the script which used iodepth_batch values
      • “complete: …. 16=100.0%….”
        • same like submit but for completed calls.
      • “issued:….total=w=10526416/w=0/d=0″…
        • 10526416 write IO have been issued, no reads, no discards and none of them have been short or dropped

Part 3: Group statistics

  • WRITE
    • “io=41119MB”
      • As in the job statistics the same amount of transfered MB here because its only one  job
    • “aggrb=1370.5MB/s”
      • aggregated bandwidth of all jobs/threads for group 0
    • “minb=1370.5MB/s maxb=1370.5MB/s”
      • The minimum and maximum bandwidth one thread saw. In this case is the minimum the same as the maximum.I don’t think this is correct! Will clarify this.
    • “mint=30008msec” and “maxt=30008msec”
      • Smallest and longest runtime of one of the jobs. The same because we ran only one job

Part 4: Disk statistics

  • “fioa: ios=71/767604”
    • 71 read IO and 767604 write IO on /dev/fioa
    • I am not sure why there are 71 read IO. I am pretty sure i didn’t run anything myself in the background. Who knows?
  • “merge=0/0” number of merges the IO from the IO scheduler
    • no merges here
  • “ticks=0/259476”
    • number of ticks we kept the drive busy. A sign that the device is saturated.
    • A tick is related to one jiffy. The next lines are only a approximation. read about Kernel Timer for more details.
    • In this example I checked the value for CONFIG_HZ

      • CONFIG_HZ is set to 250 which means 1 second / 250 = 0,004s = 4ms
      • 259476 ticks * 4ms = 1037s  ???
      • I believe this is the cumulated wait time all IO spend in the queue. If you increase only the iodepth the value increase linear.
      • 1037s / 32 (iodepth) = ~32,4s its a little bit more than the runtime of 30s
  • “io_queue=256816”
    • total time spend in the disk queue
  • “util=94.63%”
    • the utilization of the drive is 94.63 means the drive seems to be nearly saturated with this workload.

This should give you a good idea which parts of a result/output exist and some insights how to interpret them. And yes its a tough one.

Go GIMP.

FIO (Flexible I/O Tester) Part7 – Steady State of SSD,NVMe,PCIe Flash with TKperf

The real world performance of a Flash device is shown when the Steady State is reached. In most cases these are not the performance values which are shown on the vendors website.

The SNIA organization defined a specification how to test flash devices or Solid State Storage. The industry would like to have a methodology to compare NAND Flash devices with a scientific approach. The reason why there is a need for this specification is that the NAND Flash devices write performance heavily depends on the write history of the device. There are three write phases of a NAND Flash device:

  • FOB (Fresh- Out of the Box)
  • Transition
  • Steady State

SNIA_FOB_Transition_SteadyState

FOB (Fresh- Out of the Box) or Secure Erase(Sanitize?)

A device taken fresh out of the box should provide the best possible write performance for a while. Why? A flash device writes data in 4 KB pages inside of 256 KB blocks. To add additional pages to a partially filled block, the solid-state drive must erase the entire block before writing data back to it.

nand-flash-memory-pages-and-blocks

If the flash device fills up, fewer and fewer empty blocks are available. In their place are partially filled blocks. The NAND Flash device can’t just write the new data to these partially filled blocks — that would erase the existing data. Instead of a simple write operation, the NAND Flash device has to read the value of the block into its cache, modify the value with the new data, and then write it back. (Write Amplification)

Often when you would like to test a device some data has already been written to it. This means you can’t test the FOB performance anymore. For this it is possible to “Secure Erase” the device. This feature was original introduced to delete all data on a flash device securely which means that all pages/blocks will be zeroed even the blocks which are over-provisioned (not visible to the OS).  But it can also be used to optimize the performance and restore the FOB performance for a while. The vendors provide tools for this. Be careful. Some vendors make us of Sanitize and Secure Erase as features. But the implementation is different. So a Secure Erase may only delete the mapping table and not the blocks them self.

Transition

The transition is the phase between the good performance of FOB and Steady State. Most of the time the performance drops continuously over time and the write cliff appears till the Steady State is reached.

Steady State

The scientific definition for Steady State is:

Basically this means: Use a predefined write/read pattern and run this against the device until the performance of the device will be stable over time.

But should you run the test yourself? First a little bit math:

The pseudo code for IOPS states:

The whole block will be run up to 25 times depending if Steady State is already reached. Each run will be for 1 Minute.

25 (maximum runs) * 7 (R/W mixes) * 8 (block sizes) = 1400 minutes = 23,3h

The test could run for ~24h and it will write a lot. I strongly advice that you don´t run the tests yourself as long as you agree that the device looses lifetime.

There are some scripts and tools which can be useful to test the device which are based on “fio”.

  • a bash script by James Bowen. This scripts run for ~24 hours and does not stop even Steady State is reached
  • tkperf by Georg Schönberger which I prefer
  • block-storage git project by Jason Read which is aimed for cloud environments. A full implementation of PTS can’t be done in cloud environments. (For example: Secure Erase)

Using TKperf with SanDisk PX600-1000

TKperf is a python script which implements the full SNIA PTS specification. With Ubuntu its really easy to install.

After tkperf is installed and all dependencies as well I started a test. I tested a SanDisk PX600-1000 installed in Testverse. Because for the PX600 PCIe device you can’t run “hdparm” to Secure Erase, Georg Schöneberger implemented a new option “-i fusion” which leverages the SanDisk Command-line tools to Secure Erase the device. Again: Thank you @devtux_at

The following command runs all four SNIA PTS tests (IOPS,Latency,Throughput,Write Saturation). I ran with 4 jobs, an iodepth of 16 and used refill buffers to avoid compression of the device. The file test.dsc is a simple text file which describes the drive because “hdparm” can’t get infos about the PX600.

REMEMBER: ALL data will be lost and your device looses lifetime or maybe destroyed!

Results:

tkperf generates some nice png files which summarizes the testing. And the tests reached steady state after 4-5 rounds.

The device was formatted with 512 bytes sector size. This was needed to run all tests. It would improve the performance for all bigger block sizes to format the device with 4k sector size!

IOPS

PX600-1000-IOPS-mes3DPlt

Latency

This is the reason why these cards are that nice. Providing  < 0,2 ms latency is great!

PX600-1000-LAT-stdyStConvPlt

Throughput

PX600-1000-TP-RW-stdyStConvPlt

Write Saturation

PX600-1000-writeSatIOPSPlt

Go tkperf!

FIO (Flexible I/O Tester) Part6 – Sequential read and readahead

In the last read tests I found that the sequential read IOPS have been higher than expected. I left “invalidate” on default, that means that the used file for the test should be dropped out of the page cache when the test starts. So why are the IOPS higher then the raw device performance? I found that the readahead is responsible for this.

Important: readahead will only come into the play when the read is using the page cache. This means “direct=0”.

Set and get the readahead value

You can use the tool “blockdev” to show the readahead value:

or set the size with:

Example with different readahead values:

I set the readahead value to 128 and run this file: readahead

readahead_128

“issued: total=r=25600” shows that 25600 IOPS have been issued but “sda : ios=561” shows that only 561 hit the device.So we can estimate that to read 100MB with 561 IOPS each IOPS needs to be ~182KB. This seems to be higher then 128 (read ahead value) * 512Bytes (default sector size) = 64KB. Something is wrong!  I run:

With an output of 4096. Okay the physical block size is 4096 Bytes and not 512 Bytes. 128 (read ahead value) * 4096 Bytes (physical block sector size) = 256KB. This would fit much better with the value of ~182KB.

I set the readahead value to 256 and run the same test again.

readahead_256

296 IOPS means around the half of the IOPS than the last test.

I set the readahead value to 512 and run the same test again.

readahead_512

Again around the half.

So this value can have an impact on the sequential read performance of your device. But most of the times sequential read are not the bottleneck in a typical environment. Even HDD’s can provide really fast sequential reads as long the fragmentation is under control.

Go leofs.

FIO (Flexible I/O Tester) Part5 – Direct I/O or buffered (page cache) or raw performance?

Before you start to run tests against devices you should know that most operating systems make use of a DRAM caches for IO devices. So for example page cache in Linux which is: “sometimes also called disk cache, is a transparent cache for the pages originating from a secondary storage device such as a hard disk drive (HDD)”. Windows uses a similar but other approach to page cache but for my convenience I will use page cache as synonym for both approaches.

Direct I/O means an IO access where you bypass the cache (page cache).

RAW performance

If we want to measure the performance of an IO device, should we use these caching techniques or avoid them? I believe it makes sense to run a baseline of your device without the influence of any file system or page cache. The implementation of page caches are different in each OS which means the same device may vary a lot. File systems also introduce a huge variance when running tests. For example: “If someone says that a device with NTFS reaches ~50.000 IOPS/4K” Have you ever asked which version of NTFS? You should have. See Memory changes in NTFS with Windows Server 2012 compared to version 1.0.

For sure in a real world workload the influence of a page cache maybe really important and should not be underestimated or uncared. But to measure the raw performance of an device ignore any file system or page cache.

Page caching vs Direct I/O vs RAW performance

Linux Page Caching:

“fio” per default invalidates the buffer/page cache for the used files when it starts. I believe this is done to avoid the influence of the page cache when running. But remember, while “fio” is running, the page cache will build up and may have an influence on your test.

We need to set the option “invalidate=0” to keep the page/buffer in memory to test with the page cache.

Page cache in action on Testverse:

jobfile: cache-run

The first run looked like that:

fio_cache-1-run

The marked parts show that 2560 IO of 4K have been issued. But only 1121 IO hit the device. So 1439 IO seems to be answered via the page cache. So lets run it again because the file should be in page cache now.

fio_cache-2-run

The second run proved it. Zero IO hit the device. Means 100% of the IO was handle by the page cache. And WOHOOO : 853333 IOPS 🙂

How to monitor the page cache?

There is a nice tool called cachestat written by Brendan D. Gregg which is part of the perf-tools. The tool is just a script so there is no need to compile it. Just download and run 🙂

cachestat

The marked part shows a “cache-run”.

free -m can give you some details as well.

You can run this command to clear the page cache.

Example:

free-m

I ran “free -m” to show the  actual state. Then ran 2x the “cache-run” with a size of 1000MB. The second run was MUCH faster. Then “free -m” shows that “used” and “cached” is increased by 1000. This means the testfile1 is fully cached. Then I cleared the cache with echo 3 to /proc/sys/vm/drop_caches. “free -m” shows that cached is near zero.

Windows Page Caching/System cache

This chapter is even more complicated. I will update it soon!

Direct I/O

Direct I/O means an IO access where you bypass the cache (page cache). You are able to force a direct I/O with the option “direct=1”.

jobfile: cache-run-direct

cache-run-direct

Direct I/O is the first step in measuring storage devices raw performance. The IOPS dropped a lot compared to the page cache second run. Direct 8767 IOPS vs Buffered 853333 IOPS. You may think that 8767 are too less IOPS for a SanDisk ioMemory device? Remember this is a random read with 1 job/thread and sync with an iodepth with 1. This means each read IO need to wait that the IO before is completed.

RAW performance

RAW performance means you schedule workloads against the native block device without a page cache or a file system. This is the way how most vendors measure their own performance values and present them on their website.

The option “filename=/dev/sdb” (Linux) or “filename=\\.\PhysicalDrive1” (Windows) uses the second device of your system.

WARNING: The data on the selected device can be lost!!!!!

Please double check that you selected the right device. Re-check any files you downloaded.

raw-run

You may noticed that the IOPS slightly increased to 9142 compared to the direct test run.

Go tomcat.

FIO (Flexible I/O Tester) Part4 – fixed, ranges or finer grained block sizes?

Using fixed block sizes is the most common way when doing storage (device) tests. There is more then one reason for that. The obvious reason is that a lot of tools only support fixed block size. Another reason is that the storage vendors present their performance counters mostly based on fixed block sizes with 4K. But this doesn’t means it’s the best way to test or measure your storage (device).

There are two common tests which you would like to perform:

  • RAW peak performance
  • Simulating production workload

Raw peak performance means a test where you try to test your storage (device) to achieve the same or similar values as the vendor presents on his website/documentation. This is by nature the highest possible value (peak) without the influence of page caches or file systems. Why should you do this? It is not to make sure the numbers at the vendor website are true or false. It is to make sure the device is running as expected and all best practices have been done.

Simulating production workload means a test where you try to test your storage (device) against a workload production. But most of the time you are not able to run the production workload directly on the storage (device). So then you need to simulate the production workload with a tool like “fio”,”sqlio“,”swingbench“… etc. If possible always test with real production workloads instead of simulations.

Fixed block size

You may used the option “blocksize=4K” or “bs=4K” which is the way to specify a fixed block size of 4K. But the output shows “BS=4K-4K/4K-4K/4K-4K”. So why 6 numbers instead of one?

There are three parts ( / is the separator) in the output. The first is related to the block size for reads, the second for writes and the third for discards. 4K-4K means use a block size from the range of 4K to 4K which is exactly 4K. But this gives you an idea that ranges could be used. There is even more then ranges. Block sizes can be fine grained.

You are able to specify fixed block size with different values for read,write,discard.

Example: Lets run a job with 50% reads and 50% writes. The read should use a fixed block size of 4K but the writes should use 64K. So we need to set the option “rw=readwrite” or “rw=rw” and the option “rwmixread=50” or “rwmixwrite=50”.

file: read-4K-write-64K

read-4K-write-64K

Block size ranges

Instead of using a fixed block size you can specify a range.

Example: Let’s run a read workload with a block size in the range of 1K to 16K. This can be done by set the “blocksize_range=1K:16K” or “bsrange=1K:16K”. But how will the mix look like?

file: read_bs1K-16K

read-bs1K-16K

1400 IO have been issued. At this point I am note sure about the real distribution. I tried a few calculation but have not been able to provide an explanation yet.The documentation states:

fio will mix the issued io block sizes. The issued io unit will always be a multiple of the minimum value given (also see bs_unaligned).

 Finer grained block sizes

Sometimes it maybe useful to control how the weight for a block size within a range should be set. This can be done with the option “bssplit”. What could be the reason why you would like to control the block size? Lets assume you know a production workload and would like to use “fio” to act similar inside a new VM/container to evaluate if the production workload would run fine.

The format for this option is:
bssplit=blocksize/percentage:blocksize/percentage

for as many block sizes as needed.

Example: Let’s run a read workload with a block size 10% 1K, 80% 4k, 10% 16K. This can be done by set the option “bssplit=1K/10:4K/80:16K/10”.

file: read_bs1K10p_4K80p_16K10p

read_bs1K10p_4K80p_16K10p

2029 IO issued. The simple math is:

  • 2029 * 0,8 (80% of IO) * 4K (is 4K) = 6492,8 K
  • 2029 * 0,1 (10% of IO) * 1K (is 1K) = 202,9 K
  • 2029 * 0,1 (10% of IO) * 16K (is 16K) = 3246,4 K
  • 6492,8K + 202,9K + 3246,4K = 9942,1K ~ 10MB

Yes there is a derivation with means my approach is only a simple approximation of the real algorithm. Remember you can not issue one half of an IO 🙂

Go Zookeeper.

FIO (Flexible I/O Tester) Part3 – Environment variables and keywords

FIO is able to make us of environment variables and reserved keywords in job files. This can  avoid some work so that you don´t need to change the job files each time you call them.

Define a variable in a job file

The documentation is pretty nice:

So for example this is an entry in the job file where the size should be set via an environment variable.

Example of using environment variables

I would like to run four different job files and use for each run a different size (1MB,10MB,100MB,1G,10G,100G).

  • Job1 is sequential read / 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job2 is sequential write/ 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job3 is random read / 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job4 is random write / 4 KB / 1 thread / ioengine=sync / iodepth=1

I make use of the default settings under linux. For example I did not specify the block size which is 4K per default, but it is best practice to define them in case a default value may be changed.

Files:

jobfile1: read_4KB_numjob1_sync_iodepth1

jobfile2: write_4KB_numjob1_sync_iodepth1

jobfile3: randread_4KB_numjob1_sync_iodepth1

jobfile3: randwrite_4KB_numjob1_sync_iodepth1

 

Run fio:

1) Run with SIZE of 1M

fio_run_4_files_with_env

Yes this could be done in one job file but often you are using predefined files from a vendor or a website like this and just want to call them with your size settings.

2) Run with SIZE of 10M

…. and so on.

Reserved keywords:

There are 3 documented reserved keywords:

  • $pagesize           The architecture page size of the running system
  • $mb_memory    Megabytes of total memory in the system
  • $ncpus                Number of online available CPUs

You can use them in the command line or in the jobs files. They are substituted with the current system values when you run the job.

Use cases?

Lets say you would like to run a job with as much threads as online available CPUs.

file: read-x-threads

In my case (Testverse) there are 4 CPU cores with Hyper-Threading(HT) on. Means 8 threads.

read-x-threads

Go Jenkins!

 

FIO (Flexible I/O Tester) Part2 – First run and defining job files

First there is an official HOWTO from Jens Axboe. So why I am writing a blog series?

  1. Because for someone new to fio it maybe overwhelming.
  2. There are good examples of using it but there are no real world output/result and how to interpret them step by step.
  3. I want to fully understand this tool. Remember 8PP – 4.2
  4. I want to use it in my upcoming storage performance post
  5. Increase the awareness of some fio features most people don’t know about (fio2gnuplot, ioengines=rbd or cpuio)

1. First run

I am working with Testverse on a Samsung 840 at /dev/sda (in my case not root) with ext4 mounted at “/840”.

fio runs jobs (workloads) you define. So lets start with the minimum parameters needed because there are a lot.

first-run

So what happened?

fio ran one job called “first-run”. We did not specify what that job should do except that the job should run until 1 MB  has been transferred (size=1M). But what data has been transfered to where and how? Don´t get confused. You don´t need to understand the whole output (Block1-Block8)  right now.

So fio used some default values in this case which can be seen in Block 1.

fio used the defaults and ran:

  1. one job which is called “first-run”
  2. This job belongs to “group 0”
  3. created a new file with 1MB file size
  4. scheduled “sequential read” against the file
  5. it read 256 times x 4KB blocks

A detailed explanation can be found here.

2. Job Files

If you don´t want to type in long commands in your terminal every time you call fio I advise you to use job files instead. To avoid interpreting issues with file name and option etc. I call the job files “jobfil1”, “jobfile2″… but it is best practice to give meaningful file names like “read_4k_sync_1numjob”.

Job files define the jobs fio should run.

They are structure like the classic ini files. Lets write a file which runs the same job like in 1. First run

File: jobfile1

Now lets run it:

fio_jobfile1

Easy or?

How  to define a job file with 2 jobs which should run in order?

So lets write a file which contains 2 jobs. The first job should read sequential and the second should write sequential 1M.

File: jobfile2

So job1-read will run first and then job2-write will run.

How  to define a job file with 2 jobs which should run in order with the same workload?

Now we can make use of the global section to define defaults for all jobs if they don´t change it in their own section

File: jobfile3

Go Firefox.