Category Archives: Storage Performance

FIO (Flexible I/O Tester) Part7 – Steady State of SSD,NVMe,PCIe Flash with TKperf

The real world performance of a Flash device is shown when the Steady State is reached. In most cases these are not the performance values which are shown on the vendors website.

The SNIA organization defined a specification how to test flash devices or Solid State Storage. The industry would like to have a methodology to compare NAND Flash devices with a scientific approach. The reason why there is a need for this specification is that the NAND Flash devices write performance heavily depends on the write history of the device. There are three write phases of a NAND Flash device:

  • FOB (Fresh- Out of the Box)
  • Transition
  • Steady State

SNIA_FOB_Transition_SteadyState

FOB (Fresh- Out of the Box) or Secure Erase(Sanitize?)

A device taken fresh out of the box should provide the best possible write performance for a while. Why? A flash device writes data in 4 KB pages inside of 256 KB blocks. To add additional pages to a partially filled block, the solid-state drive must erase the entire block before writing data back to it.

nand-flash-memory-pages-and-blocks

If the flash device fills up, fewer and fewer empty blocks are available. In their place are partially filled blocks. The NAND Flash device can’t just write the new data to these partially filled blocks — that would erase the existing data. Instead of a simple write operation, the NAND Flash device has to read the value of the block into its cache, modify the value with the new data, and then write it back. (Write Amplification)

Often when you would like to test a device some data has already been written to it. This means you can’t test the FOB performance anymore. For this it is possible to “Secure Erase” the device. This feature was original introduced to delete all data on a flash device securely which means that all pages/blocks will be zeroed even the blocks which are over-provisioned (not visible to the OS).  But it can also be used to optimize the performance and restore the FOB performance for a while. The vendors provide tools for this. Be careful. Some vendors make us of Sanitize and Secure Erase as features. But the implementation is different. So a Secure Erase may only delete the mapping table and not the blocks them self.

Transition

The transition is the phase between the good performance of FOB and Steady State. Most of the time the performance drops continuously over time and the write cliff appears till the Steady State is reached.

Steady State

The scientific definition for Steady State is:

Basically this means: Use a predefined write/read pattern and run this against the device until the performance of the device will be stable over time.

But should you run the test yourself? First a little bit math:

The pseudo code for IOPS states:

The whole block will be run up to 25 times depending if Steady State is already reached. Each run will be for 1 Minute.

25 (maximum runs) * 7 (R/W mixes) * 8 (block sizes) = 1400 minutes = 23,3h

The test could run for ~24h and it will write a lot. I strongly advice that you don´t run the tests yourself as long as you agree that the device looses lifetime.

There are some scripts and tools which can be useful to test the device which are based on “fio”.

  • a bash script by James Bowen. This scripts run for ~24 hours and does not stop even Steady State is reached
  • tkperf by Georg Schönberger which I prefer
  • block-storage git project by Jason Read which is aimed for cloud environments. A full implementation of PTS can’t be done in cloud environments. (For example: Secure Erase)

Using TKperf with SanDisk PX600-1000

TKperf is a python script which implements the full SNIA PTS specification. With Ubuntu its really easy to install.

After tkperf is installed and all dependencies as well I started a test. I tested a SanDisk PX600-1000 installed in Testverse. Because for the PX600 PCIe device you can’t run “hdparm” to Secure Erase, Georg Schöneberger implemented a new option “-i fusion” which leverages the SanDisk Command-line tools to Secure Erase the device. Again: Thank you @devtux_at

The following command runs all four SNIA PTS tests (IOPS,Latency,Throughput,Write Saturation). I ran with 4 jobs, an iodepth of 16 and used refill buffers to avoid compression of the device. The file test.dsc is a simple text file which describes the drive because “hdparm” can’t get infos about the PX600.

REMEMBER: ALL data will be lost and your device looses lifetime or maybe destroyed!

Results:

tkperf generates some nice png files which summarizes the testing. And the tests reached steady state after 4-5 rounds.

The device was formatted with 512 bytes sector size. This was needed to run all tests. It would improve the performance for all bigger block sizes to format the device with 4k sector size!

IOPS

PX600-1000-IOPS-mes3DPlt

Latency

This is the reason why these cards are that nice. Providing  < 0,2 ms latency is great!

PX600-1000-LAT-stdyStConvPlt

Throughput

PX600-1000-TP-RW-stdyStConvPlt

Write Saturation

PX600-1000-writeSatIOPSPlt

Go tkperf!

FIO (Flexible I/O Tester) Part6 – Sequential read and readahead

In the last read tests I found that the sequential read IOPS have been higher than expected. I left “invalidate” on default, that means that the used file for the test should be dropped out of the page cache when the test starts. So why are the IOPS higher then the raw device performance? I found that the readahead is responsible for this.

Important: readahead will only come into the play when the read is using the page cache. This means “direct=0”.

Set and get the readahead value

You can use the tool “blockdev” to show the readahead value:

or set the size with:

Example with different readahead values:

I set the readahead value to 128 and run this file: readahead

readahead_128

“issued: total=r=25600” shows that 25600 IOPS have been issued but “sda : ios=561” shows that only 561 hit the device.So we can estimate that to read 100MB with 561 IOPS each IOPS needs to be ~182KB. This seems to be higher then 128 (read ahead value) * 512Bytes (default sector size) = 64KB. Something is wrong!  I run:

With an output of 4096. Okay the physical block size is 4096 Bytes and not 512 Bytes. 128 (read ahead value) * 4096 Bytes (physical block sector size) = 256KB. This would fit much better with the value of ~182KB.

I set the readahead value to 256 and run the same test again.

readahead_256

296 IOPS means around the half of the IOPS than the last test.

I set the readahead value to 512 and run the same test again.

readahead_512

Again around the half.

So this value can have an impact on the sequential read performance of your device. But most of the times sequential read are not the bottleneck in a typical environment. Even HDD’s can provide really fast sequential reads as long the fragmentation is under control.

Go leofs.

FIO (Flexible I/O Tester) Part5 – Direct I/O or buffered (page cache) or raw performance?

Before you start to run tests against devices you should know that most operating systems make use of a DRAM caches for IO devices. So for example page cache in Linux which is: “sometimes also called disk cache, is a transparent cache for the pages originating from a secondary storage device such as a hard disk drive (HDD)”. Windows uses a similar but other approach to page cache but for my convenience I will use page cache as synonym for both approaches.

Direct I/O means an IO access where you bypass the cache (page cache).

RAW performance

If we want to measure the performance of an IO device, should we use these caching techniques or avoid them? I believe it makes sense to run a baseline of your device without the influence of any file system or page cache. The implementation of page caches are different in each OS which means the same device may vary a lot. File systems also introduce a huge variance when running tests. For example: “If someone says that a device with NTFS reaches ~50.000 IOPS/4K” Have you ever asked which version of NTFS? You should have. See Memory changes in NTFS with Windows Server 2012 compared to version 1.0.

For sure in a real world workload the influence of a page cache maybe really important and should not be underestimated or uncared. But to measure the raw performance of an device ignore any file system or page cache.

Page caching vs Direct I/O vs RAW performance

Linux Page Caching:

“fio” per default invalidates the buffer/page cache for the used files when it starts. I believe this is done to avoid the influence of the page cache when running. But remember, while “fio” is running, the page cache will build up and may have an influence on your test.

We need to set the option “invalidate=0” to keep the page/buffer in memory to test with the page cache.

Page cache in action on Testverse:

jobfile: cache-run

The first run looked like that:

fio_cache-1-run

The marked parts show that 2560 IO of 4K have been issued. But only 1121 IO hit the device. So 1439 IO seems to be answered via the page cache. So lets run it again because the file should be in page cache now.

fio_cache-2-run

The second run proved it. Zero IO hit the device. Means 100% of the IO was handle by the page cache. And WOHOOO : 853333 IOPS 🙂

How to monitor the page cache?

There is a nice tool called cachestat written by Brendan D. Gregg which is part of the perf-tools. The tool is just a script so there is no need to compile it. Just download and run 🙂

cachestat

The marked part shows a “cache-run”.

free -m can give you some details as well.

You can run this command to clear the page cache.

Example:

free-m

I ran “free -m” to show the  actual state. Then ran 2x the “cache-run” with a size of 1000MB. The second run was MUCH faster. Then “free -m” shows that “used” and “cached” is increased by 1000. This means the testfile1 is fully cached. Then I cleared the cache with echo 3 to /proc/sys/vm/drop_caches. “free -m” shows that cached is near zero.

Windows Page Caching/System cache

This chapter is even more complicated. I will update it soon!

Direct I/O

Direct I/O means an IO access where you bypass the cache (page cache). You are able to force a direct I/O with the option “direct=1”.

jobfile: cache-run-direct

cache-run-direct

Direct I/O is the first step in measuring storage devices raw performance. The IOPS dropped a lot compared to the page cache second run. Direct 8767 IOPS vs Buffered 853333 IOPS. You may think that 8767 are too less IOPS for a SanDisk ioMemory device? Remember this is a random read with 1 job/thread and sync with an iodepth with 1. This means each read IO need to wait that the IO before is completed.

RAW performance

RAW performance means you schedule workloads against the native block device without a page cache or a file system. This is the way how most vendors measure their own performance values and present them on their website.

The option “filename=/dev/sdb” (Linux) or “filename=\\.\PhysicalDrive1” (Windows) uses the second device of your system.

WARNING: The data on the selected device can be lost!!!!!

Please double check that you selected the right device. Re-check any files you downloaded.

raw-run

You may noticed that the IOPS slightly increased to 9142 compared to the direct test run.

Go tomcat.

FIO (Flexible I/O Tester) Part4 – fixed, ranges or finer grained block sizes?

Using fixed block sizes is the most common way when doing storage (device) tests. There is more then one reason for that. The obvious reason is that a lot of tools only support fixed block size. Another reason is that the storage vendors present their performance counters mostly based on fixed block sizes with 4K. But this doesn’t means it’s the best way to test or measure your storage (device).

There are two common tests which you would like to perform:

  • RAW peak performance
  • Simulating production workload

Raw peak performance means a test where you try to test your storage (device) to achieve the same or similar values as the vendor presents on his website/documentation. This is by nature the highest possible value (peak) without the influence of page caches or file systems. Why should you do this? It is not to make sure the numbers at the vendor website are true or false. It is to make sure the device is running as expected and all best practices have been done.

Simulating production workload means a test where you try to test your storage (device) against a workload production. But most of the time you are not able to run the production workload directly on the storage (device). So then you need to simulate the production workload with a tool like “fio”,”sqlio“,”swingbench“… etc. If possible always test with real production workloads instead of simulations.

Fixed block size

You may used the option “blocksize=4K” or “bs=4K” which is the way to specify a fixed block size of 4K. But the output shows “BS=4K-4K/4K-4K/4K-4K”. So why 6 numbers instead of one?

There are three parts ( / is the separator) in the output. The first is related to the block size for reads, the second for writes and the third for discards. 4K-4K means use a block size from the range of 4K to 4K which is exactly 4K. But this gives you an idea that ranges could be used. There is even more then ranges. Block sizes can be fine grained.

You are able to specify fixed block size with different values for read,write,discard.

Example: Lets run a job with 50% reads and 50% writes. The read should use a fixed block size of 4K but the writes should use 64K. So we need to set the option “rw=readwrite” or “rw=rw” and the option “rwmixread=50” or “rwmixwrite=50”.

file: read-4K-write-64K

read-4K-write-64K

Block size ranges

Instead of using a fixed block size you can specify a range.

Example: Let’s run a read workload with a block size in the range of 1K to 16K. This can be done by set the “blocksize_range=1K:16K” or “bsrange=1K:16K”. But how will the mix look like?

file: read_bs1K-16K

read-bs1K-16K

1400 IO have been issued. At this point I am note sure about the real distribution. I tried a few calculation but have not been able to provide an explanation yet.The documentation states:

fio will mix the issued io block sizes. The issued io unit will always be a multiple of the minimum value given (also see bs_unaligned).

 Finer grained block sizes

Sometimes it maybe useful to control how the weight for a block size within a range should be set. This can be done with the option “bssplit”. What could be the reason why you would like to control the block size? Lets assume you know a production workload and would like to use “fio” to act similar inside a new VM/container to evaluate if the production workload would run fine.

The format for this option is:
bssplit=blocksize/percentage:blocksize/percentage

for as many block sizes as needed.

Example: Let’s run a read workload with a block size 10% 1K, 80% 4k, 10% 16K. This can be done by set the option “bssplit=1K/10:4K/80:16K/10”.

file: read_bs1K10p_4K80p_16K10p

read_bs1K10p_4K80p_16K10p

2029 IO issued. The simple math is:

  • 2029 * 0,8 (80% of IO) * 4K (is 4K) = 6492,8 K
  • 2029 * 0,1 (10% of IO) * 1K (is 1K) = 202,9 K
  • 2029 * 0,1 (10% of IO) * 16K (is 16K) = 3246,4 K
  • 6492,8K + 202,9K + 3246,4K = 9942,1K ~ 10MB

Yes there is a derivation with means my approach is only a simple approximation of the real algorithm. Remember you can not issue one half of an IO 🙂

Go Zookeeper.

FIO (Flexible I/O Tester) Part3 – Environment variables and keywords

FIO is able to make us of environment variables and reserved keywords in job files. This can  avoid some work so that you don´t need to change the job files each time you call them.

Define a variable in a job file

The documentation is pretty nice:

So for example this is an entry in the job file where the size should be set via an environment variable.

Example of using environment variables

I would like to run four different job files and use for each run a different size (1MB,10MB,100MB,1G,10G,100G).

  • Job1 is sequential read / 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job2 is sequential write/ 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job3 is random read / 4 KB / 1 thread / ioengine=sync / iodepth=1
  • Job4 is random write / 4 KB / 1 thread / ioengine=sync / iodepth=1

I make use of the default settings under linux. For example I did not specify the block size which is 4K per default, but it is best practice to define them in case a default value may be changed.

Files:

jobfile1: read_4KB_numjob1_sync_iodepth1

jobfile2: write_4KB_numjob1_sync_iodepth1

jobfile3: randread_4KB_numjob1_sync_iodepth1

jobfile3: randwrite_4KB_numjob1_sync_iodepth1

 

Run fio:

1) Run with SIZE of 1M

fio_run_4_files_with_env

Yes this could be done in one job file but often you are using predefined files from a vendor or a website like this and just want to call them with your size settings.

2) Run with SIZE of 10M

…. and so on.

Reserved keywords:

There are 3 documented reserved keywords:

  • $pagesize           The architecture page size of the running system
  • $mb_memory    Megabytes of total memory in the system
  • $ncpus                Number of online available CPUs

You can use them in the command line or in the jobs files. They are substituted with the current system values when you run the job.

Use cases?

Lets say you would like to run a job with as much threads as online available CPUs.

file: read-x-threads

In my case (Testverse) there are 4 CPU cores with Hyper-Threading(HT) on. Means 8 threads.

read-x-threads

Go Jenkins!

 

FIO (Flexible I/O Tester) Part2 – First run and defining job files

First there is an official HOWTO from Jens Axboe. So why I am writing a blog series?

  1. Because for someone new to fio it maybe overwhelming.
  2. There are good examples of using it but there are no real world output/result and how to interpret them step by step.
  3. I want to fully understand this tool. Remember 8PP – 4.2
  4. I want to use it in my upcoming storage performance post
  5. Increase the awareness of some fio features most people don’t know about (fio2gnuplot, ioengines=rbd or cpuio)

1. First run

I am working with Testverse on a Samsung 840 at /dev/sda (in my case not root) with ext4 mounted at “/840”.

fio runs jobs (workloads) you define. So lets start with the minimum parameters needed because there are a lot.

first-run

So what happened?

fio ran one job called “first-run”. We did not specify what that job should do except that the job should run until 1 MB  has been transferred (size=1M). But what data has been transfered to where and how? Don´t get confused. You don´t need to understand the whole output (Block1-Block8)  right now.

So fio used some default values in this case which can be seen in Block 1.

fio used the defaults and ran:

  1. one job which is called “first-run”
  2. This job belongs to “group 0”
  3. created a new file with 1MB file size
  4. scheduled “sequential read” against the file
  5. it read 256 times x 4KB blocks

A detailed explanation can be found here.

2. Job Files

If you don´t want to type in long commands in your terminal every time you call fio I advise you to use job files instead. To avoid interpreting issues with file name and option etc. I call the job files “jobfil1”, “jobfile2″… but it is best practice to give meaningful file names like “read_4k_sync_1numjob”.

Job files define the jobs fio should run.

They are structure like the classic ini files. Lets write a file which runs the same job like in 1. First run

File: jobfile1

Now lets run it:

fio_jobfile1

Easy or?

How  to define a job file with 2 jobs which should run in order?

So lets write a file which contains 2 jobs. The first job should read sequential and the second should write sequential 1M.

File: jobfile2

So job1-read will run first and then job2-write will run.

How  to define a job file with 2 jobs which should run in order with the same workload?

Now we can make use of the global section to define defaults for all jobs if they don´t change it in their own section

File: jobfile3

Go Firefox.

FIO (Flexible I/O Tester) Part1 – Installation and compiling if needed

FIO (Flexible I/O Tester)  is a decent I/O test tool which is often used to test the performance of HDD/SSD and PCIe flash drives. But it can do much more. For example did you know that it provides an io-engine to test a CEPH rbd (RADOS block devices) without the need to use the kernel rbd driver?

I couldn’t find good documents which shows more interpretations and explanation of the results you receive from “fio”. Voilà, I will do it then. This little tiny tool is so complex that I am planing to split it in different parts.

But don’t forget: “fio is a synthetic testing (benchmarking) tool which in most cases doesn’t represent real world workloads”

Installation and compiling if needed (Ubuntu)

fio is developed by Jens Axboe and available at github.

These posts are based on Testverse and Ubuntu 14.04.2 but sources are available so you are able to compile it in your environment. Or the easier way is to use the binary packages available for these OSes:

1. Installing the fio binary in Ubuntu 14.04.2

sudo_install_fio

and

list the help of the command.

that’s it…. or?

shows that

fio-2.1.3 is installed. The actual version available at github is 2.2.9   (30.07.2015) so lets have some fun with:

2. Compiling the newest fio version in Ubuntu 14.04.2

I am using git for the installation because I like git.

The ./configure showed that some features are using zlib-devel – so thats the reason why we install it. The packages libaio1 and libaio-dev are needed to use the ioengine libaio which ist often used to measure the raw performance of devices.

In other distributions you may need to install other packages like make, gcc, libaio etc. in advance.For Ubuntu the “build-essential” should work.

make_fio

shows version 2.2.9-g669e

done.

Go hadoop.

 

FIO (Flexible I/O Tester) Appendix 1 – Interpreting the output/result of the “first-run”

first-run

So what happened?

fio ran one job called “first-run”. We did not specify what that job should do except that the job should run until 1 MB  has been transferred (size=1M). But what data has been transfered to where and how?

So fio used some default values in this case which can be seen in Block 1.

Block 1

block1

  • “g=0”
    • this job belongs to group 0 – Groups can be used to aggregate job results.
  •  “rw=read”
    • the default io pattern we use is sequential read
  •  “bs=4K-4K/4K-4K/4K-4K”
    • the default (read/write/trim) blocksize will be 4K
  •  “ioengine=sync”
    • the default ioengine is synchronous so no parallel access (async)
  •  “iodepth=1”
    • per default there will be no more then “1 IO unit” in flight against the device

Block 2

block2

  • “Laying out IO files…. ”
    • This step creates a file if not already existing with 1MB in size with the name “first-run.0.0” in the working directory
    • This file is used for the data transfer

Block 3

block3

  • Name of the job and some infos about it like
    • “err=0”
      • no errors occurred when running this job

Block 4

block4

  • This is the IO statistic for the job.
    • “read” (remember the default for this job is sequential read)
    • “io=1024.0KB”
      • number of KB transfered from file (1 MB)
    • “bw=341333KB/s”
      • we transfered data at a speed of ~333MB per second in average
    • “iops=85333”
      • is the average IO per second (4k in this case).
    • “runt=3msec”
      • The job ran 3 milliseconds

Actually in this case we only scheduled 256 IOs (1MB / 4KB)  to the file. This took only 3 milliseconds. So the value of 85333 does only means that we could achieve these much IO per second if we read for one second.

1 s / 0,003s = ~333  (we could complete 256 IOs in 3ms) = ~85333

  • the rest of Block 4 shows in detail the latency distribution. For more details read Part8.

Block 5

block5

  • “cpu”
    • this line is dedicated to the CPU usage of the running the job
  • “usr=0.00%”
    • this is the percentage of CPU usage of the running job at user level
    • Its nearly 0. remember the job ran only for 3ms so no impact on CPU
    • 100% would mean that one CPU core will be at 100% workload, depending if HT on/off
  • “sys=0.00%”
    • this is the percentage of CPU usage of the running job at system/kernel level
  • “ctx=8”
    • The number of context switches this thread encountered while running
  • “majf=0” and “minf=0”
    • The number of major and minor page faults

 Block 6

block6

  •  this blocks shows the distribution of IO depths over the job lifetime.
    • “IO depths :   1=100.0%… ”
      • this number showed that job was able to always have 1 IO unit in flight (see Block 1)
    • “submit: …..4=100.0%….”
      • shows how many IO were submitted in a single call. In this case it could be in the range of 1 to 4
      • we know that the IO depths was always at 1 so this indicates the submitted IO in a call have been 1 all time
    • “complete: …. 4=100.0%….”
      • same like submit but for complete calls.
    • “issued:….total=r=256/w=0/d=0″…
      • 256 read IO have been issued, no writes, no discards and none of them have been short

 Block 7

block7

  •  This is the group statistic. We ran only one job belonging to group 0
    • READ
      • “io=1024KB”
        • As in the job statics the same amount of transfered MB here
      • “aggrb=341333KB/s”
        • aggregated bandwidth of all jobs/threads for group 0
      • “minb=341333KB/s maxb=341333KB/s”
        • The minimum bandwidth one thread saw. In this case is the minimum the same as the maximum because it run only 3 ms
      • “mint=3msec” and “maxt=3msec”
        • Smallest and longest runtime of one of the jobs. The same because we ran only one job

 Block 8

block8

  • Disks statistics for involved disks but they look strange or?
    • “sda: ios=0/0”
      • 0 READ IO and 0 WRITE IO on /dev/sda
    • “merge=0/0” number of merges the IO from the IO scheduler
      • no merges here
    • “ticks=0/0”
      • number of ticks we kept the drive busy.. never
    • “io_queue=0”
      • total time spend in the disk queue
    • “util=0.00%”
      • the utilization of the drive -> nothing be done on disk?

So what we are seeing here is probably the linux file cache/buffer(page cache) for ext4 files. It seems the blocks are already prefetched. And the linux readahead can have an influence as well.

1. Testverse – A universe goes live

For the upcoming storage performance tests for docker, MySQL, etc. I build a decent test machine. This box will be called Testverse (Test Universe). Please no comments about the cables 🙂

testverse

But why do I use universe in the name?

I learned years ago (feels like ages) that the first step when forming sentential logic statements is to the define your universe. That means a statement can be true or false depending on the environment (universe).

Example:

  • Statement I am saying right now: “I am drunken.”  which is not true.
  • but the same statement at a Saturday evening with friends at a bar maybe true.

So it is important for performance tuning/analysis/testing to clearly define how and in which environment you are testing. If you do so, others are able to repeat the same tests or can compare it to similar environments. This helps to prove your statements or to prove you are wrong. Even if others prove that you are wrong, its good, because then your are able to improve or correct your statement. Remember the 8PP 1.2.

The table is based on Ubuntu 14.04.2 server which will be the OS if not mentioned separately in other posts.

Testverse Hardware

ItemDescriptionFirmwareDriverHints
Chieftec Smart CH-09B-BMidi Tower---
ASUS P9X79 PROSocket 2011 with X79 Chipset4801
(4701 before 14.09.2015)
-Intel SpeedStep On
HT - On
Intel VT - On
Intel VT-d On
Intel Core i7-3820 Bx4 Cores @3,6Ghz--OC - Off
Turbo mode - on
64GB-Kit Corsair XMS38 x 8GB Modules--DRAM at 1347Mhz
EVGA GeForce GTX 650cheap GPU , not really needed---
Samsung 840 SSD 128GB128 GB SSD at SATA 3.1 - 6 Gb/sDXT09B0Qtest drive
/dev/sda
smart On
Samsung 840 SSD 128GB128 GB SSD at SATA 3.1- 6 Gb/sDXT09B0QOS drive
/dev/sdb
Sandisk/Fusion-IO PX-600 1000PCIe Flash card 8.9.14.2.1test drive
/dev/fioa
Samsung EcoGreen F4 HD204UI at SATA 2.6 - 3 Gb/s1AQ10001test drive
/dev/sdc
Brocade CNA 10202 x 10 Gbit/s Ethernet Adapter3.2.5bna - 3.2.23.01 x 10Gbe mainly for cluster interconnection
Intel® 82579V, 1 x Gbit/s (Onboard)1 x 1Gbit/s Ethernet 0.13-4e1000e - 2.3.2-kmanagement interface
be quiet! SYSTEM POWER 7600W power supply-
-hope 600W will be enough

 

 

8PP or “The art of performance tuning/analysis/testing”

I believe that performance tuning/analysis/testing is one of the most complex tasks in the IT world. I read articles from well known IT people who have been known for well founded statements but when they did performance tuning/analysis/testing they have been proven wrong.

I am planing a series of performance tuning/analysis/testing posts about storage performance with linux file systems, Docker, MySQL, MS SQL Server and more. So to avoid the common mistakes I searched the INTERNET for some scientific approaches.

Andrew Pruski wrote a nice article and Raj Jan wrote a book  about this topic. Even this presentation shows the common mistakes. I decided to use these approaches and hopefully provide well founded posts.

I added some points to Andrews approach. I will call it 8PP (8 Phases of Performance tuning/analysis/testing) now, because I will often reference this approach.

AND Don’t forget: “Performance tuning/analysis/testing is a continues process“. What you consider to be optimal for your workload today may not be optimal tomorrow.

I will link a real example showing the 8PP soon.

8PP – The 8 Phases of Performance tuning/analysis/testing (Draft 1.3)

Phase 1 – Observation

  • 1.1 Understand the problem/issue
    • Talk to all responsible people if possible
    • Is the problem/issue based on a real workload?
    • Is the evaluation technique appropriate?
  • 1.2 Define your universe
    • If possible isolate the system as much as you can
    • Make sure to write down exactly how your system/environment is build
      • Firmware, OS, driver, application versions, etc…
  • 1.3 Define and run basic baseline tests (CPU,MEM,NET,STORAGE)
    • Define the basic tests and run them while the application is stopped
    • Document the basic baseline tests
    • Compare to older basic baseline tests if any are available
  • 1.4 Describe the problem/issue in detail
    • Document the symptoms of the problem/issue
    • Document the system behavior (CPU,MEM,NETWORK,Storage) while the problem/issue arise

Phase 2 – Declaration of the end goal or issue

  • Official declare the goal or issue
  • Agree with all participants on this goal or issue

Phase 3 – Forming a hypothesis

  • Based on observation and declaration form a hypothesis

Phase 4 – Define an appropriated method to test the hypothesis

  • 4.1 don’t define too complex methods
  • 4.2 choose … for testing the hypothesis
    • the right workload
    • the right metrics
    • some metrics as key metrics
    • the right level of details
    • an efficient approach in terms of time and results
    • a tool you fully understand
  • 4.3 document the defined method and setup a test plan

Phase 5 – Testing the hypothesis

  • 5.1 Run the test plan

    • avoid or don’t test if other workloads are running
    • run the test at least two times
  • 5.2 save the results

Phase 6 – Analysis of results

  • 6.2 Read and interpret all metrics
    • understand all metrics
    • compare metrics to basic/advanced baseline metrics
    • is the result statistical correct?
    • has sensitivity analysis been done?
    • concentrate on key metrics
  • 6.3 Visualize your data
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1”
  • 6.5 Present understandable graphics for your audience

Phase 7 – Conclusion

Is the goal or issue well defined? If not go back to  “Phase 1.1”

  • 7.1 Form a conclusion if and how the hypothesis achieved the goal or solved the issue!
  • 7.2 Next Step
    • Is the hypothesis true?
      • if goal/issue is not achieved/solved, form a new hypothesis.
    • Is the hypothesis false?
      • form a new hypothesis
    • Is there a dependency to something else?
      • form a new hypothesis
    • If the goal is achieved or issue solved
      • Document everything! (You will need it in the future)

Phase 8 – Further research

  • 8.1 If needed form a new goal/issue
  • 8.2 Define and run advanced baseline tests for future analysis
  • 8.3 If possible implement a continues approach to monitor the key metrics

The 8PP itself will change from time to time because performance tuning/analysis/testing will evolve.

Go docker Kitematic!