Unleash the power of the NTNX-AVM – daily_health_report and monthly_ncc_health

I asked myself how could an admin use the NTNX-AVM. So I decided to show and provide some real world examples how this powerful automation VM can be used.

USE CASE: A daily health report should run on the Nutanix cluster and send to a specified email address!

Let’s start with the script itself. There is no script provided by Nutanix except the Nutanix Cluster Check (ncc). It does a decent job but because of the hundred of tests and output it may not be the easiest to start with. So based on the script provided by BMetcalf in the Nutanix Community I developed a script called “daily_health_report.sh” for the NTNX-AVM. It is automatically installed with the NTNX-AVM starting today.

It runs the following  command remote on a CVM which gives you a good overview of the current cluster status.

Okay we do have a script but how to run it once a day? For this case I introduced jobber to the NTNX-AVM.

Learn jobber the fast way

Connect via SSH to the NTNX-AVM and run:

jobber_list

In this case no job is known. I prepared an example which runs the script daily_health_report.sh every day at 04:00.

The easiest way to create this job is to copy the example from the source folder to a file called “.jobber” in the Nutanix home directory.

The last step is just to reload the jobs defined in “.jobber”.

Review the jobber list.

jobber_finished

How does the daily_health_check.sh work?

First of all, this script will not run in your environment because all parameters for the daily_health_check.sh are setup for my lab environment. Okay lets make sure it will run in your environment.

STEP1 – Enable SSH access from NTNX-AVM to the cluster CVMs

The script makes use of ssh/scp to run the commands remote on one of the CVMs. To run a script non-interactive we need to enable password-less authentication between the NTNX-AVM to the CVMs. I wrote a script which enables password-less authentication.

This scripts creates a key pair and deploy the keys to the CVMs. When you run it you need to specify the Cluster IP/Name and the PRISM admin password.

A test ssh connection should work now without requesting a password.

STEP2 – Edit the jobber file

Use an editor of your choice like “vi” and edit the line which starts with ” cmd :  daily_h…” and edit the parameters to your needs.

DO NOT use the cluster IP for host. Use one CVM IP.

  • –host=<YOUR-CVM-IP>
  • –recipient=<RECIPIENT-email>
  • –provider=other                                   // choose other to send email to a local email server
  • –emailuser=<EMAIL-USER>                //  Email user used to authenticate via SMTP (sender)
  • –emailpass=<EMAIL-PASSWORD>    //  Email password used to authenticate via SMTP
  • –server=<EMAIL-SERVER-IP>             // Email server IP
  • –port=<SMTP port>                            // Email server SMTP port

Reload the jobber file.

STEP3 – Test the job

The output should look like this:

edit_test_jobber

Check the email Account that the email was sent and received.

show_daily_email

There it is. Sorry for the German Thunderbird version but you should get the idea how the email looks like. An email with one attachment called “daily_health_report-<DATE>.txt”.

USE CASE: Run a monthly “ncc health_checks run_all” and send the output to a specified email address!

Some Nutanix people would say: “Why don’t you use the ncc instead?” Good point. This post shows how to run ncc every x hours and send an email. But how to run ncc once a month and get all ERROR/FAIL messages in the body?

For this case I created the ncc_health_report.sh script which runs the “ncc health_checks run_all” and send an email.

STEP1 – Extend the “.jobber” file to add this job

The example which can be found on the NTNX-AVM in  “~/work/src/github.com/Tfindelkind/automation/NTNX-AVM/jobber/example/monthly_ncc_health” defines a job which runs on the 1st of each month and calls the ncc_health_report.sh

Edit the ~/.jobber file and add the job text to the end of the file. BUT skip the first line “—“. the file should look like this.

both_jobs

Don’t forget to edit the parameters like in STEP2

Reload the jobber file.

STEP3 – Test the job

WARNING!!!! This may run for a while…

The output should look like this:

ncc_report_test

And an email should be in your inbox:

ncc_report_test_email

I know the format of the body is wired because all “newlines” have been removed. I hope to fix this in the near future.

BTW: I used the hMailServer in my lab environment. This was really the easiest mail server setup I have ever done.

Go hMailServer

 

Nutanix automation VM ( NTNX-AVM ) goes online

Since I started at Nutanix I thought about a way to write and run scripts/tools around the Nutanix ecosystem. But there are different languages which are used by the community. Perl/python/golang/powershell etc. So I asked myself: “Where the hack should I install the runtime and the scripts/tools, because the CVM is a bad place for this”

The answer took me a while but here we go:

Nutanix automation VM called NTNX-AVM

So there is no image which fits all but instead the NTNX-AVM is based on recipes which defines the runtime/scripts/tools which will be installed. The foundation of these are the cloud images which are designed to run on cloud solutions like AWS/Azure/Openstack. These images provide good security from scratch. Another advantage is that the images are already deployed which means there is no different way to install it other then “importing” a vendor controlled image. This is good for maintaining the whole project.

NTNX-AVM v1 when deployed provides golang , git, govc, java, ncli (CE edition), vsphere CLI and the automation scripts from https://github.com/Tfindelkind/automation preinstalled. So for example you can move a VM from container A to container B with the move_vm binary which leverage the Nutanix REST API which is not possible in AHV.

I introduced a job scheduler system called https://github.com/dshearer/jobber to automate tasks/jobs. The advantages are that you are able to review the history of already executed jobs and you have more control when something went wrong.

Use cases for the NTNX-AVM

  • Backup Nutanix VM’s to a NFS store like Synology/Qnap/linux…
  • Move VM from one container to another one
  • Do some daily tasks like generate reports of specific performance counters you would like to monitor which are not covered by Prism
  • anything which talks to Nutanix REST API and needs to be scheduled.
  • …. there will be more

Installation of NTNX-AVM on Acropolis Hypervisor (AHV)

For an easy deployment and usage I created a simple bash script which will do all the hard work.

The deployment for VMware and Hyper-V will follow. At the moment the process is more manual. I will post a “HOW-TO install”.

What you need is a Nutanix cluster based on AHV (>=4.7) and a client where you able to run the bash script. Ubuntu, Debian, Redhat, CentOS, Mac OS should work fine as a client. The Community Edition (CE) is the base of my development environment and is fully supported.

This is how the environment looks like before the deployment. My three node cluster based on Intel NUC.

cluster_before_NTNX-AVM

Image_service_before_NTNX-AVM

Step-by-Step deployment of NTNX-AVM with Deploy Cloud Image script (DCI)

We start at your client system in my case a Mac Book Pro. Download the latest stable release of DCI from https://github.com/Tfindelkind/DCI/releases.  In my case the version v1.0-stable is latest build available. The “Source code (tar.gz)” will work for me.

Release v1.0-stable · Tfindelkind-DCI Google Chrome, Heute at 11.40.54

Change to the Download folder and unpack/untar the file:

Downloads — bash — 92×28 Terminal, Heute at 11.44.46

You can see there are several recipes available but let’s focus just on NTNX-AVM v1 based on CentOS7.

NTNX-AVM recipe config file

IMPORTANT: THE NTNX-AVM needs internet connection when deployed. Because all tools need to be downloaded.

Now we need to edit the recipe config file of the NTNX-AVM to make sure that the IP,DNS,etc. is setup in the way we need it. Use a text editor of your choice to edit the “/recipes/NTNX-AVM/v1/CentOS7/config” file.

You should edit following settings to your needs:

  • VM-NAME          This is the name of the VM guest OS.
  • VM-IP                  The fixed IP
  • VM-NET              The network of VM
  • VM-MASK           The netmask of the network
  • VM-BC                 The broadcast address of the network
  • VM-GW                The gateway
  • VM-NS                 The nameserver
  • VM-USER             The username for the NTNX-AVM which will be created
  • VM-PASSWORD  The password for this user -> Support for access keys will be added soon.
  •                               You need to escape some special characters like “/” with a “\” (Backslash)
  • VCENTER_IP         IP of the vcenter when used.
  • VCENTER_USER   User of the vcenter
  • VCENTER_PASSWORD

This is an example file for my environment:

CentOS7 — bash — 92×28 Terminal, Heute at 11.47.18

NTNX -AVM with DHCP enabled

If you don’t want to specify a fixed IP,DNS,.. you could roll out the NTNX-AVM with DHCP. To do this edit the “/recipes/NTNX-AVM/v1/CentOS7/meta-data.template” file and remove the network part so the file looks like this one. The “ifdown eth0” and “ifup eth0” is related to a bug with the CentOS 7 cloud-image.

Deploy the NTNX-AVM to the Nutanix cluster

Now we are ready to deploy the VM to the Nutanix cluster with the dci.sh script.

We need to specify a few option to run it:

  • –recipe=NTNX-AVM             Use the pre build NTNX-AVM recipe
  • –rv=v1                                    It’s the first version so we use v1
  • –ros=CentOS7                      In this case we use the CentOS7 image and not Ubuntu
  • –host=192.168.178.130      This is the cluster IP of Nutanix/ CVM IP will work too
  • –username/–password      Prism user and password
  • –vm-name                            The name of the Nutanix VM object
  • –container=prod                  In my case I used the container “prod” (production)
  • –vlan=VLAN0                        The Nutanix network where the VM will be connected to

The dci.sh script will do the following:

  • First it will download the cloud image from a CentOS. Then it will download the deploy_cloud_vm binary.
  • It will read the recipe config file and generate a cloud seed CD/DVD image. Means all configuration like IP,DNS,.. will be saved into this CD/DVD image called “seed.iso”.
  • DCI will upload the CentOS image and seed.iso to the AHV image service.
  • The NTNX-AVM VM will be created based on the CentOS image and the seed.iso will be connected to the CD-ROM. At the first boot all settings will be applied. This is called the NoCloud deployment based on cloud-init. This will only work with cloud-init ready images.
  • The NTNX-AVM will be powered on and all configs will be applied.
  • In the background all tools/scripts will be installed

DCI-1.0-stable — bash — 92×28 Terminal, Heute at 12.05.51

The CentOS cloud image and the seed.iso have been uploaded to the image service.

Nutanix Web Console Google Chrome, Heute at 12.06.17

The NTNX-AVM has beed created and started.

Nutanix Web Console Google Chrome, Heute at 12.06.45

Using the Nutanix Automation VM aka NTNX-AVM the first time

Connect via ssh to the NTNX-AVM IP. 192.168.178.200 in my case. First of all we need to make sure that all tools are fully installed because this is done in the background after the first boot.

So let’s check if /var/log/cloud-init-output.log will show something like:

DCI-1.0-stable — nutanix@NTNX-AVM.~ — ssh — 105×28 Terminal, Heute at 07.45.06

The NTNX-AVM is finally up, after … seconds

You should reconnect via ssh once all tools/scripts are installed to make sure all environment variables will be set.

Everything is installed and we can use it.

Test the NTNX-AVM environment

Let’s connect to the Nutanix cluster with the “ncli” (nutanix command line) and show the cluster status.

DCI-1.0-stable — nutanix@NTNX-AVM.~ — ssh — 105×28 Terminal, Heute at 07.54.57

That’s it. NTNX-AVM is up running.

Today I start to implement the ntnx_backup tool which will be able to backup/restore a AHV VM to/from an external share  (NFS, SMB,….)  which will leverage jobber as the job scheduling engine.

Go Ubuntu

 

Intel NUC NUC6i7KYK – Installation of Nutanix Community Edition (CE) – Part3 – 3 node cluster creation

Now it’s time to create a Nutanix cluster. But there are some default settings I would like to change before I create the cluster. This is not mandatory but this will increase the usability in the future. Just jump to the create cluster part if you want to skip that.

Changing the AHV hypervisor hostname (optional)

Use a ssh client like PuTTY or my favorite mRemoteNG  to connect to the AHV (Host) IP. Use the default password when connecting as the “root” user which is “nutanix/4u”. Use a text editor like vi/nano to edit the “/etc/hostname” file and change the entry to a hostname you would like to have.

change_AHV_hostname

The following table shows the hostnames i used in this setup.

DNS-NameTypeIP
NTNX-NUC1AHV192.168.178.121
NTNX-NUC2AHV192.168.178.122
NTNX-NUC3AHV192.168.178.123
NTNX-NUC1-CVMCVM192.168.178.131
NTNX-NUC2-CVMCVM192.168.178.132
NTNX-NUC3-CVMCVM192.168.178.133

Changing the AHV hypervisor timezone (optional)

By default the timezone of the AHV hypervisor is PDT (Pacific daylight time). From a support perspective it makes sense that all logging dates are using PDT, so that it is easier to analyse different log files side by side. But I would like to have the time in my timezone which is Germany. To change the timezone it is needed to use the correct /etc/localtime file. You can find the files needed in “/usr/share/zoneinfo”.

  • Make a backup of the actual /etc/localtime:  “mv  /etc/localtime /etc/localtime.bak”
  • Make a link to the wanted timezone file: “ln -s /usr/share/zoneinfo/Europe/Berlin /etc/localtime”

change_time_zone

Changing the CVM name (optional)

This is a tricky part. I could not found a solution to change the CVM name. It seems there is no way to do this.

Changing the CVM timezone (optional)

@TimArenz remind me that it may be easier and the better way to change the timezone after the cluster is created. This can be done via the Nutanix CLI (ncli)

 Creating the 3 node cluster

There are two ways to install a multi node Nutanix CE cluster. Via the cluster init website or via the command line.

Cluster init web page

Connect to: http://CVMIP:2100/cluster_init.html

Enter the needed values and start the creation.

cluster_config

Cluster create via command line

We need to connect to one of CVMs of this setup via ssh with user “nutanix” and password “nutanix/4u”.

The creation is pretty simple which involves two steps. Invoke the create cluster command and set the DNS server.

cluster – s CVM-IP1, CVM-IP2, CVMIP3 create
ncli cluster add-to-name-servers servers=”DNS-SERVER”

cluster_create_begin

create_cluster_end

The first connect to PRISM

Open a browser and connect to one of the CVM IPs. Enter the user credentials: “admin/admin”

When login the first time after the installation you will be asked to change the admin password.

change_password

The NEXT Credentials which have been used for the download need to be entered now. This means that Nutanix CE edition needs a internet connection to work. There is a grace period which should be around 30 days.

enter_next_account

Prism will be be shown now and it’s ready to go.

installation_done

 

Go Wireshark!

Intel NUC NUC6i7KYK – Installation of Nutanix Community Edition (CE) – Part2 – AHV installation

There are several great posts which show how to setup Nutanix CE in a HomeLab.

Tim Expert

TinkerTry

Gareth Chapman

XenAppBlog

Mike Sheehy

I will focus on my own setup , based on the Intel NUC6i7KYK. The setup is pretty straight forward up to the point when the onboard network comes into play. The Intel driver which is included in the Nutanix CE does not provide the right ones needed for the Intel NUC6i7KYK onboard network.

Overview of the Nutanix CE install process

  1. Make sure your environment meets the minimum requirements. The table shows that a minimum of two disks are needed, at least one SSD. That´s the reason why I used 2x SanDisk X400 M.2 2280 in my environment. Remember that NVMe drives are not working atm.minimum_requirements
  2. Download the Nutanix CE disk image which will be copied to an USB flash drive. This will be the install and the boot device for this environment. The USB drive should be at least 8 GB in size but I recommend to use a device as big as possible. 32 GB flash drives starting at 10€. The reason is simple. If your environment for any reason starts to write extensive logs or data to the flash drive an 8 GB drive may end up with a wear out. Second! Maybe the image becomes bigger in the future?
  3. Boot from USB flash drive and start the installer with the right values (IP,DNS..) This step will install the Controller VM (CVM) to one of the SSD drives where all the Nutanix “Magic” resides. All local disks will be directly mapped to the CVM. This means the Acropolis Hypervisor (AHV) which is KVM based is not able to use the storage directly anymore.
  4. If chosen a single node cluster will be created. In my case where I will build a three node cluster I will leave this option blank.

Step-by-Step Installation of Nutanix CE based on Intel NUC6i7KYKD

Download the Nutanix Community Edition. You need to register first!

NutanixCE_register_download

Download the software by scrolling down to the latest build.

nutanixCE_downloadLatestbuild

The image itself is packed with “.gz”. I used the tool 7zip to unpack the file. A file like ce-2016.04.19-stable.img will be unpacked which is ready to be copied to the USB flash drive.

7zipimage7zipIMG

Now attach the USB flash drive and download the tool called Rufus. This program enables to “raw” copy an img like this one byte by byte to an USB flash drive. Choose the right USB flash drive, then switch to “DD Image” (dd means disk dump). Last step is to choose the img file and “Start”.

ATTENTION !!!! Make sure to choose the right device!!!

rufuschoose_rufus_img

The copy process takes a while!

Now we need to install the Intel network drivers

Intel e1000e for Nutanix CE on Intel NUC6i7KYK
because the actual version does not provide the right ones. Unzip the file so you have got a file called “e1000e.ko”

Now we need to copy the file “e1000e.ko” which is a kernel module to the USB flash drive. But the filesystem which is used on the USB flash drive is ext4 which MS Windows is not able to edit by default. So we need a tool like EXT2FSD to do so.

After the installation of EXT2FSD and a reboot you start the Ext2 volume Manager. In my case I needed to choose a drive letter manually to be able to work with the USB drive. So scroll down to the right device in the bottom window and select the drive and hit the “F4” key which should assign an unused drive letter.

EXT2Volumemanager

Copy the file “e1000e.ko” to the USB flash drive in the following directory: “/lib/modules/3.10.0-229.4.2.e17.nutanix.20150513.x86_64/kernel/drivers/net/ethernet/intel/e1000e/” and override the existing file.

copy_e10000e

The USB flash drive is ready to boot on the Intel NUC6i7KYK!

Attach the USB flash drive to your Intel NUC6i7KYK and boot it. Feel free to change the boot order right now so that the Intel NUC6i7KYK will always boot from the USB flash drive.

IMG_20160626_090725 IMG_20160626_090415

Now the Intel NUC6i7KYK is ready to boot from the USB flash drive.

IMG_20160626_090853

After the boot you should see the login screen.

IMG_20160626_090912

Login as user “root” with the password “nutanix/4u”. Loading the Intel network driver works with the command “modprobe e1000e”. Use “exit” to return to the login screen.

IMG_20160626_091010

The user “install” starts the installation.

IMG_20160626_091034

Choose your keyboard setting. In my case I used “de-nodeadkeys”.

IMG_20160626_091104

The following screen shows a small form. This is an examples for a single node setup.

IMG_20160626_091301

You may miss the configuration for a 3 or 4 node cluster. If you would like to setup a multi-node cluster your setup could look like this. This means that the cluster itself will be created later and we just install the environment. (Acropolis Hypervisor = Host, CVM = Nutanix Controller VM)

IMG_20160626_091320

There are two IPs which are needed to be configured. Host IP is the IP of the hypervisor. In the case of Nutanix CE  the Acropolis hypervisor will be installed, which is based on the KVM hypervisor. There are a lot of changes compared to the vanilla KVM so it is not the same. The logic of all Nutanix functions are implemented in the Controller VM. This is the reason why the OS which is installed in the VM is called NOS (Nutanix OS). NOS is based on Centos.

IMG_20160626_091352

The installation takes a while. In the end you should see a login screen with a random hostname.

The next post will show the configuration of the cluster.

Go mRemoteNG

 

 

 

 

 

 

 

Intel NUC NUC6i7KYK – Installation of Nutanix Community Edition (CE) – Part1 – Hardware setup

As already announced in my recent post I bought three Intel NUC NUC6i7KYK to setup my demo/showcase environment based on the Nutanix Community Edition which is free to use. In the following weeks I will show how I setup the environment step by step and I will document the live demos I would like to show at upcoming events. This will include the Openstack and docker integration.

nuc_hardware_installed

It all starts with the hardware itself. The NUC skull canyon edition is pretty new and a this post in the Nutanix Community literally convinced me to build a lab with these boxes. I used the following hardware setup. Be aware that DDR4 and SSDs are not included when buying the Intel NUCs.

NUC-AHV

ItemDescriptionFirmwareDriverHints
Intel NUC skull canyon NUC6i7KYK---
Intel Core i7-6770HQ
Skylake-H, 4C/8T
2.6 GHz (Turbo to 3.5 GHz), 14nm, 6MB L2, 45W TDP---
32GB (2x 16384MB) Crucial CT2K16G4SFD8213 DDR4-2133 SO-DIMM CL15 Dual Kit--
2 x SanDisk X400 M.2 2280 512 GB SATA SSD (6Gb/s)---
Intel Ethernet Connection I219-LM GbE Adapter-e1000o.ko -

Noise1-3 – HP ML 110 G6 cluster for Nutanix Community Edition

The HP ML 110 G6’s are pretty old. I bought these boxes around 2012 but with 10 GgE Broadcom CNA adapters and some fine SSDs they are still some nice boxes to run Nutanix Community Edition which is free to use.

BUT be aware. There is a reason why I called the boxes Noise1, Noise2, Noise3.

IMG_20160612_142014

This is the actual listing of the components which I installed.

ItemDescriptionFirmwareDriverHints
HP ML 110 G62011.08.26 http://www8.hp.com/h20195/v2/GetPDF.aspx/c04286629.pdf
CPUX3430 @ 2,4 GHz
RAM16GB DDR3 @1333 MHz
GraphicOnboard- MGA G200e
LSI SAS ControllerSAS1064ET Fusion-MPT SAS

SSD SamsungSamsung 750 EVO MZ-750250BW/dev/sda
SSD SandiskSDSSDP12 - 128 GB/dev/sdb
HDD 1 WDC WD10EZRX-00L 1TB/dev/sdc
HDD 2 WDC WD10EZRX-00L 1TB/dev/sdd
HDD 3VB0250EAVER 250GB/dev/sde
NIC OnboardBroadcom - NetXtreme BCM5723 - 1GBe
Intel NICIntel - 82541PI - 1Gbe
Brocade CNA 10 Gbe3.2.5
SanDisk/Fusion-IO ioDrive 2 1,2 TB

 

Nutanix – Upload ISO/Image to AHV from a NFS share

In addition to the post from Josh Odgers it seems it is not well known how to upload an ISO/Image directly from a NFS share to the image service. To achieve this you can leverage the “From URL” field in the PRISM interface.

The format for anonymous nfs access is:

nfs://IP-or-DNS/share/subfolders/isofilename

If user and password is required:

nfs://user:password@IP-or-DNS/share/subfolders/isofilename

Example:

Screenshot 2016-02-19 13.56.37

 

 

 

8. SQL Server Performance Tuning study with HammerDB – Solve PAGEIOLATCH latch contention

In the last part I found that there is a new bottleneck. It seems this is related to the PAGEIOLATCH_SH and PAGEIOLATCH_EX. The exact values depend on the time slots which is measured by the ShowIOBottlenecks script. The picture shows >70 percent wait time.

PagelatchIO

To track down the latch contention wait events Microsoft provides a decent whitepaper. I used the following script and run it several times to get an idea which resources are blocked.

PagelatchSH_5.1.240101

The resource_description column returned by this script provides the resource description in the format <DatabaseID,FileID,PageID> where the name of the database associated with DatabaseID can be determined by passing the value of DatabaseID to the DB_NAME () function.

First lets find out which table this is. This can be done via inspecting the the page and retrieving the Metadata ObjectId.

dbccpage

The metadata objectid is 373576369. Now it is easy to retrieve the related table name.


warehouse_tablename

It is the “warehouse” table.

What is the bottleneck here?

dbccpage

First of all this an explanation about the wait events:

PAGEIOLATCH_EX
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Exclusive mode. Long waits may indicate problems with the disk subsystem.

PAGEIOLATCH_SH
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem

In our case this means a lot of inserts/updates are done when running the TPC-C workload and a task waits on a latch for this page shared or exclusive! When inspecting this page we know its the warehouse table and we created the database with 33 warehouses in the beginning.

The page size in SQL server is 8K and the 33 rows all fit just in one page (m_slotcnt =33). This means some operations can no be parallelized!!

To solve this I will change the “physical” design of this table which is still in-line with the TPC-C rules. There may be different ways to achieve this. I add a column and insert some text which forces SQL server to restructure the pages and then delete the column.

add_drop

Okay now check if the m_slotCnt is 1 which means every row is in one page.

dbccpage_new

It’s done.

ShowIOBottleneck_solvedWarehouse

When running the workload again the PAGEIOLATCH_SH and PAGELATCHIO_EX wait events are nearly gone.

Before:

  • System achieved 338989 SQL Server TPM at 73685 NOPM
  • System achieved 348164 SQL Server TPM at 75689 NOPM
  • System achieved 336965 SQL Server TPM at 73206 NOPM

After:

  • System achieved 386324 SQL Server TPM at 83941 NOPM
  • System achieved 370919 SQL Server TPM at 80620 NOPM
  • System achieved 366426 SQL Server TPM at 79726 NOPM

The workload increased slightly. Again I monitored that CPU is at 100% when running. At this point I could continue to tune the SQL statements as I did the last 2-3 posts. Remember I started the SQL Server Performance Tuning with 20820 TPM at 4530 NOPM. This means more then 10x faster!

But the next step maybe to add some hardware. This all runs on just 2 of the 4 cores which are available as I wrote in the first part.

Go ChaosMonkey!

FIO (Flexible I/O Tester) Part9 – fio2gnuplot to visualize the output

When installing the linux build of “fio” it provides a tool called fio2gnuplot. This tool renders the output files of “fio” and uses gnuplot to generate nice graphics. Gnuplot is a portable command-line driven graphing utility which is freely distributed.

Example shows distribution of IOPS with different block sizes and different Read/Write Mix:

PX600-1000-IOPS-mes3DPlt

Requirements

I am using “fio” 2.2.10 which was release on 12.09.2015.

Since 2.1.2 fio2gnuplot is part of the “fio” release. To generate the graphics you need to install gnuplot.

How to generate the log files?

There are some “fio” options to generate log files.

  • write_bw_log=<Filename>
  • write_iops_log=<Filename>
  • write_lat_log=<Filename>
  • per_job_logs=0/1 ( >2.2.8 so not for Windows build 16.09.2015)

write_bw_log generates a log file with the bandwidth details of the job and so on. If you don’t set the per_job_logs=0 then for each thread (numjob=X)  there will be one file. Most of the time this is not wanted because you would like to generate graphics based on all threads. An issue I found is that the default patterns of fio2gnuplot ( -b / -i) will not work because it search for  ( *_bw.log and *_iops.log) file endings. But the files end with *_bw.X.log and *_iops.X.log. It should be fixed with this commit.

If per_job_logs=0 set and all log files option have been set:

  • write_bw_log=fio-test
  • write_iops_log=fio-test
  • write_lat_log=fio-test

then 5 files will be generated:

How does a log file look like?

Means 4096 bytes in the fourth column is the block size (4K). The second column is the bandwidth in KB/s. I believe the first column is the passed time in ms. The third column which is 0 indicates that this row is related to reads. If this is related to write than the third column is 1.

Using fio2gnuplot

fio2gnuplot works in two major phases. The first phase is to generate the input files for gnuplot and do some calculating based on the data like the avg or min and max.

Starting fio2gnuplot -b will search for all bandwidth files in the local directory and generates the input files for gnuplot. The opition “-i” is the default pattern for iops files. There is no default  pattern for latency.

fio2gnuplot_phase1

The second phase is to generate the graphics. The option “-g” can be used for this. Per default “-g” deletes the input files for gnuplot. The option “-k” can be used to keep this files for later editing. If you want to make changes to the output you are able to edit gnuplot files like the mygraph file.

fio2gnuplot_phase2

And this is the output of fio-test_bw-2Draw.png

fio-test_bw-2Draw

Using fio2gnuplot to compare files with the default pattern -b or -i

You can copy all log file in the same directory and call fio2gnuplot with the right pattern. I make use of “-b” for bandwidth comparisons.

fio2gnuplot_compare

And this is the output of compare-result-2Dsmooth.png

compare-result-2Dsmooth

Using fio2gnuplot to compare files with a custom pattern

Sometimes the default pattern will not work. For example there is no pattern for the latency output. For this case you can specify your own pattern with the option “-p <pattern>” and using a title. WARNING: Using the pattern “*.log” will raise an error. I fixed this and in the future this should work.

compare-result-pattern

And this is the output of compare-result-2Dsmooth.png

compare-lat-2Dsmooth

Go Keepass2.

 

7. SQL Server Performance Tuning study with HammerDB – Flashsoft and PX600 unleash the full power

I solved all bottlenecks since we started this performance tuning study. But now I can’t find any improvements which can be done without altering the schema or indexes which is not allowed by TPC-C rules. It is a similar situation when you run a third party application with a database which you are not allowed to change. A great solution to improve the disk latency is caching based on Flash, because it is transparent to the application vendor. The advantage of Flashsoft 3.7 is that it provides a READ and WRITE cache. The write cache is the one which should help with this OLTP workload. Remember Flashsoft can cache FC,iSCSI,NFS and local devices.

Phase 3 – Forming a hypothesis – Part 5

  • Based on observation and declaration form a hypothesis
    • Based on observation and the lessons I learned, I believe the TPM/NOPM values should increase, if the disc access latency will be reduced with the use of READ/WRITE cache (Flashsoft).

Phase 4 – Define an appropriated method to test the hypothesis

  • 4.1 don’t define too complex methods
  • 4.2 choose … for testing the hypothesis
    • the right workload
      • original workload
    • the right metrics
      • In this case I concentrate only on the TPM/NOPM values.
    • some metrics as key metrics
      • TPM/NOPM
    • the right level of details
    • an efficient approach in terms of time and results
      • Installing and configuring Flashsoft will take 30 min
    • a tool you fully understand
  • 4.3 document the defined method and setup a test plan

  I will run the following test plans:

Test plan 1

Implement a READ/WRITE cache for SQL Server based on Samsung 840 basic

Start HammerDB workload

Run ShowIOBottlenecks and Resource Monitor

Stop HammerDB workload and compare this run with the baseline

Analyze the ShowIOBottlenecks

Test plan 2

If ShowIOBottlenecks still shows wait events for disk latency I will use the PX600-1000 as READ/ WRITE cache device

Start HammerDB workload

Run ShowIOBottlenecks and Resource Monitor

Stop HammerDB workload and compare this run with the baseline

Phase 5 – Testing the hypothesis – Test Plan 1+2

  • 5.1 Run the test plan
  • avoid or don’t test if other workloads are running
  • run the test at least two times

I recorded a video when running the test plan 1.

  • System achieved 338486 SQL Server TPM at 73651 NOPM
  • System achieved 314510 SQL Server TPM at 68320 NOPM
  • System achieved 313778 SQL Server TPM at 68256 NOPM

the Resource Monitor showed that there is still latency around 10ms and ShowIOBottlenecks shows that there are wait events for Write log. So I decided to use the PX600-1000.

I recorded a video when running the test plan 2.

  • System achieved 338989 SQL Server TPM at 73685 NOPM
  • System achieved 348164 SQL Server TPM at 75689 NOPM
  • System achieved 336965 SQL Server TPM at 73206 NOPM
  •  5.2 save the results
    • All results are saved into the log files

Phase 6 – Analysis of results – Test Plan 1+2

  • 6.2 Read and interpret all metrics
    • understand all metrics
    • compare metrics to basic/advanced baseline metrics
      • TEST RESULT Flashsoft Samsung 840:
        • System achieved (338486,314510,313778)=322258 SQL Server TPM at (73651,68320,68256)=  70076 NOPM
      • TEST RESULT Flashsoft SanDisk PX600-1000:

        • System achieved (338989,348164,336965)= 341373 SQL Server TPM at  (73685,75689,73206) = 74193 NOPM
      • TEST RESULT before:
        • System achieved =161882 SQL Server TPM at = 35150 NOPM
    • has sensitivity analysis been done?
      • Just an approximation. There are so much variables even in this simple environment that this would take too much time. The approximation shows that as long I don’t make changes to the environment  the results should be stable.
    • concentrate on key metrics
      • While using Flashsoft with Samsung Basic 840 or SanDisk PX600-1000 I could nearly double the performance compared to the last run.
    •  is the result statistically correct?
      • No. The selection was only one point in time. I repeated the test a few times with a similar result, but still no.
  • 6.3 Visualize your data
  • 6.4 “Strange” results means you need to go back to “Phase 4.2 or 1.1
    • nothing strange
  • 6.5 Present understandable graphics for your audience
    • Done.

Phase 7 – Conclusion

Is the goal or issue well defined? If not go back to  “Phase 1.1”

  • 7.1 Form a conclusion if and how the hypothesis achieved the goal or solved the issue!
    • The hypothesis is true. I doubled the performance while I make use of the Flashsoft caching solution. I found that there is only a small difference between the Samsung and the SanDisk drive. SanDisk PX600-1000 should be much faster than the consumer SSD. The reason seems to be a new bottleneck I found. Page Latch waits are involved!
  • 7.2 Next Step
    • Is the hypothesis true?
      • Yes
      • if goal/issue is not achieved/solved, form a new hypothesis.
        • I will form a new hypothesis in the next post of this series where I’ll track down the Page Latch wait events and solve them.