Main menu:

Site search

Categories

March 2010
M T W T F S S
« Nov    
1234567
891011121314
15161718192021
22232425262728
293031  

Tags

Blogroll

OpenBSD CARP and VRRP conflicts

Turns out, the OpenBSD CARP (failover protocol) conflicts in port number, and behavior with devices running VRRP.
We have switch and router equipment which uses VRRP to perform gateway fail over. When we brought in CARP, using the same CARP VHID, and VRRp VRID will cause severe packet loss, as it seems the devices become confused with multicasts broadcasting conflicting member status information.

Avoid using the same CARP VHID, with the same VRRP VRID on the same ethernet segment :)

OpenSolaris ZFS replication

I’ve had this goal for quite some time now, every since my employer went with Sun X4540 storage systems to serve as our data storage for backup applications. The goal was to handle data replication at the ZFS file system level, removing the need for application-level awareness of the file system replication.
A couple products from Sun seemingly accomplish this, one being ZFS via the ‘zfs send/recv’ function, and the other being the Sun Availability Suite (AVS).
From a technical standpoint, AVS is a very mature product, and has many features above and beyond ZFS send/recv. However, since AVS is largely file system agnostic, it poses some problems when using it to replicate ZFS file systems. Namely, a ZFS resilver is a block level operation, AVS would see this as data change, and replicate the entire resilver process over the network.
For my application, that is an undesirable situation, as we will be replicating the data off-site via a rather expensive private leased network.

This brings us to the native ZFS send/recv options. There are many resources online about how it works technically, so I would suggest reading up a bit as I wont explain here.

I explored the ZFS Timeslider tools, which have the ability for every local snapshot you take, to execute an additional command (such as zfs send/recv via SSH). That worked for a while, but it was not designed to handle ZFS replication as part of the suite. When my snapshot sizes began to grow, the send/recv operation would take longer than the window before the next local snapshot was taken.
This caused the service to always enter maintenance mode, as conflicting operations would happen.

Then I found a help blog by Sun Engineer Constantin Gonzalez (http://blogs.sun.com/constantin/entry/zfs_replicator_script_new_edition)
Where he described and made available a script that would handle parts of ZFS replication, from the initial snapshot, to sending it to a remote hosts over SSH. However, the same issues haunted me there, the send/recv operation would run past the scheduled window, and subsequent jobs would step on each other and cause issues.

Clearly, these tools accomplished a lot of great things, but some additional logic could be added to ensure jobs can run past their window, without risk of additional jobs trying to take snapshots while others are in progress.

Enter ZFS user properties; you can set arbitrary properties on a per filesystem, or per snapshot level. For example, you can “lock” a file system so that your programs will check to see if a flag exists, and if so, quit gracefully and notify you.
Jobs running past their window will always happen, and a simple check to see if an existing job is running on your data set would avoid conflicting snapshots, failed jobs, etc.
Short of using an enterprise job scheduling program like Control-M, this functionality is simple to add to existing shell scripts!

But theres more, why not use the ZFS user properties to assign additional flags, such as flagging it when all operations complete, or if a snapshot depends on another for incremental sends, or if the local snapshot has been replicated fully.

I took examples from the tools previously created (see above), and added some of those checks and flags. I’ve posted the script on my site, in hopes others will find the additions helpfull, and hopefully improve on some of the incorrect ways I’ve done things.
By no means am I great at writings scripts or programs, so if you see any bugs, or improvements you can make, please make them!

Download: replicate.ksh

Again, suggest or make any improvements, and enjoy!

FreeBSD NFS performance and OpenSolaris

While setting up and testing our Sun X4540 OpenSolaris NFS server, I noticed that our FreeBSD NFS clients were having severe performance issues while writing to the server. After a few days of digging around, I came across some ancient posts (circa 2005) on the FreeBSD-performance mailing lists describing similar problems.
Here is a brief explanation of how an NFS write can happen:
Assuming we have a generic NFS server, and a generic NFS client mounted over NFSv3/TCP, for every file write you issue an FSYNC. This will signal the NFS server to write to disk immidiately what it has received from the client, and once it has written to disk it will acknowledge to the client that the write was successful so the client can send more data.
What I ran into was that for every 32KB (the default write size over TCP), the FreeBSD NFS client was issuing FSYNCs. Normally this isn’t too much of a problem for most NFS servers, but the issue we have here is the guarantee from ZFS that the on-disk state must always be consistent. ZFS will treat an FSYNC literally, and commit to disk every FSYNC. Most other file system will lie at some level, and acknowledge the write before the data actually hits the disk. While that can improve performance, if you had a crash or power failure before actually commiting to disk, you may lose data.
For my case, lying would be a good thing, as those FSYNCs cause the performance to suffer dramatically.

We have in the FreeBSD source code the following section on/near line 143:


#ifdef VM_AND_BUFFER_CACHE_SYNCHRONIZED
if (S_ISREG(fs->st_mode) && fs->st_size > 0 &&
fs->st_size <= 8 * 1048576) {
if ((p = mmap(NULL, (size_t)fs->st_size, PROT_READ,
MAP_SHARED, from_fd, (off_t)0)) == MAP_FAILED) {
warn("%s", entp->fts_path);
rval = 1;
}

What this roughly approximates to, that for files smaller that 8MB (8 * 1048576) we use Mmap to handle the file. Larger files will use the native write() function.
Normally, Mmap can increase the speed a little bit for writing many small files at once, hence the behavior to use Mmap for files smaller than 8MB. The drawback appears to be that Mmap will issue FSYNCs for every write size over NFS.

To get around this behavior, you can decrease the file size at which Mmap will attempt to be used, in my case I set it to (8 * 8), which you likely wont come across files that small. Now, pretty much all file copy operations will use write(), which will go much faster over NFS, but you may incur a 10-15% performance hit on local disk copies of small files, to gain almost 100 times speed improvement over NFS. While this still gives you the protection of FSYNCs at file close, it is true that under some circumstances you may lose data on the NFS server if there is a crash mid-write, however ZFS ensures that the on-disk state will always be consistent, so your file system should not have any errors.

The ‘cp’ utility (actually, mv, cp are both the same, just linked) already seems to have this logic built in:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libcmdutils/libcmdutils.h#65

As you can see, it will not use Mmap for really small files (less than 32MB), but it does indeed use Mmap, and will use it in 8MB chunks.

Below are two comparisons, the same hardware, but one compiled with FreeBSD using Mmap for small files, and the other write() for all files.
The data set was 62 Gigabytes of mail, most file sizes are less than 1MB, and commonly about 12KB to 300KB with random dispersion of larger files up to 25MB.

Mmap speed:

# time cp -Rv Maildir/ /mnt/obsmtp/ei_obscan/Maildir01


real 315m5.054s
user 0m8.418s
sys 10m12.513s

Write() speed:

$ time cp -Rv Maildir/ /mnt/obsmtp/ei_obscan/Maildir02


real 199m11.364s
user 0m7.324s
sys 5m39.594s

You can see the obvious performance increase, but it could be greater if file distribution was a bit larger, as for every single file close it will still issue an FSYNC which still slows it down a little bit.
Average data transfer speed went from less than 1MB/second bursts, to over 100MB/sec bursts. Sustained write was limited by FSYNCs on small files.

However, it appears there are other problems with FreeBSD NFS performance, the FreeBSD NFSv4 client always issues an FSYNC for every write block size, regardless if Mmap or write() is used. If you compare the code from the FreeBSD NFSv3 to the NFSv4 clients, you see that NFSv3 has more logic to handle ASYNC or FSYNC writes, and will tend to use ASYNC given the choice. While the NFSv4 client has no such ASYNC code in it from my examination (please correct me if I’m wrong).

I have done some packet snoops of the NFSv4 client, and not a single ASYNC write is ever issued, but mounting the same share via NFSv3 will issue ASYNC writes. Other NFS clients such as the OpenSolaris NFSv4 will use ASYNC and FSCYN appropriately, so this seems limited to only FreeBSD.

I’ll have to bring this discussion to the attention of the FreeBSD people, to see if there any any ways to improve the NFSv4 client, but it seems simply adding ASYNC writes will improve it significantly.

A 552MB ISO being copied with NFSv4 and NFSv3, showing the impact of F

552M JanĀ  2 18:22 7.1-RELEASE-i386-disc1.iso

NFSv4 mounted:
# mount_nfs -4T 10.0.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/


$ time cp 7.1-RELEASE-i386-disc1.iso /mnt/obsmtp/1001/


real 9m39.517s
user 0m0.000s
sys 0m3.349s

And here we have NFSv3 mounted:

# mount_nfs -3T 10.0.0.19:/pdxfilu01/obsmtp /mnt/obsmtp/

$ time cp 7.1-RELEASE-i386-disc1.iso /mnt/obsmtp/1001/
real 0m12.682s
user 0m0.000s
sys 0m1.265s

OpenSolaris 2008.11 NFSv4 client:
# mount -F nfs -o vers=4 10.0.0.19:/pdxfilu01/obsmtp /mnt/obsmtp


$ time cp 7.1-RELEASE-i386-disc1.iso.1 /mnt/obsmtp/101/


real 0m10.997s
user 0m0.008s
sys 0m1.772s

Pretty clear example of how much ASYNC writes improve NFS performance!

Added new services

I revamped the Services page to include more information on consulting services, to now list home office and business services.
If you have a need for computer consulting, or you are unhappy with the work that has already been done by someone else, please e-mail or call me.
I’m starting to offer more services as it seems even in this time of people having a hard time finding jobs, computer consultants sometimes still think the rules do not apply to them and either don’t show up when promised, do a poor job, or overcharge on basic services.
I offer pro-rated hourly fees, flat per-job fees, and retainers for businesses who require 24/7 access to a technician.
You will also find on this site many more write ups, links, and information in the coming months.

SNMPTT Zabbix trap handler

I’ve wondered how to integrate standard SNMP traps into Zabbix for some time, many of our systems are Dell’s with OpenManage installed. OpenManage supports sending SNMP traps to a monitoring station who receives them and then takes defined actions.
The components we will use to accomplish this are net-snmp, Zabbix, and SNMPTT. Net-SNMP will provide the trap receiver daemon, and if you choose to configure, a SNMP daemon as well for passive polling. The SNMP trap daemon will listen for incoming traps sent by hosts on your network, translate the trap from installed MIBs, and send the translated message to Zabbix using the zabbix_sender program.
This guide assumes you have configured some type of SNMP trap sending software, such as OpenManage if you use Dell systems. Other tools from HP can accomplish the same goal, but you will need to locate MIB files from those vendors to translate traps from. It is also assumed you have Zabbix configured and running in at least a minimal configuration.

Assuming you have configured OpenManage correctly, directing it to send SNMP traps to your monitoring station, it is time to install and configure an SNMP trap receiver.
On many systems, net-snmp is already installed. FreeBSD users: you can install the latest net-snmp from the ports tree:
cd /usr/ports/net-mgmt/net-snmp
make install clean

Next we must install SNMPTT. If you aren’t familiar with SNMPTT, it is a utility to retrieve SNMP trap messages from your SNMP trap daemon, process the message into human readable text using installed MIB files and a translation table. After the translation, you can have SNMPTT output the contents to another program, in our case the zabbix_sender program (which is use to manually send events to the Zabbix server).
Again, this example installs SNMPTT from the FreeBSD ports tree:
cd /usr/ports/net-mgmt/snmptt
make install clean

By default on FreeBSD your SNMP configuration files are stored under /usr/local/etc/snmp
From here, you will need to define a configuration file for snmptrapd. A basic configuration file will look similar to this

traphandle default /usr/local/sbin/snmptt
ignoreauthfailure 1
logoption f /var/log/snmpd.log
disableAuthorization yes

Make sure the path to snmptt is correct for your system. Also be sure to open the respective ports on your firewall (160, 161).
You can either use the installed FreeBSD snmptrapd init script, or place the following in your /etc/rc.local
/usr/local/sbin/snmptrapd -C -On -c /usr/local/etc/snmp/snmptrapd.conf -Lf /var/log/snmptrapd.log

You must now configure SNMPTT itself, the configuration file on FreeBSD is located under /usr/local/etc/snmp/snmptt.ini
You can customize the configuration file as you desire, key items to check are these lines:
dns_enable = 1
resolve_value_ip_addresses = 1
net_snmp_perl_enable = 1
net_snmp_perl_best_guess = 2
translate_log_trap_oid = 2
translate_value_oids = 2
mibs_environment = ALL
description_mode = 2
unknown_trap_log_enable = 1

I have found those above options to provide the best translations of MIBs, experiment as you wish. Take note also, the last line of the file provides space to include additional configuration files. We will specify a new file that contains translations for MIBs into meaningful text.
SNMP MIB files contain OIDs and descriptions for your vendors various SNMP traps.
Some vendors use some bizarre formats, but many vendor MIB files can be found HERE
For our example, you will download the Dell MIB packs, and unzip them to a directory of your choice.
Next we must convert the MIB files to something understandable, using the snmpttconvert utility.
Please read the following PAGE for specifics on converting MIBs.
Here is an excerpt on the command you could use:

for i in CPQ*
> do
> /usr/local/sbin/snmpttconvertmib --in=$i --out=snmptt.conf.dell
> done

It will create snmptt.conf.dell, with English translations of OIDs, Enterprise values, and other arbitrary items for which mere humans can read.

You will need to modify the snmptt.conf.dell file as well, to tell SNMPTT to exec a special program, the zabbix_sender program, to insert values into Zabbix itself for which we will alert on.
In my example, you will place:
EXEC /usr/local/bin/zabbix_sender -v -z HostnameOfZabbixServer -p 10051 -s Default_Trapper -k snmptraps -o "$aA / $A :: $s :: $N - $Fz"

You can use VIM or Nano find/replace to insert this into every item type row, customizing the Zabbix hostname above.

This will execute zabbix_sender for each incoming translated SNMPTT trap to Zabbix, to a Host called Default_Trapper, under item snmptraps.
Next, create a Host inside Zabbix named Default_Trapper without an IP address. Create a new Item for the host named snmptraps, description can be anything, item type must be Zabbix Trapper, and key must be snmptraps with Type of Character.

You should at this point be fully configured to receive traps from configured devices on your network.
From here, you can review the Latest Data tab in Zabbix to see what comes in from your devices, and plan your triggers to match any items you want to watch.
For example, to monitor critical tagged items from OpenManage, you can have a trigger such as:

({Default_Trapper:snmptraps.str(CRITICAL)}=1)
&({Default_Trapper:snmptraps.nodata(60)}=0)

Here is what a trap should look like:

Tue Oct 21 19:20:13 2008 StorageManagement-MIB::alertPowerSupplyFailure CRITICAL "Status Events" server.com - Storage Management Event: Power Supply Failure: Alert message ID: 2322, Power supply failure. The DC power supply is switched off., Controller 0, Connector 0, Enclosure 0, Power Supply 1

You may find many informational, minor, etc alerts come in over time. It is recommended you review incoming data often to build your triggers to alert on appropriate items. Windows itself can be configured to send SNMP traps, simply go to the SNMP Service properties from within services.msc, and configure the trap destination and community to match your monitoring station.

More information can be found on the Zabbix FORUMS or the Zabbix WIKI

You may contact me or leave comments below if you have any questions.