Category: XenServer
Hacking a corrupt VHD on xen in order to access innodb mysql information.
A client ran into a corrupted .vhd file for the data drive for a xen server in a pool. We helped them to restore from a backup, however there were some items that they had not backed up properly, our task was to see if we could some how restore the data from their drive.
First, we had to find the raw file for the drive. To do this we looked at the Local Storage -> General tab on the XenCenter to find the UUID that will contain the failing disk.
When we tried to attach the failing disk we get this error
Attaching virtual disk 'xxxxxx' to VM 'xxxx' The attempt to load the VDI failed
So, we know that the xen servers / pool reject loading the corrupted vhd. So I came up with a way to try and access the data.
After much research I came across a tool that was published by ‘twindb.com’ called ‘undrop tool for innodb’. The idea is that even after you drop or delete innodb files on your system, there are still markers in the file system which allow code to parse what ‘used’ to be on the system. They claimed some level of this worked for corrupted file systems.
- UnDrop Tool for innodb
The documentation was poor, and it took a long time to figure out, however they claimed to have 24-hour support, so I thought I would call them and just pay them to sort out the issue. They took a while and didn’t call back before I had sorted it out. All of the documentation he did have showed a link to his github account, however the link was dead. I searched and found a couple other people out there that had forked it before twindb took it down. I am thinking perhaps they run more of an service business now and can help people resolve the issue and they dont want to support the code. Since this code worked for our needs, I have forked it so that we can make it permanently available: https://github.com/matraexinc/undrop-for-innodb
First step was for me to copy the .vhd to a working directory
# cp -a 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd /tmp/restore_vhd/orig.vhd
#cd /tmp/restore_vhd/
#git clone https://github.com/matraexinc/undrop-for-innodb
#cd undrop-for-innodb
#apt-get install bison flex
#apt-get install libmysqld-dev #this was not mentioned anywhere, however an important file was quitely not compiled without it.
#mv * ../. #move all of the compiles files into your working directory
#cd ../
#./stream_parser -f orig.vhd # here is the magic – their code goes through and finds all of the ibdata1 logs and markers and creates data you can start to work through
#mv pages-orig.vhd pages-ibdata1 #the program created an organized set of data for you, and the next programs need to find this at pages-ibdata1.
#./recover_dictionary.sh #this will need to run mysql as root and it will create a database named ‘test’ which has a listing of all of the databases, tables and indexes it found.
This was where I had to start coming up with a custom solution in order to process the large volume of customer databases. I used some PHP to script the following commands for all of the many databases that needed to be restored. But here are the commands for each database and table you must run a command that corresponds to an ‘index’ file that the previous commands created for you, so you must loop through each of them.
select c.name as tablename
,a.id as indexid
from SYS_INDEXES a
join SYS_TABLES c on (a.TABLE_ID =c.ID)
This returns a list of the tables and any associated indexes, Using this you must generate a command which
- generates a create statement for the table you are backing up,
- generate a load infile sql statement and associated data file
#sys_parser -h localhost -u username -p password -d test tablennamefromsql
This generates the createstatement for the tables, save this to a createtable.sql file and execute it on your database to restore your table.
#c_parser -5 -o data.load -f pages-ibdata1/FIL_PAGE_INDEX/00000017493.page -t createtable.sql
This outputs a “load data infile ‘data.load’ statement, you should pipe this to MYSQL and it will restore your data.
I found one example where the was createstatement was notproperty created for table_id 754, it appears that the sys_parser code relies on indexes, and in one case the client tables did not have an index (not even a primary key), this make it so that no create statement was created and the import did not continue. To work around this, I manually inserted a fake primary key on one of the columns into the database
#insert into SYS_INDEXES set id=1000009, table_id = 754, name=PRIMARY, N_FIELDS=1, Type=3,SPACE=0, PAGE_NO=400000000
#insert into SYS_FIELDS set INDEX_ID=10000009, POS=0, COL_NAME=myprimaryfield
Then I was able to run the sys_parser command which then created the statement.
An Idea that Did not work ….
The idea is to create a new hdd device at /dev/xvdX create a new filesystem and mount it. The using a tool use as dd or qemu-img , overwrite the already mounted device with the contents of the vhd. While the contents are corrupted, the idea is that we will be able to explore the corrupted contents as best we can.
so the command I ran was
#qemu-img convert -p -f vpc -O raw /var/run/sr-mount/f40f93af-ae36-147b-880a-729692279845/3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd/dev/xvde
Where 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd is the name of the file / UUID that is corrupted on the xen DOM0 server and f40f93af-ae36-147b-880a-729692279845 is the UUID of the Storage / SR that it was located on
The command took a while to complete (it had to convert 50GB) but the contents of the vhd started to show up as I ran find commands on the mounted directory. During the transfer, the results were sporadic as the partition was only partially build, however after it was completed, I had access to about 50% of the data.
An Idea that Did not work (2) ….
This was not good enough to get the files the client needed. I had a suspicion that the qemu-img convert command may have dropped some of the data that was still available, so i figured I would try another, somewhat similar command, that actually seems to be a bit simpler.
This time I created another disk on the same local storage and found it using the xe vdi-list command on the dom0.
#xe vdi-list name-label=disk_for_copyingover
this showed me the UUID of this file was ‘fd959935-63c7-4415-bde0-e11a133a50c0.vhd’
i found it on disk and I executed a cat from the corrupted vhd file into the mounted vhd file while it was running.
cat 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd > ../8c5ecc86-9df9-fd72-b300-a40ace668c9b/fd959935-63c7-4415-bde0-e11a133a50c0.vhd
Where 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd is the name of the file / UUID that is corrupted on the xen DOM0 server fd959935-63c7-4415-bde0-e11a133a50c0.vhd is the name of the vdi we created to copy over
This method completely corrupted the mounted drive, so I scrapped this method.
Next up:
Try some file partition recovery tools:
I started with testdisk (apt-get install testdisk) and ran it directly againstt the vhd file
testdisk 3f204a06-ba18-42ab-ad28-84ca3a73d397.vhd
Enabling Xen VM auto start for 6.2- command line
Cytrix removed auto start from the easy to access options using XenCenter for 6.X servers.
However you can still run it.
First enable it on your pool
- xe pool-param-set uuid=UUID other-config:auto_poweron=true
Then run a command to get all of the VMs in your pool and turn auto power on for all of the VMs that are currently on.
- xe vm-list power-state=running |awk -F: ‘/uuid/ {print “xe vm-param-set uuid=”$NF” other-config:auto_power=true;”}’
This will give you a list of commands to enable auto_poweron for each of the running vm in your pool
Command Dump – Extending a disk on XenServer with xe
To expand the disk on a XenServer using the command line, I assume that you have backed up the data elsewhere before the expansion, as this method deletes everything on the disk to be expanded
- dom0>xe vm-list name-label=<your vm name> # to get the UUID of the host = VMUUID
- dom0>xe vm-shutdown uuid=<VMUUID>
- dom0>xe vbd-list params=device,empty,vdi-name-label,vdi-uuid vm-name-label=<your vm name> # to get the vdi-uuid of the disk you would like to expand = VDIUUID
- dom0>xe vdi-resize uuid=<VDIUUID> disk-size=120GB #use the size that you would like to expade to
- dom0>xe vm-start uuid=<VMUUID>
Thats it on th dom0, now as your vm boots up, log in via SSH and complete the changes by deleting the old partition, repartitioning and making a new filesystem, I am going to do this as though the system is mounted at /data
- domU>df /data # to get the device name =DEVICENAME
- domU>umount /dev/DEVICENAME
- domU>fdisk /dev/DEVICENAME
- [d] to delete the existing partition
- [c] to create a new partition
- [w] to write the partition
- [q] to close fdisk
- mkfs.ext3 /dev/DEVICENAME
- mount /data
- df /data #to see the file size expanded
Looking for help with XenServer? Matraex can help.
XenServer and XenCenter
Why do we Blog about XenServer and XenCenter?
First, a quick bit about why we chose XenServer
We are small users of the XenServer and XenCenter software, and when we were first evaluating the Hyper Visor, we didn’t know much at all about Virtualizing servers.
At the same time as we were looking at XenServer, we were also looking into HyperV and VMWare. Of the 3, I found the open source model that XenServer had, backed by Cytrix’s large company status, to be the most appealing.
XenServer was also what Amazon AWS was based on, and with our experience with AWS it helped us lean towards XenServer.
To add to this, the XenCenter software was very simple to use, way that we were able to quickly create and manage Pools of servers and simply connect to the console seemed to address the features we would need, and not overcomplicate it like the VM Ware software did. An I liked the simple fast interface.
And finally, since we dont like to have Windows or GUI interfaces in our windows environment, we loved that the Hypervisor is a Linux install we can log into and run ‘xe’ command on.. This makes XenServer is very scriptable.
XenServer is scriptable
Looking back and why we have created so many blog posts about XenServer is simply, because it is so easy to do. As we have run into things that we have had difficulty doing, it has been simple to document the process of figuring it out, We have the option to simply cut and paste our command line history. This seem so much easier than creating picture snippets of a GUI based management system, and it makes it simple to turn our documentation of the process of troubleshooting an issue into a blog post.
Solutions to Problems are easy to forget
When we find a solution to a problem, they can be very easy to implement and forget. What happens here is that we end up doing the same research a year later to find a solution to a problem. This is one of the reasons that many of our blog posts are not polished, the posts just read like a stream of consciousness troubleshooting session. We are not expert article writers, we are expert Website Developers, Server Administrators and technical implementers. However we recognized that when we solve a difficult problem, if we document that problem in a place that is easy to find (our own blog) we can easily come back to it. We simply search our own blog for it.
All of our blog topics
So really, the reasons above apply to many of our blog topics.
- Easy to script, or describe in text (without pictures of it) we are able to cut and paste
- Solution is one that we want to easily be able to find and solve again
Examples of XenServer Blog Posts
- Script for Patching XenServer 6.5
- Changing IP Addresses on a XenServer 6.5 Pool
- Adding and Removing Local Storage from XenServer
- Automating patch installation on XenServer
- Disk write speed on XenServer – single vs mdadm vs hardware raid
- Creating a Bootable USB Install Thumb drive for XenServer
- Deleting Orphaned Disks in Citrix XenServer
- Promoting a XenServer host to pool master
- All XenServer posts
manually removing a pool slave from a pool in XenCenter
manually removing a pool slave from a pool in XenCenter
Problem: The pool master was lost or the ip address was changed. Upon bootup of one of the pool’s slaves, it came up with no management network, and no network interfaces to configure.
Resolution:
MAKE SURE YOUR VMs ARE BACKED UP!!!! LOCAL STORAGE WILL GO AWAY AFTER THIS AND WILL HAVE TO BE RE-CREATED.
Remove the slave server from XenCenter.
At the slave console’s main menu, go to “Network and Management Interface”, “Emergency Network Reset”
Login, and walk through he steps of re-assigning your address. Go ahead and enter an address for the master when prompted.
The server will reboot.
Go to “Local Command Shell” on the main menu.
Check the state of the server:
xe host-is-in-emergency-mode
answer: true
because the server is still in emergency mode, we need to edit the pool.conf.
nano /etc/xensource/pool.conf
It will probably reference “slave” and whatever address you defined as your master.
Remove all entries and add : master
save the conf file with Ctrl + o, exit with Ctrl + x
Rename the state.db with this command.
mv /var/xapi/state.db /var/xapi/state.db-old
Exit to the main console with xsconsole.
reboot it, and you should be able to re-add it to XenCenter and your pool.
More on changing ip addresses here:
http://support.citrix.com/article/CTX123477
Adding your local storage back to the xenserver:
Once you’ve re-added your server back to XenCenter, you’ll notice that your storage devices are gone. to re-add:
On the console tab of the server you just added, You can list your devices with:
cat /proc/partitions
get your device id’s with:
ll /dev/disk/by-id
Execute the following command:
xe sr-create content-type=user device-config:device=/dev/disk/by-id/<device ID from the list from the previous command> host-uuid=<ID can be copied and pasted from the “general” tab> name-label=”Give It a Name” shared=false type=lvm
If you’re trying to add the disk with the system on it, you’ll have to select the partition to restore:
xe sr-create content-type=user device-config:device=/dev/disk/by-id/<device ID for the partition from the list from the previous command> host-uuid=<ID can be copied and pasted from the “general” tab> name-label=”Give It a Name” shared=false type=lvm
This might at least allow you to get and files on that storage off to a more stable place. With a server in this condition, I would recommend reloading XenServer once you’ve taken everything that you need off of it.
Matt Long
02/24/2015
In XenCenter Console – mount DVD drive in Ubuntu 14.04
In XenCenter Console – mount DVD drive in Ubuntu 14.04
When running Ubuntu 14.04 LTS as a guest under XenServer6.5 I was attempting to install xs-tools.iso by mounting it into server using the drop down box.
However at the console, i was unable to find /dev/cdrom or /dev/dvd* or /dev/sr* or anything that seemed to fit.
So I ran fdisk -l
#fdisk -l
and I found a disk I didnt recognize
Disk /dev/xvdd: 119 MB, 119955456 bytes 255 heads, 63 sectors/track, 14 cylinders, total 234288 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvdd doesn't contain a valid partition table
So I mounted it and looked at the contents
#mount /dev/xvdd /mnt #ls /mnt dr-xr-xr-x 4 root root 2048 Jan 27 04:08 Linux -r--r--r-- 1 root root 1180 Jan 27 04:08 README.txt -r--r--r-- 1 root root 65 Jan 27 04:07 AUTORUN.INF -r--r--r-- 1 root root 802816 Jan 27 04:07 citrixguestagentx64.msi -r--r--r-- 1 root root 802816 Jan 27 04:07 citrixguestagentx86.msi -r--r--r-- 1 root root 278528 Jan 27 04:07 citrixvssx64.msi -r--r--r-- 1 root root 253952 Jan 27 04:07 citrixvssx86.msi -r--r--r-- 1 root root 1925120 Jan 27 04:07 citrixxendriversx64.msi -r--r--r-- 1 root root 1486848 Jan 27 04:07 citrixxendriversx86.msi -r--r--r-- 1 root root 26 Jan 27 04:07 copyright.txt -r--r--r-- 1 root root 831488 Jan 27 04:07 installwizard.msi -r-xr-xr-x 1 root root 50449456 Jan 27 04:03 dotNetFx40_Full_x86_x64.exe -r-xr-xr-x 1 root root 1945 Jan 27 04:03 EULA_DRIVERS -r-xr-xr-x 1 root root 1654835 Jan 27 04:03 xenlegacy.exe -r-xr-xr-x 1 root root 139542 Jan 27 04:03 xluninstallerfix.exe
So I found it! Now just to install the tools and reboot
#cd Linux && ./install.sh #reboot
XenCenter – missing ‘Logs’ tab
XenCenter – missing ‘Logs’ tab
Xencenter has moved the status of actions somewhere for each Physical and VM from the very intuitive ‘logs’ tab location it was before. Here is where they moved it.
- At the bottom of the left pane there is an option called ‘Notifications’, when you click it you are automatically shown all of the the alerts (such as the status changes)
- At the top of the left pane whn you are clicked on Notifications you will notice that it has given you three options “Alerts”, “Updates” and “Events”.
- If you click on “Events” you will see the status of ongoing ‘Exports’ or transfers or other actions.
Script for Patching XenServer 6.5
Script for Patching XenServer 6.5
Here’s a little script that you can run at the dom0 console to automate loading patches on a fresh installation of XenServer 6.5 up to patch XS65E005. If they add more patches, just add more lines referencing the new patch name (e.g. XS65E006, etc) starting with the “wget command and ending with the “rm -f .xsupdate” command.
#!/bin/bash
wget http://downloadns.citrix.com.edgesuite.net/akdlm/10194/XS65E001.zip
unzip XS65E001.zip
xe patch-apply uuid=`xe patch-upload file-name=XS65E001.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`
rm -f *.zip
rm -f *.xsupdate
wget http://downloadns.citrix.com.edgesuite.net/akdlm/10195/XS65E002.zip
unzip XS65E002.zip
xe patch-apply uuid=`xe patch-upload file-name=XS65E002.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`
rm -f *.zip
rm -f *.xsupdate
wget http://downloadns.citrix.com.edgesuite.net/akdlm/10196/XS65E003.zip
unzip XS65E003.zip
xe patch-apply uuid=`xe patch-upload file-name=XS65E003.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`
rm -f *.zip
rm -f *.xsupdate
wget http://downloadns.citrix.com.edgesuite.net/akdlm/10201/XS65E005.zip
unzip XS65E005.zip
xe patch-apply uuid=`xe patch-upload file-name=XS65E005.xsupdate 2>&1|tail -1|awk -F” ” ‘{print $NF}’` host-uuid=`grep -B1 -f /etc/hostname <(xe host-list)|head -n1|awk ‘{print $NF}’`
rm -f *.zip
rm -f *.xsupdate
Changing IP Addresses on a XenServer 6.5 Pool
Changing IP Addresses on a XenServer 6.5 Pool
To change the ip addresses on a XenServer 6.5 pool, start with the slaves, and use the following xe commands:
Remember: Slaves first, then the Master
NOTE: There is no need to change the IP from the Management Console.
Find the UUID of the Host Management PIF:
xe pif-list params=uuid,host-name-label,device,management
You will see a big list. Find the UUID for the slave that you’re working on. Use the “more” pipe if the UUID for your particular slave scrolls off the screen:
xe pif-list params=uuid,host-name-label,device,management | more
Change the IP Address on the first slave:
xe pif-reconfigure-ip uuid=<UUID of host management PIF> IP=<New IP> gateway=<GatewayIP> netmask=<Subnet Mask> DNS=<DNS Lookup IPs> mode=<dhcp,none,static>
Then:
xe-toolstack-restart
Verify the new address with ifconfig, and/or ping it from a workstation.
Point the slave to the new Master IP Address:
xe pool-emergency-reset-master master-address=NEW_IP_OF_THE_MASTER
Repeat the commands above on all slaves.
On the Master:
xe pif-list params=uuid,host-name-label,device,management
xe pif-reconfigure-ip uuid=<UUID of host management PIF> IP=<New IP> gateway=<GatewayIP> netmask=<Subnet Mask> DNS=<DNS Lookup IPs> mode=<dhcp,none,static>
xe-toolstack-restart
DO NOT run the emergency-reset-master command on the Master.
Reboot the Master, then reboot the Slaves and verify that they can find the Master.
Matt Long
04/06/2015
Using MPT-Status for RAID Monitoring in a Poweredge C6100 with Perc 6
Using MPT-Status for RAID Monitoring in a Poweredge C6100 with Perc 6
This post outlines the steps needed to get a CLI report of the conditions of your RAIDs in a Poweredge C6100 with a PERC 6/i RAID Controller.
Verify your controller type:
cat /proc/scsi/mptsas/0
ioc0: LSISAS1068E B3, FwRev=011b0000h, Ports=1, MaxQ=277
Download the following packages:
daemonize-1.5.6-1.el5.i386.rpm mpt-status-1.2.0-3.el5.centos.i386.rpm lsscsi-0.17-3.el5.i386.rpm
http://dl.nux.ro/utils/mpt-status/mpt-status-1.2.0-3.el5.centos.i386.rpm
http://dl.nux.ro/utils/mpt-status/daemonize-1.5.6-1.el5.i386.rpm
http://mirror.centos.org/centos/5/os/i386/CentOS/lsscsi-0.17-3.el5.i386.rpm
Install mtp-status:
rpm -ivh mpt-status-1.2.0-3.el5.centos.i386.rpm daemonize-1.5.6-1.el5.i386.rpm lsscsi-0.17-3.el5.i386.rpm
modprobe mptctl
echo mptctl >> /etc/modules
Verify your modules:
lsmod |grep mpt
mptctl 90739 0
mptsas 57560 4
mptscsih 39876 1 mptsas
mptbase 91081 3 mptctl,mptsas,mptscsih
scsi_transport_sas 27681 1 mptsas
scsi_mod 145658 7 mptctl,sg,libata,mptsas,mptscsih,scsi_transport_sas,sd_mod
run:
mpt-status or mpt-status -n -s
Also, you can use: lsscsi -l
This little script:
echo `mpt-status -n -s|awk ‘/OPTIMAL/ {print $1, “OK”}; /ONLINE/ {print $1, “OK”}; /DEGRADED/ {print $1, “FAILURE”}; /scsi/ {print $2}; /MISSING/ {print $1, “FAILURE”} ‘`
reports:
vol_id:0 OK phys_id:1 OK phys_id:0 OK 100% 100%
On a rebuild, it reports:
vol_id:0 FAILURE phys_id:2 OK phys_id:3 OK 75% 75%
Copy that script into a file called “check_raid”, and make it executable, E.G. 755
Edit nagios-statd on parcel1. Replace “sudo /customcommands/check_raid.pl -b -w1 -c1” with filename check-raid (without the switches) at line 20, and remove “sudo”
So, from this:
commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”sudo /customcommands/check_raid.pl -b -w1 -c1″)
To this:
commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”/customcommands/check_raid”)
Port 1040 will need to be opened in XenServer. Edit /etc/sysconfig/iptables and insert this line:
-A RH-Firewall-1-INPUT -p tcp -m tcp –dport 1040 -j ACCEPT
Restart the firewall:
service iptables restart
Output:
Flushing firewall rules: [ OK ]
Setting chains to policy ACCEPT: filter [ OK ]
Unloading iptables modules: [ OK ]
Applying iptables firewall rules: [ OK ]
Loading additional iptables modules: ip_conntrack_netbios_n[FAILED]
NOTE: The “FAILED” error above doesn’t seem to be a problemVerify that port 1040 is open:
Check the status of port 1040:
service iptables status
Output:
Table: filter
Chain INPUT (policy ACCEPT)
num target prot opt source destination
1 ACCEPT 47 — 0.0.0.0/0 0.0.0.0/0
2 RH-Firewall-1-INPUT all — 0.0.0.0/0 0.0.0.0/0
Chain FORWARD (policy ACCEPT)
num target prot opt source destination
1 RH-Firewall-1-INPUT all — 0.0.0.0/0 0.0.0.0/0
Chain OUTPUT (policy ACCEPT)
num target prot opt source destination
Chain RH-Firewall-1-INPUT (2 references)
num target prot opt source destination
1 ACCEPT all — 0.0.0.0/0 0.0.0.0/0
2 ACCEPT icmp — 0.0.0.0/0 0.0.0.0/0 icmp type 255
3 ACCEPT esp — 0.0.0.0/0 0.0.0.0/0
4 ACCEPT ah — 0.0.0.0/0 0.0.0.0/0
5 ACCEPT udp — 0.0.0.0/0 224.0.0.251 udp dpt:5353
6 ACCEPT udp — 0.0.0.0/0 0.0.0.0/0 udp dpt:631
7 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:631
8 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 tcp dpt:1040
9 ACCEPT all — 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED
10 ACCEPT udp — 0.0.0.0/0 0.0.0.0/0 state NEW udp dpt:694
11 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:22
12 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:80
13 ACCEPT tcp — 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:443
14 REJECT all — 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited
running “nagios-statd” opens port 1040 on Parcel1 and listens for commands to be initiated by nagios_stat on the nagios server.
On the nagios server, in a file called “remote.orig.cfg, there are commands defined using “nagios-stat”: NOTE: These are from a working server and haven’t been modified to work with mpt. Some changes may need to be made. This is just an example of the interaction between Nagios server and client
Example:
define command{
command_name check_remote_raid
command_line $USER1$/nagios-stat -w $ARG1$ -c $ARG2$ -p $ARG3$ raid $HOSTADDRESS$
}
This command defined above is used in the “services.cfg” file.
Example:
define service{
use matraex-template
host_name mtx-lilac
service_description Lilac /data Raid
check_command check_remote_raid!1!1!1040
The three files needed on the C6100 node are:
/customcommands/check_raid (contents below) -rwxr-xr-x
/customcommands/nagios-statd (contents below) -rwxr-xr-x
/etc/init.d/nagios-statd (contens below) -rwxr–r–
Creating the soft links:
ln -s /etc/init.d/nagios-statd /etc/rc.d/rc3.d/K01nagios-statd
ln -s /etc/init.d/nagios-statd /etc/rc.d/rc3.d/S99nagios-statd
The -s = soft, and -f if used, forces overwrite.
/rc3.d/ designates runlevel 3
So when you do this:
ls -lt /customcommands/nagios-statd /etc/init.d/nagios-statd /customcommands/check_raid /etc/rc.d/rc3.d/*nagios-statd
This is what you should see:
lrwxrwxrwx 1 root root 22 Mar 6 08:08 /etc/rc.d/rc3.d/K01nagios-statd -> ../init.d/nagios-statd
-rwxr-xr-x 1 root root 365 Mar 6 07:59 /customcommands/check_raid
lrwxrwxrwx 1 root root 22 Mar 6 07:52 /etc/rc.d/rc3.d/S99nagios-statd -> ../init.d/nagios-statd
-rwxr-xr-x 1 root root 649 Mar 6 07:51 /etc/init.d/nagios-statd
-rwxr-xr-x 1 root root 9468 Mar 5 12:05 /customcommands/nagios-statd
Script Files:
NOTE: Here’s a little fix that helped me out. I had originally pasted these scripts into a DOS/Windows editor (wordpad) and it added DOS-type returns to the file, resulting in an error:
-bash: ./nagios-statd: /bin/sh^M: bad interpreter: No such file or directory
If you encounter this, do this:
Open the file in vi
hit “:” to go into command mode
enter “set fileformat=unix”
then :wq to quit.
/customcommands/check_raid:
#!/bin/bash
EXECFILE=/usr/sbin/mpt-status
if [ ! -e $EXECFILE ] ; then
echo
echo “Error $EXECFILE is not installed, please install before running”
echo
echo “Usage $0”;
echo
exit 10
fi
echo `$EXECFILE -n -s|awk ‘/OPTIMAL/ {print $1, “OK”}; /ONLINE/ {print $1, “OK”}; /DEGRADED/ {print $1, “FAILURE”}; /scsi/ {print $2};
/MISSING/ {print $1, “FAILURE”} ‘`
/customcommands/nagios_statd
#!/usr/bin/python
import getopt, os, sys, signal, socket, SocketServer
class Functions:
“Contains a set of methods for gathering data from the server.”
def __init__(self):
self.nagios_statd_version = 3.09
# As of right now, the commands are for df, who, proc, uptime, and swap.
commandlist = {}
commandlist[‘AIX’] = (“df -Ik”,”who | wc -l”,”ps ax”,”uptime”,”lsps -sl | grep -v Paging | awk ‘{print $2}’ | cut -f1 -d%”)
commandlist[‘BSD/OS’] = (“df”,”who | wc -l”,”ps -ax”,”uptime”,None)
commandlist[‘CYGWIN_NT-5.0’] = (“df -P”,None,”ps -s -W | awk ‘{printf(“%6s%6s%3s%6s%sn”,$1,$2,” S”,” 0:00″,substr($0,22))}'”,None,None)
commandlist[‘CYGWIN_NT-5.1’] = commandlist[‘CYGWIN_NT-5.0’]
commandlist[‘FreeBSD’] = (“df -k”,”who | wc -l”,”ps ax”,”uptime”,”swapinfo | awk ‘$1!~/^Device/{print $5}'”)
commandlist[‘HP-UX’] = (“bdf -l”,”who -q | grep “#””,”ps -el”,”uptime”,None)
commandlist[‘IRIX’] = (“df -kP”,”who -q | grep “#””,”ps -e -o “pid tty state time comm””,”/usr/bsd/uptime”,None)
commandlist[‘IRIX64’] = commandlist[‘IRIX’]
commandlist[‘Linux’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,”free | awk ‘$1~/^Swap:/{print ($3/$2)*100}'”,”/customcommands/check_raid”)
commandlist[‘NetBSD’] = (“df -k”,”who | wc -l”,”ps ax”,”uptime”,”swapctl -l | awk ‘$1!~/^Device/{print $5}'”)
commandlist[‘NEXTSTEP’] = (“df”,”who | /usr/ucb/wc -l”,”ps -ax”,”uptime”,None)
commandlist[‘OpenBSD’] = (“df -k”,”who | wc -l”,”ps -ax”,”uptime”,”swapctl -l | awk ‘$1!~/^Device/{print $5}'”)
commandlist[‘OSF1’] = (“df -P”,”who -q | grep “#””,”ps ax”,”uptime”,None)
commandlist[‘SCO-SV’] = (“df -Bk”,”who -q | grep “#””,”ps -el -o “pid tty s time args””,”uptime”,None)
commandlist[‘SunOS’] = (“df -k”,”who -q | grep “#””,”ps -e -o “pid tty s time comm””,”uptime”,”swap -s | tr -d -s -c [:digit:][:space:] | nawk ‘{print ($3/($3+$4))*100}'”)
commandlist[‘UNIXWARE2’] = (“/usr/ucb/df”,”who -q | grep “#””,”ps -el | awk ‘{printf(“%6d%9s%2s%5s %sn”,$5,substr($0, 61, 8),$2,substr($0,69,5),substr($0,75))}”,”echo `uptime`, load average: 0.00, `sar | awk ‘{oldidle=idle;idle=$5} END {print 100-oldidle}’`,0.00″,None)
# Now to make commandlist with the correct one for your OS.
try:
self.commandlist = commandlist[os.uname()[0]]
except KeyError:
print “Your platform isn’t supported by nagios-statd – exiting.”
sys.exit(3)
# Below are the functions that the client can call.
def disk(self):
return self.__run(0)
def proc(self):
return self.__run(2)
def swap(self):
return self.__run(4)
def uptime(self):
return self.__run(3)
def user(self):
return self.__run(1)
def raid(self):
return self.__run(5)
def version(self):
i = “nagios-statd ” + str(self.nagios_statd_version)
return i
def __run(self,cmdnum):
# Unmask SIGCHLD so popen can detect the return status (temporarily)
signal.signal(signal.SIGCHLD, signal.SIG_DFL)
outputfh = os.popen(self.commandlist[cmdnum])
output = outputfh.read()
returnvalue = outputfh.close()
signal.signal(signal.SIGCHLD, signal.SIG_IGN)
if (returnvalue):
return “ERROR %s ” % output
else:
return output
class NagiosStatd(SocketServer.StreamRequestHandler):
“Handles connection initialization and data transfer (as daemon)”
def handle(self):
# Check to see if user is allowed
if self.__notallowedhost():
self.wfile.write(self.error)
return 1
if not hasattr(self,”generichandler”):
self.generichandler = GenericHandler(self.rfile,self.wfile)
self.generichandler.run()
def __notallowedhost(self):
“Compares list of allowed users to client’s IP address.”
if hasattr(self.server,”allowedhosts”) == 0:
return 0
for i in self.server.allowedhosts:
if i == self.client_address[0]: # Address is in list
return 0
try: # Do an IP lookup of host in blocked list
i_ip = socket.gethostbyname(i)
except:
self.error = “ERROR DNS lookup of blocked host “%s” failed. Denying by default.” % i
return 1
if i_ip != i: # If address in list isn’t an IP
if socket.getfqdn(i) == socket.getfqdn(self.client_address[0]):
return 0
self.error = “ERROR Client is not among hosts allowed to connect.”
return 1
class GenericHandler:
def __init__(self,rfile=sys.stdin,wfile=sys.stdout):
# Create functions object
self.functions = Functions()
self.rfile = rfile
self.wfile = wfile
def run(self):
# Get the request from the client
line = self.rfile.readline()
line = line.strip()
# Check for appropriate requests from client
if len(line) == 0:
self.wfile.write(“ERROR No function requested from client.”)
return 1
# Call the appropriate function
try:
output = getattr(self.functions,line)()
except AttributeError:
error = “ERROR Function “” + line + “” does not exist.”
self.wfile.write(error)
return 1
except TypeError:
error = “ERROR Function “” + line + “” not supported on this platform.”
self.wfile.write(error)
return 1
# Send output
if output.isspace():
error = “ERROR Function “” + line + “” returned no information.”
self.wfile.write(error)
return 1
elif output == “ERROR”:
error = “ERROR Function “” + line + “” exited abnormally.”
self.wfile.write(error)
else:
for line in output:
self.wfile.write(line)
class ReUsingServer (SocketServer.ForkingTCPServer):
allow_reuse_address = True
class Initialization:
“Methods for interacting with user – initial code entry point.”
def __init__(self):
self.port = 1040
self.ip = ”
# Run this through Functions initially, to make sure the platform is supported.
i = Functions()
del(i)
def getoptions(self):
“Parses command line”
try:
opts, args = getopt.getopt(sys.argv[1:], “a:b:ip:P:Vh”, [“allowedhosts=”,”bindto=”,”inetd”,”port=”,”pid=”,”version”,”help”])
except getopt.GetoptError, (msg, opt):
print sys.argv[0] + “: ” + msg
print “Try ‘” + sys.argv[0] + ” –help’ for more information.”
sys.exit(3)
for option,value in opts:
if option in (“-a”,”–allowedhosts”):
value = value.replace(” “,””)
self.allowedhosts = value.split(“,”)
elif option in (“-b”,”–bindto”):
self.ip = value
elif option in (“-i”,”–inetd”):
self.runfrominetd = 1
elif option in (“-p”,”–port”):
self.port = int(value)
elif option in (“-P”,”–pid”):
self.pidfile = value
elif option in (“-V”,”–version”):
self.version()
sys.exit(3)
elif option in (“-h”,”–help”):
self.usage()
def main(self):
# Retrieve command line options
self.getoptions()
# Just splat to stdout if we’re running under inetd
if hasattr(self,”runfrominetd”):
server = GenericHandler()
server.run()
sys.exit(0)
# Check to see if the port is available
try:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((self.ip, self.port))
s.close()
del(s)
except socket.error, (errno, msg):
print “Unable to bind to port %s: %s – exiting.” % (self.port, msg)
sys.exit(2)
# Detach from terminal
if os.fork() == 0:
# Make this the controlling process
os.setsid()
# Be polite and chdir to /
os.chdir(‘/’)
# Try to close all open filehandles
for i in range(0,256):
try: os.close(i)
except: pass
# Redirect the offending filehandles
sys.stdin = open(‘/dev/null’,’r’)
sys.stdout = open(‘/dev/null’,’w’)
sys.stderr = open(‘/dev/null’,’w’)
# Set the path
os.environ[“PATH”] = “/bin:/usr/bin:/usr/local/bin:/usr/sbin”
# Reap children automatically
signal.signal(signal.SIGCHLD, signal.SIG_IGN)
# Save pid if user requested it
if hasattr(self,”pidfile”):
self.savepid(self.pidfile)
# Create a forking TCP/IP server and start processing
server = ReUsingServer((self.ip,self.port),NagiosStatd)
if hasattr(self,”allowedhosts”):
server.allowedhosts = self.allowedhosts
server.serve_forever()
# Get rid of the parent
else:
sys.exit(0)
def savepid(self,file):
try:
fh = open(file,”w”)
fh.write(str(os.getpid()))
fh.close()
except:
print “Unable to save PID file – exiting.”
sys.exit(2)
def usage(self):
print “Usage: ” + sys.argv[0] + ” [OPTION]”
print “nagios-statd daemon – remote UNIX system monitoring tool for Nagios.n”
print “-a, –allowedhosts=HOSTS Comma delimited list of IPs/hosts allowed to connect.”
print “-b, –bindto=IP IP address for the daemon to bind to.”
print “-i, –inetd Run from inetd.”
print “-p, –port=PORT Port to listen on.”
print “-P, –pid=FILE Save pid to FILE.”
print “-V, –version Output version information and exit.”
print ” -h, –help Print this help and exit.”
sys.exit(3)
def version(self):
i = Functions()
print “nagios-statd %.2f” % i.nagios_statd_version
print “os.uname()[0] = %s ” % os.uname()[0]
print “Written by Nick Reinkingn”
print “Copyright (C) 2002 Nick Reinking”
print “This is free software. There is NO warranty; not even for MERCHANTABILITY or”
print “FITNESS FOR A PARTICULAR PURPOSE.”
print “nNagios is a trademark of Ethan Galstad.”
if __name__ == “__main__”:
# Check to see if running Python 2.x+ / needed because getfqdn() is Python 2.0+ only
if (int(sys.version[0]) < 2):
print “nagios-statd requires Python version 2.0 or greater.”
sys.exit(3)
i = Initialization()
i.main()
/etc/init.d/nagios-statd:
#!/bin/sh
#
# This file should have uid root, gid sys and chmod 744
#
if [ ! -d /usr/bin ]
then # /usr not mounted
exit
fi
killproc() { # kill the named process(es)
pid=`/bin/ps -e |
/bin/grep -w $1 |
/bin/sed -e ‘s/^ *//’ -e ‘s/ .*//’`
[ “$pid” != “” ] && kill $pid
}
# Start/stop processes required for netsaint_statd server
case “$1” in
‘start’)
/customcommands/nagios-statd -a <IP of Allowed Nagios Server>,<IP of Test Workstation> -p 1040
;;
‘stop’)
killproc nagios-statd
;;
*)
echo “Usage: /etc/init.d/nagios-statd { start | stop }”
;;
esac
Testing:
As you can see in the script file above, I’ve added the IP Address of a test workstation. This will allow me to simply telnet to a node in the C6100 and execute one of the commands defined in this section of the /customcommands/nagios-statd script:
# Below are the functions that the client can call.
def disk(self):
return self.__run(0)
def proc(self):
return self.__run(2)
def swap(self):
return self.__run(4)
def uptime(self):
return self.__run(3)
def user(self):
return self.__run(1)
def raid(self):
return self.__run(5)
At your workstation, telnet to <Node IP Address> 1040
When connected, the screen will be blank.
Type “raid”. The screen won’t echo this.
When you hat enter, you should see:
vol_id:0 OK phys_id:2 OK phys_id:3 OK 100% 100%
Now you’re ready to move on to the Nagios configuration.
Matt Long
03/06/2015