Notes on Recovering from a XenServer Pool failure
Notes on Recovering from a XenServer Pool failure
For my pool I have 8 XenServers (plot1, plot2, plot3, plot4, plot5, plot6, plot7 and plot8)
At the start of my tests, plot1 is the pool master.
If the pool master goes down, another server must take over as master.
To simulate this, I just ran ‘shut down’ on the master host
A large issue here is that all of the slaves in the pool, just disabled their management interfaces so they can not be connected to using XenCenter (something I did not expect), so I connected to plot2 via SSH
THen I connected to another server in the pool, and verified its state
xe host-is-in-emergency-mode
The server said FALSE!?! the server didn’t even know that the pool was in trouble? so I ran pool-list
xe pool-list
The command took a long time so I figured I would stop it and put a time command in front of it to find out how long it really tool
time xe pool-list
Turns out, when I shut down the pool master, I am shutting down the pool! , I am not simulating an error at all. Somehow the pool master notified the slaves that it was gracefully shutting down, telling the slaves dont worry, I will be all right., the commands above never returned. so I just told plot2 to take over as master to see how we could recover from this situation.
xe pool-emergency-transition-to-master
At this point on plot 2, the pool was restored but we could still not connect to the management interfaces of any of the other plots in the pool. But XenCenter WAS able to connect to plot2, and it synchronized the entire pool, showing all of the other hosts (including plot1 which was the master previously) as down.
The other hosts in the pool are still running all of their services (SSH, apache or whathever) they just can not communicate about the pool so I have to ‘recover’ them back into the pool.
On the new master I run
xe pool-recover-slaves
This brings the slaves back into the pool so they are visible within XenCenter again. plot1, the original master is still turned off, but visible as turned off in XenCenter, so I right click on it. in XenCenter and Power On. It begins booting and I hold my breath to see if there are any master master conflicts, since the shut down host thought it was the all powerful one when it shut down.
Once it comes up (3 minutes later) I find that plot1 gracefully fell into place as a slave. So the moral of this story,
!Dont shut down the pool master, if you do you will lose XenCenter access to all of the hosts in the pools so you MUST either 1) bring it backup immediately or 2) SSH to the console of another host run #xe pool-emergency-transition-to-master and then #xe pool-recover-slaves – this will restore your pool minus the host that was originally the master. reconnect with XenCenter to the new poolmaster, using the XenCenter then power on the host that was the pool master
!Best Practice: before stopping a host that is currently the poolmaster, connect to another host in the pool and run #xe pool-emergency-transition-to-master and then #xe pool-recover-slaves prior too shutting down the host.
Well, so now that we know shutting down the master does not simulate a failure, we will have to use another ‘onsite’ method.
!Simulation2:
On plot2 (current pool master) I disconnected the ethernet cables.
The XenCenter console can no longer connect to the pool again, so I have to use SSH, This time I will connect to plot3 and find out what it thinks of the pool issue.
xe host-is-in-emergency-mode
This command returns false, somehow the host thinks every thing is okay, I run xe pool-list and xe host-list, both of which never return, come one host shouldn’t you recognize a failure here?
I ping the same IP as the pool master and the ping fails, but the xe host-is-in-emergency-mode still returns false, for some reason, this host just does not think it has a problem
so, I guess I just can’t trust xe host-is-in-emergency-mode,
Even after 2 hours, the xe host-is-in-emergency-mode still returns false.
So for monitoring, I will have to come up with some other method. but the rules for how to recover are the same
xe pool-emergency-transition-to-master
xe pool-recover-slaves
This brought the pool up again on plot3 with plot3 as the new master.
Now the trick is to bring plot2 back on, in this case, plot 2 never ended up going offline, so it is still running without the ethernet cable plugged in, so when I plug it back in, I may end up with some master – master conflicts ….. here goes!.
After reconnecting the ethernet cable to plot 2 (the old master):
– plot3 did not recognize automatically that the host is backup, infact in XenCenter, it still shows red as though it is shut down, I right clicked on it and told it to power on, but it didn’t do anything but wait.
– plot2 did not make any changes, it appears they both, happily think they are the masters.
To test how the pool reacted, I attempted to disable one of the slaves from plot2 xe host-disable uuid=xxxxxxxx (my thought is that plot 2 is incorrectly considered down and not connected so the disable should not be let through.)
It turns out that plot2 could not disable the host, because the host ‘could not be contacted’ , this is good because it makes sure that none of the slaves are confused, in fact, plot3 is not confused either, it is only plot2, the master that went missing that is confused (I have seen in xen docs that they call this a barrier of some sort)
I tried to connect to plot2 with XenCenter, but XenCenter smartly told me that I can not connect because it appears that the server was created as a backup from my pool and that the dangerous operation is not allowed. (I will try to trick XenCenter into connecting by removing references to my pool from it and then trying again)
AH! it let me! that means that XenCenter is smart enough to recognize when you are attempting to make two connection separately to the split brain masters of a pool, but prevents it.
To dig further into this issue. I decided to further ‘break’ the pool by splitting the two masters further with different definitions of the pool. On the plot2 master I used XenCenter to destroy the disconnected host plot7. XenCenter let me do this. Now when I go to reconnect, I will be attempting to pull the orphaned master with a different definition of the pool, back into the pool.
Now the trick is to determine what the best way to bring the plot2 old master back into the current pool as a slave. We need to tell the new master to recover slaves.
xe pool-recover-slaves
That pulls plot2 back in as a slave, and GREAT it did not use any of the pool definition from plot2. plot3 property asserted its role as the true pool master
I can imagine a bad scenario happening if I told the “OLD” master to recover slave, I imagine that either the split would have gotten much worse, Or (if the barrier was really working, the the pool would have told the old master that it was not possible).
Other methods that I did not use which may have worked but were nto tried (they dont feel right):
– from the orphaned master: xe pool-join force=1 ….. server username password (i doubt this would work since it is already the member of a pool)
– from the orphaned master xe pool-reset-master master-server= ip of new master (this one I am not sure of, would be worth a shot if for some reason pool is not working)
THe thing that you NEVER want to do while a master or any other server is orphaned or down, is remove the server from the pool. What can happen in this sitation is that the server that is down, still thinks it is in the pool when it comes back up but the pool does not know about it. We get into a race condition that I have only ever found one way out of. The orphaned server thinks it is in a pool, but can not get out of the pool without connecting to the master. The master will not recognize the orphaned server so the server cant do anything. (the way out of this was to promote the orphaned server to master, the remove all of the hosts in the pool, then delete all of the stored resources and pbd and then join the pool anew. This sucked because everything on the server was destroyed so I could have just r reinstalled xenserver.
I have heard but not attempted to reinstall xenserver without selecting the disks
http://support.citrix.com/article/CTX120962