Glusterfs: Re-attach a crashed, inaccessible replicated server to the other peers

This morning started off okay, until I checked my email and found an alarm ringing about one of our file servers being offline. Luckily, we have a replicated setup, so no customers were affected, but it is a little worrisome not having the safety net of two replicated servers.

The server didn't technically crash, rather, it was a problem with the EC2 instance being unavailable likely due to some unforeseen problem in the Virginia datacenter. Regardless, the first step I took was to get the server accessible via SSH.

First, I took a snapshot of the inaccessible server.
Then I stopped the inaccessible server. This was a little disconcerting because stopping the instance altogether took nearly an hour from within the AWS control panel.
Once the instance was stopped, we started it back up again.

The Tricky(ier) part

Now that I was able to access the server again, I needed to figure out how to re-attach the gluster mount into the replicated environment. I found this URL which did the trick for us: http://gluster.org/community/documentation/index.php/Gluster_3.2:_Brick…

Since we didn't completely replace the server, we didn't have to check UUID of the out of service server. Then we probed the other server and made sure the volume info was correct. Once we were sure everything was setup properly, we logged into one of the client machines and followed this article: http://gluster.org/community/documentation/index.php/Gluster_3.2:_Trigg…

So, in summary we did the following to regain our replication on our glusterfs file servers:

Ran on the crashed server (server2)

sudo gluster peer probe {server1}
sudo service glusterfs-server restart
sudo gluster peer status
sudo gluster volume info
sudo service glusterfs-server restart

Ran on a client of the gluster servers

find {/my/mount/point} -noleaf -print0 | xargs --null stat > /dev/null 2>mnt-glusterfs-selfheal-1.log

If no errors appear in the log file you are writing to, consider your Gluster setup healed.

Glusterfs: Re-attach a crashed, inaccessible replicated server to the other peers

The Tricky(ier) part

Ran on the crashed server (server2)

Ran on a client of the gluster servers

Comments