There are quite a few guides out there on how to do what I'd consider a pretty essential part of manager a Ceph storage cluster, yet the ones I came across never took me through the entire procedure:
- On a healthy cluster, run a script to map OSDs to disk serial numbers such that you can yank out the correct drive in event of failure 🙂
- Identify which OSD has gone down
- Safely remove the damaged drive from the cluster
- Pull out the drive
- Insert the new drive
- Add the new drive to the cluster
Let's start by creating the script which will map serial numbers to the corresponding OSD:
Create a file osd2serial.sh
#!/bin/bash
DISKS=`ceph-disk list | grep "ceph data"`
old_IFS=$IFS
IFS=$'n'
#echo $DISKS
for DISK in $DISKS; do
DEV=`awk '{print $1}' <<< $DISK`
OSD=`awk '{print $7}' <<< $DISK`
DEV=`echo $DEV | sed -e 's//dev///g'`
ID=`ls -l /dev/disk/by-id | grep $DEV | awk '{print $9}' | egrep -v "wwn"`
echo $OSD $ID
done
IFS=$old_IFSNext, add executable permissions to this file
chmod +x osd2serial.sh
Then run this file to output the OSD and the associated disk serial number with the following:
./osd2serial.shNext, identify which OSD has gone down:
ceph osd treeLet's assume OSD 9 has gone down
Next, run the following commands in order to remove OSD 9 from the cluster:
ceph osd crush reweight osd.9 0ceph osd out osd.9ceph osd crush remove osd.9systemctl stop ceph-osd@9.serviceceph auth del osd.9ceph osd rm osd.9umount /var/lib/ceph/osd/ceph-9Next, match the failed OSD with the correct disk serial number and pull out the failed disk from the drive bay.
Next, insert the new disk and take note of its mount point by running the following command:
lsblkAssuming the mount point is /dev/sdn, run the following command:
ceph-disk prepare --zap --cluster ceph --cluster-uuid f3a876fa-36c3-457a-9883-a4194f652da0 --fs-type xfs /dev/sdn /dev/sdn2The cluster uuid can be obtained by running:
ceph -s
Finally, run:
ceph-disk activate /dev/sdn1Ceph will now rebalance and your new OSD will be up and in the cluster.
