### Manual Maintenance ## Clean OSD removal ``` ceph osd safe-to-destroy osd.0 ceph osd out <ID> systemctl stop ceph-osd.<ID> ceph osd crush remove osd.<ID> ceph osd down 0 ceph auth del osd.<ID> ceph osd rm <ID> ``` Remove the logical volumes, volume groups and physical volumes ``` lvremove <list of volumes> vgremove <list of volume groups> pvremove <list of physical volumes> ``` ## Clean mds removal ``` systemctl stop ceph-mds@<id>.service rm -rf /var/lib/ceph/mds/ceph-<id> ceph auth rm mds.<id> ``` ## Clean mgr removal ``` systemctl stop ceph-mds@<id>.service rm -rf /var/lib/ceph/mds/ceph-<id> ceph auth rm mds.<id> ``` ## Clean mon removal ``` systemctl stop ceph-mon@<id>.service rm -rf /var/lib/ceph/mon/ceph-<id> ceph auth rm mon.<id> ``` ## Reattach expelled disk When a disk is expelled and reattached the new device name is different so the OSD fails. To reattach it in the the correct way a few steps must be followed. Let's suppose that the disk was first `dev/sdv`. After reattachment the disk becomes `/dev/sdah`, so the OSD is no more working. First identifiy the full SCSI identifier, for example checking the `/sys/block` folder: ``` sdah -> ../devices/pci0000:80/0000:80:02.0/0000:82:00.0/host11/port-11:0/expander-11:0/port-11:0:31/end_device-11:0:31/target11:0:31/11:0:31:0/block/sdah ``` or ``` udevadm info --query=path --name=/dev/sdah /devices/pci0000:80/0000:80:02.0/0000:82:00.0/host11/port-11:0/expander-11:0/port-11:0:31/end_device-11:0:31/target11:0:31/11:0:31:0/block/sdah ``` This is a JBOD disk, to remove it you have issue this command: ``` echo 1 > /sys/block/sdah/device/delete ``` Now the device disappered. Before rescanning the SCSI host you have to tweak the naming using udev rules. Create this rule `/etc/udev/rules.d/20-disk-rename.rules` With this content: ``` KERNEL=="sd?", SUBSYSTEM=="block", DEVPATH=="*port-11:0:31/end_device-11:0:31*", NAME="sdv", RUN+="/usr/bin/logger My disk ATTR{partition}=$ATTR{partition}, DEVPATH=$devpath, ID_PATH=$ENV{ID_PATH}, ID_SERIAL=$ENV{ID_SERIAL}", GOTO="END_20_PERSISTENT_DISK" KERNEL=="sd?*", ATTR{partition}=="1", SUBSYSTEM=="block", DEVPATH=="*port-11:0:31/end_device-11:0:31*", NAME="sdv%n" RUN+="/usr/bin/logger My partition parent=%p number=%n, ATTR{partition}=$ATTR{partition}" LABEL="END_20_PERSISTENT_DISK" ``` Now if rescan the SCSI host the disk will be recognized again but the block device label will be forces to be /dev/sdv ``` echo "- - -" > /sys/class/scsi_host/host11/scan ``` Now retrieve tha OSD ids ``` ceph-volume lvm list ``` which gives the full informations: ``` ====== osd.26 ====== [block] /dev/18-2EH802TV-HGST-HUH728080AL4200/sdv_data block device /dev/18-2EH802TV-HGST-HUH728080AL4200/sdv_data block uuid rMNcOq-9Isr-3LJZ-gp6P-tZmi-fcJ0-d0D0Mx cephx lockbox secret cluster fsid 959f6ec8-6e8c-4492-a396-7525a5108a8f cluster name ceph crush device class None db device /dev/cs-001_journal/sdv_db db uuid QaQmrJ-zdTu-UXZ4-oqt0-hXgM-emKe-fqtOaX encrypted 0 osd fsid aad7b25d-1182-4570-9164-5c3d3a6a61b7 osd id 26 osdspec affinity type block vdo 0 wal device /dev/cs-001_journal/sdv_wal wal uuid bjLNLd-0o3q-haDa-eFyv-ILjx-v2yk-YtaHuo devices /dev/sdv [db] /dev/cs-001_journal/sdv_db block device /dev/18-2EH802TV-HGST-HUH728080AL4200/sdv_data block uuid rMNcOq-9Isr-3LJZ-gp6P-tZmi-fcJ0-d0D0Mx cephx lockbox secret cluster fsid 959f6ec8-6e8c-4492-a396-7525a5108a8f cluster name ceph crush device class None db device /dev/cs-001_journal/sdv_db db uuid QaQmrJ-zdTu-UXZ4-oqt0-hXgM-emKe-fqtOaX encrypted 0 osd fsid aad7b25d-1182-4570-9164-5c3d3a6a61b7 osd id 26 osdspec affinity type db vdo 0 wal device /dev/cs-001_journal/sdv_wal wal uuid bjLNLd-0o3q-haDa-eFyv-ILjx-v2yk-YtaHuo devices /dev/sdb ``` output example ``` ceph-volume lvm activate --bluestore 26 aad7b25d-1182-4570-9164-5c3d3a6a61b7 Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-26 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-26 Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/18-2EH802TV-HGST-HUH728080AL4200/sdv_data --path /var/lib/ceph/osd/ceph-26 --no-mon-config Running command: /usr/bin/ln -snf /dev/18-2EH802TV-HGST-HUH728080AL4200/sdv_data /var/lib/ceph/osd/ceph-26/block Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-26/block Running command: /usr/bin/chown -R ceph:ceph /dev/dm-75 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-26 Running command: /usr/bin/ln -snf /dev/cs-001_journal/sdv_db /var/lib/ceph/osd/ceph-26/block.db Running command: /usr/bin/chown -h ceph:ceph /dev/cs-001_journal/sdv_db Running command: /usr/bin/chown -R ceph:ceph /dev/dm-77 Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-26/block.db Running command: /usr/bin/chown -R ceph:ceph /dev/dm-77 Running command: /usr/bin/ln -snf /dev/cs-001_journal/sdv_wal /var/lib/ceph/osd/ceph-26/block.wal Running command: /usr/bin/chown -h ceph:ceph /dev/cs-001_journal/sdv_wal Running command: /usr/bin/chown -R ceph:ceph /dev/dm-76 Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-26/block.wal Running command: /usr/bin/chown -R ceph:ceph /dev/dm-76 Running command: /usr/bin/systemctl enable ceph-volume@lvm-26-aad7b25d-1182-4570-9164-5c3d3a6a61b7 Running command: /usr/bin/systemctl enable --runtime ceph-osd@26 Running command: /usr/bin/systemctl start ceph-osd@26 --> ceph-volume lvm activate successful for osd ID: 26 ``` ## OSD map tweaking ``` ceph osd getcrushmap -o /tmp/crushmap crushtool -d /tmp/crushmap -o crush_map ``` Now you can edit the `crush_map` file recompile it and inject into the cluster ``` crushtool -c crush_map -o /tmp/crushmap ceph osd setcrushmap -i /tmp/crushmap ``` ## Inconsistent PGs ``` rados list-inconsistent-pg {pool} ``` ## Slow ops ``` ceph daemon mon.cs-001 ops ``` Find OSD failures ``` ceph daemon mon.cs-001 ops | grep osd_failure "description": "osd_failure(failed timeout osd.130 [v2:131.154.128.179:6876/13353,v1:131.154.128.179:6882/13353] for 24sec e76448 v76448)", "description": "osd_failure(failed timeout osd.166 [v2:131.154.128.199:6937/13430,v1:131.154.128.199:6959/13430] for 24sec e76448 v76448)", "description": "osd_failure(failed timeout osd.175 [v2:131.154.128.199:6924/13274,v1:131.154.128.199:6933/13274] for 24sec e76448 v76448)", ```