Recovering from pvmove failure
Note: This is a fairly rambling explanation of recent events. I assume that you have at least a passing knowledge of LVM and its terminology. This was written to prove that it is possible to recover from pvmove
failing (in certain cases) due to the way it performs its operations and backs up metadata.
The background
I recently received a new hard drive, and so I excitedly partitioned and formatted it, extended my LVM VGs and merrily moved my LVs completely across to the new drive using pvmove
. So far, so wonderful — I was now ready to rebuild my other two drives so that only critical things were covered by�RAID and everything else was just in a large VG that spanned multiple drives. I now had three drives, all Samsung SpinPoint:
- 500GB �(
sdb
) - 1TB (
sda
) - 1.5TB (
sdc
, and brand-new)
The partition setup I wanted:
- 48MiB RAID-1 across all drives, for
/boot
(already set up acrosssda
andsdb
) - 4.8GiB RAID-1 across all drives, for
/
(already set up acrosssda
andsdb
) - 50GiB RAID-1 across all drives, for �
/usr
,/opt
,/var
- 50GiB RAID-5 across all drives, for
/home
(giving 100GiB usable space) - The rest of each drive divided into 50/100GiB partitions, to be spread among my large data VG, containing films, music, backups, and other large data.
The disaster
I moved all of the existing LVs to the new drive, which had plenty of space. Next, I sorted out the partitions for the two older drives, and initialised the RAIDs for my sys
and safe
VGs:
mdadm -C /dev/md2 -l1 -n3 /dev/sda3 /dev/sdb3 missing
# initialise the RAID for LVM
pvcreate /dev/md2
# set up RAID-5 across /dev/sda5, /dev/sdb5 and (later) /dev/sdb5
mdadm -C /dev/md3 -l5 -n3 /dev/sda5 /dev/sdb5 missing
pvcreate /dev/md3
I then expanded the VGs and started to copy data across. I began with the safe
VG that contained /home
, as I wanted to be sure it was indeed safe. I have all of the photos I’ve ever taken, my email backups and archives, and various private keys and other important data, as well as things that could be replaced due to duplication elsewhere (including working copies of most of my projects). Being LVM, I could move the LVs around without needing to unmount them or reboot into a LiveCD:
vgextend sys /dev/md2
vgextend safe /dev/md3
# move all data from /dev/sdc2 to other volumes in the group
pvmove /dev/sdc2
# ... disaster struck before the second pvmove
pvmove /dev/sdc1
Then, disaster struck: �pvmove
spat a set of I/O errors at me… but kept running! I wasn’t paying attention (as moving 40GiB of data takes quite a while) and so I had just left it to work away.
pvmove: dev_open(/dev/sda1) called while suspended
pvmove: dev_open(/dev/sda2) called while suspended
pvmove: dev_open(/dev/sda3) called while suspended
pvmove: dev_open(/dev/sda5) called while suspended
pvmove: dev_open(/dev/md2) called while suspended
pvmove: dev_open(/dev/md3) called while suspended
It was only after it had completed and I started getting I/O and “permission denied” errors while trying to access my home directory that I first realised anything had gone wrong. After some frantic searching on the Internet, it seemed that I had to wave goodbye to all of my personal data, including those bits that weren’t backed up (though I’ve learned my lesson about backups now!).
I was determined not to give in to data loss, and so I gave myself a few recovery options:
- Create an image of the partition where the data used to reside.
- Try to recover pictures (the most critical thing for me) via foremost, which is a non-destructive operation.
- Try to rebuild the file system using
reiserfsck --rebuild-tree
.
During my investigations and poking around, I discovered that there was another option: I could try to rebuild the VG from an old copy of its metadata.
Introduction to recovery
It turns out that many LVM utilities store a backup of the VG’s metadata before and after performing various operations, so as long as your /etc/lvm
directory is fine, you have a fighting chance. (Actually, it is theoretically possible to retrieve old versions of the metadata from the LVM partition itself, but this gets much more complex and I won’t deal with it here.)
The reason for this is that pvmove
works by creating a temporary target LV, mirroring data from the existing LV to the temporary target (making checkpoints along the way), and only removing the original once the mirroring has completed. This operation can then be interrupted and resumed at any time without loss of data. This also has the advantage that a full copy of the data on the original LV is still available; only the metadata explaining where it begins and ends has changed.
To access the old versions of this metadata, investigate the contents of your /etc/lvm/archive
directory as root. You should find one or more files of the form volgroup_nnnnn.vg
, where volgroup
is the name of the relevant VG and nnnnn
is a number. These files contain a backup of the volume group’s metadata, such as which partitions it resides on and where it’s located. The contents may look something like this:
contents = "Text Format Volume Group"
version = 1
description = "Created *before* executing 'vgextend safe /dev/md/3'"
creation_host = "hostname" # Linux hostname 2.6.31-gentoo-r3 #3 SMP Mon Nov 23 11:31:59 GMT 2009 i686
creation_time = 1262624964 # Mon Jan 4 17:09:24 2010
safe {
id = "zbSAI8-ExUy-PxJw-SHX7-y9PL-AL2y-kiTk8g"
seqno = 10
status = ["RESIZEABLE", "READ", "WRITE"]
flags = []
extent_size = 8192 # 4 Megabytes
max_lv = 0
max_pv = 0
physical_volumes {
pv0 {
id = "Vw3IF5-U92i-V2aL-9we1-A4E7-7j1P-4G142u"
device = "/dev/sdc1" # Hint only
status = ["ALLOCATABLE"]
flags = []
dev_size = 199993122 # 95.3642 Gigabytes
pe_start = 384
pe_count = 24413 # 95.3633 Gigabytes
}
}
logical_volumes {
home {
id = "glU5zb-l9I7-SzGg-aXx5-lvgL-on4r-Faqkes"
status = ["READ", "WRITE", "VISIBLE"]
flags = []
segment_count = 1
segment1 {
start_extent = 0
extent_count = 12799 # 49.9961 Gigabytes
type = "striped"
stripe_count = 1 # linear
stripes = [
"pv0", 0
]
}
}
}
}
I went through the backups for my safe
volume group, and found the one before I extended the volume group to the (now corrupt) /dev/md3
, which is actually the example given above.
Recovering the volume group
It was at this point that I rebooted and held my breath. I brought the system into runlevel 1 (single-user mode) and logged in as root. Out of curiosity I ran some SMART tests on the drives — sda
was reporting no partitions, and internal testing eventually reported a persistent read error at about 80% into the drive. It looked like that drive was toast, which would explain why writing to the RAID-5 had failed.
sleep 180
smartctl --all /dev/sda
After verifying that it was a hardware issue, I decided to try to point the VG at its previous location (just /dev/sdc1
), thereby getting it to forget that the home
LV was on the now-inaccessible RAID-5. This could be accomplished by the magic of vgcfgrestore
:
vgcfgrestore -f /etc/lvm/archive/safe_00006.vg safe
# re-scan for VGs, and "safe" should show up as inactive
vgscan
# activate the VG and any LVs inside
vgchange -ay safe
# re-scan for LVs and verify that safe/home exists
lvscan
# not taking any chances -- mount it read-only
mount -o ro /dev/safe/home /home
After the mount
succeeded, I began to copy all of the data to my large partition, where it should be safe. I also copied the major bits I didn’t want to lose to a USB flash drive as well, just to be certain. Once that was done, I took a deep breath and rebooted again, this time back to the full system.
Clearing up afterwards
Fortunately, everything worked fine — it was as if nothing had ever gone wrong! All that was left now was to rebuild the broken RAID section.
I decided for safety that I would re-initialise the safe
VG’s RAID as RAID-1 rather than RAID-5, so that any two of the drives could fail without losing my precious data:
mdadm -S /dev/md3
# remove the RAID descriptors from the partitions
mdadm --zero-superblock /dev/sda5
mdadm --zero-superblock /dev/sdb5
# create as RAID-1
mdadm -C /dev/md3 -l1 -n3 /dev/sda5 /dev/sdb5 missing
What I hadn’t anticipated was that this would rebuild the RAID-1 using the data from the previous RAID-5… including the LVM descriptors and corrupt content. Once the RAID was online, I ran pvscan
and received a warning that the VG metadata was inconsistent. For one heart-stopping moment, it also appeared that LVM thought my home partition was on the corrupt RAID! I quickly stopped the array and thought again.
After some further investigation, I discovered that the first 256 sectors of a partition are used by LVM to hold various information (including multiple copies of the VG metadata). All I had to do to fix this was to destroy that information:
mdadm -S /dev/md3
# zero the first 256 sectors
dd if=/dev/zero of=/dev/sda5 bs=512 count=256
dd if=/dev/zero of=/dev/sdb5 bs=512 count=256
# recreate the array -- it may complain that they are already in an array
mdadm -C /dev/md3 -l1 -n3 /dev/sda5 /dev/sdb5 missing
Once this was done, I could run pvcreate /dev/md3
and proceed to extend the VG and pvmove
data as before, without worrying that sda
would give out on me and cause everything to fail again.
I hope this helps someone!
Leave a Reply