There was an SR backend failure. status: non-zero exit

There was an SR backend failure.
status: non-zero exit
stdout:
stderr: Traceback (most recent call last):
File "/opt/xensource/sm/LVMSR", line 1985, in ?
SRCommand.run(LVHDSR, DRIVER_INFO)
File "/opt/xensource/sm/SRCommand.py", line 307, in run
sr = driver(cmd, cmd.sr_uuid)
File "/opt/xensource/sm/SR.py", line 136, in __init__
self.load(sr_uuid)
File "/opt/xensource/sm/LVMSR", line 196, in load
self._undoAllJournals()
File "/opt/xensource/sm/LVMSR", line 1046, in _undoAllJournals
self._handleInterruptedCoalesceLeaf()
File "/opt/xensource/sm/LVMSR", line 807, in _handleInterruptedCoalesceLeaf
cleanup.gc_force(self.session, self.uuid)
File "/opt/xensource/sm/cleanup.py", line 2543, in gc_force
sr.scanLocked(force)
File "/opt/xensource/sm/cleanup.py", line 1287, in scanLocked
self.scan(force)
File "/opt/xensource/sm/cleanup.py", line 2092, in scan
self._handleInterruptedCoalesceLeaf()
File "/opt/xensource/sm/cleanup.py", line 2170, in _handleInterruptedCoalesceLeaf
self._undoInterruptedCoalesceLeaf(uuid, parentUuid)
File "/opt/xensource/sm/cleanup.py", line 2213, in _undoInterruptedCoalesceLeaf
parent.deflate()
File "/opt/xensource/sm/cleanup.py", line 934, in deflate
lvhdutil.deflate(self.sr.lvmCache, self.fileName, self.getSizeVHD())
File "/opt/xensource/sm/cleanup.py", line 991, in getSizeVHD
self._loadInfoSizeVHD()
File "/opt/xensource/sm/cleanup.py", line 1002, in _loadInfoSizeVHD
self._sizeVHD = vhdutil.getSizePhys(self.path)
File "/opt/xensource/sm/vhdutil.py", line 210, in getSizePhys
ret = ioretry(cmd)
File "/opt/xensource/sm/vhdutil.py", line 94, in ioretry
errlist = 
http://errno.EIO, errno.EAGAIN)
File "/opt/xensource/sm/util.py", line 277, in ioretry
return f()
File "/opt/xensource/sm/vhdutil.py", line 93, in <lambda>
return util.ioretry(lambda: util.pread2(cmd),
File "/opt/xensource/sm/util.py", line 178, in pread2
return pread(cmdlist, quiet = quiet)
File "/opt/xensource/sm/util.py", line 171, in pread
raise CommandException(rc, str(cmdlist), stderr.strip())
util.CommandException: 22

My VMs went down finally. I rebooted and one of my SR was broken and couldn't be repaired. When I tried to connect it, I was getting the same error message again and again.

The root cause of my issue appears to be the coalesce process was interrupted at some point during cleanup of deleted snapshots. This broke the SR, and then most likely started filling up the logs.

To fix the issue we determined what VHD was causing the failure by reviewing
tail -f SMlog

Then by renaming the VG that had the word "leaf" in it


lvrename /dev/VG_XenStorage-3386db8b-02d5-a76c-18b9-242fd6766643/leaf_368e2e6a-c8bd-49a3-a3f7-14bc070cb373_dd279484-4c7e-4155-9120-8023d364fb86 /dev/VG_XenStorage-3386db8b-02d5-a76c-18b9-242fd6766643/leaf_368e2e6a-c8bd-49a3-a3f7-14bc070cb373_dd27.bak

From this point we repaired the SR which caused the coalesce process to finish whatever it needed and cleanup. The SR came online and my problems were solved.

This was a very strange issue. Hope this information helps someone else who gets stuck!

Thanks to this post

No comments:

Post a Comment