July 8, 2013
Participants: Hannes Reinecke, Jan Kara, Alasdair G Kergon.
People tagged: (none)
Hannes Reinecke, who is working on revamping SCSI error handling, is looking into ways to make good use of the SCSI sense code. He is also interested in the more general problem of improving SCSI error handling.
vprintk_emit()
to emit only structured data, so that there would be no text message in
dmesg, but instead a structured message that could be processed by
tools designed for this purpose.
Hannes would like a discussion of the merits of this approach and of
any alternatives.
Jan Kara suggested use of the existing netlink facility. Hannes replied that he had in fact tried netlink, and found the following shortcomings:
skb_alloc()
for each message,
which fails in low-memory situations (although Hannes admits
that this last is a weak argument.
vprintk_emit()
.
Alasdair G Kergon
expressed interest in Hannes's vprintk_emit()
idea, noting
that device-mapper is under increasing pressure to “abuse”
(Alasdair's quotes)
uevents to report error conditions.
Hidehiro Kawai
is trying to handle user-space errors by
adding a hash value to
structured printk()
output.
H. Peter Anvin
attested to the “warm” reception that the idea of unique
IDs received at 2011 LKS.
FAST_FAIL
bit was invented specifically
to bypass the old-style error handler, and believes that a better way forward
is to update the error handler.
Hannes is working on doing just this, and has updates that permit command
aborts to be sent from the timeout handler and that implement an overal
eh_deadline
to specify a time limit on error handling, after
which a host reset is sent.
Hannes would also like to fail commands before the error handler completes
in order to avoid I/O stalls, to dispense with the now-obsolete
TARGET RESET command, and to account for the fact that BUS RESET has
no direct meaning on modern SCSI transports.
Hannes would like to define a meaningful error escalation
strategy that takes into account modern SCSI commands.
This escalation strategy should preferably terminate early
when recovery proves impossible, disabling the LUN in this case.