R. E. Newman-Wolfe
, University of Florida
Last modified 1/30/95.
Error-detection and error-correction
Here we consider how to determine whether a logical unit of data has
been received correctly (after either data transmission or storage),
and if it is determined to be incorrect, how to attempt to correct it.
First we develop some terminology and mathematics needed to describe
errors and then we give several methods of error detection and correction.
We will assume that the receiver has achieved block synchronization,
which is the topic of the
previous page
.
1. Preliminaries
- a. Hamming distance H(a,b) = # of bit positions in which
a differs from b, i.e., the number of 1's in a XOR b.
- Hamming distance of a code is the minimum over all pairs
of distinct code words of the Hamming distance
between them, i.e.,
- H(code) = min {H(a,b) | a<>b and a, b in code}
- b. Forward error correction - enough redundancy is transmitted
in the code that errors can be corrected by the
receiver without retransmission
(used when retransmission is impractical, as in
distant space probes, simplex lines, broadcast media
where many stations may have to listen to unnecessary
retransmission)
- c. Backward error correction - errors are detected and
retransmission requested.
(most common, generally more efficient, but has the
flaw that a single error, repeated in every retrans-
mission, will mean that the message is never delivered)
- d. Distance Requirements
- n-bit error detection requires a code with Hamming distance n+1,
(i.e., a distance d code can detect d-1 bit errors).
- c-bit error correction requires a code with Hamming distance 2c+1,
(i.e., a distance d code can correct [(d-1)/2] bit errors).
- d. Residual error rate - the probability that an error will
go undetected
- Pb = uniform probability that a bit is in error (random)
- P1 = probability that a frame is received error-free
- P2 = probability that a frame is received with one or
more undetected error(s) (residual error rate)
- P3 = probability that a frame is received with one or
more detected error(s)
- Assuming that errors are independent and have constant
probability Pb,
- P1 = (1-Pb)^F, where F = bits per frame
- P2+P3 = 1-P1
2. Methods of Error Detection
- a. Parity
- One redundant bit per word, the bit's value is
- Sum i=0 to m-1 of a_i (mod 2) for even parity
- (Used for asynchronous transmission)
- or Sum i=0 to m-1 of a_i + 1 (mod 2) for odd parity
- (Used for synchronous transmission)
- Parity will only detect an odd number of bits in error, so
(letting C(a,b) be the choose function),
- P2 = Sum(k=1 to W) C(W,k)(1-Pb)^[B(W-k)]
[Sum(j=2,4,...to B) C(B,j)Pb^j(1-Pb)^(B-j)]^k
- = sum of probabilities that k>0 words are bad in
a way that won't be detected, for each way of
choosing those k words from W words per frame,
and that the remaining words are not bad.
The probability that the W-k remaining words
are error-free is (1-Pb)^[B(W-k)], while the
probability that the k words all have
undetected errors is the probability that one
word has an undetected error to the kth power.
The probability that one word has an undetected
error is the sum of the probabilities that it
has a non-zero, even number, j, of errors.
This is just the number of ways to choose j
bad bits in B times the probability that they
are all bad and the rest are all good.
- b. LRC/VRC
- Longitudinal Redundancy Check/Vertical Redundancy Check
scheme uses not only one parity bit per word (or row of
the frame, considered now as a matrix), but also a "parity
check character", comprising the entire last row of the
matrix, with each bit in the row checking the parity of
the corresponding column. The row parity bits form the
last column and are called the VRC, while the column
parity bits form the last row and are called the LRC or
the parity check character.
LRC/VRC will fail to detect errors that occur in an
"even rectangular form", and other forms harder to describe
as long as there are an even number of errors in each column
and each row.
The LRC/VRC code has the whole matrix as a code word, and
has Hamming distance 4 (if bit i,j is inverted, then the
parity bit for both row i and the parity bit for column j
must also be inverted, thus the parity bit for the LRC itself
must be inverted, so there are no code words less than HD 4
away from any valid code word).
c. CRC - Cyclic Redundancy Check
Given a m-bit message M, and a predetermined (r+1)-bit
number P, generate an r-bit frame check sequence (FCS)
that, appended to M, will produce an (m+r)-bit number T
divisible by P (modulo 2).
- Note that addition and subtraction are done modulo 2,
so that they are both eqivalent to bitwise XOR.
- T = 2^r M + FCS,
- where FCS is computed in this way:
divide 2^r M by P, so that
- 2^r M = QP + R, and let FCS = R.
Then
- T = 2^r M + FCS = QP + R + R = QP
and is divisible by P.
Note that the remainder has only r bits, and that
it must be padded left with leading zeros if it has
less than r bits.
This can also be viewed as polynomials with binary
coefficients, with operations again modulo 2, so
- T(X) = X^r M(X) + R(X).
- Note: P is often expressed as a polynomial.
When the received transmission T', is divided by P, it
is assumed to be error-free if and only if it has no
remainder.
Errors that won't be caught by CRC:
- Let E be the (m+r)-bit error vector, with a 1 in
each location in which there is an error in the
received transmission, T'.
- T' = T + E.
- Then T' will pass the CRC test only if P|T', or
- T' = AP = T + E = QP + E, so
- AP - QP = (A-Q)P = E.
- Since P divides the left side, it must also divide the
right side, E. Hence, only error vectors divisible
by P will be undetected. If the MSB of P is 1, then
there are exactly 2^m-1 of these undetectable error words,
but how are they distributed?
- If first & last bits of P are 1, then
all single-bit errors are caught.
- If P(X) has a factor with at least 3 terms, then
all 2 bit errors will be caught.
- If (X+1) is a factor, then
all odd number of errors are caught
- Burst errors shorter than r bits will be caught, as
well as most large burst errors.
Check out Ross N. Williams'
A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS
for a longer and nicer explanation, along with code
and some practical tips.
3. Methods of Error Correction
- a. Backward Error Correction (BEC, or Reverse Error Correction):
- Send the message, detect if there was an error, then
request that the sender retransmit the message. This
not only requires error-detection, but also duplex
communication. It is not suitable for situations in which
many stations would have to wait through a retransmission
(such as broadcast media) or when effectively simplex lines
are used (as in space probes). It also has the grave
disadvantage that a consistent error (say a line stuck at
one) will never be corrected: each time the received
transmission will have the same error, so there is no way
to infer which bit is bad (unless diagnostic procedures
are used, which usually requires intervention from a
higher layer). The same message is sent over and over
until some other error causes the received transmission
to pass the error detection test.
The great advantage of REC is that it is simple and
efficient.
- b. Forward Error Correction
- Here, the receiver corrects the error without requiring
any further information from the sender. This requires
a minimum amount of redundancy in the transmission. Not
only must an error be detected, but its location must be
determined. For one-bit error correction on an n-bit
frame, we must be able to identify one out of the n bit
positions if there is an error, or state that there is no
error. Thus, we require ceiling of log(n+1) bits of
redundancy in the frame.
Since d-bit error correction requires a Hamming distance
of 2d+1 in the code, or that each code word have a
neighborhood of radius d around it such that no two
neighborhoods overlap, we can place a limit on the number N
of code words an n-bit frame may have for 1-bit EC:
- (n+1)N <= 2^n,
- since each code word must have a neighborhood
of size n+1 and they can't overlap.
- N <= 2^n / (n+1)
- N can be an integer only if n+1 is a power of 2, so let
- n+1 = 2^k
- n = 2^k - 1.
Only for frames with 2^k-1 bits can we hope for perfection,
that the neighborhoods form a partition of the space.
- i. TMR
- One way to do forward error correction is to repeat
each bit of the message several times, and let the
receiver take a vote to determine what the correct
value was for each bit. Trimodular Redundancy (TMR)
is a method in use for reliable computation, and it
can be used for error correction as well. It will
perform one-bit EC by using three bits to encode each
bit of the frame. Clearly it misses the lower bound
of the previous section for frames of more than 3 bits.
- ii. Hamming Codes
- Let n=2^k-1 be the total number of bits sent in a word.
We will use k check bits, each of which will check the
parity of ceiling of 1/2 of the bits in the word.
If we associate an n-bit vector with each check bit
so that there is a 1 in the vector iff the check bit
checks the parity of the corresponding bit in the word
and a 0 otherwise, then the vectors must form a basis
of the n-bit vector space in order for then to span
the space (ie, check all possibilities).
A convenient way in which to do this is to number the
bits of the word 1,2,...,2^k-1, and let the bits whose
positions are powers of 2 be the check bits. Check
bit 2^i will be the parity bit for all bits in the
word whose postion has the ith bit on. For example,
bit 12 = 8 + 4 will be checked by parity bits in
positions 8 and 4.
(Note that each parity bit must check itself.)
If there is an error in the transmission, then it will
be detected and the error's position is just the sum
of the positions of the bad check bits: the syndrome
gives the location directly .
- Ex:
- bit positions 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
- in hexadecimal F E D C B A 9 8 7 6 5 4 3 2 1
- data bits 1 0 1 0 0 1 0 0 1 0 1
- parity bits 1 1 0 1
- bit 8 checks ^--^--^--^--^--^--^--^
- bit 4 checks ^--^--^--^--------------^--^--^--^
- bit 2 checks ^--^--------^--^--------^--^--------^--^
- bit 1 checks ^-----^-----^-----^-----^-----^-----^-----^
- code word 1 0 1 0 0 1 0 1 0 1 0 1 1 0 1
- error word 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
- received word 1 0 1 0 0 0 0 1 0 1 0 1 1 0 1
- bit 8 should be ^--^--^--^--^--^--^--0
- bit 4 should be ^--^--^--^--------------^--^--^--1
- bit 2 should be ^--^--------^--^--------^--^--------^--1
- bit 1 should be ^-----^-----^-----^-----^-----^-----^-----1
- bad parity bits X 8 X 2
- error location is 8 + 2 = A
- The error syndrome (i.e., the set of bad parity bits) exactly
identifies the error location by addition of the bad parity
bit locations. If the bit in error has a location with bit
i = 1, then it is checked by parity bit 2^i. If its location
has bit i = 0 then it is not checked by parity bit 2^i, so
adding the bad parity bit locations must give the bad bit's
location (even if it is a parity bit itself). Note that with
4 parity bits, we can distinguish 2^4 = 16 "locations." There
are only 15 bit locations - so what is the 16th "location?"
You may think of it as location 0 - checked by no parity bits,
and corresponding to no errors (there are exactly 16 ways in
which a 15-bit word can have zero or one error in it).
- Notice that perfect Hamming codes exactly match the
lower bounds given above. They become more efficient
as the word size increases: let R be the ratio of
message bits m, to word bits n = 2^k-1.
- lim (n->inf) R = lim (n->inf) (2^k-1-k)/(2^k-1) = 1.
- However, the probability of more than one error also
increases as word size increases, so there is some
tradeoff between efficiency and efficacy.
4. Burst errors
Most of the techniques described do not handle burst errors
well. LRC/VRC does reasonably well, since a burst no longer
than word length will be detected by the LRC, and CRC does
very well with most bursts (as noted in that section). In
order to improve the performance of parity or Hamming codes
in the presence of burst errors, we can buffer a frame of
words, each with its own redundant bits, and then send out
the bits of the frame column by column rather than by rows.
If a burst is shorter than the number of rows in the frame,
then each row will have one or fewer errors, and they will
all be detected (or corrected).
This is interleaving , and may be done on the bit,
byte, word, or block level. Burst errors are distributed
over the interleaved (time-division multiplexed) channels,
so that if K channels are interleaved (i.e., an interleaving
depth of K is used), then each channel suffers an error
of length roughly 1/K of the original burst error.
In this way, the LRC/VRC can detect bursts that are no longer
than the number of rows in a column.
LRC/VRC is a form of concatenated coding, another
effective way to increase the power of error correction methods
and resistance to burst errors.
5. Tape Storage - Error Correction
Nine-track tapes (as well as other storage media) use forward
error correction methods. One common method is to use the ninth
track as a parity track, and include a powerful error detection
check sequence at the end of each block stored on the medium.
Since a common error is to have one bad track, this bad track is
likely to be detected by the check sequence at the end of the
block, and as long as only one track is bad, the parity track can
be used to correct the errorful track.
6. Concatenated Codes
When a data channel is coded at more than one level, then it is
said to be encoded using concatenated coding. Essentially the
inner code (the one closest to the transmitter)...(to be
continued.... ran out of time just now :( )
In addition to the receiver being able to detect errors and perhaps to
correct them, the sender and the receiver need to agree upon some common
control information and frame formats, along with protocols to use in
order to effect the successful transfer of a sequence of frames.
Datalink control and sliding window protocols are the topic of the
next page
.
This document is
copyright 1995
by Richard E. Newman-Wolfe.
Send comments to nemo@cis.ufl.edu