Error-detection and error-correction

Here we consider how to determine whether a logical unit of data has been received correctly (after either data transmission or storage), and if it is determined to be incorrect, how to attempt to correct it. First we develop some terminology and mathematics needed to describe errors and then we give several methods of error detection and correction. We will assume that the receiver has achieved block synchronization, which is the topic of the previous page .

1. Preliminaries

a. Hamming distance H(a,b) = # of bit positions in which a differs from b, i.e., the number of 1's in a XOR b.
Hamming distance of a code is the minimum over all pairs of distinct code words of the Hamming distance between them, i.e.,
- H(code) = min {H(a,b) | a<>b and a, b in code}
b. Forward error correction - enough redundancy is transmitted in the code that errors can be corrected by the receiver without retransmission (used when retransmission is impractical, as in distant space probes, simplex lines, broadcast media where many stations may have to listen to unnecessary retransmission)
c. Backward error correction - errors are detected and retransmission requested. (most common, generally more efficient, but has the flaw that a single error, repeated in every retrans- mission, will mean that the message is never delivered)
d. Distance Requirements
- n-bit error detection requires a code with Hamming distance n+1, (i.e., a distance d code can detect d-1 bit errors).
- c-bit error correction requires a code with Hamming distance 2c+1, (i.e., a distance d code can correct [(d-1)/2] bit errors).
d. Residual error rate - the probability that an error will go undetected
- Pb = uniform probability that a bit is in error (random)
- P1 = probability that a frame is received error-free
- P2 = probability that a frame is received with one or more undetected error(s) (residual error rate)
- P3 = probability that a frame is received with one or more detected error(s)
- Assuming that errors are independent and have constant probability Pb,
- P1 = (1-Pb)^F, where F = bits per frame
- P2+P3 = 1-P1

2. Methods of Error Detection

a. Parity
One redundant bit per word, the bit's value is
- Sum i=0 to m-1 of a_i (mod 2) for even parity
- (Used for asynchronous transmission)
- or Sum i=0 to m-1 of a_i + 1 (mod 2) for odd parity
- (Used for synchronous transmission)
Parity will only detect an odd number of bits in error, so (letting C(a,b) be the choose function),
- P2 = Sum(k=1 to W) C(W,k)(1-Pb)^[B(W-k)] [Sum(j=2,4,...to B) C(B,j)Pb^j(1-Pb)^(B-j)]^k
- = sum of probabilities that k>0 words are bad in a way that won't be detected, for each way of choosing those k words from W words per frame, and that the remaining words are not bad. The probability that the W-k remaining words are error-free is (1-Pb)^[B(W-k)], while the probability that the k words all have undetected errors is the probability that one word has an undetected error to the kth power. The probability that one word has an undetected error is the sum of the probabilities that it has a non-zero, even number, j, of errors. This is just the number of ways to choose j bad bits in B times the probability that they are all bad and the rest are all good.
b. LRC/VRC
Longitudinal Redundancy Check/Vertical Redundancy Check scheme uses not only one parity bit per word (or row of the frame, considered now as a matrix), but also a "parity check character", comprising the entire last row of the matrix, with each bit in the row checking the parity of the corresponding column. The row parity bits form the last column and are called the VRC, while the column parity bits form the last row and are called the LRC or the parity check character. LRC/VRC will fail to detect errors that occur in an "even rectangular form", and other forms harder to describe as long as there are an even number of errors in each column and each row. The LRC/VRC code has the whole matrix as a code word, and has Hamming distance 4 (if bit i,j is inverted, then the parity bit for both row i and the parity bit for column j must also be inverted, thus the parity bit for the LRC itself must be inverted, so there are no code words less than HD 4 away from any valid code word).

c. CRC - Cyclic Redundancy Check

Given a m-bit message M, and a predetermined (r+1)-bit number P, generate an r-bit frame check sequence (FCS) that, appended to M, will produce an (m+r)-bit number T divisible by P (modulo 2).

Note that addition and subtraction are done modulo 2, so that they are both eqivalent to bitwise XOR.
T = 2^r M + FCS,
where FCS is computed in this way: divide 2^r M by P, so that
2^r M = QP + R, and let FCS = R. Then
T = 2^r M + FCS = QP + R + R = QP and is divisible by P. Note that the remainder has only r bits, and that it must be padded left with leading zeros if it has less than r bits.

This can also be viewed as polynomials with binary coefficients, with operations again modulo 2, so

T(X) = X^r M(X) + R(X).
Note: P is often expressed as a polynomial.

When the received transmission T', is divided by P, it is assumed to be error-free if and only if it has no remainder.

Errors that won't be caught by CRC:

Let E be the (m+r)-bit error vector, with a 1 in each location in which there is an error in the received transmission, T'.
T' = T + E.
Then T' will pass the CRC test only if P|T', or
T' = AP = T + E = QP + E, so
AP - QP = (A-Q)P = E.
Since P divides the left side, it must also divide the right side, E. Hence, only error vectors divisible by P will be undetected. If the MSB of P is 1, then there are exactly 2^m-1 of these undetectable error words, but how are they distributed?
If first & last bits of P are 1, then all single-bit errors are caught.
If P(X) has a factor with at least 3 terms, then all 2 bit errors will be caught.
If (X+1) is a factor, then all odd number of errors are caught
Burst errors shorter than r bits will be caught, as well as most large burst errors.

Check out Ross N. Williams' A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS for a longer and nicer explanation, along with code and some practical tips.

3. Methods of Error Correction

a. Backward Error Correction (BEC, or Reverse Error Correction):
Send the message, detect if there was an error, then request that the sender retransmit the message. This not only requires error-detection, but also duplex communication. It is not suitable for situations in which many stations would have to wait through a retransmission (such as broadcast media) or when effectively simplex lines are used (as in space probes). It also has the grave disadvantage that a consistent error (say a line stuck at one) will never be corrected: each time the received transmission will have the same error, so there is no way to infer which bit is bad (unless diagnostic procedures are used, which usually requires intervention from a higher layer). The same message is sent over and over until some other error causes the received transmission to pass the error detection test. The great advantage of REC is that it is simple and efficient.
b. Forward Error Correction
Here, the receiver corrects the error without requiring any further information from the sender. This requires a minimum amount of redundancy in the transmission. Not only must an error be detected, but its location must be determined. For one-bit error correction on an n-bit frame, we must be able to identify one out of the n bit positions if there is an error, or state that there is no error. Thus, we require ceiling of log(n+1) bits of redundancy in the frame. Since d-bit error correction requires a Hamming distance of 2d+1 in the code, or that each code word have a neighborhood of radius d around it such that no two neighborhoods overlap, we can place a limit on the number N of code words an n-bit frame may have for 1-bit EC:
- (n+1)N <= 2^n,
- since each code word must have a neighborhood of size n+1 and they can't overlap.
- N <= 2^n / (n+1)
- N can be an integer only if n+1 is a power of 2, so let
- n+1 = 2^k
- n = 2^k - 1.
Only for frames with 2^k-1 bits can we hope for perfection, that the neighborhoods form a partition of the space.
- i. TMR
- One way to do forward error correction is to repeat each bit of the message several times, and let the receiver take a vote to determine what the correct value was for each bit. Trimodular Redundancy (TMR) is a method in use for reliable computation, and it can be used for error correction as well. It will perform one-bit EC by using three bits to encode each bit of the frame. Clearly it misses the lower bound of the previous section for frames of more than 3 bits.
- ii. Hamming Codes
- Let n=2^k-1 be the total number of bits sent in a word. We will use k check bits, each of which will check the parity of ceiling of 1/2 of the bits in the word. If we associate an n-bit vector with each check bit so that there is a 1 in the vector iff the check bit checks the parity of the corresponding bit in the word and a 0 otherwise, then the vectors must form a basis of the n-bit vector space in order for then to span the space (ie, check all possibilities). A convenient way in which to do this is to number the bits of the word 1,2,...,2^k-1, and let the bits whose positions are powers of 2 be the check bits. Check bit 2^i will be the parity bit for all bits in the word whose postion has the ith bit on. For example, bit 12 = 8 + 4 will be checked by parity bits in positions 8 and 4. (Note that each parity bit must check itself.) If there is an error in the transmission, then it will be detected and the error's position is just the sum of the positions of the bad check bits: the syndrome gives the location directly .
- Ex:
- The error syndrome (i.e., the set of bad parity bits) exactly identifies the error location by addition of the bad parity bit locations. If the bit in error has a location with bit i = 1, then it is checked by parity bit 2^i. If its location has bit i = 0 then it is not checked by parity bit 2^i, so adding the bad parity bit locations must give the bad bit's location (even if it is a parity bit itself). Note that with 4 parity bits, we can distinguish 2^4 = 16 "locations." There are only 15 bit locations - so what is the 16th "location?" You may think of it as location 0 - checked by no parity bits, and corresponding to no errors (there are exactly 16 ways in which a 15-bit word can have zero or one error in it).
- Notice that perfect Hamming codes exactly match the lower bounds given above. They become more efficient as the word size increases: let R be the ratio of message bits m, to word bits n = 2^k-1.
- lim (n->inf) R = lim (n->inf) (2^k-1-k)/(2^k-1) = 1.
- However, the probability of more than one error also increases as word size increases, so there is some tradeoff between efficiency and efficacy.

4. Burst errors

Most of the techniques described do not handle burst errors well. LRC/VRC does reasonably well, since a burst no longer than word length will be detected by the LRC, and CRC does very well with most bursts (as noted in that section). In order to improve the performance of parity or Hamming codes in the presence of burst errors, we can buffer a frame of words, each with its own redundant bits, and then send out the bits of the frame column by column rather than by rows. If a burst is shorter than the number of rows in the frame, then each row will have one or fewer errors, and they will all be detected (or corrected). This is interleaving , and may be done on the bit, byte, word, or block level. Burst errors are distributed over the interleaved (time-division multiplexed) channels, so that if K channels are interleaved (i.e., an interleaving depth of K is used), then each channel suffers an error of length roughly 1/K of the original burst error. In this way, the LRC/VRC can detect bursts that are no longer than the number of rows in a column. LRC/VRC is a form of concatenated coding, another effective way to increase the power of error correction methods and resistance to burst errors.

5. Tape Storage - Error Correction

Nine-track tapes (as well as other storage media) use forward error correction methods. One common method is to use the ninth track as a parity track, and include a powerful error detection check sequence at the end of each block stored on the medium. Since a common error is to have one bad track, this bad track is likely to be detected by the check sequence at the end of the block, and as long as only one track is bad, the parity track can be used to correct the errorful track.

6. Concatenated Codes

When a data channel is coded at more than one level, then it is said to be encoded using concatenated coding. Essentially the inner code (the one closest to the transmitter)...(to be continued.... ran out of time just now :( ) In addition to the receiver being able to detect errors and perhaps to correct them, the sender and the receiver need to agree upon some common control information and frame formats, along with protocols to use in order to effect the successful transfer of a sequence of frames. Datalink control and sliding window protocols are the topic of the next page .

Send comments to nemo@cis.ufl.edu