In this class, we often express cryptologic theory in terms of statistical measures. Since one's recall of basic statistics may require improvement in order to follow the lecture material, we present the following brief review. We begin with a summary of parameters and distributions, then discuss computational means for determining such distributions. Finally, we consider how various distributions can be manipulated to achieve certain properties, such as increased entropy or randomness.
µ(a) = a / |X|
and the standard deviation about the mean is given by:(a) = ( (a-µ)2 / |X| )1/2 ,
which is the square root of the variance of a, denoted by v(a) = 2.Example. If a = (1,3,2,4), then µ(a) = (1+3+2+4)/4 = 10/4 = 2.5, since there are four elements in a. The standard deviation (a) is computed according to the following steps:
h(a)(f) = =f (a) , f F .
Example. If a = (A,B,R,A,C,A,D,A,B,R,A) and we denote h(A) h(a)(A), then h(A) = 5 since there are 5 A's in Abracadabra. Additionally, h(B) = 2, h(C) = 1, and so forth.
Algorithm. If a (Zn)X, then the histogram h(a) can be computed as:
h := 0 for each x in X do: { h(a(x)) := h(a(x)) + 1 } .The resultant histogram will have n bins.
µ(a) = f · h(f) / |X|
2(a) = (f-µ)2 · h(f) / |X| .
Remark. It is appropriate to think of a histogram as a distribution of frequency-of-occurrence. Note that a frequency distribution h(a) is proportional to a probability distribution Pr(a), since Pr(f) = h(f) / h, where fdomain(h). Alternatively, we can say that frange(a), which implies thatPr(a) = h(a) / h .
The preceding summation operation can be thought of in terms of Reimann-Stiltjes integration, versus Lebesque integration where Pr(f) = h(f) / h. For example, via Lebesque integration, h(a) could be normalized to the unit interval [0,1]R to yield the following distribution of the values of a:Pr(a) = h(a) / (h) .
These techniques for constructing a probability distribution from the image (message) histogram will be useful for attacking encryptions that disclose part or all of the plaintext histogram in the corresponding ciphertext.
m1(h) = µ and
m2(h) =
2 ,
mk(h) =
h(f) ·
(f-µ(h))k / |X| .
mk(a) = a(x) · xk / |X| .
For example, Figure A-1(a) illustrates the spatial mean or centroid m1,1(a) of a sickle-shaped object in an image, while Figure A-1(b) shows the mean of a histogram that could be derived from an image similar to a that has ten greylevels. In Figure A-1(b), it is a coincidence only that the mean falls within the histogram bin which is the mode, or most frequently occurring value.
Figure A-1. (a) Spatial mean (centroid) of
an object in an image defined on a two-dimensional
domain; (b) Mean of a histogram.
For example, consider that n adjacent symbols denoted by s = (s1,s2,...,sn), which is often called an n-gram, has a probability of occurrence Pr(s) within a given text corpus. Such simple observations are the basis for the Hill cipher, which we will discuss later in this class, as well as for effective attacks upon a wide variety of encryptions. We thus consider several measures or parameters that can be derived from a histogram.
Observation. When conducting statistical cryptanalysis, one often needs to know when a value in plaintext or ciphertext occurs frequently. In particular, we would like to know the value that occurs most frequently, which is called the mode.
Definition. Given an image a FX, the mode of a is computed from the histogram h(a) asmode(a) = domain(h) .
Definition. A distribution that has one (more than one) peak or local maximum is called monomodal (multimodal).Observation. For similar reasons, we would like to know the partition of domain(h) that divides lesser values (which, taken together, occur 50 percent of the time) from values greater than the partition. The partition value is called the median, which is computed from the cumulative histogram. The following definitions apply.
Definition. Given an image a FX, the cumulative histogram c(a) is computed recursively from the histogram h(a) as follows. Let F be indexed by i{1,2,...,|F|} such that there are no values in F between fi-1 and fi, where fi-1 < fi. Then, we have[c(h)](fi) = [c(h)](fi-1) + h(fi) .
Definition. Given an image a FX and its cumulative histogram c(a), the median of a is given bymedian(a) = (domain(c || > |X|/2))
Algorithm. Given an N-pixel image (or N-character message) aFX and its histogram hNF, the mode and median of a are computed from the cumulative distribution cNF, as follows:c(f1) := h(f1) freq := 0 notm := 1 for each i from 2 to |F| do: { c(fi) := c(fi-1) + h(fi) if (h(fi) > freq) then { mode := fi freq := h(fi) } if (notm AND c(fi) > |X|/2) then { median := fi notm := 0 } } .
[Greek
kyrt-ocirc-'sis=convexity]
to determine if
the histogram is sharply peaked or is flat.
Definition. The skewness and kurtosis of a
histogram h of a
FX are given by
skew(h) = m3(h) /
3
kurt(h) = m4(h) /
4 ,
[Greek leptos=slender or
small]
. The opposite case, namely, a distribution that
tends toward flatness, is called platykurtotic
[Greek platy=flat]
. A useful mnemonic is "`plat'
= `flat'".
In this section, we discuss the nature and computation of various types of probability distributions (e.g., Gaussian, Poisson, Lorentzian, Gamma, Chi-squared, and Student's t) that will be useful in cryptanalysis. As an example, we present theory and algorithms for the Gaussian distribution. In Section A-3, we discuss the manipulation of such distributions as a prelude to statistical cryptanalysis.
There are many different types of probability distributions, each of which has been derived from observations of natural phenomena. Unfortunately, common probability distributions are rarely derived from first principles. That is, the laws of physics are typically not employed to construct a body of theory from which a given distribution is computed a priori. Rather, data is gathered from observations of one or more processes, and a probability distribution is fitted to measures derived from the data. Thus, in practice, there are relatively few causal models that can deterministically link a given physical process with the statistical distribution that characterizes the outcomes of that process. As a result, one often has little knowledge about why a given distribution occurs in a given situation.
For example, it is well known that the times between arrival of photons at a detector can be characterized with reasonable accuracy by a Poisson distribution. It is likewise known that the greylevels of images that depict natural scenes under spatially uniform illumination are typically Gaussian-distributed. Lettington [Let94] has shown that the greylevels of gradient images taken from selected scenes that contain naturally-occurring or manufactured objects generally conform to a Lorentzian distribution. The reasons for these behaviors are presently not apparent. As a result, there are interesting possibilities for research in causal models that underlie probability distributions.
Example. Given B = {0,1} and a process P: B2 -> B2, the set of all possible outcomes is B2 = {(0,0),(0,1),(1,0),(1,1)}.
Definition. A probability measure on a set S is denoted by Pr: 2S -> [0,1], such that the following conditions are satisfied:
Pr(A B) = Pr(A) + Pr(B) . (I)
Definition. An event is a subset of a sample space.
Example. E = {a,c} is an event in a sample space S = {a,b,c}.
Definition. Given an event E in sample space S, and another event E´ that is the complement of E, the following statements hold:
Statement 2 follows from the fact that S is the disjoint union of E and E´. Thus, from (I), we have 1 = Pr(S) = Pr(E) + Pr(E´).
Lemma. If events A,B S are non-disjoint, then
Pr(A B) = Pr(A) + Pr(B) - Pr(A B). (II)
Example. Given a sample space S = {a,b,c,d}, let A = {a,b} and B = {b,c}, with Pr(a) = Pr(b) = Pr(c) = 1/4. Let Pr(A) = Pr(B) = 0.3. Since AB = {b}, we have
Pr(A B) = Pr({a,b,c})
= Pr(A) + Pr(B) - Pr(A B)
= 0.3 + 0.3 - 0.25 = 0.6 - 0.25 = 0.35 .
Lemma. If events A,B S are conjoint (i.e., occur together), then
Pr(A B) = Pr(A) · Pr(B)
Definition. Given a sample space S with subsets A and B that have nonzero probabilities, the conditional probability of B on A (also called the probability of B given A) is given by
Pr(AB)
Pr(A|B) = .
Pr(A)
Definition. Given events A,BS, if A and B are independent, then the following statements hold:
Definition. If A and B are any events in a sample space S, then
Pr(A B) = Pr(A) · Pr(B|A) .
Note the difference between the preceding two definitions, which specify probabilities for independent versus general events.
Remark. The concept of probability assumes that all possible outcomes can be specified within a given sample space. In finite discrete systems such as cryptosystems, this constraint of a priori omniscience is usually not problematic, since all outcomes of a finite process can be enumerated in finite time, given finite resources. However, in continuous physical models, the definition of probability is somewhat tenuous since the outcomes of nontrivial infinite processes that occur in an infinite system generally cannot be determined in finite time with finite resources. In this class, practical applications of probability theory emphasize finite, discrete processes and their corresponding distributions. Thus, we address only infrequently those issues pertaining to continuous versus discrete systems.
Pr(X=c) = Pr({sS : X(s) = c}) .
Symmetric statements hold for inequality relations such as less-than (<) and greater-than (>).Definition. If X is a random variable on a sample space S with probability measure Pr, then a probability distribution f associated with X is defined as:
f(x) = Pr(X = x) .
Remark. In this class, we restrict the preceding definition to describe a probability distribution on a finite discrete set F as a map d: F -> [0,1].
Definition. A trivial process is one that has the same outcome for all inputs.
Example. Multiplication by zero or one are trivial processes over the real numbers.
Definition. A uniformly random process has a probability distribution that is a constant image.
Observation. Given a finite, discrete process a FX where a has equiprobable values, the distribution Pr(a) can be expressed as
Pr(f) = 1/|F|, fF .
In practice, we can write the preceding statement more generally as Pr(a) = 1/|range(a)|.Example. If a = (A,B,A,C,C,B), then F = {A,B,C} and Pr(F) = 1/|F| = 1/3.
Remark. The preceding expression has a dual information-theoretic representation that defines the entropy (disorder) in a random process. For example, if a sample space S can be encoded in a maximum of m bits, then the entropy of each element of S is m bits. That is, if d is a random process over an alphabet S, then the minimum possible entropy for the encoding of each symbol in S by d is m bits. Thus, each input element of d is represented by no fewer than m bits.
While it may seem that the preceding statement contradicts the observation that the information-theoretic content of a given distribution can be expressed in terms of one bit, there exists no such contradiction in practice. Since we will shortly provide a proof sketch to the effect that random distributions cannot occur over finite discrete sets, this is (in practice) a moot point. However, such arguments serve to highlight two issues:
Digression. Since uniformly random entities or distributions have zero information per symbol at the asymptotic limit, there have been posed several questions regarding the physical and philosophical basis for the existence of processes in the natural world that have equiprobable outcomes. We discuss this issue briefly, since it is germane to several practical topics in key generation and cryptanalysis. For example, assume (for purposes of argument) that a purpose of physical existence is the achievement of non-randomness. By this definition, a uniformly random distribution could not exist physically. Similarly, if the purpose of existence is to convey information, then uniform randomness would convey zero information at the aforementioned asymptotic limit. As a result, one can say that the aforementioned precondition for existence has been violated. Since the purpose of existence is not known in the philosophical sense, these arguments may properly be viewed as moot.
An alternative conjecture derives from physics, where the Second Law of Thermodynamics (STL) states that global entropy (a measure of disorder) can only increase with time. For example, if the Universe was completely randomized, then its entropy would be maximized and could not increase further. This could, in principle, violate the STL. Likewise, the STL implies that the entropy of a local reference frame that is isolated from the rest of the Universe can only increase. (By "isolated" we mean that no energy is transferred into or out of the reference frame.) By the preceding argument, such completely random, local, such isolated processes could (in principle) violate the STL.
The preceding conjecture is interesting but trivial, since (a) the Universe is not everywhere random at all spatial scales and (b) we do not know how to produce random isolated processes. At a practical level, however, it follows that questions should arise concerning the security of "random" key sequences using methods such as radioactive decay, quantum fluctuations in photodetectors, etc. We shall address such questions later in this section. For now, it suffices to state the following theorem.
Theorem. No nontrivial finite, discrete process can have equiprobable outcomes (i.e., have a sample space whose elements are uniformly distributed.)
Remark. The formal proof could be assigned on an exam or as homework.
Question. Why study random distributions, especially if they do not exist in finite, discrete systems?
Remark. Interestingly enough, the computation of uniformly random distributions remains one of the key goals of number theory and cryptology. The use of randomness to obscure properties of a message that are useful to a cryptanalyst (i.e., an adversary) is a key problem in existing cryptographic theory and practice. Therefore, we emphasize discussion of various techniques for computing "uniformly random" distributions from nonrandom processes or data.
Later in this class we will consider the problem of semantic ciphers or steganography, in which properties of plaintext are disguised in the ciphertext so as to appear innocuous. In the absence of techniques for rigorously producing random processes, such camouflage methods may be a more operationally feasible approach to disguising plaintext than attempting to simulate randomness.
Observation. One needs random bits (or values) for several cryptographic purposes, of which the two most common are (a) generation of cryptographic keys (or passwords) and (b) concealment of values in certain protocols. Several definitions of randomness are employed in cryptology. However, there is the following basic, implementational criterion for a random source.
Assumption. Let an adversary have (i) full knowledge of an encryption site's software/ hardware, (ii) the money to build and run a matching cryptosystem for exhaustive attack, and (iii) the ability to compromise the site's physical facilities (i.e., wiretapping, planting bugs, etc.) This capable adversary must not be able to predict the next bit the site's "random" generator produces, even if he knows all the bits produced at the site thus far.
Observation. Random bits are typically obtained via the following process:
Additional methods for producing random bits that yield pseudorandom output include:
The following methods are often used as bit sources, but have serious flaws since they are observable, predictable, or subject to influence by a determined adversary, especially on multiprocess machines:
The following are nearly worthless bit sources that are frequently recommended or used for purposes of convenience:
Equally or entirely worthless are the following bit sources:
Unfortunately, the bits gathered from the foregoing methods are not necessarily independent. That is, one might be able to predict a bit value with probability greater than 1/2, given all other bits. The adversary might even know entire subsequences of a bitstream, which he could obtain by eavesdropping. The key constraint is that the gathered bits contain information (entropy) that is everywhere unavailable to the adversary.
Provided that the hash function meets the required criteria and cannot be guessed by the adversary, the output of Step 3 is a set of independent, "unguessable" bits. These can be used with confidence wherever random bits are required.
Remark. When implementing the foregoing method of random number generation, one preserves security by:
e-(x - µ)2/22
Pr(X=x) = .
·(2)1/2
normal(x,mean,stdev): { pi := 3.14159 den1 := 2*stdev^2 den2 := stdev*sqrt(2*pi) coef := -(x - mean)^2/(2*den1) prob := exp(coef)/den2 return(prob) } ,where stdev denotes standard deviation.
e-µ · µx
Pr(X=x) = ,
x!
1
Pr(X=x) = · xae-x/b,
(a + 1)ba+1
Pr(X=x) =
·(x2 + 2)
We next present two commonly-used sampling distributions, whose applications are discussed in Section A-3.
2 =
(Xi)2
(2)(n-2)/2
f(2)
= ·
e-2/2 ,
2n/2· (n/2)
where 0 < 2 <
and the mean and
standard deviation are given by µ = n and
= (2n)1/2,
respectively.
X · n1/2
t =
(III)
Y
((n+1)/2)
t = ,
(n)1/2
· (n/2)
· (1 + t2/n)(n+1)/2
where - < t
< and the mean and
standard deviation are given by µ = 0 and
= (n/(n-2))1/2,
respectively.
We shall next discuss uses of sampling distributions.
It is occasionally useful to test whether or not there is a significant difference between two sample means. Also, we often need to test whether two variables in a given bivariate sample are associated with each other. For example, one might want to test the frequencies of occurrence of digrams in plaintext and ciphertext corpi that appear to be associated with each other. In this section, we examine several simple but useful tests, such as the t-test, chi-squared test for dependence, the Phi test for association, Cramer's V-test, and Pearson correlation.
Figure A-2. (a) One-tailed distribution;
(b) two-tailed distribution.
Concept. The t-test requires that we know the population sizes, means, and standard deviations of two samples. We also must know the number of tails on the distribution that best fits the two samples, and the level of statistical signficance at which we want to perform the t-test. Given a table of t-distribution values (reference Equation III, above), we compute the t-score, then look it up in the table for a given number of tails and the predetermined level of significance. If the entabulated t-score is less than the computed t-score, then the difference of means is significant at the prespecified level of significance.
Algorithm. Given random variables X and Y represented by samples A1 and A2, perform the following steps:
Step 2. Compute the standard error of the difference as follows:
= (12/N1 + 22/N2)1/2 .
Step 3. Compute the t-score as:t = (µ1 - µ2) / .
Step 4. Given the level of significance 0 > k < 1 and T, the number of tails on distribution that best portrays A1 and A2, compute the effective population size
N'
= N1 + N2 - 2 ,
'
as table lookup constraints. Step 5. If t > tmin, then the difference between µ1 and µ2 is significant at probability 1-k.
= (62/14 + 82/14)1/2
andt = (µ1 - µ2) / = (70-60)/2.67 = 3.74 .
Letting
N'
= N1 + N2 -
= 14 + 14 - 2 = 26,
'
, and find tmin = 2.056.
Since Remark. If N1 + N2 < 30, then the t-distribution table must be employed. Otherwise, the two-tailed t-distribution approximates the normal (Gaussian) distribution, and the latter can be used to predict the minimum t-score. We recommend that one use the t-table wherever possible, to achieve accuracy.
Observation. In cryptologic practice, the t-test is useful for comparing the distribution of results between two experiments. For example, one might correlate the histograms of various messages in a corpus C of ciphertext with two histograms that characterize the distribution of characters in known plaintext corpi D and E. That is, one asks if there is a significant difference between the distribution of correlation coefficients (results) from the correlation f[h(m),h(D)], mC, versus the correlation f[h(m),h(E)]. One can apply the t-test to the respective histograms of the correlation coefficients to determine whether or not the histogram means are significantly different. Although not a definitive test, such methods are useful for introductory cryptanalysis of data that have a near-normal distribution.
Mathematical Description. Let a bivariate table consist of N cells. For purposes of simplicity, we use a 2 × 2-cell bivariate histogram h(x,y), shown in Figure A-3. It follows that N = 2(2) = 4 cells.
Figure A-3. Chi-squared test - 2 × 2 bivariate
histogram h.
2 = x y (fo(x,y) - fe(x,y))2 / fe(x,y) . (IV)
Given the degrees of freedom and the number of tails in the best-fit distribution (as discussed in Section A-3.1) one finds the 2 score S for a given level of significance 0 < k < 1 in a 2 table or from the 2 distribution (reference Section A-2). If the table or distribution result 2 < S, then there is no statistically significant difference between the two entabulated variables (shown in Figure A-3).Algorithm. Let X and Y denote finite, discrete random variables whose values are respectively indexed by the sets U and V. Let the histogram h : U × V -> N have N = |U| · |V| cells. The independence of X and Y is tested as follows:
ru =
h(u,v) ,
uU
cv =
h(u,v) ,
vV.
fe(u,v) = (ru · cv) / h .
Step 3. Apply Equation (IV) to yield the chi-squared score, as follows:2 = (h(u,v) - fe(u,v))2 / fe(u,v) .
Step 4. Compute the degrees of freedomd = (|U| - 1) · (|V| - 1)
and determine the number of tails T = 1 or T = 2.
Step 5. Given a prespecified confidence level 0 < k < 1,
set k'
= k/|T - 3|.
Thus,
k'
= k/2
for a one-tailed distribution. Given a
2 table
or distribution, if
(d,T,k'
) >
2(h),
then the random variables X and
Y are not significantly different in the sample portrayed
by h. Otherwise, the variables are significantly different.
Example. %%%TBD%%%
Remark. There are four assumptions or constraints associated with the use of 2 testing, as follows:
2 = (|h(u,v) - fe(u,v)| - 0.5)2 / fe(u,v) ,
which is called the Yates correction.= (2 / N)1/2 ,
which assumes values in the range [-1,1]. That is, equals:
Remark. Unfortunately, Phi can yield a result that exceeeds unity when the input table exceeds four cells (2 × 2 configuration). In such cases, we recommend the following modification to Phi, called Cramer's V-test.
V = (2) / (M-1,N-1) )1/2 ,
where V assumes values in the interval [0,1] such that
|X| ·
(a ·
b) -
a ·
b
r = , (VI)
( |X| ·
a2
- (a)2
)1/2 ·
( |X| ·
b2
- (b)2
)1/2
Note that r is also called the Pearson-r, Pearson correlation coefficient, or Pearson-r coefficient.
Example. Let a and b be as shown in Table A-1, which illustrates the first three steps involved in computing r. Step 4 involves taking the sums of the columns in Table A-1.
Step 1 Step 2 Step 3 i a b a·b a2 b2 - --- --- ----- --- ---- 1 12 9 81 144 81 2 4 8 32 16 64 3 3 10 30 9 100 4 6 1 6 36 1 5 7 4 28 49 16 6 1 6 7 1 36 --- --- ----- --- ---- a=33 b=38 ab=184 a2=255 b2=298
Table A-1. Initial steps in computing r(a,b).
Applying Equation (VI) to the result of Step 4, we obtain
6 · 184 - 33 · 38
r = = -0.385 ,
(6 · 255 - 332)1/2 ·
(6 · 298 - 382)1/2
Remark. The magnitude of r, not the sign, indicates the
strength of association between a and b.
Additionally,
an r2 value denotes the fraction of variance shared
by a and b. Thus, if
r = -0.385, then
r2 = 0.149, which indicates that 14.9 percent of the
variance in a is accounted for in b, and vice
versa.
Observation. It is possible to approximate the t-score associated with a given Pearson correlation coefficient r, as follows:
t = r · (N-2)/(1-r) .
In the previous example, N = |X|, the number of data in a and b. Given that PPMC constitutes a one-tailed test, and given a prespecified level of significance 0 < k < 1, we can look up tmin(N-2,k,T=1) in a table of t-scores, where N-2 denotes the number of degrees of freedom. If t < tmin, then the difference between a and b is not statistically significant.Remark. The Pearson-r coefficient is accurate for linear correlation only. If the data have significant nonlinearities, then other methods of comparison (e.g., least- squares regression, analysis of variance, or template matching) should be employed.
For example, if c denotes the cumulative histogram of a, then uniform histogram modification is given by
e = [em - e1] · c(f) + e1 .
Additional modification functions are cited in [Pra78].
We next discuss properties of the normal distribution.
f(x) = (2)-1/2 · e-x2/2 , (VII)
the cumulative distribution is given by
c(x) = (2)-1/2 · e-s2/2 ds .
f'
(x) = -x · (2)-1/2 · e-x2/2 = -x·f(x) .
f"
(x) = -(x2 - 1) · (2)-1/2 · e-x2/2 = (x2 - 1)·f(x) .
This concludes our discussion of basic statistics for this class. Additional statistical concepts will be defined as they are introduced in theory development.
[Dav94] Davis, D., R. Ihaka, and P. Fenstermacher. "Cryptographic randomness from air turbulence in disk drives", in Yvo G. Desmedt, Ed., Proceedings CRYPTO 94, Springer Lecture Notes in Computer Science 839:114-120 (1994).
[Let94] Lettington, A.H. and Q.H. Hong. "An interpolator for infrared images", Optical Engineering 33:725-729 (1994).
[Mau91] Maurer, U.M. "A universal statistical test for random bit generators", In A.J. Menezes and S. A. Vanstone, Eds., Proceedings CRYPTO 90, Springer Lecture Notes in Computer Science 537:409-420 (1991).
[Pra78] Pratt, W. Digital Image Processing, New York: John Wiley (1978).
Copyright © 1996 by Mark S. Schmalz, All Rights Reserved