Birthday Paradox
Created:
2023-03-29
Updated:
2023-12-15
The birthday paradox is a surprising statistical phenomenon that shows how even in a small group, it’s very likely that two people share the same birthday.
In this blog post, we’ll explore the math behind the paradox and its practical applications in computer science and cryptography.
The Paradox
As a flagship example, it states that in a group of just 23 people, there is a 50% change that two of them will share the same birthday.
This probability, which seems quite high, may come as a surprise to many, however it can be easily explained by the principles of probability.
The paradox is a classical example of how probability can lead to seemingly counterintuitive conclusions and is a great illustration of how probability can be used in everyday life.
Probability
The probability of the Birthday Paradox is computed by considering the number of possible pairs of people in a group and the probability that any of these people will have different birthdays.
To calculate the probability that any two people in a group of n
will have
different birthdays, we can use the formula:
p(different birthdays) = 365/365 ⋅ 364/365 ⋅ ... ⋅ (365-n+1)/365
= 365! / [(365-n)!⋅365ⁿ]
To find the probability that at least two people in the group will have the same birthday, we can take 1 minus the probability that they all have different birthdays
p(same birthday) = 1 - p(different birthdays)
Some results:
n | p(n) |
---|---|
10 | 0.117 |
20 | 0.411 |
23 | 0.507 |
40 | 0.891 |
70 | 0.999 |
Generalized Problem
Given a year of d
days we want to determine the smaller number n
such that
the probability of a birthday coincidence is at least 50%
.
In other words n is the minimal integer such that:
1 - (1-1/d)⋅(1-2/d)..(1-(n-1)/d) ≥ 1/2
Which is equivalent to:
(1-1/d)⋅(1-2/d)..(1-(n-1)/d) ≤ 1/2
As an approximation, ∏ (1 - i/d) ≈ 1/2
for:
n ≈ ⌈√(2·d·ln(2))⌉
This approximation is derived using the Taylor expansion of the exponential
function and is valid for large d
.
Proof
∏ (1 - i/d) ≈ exp[ ln( ∏ (1 - i/d) ) ] = exp[ ∑ ln(1 - i/d) ]
Taylor expansion for ln(1 - x)
around x = 0
which is ln(1-x) ≈ -x
.
∑ ln(1- i/d) = -∑ i/d = -n·(n-1)/(2·d)
→ ∏ (1 - i/d) ≈ exp[ -n·(n-1)/(2·d) ]
Equate to 1/2
which is the target value
exp[ -n·(n-1)/(2·d) ] ≈ 1/2
→ -n·(n-1)/(2·d) ≈ ln(1/2) = -ln(2)
→ n²-n ≈ 2·d·ln(2)
For large enough n
we can omit the subtraction
→ n² ≈ 2·d·ln(2)
→ n ≈ √(2·d·ln(2))
∎
Note that the classical birthday problem is an instance of this problem with
d = 365
and gives as a result n = 23
.
Collision Probability
Given n
items drawn from a set of d
elements, we look for the probability
p(n)
that at least two numbers are equal.
The generic formula is derived using the same argument given in the previous section:
p(n) = 1 - (1-1/d)⋅(1-2/d)..(1-(n-1)/d)
Conversely, n(p)
denotes the number of items drawn from a set of d
elements
to obtain a probability p
that at least two numbers are the same, then:
n(p) ≈ √(2·d·ln(1/(1-p)))
The proof of this formula is similar to the one given in the previous section
for p = 1/2
.
When applied to hash functions this is the expected number of N
-bit hashes
that can be generated before getting a collision with probability p
.
Surprisingly, for p = 1/2
this is not O(2ᴺ)
, but rather only O(√(2ᴺ))
.
There are a number of attacks to crypto systems which leverages this fact.
Attack
The birthday paradox can be particularly insidious if the hash is used as a primitive building block for a more complex scheme which heavily relies on the collision resistance property of some hash function.
For example, if an attacker discovers two messages m₁
and m₂
such that
H(m₁) = H(m₂)
, he can submit m₁
to the victim in order to have it signed,
thus obtaining a signature for H(m₁)
, but this is a valid signature for
H(m₂)
as well.
If the number of possible outputs of H
is 2ᴺ
, a technique to find a
collision is:
- Construct
k = 2^(N/2)
variants of a legit messagem₁
. - Use the same technique to construct
k
variants of a malicious messagem₂
. - Construct two sets of digests:
A = {H(m₁ᵢ)}
,B = {H(m₂ᵢ)}
. - Search for any item in
A ∩ B
.
The variants can be constructed with a technique which generates messages with changes which are not detectable from a typical renderer. For example insert the same number of spaces and backspaces after a word.
By inserting 100
spaces around the message the attacker can construct
2^100
different variants.
The probability of a collision obviously depends on k
.
Can be proven that
Pr(A ∩ B ≠ ∅) ≥ 1/2 if k ≥ 2^(N/2)
The number of the elements in each set corresponds to the expected number
of elements to withdraw from a set of 2ᴺ
before observing a collision with
probability 1/2
.
As |A ∪ B| ≥ 2^(N/2)
then the probability of finding a collision is still ≈ 1/2
(it doesn’t double). Thus, there is a good chance (≥ 1/2
) to find a
collision for two elements which belong to different sets.
Mitigations
To reduce birthday attack risk we have to choose a value for N
that is
sufficiently large.
Hashes like MD5 (N=128
) or sha1 (N=160
) are not considered secure anymore.
Another way to counter this attack is to use a MAC (a keyed hash) which bounds
the output also to a secret key k
.
In this way an attacker cannot pre-compute offline a table of collisions, as they do not know the secret key used in the particular context.
As a final note, always keep in mind that for pigeon-hole principle if the
hash possible outputs are k
, then after k+1
extractions we’re going to have
a collision with probability 1
.
References
- Simple PoC in Rust here