# Birthday Paradox

Created:
2023-03-29

Updated:
2023-12-15

The birthday paradox is a surprising statistical phenomenon that shows how even in a small group, it’s very likely that two people share the same birthday.

In this blog post, we’ll explore the math behind the paradox and its practical applications in computer science and cryptography.

## The Paradox

As a flagship example, it states that in a group of just 23 people, there is a 50% change that two of them will share the same birthday.

This probability, which seems quite high, may come as a surprise to many, however it can be easily explained by the principles of probability.

The paradox is a classical example of how probability can lead to seemingly counterintuitive conclusions and is a great illustration of how probability can be used in everyday life.

### Probability

The probability of the Birthday Paradox is computed by considering the number
of possible pairs of people in a group and the probability that any of these
people will have **different** birthdays.

To calculate the probability that any two people in a group of `n`

will have
different birthdays, we can use the formula:

```
p(different birthdays) = 365/365 ⋅ 364/365 ⋅ ... ⋅ (365-n+1)/365
= 365! / [(365-n)!⋅365ⁿ]
```

To find the probability that at least two people in the group will have the same birthday, we can take 1 minus the probability that they all have different birthdays

```
p(same birthday) = 1 - p(different birthdays)
```

Some results:

n | p(n) |
---|---|

10 | 0.117 |

20 | 0.411 |

23 | 0.507 |

40 | 0.891 |

70 | 0.999 |

## Generalized Problem

Given a year of `d`

days we want to determine the smaller number `n`

such that
the probability of a birthday coincidence is at least `50%`

.

In other words n is the minimal integer such that:

```
1 - (1-1/d)⋅(1-2/d)..(1-(n-1)/d) ≥ 1/2
```

Which is equivalent to:

```
(1-1/d)⋅(1-2/d)..(1-(n-1)/d) ≤ 1/2
```

As an approximation, `∏ (1 - i/d) ≈ 1/2`

for:

```
n ≈ ⌈√(2·d·ln(2))⌉
```

This approximation is derived using the *Taylor* expansion of the exponential
function and is valid for large `d`

.

*Proof*

```
∏ (1 - i/d) ≈ exp[ ln( ∏ (1 - i/d) ) ] = exp[ ∑ ln(1 - i/d) ]
```

Taylor expansion for `ln(1 - x)`

around `x = 0`

which is `ln(1-x) ≈ -x`

.

```
∑ ln(1- i/d) = -∑ i/d = -n·(n-1)/(2·d)
→ ∏ (1 - i/d) ≈ exp[ -n·(n-1)/(2·d) ]
```

Equate to `1/2`

which is the target value

```
exp[ -n·(n-1)/(2·d) ] ≈ 1/2
→ -n·(n-1)/(2·d) ≈ ln(1/2) = -ln(2)
→ n²-n ≈ 2·d·ln(2)
```

For large enough `n`

we can omit the subtraction

```
→ n² ≈ 2·d·ln(2)
→ n ≈ √(2·d·ln(2))
```

∎

Note that the classical birthday problem is an instance of this problem with
`d = 365`

and gives as a result `n = 23`

.

### Collision Probability

Given `n`

items drawn from a set of `d`

elements, we look for the probability
`p(n)`

that at least two numbers are equal.

The generic formula is derived using the same argument given in the previous section:

```
p(n) = 1 - (1-1/d)⋅(1-2/d)..(1-(n-1)/d)
```

Conversely, `n(p)`

denotes the number of items drawn from a set of `d`

elements
to obtain a probability `p`

that at least two numbers are the same, then:

```
n(p) ≈ √(2·d·ln(1/(1-p)))
```

The proof of this formula is similar to the one given in the previous section
for `p = 1/2`

.

When applied to hash functions this is the expected number of `N`

-bit hashes
that can be generated before getting a collision with probability `p`

.

Surprisingly, for `p = 1/2`

this is not `O(2ᴺ)`

, but rather only `O(√(2ᴺ))`

.

There are a number of attacks to crypto systems which leverages this fact.

## Attack

The birthday paradox can be particularly insidious if the hash is used as a primitive building block for a more complex scheme which heavily relies on the collision resistance property of some hash function.

For example, if an attacker discovers two messages `m₁`

and `m₂`

such that
`H(m₁) = H(m₂)`

, he can submit `m₁`

to the victim in order to have it signed,
thus obtaining a signature for `H(m₁)`

, but this is a valid signature for
`H(m₂)`

as well.

If the number of possible outputs of `H`

is `2ᴺ`

, a technique to find a
collision is:

- Construct
`k = 2^(N/2)`

variants of a legit message`m₁`

. - Use the same technique to construct
`k`

variants of a malicious message`m₂`

. - Construct two sets of digests:
`A = {H(m₁ᵢ)}`

,`B = {H(m₂ᵢ)}`

. - Search for any item in
`A ∩ B`

.

The variants can be constructed with a technique which generates messages with changes which are not detectable from a typical renderer. For example insert the same number of spaces and backspaces after a word.

By inserting `100`

spaces around the message the attacker can construct
`2^100`

different variants.

The probability of a collision obviously depends on `k`

.

Can be proven that

```
Pr(A ∩ B ≠ ∅) ≥ 1/2 if k ≥ 2^(N/2)
```

The number of the elements in each set corresponds to the expected number
of elements to withdraw from a set of `2ᴺ`

before observing a collision with
probability `1/2`

.

As `|A ∪ B| ≥ 2^(N/2)`

then the probability of finding a collision is still `≈ 1/2`

(**it doesn’t double**). Thus, there is a good chance (`≥ 1/2`

) to find a
collision for two elements which belong to different sets.

### Mitigations

To reduce birthday attack risk we have to choose a value for `N`

that is
sufficiently large.

Hashes like MD5 (`N=128`

) or sha1 (`N=160`

) are not considered secure anymore.

Another way to counter this attack is to use a MAC (a keyed hash) which bounds
the output also to a secret key `k`

.

In this way an attacker cannot pre-compute offline a table of collisions, as they do not know the secret key used in the particular context.

As a final note, always keep in mind that for **pigeon-hole principle** if the
hash possible outputs are `k`

, then after `k+1`

extractions we’re going to have
a collision with probability `1`

.

## References

- Simple PoC in Rust here