Original post is here: eklausmeier.goip.de
I was asked by a colleague of mine how many combinations are possible for the German tax id. The German tax id is built entirely from eleven digits. Its specification is given in Pruefziffernberechnung, see Steueridentifikationsnummer in the German Wikipedia. Every person reported to live in Germany is assigned one tax id, even babies and children. 20 years after the death of a person the number can be recycled.
[more_WP_Tag] The rules for constructing a tax id are as follows:
- last digit is check digit, and can be ignored for the rest of this discussion
- first digit cannot be zero, but any digit from 1 to 9
- one digit must appear exactly two times
- the rest of the digits may only appear a single time
For example, 4895437120 and 5549267083 are valid tax ids, ignoring the 11th check digit. The first example tax id has 4 as repeating digit, the second has 5 as repeating digit. All other digits do not repeat.
Hint: As we have ten digits altogether, one of them being the repeating digit, so in effect we only have nine remaining digits to choose one of them to be the repeating digit. For example, 9x87054321, nine digits, excluding digit 6, we have nine digits to choose from, e.g., digit x=2. So, position 2 and position 9 now have both digit 2. For 98x7654321, we have only 8 possibilities, not nine, to choose our repeating digit, i.e., we can choose either the first digit (here 9), the fourth (here 7), the fifth (here 6), and so on. We cannot choose the second one (here 8), as this repeating digit was already covered in the case 9x87054321.
To deduce the number of possible combinations nine different cases have to be considered, depending on the repeating digit. First case:
- 1st digit: 9 possibilities for the first digit (1-9)
- 2nd digit: assume 2nd digit is our repeating digit, then we have 9 possibilities for this, see hint
- 3rd digit: 9 possibilities (ten digits, one was already used for the 1st digit)
- 4th digit: 8 possibilities
- 5th digit: 7 possibilities
- 6th digit: 6 possibilities
- 7th digit: 5 possibilities
- 8th digit: 4 possibilities
- 9th digit: 3 possibilities
- 10th digit: 2 possibilities
Add to this the second case:
- 1st digit: 9 possibilities for the first digit (1-9)
- 2nd digit: 9 possibilities (ten digits, one was already used for the 1st digit)
- 3rd digit: assume 3rd digit is our repeating digit, then we have 8 possibilities for this, see hint
- 4th digit: 8 possibilities
- 5th digit: 7 possibilities
- . . .
- 10th digit: 2 possibilities
and so on up to the ninth case:
- 1st digit: 9 possibilities for the first digit (1-9)
- 2nd digit: 9 possibilities (ten digits, one was already used for the 1st digit)
- . . .
- 9th digit: 2 possibilities
- 10th digit: assume 10th digit is our repeating digit, then we have 1 possibility for this, see hint
This is $$ \eqalign{ & 9*(9)98*\cdots2 + 99*(8)8\cdots2 + 998(7)\cdots2 + 9987\cdots*(1) \cr &= (998*\cdots2)(9+8+\cdots+2+1) \cr &= 146,966,400 \cr } $$ That number is strikingly low, compared to the roughly 82 million population in Germany, see German population.
The chance of getting a valid German tax id (without check digit) out of a complete random number (all digits with equal probability) is therefore $146,966,400 / 10^{10}$, i.e., 1.47%.
As the derivation of the number of combinations might be a little entangled, it is good to know, whether there is an alternative way to prove that the number above is indeed correct, or at least plausible. For this I have written a short simulation which calculates the probability that a number is indeed a German tax id, see taxidsim.c. The output looks something like this:
~/c/taxidsim -tn100000
...
*9 2 6 7 8 1 5 0 8 4 / 1110111121
*3 5 2 7 0 1 6 9 8 5 / 1111021111
*1 9 5 8 5 7 4 2 3 0 / 1111120111
*9 3 8 0 6 7 2 1 5 7 / 1111011211
Digit Occur. Perc.
0 100488 10.048800
1 100456 10.045600
2 100817 10.081700
3 100026 10.002600
4 99938 9.993800
5 99432 9.943200
6 99475 9.947500
7 99826 9.982600
8 99690 9.969000
9 99852 9.985200
tax ids: 1453 1.453000
Enlarging n on the commandline gives 1.470%, which is quite near to our theoretical value.
taxidsim.c contains a number of unrolled loops, see, e.g., Dongarra/Hinds (1979), thus helping Out-of-order execution. The trick to detect tax ids is, first to count the number of occurences of x[i], i=0(1)9 in r[i], i=0(1)9, and then to count occurences in r[] in rs[]. The test is then
x[0] != 0 && rs[0] == 1 && rs[1] == 8 && rs[2] == 1
i.e., first digit is non-zero, the number of no-occurrences is one, rs[0]=1, eight digits occur exactly one time, the double-occurence is 1, rs[2]=1.
Added 15-Jul-2018: Klaus Koch, via e-mail, pointed out the following method to get the number of combinations.
Five years later: Yes, I can confirm your results. But you can get it much easier: If you drop the restriction non-zero for the first digit, you get 90*10!/2=163.296.000.
(If you google for 163296000 you will get some hits.)
Multiply by 0.9 and you get your number.
Added 25-Jul-2018: Klaus Koch, via e-mail, added the following information:
Wie Sie vielleicht auch schon bemerkt haben, gibt es zwar viele, die über datenschutzrechliche Aspekte der Steuer-ID, aber offenbar nur äußerst wenige, die über die Anzahl möglicher Steuer-IDs nachdenken und sich dazu im Netz äußern. In der Tat teile ich Ihre Überraschung darüber, dass die Anzahl gültiger IDs so erstaunlich gering ist. ... in 2016 [wurde] beschlossen, künftig auch drei gleiche Ziffern zuzulassen, Wikipedia: https://de.wikipedia.org/wiki/Steuerliche_Identifikationsnummer.
Nachdem ich das gelesen hatte, mußte ich natürlich neu rechnen (lassen). Microsoft Excel hat die Binomialkoeffizienten, die jeder unter dem Namen n über k kennt, in der Funktion Kombinationen(n;k) versteckt, also =FAKULTÄT(10)*KOMBINATIONEN(10;A3)/FAKULTÄT(A3-1) wobei in A3 die Zahl gleicher Ziffern steht. Hier die ganze Tabelle:
gleiche Anzahl Ziffern Möglichkeiten 2 163296000 (ab 2007) 3 217728000 (ab 2016) 4 127008000 5 38102400 6 6350400 7 604800 8 32400 9 900 10 10
In der [ersten] Zeile werden Lösungen betrachtet, bei denen eine Ziffer gar nicht vorkommt, in der [zweiten] Zeile werden Lösungen betrachtet, bei denen zwei Ziffern gar nicht vorkommen, u.s.w. ... [Ich] habe meinen PC nachzählen lassen, er brauchte nur 10 Minuten für alle 10-stelligen Zahlen... Entwarnung: er kam zum gleichen Ergebnis. Wegen des Ausschlusses einer führenden Null müssen alle Resultate noch mit 0,9 multipliziert werden. Durch die Änderung in 2016 erhöht sich der Pool beträchtlich.
@Klaus Koch: Thanks for these thoughts and calculations!