Honestly I couldn't quite follow this question all the way through, but here's the idea I think you want.
Judging by your last program, you figured out, more or less, that you can express the number of valid UTF-8 byte sequences of length N
as a recurrence relation. I actually think you got the numbers wrong, though; I believe it should be this:
n(0) = 1
since there is only one UTF-8 byte sequence with zero bytesn(1) = 128
for obvious reasonsn(2) = 18304
:128 * 128
sequences containing two characters encoded in one byte each1920
(that's2**11 - 128
) sequences containing one character encoded in two bytes
n(3) = 2650112
:128 * 128 * 128
sequences of three characters128 * 1920 * 2
sequences of two characters (since you have a two-byte character, 1920 possibilities, and a one-byte character, 128 possibilities, in either of two orders)61440
(that's2**16 - 1920 - 128
and then minus another2**11
for the surrogates) sequences of one three-byte character
n(4) = 383270912
:128 * 128 * 128 * 128
sequences of four characters128 * 128 * 1920 * 3
sequences of three characters (one two-byte character and two one-byte characters, with the two-byte character occupying any of three positions)1920 * 1920
sequences of two two-byte characters128 * 61440 * 2
sequences of a one-byte and a three-byte character, in either order1048576
sequences of one four-byte character
(but my main point here is to demonstrate the technique, so if I'm wrong about these numbers, adjust accordingly) After that, the number of UTF-8 byte sequences of length N
if N > 4
is determined by this recurrence relation:
n(N) = 128 * n(N-1) + 1920 * n(N-2) + 61440 * n(N-3) + 1048576 * n(N - 4)
reflecting the fact that you can form a UTF-8 sequence of length N > 4
by appending a one-byte character to any sequence of length N - 1
, or appending a two-byte character to any sequence of length N - 2
, or so on for 3- or 4-byte characters.
This formula is very straightforward to implement iteratively or recursively, and you can get the desired proportion by dividing the result by 2**N
. So a reasonably efficient way to compute this is to write a function that computes n(N)
using iteration or recursion and then just divide it by 2**N
. For example:
@functools.cachedef numerator(N: int) -> int: return 128 * numerator(N - 1) + ...def valid_UTF8_chance_expansion(length: int): return fractions.Fraction(numerator(length), 2**length) # or convert to string if you like
If you want something better than that, you can convert the sequence into a closed-form expression by using a characteristic polynomial. The basic idea is that you take the recurrence relation from above, replace each n(k)
with the k
th power of an unknown variable x
to get
x**N = 128 * x**(N-1) - 1920 * x**(N-2) - 61440 * x**(N-3) - 1048576 * x**(N-4)
then simplify and move everything to one side to get
x**4 - 128 * x**3 - 1920 * x**2 - 61440 * x - 1048576 = 0
and find the roots of that polynomial, call them r_1
through r_4
. (You can use Sympy or Wolfram Alpha or any computer algebra system or symbolic equation solver.) You can then write n(N)
as a linear combination of N
th powers of the roots; that is:
n(N) = c_1 * r_1**N + c_2 * r_2**N + c_3 * r_3**N + c_4 * r_4**N
for constants c_k
, which can be determined by matching to initial conditions - that is, write four copies of this formula for N=1
through N=4
, set them equal to the numeric values you know those to have, and you have a system of four linear equations in four variables which you can solve using standard techniques. You'll get a moderately complicated formula involving some roots and maybe some complex numbers, but with a bit of manipulation you can probably rearrange it in some way where you group the roots and i's together and then you can make a Python implementation that only uses integer arithmetic.