Answer by David Z for How to efficiently calculate the fraction (valid UTF8 byte sequence of length N)/(total N-byte sequences)?

Honestly I couldn't quite follow this question all the way through, but here's the idea I think you want.

Judging by your last program, you figured out, more or less, that you can express the number of valid UTF-8 byte sequences of length N as a recurrence relation. I actually think you got the numbers wrong, though; I believe it should be this:

n(0) = 1 since there is only one UTF-8 byte sequence with zero bytes
n(1) = 128 for obvious reasons
n(2) = 18304:
- 128 * 128 sequences containing two characters encoded in one byte each
- 1920 (that's 2**11 - 128) sequences containing one character encoded in two bytes
n(3) = 2650112:
- 128 * 128 * 128 sequences of three characters
- 128 * 1920 * 2 sequences of two characters (since you have a two-byte character, 1920 possibilities, and a one-byte character, 128 possibilities, in either of two orders)
- 61440 (that's 2**16 - 1920 - 128 and then minus another 2**11 for the surrogates) sequences of one three-byte character
n(4) = 383270912:
- 128 * 128 * 128 * 128 sequences of four characters
- 128 * 128 * 1920 * 3 sequences of three characters (one two-byte character and two one-byte characters, with the two-byte character occupying any of three positions)
- 1920 * 1920 sequences of two two-byte characters
- 128 * 61440 * 2 sequences of a one-byte and a three-byte character, in either order
- 1048576 sequences of one four-byte character

(but my main point here is to demonstrate the technique, so if I'm wrong about these numbers, adjust accordingly) After that, the number of UTF-8 byte sequences of length N if N > 4 is determined by this recurrence relation:

n(N) = 128 * n(N-1) + 1920 * n(N-2) + 61440 * n(N-3) + 1048576 * n(N - 4)

reflecting the fact that you can form a UTF-8 sequence of length N > 4 by appending a one-byte character to any sequence of length N - 1, or appending a two-byte character to any sequence of length N - 2, or so on for 3- or 4-byte characters.

This formula is very straightforward to implement iteratively or recursively, and you can get the desired proportion by dividing the result by 2**N. So a reasonably efficient way to compute this is to write a function that computes n(N) using iteration or recursion and then just divide it by 2**N. For example:

@functools.cachedef numerator(N: int) -> int:    return 128 * numerator(N - 1) + ...def valid_UTF8_chance_expansion(length: int):    return fractions.Fraction(numerator(length), 2**length)  # or convert to string if you like

If you want something better than that, you can convert the sequence into a closed-form expression by using a characteristic polynomial. The basic idea is that you take the recurrence relation from above, replace each n(k) with the kth power of an unknown variable x to get

x**N = 128 * x**(N-1) - 1920 * x**(N-2) - 61440 * x**(N-3) - 1048576 * x**(N-4)

then simplify and move everything to one side to get

x**4 - 128 * x**3 - 1920 * x**2 - 61440 * x - 1048576 = 0

and find the roots of that polynomial, call them r_1 through r_4. (You can use Sympy or Wolfram Alpha or any computer algebra system or symbolic equation solver.) You can then write n(N) as a linear combination of Nth powers of the roots; that is:

n(N) = c_1 * r_1**N + c_2 * r_2**N + c_3 * r_3**N + c_4 * r_4**N

for constants c_k, which can be determined by matching to initial conditions - that is, write four copies of this formula for N=1 through N=4, set them equal to the numeric values you know those to have, and you have a system of four linear equations in four variables which you can solve using standard techniques. You'll get a moderately complicated formula involving some roots and maybe some complex numbers, but with a bit of manipulation you can probably rearrange it in some way where you group the roots and i's together and then you can make a Python implementation that only uses integer arithmetic.

Answer by David Z for How to efficiently calculate the fraction (valid UTF8 byte sequence of length N)/(total N-byte sequences)?

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List