Quantcast
Channel: User David Z - Stack Overflow
Viewing all articles
Browse latest Browse all 50

Answer by David Z for How to efficiently calculate the fraction (valid UTF8 byte sequence of length N)/(total N-byte sequences)?

$
0
0

Honestly I couldn't quite follow this question all the way through, but here's the idea I think you want.

Judging by your last program, you figured out, more or less, that you can express the number of valid UTF-8 byte sequences of length N as a recurrence relation. I actually think you got the numbers wrong, though; I believe it should be this:

  • n(0) = 1 since there is only one UTF-8 byte sequence with zero bytes
  • n(1) = 128 for obvious reasons
  • n(2) = 18304:
    • 128 * 128 sequences containing two characters encoded in one byte each
    • 1920 (that's 2**11 - 128) sequences containing one character encoded in two bytes
  • n(3) = 2650112:
    • 128 * 128 * 128 sequences of three characters
    • 128 * 1920 * 2 sequences of two characters (since you have a two-byte character, 1920 possibilities, and a one-byte character, 128 possibilities, in either of two orders)
    • 61440 (that's 2**16 - 1920 - 128 and then minus another 2**11 for the surrogates) sequences of one three-byte character
  • n(4) = 383270912:
    • 128 * 128 * 128 * 128 sequences of four characters
    • 128 * 128 * 1920 * 3 sequences of three characters (one two-byte character and two one-byte characters, with the two-byte character occupying any of three positions)
    • 1920 * 1920 sequences of two two-byte characters
    • 128 * 61440 * 2 sequences of a one-byte and a three-byte character, in either order
    • 1048576 sequences of one four-byte character

(but my main point here is to demonstrate the technique, so if I'm wrong about these numbers, adjust accordingly) After that, the number of UTF-8 byte sequences of length N if N > 4 is determined by this recurrence relation:

n(N) = 128 * n(N-1) + 1920 * n(N-2) + 61440 * n(N-3) + 1048576 * n(N - 4)

reflecting the fact that you can form a UTF-8 sequence of length N > 4 by appending a one-byte character to any sequence of length N - 1, or appending a two-byte character to any sequence of length N - 2, or so on for 3- or 4-byte characters.

This formula is very straightforward to implement iteratively or recursively, and you can get the desired proportion by dividing the result by 2**N. So a reasonably efficient way to compute this is to write a function that computes n(N) using iteration or recursion and then just divide it by 2**N. For example:

@functools.cachedef numerator(N: int) -> int:    return 128 * numerator(N - 1) + ...def valid_UTF8_chance_expansion(length: int):    return fractions.Fraction(numerator(length), 2**length)  # or convert to string if you like

If you want something better than that, you can convert the sequence into a closed-form expression by using a characteristic polynomial. The basic idea is that you take the recurrence relation from above, replace each n(k) with the kth power of an unknown variable x to get

x**N = 128 * x**(N-1) - 1920 * x**(N-2) - 61440 * x**(N-3) - 1048576 * x**(N-4)

then simplify and move everything to one side to get

x**4 - 128 * x**3 - 1920 * x**2 - 61440 * x - 1048576 = 0

and find the roots of that polynomial, call them r_1 through r_4. (You can use Sympy or Wolfram Alpha or any computer algebra system or symbolic equation solver.) You can then write n(N) as a linear combination of Nth powers of the roots; that is:

n(N) = c_1 * r_1**N + c_2 * r_2**N + c_3 * r_3**N + c_4 * r_4**N

for constants c_k, which can be determined by matching to initial conditions - that is, write four copies of this formula for N=1 through N=4, set them equal to the numeric values you know those to have, and you have a system of four linear equations in four variables which you can solve using standard techniques. You'll get a moderately complicated formula involving some roots and maybe some complex numbers, but with a bit of manipulation you can probably rearrange it in some way where you group the roots and i's together and then you can make a Python implementation that only uses integer arithmetic.


Viewing all articles
Browse latest Browse all 50

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>