Alex Walter
2011-07-08 01:35:12 UTC
In testing my program I found a discrepancy between the reference
executable and my program with the Unicode character 0x110000.
The UTF8 encoding of this character is:
11110100 10010000 10000000 10000000
My program detects that this is invalid at the second byte and prints
the last 3 bytes as extra characters. I believe this is correct because
no valid UTF8 character can have a 1 in the 4th position of the second
byte since this puts it above the allowed range no matter what the bits
after it are.
But the reference executable prints out all of the bytes and then
determines that the UTF8 is invalid, printing no extra bytes.
From the assignment description, I was under the assumption that we
were supposed to check if a UTF8 character is valid as each byte is
checked and that appears to true for all Unicode characters less than
0x11000, so is this a problem with the reference executable?
executable and my program with the Unicode character 0x110000.
The UTF8 encoding of this character is:
11110100 10010000 10000000 10000000
My program detects that this is invalid at the second byte and prints
the last 3 bytes as extra characters. I believe this is correct because
no valid UTF8 character can have a 1 in the 4th position of the second
byte since this puts it above the allowed range no matter what the bits
after it are.
But the reference executable prints out all of the bytes and then
determines that the UTF8 is invalid, printing no extra bytes.
From the assignment description, I was under the assumption that we
were supposed to check if a UTF8 character is valid as each byte is
checked and that appears to true for all Unicode characters less than
0x11000, so is this a problem with the reference executable?