Discussion:
A4Q1: extra characters for invalid character
(too old to reply)
SG
2011-07-02 06:54:17 UTC
Permalink
When I run the current executable file for input 0xeeeeee (6 e's), I get
the following:
0xeeee : invalid Extra characters 0xee
Shouldn't it be just 0xeeeeee : invalid ?
Since there are 3 character bytes it corresponds to the 3rd type, and ee
is of form 1110xxxx. but the second byte fails for form 10xxxxxx, so it
is correct that it's invalid. However I don't understand why the last
one is extra character.
Did I misunderstand what determines the extra character for invalid
character?
Thank you
Muhammad tauqir
2011-07-02 19:47:39 UTC
Permalink
I believe you are supposed to output "invalid" <b>as soon as you come
across an invalid character.</b>
So when you come across the invalid character '0xee', you must out put
error there and output the rest as 'Extra characters'

Regards,
Muhammad Tauqir
Post by SG
When I run the current executable file for input 0xeeeeee (6 e's), I get
0xeeee : invalid Extra characters 0xee
Shouldn't it be just 0xeeeeee : invalid ?
Since there are 3 character bytes it corresponds to the 3rd type, and ee
is of form 1110xxxx. but the second byte fails for form 10xxxxxx, so it
is correct that it's invalid. However I don't understand why the last
one is extra character.
Did I misunderstand what determines the extra character for invalid
character?
Thank you
Terry Anderson
2011-07-02 21:56:01 UTC
Permalink
First, just a note to everybody to avoid any confusion: if you are using
the genutf8.cc file to generate input data (which is advised), then the
input 0xeeeeee should be expressed as:

0xee, 0xee, 0xee, '\n'

The reason is that this is placed in a char array, and so each element of
the array can only hold one byte of data.

Now to answer your question. The input, when expressed in binary, is:

1110 1110 . 1110 1110 . 1110 1110 . 0000 1010

(I put the .s in only to make it easier to see where this is broken up
into individual bytes)

Looking at the first byte, the check bits are 1110, so we are working with
the third case of the encoding. We assume that this input is a valid
encoding until we are proven otherwise.

We then read the second byte and see that its check bits are NOT 10,
which means that the first two bytes do not correspond to a valid
encoding. This is when "0xeeee : invalid" is printed.

We then check to see if the third byte is the newline character, which it
is not. Thus the third byte counts as an "extra character" because we
have already determined that the first two bytes do not correspond to a
valid encoding. This is when we print we print "Extra characters 0xee".

The fourth byte is the newline character, which indicates we are at the
end of this particular input (and the newline character doesn't count as
input, so we don't print it as an extra character).

Terry
--
Terry Anderson
CS 246 Instructor
When I run the current executable file for input 0xeeeeee (6 e's), I get the
0xeeee : invalid Extra characters 0xee
Shouldn't it be just 0xeeeeee : invalid ?
Since there are 3 character bytes it corresponds to the 3rd type, and ee is
of form 1110xxxx. but the second byte fails for form 10xxxxxx, so it is
correct that it's invalid. However I don't understand why the last one is
extra character.
Did I misunderstand what determines the extra character for invalid
character?
Thank you
Continue reading on narkive:
Loading...