> Any comment/suggestion welcome (I've cross-posted intentionally, please remove recipients if not appropriate.) > TAG members - has the issue of dealing with symbols vs characters/codepoints come up in TAG discussion? > Please feel free to come back again here or contact the I18N WG. > what the Character Model has to say about this: > I'd suggest we schedule a discussion of this issue in an upcoming call. > Unfortunately for us, both considerations apply in the annotation use > b) "user interaction is a primary concern" - in which case grapheme > "code unit strings" (I presume interop with existing DOM APIs would also > a) there are performance considerations that would predicate the use of > clear that recommendation is to use character strings (i.e. > The character model lays out the problems more clearly than I have. > Thanks for this reference, Martin, and thanks for passing this to TAG, > When transfering data, it is important that the other implementation counts offsets the same way. > On my wishlist, I would hope that the new Annotation standard would include a normative list (SHOULD not MUST) of string counting functions for all major programming languages and other standards like SPARQL to tackle interoperability. > (yes, we consider moving to a W3C community group for further improvement) > - the "definition of string" section in the NIF spec: However, NFD is not in wide use and the annotation of diacritics is probably out of scope. > in NFD you can annotate the code point for the diacritic separately. However, if people wish to annotate diacritics independently. > There is a problem with Unicode Normal Form (NF). Personally I think, byte offset for text is unnecessary, simply because code points are better, i.e. > Anyhow, I wouldn't know a single use case for using Code Units for annotation. > It was quite difficult to work with the byte offset given that the original formats where HTML, txt, PDFs and docx. > For the NLP2RDF project we converted these 30 million annotations to RDF: > Python, len() in combination with decode(): len("ä".decode("UTF-8")) =1 Any deviation will lead to side effects such as "ä" having the length 2: > Regarding annotation, using code points or Character Strings is definitely the best practice. > On the (serialized) web, UTF-8 is predominant, which is really not the question here as the choice between graphems, code points and units is orthogonal to encoding. Maybe some DOM parser rely on UTF-16 internally too, but still count Code Points C/C++ has a dataype widechar using 16 bits as it is easier to allocate memory for variables. This means that you can use byte offsets easily to jump to certain positions in the text. > While UTF-8 has a variable length of one to four bytes per code point, UTF-16 and 32 have the advantage of a fixed length. you can encode the same code point in UTF-8, UTF-16 and UTF-32 which will definitely change the number of code units and bytes needed. > UTF-16 is the encoding of the string and is independent of code points, units and graphems, i.e. > From my understanding the example in is not good: > I am a bit puzzled why is renaming Unicode Code Points (a clearly defined thing) to Character String. Here I show it in python (note the u'xxx' is a UTF-16): That you can calculate an offset of a character is not true.įor example characters of the use 4 bytes Like UTF-8 it can use up to 4 bytes.Ĭases where UTF-16 uses 4 bytes may be 'pathological', but the assumption ![]() UTF-16 is **not** a fixed length encoding. > UTF-16 and 32 have the advantage of a fixed length. > While UTF-8 has a variable length of one to four bytes per code point, The following example shows the usage of 圜odePoints() method.To: Sebastian Hellmann, Public TAG List ĬC: W3C Public Annotation List, nlp2rdf ![]() IndexOutOfBoundsException − if index is negative or larger then the length of the char sequence, or if codePointOffset is positive and the subsequence starting with index has fewer than codePointOffset code points, or if codePointOffset is negative and the subsequence before index has fewer than the absolute value of codePointOffset code points. This method returns the index within the char sequence Exception codePointOffset − the offset in code points.Public static int offsetB圜odePoints(CharSequence seq, int index, int codePointOffset) Declarationįollowing is the declaration for 圜odePoints() method Unpaired surrogates within the text range given by index and codePointOffset count as one code point each. The 圜odePoints(CharSequence seq, int index, int codePointOffset) returns the index within the given char sequence that is offset from the given index by codePointOffset code points.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |