j7n wrote: ↑07 Feb 2019 14:45
Text editors/viewers that can load 'binary' data and synchronize with any text inside it exist and can be used instead.
Many of the feature requests I get (related to this summary) are from people who want to see text in their natural language. Especially when searching text, the found occurrences should closely resemble the search pattern.
I guess you can make some relaxing assumptions, such as regarding the direction of selection, but it should not be a raw view that is too close to a direct mapping to Code Units (in the sense of Unicode). I.e. UInt16 and UInt8 units, for UTF-16 and UTF-8 respectively, are too low level.
j7n wrote: ↑07 Feb 2019 14:45
When it comes to characters that combine into one, such as "u" followed by a "combining diaeresis", it is more useful to see the two codes separately than a restored letter "Ã¼" (most strings would already have Latin letters precomposed).
Many Asian languages are not readable in the uncombined form. And I wouldn't want to read an u with an Umlaut separated from it. These characters get botched up enough everywhere.
Some examples, where you can see that non-combined characters are hard to read:
https://stackoverflow.com/questions/198 ... an-letters
http://www.programminginkorean.com/prog ... n-unicode/
I misunderstood the above examples. The Hangul jamos (glyphs on the left side) are only combined to hangul syllables (right side) through normalization, not
through grapheme clustering. This is because none of the jamo code points above are of GraphemeClusterBreak type L, V, or T. The examples below that show Hangul clusters use Hangul syllables of the L(ead), V(owel), or T(rail) type, and therefore get handled as a cluster.
HxD will not normalize text, as this would be too far from the low level code unit representation and would actually transform bytes/code units, instead of treating them (selecting, caret position, etc.) and showing them as a cluster. But it will handle grapheme cluster examples of Hangul, given below (and explained above, i.e., LVT syllables).
Apparently CJK characters (sinographs/chinese characters/ideographs) are never composed of several code points: https://www.unicode.org/faq/han_cjk.html#16
TODO: seems not be entirely correct: https://en.wikipedia.org/wiki/Enclosed_ ... and_Months
However HxD will probably not support this.
According to GraphemeBreakProperty-8.0.0.txt there are no CJK characters that form grapheme clusters, but there are Hangul syllables that form Hangul syllable sequences (grapheme clusters). Many other Asian scripts, such as Tibetan or Devanagari, as well as middle eastern such as Arabic or Hebrew are listed as creating grapheme clusters.
Browsers such as Firefox, or text processing apps like WordPad/Notepad and LibreOffice will render a grapheme cluster as one unit, but allow to remove code points from the end of the code point sequence when pressing backspace. Each press of backspace deletes one code point. Pressing the del key in front of a code point sequence will delete the entire grapheme cluster at once.
Selection and caret movement also always handles the entire grapheme cluster.
Notepad and WordPad are inconsistent in handling Hangul syllable sequences:
renders as one grapheme cluster in all the apps mentioned above. But
will render as individual code points in WordPad and Notepad, yet render/edit fine in Firefox or LibreOffice. It seems Uniscribe is at fault for that, and it's not clear if it can be convinced to handle this grapheme cluster properly (with the right options/API calls). DirectWrite also cannot render it properly.
Relevant link for Win10: https://en.wikipedia.org/wiki/Uniscribe ... ing_Engine
Firefox uses HarfBuzz for font shaping (yet DirectWrite/Uniscribe for rendering), which apparently works better in the last example above.
https://hg.mozilla.org/mozilla-central/ ... gfx/thebes
has the relevant code, especially
https://hg.mozilla.org/mozilla-central/ ... zzShaper.h
https://hg.mozilla.org/mozilla-central/ ... Shaper.cpp
It however does not seem to be trivial to implement or link as library, and might be overkill considering all the additional effort that would be needed, even if this wrong rendering is a bit annoying.
Interesting read about complex scripts and shaping engines: http://tiro.com/John/Universal_Shaping_ ... POLabs.pdf
Other examples of grapheme clusters are:
See BytesToValidUnicodeGraphemeClusters.dpr for details.
See also information about Korean (Hangul, which is not part of CJK):
See also: Why doesn't the Unicode Standard adopt a compositional model for encoding Han ideographs? Wouldn't that save a large number of code points?
More on grapheme clusters (what users would consider a character) and combining characters: http://unicode.org/faq/char_combmark.html
The FAQ also shows some examples of combining characters, that we can test on.
Especially relevant is: How are characters counted when measuring the length or position of a character in a string?
, which provides also a good example for combining characters, that are not simple diacritics:
The image shows a grapheme cluster made of the base letter "Devanagari Letter Na (U+0928)" and the combining mark "Devanagari Vowel Sign I (U+093F)", among other code points.
For proper text segmentation and finding grapheme base characters
(i.e., starting points), understanding grapheme clusters
, and other related terms the glossary
Information about handling regex
might be relevant as well, at least for search of handling code point sequences.
For analyzing encoding oddities, I would rather think extending the data inspector would be more useful, and extending it to a structural view. There is a UTF-8 Codepoint entry already, that will explicitly mention every encoding error. More could be done along the same line.
You could also show a different representation like browsers do, when there are encoding errors, such as using replacement characters for byte sequences, that don't result in a validly encoded string.
Lastly, it might be useful to have the ghost caret encompass/surround all the bytes that correspond to the currently highlighted glyph/insertion mark position. Similarly, when the hex column is active, the byte under the caret should result in a ghost caret in the text view, that (partially) highlights the corresponding glyph (how that partial highlighting is to be done is another question to figure out).
j7n wrote: ↑07 Feb 2019 14:45
Any invisible control characters are useful to see without interpretation, such as writing direction marks. Writing direction marks (which can occur randomly) shouldn't affect how a selection with the mouse work. This allows to diagnose problems when those symbols occur in codes like filenames, which look ordinary in complete text views.
Control characters and non-printable characters will probably just be represented like now: by a replacement character such as a dot.
For example, in WinHex I see a very basic Unicode. Maybe it is upgraded in later versions, but aside from glitchy graphic rendering, extra spaces
The WinHex authors certainly have an impressive list of features, but they are not a role model for HxD. I have different goals, and the Unicode support is a good example for that.
The GUI in the screenshot you showed really does not make much of an attempt to deal with any of the issues I mentioned. It also makes it hard to read the text, including latin based scripts that are spaced apart too widely.
To diagnose and fix encoding issues, it might be more valuable to have a separate tool, like the screenshot from Neatpad above, so you can truly and easily see (and possibly fix in case of errors) the correspondence of hexadecimal pairs to glyphs and vice versa. In this case you could work with a defined selection of bytes, which would avoid all the issues related to huge files, and wrongly assumed start-offsets within them.
If combined with a structure view, this could be an alternative view for a common text edit control. You could also easily play around with start offsets for multi-byte encodings like UTF-8 or SHIFT-JIS.
Implementing something like this as a global view for the entire file (i.e., the text column) would however result in many issues as mentioned earlier.
For the more global view (i.e. the text column) I'd like a representation that is reasonably well readable directly (such as printing combining characters as one glyph), without focusing too much on the byte/2-byte/etc. to glyph correspondence, which is not an obvious/regular table-like representation anyways. It should approximate a text editor view as much as reasonable.
Finding out the right level of interpretation for text rendering encodings, such as UTF-8 and SHIFT-JIS, so search (and showing search results) works, is still a thing to be figured out. Especially because the starting offset (of the search/string matching) will affect whether certain strings will be found or not. The solution is likely that the same search has to be performed in parallel from several starting offsets, especially if you start the search somewhere random in the file.
There might be an option to improve efficiency by searching only for one of the possible alignments/synchronization schemes, if you know for sure a certain alignment exists. But this is unlikely to be true in too many cases and would be a possible later
optimization, not a first implementation.