pasting non-hex characters from clipboard

Wishlists for new functionality and features.
Post Reply
Harmon20
Posts: 2
Joined: 07 Jul 2024 15:34

pasting non-hex characters from clipboard

Post by Harmon20 »

When pasting data from the Windows clipboard I was getting an error that there were non-hex characters in the data.

This was proper because there were non-hex characters in the data, but I had no indication where in the 160MB data stream the characters existed or what they were. Tracking the offending characters down was a big time sink.

I would be good to receive some indication of where to look in the data. Possibly character count at which the bad characters were encountered, or a string of the last valid characters prior to the offending characters, or what the offending character was. Maybe import the data up to the point of the offending characters and drop everything thereafter? Or allow importation but mark the bad characters and refuse any further processing, saving, or export until the problem characters are corrected. Just throwing ideas out. Any information at all would be useful in tracking down the bad characters when it is a large data set.

For reference, I was pasting in the output of a terminal window after having done a hexdump on a pcap file in a Linux filesystem. I had no way to export the file off of the machine I was interacting with due to it being a stripped down embeded OS reached by ssh through a convoluted port forward/NAT kind of arrangement, so I dumped the file to screen and was copying it to a local (to me) file via HxD. Apparently there were some spurious characters printed to the console window in the course of the dump that went by too fast for me to see.

Ultimately I located the characters by pasting the data into notepad and halving the data, dumping the top half into HxD, halve the remaining data and paste the top half into HxD, continued until I could no longer get a good top half of data, then start the halving process from the end of the remaining data and work my way back to the top until I had shrunk the offending data set small enough to do a quick visual scan.

Josh
Maël
Site Admin
Posts: 1461
Joined: 12 Mar 2005 14:15

Re: pasting non-hex characters from clipboard

Post by Maël »

The error was supposed to remind people that the hex column and text column automatically interpret textual clipboard data differently:
  • the hex column interprets text as bytes encoded as characters from 00-FF (characters themselves are encoded in a text encoding, so technically, there are two conversions, if the text encoding from the clipboard needs to be converted to the internal standard, UTF-16LE, first).
  • the text column converts text from the clipboard to bytes, by converting from the clipboard text encoding to the text encoding chosen in HxD.
Instead of the current error message, the user could be asked something like this:
The text to be pasted in the hex column contains invalid hex values, and may not be convertible to meaningful data.
The allowed characters are A to F, a to f, 0 to 9, and white space (#1..#32), or as regex: [A-Fa-f0-9\x01-\x20]
Attempt to filter all invalid characters?
Listing all the non-matching characters and their locations could create a huge list, and then you'd have to decide where to put that result. A message box usually won't be able to contain all that, and you wouldn't want pasting to create files with search results (where would it be located, and a text editor opening a file suddenly would be unexpected, too) or fill a listbox with locations of invalid characters, because then you would want to display the clipboard contents and click on the results to jump to locations.

Using a text editor to find (all) invalid characters seems more practical, as then you'd have all the options to change or inspect the text as you like. You could offer a button in the message box, that would automatically launch a text editor, paste the text, then perform the search and list the results, but I am not aware of any way to automate this in a way that works across various text editors.

So using the regex you could perform a search in any text editor with regex support, and find all the invalid characters. Already mentioning the regex in the message box seems to clutter the message, though. Also message boxes usually don't allow to format text, to highlight code/use different fonts, like I did here.

When HxD supports regex, you could have default/suggested search expressions (to find all hex values or all invalid hex values). So you would have to paste the text in the text column, and then could use search there.
Maël
Site Admin
Posts: 1461
Joined: 12 Mar 2005 14:15

Re: pasting non-hex characters from clipboard

Post by Maël »

Harmon20 wrote: 07 Jul 2024 15:56 Ultimately I located the characters by pasting the data into notepad and halving the data, dumping the top half into HxD, halve the remaining data and paste the top half into HxD, continued until I could no longer get a good top half of data, then start the halving process from the end of the remaining data and work my way back to the top until I had shrunk the offending data set small enough to do a quick visual scan.
An easier solution for locating invalid characters might have been to use a regex search in a text editor using the following expression (negated version from my previous post):
[^A-Fa-f0-9\x01-\x20]

Do you have a list of characters that were causing issues? Maybe I could include them in the list to filter automatically, depending on what they are.

A final thought I have, would be to offer a clipboard preview, where you could see the clipboard contents interpreted in various ways (text column/hex colum, import from source code arrays etc), maybe a bit like the CVS import function from spreadsheet programs. That would be substantial work, though.
Harmon20
Posts: 2
Joined: 07 Jul 2024 15:34

Re: pasting non-hex characters from clipboard

Post by Harmon20 »

Oy, you've revealed the flaw in my proposal. My laziness.

My halving loop was a quick and dirty solution to what I had hoped was a one-off problem not requiring a lot of effort. I could write a sanitization script to do the work for me but, you know, lazy.

I thought that since HxD was aware of an offending character and was going through the trouble of popping up a message box it might be a high use/low effort feature to tell me what that character was in the body of the error. I assumed, mistakenly perhaps, that it errored out upon encountering the first bad character and didn't bother looking for all of them. I wasn't looking for the complete list of problems with the data; if I'm pasting in a pile of garbage then that's totally on me.

My suggestions were based on pure conjecture of what routines I thought might have caused this error to pop up.

Possibly a sanity check was going through the data and examining each character and upon finding a bad one terminated the search and set the error condition, in which case it seemed a simple thing to add the last character checked to the error text. "Pasting is not possible, as the text in the clipboard contains invalid hex-values. A (%c, chr_chk) was found.\nIf you want to paste text, move to the text-column."

Or possibly the data was being streamed from the clipboard to the hex container in HxD, in which case leave the stream up to the point of the error in place and terminate the stream rather than abandoning the paste job. This would allow an arbitrary string at the end of the pasted data to be located in the source data, pointing to the problem characters.

But if it isn't a relatively simple thing to implement then I agree, not worth the effort. Make the user clean up their data.

Josh
Maël
Site Admin
Posts: 1461
Joined: 12 Mar 2005 14:15

Re: pasting non-hex characters from clipboard

Post by Maël »

It does indeed stop at the first invalid character, but since this has to be a generic solution, it won't be useful for everyone: A compiler, parser, or something like that might output the first invalid character location in a file, but there you assume the file is sourcecode written mostly in the right format, and then you can immediately jump to the right location by clicking in the GUI and viewing the error.

Since that's not possible easily, as mentioned before, it seems more useful to directly paste it in a text editor and search there, to find all issues at once, since you can't copy the location to the clipboard (or you'd replace the content).

If it really is just a single invalid character, that could be useful indeed, but you still would have to first edit it in a text editor. I assume the one benefit would be that you don't need to do a regex search, but could directly write the location in a Goto window and jump there.

So it would be a tradeoff: how cluttered should the message box become and how readable will the message remain with all that additional information, since I think now including a regex would be more useful for the general case.

Partially pasting (up to the invalid character) despite the error message, seems to be a too specific use case, since the text could be entirely unsuited to be interpreted as hex values, and then having maybe just one or no bytes, or just a few "random" ones appear, would seem confusing, and would have to be explained first to the user somehow.
Maël
Site Admin
Posts: 1461
Joined: 12 Mar 2005 14:15

Re: pasting non-hex characters from clipboard

Post by Maël »

The revised error message could look like this:
Invalid hex values in the clipboard prevent pasting into the hex column. For arbitrary text, paste into the text column, instead.

The first invalid character (%c) is at line %d, character position %d.
The allowed characters are A to F, a to f, 0 to 9, and white space (#1..#32), or as regex: [A-Fa-f0-9\x01-\x20]

Attempt to filter invalid characters and proceed?

It would still require some changes, as you need to have a yes/no-user choice (instead of a simple error/exception raised), and you need to count lines and columns (codepoints, not UTF-16 code units), but I suppose that would be doable.

Edit: I just noticed that a lot of text editors and even IDEs don't offer a way to jump to a character (from file start), and not even to a character in a line, they can just jump to a line index. If your whole text has no line breaks, just jumping to a line number would be pointless. So I wonder how useful it is to include this information at all. Also the absolute character position seems to be sometimes 1 and sometimes 0 based. So the more reliable "standard" would be line/character position, which are both 1 based.
Post Reply