Fix decoding error (issue 168)#169
Conversation
|
@christian-intra2net Thanks for contributing! Before any review though - have you run the unit tests? I believe they also need adaptation. |
|
Oops, sorry, I was not aware of unit tests. I think I found the problem, fixing it now... |
By decoding incomplete byte-input, we risk losing multi-byte characters if their byte-representation is not aligned with the end of our buffer. Fix this by concatenating bytes first, and only decode when we actually have to. In case of `Tail`` that is a little complicated, because we need to return text in-between `read()` calls. Outsource to a helper function with lots of documentation to clarify and reduce complexity. Signed-off-by: Plamen Dimitrov <plamen.dimitrov@intra2net.com>
4821842 to
2f9de5d
Compare
|
Great to see the CI passing, to clarify for everyone else - this PR fixes issue #168. |
pevogam
left a comment
There was a problem hiding this comment.
This is an initial review from me, overall this pull request fixes a really non-trivial issue!
| raw_data = os.read(expect_pipe, 1024) | ||
| if not raw_data: | ||
| return read, data | ||
| return read, data.decode(self.encoding, "ignore") |
There was a problem hiding this comment.
I assume from your comment in #168 (comment) this could be instead be turned into replace and thus provide the clarity you mentioned there. So let's see if we collect some feedback there on the original choices first and until then a "replace" setting here would rather be a requested change for additional improvement there here.
| return read, data.decode(self.encoding, "ignore") | ||
| read += len(raw_data) | ||
| data += raw_data.decode(self.encoding, "ignore") | ||
| data += raw_data |
There was a problem hiding this comment.
Definitely better not to decode raw data until the very end, I think this change improves the clarity and related better to the choice of naming.
| thread.join() | ||
|
|
||
|
|
||
| def partial_decode(input_bytes, encoding): |
There was a problem hiding this comment.
Maybe we could add a few unit tests for what behavior should be contracted with this function?
There was a problem hiding this comment.
Maybe there is also a better location for it e.g. in utils folder or something like that since the current module is purely structuring the classes in order of composition.
By decoding incomplete byte-input, we risk losing multi-byte characters, if their byte-representation is not aligned with the end of our buffer.
Fix this by concatenating bytes first, and only decode when we have to.
In case of Tail that is a little complicated, because we need to return text in-between read() calls. Outsource to a helper function with lots of documentation to clarify and reduce complexity.
Clarify error policy for decode: why not switch from "ignore" to "replace" to notify callers of decoding problems instead of hiding them?
Original author: Christian Herdtweck christian.herdtweck@intra2net.com