[Scons-dev] Merge PR #235 before release

Kenny, Jason L jason.l.kenny at intel.com
Wed May 27 09:33:29 EDT 2015


I would agree with this.

In general the OS today store file data ( ie the file system data not the data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is not always the case it could be big5 or some other locale encoding.  On Linux there are means to see what the “native” encoding is to use it.

I should note that the idea of converting binary to Unicode does not really exist. The point of a binary string to is to hold random data ( ie like a double in the raw form 64-bit vs the dec values of 1.2385). One can assume that it is a certain code page encoding and convert from that. And like I stated above there are api to see what the locale code page encoding is and that can be used to convert the code to the local ANSI/OEM encoding. This is different from a binary string.

Jason



From: Scons-dev [mailto:scons-dev-bounces at scons.org] On Behalf Of Gary Oberbrunner
Sent: Wednesday, May 27, 2015 7:43 AM
To: SCons developer list
Subject: Re: [Scons-dev] Merge PR #235 before release


On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik <techtonik at gmail.com<mailto:techtonik at gmail.com>> wrote:
What I need is a bulletproof way to convert from anything to unicode. This
requires some kind of escaping to go forward and back. Some helper
methods like u2b() (unicode to binary) and b2u(). I am quite surprised that
so far I found nothing for this "simple" case.

That's because in general the encoding of the "binary" string is unknown.  Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else?  You can't decode such a string to Unicode without knowing the encoding.  Check out the python-3 branch where we've been working through some of those issues.  Your u2b is "easy" if you assume you want the binary to be utf-8 encoded, which is normally safe; this conversion is guaranteed to work.  Your b2u is not so easy.  You can't just assume utf-8 as you might think; if the string has invalid utf-8 bytes it'll raise an error or generate dummy chars depending on the args you pass to str.decode().  At least it'll get mangled if it's in a different encoding than you expect.

--
Gary
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://pairlist2.pair.net/pipermail/scons-dev/attachments/20150527/9521b8a6/attachment.html>


More information about the Scons-dev mailing list