[reportlab-users] Encoding UTF-8 instead of PDFDoc

Wed Mar 1 07:23:50 EST 2017

On 01/03/2017 05:05, Koki Nomura wrote:
> Hi,
>
> pdfdocEnc() in pdfdoc.py raises a UnicodeEncodeError as below when I
> process a PDF file with Unicode characters. I'm running my script on Python
> 3.6.0.
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\x00' in
> position 11: character maps to <undefined>
>
> This error disappears when I change the encoding from extpdfdoc to utf-8 in
> this block of code.
>
> if isPy3:
>     def pdfdocEnc(x):
>         return x.encode('extpdfdoc') if isinstance(x,str) else x
>
> While I don't fully understand 'extpdfdoc' encoding, can we change this
> encoding to utf-8 as PDF specifications allow to use Unicode as well as
> PDFDocEncoding?
>
> Thanks,
> Koki
........
Hi Koki,

not sure whether this is a good idea. The pdfdocEnc function is supposed to use 
either a bytestring or unicode. The output is 'supposed' to be acceptable to PDF 
and for that we would normally expect to use the pdfdoc standard encoding. The 
extpdfdoc encoding just adds  CR ('\r') and LF ('\n') identity mapped.

Can you give an example of where this is going wrong ie what you passed to a 
reportlab function to cause the problem.

PDF does allow different encodings in various places, but usually we either end 
up using pdfdoc or sometimes UTF16. I don't think PDF allows utf8 in many 
places; names are one case and I believe some software uri's can be directly 
encoded as utf8.
-- 
Robin Becker