[reportlab-users] BUGFIX: Re:    in paragraph
    Robin Becker 
    robin at reportlab.com
       
    Thu Dec  4 09:31:54 EST 2008
    
    
  
Dirk Holtwick wrote:
>> _WSC=u''.join((
>>     u'\u0009',    # HORIZONTAL TABULATION
>>     u'\u000A',    # LINE FEED
>>     u'\u000B',    # VERTICAL TABULATION
>>     u'\u000C',    # FORM FEED
>>     u'\u000D',    # CARRIAGE RETURN
>>     u'\u001C',    # FILE SEPARATOR
>>     u'\u001D',    # GROUP SEPARATOR
>>     u'\u001E',    # RECORD SEPARATOR
>>     u'\u001F',    # UNIT SEPARATOR
>>     u'\u0020',    # SPACE
>>     u'\u0085',    # NEXT LINE
>>     #u'\u00A0', # NO-BREAK SPACE
>>     u'\u1680',    # OGHAM SPACE MARK
>>     u'\u2000',    # EN QUAD
>>     u'\u2001',    # EM QUAD
>>     u'\u2002',    # EN SPACE
>>     u'\u2003',    # EM SPACE
>>     u'\u2004',    # THREE-PER-EM SPACE
>>     u'\u2005',    # FOUR-PER-EM SPACE
>>     u'\u2006',    # SIX-PER-EM SPACE
>>     u'\u2007',    # FIGURE SPACE
>>     u'\u2008',    # PUNCTUATION SPACE
>>     u'\u2009',    # THIN SPACE
>>     u'\u200A',    # HAIR SPACE
>>     u'\u200B',    # ZERO WIDTH SPACE
>>     u'\u2028',    # LINE SEPARATOR
>>     u'\u2029',    # PARAGRAPH SEPARATOR
>>     u'\u202F',    # NARROW NO-BREAK SPACE
>>     u'\u205F',    # MEDIUM MATHEMATICAL SPACE
>>     u'\u3000',    # IDEOGRAPHIC SPACE
>>     ))
>>
>> #on UTF8 branch, split and strip must be unicode-safe!
>> def split(text, delim=None):
>>     if type(text) is str: text = text.decode('utf8')
>>     if type(delim) is str: delim = delim.decode('utf8')
>>     if delim is None and u'\xa0' in text:
>>         delim = _WSC
>>     return [uword.encode('utf8') for uword in text.split(delim)]
>>
>>
>>
>> can you check this against your problem cases?
> 
> I don't think the last line will work like this. I think it should be 
> more like this:
> 
> -----------------8<---------------[cut here]
> import re
> _WSC_RE = re.compile(u"[%s]" % re.escape(_WSC))
> 
> def split(text, delim=None):
>     if type(text) is str: text = text.decode('utf8')
>     if type(delim) is str: delim = delim.decode('utf8')
>     if delim is None and u'\xa0' in text:
>         return [uword.encode('utf8') for uword in re.split(_WSC_RE, text)]
>     return [uword.encode('utf8') for uword in text.split(delim)]
> -----------------8<---------------[cut here]
> 
> This one worked fine in my version.
> 
> Dirk
you're absolutely right. I keep thinking delim is a set of chars, but it's a 
string. If the above works for you I guess it'll be fine. Perhaps we could code 
it a bit more efficiently by using _WSC_RE.split(text) instead of 
re.split(_WSC_RE, text) or for the hyper speeders
_WSC_RE_split = re.compile(u"[%s]" % re.escape(_WSC)).split
.......
        return [uword.encode('utf8') for uword in _WSC_RE_split(text)]
In fact I notice that \s doesn't match \xa0, but I am uncertain if that is 
intended or accidental.
-- 
Robin Becker
    
    
More information about the reportlab-users
mailing list