[reportlab-users] BUGFIX: Re:    in paragraph
    Dirk Holtwick 
    dirk.holtwick at gmail.com
       
    Thu Dec  4 09:46:05 EST 2008
    
    
  
> you're absolutely right. I keep thinking delim is a set of chars, but 
> it's a string. If the above works for you I guess it'll be fine. Perhaps 
> we could code it a bit more efficiently by using _WSC_RE.split(text) 
> instead of re.split(_WSC_RE, text) or for the hyper speeders
Of course :)
> _WSC_RE_split = re.compile(u"[%s]" % re.escape(_WSC)).split
> .......
>        return [uword.encode('utf8') for uword in _WSC_RE_split(text)]
> 
> 
> In fact I notice that \s doesn't match \xa0, but I am uncertain if that 
> is intended or accidental.
It depends on the settings, see Python Manual:
-----------------8<---------------[cut here]
\s
When the LOCALE and UNICODE flags are not specified, matches any 
whitespace character; this is equivalent to the set [ \t\n\r\f\v]. With 
LOCALE, it will match this set plus whatever characters are defined as 
space for the current locale. If UNICODE is set, this will match the 
characters [ \t\n\r\f\v] plus whatever is classified as space in the 
Unicode character properties database.
-----------------8<---------------[cut here]
I think to have an explicit rule set as in out code avoids a lot of 
trouble, since in unicode it is defined as a space as you already mentioned:
-----------------8<---------------[cut here]
 >>> u"\x0a".isspace()
True
-----------------8<---------------[cut here]
Dirk
    
    
More information about the reportlab-users
mailing list