[reportlab-users] BUGFIX: Re:    in paragraph
    Robin Becker 
    robin at reportlab.com
       
    Wed Dec  3 11:35:00 EST 2008
    
    
  
Dirk Holtwick wrote:
> Hi,
> 
> to fix the described error please modify the following function in 
> "paragraph.py":
> 
> -----------------8<---------------[cut here]
> #on UTF8 branch, split and strip must be unicode-safe!
> def split(text, delim=None):
>     if type(text) is str: text = text.decode('utf8')
>     if type(delim) is str: delim = delim.decode('utf8')
>     # This fixes   issue and multiple linebraks on splitted page part
>     if delim is None and text == u'\xa0':
>         delim = ' '
>     return [uword.encode('utf8') for uword in text.split(delim)]
> -----------------8<---------------[cut here]
.......
I think this works in some special cases particularly when using the   
form. However, it still fails to split in the case that u'\xa0' is embedded in 
the string in a more normal way.
eg even using the above
 >>> split(u'a\nb\xa0\tbbbb')
['a', 'b', 'bbbb']
whereas we presumably don't want \xa0 to be regarded as a split point. The 
problem lies with python's unicode split which regards the None delim case as 
being all white space codes. In the C code these seem to be used
> u'\u0009',	# HORIZONTAL TABULATION
> u'\u000A',	# LINE FEED
> u'\u000B',	# VERTICAL TABULATION
> u'\u000C',	# FORM FEED
> u'\u000D',	# CARRIAGE RETURN
> u'\u001C',	# FILE SEPARATOR
> u'\u001D',	# GROUP SEPARATOR
> u'\u001E',	# RECORD SEPARATOR
> u'\u001F',	# UNIT SEPARATOR
> u'\u0020',	# SPACE
> u'\u0085',	# NEXT LINE
> u'\u00A0',	# NO-BREAK SPACE
> u'\u1680',	# OGHAM SPACE MARK
> u'\u2000',	# EN QUAD
> u'\u2001',	# EM QUAD
> u'\u2002',	# EN SPACE
> u'\u2003',	# EM SPACE
> u'\u2004',	# THREE-PER-EM SPACE
> u'\u2005',	# FOUR-PER-EM SPACE
> u'\u2006',	# SIX-PER-EM SPACE
> u'\u2007',	# FIGURE SPACE
> u'\u2008',	# PUNCTUATION SPACE
> u'\u2009',	# THIN SPACE
> u'\u200A',	# HAIR SPACE
> u'\u200B',	# ZERO WIDTH SPACE
> u'\u2028',	# LINE SEPARATOR
> u'\u2029',	# PARAGRAPH SEPARATOR
> u'\u202F',	# NARROW NO-BREAK SPACE
> u'\u205F',	# MEDIUM MATHEMATICAL SPACE
> u'\u3000',	# IDEOGRAPHIC SPACE
so I believe we can change split to a better scheme using
_WSC=u''.join((
	u'\u0009',	# HORIZONTAL TABULATION
	u'\u000A',	# LINE FEED
	u'\u000B',	# VERTICAL TABULATION
	u'\u000C',	# FORM FEED
	u'\u000D',	# CARRIAGE RETURN
	u'\u001C',	# FILE SEPARATOR
	u'\u001D',	# GROUP SEPARATOR
	u'\u001E',	# RECORD SEPARATOR
	u'\u001F',	# UNIT SEPARATOR
	u'\u0020',	# SPACE
	u'\u0085',	# NEXT LINE
	#u'\u00A0', # NO-BREAK SPACE
	u'\u1680',	# OGHAM SPACE MARK
	u'\u2000',	# EN QUAD
	u'\u2001',	# EM QUAD
	u'\u2002',	# EN SPACE
	u'\u2003',	# EM SPACE
	u'\u2004',	# THREE-PER-EM SPACE
	u'\u2005',	# FOUR-PER-EM SPACE
	u'\u2006',	# SIX-PER-EM SPACE
	u'\u2007',	# FIGURE SPACE
	u'\u2008',	# PUNCTUATION SPACE
	u'\u2009',	# THIN SPACE
	u'\u200A',	# HAIR SPACE
	u'\u200B',	# ZERO WIDTH SPACE
	u'\u2028',	# LINE SEPARATOR
	u'\u2029',	# PARAGRAPH SEPARATOR
	u'\u202F',	# NARROW NO-BREAK SPACE
	u'\u205F',	# MEDIUM MATHEMATICAL SPACE
	u'\u3000',	# IDEOGRAPHIC SPACE
	))
#on UTF8 branch, split and strip must be unicode-safe!
def split(text, delim=None):
	if type(text) is str: text = text.decode('utf8')
	if type(delim) is str: delim = delim.decode('utf8')
	if delim is None and u'\xa0' in text:
		delim = _WSC
	return [uword.encode('utf8') for uword in text.split(delim)]
can you check this against your problem cases?
-- 
Robin Becker
    
    
More information about the reportlab-users
mailing list