Login  Register

Re: Is a non-break space whitespace?

Posted by Stephan Eggermont-3 on Sep 24, 2017; 1:05pm
URL: https://forum.world.st/Is-a-non-break-space-whitespace-tp4971514p4971517.html

On 24-09-17 13:53, PBKResearch wrote:

> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
> scraping text data from web pages, which often comes with redundant
> space before or after. I routinely use ‘trim’ on the final string
> output, but I have found cases where there are still redundant spaces.
> Inspecting the results, I find that the characters are non-break spaces
> (codepoint 160, Unicode U+00A0). Looking at the code, String>>#trim
> depends on Character>>#isSeparator, which does not answer true for a
> non-break space. I can use trimBoth: [:char| char asInteger = 160] to
> remove the redundant spaces if I know where to expect them, so it is not
> a major problem. But the question remains: should non-break space be
> included in the list of separators in Character>>#isSeparator.

In unicode, there are many more 'characters' that could be considered
whitespace. You are collecting data from web pages, so you have no
influence on what you'll get as input. I don't think this should be
solved in #isSeparator.

Stephan