Login  Register

Re: Is a non-break space whitespace?

Posted by Ben Coman on Sep 25, 2017; 12:08pm
URL: https://forum.world.st/Is-a-non-break-space-whitespace-tp4971514p4971891.html



On Mon, Sep 25, 2017 at 3:53 PM, Richard Sargent <[hidden email]> wrote:
Ben Coman wrote
> On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch &lt;

> peter@.co

> &gt;
> wrote:
>
>> Hello All
>>
>>
>>
>> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
>> scraping text data from web pages, which often comes with redundant space
>> before or after. I routinely use ‘trim’ on the final string output, but I
>> have found cases where there are still redundant spaces. Inspecting the
>> results, I find that the characters are non-break spaces (codepoint 160,
>> Unicode U+00A0). Looking at the code, String>>#trim depends on
>> Character>>#isSeparator, which does not answer true for a non-break
>> space.
>> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
>> spaces if I know where to expect them, so it is not a major problem. But
>> the question remains: should non-break space be included in the list of
>> separators in Character>>#isSeparator.
>>
>>
>>
>> Peter Kenny
>>
>>
>>
>
> Off the cuff.. intuitively "by definition" a Non-Break space seems to be
> Not-a-Separator.
> If the web pages you are scraping are misusing non-break space to munge
> formatting, that is not something to be solved by modifying semantics of
> #isSeparator.
> Stef's suggestion for a selector that takes a list of separators seems
> appropriate.


Rather than off-the-cuffing anything, please honour the Unicode Character
Properties. Refer to
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
others.

I hope it was clear I wasn't speaking from a position of authoritative knowledge.. 
Nice to learn something new.  Thanks for the correction.  
cheers -ben