Login  Register

Re: Is a non-break space whitespace?

Posted by Ben Coman on Sep 25, 2017; 7:23am
URL: https://forum.world.st/Is-a-non-break-space-whitespace-tp4971514p4971771.html



On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch <[hidden email]> wrote:

Hello All

 

I have a little puzzle to disturb your Sunday lunch, maybe. I have been scraping text data from web pages, which often comes with redundant space before or after. I routinely use ‘trim’ on the final string output, but I have found cases where there are still redundant spaces. Inspecting the results, I find that the characters are non-break spaces (codepoint 160, Unicode U+00A0). Looking at the code, String>>#trim depends on Character>>#isSeparator, which does not answer true for a non-break space. I can use trimBoth: [:char| char asInteger = 160] to remove the redundant spaces if I know where to expect them, so it is not a major problem. But the question remains: should non-break space be included in the list of separators in Character>>#isSeparator.

 

Peter Kenny

 


Off the cuff.. intuitively "by definition" a Non-Break space seems to be Not-a-Separator.  
If the web pages you are scraping are misusing non-break space to munge formatting, that is not something to be solved by modifying semantics of #isSeparator.
Stef's suggestion for a selector that takes a list of separators seems appropriate.

cheers -ben