Is a non-break space whitespace?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Is a non-break space whitespace?

Peter Kenny

Hello All

 

I have a little puzzle to disturb your Sunday lunch, maybe. I have been scraping text data from web pages, which often comes with redundant space before or after. I routinely use ‘trim’ on the final string output, but I have found cases where there are still redundant spaces. Inspecting the results, I find that the characters are non-break spaces (codepoint 160, Unicode U+00A0). Looking at the code, String>>#trim depends on Character>>#isSeparator, which does not answer true for a non-break space. I can use trimBoth: [:char| char asInteger = 160] to remove the redundant spaces if I know where to expect them, so it is not a major problem. But the question remains: should non-break space be included in the list of separators in Character>>#isSeparator.

 

Peter Kenny

 

Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Stephan Eggermont-3
On 24-09-17 13:53, PBKResearch wrote:

> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
> scraping text data from web pages, which often comes with redundant
> space before or after. I routinely use ‘trim’ on the final string
> output, but I have found cases where there are still redundant spaces.
> Inspecting the results, I find that the characters are non-break spaces
> (codepoint 160, Unicode U+00A0). Looking at the code, String>>#trim
> depends on Character>>#isSeparator, which does not answer true for a
> non-break space. I can use trimBoth: [:char| char asInteger = 160] to
> remove the redundant spaces if I know where to expect them, so it is not
> a major problem. But the question remains: should non-break space be
> included in the list of separators in Character>>#isSeparator.

In unicode, there are many more 'characters' that could be considered
whitespace. You are collecting data from web pages, so you have no
influence on what you'll get as input. I don't think this should be
solved in #isSeparator.

Stephan



Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Stephane Ducasse-3
In reply to this post by Peter Kenny
may be we should have a trimWithSeparators: #() where we can specify
what we want.
For the question about isSeparator I do not know.

On Sun, Sep 24, 2017 at 1:53 PM, PBKResearch <[hidden email]> wrote:

> Hello All
>
>
>
> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
> scraping text data from web pages, which often comes with redundant space
> before or after. I routinely use ‘trim’ on the final string output, but I
> have found cases where there are still redundant spaces. Inspecting the
> results, I find that the characters are non-break spaces (codepoint 160,
> Unicode U+00A0). Looking at the code, String>>#trim depends on
> Character>>#isSeparator, which does not answer true for a non-break space. I
> can use trimBoth: [:char| char asInteger = 160] to remove the redundant
> spaces if I know where to expect them, so it is not a major problem. But the
> question remains: should non-break space be included in the list of
> separators in Character>>#isSeparator.
>
>
>
> Peter Kenny
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Ben Coman
In reply to this post by Peter Kenny


On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch <[hidden email]> wrote:

Hello All

 

I have a little puzzle to disturb your Sunday lunch, maybe. I have been scraping text data from web pages, which often comes with redundant space before or after. I routinely use ‘trim’ on the final string output, but I have found cases where there are still redundant spaces. Inspecting the results, I find that the characters are non-break spaces (codepoint 160, Unicode U+00A0). Looking at the code, String>>#trim depends on Character>>#isSeparator, which does not answer true for a non-break space. I can use trimBoth: [:char| char asInteger = 160] to remove the redundant spaces if I know where to expect them, so it is not a major problem. But the question remains: should non-break space be included in the list of separators in Character>>#isSeparator.

 

Peter Kenny

 


Off the cuff.. intuitively "by definition" a Non-Break space seems to be Not-a-Separator.  
If the web pages you are scraping are misusing non-break space to munge formatting, that is not something to be solved by modifying semantics of #isSeparator.
Stef's suggestion for a selector that takes a list of separators seems appropriate.

cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Richard Sargent
Administrator
Ben Coman wrote
> On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch &lt;

> peter@.co

> &gt;
> wrote:
>
>> Hello All
>>
>>
>>
>> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
>> scraping text data from web pages, which often comes with redundant space
>> before or after. I routinely use ‘trim’ on the final string output, but I
>> have found cases where there are still redundant spaces. Inspecting the
>> results, I find that the characters are non-break spaces (codepoint 160,
>> Unicode U+00A0). Looking at the code, String>>#trim depends on
>> Character>>#isSeparator, which does not answer true for a non-break
>> space.
>> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
>> spaces if I know where to expect them, so it is not a major problem. But
>> the question remains: should non-break space be included in the list of
>> separators in Character>>#isSeparator.
>>
>>
>>
>> Peter Kenny
>>
>>
>>
>
> Off the cuff.. intuitively "by definition" a Non-Break space seems to be
> Not-a-Separator.
> If the web pages you are scraping are misusing non-break space to munge
> formatting, that is not something to be solved by modifying semantics of
> #isSeparator.
> Stef's suggestion for a selector that takes a list of separators seems
> appropriate.


Rather than off-the-cuffing anything, please honour the Unicode Character
Properties. Refer to
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
others.


> cheers -ben





--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Sven Van Caekenberghe-2

On 25 Sep 2017, at 09:53, Richard Sargent <[hidden email]> wrote:

Ben Coman wrote
On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch &lt;

peter@.co

&gt;
wrote:

Hello All



I have a little puzzle to disturb your Sunday lunch, maybe. I have been
scraping text data from web pages, which often comes with redundant space
before or after. I routinely use ‘trim’ on the final string output, but I
have found cases where there are still redundant spaces. Inspecting the
results, I find that the characters are non-break spaces (codepoint 160,
Unicode U+00A0). Looking at the code, String>>#trim depends on
Character>>#isSeparator, which does not answer true for a non-break
space.
I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
spaces if I know where to expect them, so it is not a major problem. But
the question remains: should non-break space be included in the list of
separators in Character>>#isSeparator.



Peter Kenny




Off the cuff.. intuitively "by definition" a Non-Break space seems to be
Not-a-Separator.
If the web pages you are scraping are misusing non-break space to munge
formatting, that is not something to be solved by modifying semantics of
#isSeparator.
Stef's suggestion for a selector that takes a list of separators seems
appropriate.


Rather than off-the-cuffing anything, please honour the Unicode Character
Properties. Refer to
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
others.

The Pharo Unicode Project (http://www.smalltalkhub.com/#!/~Pharo/Unicode) offers support for these properties.


There is a performance cost though (it is a big database to load/use).

I would not immediately change #isSeparator though.

cheers -ben





--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Peter Kenny

Hello

 

One way of dealing with this in a general way is to introduce a new predicate #isWhitespace for Character, maybe following the Wikipedia definition as Richard suggests, and then either (a) recode String>>#trim and friends to use #isWhitespace rather than #isSeparator or (b) introduce a new operation String>>#trimWhitespace which uses #isWhitespace.

@stef. There is already an operation which allows us to specify the characters to be trimmed, which I mentioned in my original post. We write trimBoth: aBlock, where aBlock value: char answers true if char is to be trimmed. I knew I could solve my immediate problem like this; I just wondered if there should be something more general.

 

Peter Kenny

 

From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
Sent: 25 September 2017 09:10
To: Any question about pharo is welcome <[hidden email]>
Subject: Re: [Pharo-users] Is a non-break space whitespace?

 



On 25 Sep 2017, at 09:53, Richard Sargent <[hidden email]> wrote:

Ben Coman wrote

On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch &lt;



[hidden email]



&gt;
wrote:


Hello All



I have a little puzzle to disturb your Sunday lunch, maybe. I have been
scraping text data from web pages, which often comes with redundant space
before or after. I routinely use ‘trim’ on the final string output, but I
have found cases where there are still redundant spaces. Inspecting the
results, I find that the characters are non-break spaces (codepoint 160,
Unicode U+00A0). Looking at the code, String>>#trim depends on
Character>>#isSeparator, which does not answer true for a non-break
space.
I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
spaces if I know where to expect them, so it is not a major problem. But
the question remains: should non-break space be included in the list of
separators in Character>>#isSeparator.



Peter Kenny



Off the cuff.. intuitively "by definition" a Non-Break space seems to be
Not-a-Separator.
If the web pages you are scraping are misusing non-break space to munge
formatting, that is not something to be solved by modifying semantics of
#isSeparator.
Stef's suggestion for a selector that takes a list of separators seems
appropriate.



Rather than off-the-cuffing anything, please honour the Unicode Character
Properties. Refer to
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
others.

 

The Pharo Unicode Project (http://www.smalltalkhub.com/#!/~Pharo/Unicode) offers support for these properties.

 

cid:image001.png@01D335E3.4234C800

 

There is a performance cost though (it is a big database to load/use).

 

I would not immediately change #isSeparator though.



cheers -ben






--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html

 

Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Ben Coman
In reply to this post by Richard Sargent


On Mon, Sep 25, 2017 at 3:53 PM, Richard Sargent <[hidden email]> wrote:
Ben Coman wrote
> On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch &lt;

> peter@.co

> &gt;
> wrote:
>
>> Hello All
>>
>>
>>
>> I have a little puzzle to disturb your Sunday lunch, maybe. I have been
>> scraping text data from web pages, which often comes with redundant space
>> before or after. I routinely use ‘trim’ on the final string output, but I
>> have found cases where there are still redundant spaces. Inspecting the
>> results, I find that the characters are non-break spaces (codepoint 160,
>> Unicode U+00A0). Looking at the code, String>>#trim depends on
>> Character>>#isSeparator, which does not answer true for a non-break
>> space.
>> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant
>> spaces if I know where to expect them, so it is not a major problem. But
>> the question remains: should non-break space be included in the list of
>> separators in Character>>#isSeparator.
>>
>>
>>
>> Peter Kenny
>>
>>
>>
>
> Off the cuff.. intuitively "by definition" a Non-Break space seems to be
> Not-a-Separator.
> If the web pages you are scraping are misusing non-break space to munge
> formatting, that is not something to be solved by modifying semantics of
> #isSeparator.
> Stef's suggestion for a selector that takes a list of separators seems
> appropriate.


Rather than off-the-cuffing anything, please honour the Unicode Character
Properties. Refer to
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
others.

I hope it was clear I wasn't speaking from a position of authoritative knowledge.. 
Nice to learn something new.  Thanks for the correction.  
cheers -ben
Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Stephan Eggermont-3
In reply to this post by Richard Sargent
On 25-09-17 09:53, Richard Sargent wrote:
> Rather than off-the-cuffing anything, please honour the Unicode Character
> Properties. Refer to
> https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
> others.

That is a good idea. And it won't help you if you scrape data from the
web, as you'll find plenty of bad encoding. And unclarity over which
version of which standard was used (see mongolian vowel separator)

Stephan


Reply | Threaded
Open this post in threaded view
|

Re: Is a non-break space whitespace?

Julián Maestri-2
The non breaking space, is a whitespace character or separator depending on the context.
As far as i know it's not considered as a separator only when deciding text layout, it means do not break the line here: eg: "A distance of 100 meters" (nbsp between 100 and meters) should be rendered as either:
> A distance of 100 meters
or:
> A distance of
> 100 meters
But not:
> A distance of 100
> meters

There's an opposite character (don't remember the code) which means <you can break here if you need to> which is NOT a whitespace.

On 25 September 2017 at 10:21, stephan <[hidden email]> wrote:
On 25-09-17 09:53, Richard Sargent wrote:
Rather than off-the-cuffing anything, please honour the Unicode Character
Properties. Refer to
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among
others.

That is a good idea. And it won't help you if you scrape data from the web, as you'll find plenty of bad encoding. And unclarity over which version of which standard was used (see mongolian vowel separator)

Stephan