Hello All I have a little puzzle to disturb your Sunday lunch, maybe. I have been scraping text data from web pages, which often comes with redundant space before or after. I routinely use ‘trim’ on the final string output, but I have found cases where there are still redundant spaces. Inspecting the results, I find that the characters are non-break spaces (codepoint 160, Unicode U+00A0). Looking at the code, String>>#trim depends on Character>>#isSeparator, which does not answer true for a non-break space. I can use trimBoth: [:char| char asInteger = 160] to remove the redundant spaces if I know where to expect them, so it is not a major problem. But the question remains: should non-break space be included in the list of separators in Character>>#isSeparator. Peter Kenny |
On 24-09-17 13:53, PBKResearch wrote:
> I have a little puzzle to disturb your Sunday lunch, maybe. I have been > scraping text data from web pages, which often comes with redundant > space before or after. I routinely use ‘trim’ on the final string > output, but I have found cases where there are still redundant spaces. > Inspecting the results, I find that the characters are non-break spaces > (codepoint 160, Unicode U+00A0). Looking at the code, String>>#trim > depends on Character>>#isSeparator, which does not answer true for a > non-break space. I can use trimBoth: [:char| char asInteger = 160] to > remove the redundant spaces if I know where to expect them, so it is not > a major problem. But the question remains: should non-break space be > included in the list of separators in Character>>#isSeparator. In unicode, there are many more 'characters' that could be considered whitespace. You are collecting data from web pages, so you have no influence on what you'll get as input. I don't think this should be solved in #isSeparator. Stephan |
In reply to this post by Peter Kenny
may be we should have a trimWithSeparators: #() where we can specify
what we want. For the question about isSeparator I do not know. On Sun, Sep 24, 2017 at 1:53 PM, PBKResearch <[hidden email]> wrote: > Hello All > > > > I have a little puzzle to disturb your Sunday lunch, maybe. I have been > scraping text data from web pages, which often comes with redundant space > before or after. I routinely use ‘trim’ on the final string output, but I > have found cases where there are still redundant spaces. Inspecting the > results, I find that the characters are non-break spaces (codepoint 160, > Unicode U+00A0). Looking at the code, String>>#trim depends on > Character>>#isSeparator, which does not answer true for a non-break space. I > can use trimBoth: [:char| char asInteger = 160] to remove the redundant > spaces if I know where to expect them, so it is not a major problem. But the > question remains: should non-break space be included in the list of > separators in Character>>#isSeparator. > > > > Peter Kenny > > |
In reply to this post by Peter Kenny
On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch <[hidden email]> wrote:
Off the cuff.. intuitively "by definition" a Non-Break space seems to be Not-a-Separator. If the web pages you are scraping are misusing non-break space to munge formatting, that is not something to be solved by modifying semantics of #isSeparator. Stef's suggestion for a selector that takes a list of separators seems appropriate. cheers -ben |
Administrator
|
Ben Coman wrote
> On Sun, Sep 24, 2017 at 7:53 PM, PBKResearch < > peter@.co > > > wrote: > >> Hello All >> >> >> >> I have a little puzzle to disturb your Sunday lunch, maybe. I have been >> scraping text data from web pages, which often comes with redundant space >> before or after. I routinely use ‘trim’ on the final string output, but I >> have found cases where there are still redundant spaces. Inspecting the >> results, I find that the characters are non-break spaces (codepoint 160, >> Unicode U+00A0). Looking at the code, String>>#trim depends on >> Character>>#isSeparator, which does not answer true for a non-break >> space. >> I can use trimBoth: [:char| char asInteger = 160] to remove the redundant >> spaces if I know where to expect them, so it is not a major problem. But >> the question remains: should non-break space be included in the list of >> separators in Character>>#isSeparator. >> >> >> >> Peter Kenny >> >> >> > > Off the cuff.. intuitively "by definition" a Non-Break space seems to be > Not-a-Separator. > If the web pages you are scraping are misusing non-break space to munge > formatting, that is not something to be solved by modifying semantics of > #isSeparator. > Stef's suggestion for a selector that takes a list of separators seems > appropriate. Rather than off-the-cuffing anything, please honour the Unicode Character Properties. Refer to https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among others. > cheers -ben -- Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html |
On 25 Sep 2017, at 09:53, Richard Sargent <[hidden email]> wrote: The Pharo Unicode Project (http://www.smalltalkhub.com/#!/~Pharo/Unicode) offers support for these properties. There is a performance cost though (it is a big database to load/use). I would not immediately change #isSeparator though. cheers -ben |
Hello One way of dealing with this in a general way is to introduce a new predicate #isWhitespace for Character, maybe following the Wikipedia definition as Richard suggests, and then either (a) recode String>>#trim and friends to use #isWhitespace rather than #isSeparator or (b) introduce a new operation String>>#trimWhitespace which uses #isWhitespace. @stef. There is already an operation which allows us to specify the characters to be trimmed, which I mentioned in my original post. We write trimBoth: aBlock, where aBlock value: char answers true if char is to be trimmed. I knew I could solve my immediate problem like this; I just wondered if there should be something more general. Peter Kenny From: Pharo-users [mailto:[hidden email]] On Behalf Of Sven Van Caekenberghe
The Pharo Unicode Project (http://www.smalltalkhub.com/#!/~Pharo/Unicode) offers support for these properties. There is a performance cost though (it is a big database to load/use). I would not immediately change #isSeparator though.
|
In reply to this post by Richard Sargent
On Mon, Sep 25, 2017 at 3:53 PM, Richard Sargent <[hidden email]> wrote: Ben Coman wrote I hope it was clear I wasn't speaking from a position of authoritative knowledge.. Nice to learn something new. Thanks for the correction. cheers -ben |
In reply to this post by Richard Sargent
On 25-09-17 09:53, Richard Sargent wrote:
> Rather than off-the-cuffing anything, please honour the Unicode Character > Properties. Refer to > https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace, among > others. That is a good idea. And it won't help you if you scrape data from the web, as you'll find plenty of bad encoding. And unclarity over which version of which standard was used (see mongolian vowel separator) Stephan |
The non breaking space, is a whitespace character or separator depending on the context. As far as i know it's not considered as a separator only when deciding text layout, it means do not break the line here: eg: "A distance of 100 meters" (nbsp between 100 and meters) should be rendered as either: > A distance of 100 meters or: > A distance of > 100 meters But not: > A distance of 100 > meters There's an opposite character (don't remember the code) which means <you can break here if you need to> which is NOT a whitespace. On 25 September 2017 at 10:21, stephan <[hidden email]> wrote: On 25-09-17 09:53, Richard Sargent wrote: |
Free forum by Nabble | Edit this page |