PetitParser question parsing HTML meta tags

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

PetitParser question parsing HTML meta tags

Paul DeBruicker
This is kind of a "I'm tired of thinking about this and not making much progress for the amount of time I'm putting in question" but here it is:



I'm trying to parse descriptions from HTML meta elements.  I can't use Soup because there isn't a working GemStone port.  

I've got it to work with the structure:

<meta name="description" content="my description">

and

<meta name="Description" content="my description">


but I'm running into instances of:

<meta http-equiv="description" content="my description">

and

<meta http-equiv="Description" content="my description">


and am having trouble adapting my parsing code (such as it is).


The parsing code that addresses the first two cases is:



parseHtmlPageForDescription: htmlString
  | startParser endParser ppStream descParser result text lower str doubleQuoteIndex |
  lower := 'escription' asParser.
  startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser.
  endParser := '>' asParser.
  ppStream := htmlString readStream asPetitStream.
  descParser := ((#'any' asParser starLazy: startParser , lower)
    , (#'any' asParser starLazy: endParser)) ==> #'second'.
  result := descParser parse: ppStream.
  text := (result
    inject: (WriteStream on: String new)
    into: [ :stream :char |
      stream nextPut: char.
      stream ])
    contents trimBoth.
  str := text copyFrom: (text findString: 'content=') + 9 to: text size.
  doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
  ^ str copyFrom: 1 to: str size - doubleQuoteIndex


I can't figure out how to change the startParser parser to accept the second idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has any ideas on different approaches I'd appreciate learning them.  


Thanks for giving it some thought

Paul
Reply | Threaded
Open this post in threaded view
|

Re: PetitParser question parsing HTML meta tags

Martin McClure-2
On 03/30/2017 10:58 AM, PAUL DEBRUICKER wrote:
> I can't figure out how to change the startParser parser to accept the second idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has any ideas on different approaches I'd appreciate learning them.  

This looks like a job for the ordered choice operator. Perhaps something
very roughly like this...

descTag := nameDescTag / httpDescTag.
nameDescTag := 'name=' , descAnyCase.
httpDescTag := 'http-equiv=' , descAnyCase.
descAnyCase := 'description' asParser / 'Description' asParser.

And so on.

HTH.

Regards,

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: PetitParser question parsing HTML meta tags

monty-3
In reply to this post by Paul DeBruicker
You could use XMLHTMLParser from STHub PharoExtras/XMLParserHTML (supported on Pharo, Squak, and GS):

descriptions := OrderedCollection new.
(XMLHTMLParser parseURL: aURL)
        allElementsNamed: 'meta'
        do: [:each |
                ((each attributeAt: 'name') asLowercase = 'description'
                        or: [(each attributeAt: 'http-equiv') asLowercase = 'description'])
                        ifTrue: [descriptions addLast: (each attributeAt: 'content')]].

it accepts messy HTML and produces an XML DOM tree from it.

> Sent: Thursday, March 30, 2017 at 1:58 PM
> From: "PAUL DEBRUICKER" <[hidden email]>
> To: "Any question about pharo is welcome" <[hidden email]>
> Subject: [Pharo-users] PetitParser question parsing HTML meta tags
>
> This is kind of a "I'm tired of thinking about this and not making much progress for the amount of time I'm putting in question" but here it is:
>
>
>
> I'm trying to parse descriptions from HTML meta elements.  I can't use Soup because there isn't a working GemStone port.  
>
> I've got it to work with the structure:
>
> <meta name="description" content="my description">
>
> and
>
> <meta name="Description" content="my description">
>
>
> but I'm running into instances of:
>
> <meta http-equiv="description" content="my description">
>
> and
>
> <meta http-equiv="Description" content="my description">
>
>
> and am having trouble adapting my parsing code (such as it is).
>
>
> The parsing code that addresses the first two cases is:
>
>
>
> parseHtmlPageForDescription: htmlString
>   | startParser endParser ppStream descParser result text lower str doubleQuoteIndex |
>   lower := 'escription' asParser.
>   startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser.
>   endParser := '>' asParser.
>   ppStream := htmlString readStream asPetitStream.
>   descParser := ((#'any' asParser starLazy: startParser , lower)
>     , (#'any' asParser starLazy: endParser)) ==> #'second'.
>   result := descParser parse: ppStream.
>   text := (result
>     inject: (WriteStream on: String new)
>     into: [ :stream :char |
>       stream nextPut: char.
>       stream ])
>     contents trimBoth.
>   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
>   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
>   ^ str copyFrom: 1 to: str size - doubleQuoteIndex
>
>
> I can't figure out how to change the startParser parser to accept the second idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has any ideas on different approaches I'd appreciate learning them.  
>
>
> Thanks for giving it some thought
>
> Paul
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitParser question parsing HTML meta tags

monty-3
In reply to this post by Paul DeBruicker
XMLParserHTML is the fastest HTML parser on Pharo, Squeak, and GS. It has DOM and SAX parsers and works with other libs such as PharoExtras/XPath and PharoExtras/XMLParserStAX.

Element and attribute names are normalized to lowercase, and printing XML DOM trees back as HTML is complicated by browsers not recognizing XML-style self-closing tags ending with "/>" for some elements (like "script"), so use #printedWithoutSelfClosingTags/#printWithoutSelfClosingTagsOn:/#printWithoutSelfClosingTagsToFileNamed: instead.

> Sent: Thursday, March 30, 2017 at 1:58 PM
> From: "PAUL DEBRUICKER" <[hidden email]>
> To: "Any question about pharo is welcome" <[hidden email]>
> Subject: [Pharo-users] PetitParser question parsing HTML meta tags
>
> This is kind of a "I'm tired of thinking about this and not making much progress for the amount of time I'm putting in question" but here it is:
>
>
>
> I'm trying to parse descriptions from HTML meta elements.  I can't use Soup because there isn't a working GemStone port.  
>
> I've got it to work with the structure:
>
> <meta name="description" content="my description">
>
> and
>
> <meta name="Description" content="my description">
>
>
> but I'm running into instances of:
>
> <meta http-equiv="description" content="my description">
>
> and
>
> <meta http-equiv="Description" content="my description">
>
>
> and am having trouble adapting my parsing code (such as it is).
>
>
> The parsing code that addresses the first two cases is:
>
>
>
> parseHtmlPageForDescription: htmlString
>   | startParser endParser ppStream descParser result text lower str doubleQuoteIndex |
>   lower := 'escription' asParser.
>   startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser.
>   endParser := '>' asParser.
>   ppStream := htmlString readStream asPetitStream.
>   descParser := ((#'any' asParser starLazy: startParser , lower)
>     , (#'any' asParser starLazy: endParser)) ==> #'second'.
>   result := descParser parse: ppStream.
>   text := (result
>     inject: (WriteStream on: String new)
>     into: [ :stream :char |
>       stream nextPut: char.
>       stream ])
>     contents trimBoth.
>   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
>   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
>   ^ str copyFrom: 1 to: str size - doubleQuoteIndex
>
>
> I can't figure out how to change the startParser parser to accept the second idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has any ideas on different approaches I'd appreciate learning them.  
>
>
> Thanks for giving it some thought
>
> Paul
>

Reply | Threaded
Open this post in threaded view
|

Re: PetitParser question parsing HTML meta tags

Paul DeBruicker
In reply to this post by monty-3
Thanks.  I really appreciate everyone's help on this.  Was at a high level of frustration the other day.  







monty-3 wrote
You could use XMLHTMLParser from STHub PharoExtras/XMLParserHTML (supported on Pharo, Squak, and GS):

descriptions := OrderedCollection new.
(XMLHTMLParser parseURL: aURL)
        allElementsNamed: 'meta'
        do: [:each |
                ((each attributeAt: 'name') asLowercase = 'description'
                        or: [(each attributeAt: 'http-equiv') asLowercase = 'description'])
                        ifTrue: [descriptions addLast: (each attributeAt: 'content')]].

it accepts messy HTML and produces an XML DOM tree from it.

> Sent: Thursday, March 30, 2017 at 1:58 PM
> From: "PAUL DEBRUICKER" <[hidden email]>
> To: "Any question about pharo is welcome" <[hidden email]>
> Subject: [Pharo-users] PetitParser question parsing HTML meta tags
>
> This is kind of a "I'm tired of thinking about this and not making much progress for the amount of time I'm putting in question" but here it is:
>
>
>
> I'm trying to parse descriptions from HTML meta elements.  I can't use Soup because there isn't a working GemStone port.  
>
> I've got it to work with the structure:
>
> <meta name="description" content="my description">
>
> and
>
> <meta name="Description" content="my description">
>
>
> but I'm running into instances of:
>
> <meta http-equiv="description" content="my description">
>
> and
>
> <meta http-equiv="Description" content="my description">
>
>
> and am having trouble adapting my parsing code (such as it is).
>
>
> The parsing code that addresses the first two cases is:
>
>
>
> parseHtmlPageForDescription: htmlString
>   | startParser endParser ppStream descParser result text lower str doubleQuoteIndex |
>   lower := 'escription' asParser.
>   startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser.
>   endParser := '>' asParser.
>   ppStream := htmlString readStream asPetitStream.
>   descParser := ((#'any' asParser starLazy: startParser , lower)
>     , (#'any' asParser starLazy: endParser)) ==> #'second'.
>   result := descParser parse: ppStream.
>   text := (result
>     inject: (WriteStream on: String new)
>     into: [ :stream :char |
>       stream nextPut: char.
>       stream ])
>     contents trimBoth.
>   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
>   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
>   ^ str copyFrom: 1 to: str size - doubleQuoteIndex
>
>
> I can't figure out how to change the startParser parser to accept the second idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has any ideas on different approaches I'd appreciate learning them.  
>
>
> Thanks for giving it some thought
>
> Paul
>