PetitParser question parsing HTML meta tags

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

PetitParser question parsing HTML meta tags

Hartmut Krasemann-2
its easier this way (still in form of a script):

parseHtmlPageForDescription: htmlString
  | parser endParser  metaParser descriptionParser contentParser  res1Parser res2Parser quoteParser nonQuoteParser |
  metaParser := '<meta name="' asParser.
  "the next line extends the parser to understand http-equiv"
  metaParser := metaParser | '<meta http-equiv="' asParser.
  quoteParser := $" asParser.
  nonQuoteParser := PPPredicateObjectParser anyExceptAnyOf: '"'.
  descriptionParser := nonQuoteParser star token.
  res1Parser := descriptionParser .
  res2Parser := descriptionParser .
  contentParser := '" content="' asParser trim.
  endParser := '">' asParser. 
  parser := (metaParser, res1Parser, contentParser, res2Parser, endParser)  end
           ==> [:nodes| Array with: (nodes at: 2) inputValue with: (nodes at: 4) inputValue ].
  ^parser parse: htmlString.
 

"self parseHtmlPageForDescription:  self htmlString1
self parseHtmlPageForDescription:  self htmlString2
self parseHtmlPageForDescription:  self htmlString3   "

with
htmlString1
  ^'<meta name="  Description" content="my description">'
etc..

you may want to read http://www.lukas-renggli.ch/blog/petitparser-1

good luck
Hartmut
This is kind of a "I'm tired of thinking about this and not making much progress for the amount of time I'm putting in question" but here it is: 



I'm trying to parse descriptions from HTML meta elements.  I can't use Soup because there isn't a working GemStone port.  

I've got it to work with the structure:

<meta name="description" content="my description">

and 

<meta name="Description" content="my description">


but I'm running into instances of: 

<meta http-equiv="description" content="my description">

and

<meta http-equiv="Description" content="my description">


and am having trouble adapting my parsing code (such as it is). 


The parsing code that addresses the first two cases is:



parseHtmlPageForDescription: htmlString
  | startParser endParser ppStream descParser result text lower str doubleQuoteIndex |
  lower := 'escription' asParser.
  startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser.
  endParser := '>' asParser.
  ppStream := htmlString readStream asPetitStream.
  descParser := ((#'any' asParser starLazy: startParser , lower)
    , (#'any' asParser starLazy: endParser)) ==> #'second'.
  result := descParser parse: ppStream.
  text := (result
    inject: (WriteStream on: String new)
    into: [ :stream :char | 
      stream nextPut: char.
      stream ])
    contents trimBoth.
  str := text copyFrom: (text findString: 'content=') + 9 to: text size.
  doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
  ^ str copyFrom: 1 to: str size - doubleQuoteIndex


I can't figure out how to change the startParser parser to accept the second idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has any ideas on different approaches I'd appreciate learning them.  


Thanks for giving it some thought

Paul


--
signatur

Hartmut Krasemann
Königsberger Str. 41 c
D 22869 Schenefeld
Tel. 040.8307097
Mobil 0171.6451283
[hidden email]