its easier this way (still in form of a
script):
parseHtmlPageForDescription: htmlString | parser endParser metaParser descriptionParser contentParser res1Parser res2Parser quoteParser nonQuoteParser | metaParser := '<meta name="' asParser. "the next line extends the parser to understand http-equiv" metaParser := metaParser | '<meta http-equiv="' asParser. quoteParser := $" asParser. nonQuoteParser := PPPredicateObjectParser anyExceptAnyOf: '"'. descriptionParser := nonQuoteParser star token. res1Parser := descriptionParser . res2Parser := descriptionParser . contentParser := '" content="' asParser trim. endParser := '">' asParser. parser := (metaParser, res1Parser, contentParser, res2Parser, endParser) end ==> [:nodes| Array with: (nodes at: 2) inputValue with: (nodes at: 4) inputValue ]. ^parser parse: htmlString. "self parseHtmlPageForDescription: self htmlString1 self parseHtmlPageForDescription: self htmlString2 self parseHtmlPageForDescription: self htmlString3 " with htmlString1 ^'<meta name=" Description" content="my description">' etc.. you may want to read http://www.lukas-renggli.ch/blog/petitparser-1 good luck Hartmut This is kind of a "I'm tired of thinking about this and not making much progress for the amount of time I'm putting in question" but here it is: I'm trying to parse descriptions from HTML meta elements. I can't use Soup because there isn't a working GemStone port. I've got it to work with the structure: <meta name="description" content="my description"> and <meta name="Description" content="my description"> but I'm running into instances of: <meta http-equiv="description" content="my description"> and <meta http-equiv="Description" content="my description"> and am having trouble adapting my parsing code (such as it is). The parsing code that addresses the first two cases is: parseHtmlPageForDescription: htmlString | startParser endParser ppStream descParser result text lower str doubleQuoteIndex | lower := 'escription' asParser. startParser := '<meta name=' asParser , #'any' asParser , #'any' asParser. endParser := '>' asParser. ppStream := htmlString readStream asPetitStream. descParser := ((#'any' asParser starLazy: startParser , lower) , (#'any' asParser starLazy: endParser)) ==> #'second'. result := descParser parse: ppStream. text := (result inject: (WriteStream on: String new) into: [ :stream :char | stream nextPut: char. stream ]) contents trimBoth. str := text copyFrom: (text findString: 'content=') + 9 to: text size. doubleQuoteIndex := 8 - ((str last: 7) indexOf: $"). ^ str copyFrom: 1 to: str size - doubleQuoteIndex I can't figure out how to change the startParser parser to accept the second idiom. And maybe there's a better approach altogether. Anyway. If anyone has any ideas on different approaches I'd appreciate learning them. Thanks for giving it some thought Paul
--
Hartmut Krasemann |
Free forum by Nabble | Edit this page |