Smalltalk › Pharo › Pharo Smalltalk Developers

XMLParser changes

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

12 messages Options

CyrilFerlicot

XMLParser changes

Hi,

last week there was a new stable version of XMLParser and some tests
broke in some tools. I think that there was a regression in this version.

Snippet:

(XMLDOMParser parse: '<?xml version="1.0" encoding="UTF-8"?>

<Clones>
<ClonedFragment cloneName="test">
<Member fileName="ProgramA"/>
<Member fileName="ProgramB"/>
</ClonedFragment>
<ClonedFragment cloneName="test2">
<Member fileName="ProgramA"/>
<Member fileName="ProgramB"/>
</ClonedFragment>
</Clones>') elements first nodes

With the release 2.7.4 we get 2 nodes but in release 2.7.6 we get 5
nodes. The 2 previous ones and 3 empty String nodes.

I think this is not what we expect. Correct me if I am wrong :)

--
Cyril Ferlicot

http://www.synectique.eu

165 Avenue Bretagne
Lille 59000 France

signature.asc (817 bytes) Download Attachment

Jan Vrany

Re: XMLParser changes

Not sure whether this was an intention. but strictly speaking,
new behavior is correct.
See Extensible Markup Language (XML) 1.0 (Fifth Edition),
Section 2.10 White Space Handling:

"An XML processor must always pass all characters in a document that
are not markup through to the application. A validating XML processor
must also inform the application which of these characters constitute
white space appearing in element content."

Jan

On Mon, 2016-04-25 at 12:14 +0200, Cyril Ferlicot Delbecque wrote:

> Hi,
>
> last week there was a new stable version of XMLParser and some tests
> broke in some tools. I think that there was a regression in this
> version.
>
> Snippet:
>
> (XMLDOMParser parse: '<?xml version="1.0" encoding="UTF-8"?>
> 
> <Clones>
> <ClonedFragment cloneName="test">
> <Member fileName="ProgramA"/>
> <Member fileName="ProgramB"/>
> </ClonedFragment>
> <ClonedFragment cloneName="test2">
> <Member fileName="ProgramA"/>
> <Member fileName="ProgramB"/>
> </ClonedFragment>
> </Clones>') elements first nodes
>
> With the release 2.7.4 we get 2 nodes but in release 2.7.6 we get 5
> nodes. The 2 previous ones and 3 empty String nodes.
>
> I think this is not what we expect. Correct me if I am wrong :)
>
>

NorbertHartl

Re: XMLParser changes

In reply to this post by CyrilFerlicot

> Am 25.04.2016 um 12:14 schrieb Cyril Ferlicot Delbecque <[hidden email]>:
>
> Hi,
>
> last week there was a new stable version of XMLParser and some tests
> broke in some tools. I think that there was a regression in this version.
>
> Snippet:
>
> (XMLDOMParser parse: '<?xml version="1.0" encoding="UTF-8"?>
> 
> <Clones>
> <ClonedFragment cloneName="test">
> <Member fileName="ProgramA"/>
> <Member fileName="ProgramB"/>
> </ClonedFragment>
> <ClonedFragment cloneName="test2">
> <Member fileName="ProgramA"/>
> <Member fileName="ProgramB"/>
> </ClonedFragment>
> </Clones>') elements first nodes
>
> With the release 2.7.4 we get 2 nodes but in release 2.7.6 we get 5
> nodes. The 2 previous ones and 3 empty String nodes.
>
> I think this is not what we expect. Correct me if I am wrong :)

It is right. There is no characters to be skipped. These must all resolve to text nodes with the whitespace characters in it. The behaviour of the new version is correct. So if you indent XML documents you might change the structure. The property should give control over that behaviour is called PreserveWhiteSpace.

Norbert

CyrilFerlicot

Re: XMLParser changes

On 25/04/2016 12:30, Norbert Hartl wrote:

> It is right. There is no characters to be skipped. These must all resolve to text nodes with the whitespace characters in it. The behaviour of the new version is correct. So if you indent XML documents you might change the structure. The property should give control over that behaviour is called PreserveWhiteSpace.
>
> Norbert
>

Ok thank you.
Now I need to find how to disable it. :)
I tried

((XMLDOMParser on: '<?xml version="1.0" encoding="UTF-8"?>

<Clones>
<ClonedFragment cloneName="test">
<Member fileName="ProgramA"/>
<Member fileName="ProgramB"/>
</ClonedFragment>
<ClonedFragment cloneName="test2">
<Member fileName="ProgramA"/>
<Member fileName="ProgramB"/>
</ClonedFragment>
</Clones>') preservesIgnorableWhitespace: false; parseDocument) elements
first nodes

but I still get the whitespaces.

I did not do a lot of XML parsing, someone knows how to remove easily
all the whitespaces nodes in a document?

--
Cyril Ferlicot

http://www.synectique.eu

165 Avenue Bretagne
Lille 59000 France

signature.asc (817 bytes) Download Attachment

monty-3

Re: XMLParser changes

In reply to this post by CyrilFerlicot

It's not a regression (see the changelog or the new comment in #ignorableWhitespace:). This is the correct behavior required by the spec, is long-overdue, and is also the way other parsers like libxml2 behave.

From Sec. 2.10:
"An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content."

The "white space appearing in element content" is the only whitespace the spec treats as ignorable, but the definition of "element content" requires validation and a DTD with ELEMENT declarations that restrict its content to elements.

From sec. 3.2.1:
"An element type has element content when elements of that type MUST contain only child elements (no character data), optionally separated by white space."

so these are of type element content:
<!ELEMENT a (b,c,d)>
<!ELEMENT a (b|c)>
<!ELEMENT a (b+,c*)>

but these aren't:
<!ELEMENT a (#PCDATA|b|c)*>
<!ELEMENT a ANY>
<!ELEMENT a EMPTY>

And if the parser is non-validating (or in our case, if you disable it) or there's no DTD, then all whitespace must be assumed to be non-ignorable.

> Sent: Monday, April 25, 2016 at 6:14 AM
> From: "Cyril Ferlicot Delbecque" <[hidden email]>
> To: [hidden email], monty <[hidden email]>
> Subject: XMLParser changes
>
> Hi,
>
> last week there was a new stable version of XMLParser and some tests
> broke in some tools. I think that there was a regression in this version.
>
> Snippet:
>
> (XMLDOMParser parse: '<?xml version="1.0" encoding="UTF-8"?>
> 
> <Clones>
> <ClonedFragment cloneName="test">
> <Member fileName="ProgramA"/>
> <Member fileName="ProgramB"/>
> </ClonedFragment>
> <ClonedFragment cloneName="test2">
> <Member fileName="ProgramA"/>
> <Member fileName="ProgramB"/>
> </ClonedFragment>
> </Clones>') elements first nodes
>
> With the release 2.7.4 we get 2 nodes but in release 2.7.6 we get 5
> nodes. The 2 previous ones and 3 empty String nodes.
>
> I think this is not what we expect. Correct me if I am wrong :)
>
>
> --
> Cyril Ferlicot
>
> http://www.synectique.eu
>
> 165 Avenue Bretagne
> Lille 59000 France
>
>

monty-3

Re: XMLParser changes

In reply to this post by CyrilFerlicot

All preservesIgnorableWhitespace: does now is preserve whitespace as string nodes in elements of type element content as indicated by a DTD with ELEMENT declarations that restricts their content to elements. That's why toggling the setting didn't do anything for you when you parsed a doc with no DTD and ELEMENT declarations (I will adapt the #ignorableWhitespace: comment for #preservesIgnorableWhitespace: to explain this more).

> Sent: Monday, April 25, 2016 at 6:55 AM
> From: "Cyril Ferlicot Delbecque" <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-dev] XMLParser changes
>
>
>
> On 25/04/2016 12:30, Norbert Hartl wrote:
>
> > It is right. There is no characters to be skipped. These must all resolve to text nodes with the whitespace characters in it. The behaviour of the new version is correct. So if you indent XML documents you might change the structure. The property should give control over that behaviour is called PreserveWhiteSpace.
> >
> > Norbert
> >
>
> Ok thank you.
> Now I need to find how to disable it. :)
> I tried
>
> ((XMLDOMParser on: '<?xml version="1.0" encoding="UTF-8"?>
> 
> <Clones>
> <ClonedFragment cloneName="test">
> <Member fileName="ProgramA"/>
> <Member fileName="ProgramB"/>
> </ClonedFragment>
> <ClonedFragment cloneName="test2">
> <Member fileName="ProgramA"/>
> <Member fileName="ProgramB"/>
> </ClonedFragment>
> </Clones>') preservesIgnorableWhitespace: false; parseDocument) elements
> first nodes
>
> but I still get the whitespaces.
>
> I did not do a lot of XML parsing, someone knows how to remove easily
> all the whitespaces nodes in a document?
>
> --
> Cyril Ferlicot
>
> http://www.synectique.eu
>
> 165 Avenue Bretagne
> Lille 59000 France
>
>

monty-3

Re: XMLParser changes

In reply to this post by Jan Vrany

> Sent: Monday, April 25, 2016 at 6:23 AM
> From: "Jan Vrany" <[hidden email]>
> To: [hidden email]
> Subject: Re: [Pharo-dev] XMLParser changes
>
> Not sure whether this was an intention. but strictly speaking,
> new behavior is correct.
> See Extensible Markup Language (XML) 1.0 (Fifth Edition),
> Section 2.10 White Space Handling:
>
> "An XML processor must always pass all characters in a document that
> are not markup through to the application. A validating XML processor
> must also inform the application which of these characters constitute
> white space appearing in element content."
>
> Jan
>
>
>
> On Mon, 2016-04-25 at 12:14 +0200, Cyril Ferlicot Delbecque wrote:
> > Hi,
> >
> > last week there was a new stable version of XMLParser and some tests
> > broke in some tools. I think that there was a regression in this
> > version.
> >
> > Snippet:
> >
> > (XMLDOMParser parse: '<?xml version="1.0" encoding="UTF-8"?>
> > 
> > <Clones>
> > <ClonedFragment cloneName="test">
> > <Member fileName="ProgramA"/>
> > <Member fileName="ProgramB"/>
> > </ClonedFragment>
> > <ClonedFragment cloneName="test2">
> > <Member fileName="ProgramA"/>
> > <Member fileName="ProgramB"/>
> > </ClonedFragment>
> > </Clones>') elements first nodes
> >
> > With the release 2.7.4 we get 2 nodes but in release 2.7.6 we get 5
> > nodes. The 2 previous ones and 3 empty String nodes.
> >
> > I think this is not what we expect. Correct me if I am wrong :)
> >
> >
>
>

stepharo

Re: XMLParser changes

Hi guys

I did not understand where are the spaces?

Can you show me?

Stef

>>> (XMLDOMParser parse: '<?xml version="1.0" encoding="UTF-8"?>
>>> 
>>> <Clones>
>>> <ClonedFragment cloneName="test">
>>> <Member fileName="ProgramA"/>
>>> <Member fileName="ProgramB"/>
>>> </ClonedFragment>
>>> <ClonedFragment cloneName="test2">
>>> <Member fileName="ProgramA"/>
>>> <Member fileName="ProgramB"/>
>>> </ClonedFragment>
>>> </Clones>') elements first nodes
>>>
>>> With the release 2.7.4 we get 2 nodes but in release 2.7.6 we get 5
>>> nodes. The 2 previous ones and 3 empty String nodes.
>>>
>>> I think this is not what we expect. Correct me if I am wrong :)
>>>
>>>
>>
>

NorbertHartl

Re: XMLParser changes

Stef,

if you have

<a>foo</a>

and then you indent it

<a>
foo
</a>

you insert after <a> a newline and spaces or tabs as well as after . In this case the whitespace between <a> and is valid but ignorable.

Norbert

> Am 27.04.2016 um 09:12 schrieb stepharo <[hidden email]>:
>
> Hi guys
>
>
> I did not understand where are the spaces?
>
> Can you show me?
>
> Stef
>
>
>>>> (XMLDOMParser parse: '<?xml version="1.0" encoding="UTF-8"?>
>>>> 
>>>> <Clones>
>>>> <ClonedFragment cloneName="test">
>>>> <Member fileName="ProgramA"/>
>>>> <Member fileName="ProgramB"/>
>>>> </ClonedFragment>
>>>> <ClonedFragment cloneName="test2">
>>>> <Member fileName="ProgramA"/>
>>>> <Member fileName="ProgramB"/>
>>>> </ClonedFragment>
>>>> </Clones>') elements first nodes
>>>>
>>>> With the release 2.7.4 we get 2 nodes but in release 2.7.6 we get 5
>>>> nodes. The 2 previous ones and 3 empty String nodes.
>>>>
>>>> I think this is not what we expect. Correct me if I am wrong :)
>>>>
>>>>
>>>
>>
>
>

CyrilFerlicot

Re: XMLParser changes

In reply to this post by stepharo

On 27/04/2016 09:12, stepharo wrote:
> Hi guys
>
>
> I did not understand where are the spaces?
>

The whitespaces are the line return and tabulation (indent) for the
pretty print.

> Can you show me?
>
> Stef
>

--
Cyril Ferlicot

http://www.synectique.eu

165 Avenue Bretagne
Lille 59000 France

signature.asc (817 bytes) Download Attachment

Peter Uhnak

Re: XMLParser changes

I re-read this thread three times and I am confused.

Is there a way to just ignore the whitespace?

The XML contains doctype (<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">) so maybe that's why it's broken?

But if I understand correctly, the _correct_ way to solve this is not to ignore the whitespace on parse time, but on processing time, correct?

Peter

On Wed, Apr 27, 2016 at 10:29 AM, Cyril Ferlicot Delbecque <[hidden email]> wrote:

On 27/04/2016 09:12, stepharo wrote:
> Hi guys
>
>
> I did not understand where are the spaces?
>

The whitespaces are the line return and tabulation (indent) for the
pretty print.

> Can you show me?
>
> Stef
>

--
Cyril Ferlicot

http://www.synectique.eu

165 Avenue Bretagne
Lille 59000 France

CyrilFerlicot

Re: XMLParser changes

On 09/05/2016 17:15, Peter Uhnák wrote:

> I re-read this thread three times and I am confused.
> Is there a way to just ignore the whitespace?
>
> The XML contains doctype (<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
> "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">) so maybe that's why
> it's broken?
>
> But if I understand correctly, the _correct_ way to solve this is not to
> ignore the whitespace on parse time, but on processing time, correct?
>
> Peter

Hi,

What I did to ignore whitespace was to add a DTD in the XML:

'<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE System [
<!ELEMENT System (Clones*)>
<!ATTLIST System Name CDATA #REQUIRED>
<!ATTLIST System Sliding-Window-Size CDATA #IMPLIED>
<!ATTLIST System Min-Nbr-Cloned-Locations CDATA #REQUIRED>
<!ATTLIST System Min-Fragment-Size CDATA #REQUIRED>
<!ATTLIST System Code CDATA #REQUIRED>
<!ATTLIST System Cleaner CDATA #REQUIRED>
<!ELEMENT Clones (ClonedFragment*)>
<!ELEMENT ClonedFragment (Member*)>
<!ATTLIST ClonedFragment cloneName CDATA #REQUIRED>
<!ATTLIST ClonedFragment cloneHash CDATA #REQUIRED>
<!ELEMENT Member EMPTY>
<!ATTLIST Member fileName CDATA #REQUIRED>
<!ATTLIST Member startLine CDATA #REQUIRED>
<!ATTLIST Member endLine CDATA #REQUIRED>
]>

<System Name="Mantis" Code="0" Sliding-Window-Size="4"
Min-Nbr-Cloned-Locations="2" Min-Fragment-Size="10"
Cleaner="DPDummyCleaner">
<Clones>
<ClonedFragment cloneName="modified name"
cloneHash="1784404950000220151828">
<Member fileName="ProgramA" startLine="2" endLine="5" />
<Member fileName="ProgramB" startLine="2" endLine="5" />
</ClonedFragment>
<ClonedFragment cloneName="Code Fragment 2"
cloneHash="2440719450000130430812">
<Member fileName="ProgramA" startLine="1" endLine="4" />
<Member fileName="ProgramB" startLine="1" endLine="4" />
</ClonedFragment>
</Clones>
</System>'

White this DTD the parser know that I do not want the whitespaces.

--
Cyril Ferlicot

http://www.synectique.eu

165 Avenue Bretagne
Lille 59000 France

signature.asc (817 bytes) Download Attachment