How to get rid of empty XML nodes?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to get rid of empty XML nodes?

Stephane Ducasse-3
)Hi

we are manipulating an XML document and I would like to get rid of the
spurious empty string.
We saw that the gt panes are doing it.

(aNodeWithElements isStringNode
and: [aNodeWithElements isEmpty
or: [aNodeWithElements isWhitespace]]

Is there a way not to produce empty nodes?
Is there a simple way not to have to handle them

Now each time we are dealing with a node with have to check.

Stef

Reply | Threaded
Open this post in threaded view
|

Re: How to get rid of empty XML nodes?

Stephane Ducasse-3
We tried

| parser doc visitor |
parser := XMLDOMParser new
on: self xmlContents;
preservesIgnorableWhitespace: false.
doc := parser parseDocument.

but we still have the empty nodes around.

Stef


On Tue, Dec 5, 2017 at 2:29 PM, Stephane Ducasse
<[hidden email]> wrote:

> )Hi
>
> we are manipulating an XML document and I would like to get rid of the
> spurious empty string.
> We saw that the gt panes are doing it.
>
> (aNodeWithElements isStringNode
> and: [aNodeWithElements isEmpty
> or: [aNodeWithElements isWhitespace]]
>
> Is there a way not to produce empty nodes?
> Is there a simple way not to have to handle them
>
> Now each time we are dealing with a node with have to check.
>
> Stef

Reply | Threaded
Open this post in threaded view
|

Re: How to get rid of empty XML nodes?

monty-3
In reply to this post by Stephane Ducasse-3
By "empty XML nodes," do you mean whitespace-only string nodes? Those are included because all in-element whitespace is assumed significant by the spec: https://www.w3.org/TR/xml/#sec-white-space

The exception is if the element is declared in the DTD as only having element children ("element content"): https://www.w3.org/TR/xml/#dt-elemcontent

For example, if you declare an element like this:

<!ELEMENT one (two,three*,four?)>

Any whitespace around a "two," "three," or "four" element child of a "one" element is insignificant and ignored (unless #preservesIgnorableWhitespace: is true). Other parsers, like LibXML2 and Xerces, behave the same way.

I'll see if I can come up with some easier way to deal with this, like an optional parser setting, new enumeration methods, or maybe a tree transformation.

> Sent: Tuesday, December 05, 2017 at 8:29 AM
> From: "Stephane Ducasse" <[hidden email]>
> To: "Pharo Development List" <[hidden email]>
> Subject: [Pharo-dev] How to get rid of empty XML nodes?
>
> )Hi
>
> we are manipulating an XML document and I would like to get rid of the
> spurious empty string.
> We saw that the gt panes are doing it.
>
> (aNodeWithElements isStringNode
> and: [aNodeWithElements isEmpty
> or: [aNodeWithElements isWhitespace]]
>
> Is there a way not to produce empty nodes?
> Is there a simple way not to have to handle them
>
> Now each time we are dealing with a node with have to check.
>
> Stef
>
>

Reply | Threaded
Open this post in threaded view
|

Re: How to get rid of empty XML nodes?

Stephane Ducasse-3
Hi monty


On Fri, Dec 8, 2017 at 9:03 AM, monty <[hidden email]> wrote:
> By "empty XML nodes," do you mean whitespace-only string nodes?

Yes

> Those are included because all in-element whitespace is assumed significant by the spec: https://www.w3.org/TR/xml/#sec-white-space

I know. There was a discussion a while ago. I just lost a couple of
hours understanding that :(

But this is a super super super annoying practices.
We had to test each nodes to see if it is a empty nodes so it makes
everything a lot more complex without real justification
beside the fact that these standardizers probably never implemented
some real cases.
This standard is a really out of reality from that perspective.

> The exception is if the element is declared in the DTD as only having element children ("element content"): https://www.w3.org/TR/xml/#dt-elemcontent

Well the XML files that I had (I did not choose XML because I would
have prefer JSON :) ), had no DTD :(

So at the end of the day, this wonderful standard puts all the stress
and burden to people.

>
> For example, if you declare an element like this:
>
> <!ELEMENT one (two,three*,four?)>
>
> Any whitespace around a "two," "three," or "four" element child of a "one" element is insignificant and ignored (unless #preservesIgnorableWhitespace: is true). Other parsers, like LibXML2 and Xerces, behave the same way.
>
> I'll see if I can come up with some easier way to deal with this, like an optional parser setting, new enumeration methods, or maybe a tree transformation.

It would be A HUGE PLUS!!!!!!!!!!!!!!!!!!


Because reality is that people have XML files with just nodes and no
empty nodes and they are forced to
Let me know because I could try.

I was showing how to use Pharo to import code to pharo learners and
this was a big drag.

Stef


I tried to set some values in the parser but it did not work.
BTW I saw that the configuration logic forces to write the following

| parser doc visitor |
parser := XMLDOMParser new
   on: self xmlContents;
   preservesIgnorableWhitespace: true.

and not

| parser doc visitor |
parser := XMLDOMParser new
    preservesIgnorableWhitespace: true.
    on: self xmlContents;


>
>> Sent: Tuesday, December 05, 2017 at 8:29 AM
>> From: "Stephane Ducasse" <[hidden email]>
>> To: "Pharo Development List" <[hidden email]>
>> Subject: [Pharo-dev] How to get rid of empty XML nodes?
>>
>> )Hi
>>
>> we are manipulating an XML document and I would like to get rid of the
>> spurious empty string.
>> We saw that the gt panes are doing it.
>>
>> (aNodeWithElements isStringNode
>> and: [aNodeWithElements isEmpty
>> or: [aNodeWithElements isWhitespace]]
>>
>> Is there a way not to produce empty nodes?
>> Is there a simple way not to have to handle them
>>
>> Now each time we are dealing with a node with have to check.
>>
>> Stef
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: How to get rid of empty XML nodes?

NorbertHartl


> Am 08.12.2017 um 14:21 schrieb Stephane Ducasse <[hidden email]>:
>
> Hi monty
>
>
>> On Fri, Dec 8, 2017 at 9:03 AM, monty <[hidden email]> wrote:
>> By "empty XML nodes," do you mean whitespace-only string nodes?
>
> Yes
>
>> Those are included because all in-element whitespace is assumed significant by the spec: https://www.w3.org/TR/xml/#sec-white-space
>
> I know. There was a discussion a while ago. I just lost a couple of
> hours understanding that :(
>
> But this is a super super super annoying practices.
> We had to test each nodes to see if it is a empty nodes so it makes
> everything a lot more complex without real justification
> beside the fact that these standardizers probably never implemented
> some real cases.
> This standard is a really out of reality from that perspective.

Are you sure you do not oversimplify things? XML would be even more complex if these cases would be in the standard. It is not easy to decide if a whitespace is important or not.
Where do this whitespaces in your case come from? Most probably because the XML is pretty printed. That is inserting whitespaces into the serialized text. So why not just stopping to pretty print and your problem is gone.

Norbert

>
>> The exception is if the element is declared in the DTD as only having element children ("element content"): https://www.w3.org/TR/xml/#dt-elemcontent
>
> Well the XML files that I had (I did not choose XML because I would
> have prefer JSON :) ), had no DTD :(
>
> So at the end of the day, this wonderful standard puts all the stress
> and burden to people.
>
>>
>> For example, if you declare an element like this:
>>
>> <!ELEMENT one (two,three*,four?)>
>>
>> Any whitespace around a "two," "three," or "four" element child of a "one" element is insignificant and ignored (unless #preservesIgnorableWhitespace: is true). Other parsers, like LibXML2 and Xerces, behave the same way.
>>
>> I'll see if I can come up with some easier way to deal with this, like an optional parser setting, new enumeration methods, or maybe a tree transformation.
>
> It would be A HUGE PLUS!!!!!!!!!!!!!!!!!!
>
>
> Because reality is that people have XML files with just nodes and no
> empty nodes and they are forced to
> Let me know because I could try.
>
> I was showing how to use Pharo to import code to pharo learners and
> this was a big drag.
>
> Stef
>
>
> I tried to set some values in the parser but it did not work.
> BTW I saw that the configuration logic forces to write the following
>
> | parser doc visitor |
> parser := XMLDOMParser new
>   on: self xmlContents;
>   preservesIgnorableWhitespace: true.
>
> and not
>
> | parser doc visitor |
> parser := XMLDOMParser new
>    preservesIgnorableWhitespace: true.
>    on: self xmlContents;
>
>
>>
>>> Sent: Tuesday, December 05, 2017 at 8:29 AM
>>> From: "Stephane Ducasse" <[hidden email]>
>>> To: "Pharo Development List" <[hidden email]>
>>> Subject: [Pharo-dev] How to get rid of empty XML nodes?
>>>
>>> )Hi
>>>
>>> we are manipulating an XML document and I would like to get rid of the
>>> spurious empty string.
>>> We saw that the gt panes are doing it.
>>>
>>> (aNodeWithElements isStringNode
>>> and: [aNodeWithElements isEmpty
>>> or: [aNodeWithElements isWhitespace]]
>>>
>>> Is there a way not to produce empty nodes?
>>> Is there a simple way not to have to handle them
>>>
>>> Now each time we are dealing with a node with have to check.
>>>
>>> Stef
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: How to get rid of empty XML nodes?

Stephane Ducasse-3
Norbert

Should I say to the tool generating the XML that it is an idiot? Even
that I cannot. It is a tool I do not control.
I have no control about what I get.
Now why we cannot control that if people add a line return or not does
not matter?
Why I cannot be in charge of deciding? I take the risk of the
interpretation but now
the "standard" does not help me at all. It just tells me that is good for me.

I implemented in the past "standards" like XMI to found that there
were bugs in the spec.

At then end, each time I visit a node I have to check

visitNodeWithElements: aNodeWithElements
   | currentNode |
   currentNode := OkStubNode new.
   self cleanNode: aNodeWithElements.
   aNodeWithElements hasChildren
        ifTrue: [ | tokenNode |
                    self cleanNode: aNodeWithElements nodes first.
                    tokenNode := self visitElement: aNodeWithElements
nodes first.
                    self assert: tokenNode isToken.
                    currentNode addChild: tokenNode.
                    aNodeWithElements nodes allButFirst
                        do: [ :each | currentNode addChild: (self
visitNodeWithElements: each) ] ].
    ^ currentNode

And I do not like to modify a structure while I'm visiting it.


cleanNode: aNodeWithElements
      aNodeWithElements removeNodes: (aNodeWithElements nodes select:
[ :e | e isStringNode and: [ e isEmpty or: [ e isWhitespace ] ] ])

So I understand why people are going away from XML.

Stef

On Fri, Dec 8, 2017 at 4:02 PM, Norbert Hartl <[hidden email]> wrote:

>
>
>> Am 08.12.2017 um 14:21 schrieb Stephane Ducasse <[hidden email]>:
>>
>> Hi monty
>>
>>
>>> On Fri, Dec 8, 2017 at 9:03 AM, monty <[hidden email]> wrote:
>>> By "empty XML nodes," do you mean whitespace-only string nodes?
>>
>> Yes
>>
>>> Those are included because all in-element whitespace is assumed significant by the spec: https://www.w3.org/TR/xml/#sec-white-space
>>
>> I know. There was a discussion a while ago. I just lost a couple of
>> hours understanding that :(
>>
>> But this is a super super super annoying practices.
>> We had to test each nodes to see if it is a empty nodes so it makes
>> everything a lot more complex without real justification
>> beside the fact that these standardizers probably never implemented
>> some real cases.
>> This standard is a really out of reality from that perspective.
>
> Are you sure you do not oversimplify things? XML would be even more complex if these cases would be in the standard. It is not easy to decide if a whitespace is important or not.
> Where do this whitespaces in your case come from? Most probably because the XML is pretty printed. That is inserting whitespaces into the serialized text. So why not just stopping to pretty print and your problem is gone.
>
> Norbert
>>
>>> The exception is if the element is declared in the DTD as only having element children ("element content"): https://www.w3.org/TR/xml/#dt-elemcontent
>>
>> Well the XML files that I had (I did not choose XML because I would
>> have prefer JSON :) ), had no DTD :(
>>
>> So at the end of the day, this wonderful standard puts all the stress
>> and burden to people.
>>
>>>
>>> For example, if you declare an element like this:
>>>
>>> <!ELEMENT one (two,three*,four?)>
>>>
>>> Any whitespace around a "two," "three," or "four" element child of a "one" element is insignificant and ignored (unless #preservesIgnorableWhitespace: is true). Other parsers, like LibXML2 and Xerces, behave the same way.
>>>
>>> I'll see if I can come up with some easier way to deal with this, like an optional parser setting, new enumeration methods, or maybe a tree transformation.
>>
>> It would be A HUGE PLUS!!!!!!!!!!!!!!!!!!
>>
>>
>> Because reality is that people have XML files with just nodes and no
>> empty nodes and they are forced to
>> Let me know because I could try.
>>
>> I was showing how to use Pharo to import code to pharo learners and
>> this was a big drag.
>>
>> Stef
>>
>>
>> I tried to set some values in the parser but it did not work.
>> BTW I saw that the configuration logic forces to write the following
>>
>> | parser doc visitor |
>> parser := XMLDOMParser new
>>   on: self xmlContents;
>>   preservesIgnorableWhitespace: true.
>>
>> and not
>>
>> | parser doc visitor |
>> parser := XMLDOMParser new
>>    preservesIgnorableWhitespace: true.
>>    on: self xmlContents;
>>
>>
>>>
>>>> Sent: Tuesday, December 05, 2017 at 8:29 AM
>>>> From: "Stephane Ducasse" <[hidden email]>
>>>> To: "Pharo Development List" <[hidden email]>
>>>> Subject: [Pharo-dev] How to get rid of empty XML nodes?
>>>>
>>>> )Hi
>>>>
>>>> we are manipulating an XML document and I would like to get rid of the
>>>> spurious empty string.
>>>> We saw that the gt panes are doing it.
>>>>
>>>> (aNodeWithElements isStringNode
>>>> and: [aNodeWithElements isEmpty
>>>> or: [aNodeWithElements isWhitespace]]
>>>>
>>>> Is there a way not to produce empty nodes?
>>>> Is there a simple way not to have to handle them
>>>>
>>>> Now each time we are dealing with a node with have to check.
>>>>
>>>> Stef
>>>>
>>>>
>>>
>

Original-java.xml (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How to get rid of empty XML nodes?

NorbertHartl
Sure it can get quite annoying. It would be good to have a switch to prevent the creation of whitespace-only nodes at parse time.

Norbert

> Am 10.12.2017 um 08:42 schrieb Stephane Ducasse <[hidden email]>:
>
> Norbert
>
> Should I say to the tool generating the XML that it is an idiot? Even
> that I cannot. It is a tool I do not control.
> I have no control about what I get.
> Now why we cannot control that if people add a line return or not does
> not matter?
> Why I cannot be in charge of deciding? I take the risk of the
> interpretation but now
> the "standard" does not help me at all. It just tells me that is good for me.
>
> I implemented in the past "standards" like XMI to found that there
> were bugs in the spec.
>
> At then end, each time I visit a node I have to check
>
> visitNodeWithElements: aNodeWithElements
>   | currentNode |
>   currentNode := OkStubNode new.
>   self cleanNode: aNodeWithElements.
>   aNodeWithElements hasChildren
>        ifTrue: [ | tokenNode |
>                    self cleanNode: aNodeWithElements nodes first.
>                    tokenNode := self visitElement: aNodeWithElements
> nodes first.
>                    self assert: tokenNode isToken.
>                    currentNode addChild: tokenNode.
>                    aNodeWithElements nodes allButFirst
>                        do: [ :each | currentNode addChild: (self
> visitNodeWithElements: each) ] ].
>    ^ currentNode
>
> And I do not like to modify a structure while I'm visiting it.
>
>
> cleanNode: aNodeWithElements
>      aNodeWithElements removeNodes: (aNodeWithElements nodes select:
> [ :e | e isStringNode and: [ e isEmpty or: [ e isWhitespace ] ] ])
>
> So I understand why people are going away from XML.
>
> Stef
>
>> On Fri, Dec 8, 2017 at 4:02 PM, Norbert Hartl <[hidden email]> wrote:
>>
>>
>>> Am 08.12.2017 um 14:21 schrieb Stephane Ducasse <[hidden email]>:
>>>
>>> Hi monty
>>>
>>>
>>>> On Fri, Dec 8, 2017 at 9:03 AM, monty <[hidden email]> wrote:
>>>> By "empty XML nodes," do you mean whitespace-only string nodes?
>>>
>>> Yes
>>>
>>>> Those are included because all in-element whitespace is assumed significant by the spec: https://www.w3.org/TR/xml/#sec-white-space
>>>
>>> I know. There was a discussion a while ago. I just lost a couple of
>>> hours understanding that :(
>>>
>>> But this is a super super super annoying practices.
>>> We had to test each nodes to see if it is a empty nodes so it makes
>>> everything a lot more complex without real justification
>>> beside the fact that these standardizers probably never implemented
>>> some real cases.
>>> This standard is a really out of reality from that perspective.
>>
>> Are you sure you do not oversimplify things? XML would be even more complex if these cases would be in the standard. It is not easy to decide if a whitespace is important or not.
>> Where do this whitespaces in your case come from? Most probably because the XML is pretty printed. That is inserting whitespaces into the serialized text. So why not just stopping to pretty print and your problem is gone.
>>
>> Norbert
>>>
>>>> The exception is if the element is declared in the DTD as only having element children ("element content"): https://www.w3.org/TR/xml/#dt-elemcontent
>>>
>>> Well the XML files that I had (I did not choose XML because I would
>>> have prefer JSON :) ), had no DTD :(
>>>
>>> So at the end of the day, this wonderful standard puts all the stress
>>> and burden to people.
>>>
>>>>
>>>> For example, if you declare an element like this:
>>>>
>>>> <!ELEMENT one (two,three*,four?)>
>>>>
>>>> Any whitespace around a "two," "three," or "four" element child of a "one" element is insignificant and ignored (unless #preservesIgnorableWhitespace: is true). Other parsers, like LibXML2 and Xerces, behave the same way.
>>>>
>>>> I'll see if I can come up with some easier way to deal with this, like an optional parser setting, new enumeration methods, or maybe a tree transformation.
>>>
>>> It would be A HUGE PLUS!!!!!!!!!!!!!!!!!!
>>>
>>>
>>> Because reality is that people have XML files with just nodes and no
>>> empty nodes and they are forced to
>>> Let me know because I could try.
>>>
>>> I was showing how to use Pharo to import code to pharo learners and
>>> this was a big drag.
>>>
>>> Stef
>>>
>>>
>>> I tried to set some values in the parser but it did not work.
>>> BTW I saw that the configuration logic forces to write the following
>>>
>>> | parser doc visitor |
>>> parser := XMLDOMParser new
>>>  on: self xmlContents;
>>>  preservesIgnorableWhitespace: true.
>>>
>>> and not
>>>
>>> | parser doc visitor |
>>> parser := XMLDOMParser new
>>>   preservesIgnorableWhitespace: true.
>>>   on: self xmlContents;
>>>
>>>
>>>>
>>>>> Sent: Tuesday, December 05, 2017 at 8:29 AM
>>>>> From: "Stephane Ducasse" <[hidden email]>
>>>>> To: "Pharo Development List" <[hidden email]>
>>>>> Subject: [Pharo-dev] How to get rid of empty XML nodes?
>>>>>
>>>>> )Hi
>>>>>
>>>>> we are manipulating an XML document and I would like to get rid of the
>>>>> spurious empty string.
>>>>> We saw that the gt panes are doing it.
>>>>>
>>>>> (aNodeWithElements isStringNode
>>>>> and: [aNodeWithElements isEmpty
>>>>> or: [aNodeWithElements isWhitespace]]
>>>>>
>>>>> Is there a way not to produce empty nodes?
>>>>> Is there a simple way not to have to handle them
>>>>>
>>>>> Now each time we are dealing with a node with have to check.
>>>>>
>>>>> Stef
>>>>>
>>>>>
>>>>
>>
> <Original-java.xml>