how to deal with string position in relation to cr/crlf

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

how to deal with string position in relation to cr/crlf

Tudor Girba
Hi,

I have a small problem related to file line endings and storing the token information of PetitParser.

Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary.

('abc', Character cr asString , 'd') findString: 'd'.
==> 5

('abc', Character cr asString , Character lf asString, 'd') findString: 'd'.
==> 6


How would you approach this problem?

Cheers,
Doru


--
www.tudorgirba.com

"Every thing has its own flow."





Reply | Threaded
Open this post in threaded view
|

Re: how to deal with string position in relation to cr/crlf

Toon Verwaest-2

What is the problem exactly? Do you somewhere rely on exact position in the complete string?

What you could do is just keep line + column number since that stays fixed. Just increment newline on the systems newline sequence, and set column back to 0.

On Apr 28, 2011 10:48 AM, "Tudor Girba" <[hidden email]> wrote:
> Hi,
>
> I have a small problem related to file line endings and storing the token information of PetitParser.
>
> Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary.
>
> ('abc', Character cr asString , 'd') findString: 'd'.
> ==> 5
>
> ('abc', Character cr asString , Character lf asString, 'd') findString: 'd'.
> ==> 6

>
>
> How would you approach this problem?
>
> Cheers,
> Doru
>
>
> --
> www.tudorgirba.com
>
> "Every thing has its own flow."
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [Moose-dev] Re: how to deal with string position in relation to cr/crlf

Tudor Girba
Hi,

On 28 Apr 2011, at 10:55, Toon Verwaest wrote:

> What is the problem exactly? Do you somewhere rely on exact position in the complete string?

Yes, I need to rely on this position afterwards, for example, for relating model pieces to the source code. The question is how to retrieve and how to store this information.

> What you could do is just keep line + column number since that stays fixed. Just increment newline on the systems newline sequence, and set column back to 0.

Indeed. The problem is that the token of PetitParser only knows the character position from the stream. This would mean that we would have to modify the tracking of the position with extra information.

Is there no other option?

Cheers,
Doru



> On Apr 28, 2011 10:48 AM, "Tudor Girba" <[hidden email]> wrote:
> > Hi,
> >
> > I have a small problem related to file line endings and storing the token information of PetitParser.
> >
> > Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary.
> >
> > ('abc', Character cr asString , 'd') findString: 'd'.
> > ==> 5
> >
> > ('abc', Character cr asString , Character lf asString, 'd') findString: 'd'.
> > ==> 6
> >
> >
> > How would you approach this problem?
> >
> > Cheers,
> > Doru
> >
> >
> > --
> > www.tudorgirba.com
> >
> > "Every thing has its own flow."
> >
> >
> >
> >
> >
> _______________________________________________
> Moose-dev mailing list
> [hidden email]
> https://www.iam.unibe.ch/mailman/listinfo/moose-dev

--
www.tudorgirba.com

"We are all great at making mistakes."








Reply | Threaded
Open this post in threaded view
|

Re: [Moose-dev] Re: how to deal with string position in relation to cr/crlf

Toon Verwaest-2

> Indeed. The problem is that the token of PetitParser only knows the character position from the stream. This would mean that we would have to modify the tracking of the position with extra information.
>
> Is there no other option?
If what you are doing is relating it back to the original source code
... isn't the original source code stored in 1 specific format, \r, \n
or \r\n? Or do you use the models that are parsed once to map it back to
different versions of the same files on different platforms? In that
case you could always convert the input file into the format you want.

To me however it seems like it makes most sense to keep the line +
column count if you are going to keep anything yourself anyway. You do
not need to rely on what petitparser knows already, you can keep this
data yourself. Petitparser needs to have the char location since that's
where it's parsing. The line+column is metadata that you need, not
petitparser.
To implement this you just again need to keep track of all the newlines
you see. Everytime you see a newline you update your newline count AND
keep track of the position where the newline happened. This way you
actually have the column count (the actual position - the location where
the last newline occurred).

Another option I see is always parsing using a \r or \n file format by
first converting it. Then when you show the position, you will have to
check if the file is actually \r, \n or if it's rather \r\n. If it's \r
or \n then you just give back the number as is. Otherwise you walk over
the file to find out where all the newlines occur. From this you can
build an array that tells you which position ranges have to add how many
charcounts.

For example [0, 10, 15, 17, 20] if the newlines occur at [0, 9, 13, 14,
16] (always subtract 1 char of the newline since we map from 1-sized
newline to 2-sized newline). Now you can just translate your position by
looking for the highest number lower than the position. For example if
you were looking at position 15, this will map onto 14, which has index
4, so you have to do + 4 -> the real position is 19. This is just a
binary search for each position in the array of newlines, so it's
O(number of newlines in file * number of tokens) to translate the model
to become architecture-dependent.


The last option is to just store both position formats in your model
directly, and figuring out which fileformat you are mapping it back
onto. This is O(1) but requires double the data for position numbers (no
biggy I suppose); but it does require your parser to keep track of the
position info itself again. The previous option avoids that.

Hope this helps to make some sort of a decision :)

Toon

Reply | Threaded
Open this post in threaded view
|

Re: how to deal with string position in relation to cr/crlf

abergel
In reply to this post by Tudor Girba
I experienced a similar problem when I worked on CAnalyzer. What I did is simply to remove all lf characters before doing anything.

Alexandre



Le 28 avr. 2011 à 04:47, Tudor Girba <[hidden email]> a écrit :

> Hi,
>
> I have a small problem related to file line endings and storing the token information of PetitParser.
>
> Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary.
>
> ('abc', Character cr asString , 'd') findString: 'd'.
> ==> 5
>
> ('abc', Character cr asString , Character lf asString, 'd') findString: 'd'.
> ==> 6
>
>
> How would you approach this problem?
>
> Cheers,
> Doru
>
>
> --
> www.tudorgirba.com
>
> "Every thing has its own flow."
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: how to deal with string position in relation to cr/crlf

Nicolas Cellier
I would say you'll have problems if an only if you mix pointers
- on transformed streams
- and on OS-native line-end transparent streams...
You'll have to arrange to either:
- always used a transformed stream
- or always use a native line-end transparent stream.

New utilities in Pharo and Squeak #linesDo: etc... works on any CR,
LF, CRLF conventions, even mixed conventions in a single file at an
acceptable runtime cost.
That may help...

Nicolas

2011/4/28 Alexandre Bergel <[hidden email]>:

> I experienced a similar problem when I worked on CAnalyzer. What I did is simply to remove all lf characters before doing anything.
>
> Alexandre
>
>
>
> Le 28 avr. 2011 à 04:47, Tudor Girba <[hidden email]> a écrit :
>
>> Hi,
>>
>> I have a small problem related to file line endings and storing the token information of PetitParser.
>>
>> Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary.
>>
>> ('abc', Character cr asString , 'd') findString: 'd'.
>> ==> 5
>>
>> ('abc', Character cr asString , Character lf asString, 'd') findString: 'd'.
>> ==> 6
>>
>>
>> How would you approach this problem?
>>
>> Cheers,
>> Doru
>>
>>
>> --
>> www.tudorgirba.com
>>
>> "Every thing has its own flow."
>>
>>
>>
>>
>>
>
>