Hi,
I have a small problem related to file line endings and storing the token information of PetitParser. Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary. ('abc', Character cr asString , 'd') findString: 'd'. ==> 5 ('abc', Character cr asString , Character lf asString, 'd') findString: 'd'. ==> 6 How would you approach this problem? Cheers, Doru -- www.tudorgirba.com "Every thing has its own flow." _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
What is the problem exactly? Do you somewhere rely on exact position in the complete string? What you could do is just keep line + column number since that stays fixed. Just increment newline on the systems newline sequence, and set column back to 0. On Apr 28, 2011 10:48 AM, "Tudor Girba" <[hidden email]> wrote:
> Hi, > > I have a small problem related to file line endings and storing the token information of PetitParser. > > Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary. > > ('abc', Character cr asString , 'd') findString: 'd'. > ==> 5 > > ('abc', Character cr asString , Character lf asString, 'd') findString: 'd'. > ==> 6 > "Every thing has its own flow." > > > > > _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
Hi,
On 28 Apr 2011, at 10:55, Toon Verwaest wrote: > What is the problem exactly? Do you somewhere rely on exact position in the complete string? Yes, I need to rely on this position afterwards, for example, for relating model pieces to the source code. The question is how to retrieve and how to store this information. > What you could do is just keep line + column number since that stays fixed. Just increment newline on the systems newline sequence, and set column back to 0. Indeed. The problem is that the token of PetitParser only knows the character position from the stream. This would mean that we would have to modify the tracking of the position with extra information. Is there no other option? Cheers, Doru > On Apr 28, 2011 10:48 AM, "Tudor Girba" <[hidden email]> wrote: > > Hi, > > > > I have a small problem related to file line endings and storing the token information of PetitParser. > > > > Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary. > > > > ('abc', Character cr asString , 'd') findString: 'd'. > > ==> 5 > > > > ('abc', Character cr asString , Character lf asString, 'd') findString: 'd'. > > ==> 6 > > > > > > How would you approach this problem? > > > > Cheers, > > Doru > > > > > > -- > > www.tudorgirba.com > > > > "Every thing has its own flow." > > > > > > > > > > > _______________________________________________ > Moose-dev mailing list > [hidden email] > https://www.iam.unibe.ch/mailman/listinfo/moose-dev -- www.tudorgirba.com "We are all great at making mistakes." _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
> Indeed. The problem is that the token of PetitParser only knows the character position from the stream. This would mean that we would have to modify the tracking of the position with extra information. > > Is there no other option? If what you are doing is relating it back to the original source code ... isn't the original source code stored in 1 specific format, \r, \n or \r\n? Or do you use the models that are parsed once to map it back to different versions of the same files on different platforms? In that case you could always convert the input file into the format you want. To me however it seems like it makes most sense to keep the line + column count if you are going to keep anything yourself anyway. You do not need to rely on what petitparser knows already, you can keep this data yourself. Petitparser needs to have the char location since that's where it's parsing. The line+column is metadata that you need, not petitparser. To implement this you just again need to keep track of all the newlines you see. Everytime you see a newline you update your newline count AND keep track of the position where the newline happened. This way you actually have the column count (the actual position - the location where the last newline occurred). Another option I see is always parsing using a \r or \n file format by first converting it. Then when you show the position, you will have to check if the file is actually \r, \n or if it's rather \r\n. If it's \r or \n then you just give back the number as is. Otherwise you walk over the file to find out where all the newlines occur. From this you can build an array that tells you which position ranges have to add how many charcounts. For example [0, 10, 15, 17, 20] if the newlines occur at [0, 9, 13, 14, 16] (always subtract 1 char of the newline since we map from 1-sized newline to 2-sized newline). Now you can just translate your position by looking for the highest number lower than the position. For example if you were looking at position 15, this will map onto 14, which has index 4, so you have to do + 4 -> the real position is 19. This is just a binary search for each position in the array of newlines, so it's O(number of newlines in file * number of tokens) to translate the model to become architecture-dependent. The last option is to just store both position formats in your model directly, and figuring out which fileformat you are mapping it back onto. This is O(1) but requires double the data for position numbers (no biggy I suppose); but it does require your parser to keep track of the position info itself again. The previous option avoids that. Hope this helps to make some sort of a decision :) Toon _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
In reply to this post by Tudor Girba
I experienced a similar problem when I worked on CAnalyzer. What I did is simply to remove all lf characters before doing anything.
Alexandre Le 28 avr. 2011 à 04:47, Tudor Girba <[hidden email]> a écrit : > Hi, > > I have a small problem related to file line endings and storing the token information of PetitParser. > > Sometimes, we parse the sources on Windows and then manipulate the model on Linux or Mac. In this context, if I store the token positions in a string, I encounter problems because CR and LF are considered characters, but the line endings can vary. > > ('abc', Character cr asString , 'd') findString: 'd'. > ==> 5 > > ('abc', Character cr asString , Character lf asString, 'd') findString: 'd'. > ==> 6 > > > How would you approach this problem? > > Cheers, > Doru > > > -- > www.tudorgirba.com > > "Every thing has its own flow." > > > > > _______________________________________________ Moose-dev mailing list [hidden email] https://www.iam.unibe.ch/mailman/listinfo/moose-dev |
Free forum by Nabble | Edit this page |