BTW,
Doing a web search on +Rope +Unicode, I found that Mozilla is developing a programming language called Rust which uses Ropes with packed UTF-8 strings. The internal documentation suggests heavy users of strings use ropes instead. Note: http://static.rust-lang.org/doc/0.5/std/rope.html FYI, -KenD _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
-KenD
|
Interesting observation, Ken
This may be considered a confirmation to move on with the implementation of Ropes. According to http://static.rust-lang.org/doc/0.5/std/rope.html "Ropes are a high-level representation of text that offers much better performance than strings for common operations, and generally reduce memory allocations and copies, while only entailing a small degradation of less common operations." ..... "In addition, the tree structure of ropes makes them suitable as a form of index to speed-up access to Unicode characters by index in long chunks of text." And the string basic type in Rust contains UTF8 encoded characters http://dl.rust-lang.org/doc/0.3/tutorial.html (in version 0.3) Should the Rust language Ropes API http://static.rust-lang.org/doc/0.5/std/rope.html#type-rope be taken as a model for the Cuis implementation? So far there are 10 methods in the Cuis Ropes implementation Rope selectors a Set(#asString #, #stringRepresentation #first #doesNotUnderstand: #last #asText #printOn: #copyReplaceFrom:to:with: #printString #asRope) Interesting candidates from the Rust language Rope API are Function append_char - Add one char to the end of the rope Function prepend_char - Add one char to the beginning of the rope Function append_str - Add one string to the end of the rope Function char_at - The character at position pos Function char_len - The number of character in the rope Function cmp - Compare two ropes by Unicode lexicographical order. Function eq - Returns true if both ropes have the same content (regardless of their structure), false otherwise Function ge - # Arguments Function gt - # Arguments Function iter_chars - Loop through a rope, char by char, until the end. I would say a high priority is a) Rope construction (i.e. appending and prepending instances of Character and String) b) streaming over a Rope. This is a typical operation when you deal with a large text file. c) finding a subrope And of course performance tests with random data to see where it starts to be more efficient to deal with Ropes than Strings. --Hannes On 2/16/13, Ken Dickey <[hidden email]> wrote: > BTW, > > Doing a web search on +Rope +Unicode, I found that Mozilla is developing a > programming language called Rust which uses Ropes with packed UTF-8 > strings. > > The internal documentation suggests heavy users of strings use ropes > instead. > > Note: > http://static.rust-lang.org/doc/0.5/std/rope.html > > FYI, > -KenD > > _______________________________________________ > Cuis mailing list > [hidden email] > http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org > _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org |
On 2/16/13, H. Hirzel <[hidden email]> wrote:
> Interesting observation, Ken > > This may be considered a confirmation to move on with the > implementation of Ropes. > > According to http://static.rust-lang.org/doc/0.5/std/rope.html > > "Ropes are a high-level representation of text that offers much better > performance than strings for common operations, and generally reduce > memory allocations and copies, while only entailing a small > degradation of less common operations." > > ..... > "In addition, the tree structure of ropes makes them suitable as a > form of index to speed-up access to Unicode characters by index in > long chunks of text." > > > And the string basic type in Rust contains UTF8 encoded characters > http://dl.rust-lang.org/doc/0.3/tutorial.html (in version 0.3) > > Should the Rust language Ropes API > http://static.rust-lang.org/doc/0.5/std/rope.html#type-rope > be taken as a model for the Cuis implementation? > > So far there are 10 methods in the Cuis Ropes implementation > > Rope selectors > a Set(#asString > #, > #stringRepresentation > #first > #doesNotUnderstand: > #last > #asText > #printOn: > #copyReplaceFrom:to:with: > #printString > #asRope) Acutally I realize that the subclasses of Rope FlatRope, ConcatRope and SubRope have #at: #at:put #size > Interesting candidates from the Rust language Rope API are > > Function append_char - Add one char to the end of the rope > Function prepend_char - Add one char to the beginning of the rope > Function append_str - Add one string to the end of the rope > > Function char_at - The character at position pos > Function char_len - The number of character in the rope > Function cmp - Compare two ropes by Unicode lexicographical order. > Function eq - Returns true if both ropes have the same content > (regardless of their structure), false otherwise > Function ge - # Arguments > Function gt - # Arguments > Function iter_chars - Loop through a rope, char by char, until the end. > > I would say a high priority is > > a) Rope construction (i.e. appending and prepending instances of > Character and String) > b) streaming over a Rope. This is a typical operation when you deal > with a large text file. > c) finding a subrope > > And of course performance tests with random data to see where it > starts to be more efficient to deal with Ropes than Strings. > > --Hannes > > On 2/16/13, Ken Dickey <[hidden email]> wrote: >> BTW, >> >> Doing a web search on +Rope +Unicode, I found that Mozilla is developing >> a >> programming language called Rust which uses Ropes with packed UTF-8 >> strings. >> >> The internal documentation suggests heavy users of strings use ropes >> instead. >> >> Note: >> http://static.rust-lang.org/doc/0.5/std/rope.html >> >> FYI, >> -KenD >> >> _______________________________________________ >> Cuis mailing list >> [hidden email] >> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org >> > _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org |
In reply to this post by Hannes Hirzel
On Sat, 16 Feb 2013 18:00:19 +0000
"H. Hirzel" <[hidden email]> wrote: > I think I need a bit more instructions how the class Rope [1] is > supposed to be used. Basically, it should look like a String until you look unto the details. I did add Character>>asRope and String>>asRope and #, to make things easier. > Should the Rust language Ropes API > http://static.rust-lang.org/doc/0.5/std/rope.html#type-rope > be taken as a model for the Cuis implementation? I have been looking at the Rust code. I find it a bit verbose. Given the breadth of the current String API, I don't see the need to add much. Note that in general we have most of the Rust functions already: char_at ==> #at: char_len ==> #size eq ==> #= iter_chars ==> #do: ... > I would say a high priority is > > a) Rope construction (i.e. appending and prepending instances of > Character and String) > b) streaming over a Rope. This is a typical operation when you deal > with a large text file. > c) finding a subrope Agreed. I need to fill out the rather large string protocol/API. As another example, Ceder/Mesa used "lazy" ropes for editing large text files. My recollection is that Cedar used ropes only and everywhere -- no separate string type. > I actually would expect the class Rope to have > Rope class >> fromString: Good call. I'll add it. The point of this first code was as proof of concept. I wanted to get enough experience to see if this was worth our time to carry further. Lazy ropes and other specializations are easy to add. Like Juan, I was pleasantly surprised to see that using a rope in the text editor was darn easy. Plenty of details yet, but the first indicators are positive. Next steps are to go from prototype/proof-of-concept to use-in-place-of-strings. This will take me a while. I still have much to learn about paragraph, parsing, and the text editors -- not to mention Smalltalk best practices. Thanks for taking a serious look at this, -KenD -- Ken [dot] Dickey [at] whidbey [dot] com _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
-KenD
|
Hello Ken
On 2/16/13, Ken Dickey <[hidden email]> wrote: > On Sat, 16 Feb 2013 18:00:19 +0000 > "H. Hirzel" <[hidden email]> wrote: > >> I think I need a bit more instructions how the class Rope [1] is >> supposed to be used. > > Basically, it should look like a String until you look unto the details. > > I did add Character>>asRope and String>>asRope and #, to make things > easier. This is easier indeed. Thank you. >> Should the Rust language Ropes API >> http://static.rust-lang.org/doc/0.5/std/rope.html#type-rope >> be taken as a model for the Cuis implementation? > > I have been looking at the Rust code. I find it a bit verbose. > > Given the breadth of the current String API, I don't see the need to add > much. > > Note that in general we have most of the Rust functions already: > > char_at ==> #at: > char_len ==> #size > eq ==> #= > iter_chars ==> #do: > ... This made me use #do: in one of the performance tests, see the performance test thread. I consider iterating through all the chars in a Rope a typical operation I need. > >> I would say a high priority is >> >> a) Rope construction (i.e. appending and prepending instances of >> Character and String) OK, as for now >> b) streaming over a Rope. This is a typical operation when you deal >> with a large text file. OK, as for now. >> c) finding a subrope This is what I am interested in next. The general String find. And maybe #splitBy: BTW could you provide the links to Java, Python and C implementations of Ropes so that I can have a look at it as well? > Agreed. I need to fill out the rather large string protocol/API. > > As another example, Ceder/Mesa used "lazy" ropes for editing large text > files. My recollection is that Cedar used ropes only and everywhere -- no > separate string type. > > >> I actually would expect the class Rope to have >> Rope class >> fromString: > > Good call. I'll add it. Thank you, this makes using Ropes effortlessly. I used in in the performance tests. I only need to know about class Rope for getting started. Maybe we should complete API the subclasses provide in the class Rope as it is done in String in Squeak (String is superclass of ByteString and WideString in Squeak). > The point of this first code was as proof of concept. I wanted to get > enough experience to see if this was worth our time to carry further. Lazy > ropes and other specializations are easy to add. What is "lazy" in lazy Ropes? > Like Juan, I was pleasantly surprised to see that using a rope in the text > editor was darn easy. Yes. > Plenty of details yet, but the first indicators are positive. > > Next steps are to go from prototype/proof-of-concept to > use-in-place-of-strings. Yes. > This will take me a while. I still have much to learn about paragraph, > parsing, and the text editors. With paragraph parsing you mean finding where the paragraph boundaries are? It might be worthwhile considering using paragraph boundaries as end points for Ropes? -- not to mention Smalltalk best practices. Sure but people here are willing to help with their knowledge. > > Thanks for taking a serious look at this, You are very welcome. Regards --Hannes _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org |
On Mon, 18 Feb 2013 10:07:22 +0000
"H. Hirzel" <[hidden email]> wrote: > >> c) finding a subrope > > This is what I am interested in next. The general String find. And > maybe #splitBy: Right. Unicode has character classes, so word/token breaks, whitespace, case-ing, and especially sorting are typically more complex and expensive to implement. E.g. you really have to know both the language and locale to sort properly. My strategy at this point is to go for simple comparison and sorting using the numeric code-point values (#<) and do the more complex comparisons (#comesBefore:) after the codePoint basics are in place and tested. One challenge is to have good names which are descriptive of the operation but makes the distinction between Unicode and Numeric. > > BTW could you provide the links to Java, Python and C implementations > of Ropes so that I can have a look at it as well? I did that in these emails and Wikipedia has references, but no reason not to gather them together in the documentation. > >> I actually would expect the class Rope to have > >> Rope class >> fromString: > > > > Good call. I'll add it. > > Thank you, this makes using Ropes effortlessly. I used in in the > performance tests. I only need to know about class Rope for getting > started. Wow! Thanks for doing performance tests! Good information! > Maybe we should complete API the subclasses provide in the class Rope > as it is done in String in Squeak (String is superclass of ByteString > and WideString in Squeak). Ah. You are catching me out. I have been consciously NOT looking at the Squeak implementation until I did a "code sketch" of a Ropes implementation for Unicode. I tend to "think out loud" by writing code. I confuse myself easily and the more abstract the concept, the better it is to do concrete implementations (IMHO). I just created https://github.com/KenDickey/Cuis-Unicode Note the read-me states "Not yet ready to see the light of day" !! No real Unicode scanning yet, just getting sketching out the dispatch mechanics. Note that I have changed names (Rope -> UniString, ConcatRope -> UniSplice, etc.) The plan is to qualify when scanning/inputing characters to see when it makes sense to use 8/16/32 bit charBlocks to hold codePoints. I.e. if more than XXX % of characters are 32 bit, then use the 32-bit storage class (WordArray) for a chunk of codePoints. We need to measure to figure out what XXX is. > > The point of this first code was as proof of concept. I wanted to get > > enough experience to see if this was worth our time to carry further. Lazy > > ropes and other specializations are easy to add. > > What is "lazy" in lazy Ropes? If I edit a large file, I don't really want to read the entire file into memory. I can read in a chunk and concatenate this FlatRope with a LazyRope which is essentially a promise to get things as needed. In practice, #at: and friends are closures or methods which will read in more of the file as required. This is called "lazy", as in I am too lazy to read the file all at once. I'll do it automagically on demand. Think "virtual rope" like "virtual file system". You don't have to have everything in memory. So "distributed", "transactions", ... plenty of fun here! 8^) > > This will take me a while. I still have much to learn about paragraph, > > parsing, and the text editors. > > With paragraph parsing you mean finding where the paragraph boundaries > are? It might be worthwhile considering using paragraph boundaries as > end points for Ropes? Paragraph breaks are another complex concept, particularly in mixing left-to-right and right-to-left word orderings (say a newspaper review in Arabic, Hebrew, and French). I think paragraph breaks must be orthogonal to storage. You don't store paragraphs as "disk blocks". You make index structures. Hey, this is getting way too long and I have a bunch of my life that does not fit here. More later. Thanks again for everything!! I'll try and get the "pulls" in later today. -KenD -- Ken [dot] Dickey [at] whidbey [dot] com _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
-KenD
|
Ah!
I Should mention is that good Unicode support is big. I am trying for a package separate from Core Cuis. This means that UniChar does not inherit from Character and UniString does not inherit from String. [They would share Traits in Pharo]. Full Unicode does not fit well in a current-generation cell phone. -- Ken [dot] Dickey [at] whidbey [dot] com _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
-KenD
|
In reply to this post by KenDickey
Hi Folks,
On 2/18/2013 2:53 PM, Ken Dickey wrote: > On Mon, 18 Feb 2013 10:07:22 +0000 > "H. Hirzel"<[hidden email]> wrote: > > >>>> c) finding a subrope >> This is what I am interested in next. The general String find. And >> maybe #splitBy: > Right. Unicode has character classes, so word/token breaks, whitespace, case-ing, and especially sorting are typically more complex and expensive to implement. E.g. you really have to know both the language and locale to sort properly. > > My strategy at this point is to go for simple comparison and sorting using the numeric code-point values (#<) and do the more complex comparisons (#comesBefore:) after the codePoint basics are in place and tested. > > One challenge is to have good names which are descriptive of the operation but makes the distinction between Unicode and Numeric. > >> BTW could you provide the links to Java, Python and C implementations >> of Ropes so that I can have a look at it as well? > I did that in these emails and Wikipedia has references, but no reason not to gather them together in the documentation. > > >>>> I actually would expect the class Rope to have >>>> Rope class>> fromString: >>> Good call. I'll add it. >> Thank you, this makes using Ropes effortlessly. I used in in the >> performance tests. I only need to know about class Rope for getting >> started. > Wow! Thanks for doing performance tests! Good information! > > >> Maybe we should complete API the subclasses provide in the class Rope >> as it is done in String in Squeak (String is superclass of ByteString >> and WideString in Squeak). > Ah. You are catching me out. I have been consciously NOT looking at the Squeak implementation until I did a "code sketch" of a Ropes implementation for Unicode. > > I tend to "think out loud" by writing code. I confuse myself easily and the more abstract the concept, the better it is to do concrete implementations (IMHO). > > I just created > https://github.com/KenDickey/Cuis-Unicode > > Note the read-me states "Not yet ready to see the light of day" !! > > No real Unicode scanning yet, just getting sketching out the dispatch mechanics. > > Note that I have changed names (Rope -> UniString, ConcatRope -> UniSplice, etc.) > > The plan is to qualify when scanning/inputing characters to see when it makes sense to use 8/16/32 bit charBlocks to hold codePoints. I.e. if more than XXX % of characters are 32 bit, then use the 32-bit storage class (WordArray) for a chunk of codePoints. We need to measure to figure out what XXX is. > > >>> The point of this first code was as proof of concept. I wanted to get >>> enough experience to see if this was worth our time to carry further. Lazy >>> ropes and other specializations are easy to add. >> What is "lazy" in lazy Ropes? > If I edit a large file, I don't really want to read the entire file into memory. I can read in a chunk and concatenate this FlatRope with a LazyRope which is essentially a promise to get things as needed. In practice, #at: and friends are closures or methods which will read in more of the file as required. This is called "lazy", as in I am too lazy to read the file all at once. I'll do it automagically on demand. > > Think "virtual rope" like "virtual file system". You don't have to have everything in memory. So "distributed", "transactions", ... plenty of fun here! 8^) > > >>> This will take me a while. I still have much to learn about paragraph, >>> parsing, and the text editors. >> With paragraph parsing you mean finding where the paragraph boundaries >> are? It might be worthwhile considering using paragraph boundaries as >> end points for Ropes? > Paragraph breaks are another complex concept, particularly in mixing left-to-right and right-to-left word orderings (say a newspaper review in Arabic, Hebrew, and French). > > I think paragraph breaks must be orthogonal to storage. You don't store paragraphs as "disk blocks". You make index structures. In addition, most users won't need Strings but Texts, i.e. including attributes. The Pharo folks did a new implementation of Text that uses something very similar to Ropes. They call them 'extents'. This would be nice to have. > Hey, this is getting way too long and I have a bunch of my life that does not fit here. :) I can understand you. Cuis and Morphic 3 started this way! > More later. > > > Thanks again for everything!! I'll try and get the "pulls" in later today. > -KenD Cheers, Juan Vuletich _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org |
In reply to this post by KenDickey
Hi Ken,
At https://github.com/KenDickey/Cuis-Unicode you say there's no support for display yet. My yet unpublished Morphic 3 prototype can render ISO 8859-15 using the vector graphics engine with decent visual quality (on par with Mac & Win). If things go well, I should be able to include a TrueType or OpenType interpreter, meaning that some support for displaying any Unicode glyph is not that far (maybe a year from now). This would be Freetype functionality. As you know, proper paragraph composition in any script is another (harder) problem for the future. (PanGo functionality). Cheers, Juan Vuletich On 2/18/2013 3:02 PM, Ken Dickey wrote: > Ah! > > I Should mention is that good Unicode support is big. > > I am trying for a package separate from Core Cuis. This means that UniChar does not inherit from Character and UniString does not inherit from String. [They would share Traits in Pharo]. > > Full Unicode does not fit well in a current-generation cell phone. > _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org |
Juan,
I was mainly referring to size. The gnu-unifont truetype font which covers Unicode codePoints through 16rFFFF (i.e. less than full Unicode) weighs in at 16 MegaBytes. And that is just the TrueType font info with zero the support code. An interesing alternative, by the way is CDL (Character Description Language) which carries information for rendering CJK (> 84,000 character gylphs) in only 1.4 MB. Rendering is basically painting the strokes which make up the characters. I think this would be really, really cool in Morphic 3!!! See: http://www.wenlin.com/cdl/ This could possibly fit in a new-gen cell phone. $0.02, -KenD ======= On Fri, 22 Feb 2013 16:03:20 -0300 Juan Vuletich <[hidden email]> wrote: > Hi Ken, > > At https://github.com/KenDickey/Cuis-Unicode you say there's no support > for display yet. My yet unpublished Morphic 3 prototype can render ISO > 8859-15 using the vector graphics engine with decent visual quality (on > par with Mac & Win). If things go well, I should be able to include a > TrueType or OpenType interpreter, meaning that some support for > displaying any Unicode glyph is not that far (maybe a year from now). > This would be Freetype functionality. > On 2/18/2013 3:02 PM, Ken Dickey wrote: > > Ah! > > > > I Should mention is that good Unicode support is big. ... > > Full Unicode does not fit well in a current-generation cell phone. > > > > > _______________________________________________ > Cuis mailing list > [hidden email] > http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org -- Ken [dot] Dickey [at] whidbey [dot] com _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
-KenD
|
In reply to this post by Juan Vuletich-4
On 2/22/2013 3:57 PM, Juan Vuletich wrote:
> ... > In addition, most users won't need Strings but Texts, i.e. including > attributes. The Pharo folks did a new implementation of Text that uses > something very similar to Ropes. They call them 'extents'. This would > be nice to have. > > ... Apologies. My memory failed. It is TxTextModel, and is discussed (together with other stuff like paragraph layout and morph drawing) here: http://forum.world.st/TxTextMorph-based-on-new-text-model-td4663219.html . In any case, a new implementation of Strings suggest doing a new implementation of Text... Cheers, Juan Vuletich _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org |
In reply to this post by KenDickey
Hi Ken,
On 2/22/2013 9:59 PM, Ken Dickey wrote: > Juan, > > I was mainly referring to size. The gnu-unifont truetype font which covers Unicode codePoints through 16rFFFF (i.e. less than full Unicode) weighs in at 16 MegaBytes. And that is just the TrueType font info with zero the support code. I see. > An interesing alternative, by the way is CDL (Character Description Language) which carries information for rendering CJK (> 84,000 character gylphs) in only 1.4 MB. Rendering is basically painting the strokes which make up the characters. > I think this would be really, really cool in Morphic 3!!! > > See: > http://www.wenlin.com/cdl/ > > > This could possibly fit in a new-gen cell phone. > I was't aware of this. Thanks. They say it is easy to transform into SVG. That means it is easy to render in M3. Cheers, Juan Vuletich > $0.02, > -KenD > ======= > On Fri, 22 Feb 2013 16:03:20 -0300 > Juan Vuletich<[hidden email]> wrote: > >> Hi Ken, >> >> At https://github.com/KenDickey/Cuis-Unicode you say there's no support >> for display yet. My yet unpublished Morphic 3 prototype can render ISO >> 8859-15 using the vector graphics engine with decent visual quality (on >> par with Mac& Win). If things go well, I should be able to include a >> TrueType or OpenType interpreter, meaning that some support for >> displaying any Unicode glyph is not that far (maybe a year from now). >> This would be Freetype functionality. >> On 2/18/2013 3:02 PM, Ken Dickey wrote: >>> Ah! >>> >>> I Should mention is that good Unicode support is big. > ... >>> Full Unicode does not fit well in a current-generation cell phone. >>> >> >> _______________________________________________ >> Cuis mailing list >> [hidden email] >> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org > _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org |
In reply to this post by Juan Vuletich-4
On Sat, 23 Feb 2013 00:56:50 -0300
Juan Vuletich <[hidden email]> wrote: > . In any case, a new implementation of Strings suggest doing a new > implementation of Text... OK. But first, back to Ropes. -KenD _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
-KenD
|
Hello
I found this interesting page about how other languages handle or do not handle Unicode http://rosettacode.org/wiki/Character_codes Regards Hannes On 2/23/13, Ken Dickey <[hidden email]> wrote: > On Sat, 23 Feb 2013 00:56:50 -0300 > Juan Vuletich <[hidden email]> wrote: > >> . In any case, a new implementation of Strings suggest doing a new >> implementation of Text... > > OK. But first, back to Ropes. > > -KenD > > _______________________________________________ > Cuis mailing list > [hidden email] > http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org > _______________________________________________ Cuis mailing list [hidden email] http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org |
Free forum by Nabble | Edit this page |