Hello,
it looks like its unsafe to switch between ascii/binary mode in SocketStream, because it resets buffers upon switch: binary "Tell the SocketStream to send data as ByteArrays instead of Strings. Default is ascii." binary := true. self resetBuffers The dynamic mode switching is useful: for instance i want to read HTTP headers first, which is preferable to do in ascii mode, but its contents may be binary , which obviously preferable to read in binary mode, to avoid extra conversions. Since SocketStream caching the data it reads, then instead of resetting the buffers it should convert them and avoid losing the data. What you think is an appropriate solution to this? -- Best regards, Igor Stasenko AKA sig. |
On 3/15/2010 7:44 PM, Igor Stasenko wrote:
> Since SocketStream caching the data it reads, then instead of > resetting the buffers it should convert them and avoid losing the > data. > > What you think is an appropriate solution to this? Don't "reset" the buffers; simply convert them to the proper ascii/binary format. Cheers, - Andreas |
On 16 March 2010 04:46, Andreas Raab <[hidden email]> wrote:
> On 3/15/2010 7:44 PM, Igor Stasenko wrote: >> >> Since SocketStream caching the data it reads, then instead of >> resetting the buffers it should convert them and avoid losing the >> data. >> >> What you think is an appropriate solution to this? > > Don't "reset" the buffers; simply convert them to the proper ascii/binary > format. > There could be an alternative approach: - keep buffers in a single (binary) format and covert an output depending on mode. The choice is, when you should pay the conversion price: - each time you read something - each time you switching the mode If input is a mix of ascii/binary content, it will be very ineffective converting the cache each time mode switching. For example - HTTP 'transfer-encoding: chunked'. Content may be a binary data, but it could be chunked, then input becomes a mix of binary data and hexadecimal ascii values, and crlf's. So, it requires mode deep analyzis than just saying 'convert it' :) > Cheers, > - Andreas -- Best regards, Igor Stasenko AKA sig. |
On 3/15/2010 8:14 PM, Igor Stasenko wrote:
> There could be an alternative approach: > - keep buffers in a single (binary) format and covert an output > depending on mode. > > The choice is, when you should pay the conversion price: > - each time you read something > - each time you switching the mode > > If input is a mix of ascii/binary content, it will be very ineffective > converting the cache each time mode switching. > For example - HTTP 'transfer-encoding: chunked'. > Content may be a binary data, but it could be chunked, then input > becomes a mix of > binary data and hexadecimal ascii values, and crlf's. > > So, it requires mode deep analyzis than just saying 'convert it' :) I don't think it's all that complicated :-) First, you'd slow down all current use cases and introduce a lot of potential bugs if you added conversion upon access. You would also break any extension methods (the next:into: methods were originally extensions on SocketStream before I added them to trunk). Given all of that changing SocketStream in that way seems highly questionable. The specific use case of chunked encoding is interesting too, since the motivation of adding the next:into: family of methods came from reading chunked encoding :-) As a consequence, the fastest way to read chunked encoding in Squeak today is the following: buffer := ByteArray new. "or: ByteString new" [firstLine := socketStream nextLine. chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works" chunkSize = 0] whileFalse:[ buffer size < chunkSize ifFalse:[buffer := buffer class new: chunkSize]. buffer := socketStream next: chunkSize into: buffer startingAt: 1. outStream next: chunkSize putAll: buffer. socketStream skip: 2. "CRLF" ]. socketStream skip: 2. "CRLF" There is no conversion needed between ascii/binary since the next:into: code accepts both strings and byte arrays. By the end of the day switching between ascii and binary is a bit of a convenience function which means that you probably shouldn't be writing high-performance code that depends on constantly switching between the two (I think that's a fair tradeoff). The next:into: family was specifically provided for high-performance situations by providing a pre-allocated buffer and avoiding the allocation overhead. Cheers, - Andreas |
In reply to this post by Igor Stasenko
On Tue, 16 Mar 2010, Igor Stasenko wrote:
> On 16 March 2010 04:46, Andreas Raab <[hidden email]> wrote: >> On 3/15/2010 7:44 PM, Igor Stasenko wrote: >>> >>> Since SocketStream caching the data it reads, then instead of >>> resetting the buffers it should convert them and avoid losing the >>> data. >>> >>> What you think is an appropriate solution to this? >> >> Don't "reset" the buffers; simply convert them to the proper ascii/binary >> format. >> > > There could be an alternative approach: > - keep buffers in a single (binary) format and covert an output > depending on mode. This is exactly what we were doing in our own SocketStream-like class. For a general purpose SocketStream this might give some extra complexity for the implementation, but it'd also allows us to use the stream primitives. There's a third option if you want to optimize for rapid ascii/binary mode changes: store both the binary and the ascii buffers and fill/copy them in a lazy way. Levente > > The choice is, when you should pay the conversion price: > - each time you read something > - each time you switching the mode > > If input is a mix of ascii/binary content, it will be very ineffective > converting the cache each time mode switching. > For example - HTTP 'transfer-encoding: chunked'. > Content may be a binary data, but it could be chunked, then input > becomes a mix of > binary data and hexadecimal ascii values, and crlf's. > > So, it requires mode deep analyzis than just saying 'convert it' :) > >> Cheers, >> - Andreas > > > -- > Best regards, > Igor Stasenko AKA sig. > > |
In reply to this post by Andreas.Raab
On 16 March 2010 05:51, Andreas Raab <[hidden email]> wrote:
> On 3/15/2010 8:14 PM, Igor Stasenko wrote: >> >> There could be an alternative approach: >> - keep buffers in a single (binary) format and covert an output >> depending on mode. >> >> The choice is, when you should pay the conversion price: >> - each time you read something >> - each time you switching the mode >> >> If input is a mix of ascii/binary content, it will be very ineffective >> converting the cache each time mode switching. >> For example - HTTP 'transfer-encoding: chunked'. >> Content may be a binary data, but it could be chunked, then input >> becomes a mix of >> binary data and hexadecimal ascii values, and crlf's. >> >> So, it requires mode deep analyzis than just saying 'convert it' :) > > I don't think it's all that complicated :-) > > First, you'd slow down all current use cases and introduce a lot of > potential bugs if you added conversion upon access. You would also break any > extension methods (the next:into: methods were originally extensions on > SocketStream before I added them to trunk). Given all of that changing > SocketStream in that way seems highly questionable. > > The specific use case of chunked encoding is interesting too, since the > motivation of adding the next:into: family of methods came from reading > chunked encoding :-) As a consequence, the fastest way to read chunked > encoding in Squeak today is the following: > > buffer := ByteArray new. "or: ByteString new" > [firstLine := socketStream nextLine. > chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works" > chunkSize = 0] whileFalse:[ > buffer size < chunkSize > ifFalse:[buffer := buffer class new: chunkSize]. > buffer := socketStream next: chunkSize into: buffer startingAt: 1. > outStream next: chunkSize putAll: buffer. > socketStream skip: 2. "CRLF" > ]. > socketStream skip: 2. "CRLF" > > There is no conversion needed between ascii/binary since the next:into: code > accepts both strings and byte arrays. By the end of the day switching > between ascii and binary is a bit of a convenience function which means that > you probably shouldn't be writing high-performance code that depends on > constantly switching between the two (I think that's a fair tradeoff). The > next:into: family was specifically provided for high-performance situations > by providing a pre-allocated buffer and avoiding the allocation overhead. > Yes, #next:into: is convenient, if you know the content size from the start or if you want to read all at once into memory. But conversion strategy which you showing above doesn't works well for all cases. For persistent streams, used for exchanging the data between peers, there is no notion of 'read everything up to end', but usually 'read what is currently available', because peers exchanging data in real time and you can't predict what will follow next to the last input. My current intent is to make a fast reader which using a socket as backend, and which is: - reads/parsing http headers - handling chunked content transfer encoding - handling utf8 content encoding - and only then, there is a consumer, which is a JSON parser parsing input character by character, and, as many other parsers, obviously have no use of #next:into: , but using #peek and #next all the way. The idea is to parse data, once it become available, instead of reading all up to the end, and only then start parsing. You could ask, why its more effective? - Because of network latency. A client, instead of simply waiting for a next data packet to arrive could spend this time more productively by parsing the available input (besides, it will spend this time anyways, so why wasting the time?). This means that results of parsing will be available earlier, comparing to scheme, when you start parsing only when all data arrives. Also, the bigger content size, the more efficient it will be not only by speed, but also by memory consumption. So, i tend to look for a ways, when socket stream design is focused on streaming the data and doesn't assuming that consumer of it having any preferences to use buffered approach (#next:into: ) over non-buffered one (#next). > Cheers, > - Andreas -- Best regards, Igor Stasenko AKA sig. |
2010/3/16 Igor Stasenko <[hidden email]>:
> On 16 March 2010 05:51, Andreas Raab <[hidden email]> wrote: >> On 3/15/2010 8:14 PM, Igor Stasenko wrote: >>> >>> There could be an alternative approach: >>> - keep buffers in a single (binary) format and covert an output >>> depending on mode. >>> >>> The choice is, when you should pay the conversion price: >>> - each time you read something >>> - each time you switching the mode >>> >>> If input is a mix of ascii/binary content, it will be very ineffective >>> converting the cache each time mode switching. >>> For example - HTTP 'transfer-encoding: chunked'. >>> Content may be a binary data, but it could be chunked, then input >>> becomes a mix of >>> binary data and hexadecimal ascii values, and crlf's. >>> >>> So, it requires mode deep analyzis than just saying 'convert it' :) >> >> I don't think it's all that complicated :-) >> >> First, you'd slow down all current use cases and introduce a lot of >> potential bugs if you added conversion upon access. You would also break any >> extension methods (the next:into: methods were originally extensions on >> SocketStream before I added them to trunk). Given all of that changing >> SocketStream in that way seems highly questionable. >> >> The specific use case of chunked encoding is interesting too, since the >> motivation of adding the next:into: family of methods came from reading >> chunked encoding :-) As a consequence, the fastest way to read chunked >> encoding in Squeak today is the following: >> >> buffer := ByteArray new. "or: ByteString new" >> [firstLine := socketStream nextLine. >> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works" >> chunkSize = 0] whileFalse:[ >> buffer size < chunkSize >> ifFalse:[buffer := buffer class new: chunkSize]. >> buffer := socketStream next: chunkSize into: buffer startingAt: 1. >> outStream next: chunkSize putAll: buffer. >> socketStream skip: 2. "CRLF" >> ]. >> socketStream skip: 2. "CRLF" >> >> There is no conversion needed between ascii/binary since the next:into: code >> accepts both strings and byte arrays. By the end of the day switching >> between ascii and binary is a bit of a convenience function which means that >> you probably shouldn't be writing high-performance code that depends on >> constantly switching between the two (I think that's a fair tradeoff). The >> next:into: family was specifically provided for high-performance situations >> by providing a pre-allocated buffer and avoiding the allocation overhead. >> > > Yes, #next:into: is convenient, if you know the content size from the start > or if you want to read all at once into memory. > But conversion strategy which you showing above doesn't works well for > all cases. > For persistent streams, used for exchanging the data between peers, > there is no notion of 'read everything up to end', > but usually 'read what is currently available', because peers > exchanging data in real time and you can't predict what will follow > next to the last input. > > My current intent is to make a fast reader which using a socket as > backend, and which is: > > - reads/parsing http headers > - handling chunked content transfer encoding > - handling utf8 content encoding > - and only then, there is a consumer, which is a JSON parser parsing > input character by character, > and, as many other parsers, obviously have no use of #next:into: , but > using #peek and #next all the way. > > The idea is to parse data, once it become available, instead of > reading all up to the end, and only then start parsing. > You could ask, why its more effective? - Because of network latency. > A client, instead of simply waiting for a next data packet to arrive > could spend this time more productively by parsing the available input > (besides, it will spend this time anyways, so why wasting the time?). > This means that results of parsing will be available earlier, > comparing to scheme, when you start parsing only when all data > arrives. > Also, the bigger content size, the more efficient it will be not only > by speed, but also by memory consumption. > > So, i tend to look for a ways, when socket stream design is focused on > streaming the data and doesn't assuming that consumer of it having any > preferences to use buffered approach (#next:into: ) over non-buffered > one (#next). > Then you should really consider looking at VW-XTream transforming: stuff. The idea is to have parallel processing (pipelines). Of course, we can not have true parallelism yet in Smalltalk, but at least the 1st level can work with non blocking squeak socket. Nicolas >> Cheers, >> - Andreas > > > -- > Best regards, > Igor Stasenko AKA sig. > > |
On 16 March 2010 10:57, Nicolas Cellier
<[hidden email]> wrote: > 2010/3/16 Igor Stasenko <[hidden email]>: >> On 16 March 2010 05:51, Andreas Raab <[hidden email]> wrote: >>> On 3/15/2010 8:14 PM, Igor Stasenko wrote: >>>> >>>> There could be an alternative approach: >>>> - keep buffers in a single (binary) format and covert an output >>>> depending on mode. >>>> >>>> The choice is, when you should pay the conversion price: >>>> - each time you read something >>>> - each time you switching the mode >>>> >>>> If input is a mix of ascii/binary content, it will be very ineffective >>>> converting the cache each time mode switching. >>>> For example - HTTP 'transfer-encoding: chunked'. >>>> Content may be a binary data, but it could be chunked, then input >>>> becomes a mix of >>>> binary data and hexadecimal ascii values, and crlf's. >>>> >>>> So, it requires mode deep analyzis than just saying 'convert it' :) >>> >>> I don't think it's all that complicated :-) >>> >>> First, you'd slow down all current use cases and introduce a lot of >>> potential bugs if you added conversion upon access. You would also break any >>> extension methods (the next:into: methods were originally extensions on >>> SocketStream before I added them to trunk). Given all of that changing >>> SocketStream in that way seems highly questionable. >>> >>> The specific use case of chunked encoding is interesting too, since the >>> motivation of adding the next:into: family of methods came from reading >>> chunked encoding :-) As a consequence, the fastest way to read chunked >>> encoding in Squeak today is the following: >>> >>> buffer := ByteArray new. "or: ByteString new" >>> [firstLine := socketStream nextLine. >>> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works" >>> chunkSize = 0] whileFalse:[ >>> buffer size < chunkSize >>> ifFalse:[buffer := buffer class new: chunkSize]. >>> buffer := socketStream next: chunkSize into: buffer startingAt: 1. >>> outStream next: chunkSize putAll: buffer. >>> socketStream skip: 2. "CRLF" >>> ]. >>> socketStream skip: 2. "CRLF" >>> >>> There is no conversion needed between ascii/binary since the next:into: code >>> accepts both strings and byte arrays. By the end of the day switching >>> between ascii and binary is a bit of a convenience function which means that >>> you probably shouldn't be writing high-performance code that depends on >>> constantly switching between the two (I think that's a fair tradeoff). The >>> next:into: family was specifically provided for high-performance situations >>> by providing a pre-allocated buffer and avoiding the allocation overhead. >>> >> >> Yes, #next:into: is convenient, if you know the content size from the start >> or if you want to read all at once into memory. >> But conversion strategy which you showing above doesn't works well for >> all cases. >> For persistent streams, used for exchanging the data between peers, >> there is no notion of 'read everything up to end', >> but usually 'read what is currently available', because peers >> exchanging data in real time and you can't predict what will follow >> next to the last input. >> >> My current intent is to make a fast reader which using a socket as >> backend, and which is: >> >> - reads/parsing http headers >> - handling chunked content transfer encoding >> - handling utf8 content encoding >> - and only then, there is a consumer, which is a JSON parser parsing >> input character by character, >> and, as many other parsers, obviously have no use of #next:into: , but >> using #peek and #next all the way. >> >> The idea is to parse data, once it become available, instead of >> reading all up to the end, and only then start parsing. >> You could ask, why its more effective? - Because of network latency. >> A client, instead of simply waiting for a next data packet to arrive >> could spend this time more productively by parsing the available input >> (besides, it will spend this time anyways, so why wasting the time?). >> This means that results of parsing will be available earlier, >> comparing to scheme, when you start parsing only when all data >> arrives. >> Also, the bigger content size, the more efficient it will be not only >> by speed, but also by memory consumption. >> >> So, i tend to look for a ways, when socket stream design is focused on >> streaming the data and doesn't assuming that consumer of it having any >> preferences to use buffered approach (#next:into: ) over non-buffered >> one (#next). >> > > Then you should really consider looking at VW-XTream transforming: stuff. > The idea is to have parallel processing (pipelines). Err. Pipes is not parallel processing. It is a sequential - output of one pipe is input of another one. And sure thing, this is how i think a good streams should be working. Too bad, i have to use what we have in Squeak/Pharo.. or do everything from scratch. > Of course, we can not have true parallelism yet in Smalltalk, but at > least the 1st level can work with non blocking squeak socket. > > Nicolas > >>> Cheers, >>> - Andreas >> >> >> -- >> Best regards, >> Igor Stasenko AKA sig. >> >> > > -- Best regards, Igor Stasenko AKA sig. |
2010/3/16 Igor Stasenko <[hidden email]>:
> On 16 March 2010 10:57, Nicolas Cellier > <[hidden email]> wrote: >> 2010/3/16 Igor Stasenko <[hidden email]>: >>> On 16 March 2010 05:51, Andreas Raab <[hidden email]> wrote: >>>> On 3/15/2010 8:14 PM, Igor Stasenko wrote: >>>>> >>>>> There could be an alternative approach: >>>>> - keep buffers in a single (binary) format and covert an output >>>>> depending on mode. >>>>> >>>>> The choice is, when you should pay the conversion price: >>>>> - each time you read something >>>>> - each time you switching the mode >>>>> >>>>> If input is a mix of ascii/binary content, it will be very ineffective >>>>> converting the cache each time mode switching. >>>>> For example - HTTP 'transfer-encoding: chunked'. >>>>> Content may be a binary data, but it could be chunked, then input >>>>> becomes a mix of >>>>> binary data and hexadecimal ascii values, and crlf's. >>>>> >>>>> So, it requires mode deep analyzis than just saying 'convert it' :) >>>> >>>> I don't think it's all that complicated :-) >>>> >>>> First, you'd slow down all current use cases and introduce a lot of >>>> potential bugs if you added conversion upon access. You would also break any >>>> extension methods (the next:into: methods were originally extensions on >>>> SocketStream before I added them to trunk). Given all of that changing >>>> SocketStream in that way seems highly questionable. >>>> >>>> The specific use case of chunked encoding is interesting too, since the >>>> motivation of adding the next:into: family of methods came from reading >>>> chunked encoding :-) As a consequence, the fastest way to read chunked >>>> encoding in Squeak today is the following: >>>> >>>> buffer := ByteArray new. "or: ByteString new" >>>> [firstLine := socketStream nextLine. >>>> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works" >>>> chunkSize = 0] whileFalse:[ >>>> buffer size < chunkSize >>>> ifFalse:[buffer := buffer class new: chunkSize]. >>>> buffer := socketStream next: chunkSize into: buffer startingAt: 1. >>>> outStream next: chunkSize putAll: buffer. >>>> socketStream skip: 2. "CRLF" >>>> ]. >>>> socketStream skip: 2. "CRLF" >>>> >>>> There is no conversion needed between ascii/binary since the next:into: code >>>> accepts both strings and byte arrays. By the end of the day switching >>>> between ascii and binary is a bit of a convenience function which means that >>>> you probably shouldn't be writing high-performance code that depends on >>>> constantly switching between the two (I think that's a fair tradeoff). The >>>> next:into: family was specifically provided for high-performance situations >>>> by providing a pre-allocated buffer and avoiding the allocation overhead. >>>> >>> >>> Yes, #next:into: is convenient, if you know the content size from the start >>> or if you want to read all at once into memory. >>> But conversion strategy which you showing above doesn't works well for >>> all cases. >>> For persistent streams, used for exchanging the data between peers, >>> there is no notion of 'read everything up to end', >>> but usually 'read what is currently available', because peers >>> exchanging data in real time and you can't predict what will follow >>> next to the last input. >>> >>> My current intent is to make a fast reader which using a socket as >>> backend, and which is: >>> >>> - reads/parsing http headers >>> - handling chunked content transfer encoding >>> - handling utf8 content encoding >>> - and only then, there is a consumer, which is a JSON parser parsing >>> input character by character, >>> and, as many other parsers, obviously have no use of #next:into: , but >>> using #peek and #next all the way. >>> >>> The idea is to parse data, once it become available, instead of >>> reading all up to the end, and only then start parsing. >>> You could ask, why its more effective? - Because of network latency. >>> A client, instead of simply waiting for a next data packet to arrive >>> could spend this time more productively by parsing the available input >>> (besides, it will spend this time anyways, so why wasting the time?). >>> This means that results of parsing will be available earlier, >>> comparing to scheme, when you start parsing only when all data >>> arrives. >>> Also, the bigger content size, the more efficient it will be not only >>> by speed, but also by memory consumption. >>> >>> So, i tend to look for a ways, when socket stream design is focused on >>> streaming the data and doesn't assuming that consumer of it having any >>> preferences to use buffered approach (#next:into: ) over non-buffered >>> one (#next). >>> >> >> Then you should really consider looking at VW-XTream transforming: stuff. >> The idea is to have parallel processing (pipelines). > > Err. Pipes is not parallel processing. It is a sequential - output of > one pipe is input of another one. Well, Mr Ford understood that before us ;) - having only one object working while the others are resting is not the most efficient way to process the sequential stream. Nicolas > And sure thing, this is how i think a good streams should be working. > Too bad, i have to use what we have in Squeak/Pharo.. or do everything > from scratch. > >> Of course, we can not have true parallelism yet in Smalltalk, but at >> least the 1st level can work with non blocking squeak socket. >> >> Nicolas >> >>>> Cheers, >>>> - Andreas >>> >>> >>> -- >>> Best regards, >>> Igor Stasenko AKA sig. >>> >>> >> >> > > > > -- > Best regards, > Igor Stasenko AKA sig. > > |
Free forum by Nabble | Edit this page |