Smalltalk › Squeak › Squeak - Dev

SocketSteam: Switching ascii/binary modes

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

9 messages Options

Igor Stasenko

SocketSteam: Switching ascii/binary modes

Hello,

it looks like its unsafe to switch between ascii/binary mode in SocketStream,
because it resets buffers upon switch:

binary
"Tell the SocketStream to send data
as ByteArrays instead of Strings.
Default is ascii."

binary := true.
self resetBuffers

The dynamic mode switching is useful:
for instance i want to read HTTP headers first, which is preferable to
do in ascii mode,
but its contents may be binary , which obviously preferable to read in
binary mode, to avoid
extra conversions.

Since SocketStream caching the data it reads, then instead of
resetting the buffers it should convert them and avoid losing the
data.

What you think is an appropriate solution to this?

--
Best regards,
Igor Stasenko AKA sig.

Andreas.Raab

Re: SocketSteam: Switching ascii/binary modes

On 3/15/2010 7:44 PM, Igor Stasenko wrote:
> Since SocketStream caching the data it reads, then instead of
> resetting the buffers it should convert them and avoid losing the
> data.
>
> What you think is an appropriate solution to this?

Don't "reset" the buffers; simply convert them to the proper
ascii/binary format.

Cheers,
- Andreas

Igor Stasenko

Re: SocketSteam: Switching ascii/binary modes

On 16 March 2010 04:46, Andreas Raab <[hidden email]> wrote:

> On 3/15/2010 7:44 PM, Igor Stasenko wrote:
>>
>> Since SocketStream caching the data it reads, then instead of
>> resetting the buffers it should convert them and avoid losing the
>> data.
>>
>> What you think is an appropriate solution to this?
>
> Don't "reset" the buffers; simply convert them to the proper ascii/binary
> format.
>

There could be an alternative approach:
- keep buffers in a single (binary) format and covert an output
depending on mode.

The choice is, when you should pay the conversion price:
- each time you read something
- each time you switching the mode

If input is a mix of ascii/binary content, it will be very ineffective
converting the cache each time mode switching.
For example - HTTP 'transfer-encoding: chunked'.
Content may be a binary data, but it could be chunked, then input
becomes a mix of
binary data and hexadecimal ascii values, and crlf's.

So, it requires mode deep analyzis than just saying 'convert it' :)

> Cheers,
> - Andreas

--
Best regards,
Igor Stasenko AKA sig.

Andreas.Raab

Re: SocketSteam: Switching ascii/binary modes

On 3/15/2010 8:14 PM, Igor Stasenko wrote:

> There could be an alternative approach:
> - keep buffers in a single (binary) format and covert an output
> depending on mode.
>
> The choice is, when you should pay the conversion price:
> - each time you read something
> - each time you switching the mode
>
> If input is a mix of ascii/binary content, it will be very ineffective
> converting the cache each time mode switching.
> For example - HTTP 'transfer-encoding: chunked'.
> Content may be a binary data, but it could be chunked, then input
> becomes a mix of
> binary data and hexadecimal ascii values, and crlf's.
>
> So, it requires mode deep analyzis than just saying 'convert it' :)

I don't think it's all that complicated :-)

First, you'd slow down all current use cases and introduce a lot of
potential bugs if you added conversion upon access. You would also break
any extension methods (the next:into: methods were originally extensions
on SocketStream before I added them to trunk). Given all of that
changing SocketStream in that way seems highly questionable.

The specific use case of chunked encoding is interesting too, since the
motivation of adding the next:into: family of methods came from reading
chunked encoding :-) As a consequence, the fastest way to read chunked
encoding in Squeak today is the following:

buffer := ByteArray new. "or: ByteString new"
[firstLine := socketStream nextLine.
chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works"
chunkSize = 0] whileFalse:[
buffer size < chunkSize
ifFalse:[buffer := buffer class new: chunkSize].
buffer := socketStream next: chunkSize into: buffer startingAt: 1.
outStream next: chunkSize putAll: buffer.
socketStream skip: 2. "CRLF"
].
socketStream skip: 2. "CRLF"

There is no conversion needed between ascii/binary since the next:into:
code accepts both strings and byte arrays. By the end of the day
switching between ascii and binary is a bit of a convenience function
which means that you probably shouldn't be writing high-performance code
that depends on constantly switching between the two (I think that's a
fair tradeoff). The next:into: family was specifically provided for
high-performance situations by providing a pre-allocated buffer and
avoiding the allocation overhead.

Cheers,
- Andreas

Levente Uzonyi-2

Re: SocketSteam: Switching ascii/binary modes

In reply to this post by Igor Stasenko

On Tue, 16 Mar 2010, Igor Stasenko wrote:

> On 16 March 2010 04:46, Andreas Raab <[hidden email]> wrote:
>> On 3/15/2010 7:44 PM, Igor Stasenko wrote:
>>>
>>> Since SocketStream caching the data it reads, then instead of
>>> resetting the buffers it should convert them and avoid losing the
>>> data.
>>>
>>> What you think is an appropriate solution to this?
>>
>> Don't "reset" the buffers; simply convert them to the proper ascii/binary
>> format.
>>

This is what we are doing in StandardFileStream.

>
> There could be an alternative approach:
> - keep buffers in a single (binary) format and covert an output
> depending on mode.

This is exactly what we were doing in our own SocketStream-like class. For
a general purpose SocketStream this might give some extra complexity for
the implementation, but it'd also allows us to use the stream primitives.

There's a third option if you want to optimize for rapid ascii/binary
mode changes: store both the binary and the ascii buffers and fill/copy
them in a lazy way.

Levente

>
> The choice is, when you should pay the conversion price:
> - each time you read something
> - each time you switching the mode
>
> If input is a mix of ascii/binary content, it will be very ineffective
> converting the cache each time mode switching.
> For example - HTTP 'transfer-encoding: chunked'.
> Content may be a binary data, but it could be chunked, then input
> becomes a mix of
> binary data and hexadecimal ascii values, and crlf's.
>
> So, it requires mode deep analyzis than just saying 'convert it' :)
>
>> Cheers,
>> - Andreas
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>

Igor Stasenko

Re: SocketSteam: Switching ascii/binary modes

In reply to this post by Andreas.Raab

On 16 March 2010 05:51, Andreas Raab <[hidden email]> wrote:

> On 3/15/2010 8:14 PM, Igor Stasenko wrote:
>>
>> There could be an alternative approach:
>> - keep buffers in a single (binary) format and covert an output
>> depending on mode.
>>
>> The choice is, when you should pay the conversion price:
>> - each time you read something
>> - each time you switching the mode
>>
>> If input is a mix of ascii/binary content, it will be very ineffective
>> converting the cache each time mode switching.
>> For example - HTTP 'transfer-encoding: chunked'.
>> Content may be a binary data, but it could be chunked, then input
>> becomes a mix of
>> binary data and hexadecimal ascii values, and crlf's.
>>
>> So, it requires mode deep analyzis than just saying 'convert it' :)
>
> I don't think it's all that complicated :-)
>
> First, you'd slow down all current use cases and introduce a lot of
> potential bugs if you added conversion upon access. You would also break any
> extension methods (the next:into: methods were originally extensions on
> SocketStream before I added them to trunk). Given all of that changing
> SocketStream in that way seems highly questionable.
>
> The specific use case of chunked encoding is interesting too, since the
> motivation of adding the next:into: family of methods came from reading
> chunked encoding :-) As a consequence, the fastest way to read chunked
> encoding in Squeak today is the following:
>
> buffer := ByteArray new. "or: ByteString new"
> [firstLine := socketStream nextLine.
> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works"
> chunkSize = 0] whileFalse:[
> buffer size < chunkSize
> ifFalse:[buffer := buffer class new: chunkSize].
> buffer := socketStream next: chunkSize into: buffer startingAt: 1.
> outStream next: chunkSize putAll: buffer.
> socketStream skip: 2. "CRLF"
> ].
> socketStream skip: 2. "CRLF"
>
> There is no conversion needed between ascii/binary since the next:into: code
> accepts both strings and byte arrays. By the end of the day switching
> between ascii and binary is a bit of a convenience function which means that
> you probably shouldn't be writing high-performance code that depends on
> constantly switching between the two (I think that's a fair tradeoff). The
> next:into: family was specifically provided for high-performance situations
> by providing a pre-allocated buffer and avoiding the allocation overhead.
>

Yes, #next:into: is convenient, if you know the content size from the start
or if you want to read all at once into memory.
But conversion strategy which you showing above doesn't works well for
all cases.
For persistent streams, used for exchanging the data between peers,
there is no notion of 'read everything up to end',
but usually 'read what is currently available', because peers
exchanging data in real time and you can't predict what will follow
next to the last input.

My current intent is to make a fast reader which using a socket as
backend, and which is:

- reads/parsing http headers
- handling chunked content transfer encoding
- handling utf8 content encoding
- and only then, there is a consumer, which is a JSON parser parsing
input character by character,
and, as many other parsers, obviously have no use of #next:into: , but
using #peek and #next all the way.

The idea is to parse data, once it become available, instead of
reading all up to the end, and only then start parsing.
You could ask, why its more effective? - Because of network latency.
A client, instead of simply waiting for a next data packet to arrive
could spend this time more productively by parsing the available input
(besides, it will spend this time anyways, so why wasting the time?).
This means that results of parsing will be available earlier,
comparing to scheme, when you start parsing only when all data
arrives.
Also, the bigger content size, the more efficient it will be not only
by speed, but also by memory consumption.

So, i tend to look for a ways, when socket stream design is focused on
streaming the data and doesn't assuming that consumer of it having any
preferences to use buffered approach (#next:into: ) over non-buffered
one (#next).

> Cheers,
> - Andreas

--
Best regards,
Igor Stasenko AKA sig.

Nicolas Cellier

Re: SocketSteam: Switching ascii/binary modes

2010/3/16 Igor Stasenko <[hidden email]>:

> On 16 March 2010 05:51, Andreas Raab <[hidden email]> wrote:
>> On 3/15/2010 8:14 PM, Igor Stasenko wrote:
>>>
>>> There could be an alternative approach:
>>> - keep buffers in a single (binary) format and covert an output
>>> depending on mode.
>>>
>>> The choice is, when you should pay the conversion price:
>>> - each time you read something
>>> - each time you switching the mode
>>>
>>> If input is a mix of ascii/binary content, it will be very ineffective
>>> converting the cache each time mode switching.
>>> For example - HTTP 'transfer-encoding: chunked'.
>>> Content may be a binary data, but it could be chunked, then input
>>> becomes a mix of
>>> binary data and hexadecimal ascii values, and crlf's.
>>>
>>> So, it requires mode deep analyzis than just saying 'convert it' :)
>>
>> I don't think it's all that complicated :-)
>>
>> First, you'd slow down all current use cases and introduce a lot of
>> potential bugs if you added conversion upon access. You would also break any
>> extension methods (the next:into: methods were originally extensions on
>> SocketStream before I added them to trunk). Given all of that changing
>> SocketStream in that way seems highly questionable.
>>
>> The specific use case of chunked encoding is interesting too, since the
>> motivation of adding the next:into: family of methods came from reading
>> chunked encoding :-) As a consequence, the fastest way to read chunked
>> encoding in Squeak today is the following:
>>
>> buffer := ByteArray new. "or: ByteString new"
>> [firstLine := socketStream nextLine.
>> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works"
>> chunkSize = 0] whileFalse:[
>> buffer size < chunkSize
>> ifFalse:[buffer := buffer class new: chunkSize].
>> buffer := socketStream next: chunkSize into: buffer startingAt: 1.
>> outStream next: chunkSize putAll: buffer.
>> socketStream skip: 2. "CRLF"
>> ].
>> socketStream skip: 2. "CRLF"
>>
>> There is no conversion needed between ascii/binary since the next:into: code
>> accepts both strings and byte arrays. By the end of the day switching
>> between ascii and binary is a bit of a convenience function which means that
>> you probably shouldn't be writing high-performance code that depends on
>> constantly switching between the two (I think that's a fair tradeoff). The
>> next:into: family was specifically provided for high-performance situations
>> by providing a pre-allocated buffer and avoiding the allocation overhead.
>>
>
> Yes, #next:into: is convenient, if you know the content size from the start
> or if you want to read all at once into memory.
> But conversion strategy which you showing above doesn't works well for
> all cases.
> For persistent streams, used for exchanging the data between peers,
> there is no notion of 'read everything up to end',
> but usually 'read what is currently available', because peers
> exchanging data in real time and you can't predict what will follow
> next to the last input.
>
> My current intent is to make a fast reader which using a socket as
> backend, and which is:
>
> - reads/parsing http headers
> - handling chunked content transfer encoding
> - handling utf8 content encoding
> - and only then, there is a consumer, which is a JSON parser parsing
> input character by character,
> and, as many other parsers, obviously have no use of #next:into: , but
> using #peek and #next all the way.
>
> The idea is to parse data, once it become available, instead of
> reading all up to the end, and only then start parsing.
> You could ask, why its more effective? - Because of network latency.
> A client, instead of simply waiting for a next data packet to arrive
> could spend this time more productively by parsing the available input
> (besides, it will spend this time anyways, so why wasting the time?).
> This means that results of parsing will be available earlier,
> comparing to scheme, when you start parsing only when all data
> arrives.
> Also, the bigger content size, the more efficient it will be not only
> by speed, but also by memory consumption.
>
> So, i tend to look for a ways, when socket stream design is focused on
> streaming the data and doesn't assuming that consumer of it having any
> preferences to use buffered approach (#next:into: ) over non-buffered
> one (#next).
>

Then you should really consider looking at VW-XTream transforming: stuff.
The idea is to have parallel processing (pipelines).
Of course, we can not have true parallelism yet in Smalltalk, but at
least the 1st level can work with non blocking squeak socket.

Nicolas

>> Cheers,
>> - Andreas
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>

Igor Stasenko

Re: SocketSteam: Switching ascii/binary modes

On 16 March 2010 10:57, Nicolas Cellier
<[hidden email]> wrote:

> 2010/3/16 Igor Stasenko <[hidden email]>:
>> On 16 March 2010 05:51, Andreas Raab <[hidden email]> wrote:
>>> On 3/15/2010 8:14 PM, Igor Stasenko wrote:
>>>>
>>>> There could be an alternative approach:
>>>> - keep buffers in a single (binary) format and covert an output
>>>> depending on mode.
>>>>
>>>> The choice is, when you should pay the conversion price:
>>>> - each time you read something
>>>> - each time you switching the mode
>>>>
>>>> If input is a mix of ascii/binary content, it will be very ineffective
>>>> converting the cache each time mode switching.
>>>> For example - HTTP 'transfer-encoding: chunked'.
>>>> Content may be a binary data, but it could be chunked, then input
>>>> becomes a mix of
>>>> binary data and hexadecimal ascii values, and crlf's.
>>>>
>>>> So, it requires mode deep analyzis than just saying 'convert it' :)
>>>
>>> I don't think it's all that complicated :-)
>>>
>>> First, you'd slow down all current use cases and introduce a lot of
>>> potential bugs if you added conversion upon access. You would also break any
>>> extension methods (the next:into: methods were originally extensions on
>>> SocketStream before I added them to trunk). Given all of that changing
>>> SocketStream in that way seems highly questionable.
>>>
>>> The specific use case of chunked encoding is interesting too, since the
>>> motivation of adding the next:into: family of methods came from reading
>>> chunked encoding :-) As a consequence, the fastest way to read chunked
>>> encoding in Squeak today is the following:
>>>
>>> buffer := ByteArray new. "or: ByteString new"
>>> [firstLine := socketStream nextLine.
>>> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works"
>>> chunkSize = 0] whileFalse:[
>>> buffer size < chunkSize
>>> ifFalse:[buffer := buffer class new: chunkSize].
>>> buffer := socketStream next: chunkSize into: buffer startingAt: 1.
>>> outStream next: chunkSize putAll: buffer.
>>> socketStream skip: 2. "CRLF"
>>> ].
>>> socketStream skip: 2. "CRLF"
>>>
>>> There is no conversion needed between ascii/binary since the next:into: code
>>> accepts both strings and byte arrays. By the end of the day switching
>>> between ascii and binary is a bit of a convenience function which means that
>>> you probably shouldn't be writing high-performance code that depends on
>>> constantly switching between the two (I think that's a fair tradeoff). The
>>> next:into: family was specifically provided for high-performance situations
>>> by providing a pre-allocated buffer and avoiding the allocation overhead.
>>>
>>
>> Yes, #next:into: is convenient, if you know the content size from the start
>> or if you want to read all at once into memory.
>> But conversion strategy which you showing above doesn't works well for
>> all cases.
>> For persistent streams, used for exchanging the data between peers,
>> there is no notion of 'read everything up to end',
>> but usually 'read what is currently available', because peers
>> exchanging data in real time and you can't predict what will follow
>> next to the last input.
>>
>> My current intent is to make a fast reader which using a socket as
>> backend, and which is:
>>
>> - reads/parsing http headers
>> - handling chunked content transfer encoding
>> - handling utf8 content encoding
>> - and only then, there is a consumer, which is a JSON parser parsing
>> input character by character,
>> and, as many other parsers, obviously have no use of #next:into: , but
>> using #peek and #next all the way.
>>
>> The idea is to parse data, once it become available, instead of
>> reading all up to the end, and only then start parsing.
>> You could ask, why its more effective? - Because of network latency.
>> A client, instead of simply waiting for a next data packet to arrive
>> could spend this time more productively by parsing the available input
>> (besides, it will spend this time anyways, so why wasting the time?).
>> This means that results of parsing will be available earlier,
>> comparing to scheme, when you start parsing only when all data
>> arrives.
>> Also, the bigger content size, the more efficient it will be not only
>> by speed, but also by memory consumption.
>>
>> So, i tend to look for a ways, when socket stream design is focused on
>> streaming the data and doesn't assuming that consumer of it having any
>> preferences to use buffered approach (#next:into: ) over non-buffered
>> one (#next).
>>
>
> Then you should really consider looking at VW-XTream transforming: stuff.
> The idea is to have parallel processing (pipelines).

Err. Pipes is not parallel processing. It is a sequential - output of
one pipe is input of another one.
And sure thing, this is how i think a good streams should be working.
Too bad, i have to use what we have in Squeak/Pharo.. or do everything
from scratch.

> Of course, we can not have true parallelism yet in Smalltalk, but at
> least the 1st level can work with non blocking squeak socket.
>
> Nicolas
>
>>> Cheers,
>>> - Andreas
>>
>>
>> --
>> Best regards,
>> Igor Stasenko AKA sig.
>>
>>
>
>

--
Best regards,
Igor Stasenko AKA sig.

Nicolas Cellier

Re: SocketSteam: Switching ascii/binary modes

2010/3/16 Igor Stasenko <[hidden email]>:

> On 16 March 2010 10:57, Nicolas Cellier
> <[hidden email]> wrote:
>> 2010/3/16 Igor Stasenko <[hidden email]>:
>>> On 16 March 2010 05:51, Andreas Raab <[hidden email]> wrote:
>>>> On 3/15/2010 8:14 PM, Igor Stasenko wrote:
>>>>>
>>>>> There could be an alternative approach:
>>>>> - keep buffers in a single (binary) format and covert an output
>>>>> depending on mode.
>>>>>
>>>>> The choice is, when you should pay the conversion price:
>>>>> - each time you read something
>>>>> - each time you switching the mode
>>>>>
>>>>> If input is a mix of ascii/binary content, it will be very ineffective
>>>>> converting the cache each time mode switching.
>>>>> For example - HTTP 'transfer-encoding: chunked'.
>>>>> Content may be a binary data, but it could be chunked, then input
>>>>> becomes a mix of
>>>>> binary data and hexadecimal ascii values, and crlf's.
>>>>>
>>>>> So, it requires mode deep analyzis than just saying 'convert it' :)
>>>>
>>>> I don't think it's all that complicated :-)
>>>>
>>>> First, you'd slow down all current use cases and introduce a lot of
>>>> potential bugs if you added conversion upon access. You would also break any
>>>> extension methods (the next:into: methods were originally extensions on
>>>> SocketStream before I added them to trunk). Given all of that changing
>>>> SocketStream in that way seems highly questionable.
>>>>
>>>> The specific use case of chunked encoding is interesting too, since the
>>>> motivation of adding the next:into: family of methods came from reading
>>>> chunked encoding :-) As a consequence, the fastest way to read chunked
>>>> encoding in Squeak today is the following:
>>>>
>>>> buffer := ByteArray new. "or: ByteString new"
>>>> [firstLine := socketStream nextLine.
>>>> chunkSize := ('16r',firstLine asUppercase) asNumber. "icky but works"
>>>> chunkSize = 0] whileFalse:[
>>>> buffer size < chunkSize
>>>> ifFalse:[buffer := buffer class new: chunkSize].
>>>> buffer := socketStream next: chunkSize into: buffer startingAt: 1.
>>>> outStream next: chunkSize putAll: buffer.
>>>> socketStream skip: 2. "CRLF"
>>>> ].
>>>> socketStream skip: 2. "CRLF"
>>>>
>>>> There is no conversion needed between ascii/binary since the next:into: code
>>>> accepts both strings and byte arrays. By the end of the day switching
>>>> between ascii and binary is a bit of a convenience function which means that
>>>> you probably shouldn't be writing high-performance code that depends on
>>>> constantly switching between the two (I think that's a fair tradeoff). The
>>>> next:into: family was specifically provided for high-performance situations
>>>> by providing a pre-allocated buffer and avoiding the allocation overhead.
>>>>
>>>
>>> Yes, #next:into: is convenient, if you know the content size from the start
>>> or if you want to read all at once into memory.
>>> But conversion strategy which you showing above doesn't works well for
>>> all cases.
>>> For persistent streams, used for exchanging the data between peers,
>>> there is no notion of 'read everything up to end',
>>> but usually 'read what is currently available', because peers
>>> exchanging data in real time and you can't predict what will follow
>>> next to the last input.
>>>
>>> My current intent is to make a fast reader which using a socket as
>>> backend, and which is:
>>>
>>> - reads/parsing http headers
>>> - handling chunked content transfer encoding
>>> - handling utf8 content encoding
>>> - and only then, there is a consumer, which is a JSON parser parsing
>>> input character by character,
>>> and, as many other parsers, obviously have no use of #next:into: , but
>>> using #peek and #next all the way.
>>>
>>> The idea is to parse data, once it become available, instead of
>>> reading all up to the end, and only then start parsing.
>>> You could ask, why its more effective? - Because of network latency.
>>> A client, instead of simply waiting for a next data packet to arrive
>>> could spend this time more productively by parsing the available input
>>> (besides, it will spend this time anyways, so why wasting the time?).
>>> This means that results of parsing will be available earlier,
>>> comparing to scheme, when you start parsing only when all data
>>> arrives.
>>> Also, the bigger content size, the more efficient it will be not only
>>> by speed, but also by memory consumption.
>>>
>>> So, i tend to look for a ways, when socket stream design is focused on
>>> streaming the data and doesn't assuming that consumer of it having any
>>> preferences to use buffered approach (#next:into: ) over non-buffered
>>> one (#next).
>>>
>>
>> Then you should really consider looking at VW-XTream transforming: stuff.
>> The idea is to have parallel processing (pipelines).
>
> Err. Pipes is not parallel processing. It is a sequential - output of
> one pipe is input of another one.

Well, Mr Ford understood that before us ;) - having only one object
working while the others are resting is not the most efficient way to
process the sequential stream.

Nicolas

> And sure thing, this is how i think a good streams should be working.
> Too bad, i have to use what we have in Squeak/Pharo.. or do everything
> from scratch.
>
>> Of course, we can not have true parallelism yet in Smalltalk, but at
>> least the 1st level can work with non blocking squeak socket.
>>
>> Nicolas
>>
>>>> Cheers,
>>>> - Andreas
>>>
>>>
>>> --
>>> Best regards,
>>> Igor Stasenko AKA sig.
>>>
>>>
>>
>>
>
>
>
> --
> Best regards,
> Igor Stasenko AKA sig.
>
>