Design problem: of External Memory, Strings, and Unicode

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Design problem: of External Memory, Strings, and Unicode

Chris Uppal-3
Hi,

I'm trying to make something work, but I'm not sure how to do it, or even
whether it's all possible.  I know what I /want/ it to do, but I don't know
enough about external memory (etc) to know how close I can get to my aim.

I'll have to start with a bit of background.  I'm stalled on several projects
for lack of /real/ Unicode handling in Dolphin, so I decided to take a detour
and put something together.  (The UnicodeString class is no use whatsoever for
my purposes, indeed I think that "UnicodeString" is a misnomer -- it should be
called something like WideString since that's what it is (even incomplete as it
currently is)).

What I want is the ability to handle Unicode data in all of the defined
encodings (at least: UTF-8, UTF-16, UTF-32, and the truly, mind-bendingly,
weird encoding that Sun have defined for communicating with the Java VM).
Naturally I want the resulting objects to be as String-like as possible (I need
{Read/Write}Streams too, but I know how to handle that).  So what I've
currently got is a bunch of classes corresponding to each of the encodings
(UTF8String, UTF16String, etc).  Each object consists of a ByteArray (or
similar), plus a few housekeeping fields (sizeInBytes, sizeInCharacters, ...).
The encoding used to map between logical Unicode characters and the actual
bytes in the binary data is determined by the class of the object.  All that
works (as far as I've got, anyway).  One small relevant complexity is that I
want to be able to support implicitly null-terminated strings (following the
pattern of Dolphin Strings) but don't want to /force/ that, so I have to keep a
record of how many bytes are part of the explicit string, as distinct from the
size of the ByteArray itself.  That may help with other stuff too...

Anyway, that's the background.  Now what I'm looking at is how best to use
these things for external interfacing.  I can see how to pass the things out to
external code that expects a byte buffer in <whatever> encoding, but how do I
handle the reverse ?   Say that an external function returns a pointer to a
null-terminated "string" in UTF-8 format.  The easy, but unsatisfactory, way to
call it would be to declare the method (to Dolphin) as returning void* or
similar, and then leave it up to the custom wrapping code to create a
UTF8String which either wrapped the corresponding ExternalAddress (or would it
be an LPVOID -- I've never understood the difference ?) or which wrapped a
ByteArray created by copying the external bytes.

What I really want is to be able to handle such strings more transparently,
much as the Dolphin VM does for 8-bit "native" strings.  I'd like to be able to
declare the external method as something like:
    <stdcall: UTF8String someFunction ...args...>
and have the VM automatically create an instance of UTF8String which wraps the
external address.  I don't know if that's possible (I don't want to make my
strings inherit from ExternalStructure) -- it may even be that it would "just
work" with my code more-or-less as it is at present (the UTF8Sting's "bytes"
instvar is at index 1, which I suspect may be necessary).  Alternatively, I
would be happy if the VM could be persuaded to copy the external bytes into a
(null-terminated) ByteArray and wrap a UTF8String around that (sort of like how
I think it handles Stings).  I suspect that would require special VM magic,
though.

(I'm assuming null-terminated "strings" because if the API is defined to take,
say, a void** and a size_t* as parameters that it fills in with a pointer to
the buffer and its size, then there's no way that the VM can handle that
automatically for me -- I think...).

Ideally, I would like my string objects to be able to wrap either ByteArrays or
external addresses, but I could live with a split, so that UTF8String (and all
the others) existed as both UTF8String (a subclass of AbstractUnicodeString,
under SequenceableCollection, with a complete Collections+Unicode
implementation, but which could /only/ wrap ByteArrays) plus a cut-down
ExternalUTF8String (a subclass of ExternalStructure, with a limited protocol,
but which can wrap ExternalAddress/LPVOIDs).  The actual encoding/decoding
logic is split out into separate objects anyway, so having internal and
external flavours of each kind of string wouldn't involve too much code
duplication.

I hope all that made some kind of sense.  Anyone got any ideas, suggestions, or
more information on how the automatic creation of object-wrappers works ?

Thanks for reading.

    -- chris


Reply | Threaded
Open this post in threaded view
|

Re: Design problem: of External Memory, Strings, and Unicode

Blair McGlashan-3
"Chris Uppal" <[hidden email]> wrote in message
news:[hidden email]...

> ....
> What I want is the ability to handle Unicode data in all of the defined
> encodings (at least: UTF-8, UTF-16, UTF-32, and the truly, mind-bendingly,
> weird encoding that Sun have defined for communicating with the Java VM).
> Naturally I want the resulting objects to be as String-like as possible
> ...
>...  One small relevant complexity is that I
> want to be able to support implicitly null-terminated strings (following
> the
> pattern of Dolphin Strings) but don't want to /force/ that, so I have to
> keep a
> record of how many bytes are part of the explicit string, as distinct from
> the
> size of the ByteArray itself.  That may help with other stuff too...
>

Why bother allowing both cases? Dolphin itself doesn't rely on strings being
null-terminated, but it always null-terminates its own strings at creation
time by allocating an extra character which is implicitly initiated to zero
as a result of the normal memory initialization performed by the object
memory. The extra space is not included in the size reported for the object.
This behaviour is controlled by a behaviour bit that specifies whether
instances are null-terminated or not - its only relevant to byte objects
though. Why not take advantage of that for your UTF strings?

> Anyway, that's the background.  Now what I'm looking at is how best to use
> these things for external interfacing.

Well first off I would be careful about conflating the internal and
external. I suggest you consider providing a separate class to represent
externally created objects should you need them. If we were building this
into the base system we would first build UTF strings for internal
manipulation, so those would go in the collection hierarchy and probably
involve some refactoring of String itself. External interface usage would be
a secondary consideration, and most likely require some additional classes
in the ExternalStructure hierarchy.

>...I can see how to pass the things out to
> external code that expects a byte buffer in <whatever> encoding, but how
> do I
> handle the reverse ?   Say that an external function returns a pointer to
> a
> null-terminated "string" in UTF-8 format.  The easy, but unsatisfactory,
> way to
> call it would be to declare the method (to Dolphin) as returning void* or
> similar, and then leave it up to the custom wrapping code to create a
> UTF8String which either wrapped the corresponding ExternalAddress (or
> would it
> be an LPVOID -- I've never understood the difference ?) or which wrapped a
> ByteArray created by copying the external bytes.

Firstly lets cover LPVOID vs ExternalAddress - it may help. The purpose of
LPVOID is to represent situations where you need a reference to an address
(i.e. a double-indirection). It lives in the ExternalStructure hierarchy,
which is able to represent both a "value" instance and a "reference"
instance. Value instances hold an internally allocated ByteArray. Reference
instances hold an ExternalAddress instance which points at the data, which
has usually been allocated from some external heap. ExternalAddress has a
special behaviour bit set to indicate to the VM that it is an "indirection"
object, so that it is implicitly indirected in certain primitives - .Value
instances of LPVOID are not useful - it is always used with reference
instances to represent a pointer to a pointer.

Actually LPVOID is needed in very few cases - typically only in callback
situations or sometimes when doubly-indirected pointers are embedded in
arrays or structures. The majority of the time ExternalAddress instances are
used.

>
> What I really want is to be able to handle such strings more
> transparently,
> much as the Dolphin VM does for 8-bit "native" strings.  I'd like to be
> able to
> declare the external method as something like:
>    <stdcall: UTF8String someFunction ...args...>
> and have the VM automatically create an instance of UTF8String which wraps
> the
> external address.  I don't know if that's possible (I don't want to make
> my
> strings inherit from ExternalStructure) -- it may even be that it would
> "just
> work" with my code more-or-less as it is at present (the UTF8Sting's
> "bytes"
> instvar is at index 1, which I suspect may be necessary).

It is possible, and it probably will just work. The classes do not have to
be ExternalStructures, but they have to be shaped like them. The VM has
fairly flexible capabilities for creating return values, and for creating
objects passed to callbacks (which amounts to the same thing). You can
return a "structure" by value by declaring it as in your example - of course
the VM has to know how large the object is. It gets this information by
accessing the byte size information held in some extra behaviour bits. If
you browse the ExternalStructure hierarchy you will be able to find where
this gets set. The VM will then create the declared object type, and an
instance of ByteArray of the byte size stored in the class. It stores this
byte array in the first instance variable of the structure object. It also
copies the data, either from the stack, or from registers, depending on the
size of the structure and the calling convention, into the ByteArray. The VM
can also create byte objects directly to represent structure values, which
it will do if the structure class in the declaration is a byte class and
again has the byte size encoded in the behaviour bits. GUID is an example of
such a class in the image.


>...Alternatively, I
> would be happy if the VM could be persuaded to copy the external bytes
> into a
> (null-terminated) ByteArray and wrap a UTF8String around that (sort of
> like how
> I think it handles Stings).  I suspect that would require special VM
> magic,
> though.
>
> (I'm assuming null-terminated "strings" because if the API is defined to
> take,
> say, a void** and a size_t* as parameters that it fills in with a pointer
> to
> the buffer and its size, then there's no way that the VM can handle that
> automatically for me -- I think...).

Correct. To marshal such cases automatically at the VM level there would
need to be more information in the declarations - i.e. it would need to be
more like IDL where direction is specified and also the relationship between
the size parameter and the buffer.

>
> Ideally, I would like my string objects to be able to wrap either
> ByteArrays or
> external addresses, but I could live with a split, so that UTF8String (and
> all
> the others) existed as both UTF8String (a subclass of
> AbstractUnicodeString,
> under SequenceableCollection, with a complete Collections+Unicode
> implementation, but which could /only/ wrap ByteArrays) plus a cut-down
> ExternalUTF8String (a subclass of ExternalStructure, with a limited
> protocol,
> but which can wrap ExternalAddress/LPVOIDs).  The actual encoding/decoding
> logic is split out into separate objects anyway, so having internal and
> external flavours of each kind of string wouldn't involve too much code
> duplication.
>

As I say, I would recommend separate classes along the lines of the design
you suggest, although for efficiency reasons I think you can and should
avoid the indirection to a ByteArray by using byte classes directly. Of
course if that makes the implementation more complex or confusing it can be
left for a later exercise. Generally speaking you need to explicitly marshal
externally allocated data anyway at some point, so it can get confusing if
you try to do it all in one class.

> I hope all that made some kind of sense.  Anyone got any ideas,
> suggestions, or
> more information on how the automatic creation of object-wrappers works ?
>
> Thanks for reading.
>

Hope this helps

Regards

Blair


Reply | Threaded
Open this post in threaded view
|

Re: Design problem: of External Memory, Strings, and Unicode

Chris Uppal-3
Blair,

> Hope this helps

It does indeed.  Many thanks for the explanations and suggestions; I can -- I
think -- see where I'm going with this now.

    -- chris