Smalltalk › Squeak › Squeak - Dev

The .changes file should be bound to a single image

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

27 messages Options

Chris Muller-3

Re: [Pharo-dev] [squeak-dev] The .changes file should be bound to a single image

>> In practice, this is not an issue that either Chris or I have noticed,
>> probably because we are not doing software development (saving method
>> changes) at the same time that we are running RemoteTask and similar.
>> But I can certainly see how it might be a problem if, for example, I
>> had a bunch of students running the same image from a network shared
>> folder.
>
> Maybe its time to consider a fundamental change in how method-sources
> are referred to.
> Taking inspiration from git... A content addressable key-value file
> store might solve concurrent access. Each CompiledMethod gets written
> to a file named for the hash of its contents, which is the only
> reference the Image getsto a method's source. Each such file would

It sounds like a lot of files.. so how would I move an image to
another computer? I gotta know which files go with which image?

Plus, it doesn't really solve the fundamental problem of two images
writing to the same file. Mutliple images could still change the same
method to the same contents at the same time. You may have made the
problem less-likely, except for when you have your first
hash-collision of *different* sources (it COULD happen), in which case
it wouldn't even require the changes to occur at the same time.

I guess it would also lose the order-sequence of the change log too...
unless you were to try to use the underlying filesystem's timestamps
on each file but... it wouldn't work after I've copied all the files
via scp and because they all get new timestamps...

Might be better to teach the class, who are learning about Smalltalk
anyway, about the nature of the changes file..?

Chris Muller-3

Re: [Pharo-dev] [squeak-dev] The .changes file should be bound to a single image

Another thought...

Upon launching of the image, start a, temporary changes file,
[image-name]-[some UUID].changes.

Upon image save, append the temp changes file to the main changes
file, but in an atomic way (first do the append as a new unique
filename, then rename it to the original changes file name).

Hmm, but then we would have to check two changes files when accessing sources..

On Thu, Jun 30, 2016 at 3:10 PM, Chris Muller <[hidden email]> wrote:

>>> In practice, this is not an issue that either Chris or I have noticed,
>>> probably because we are not doing software development (saving method
>>> changes) at the same time that we are running RemoteTask and similar.
>>> But I can certainly see how it might be a problem if, for example, I
>>> had a bunch of students running the same image from a network shared
>>> folder.
>>
>> Maybe its time to consider a fundamental change in how method-sources
>> are referred to.
>> Taking inspiration from git... A content addressable key-value file
>> store might solve concurrent access. Each CompiledMethod gets written
>> to a file named for the hash of its contents, which is the only
>> reference the Image getsto a method's source. Each such file would
>
> It sounds like a lot of files.. so how would I move an image to
> another computer? I gotta know which files go with which image?
>
> Plus, it doesn't really solve the fundamental problem of two images
> writing to the same file. Mutliple images could still change the same
> method to the same contents at the same time. You may have made the
> problem less-likely, except for when you have your first
> hash-collision of *different* sources (it COULD happen), in which case
> it wouldn't even require the changes to occur at the same time.
>
> I guess it would also lose the order-sequence of the change log too...
> unless you were to try to use the underlying filesystem's timestamps
> on each file but... it wouldn't work after I've copied all the files
> via scp and because they all get new timestamps...
>
> Might be better to teach the class, who are learning about Smalltalk
> anyway, about the nature of the changes file..?

John Pfersich-2

Re: [Pharo-dev] [squeak-dev] The .changes file should be bound to a single image

Sounds like a better idea to me, but I don't think it would solve the problem of multiple images almost simultaneously attempting to update themselves (as in a classroom)

Sent from my iPad

> On Jun 30, 2016, at 13:31, Chris Muller <[hidden email]> wrote:
>
> Another thought...
>
> Upon launching of the image, start a, temporary changes file,
> [image-name]-[some UUID].changes.
>
> Upon image save, append the temp changes file to the main changes
> file, but in an atomic way (first do the append as a new unique
> filename, then rename it to the original changes file name).
>
> Hmm, but then we would have to check two changes files when accessing sources..
>
> On Thu, Jun 30, 2016 at 3:10 PM, Chris Muller <[hidden email]> wrote:
>>>> In practice, this is not an issue that either Chris or I have noticed,
>>>> probably because we are not doing software development (saving method
>>>> changes) at the same time that we are running RemoteTask and similar.
>>>> But I can certainly see how it might be a problem if, for example, I
>>>> had a bunch of students running the same image from a network shared
>>>> folder.
>>>
>>> Maybe its time to consider a fundamental change in how method-sources
>>> are referred to.
>>> Taking inspiration from git... A content addressable key-value file
>>> store might solve concurrent access. Each CompiledMethod gets written
>>> to a file named for the hash of its contents, which is the only
>>> reference the Image getsto a method's source. Each such file would
>>
>> It sounds like a lot of files.. so how would I move an image to
>> another computer? I gotta know which files go with which image?
>>
>> Plus, it doesn't really solve the fundamental problem of two images
>> writing to the same file. Mutliple images could still change the same
>> method to the same contents at the same time. You may have made the
>> problem less-likely, except for when you have your first
>> hash-collision of *different* sources (it COULD happen), in which case
>> it wouldn't even require the changes to occur at the same time.
>>
>> I guess it would also lose the order-sequence of the change log too...
>> unless you were to try to use the underlying filesystem's timestamps
>> on each file but... it wouldn't work after I've copied all the files
>> via scp and because they all get new timestamps...
>>
>> Might be better to teach the class, who are learning about Smalltalk
>> anyway, about the nature of the changes file..?
>

Ben Coman

Re: [Pharo-dev] [squeak-dev] The .changes file should be bound to a single image

In reply to this post by Chris Muller-3

On Fri, Jul 1, 2016 at 4:10 AM, Chris Muller <[hidden email]> wrote:

Yes, that would be a sticking point. You couldn't just grab any saved
Image file off disk. The image would first need to generate an archive
transfer file. Except if these methods were automatically pushed
through to a private web service, then presuming pervasive web access
you, that sleeping Image would pull down its sources where ever it
boots back up (which even if that would be cool, is not the problem of
the original post.)

>
> Plus, it doesn't really solve the fundamental problem of two images
> writing to the same file. Multiple images could still change the same
> method to the same contents at the same time.

The hash-named-file would never be written to twice. Its a fixed
point in space-time ;)
A second image with the same hash would write the *same* contents, so
there is no need to write.
If the hash-named-file exists, do nothing. To handle any race
condition between checking file existence and writing to it, the first
image could take an exclusive write lock.

> You may have made the
> problem less-likely, except for when you have your first
> hash-collision of *different* sources (it COULD happen),

Some equivalent things...

* Pick a random atom from the volume of the moon, then another random
pick gets the same atom.
http://stackoverflow.com/a/23253149

* Win the national lottery 11 times in a row
http://stackoverflow.com/a/29146396

* Your chances of winning the Powerball lottery are far better than
finding a hash collision. After all, lotteries often have actual
winners. The probability of a hash collision is more like a lottery
that has been running since prehistoric times and has never had a
winner and will probably not have a winner for billions of years.
http://ericsink.com/vcbe/html/cryptographic_hashes.html

> in which case it wouldn't even require the changes to occur at the same time.

When the second Image finds the hash-named-file already exists,
it could check the contents and flag an error if they don't match,
so at least its not a silent error. The same when integrating
different repositories.

>
> I guess it would also lose the order-sequence of the change log too...
> unless you were to try to use the underlying filesystem's timestamps
> on each file but... it wouldn't work after I've copied all the files
> via scp and because they all get new timestamps...

good point. This would complicate changes-replay for a crashed image.
Although this case is only important "now" and could be handled by
"/tmp/${username}.${last-image-save-checkpoint-id}" file that records
the order of commits for a session, that would be checked for on Image
startup - which is similar to what you already suggested...

> Upon launching of the image, start a, temporary changes file,
> [image-name]-[some UUID].changes.
>
> Upon image save, append the temp changes file to the main changes
> file, but in an atomic way (first do the append as a new unique
> filename, then rename it to the original changes file name).
>

Good idea. This would eliminate the need for my idea here. You'd
need some way to match the UUID with the Image being opened, so I
guess the UUID would need to stored in the saved Image and be constant
for the session, and be updated each save of the Image. The temporary
changes filename could include username to distinguish between users.
If the same user opens an Image twice, there would be two files and
upon recovering from a crash the user would be presented a choice
between the two files.

>
> Might be better to teach the class, who are learning about Smalltalk
> anyway, about the nature of the changes file..?

This seemed more of a classroom system administration issue. Actually
in that case, maybe the network executable startup script just copied
both image and changes file to the user's personal area?

cheers -ben

Eliot Miranda-2

Re: [Pharo-dev] [squeak-dev] The .changes file should be bound to a single image

In reply to this post by Ben Coman

Ben,

> On Jun 29, 2016, at 9:48 PM, Ben Coman <[hidden email]> wrote:
>
>> On Thu, Jun 30, 2016 at 7:07 AM, David T. Lewis <[hidden email]> wrote:
>>> On Wed, Jun 29, 2016 at 02:00:19PM -0400, David T. Lewis wrote:
>>> Let's not solve the wrong problem folks. I only looked at this for 10
>>> minutes this morning, and I think (but I am not sure) that the issue
>>> affects the case of saving the image, and that the normal writing of
>>> changes is fine.
>>
>> I am wrong.
>>
>> I spent some more time with this, and it is clear that two images saving
>> chunks to the same changes file will result in corrupted change records
>> in the changes file. It is not just an issue related to the image save
>> as I suggested above.
>>
>> In practice, this is not an issue that either Chris or I have noticed,
>> probably because we are not doing software development (saving method
>> changes) at the same time that we are running RemoteTask and similar.
>> But I can certainly see how it might be a problem if, for example, I
>> had a bunch of students running the same image from a network shared
>> folder.
>
> Maybe its time to consider a fundamental change in how method-sources
> are referred to.

The changes file us not merely the repository for sources on newly minted methods. It is also a log file, a crash recovery mechanism. It is simple. It works. You propose something horribly complex to solve a problem that a) died t affect very many people, b) is easy to work around and c) feasible to fix with a well-known approach. If doesn't wash for me.

> Taking inspiration from git... A content addressable key-value file
> store might solve concurrent access. Each CompiledMethod gets written
> to a file named for the hash of its contents, which is the only
> reference the Image getsto a method's source. Each such file would
> *only* need be written once and thereafter could be read
> simultaneously by multiple Images. Anyone on the network wanting
> store the same source would see the file already exists and have
> nothing to do.
> Perhaps having many individual files implies abysmal performance,
>
> Or maybe something similar to Mecurial's reflog format [1] could be
> used, one file per class.
>
> The thing about the Image *only* referring to a method's source by its
> content hash would seem to great flexibility in backends to
> locate/store that source. Possibly...
> * stored as individual files as above
> * bundled in a zip file in random order
> * a school could configure a database server in Image provided to students
> * hashes could be thrown at a service on the Internet
> * cached locally with a key-value database like LMDB [2]
> * remote replication to multiple internet backup locations
> * in an emergency you could throw bundle of hashes as a query to the
> mail list and get an adhoc response of individual files.
> * Inter-Smalltalk image communication
>
> Pharo has a stated goal to get rid of the changes file. Changing to
> content-hash-addressable method-source seems a logicial step along
> that road. Even if the Squeak community doesn't want to go so far as
> eliminating the .changes file, can they see value in changing method
> source references to be content-hashes rather than indexes into a
> particular file?
>
> [1] http://blog.prasoonshukla.com/mercurial-vs-git-scaling
> [2] https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database
>
>
> Just having a poke at this, it seems a new form of
> CompiledMethodTrailer may need to be defined, being invoked from
> CompiledMethod>>sourceCode. CompiledMethodTrailer>>sourceCode would
> find the source code based on a content-hash held by the
> CompiledMethod. If found, the call to #getSourceFromFile that
> accesses the .changes file will be bypassed, and could remain as a
> backup.
>
> cheers -ben
>
>>
>> Dave
>>
>>
>>>
>>> Max was running on Pharo, which may or may not be handling changes the
>>> same way. I think he may be seeing a different problem from the one I
>>> confirmed.
>>>
>>> So a bit more testing and verification would be in order. I can't look at
>>> it now though.
>>>
>>> Dave
>>>
>>>>
>>>>> On 29-06-2016, at 10:35 AM, Eliot Miranda <[hidden email]>
>>>>> wrote:
>>>> {snip much rant}
>>>>
>>>>> The most obvious place where this is an issue is where two images are
>>>>> using the same changes file and think they???re appending. Image A seeks
>>>>> to the end of the file, ???writes??? stuff. Image B near-simultaneously
>>>>> does the same. Eventually each process gets around to pushing data to
>>>>> hardware. Oops! And let???s not dwell too much on the problems possible
>>>>> if either process causes a truncation of the file. Oh, wait, I think we
>>>>> actually had a problem with that some years ago.
>>>>>
>>>>> The thing is that this problem bites even if we have a unitary primitive
>>>>> that both positions and writes if that primitive is written above a
>>>>> substrate that, as unix and stdio streams do, separates positioning from
>>>>> writing. The primitive is neat but it simply drives the problem further
>>>>> underground.
>>>>
>>>>
>>>> Oh absolutely - we only have real control over a small part of it. It
>>>> would probably be worth making use of that where we can.
>>>>
>>>>>
>>>>> A more robust solution might be to position, write, reposition, read,
>>>>> and compare, shortening on corruption, and retrying, using exponential
>>>>> back-off like ethernet packet transmission. Most of the time this adds
>>>>> only the overhead of reading what's written.
>>>>
>>>> Yes, for anything we want reliable that???s probably a good way. A limit
>>>> on the number of retries would probably be smart to stop infinite
>>>> recursion. Imagine the fun of an error causing infinite retries of writing
>>>> an error log about an infinite recursion. On an infinitely large Beowulf
>>>> cluster!
>>>>
>>>> It???s all yet another example of where software meeting reality leads to
>>>> nightmares.
>>>>
>>>>
>>>> tim
>>>> --
>>>> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
>>>> If it was easy, the hardware people would take care of it.
>

John Pfersich-2

Re: [Pharo-dev] [squeak-dev] The .changes file should be bound to a single image

In reply to this post by Ben Coman

Sent from my iPad

>>
>> Might be better to teach the class, who are learning about Smalltalk
>> anyway, about the nature of the changes file..?
>
> This seemed more of a classroom system administration issue. Actually
> in that case, maybe the network executable startup script just copied
> both image and changes file to the user's personal area?
>
> cheers -ben
>

This is the best idea of all...

Max Leske

Re: [Pharo-dev] [squeak-dev] The .changes file should be bound to a single image

In reply to this post by Max Leske

It’s nice to see the enthusiasm (both pro and con) on this issue. I just want to clarify that it has nothing to do with a class room setting, where the changes file is being shared or copied so students have access. I have run into the corrupted .changes file problem myself a couple of times for two reasons mainly:

a) I’ve done a lot of work but need to check something against code that wasn’t modified (and no, checking package changes in Monticello wouldn’t help in the case I’m thinking of. Imagine for example a huge refactoring across multiple packages). So I open a second copy of the image. I keep both images open because its convenient but at some point I accidentally make a change in the wrong image. Now I’m screwed.
b) I forgot that I already had the image running (e.g. minimised). I start a fresh copy and work on it until I realise that some of my method sources are broken. Again: screwed.

Another thing I want to mention is that the semantics of flush depend on the operating / file system (I have experienced this first hand between Linux (ext4) and OS X (HSF+)). Just because you’ve flushed you’re buffer doesn’t mean that the contents have actually been written to the file. So while it may be true that there is a #flush missing somewhere I would not expect that adding the #flush will solve the problem entirely (which is one reason for proposing a locking mechanism in the first place).

Cheers,
Max