The Inbox: Monticello-mva.667.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

The Inbox: Monticello-mva.667.mcz

commits-2
A new version of Monticello was added to project The Inbox:
http://source.squeak.org/inbox/Monticello-mva.667.mcz

==================== Summary ====================

Name: Monticello-mva.667
Author: mva
Time: 6 April 2017, 8:41:09.494386 pm
UUID: 0075cba1-70ff-4e10-9be6-0c01f85dc85a
Ancestors: Monticello-eem.666

New style diffy version (*.mcd): Prune ancestors version infos in the info of the base of the diff when writing a diffy version. Graft them back from the diff's base version when reading a diffy version unless base info already has ancestors (old-style diffy version with complete version history info) in which case leave them alone.

=============== Diff against Monticello-eem.666 ===============

Item was added:
+ ----- Method: MCMcdReader>>loadVersionInfo (in category 'loading') -----
+ loadVersionInfo
+ | baseInfo |
+ super loadVersionInfo.
+ baseInfo := self baseInfo.
+ info graftAncestorsTo: baseInfo from:
+ (MCRepositoryGroup default versionWithInfo: baseInfo) info!

Item was added:
+ ----- Method: MCMcdWriter>>writeVersion: (in category 'visiting') -----
+ writeVersion: aVersion
+ self writeFormat.
+ self writePackage: aVersion package.
+ self writeVersionInfo:
+ (aVersion info veryDeepCopy
+ pruneAncestorsFrom: aVersion baseInfo).
+ self writeDefinitions: aVersion.
+ aVersion dependencies do: [:ea | self writeVersionDependency: ea]!

Item was added:
+ ----- Method: MCVersionInfo>>ancestors: (in category 'accessing') -----
+ ancestors: anObject
+ ancestors := anObject!

Item was added:
+ ----- Method: MCVersionInfo>>graftAncestorsTo:from: (in category 'copying') -----
+ graftAncestorsTo: aBaseVersionInfo from: aVersionInfo
+ (self allAncestors select: [:e | e = aBaseVersionInfo])
+ do: [:e | e ancestors isEmpty ifTrue: [e ancestors: aVersionInfo ancestors]]!

Item was added:
+ ----- Method: MCVersionInfo>>pruneAncestorsFrom: (in category 'copying') -----
+ pruneAncestorsFrom: aBaseVersionInfo
+ (self allAncestors select: [:e | e = aBaseVersionInfo])
+ do: [:e | e ancestors: #()]
+
+ !


Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Tobias Pape
Hi,


> On 06.04.2017, at 21:07, [hidden email] wrote:
>
> A new version of Monticello was added to project The Inbox:
> http://source.squeak.org/inbox/Monticello-mva.667.mcz
>
> ==================== Summary ====================
>
> Name: Monticello-mva.667
> Author: mva
> Time: 6 April 2017, 8:41:09.494386 pm
> UUID: 0075cba1-70ff-4e10-9be6-0c01f85dc85a
> Ancestors: Monticello-eem.666
>
> New style diffy version (*.mcd): Prune ancestors version infos in the info of the base of the diff when writing a diffy version. Graft them back from the diff's base version when reading a diffy version unless base info already has ancestors (old-style diffy version with complete version history info) in which case leave them alone.

Might the submitter want to explain their idea?
:)

Best regards
        -Tobias

>
> =============== Diff against Monticello-eem.666 ===============
>
> Item was added:
> + ----- Method: MCMcdReader>>loadVersionInfo (in category 'loading') -----
> + loadVersionInfo
> + | baseInfo |
> + super loadVersionInfo.
> + baseInfo := self baseInfo.
> + info graftAncestorsTo: baseInfo from:
> + (MCRepositoryGroup default versionWithInfo: baseInfo) info!
>
> Item was added:
> + ----- Method: MCMcdWriter>>writeVersion: (in category 'visiting') -----
> + writeVersion: aVersion
> + self writeFormat.
> + self writePackage: aVersion package.
> + self writeVersionInfo:
> + (aVersion info veryDeepCopy
> + pruneAncestorsFrom: aVersion baseInfo).
> + self writeDefinitions: aVersion.
> + aVersion dependencies do: [:ea | self writeVersionDependency: ea]!
>
> Item was added:
> + ----- Method: MCVersionInfo>>ancestors: (in category 'accessing') -----
> + ancestors: anObject
> + ancestors := anObject!
>
> Item was added:
> + ----- Method: MCVersionInfo>>graftAncestorsTo:from: (in category 'copying') -----
> + graftAncestorsTo: aBaseVersionInfo from: aVersionInfo
> + (self allAncestors select: [:e | e = aBaseVersionInfo])
> + do: [:e | e ancestors isEmpty ifTrue: [e ancestors: aVersionInfo ancestors]]!
>
> Item was added:
> + ----- Method: MCVersionInfo>>pruneAncestorsFrom: (in category 'copying') -----
> + pruneAncestorsFrom: aBaseVersionInfo
> + (self allAncestors select: [:e | e = aBaseVersionInfo])
> + do: [:e | e ancestors: #()]
> +
> + !
>
>


Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Milan Vavra
Hi Tobias,

Thanks for asking.

So what is this Monticello-mva.667.mcz good for?

The idea is...
... let's make mcd files smaller.
As small as they can be.

Only containing information relevant to the diff.

No more. No less.

And they can be. A lot smaller.

An order of magnitude smaller compared with the in-trunk version.
Think 8K instead of 80K for a mcd with one-line changes.

If you have followed the instructions in
http://lists.squeakfoundation.org/pipermail/squeak-dev/2017-April/194029.html
to get a current squeak6.0 alpha, you will have seen files like
Collections-eem.743(ul.742).mcd in your package-cache directory.

If you look at their sizes, you will notice that they are much smaller than
regular mcz files.

For example.
A standard snapshot mcz, Collections-ul.742.mcz is 485K.
A diff mcd, Collections-eem.743(ul.742).mcd is only 84K.

How do you create such files?

Select a version in a Repository Browser, click Diff, select the
version against which the diff should be made and you get a 'diffy version'.
If you now click 'Copy' and copy it to a different directory repository,
an mcd file will be stored there not an mcz file.

Or if you yellow-click a directory repository in Monticello Browser and
select 'store diffs' then whenever you select a version in Repository
browser, click 'Copy' to copy to that directory repository, the version
will be stored there as an mcd file.


But what if I told you that that Collections-eem.743(ul.742).mcd could have
been even smaller. A lot smaller. An order of magnitude smaller.
Not 84K. Only 4.9K.
With no loss of information?

I have that converted version sitting on my disk right now.

It were written out with this modification
http://forum.world.st/The-Inbox-Monticello-mva-667-mcz-tt4941466.html
http://source.squeak.org/inbox/Monticello-mva.667.mcz to Monticello.

How could it be so small?

Well, by trimming the information stored in the 'version' file in the mcd
zip archive.

This information grows over time as new versions are added and commit
comments written. And if not trimmed will gradually take up a significant
portion of the file's size. Especially for small changes. One-liners. And
there's no real need to store it all in each mcd file.


No information is lost. Because the information that is trimmed is
readily available in the monticello version against which the diff mcd was
made.

So the trick is to trim on writing. And attach back from base version on
reading.

Disk space, network bandwidth is saved.

The version's history appears the same as before when all that redundant
information were saved in the mcd archive.

So you can write out much smaller mcd files with this modification.

Read them back in and the system will not know the difference.

And if you have old mcd files sitting around with full version info history
they are read in as before with no surprises.

Does it make sense or which part needs better explanation?

Best Regards,

Milan Vavra

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Chris Muller-3
I haven't looked at it, but would like to ask if you've tested when
you have multiple .mcd's in succession?  Like, if you have,

    Kernel-cmm.100.mcz
    Kernel-cmm.101.mcd
    Kernel-cmm.102.mcd

Does 102 need to have ancestry at least back to 101 (or, 100?) still stored?


On Fri, Apr 7, 2017 at 6:38 AM, Milan Vavra via Squeak-dev
<[hidden email]> wrote:

> Hi Tobias,
>
> Thanks for asking.
>
> So what is this Monticello-mva.667.mcz good for?
>
> The idea is...
> ... let's make mcd files smaller.
> As small as they can be.
>
> Only containing information relevant to the diff.
>
> No more. No less.
>
> And they can be. A lot smaller.
>
> An order of magnitude smaller compared with the in-trunk version.
> Think 8K instead of 80K for a mcd with one-line changes.
>
> If you have followed the instructions in
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2017-April/194029.html
> to get a current squeak6.0 alpha, you will have seen files like
> Collections-eem.743(ul.742).mcd in your package-cache directory.
>
> If you look at their sizes, you will notice that they are much smaller than
> regular mcz files.
>
> For example.
> A standard snapshot mcz, Collections-ul.742.mcz is 485K.
> A diff mcd, Collections-eem.743(ul.742).mcd is only 84K.
>
> How do you create such files?
>
> Select a version in a Repository Browser, click Diff, select the
> version against which the diff should be made and you get a 'diffy version'.
> If you now click 'Copy' and copy it to a different directory repository,
> an mcd file will be stored there not an mcz file.
>
> Or if you yellow-click a directory repository in Monticello Browser and
> select 'store diffs' then whenever you select a version in Repository
> browser, click 'Copy' to copy to that directory repository, the version
> will be stored there as an mcd file.
>
>
> But what if I told you that that Collections-eem.743(ul.742).mcd could have
> been even smaller. A lot smaller. An order of magnitude smaller.
> Not 84K. Only 4.9K.
> With no loss of information?
>
> I have that converted version sitting on my disk right now.
>
> It were written out with this modification
> http://forum.world.st/The-Inbox-Monticello-mva-667-mcz-tt4941466.html
> http://source.squeak.org/inbox/Monticello-mva.667.mcz to Monticello.
>
> How could it be so small?
>
> Well, by trimming the information stored in the 'version' file in the mcd
> zip archive.
>
> This information grows over time as new versions are added and commit
> comments written. And if not trimmed will gradually take up a significant
> portion of the file's size. Especially for small changes. One-liners. And
> there's no real need to store it all in each mcd file.
>
>
> No information is lost. Because the information that is trimmed is
> readily available in the monticello version against which the diff mcd was
> made.
>
> So the trick is to trim on writing. And attach back from base version on
> reading.
>
> Disk space, network bandwidth is saved.
>
> The version's history appears the same as before when all that redundant
> information were saved in the mcd archive.
>
> So you can write out much smaller mcd files with this modification.
>
> Read them back in and the system will not know the difference.
>
> And if you have old mcd files sitting around with full version info history
> they are read in as before with no surprises.
>
> Does it make sense or which part needs better explanation?
>
> Best Regards,
>
> Milan Vavra
>
>
>
>
>
> --
> View this message in context: http://forum.world.st/The-Inbox-Monticello-mva-667-mcz-tp4941466p4941532.html
> Sent from the Squeak - Dev mailing list archive at Nabble.com.
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Milan Vavra
Chris Muller wrote:
> I haven't looked at it, but would like to ask if you've tested when
> you have multiple .mcd's in succession?  Like, if you have,
>
>     Kernel-cmm.100.mcz
>     Kernel-cmm.101.mcd
>     Kernel-cmm.102.mcd
>
> Does 102 need to have ancestry at least back to 101 (or, 100?) still stored?
>
Assuming we have
  Kernel-cmm.100.mcz
  Kernel-cmm.101(cmm.100).mcd
  Kernel-cmm.102(cmm.101).mcd
then yes, 102 needs to have ancestry going back to 101. But no further.
No need to go beyond 101. Ancestry from 101 onward can be trimmed.
So when writing
  Kernel-cmm.102(cmm.101).mcd
ancestry of Kernel-cmm.101 can be trimmed. In the file we are saving, not
in the system, that's why we need a #veryDeepCopy of the ancestry before
we trim it.

And when reading
  Kernel-cmm.102(cmm.101).mcd
ancestry of Kernel-cmm.101 can be re-attached so that Kernel-cmm.102's
version info looks the same as it did when we were writting the
Kernel-cmm.102(cmm.101).mcd before the trimming.

Best Regards,

Milan Vavra
Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Chris Muller-3
On Fri, Apr 7, 2017 at 2:13 PM, Milan Vavra via Squeak-dev
<[hidden email]> wrote:

> Chris Muller wrote:
>> I haven't looked at it, but would like to ask if you've tested when
>> you have multiple .mcd's in succession?  Like, if you have,
>>
>>     Kernel-cmm.100.mcz
>>     Kernel-cmm.101.mcd
>>     Kernel-cmm.102.mcd
>>
>> Does 102 need to have ancestry at least back to 101 (or, 100?) still
>> stored?
>>
> Assuming we have
>   Kernel-cmm.100.mcz
>   Kernel-cmm.101(cmm.100).mcd
>   Kernel-cmm.102(cmm.101).mcd
> then yes, 102 needs to have ancestry going back to 101. But no further.
> No need to go beyond 101. Ancestry from 101 onward can be trimmed.
> So when writing
>   Kernel-cmm.102(cmm.101).mcd
> ancestry of Kernel-cmm.101 can be trimmed. In the file we are saving, not
> in the system, that's why we need a #veryDeepCopy of the ancestry before
> we trim it.

So it reduces redundancy and disk utilization, with the trade-off
being that it must re-open the original .mcz in order to get that
ancestry back into memory.

That read should be done eagerly, otherwise the system would interpret
the empty ancestry as simply no ancestry.

> And when reading
>   Kernel-cmm.102(cmm.101).mcd
> ancestry of Kernel-cmm.101 can be re-attached so that Kernel-cmm.102's
> version info looks the same as it did when we were writting the
> Kernel-cmm.102(cmm.101).mcd before the trimming.

You said, "can be", but I think it should do it eagerly to avoid
unintended consequences.  If we don't open the original .mcz eagerly,
then I think we would need to terminate the ancestry with some kind of
"reference stub" instead of an empty Array.

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Milan Vavra
Chris Muller wrote:
>> Chris Muller wrote:
>>> I haven't looked at it, but would like to ask if you've tested when
>>> you have multiple .mcd's in succession?  Like, if you have,
>>>
>>>     Kernel-cmm.100.mcz
>>>     Kernel-cmm.101.mcd
>>>     Kernel-cmm.102.mcd
>>>
>>> Does 102 need to have ancestry at least back to 101 (or, 100?) still
>>> stored?
>>>
>> Assuming we have
>>   Kernel-cmm.100.mcz
>>   Kernel-cmm.101(cmm.100).mcd
>>   Kernel-cmm.102(cmm.101).mcd
>> then yes, 102 needs to have ancestry going back to 101. But no further.
>> No need to go beyond 101. Ancestry from 101 onward can be trimmed.
>> So when writing
>>   Kernel-cmm.102(cmm.101).mcd
>> ancestry of Kernel-cmm.101 can be trimmed. In the file we are saving, not
>> in the system, that's why we need a #veryDeepCopy of the ancestry before
>> we trim it.
>
>So it reduces redundancy and disk utilization, with the trade-off
>being that it must re-open the original .mcz in order to get that
>ancestry back into memory.

That is correct. See below.

>
>That read should be done eagerly, otherwise the system would interpret
>the empty ancestry as simply no ancestry.
>

Good point.

The original mcz is being opened. Albeit indirectly and behind the
scenes.

The code
MCRepositoryGroup default versionWithInfo: baseInfo
basically does that.

What it does is ask the system: 'in any of the repositories known to you,
look for a version with this UUID and return it to me'.

We then attach its ancestors to our newly read version info at points where
the base version info is referenced.

>> And when reading
>>   Kernel-cmm.102(cmm.101).mcd
>> ancestry of Kernel-cmm.101 can be re-attached so that Kernel-cmm.102's
>> version info looks the same as it did when we were writting the
>> Kernel-cmm.102(cmm.101).mcd before the trimming.
>
>You said, "can be", but I think it should do it eagerly to avoid
>unintended consequences.  If we don't open the original .mcz eagerly,
>then I think we would need to terminate the ancestry with some kind of
>"reference stub" instead of an empty Array.
>

I said "can be" but what I really meant is "is being re-attached".


Best Regards,

Milan Vavra
Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Milan Vavra
In reply to this post by Chris Muller-3
Chris Muller wrote:
> I haven't looked at it, but would like to ask if you've tested when
> you have multiple .mcd's in succession?  Like, if you have,

Yes I have tested multiple .mcd's in succession.

That is where this modification really shines.

The size of each successive mcd is only proportional to the amount of
changes it contains. One liners are just a few KB. Each time. No matter how
many versions came before them.

Especially with big packages with a lot of previous versions whose version
information would normally be saved in the 'version' file of the mcd
archives. Like the Kernel.

An mcz snapshot must include the complete version information.
An mcd is a different story.

The mcd files being what they are store only the code that has been changed
against its base version.

They store a patch.

A patch needs its base version to exist to be able to reconstruct the
snapshot it represents.

This modification just modifies the 'version' information to match that
behavior so that only version information that has been modified since its
base version is stored in an mcd file.


Best Regards,

Milan Vavra

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Bert Freudenberg
On Fri, Apr 7, 2017 at 10:38 PM, Milan Vavra via Squeak-dev <[hidden email]> wrote:

An mcz snapshot must include the complete version information.
An mcd is a different story.

The mcd files being what they are store only the code that has been changed
against its base version.

Awesome! I haven't tried it, but this sounds exactly how it should have been from the beginning. Thank you!

I wonder what we should do about the MCDs auto-generated by the source.squeak server. When we update the server code to produce these new diffy versions, then an older image won't get the correct history. Basically we would need to detect if the image requesting the MCD does have the history restoration code or not. Maybe it should send a little argument in the URL? E.g. http://source.squeak.org/trunk/Foo-abc.123(120).mcd?prunehistory but I'm not sure if that would throw off an older source server ...

- Bert - 



cbc
Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

cbc


On Tue, Apr 11, 2017 at 7:28 AM, Bert Freudenberg <[hidden email]> wrote:
On Fri, Apr 7, 2017 at 10:38 PM, Milan Vavra via Squeak-dev <[hidden email]> wrote:

An mcz snapshot must include the complete version information.
An mcd is a different story.

The mcd files being what they are store only the code that has been changed
against its base version.

Awesome! I haven't tried it, but this sounds exactly how it should have been from the beginning. Thank you!

I wonder what we should do about the MCDs auto-generated by the source.squeak server.
Does source.squeak server generate the MCD on each request, or does it cache and/or save the MCD's generated so it doesn't have to the next time?  The later seems dangerous in this context.
 
When we update the server code to produce these new diffy versions, then an older image won't get the correct history. Basically we would need to detect if the image requesting the MCD does have the history restoration code or not. Maybe it should send a little argument in the URL? E.g. http://source.squeak.org/trunk/Foo-abc.123(120).mcd?prunehistory but I'm not sure if that would throw off an older source server ...

- Bert - 







Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Milan Vavra
Bert Freudenberg wrote:
>Milan Vavra wrote:
>>An mcz snapshot must include the complete version information.
>>An mcd is a different story.
>>
>>The mcd files being what they are store only the code that has been changed
>>against its base version.
>
>Awesome! I haven't tried it, but this sounds exactly how it should have been from the beginning. Thank you!

Glad to hear that.

The surprisingly big mcds have been a personal pet peeve of mine for quite
some time. I would really like this 'history trimming and restoration code'
become part of Squeak so that people can use mcds to store their work and
not waste a whole lot of disk space.

And if this became part of the update process, the amount of data one
needs to download to update to the current alpha would be down to a
trickle.

Best Regards,

Milan Vavra
Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Monticello-mva.667.mcz

Milan Vavra
In reply to this post by cbc
Chris Cunningham wrote:
>Bert Freudenberg wrote:
>>Milan Vavra wrote:
>>>
>>>An mcz snapshot must include the complete version information.
>>>An mcd is a different story.
>>>
>>>The mcd files being what they are store only the code that has been changed
>>>against its base version.
>>
>>Awesome! I haven't tried it, but this sounds exactly how it should have been from the
>>beginning. Thank you!
>>
>>I wonder what we should do about the MCDs auto-generated by the source.squeak server.
>>
>Does source.squeak server generate the MCD on each request, or does it cache and/or save >the
>MCD's generated so it doesn't have to the next time?  The later seems dangerous in this context.


There is a danger of reading an mcd without history restoration code in
place and so losing the history beyond base info at that moment.

This could be avoided, by replacing the pruned ancestors in the written out
'version' member of the mcd zip archive, with a string like

'To use this mcd you need Monticello with history restoration code'.

The history restoration code could be modified to look for this string and
to remove it so that the version info reading/restoring can continue as
before.

The current in-trunk mcd reading code (read any Monticello without history
restoration support) will choke on this and open a debugger.

When you poke around in the variables in the debugger, you will see the
string ('To use this mcd you need Monticello with history restoration
code') in the tokens instance variable at some level. This should get your
attentinon.

Even better, if you don't care about having only partial version info
history this time (just read that fine thing, thank you very much!), you
can replace the array
#('To use this mcd you need Monticello with history restoration code')
with an empty array
{}
in the debugger, Restart and Proceed and the version will be read in with
partial history.

Best Regards,

Milan Vavra