Compact representation of source code history

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Compact representation of source code history

Stephan Eggermont-3
The current format we use for storing source code is not optimal for
archival and analysis purposes. Each mcz stores all source code. That
makes it difficult and slow. Today I've experimented with an archive
format that combines many mcz's and should be able to reconstruct all
individual ones.

I defined a MCProject, representing a project repository. The
definitions are stored in an OCLiteralSet

Object subclass: #MCProject
        instanceVariableNames: 'location infos definitions repository'
        classVariableNames: ''
        package: 'MonticelloProjects'

For each filename found in the repository I load the MCVersion and its
snapshot.

MCProject>>read
        | filenames |
        repository := MCHttpRepository location: location user: '' password: ''.
        filenames := repository readableFileNames.
        filenames do: [ :each | self read: each ]
       
MCproject>>read: aFileName
        "Needs a rate limiter!!!"
        |mcVersion|
        mcVersion := repository loadNotCachedVersionFromFileNamed: aFileName.
        mcVersion snapshot.
        self parse: mcVersion.
        repository flushCache



For each unique package in those MCVersions, I add a MCPackageInfo,
defined as

Object subclass: #MCPackageInfo
        instanceVariableNames: 'packageName packageVersions'
        classVariableNames: ''
        package: 'MonticelloProjects'

MCProject>>parse: aVersion
        |info|
        info := self ensureInfo: aVersion package.
        info addVersion: aVersion in: self

a MCPackageVersion then stores the info and the unique definition
that is stored in the project, eliminating the duplicates.

MCPackageInfo>>addVersion: aMcVersion in: aProject
        |packageVersion|
        packageVersion := MCPackageVersion new
                info: aMcVersion info;
                yourself.
        self packageVersions add: packageVersion.
        aMcVersion snapshot definitions do: [ :aDefinition |
                        packageVersion definitions add:
                                (aProject definitions add: aDefinition) ].

As long as #= and #hash are correctly defined for all MCDefinitions,
this should make it possible too eliminate all duplicate definitions and
have a full history. On my Documentation repo this already saves
a factor 7, when saving this compressed as a Fuel file. On large
repositories with a high change rate (Roassal2?) the compression will be
significantly higher. There are several other normalizations that can
reduce the size further:
- make recategorization explicit
- normalize MCVersionInfo  data: explicit author, compact timestamp.

I'd be interested in further ideas for this, and situations where this
approach wouldn't work.

Stephan


Reply | Threaded
Open this post in threaded view
|

Re: Compact representation of source code history

Stephan Eggermont-3
For the part of the Roassal2 repo that I could read before getting a 400
response:

19.665 unique definitions in the following # of versions
(later versions of Raossal2 have 6,5K definitions)

3 ProfilerCPP
2 Roassal2EventCollector
2 Roassal2Spec
81 Trachel
19 ConfigurationOfRoassal2
1 Glamour-Roassal2-presentations
7 Roassal2GT
105 VersionOfRoassal2
273 Roassal2

compressed into 18.7 MB

Stephan




Reply | Threaded
Open this post in threaded view
|

Re: Compact representation of source code history

Marcus Denker-4

> On 14 Dec 2015, at 17:18, Stephan Eggermont <[hidden email]> wrote:
>
> For the part of the Roassal2 repo that I could read before getting a 400 response:
>
> 19.665 unique definitions in the following # of versions
> (later versions of Raossal2 have 6,5K definitions)
>
> 3 ProfilerCPP
> 2 Roassal2EventCollector
> 2 Roassal2Spec
> 81 Trachel
> 19 ConfigurationOfRoassal2
> 1 Glamour-Roassal2-presentations
> 7 Roassal2GT
> 105 VersionOfRoassal2
> 273 Roassal2
>
> compressed into 18.7 MB
>

Nice!

        Marcus


Reply | Threaded
Open this post in threaded view
|

Re: Compact representation of source code history

Stephan Eggermont-3
In reply to this post by Stephan Eggermont-3
Pharo/Pharo50 on smalltalkhub contains

2502 package versions (the separate mczs)
in 389 different packages
containing 96390 unique MCDefinitions (nearly all MCmethodDefinition)

That whole repo can be represented in a 221.9 MB fuel file, compressed
as tar.gz: 80.6 MB

I wonder if I might be able to fit all smalltalkhub source incl history
in ram on a modern pc

Stephan


Reply | Threaded
Open this post in threaded view
|

Re: Compact representation of source code history

Stephan Eggermont-3
On 15-12-15 21:35, Stephan Eggermont wrote:
> Pharo/Pharo50 on smalltalkhub contains
>
> 2502 package versions (the separate mczs)

Oops 3512, of which I had issues with 10

Stephan