The current format we use for storing source code is not optimal for
archival and analysis purposes. Each mcz stores all source code. That makes it difficult and slow. Today I've experimented with an archive format that combines many mcz's and should be able to reconstruct all individual ones. I defined a MCProject, representing a project repository. The definitions are stored in an OCLiteralSet Object subclass: #MCProject instanceVariableNames: 'location infos definitions repository' classVariableNames: '' package: 'MonticelloProjects' For each filename found in the repository I load the MCVersion and its snapshot. MCProject>>read | filenames | repository := MCHttpRepository location: location user: '' password: ''. filenames := repository readableFileNames. filenames do: [ :each | self read: each ] MCproject>>read: aFileName "Needs a rate limiter!!!" |mcVersion| mcVersion := repository loadNotCachedVersionFromFileNamed: aFileName. mcVersion snapshot. self parse: mcVersion. repository flushCache For each unique package in those MCVersions, I add a MCPackageInfo, defined as Object subclass: #MCPackageInfo instanceVariableNames: 'packageName packageVersions' classVariableNames: '' package: 'MonticelloProjects' MCProject>>parse: aVersion |info| info := self ensureInfo: aVersion package. info addVersion: aVersion in: self a MCPackageVersion then stores the info and the unique definition that is stored in the project, eliminating the duplicates. MCPackageInfo>>addVersion: aMcVersion in: aProject |packageVersion| packageVersion := MCPackageVersion new info: aMcVersion info; yourself. self packageVersions add: packageVersion. aMcVersion snapshot definitions do: [ :aDefinition | packageVersion definitions add: (aProject definitions add: aDefinition) ]. As long as #= and #hash are correctly defined for all MCDefinitions, this should make it possible too eliminate all duplicate definitions and have a full history. On my Documentation repo this already saves a factor 7, when saving this compressed as a Fuel file. On large repositories with a high change rate (Roassal2?) the compression will be significantly higher. There are several other normalizations that can reduce the size further: - make recategorization explicit - normalize MCVersionInfo data: explicit author, compact timestamp. I'd be interested in further ideas for this, and situations where this approach wouldn't work. Stephan |
For the part of the Roassal2 repo that I could read before getting a 400
response: 19.665 unique definitions in the following # of versions (later versions of Raossal2 have 6,5K definitions) 3 ProfilerCPP 2 Roassal2EventCollector 2 Roassal2Spec 81 Trachel 19 ConfigurationOfRoassal2 1 Glamour-Roassal2-presentations 7 Roassal2GT 105 VersionOfRoassal2 273 Roassal2 compressed into 18.7 MB Stephan |
> On 14 Dec 2015, at 17:18, Stephan Eggermont <[hidden email]> wrote: > > For the part of the Roassal2 repo that I could read before getting a 400 response: > > 19.665 unique definitions in the following # of versions > (later versions of Raossal2 have 6,5K definitions) > > 3 ProfilerCPP > 2 Roassal2EventCollector > 2 Roassal2Spec > 81 Trachel > 19 ConfigurationOfRoassal2 > 1 Glamour-Roassal2-presentations > 7 Roassal2GT > 105 VersionOfRoassal2 > 273 Roassal2 > > compressed into 18.7 MB > Nice! Marcus |
In reply to this post by Stephan Eggermont-3
Pharo/Pharo50 on smalltalkhub contains
2502 package versions (the separate mczs) in 389 different packages containing 96390 unique MCDefinitions (nearly all MCmethodDefinition) That whole repo can be represented in a 221.9 MB fuel file, compressed as tar.gz: 80.6 MB I wonder if I might be able to fit all smalltalkhub source incl history in ram on a modern pc Stephan |
On 15-12-15 21:35, Stephan Eggermont wrote:
> Pharo/Pharo50 on smalltalkhub contains > > 2502 package versions (the separate mczs) Oops 3512, of which I had issues with 10 Stephan |
Free forum by Nabble | Edit this page |