At the PharoDays I was painfully reminded that SSDs perform really badly when using small files. The Bloc tutorial used a github filetree repo and that has a lot of files. The whole folder is 116 MB in 16K files. Copying that amount of data should not be noticable, taking about a third of a second. With it being in so many files, it took more than half a minute, or a hundred times as long.
That is too much overhead. How can we improve the file format in a way that keeps the cross-platform exchange advantages and a reasonable way to view diffs and propose small changes using the github web tools? Cuis uses a different format with git. How does that compare? What is used in Squeak? Stephan |
There is the same kind of issues with Hadoop but the block size is 128MB. So, lots of small files give the same issue. This is solved by having HAR files (Hadoop Archive) that contain the files. The haddop filesystem is usually able to access the har contents somewhat transparently from userland. But I guess that libgit will not be too cooperative. We can also look into how to mount such an archive as a filesystem. Phil Le 21 mai 2017 17:27, "Stephan Eggermont" <[hidden email]> a écrit : At the PharoDays I was painfully reminded that SSDs perform really badly when using small files. The Bloc tutorial used a github filetree repo and that has a lot of files. The whole folder is 116 MB in 16K files. Copying that amount of data should not be noticable, taking about a third of a second. With it being in so many files, it took more than half a minute, or a hundred times as long. |
In reply to this post by Stephan Eggermont-3
Cuis uses a standard chunk file format[1] with a file per package.
Given the fact that no two Smalltalk dialects share a common class creation protocol makes simply using chunk file format a non-starter. Squeak uses FileTree. As part of his work on Cypress 2.0 Martin McClure is planning on supporting a file per class disk format in addition to the file per method format and possibly a file per package format. I'm not sure whether Martin is at the point where he is ready to share his plans, but this is a problem that is being worked on and when Martin is ready for feedback he'll publish his spec. Dale [1] https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev/blob/master/Packages/Assessments.pck.st On 5/21/17 8:25 AM, Stephan Eggermont wrote: > At the PharoDays I was painfully reminded that SSDs perform really badly when using small files. The Bloc tutorial used a github filetree repo and that has a lot of files. The whole folder is 116 MB in 16K files. Copying that amount of data should not be noticable, taking about a third of a second. With it being in so many files, it took more than half a minute, or a hundred times as long. > > That is too much overhead. How can we improve the file format in a way that keeps the cross-platform exchange advantages and a reasonable way to view diffs and propose small changes using the github web tools? > > Cuis uses a different format with git. How does that compare? What is used in Squeak? > > Stephan > |
In reply to this post by Stephan Eggermont-3
Le 21/05/2017 à 17:25, Stephan Eggermont a écrit :
> At the PharoDays I was painfully reminded that SSDs perform really badly when using small files. The Bloc tutorial used a github filetree repo and that has a lot of files. The whole folder is 116 MB in 16K files. Copying that amount of data should not be noticable, taking about a third of a second. With it being in so many files, it took more than half a minute, or a hundred times as long. > > That is too much overhead. How can we improve the file format in a way that keeps the cross-platform exchange advantages and a reasonable way to view diffs and propose small changes using the github web tools? Write longer methods ;) As soon as you start packing together multiple methods in a file, then the diff context view of all the tools except the smalltalk ones become problematic because it does not respect anymore the "method" context around the changes, forcing you to mentally rebuild the context of the diff. I've done that when tracking down changes in git between two .st packages, and it really becomes a problem, like a method change + addition of a new method messes up completely the diff in terms of understanding what has really changed. And since the diff does not carry enough context for you to know in which method you are... > Cuis uses a different format with git. How does that compare? What is used in Squeak? Cuis: it is just the old .st package format. Cuis does not handle anything vcs related, you work in a git client. You can set dependencies between packages. I don't know for Squeak. Thierry > Stephan > > |
In reply to this post by Stephan Eggermont-3
On Sun, May 21, 2017 at 11:25 PM, Stephan Eggermont <[hidden email]> wrote: At the PharoDays I was painfully reminded that SSDs perform really badly when using small files. The Bloc tutorial used a github filetree repo and that has a lot of files. The whole folder is 116 MB in 16K files. Copying that amount of data should not be noticable, taking about a third of a second. With it being in so many files, it took more than half a minute, or a hundred times as long. For comparison, pulling whole of pharo-core across network and unpacking... $ time git clone --depth=1 https://github.com/pharo- real 0m44.306s user 0m5.384s sys 0m5.908s Count and size of files $ find pharo-core/ | wc -l 152547 [files] $ du -sh pharo-core 639MB The smaller size of the archive without compression was a surprise. i.e. two-thirds empty space in filesystem blocks... $ time tar cf pharo-core.tar pharo-core/ real 0m8.216s user 0m2.284s sys 0m3.776s $ ls -lh pharo-core.tar 212MB File by file copy to a USB stick. Wow! Yes, thats over an hour. $ time cp -R pharo-core /media/ben/USBSTICK/pharo-core real 68m49.168s user 0m1.852s sys 0m39.768s Many forums indicated the following should be faster, but it was not... $ time ((tar cf - pharo-core) | (tar xf - -C /media/ben/USBSTICK/)) real 77m47.896s user 0m6.360s sys 0m38.900s Saving to an archive on USB stick is much quicker... $ time tar cf /media/ben/USBSTICK/pharo-core.tar pharo-core real 1m35.723s user 0m3.188s sys 0m5.972s Saving to a compressed archive on USB stick is muchmuch quicker... $ time tar zcf /media/ben/USBSTICK/pharo-core.tgz pharo-core real 0m11.778s user 0m7.908s sys 0m3.604s Saving to a compressed archive on USB stick, then unpacking that to local drive... $ time (tar zcf /media/ben/USBSTICK/pharo-core.tgz pharo-core \ && tar zxf /media/ben/USBSTICK/pharo-core.tgz -C restore/) real 0m49.934s user 0m10.552s sys 0m9.192s So 38 seconds to unpack, which is very similar to original git clone. But I guess your primary drive is SSD? So you don't get to bypass the issue using archvies? What result do you get cloning pharo-core like my first test case? cheers -ben |
In reply to this post by Dale Henrichs-3
On Sunday 21 May 2017 11:34 PM, Dale Henrichs wrote:
> As part of his work on Cypress 2.0 Martin McClure is planning on > supporting a file per class disk format in addition to the file per > method format and possibly a file per package format. > > I'm not sure whether Martin is at the point where he is ready to share > his plans, but this is a problem that is being worked on and when Martin > is ready for feedback he'll publish his spec. Is it possible to build a file responder right into Pharo and expose packages through WebDAV or FUSE or sshfs service? Then the contents can go directly from RAM (pharo) to RAM (repo server or git) without going through slow disk filesystem. Regards .. Subbu |
In reply to this post by Ben Coman
On 22/05/17 04:38, Ben Coman wrote:
> On Sun, May 21, 2017 at 11:25 PM, Stephan Eggermont > <[hidden email] > <mailto:[hidden email]>> wrote: > So this was copying files at the OS level? Which OS? MacOS. > The smaller size of the archive without compression was a surprise. i.e. > two-thirds empty space in filesystem blocks... Why is that a surprise? Most methods are a lot smaller than 4K > File by file copy to a USB stick. Wow! Yes, thats over an hour. I used a Samsung T3, which is somewhat faster than most sticks. > But I guess your primary drive is SSD? So you don't get to bypass the > issue using archvies? > What result do you get cloning pharo-core like my first test case? real 0m40.270s user 0m6.174s sys 0m18.002s on a slow wifi/dsl on a MBP13 mid 2014 2.6 i5 Stephan |
On Tue, May 23, 2017 at 1:33 PM, Stephan Eggermont <[hidden email]> wrote: On 22/05/17 04:38, Ben Coman wrote: Lets say minor-surprise. It just wasn't at the forefront of my thoughts to expect this.
About the same was mine then. Hopefully that means Iceberg won't suffer much. Does your "tar" results match mine also? cheers -ben |
In reply to this post by Stephan Eggermont-3
Stephan,
Out of curiosity. Why are you copying the git repo? In my normal usage, I clone a git repository on disk and then share that repository amongst all of the image/stones on that machine ... The upshot is that I almost never copy the repo. I use push to publish changes to common git repo and pull to update the local disk copy and those commands update only the files that have changed ... That isn't to say the I NEVER copy a git repo, but it is not something that I do frequently ... Dale On 5/21/17 8:25 AM, Stephan Eggermont wrote: > At the PharoDays I was painfully reminded that SSDs perform really badly when using small files. The Bloc tutorial used a github filetree repo and that has a lot of files. The whole folder is 116 MB in 16K files. Copying that amount of data should not be noticable, taking about a third of a second. With it being in so many files, it took more than half a minute, or a hundred times as long. > > That is too much overhead. How can we improve the file format in a way that keeps the cross-platform exchange advantages and a reasonable way to view diffs and propose small changes using the github web tools? > > Cuis uses a different format with git. How does that compare? What is used in Squeak? > > Stephan > |
In reply to this post by K K Subbu
Hi Subbu,
> On May 21, 2017, at 11:26 PM, K K Subbu <[hidden email]> wrote: > >> On Sunday 21 May 2017 11:34 PM, Dale Henrichs wrote: >> As part of his work on Cypress 2.0 Martin McClure is planning on >> supporting a file per class disk format in addition to the file per >> method format and possibly a file per package format. >> >> I'm not sure whether Martin is at the point where he is ready to share >> his plans, but this is a problem that is being worked on and when Martin >> is ready for feedback he'll publish his spec. > > Is it possible to build a file responder right into Pharo and expose packages through WebDAV or FUSE or sshfs service? Then the contents can go directly from RAM (pharo) to RAM (repo server or git) without going through slow disk filesystem. Yes it's possible and really interesting for lots of reasons. Not sure why no one is exploring this avenue. > > Regards .. Subbu > |
Hi,
> > Is it possible to build a file responder right into Pharo and > > expose packages through WebDAV or FUSE or sshfs service? Then the > > contents can go directly from RAM (pharo) to RAM (repo server or > > git) without going through slow disk filesystem. > > Yes it's possible and really interesting for lots of reasons. Not > sure why no one is exploring this avenue. I did explored this path with FUSE ages age (~10 yeas, IIRC) and, if my memory is not failing me, was possible but turned out to be useless in practice. The problem was that the call to fuse entered a kernel and sit there until there was a FS request at which point it called back my smalltalk. This means smalltalk was unresponsive most of the time. Besides, one needed superuser priviledges and all that stuff. Things may be different these days. Some another 10 years before me Claus did exactly the same thing using NFSv3, being able to "mount" running Smalltalk image. Last time I tried it still worked. I was never used for anything practical, as far as I can recall. Jan > > > > Regards .. Subbu > > > > |
In reply to this post by Dale Henrichs-3
On 23/05/17 15:50, Dale Henrichs wrote:
> Out of curiosity. Why are you copying the git repo? We were following a tutorial. The first participant downloaded the environment and I assumed copying it would be faster than downloading through a shared wifi network. Stephan |
On 05/23/2017 03:38 PM, Stephan Eggermont wrote: > On 23/05/17 15:50, Dale Henrichs wrote: >> Out of curiosity. Why are you copying the git repo? > > We were following a tutorial. The first participant > downloaded the environment and I assumed copying it > would be faster than downloading through a shared wifi > network. Ah, okay ... now that we know what we know, I guess a zip file of the directory structure would have been a better solution ... Dale |
In reply to this post by Jan Vrany
On Tuesday 23 May 2017 08:37 PM, Jan Vrany wrote:
> Hi, > >>> Is it possible to build a file responder right into Pharo and >>> expose packages through WebDAV or FUSE or sshfs service? Then the >>> contents can go directly from RAM (pharo) to RAM (repo server or >>> git) without going through slow disk filesystem. >> Yes it's possible and really interesting for lots of reasons. Not >> sure why no one is exploring this avenue. > I did explored this path with FUSE ages age (~10 yeas, IIRC) and, > if my memory is not failing me, was possible but turned out to be > useless in practice. The problem was that the call to fuse entered > a kernel and sit there until there was a FS request at which point > it called back my smalltalk. This means smalltalk was unresponsive > most of the time. Besides, one needed superuser priviledges and all > that stuff. Things may be different these days. Yes, FUSE has evolved. Here is a recent paper on it: https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf The main advantage of FUSE is that responders don't need root permissions to mount archives, unlike NFS. There are over 100 of them like iso, zip, vfat, cram, smb etc. I use fuse-zip regularly to skip extracting lots of small files into disk/SSDs. WebDAV or sshfs is a little more complex but is multi-platform as they work over network sockets instead of device i/o. The important point here is not about the protocol but about live mapping of Pharo packages/sources into the host fs namespace. Live mapping will let us leverage existing tools for devops instead of having to build separate clients for each version control system right into Pharo. The perf boost we get when objects pass from RAM to RAM without going through disk block is a bonus. Regards .. Subbu |
On Wed, 2017-05-24 at 11:46 +0530, K K Subbu wrote:
> On Tuesday 23 May 2017 08:37 PM, Jan Vrany wrote: > > Hi, > > > > > > Is it possible to build a file responder right into Pharo and > > > > expose packages through WebDAV or FUSE or sshfs service? Then > > > > the > > > > contents can go directly from RAM (pharo) to RAM (repo server > > > > or > > > > git) without going through slow disk filesystem. > > > > > > Yes it's possible and really interesting for lots of > > > reasons. Not > > > sure why no one is exploring this avenue. > > > > I did explored this path with FUSE ages age (~10 yeas, IIRC) and, > > if my memory is not failing me, was possible but turned out to be > > useless in practice. The problem was that the call to fuse entered > > a kernel and sit there until there was a FS request at which point > > it called back my smalltalk. This means smalltalk was unresponsive > > most of the time. Besides, one needed superuser priviledges and all > > that stuff. Things may be different these days. > > Yes, FUSE has evolved. Here is a recent paper on it: It did, indeed. Thanks! I had a wee look at the newest source and it looks one would still have to implement fuse event loop [1] herself. It seems that functions needed are not part of public fuse API so one may need to enter a thin ice of using internal API which may not be visible from outside. Still, seems doable. Jan [1]: https://github.com/libfuse/libfuse/blob/master/lib/fuse_loop.c#L19 |
Free forum by Nabble | Edit this page |