Faster directory enumeration?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
bpi
Reply | Threaded
Open this post in threaded view
|

Faster directory enumeration?

bpi
Dear Squeakers,

I want to count files with a certain extension in a folder recursively. Here is the code I use:

| dir count runtime |
count := 0.
dir := FileDirectory on: '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'.
runtime := Time millisecondsToRun: [
        dir directoryTreeDo: [:each |
                (each last name endsWith: '.emlx') ifTrue: [count := count + 1]]].
{count. runtime}. #(289747 66109)

As you can see it finds 289.747 files and it takes about 66 seconds. Is there any faster way to do this given the current VM primitives?

The reason I ask is that the equivalent Python code takes between 1.5 and 6 seconds. :-/

#!/usr/local/bin/python3
import os
import time

path = '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'

print(path)

start = time.time()
emlx = 0
for dirpath, dirnames, filenames in os.walk(path):
    for filename in filenames:
        if filename.endswith('.emlx'):
            emlx += 1

runtime = time.time() - start

print(emlx, runtime)

It seems to have to do with an optimized os.scandir() function, described here: https://www.python.org/dev/peps/pep-0471/

Cheers,
Bernhard



Reply | Threaded
Open this post in threaded view
|

Re: Faster directory enumeration?

David T. Lewis
It is probably far too bit-rotted to be of any use now, but here is what I
came up with 15 years ago to improve this:

  http://wiki.squeak.org/squeak/2274

I did briefly look at this again a couple of years ago, and put the
updates on SqueakSource. But I think I found that the directory primitives
are nowhere near as big a win now as they were 15 years ago. Nevertheless
it may still be of some interest.

Dave

> Dear Squeakers,
>
> I want to count files with a certain extension in a folder recursively.
> Here is the code I use:
>
> | dir count runtime |
> count := 0.
> dir := FileDirectory on:
> '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'.
> runtime := Time millisecondsToRun: [
> dir directoryTreeDo: [:each |
> (each last name endsWith: '.emlx') ifTrue: [count := count + 1]]].
> {count. runtime}. #(289747 66109)
>
> As you can see it finds 289.747 files and it takes about 66 seconds. Is
> there any faster way to do this given the current VM primitives?
>
> The reason I ask is that the equivalent Python code takes between 1.5 and
> 6 seconds. :-/
>
> #!/usr/local/bin/python3
> import os
> import time
>
> path =
> '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'
>
> print(path)
>
> start = time.time()
> emlx = 0
> for dirpath, dirnames, filenames in os.walk(path):
>     for filename in filenames:
>         if filename.endswith('.emlx'):
>             emlx += 1
>
> runtime = time.time() - start
>
> print(emlx, runtime)
>
> It seems to have to do with an optimized os.scandir() function, described
> here: https://www.python.org/dev/peps/pep-0471/
>
> Cheers,
> Bernhard
>
>
>



bpi
Reply | Threaded
Open this post in threaded view
|

Re: Faster directory enumeration?

bpi
Hi Dave,

Thanks for the answer. I guess I would need to build the latest version of the plugin myself, right? (I am on macOS Sierra.)

I could load DirectoryPlugin. However, VMConstruction-Plugins-DirectoryPlugin needs InterpreterPlugin available.

Bernhard

> Am 17.10.2016 um 19:56 schrieb David T. Lewis <[hidden email]>:
>
> It is probably far too bit-rotted to be of any use now, but here is what I
> came up with 15 years ago to improve this:
>
>  http://wiki.squeak.org/squeak/2274
>
> I did briefly look at this again a couple of years ago, and put the
> updates on SqueakSource. But I think I found that the directory primitives
> are nowhere near as big a win now as they were 15 years ago. Nevertheless
> it may still be of some interest.
>
> Dave
>
>> Dear Squeakers,
>>
>> I want to count files with a certain extension in a folder recursively.
>> Here is the code I use:
>>
>> | dir count runtime |
>> count := 0.
>> dir := FileDirectory on:
>> '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'.
>> runtime := Time millisecondsToRun: [
>> dir directoryTreeDo: [:each |
>> (each last name endsWith: '.emlx') ifTrue: [count := count + 1]]].
>> {count. runtime}. #(289747 66109)
>>
>> As you can see it finds 289.747 files and it takes about 66 seconds. Is
>> there any faster way to do this given the current VM primitives?
>>
>> The reason I ask is that the equivalent Python code takes between 1.5 and
>> 6 seconds. :-/
>>
>> #!/usr/local/bin/python3
>> import os
>> import time
>>
>> path =
>> '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'
>>
>> print(path)
>>
>> start = time.time()
>> emlx = 0
>> for dirpath, dirnames, filenames in os.walk(path):
>>    for filename in filenames:
>>        if filename.endswith('.emlx'):
>>            emlx += 1
>>
>> runtime = time.time() - start
>>
>> print(emlx, runtime)
>>
>> It seems to have to do with an optimized os.scandir() function, described
>> here: https://www.python.org/dev/peps/pep-0471/
>>
>> Cheers,
>> Bernhard
>>
>>
>>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Faster directory enumeration?

David T. Lewis
Hi Bernhard,

InterpreterPlugin is part of the VMMaker package, so you would need to be
working in an image with VMMaker loaded (maybe one of the prepared image
from Eliot's site).

I should have checked my own notes before replying - I cannot explain the
reason for this, but it seems that the readdir() primitives no longer
provided any performance benefit when I tested them a couple of years ago.

Here is what I wrote in the summary on
http://www.squeaksource.com/DirectoryPlugin:

Performance characteristics have changed significantly since Squeak circa
2003. The readdir() primitives no longer provide any benefit, but the file
testing primitives still yield a couple orders of magnitude improvement
for some functions.


So ... I guess that some additional profiling would be in order.

Dave


> Hi Dave,
>
> Thanks for the answer. I guess I would need to build the latest version of
> the plugin myself, right? (I am on macOS Sierra.)
>
> I could load DirectoryPlugin. However,
> VMConstruction-Plugins-DirectoryPlugin needs InterpreterPlugin available.
>
> Bernhard
>
>> Am 17.10.2016 um 19:56 schrieb David T. Lewis <[hidden email]>:
>>
>> It is probably far too bit-rotted to be of any use now, but here is what
>> I
>> came up with 15 years ago to improve this:
>>
>>  http://wiki.squeak.org/squeak/2274
>>
>> I did briefly look at this again a couple of years ago, and put the
>> updates on SqueakSource. But I think I found that the directory
>> primitives
>> are nowhere near as big a win now as they were 15 years ago.
>> Nevertheless
>> it may still be of some interest.
>>
>> Dave
>>
>>> Dear Squeakers,
>>>
>>> I want to count files with a certain extension in a folder recursively.
>>> Here is the code I use:
>>>
>>> | dir count runtime |
>>> count := 0.
>>> dir := FileDirectory on:
>>> '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'.
>>> runtime := Time millisecondsToRun: [
>>> dir directoryTreeDo: [:each |
>>> (each last name endsWith: '.emlx') ifTrue: [count := count + 1]]].
>>> {count. runtime}. #(289747 66109)
>>>
>>> As you can see it finds 289.747 files and it takes about 66 seconds. Is
>>> there any faster way to do this given the current VM primitives?
>>>
>>> The reason I ask is that the equivalent Python code takes between 1.5
>>> and
>>> 6 seconds. :-/
>>>
>>> #!/usr/local/bin/python3
>>> import os
>>> import time
>>>
>>> path =
>>> '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'
>>>
>>> print(path)
>>>
>>> start = time.time()
>>> emlx = 0
>>> for dirpath, dirnames, filenames in os.walk(path):
>>>    for filename in filenames:
>>>        if filename.endswith('.emlx'):
>>>            emlx += 1
>>>
>>> runtime = time.time() - start
>>>
>>> print(emlx, runtime)
>>>
>>> It seems to have to do with an optimized os.scandir() function,
>>> described
>>> here: https://www.python.org/dev/peps/pep-0471/
>>>
>>> Cheers,
>>> Bernhard
>>>
>>>
>>>
>>
>>
>>
>
>





Reply | Threaded
Open this post in threaded view
|

Re: Faster directory enumeration?

Eliot Miranda-2


On Mon, Oct 17, 2016 at 1:17 PM, David T. Lewis <[hidden email]> wrote:
Hi Bernhard,

InterpreterPlugin is part of the VMMaker package, so you would need to be
working in an image with VMMaker loaded (maybe one of the prepared image
from Eliot's site).

There aren't any.  There is a script in the image subdirectory of http://www.github.com/opensmalltalk/vm which builds one; see image/buildspurtrunkvmmakerimage.sh

I should have checked my own notes before replying - I cannot explain the
reason for this, but it seems that the readdir() primitives no longer
provided any performance benefit when I tested them a couple of years ago.

Here is what I wrote in the summary on
http://www.squeaksource.com/DirectoryPlugin:

Performance characteristics have changed significantly since Squeak circa
2003. The readdir() primitives no longer provide any benefit, but the file
testing primitives still yield a couple orders of magnitude improvement
for some functions.


So ... I guess that some additional profiling would be in order.

Dave


> Hi Dave,
>
> Thanks for the answer. I guess I would need to build the latest version of
> the plugin myself, right? (I am on macOS Sierra.)
>
> I could load DirectoryPlugin. However,
> VMConstruction-Plugins-DirectoryPlugin needs InterpreterPlugin available.
>
> Bernhard
>
>> Am 17.10.2016 um 19:56 schrieb David T. Lewis <[hidden email]>:
>>
>> It is probably far too bit-rotted to be of any use now, but here is what
>> I
>> came up with 15 years ago to improve this:
>>
>>  http://wiki.squeak.org/squeak/2274
>>
>> I did briefly look at this again a couple of years ago, and put the
>> updates on SqueakSource. But I think I found that the directory
>> primitives
>> are nowhere near as big a win now as they were 15 years ago.
>> Nevertheless
>> it may still be of some interest.
>>
>> Dave
>>
>>> Dear Squeakers,
>>>
>>> I want to count files with a certain extension in a folder recursively.
>>> Here is the code I use:
>>>
>>> | dir count runtime |
>>> count := 0.
>>> dir := FileDirectory on:
>>> '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'.
>>> runtime := Time millisecondsToRun: [
>>>     dir directoryTreeDo: [:each |
>>>             (each last name endsWith: '.emlx') ifTrue: [count := count + 1]]].
>>> {count. runtime}. #(289747 66109)
>>>
>>> As you can see it finds 289.747 files and it takes about 66 seconds. Is
>>> there any faster way to do this given the current VM primitives?
>>>
>>> The reason I ask is that the equivalent Python code takes between 1.5
>>> and
>>> 6 seconds. :-/
>>>
>>> #!/usr/local/bin/python3
>>> import os
>>> import time
>>>
>>> path =
>>> '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'
>>>
>>> print(path)
>>>
>>> start = time.time()
>>> emlx = 0
>>> for dirpath, dirnames, filenames in os.walk(path):
>>>    for filename in filenames:
>>>        if filename.endswith('.emlx'):
>>>            emlx += 1
>>>
>>> runtime = time.time() - start
>>>
>>> print(emlx, runtime)
>>>
>>> It seems to have to do with an optimized os.scandir() function,
>>> described
>>> here: https://www.python.org/dev/peps/pep-0471/
>>>
>>> Cheers,
>>> Bernhard
>>>
>>>
>>>
>>
>>
>>
>
>








--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: Faster directory enumeration?

Levente Uzonyi
In reply to this post by bpi
The whole image-side code starting from #directoryTreeDo: could use some
optimization, but that would only make it at most 1.5x faster.
If I were you, I'd use OSProcess and execute this:

  find directory -name '*.exml'

It's not that nice, but it shouldn't take more than a second to find the
files.

Levente

On Mon, 17 Oct 2016, Bernhard Pieber wrote:

> Dear Squeakers,
>
> I want to count files with a certain extension in a folder recursively. Here is the code I use:
>
> | dir count runtime |
> count := 0.
> dir := FileDirectory on: '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'.
> runtime := Time millisecondsToRun: [
> dir directoryTreeDo: [:each |
> (each last name endsWith: '.emlx') ifTrue: [count := count + 1]]].
> {count. runtime}. #(289747 66109)
>
> As you can see it finds 289.747 files and it takes about 66 seconds. Is there any faster way to do this given the current VM primitives?
>
> The reason I ask is that the equivalent Python code takes between 1.5 and 6 seconds. :-/
>
> #!/usr/local/bin/python3
> import os
> import time
>
> path = '/Users/bernhard/Library/Mail/V4/D77E3582-7EBE-4B5A-BFE0-E30BF6AE995F/Smalltalk.mbox/Squeak.mbox'
>
> print(path)
>
> start = time.time()
> emlx = 0
> for dirpath, dirnames, filenames in os.walk(path):
>    for filename in filenames:
>        if filename.endswith('.emlx'):
>            emlx += 1
>
> runtime = time.time() - start
>
> print(emlx, runtime)
>
> It seems to have to do with an optimized os.scandir() function, described here: https://www.python.org/dev/peps/pep-0471/
>
> Cheers,
> Bernhard
>
>
>
>