The Inbox: Files-cmm.125.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

The Inbox: Files-cmm.125.mcz

commits-2
A new version of Files was added to project The Inbox:
http://source.squeak.org/inbox/Files-cmm.125.mcz

==================== Summary ====================

Name: Files-cmm.125
Author: cmm
Time: 27 June 2013, 7:24:33.37 pm
UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
Ancestors: Files-fbs.124

FileDirectory>>statsForDirectoryTree: is a prime example of FileDirectory living up to its poor reputation.  It does one useless thing, and does it inefficiently.  It has no senders, deservedly.
        Tree-walking operations like gathering stats can be written in apps more generally using directoryTreeDo:.

=============== Diff against Files-fbs.124 ===============

Item was changed:
+ ----- Method: FileDirectory>>directoryTreeDo:entries: (in category 'private') -----
- ----- Method: FileDirectory>>directoryTreeDo:entries: (in category 'enumeration') -----
  directoryTreeDo: oneArgBlock entries: entriesCollection
  "Value oneArgBlock with the path (an OrderedCollection of FileDirectory's) to each DirectoryEntry and the DirectoryEntry itself."
  self entries do:
  [ : each |
  entriesCollection add: each.
  oneArgBlock value: entriesCollection.
  each isDirectory ifTrue:
  [ | subdir |
  subdir := each asFileDirectory.
  subdir
  directoryTreeDo: oneArgBlock
  entries: entriesCollection ].
  entriesCollection removeLast ]!

Item was removed:
- ----- Method: FileDirectory>>statsForDirectoryTree: (in category 'enumeration') -----
- statsForDirectoryTree: rootedPathName
- "Return the size statistics for the entire directory tree starting at the given root. The result is a three element array of the form: (<number of folders><number of files><total bytes in all files>). This method also serves as an example of how recursively enumerate a directory tree."
- "FileDirectory default statsForDirectoryTree: '\smalltalk'"
-
- | dirs files bytes todo entries p |
- dirs := files := bytes := 0.
- todo := OrderedCollection with: rootedPathName.
- [todo isEmpty] whileFalse: [
- p := todo removeFirst.
- entries := self directoryContentsFor: p.
- entries do: [:entry |
- entry isDirectory
- ifTrue: [
- todo addLast: p , self pathNameDelimiter asString , entry name.
- dirs := dirs + 1]
- ifFalse: [
- files := files + 1.
- bytes := bytes + entry fileSize]]].
- ^ Array with: dirs with: files with: bytes
- !


Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Levente Uzonyi-2
On Fri, 28 Jun 2013, [hidden email] wrote:

> A new version of Files was added to project The Inbox:
> http://source.squeak.org/inbox/Files-cmm.125.mcz
>
> ==================== Summary ====================
>
> Name: Files-cmm.125
> Author: cmm
> Time: 27 June 2013, 7:24:33.37 pm
> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
> Ancestors: Files-fbs.124
>
> FileDirectory>>statsForDirectoryTree: is a prime example of FileDirectory living up to its poor reputation.  It does one useless thing, and does it inefficiently.  It has no senders, deservedly.
> Tree-walking operations like gathering stats can be written in apps more generally using directoryTreeDo:.

I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is
used by #directoryTreeDo:entries:), which is somewhat better than DFS,
because of the way FilePlugin works: you have to iterate over the contents
of the whole directory if you want to it to be efficient.
If you use DFS, then the contents of all parent directories up to the
root will have to be stored in memory during the iteration, while if you
use BFS, then only the sibling directories (without their contents) or the
siblings's parents will be in memory.

I made two "new" implementations of #statsForDirectoryTree:. This one
optimizes memory usage for BFS:

statsForDirectoryTree2: rootedPathName

  | dirs files bytes todo |
  dirs := files := bytes := 0.
  todo := OrderedCollection with: rootedPathName.
  [ todo isEmpty ] whileFalse: [
  | p |
  p := todo removeFirst.
  self directoryContentsFor: p do: [ :entry |
  entry isDirectory
  ifTrue: [
  todo addLast: p , self pathNameDelimiter asString , entry name.
  dirs := dirs + 1]
  ifFalse: [
  files := files + 1.
  bytes := bytes + entry fileSize ] ] ].
  ^{ dirs. files. bytes }

This one uses #directoryTreeDo::

statsForDirectoryTree3: rootedPathName

  | dirs files bytes |
  dirs := files := bytes := 0.
  (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
  | entry |
  entry := path last.
  entry isDirectory
  ifTrue: [ dirs := dirs + 1 ]
  ifFalse: [
  files := files + 1.
  bytes := bytes + entry fileSize ] ].
  ^{ dirs. files. bytes }

Let's see the numbers:

"Make sure the disk won't be accessed"
FileDirectory default statsForDirectoryTree: '/usr/'.
FileDirectory default statsForDirectoryTree: '/usr/'.
"Run the benchmark"
#(statsForDirectoryTree: statsForDirectoryTree2: statsForDirectoryTree3:) collect: [ :each |
  each -> ((1 to: 5) collect: [ :run |
  Smalltalk garbageCollect.
  [ FileDirectory default perform: each with: '/usr/' ] timeToRun ]) ].

{
  #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
  #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
  #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }

It's a bit noisy (i was too lazy to exit other programs), but the one
using #directoryTreeDo: seems to be clearly slower than the existing
algorithm using BFS.


Levente

>
> =============== Diff against Files-fbs.124 ===============
>
> Item was changed:
> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category 'private') -----
> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category 'enumeration') -----
>  directoryTreeDo: oneArgBlock entries: entriesCollection
>   "Value oneArgBlock with the path (an OrderedCollection of FileDirectory's) to each DirectoryEntry and the DirectoryEntry itself."
>   self entries do:
>   [ : each |
>   entriesCollection add: each.
>   oneArgBlock value: entriesCollection.
>   each isDirectory ifTrue:
>   [ | subdir |
>   subdir := each asFileDirectory.
>   subdir
>   directoryTreeDo: oneArgBlock
>   entries: entriesCollection ].
>   entriesCollection removeLast ]!
>
> Item was removed:
> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category 'enumeration') -----
> - statsForDirectoryTree: rootedPathName
> - "Return the size statistics for the entire directory tree starting at the given root. The result is a three element array of the form: (<number of folders><number of files><total bytes in all files>). This method also serves as an example of how recursively enumerate a directory tree."
> - "FileDirectory default statsForDirectoryTree: '\smalltalk'"
> -
> - | dirs files bytes todo entries p |
> - dirs := files := bytes := 0.
> - todo := OrderedCollection with: rootedPathName.
> - [todo isEmpty] whileFalse: [
> - p := todo removeFirst.
> - entries := self directoryContentsFor: p.
> - entries do: [:entry |
> - entry isDirectory
> - ifTrue: [
> - todo addLast: p , self pathNameDelimiter asString , entry name.
> - dirs := dirs + 1]
> - ifFalse: [
> - files := files + 1.
> - bytes := bytes + entry fileSize]]].
> - ^ Array with: dirs with: files with: bytes
> - !
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Hannes Hirzel
So, what do you propose Levente? Maybe just an update of the comment
of FileDirectory which explains this issue and gives your example
statistics code?

--Hannes

On 6/29/13, Levente Uzonyi <[hidden email]> wrote:

> On Fri, 28 Jun 2013, [hidden email] wrote:
>
>> A new version of Files was added to project The Inbox:
>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>
>> ==================== Summary ====================
>>
>> Name: Files-cmm.125
>> Author: cmm
>> Time: 27 June 2013, 7:24:33.37 pm
>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>> Ancestors: Files-fbs.124
>>
>> FileDirectory>>statsForDirectoryTree: is a prime example of FileDirectory
>> living up to its poor reputation.  It does one useless thing, and does it
>> inefficiently.  It has no senders, deservedly.
>> Tree-walking operations like gathering stats can be written in apps more
>> generally using directoryTreeDo:.
>
> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is
> used by #directoryTreeDo:entries:), which is somewhat better than DFS,
> because of the way FilePlugin works: you have to iterate over the contents
> of the whole directory if you want to it to be efficient.
> If you use DFS, then the contents of all parent directories up to the
> root will have to be stored in memory during the iteration, while if you
> use BFS, then only the sibling directories (without their contents) or the
> siblings's parents will be in memory.
>
> I made two "new" implementations of #statsForDirectoryTree:. This one
> optimizes memory usage for BFS:
>
> statsForDirectoryTree2: rootedPathName
>
>   | dirs files bytes todo |
>   dirs := files := bytes := 0.
>   todo := OrderedCollection with: rootedPathName.
>   [ todo isEmpty ] whileFalse: [
>   | p |
>   p := todo removeFirst.
>   self directoryContentsFor: p do: [ :entry |
>   entry isDirectory
>   ifTrue: [
>   todo addLast: p , self pathNameDelimiter asString , entry name.
>   dirs := dirs + 1]
>   ifFalse: [
>   files := files + 1.
>   bytes := bytes + entry fileSize ] ] ].
>   ^{ dirs. files. bytes }
>
> This one uses #directoryTreeDo::
>
> statsForDirectoryTree3: rootedPathName
>
>   | dirs files bytes |
>   dirs := files := bytes := 0.
>   (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>   | entry |
>   entry := path last.
>   entry isDirectory
>   ifTrue: [ dirs := dirs + 1 ]
>   ifFalse: [
>   files := files + 1.
>   bytes := bytes + entry fileSize ] ].
>   ^{ dirs. files. bytes }
>
> Let's see the numbers:
>
> "Make sure the disk won't be accessed"
> FileDirectory default statsForDirectoryTree: '/usr/'.
> FileDirectory default statsForDirectoryTree: '/usr/'.
> "Run the benchmark"
> #(statsForDirectoryTree: statsForDirectoryTree2: statsForDirectoryTree3:)
> collect: [ :each |
>   each -> ((1 to: 5) collect: [ :run |
>   Smalltalk garbageCollect.
>   [ FileDirectory default perform: each with: '/usr/' ] timeToRun ]) ].
>
> {
>   #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>   #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>   #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>
> It's a bit noisy (i was too lazy to exit other programs), but the one
> using #directoryTreeDo: seems to be clearly slower than the existing
> algorithm using BFS.
>
>
> Levente
>
>>
>> =============== Diff against Files-fbs.124 ===============
>>
>> Item was changed:
>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>> 'private') -----
>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>> 'enumeration') -----
>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>   "Value oneArgBlock with the path (an OrderedCollection of
>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry itself."
>>   self entries do:
>>   [ : each |
>>   entriesCollection add: each.
>>   oneArgBlock value: entriesCollection.
>>   each isDirectory ifTrue:
>>   [ | subdir |
>>   subdir := each asFileDirectory.
>>   subdir
>>   directoryTreeDo: oneArgBlock
>>   entries: entriesCollection ].
>>   entriesCollection removeLast ]!
>>
>> Item was removed:
>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>> 'enumeration') -----
>> - statsForDirectoryTree: rootedPathName
>> - "Return the size statistics for the entire directory tree starting at
>> the given root. The result is a three element array of the form: (<number
>> of folders><number of files><total bytes in all files>). This method also
>> serves as an example of how recursively enumerate a directory tree."
>> - "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>> -
>> - | dirs files bytes todo entries p |
>> - dirs := files := bytes := 0.
>> - todo := OrderedCollection with: rootedPathName.
>> - [todo isEmpty] whileFalse: [
>> - p := todo removeFirst.
>> - entries := self directoryContentsFor: p.
>> - entries do: [:entry |
>> - entry isDirectory
>> - ifTrue: [
>> - todo addLast: p , self pathNameDelimiter asString , entry name.
>> - dirs := dirs + 1]
>> - ifFalse: [
>> - files := files + 1.
>> - bytes := bytes + entry fileSize]]].
>> - ^ Array with: dirs with: files with: bytes
>> - !
>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Chris Muller-3
In reply to this post by Levente Uzonyi-2
Hi Levente, I do not know what you mean by DFS and BFS.  My claim that
statsForDirectoryTree: "does one useless thing, and does it
inefficiently" is commentary about the _design_, not the
implementation, because it forces me to enumerate an entire directory
even if I don't want to.  For example, what if I want to exclude a
".temp" directory's stats from its containing directory's stats?  To
do that, I'd have to run #statsForDirectoryTree: twice.  That's not
going to be faster..

On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <[hidden email]> wrote:

> On Fri, 28 Jun 2013, [hidden email] wrote:
>
>> A new version of Files was added to project The Inbox:
>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>
>> ==================== Summary ====================
>>
>> Name: Files-cmm.125
>> Author: cmm
>> Time: 27 June 2013, 7:24:33.37 pm
>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>> Ancestors: Files-fbs.124
>>
>> FileDirectory>>statsForDirectoryTree: is a prime example of FileDirectory
>> living up to its poor reputation.  It does one useless thing, and does it
>> inefficiently.  It has no senders, deservedly.
>>         Tree-walking operations like gathering stats can be written in
>> apps more generally using directoryTreeDo:.
>
>
> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is used
> by #directoryTreeDo:entries:), which is somewhat better than DFS, because of
> the way FilePlugin works: you have to iterate over the contents of the whole
> directory if you want to it to be efficient.
> If you use DFS, then the contents of all parent directories up to the root
> will have to be stored in memory during the iteration, while if you use BFS,
> then only the sibling directories (without their contents) or the siblings's
> parents will be in memory.
>
> I made two "new" implementations of #statsForDirectoryTree:. This one
> optimizes memory usage for BFS:
>
> statsForDirectoryTree2: rootedPathName
>
>         | dirs files bytes todo |
>
>         dirs := files := bytes := 0.
>         todo := OrderedCollection with: rootedPathName.
>         [ todo isEmpty ] whileFalse: [
>                 | p |
>                 p := todo removeFirst.
>                 self directoryContentsFor: p do: [ :entry |
>                         entry isDirectory
>                                 ifTrue: [
>
>                                         todo addLast: p , self
> pathNameDelimiter asString , entry name.
>                                         dirs := dirs + 1]
>                                 ifFalse: [
>
>                                         files := files + 1.
>                                         bytes := bytes + entry fileSize ] ]
> ].
>         ^{ dirs. files. bytes }
>
> This one uses #directoryTreeDo::
>
> statsForDirectoryTree3: rootedPathName
>
>         | dirs files bytes |
>
>         dirs := files := bytes := 0.
>         (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>                 | entry |
>                 entry := path last.
>                 entry isDirectory
>                         ifTrue: [ dirs := dirs + 1 ]
>                         ifFalse: [
>
>                                 files := files + 1.
>                                 bytes := bytes + entry fileSize ] ].
>         ^{ dirs. files. bytes }
>
> Let's see the numbers:
>
> "Make sure the disk won't be accessed"
> FileDirectory default statsForDirectoryTree: '/usr/'.
> FileDirectory default statsForDirectoryTree: '/usr/'.
> "Run the benchmark"
> #(statsForDirectoryTree: statsForDirectoryTree2: statsForDirectoryTree3:)
> collect: [ :each |
>         each -> ((1 to: 5) collect: [ :run |
>                 Smalltalk garbageCollect.
>                 [ FileDirectory default perform: each with: '/usr/' ]
> timeToRun ]) ].
>
> {
>         #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>         #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>         #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>
> It's a bit noisy (i was too lazy to exit other programs), but the one using
> #directoryTreeDo: seems to be clearly slower than the existing algorithm
> using BFS.
>
>
> Levente
>
>
>>
>> =============== Diff against Files-fbs.124 ===============
>>
>> Item was changed:
>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>> 'private') -----
>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>> 'enumeration') -----
>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>         "Value oneArgBlock with the path (an OrderedCollection of
>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry itself."
>>         self entries do:
>>                 [ : each |
>>                 entriesCollection add: each.
>>                 oneArgBlock value: entriesCollection.
>>                 each isDirectory ifTrue:
>>                         [ | subdir |
>>                         subdir := each asFileDirectory.
>>                         subdir
>>                                 directoryTreeDo: oneArgBlock
>>                                 entries: entriesCollection ].
>>                 entriesCollection removeLast ]!
>>
>> Item was removed:
>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>> 'enumeration') -----
>> - statsForDirectoryTree: rootedPathName
>> -       "Return the size statistics for the entire directory tree starting
>> at the given root. The result is a three element array of the form: (<number
>> of folders><number of files><total bytes in all files>). This method also
>> serves as an example of how recursively enumerate a directory tree."
>> -       "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>> -
>> -       | dirs files bytes todo entries p |
>> -       dirs := files := bytes := 0.
>> -       todo := OrderedCollection with: rootedPathName.
>> -       [todo isEmpty] whileFalse: [
>> -               p := todo removeFirst.
>> -               entries := self directoryContentsFor: p.
>> -               entries do: [:entry |
>> -                       entry isDirectory
>> -                               ifTrue: [
>> -                                       todo addLast: p , self
>> pathNameDelimiter asString , entry name.
>> -                                       dirs := dirs + 1]
>> -                               ifFalse: [
>> -                                       files := files + 1.
>> -                                       bytes := bytes + entry
>> fileSize]]].
>> -       ^ Array with: dirs with: files with: bytes
>> - !
>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Frank Shearar-3
On 30 June 2013 18:19, Chris Muller <[hidden email]> wrote:
> Hi Levente, I do not know what you mean by DFS and BFS.

Depth-first versus breadth-first search.

frank

>  My claim that
> statsForDirectoryTree: "does one useless thing, and does it
> inefficiently" is commentary about the _design_, not the
> implementation, because it forces me to enumerate an entire directory
> even if I don't want to.  For example, what if I want to exclude a
> ".temp" directory's stats from its containing directory's stats?  To
> do that, I'd have to run #statsForDirectoryTree: twice.  That's not
> going to be faster..
>
> On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <[hidden email]> wrote:
>> On Fri, 28 Jun 2013, [hidden email] wrote:
>>
>>> A new version of Files was added to project The Inbox:
>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>
>>> ==================== Summary ====================
>>>
>>> Name: Files-cmm.125
>>> Author: cmm
>>> Time: 27 June 2013, 7:24:33.37 pm
>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>> Ancestors: Files-fbs.124
>>>
>>> FileDirectory>>statsForDirectoryTree: is a prime example of FileDirectory
>>> living up to its poor reputation.  It does one useless thing, and does it
>>> inefficiently.  It has no senders, deservedly.
>>>         Tree-walking operations like gathering stats can be written in
>>> apps more generally using directoryTreeDo:.
>>
>>
>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is used
>> by #directoryTreeDo:entries:), which is somewhat better than DFS, because of
>> the way FilePlugin works: you have to iterate over the contents of the whole
>> directory if you want to it to be efficient.
>> If you use DFS, then the contents of all parent directories up to the root
>> will have to be stored in memory during the iteration, while if you use BFS,
>> then only the sibling directories (without their contents) or the siblings's
>> parents will be in memory.
>>
>> I made two "new" implementations of #statsForDirectoryTree:. This one
>> optimizes memory usage for BFS:
>>
>> statsForDirectoryTree2: rootedPathName
>>
>>         | dirs files bytes todo |
>>
>>         dirs := files := bytes := 0.
>>         todo := OrderedCollection with: rootedPathName.
>>         [ todo isEmpty ] whileFalse: [
>>                 | p |
>>                 p := todo removeFirst.
>>                 self directoryContentsFor: p do: [ :entry |
>>                         entry isDirectory
>>                                 ifTrue: [
>>
>>                                         todo addLast: p , self
>> pathNameDelimiter asString , entry name.
>>                                         dirs := dirs + 1]
>>                                 ifFalse: [
>>
>>                                         files := files + 1.
>>                                         bytes := bytes + entry fileSize ] ]
>> ].
>>         ^{ dirs. files. bytes }
>>
>> This one uses #directoryTreeDo::
>>
>> statsForDirectoryTree3: rootedPathName
>>
>>         | dirs files bytes |
>>
>>         dirs := files := bytes := 0.
>>         (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>                 | entry |
>>                 entry := path last.
>>                 entry isDirectory
>>                         ifTrue: [ dirs := dirs + 1 ]
>>                         ifFalse: [
>>
>>                                 files := files + 1.
>>                                 bytes := bytes + entry fileSize ] ].
>>         ^{ dirs. files. bytes }
>>
>> Let's see the numbers:
>>
>> "Make sure the disk won't be accessed"
>> FileDirectory default statsForDirectoryTree: '/usr/'.
>> FileDirectory default statsForDirectoryTree: '/usr/'.
>> "Run the benchmark"
>> #(statsForDirectoryTree: statsForDirectoryTree2: statsForDirectoryTree3:)
>> collect: [ :each |
>>         each -> ((1 to: 5) collect: [ :run |
>>                 Smalltalk garbageCollect.
>>                 [ FileDirectory default perform: each with: '/usr/' ]
>> timeToRun ]) ].
>>
>> {
>>         #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>         #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>         #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>
>> It's a bit noisy (i was too lazy to exit other programs), but the one using
>> #directoryTreeDo: seems to be clearly slower than the existing algorithm
>> using BFS.
>>
>>
>> Levente
>>
>>
>>>
>>> =============== Diff against Files-fbs.124 ===============
>>>
>>> Item was changed:
>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>> 'private') -----
>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>> 'enumeration') -----
>>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>>         "Value oneArgBlock with the path (an OrderedCollection of
>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry itself."
>>>         self entries do:
>>>                 [ : each |
>>>                 entriesCollection add: each.
>>>                 oneArgBlock value: entriesCollection.
>>>                 each isDirectory ifTrue:
>>>                         [ | subdir |
>>>                         subdir := each asFileDirectory.
>>>                         subdir
>>>                                 directoryTreeDo: oneArgBlock
>>>                                 entries: entriesCollection ].
>>>                 entriesCollection removeLast ]!
>>>
>>> Item was removed:
>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>> 'enumeration') -----
>>> - statsForDirectoryTree: rootedPathName
>>> -       "Return the size statistics for the entire directory tree starting
>>> at the given root. The result is a three element array of the form: (<number
>>> of folders><number of files><total bytes in all files>). This method also
>>> serves as an example of how recursively enumerate a directory tree."
>>> -       "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>> -
>>> -       | dirs files bytes todo entries p |
>>> -       dirs := files := bytes := 0.
>>> -       todo := OrderedCollection with: rootedPathName.
>>> -       [todo isEmpty] whileFalse: [
>>> -               p := todo removeFirst.
>>> -               entries := self directoryContentsFor: p.
>>> -               entries do: [:entry |
>>> -                       entry isDirectory
>>> -                               ifTrue: [
>>> -                                       todo addLast: p , self
>>> pathNameDelimiter asString , entry name.
>>> -                                       dirs := dirs + 1]
>>> -                               ifFalse: [
>>> -                                       files := files + 1.
>>> -                                       bytes := bytes + entry
>>> fileSize]]].
>>> -       ^ Array with: dirs with: files with: bytes
>>> - !
>>>
>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Chris Muller-4
Ah, thank you!  Now I understand Levente's comments.

Levente, that is an excellent point about directoryTreeDo: traversing
depth-first.  In fact that makes me want to change it to be
breadth-first..  I'm trying to think about all the use-cases it serves
and whether that would make a difference; since we have the path,
maybe not..

On Sun, Jun 30, 2013 at 1:29 PM, Frank Shearar <[hidden email]> wrote:

> On 30 June 2013 18:19, Chris Muller <[hidden email]> wrote:
>> Hi Levente, I do not know what you mean by DFS and BFS.
>
> Depth-first versus breadth-first search.
>
> frank
>
>>  My claim that
>> statsForDirectoryTree: "does one useless thing, and does it
>> inefficiently" is commentary about the _design_, not the
>> implementation, because it forces me to enumerate an entire directory
>> even if I don't want to.  For example, what if I want to exclude a
>> ".temp" directory's stats from its containing directory's stats?  To
>> do that, I'd have to run #statsForDirectoryTree: twice.  That's not
>> going to be faster..
>>
>> On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <[hidden email]> wrote:
>>> On Fri, 28 Jun 2013, [hidden email] wrote:
>>>
>>>> A new version of Files was added to project The Inbox:
>>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>>
>>>> ==================== Summary ====================
>>>>
>>>> Name: Files-cmm.125
>>>> Author: cmm
>>>> Time: 27 June 2013, 7:24:33.37 pm
>>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>>> Ancestors: Files-fbs.124
>>>>
>>>> FileDirectory>>statsForDirectoryTree: is a prime example of FileDirectory
>>>> living up to its poor reputation.  It does one useless thing, and does it
>>>> inefficiently.  It has no senders, deservedly.
>>>>         Tree-walking operations like gathering stats can be written in
>>>> apps more generally using directoryTreeDo:.
>>>
>>>
>>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is used
>>> by #directoryTreeDo:entries:), which is somewhat better than DFS, because of
>>> the way FilePlugin works: you have to iterate over the contents of the whole
>>> directory if you want to it to be efficient.
>>> If you use DFS, then the contents of all parent directories up to the root
>>> will have to be stored in memory during the iteration, while if you use BFS,
>>> then only the sibling directories (without their contents) or the siblings's
>>> parents will be in memory.
>>>
>>> I made two "new" implementations of #statsForDirectoryTree:. This one
>>> optimizes memory usage for BFS:
>>>
>>> statsForDirectoryTree2: rootedPathName
>>>
>>>         | dirs files bytes todo |
>>>
>>>         dirs := files := bytes := 0.
>>>         todo := OrderedCollection with: rootedPathName.
>>>         [ todo isEmpty ] whileFalse: [
>>>                 | p |
>>>                 p := todo removeFirst.
>>>                 self directoryContentsFor: p do: [ :entry |
>>>                         entry isDirectory
>>>                                 ifTrue: [
>>>
>>>                                         todo addLast: p , self
>>> pathNameDelimiter asString , entry name.
>>>                                         dirs := dirs + 1]
>>>                                 ifFalse: [
>>>
>>>                                         files := files + 1.
>>>                                         bytes := bytes + entry fileSize ] ]
>>> ].
>>>         ^{ dirs. files. bytes }
>>>
>>> This one uses #directoryTreeDo::
>>>
>>> statsForDirectoryTree3: rootedPathName
>>>
>>>         | dirs files bytes |
>>>
>>>         dirs := files := bytes := 0.
>>>         (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>>                 | entry |
>>>                 entry := path last.
>>>                 entry isDirectory
>>>                         ifTrue: [ dirs := dirs + 1 ]
>>>                         ifFalse: [
>>>
>>>                                 files := files + 1.
>>>                                 bytes := bytes + entry fileSize ] ].
>>>         ^{ dirs. files. bytes }
>>>
>>> Let's see the numbers:
>>>
>>> "Make sure the disk won't be accessed"
>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>> "Run the benchmark"
>>> #(statsForDirectoryTree: statsForDirectoryTree2: statsForDirectoryTree3:)
>>> collect: [ :each |
>>>         each -> ((1 to: 5) collect: [ :run |
>>>                 Smalltalk garbageCollect.
>>>                 [ FileDirectory default perform: each with: '/usr/' ]
>>> timeToRun ]) ].
>>>
>>> {
>>>         #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>>         #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>>         #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>>
>>> It's a bit noisy (i was too lazy to exit other programs), but the one using
>>> #directoryTreeDo: seems to be clearly slower than the existing algorithm
>>> using BFS.
>>>
>>>
>>> Levente
>>>
>>>
>>>>
>>>> =============== Diff against Files-fbs.124 ===============
>>>>
>>>> Item was changed:
>>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>> 'private') -----
>>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>> 'enumeration') -----
>>>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>>>         "Value oneArgBlock with the path (an OrderedCollection of
>>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry itself."
>>>>         self entries do:
>>>>                 [ : each |
>>>>                 entriesCollection add: each.
>>>>                 oneArgBlock value: entriesCollection.
>>>>                 each isDirectory ifTrue:
>>>>                         [ | subdir |
>>>>                         subdir := each asFileDirectory.
>>>>                         subdir
>>>>                                 directoryTreeDo: oneArgBlock
>>>>                                 entries: entriesCollection ].
>>>>                 entriesCollection removeLast ]!
>>>>
>>>> Item was removed:
>>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>>> 'enumeration') -----
>>>> - statsForDirectoryTree: rootedPathName
>>>> -       "Return the size statistics for the entire directory tree starting
>>>> at the given root. The result is a three element array of the form: (<number
>>>> of folders><number of files><total bytes in all files>). This method also
>>>> serves as an example of how recursively enumerate a directory tree."
>>>> -       "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>>> -
>>>> -       | dirs files bytes todo entries p |
>>>> -       dirs := files := bytes := 0.
>>>> -       todo := OrderedCollection with: rootedPathName.
>>>> -       [todo isEmpty] whileFalse: [
>>>> -               p := todo removeFirst.
>>>> -               entries := self directoryContentsFor: p.
>>>> -               entries do: [:entry |
>>>> -                       entry isDirectory
>>>> -                               ifTrue: [
>>>> -                                       todo addLast: p , self
>>>> pathNameDelimiter asString , entry name.
>>>> -                                       dirs := dirs + 1]
>>>> -                               ifFalse: [
>>>> -                                       files := files + 1.
>>>> -                                       bytes := bytes + entry
>>>> fileSize]]].
>>>> -       ^ Array with: dirs with: files with: bytes
>>>> - !
>>>>
>>>>
>>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Levente Uzonyi-2
On Sun, 30 Jun 2013, Chris Muller wrote:

> Ah, thank you!  Now I understand Levente's comments.
>
> Levente, that is an excellent point about directoryTreeDo: traversing
> depth-first.  In fact that makes me want to change it to be
> breadth-first..  I'm trying to think about all the use-cases it serves
> and whether that would make a difference; since we have the path,
> maybe not..

The only techinal hurdle in changing the algorithm is the argument of
argument the block. Currently it is an OrderedCollection of DirectoryEntry
objects from the root to the current DirectoryEntry object. With DFS
this is naturally available, but with BFS it's not.

So the question is, what should we do?

1) keep the existing OrderedCollection argument

I uploaded Files-ul.125 to the Inbox, which changes the implementation of
#directoryTreeDo: this way. As you can see, it's not an easy to understand
algorithm. :)

2) use some kind of linked list-like data structure instead.

Here's an example implementation for this:

  | queue argument |
  argument := { self directoryEntry. nil }.
  oneArgBlock value: argument.
  queue := OrderedCollection with: argument copy.
  [ queue isEmpty ] whileFalse: [
  | parent |
  parent := queue removeFirst.
  argument at: 2 put: parent.
  parent first asFileDirectory entriesDo: [ :entry |
  argument at: 1 put: entry.
  oneArgBlock value: argument.
  entry isDirectory ifTrue: [ queue addLast: argument copy ] ] ]

3) only pass the current DirectoryEntry to the block and let the user
decide if he needs the whole path (though it costs a lot to recreate it)

Implementation:

  | queue |
  queue := OrderedCollection with: self directoryEntry.
  oneArgBlock value: queue first.
  [ queue isEmpty ] whileFalse: [
  queue removeFirst asFileDirectory entriesDo: [ :entry |
  oneArgBlock value: entry.
  entry isDirectory ifTrue: [ queue addLast: entry ] ] ]

Both examples at 2) and 3) require the #entriesDo: method, which is
available in Files-ul.125 from the Inbox.

If you profile any of this code on a larger directory tree, you'll see
that there's plenty of room for improvement in FileDirectory.


Levente

>
> On Sun, Jun 30, 2013 at 1:29 PM, Frank Shearar <[hidden email]> wrote:
>> On 30 June 2013 18:19, Chris Muller <[hidden email]> wrote:
>>> Hi Levente, I do not know what you mean by DFS and BFS.
>>
>> Depth-first versus breadth-first search.
>>
>> frank
>>
>>>  My claim that
>>> statsForDirectoryTree: "does one useless thing, and does it
>>> inefficiently" is commentary about the _design_, not the
>>> implementation, because it forces me to enumerate an entire directory
>>> even if I don't want to.  For example, what if I want to exclude a
>>> ".temp" directory's stats from its containing directory's stats?  To
>>> do that, I'd have to run #statsForDirectoryTree: twice.  That's not
>>> going to be faster..
>>>
>>> On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <[hidden email]> wrote:
>>>> On Fri, 28 Jun 2013, [hidden email] wrote:
>>>>
>>>>> A new version of Files was added to project The Inbox:
>>>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>>>
>>>>> ==================== Summary ====================
>>>>>
>>>>> Name: Files-cmm.125
>>>>> Author: cmm
>>>>> Time: 27 June 2013, 7:24:33.37 pm
>>>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>>>> Ancestors: Files-fbs.124
>>>>>
>>>>> FileDirectory>>statsForDirectoryTree: is a prime example of FileDirectory
>>>>> living up to its poor reputation.  It does one useless thing, and does it
>>>>> inefficiently.  It has no senders, deservedly.
>>>>>         Tree-walking operations like gathering stats can be written in
>>>>> apps more generally using directoryTreeDo:.
>>>>
>>>>
>>>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is used
>>>> by #directoryTreeDo:entries:), which is somewhat better than DFS, because of
>>>> the way FilePlugin works: you have to iterate over the contents of the whole
>>>> directory if you want to it to be efficient.
>>>> If you use DFS, then the contents of all parent directories up to the root
>>>> will have to be stored in memory during the iteration, while if you use BFS,
>>>> then only the sibling directories (without their contents) or the siblings's
>>>> parents will be in memory.
>>>>
>>>> I made two "new" implementations of #statsForDirectoryTree:. This one
>>>> optimizes memory usage for BFS:
>>>>
>>>> statsForDirectoryTree2: rootedPathName
>>>>
>>>>         | dirs files bytes todo |
>>>>
>>>>         dirs := files := bytes := 0.
>>>>         todo := OrderedCollection with: rootedPathName.
>>>>         [ todo isEmpty ] whileFalse: [
>>>>                 | p |
>>>>                 p := todo removeFirst.
>>>>                 self directoryContentsFor: p do: [ :entry |
>>>>                         entry isDirectory
>>>>                                 ifTrue: [
>>>>
>>>>                                         todo addLast: p , self
>>>> pathNameDelimiter asString , entry name.
>>>>                                         dirs := dirs + 1]
>>>>                                 ifFalse: [
>>>>
>>>>                                         files := files + 1.
>>>>                                         bytes := bytes + entry fileSize ] ]
>>>> ].
>>>>         ^{ dirs. files. bytes }
>>>>
>>>> This one uses #directoryTreeDo::
>>>>
>>>> statsForDirectoryTree3: rootedPathName
>>>>
>>>>         | dirs files bytes |
>>>>
>>>>         dirs := files := bytes := 0.
>>>>         (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>>>                 | entry |
>>>>                 entry := path last.
>>>>                 entry isDirectory
>>>>                         ifTrue: [ dirs := dirs + 1 ]
>>>>                         ifFalse: [
>>>>
>>>>                                 files := files + 1.
>>>>                                 bytes := bytes + entry fileSize ] ].
>>>>         ^{ dirs. files. bytes }
>>>>
>>>> Let's see the numbers:
>>>>
>>>> "Make sure the disk won't be accessed"
>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>> "Run the benchmark"
>>>> #(statsForDirectoryTree: statsForDirectoryTree2: statsForDirectoryTree3:)
>>>> collect: [ :each |
>>>>         each -> ((1 to: 5) collect: [ :run |
>>>>                 Smalltalk garbageCollect.
>>>>                 [ FileDirectory default perform: each with: '/usr/' ]
>>>> timeToRun ]) ].
>>>>
>>>> {
>>>>         #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>>>         #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>>>         #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>>>
>>>> It's a bit noisy (i was too lazy to exit other programs), but the one using
>>>> #directoryTreeDo: seems to be clearly slower than the existing algorithm
>>>> using BFS.
>>>>
>>>>
>>>> Levente
>>>>
>>>>
>>>>>
>>>>> =============== Diff against Files-fbs.124 ===============
>>>>>
>>>>> Item was changed:
>>>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>> 'private') -----
>>>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>> 'enumeration') -----
>>>>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>>>>         "Value oneArgBlock with the path (an OrderedCollection of
>>>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry itself."
>>>>>         self entries do:
>>>>>                 [ : each |
>>>>>                 entriesCollection add: each.
>>>>>                 oneArgBlock value: entriesCollection.
>>>>>                 each isDirectory ifTrue:
>>>>>                         [ | subdir |
>>>>>                         subdir := each asFileDirectory.
>>>>>                         subdir
>>>>>                                 directoryTreeDo: oneArgBlock
>>>>>                                 entries: entriesCollection ].
>>>>>                 entriesCollection removeLast ]!
>>>>>
>>>>> Item was removed:
>>>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>>>> 'enumeration') -----
>>>>> - statsForDirectoryTree: rootedPathName
>>>>> -       "Return the size statistics for the entire directory tree starting
>>>>> at the given root. The result is a three element array of the form: (<number
>>>>> of folders><number of files><total bytes in all files>). This method also
>>>>> serves as an example of how recursively enumerate a directory tree."
>>>>> -       "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>>>> -
>>>>> -       | dirs files bytes todo entries p |
>>>>> -       dirs := files := bytes := 0.
>>>>> -       todo := OrderedCollection with: rootedPathName.
>>>>> -       [todo isEmpty] whileFalse: [
>>>>> -               p := todo removeFirst.
>>>>> -               entries := self directoryContentsFor: p.
>>>>> -               entries do: [:entry |
>>>>> -                       entry isDirectory
>>>>> -                               ifTrue: [
>>>>> -                                       todo addLast: p , self
>>>>> pathNameDelimiter asString , entry name.
>>>>> -                                       dirs := dirs + 1]
>>>>> -                               ifFalse: [
>>>>> -                                       files := files + 1.
>>>>> -                                       bytes := bytes + entry
>>>>> fileSize]]].
>>>>> -       ^ Array with: dirs with: files with: bytes
>>>>> - !
>>>>>
>>>>>
>>>>>
>>>>
>>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Levente Uzonyi-2
In reply to this post by Hannes Hirzel
On Sun, 30 Jun 2013, H. Hirzel wrote:

> So, what do you propose Levente? Maybe just an update of the comment
> of FileDirectory which explains this issue and gives your example
> statistics code?

Definitely not, but I'm not sure if it's worth fixing FileDirectory. Even
though I like its simplicity (only a few classes, single class entry
point), it currently has some bottlenecks, noise (messy code) and a
design flaw (DirectoryEntry <-> FileDirectory).


Levente

>
> --Hannes
>
> On 6/29/13, Levente Uzonyi <[hidden email]> wrote:
>> On Fri, 28 Jun 2013, [hidden email] wrote:
>>
>>> A new version of Files was added to project The Inbox:
>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>
>>> ==================== Summary ====================
>>>
>>> Name: Files-cmm.125
>>> Author: cmm
>>> Time: 27 June 2013, 7:24:33.37 pm
>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>> Ancestors: Files-fbs.124
>>>
>>> FileDirectory>>statsForDirectoryTree: is a prime example of FileDirectory
>>> living up to its poor reputation.  It does one useless thing, and does it
>>> inefficiently.  It has no senders, deservedly.
>>> Tree-walking operations like gathering stats can be written in apps more
>>> generally using directoryTreeDo:.
>>
>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is
>> used by #directoryTreeDo:entries:), which is somewhat better than DFS,
>> because of the way FilePlugin works: you have to iterate over the contents
>> of the whole directory if you want to it to be efficient.
>> If you use DFS, then the contents of all parent directories up to the
>> root will have to be stored in memory during the iteration, while if you
>> use BFS, then only the sibling directories (without their contents) or the
>> siblings's parents will be in memory.
>>
>> I made two "new" implementations of #statsForDirectoryTree:. This one
>> optimizes memory usage for BFS:
>>
>> statsForDirectoryTree2: rootedPathName
>>
>>   | dirs files bytes todo |
>>   dirs := files := bytes := 0.
>>   todo := OrderedCollection with: rootedPathName.
>>   [ todo isEmpty ] whileFalse: [
>>   | p |
>>   p := todo removeFirst.
>>   self directoryContentsFor: p do: [ :entry |
>>   entry isDirectory
>>   ifTrue: [
>>   todo addLast: p , self pathNameDelimiter asString , entry name.
>>   dirs := dirs + 1]
>>   ifFalse: [
>>   files := files + 1.
>>   bytes := bytes + entry fileSize ] ] ].
>>   ^{ dirs. files. bytes }
>>
>> This one uses #directoryTreeDo::
>>
>> statsForDirectoryTree3: rootedPathName
>>
>>   | dirs files bytes |
>>   dirs := files := bytes := 0.
>>   (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>   | entry |
>>   entry := path last.
>>   entry isDirectory
>>   ifTrue: [ dirs := dirs + 1 ]
>>   ifFalse: [
>>   files := files + 1.
>>   bytes := bytes + entry fileSize ] ].
>>   ^{ dirs. files. bytes }
>>
>> Let's see the numbers:
>>
>> "Make sure the disk won't be accessed"
>> FileDirectory default statsForDirectoryTree: '/usr/'.
>> FileDirectory default statsForDirectoryTree: '/usr/'.
>> "Run the benchmark"
>> #(statsForDirectoryTree: statsForDirectoryTree2: statsForDirectoryTree3:)
>> collect: [ :each |
>>   each -> ((1 to: 5) collect: [ :run |
>>   Smalltalk garbageCollect.
>>   [ FileDirectory default perform: each with: '/usr/' ] timeToRun ]) ].
>>
>> {
>>   #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>   #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>   #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>
>> It's a bit noisy (i was too lazy to exit other programs), but the one
>> using #directoryTreeDo: seems to be clearly slower than the existing
>> algorithm using BFS.
>>
>>
>> Levente
>>
>>>
>>> =============== Diff against Files-fbs.124 ===============
>>>
>>> Item was changed:
>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>> 'private') -----
>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>> 'enumeration') -----
>>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>>   "Value oneArgBlock with the path (an OrderedCollection of
>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry itself."
>>>   self entries do:
>>>   [ : each |
>>>   entriesCollection add: each.
>>>   oneArgBlock value: entriesCollection.
>>>   each isDirectory ifTrue:
>>>   [ | subdir |
>>>   subdir := each asFileDirectory.
>>>   subdir
>>>   directoryTreeDo: oneArgBlock
>>>   entries: entriesCollection ].
>>>   entriesCollection removeLast ]!
>>>
>>> Item was removed:
>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>> 'enumeration') -----
>>> - statsForDirectoryTree: rootedPathName
>>> - "Return the size statistics for the entire directory tree starting at
>>> the given root. The result is a three element array of the form: (<number
>>> of folders><number of files><total bytes in all files>). This method also
>>> serves as an example of how recursively enumerate a directory tree."
>>> - "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>> -
>>> - | dirs files bytes todo entries p |
>>> - dirs := files := bytes := 0.
>>> - todo := OrderedCollection with: rootedPathName.
>>> - [todo isEmpty] whileFalse: [
>>> - p := todo removeFirst.
>>> - entries := self directoryContentsFor: p.
>>> - entries do: [:entry |
>>> - entry isDirectory
>>> - ifTrue: [
>>> - todo addLast: p , self pathNameDelimiter asString , entry name.
>>> - dirs := dirs + 1]
>>> - ifFalse: [
>>> - files := files + 1.
>>> - bytes := bytes + entry fileSize]]].
>>> - ^ Array with: dirs with: files with: bytes
>>> - !
>>>
>>>
>>>
>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Chris Muller-3
In reply to this post by Levente Uzonyi-2
Good discussion thanks Levente.  You know, after more thinking I had
trouble understanding what is the characteristic of the FilePlugin
primitives that leads to better performance in a BFS?

The only use-case I can think of would be where the doBlock of
directoryTreeDo: could be short-circuited by a non-local return, such
as a detect operation.  But in a stats operation, it seems like the
same requests to FilePlugin are made in each case.  That is, every
subdirectory gets read just once via the #entries primitive in either
case BFS or DFS..

But if the whole sub-tree is going to be accessed it seems like a
wash, and indeed I'm getting pretty close times between my and your
version of directoryTreeDo::

|dir| dir:= FileDirectory default containingDirectory containingDirectory.
[dir directoryTreeDo: [ : e |  ]] timeToRun

  (BFS) "444"
  (DFS) "428"

I still believe a directoryTreeDetect: operation has potential for
less exercise of FilePlugin under BFS, but did I miss something for a
full tree-traversal?

Thanks.



On Wed, Jul 3, 2013 at 12:48 AM, Levente Uzonyi <[hidden email]> wrote:

> On Sun, 30 Jun 2013, Chris Muller wrote:
>
>> Ah, thank you!  Now I understand Levente's comments.
>>
>> Levente, that is an excellent point about directoryTreeDo: traversing
>> depth-first.  In fact that makes me want to change it to be
>> breadth-first..  I'm trying to think about all the use-cases it serves
>> and whether that would make a difference; since we have the path,
>> maybe not..
>
>
> The only techinal hurdle in changing the algorithm is the argument of
> argument the block. Currently it is an OrderedCollection of DirectoryEntry
> objects from the root to the current DirectoryEntry object. With DFS this is
> naturally available, but with BFS it's not.
>
> So the question is, what should we do?
>
> 1) keep the existing OrderedCollection argument
>
> I uploaded Files-ul.125 to the Inbox, which changes the implementation of
> #directoryTreeDo: this way. As you can see, it's not an easy to understand
> algorithm. :)
>
> 2) use some kind of linked list-like data structure instead.
>
> Here's an example implementation for this:
>
>         | queue argument |
>         argument := { self directoryEntry. nil }.
>         oneArgBlock value: argument.
>         queue := OrderedCollection with: argument copy.
>         [ queue isEmpty ] whileFalse: [
>                 | parent |
>                 parent := queue removeFirst.
>                 argument at: 2 put: parent.
>                 parent first asFileDirectory entriesDo: [ :entry |
>                         argument at: 1 put: entry.
>                         oneArgBlock value: argument.
>                         entry isDirectory ifTrue: [ queue addLast: argument
> copy ] ] ]
>
> 3) only pass the current DirectoryEntry to the block and let the user decide
> if he needs the whole path (though it costs a lot to recreate it)
>
> Implementation:
>
>         | queue |
>         queue := OrderedCollection with: self directoryEntry.
>         oneArgBlock value: queue first.
>         [ queue isEmpty ] whileFalse: [
>                 queue removeFirst asFileDirectory entriesDo: [ :entry |
>                         oneArgBlock value: entry.
>                         entry isDirectory ifTrue: [ queue addLast: entry ] ]
> ]
>
> Both examples at 2) and 3) require the #entriesDo: method, which is
> available in Files-ul.125 from the Inbox.
>
> If you profile any of this code on a larger directory tree, you'll see that
> there's plenty of room for improvement in FileDirectory.
>
>
> Levente
>
>
>>
>> On Sun, Jun 30, 2013 at 1:29 PM, Frank Shearar <[hidden email]>
>> wrote:
>>>
>>> On 30 June 2013 18:19, Chris Muller <[hidden email]> wrote:
>>>>
>>>> Hi Levente, I do not know what you mean by DFS and BFS.
>>>
>>>
>>> Depth-first versus breadth-first search.
>>>
>>> frank
>>>
>>>>  My claim that
>>>> statsForDirectoryTree: "does one useless thing, and does it
>>>> inefficiently" is commentary about the _design_, not the
>>>> implementation, because it forces me to enumerate an entire directory
>>>> even if I don't want to.  For example, what if I want to exclude a
>>>> ".temp" directory's stats from its containing directory's stats?  To
>>>> do that, I'd have to run #statsForDirectoryTree: twice.  That's not
>>>> going to be faster..
>>>>
>>>> On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <[hidden email]> wrote:
>>>>>
>>>>> On Fri, 28 Jun 2013, [hidden email] wrote:
>>>>>
>>>>>> A new version of Files was added to project The Inbox:
>>>>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>>>>
>>>>>> ==================== Summary ====================
>>>>>>
>>>>>> Name: Files-cmm.125
>>>>>> Author: cmm
>>>>>> Time: 27 June 2013, 7:24:33.37 pm
>>>>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>>>>> Ancestors: Files-fbs.124
>>>>>>
>>>>>> FileDirectory>>statsForDirectoryTree: is a prime example of
>>>>>> FileDirectory
>>>>>> living up to its poor reputation.  It does one useless thing, and does
>>>>>> it
>>>>>> inefficiently.  It has no senders, deservedly.
>>>>>>         Tree-walking operations like gathering stats can be written in
>>>>>> apps more generally using directoryTreeDo:.
>>>>>
>>>>>
>>>>>
>>>>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is
>>>>> used
>>>>> by #directoryTreeDo:entries:), which is somewhat better than DFS,
>>>>> because of
>>>>> the way FilePlugin works: you have to iterate over the contents of the
>>>>> whole
>>>>> directory if you want to it to be efficient.
>>>>> If you use DFS, then the contents of all parent directories up to the
>>>>> root
>>>>> will have to be stored in memory during the iteration, while if you use
>>>>> BFS,
>>>>> then only the sibling directories (without their contents) or the
>>>>> siblings's
>>>>> parents will be in memory.
>>>>>
>>>>> I made two "new" implementations of #statsForDirectoryTree:. This one
>>>>> optimizes memory usage for BFS:
>>>>>
>>>>> statsForDirectoryTree2: rootedPathName
>>>>>
>>>>>         | dirs files bytes todo |
>>>>>
>>>>>         dirs := files := bytes := 0.
>>>>>         todo := OrderedCollection with: rootedPathName.
>>>>>         [ todo isEmpty ] whileFalse: [
>>>>>                 | p |
>>>>>                 p := todo removeFirst.
>>>>>                 self directoryContentsFor: p do: [ :entry |
>>>>>                         entry isDirectory
>>>>>                                 ifTrue: [
>>>>>
>>>>>                                         todo addLast: p , self
>>>>> pathNameDelimiter asString , entry name.
>>>>>                                         dirs := dirs + 1]
>>>>>                                 ifFalse: [
>>>>>
>>>>>                                         files := files + 1.
>>>>>                                         bytes := bytes + entry fileSize
>>>>> ] ]
>>>>> ].
>>>>>         ^{ dirs. files. bytes }
>>>>>
>>>>> This one uses #directoryTreeDo::
>>>>>
>>>>> statsForDirectoryTree3: rootedPathName
>>>>>
>>>>>         | dirs files bytes |
>>>>>
>>>>>         dirs := files := bytes := 0.
>>>>>         (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>>>>                 | entry |
>>>>>                 entry := path last.
>>>>>                 entry isDirectory
>>>>>                         ifTrue: [ dirs := dirs + 1 ]
>>>>>                         ifFalse: [
>>>>>
>>>>>                                 files := files + 1.
>>>>>                                 bytes := bytes + entry fileSize ] ].
>>>>>         ^{ dirs. files. bytes }
>>>>>
>>>>> Let's see the numbers:
>>>>>
>>>>> "Make sure the disk won't be accessed"
>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>> "Run the benchmark"
>>>>> #(statsForDirectoryTree: statsForDirectoryTree2:
>>>>> statsForDirectoryTree3:)
>>>>> collect: [ :each |
>>>>>         each -> ((1 to: 5) collect: [ :run |
>>>>>                 Smalltalk garbageCollect.
>>>>>                 [ FileDirectory default perform: each with: '/usr/' ]
>>>>> timeToRun ]) ].
>>>>>
>>>>> {
>>>>>         #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>>>>         #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>>>>         #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>>>>
>>>>> It's a bit noisy (i was too lazy to exit other programs), but the one
>>>>> using
>>>>> #directoryTreeDo: seems to be clearly slower than the existing
>>>>> algorithm
>>>>> using BFS.
>>>>>
>>>>>
>>>>> Levente
>>>>>
>>>>>
>>>>>>
>>>>>> =============== Diff against Files-fbs.124 ===============
>>>>>>
>>>>>> Item was changed:
>>>>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>> 'private') -----
>>>>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>> 'enumeration') -----
>>>>>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>>>>>         "Value oneArgBlock with the path (an OrderedCollection of
>>>>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry
>>>>>> itself."
>>>>>>         self entries do:
>>>>>>                 [ : each |
>>>>>>                 entriesCollection add: each.
>>>>>>                 oneArgBlock value: entriesCollection.
>>>>>>                 each isDirectory ifTrue:
>>>>>>                         [ | subdir |
>>>>>>                         subdir := each asFileDirectory.
>>>>>>                         subdir
>>>>>>                                 directoryTreeDo: oneArgBlock
>>>>>>                                 entries: entriesCollection ].
>>>>>>                 entriesCollection removeLast ]!
>>>>>>
>>>>>> Item was removed:
>>>>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>>>>> 'enumeration') -----
>>>>>> - statsForDirectoryTree: rootedPathName
>>>>>> -       "Return the size statistics for the entire directory tree
>>>>>> starting
>>>>>> at the given root. The result is a three element array of the form:
>>>>>> (<number
>>>>>> of folders><number of files><total bytes in all files>). This method
>>>>>> also
>>>>>> serves as an example of how recursively enumerate a directory tree."
>>>>>> -       "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>>>>> -
>>>>>> -       | dirs files bytes todo entries p |
>>>>>> -       dirs := files := bytes := 0.
>>>>>> -       todo := OrderedCollection with: rootedPathName.
>>>>>> -       [todo isEmpty] whileFalse: [
>>>>>> -               p := todo removeFirst.
>>>>>> -               entries := self directoryContentsFor: p.
>>>>>> -               entries do: [:entry |
>>>>>> -                       entry isDirectory
>>>>>> -                               ifTrue: [
>>>>>> -                                       todo addLast: p , self
>>>>>> pathNameDelimiter asString , entry name.
>>>>>> -                                       dirs := dirs + 1]
>>>>>> -                               ifFalse: [
>>>>>> -                                       files := files + 1.
>>>>>> -                                       bytes := bytes + entry
>>>>>> fileSize]]].
>>>>>> -       ^ Array with: dirs with: files with: bytes
>>>>>> - !
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

timrowledge

On 03-07-2013, at 11:13 AM, Chris Muller <[hidden email]> wrote:

> Good discussion thanks Levente.  You know, after more thinking I had
> trouble understanding what is the characteristic of the FilePlugin
> primitives that leads to better performance in a BFS?

Just take a look at the computational gymnastics undertaken in the lowest level of FilePlugin code; both in the image and the actual platform code. For all platforms it is a nightmare of untangling, mangling and re-tangling strange values that bear little relationship to much useful. As an example, look at what happens when  dir_Lookup is used in unix. Eeek!

I suggest that the platform handling is done at far too low level. Different platforms do things rather differently and rather than the current pattern of squeezing all of them into a straitjacket - where dir_Lookup is an example - we ought to have a higher level fanout. Keeping to unix as an example, consider what happens in a modest sized directory when running FileDirectory>entries (and we should temporarily ignore the ridiculous uses that gets put to); #directoryContentsFor:do: repeatedly calls the primitive that then uses dir_lookup. By the time you get to looking up entry 42 in the directory you would have already done around 860 readdir calls if there wasn't a sneaky optimisation that keeps the last opened dir stream around to see if it is the next needed  one. When doing a BFS, that is generally true, thus some benefit accrues.

If we had image level code that understood how readdir works and took advantage of it through a suitable prim we might gain some interesting advantages.Likewise for the specific apis on other platforms; RISC OS for example can look directly at entry(i) but also has to do terrible things to handle the idiotic edge case of passing an empty string argument to mean 'list all the roots'. Take a look at dir_LookupRoot in platforms/RiscOS/plugins/FilePlugin/sqRPCDirectory.c and reel in shock!

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
"How many Ringworld Engineers does it take to change a lightbulb?" "Thirty. Hey, moving suns around isn't easy..."



Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Chris Muller-3
Whew, that's a lot of adjectives!

I vaguely understand by looking at #directoryContentsFor:do:.  I guess
I had thought #entries was the primitive itself, but I think you're
saying calling #primLookupEntryIn:index: is inefficient to be calling
again and again for just one entry..  Too bad.

On Wed, Jul 3, 2013 at 2:13 PM, tim Rowledge <[hidden email]> wrote:

>
> On 03-07-2013, at 11:13 AM, Chris Muller <[hidden email]> wrote:
>
>> Good discussion thanks Levente.  You know, after more thinking I had
>> trouble understanding what is the characteristic of the FilePlugin
>> primitives that leads to better performance in a BFS?
>
> Just take a look at the computational gymnastics undertaken in the lowest level of FilePlugin code; both in the image and the actual platform code. For all platforms it is a nightmare of untangling, mangling and re-tangling strange values that bear little relationship to much useful. As an example, look at what happens when  dir_Lookup is used in unix. Eeek!
>
> I suggest that the platform handling is done at far too low level. Different platforms do things rather differently and rather than the current pattern of squeezing all of them into a straitjacket - where dir_Lookup is an example - we ought to have a higher level fanout. Keeping to unix as an example, consider what happens in a modest sized directory when running FileDirectory>entries (and we should temporarily ignore the ridiculous uses that gets put to); #directoryContentsFor:do: repeatedly calls the primitive that then uses dir_lookup. By the time you get to looking up entry 42 in the directory you would have already done around 860 readdir calls if there wasn't a sneaky optimisation that keeps the last opened dir stream around to see if it is the next needed  one. When doing a BFS, that is generally true, thus some benefit accrues.
>
> If we had image level code that understood how readdir works and took advantage of it through a suitable prim we might gain some interesting advantages.Likewise for the specific apis on other platforms; RISC OS for example can look directly at entry(i) but also has to do terrible things to handle the idiotic edge case of passing an empty string argument to mean 'list all the roots'. Take a look at dir_LookupRoot in platforms/RiscOS/plugins/FilePlugin/sqRPCDirectory.c and reel in shock!
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> "How many Ringworld Engineers does it take to change a lightbulb?" "Thirty. Hey, moving suns around isn't easy..."
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

timrowledge

On 03-07-2013, at 12:38 PM, Chris Muller <[hidden email]> wrote:

> Whew, that's a lot of adjectives!
Indeed, a scrumptiously over-indulgent miscellany of amply audacious adjectives.

>
> I vaguely understand by looking at #directoryContentsFor:do:.  I guess
> I had thought #entries was the primitive itself, but I think you're
> saying calling #primLookupEntryIn:index: is inefficient to be calling
> again and again for just one entry..  Too bad.

Take a look at the appalling usages of #entries (which is the only real user of directoryContentsFor:). We even have code that enumerates the entire directory in order to find out if it is empty and can be deleted! Insanity! We find out if a file exists by enumerating everything using the primitive and then enumerating the resultant collection. Aaaargh! I think my brain just segfaulted.


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Strange OpCodes: LOW: Launch on Warning



Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Levente Uzonyi-2
In reply to this post by Chris Muller-3
On Wed, 3 Jul 2013, Chris Muller wrote:

> Good discussion thanks Levente.  You know, after more thinking I had
> trouble understanding what is the characteristic of the FilePlugin
> primitives that leads to better performance in a BFS?

With the current DFS implementation (you use FileDirectory >> #entries
from FileDirectory #directoryTreeDo:entries: which creates an Array with
a DirectoryEntry object for all files in that directory) you keep all
DirectoryEntry instances in the memory up to the root during the
iteration.
With BFS you don't have to do that, because once you start processing a
directory, you won't start another before you finish the current one.
You could avoid storing the entries in memory with DFS too - by using
FileDirectory >> #entriesDo: from Files-ul.125, but that will potentially
expose the bottleneck present in the FilePlugin. Here's a
small demonstration:

path := '/usr/bin'.
"Single process using the FilePlugin"
[ (FileDirectory on: path) directoryTreeDo: [ :each | Processor yield ] ] timeToRun.
"1208"

"2 processes using the FilePlugin"
[
  (FileDirectory on: path) directoryTreeDo: [ :each |
  Processor yield ] ] fork.
[ (FileDirectory on: path) directoryTreeDo: [ :each |
  Processor yield ] ] timeToRun.
"56716"

What's happening?
Opening a directory and getting the nth element of it is not cheap, but
that's what #primLookupEntryIn:index: offers. To speed up things,
FilePlugin stores the last opened directory and the index of the last
accessed file in it. If you ask for the next file from the same directory,
then it'll reuse the stored directory, and everything will seem fast. But
if you open another directory, or ask for any file other than the next one
from the same directory, then it'll be much slower.

For example asking for the files from a directory, but in reversed order
will take O(n*n) time, where n is the number of files in the directory.

>
> The only use-case I can think of would be where the doBlock of
> directoryTreeDo: could be short-circuited by a non-local return, such

I don't see how this is related to BFS vs DFS, or my suggestion to change
the argument of directoryTreeDo:.

> as a detect operation.  But in a stats operation, it seems like the
> same requests to FilePlugin are made in each case.  That is, every
> subdirectory gets read just once via the #entries primitive in either
> case BFS or DFS..

Indeed, the difference is in the memory usage.

>
> But if the whole sub-tree is going to be accessed it seems like a
> wash, and indeed I'm getting pretty close times between my and your
> version of directoryTreeDo::
>
> |dir| dir:= FileDirectory default containingDirectory containingDirectory.
> [dir directoryTreeDo: [ : e |  ]] timeToRun
>
>  (BFS) "444"
>  (DFS) "428"

Did you running the benchmark on /usr/ from
http://lists.squeakfoundation.org/pipermail/squeak-dev/2013-June/171827.html 
?


Levente

>
> I still believe a directoryTreeDetect: operation has potential for
> less exercise of FilePlugin under BFS, but did I miss something for a
> full tree-traversal?
>
> Thanks.
>
>
>
> On Wed, Jul 3, 2013 at 12:48 AM, Levente Uzonyi <[hidden email]> wrote:
>> On Sun, 30 Jun 2013, Chris Muller wrote:
>>
>>> Ah, thank you!  Now I understand Levente's comments.
>>>
>>> Levente, that is an excellent point about directoryTreeDo: traversing
>>> depth-first.  In fact that makes me want to change it to be
>>> breadth-first..  I'm trying to think about all the use-cases it serves
>>> and whether that would make a difference; since we have the path,
>>> maybe not..
>>
>>
>> The only techinal hurdle in changing the algorithm is the argument of
>> argument the block. Currently it is an OrderedCollection of DirectoryEntry
>> objects from the root to the current DirectoryEntry object. With DFS this is
>> naturally available, but with BFS it's not.
>>
>> So the question is, what should we do?
>>
>> 1) keep the existing OrderedCollection argument
>>
>> I uploaded Files-ul.125 to the Inbox, which changes the implementation of
>> #directoryTreeDo: this way. As you can see, it's not an easy to understand
>> algorithm. :)
>>
>> 2) use some kind of linked list-like data structure instead.
>>
>> Here's an example implementation for this:
>>
>>         | queue argument |
>>         argument := { self directoryEntry. nil }.
>>         oneArgBlock value: argument.
>>         queue := OrderedCollection with: argument copy.
>>         [ queue isEmpty ] whileFalse: [
>>                 | parent |
>>                 parent := queue removeFirst.
>>                 argument at: 2 put: parent.
>>                 parent first asFileDirectory entriesDo: [ :entry |
>>                         argument at: 1 put: entry.
>>                         oneArgBlock value: argument.
>>                         entry isDirectory ifTrue: [ queue addLast: argument
>> copy ] ] ]
>>
>> 3) only pass the current DirectoryEntry to the block and let the user decide
>> if he needs the whole path (though it costs a lot to recreate it)
>>
>> Implementation:
>>
>>         | queue |
>>         queue := OrderedCollection with: self directoryEntry.
>>         oneArgBlock value: queue first.
>>         [ queue isEmpty ] whileFalse: [
>>                 queue removeFirst asFileDirectory entriesDo: [ :entry |
>>                         oneArgBlock value: entry.
>>                         entry isDirectory ifTrue: [ queue addLast: entry ] ]
>> ]
>>
>> Both examples at 2) and 3) require the #entriesDo: method, which is
>> available in Files-ul.125 from the Inbox.
>>
>> If you profile any of this code on a larger directory tree, you'll see that
>> there's plenty of room for improvement in FileDirectory.
>>
>>
>> Levente
>>
>>
>>>
>>> On Sun, Jun 30, 2013 at 1:29 PM, Frank Shearar <[hidden email]>
>>> wrote:
>>>>
>>>> On 30 June 2013 18:19, Chris Muller <[hidden email]> wrote:
>>>>>
>>>>> Hi Levente, I do not know what you mean by DFS and BFS.
>>>>
>>>>
>>>> Depth-first versus breadth-first search.
>>>>
>>>> frank
>>>>
>>>>>  My claim that
>>>>> statsForDirectoryTree: "does one useless thing, and does it
>>>>> inefficiently" is commentary about the _design_, not the
>>>>> implementation, because it forces me to enumerate an entire directory
>>>>> even if I don't want to.  For example, what if I want to exclude a
>>>>> ".temp" directory's stats from its containing directory's stats?  To
>>>>> do that, I'd have to run #statsForDirectoryTree: twice.  That's not
>>>>> going to be faster..
>>>>>
>>>>> On Sat, Jun 29, 2013 at 5:15 PM, Levente Uzonyi <[hidden email]> wrote:
>>>>>>
>>>>>> On Fri, 28 Jun 2013, [hidden email] wrote:
>>>>>>
>>>>>>> A new version of Files was added to project The Inbox:
>>>>>>> http://source.squeak.org/inbox/Files-cmm.125.mcz
>>>>>>>
>>>>>>> ==================== Summary ====================
>>>>>>>
>>>>>>> Name: Files-cmm.125
>>>>>>> Author: cmm
>>>>>>> Time: 27 June 2013, 7:24:33.37 pm
>>>>>>> UUID: 0b00b2c4-3430-442b-a924-72040ff43d73
>>>>>>> Ancestors: Files-fbs.124
>>>>>>>
>>>>>>> FileDirectory>>statsForDirectoryTree: is a prime example of
>>>>>>> FileDirectory
>>>>>>> living up to its poor reputation.  It does one useless thing, and does
>>>>>>> it
>>>>>>> inefficiently.  It has no senders, deservedly.
>>>>>>>         Tree-walking operations like gathering stats can be written in
>>>>>>> apps more generally using directoryTreeDo:.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I wouldn't say it's inefficient. It uses BFS (instead of DFS, which is
>>>>>> used
>>>>>> by #directoryTreeDo:entries:), which is somewhat better than DFS,
>>>>>> because of
>>>>>> the way FilePlugin works: you have to iterate over the contents of the
>>>>>> whole
>>>>>> directory if you want to it to be efficient.
>>>>>> If you use DFS, then the contents of all parent directories up to the
>>>>>> root
>>>>>> will have to be stored in memory during the iteration, while if you use
>>>>>> BFS,
>>>>>> then only the sibling directories (without their contents) or the
>>>>>> siblings's
>>>>>> parents will be in memory.
>>>>>>
>>>>>> I made two "new" implementations of #statsForDirectoryTree:. This one
>>>>>> optimizes memory usage for BFS:
>>>>>>
>>>>>> statsForDirectoryTree2: rootedPathName
>>>>>>
>>>>>>         | dirs files bytes todo |
>>>>>>
>>>>>>         dirs := files := bytes := 0.
>>>>>>         todo := OrderedCollection with: rootedPathName.
>>>>>>         [ todo isEmpty ] whileFalse: [
>>>>>>                 | p |
>>>>>>                 p := todo removeFirst.
>>>>>>                 self directoryContentsFor: p do: [ :entry |
>>>>>>                         entry isDirectory
>>>>>>                                 ifTrue: [
>>>>>>
>>>>>>                                         todo addLast: p , self
>>>>>> pathNameDelimiter asString , entry name.
>>>>>>                                         dirs := dirs + 1]
>>>>>>                                 ifFalse: [
>>>>>>
>>>>>>                                         files := files + 1.
>>>>>>                                         bytes := bytes + entry fileSize
>>>>>> ] ]
>>>>>> ].
>>>>>>         ^{ dirs. files. bytes }
>>>>>>
>>>>>> This one uses #directoryTreeDo::
>>>>>>
>>>>>> statsForDirectoryTree3: rootedPathName
>>>>>>
>>>>>>         | dirs files bytes |
>>>>>>
>>>>>>         dirs := files := bytes := 0.
>>>>>>         (FileDirectory on: rootedPathName) directoryTreeDo: [ :path |
>>>>>>                 | entry |
>>>>>>                 entry := path last.
>>>>>>                 entry isDirectory
>>>>>>                         ifTrue: [ dirs := dirs + 1 ]
>>>>>>                         ifFalse: [
>>>>>>
>>>>>>                                 files := files + 1.
>>>>>>                                 bytes := bytes + entry fileSize ] ].
>>>>>>         ^{ dirs. files. bytes }
>>>>>>
>>>>>> Let's see the numbers:
>>>>>>
>>>>>> "Make sure the disk won't be accessed"
>>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>>> FileDirectory default statsForDirectoryTree: '/usr/'.
>>>>>> "Run the benchmark"
>>>>>> #(statsForDirectoryTree: statsForDirectoryTree2:
>>>>>> statsForDirectoryTree3:)
>>>>>> collect: [ :each |
>>>>>>         each -> ((1 to: 5) collect: [ :run |
>>>>>>                 Smalltalk garbageCollect.
>>>>>>                 [ FileDirectory default perform: each with: '/usr/' ]
>>>>>> timeToRun ]) ].
>>>>>>
>>>>>> {
>>>>>>         #statsForDirectoryTree:->#(6340 6300 6430 6316 6228) .
>>>>>>         #statsForDirectoryTree2:->#(5918 6072 6122 6296 6306) .
>>>>>>         #statsForDirectoryTree3:->#(8576 8374 8470 8506 8228) }
>>>>>>
>>>>>> It's a bit noisy (i was too lazy to exit other programs), but the one
>>>>>> using
>>>>>> #directoryTreeDo: seems to be clearly slower than the existing
>>>>>> algorithm
>>>>>> using BFS.
>>>>>>
>>>>>>
>>>>>> Levente
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> =============== Diff against Files-fbs.124 ===============
>>>>>>>
>>>>>>> Item was changed:
>>>>>>> + ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>>> 'private') -----
>>>>>>> - ----- Method: FileDirectory>>directoryTreeDo:entries: (in category
>>>>>>> 'enumeration') -----
>>>>>>>  directoryTreeDo: oneArgBlock entries: entriesCollection
>>>>>>>         "Value oneArgBlock with the path (an OrderedCollection of
>>>>>>> FileDirectory's) to each DirectoryEntry and the DirectoryEntry
>>>>>>> itself."
>>>>>>>         self entries do:
>>>>>>>                 [ : each |
>>>>>>>                 entriesCollection add: each.
>>>>>>>                 oneArgBlock value: entriesCollection.
>>>>>>>                 each isDirectory ifTrue:
>>>>>>>                         [ | subdir |
>>>>>>>                         subdir := each asFileDirectory.
>>>>>>>                         subdir
>>>>>>>                                 directoryTreeDo: oneArgBlock
>>>>>>>                                 entries: entriesCollection ].
>>>>>>>                 entriesCollection removeLast ]!
>>>>>>>
>>>>>>> Item was removed:
>>>>>>> - ----- Method: FileDirectory>>statsForDirectoryTree: (in category
>>>>>>> 'enumeration') -----
>>>>>>> - statsForDirectoryTree: rootedPathName
>>>>>>> -       "Return the size statistics for the entire directory tree
>>>>>>> starting
>>>>>>> at the given root. The result is a three element array of the form:
>>>>>>> (<number
>>>>>>> of folders><number of files><total bytes in all files>). This method
>>>>>>> also
>>>>>>> serves as an example of how recursively enumerate a directory tree."
>>>>>>> -       "FileDirectory default statsForDirectoryTree: '\smalltalk'"
>>>>>>> -
>>>>>>> -       | dirs files bytes todo entries p |
>>>>>>> -       dirs := files := bytes := 0.
>>>>>>> -       todo := OrderedCollection with: rootedPathName.
>>>>>>> -       [todo isEmpty] whileFalse: [
>>>>>>> -               p := todo removeFirst.
>>>>>>> -               entries := self directoryContentsFor: p.
>>>>>>> -               entries do: [:entry |
>>>>>>> -                       entry isDirectory
>>>>>>> -                               ifTrue: [
>>>>>>> -                                       todo addLast: p , self
>>>>>>> pathNameDelimiter asString , entry name.
>>>>>>> -                                       dirs := dirs + 1]
>>>>>>> -                               ifFalse: [
>>>>>>> -                                       files := files + 1.
>>>>>>> -                                       bytes := bytes + entry
>>>>>>> fileSize]]].
>>>>>>> -       ^ Array with: dirs with: files with: bytes
>>>>>>> - !
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Levente Uzonyi-2
In reply to this post by timrowledge
On Wed, 3 Jul 2013, tim Rowledge wrote:

>
> On 03-07-2013, at 12:38 PM, Chris Muller <[hidden email]> wrote:
>
>> Whew, that's a lot of adjectives!
> Indeed, a scrumptiously over-indulgent miscellany of amply audacious adjectives.
>
>>
>> I vaguely understand by looking at #directoryContentsFor:do:.  I guess
>> I had thought #entries was the primitive itself, but I think you're
>> saying calling #primLookupEntryIn:index: is inefficient to be calling
>> again and again for just one entry..  Too bad.
>
> Take a look at the appalling usages of #entries (which is the only real user of directoryContentsFor:). We even have code that enumerates the entire directory in order to find out if it is empty and can be deleted! Insanity! We find out if a file exists by enumerating everything using the primitive and then enumerating the resultant collection. Aaaargh! I think my brain just segfaulted.

Sometimes it's better if you don't look under the hood. :)


Levente

>
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Strange OpCodes: LOW: Launch on Warning
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

David T. Lewis
In reply to this post by timrowledge
On Wed, Jul 03, 2013 at 12:13:36PM -0700, tim Rowledge wrote:

>
> On 03-07-2013, at 11:13 AM, Chris Muller <[hidden email]> wrote:
>
> > Good discussion thanks Levente.  You know, after more thinking I had
> > trouble understanding what is the characteristic of the FilePlugin
> > primitives that leads to better performance in a BFS?
>
> Just take a look at the computational gymnastics undertaken in the lowest level of FilePlugin code; both in the image and the actual platform code. For all platforms it is a nightmare of untangling, mangling and re-tangling strange values that bear little relationship to much useful. As an example, look at what happens when  dir_Lookup is used in unix. Eeek!
>
> I suggest that the platform handling is done at far too low level. Different platforms do things rather differently and rather than the current pattern of squeezing all of them into a straitjacket - where dir_Lookup is an example - we ought to have a higher level fanout. Keeping to unix as an example, consider what happens in a modest sized directory when running FileDirectory>entries (and we should temporarily ignore the ridiculous uses that gets put to); #directoryContentsFor:do: repeatedly calls the primitive that then uses dir_lookup. By the time you get to looking up entry 42 in the directory you would have already done around 860 readdir calls if there wasn't a sneaky optimisation that keeps the last opened dir stream around to see if it is the next needed  one. When doing a BFS, that is generally true, thus some benefit accrues.
>
> If we had image level code that understood how readdir works and took advantage of it through a suitable prim we might gain some interesting advantages.Likewise for the specific apis on other platforms; RISC OS for example can look directly at entry(i) but also has to do terrible things to handle the idiotic edge case of passing an empty string argument to mean 'list all the roots'. Take a look at dir_LookupRoot in platforms/RiscOS/plugins/FilePlugin/sqRPCDirectory.c and reel in shock!
>

About 10 years ago I did a plugin to provide directory access using
Posix directory streams for Unix and Windows VMs:

  http://wiki.squeak.org/squeak/2274

I wonder if we could generalize the interface in such a way that it would
still make sense for non-posix platforms?

At the time, I was claiming some fairly significant performance improvements.
I don't know what it would look like today, but given that this is an
improvement in the primitives, the relative performance for a Cog VM might
be even better.

Here is what I was seeing on my ancient PC at the time:

  Performance improvement on my system, running Linux:
 
      * FileDirectory>>directoryContentsFor: is 7.3 times faster
      * FileDirectory>>entryAt:ifAbsent: is 148 times faster
      * FileDirectory>>fileAndDirectoryNames is 5.2 times faster
      * FileDirectory>>fileExists: is 121 times faster
      * FileDirectory>>directoryExists: is 104 times faster
      * FileDirectory>>fileOrDirectoryExists: is 32 times faster
      * StandardFileStream>>isAFileNamed: is 1445 times faster
 
 
  Performance improvement on my system, running Windows:
 
      * FileDirectory>>directoryContentsFor: (N/A)
      * FileDirectory>>entryAt:ifAbsent: (N/A)
      * FileDirectory>>fileAndDirectoryNames 1.37 times faster
      * FileDirectory>>fileExists: 86 times faster
      * FileDirectory>>directoryExists: 80 times faster
      * FileDirectory>>fileOrDirectoryExists: 90 times faster
      * StandardFileStream>>isAFileNamed: 497 times faster

Dave
 

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Chris Muller-4
In reply to this post by Levente Uzonyi-2
>> Good discussion thanks Levente.  You know, after more thinking I had
>> trouble understanding what is the characteristic of the FilePlugin
>> primitives that leads to better performance in a BFS?
>
> With the current DFS implementation (you use FileDirectory >> #entries from
> FileDirectory #directoryTreeDo:entries: which creates an Array with a
> DirectoryEntry object for all files in that directory) you keep all
> DirectoryEntry instances in the memory up to the root during the iteration.
> With BFS you don't have to do that, because once you start processing a
> directory, you won't start another before you finish the current one.
> You could avoid storing the entries in memory with DFS too - by using
> FileDirectory >> #entriesDo: from Files-ul.125, but that will potentially
> expose the bottleneck present in the FilePlugin. Here's a small
> demonstration:
>
> path := '/usr/bin'.
> "Single process using the FilePlugin"
> [ (FileDirectory on: path) directoryTreeDo: [ :each | Processor yield ] ]
> timeToRun.
> "1208"
>
> "2 processes using the FilePlugin"
> [
>         (FileDirectory on: path) directoryTreeDo: [ :each |
>                 Processor yield ] ] fork.
> [ (FileDirectory on: path) directoryTreeDo: [ :each |
>                 Processor yield ] ] timeToRun.
> "56716"

Amazing.

> What's happening?
> Opening a directory and getting the nth element of it is not cheap, but
> that's what #primLookupEntryIn:index: offers. To speed up things, FilePlugin
> stores the last opened directory and the index of the last accessed file in
> it. If you ask for the next file from the same directory, then it'll reuse
> the stored directory, and everything will seem fast. But if you open another
> directory, or ask for any file other than the next one from the same
> directory, then it'll be much slower.
>
> For example asking for the files from a directory, but in reversed order
> will take O(n*n) time, where n is the number of files in the directory.

I didn't follow your BFS version of directoryTreeDo:, but I did bench
it.  Given the above, shouldn't it have performed much faster than the
DFS version?

>> The only use-case I can think of would be where the doBlock of
>> directoryTreeDo: could be short-circuited by a non-local return, such
>
> I don't see how this is related to BFS vs DFS, or my suggestion to change
> the argument of directoryTreeDo:.
>
>> as a detect operation.  But in a stats operation, it seems like the
>> same requests to FilePlugin are made in each case.  That is, every
>> subdirectory gets read just once via the #entries primitive in either
>> case BFS or DFS..
>
> Indeed, the difference is in the memory usage.

Agreed -- but I was just trying to relate BFS vs. DFS to usage of
FilePlugin.  It sounds like by DFS asking for other than the very next
file, it's tripping up the inefficiency of FilePlugin, is that right?

>> But if the whole sub-tree is going to be accessed it seems like a
>> wash, and indeed I'm getting pretty close times between my and your
>> version of directoryTreeDo::
>>
>> |dir| dir:= FileDirectory default containingDirectory containingDirectory.
>> [dir directoryTreeDo: [ : e |  ]] timeToRun
>>
>>  (BFS) "444"
>>  (DFS) "428"
>
> Did you running the benchmark on /usr/ from
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2013-June/171827.html
> ?

Yes, and I got similar results as you, but that is a benchmark for the
stats operation only, not directoryTreeDo:.  I think that stats
operation which answers a 3-element Array of Array's is too-specific
to be useful.  I'm interested in directoryTreeDo: to be fast and
efficient so I can use it for stats if I want, or other things.  But I
did not get a significant difference in execution performance between
your BFS and my DFS versions.  Although we know the DFS will use more
memory, I'm puzzled why I did not get a big boost by your BFS version
of it given the inefficiency of FilePlugin..

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Files-cmm.125.mcz

Levente Uzonyi-2
On Thu, 4 Jul 2013, Chris Muller wrote:

>>> Good discussion thanks Levente.  You know, after more thinking I had
>>> trouble understanding what is the characteristic of the FilePlugin
>>> primitives that leads to better performance in a BFS?
>>
>> With the current DFS implementation (you use FileDirectory >> #entries from
>> FileDirectory #directoryTreeDo:entries: which creates an Array with a
>> DirectoryEntry object for all files in that directory) you keep all
>> DirectoryEntry instances in the memory up to the root during the iteration.
>> With BFS you don't have to do that, because once you start processing a
>> directory, you won't start another before you finish the current one.
>> You could avoid storing the entries in memory with DFS too - by using
>> FileDirectory >> #entriesDo: from Files-ul.125, but that will potentially
>> expose the bottleneck present in the FilePlugin. Here's a small
>> demonstration:
>>
>> path := '/usr/bin'.
>> "Single process using the FilePlugin"
>> [ (FileDirectory on: path) directoryTreeDo: [ :each | Processor yield ] ]
>> timeToRun.
>> "1208"
>>
>> "2 processes using the FilePlugin"
>> [
>>         (FileDirectory on: path) directoryTreeDo: [ :each |
>>                 Processor yield ] ] fork.
>> [ (FileDirectory on: path) directoryTreeDo: [ :each |
>>                 Processor yield ] ] timeToRun.
>> "56716"
>
> Amazing.
>
>> What's happening?
>> Opening a directory and getting the nth element of it is not cheap, but
>> that's what #primLookupEntryIn:index: offers. To speed up things, FilePlugin
>> stores the last opened directory and the index of the last accessed file in
>> it. If you ask for the next file from the same directory, then it'll reuse
>> the stored directory, and everything will seem fast. But if you open another
>> directory, or ask for any file other than the next one from the same
>> directory, then it'll be much slower.
>>
>> For example asking for the files from a directory, but in reversed order
>> will take O(n*n) time, where n is the number of files in the directory.
>
> I didn't follow your BFS version of directoryTreeDo:, but I did bench
> it.  Given the above, shouldn't it have performed much faster than the
> DFS version?

No, because it uses #entries, which fetches all the files from the
directory.

>
>>> The only use-case I can think of would be where the doBlock of
>>> directoryTreeDo: could be short-circuited by a non-local return, such
>>
>> I don't see how this is related to BFS vs DFS, or my suggestion to change
>> the argument of directoryTreeDo:.
>>
>>> as a detect operation.  But in a stats operation, it seems like the
>>> same requests to FilePlugin are made in each case.  That is, every
>>> subdirectory gets read just once via the #entries primitive in either
>>> case BFS or DFS..
>>
>> Indeed, the difference is in the memory usage.
>
> Agreed -- but I was just trying to relate BFS vs. DFS to usage of
> FilePlugin.  It sounds like by DFS asking for other than the very next
> file, it's tripping up the inefficiency of FilePlugin, is that right?

Exactly.

>
>>> But if the whole sub-tree is going to be accessed it seems like a
>>> wash, and indeed I'm getting pretty close times between my and your
>>> version of directoryTreeDo::
>>>
>>> |dir| dir:= FileDirectory default containingDirectory containingDirectory.
>>> [dir directoryTreeDo: [ : e |  ]] timeToRun
>>>
>>>  (BFS) "444"
>>>  (DFS) "428"
>>
>> Did you running the benchmark on /usr/ from
>> http://lists.squeakfoundation.org/pipermail/squeak-dev/2013-June/171827.html
>> ?
>
> Yes, and I got similar results as you, but that is a benchmark for the
> stats operation only, not directoryTreeDo:.  I think that stats
> operation which answers a 3-element Array of Array's is too-specific
> to be useful.  I'm interested in directoryTreeDo: to be fast and
> efficient so I can use it for stats if I want, or other things.  But I
> did not get a significant difference in execution performance between
> your BFS and my DFS versions.  Although we know the DFS will use more
> memory, I'm puzzled why I did not get a big boost by your BFS version
> of it given the inefficiency of FilePlugin..
>

Because of the use of #entries. It fetches all the files in the current
directory, so it avoids the bottleneck of FilePlugin.


Levente