etarsoinl:

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

etarsoinl:

Tobias Pape
Hi all

I was curious about the relative distribution of characters in Squeak Code.
I sampled the source code[1] and drew a histogram (Attached)
Here are my results:

- The most frequent (printable) characters are in order

etarsoinl:

  and more detailed, the 90 most frequent characters:

etarsoinl:cdfumhpg.ybwSv"=1CT'x][0F)(k2ANPI|M^B4O7D6R3598#EL-,zWVjU;H+q/>*<G@KX${}YQZJ\~?!

- This is quit close to actual English:

etaonishrlducmwyfgpbvkjxqz

- Observations:
  - The most frequent punctuation is : and . follows quite long after.
  - Cascading is comparatively rare. We have more blocks and equality/identity comparisons than ;
  - Blocks are more common than parenthesis and literal arrays
  - You cannot spell ifTrue or ifFalse with the 20 most common characters
  - ifTrue: is far more common than ifFalse:
  - The most frequent uppercase Character is S. I have no conjecture here, tho.
- Comparison:
  - Here's C, sampling the Linux kernel:
    
et_risancodlupfm,);(*0hvgb-E=x>ITRSACkNL.P1O/wD2My"{}UF&3GB4q86HV5:<X#[]+zK7W9Y|%\!jQZ'

    - under_score_case vs. camelCase is rather obvious.
    - (not displayed but tab and newline are amog the 6 most frequent characters!)
    - Punctuation starts much earlier.
    - The beginning differs a lot, the ending not so much.
    - 0 is far more important than 1
    - : is unimportant

  - Here's Ruby, sampling Rails:

etsaonridl_cupmh.f:,"gb')(=y#vw/kq>ATx0<1R[]@S{}CE|2?-zjDMIPN+BO\F3L5!HU%&4*98GW6;YV7J`X

    - underscore shows, but not so much as in C.
    - The : is (like in Smalltalk) more important
    - Uppercase is more uncommon than in both C and Smalltalk.


Have fun!

Best regards
-Tobias




[1]: 
" Uses the new HistogramMorph "
| characterFrequency |
CurrentReadOnlySourceFiles cacheDuring: [
characterFrequency := ((CompiledMethod allInstances select: 
[:method | (method allLiterals detectSum: 
[:lit | lit isCollection ifFalse: [0] ifTrue: [lit size]]) < 1500])
gather: [:method | method getSource
reject: [:c |c isSeparator]]) asBag].

(HistogramMorph on: characterFrequency)
labelBlock: [:c | c codePoint > 32 ifTrue:[c asString] ifFalse: [c printString]];
openInWorld.

((characterFrequency sortedCounts collect: [:ea | ea value]) first: 90) join.



Reply | Threaded
Open this post in threaded view
|

Re: etarsoinl:

Karl Ramberg
Cool

Best,
Karl

On Wed, Jun 22, 2016 at 10:40 AM, Tobias Pape <[hidden email]> wrote:
Hi all

I was curious about the relative distribution of characters in Squeak Code.
I sampled the source code[1] and drew a histogram (Attached)
Here are my results:

- The most frequent (printable) characters are in order

etarsoinl:

  and more detailed, the 90 most frequent characters:

etarsoinl:cdfumhpg.ybwSv"=1CT'x][0F)(k2ANPI|M^B4O7D6R3598#EL-,zWVjU;H+q/>*<G@KX${}YQZJ\~?!

- This is quit close to actual English:

etaonishrlducmwyfgpbvkjxqz

- Observations:
  - The most frequent punctuation is : and . follows quite long after.
  - Cascading is comparatively rare. We have more blocks and equality/identity comparisons than ;
  - Blocks are more common than parenthesis and literal arrays
  - You cannot spell ifTrue or ifFalse with the 20 most common characters
  - ifTrue: is far more common than ifFalse:
  - The most frequent uppercase Character is S. I have no conjecture here, tho.
- Comparison:
  - Here's C, sampling the Linux kernel:
    
et_risancodlupfm,);(*0hvgb-E=x>ITRSACkNL.P1O/wD2My"{}UF&3GB4q86HV5:<X#[]+zK7W9Y|%\!jQZ'

    - under_score_case vs. camelCase is rather obvious.
    - (not displayed but tab and newline are amog the 6 most frequent characters!)
    - Punctuation starts much earlier.
    - The beginning differs a lot, the ending not so much.
    - 0 is far more important than 1
    - : is unimportant

  - Here's Ruby, sampling Rails:

etsaonridl_cupmh.f:,"gb')(=y#vw/kq>ATx0<1R[]@S{}CE|2?-zjDMIPN+BO\F3L5!HU%&4*98GW6;YV7J`X

    - underscore shows, but not so much as in C.
    - The : is (like in Smalltalk) more important
    - Uppercase is more uncommon than in both C and Smalltalk.


Have fun!

Best regards
-Tobias




[1]: 
" Uses the new HistogramMorph "
| characterFrequency |
CurrentReadOnlySourceFiles cacheDuring: [
characterFrequency := ((CompiledMethod allInstances select: 
[:method | (method allLiterals detectSum: 
[:lit | lit isCollection ifFalse: [0] ifTrue: [lit size]]) < 1500])
gather: [:method | method getSource
reject: [:c |c isSeparator]]) asBag].

(HistogramMorph on: characterFrequency)
labelBlock: [:c | c codePoint > 32 ifTrue:[c asString] ifFalse: [c printString]];
openInWorld.

((characterFrequency sortedCounts collect: [:ea | ea value]) first: 90) join.







Reply | Threaded
Open this post in threaded view
|

Re: etarsoinl:

marcel.taeumel
In reply to this post by Tobias Pape
Tobias Pape wrote
Hi all

I was curious about the relative distribution of characters in Squeak Code.
I sampled the source code[1] and drew a histogram (Attached)
Here are my results:

- The most frequent (printable) characters are in order

        etarsoinl:

  and more detailed, the 90 most frequent characters:

        etarsoinl:cdfumhpg.ybwSv"=1CT'x][0F)(k2ANPI|M^B4O7D6R3598#EL-,zWVjU;H+q/>*<G@KX${}YQZJ\~?!

- This is quit close to actual English:

        etaonishrlducmwyfgpbvkjxqz

- Observations:
  - The most frequent punctuation is : and . follows quite long after.
  - Cascading is comparatively rare. We have more blocks and equality/identity comparisons than ;
  - Blocks are more common than parenthesis and literal arrays
  - You cannot spell ifTrue or ifFalse with the 20 most common characters
  - ifTrue: is far more common than ifFalse:
  - The most frequent uppercase Character is S. I have no conjecture here, tho.
       
- Comparison:
  - Here's C, sampling the Linux kernel:
   
        et_risancodlupfm,);(*0hvgb-E=x>ITRSACkNL.P1O/wD2My"{}UF&3GB4q86HV5:<X#[]+zK7W9Y|%\!jQZ'

    - under_score_case vs. camelCase is rather obvious.
    - (not displayed but tab and newline are amog the 6 most frequent characters!)
    - Punctuation starts much earlier.
    - The beginning differs a lot, the ending not so much.
    - 0 is far more important than 1
    - : is unimportant

  - Here's Ruby, sampling Rails:

        etsaonridl_cupmh.f:,"gb')(=y#vw/kq>ATx0<1R[]@S{}CE|2?-zjDMIPN+BO\F3L5!HU%&4*98GW6;YV7J`X

    - underscore shows, but not so much as in C.
    - The : is (like in Smalltalk) more important
    - Uppercase is more uncommon than in both C and Smalltalk.


Have fun!

Best regards
        -Tobias




[1]:
" Uses the new HistogramMorph "
| characterFrequency |
CurrentReadOnlySourceFiles cacheDuring: [
        characterFrequency := ((CompiledMethod allInstances select:
                [:method | (method allLiterals detectSum:
                        [:lit | lit isCollection ifFalse: [0] ifTrue: [lit size]]) < 1500])
                gather: [:method | method getSource
                        reject: [:c |c isSeparator]]) asBag].

(HistogramMorph on: characterFrequency)
        labelBlock: [:c | c codePoint > 32 ifTrue:[c asString] ifFalse: [c printString]];
        openInWorld.
       
((characterFrequency sortedCounts collect: [:ea | ea value]) first: 90) join.





etarsoinl.png (9K) <http://forum.world.st/attachment/4902372/0/etarsoinl.png>
:)

"Do you think the author might be interested in rewriting his work to cut it down? If you cut out all the 'O's, you might lose six pages there."
http://www.dailymotion.com/video/x4n10h_mr-mann-bookshop_fun

Best,
Marcel