Mine-able ideas?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Mine-able ideas?

Frank Shearar-3
http://blog.datomic.com/2012/10/codeq.html

Executive summary:
* Git gives version control over files
* Clojure code typically has lots of functions or other chunks of code
in one file
* This means you can't ask for the version of a single unit of code
* Static analyses over the files as they vary through time, dumped
into a database, yields interesting stuff

What they're calling "codeqs" ("code quantum") filetree folks would
call a file, because filetree already splits everything (I think?)
into bits, and versions everything at the "codeq" level by virtue of
storing each bit in its own file: class definition, comment, method
definition, etc.

So we already have most of this stuff already - I couldn't live
without my in-image method versions - but I'm wondering if anyone else
can spot anything worth copying?

frank

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Colin Putney-3



On Wed, Jan 2, 2013 at 4:18 PM, Frank Shearar <[hidden email]> wrote:
http://blog.datomic.com/2012/10/codeq.html

Executive summary:
* Git gives version control over files
* Clojure code typically has lots of functions or other chunks of code
in one file
* This means you can't ask for the version of a single unit of code
* Static analyses over the files as they vary through time, dumped
into a database, yields interesting stuff

What they're calling "codeqs" ("code quantum") filetree folks would
call a file, because filetree already splits everything (I think?)
into bits, and versions everything at the "codeq" level by virtue of
storing each bit in its own file: class definition, comment, method
definition, etc.

So we already have most of this stuff already - I couldn't live
without my in-image method versions - but I'm wondering if anyone else
can spot anything worth copying?

Nah. They're basically figuring out how to extract the semantic changes from git, since git just treats the source code as opaque text. That gets them to what Monticello has now. I guess there's a bit of "imagine what you could do then!" that's unspecified.

Which is not to say that it's a bad idea. I'd love to create a huge database of, say, the update stream going back to the beginning, or the entire contents of squeaksource. But... then what?

Things that spring to mind immediately:

- universal senders and implementors
- metrics like message sends per method or methods per class
- detection of package dependencies
- analysis of how long-lived packages change over time
- analysis of contribution and collaboration between coders

and so on.

But, what good is it? Might be interesting, maybe there's some research papers to be written, but would it do us any good as a community? Would there be useful tools that came out of it? Would it be worth the effort? Hard to say.

Colin


Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Frank Shearar-3
On 2 January 2013 22:17, Colin Putney <[hidden email]> wrote:

>
>
>
> On Wed, Jan 2, 2013 at 4:18 PM, Frank Shearar <[hidden email]>
> wrote:
>>
>> http://blog.datomic.com/2012/10/codeq.html
>>
>> Executive summary:
>> * Git gives version control over files
>> * Clojure code typically has lots of functions or other chunks of code
>> in one file
>> * This means you can't ask for the version of a single unit of code
>> * Static analyses over the files as they vary through time, dumped
>> into a database, yields interesting stuff
>>
>> What they're calling "codeqs" ("code quantum") filetree folks would
>> call a file, because filetree already splits everything (I think?)
>> into bits, and versions everything at the "codeq" level by virtue of
>> storing each bit in its own file: class definition, comment, method
>> definition, etc.
>>
>> So we already have most of this stuff already - I couldn't live
>> without my in-image method versions - but I'm wondering if anyone else
>> can spot anything worth copying?
>
>
> Nah. They're basically figuring out how to extract the semantic changes from
> git, since git just treats the source code as opaque text. That gets them to
> what Monticello has now. I guess there's a bit of "imagine what you could do
> then!" that's unspecified.

That was pretty much what I was thinking. And filetree preserves this
fine-grained "code quantum"-sized version control.

The only advantage I still see of lots-of-stuff-inna-file is that you
can very quickly hop around a bunch of code. Our tools just don't work
that way. They _could_. Noone's just ever hurt enough to display code
in this fashion. It's easy enough: what's not so easy is to make that
big blob of text efficiently editable such that you still keep track
of the, for example, individual methods. (I'll leave aside the lack of
syntax around method definition. That's not a big problem.) For
instance: parse the entire file, find the method definitions, update
the image by compiling them. (Handwave around the imperative hacks one
could do.)

> Which is not to say that it's a bad idea. I'd love to create a huge database
> of, say, the update stream going back to the beginning, or the entire
> contents of squeaksource. But... then what?
>
> Things that spring to mind immediately:
>
> - universal senders and implementors
> - metrics like message sends per method or methods per class
> - detection of package dependencies

This would be a massive win. I took a bash a while ago at extending
DependencyBrowser to work over one's package-cache to do this. I
didn't get terribly far, probably largely to me being pretty ignorant
about just about everything I needed to know. I have the
half-completed work lying around. Maybe I should publish it somewhere!

frank

> - analysis of how long-lived packages change over time
> - analysis of contribution and collaboration between coders
>
> and so on.
>
> But, what good is it? Might be interesting, maybe there's some research
> papers to be written, but would it do us any good as a community? Would
> there be useful tools that came out of it? Would it be worth the effort?
> Hard to say.
>
> Colin
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

timrowledge
In reply to this post by Colin Putney-3

On 02-01-2013, at 2:17 PM, Colin Putney <[hidden email]> wrote:

>
> Which is not to say that it's a bad idea. I'd love to create a huge database of, say, the update stream going back to the beginning, or the entire contents of squeaksource. But... then what?

Well, if only the sensible compiled method format had been adopted so that source references could be proper objects rather than hacked-up numbers, then you could have source kept in a proper database. Like, say, dabble. Or a dabble-ish thing that kept recent-ish stuff local and could refer back to a server for ancient history. Find all versions of a method back to the beginning of time. Find out about classes being renamed or deleted.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Do files get embarrassed when they get unzipped?



Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Bob Arning-2
In reply to this post by Frank Shearar-3
IIRC, that was one of the features of the Whisker Browser back in the day.

http://wiki.squeak.org/squeak/1993


The goal of the Whisker Browser (a.k.a. Stacking Browser) is to provide a simple and intuitive way to view the contents of multiple classes and multiple methods simultaneously, while using screen real estate efficiently and not requiring a lot of window moving/resizing. It does this by introducing the concept of subpane stacking. ...


Cheers,
Bob

On 1/2/13 5:32 PM, Frank Shearar wrote:
The only advantage I still see of lots-of-stuff-inna-file is that you
can very quickly hop around a bunch of code. Our tools just don't work
that way. They _could_. Noone's just ever hurt enough to display code
in this fashion. It's easy enough: what's not so easy is to make that
big blob of text efficiently editable such that you still keep track
of the, for example, individual methods. 



Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

David T. Lewis
In reply to this post by timrowledge
On Wed, Jan 02, 2013 at 02:42:25PM -0800, tim Rowledge wrote:
>
> On 02-01-2013, at 2:17 PM, Colin Putney <[hidden email]> wrote:
>
> >
> > Which is not to say that it's a bad idea. I'd love to create a huge database of, say, the update stream going back to the beginning, or the entire contents of squeaksource. But... then what?
>
> Well, if only the sensible compiled method format had been adopted so that source references could be proper objects rather than hacked-up numbers, then you could have source kept in a proper database. Like, say, dabble. Or a dabble-ish thing that kept recent-ish stuff local and could refer back to a server for ancient history. Find all versions of a method back to the beginning of time. Find out about classes being renamed or deleted.
>

Do you have a reference to the sensible compiled method format? I think
I recall some discussions on that topic, but I don't recall when or by
whom.

But really, what are we missing? We have CompiledMethodTrailer that appears to
provide an infinitely extensible mechanism for inventing new kinds of source
pointers.  And we have an abstract SourceFileArray which, if its class comment
is to be believed, is intended to encourage someone to actually go out and do
exactly what you describe:

  "This class is an abstract superclass for source code access mechanisms.
  It defines the messages that need to be understood by those subclasses
  that store and retrieve source chunks on files, over the network or in
  databases. The first concrete subclass, StandardSourceFileArray, supports
  access to the traditional sources and changes files. Other subclasses
  might implement multiple source files for different applications, or
  access to a network source server."

We already have one new subclass (ExpandedSourceFileArray) that was used
to eliminate the old size limit on changes files. There is nothing stopping
someone from coming up with other implementations that delegate to
databases or to something on the internet.

As far as I can see, the only thing that is missing is for somebody to
actually go do it.

Dave


Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

timrowledge

On 02-01-2013, at 4:15 PM, "David T. Lewis" <[hidden email]> wrote:
>
> Do you have a reference to the sensible compiled method format? I think
> I recall some discussions on that topic, but I don't recall when or by
> whom.

I was thinking of the now-ancient 'NewCompiledMethod', going back to about 1997. The last I heard on the subject was about 5 years ago.
But..
>
> But really, what are we missing? We have CompiledMethodTrailer that appears to
> provide an infinitely extensible mechanism for inventing new kinds of source
> pointers.

… it reads as if that might provide the same result. Namely having the source pointer for each method be a proper oop, with all the obvious advantages over a weirdly encrypted 24bit number hidden within some bytes at the end of a byte array


tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Document the code?  Why do you think they call it "code?"



Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

David T. Lewis
On Wed, Jan 02, 2013 at 05:26:25PM -0800, tim Rowledge wrote:
>
> On 02-01-2013, at 4:15 PM, "David T. Lewis" <[hidden email]> wrote:
> >
> > Do you have a reference to the sensible compiled method format? I think
> > I recall some discussions on that topic, but I don't recall when or by
> > whom.
>
> I was thinking of the now-ancient 'NewCompiledMethod', going back to about 1997. The last I heard on the subject was about 5 years ago.

Ah, right, it's all coming back to me now. Thanks.

> But..
> >
> > But really, what are we missing? We have CompiledMethodTrailer that appears to
> > provide an infinitely extensible mechanism for inventing new kinds of source
> > pointers.
>
> ? it reads as if that might provide the same result. Namely having the source pointer for each method be a proper oop, with all the obvious advantages over a weirdly encrypted 24bit number hidden within some bytes at the end of a byte array
>

In principle, I think yes. Igor Stasenko created the CompiledMethodTrailer,
which has provided a really nice way to keep the existing formats working while
allowing all sorts of extensions. I'm not sure if he had in mind to implement
source pointers as first class objects, but it seems like it would be a
straightforward extension.

Cross posting to pharo in order to lure Igor back into the discussion ;-)

Dave


Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Chris Muller-3
In reply to this post by timrowledge
> numbers, then you could have source kept in a proper database. Like, say, dabble. Or a dabble-ish thing that kept recent-ish stuff local and
> could refer back to a server for ancient history. Find all versions of a method back to the beginning of time...

The all-method-history thing has been available for a while now via Magma.

   http://wiki.squeak.org/squeak/5603

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

David T. Lewis
On Thu, Jan 03, 2013 at 10:27:00AM -0600, Chris Muller wrote:
> > numbers, then you could have source kept in a proper database. Like, say, dabble. Or a dabble-ish thing that kept recent-ish stuff local and
> > could refer back to a server for ancient history. Find all versions of a method back to the beginning of time...
>
> The all-method-history thing has been available for a while now via Magma.
>
>    http://wiki.squeak.org/squeak/5603

Now that sounds like the *right* way to do it :)

Dave


Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Hannes Hirzel
Dave,

I like this idea too of having earlier versions of methods in a
separate database and not in the image.

You recently reminded us to calculate the space of certain types of
objects use in the image. I forgot the command again and I do not
easily find it in the mail history.

How do you calculate the space used by earlier method versions in the image?

--Hannes

On 1/3/13, David T. Lewis <[hidden email]> wrote:

> On Thu, Jan 03, 2013 at 10:27:00AM -0600, Chris Muller wrote:
>> > numbers, then you could have source kept in a proper database. Like,
>> > say, dabble. Or a dabble-ish thing that kept recent-ish stuff local and
>> > could refer back to a server for ancient history. Find all versions of a
>> > method back to the beginning of time...
>>
>> The all-method-history thing has been available for a while now via
>> Magma.
>>
>>    http://wiki.squeak.org/squeak/5603
>
> Now that sounds like the *right* way to do it :)
>
> Dave
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Colin Putney-3



On Thu, Jan 3, 2013 at 2:49 PM, H. Hirzel <[hidden email]> wrote:
 
How do you calculate the space used by earlier method versions in the image?

None, actually. All source code is stored in the .source or .changes file.

 Colin


Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

timrowledge

On 03-01-2013, at 11:52 AM, Colin Putney <[hidden email]> wrote:
>
> On Thu, Jan 3, 2013 at 2:49 PM, H. Hirzel <[hidden email]> wrote:
>  
> How do you calculate the space used by earlier method versions in the image?
>
> None, actually. All source code is stored in the .source or .changes file.

I know source gets stored in the files but long ago it was the case that method version objects were kept in the image and they held on to a *lot* of crap. Did that get changed? IIRC it was part of a never completed attempt to have some sort of namespacey-effect using projects.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful random insult:- Been playing with the pharmacy section again.



Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Hannes Hirzel
Yes, I had in mind what Tim mentions.

--Hannes

On 1/3/13, tim Rowledge <[hidden email]> wrote:

>
> On 03-01-2013, at 11:52 AM, Colin Putney <[hidden email]> wrote:
>>
>> On Thu, Jan 3, 2013 at 2:49 PM, H. Hirzel <[hidden email]>
>> wrote:
>>
>> How do you calculate the space used by earlier method versions in the
>> image?
>>
>> None, actually. All source code is stored in the .source or .changes
>> file.
>
> I know source gets stored in the files but long ago it was the case that
> method version objects were kept in the image and they held on to a *lot* of
> crap. Did that get changed? IIRC it was part of a never completed attempt to
> have some sort of namespacey-effect using projects.
>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Useful random insult:- Been playing with the pharmacy section again.
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Igor Stasenko
In reply to this post by timrowledge
On 3 January 2013 02:26, tim Rowledge <[hidden email]> wrote:

>
> On 02-01-2013, at 4:15 PM, "David T. Lewis" <[hidden email]> wrote:
>>
>> Do you have a reference to the sensible compiled method format? I think
>> I recall some discussions on that topic, but I don't recall when or by
>> whom.
>
> I was thinking of the now-ancient 'NewCompiledMethod', going back to about 1997. The last I heard on the subject was about 5 years ago.
> But..
>>
>> But really, what are we missing? We have CompiledMethodTrailer that appears to
>> provide an infinitely extensible mechanism for inventing new kinds of source
>> pointers.
>
> … it reads as if that might provide the same result. Namely having the source pointer for each method be a proper oop, with all the obvious advantages over a weirdly encrypted 24bit number hidden within some bytes at the end of a byte array
>

+1
as well as bytecode can be held in one oop, leaving a compiled method
need not to have separate object format, just a contract that its
first ivar is bytecode.


>
> tim
> --
> tim Rowledge; [hidden email]; http://www.rowledge.org/tim
> Document the code?  Why do you think they call it "code?"


--
Best regards,
Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Igor Stasenko
In reply to this post by David T. Lewis
On 3 January 2013 02:55, David T. Lewis <[hidden email]> wrote:

> On Wed, Jan 02, 2013 at 05:26:25PM -0800, tim Rowledge wrote:
>>
>> On 02-01-2013, at 4:15 PM, "David T. Lewis" <[hidden email]> wrote:
>> >
>> > Do you have a reference to the sensible compiled method format? I think
>> > I recall some discussions on that topic, but I don't recall when or by
>> > whom.
>>
>> I was thinking of the now-ancient 'NewCompiledMethod', going back to about 1997. The last I heard on the subject was about 5 years ago.
>
> Ah, right, it's all coming back to me now. Thanks.
>
>> But..
>> >
>> > But really, what are we missing? We have CompiledMethodTrailer that appears to
>> > provide an infinitely extensible mechanism for inventing new kinds of source
>> > pointers.
>>
>> ? it reads as if that might provide the same result. Namely having the source pointer for each method be a proper oop, with all the obvious advantages over a weirdly encrypted 24bit number hidden within some bytes at the end of a byte array
>>
>
> In principle, I think yes. Igor Stasenko created the CompiledMethodTrailer,
> which has provided a really nice way to keep the existing formats working while
> allowing all sorts of extensions. I'm not sure if he had in mind to implement
> source pointers as first class objects, but it seems like it would be a
> straightforward extension.
>
Sure thing, i wanted to go forward.
I wrote about it multiple times.
But we need someone who will put an idea into flesh :)

> Cross posting to pharo in order to lure Igor back into the discussion ;-)
>
> Dave

--
Best regards,
Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Chris Muller-3
In reply to this post by David T. Lewis
It's a good way to have local improvement and its pure objects, but it
doesn't really help us as a community because 1) it doesn't allow
external tools to interface to it and 2)  Magma does not support
authentication or authorization.


On Thu, Jan 3, 2013 at 11:46 AM, David T. Lewis <[hidden email]> wrote:

> On Thu, Jan 03, 2013 at 10:27:00AM -0600, Chris Muller wrote:
>> > numbers, then you could have source kept in a proper database. Like, say, dabble. Or a dabble-ish thing that kept recent-ish stuff local and
>> > could refer back to a server for ancient history. Find all versions of a method back to the beginning of time...
>>
>> The all-method-history thing has been available for a while now via Magma.
>>
>>    http://wiki.squeak.org/squeak/5603
>
> Now that sounds like the *right* way to do it :)
>
> Dave
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Bert Freudenberg
In reply to this post by Colin Putney-3

On 02.01.2013, at 23:17, Colin Putney <[hidden email]> wrote:




On Wed, Jan 2, 2013 at 4:18 PM, Frank Shearar <[hidden email]> wrote:
http://blog.datomic.com/2012/10/codeq.html

Executive summary:
* Git gives version control over files
* Clojure code typically has lots of functions or other chunks of code
in one file
* This means you can't ask for the version of a single unit of code
* Static analyses over the files as they vary through time, dumped
into a database, yields interesting stuff

What they're calling "codeqs" ("code quantum") filetree folks would
call a file, because filetree already splits everything (I think?)
into bits, and versions everything at the "codeq" level by virtue of
storing each bit in its own file: class definition, comment, method
definition, etc.

So we already have most of this stuff already - I couldn't live
without my in-image method versions - but I'm wondering if anyone else
can spot anything worth copying?

Nah. They're basically figuring out how to extract the semantic changes from git, since git just treats the source code as opaque text. That gets them to what Monticello has now. I guess there's a bit of "imagine what you could do then!" that's unspecified.

Which is not to say that it's a bad idea. I'd love to create a huge database of, say, the update stream going back to the beginning, or the entire contents of squeaksource. But... then what?

Things that spring to mind immediately:

- universal senders and implementors
- metrics like message sends per method or methods per class
- detection of package dependencies
- analysis of how long-lived packages change over time
- analysis of contribution and collaboration between coders

and so on.

But, what good is it? Might be interesting, maybe there's some research papers to be written, but would it do us any good as a community? Would there be useful tools that came out of it? Would it be worth the effort? Hard to say.

Colin


Wasn't this one of the goals Dale had for SqueakSource3?

- Bert -




Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Frank Shearar-3
In reply to this post by Colin Putney-3
On 2 January 2013 22:17, Colin Putney <[hidden email]> wrote:

>
>
>
> On Wed, Jan 2, 2013 at 4:18 PM, Frank Shearar <[hidden email]>
> wrote:
>>
>> http://blog.datomic.com/2012/10/codeq.html
>>
>> Executive summary:
>> * Git gives version control over files
>> * Clojure code typically has lots of functions or other chunks of code
>> in one file
>> * This means you can't ask for the version of a single unit of code
>> * Static analyses over the files as they vary through time, dumped
>> into a database, yields interesting stuff
>>
>> What they're calling "codeqs" ("code quantum") filetree folks would
>> call a file, because filetree already splits everything (I think?)
>> into bits, and versions everything at the "codeq" level by virtue of
>> storing each bit in its own file: class definition, comment, method
>> definition, etc.
>>
>> So we already have most of this stuff already - I couldn't live
>> without my in-image method versions - but I'm wondering if anyone else
>> can spot anything worth copying?
>
>
> Nah. They're basically figuring out how to extract the semantic changes from
> git, since git just treats the source code as opaque text. That gets them to
> what Monticello has now. I guess there's a bit of "imagine what you could do
> then!" that's unspecified.
>
> Which is not to say that it's a bad idea. I'd love to create a huge database
> of, say, the update stream going back to the beginning, or the entire
> contents of squeaksource. But... then what?
>
> Things that spring to mind immediately:
>
> - universal senders and implementors
> - metrics like message sends per method or methods per class
> - detection of package dependencies
> - analysis of how long-lived packages change over time
> - analysis of contribution and collaboration between coders
>
> and so on.
>
> But, what good is it? Might be interesting, maybe there's some research
> papers to be written, but would it do us any good as a community? Would
> there be useful tools that came out of it? Would it be worth the effort?
> Hard to say.

I eventually remembered the paper I'd recently read:
http://scg.unibe.ch/archive/papers/Rob12aAPIDeprecations.pdf "How Do
Developers React to API Deprecation? The Case of a Smalltalk
Ecosystem" looks at how APIs change and how developers react to same,
and it mines SS for its data.

Hopefully, research papers _would_ benefit the community. (In other
words, they'd hopefully be research papers into things that were
useful, or that enabled useful things.)

frank

> Colin
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Mine-able ideas?

Colin Putney-3
In reply to this post by timrowledge



On Thu, Jan 3, 2013 at 2:58 PM, tim Rowledge <[hidden email]> wrote:
 
I know source gets stored in the files but long ago it was the case that method version objects were kept in the image and they held on to a *lot* of crap. Did that get changed? IIRC it was part of a never completed attempt to have some sort of namespacey-effect using projects.

That must have been before my time. These days, all versions are stored on disk. Each chunk has the source pointer for the previous version in it, and the tools walk back through the changes/source files collecting all the versions in the chain.

Here's an example:

!MCVersionInfo methodsFor: 'converting' stamp: 'bf 4/18/2010 23:25' prior: 23175569!
asDictionary
^ Dictionary new
at: #name put: name;
at: #id put: id asString;
at: #message put: message;
at: #date put: date;
at: #time put: time;
at: #author put: author;
at: #ancestors put: (self ancestors collect: [:a | a asDictionary]);
yourself! !

That "prior" parameters points to the previous version.

Colin