[Long] Public repository scalability, Store improvements needed?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Long] Public repository scalability, Store improvements needed?

Joachim Geidel
Hi all,

I am a bit concerned about the accessibility of the public Store
repository. I don't mean database size or response times at the
technical level, but the sheer number of bundles and packages and their
impact on the usability of the repository.

I perceive the repository as one of the propaganda and communication
tools for the Smalltalk community, addressing the concern that there are
not enough Smalltalk components available. Therefore, it should be as
easily accessible (in any sense of the word you can come up with) as
possible. At second thought, response time matters for accessibility, too.

A first attempt at handling this is the index web page at
http://www.cincomsmalltalk.com/publicRepository/ which lists only
top level components with a blessing level of at least "Development" and
having a pundle comment. However, this has the drawback that many useful
but uncommented pundles sink into oblivion. Also, it only shows the
pundle names on the page. There is no quick way of browsing through
comments etc. There is also the RSS feed at
http://www.cincomsmalltalk.com/store/store.xml which shows the 19(?)
latest versions published.

But I tend to browse the repository using the usual Store and StoreGlorp
tools, looking for useful components from time to time. With increasing
size of the repository, this becomes more and more tedious. I think that
it must be overwhelming and frustrating for someone looking at it for
the first time.

While I don't have solutions for the problem, here are some thoughts.

Better tool support needed
--------------------------

Before looking at the tools: Of course, it would help a lot if authors
of contributions would always comment their pundles, write
comprehensible blessing comments, and include remarks about enclosing
bundles, prerequisites, dependencies, links to web sites with
documentation etc. in the comments - it's not just an issue of the Store
tools. Tools can't add information which is not present.

IMHO, the "published items" list of the Store tools does not scale to
the size of the public repository or of a repository of an active
multi-team development organization (I've seen this elsewhere, too).
There are several problems with that list:

- If you look at a package or bundle, you can't see if it is contained
in a bundle. Using a hierarchical list could help, but it would be
inaccurate and sometimes even misleading, as there are pundles which are
part of more than one bundle. Also, being contained in a bundle does not
mean that the package is not useful as a standalone component.

- For projects with more than one bundle, the "top level" bundles are
not easily identifiable. If the bundle comment does not explicitly state
what is needed for using the project's code, you have to use
trial-and-error to find out. The Chronos class library alone has 29
bundles! Fortunately, their names all start with "Chronos". But how do I
detect that "CodeFoo" and "MetaDevelopment" are part of "Moose"? So,
which bundles do I have to load? Which ones do I have to replicate into
my local repository? In most cases, this is not documented in the bundle
comments, and there is no tool support at all for this purpose.

- This gets worse when bundles or packages are renamed. Examples:
* Cairo -> CairoGraphicsX -> CairoGraphics
* ExtraRBForSUnitToo -> SUnitToo(ls)
(@Travis: This is just a coincidence, and I am not opposed to renaming
pundles. It's the inability of the tools to show what happened which
bothers me.)

- When replicating components into my local repository, it's often a
trial-and-error game until I have found all of the prerequisites. There
is no "replicate with prerequisites". Also, a visualization tool for
showing dependents and /or prerequisites graphically for a pundle
version, similar to the versioning graph in RBStoreExtensions, would be
helpful. (Yeah, I know, someone is certainly going to tell me "if you
need it, why don't you write it yourself".)

Structuring the contents
------------------------

There are different kinds of components in the repository:

- Previews of components of VisualWorks. These become obsolete as soon
as they are integrated in a product release. It would be nice to have
some kind of compatibility filter in the tools - don't show me things
which I can't load yet into VW 7.4.1, and don't show me versions which
are already obsolete. Examples: SmalltalkDoc, MQInterface, the
Webservices pundles.

- Community contributions for the VisualWorks product like bug fixes and
small enhancements (e.g. the SYSBUG-* packages). They are
usually very small grained packages. Especially for bug fixes, it is
often not clear for which VW releases they are intended, and if they
have been integrated into a VW release. Example: MouseWheelX11 - it's
from 2002, but do you still need it in VW 7.4.1? Without browsing
through all the VW release notes from 2002 to 2006, you won't be able to
tell. Could these packages be annotated in some way when they are
integrated in VW?

- Development tools for VisualWorks. Examples: The various refactoring
browser enhancements, or test tools like BRITE, Dakar Testing,
SUnitToo(ls), SUnitDebugExtensions, SSpec. Some of
the contributed tools are not in the repository, but on the VW CD
(RB_Tabs, GHStoreEnhance). Some absolutely indispensable ones are only
in the repository (RBStoreExtensions), but not on the CD. Others are
both on the CD and in the repository - but which one should I use
(AutoComplete)? Sometimes, the version on the CD is outdated, sometimes
the latest versions in the repository will work only with an upcoming
not yet published release of VW. Sometimes, the parcel version number on
the CD are in sync with the VW release number, sometimes they lag
behind, and sometimes the parcel version number, the VW release number,
and the package version in the public repository are completely
unrelated. Okay, it's not as bad as it sounds, but I wouldn't want to be
a newbie browsing all that.

- Open source class libraries based on VW, e.g. Chronos, AIDA/Web,
Seaside, Glorp, Magritte, etc.

- "Simple little things" like SimpleLittleThings, the SYSEXT-* packages,
OnceUponATime, CraftedMemoryPolicy, DateField, and many more.

Would it be possible to somehow annotate these different kinds of
components such that they can be discerned / filtered in the Store tools?

Oh, and what about starting a web browser when clicking on URLs in
comments? Does SmalltalkDoc have this feature?

Garbage, and "invisible" pundles
----------------------------

I have the impression that there are several classroom projects
who are using the Cincom public repository for development. There are
new packages with very general names like "GUI", "Gui", "Utilities", or
"Algorithms", packages with comments like "This is my attempt to test
out store", bundles with names like "CS2340", "T123SG" without any
comments, "PBEC" without comment, but blessing comments like "M4
submission 3", and packages like "Graph" without any content and package
or blessing comment. And what are "BandGreeks" and "MortalWombatEcode"
supposed to be?

While I am quite happy that there are universities teaching Smalltalk, I
think that the public repository is not the right place for source code
management of classroom projects or ongoing development activity of
projects which have not yet reached a certain level of maturity
(something like "public beta"). Just imagine two universities having
half a dozen teams work on a project twice a year - a couple of years,
and you won't be able to find the interesting components among all of
their leftovers.

On the other hand, there are some interesting components like
"Softwarenaut" which are unfortunately undocumented, and don't show up
on the repository contents page - I usually don't even consider them
when looking at what I might replicate into my own development
environment. (For Softwarenaut, see
http://www.inf.unisi.ch/phd/lungu/research/softwarenaut/ - it seems to
be a very interesting software reengineering tool, but the version in
the public repository is not the most recent one.)

Would a usage policy make sense, and could it be enforced? Do we need a
"repository police"? Or could techniques from Web 2.0 like tagging
mechanisms be adopted for identifying useful components, such that
garbage would automatically be pushed to the bottom of the list,
avoiding the need of a usage policy? Would this help pulling
useful but forgotten pundles out of the shadows?

Best regards,
Joachim Geidel

Reply | Threaded
Open this post in threaded view
|

Re: [Long] Public repository scalability, Store improvements needed?

Travis Griggs-3
Joachim,

You make many good observations. I've interspersed mine betwixt.

On Mar 4, 2007, at 3:12, Joachim Geidel wrote:

Hi all,

I am a bit concerned about the accessibility of the public Store
repository. I don't mean database size or response times at the
technical level, but the sheer number of bundles and packages and their
impact on the usability of the repository.

I perceive the repository as one of the propaganda and communication
tools for the Smalltalk community, addressing the concern that there are
not enough Smalltalk components available. Therefore, it should be as
easily accessible (in any sense of the word you can come up with) as
possible. At second thought, response time matters for accessibility, too.

Yes. I agree. Better, faster, easier, more accessible. These are all things I think we should continue to chisel away at. To play the complimentary side of this coin... I think the Open Repository has been a phenomenal success. I am to this day, everly so grateful to Pete Hatch and James Robertson and anyone else who assisted them in persevering to set it up and have since maintained it. That we're in a position to ponder how to make the problem that's created more tenable, is a Really Good Thing (tm).

As for direct response time... I too note that it seems to have degraded in performance. But on the flip side... it still runs faster than doing  "ports" or debian installs for me. So I'm glad we're on par if not better than some of the other other big repository efforts.

<snip>

But I tend to browse the repository using the usual Store and StoreGlorp
tools, looking for useful components from time to time. With increasing
size of the repository, this becomes more and more tedious. I think that
it must be overwhelming and frustrating for someone looking at it for
the first time.

I agree again. And again am glad this is a problem we have. It's like grumbling about having to pay the government too much in taxes because you're making lots more money than you used too. :)

This is not a unique problem by any means. This is a network age problem in general. The feelings of "where is what I need; why isn't what I want to find instantly at my finger tips" is a problem I have as I load software on my Mac, when I go hunting for tools for windows, when I look for tools in debian repositories, or heaven forbid... at Source Forge. It behooves us to make observations about how those resources deal with similar problems. To emulate where we we fall short. To innovate when we can. To accept the sheer size of the problem. That is... that there probably is no "Silver Bullet" that makes all of this better... or one of those communities with vaster resources would be doing it already.

While I don't have solutions for the problem, here are some thoughts.

Better tool support needed

I knew it would come to this :)

Before looking at the tools: Of course, it would help a lot if authors
of contributions would always comment their pundles, write
comprehensible blessing comments, and include remarks about enclosing
bundles, prerequisites, dependencies, links to web sites with
documentation etc. in the comments - it's not just an issue of the Store
tools. Tools can't add information which is not present.

With 7.5, there's the little "warning" icons. I find that these have encouraged me to do more in the way of commenting packages I maintain. Am I doing enough... probably not. But I'm doing a lot more for me. It's my hope that others, at least some others, will react similarly.

Little simple things we can do to improve the situation further. Detect the presence of URL looking strings in comments and provide quicker accessibility to them. The overview view of a package should not only be it's comment, but the prerequisites it needs, and more directly... what loading this package is going to suck into your own image. We need to help people do a better job of inputing and maintaining prerequisites. I think there's some open source tool that could/should be integrated for that. :)

I'll leave a segue regarding bundles till the bottom.

IMHO, the "published items" list of the Store tools does not scale to
the size of the public repository or of a repository of an active
multi-team development organization (I've seen this elsewhere, too).
There are several problems with that list:

- If you look at a package or bundle, you can't see if it is contained
in a bundle. Using a hierarchical list could help, but it would be
inaccurate and sometimes even misleading, as there are pundles which are
part of more than one bundle. Also, being contained in a bundle does not
mean that the package is not useful as a standalone component.

Sitting on hands regarding bundles. As far as prereqs, I think how to make that process more visible is straightforward. Tools like synaptic and the like do exactly this.

- For projects with more than one bundle, the "top level" bundles are
not easily identifiable. If the bundle comment does not explicitly state
what is needed for using the project's code, you have to use
trial-and-error to find out. The Chronos class library alone has 29
bundles! Fortunately, their names all start with "Chronos". But how do I
detect that "CodeFoo" and "MetaDevelopment" are part of "Moose"? So,
which bundles do I have to load? Which ones do I have to replicate into
my local repository? In most cases, this is not documented in the bundle
comments, and there is no tool support at all for this purpose.

- This gets worse when bundles or packages are renamed. Examples:
* Cairo -> CairoGraphicsX -> CairoGraphics
* ExtraRBForSUnitToo -> SUnitToo(ls)
(@Travis: This is just a coincidence, and I am not opposed to renaming
pundles. It's the inability of the tools to show what happened which
bothers me.)

I certainly have a bad habit of renaming packages (though no one can accuse me of having a bad habit of renaming bundles). In the second case, I've done exactly what is done in every other project maintaining system. How many times have you found a link for a piece of software, followed it, only to find a link that says "this project has been renamed/moved"? I note that ExtraRBForSUnitToo (the latest version) is an empty package. If you load it, it will prereq load what it was renamed too. Furthermore, it's package and publish comment both state:

"This package has been replaced by SUnitToo(ls). Please load that instead."

I don't know what else to do. I'm open to ideas. I don't know how many others are guilty of package renames. But I'm at least doing my best to follow what I see as similar approaches in other repository systems. This all sounds kind of defensive. It's not meant to. I'm actually really glad you mentioned it, because we can at least explore the similarities with how other systems deal with renames.

The case of the 3 Cairo packages is actually a bit more entertaining. Cairo was an initial implementation published by Holger Kleinsorgen. It had been stale for more than a year before I published the other two packages. At the time, I contacted him, and he was cool with me starting another one. His initial experiments hadn't paid off, in part due to the lack of stability at the time with the cairo library on windows. Not knowing what to do with that package, and not wanting to just completely remove everything and replace it with the approach I had taken, simply so the name could stay the same, I chose to use a different name. CairoGraphicsX was a "false" start of mine. I don't know if it's in the OR or not, but there's actually a _4th_ Cairo package out there which Boris Popov did an initial spike for called Chartz (Bundle and multiple "category" packages). Like the case with Holger, Boris and I had chatted quite a bit both in person and on IRC about the situation and felt that starting a project, and hijacking (copying code) from his defunct efforts was appropriate.

So this is again a great example imo. It's like looking for bit torrent clients. There are tons of them. Some of them are renames. Or dead projects, from which other projects are born.

One piece that seems to be obviously missing here is the ability to remove packages from the OpenRepository. This seems to be a problem I'm running in to a lot lately with various facets of Smalltalk. Removing stuff to make more of the stuff that remains. It would be nice to be able to easily remove packages from the OR. Currently, you have to use a tool that it's a little out of the way. And you have to be database administrator. That leads to logistical issues with deciding who can remove a package. Is it the original publisher? What if others have "taken over" stewardship of the package and have maintained it themselves. One semi-solution might be to add a blessing level or special comment, or basically some piece of easily accessible meta data, which can be used to mark the package as EOL (End Of Life). Tools could be taught to filter out these packages by default. One could even try to infer this information from existing data such as last published time in conjunction with download activity.

- When replicating components into my local repository, it's often a
trial-and-error game until I have found all of the prerequisites. There
is no "replicate with prerequisites". Also, a visualization tool for
showing dependents and /or prerequisites graphically for a pundle
version, similar to the versioning graph in RBStoreExtensions, would be
helpful. (Yeah, I know, someone is certainly going to tell me "if you
need it, why don't you write it yourself".)

I'm not going to tell you that. It is a possible solution. But there is nothing wrong with identifying a need. You're just not allowed to whine about its absence unless you contribute to its effort. :)

Structuring the contents
------------------------

There are different kinds of components in the repository:

- Previews of components of VisualWorks. These become obsolete as soon
as they are integrated in a product release. It would be nice to have
some kind of compatibility filter in the tools - don't show me things
which I can't load yet into VW 7.4.1, and don't show me versions which
are already obsolete. Examples: SmalltalkDoc, MQInterface, the
Webservices pundles.

- Community contributions for the VisualWorks product like bug fixes and
small enhancements (e.g. the SYSBUG-* packages). They are
usually very small grained packages. Especially for bug fixes, it is
often not clear for which VW releases they are intended, and if they
have been integrated into a VW release. Example: MouseWheelX11 - it's
from 2002, but do you still need it in VW 7.4.1? Without browsing
through all the VW release notes from 2002 to 2006, you won't be able to
tell. Could these packages be annotated in some way when they are
integrated in VW?

- Development tools for VisualWorks. Examples: The various refactoring
browser enhancements, or test tools like BRITE, Dakar Testing,
SUnitToo(ls), SUnitDebugExtensions, SSpec. Some of
the contributed tools are not in the repository, but on the VW CD
(RB_Tabs, GHStoreEnhance). Some absolutely indispensable ones are only
in the repository (RBStoreExtensions), but not on the CD. Others are
both on the CD and in the repository - but which one should I use
(AutoComplete)? Sometimes, the version on the CD is outdated, sometimes
the latest versions in the repository will work only with an upcoming
not yet published release of VW. Sometimes, the parcel version number on
the CD are in sync with the VW release number, sometimes they lag
behind, and sometimes the parcel version number, the VW release number,
and the package version in the public repository are completely
unrelated. Okay, it's not as bad as it sounds, but I wouldn't want to be
a newbie browsing all that.

- Open source class libraries based on VW, e.g. Chronos, AIDA/Web,
Seaside, Glorp, Magritte, etc.

- "Simple little things" like SimpleLittleThings, the SYSEXT-* packages,
OnceUponATime, CraftedMemoryPolicy, DateField, and many more.

Would it be possible to somehow annotate these different kinds of
components such that they can be discerned / filtered in the Store tools?

Originally, Debian had this kind of categorization. Not the same dimensions. But nevertheless, different buckets that supposed different types of packages went in. For all I know, it still does. But I've never observed anyone using them. They just don't add enough to the picture, and the first and second and third time you get burned by having it filter out something that you disagreed with the classification, you quit paying attention. I'm almost to that point with the "categories" offered when I go to the "Get Mac OS X" software site. Is that educational or entertainment? Does SourceForge try to solve this problem? I honestly don't know. I know I laugh at iTunes (CDDB indirectly I guess) and its attempts to genre'ize my music for me.

I wonder if the solution isn't to some how make this kind of annotation more interactive. To allow multiple people to indicate what they used a given package for. This would be a fuzzier model, on what seems to be a fuzzy classification in the first place.

Oh, and what about starting a web browser when clicking on URLs in
comments? Does SmalltalkDoc have this feature?

Not SmalltalkDoc itself per se. But it has been recognized as a good idea.

Garbage, and "invisible" pundles
----------------------------

I have the impression that there are several classroom projects
who are using the Cincom public repository for development. There are
new packages with very general names like "GUI", "Gui", "Utilities", or
"Algorithms", packages with comments like "This is my attempt to test
out store", bundles with names like "CS2340", "T123SG" without any
comments, "PBEC" without comment, but blessing comments like "M4
submission 3", and packages like "Graph" without any content and package
or blessing comment. And what are "BandGreeks" and "MortalWombatEcode"
supposed to be?

While I am quite happy that there are universities teaching Smalltalk, I
think that the public repository is not the right place for source code
management of classroom projects or ongoing development activity of
projects which have not yet reached a certain level of maturity
(something like "public beta"). Just imagine two universities having
half a dozen teams work on a project twice a year - a couple of years,
and you won't be able to find the interesting components among all of
their leftovers.

On the other hand, there are some interesting components like
"Softwarenaut" which are unfortunately undocumented, and don't show up
on the repository contents page - I usually don't even consider them
when looking at what I might replicate into my own development
environment. (For Softwarenaut, see
be a very interesting software reengineering tool, but the version in
the public repository is not the most recent one.)

Would a usage policy make sense, and could it be enforced? Do we need a
"repository police"? Or could techniques from Web 2.0 like tagging
mechanisms be adopted for identifying useful components, such that
garbage would automatically be pushed to the bottom of the list,
avoiding the need of a usage policy? Would this help pulling
useful but forgotten pundles out of the shadows?

Yes, I think different ways of filtering/viewing the contents is the way to go. When I go to a sourceforge project, the pieces of data that all guide me as I determine is it worth even following up with it are:
* what's the developer team look like (how many people)
* what's the rate of news/when was the last news
* when was the last published version
* how many downloads, and when was the last one

Having this kind of information would be a good thing. Things like "most popular 10 packages" by different attributes would be cool.

To close, something that others have that we don't. And vice versa.

One thing I note in other projects that are publicly available is more control points for project governance. If it's source forge, you have approved developers for submitting changes. Even in projects managed without sourceforge, there's always some sort of semi-formal governance. We run our projects like Ralph Johnson once described, like public community bicycles, where anyone can pick it up, ride it where they want, and then leave it for the next person. I'm not saying this is bad. And I tend to resist "rules", especially where our community is so small that it doesn't need further impediments to getting stuff done. I'm simply observing the difference. Others can judge it one way or the other.

The thing we have that other community repository systems don't have is that segue promised earlier: Bundles. I think/hope that most people know of my disdain for these. I expect most to tune out and be annoyed with me for using the opportunity to "beat that old horse" again. I'm pessimistic and skeptical that we'll ever be able to fix this. From an observational POV, I challenge anyone to show me any other cms or repository system of note that has anything akin to bundles. Even other Smalltalks have chosen not to follow our lead in this regard. So we must at least accept that problems you cite above which have to do with Bundles are our own making. And therefore, we won't be able to find much in the way of ideas amongst other systems. From a Tools standpoint, I have to say, that having to constantly "consider the implication" of bundles with everything we do is a constant cost and liability we pay. It makes us go slower. It means that we have more bugs. There is no software that is quicker to write than that that we don't have to write in the first place. So the question becomes is the cost worth it? Do they give significant value versus other repository systems to warrant the added complexity and cost? I know from experience, that things like Prerequisite computation are just hopelessly befuddled by the ambiguities presented by having Bundles in the system. Sometimes I believe that Bundles are a hint that Smalltalk is a dead language. I am reasonably confident that the people who put Smalltalk together would never have put something like Bundles together. It adds complexity instead of simplicity. It's like adding Multiple Inheritance to the language. At times a useful feature. But when considered in the light of what it does for your tools and how that single decision is something you have to constantly pay for at every point in the future, it is usually considered not such a good idea after all. Or the idea of embedding a non-messaging language inside of an otherwise OO messaging one. You pay and pay and pay for these kinds of things. You're system has to deal with not just one concept, it has to deal with two, AND it has to deal with the interplay between the two: 1 + 1 > 2. I'll happily/dutifully continue to try to improve VW tools and improve the user experience. I'll also continue to present the case that Bundles are not a common idiom, and that they have a cost.

--
Travis Griggs
Objologist
"There are a thousand hacking at the branches of evil to one who is striking at the root" - Henry David Thoreau


Reply | Threaded
Open this post in threaded view
|

Re: [Long] Public repository scalability, Store improvements needed?

Bruce Badger
On 12/03/07, Travis Griggs <[hidden email]> wrote:
> On Mar 4, 2007, at 3:12, Joachim Geidel wrote:

>> - This gets worse when bundles or packages are renamed. Examples:
>> * Cairo -> CairoGraphicsX -> CairoGraphics
>> * ExtraRBForSUnitToo -> SUnitToo(ls)

> I certainly have a bad habit of renaming packages

I think that trending towards better names is Good Thing.

The problem for things in Store is that the name of a thing (e.g.
method, class, package, parcel) is (in practice at least) the primary
key of that thing.

I hope that the next Store schema separates the identity of the things
it holds from the names those things may have from time to time such
that we can follow version histories across name changes and
successfully load a new version of a thing even if it's name has
changed (barring in-image name clashes etc).

Well, I can dream ...

--
Make the most of your skills - with OpenSkills
http://www.openskills.org/

Reply | Threaded
Open this post in threaded view
|

Re: [Long] Public repository scalability, Store improvements needed?

Alan Knight-2
In reply to this post by Travis Griggs-3
At 01:12 PM 3/12/2007, Travis Griggs wrote:
One piece that seems to be obviously missing here is the ability to remove packages from the OpenRepository. This seems to be a problem I'm running in to a lot lately with various facets of Smalltalk. Removing stuff to make more of the stuff that remains. It would be nice to be able to easily remove packages from the OR. Currently, you have to use a tool that it's a little out of the way. And you have to be database administrator. That leads to logistical issues with deciding who can remove a package. Is it the original publisher? What if others have "taken over" stewardship of the package and have maintained it themselves. One semi-solution might be to add a blessing level or special comment, or basically some piece of easily accessible meta data, which can be used to mark the package as EOL (End Of Life). Tools could be taught to filter out these packages by default. One could even try to infer this information from existing data such as last published time in conjunction with download activity.

To respond to just one of these points with a quick answer...

StoreForGlorp actually adds an #Obsolete blessing level for precisely this purpose, and the package that generates the view at http://www.cincomsmalltalk.com/publicStore will ignore things for which the most recent (or maybe it's any) version has this blessing. That could (should) be integrated into base Store, and other things could make use of it as well.

--
Alan Knight [|], Cincom Smalltalk Development

"The Static Typing Philosophy: Make it fast. Make it right. Make it run." - Niall Ross