At times access to source.squeak.org becomes slower, as has been the
case today. I can see in the logs that various web-crawlers are the likely culprit. Having the information there accessible via search engines is a wornderful thing but I have to suspect that the Seaside session IDs eliminate this option. (Of course when URLs like http://source.squeak.org/trunk.html are found on other sites they then become indexed.) Unless I'm mistaken about this, and I would appreciate any guidance, it seems like we need to add a robots.txt to the site which guides or simply asks crawlers to stay away. Thoughts? I'm no SEO export. Ken |
On 28.04.2010, at 21:07, Ken Causey wrote:
> > At times access to source.squeak.org becomes slower, as has been the > case today. I can see in the logs that various web-crawlers are the > likely culprit. Having the information there accessible via search > engines is a wornderful thing but I have to suspect that the Seaside > session IDs eliminate this option. (Of course when URLs like > http://source.squeak.org/trunk.html are found on other sites they then > become indexed.) Which URLs are the bots accessing? > Unless I'm mistaken about this, and I would appreciate any guidance, it > seems like we need to add a robots.txt to the site which guides or > simply asks crawlers to stay away. Thoughts? I'm no SEO export. We do have a robots.txt: http://source.squeak.org/robots.txt - Bert - |
In reply to this post by Ken Causey-3
> -------- Original Message --------
> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just > ask crawlers to desist?) > From: Bert Freudenberg <[hidden email]> > Date: Wed, April 28, 2010 2:59 pm > To: The general-purpose Squeak developers list > <[hidden email]> > > > On 28.04.2010, at 21:07, Ken Causey wrote: > > > > At times access to source.squeak.org becomes slower, as has been the > > case today. I can see in the logs that various web-crawlers are the > > likely culprit. Having the information there accessible via search > > engines is a wornderful thing but I have to suspect that the Seaside > > session IDs eliminate this option. (Of course when URLs like > > http://source.squeak.org/trunk.html are found on other sites they then > > become indexed.) > > Which URLs are the bots accessing? Well, without detailed analysis it seems to be everything. Feel free to look at ~squeaksource/apachelogs/. > > > Unless I'm mistaken about this, and I would appreciate any guidance, it > > seems like we need to add a robots.txt to the site which guides or > > simply asks crawlers to stay away. Thoughts? I'm no SEO export. > > We do have a robots.txt: > http://source.squeak.org/robots.txt Aha. Well, I know little about this subject. But if this means what I think it means it seems that the crawlers are ignoring it. > > - Bert - |
In reply to this post by Ken Causey-3
On Wed, 28 Apr 2010, Ken Causey wrote:
> At times access to source.squeak.org becomes slower, as has been the > case today. I can see in the logs that various web-crawlers are the > likely culprit. Having the information there accessible via search > engines is a wornderful thing but I have to suspect that the Seaside > session IDs eliminate this option. (Of course when URLs like > http://source.squeak.org/trunk.html are found on other sites they then > become indexed.) See http://code.google.com/p/seaside/issues/detail?id=262 . I had two solutions for the problem in Seaside 2.8. One was using a linked hashtable to manage the sessions, resulting in O(1) session creation/access time, but it broke the almost never used feature, that every session can have a distinct timeout value. To solve that problem I replaced the linked hashtable with a heap, which gave O(log(n)) creation/access time, but this time I was told to implement it in Seaside 2.9 using the new plugin system. The above solutions can't be implemented as a plugin, so we got nowhere. > > Unless I'm mistaken about this, and I would appreciate any guidance, it > seems like we need to add a robots.txt to the site which guides or > simply asks crawlers to stay away. Thoughts? I'm no SEO export. This should do it: User-agent: * Disallow: / Levente > > Ken > > > |
In reply to this post by Ken Causey-3
On 28.04.2010, at 22:08, Ken Causey wrote:
> >> -------- Original Message -------- >> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just >> ask crawlers to desist?) >> From: Bert Freudenberg <[hidden email]> >> Date: Wed, April 28, 2010 2:59 pm >> To: The general-purpose Squeak developers list >> <[hidden email]> >> >> >> On 28.04.2010, at 21:07, Ken Causey wrote: >>> >>> At times access to source.squeak.org becomes slower, as has been the >>> case today. I can see in the logs that various web-crawlers are the >>> likely culprit. Having the information there accessible via search >>> engines is a wornderful thing but I have to suspect that the Seaside >>> session IDs eliminate this option. (Of course when URLs like >>> http://source.squeak.org/trunk.html are found on other sites they then >>> become indexed.) >> >> Which URLs are the bots accessing? > > Well, without detailed analysis it seems to be everything. Feel free to > look at ~squeaksource/apachelogs/. > >> >>> Unless I'm mistaken about this, and I would appreciate any guidance, it >>> seems like we need to add a robots.txt to the site which guides or >>> simply asks crawlers to stay away. Thoughts? I'm no SEO export. >> >> We do have a robots.txt: >> http://source.squeak.org/robots.txt > > Aha. Well, I know little about this subject. But if this means what I > think it means it seems that the crawlers are ignoring it. I just read up on it. Glob patterns are *not* allowed, the single asterisk in the user agent is a special char and not a pattern match. We used User-agent: * Disallow: /@* But it should be User-agent: * Disallow: /@ I'm going to fix that, let's see how it works out. - Bert - |
In reply to this post by Ken Causey-3
> -------- Original Message --------
> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just > ask crawlers to desist?) > From: Bert Freudenberg <[hidden email]> > Date: Wed, April 28, 2010 3:31 pm > To: The general-purpose Squeak developers list > <[hidden email]> > I just read up on it. Glob patterns are *not* allowed, the single asterisk in the user agent is a special char and not a pattern match. We used > > User-agent: * > Disallow: /@* > > But it should be > > User-agent: * > Disallow: /@ > > I'm going to fix that, let's see how it works out. > > - Bert - Thank you very much Bert, again... Ken |
In reply to this post by Bert Freudenberg
On Wed, 28 Apr 2010, Bert Freudenberg wrote:
> On 28.04.2010, at 22:08, Ken Causey wrote: >> >>> -------- Original Message -------- >>> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just >>> ask crawlers to desist?) >>> From: Bert Freudenberg <[hidden email]> >>> Date: Wed, April 28, 2010 2:59 pm >>> To: The general-purpose Squeak developers list >>> <[hidden email]> >>> >>> >>> On 28.04.2010, at 21:07, Ken Causey wrote: >>>> >>>> At times access to source.squeak.org becomes slower, as has been the >>>> case today. I can see in the logs that various web-crawlers are the >>>> likely culprit. Having the information there accessible via search >>>> engines is a wornderful thing but I have to suspect that the Seaside >>>> session IDs eliminate this option. (Of course when URLs like >>>> http://source.squeak.org/trunk.html are found on other sites they then >>>> become indexed.) >>> >>> Which URLs are the bots accessing? >> >> Well, without detailed analysis it seems to be everything. Feel free to >> look at ~squeaksource/apachelogs/. >> >>> >>>> Unless I'm mistaken about this, and I would appreciate any guidance, it >>>> seems like we need to add a robots.txt to the site which guides or >>>> simply asks crawlers to stay away. Thoughts? I'm no SEO export. >>> >>> We do have a robots.txt: >>> http://source.squeak.org/robots.txt >> >> Aha. Well, I know little about this subject. But if this means what I >> think it means it seems that the crawlers are ignoring it. > > I just read up on it. Glob patterns are *not* allowed, the single asterisk in the user agent is a special char and not a pattern match. We used > > User-agent: * > Disallow: /@* > > But it should be > > User-agent: * > Disallow: /@ Just realized that links generated by Seaside begin with @. Tricky. :) Levente > > I'm going to fix that, let's see how it works out. > > - Bert - > > > > |
Free forum by Nabble | Edit this page |