SqueakSource indexability (aka should we just ask crawlers to desist?)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

SqueakSource indexability (aka should we just ask crawlers to desist?)

Ken Causey-3
At times access to source.squeak.org becomes slower, as has been the
case today.  I can see in the logs that various web-crawlers are the
likely culprit.  Having the information there accessible via search
engines is a wornderful thing but I have to suspect that the Seaside
session IDs eliminate this option.  (Of course when URLs like
http://source.squeak.org/trunk.html are found on other sites they then
become indexed.)

Unless I'm mistaken about this, and I would appreciate any guidance, it
seems like we need to add a robots.txt to the site which guides or
simply asks crawlers to stay away.  Thoughts?  I'm no SEO export.

Ken


Reply | Threaded
Open this post in threaded view
|

Re: SqueakSource indexability (aka should we just ask crawlers to desist?)

Bert Freudenberg
On 28.04.2010, at 21:07, Ken Causey wrote:
>
> At times access to source.squeak.org becomes slower, as has been the
> case today.  I can see in the logs that various web-crawlers are the
> likely culprit.  Having the information there accessible via search
> engines is a wornderful thing but I have to suspect that the Seaside
> session IDs eliminate this option.  (Of course when URLs like
> http://source.squeak.org/trunk.html are found on other sites they then
> become indexed.)

Which URLs are the bots accessing?

> Unless I'm mistaken about this, and I would appreciate any guidance, it
> seems like we need to add a robots.txt to the site which guides or
> simply asks crawlers to stay away.  Thoughts?  I'm no SEO export.

We do have a robots.txt:
http://source.squeak.org/robots.txt

- Bert -



Reply | Threaded
Open this post in threaded view
|

RE: SqueakSource indexability (aka should we just ask crawlers to desist?)

Ken Causey-3
In reply to this post by Ken Causey-3
> -------- Original Message --------
> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just
> ask crawlers to desist?)
> From: Bert Freudenberg <[hidden email]>
> Date: Wed, April 28, 2010 2:59 pm
> To: The general-purpose Squeak developers list
> <[hidden email]>
>
>
> On 28.04.2010, at 21:07, Ken Causey wrote:
> >
> > At times access to source.squeak.org becomes slower, as has been the
> > case today.  I can see in the logs that various web-crawlers are the
> > likely culprit.  Having the information there accessible via search
> > engines is a wornderful thing but I have to suspect that the Seaside
> > session IDs eliminate this option.  (Of course when URLs like
> > http://source.squeak.org/trunk.html are found on other sites they then
> > become indexed.)
>
> Which URLs are the bots accessing?

Well, without detailed analysis it seems to be everything.  Feel free to
look at ~squeaksource/apachelogs/.

>
> > Unless I'm mistaken about this, and I would appreciate any guidance, it
> > seems like we need to add a robots.txt to the site which guides or
> > simply asks crawlers to stay away.  Thoughts?  I'm no SEO export.
>
> We do have a robots.txt:
> http://source.squeak.org/robots.txt

Aha.  Well, I know little about this subject.  But if this means what I
think it means it seems that the crawlers are ignoring it.

>
> - Bert -


Reply | Threaded
Open this post in threaded view
|

Re: SqueakSource indexability (aka should we just ask crawlers to desist?)

Levente Uzonyi-2
In reply to this post by Ken Causey-3
On Wed, 28 Apr 2010, Ken Causey wrote:

> At times access to source.squeak.org becomes slower, as has been the
> case today.  I can see in the logs that various web-crawlers are the
> likely culprit.  Having the information there accessible via search
> engines is a wornderful thing but I have to suspect that the Seaside
> session IDs eliminate this option.  (Of course when URLs like
> http://source.squeak.org/trunk.html are found on other sites they then
> become indexed.)

See http://code.google.com/p/seaside/issues/detail?id=262 . I had two
solutions for the problem in Seaside 2.8. One was using a linked hashtable
to manage the sessions, resulting in O(1) session creation/access time,
but it broke the almost never used feature, that every session can have
a distinct timeout value.
To solve that problem I replaced the linked hashtable with a heap, which
gave O(log(n)) creation/access time, but this time I was told to implement
it in Seaside 2.9 using the new plugin system. The above solutions can't
be implemented as a plugin, so we got nowhere.

>
> Unless I'm mistaken about this, and I would appreciate any guidance, it
> seems like we need to add a robots.txt to the site which guides or
> simply asks crawlers to stay away.  Thoughts?  I'm no SEO export.

This should do it:
User-agent: *
Disallow: /


Levente

>
> Ken
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: SqueakSource indexability (aka should we just ask crawlers to desist?)

Bert Freudenberg
In reply to this post by Ken Causey-3
On 28.04.2010, at 22:08, Ken Causey wrote:

>
>> -------- Original Message --------
>> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just
>> ask crawlers to desist?)
>> From: Bert Freudenberg <[hidden email]>
>> Date: Wed, April 28, 2010 2:59 pm
>> To: The general-purpose Squeak developers list
>> <[hidden email]>
>>
>>
>> On 28.04.2010, at 21:07, Ken Causey wrote:
>>>
>>> At times access to source.squeak.org becomes slower, as has been the
>>> case today.  I can see in the logs that various web-crawlers are the
>>> likely culprit.  Having the information there accessible via search
>>> engines is a wornderful thing but I have to suspect that the Seaside
>>> session IDs eliminate this option.  (Of course when URLs like
>>> http://source.squeak.org/trunk.html are found on other sites they then
>>> become indexed.)
>>
>> Which URLs are the bots accessing?
>
> Well, without detailed analysis it seems to be everything.  Feel free to
> look at ~squeaksource/apachelogs/.
>
>>
>>> Unless I'm mistaken about this, and I would appreciate any guidance, it
>>> seems like we need to add a robots.txt to the site which guides or
>>> simply asks crawlers to stay away.  Thoughts?  I'm no SEO export.
>>
>> We do have a robots.txt:
>> http://source.squeak.org/robots.txt
>
> Aha.  Well, I know little about this subject.  But if this means what I
> think it means it seems that the crawlers are ignoring it.

I just read up on it. Glob patterns are *not* allowed, the single asterisk in the user agent is a special char and not a pattern match. We used

User-agent: *
Disallow: /@*

But it should be

User-agent: *
Disallow: /@

I'm going to fix that, let's see how it works out.

- Bert -



Reply | Threaded
Open this post in threaded view
|

RE: SqueakSource indexability (aka should we just ask crawlers to desist?)

Ken Causey-3
In reply to this post by Ken Causey-3
> -------- Original Message --------
> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just
> ask crawlers to desist?)
> From: Bert Freudenberg <[hidden email]>
> Date: Wed, April 28, 2010 3:31 pm
> To: The general-purpose Squeak developers list
> <[hidden email]>

> I just read up on it. Glob patterns are *not* allowed, the single asterisk in the user agent is a special char and not a pattern match. We used
>
> User-agent: *
> Disallow: /@*
>
> But it should be
>
> User-agent: *
> Disallow: /@
>
> I'm going to fix that, let's see how it works out.
>
> - Bert -

Thank you very much Bert, again...

Ken


Reply | Threaded
Open this post in threaded view
|

Re: SqueakSource indexability (aka should we just ask crawlers to desist?)

Levente Uzonyi-2
In reply to this post by Bert Freudenberg
On Wed, 28 Apr 2010, Bert Freudenberg wrote:

> On 28.04.2010, at 22:08, Ken Causey wrote:
>>
>>> -------- Original Message --------
>>> Subject: Re: [squeak-dev] SqueakSource indexability (aka should we just
>>> ask crawlers to desist?)
>>> From: Bert Freudenberg <[hidden email]>
>>> Date: Wed, April 28, 2010 2:59 pm
>>> To: The general-purpose Squeak developers list
>>> <[hidden email]>
>>>
>>>
>>> On 28.04.2010, at 21:07, Ken Causey wrote:
>>>>
>>>> At times access to source.squeak.org becomes slower, as has been the
>>>> case today.  I can see in the logs that various web-crawlers are the
>>>> likely culprit.  Having the information there accessible via search
>>>> engines is a wornderful thing but I have to suspect that the Seaside
>>>> session IDs eliminate this option.  (Of course when URLs like
>>>> http://source.squeak.org/trunk.html are found on other sites they then
>>>> become indexed.)
>>>
>>> Which URLs are the bots accessing?
>>
>> Well, without detailed analysis it seems to be everything.  Feel free to
>> look at ~squeaksource/apachelogs/.
>>
>>>
>>>> Unless I'm mistaken about this, and I would appreciate any guidance, it
>>>> seems like we need to add a robots.txt to the site which guides or
>>>> simply asks crawlers to stay away.  Thoughts?  I'm no SEO export.
>>>
>>> We do have a robots.txt:
>>> http://source.squeak.org/robots.txt
>>
>> Aha.  Well, I know little about this subject.  But if this means what I
>> think it means it seems that the crawlers are ignoring it.
>
> I just read up on it. Glob patterns are *not* allowed, the single asterisk in the user agent is a special char and not a pattern match. We used
>
> User-agent: *
> Disallow: /@*
>
> But it should be
>
> User-agent: *
> Disallow: /@

Just realized that links generated by Seaside begin with @. Tricky. :)


Levente

>
> I'm going to fix that, let's see how it works out.
>
> - Bert -
>
>
>
>