Efficient date parsing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient date parsing

Stan Shepherd
I have a large file with dates in the format '2001-11-04'. As I couldn't find a method to change to date, I'm doing:
      date := self dateFrom: (dateString  subStrings: '-').

dateFrom: anArray
        "anArray should be like #('2005' '09' '06')"
        ^Date year: anArray first asInteger month: anArray second asInteger day: anArray third asInteger

This works, but profiling shows half the work in loading the file is in dateFrom:

Is there a more efficient way? Is there a standard method I've missed?

Also, I tried to import the .cs of a date parser by Goran here;
http://www.nabble.com/Parsing-dates%21-td10078517.html#a10078517

When I unpack the file, then click 'install' in the file list, I get a syntax error. How should I be loading this file?

Thanks,    ...Stan
Reply | Threaded
Open this post in threaded view
|

RE: Efficient date parsing

Ramon Leon-5
cient date parsing

>
>
> I have a large file with dates in the format '2001-11-04'. As
> I couldn't find a method to change to date, I'm doing:
>       date := self dateFrom: (dateString  subStrings: '-').
>
> dateFrom: anArray
> "anArray should be like #('2005' '09' '06')"
> ^Date year: anArray first asInteger month: anArray
> second asInteger day:
> anArray third asInteger
>
> This works, but profiling shows half the work in loading the
> file is in
> dateFrom:
>
> Is there a more efficient way? Is there a standard method I've missed?

On my Date class, class side, I use the following extention methods for
parsing dates.

fromString: aString format: aFormat
        aFormat = #dmy
                ifTrue: [^ self readEuro: aString readStream ].
        aFormat = #iso8601
                ifTrue: [^ self readISO: aString readStream].
        ^ self fromString: aString

readEuro: aStream
        "Read a Date in euro format dd-mm-yyyy"
        | day month year |
        aStream skipSeparators.
        day := Integer readFrom: aStream.
        [aStream peek isDigit]
                whileFalse: [aStream skip: 1].
        month := Integer readFrom: aStream.
        [aStream peek isDigit]
                whileFalse: [aStream skip: 1].
        year := Integer readFrom: aStream.
        ^ self
                newDay: day
                month: month
                year: year

readISO: aStream
        "Read a Date in ISO-8601 format yyyy-mm-dd"
        | day month year |
        aStream skipSeparators.
        year := Integer readFrom: aStream.
        [aStream peek isDigit]
                whileFalse: [aStream skip: 1].
        month := Integer readFrom: aStream.
        [aStream peek isDigit]
                whileFalse: [aStream skip: 1].
        day := Integer readFrom: aStream.
        ^ self
                newDay: day
                month: month
                year: year

Ramon Leon
http://onsmalltalk.com

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Efficient date parsing

Randal L. Schwartz
In reply to this post by Stan Shepherd
>>>>> "stan" == stan shepherd <[hidden email]> writes:

stan> Is there a more efficient way? Is there a standard method I've missed?

One thing you should ensure is that you're not converting the same date twice.
Typical logs have thousands of entries all with the same date, and I've seen
far too many naive solutions that keep converting, over and over again,
'2008-02-15' into Feb 15 2008.  Really no point.

Implement a simple cache:

   ^dateCache at: dateString ifAbsent:
      [dateCache at: dateString put: (Date from: dateString)].

something like that.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[hidden email]> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

RE: Efficient date parsing

Ramon Leon-5
>
> Implement a simple cache:
>
>    ^dateCache at: dateString ifAbsent:
>       [dateCache at: dateString put: (Date from: dateString)].
>
> something like that.
>
> Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777
0095 <[hidden email]>

Or more idiomatic...

^dateCache at: dateString ifAbsentPut: [Date from: dateString].

Ramon Leon
http://onsmalltalk.com

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Efficient date parsing

Randal L. Schwartz
>>>>> "Ramon" == Ramon Leon <[hidden email]> writes:

Ramon> Or more idiomatic...

Ramon> ^dateCache at: dateString ifAbsentPut: [Date from: dateString].

Yeah, shortly after I posted that, I remembered that. :)

Also, for the beginners. you need to initialize dateCache to a Dictionary.
the normal way to do that is to have an instance side method called
#initialize:

        initialize
                super initialize. "NEVER NEVER leave this out"
                dateCache := Dictionary new.

When you save this, squeak will ask "I dunno dateCache", and you should
make it an instance variable.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[hidden email]> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

RE: Efficient date parsing

Ramon Leon-5
>
> Yeah, shortly after I posted that, I remembered that. :)
>
> Also, for the beginners. you need to initialize dateCache to
> a Dictionary.
> the normal way to do that is to have an instance side method called
> #initialize:
>
>         initialize
>                 super initialize. "NEVER NEVER leave this out"
>                 dateCache := Dictionary new.
>
> When you save this, squeak will ask "I dunno dateCache", and
> you should make it an instance variable.
>
> --
> Randal L. Schwartz - Stonehenge Consulting Services, Inc. -

Unless you're doing a class side #initialize, in which case you don't want
to call super initialize.

Ramon Leon
http://onsmalltalk.com

_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Efficient date parsing

Bert Freudenberg
On 03.04.2008, at 07:49, Ramon Leon wrote:

>>
>> Yeah, shortly after I posted that, I remembered that. :)
>>
>> Also, for the beginners. you need to initialize dateCache to
>> a Dictionary.
>> the normal way to do that is to have an instance side method called
>> #initialize:
>>
>>        initialize
>>                super initialize. "NEVER NEVER leave this out"
>>                dateCache := Dictionary new.
>>
>> When you save this, squeak will ask "I dunno dateCache", and
>> you should make it an instance variable.
>>
>> --
>> Randal L. Schwartz - Stonehenge Consulting Services, Inc. -
>
> Unless you're doing a class side #initialize, in which case you  
> don't want
> to call super initialize.

Indeed, that's one reason why I prefer lazy initialization over  
#initialize:

dateFrom: aString
        dateCache ifNil: [dateCache := Dictionary new].
        ^dateCache at: aString ifAbsentPut: [Date from: aString].

(although the canonical way is to have the dateCache init code in a  
#dateCache accessor, then always use "self dateCache").

One major plus of lazy initialization is that this supports code  
upgrades of a running system, where #initialize is usually not run  
again.

Also, if this indeed is for log data with many identical dates in a  
row you might flush the cache from time to time, or indeed even only  
check if the next date is the same as the previous ... knowledge of  
your domain beats any general optimization ;)

- Bert -


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Efficient date parsing

Stan Shepherd
In reply to this post by Randal L. Schwartz
Thanks Randal and Ramon. That's about three times quicker; most of the improvement seems to be in the caching.
...Stan
Randal L. Schwartz wrote
>>>>> "Ramon" == Ramon Leon <ramon.leon@allresnet.com> writes:

Ramon> Or more idiomatic...

Ramon> ^dateCache at: dateString ifAbsentPut: [Date from: dateString].

Yeah, shortly after I posted that, I remembered that. :)
Reply | Threaded
Open this post in threaded view
|

Re: Efficient date parsing

Bert Freudenberg
On 03.04.2008, at 09:20, stan shepherd wrote:
>
> Thanks Randal and Ramon. That's about three times quicker; most of the
> improvement seems to be in the caching.
> ...Stan


Also, do profile your code.

- Bert -


_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners
Reply | Threaded
Open this post in threaded view
|

Re: Efficient date parsing

Göran Krampe
Hi!

I tried downloading my original attachment and the problem is that it
was/is double-gzipped - not sure why. Anyway, attaching once more the
ChangeSet. It installs fine in 3.10beta7159 at least.

You install it by decompressing it and then use the "install" button in
the filelist, or by just using "filein" on the compressed file directly
(not sure why there is no install button in that case but anyway).

I also include the method comment below to show you what it gives you:

readFrom: inputStream pattern: pattern
        "Read a Date from the stream based on the pattern which can include the
tokens:
       
                y = A year with 1-n digits
                yy = A year with 2 digits
                yyyy = A year with 4 digits
                m = A month with 1-n digits
                mm = A month with 2 digits
                d = A day with 1-n digits
                dd = A day with 2 digits
               
        ...and any other Strings inbetween. Representing $y, $m and $d is done
using
        \y, \m and \d and slash itself with \\. Simple example patterns:

                'yyyy-mm-dd'
                'yyyymmdd'
                'yy.mm.dd'
                'ymd'

        A year given using only two decimals is considered to be >2000."



regards, Göran
_______________________________________________
Beginners mailing list
[hidden email]
http://lists.squeakfoundation.org/mailman/listinfo/beginners

DateReadFromPattern.cs.gz (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Efficient date parsing

Stan Shepherd
goran-14 wrote
Hi!

I tried downloading my original attachment and the problem is that it
was/is double-gzipped - not sure why.
I've seen the same thing with a few snippets I've downloaded.

goran-14 wrote
I also include the method comment below to show you what it gives you:
Thanks for that.

...Stan