I've been involved in a long and sometimes weird discussion on the RaspberryPi forums about fibonacci implementations in different languages and using assorted algorithms. I'd never really thought about alternatives to recursive approaches before but it turns out there is massively faster way. Clearly, this is not to replace the benchFib code since that has a specific use, but if anyone with an interest in numerics wants a faster way to generate fib(8577) for example, this is how -
Integer >> fastDoublingFib "derived from https://www.nayuki.io/page/fast-fibonacci-algorithms" "(1 to: 20) collect:[:i| i fastDoublingFib]" "testing a quite large one - " "8577 fastDoublingFib= 13724780954457889052017147206983806244049002655849289934662148526555403650297300643609932653364630032094175733360509578504423049114004867523161091564824674600978308740378989479162611031273424686573759784740689516901416473328655422390895971263265867635819510949686102571388751758998017349379498918930220900180038324615253426530740473855269056304380498630108126734047701362218655030270360608989081398866489746698916626888016696892691933237089180504631788915650757757515944644732966345269202761471025651071790297611079540881155092137592980230998234868586211881093892491570520577408961869977892273540956424095750855208529072246641778982103984467921868950012668004047986803017482248992968482737462668300684879633714025755790485860328854796518843956263863014632532331145699658530054942590047273403691531821918862996422405159427262092477196755988981309029424760342802374213122162727840557722145891090413688461745240415668189577836068480363407847582529735341950500636735281963089675493707159434777756081146452522323681782226760627277553296721358921412115264845467834979154061137421532609247762981818564578019888974692581079593575783553856910367568474613323528337733872069223030834774749130478360574004172522316484339530942110067893000847800932306298725285623628731149337468217751734165148732164148285915275115006479682658150442259002271790547596033006363411193653337536041106069912826015502035140618407668385378737477702597473151509972754111640855347958033314453349633268543893894677097708945041254623018915871109789412793709229204261914803477697183287924195770678873001065036313926288444791424871512110658175954743584548831946767673488152740675550518235698898217693311515366329280005757014637854214769152690638778904780724293185353992279724740604674926819294787586671833537117545443846365508358918882" | a b c | a := 0. b := 1. self highBit to: 1 by: -1 do:[:i||d e| d := (b bitShift: 1) - a * a. e := a squared + b squared. a := d. b := e. (self bitAt: i) = 1 ifTrue:[ c := a + b. a := b. b := c] ]. ^a I bet it could be sped up more, but a simply spyOn run just allocated all the primitive time to delay waiting. Didn't we improve that some time ago? tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: BYEBYE: Store in Write-Only Storage |
On Tue, 16 Apr 2019, tim Rowledge wrote:
> I've been involved in a long and sometimes weird discussion on the RaspberryPi forums about fibonacci implementations in different languages and using assorted algorithms. I'd never really thought about alternatives to recursive approaches before but it turns out there is massively faster way. Clearly, this is not to replace the benchFib code since that has a specific use, but if anyone with an interest in numerics wants a faster way to generate fib(8577) for example, this is how - > > Integer >> fastDoublingFib > "derived from https://www.nayuki.io/page/fast-fibonacci-algorithms" > "(1 to: 20) collect:[:i| i fastDoublingFib]" > "testing a quite large one - " > "8577 fastDoublingFib= 13724780954457889052017147206983806244049002655849289934662148526555403650297300643609932653364630032094175733360509578504423049114004867523161091564824674600978308740378989479162611031273424686573759784740689516901416473328655422390895971263265867635819510949686102571388751758998017349379498918930220900180038324615253426530740473855269056304380498630108126734047701362218655030270360608989081398866489746698916626888016696892691933237089180504631788915650757757515944644732966345269202761471025651071790297611079540881155092137592980230998234868586211881093892491570520577408961869977892273540956424095750855208529072246641778982103984467921868950012668004047986803017482248992968482737462668300684879633714025755790485860328854796518843956263863014632532331145699658530054942590047273403691531821918862996422405159427262092477196755988981309029424760342802374213122162727840557722145891090413688461745240415668189577836068480363407847582529735341950500636735281963089675493707159434777756081146452522323681782226760627277553296721358921412115264845467834979154061137421532609247762981818564578019888974692581079593575783553856910367568474613323528337733872069223030834774749130478360574004172522316484339530942110067893000847800932306298725285623628731149337468217751734165148732164148285915275115006479682658150442259002271790547596033006363411193653337536041106069912826015502035140618407668385378737477702597473151509972754111640855347958033314453349633268543893894677097708945041254623018915871109789412793709229204261914803477697183287924195770678873001065036313926288444791424871512110658175954743584548831946767673488152740675550518235698898217693311515366329280005757014637854214769152690638778904780724293185353992279724740604674926819294787586671833537117545443846365508358918882" > | a b c | > a := 0. > b := 1. > self highBit to: 1 by: -1 do:[:i||d e| > d := (b bitShift: 1) - a * a. > e := a squared + b squared. > a := d. > b := e. > (self bitAt: i) = 1 ifTrue:[ > c := a + b. > a := b. > b := c] > ]. > ^a > I bet it could be sped up more, but a simply spyOn run just allocated all the primitive time to delay waiting. Didn't we improve that some time ago? can't make it any quicker for larger receivers. What are these typical tricks? - inlining methods - decreasing the number of temporaries needed - reordering temporaries so that the more frequently used ones use registers - declare all temporaries at the method level so that the compiler can skip a few bytecodes setting their values to nil in each loop But as the receiver gets larger, the LargeInteger primitives will dominate the run time more and more, so the tricks will have neglibile effect on performance. And yes, we do have a better profiler: AndreasSystemProfiler. It hasn't been integrated into Squeak yet for some reason, but it's available on SqueakMap. Levente > > tim > -- > tim Rowledge; [hidden email]; http://www.rowledge.org/tim > Strange OpCodes: BYEBYE: Store in Write-Only Storage |
> On 2019-04-17, at 4:26 AM, Levente Uzonyi <[hidden email]> wrote: > > On Tue, 16 Apr 2019, tim Rowledge wrote: > >> I've been involved in a long and sometimes weird discussion on the RaspberryPi forums about fibonacci implementations in different languages and using assorted algorithms. I'd never really thought about alternatives to recursive approaches before but it turns out there is massively faster way. Clearly, this is not to replace the benchFib code since that has a specific use, but if anyone with an interest in numerics wants a faster way to generate fib(8577) for example, this is how - >> Integer >> fastDoublingFib [snip] > > That's some pretty nifty code. Other than the typical tricks, you can't make it any quicker for larger receivers. > What are these typical tricks? > - inlining methods > - decreasing the number of temporaries needed > - reordering temporaries so that the more frequently used ones use registers > - declare all temporaries at the method level so that the compiler can skip a few bytecodes setting their values to nil in each loop > > But as the receiver gets larger, the LargeInteger primitives will dominate the run time more and more, so the tricks will have neglibile effect on performance. A spyOn of the amusing fib(4784969) (find the first fib which has more than a million digits, apparently) showed 99.8% in primitives, within the limits of the spyOn code. There is also "Karatsuba fast multiplication" on that same site, which is apparently faster for biiiiiig numbers. There is even example code in python and java. Who feels like trying it out? I'll note that a C implementation of the fast-doubling-fib-with-karatsuba code is on github at https://github.com/ZiCog/fibo_4784969/tree/master/c and declares a runtime of 63 seconds on a Pi 3. I got 72 on Squeak on my Pi 3 (without karatsuba multiply). So pretty good so far. > > And yes, we do have a better profiler: AndreasSystemProfiler. It hasn't been integrated into Squeak yet for some reason, but it's available on SqueakMap. Perhaps I'll try it out just for fun. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful random insult:- Wise as the world is flat. |
Hi Tim, I have published some Karatsuba implementation in squeak for years, it's quite easy. Andres Valloud did provide some in VW too, then probably in Cuis too... Le mer. 17 avr. 2019 à 20:56, tim Rowledge <[hidden email]> a écrit :
|
Let me see... For example here in 2011: I use Karatsuba a lot for ArbitraryPrecisionFloat... But there is even more efficient for monstruous integers, Schoenhage-Strassen which use kind of Fourier Transform, but harder to program... And I think that I recently heard that an algorithm approching the O(log N) barrier has been published... Le mer. 17 avr. 2019 à 21:28, Nicolas Cellier <[hidden email]> a écrit :
|
Le mer. 17 avr. 2019 à 21:38, Nicolas Cellier <[hidden email]> a écrit :
O(N log N) of course...
|
Le mer. 17 avr. 2019 à 21:39, Nicolas Cellier <[hidden email]> a écrit :
|
In reply to this post by Nicolas Cellier
> On 2019-04-17, at 12:38 PM, Nicolas Cellier <[hidden email]> wrote: > > Let me see... > For example here in 2011: > > MCHttpRepository > location: 'http://www.squeaksource.com/FactorialContest' > user: '' > password: ''. Nice, nice ! That does fib(4784969) in 14 seconds on my Pi3, or 5 times faster. Wow. And, er, 4 times faster than the C code version on that github repo :-) tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim "How many Slavers does it take to change a lightbulb?” "Dunno. How susceptible are lightbulbs to telepathy?" |
And just because it's fun, here's a small cs file that can be run from a commandline (as in `squeak my.image bigFib.cs`) to write the time taken to stdio -
tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Science adjusts its views based on what is observed. Faith denies observation so belief can be preserved bigFib.cs.zip (4K) Download Attachment |
In reply to this post by timrowledge
I'm being told that the .cs file I posted doesn't work on a Windows VM because of an apparent inability to write to stout on Windows - is this correct? *should* it work and it just got broken?
tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: KFP: Kindle Fire in Printer |
Hi Tim, I've posted some enhancements with a 3-way Toom-Cook multiplication algorithm (it is a generalization of karatsuba, but we divide in n parts instead of 2, here 3 parts for Toom-3) Use fastMultiply: instead of karatsubaTimes: Maybe I'll post more enhancements if time permits... Le ven. 19 avr. 2019 à 22:41, tim Rowledge <[hidden email]> a écrit : I'm being told that the .cs file I posted doesn't work on a Windows VM because of an apparent inability to write to stout on Windows - is this correct? *should* it work and it just got broken? |
> On 2019-04-24, at 1:29 PM, Nicolas Cellier <[hidden email]> wrote: > > Hi Tim, > I've posted some enhancements with a 3-way Toom-Cook multiplication algorithm > (it is a generalization of karatsuba, but we divide in n parts instead of 2, here 3 parts for Toom-3) Wow. That's nearly twice as fast for the fib(4784969) - 8.4 sec. Amazing. So about 10x over the original. Now *printing* the number is rather slower and if you have any magic code to improve that it would be interesting. I would *strongly* support putting the algorithm improvements into trunk. Very little code for colossal speedup and a really interesting exemplar of advanced algorithms. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Strange OpCodes: ESBD: Erase System and Burn Documentation |
In reply to this post by timrowledge
Several actual things (as opposed to some really stupid whinging from a couple of people on the Pi forums) have come up in pursuit of faster fibonacci finding.
First up, still a question - > On 2019-04-19, at 1:41 PM, tim Rowledge <[hidden email]> wrote: > > I'm being told that the .cs file I posted doesn't work on a Windows VM because of an apparent inability to write to stout on Windows - is this correct? *should* it work and it just got broken? I looked a the VM code and it appears to be there to make stdout etc for Windows work. I still can't use it on Win10. then again, just what *does* work properly on Win10? I see lots of complaints about pretty much everything. Running stuff from a commandline. Conceptually it ought to be practical to do - `squeak my.image myCode.cs` ... and it sort of is. The good news is that the readDocumentAtStartup related code mostly gets things right and even handles the file(s) being zipped or on an http server somewhere. Cool! However, it seems that the stdout related initialisation in FileStream gets run *after* the readDocumentAtStartup code - which means any code wanting to write to stdout is out of luck unless the file handles just happen to be already correct from a previous start/save cycle. Clearly this is unlikely when starting a clean image, especially form a freshly downloaded all-in-one package. Looking at FileStream class initialise I see Smalltalk addToStartUpList: self after: SecurityManager; "the intent being before: AutoStart" and yet inspecting the startuplist shows that FileStream comes well *after* AutoStart (which in a somewhat convoluted way actually makes ProjectLauncher actually do the startup document reading) and so we got things messed up sometime. Also, it would be nice for there to be an accessible way to have the system read the file and execute it and never open the main window. I have whined about this before to usually uninterested ears, but the main window should not open until some code actually pushes pixels to the glass. That's largely a simple (ish) VM change, to the best of my knowledge. It's always been that way on RISC OS, so it clearly can work. There is, of course, also the possibility that some class in the StartUpList would trigger this before we get to ProjectLauncher>>#startUpAfterLogin. VisualWorks can do this, so we should be able to. Default file directories, particularly with all-in-one This is always a tricky one to make consistent; we always know what we *want* it to be but working it out can be fun. The general algorithm we use is 'the directory the image was in' but most OSs have their own idea of what a current working directory is and users often want to work with that instead. I got trapped by this in the fastfib stuff because the all-in-one package has the image in most definitely-not the CWD. Unless you do actually cd to the image's directory. Trying to explain this to people that are doing their best to make Smalltalk look bad is ... tricky. As an example, consider using the all-in-one after cd-ing to the Squeak-x.x-All-in-One directory as suggested. Run the vm via the squeak.sh on the image in Contents/Resources - works ok. Now try including the myCode.cs from above; if it is in the cwd it will not be found because the default dir is actually seen as Squeak-x.x-All-in-One/Contents/Resources and you have a puzzled user. Even more fun happens if myCode.cs is 'correctly' referred to with an absolute path but tries to write an output file (because we couldn't write to stdout under Win10, ibid.) Screen-centred dialogues and the self-centred progressbar Turns out that an annoyingly common case is for a dialogue to try to popup - to ask for what to do about some error? - whilst the progressbar is running. At the same centre of the screen. Where the progressbar insists on being on top and hiding the dialogue. Preventing anyone from seeing that there is a dialogue to respond to. Irritating all and sundry. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful random insult:- Thinks E=MC^2 is a rap star. |
In reply to this post by timrowledge
On Fri, Apr 26, 2019 at 4:02 AM tim Rowledge <[hidden email]> wrote:
+1
|
Le ven. 26 avr. 2019 à 13:55, Fabio Niephaus <[hidden email]> a écrit :
changes are in inbox now they should have associated tests For the printing, cost is dominated by division. division cost is dominated by multiplication. We could opt for a divide and conquer algorithm too, or mixed Barett-Newton-Raphson which is the most efficient known algorithm See for example http://bioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Lec5-Fast-Division-Hasselstrom2003.pdf Another interesting POV is that most of these divide and conquer lead to easy parallelisation... If we could spawn image clones above a certain level, and communicate the results via sockets. The printing though would require to share the target String memory, or we could end in a O(n*log(n)) reconstruction cost to gather the pieces.
|
In reply to this post by timrowledge
> On 2019-04-25, at 7:40 PM, tim Rowledge <[hidden email]> wrote: > > > Looking at FileStream class initialise I see > Smalltalk > addToStartUpList: self after: SecurityManager; "the intent being before: AutoStart" > and yet inspecting the startuplist shows that FileStream comes well *after* AutoStart (which in a somewhat convoluted way actually makes ProjectLauncher actually do the startup document reading) and so we got things messed up sometime. It seems all we need is to run FileStream initialise to fix this, at least in a current image. I'd guess that making sure the order of things in the startup list is correct is something the ReleaseManager ought to be able to do. I would have guessed that it might use SmalltalkImage class>>#initializeStartUpList but can't see a connection in the code right now. Mind you, that method has a fairly big bug in that it ignores the ordering desires of all the classes involved, which may explain the original problem. We can't simplistically run through the list of classes currently in the startup list and send #initialize to them since any number of them might do much more stuff in those methods, including stuff that could be destructive. We would need to separate out the startup list related code and only run that. So conceptually, a) in all senders of #addToStartUpList* and related methods, split out the startup list part to an agreed method such as #addToStartUpSequence that handles the ordering requirements, if any. A default nul version inObject would be nice. b) to rebuild the startup list, take copy of the current list and send each item #addToStartUpSequence c) some classes may need to be checked out for where they need to be in the ordering - I see thsat ProcessorScheduler is mentioned particularly in SmalltalkImage class>>#initializeStartUpList and yet its #initialize makes no reference to #addToStartUpList* d) use the rebuilt method from ReleaseBuilder Ideas? tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful random insult:- Lights are on but nobody's home. |
In reply to this post by Nicolas Cellier
Le ven. 26 avr. 2019 à 18:10, Nicolas Cellier <[hidden email]> a écrit :
Just ask, it's in the inbox now :)
|
> On 2019-04-27, at 1:43 AM, Nicolas Cellier <[hidden email]> wrote: > > > > > Just ask, it's in the inbox now :) Good grief. That's some interesting code. BUT - it isn't in the inbox so far as I can see; the newest Kernel mcz is http://source.squeak.org/inbox/Kernel-nice.1216.mcz ... oh, wait, I see you've put an even newer version in trunk. Phew, I thought you might have been suffering from uploading problems like my NuScratch stuff. I merged (because I had my fastfib method already) your Kernel-nice.1220 and tried out [4784969 fastDoublingFib asString ] timeToRun . It crashed because of no Interval>>digitShiftSum: method, which needs to go into Collections I guess. Previously [4784969 fastDoublingFib asString ] timeToRun -> 136833mS (on my Pi3+) Now [4784969 fastDoublingFib asString ] timeToRun -> 21060mS Call 6.5X faster. Assuming we are all confident about the correctness I'd urge making that the default multiply for large ints. Unless somebody is going to improve the plugin to assist in making this even faster? tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful random insult:- Talks to plants on their own level. |
Le sam. 27 avr. 2019 à 20:26, tim Rowledge <[hidden email]> a écrit :
Oups! It's in Collection, I forgot to commit it! Previously Assuming we are all confident about the correctness I'd urge making that the default multiply for large ints. Unless somebody is going to improve the plugin to assist in making this even faster? The problem is that fastMultiply: just put overhead for every moderately LargeIntegers, which is the most common case, so I hesitate to do it the default. Implementing divide and conquer in the primitives is also not a pleasant path: divide and conquer often trade speed for space, so it means that we must massively allocate new objects in primitives which is a difficult task. Or resuse some preallocated memory, which gives a horrible taste to these beautiful and simple algorithms. Also it hides those algorithms from our view, make debugging not a pleasure, and make improvment a less enjoyable hobby. What we could do instead, is to make some primitive fail when digitLength is above a certain threshold. That threshold would be a parameter controlled from the image, and a default value of 0 at VM startup could mean no primitive failure making the VM friendly to those image not Karatsuba aware... -- |
Le sam. 27 avr. 2019 à 22:49, Nicolas Cellier <[hidden email]> a écrit :
Hmm, after thoughts and experiments, I retract what I said. The overhead cost in Squeak is very marginal at least in 64bits VM, so we can make #fastMultiply: the default. Last time I tried was in VW, where the arrangement of primitives is a bit different. Seuls les imbéciles ne changent pas d'avis ;)
|
Free forum by Nabble | Edit this page |