10-18-2023 04:17 AM
@altenbach wrote:
@Yamaeda wrote:
I tried with switching to U64, but that was 2.5x slower!Yes, that has always been my experience. (but I did try here too, adjusting the base much higher, of course). It seems that 32bit integer operation are especially efficient. Not sure that it has to be this way...
Or is it LV not actually adding 2 64 bit numbers under the hood, but doing some 2x32 bit shenanigans? There's no reason adding 64 bits should be slower, right? Unless it down on assembly level can do a QADD (quick add) and integrate the number of a 32 bit integer in the instruction and a 64 bit requires a memory call.
(On the old Motorola 68000 you could include numbers up to 7 in an QADD command, else you'd have to use ADD where it was faster with a constant than a memory address)
10-18-2023 04:35 AM - edited 10-18-2023 04:46 AM
@altenbach wrote:
[...] It seems that 32bit integer operation are especially efficient. Not sure that it has to be this way...
I guess the compiler is making use of SIMD instructions available on the processor - multiple integers are loaded into the processor registers, then the same arithmetic operation is done to them in parallel, giving a speedup of factor 2 between 32 and 64 bit. Depending on the instruction set and processor architecture, 32 bit integer SIMD might be the preferred mode (e.g. having more 32bit registers available, ...) or the additional .5 comes from the overhead of having twice as many load/store operations per arithmetic operation.
Checking the disassembly of the generated code could shed some light on this, but one would still need to double check the processors documentation for the specific timing details. Quite an involved task I think.
10-18-2023 06:27 AM
@altenbach a écrit :
Here's a quick draft how you could use Base 1G instead of base 10 to calculate the N'th Fibonacci number.
See what speed you get (on my very (very!!) old laptop, its about 40s for 1M.)
... and yes, there are probably much faster ways to do all that..... 😄
Hi
Not much faster, but just a little bit.
a) if I am getting the last SR not the previous one (drop one addition)
b) to avoid checking i=0 in the final loop, I remove and convert the first digit in the last array.
With these additions I get, calculating 1M point, from 20.8 sec to 19.3sec.
It seems I get the same values (just that you cannot calculate a number smaller than 3 🙂 )
Regards
10-18-2023 12:05 PM
My code was designed to give the correct result for all positive inputs at the cost of at most 2 extra iteration (typically a tiny fraction of a percent!). I doubt that the differences are really significant. Also the formatting was timed and not considered important to further improve. It can be sped up using very different techniques (not shown).
One change you can make is remove the Q&R inside the case structure and just append a "1" in the default case. My original "carry" code was designed for the general case where the IQ can be >1.
Note that even switching from U32to I32 ~doubles (!!!) the elapsed time. Same as replacing the inner Q&R with explicit code as you originally did.