09-22-2020 02:43 PM
@GerdW wrote:
Loopless:
Have you counted the allocation dots? Probably significantly more memory overhead with all these seperate intermediary arrays (2D, 1D). Not sure if a few ms are worth it...
09-22-2020 02:45 PM
09-22-2020 03:04 PM - edited 09-22-2020 03:29 PM
Curiously, interlacing is faster than my typecasting (~19ms , or about 8ms when the loop is parallelized).
I am sure we can squeeze a little bit more out of it, but going from 5 minutes to 10ms is quite good, IMHO ;))
The first estimate of getting 1000x improvement was low. We got about 30000x! 😮
09-22-2020 03:19 PM - edited 09-22-2020 03:22 PM
09-22-2020 03:27 PM - edited 09-22-2020 03:36 PM
@crossrulz wrote:
Preallocating the array and not reshaping seems to help quite a bit.
It surprised me that the concatenating tunnel is about 20x slower, even though the compiler could probably figure exactly what to do based on the build array with two scalars inside the loop.
Here's what I probably would do in the end:
09-22-2020 03:44 PM