Simultaneous file read access in parallel for loops

Novgorod · ‎05-16-2024

I'm aware that physical disk access is sequential, but the OS does a pretty good job at making it multi-threaded, at least in appearance. You can do a disk benchmark with any tool (e.g. Crystaldiskmark) and see that for small data chunks (4KB or so) multi-threaded random access is significantly faster than single-threaded because of how the OS handles the IO and caching. This scenario would benefit from multiple simultaneous file references; for large chunks (MByte size or more) it shouldn't make any difference and the semaphore solution should be the easiest way to go.

PinguX · ‎05-16-2024

I really don't know what happen at low-level for the disk.

Below is how I benchmarked it :

The results seem to say that parallel access is faster. Maybe I missed something ?

Although I'm totally newbie to it, my curiosity is aroused. So I will follow this thread with interest. 👀

EDIT: changes in number and size of chunks make the results vary A LOT. Always in favor of parallel for what I have seen, but not that much in some cases.

Novgorod · ‎05-16-2024

Interesting, and also pretty much what I expected (though I thought the performance gain would be lower for 1MB big chunks). I assume the ratio increases in favor of parallel much more for KB sized chunks. Even for larger chunks you get multi-threaded improvement because the OS does some read ahead and caching.

In any case, your parallel FOR loop might mess up the thread assignment to the correct file reference because you assume sequential partitioning of the iterations. Did you verify that the read sequence is correct? You can generate a test file with (e.g.) ascending numbers and check the order after reading in parallel.

PinguX · ‎05-17-2024

That is a good remark !

With auto-partitioning enabled for parallel iterations, I got 1 out of 5 executions with data shuffleled.

I could patch it by specifying the chunk size with (C) terminal, with constant equal to "1".

I am currently running a test to check consistency of results. No error was encountered so far.

Novgorod · ‎05-17-2024

Yeah, chunk size 1 should force a deterministic sequential distribution of the iterations over the threads, but the auto partitioning is more efficient or performant, at least as claimed by NI...

Andrey_Dmitriev · ‎05-18-2024

@Novgorod wrote:

I'm aware that physical disk access is sequential, but the OS does a pretty good job at making it multi-threaded, at least in appearance. You can do a disk benchmark with any tool (e.g. Crystaldiskmark) and see that for small data chunks (4KB or so) multi-threaded random access is significantly faster than single-threaded because of how the OS handles the IO and caching. This scenario would benefit from multiple simultaneous file references; for large chunks (MByte size or more) it shouldn't make any difference and the semaphore solution should be the easiest way to go.

You're perfectly right. Moreover, for Disk I/O, you also have Queue Depth, but you also have controller max bandwidth. I mean, for example, on my old laptop, I have something around 500 MB/s when reading with block sizes like 1MB or above. I'll never reach read performance above this limit. If reading with smaller block sizes, like 4K, then the performance drops down, and I can get better performance with multiple threads.In your particular case, you have already stored a large file that needs to be processed. If your multithreaded processing algorithm can process the data at the Disk I/O speed, then it makes sense to get the maximum possible data throughput from the Drive, but this is possible with large I/O Block sizes only.

@PinguX wrote:

I really don't know what happen at low-level for the disk.

Below is how I benchmarked it :

The results seem to say that parallel access is faster. Maybe I missed something ?

Obviosly parallel is faster, because you working with 2K (or 4K) block size only (not sure if you have singles or doubles).

If you will increase block size, then the difference will be lower, in theory, then from certain block size you will get almost the same performance. Theoretically you can keep parallel reading and choose "optimal" block size.

During benchamraking you must be 100% sure that you not used cached file, otherwise you measuring memory and API performance, but not disk. Typically benchmarking software using kind of direct low level access, but in LabVIEW I think it sufficient to test it with file, which is several times larger than available RAM.

Andrey_Dmitriev · ‎05-19-2024

I have had a little bit time at weekend to make a simple benchmark.

To avoid using cached file, I've created test file at RAM Sise (I have 16 GB):

For me was also interesting how cache for writing works; as well as Samsung Rapid mode.

Anyway, on Samsung 750 EVO in i7-3520M average writing speed is 189 MB/s and on 870 EVO @ i7-7700 416 MB/s (but writing performance is out of scope for this topic).

Now about reading. I will read whole 16 GB file with different buffers (from 512 Bytes to 8 MB) and threads (1-4, because both test CPUs are 4 cores):

The results (MB/s):

Laptop: Samsung EVO 750 @ i7-3520M:

Desktop: Samsung 870 EVO @ i7-7700:

Green marked performance over 500MB/s.

As you can see, with small read buffer like 512 bytes you will lost lot of. If you would like to read at full SSD speed, then the buffer around 1 MB is essential. And yes, multiple threads will imporove a little bit. On your hardware situation and optimal buffer size/threads could be different — just needs to be benchmarked for your particular case.

Novgorod · ‎05-25-2024

Nice, very interesting results. I would've assumed there would be more performance gain from multi-threading for smaller chunk/buffer sizes, but your results look inconclusive for small chunks - but there's a clear speed-up for bigger chunks, which I didn't expect (rather the opposite based on the usual HDD/SSD benchmark tools). Maybe Labview's reading routines don't play nicely with how Windows handles simultaneous/multi-threaded file access.

It also looks a bit suspicious that you don't get the full controller throughput with 8MB chunks and 1 thread. Does the single-threaded speed actually saturate below the true controller's limit? What if you make the chunk size huge (100+MB), which is basically completely sequential reading? Can you compare it with the single-threaded performance in Crystaldiskmark or the like?

Andrey_Dmitriev · ‎05-26-2024

@Novgorod wrote:

... Can you compare it with the single-threaded performance in Crystaldiskmark or the like?

The CrystalDiskMark will show these huge readings on my PC for Samsung 870 EVO connected to SATA 6GB/s, which cannot deliver more than 600 MB/s in theory:

2024-05-26 19.37.50 - CrystalDiskMark 8.0.4 x64 [Admin] .jpg

It is because of the Samsung SSD with enabled Magician Rapid Mode (which is just an additional cache).Of course, such numbers have nothing to do with real read performance. Of course, I can turn off this mode just for benchmarking, but instead of that, I prefer to always create my own benchmark based on a particular use case. And since LabVIEW is more or less a 'black box,' it probably makes sense to create an initial benchmark like this

    clock_t start_time = clock();

    while (!feof(file)) {
        size_t bytes_read = fread(buffer, 1, block_size, file);
        total_bytes_read += bytes_read; ++total_blocks_read;
    }

    clock_t end_time = clock();

based on old school fread()

Full source code:

Spoiler

#include <ansi_c.h>

#define FILE_PATH "C:/Users/Andrey/Desktop/16GB/LargeFile.bin"
#define BUFFER_SIZE 8388608 // 8 MB buffer
#define MB (1024.0 * 1024.0)

void read_benchmark(const char* file_path, size_t block_size) {
    FILE* file = fopen(file_path, "rb");
    if (file == NULL) {
        printf("Error opening file: %s\n", file_path);
        return;
    }

    char* buffer = (char*)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        printf("Error allocating memory\n");
        fclose(file);
        return;
    }

    size_t total_bytes_read = 0, total_blocks_read = 0;
    clock_t start_time = clock();

    while (!feof(file)) {
        size_t bytes_read = fread(buffer, 1, block_size, file);
        total_bytes_read += bytes_read; ++total_blocks_read;
    }

    clock_t end_time = clock();
    double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;

    printf("\nBlock size: %zu bytes; ", block_size);
    printf("Total blocks read: %zu; ", total_blocks_read);
    printf("Total bytes read: %zu\n", total_bytes_read);
    printf("Elapsed time: %.2f seconds - ", elapsed_time);
    printf("Read speed: %.2f MB/s\n", (double)total_bytes_read / MB / elapsed_time);

    free(buffer);
    fclose(file);
}

int main() {
    size_t block_sizes[] = {512, 2048, 8192, 32768, 131072, 524288, 2097152, 8388608};
    int num_block_sizes = sizeof(block_sizes) / sizeof(block_sizes[0]);

    for (int i = 0; i < num_block_sizes; i++) {
        read_benchmark(FILE_PATH, block_sizes[i]);
    }

    return 0;
}

#include <ansi_c.h> #define FILE_PATH "C:/Users/Andrey/Desktop/16GB/LargeFile.bin" #define BUFFER_SIZE 8388608 // 8 MB buffer #define MB (1024.0 * 1024.0) void read_benchmark(const char* file_path, size_t block_size) { FILE* file = fopen(file_path, "rb"); if (file == NULL) { printf("Error opening file: %s\n", file_path); return; } char* buffer = (char*)malloc(BUFFER_SIZE); if (buffer == NULL) { printf("Error allocating memory\n"); fclose(file); return; } size_t total_bytes_read = 0, total_blocks_read = 0; clock_t start_time = clock(); while (!feof(file)) { size_t bytes_read = fread(buffer, 1, block_size, file); total_bytes_read += bytes_read; ++total_blocks_read; } clock_t end_time = clock(); double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC; printf("\nBlock size: %zu bytes; ", block_size); printf("Total blocks read: %zu; ", total_blocks_read); printf("Total bytes read: %zu\n", total_bytes_read); printf("Elapsed time: %.2f seconds - ", elapsed_time); printf("Read speed: %.2f MB/s\n", (double)total_bytes_read / MB / elapsed_time); free(buffer); fclose(file); } int main() { size_t block_sizes[] = {512, 2048, 8192, 32768, 131072, 524288, 2097152, 8388608}; int num_block_sizes = sizeof(block_sizes) / sizeof(block_sizes[0]); for (int i = 0; i < num_block_sizes; i++) { read_benchmark(FILE_PATH, block_sizes[i]); } return 0; }

Difference between 512 bytes and 8 MB buffer:

Block size: 512 bytes; Total blocks read: 33554433; Total bytes read: 17179869184
Elapsed time: 88.57 seconds - Read speed: 184.99 MB/s

Block size: 8388608 bytes; Total blocks read: 2049; Total bytes read: 17179869184
Elapsed time: 32.95 seconds - Read speed: 497.27 MB/s

Full results:

Spoiler

Block size: 512 bytes; Total blocks read: 33554433; Total bytes read: 17179869184
Elapsed time: 88.57 seconds - Read speed: 184.99 MB/s

Block size: 2048 bytes; Total blocks read: 8388609; Total bytes read: 17179869184
Elapsed time: 40.56 seconds - Read speed: 403.94 MB/s

Block size: 8192 bytes; Total blocks read: 2097153; Total bytes read: 17179869184
Elapsed time: 40.16 seconds - Read speed: 407.99 MB/s

Block size: 32768 bytes; Total blocks read: 524289; Total bytes read: 17179869184
Elapsed time: 38.63 seconds - Read speed: 424.08 MB/s

Block size: 131072 bytes; Total blocks read: 131073; Total bytes read: 17179869184
Elapsed time: 35.30 seconds - Read speed: 464.10 MB/s

Block size: 524288 bytes; Total blocks read: 32769; Total bytes read: 17179869184
Elapsed time: 33.59 seconds - Read speed: 487.74 MB/s

Block size: 2097152 bytes; Total blocks read: 8193; Total bytes read: 17179869184
Elapsed time: 33.55 seconds - Read speed: 488.36 MB/s

Block size: 8388608 bytes; Total blocks read: 2049; Total bytes read: 17179869184
Elapsed time: 32.95 seconds - Read speed: 497.27 MB/s

Block size: 512 bytes; Total blocks read: 33554433; Total bytes read: 17179869184 Elapsed time: 88.57 seconds - Read speed: 184.99 MB/s Block size: 2048 bytes; Total blocks read: 8388609; Total bytes read: 17179869184 Elapsed time: 40.56 seconds - Read speed: 403.94 MB/s Block size: 8192 bytes; Total blocks read: 2097153; Total bytes read: 17179869184 Elapsed time: 40.16 seconds - Read speed: 407.99 MB/s Block size: 32768 bytes; Total blocks read: 524289; Total bytes read: 17179869184 Elapsed time: 38.63 seconds - Read speed: 424.08 MB/s Block size: 131072 bytes; Total blocks read: 131073; Total bytes read: 17179869184 Elapsed time: 35.30 seconds - Read speed: 464.10 MB/s Block size: 524288 bytes; Total blocks read: 32769; Total bytes read: 17179869184 Elapsed time: 33.59 seconds - Read speed: 487.74 MB/s Block size: 2097152 bytes; Total blocks read: 8193; Total bytes read: 17179869184 Elapsed time: 33.55 seconds - Read speed: 488.36 MB/s Block size: 8388608 bytes; Total blocks read: 2049; Total bytes read: 17179869184 Elapsed time: 32.95 seconds - Read speed: 497.27 MB/s

So, it matched LabVIEW's results more or less. The advantage of a C-based benchmark is that almost everything is in your hands (but I believe, LabVIEW under the hood also uses just normal standard file I/O calls to read the files).

Novgorod · ‎05-26-2024

Samsung Magician can make a RAM disk for you? Never knew...

The default Crystaldiskmark settings should be 1GiB chunks and 5 iterations. Your smaller values probably used only the RAM cache, so it doesn't count :).. Anyhow, I so happen to have a Samsung 870 EVO as well (but the 4TB model) and that's my result without any RAM disk trickery:

The rest of my hardware should be beefy enough not to bottleneck anything (13900K and so on). That's what I would expect from SATA 600, and single-threaded sequential performance is within the error margins identical to multi-threaded. In Labview this looks very similar:

I think the read/write speed in Crystaldiskmark is in MB/s, not MiB/s, then it's relatively consistent with my Labview single-threaded test. I get the same median speed in Labview for a large range of buffer sizes up to 8MiB and the performance drops with larger buffers (which makes no sense to me, but it's Labview)...

In any case, I'm wondering why your C code performance is ~10% below the true maximum throughput (which you get in Labview in multi-threaded mode). Maybe it's worth checking again with Crystaldiskmark without the RAM disk shenanigans 😉 ...

LabVIEW

Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops

Re: Simultaneous file read access in parallel for loops