LabVIEW

cancel
Showing results for 
Search instead for 
Did you mean: 

Simultaneous file read access in parallel for loops

Something to keep in mind is if the order from the file matters. There's nothing in the parfor loop to ensure each instance gets every other chunk. If one of them gets stalled for some reason, the other might grab a few in a row. The kind of problem that may be fine during initial testing but start to cause oddities on a system that gets more heavily loaded over time.

~ G stands for Fun ~
Helping pave the path to long-term living and thriving in space.
0 Kudos
Message 21 of 27
(375 Views)

@Novgorod wrote:

Samsung Magician can make a RAM disk for you? Never knew...

...

In any case, I'm wondering why your C code performance is ~10% below the true maximum throughput (which you get in Labview in multi-threaded mode). Maybe it's worth checking again with Crystaldiskmark without the RAM disk shenanigans 😉 ...


Well, you will never get maximum throughput, but we are close enough. Let do one more experiment.

 

First of them, Crystal Disk Mark will not make "RAM" disk. On my home PC I have Samsung SSD, and here "Rapid" mode is available: Samsung SSD Rapid Mode Whitepaper. I keep this enabled with understanding that this is mostly for marketing, but with this "placebo" I have "feeling" that my PC is slightly more responsible when this mode is ON.

 

Now back to our topic — test equipment:

Dell Precision 5820 Workstation with Intel Xeon W-2245 CPU @3.90 GHz/32GB RAM SSD: KXG60ZNV1T02 NVMe KIOXIA 1TB NVM Express 1.3. According to the doc, Maximal stated sequential throughput is 3180 MB/s (3033 MiB/s) read and 2960 MB/s (2823 MiB/s) write. Around 355K IOPS max.

 

CrystalDiskMark Benchmark (Read only):

image-20240527100711932.png

(Default Settings for NVM SSD was used, File Size increased to 32 GiB (because I have 32 GB RAM and don't want to get cached file), 4 passes).

Now I need another benchmark, with vary Buffer size. I will use ATTO Benchmark.

This is the single thread test:

b1.png

And 4 threads (assumed Queue Depth is amount of threads):

b4.png

So, we are at 2,9 GB/s max.

But the fact is — larger buffer will deliver huge performance, and multi threading — a little bit more.

There are lot of benchmark tools available, for example, test with AIDA (here I can choose block size between 4K and 8 MB and measured 130-150 MB/s vs 2-2,5GB/s:

Spoiler
image-20240527103612891.png

I'll not bother you with another benchs and now will use CVI with same sizes as used for ATTO benchmark.

Single thread source code under spoiler.

Spoiler
//==============================================================================
//
// Title:		SingleThreadBench
// Purpose:		Single threaded Read File benchmark.
//
// Created on:	27.05.2024 at 06:41:26 by AD.
// Copyright:	2024,AD. All Rights Reserved.
//
//==============================================================================

#include <ansi_c.h>

#define FILE_PATH "C:/Desktop/LargeFile.bin"
#define BUFFER_SIZE 8388608 // 8 MB buffer
#define MB (1024.0 * 1024.0)

void initial_read();

void read_benchmark(const char* file_path, size_t block_size) {
    FILE* file = fopen(file_path, "rb");
    if (file == NULL) {
        printf("Error opening file: %s\n", file_path);
        return;
    }

    char* buffer = (char*)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        printf("Error allocating memory\n");
        fclose(file);
        return;
    }

    size_t total_bytes_read = 0, total_blocks_read = 0;
    clock_t start_time = clock();

    while (!feof(file)) {
        size_t bytes_read = fread(buffer, 1, block_size, file);
        total_bytes_read += bytes_read;
		if(bytes_read) ++total_blocks_read;
    }

    clock_t end_time = clock();
    double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;

    printf("\nBlock size: %zu bytes; ", block_size);
    printf("Total blocks read: %zu; ", total_blocks_read);
    printf("Total bytes read: %zu\n", total_bytes_read);
    printf("Elapsed time: %.2f seconds - ", elapsed_time);
    printf("Read speed: %.2f MB/s; ", (double)total_bytes_read / MB / elapsed_time);
    printf("I/O speed: %.2f IO/s\n", (double)total_blocks_read / elapsed_time);


    free(buffer);
    fclose(file);
}

int main() {
	
	initial_read();	
    size_t block_sizes[15] = {0}; //512 Bytes ... 8 MB
	for (int i = 0, b = 512; i < 15; i++, b*=2) block_sizes[i] = b;
    int num_block_sizes = sizeof(block_sizes) / sizeof(block_sizes[0]);

    for (int i = 0; i < num_block_sizes; i++) {
        read_benchmark(FILE_PATH, block_sizes[i]);
    }

    return 0;
}

void initial_read()
{
	//Initial read to get file cache empty
	FILE* file = fopen(FILE_PATH, "rb");
    if (!file) {
        printf("Error opening file: %s\n", FILE_PATH);
        return;
    }

	char* buffer = (char*)malloc(BUFFER_SIZE);
    if (!buffer) {
        printf("Error allocating memory\n");
        fclose(file);
        return;
    }
	int cnt=0;
    printf("Initial read...\n");	
	while (!feof(file)) {
		fread(buffer, 1, BUFFER_SIZE, file); 
		if(cnt++==100) {printf("."); cnt=0;};
	}
    printf("\nInitial read completed, starting bench...\n\n");	
	
    free(buffer);
    fclose(file);
}

And the results

Spoiler
Block size: 512 bytes; Total blocks read: 67108864; Total bytes read: 34359738368
Elapsed time: 185.27 seconds - Read speed: 176.86 MB/s; I/O speed: 362216.10 IO/s

Block size: 1024 bytes; Total blocks read: 33554432; Total bytes read: 34359738368
Elapsed time: 182.99 seconds - Read speed: 179.07 MB/s; I/O speed: 183370.58 IO/s

Block size: 2048 bytes; Total blocks read: 16777216; Total bytes read: 34359738368
Elapsed time: 48.64 seconds - Read speed: 673.64 MB/s; I/O speed: 344905.04 IO/s

Block size: 4096 bytes; Total blocks read: 8388608; Total bytes read: 34359738368
Elapsed time: 36.22 seconds - Read speed: 904.67 MB/s; I/O speed: 231595.15 IO/s

Block size: 8192 bytes; Total blocks read: 4194304; Total bytes read: 34359738368
Elapsed time: 37.09 seconds - Read speed: 883.50 MB/s; I/O speed: 113087.55 IO/s

Block size: 16384 bytes; Total blocks read: 2097152; Total bytes read: 34359738368
Elapsed time: 35.68 seconds - Read speed: 918.46 MB/s; I/O speed: 58781.62 IO/s

Block size: 32768 bytes; Total blocks read: 1048576; Total bytes read: 34359738368
Elapsed time: 29.70 seconds - Read speed: 1103.30 MB/s; I/O speed: 35305.59 IO/s

Block size: 65536 bytes; Total blocks read: 524288; Total bytes read: 34359738368
Elapsed time: 29.91 seconds - Read speed: 1095.63 MB/s; I/O speed: 17530.03 IO/s

Block size: 131072 bytes; Total blocks read: 262144; Total bytes read: 34359738368
Elapsed time: 23.14 seconds - Read speed: 1416.32 MB/s; I/O speed: 11330.57 IO/s

Block size: 262144 bytes; Total blocks read: 131072; Total bytes read: 34359738368
Elapsed time: 20.69 seconds - Read speed: 1584.07 MB/s; I/O speed: 6336.27 IO/s

Block size: 524288 bytes; Total blocks read: 65536; Total bytes read: 34359738368
Elapsed time: 18.84 seconds - Read speed: 1739.37 MB/s; I/O speed: 3478.74 IO/s

Block size: 1048576 bytes; Total blocks read: 32768; Total bytes read: 34359738368
Elapsed time: 18.30 seconds - Read speed: 1790.60 MB/s; I/O speed: 1790.60 IO/s

Block size: 2097152 bytes; Total blocks read: 16384; Total bytes read: 34359738368
Elapsed time: 16.36 seconds - Read speed: 2002.57 MB/s; I/O speed: 1001.28 IO/s

Block size: 4194304 bytes; Total blocks read: 8192; Total bytes read: 34359738368
Elapsed time: 16.44 seconds - Read speed: 1992.82 MB/s; I/O speed: 498.21 IO/s

Block size: 8388608 bytes; Total blocks read: 4096; Total bytes read: 34359738368
Elapsed time: 18.85 seconds - Read speed: 1738.45 MB/s; I/O speed: 217.31 IO/s

Now will do the same with multithreading. I will not use OpenMP, instead of that I'll launch the threads "manually".

So, thread to read will be like this:

int CVICALLBACK ThreadFunction (void *functionData)
{
	pstructThread td = (pstructThread) functionData;
    size_t total_bytes_read = 0, total_blocks_read = 0;
    fpos_t newFilePosition;
	newFilePosition._offset = td->Id * td->block_size;

    while (!feof(td->file)) {
		fsetpos(td->file, &newFilePosition);
		size_t bytes_read = fread(td->buffer, 1, td->block_size, td->file);
        total_bytes_read += bytes_read;
		if(bytes_read) ++total_blocks_read;
		newFilePosition._offset = newFilePosition._offset + N_THREADS * td->block_size;
    }

	td->total_blocks_read = total_blocks_read;
	td->total_bytes_read = total_bytes_read;
	return(0);
}

The advantage for own threads is that everything is under your control.

Full source under the spoiler as well.

Spoiler
//==============================================================================
//
// Title:		MultiThreadBench
// Purpose:		A short description of the command-line tool.
//
// Created on:	27.05.2024 at 12:42:19 by .
// Copyright:	2024, . All Rights Reserved.
//
//==============================================================================

#include <utility.h>
#include <ansi_c.h>

#define FILE_PATH "C:/Desktop/LargeFile.bin"
#define BUFFER_SIZE 8388608 // 8 MB buffer
#define MB (1024.0 * 1024.0)

typedef struct {
    int Id;
	FILE* file;
	size_t block_size;
	char* buffer;
	size_t total_bytes_read;
	size_t total_blocks_read;
} structThread, *pstructThread;

#define N_THREADS 4
int CVICALLBACK ThreadFunction (void *functionData);
void initial_read();

int main (int argc, char *argv[])
{
	int functionId[N_THREADS] = {0};
	structThread tData[N_THREADS] = {0, NULL, 0, NULL, 0 , 0};

	initial_read();

	//---------------------------------------
	// Benchmark
	//
    size_t block_sizes[15] = {0}; //512 Bytes ... 8 MB
	for (int i = 0, b = 512; i < 15; i++, b*=2) block_sizes[i] = b;

    int num_block_sizes = sizeof(block_sizes) / sizeof(block_sizes[0]);
	
    for (int i = 0; i < num_block_sizes; i++) {

		for (int j = 0; j < N_THREADS; j++){
			tData[j].file = fopen(FILE_PATH, "rb");
	    	if (tData[j].file == NULL) {
	        	printf("Error opening file: %s\n", FILE_PATH);
	    	}

	    	tData[j].buffer = (char*)malloc(BUFFER_SIZE);
	    	if (tData[j].buffer == NULL) {
	        	printf("Error allocating memory\n");
	        	fclose(tData[j].file);
	    	}
			tData[j].Id = j;
			tData[j].block_size = block_sizes[i];
			CmtScheduleThreadPoolFunction (DEFAULT_THREAD_POOL_HANDLE, ThreadFunction, &tData[j], &functionId[j]);
		}

		size_t total_bytes_read = 0, total_blocks_read = 0;
		clock_t start_time = clock();
	
		for (int j = 0; j < N_THREADS; j++)
			CmtWaitForThreadPoolFunctionCompletion (DEFAULT_THREAD_POOL_HANDLE, functionId[j], 0);

		clock_t end_time = clock();
	    double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
		
		for (int j = 0; j < N_THREADS; j++){
			total_bytes_read += tData[j].total_bytes_read;
			total_blocks_read += tData[j].total_blocks_read;
		}
	    printf("\nBlock size: %zu bytes; ", block_sizes[i]);
	    printf("Total blocks read: %zu; ", total_blocks_read);
	    printf("Total bytes read: %zu\n", total_bytes_read);
	    printf("Elapsed time: %.2f seconds - ", elapsed_time);
	    printf("Read speed: %.2f MB/s; ", (double)total_bytes_read / MB / elapsed_time);
	    printf("I/O speed: %.2f IO/s\n", (double)total_blocks_read / elapsed_time);

		for (int j = 0; j < N_THREADS; j++){
			CmtReleaseThreadPoolFunctionID (DEFAULT_THREAD_POOL_HANDLE, functionId[j]);
		    free(tData[j].buffer);
	    	fclose(tData[j].file);
		}

	}
	return 0;
}

int CVICALLBACK ThreadFunction (void *functionData)
{
	pstructThread td = (pstructThread) functionData;
    size_t total_bytes_read = 0, total_blocks_read = 0;
    fpos_t newFilePosition;
	newFilePosition._offset = td->Id * td->block_size;

    while (!feof(td->file)) {
		fsetpos(td->file, &newFilePosition);
		size_t bytes_read = fread(td->buffer, 1, td->block_size, td->file);
        total_bytes_read += bytes_read;
		if(bytes_read) ++total_blocks_read;
		newFilePosition._offset = newFilePosition._offset + N_THREADS * td->block_size;
    }

	td->total_blocks_read = total_blocks_read;
	td->total_bytes_read = total_bytes_read;
	return(0);
}

void initial_read()
{
	//Initial read to get file cache empty
	FILE* file = fopen(FILE_PATH, "rb");
    if (!file) {
        printf("Error opening file: %s\n", FILE_PATH);
        return;
    }

	char* buffer = (char*)malloc(BUFFER_SIZE);
    if (!buffer) {
        printf("Error allocating memory\n");
        fclose(file);
        return;
    }
	int cnt=0;
    printf("Initial read...\n");	
	while (!feof(file)) {
		fread(buffer, 1, BUFFER_SIZE, file); 
		if(cnt++==100) {printf("."); cnt=0;};
	}
    printf("\nInitial read completed, starting bench...\n\n");	
	
    free(buffer);
    fclose(file);
}

The results:

Spoiler
Block size: 512 bytes; Total blocks read: 67108864; Total bytes read: 34359738368
Elapsed time: 575.18 seconds - Read speed: 56.97 MB/s; I/O speed: 116674.14 IO/s

Block size: 1024 bytes; Total blocks read: 33554432; Total bytes read: 34359738368
Elapsed time: 573.07 seconds - Read speed: 57.18 MB/s; I/O speed: 58551.76 IO/s

Block size: 2048 bytes; Total blocks read: 16777216; Total bytes read: 34359738368
Elapsed time: 307.27 seconds - Read speed: 106.64 MB/s; I/O speed: 54600.71 IO/s

Block size: 4096 bytes; Total blocks read: 8388608; Total bytes read: 34359738368
Elapsed time: 192.45 seconds - Read speed: 170.27 MB/s; I/O speed: 43588.28 IO/s

Block size: 8192 bytes; Total blocks read: 4194304; Total bytes read: 34359738368
Elapsed time: 103.49 seconds - Read speed: 316.62 MB/s; I/O speed: 40527.03 IO/s

Block size: 16384 bytes; Total blocks read: 2097152; Total bytes read: 34359738368
Elapsed time: 59.10 seconds - Read speed: 554.49 MB/s; I/O speed: 35487.21 IO/s

Block size: 32768 bytes; Total blocks read: 1048576; Total bytes read: 34359738368
Elapsed time: 36.80 seconds - Read speed: 890.56 MB/s; I/O speed: 28497.79 IO/s

Block size: 65536 bytes; Total blocks read: 524288; Total bytes read: 34359738368
Elapsed time: 25.07 seconds - Read speed: 1307.22 MB/s; I/O speed: 20915.47 IO/s

Block size: 131072 bytes; Total blocks read: 262144; Total bytes read: 34359738368
Elapsed time: 15.31 seconds - Read speed: 2140.30 MB/s; I/O speed: 17122.40 IO/s

Block size: 262144 bytes; Total blocks read: 131072; Total bytes read: 34359738368
Elapsed time: 11.99 seconds - Read speed: 2732.49 MB/s; I/O speed: 10929.95 IO/s

Block size: 524288 bytes; Total blocks read: 65536; Total bytes read: 34359738368
Elapsed time: 12.18 seconds - Read speed: 2690.31 MB/s; I/O speed: 5380.62 IO/s

Block size: 1048576 bytes; Total blocks read: 32768; Total bytes read: 34359738368
Elapsed time: 12.45 seconds - Read speed: 2632.60 MB/s; I/O speed: 2632.60 IO/s

Block size: 2097152 bytes; Total blocks read: 16384; Total bytes read: 34359738368
Elapsed time: 12.27 seconds - Read speed: 2670.36 MB/s; I/O speed: 1335.18 IO/s

Block size: 4194304 bytes; Total blocks read: 8192; Total bytes read: 34359738368
Elapsed time: 12.15 seconds - Read speed: 2696.29 MB/s; I/O speed: 674.07 IO/s

Block size: 8388608 bytes; Total blocks read: 4096; Total bytes read: 34359738368
Elapsed time: 11.46 seconds - Read speed: 2858.84 MB/s; I/O speed: 357.35 IO/s

So, now for single thread I have around 2GB/s and for 4 threads up to 2,8 GB/s.

Now I'll perform same benchmarks in LabVIEW (the code from this topic above, just slightly modified to get same output as command line tool).

Single thread:

Spoiler
Block size: 512 bytes; Total blocks read: 67108864; Total bytes read: 34359738368
Elapsed time: 316,646242 seconds - Read speed: 103,484569 MB/s; I/O speed: 211936,397828 IO/s

Block size: 1024 bytes; Total blocks read: 33554432; Total bytes read: 34359738368
Elapsed time: 162,758298 seconds - Read speed: 201,329213 MB/s; I/O speed: 206161,114459 IO/s

Block size: 2048 bytes; Total blocks read: 16777216; Total bytes read: 34359738368
Elapsed time: 83,952761 seconds - Read speed: 390,314739 MB/s; I/O speed: 199841,146380 IO/s

Block size: 4096 bytes; Total blocks read: 8388608; Total bytes read: 34359738368
Elapsed time: 45,860620 seconds - Read speed: 714,512794 MB/s; I/O speed: 182915,275171 IO/s

Block size: 8192 bytes; Total blocks read: 4194304; Total bytes read: 34359738368
Elapsed time: 36,531225 seconds - Read speed: 896,986081 MB/s; I/O speed: 114814,218410 IO/s

Block size: 16384 bytes; Total blocks read: 2097152; Total bytes read: 34359738368
Elapsed time: 36,671083 seconds - Read speed: 893,565111 MB/s; I/O speed: 57188,167115 IO/s

Block size: 32768 bytes; Total blocks read: 1048576; Total bytes read: 34359738368
Elapsed time: 31,198882 seconds - Read speed: 1050,294046 MB/s; I/O speed: 33609,409465 IO/s

Block size: 65536 bytes; Total blocks read: 524288; Total bytes read: 34359738368
Elapsed time: 27,402790 seconds - Read speed: 1195,790665 MB/s; I/O speed: 19132,650641 IO/s

Block size: 131072 bytes; Total blocks read: 262144; Total bytes read: 34359738368
Elapsed time: 22,627553 seconds - Read speed: 1448,146009 MB/s; I/O speed: 11585,168075 IO/s

Block size: 262144 bytes; Total blocks read: 131072; Total bytes read: 34359738368
Elapsed time: 20,141965 seconds - Read speed: 1626,852229 MB/s; I/O speed: 6507,408915 IO/s

Block size: 524288 bytes; Total blocks read: 65536; Total bytes read: 34359738368
Elapsed time: 18,715838 seconds - Read speed: 1750,816645 MB/s; I/O speed: 3501,633291 IO/s

Block size: 1048576 bytes; Total blocks read: 32768; Total bytes read: 34359738368
Elapsed time: 19,089178 seconds - Read speed: 1716,574736 MB/s; I/O speed: 1716,574736 IO/s

Block size: 2097152 bytes; Total blocks read: 16384; Total bytes read: 34359738368
Elapsed time: 18,618085 seconds - Read speed: 1760,009171 MB/s; I/O speed: 880,004586 IO/s

Block size: 4194304 bytes; Total blocks read: 8192; Total bytes read: 34359738368
Elapsed time: 18,082172 seconds - Read speed: 1812,171659 MB/s; I/O speed: 453,042915 IO/s

Block size: 8388608 bytes; Total blocks read: 4096; Total bytes read: 34359738368
Elapsed time: 18,231529 seconds - Read speed: 1797,325940 MB/s; I/O speed: 224,665743 IO/s

And 4 Threads

Spoiler
Block size: 512 bytes; Total blocks read: 67108864; Total bytes read: 34359738368
Elapsed time: 710,286003 seconds - Read speed: 46,133529 MB/s; I/O speed: 94481,467563 IO/s

Block size: 1024 bytes; Total blocks read: 33554432; Total bytes read: 34359738368
Elapsed time: 442,839673 seconds - Read speed: 73,995177 MB/s; I/O speed: 75771,061348 IO/s

Block size: 2048 bytes; Total blocks read: 16777216; Total bytes read: 34359738368
Elapsed time: 339,909001 seconds - Read speed: 96,402272 MB/s; I/O speed: 49357,963324 IO/s

Block size: 4096 bytes; Total blocks read: 8388608; Total bytes read: 34359738368
Elapsed time: 205,209592 seconds - Read speed: 159,680645 MB/s; I/O speed: 40878,245166 IO/s

Block size: 8192 bytes; Total blocks read: 4194304; Total bytes read: 34359738368
Elapsed time: 115,966125 seconds - Read speed: 282,565275 MB/s; I/O speed: 36168,355261 IO/s

Block size: 16384 bytes; Total blocks read: 2097152; Total bytes read: 34359738368
Elapsed time: 63,981812 seconds - Read speed: 512,145548 MB/s; I/O speed: 32777,315057 IO/s

Block size: 32768 bytes; Total blocks read: 1048576; Total bytes read: 34359738368
Elapsed time: 36,407557 seconds - Read speed: 900,032932 MB/s; I/O speed: 28801,053821 IO/s

Block size: 65536 bytes; Total blocks read: 524288; Total bytes read: 34359738368
Elapsed time: 23,949037 seconds - Read speed: 1368,238720 MB/s; I/O speed: 21891,819517 IO/s

Block size: 131072 bytes; Total blocks read: 262144; Total bytes read: 34359738368
Elapsed time: 17,703998 seconds - Read speed: 1850,881335 MB/s; I/O speed: 14807,050676 IO/s

Block size: 262144 bytes; Total blocks read: 131072; Total bytes read: 34359738368
Elapsed time: 13,299071 seconds - Read speed: 2463,931466 MB/s; I/O speed: 9855,725865 IO/s

Block size: 524288 bytes; Total blocks read: 65536; Total bytes read: 34359738368
Elapsed time: 11,778760 seconds - Read speed: 2781,956793 MB/s; I/O speed: 5563,913585 IO/s

Block size: 1048576 bytes; Total blocks read: 32768; Total bytes read: 34359738368
Elapsed time: 10,879002 seconds - Read speed: 3012,040967 MB/s; I/O speed: 3012,040967 IO/s

Block size: 2097152 bytes; Total blocks read: 16384; Total bytes read: 34359738368
Elapsed time: 11,068786 seconds - Read speed: 2960,396948 MB/s; I/O speed: 1480,198474 IO/s

Block size: 4194304 bytes; Total blocks read: 8192; Total bytes read: 34359738368
Elapsed time: 10,351174 seconds - Read speed: 3165,631291 MB/s; I/O speed: 791,407823 IO/s

Block size: 8388608 bytes; Total blocks read: 4096; Total bytes read: 34359738368
Elapsed time: 10,509927 seconds - Read speed: 3117,814382 MB/s; I/O speed: 389,726798 IO/s

Now put everything into single table:

Screenshot 2024-05-27 17.21.51.png

As you can see — real life is slightly different from synthetic benchmark, single thread is faster for small block sizes (because of IOPS throughput limit), somewhere from 32K/64K they the same, then for 8MB MT version faster. And yes, LabVIEW is 10% faster than CVI at 8MB buffer.

I think a real reason is the SSD Cache (1 TB SSD may have a cache nearly 100 GB, unfortunately I haven't spec for given model), and how this utilized in multi threaded test, so the sequential execution, even on RAM-sized file caused dependent iterations. In theory between the tests we should read the file which larger than SSD's cache (but I haven't so much time).

 

But one magic experiment can be quickly performed.

I will modify my thread function so, that from "interleave" reading I will read quarter of the file sequentially in each one:

  

int CVICALLBACK ThreadFunction (void *functionData)
{
	pstructThread td = (pstructThread) functionData;
    size_t total_bytes_read = 0, total_blocks_read = 0;
    fpos_t newFilePosition;
	//newFilePosition._offset = td->Id * td->block_size;
	newFilePosition._offset = td->Id * (34359738368/N_THREADS);
	fsetpos(td->file, &newFilePosition);
		
    while (total_bytes_read < 34359738368/N_THREADS) {
		size_t bytes_read = fread(td->buffer, 1, td->block_size, td->file);
        total_bytes_read += bytes_read;
		if(bytes_read) ++total_blocks_read;
    }

	td->total_blocks_read = total_blocks_read;
	td->total_bytes_read = total_bytes_read;
	return(0);
}

I'm too lazy to get file size programmatically, sorry about that.

In general this is just chunks partitioning different:

Screenshot 2024-05-27 17.40.31.png

And now you will not believe me — I see as 4GB/s at 2MB Buffer:

Spoiler
Block size: 512 bytes; Total blocks read: 67108864; Total bytes read: 34359738368
Elapsed time: 172.07 seconds - Read speed: 190.43 MB/s; I/O speed: 390004.56 IO/s

Block size: 1024 bytes; Total blocks read: 33554432; Total bytes read: 34359738368
Elapsed time: 170.29 seconds - Read speed: 192.43 MB/s; I/O speed: 197044.04 IO/s

Block size: 2048 bytes; Total blocks read: 16777216; Total bytes read: 34359738368
Elapsed time: 44.40 seconds - Read speed: 738.00 MB/s; I/O speed: 377856.71 IO/s

Block size: 4096 bytes; Total blocks read: 8388608; Total bytes read: 34359738368
Elapsed time: 24.22 seconds - Read speed: 1353.04 MB/s; I/O speed: 346379.06 IO/s

Block size: 8192 bytes; Total blocks read: 4194304; Total bytes read: 34359738368
Elapsed time: 18.84 seconds - Read speed: 1739.28 MB/s; I/O speed: 222627.60 IO/s

Block size: 16384 bytes; Total blocks read: 2097152; Total bytes read: 34359738368
Elapsed time: 15.37 seconds - Read speed: 2132.64 MB/s; I/O speed: 136488.90 IO/s

Block size: 32768 bytes; Total blocks read: 1048576; Total bytes read: 34359738368
Elapsed time: 12.74 seconds - Read speed: 2572.46 MB/s; I/O speed: 82318.73 IO/s

Block size: 65536 bytes; Total blocks read: 524288; Total bytes read: 34359738368
Elapsed time: 11.90 seconds - Read speed: 2753.84 MB/s; I/O speed: 44061.52 IO/ss

Block size: 131072 bytes; Total blocks read: 262144; Total bytes read: 34359738368
Elapsed time: 9.70 seconds - Read speed: 3379.54 MB/s; I/O speed: 27036.30 IO/s

Block size: 262144 bytes; Total blocks read: 131072; Total bytes read: 34359738368
Elapsed time: 8.93 seconds - Read speed: 3670.25 MB/s; I/O speed: 14681.00 IO/s

Block size: 524288 bytes; Total blocks read: 65536; Total bytes read: 34359738368
Elapsed time: 8.69 seconds - Read speed: 3771.64 MB/s; I/O speed: 7543.28 IO/s

Block size: 1048576 bytes; Total blocks read: 32768; Total bytes read: 34359738368
Elapsed time: 8.30 seconds - Read speed: 3947.48 MB/s; I/O speed: 3947.48 IO/s

Block size: 2097152 bytes; Total blocks read: 16384; Total bytes read: 34359738368
Elapsed time: 8.17 seconds - Read speed: 4012.24 MB/s; I/O speed: 2006.12 IO/s

Block size: 4194304 bytes; Total blocks read: 8192; Total bytes read: 34359738368
Elapsed time: 8.69 seconds - Read speed: 3772.07 MB/s; I/O speed: 943.02 IO/s

Block size: 8388608 bytes; Total blocks read: 4096; Total bytes read: 34359738368
Elapsed time: 9.85 seconds - Read speed: 3325.35 MB/s; I/O speed: 415.67 IO/s

But this is unreal, of course. If I will read other 32GB file between the tests, the everything went normal.

 

And the last experiment — I will create just 1GB file, which fully fit to the file cache, then benchmark looks like this:

Spoiler
Block size: 512 bytes; Total blocks read: 2097152; Total bytes read: 1073741824
Elapsed time: 7,989532 seconds - Read speed: 128,167707 MB/s; I/O speed: 262487,464848 IO/s

Block size: 1024 bytes; Total blocks read: 1048576; Total bytes read: 1073741824
Elapsed time: 3,993922 seconds - Read speed: 256,389571 MB/s; I/O speed: 262542,920841 IO/s

Block size: 2048 bytes; Total blocks read: 524288; Total bytes read: 1073741824
Elapsed time: 2,079740 seconds - Read speed: 492,369286 MB/s; I/O speed: 252093,074336 IO/s

Block size: 4096 bytes; Total blocks read: 262144; Total bytes read: 1073741824
Elapsed time: 1,123425 seconds - Read speed: 911,497995 MB/s; I/O speed: 233343,486804 IO/s

Block size: 8192 bytes; Total blocks read: 131072; Total bytes read: 1073741824
Elapsed time: 0,617197 seconds - Read speed: 1659,114509 MB/s; I/O speed: 212366,657177 IO/s

Block size: 16384 bytes; Total blocks read: 65536; Total bytes read: 1073741824
Elapsed time: 0,350030 seconds - Read speed: 2925,460189 MB/s; I/O speed: 187229,452070 IO/s

Block size: 32768 bytes; Total blocks read: 32768; Total bytes read: 1073741824
Elapsed time: 0,221656 seconds - Read speed: 4619,764924 MB/s; I/O speed: 147832,477579 IO/s

Block size: 65536 bytes; Total blocks read: 16384; Total bytes read: 1073741824
Elapsed time: 0,174452 seconds - Read speed: 5869,812825 MB/s; I/O speed: 93917,005202 IO/s

Block size: 131072 bytes; Total blocks read: 8192; Total bytes read: 1073741824
Elapsed time: 0,141538 seconds - Read speed: 7234,801089 MB/s; I/O speed: 57878,408712 IO/s

Block size: 262144 bytes; Total blocks read: 4096; Total bytes read: 1073741824
Elapsed time: 0,127907 seconds - Read speed: 8005,848022 MB/s; I/O speed: 32023,392087 IO/s

Block size: 524288 bytes; Total blocks read: 2048; Total bytes read: 1073741824
Elapsed time: 0,116038 seconds - Read speed: 8824,687753 MB/s; I/O speed: 17649,375507 IO/s

Block size: 1048576 bytes; Total blocks read: 1024; Total bytes read: 1073741824
Elapsed time: 0,261621 seconds - Read speed: 3914,063375 MB/s; I/O speed: 3914,063375 IO/s

Block size: 2097152 bytes; Total blocks read: 512; Total bytes read: 1073741824
Elapsed time: 0,281289 seconds - Read speed: 3640,380206 MB/s; I/O speed: 1820,190103 IO/s

Block size: 4194304 bytes; Total blocks read: 256; Total bytes read: 1073741824
Elapsed time: 0,295443 seconds - Read speed: 3465,986286 MB/s; I/O speed: 866,496572 IO/s

Block size: 8388608 bytes; Total blocks read: 128; Total bytes read: 1073741824
Elapsed time: 0,266796 seconds - Read speed: 3838,137064 MB/s; I/O speed: 479,767133 IO/s

So, I have 8GB/s now, because file readed completely cached in RAM. By the way — even in this situation with small buffer size I have huge performance penalty, especially on first read, where the file wasn't cached yet.

 

So, in general multi threaded reading have some benefits, but in case of parallel processing usually the reading is not a bottleneck, do it as most convenient from architecture point of view.

 

Message 22 of 27
(340 Views)

Regarding the Samsung trickery, it's basically a RAM drive as I said - it creates and manages it dynamically and transparently through a filter driver, but the result is the same. Windows itself also does a lot of caching (huge read cache, depending on your RAM size, and smaller write cache), so "fair" benchmarks have to bypass this cache. For "real-world" tests you'd have to use files that haven't been read for a long time. Even if you make a file bigger than your RAM size, it may be partially cached. It's better to look at the read graph because you'll see exactly when it switches from the cache to the actual drive - all examples are single-threaded (!) sequential reads with Labview (same 870 EVO):

cache.png

 

Also, interestingly, I can roughly reproduce your results for reading a fully cached file with the speed dropping to ~4GB/s for chunk sizes of ~1MiB and above and much faster speeds for smaller chunks (why though!?):

ultra.png

 

In any case, this topic isn't supposed to be about cached reads - they are kind of unpredictable and super fast anyway, so multi-threading optimization is a moot point there. So it's better to limit the benchmarking to the clearly un-cached case..

 

The real-world application would be (for example) going through a huge binary measurement file beyond any practical RAM caching (tens of GB or more) which contains a large number of entries or data blocks, which can be all processed independently and therefore in parallel. For large enough chunk sizes, sequential reading (with a semaphore in a parallel FOR loop) should be close to maximum performance, and maybe making 2 explicit read threads should fully optimize the available throughput. For small chunk sizes, the easiest way to optimize throughput seems to be using multiple file references and forcing a deterministic distribution of the parallel iterations over the CPU threads (1 wired to the C terminal) to make sure each thread uses the correct file reference. The question is what's the performance penalty of setting C to 1 compared to automatic iteration partitioning...

Message 23 of 27
(312 Views)

@Novgorod wrote:

....

In any case, this topic isn't supposed to be about cached reads - they are kind of unpredictable and super fast anyway, so multi-threading optimization is a moot point there. So it's better to limit the benchmarking to the clearly un-cached case..

...


Fully agree, and what I've learned from this topic is that we have not only OS cache, but also SSD cache, so it is essential to read other "large enough" files to push cached data away from the cache between the tests. Why with small chunks we can observe different behaviour - that could be effect from different architectures and hardware, but this is out of the scope.

By the way - during experiments I've seen some sporadical "end of file" errors from parallelized loop, so personally if I'll do this parallel, then pobably will not use single parallelized loop, and create pure parallel loops side by side, like this:

fl.png

will be much easier to handle. If scalabaility for different amount of threads is required, then can be rewritten using VI Server and dynamically called VIs.

 

 

0 Kudos
Message 24 of 27
(301 Views)

@Andrey_Dmitriev wrote:

...I'm too lazy to get file size programmatically, sorry about that.

Ummmm.

 

"Lazy" is *definitely* not the right word to describe you after this highly detailed exploration and discussion!  Thanks to both of you engaging so thoroughly!

 

 

-Kevin P

 

 

 

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.
0 Kudos
Message 25 of 27
(281 Views)

@Andrey_Dmitriev wrote:


Fully agree, and what I've learned from this topic is that we have not only OS cache, but also SSD cache, so it is essential to read other "large enough" files to push cached data away from the cache between the tests

 


SSD cache shouldn't matter for reads, only for writes (some small percentage of the flash is dynamically used as SLC cache to achieve the GB/s write speeds and the written data is then shoveled into the dense but slow MLC/TLC flash by the controller in the background). For reads, the file gets already cached by the OS in the much larger and faster system RAM, the SSD's DRAM cache (if it has one) is just a small buffer and SSDs definitely don't move read files into SLC cache for faster access because this costs write cycles, and flash reads are much faster than writes anyway.

 

Also interesting point about end of file errors with parallel FOR loops (I assume C is set to 1?) - that would mean the distribution of iterations over the threads is not deterministic, even for C = 1, and the only way to ensure determinism would be with multiple loop structures like in your example. All this could be much more elgant if NI would let us access the thread number/ID assigned to the current parallelized iteration inside the loop...

0 Kudos
Message 26 of 27
(269 Views)

@Novgorod wrote:

@Andrey_Dmitriev wrote:


Fully agree, and what I've learned from this topic is that we have not only OS cache, but also SSD cache, so it is essential to read other "large enough" files to push cached data away from the cache between the tests

 


SSD cache shouldn't matter for reads, only for writes (some small percentage of the flash is dynamically used as SLC cache to achieve the GB/s write speeds and the written data is then shoveled into the dense but slow MLC/TLC flash by the controller in the background). For reads, the file gets already cached by the OS in the much larger and faster system RAM, the SSD's DRAM cache (if it has one) is just a small buffer and SSDs definitely don't move read files into SLC cache for faster access because this costs write cycles, and flash reads are much faster than writes anyway.


It seems to be you're right, I re-measured 4 threads C-code with individual time tracking of each thread, and yes, last thread, which reads last quarter of the file is always significantly faster, nearly 4 times:

THREAD 3 Elapsed time: 2.52 seconds
THREAD 1 Elapsed time: 8.56 seconds
THREAD 0 Elapsed time: 8.64 seconds
THREAD 2 Elapsed time: 8.65 seconds

Block size: 2097152 bytes; Total blocks read: 16384; Total bytes read: 34359738368
Elapsed time: 8.67 seconds - Read speed: 3778.16 MB/s; I/O speed: 1889.08 IO/s

caused resulted overall speed well over theoretical limit. Its just because after previous reading the rest of the file remains in the OS cache and the last thread get this from RAM instead of SSD. Demystified now.

 

0 Kudos
Message 27 of 27
(247 Views)