Unicode to UTF-8 conversion

V_T_S · ‎11-16-2017

Hello,

Has anyone every done conversion of Unicode Characters (0000 - FFFF) to UTF-8? Are there any inbuild modules which can be used for this conversion?

http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=1024

There is a module ASCII text to UTF-8 which does not work for unicode characters.

-VS

Bob_Schor · ‎11-16-2017

The safe answer is "Not Yet", based on a search of LabVIEW Help and a Web Search. Part of the problem is that Unicode is U16, while ASCII (which LabVIEW basically uses) is U8. You could write your own Translator/Mapper, but it would have to deal with the Many-to-One problem (i.e. you'd need to decide which 256 Unicode characters you wanted to display as which Ascii character).

Bob Schor

P.S. -- If I'm wrong about this, I have every confidence that other readers of this Forum will point out my error ...

paul_cardinale · ‎11-16-2017

The conversion would be simple. Why don't you write a VI for it and share it.

"If you weren't supposed to push it, it wouldn't be a button."

rolfk · ‎11-16-2017

Actually ASCII is a VERY LIMITED subset of Unicode. And no, Unicode ist not UTF-16, but UTF-16 LE (little endian) is de Unicode version used on Windows. You also have UTF-8 and UTF-32 and for the 16 and 32 bit versions both an LE and BE version.

Windows NT uses internally everywhere UTF-16 LE, while translating it to 8-bit MBCS on demand for applications that don't use Unicode such as LabVIEW. And LabVIEW therefore does not really use ASCII but 8-bit MBCS. For most Windows codepages this means an extended ASCII code page with the lower 128 character codes mapped to the standard 7-bit ASCII characters and the upper 128 character points mapped to code page specific characters. But Asian and Arabian codepages can define more than 256 characters and then a single character suddenly consists of multiple bytes even in LabVIEW.

Linux uses nowadays internally mostly UTF-32 (LE or BE) depending on the endianess of the CPU but with most user systems nowadays running on x86/64 or ARM this is usually also LE.

On the user level it uses UTF-8 which is in fact also a MBCS encoding where a single character point can consist of 1 to 4 bytes. The first 128 characters in the ASCII table map exactly to the first 128 characters in the Unicode standard.

So if your LabVIEW was running on on a modern Linux system you theoretically would already use UTF-8

On Windows to get to UTF-8 from UTF-16 LE one needs to simply call the function WideCharToMultyByte() with the first parameter set to CP_UTF8 (65001) instead of CP_ACP (0).

And be very careful to allocate a large enough buffer for the returned UTF-8 string. It can theoretically get up to 4 times as long in bytes as there are UTF-16 character points in the incoming string!

Rolf Kalbermatter
My Blog

Bob_Schor · ‎11-16-2017

Thanks, RolfK, for the clear and concise response!

Bob Schor

V_T_S · ‎11-16-2017

Thank you for the detail description RolfK!!

LabVIEW

Unicode to UTF-8 conversion

Unicode to UTF-8 conversion

Re: Unicode to UTF-8 conversion

Re: Unicode to UTF-8 conversion

Re: Unicode to UTF-8 conversion

Re: Unicode to UTF-8 conversion

Re: Unicode to UTF-8 conversion