Solved VS2022, how can I read properly unicode from a file into a string? So far everything converts to ansi


sdowney717

Well-known member
Member
VIP
Local time
3:02 AM
Posts
1,048
OS
windows 11
Works great only for English text
Code:
        Dim fileReader As System.IO.StreamReader
        fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak)
        Dim stringReader As String
        stringReader = fileReader.ReadToEnd()

or

Code:
   Dim content As string = IO.File.ReadAllText(FilenameToBreak)

Imagine a text file with a line like this. All swedish chars turn into ????
00170 2200061 4500500004000000520005200040901001600092 aHur mår du?Hur är läget?FörlåtHejdå aHur är läget?Hur mår du?FörlåtHur mår du?Förlåt a85843322572
 
Last edited:

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb

Their Unicode example does not work here

I thought should be able to read a text file having English, Swedish, Chinese chars and it not be hard to do.
So far find no examples of how to do it.
I saw some mention of setting code pages for a specific language, but that is no good if a file has chars from many different languages.
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
This works but I bet only for code page 1252 languages.
Bad if your text file has say, English, Swedish, Chinese chars?
How is this done?
You can't assume a file is Swedish or Indonesian or Arabic or a mix of languages.
How can you assume anything about the text in a text file?

Code:
 Dim content As String = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.GetEncoding(1252))

Code:
" ChrW(31) & "aHur mår du?Hur är läget?FörlåtHejdå" & ChrW(30) & "  " & ChrW(31) & "aHur är läget?Hur mår du?FörlåtHur mår du?Förlåt" & ChrW(30) & "  " & ChrW(31) & "a85843322572" & ChrW(30) & ChrW(29)
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
I may have resolved it this way

Maybe I got it with this?
This looks good, so willl mark it resolved.



Code:
        Dim fileReader As System.IO.StreamReader
        fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.UTF8)
        Dim stringReader As String
        stringReader = fileReader.ReadToEnd()

Code:
?stringreader
"""aHur mår du?Hur är läget?FörlåtHejdå  圖圖3解弓月金難手中大""
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
I would suggest in future postings, to write "How can I...Visual Basic or VB" instead of VS2022. At this point, your IDE doesn't matter.

The problem is the same in multiple programming languages. You need to inspect the header bytes to see if any possible encoding is defined, and then pivot based on the headers. Presuming it's UTF8 is arbitrary and will get you into trouble.

Here's someone's solution in VB:
 

My Computer

System One

  • OS
    Windows 7
Also adding this single line seems to be working now.
Who knows if it will keep working

Code:
Dim content As String = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF8)
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
I would suggest in future postings, to write "How can I...Visual Basic or VB" instead of VS2022. At this point, your IDE doesn't matter.

The problem is the same in multiple programming languages. You need to inspect the header bytes to see if any possible encoding is defined, and then pivot based on the headers. Presuming it's UTF8 is arbitrary and will get you into trouble.

Here's someone's solution in VB:
Been wondering about it as well.

I can easily check for a BOM, as I did that in vb6.
I don't have the experience with Unicode files to know if today this is still relevant, or if Windows just knows what to do.

Would be nice to have a site where you can download various types of Unicode files to test.
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
For example, I just created English text file in notepad, and ran it in vs2022 subroutine.
And it worked, just gave back perfect English text, nothing strange. So telling it UTF8 encoding had no effect on the English output

This read of the file as a var called filenametobreak, using this coding with UTF8
Dim content As String = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF8)

gave this result
?content
"A test of an english language file written in notepad" & vbCrLf

So it looks like today's Windows handles it.
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
So it looks like today's Windows handles it.

That would depend on the default encoding on each installation of Windows. Not guaranteed to work on non-English installations.

System.IO.StreamReader has a parameter to detect BOM.

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first four bytes of the stream. It automatically recognizes UTF-8, little-endian UTF-16, big-endian UTF-16, little-endian UTF-32, and big-endian UTF-32 text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.
 

My Computers

System One System Two

  • OS
    Windows 11 Pro 23H2 [rev. 3593]
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical Keyboard with Cherry MX Clears
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical Keyboard - Cherry MX Clear

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
For sure, UTF-8 is the most common. Up to you whether you want to assume it or not.
 

My Computers

System One System Two

  • OS
    Windows 11 Pro 23H2 [rev. 3593]
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical Keyboard with Cherry MX Clears
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical Keyboard - Cherry MX Clear
That would depend on the default encoding on each installation of Windows. Not guaranteed to work on non-English installations.

System.IO.StreamReader has a parameter to detect BOM.
How would you detect the encoding using that parameter for this file opening line, before you open it?

fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.UTF8)
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
How would you detect the encoding using that parameter for this file opening line, before you open it?

fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.UTF8)
Found, this, it can't accurately know until the first read is done, which makes sense.
I could see one way of doing it by binary reading the first few chars at beginning of file, and depending on what they are, open the file with UTF8 or UTF16.
StreamReader defaults to UTF-8 encoding unless specified otherwise, instead of defaulting to the ANSI code page for the current system. UTF-8 handles Unicode characters correctly and provides consistent results on localized versions of the operating system. If you get the current character encoding using the CurrentEncoding property, the value is not reliable until after the first Read method, since encoding auto detection is not done until the first call to a Read method.
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
Guy gives a comprehensive response on the BOM and all the detection methods can still fail. answer is it's not as easy as you think. The BOM may also not exist in a Unicode file, if you believe what he says.
I will keep looking into this, but at least they have moved onto a standard UTF8 for everything.
I am not opening any old Unicode text file, my program is opening a MARC 21 file which could be Unicode.
The MARC examples I have seen are so far UTF8 or plain ANSI.


This is a way I did this in vb6, and it was a unicode file, but being vb6, all I cared about was ANSI. Anyway it worked.
I can't fix the weird yellow thingy, the line says
Set ts = fso_OpenTextFile(FilenameToBreak)

crazy no mattr what I do it puts that dumb yellow emoticon in there
Set ts = fso 'DOT' OpenTextFile(FilenameToBreak)
Dim fso As New FileSystemObject
Dim ts As TextStream
Set ts = fso_OpenTextFile(FilenameToBreak)


Filenum2 = FreeFile
Open FileNameToCreate For Binary As #Filenum2 'new Marc file

'read first LDRData line for first record
'Line Input #Filenum1, LDRData

LDRData = ts.ReadLine
'get rid of this BOM in element 0 '' for a unicode8 file
If Left(LDRData, 3) = "" Then LDRData = Replace(LDRData, "", "")
'get rid of this BOM for unicode16
If Left(LDRData, 3) = "þÿ" Or Left(LDRData, 3) = "ÿþ" Then LDRData = Mid(LDRData, 3)

LDRData = Trim(Mid(LDRData, 6))
'check for stray chr(30) in data and remove it, wont be any
LDRData = Replace(LDRData, Chr(30), "")
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
I didn’t read the linked post, but yes, the BOM is optional. You don’t want it when writing, for example, to an HTTP stream.
 

My Computers

System One System Two

  • OS
    Windows 11 Pro 23H2 [rev. 3593]
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical Keyboard with Cherry MX Clears
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical Keyboard - Cherry MX Clear
I didn’t read the linked post, but yes, the BOM is optional. You don’t want it when writing, for example, to an HTTP stream.
I found a code snippet example that was part of a class module, where it writes a file with a type of unicode, then tells you the unicode used to make the file.

I put it into a sub just to see what it does and changed console to debug to see the output

I suppose is this reading the BOM?
And what type unicode is new UnicodeEncoding())?
Is that utf16?
I saw options for that and utf8, utf7, utf32


Code:
  Private Sub tester()
      Dim path As String = "c:\temp\MyTest.txt"
      Try
          If File.Exists(path) Then
              File.Delete(path)
          End If

          'Use an encoding other than the default (UTF8).
          ' Dim sw As StreamWriter = New StreamWriter(path, False, New UnicodeEncoding())
          Dim sw As StreamWriter = New StreamWriter(path, False, New UTF8Encoding())

          sw.WriteLine("This")
          sw.WriteLine("is some text")
          sw.WriteLine("to test")
          sw.WriteLine("Reading")
          sw.Close()

          Dim sr As StreamReader = New StreamReader(path, True)

          Do While sr.Peek() >= 0
              Console.Write(Convert.ToChar(sr.Read()))
              Debug.WriteLine(Convert.ToChar(sr.Read()))
          Loop

          'Test for the encoding after reading, or at least
          'after the first read.

          Debug.Print("The encoding used was {0}.", sr.CurrentEncoding)
          'Console.WriteLine()

          sr.Close()
      Catch e As Exception
          'Console.WriteLine("The process failed: {0}", e.ToString())
          Debug.Print("The process failed: {0}", e.ToString())
      End Try
  End Sub
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
Trying to understand what this is doing here, I put a countchar step into the loop, no matter how many loops, it just always says


The encoding used was System.Text.UTF8Encoding.

This on a file I made several years ago, with no unicode anything in my mind.
Is it just saying all text files are UTF8 encoding??

Code:
    Private Sub tester(FileName As String)
        Dim path As String = "c:\temp\MyTest.txt"
        Try
       
            '********************************************************************
            Dim sr As New StreamReader(FileName, True)
            Dim Countchars As Integer
            Do While sr.Peek() >= 0
                'Debug.Write(Convert.ToChar(sr.Read()))
                Countchars += 1
                If Countchars > 1000 Then Exit Do
            Loop
            Debug.WriteLine(" ")

            'Test for the encoding after reading, or at least
            'after the first read.

            Debug.Print("The encoding used was {0}.", sr.CurrentEncoding)
            Debug.WriteLine(" ")

            sr.Close()
        Catch e As Exception

            Debug.Print("The process failed: {0}", e.ToString())
            Debug.WriteLine(" ")
        End Try
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
The encoding used was System.Text.UTF32Encoding.

Ok, I changed this code line for the mytest file, and looked at it in Notepad and it does look very widely spaced, so it is unicode32

Dim sw As New StreamWriter(path, False, New UTF32Encoding())

I set it to do 10 loops. If you leave it as is, it goes off into infinity, never stops looping

It seems it reports english text files not specifically unicode8 as unicode8
But maybe this could be used to autodetect a file type and then read a unicode file into a string with the appropriate setting

fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.UTF8)

or

fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.BigEndianUnicode)

Result in Notepad
Code:
 T h i s
 
 i s   s o m e   t e x t
 
 t o   t e s t
 
 R e a d i n g
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb
So what is a file created with 'UnicodeEncoding', is it a BigIndianUnicode file??
I dont see any other file create unicode option in streamwriter than these 4.

Code:
file created as           >  the reported result is

New UTF7Encoding())       > The encoding used was System.Text.UTF8Encoding.
New UTF8Encoding())       > The encoding used was System.Text.UTF8Encoding.
New UnicodeEncoding())    > The encoding used was System.Text.UnicodeEncoding.
New UTF32Encoding())      > The encoding used was System.Text.UTF32Encoding.
 

My Computer

System One

  • OS
    windows 11
    Computer type
    PC/Desktop
    Manufacturer/Model
    some kind of old ASUS MB
    CPU
    old AMD B95
    Motherboard
    ASUS
    Memory
    8gb
    Hard Drives
    ssd WD 500 gb

My Computers

System One System Two

  • OS
    Windows 11 Pro 23H2 [rev. 3593]
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC12WSHi7
    CPU
    12th Gen Intel Core i7-1260P, 2100 MHz
    Motherboard
    NUC12WSBi7
    Memory
    64 GB
    Graphics Card(s)
    Intel Iris Xe
    Sound Card
    built-in Realtek HD audio
    Monitor(s) Displays
    Dell U3219Q
    Screen Resolution
    3840x2160 @ 60Hz
    Hard Drives
    Samsung SSD 990 PRO 1TB
    Keyboard
    CODE 104-Key Mechanical Keyboard with Cherry MX Clears
  • Operating System
    Linux Mint 21.2 (Cinnamon)
    Computer type
    PC/Desktop
    Manufacturer/Model
    Intel NUC8i5BEH
    CPU
    Intel Core i5-8259U CPU @ 2.30GHz
    Memory
    32 GB
    Graphics card(s)
    Iris Plus 655
    Keyboard
    CODE 104-Key Mechanical Keyboard - Cherry MX Clear
Back
Top Bottom