Solved VS2022, how can I read properly unicode from a file into a string? So far everything converts to ansi

sdowney717 · May 18, 2024

Works great only for English text

Code:

        Dim fileReader As System.IO.StreamReader
        fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak)
        Dim stringReader As String
        stringReader = fileReader.ReadToEnd()

or

Code:

   Dim content As string = IO.File.ReadAllText(FilenameToBreak)

Imagine a text file with a line like this. All swedish chars turn into ????
00170 2200061 4500500004000000520005200040901001600092 aHur mår du?Hur är läget?FörlåtHejdå aHur är läget?Hur mår du?FörlåtHur mår du?Förlåt a85843322572

sdowney717 · May 18, 2024

How to: Read From Text Files - Visual Basic

Learn more about: How to: Read From Text Files in Visual Basic

learn.microsoft.com

Their Unicode example does not work here

I thought should be able to read a text file having English, Swedish, Chinese chars and it not be hard to do.
So far find no examples of how to do it.
I saw some mention of setting code pages for a specific language, but that is no good if a file has chars from many different languages.

sdowney717 · May 18, 2024

This works but I bet only for code page 1252 languages.
Bad if your text file has say, English, Swedish, Chinese chars?
How is this done?
You can't assume a file is Swedish or Indonesian or Arabic or a mix of languages.
How can you assume anything about the text in a text file?

Code:

 Dim content As String = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.GetEncoding(1252))

Code:

" ChrW(31) & "aHur mår du?Hur är läget?FörlåtHejdå" & ChrW(30) & "  " & ChrW(31) & "aHur är läget?Hur mår du?FörlåtHur mår du?Förlåt" & ChrW(30) & "  " & ChrW(31) & "a85843322572" & ChrW(30) & ChrW(29)

sdowney717 · May 18, 2024

I may have resolved it this way

Maybe I got it with this?
This looks good, so willl mark it resolved.

Code:

        Dim fileReader As System.IO.StreamReader
        fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.UTF8)
        Dim stringReader As String
        stringReader = fileReader.ReadToEnd()

Code:

?stringreader
"""aHur mår du?Hur är läget?FörlåtHejdå  圖圖3解弓月金難手中大""

garlin · May 18, 2024

I would suggest in future postings, to write "How can I...Visual Basic or VB" instead of VS2022. At this point, your IDE doesn't matter.

The problem is the same in multiple programming languages. You need to inspect the header bytes to see if any possible encoding is defined, and then pivot based on the headers. Presuming it's UTF8 is arbitrary and will get you into trouble.

Here's someone's solution in VB:

[RESOLVED] [2005] How to determine file encoding?-VBForums

I want to read a user supplied text file, display it in a richtextbox and then write it back out after some changes. I can easily read the file and display it, regardless of whether it is ANSI, Unicode, UTF etc. But when I write it back out, I need to know which encoding to use. So somehow...

www.vbforums.com

sdowney717 · May 18, 2024

Also adding this single line seems to be working now.
Who knows if it will keep working

Code:

Dim content As String = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF8)

sdowney717 · May 18, 2024

garlin said:
I would suggest in future postings, to write "How can I...Visual Basic or VB" instead of VS2022. At this point, your IDE doesn't matter.

The problem is the same in multiple programming languages. You need to inspect the header bytes to see if any possible encoding is defined, and then pivot based on the headers. Presuming it's UTF8 is arbitrary and will get you into trouble.

Here's someone's solution in VB:

[RESOLVED] [2005] How to determine file encoding?-VBForums

I want to read a user supplied text file, display it in a richtextbox and then write it back out after some changes. I can easily read the file and display it, regardless of whether it is ANSI, Unicode, UTF etc. But when I write it back out, I need to know which encoding to use. So somehow...

www.vbforums.com

Been wondering about it as well.

I can easily check for a BOM, as I did that in vb6.
I don't have the experience with Unicode files to know if today this is still relevant, or if Windows just knows what to do.

Would be nice to have a site where you can download various types of Unicode files to test.

sdowney717 · May 18, 2024

For example, I just created English text file in notepad, and ran it in vs2022 subroutine.
And it worked, just gave back perfect English text, nothing strange. So telling it UTF8 encoding had no effect on the English output

This read of the file as a var called filenametobreak, using this coding with UTF8

Dim content As String = IO.File.ReadAllText(FilenameToBreak, System.Text.Encoding.UTF8)

gave this result

?content
"A test of an english language file written in notepad" & vbCrLf

So it looks like today's Windows handles it.

pseymour · May 18, 2024

sdowney717 said:
So it looks like today's Windows handles it.

That would depend on the default encoding on each installation of Windows. Not guaranteed to work on non-English installations.

System.IO.StreamReader has a parameter to detect BOM.

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first four bytes of the stream. It automatically recognizes UTF-8, little-endian UTF-16, big-endian UTF-16, little-endian UTF-32, and big-endian UTF-32 text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

sdowney717 · May 18, 2024

Good reading discussion of Unicode today
UTF8 is the preferred standard.

What is the difference between UTF-8 and ISO-8859-1?

stackoverflow.com

How many characters can UTF-8 encode?

If UTF-8 is 8 bits, does it not mean that there can be only maximum of 256 different characters? The first 128 code points are the same as in ASCII. But it says UTF-8 can support up to million of

stackoverflow.com

pseymour · May 18, 2024

For sure, UTF-8 is the most common. Up to you whether you want to assume it or not.

sdowney717 · May 18, 2024

pseymour said:
That would depend on the default encoding on each installation of Windows. Not guaranteed to work on non-English installations.

System.IO.StreamReader has a parameter to detect BOM.

How would you detect the encoding using that parameter for this file opening line, before you open it?

fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.UTF8)

sdowney717 · May 18, 2024

sdowney717 said:
How would you detect the encoding using that parameter for this file opening line, before you open it?

fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.UTF8)

Found, this, it can't accurately know until the first read is done, which makes sense.
I could see one way of doing it by binary reading the first few chars at beginning of file, and depending on what they are, open the file with UTF8 or UTF16.

StreamReader Class (System.IO)

Implements a TextReader that reads characters from a byte stream in a particular encoding.

learn.microsoft.com

StreamReader defaults to UTF-8 encoding unless specified otherwise, instead of defaulting to the ANSI code page for the current system. UTF-8 handles Unicode characters correctly and provides consistent results on localized versions of the operating system. If you get the current character encoding using the CurrentEncoding property, the value is not reliable until after the first Read method, since encoding auto detection is not done until the first call to a Read method.

sdowney717 · May 18, 2024

Guy gives a comprehensive response on the BOM and all the detection methods can still fail. answer is it's not as easy as you think. The BOM may also not exist in a Unicode file, if you believe what he says.
I will keep looking into this, but at least they have moved onto a standard UTF8 for everything.
I am not opening any old Unicode text file, my program is opening a MARC 21 file which could be Unicode.
The MARC examples I have seen are so far UTF8 or plain ANSI.

Determine TextFile Encoding?

I need to determine if a text file's content is equal to one of these text encodings: System.Text.Encoding.ASCII System.Text.Encoding.BigEndianUnicode ' UTF-L 16 System.Text.Encoding.Default ' ANSI

stackoverflow.com

This is a way I did this in vb6, and it was a unicode file, but being vb6, all I cared about was ANSI. Anyway it worked.
I can't fix the weird yellow thingy, the line says
Set ts = fs

penTextFile(FilenameToBreak)

crazy no mattr what I do it puts that dumb yellow emoticon in there
Set ts = fso 'DOT' OpenTextFile(FilenameToBreak)

Dim fso As New FileSystemObject
Dim ts As TextStream
Set ts = fspenTextFile(FilenameToBreak)

Filenum2 = FreeFile
Open FileNameToCreate For Binary As #Filenum2 'new Marc file

'read first LDRData line for first record
'Line Input #Filenum1, LDRData

LDRData = ts.ReadLine
'get rid of this BOM in element 0 'ï»¿' for a unicode8 file
If Left(LDRData, 3) = "ï»¿" Then LDRData = Replace(LDRData, "ï»¿", "")
'get rid of this BOM for unicode16
If Left(LDRData, 3) = "þÿ" Or Left(LDRData, 3) = "ÿþ" Then LDRData = Mid(LDRData, 3)

LDRData = Trim(Mid(LDRData, 6))
'check for stray chr(30) in data and remove it, wont be any
LDRData = Replace(LDRData, Chr(30), "")

pseymour · May 18, 2024

I didn’t read the linked post, but yes, the BOM is optional. You don’t want it when writing, for example, to an HTTP stream.

sdowney717 · May 19, 2024

pseymour said:
I didn’t read the linked post, but yes, the BOM is optional. You don’t want it when writing, for example, to an HTTP stream.

I found a code snippet example that was part of a class module, where it writes a file with a type of unicode, then tells you the unicode used to make the file.

I put it into a sub just to see what it does and changed console to debug to see the output

I suppose is this reading the BOM?
And what type unicode is new UnicodeEncoding())?
Is that utf16?
I saw options for that and utf8, utf7, utf32

Code:

  Private Sub tester()
      Dim path As String = "c:\temp\MyTest.txt"
      Try
          If File.Exists(path) Then
              File.Delete(path)
          End If

          'Use an encoding other than the default (UTF8).
          ' Dim sw As StreamWriter = New StreamWriter(path, False, New UnicodeEncoding())
          Dim sw As StreamWriter = New StreamWriter(path, False, New UTF8Encoding())

          sw.WriteLine("This")
          sw.WriteLine("is some text")
          sw.WriteLine("to test")
          sw.WriteLine("Reading")
          sw.Close()

          Dim sr As StreamReader = New StreamReader(path, True)

          Do While sr.Peek() >= 0
              Console.Write(Convert.ToChar(sr.Read()))
              Debug.WriteLine(Convert.ToChar(sr.Read()))
          Loop

          'Test for the encoding after reading, or at least
          'after the first read.

          Debug.Print("The encoding used was {0}.", sr.CurrentEncoding)
          'Console.WriteLine()

          sr.Close()
      Catch e As Exception
          'Console.WriteLine("The process failed: {0}", e.ToString())
          Debug.Print("The process failed: {0}", e.ToString())
      End Try
  End Sub

sdowney717 · May 19, 2024

Trying to understand what this is doing here, I put a countchar step into the loop, no matter how many loops, it just always says

The encoding used was System.Text.UTF8Encoding.

This on a file I made several years ago, with no unicode anything in my mind.
Is it just saying all text files are UTF8 encoding??

Code:

    Private Sub tester(FileName As String)
        Dim path As String = "c:\temp\MyTest.txt"
        Try
       
            '********************************************************************
            Dim sr As New StreamReader(FileName, True)
            Dim Countchars As Integer
            Do While sr.Peek() >= 0
                'Debug.Write(Convert.ToChar(sr.Read()))
                Countchars += 1
                If Countchars > 1000 Then Exit Do
            Loop
            Debug.WriteLine(" ")

            'Test for the encoding after reading, or at least
            'after the first read.

            Debug.Print("The encoding used was {0}.", sr.CurrentEncoding)
            Debug.WriteLine(" ")

            sr.Close()
        Catch e As Exception

            Debug.Print("The process failed: {0}", e.ToString())
            Debug.WriteLine(" ")
        End Try

sdowney717 · May 19, 2024

The encoding used was System.Text.UTF32Encoding.

Ok, I changed this code line for the mytest file, and looked at it in Notepad and it does look very widely spaced, so it is unicode32

Dim sw As New StreamWriter(path, False, New UTF32Encoding())

I set it to do 10 loops. If you leave it as is, it goes off into infinity, never stops looping

It seems it reports english text files not specifically unicode8 as unicode8
But maybe this could be used to autodetect a file type and then read a unicode file into a string with the appropriate setting

fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.UTF8)

or

fileReader = My.Computer.FileSystem.OpenTextFileReader(FilenameToBreak, System.Text.Encoding.BigEndianUnicode)

Result in Notepad

Code:

 T h i s
 
 i s   s o m e   t e x t
 
 t o   t e s t
 
 R e a d i n g

sdowney717 · May 19, 2024

So what is a file created with 'UnicodeEncoding', is it a BigIndianUnicode file??
I dont see any other file create unicode option in streamwriter than these 4.

Code:

file created as           >  the reported result is

New UTF7Encoding())       > The encoding used was System.Text.UTF8Encoding.
New UTF8Encoding())       > The encoding used was System.Text.UTF8Encoding.
New UnicodeEncoding())    > The encoding used was System.Text.UnicodeEncoding.
New UTF32Encoding())      > The encoding used was System.Text.UTF32Encoding.