Google Answers Logo
View Question
 
Q: Reading encoded ISO-8859-1 file with .Net ( No Answer,   1 Comment )
Question  
Subject: Reading encoded ISO-8859-1 file with .Net
Category: Computers > Programming
Asked by: yoknows-ga
List Price: $25.00
Posted: 25 May 2005 12:04 PDT
Expires: 24 Jun 2005 12:04 PDT
Question ID: 525528
Question:
Coding with .Net, why is the apostrophe character (?) not read
properly using the encoding "iso-8859-1"?  Note: The regular
apostrophe (') is fine.

Background:
Using vb.net, I am trying to properly read an HTML file.  Since many
files have encoded characters, I have to determine the character
encoding of the file before I read it. To determine the encoding, I
read the charset value in the <meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1"> (If this "charset=" is
missing from the file, then I guess the encoding type).

My problem is:
I am finding many files that have charset=iso-8859-1, and when I use
the following code to read in the files, an apostrophe character (?)
is not read properly.  "Regular" apostrophes (') are read fine.  The
offending apostrophe is read in as the wrong character.  Here is my
code (altered a bit for this question):

    Public Function getEncodedString(ByVal pFname As String) As String
        Dim filstream As New IO.FileStream(pFname, FileMode.Open)     
'read from file
        Dim bufstream As New IO.BufferedStream(filstream)      'read from FILE
        Dim encodingString As String
        Dim _detectedEncoding As Encoding
        _detectedEncoding = Encoding.GetEncoding("iso-8859-1")
        FilStream.Seek(0, SeekOrigin.Begin)
        Dim stream_reader As IO.StreamReader
        Try
            stream_reader = New IO.StreamReader(FilStream, _detectedEncoding, True)
        Catch ex As Exception
            'do something
        Finally
            encodingString = stream_reader.ReadToEnd
            stream_reader.Close()                  ' clean up
        End Try
        Return encodingString
    End Function

A sample "offending" webpage is
http://www.wsiquicknetsolutions.com/services.asp?sec=1
In that page, the words with apostrophes: you?ve, doesn?t, and
parent?s are not properly read.  However, when you browse that page in
Firefox, it looks fine.  And Firefox determines the encoding as
ISO-8859-1.

Environment:
O/S: Windows 2000 SP 4
.Net Framework: NET Framework 1.1 (version 1.1.4322.573)
Answer  
There is no answer at this time.

Comments  
Subject: Re: Reading encoded ISO-8859-1 file with .Net
From: pianoboy77-ga on 29 May 2005 18:06 PDT
 
Hi,
 The ISO-8859-1 character set does not actually contain the "curly"
apostrophe character. Windows-1252 is a superset of ISO-8859-1 that
does define some extended characters such as the curly apostrophes
(smart quotes), the "em-dash", etc. These extended characters are so
popular that many people forget they are not actually part of
ISO-8859-1.

Please refer to the following links for more info on the ISO-8859-1 /
Windows-1252 differences:

http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
http://www.cs.tut.fi/~jkorpela/chars.html#win

As you can see, technically, the pages you're loading that contain
these characters are not actually ISO-8859-1 compliant. The reason the
example web page you provided (and others) are being identified as
ISO-8859-1 is likely due to the meta-tags in the page. If you view the
page source (in IE, right-click anywhere in the page, click "View
Source"), you'll see at the top of the file this tag:
   meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"

So the page is telling the browser that it is ISO-8859-1, when in fact it's not.

Since this could be a common problem, what I would suggest doing would
be whenever you encounter iso-8859-1, just choose to use windows-1252
instead. I know this seems like a hack, but since windows-1252 is
supposedly a superset of iso-8859-1 that only differs by defining a
few more printable characters, this should hopefully work without
causing any new issues.

Hope this helps!

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy