![]() |
|
|
| Subject:
Reading encoded ISO-8859-1 file with .Net
Category: Computers > Programming Asked by: yoknows-ga List Price: $25.00 |
Posted:
25 May 2005 12:04 PDT
Expires: 24 Jun 2005 12:04 PDT Question ID: 525528 |
Question:
Coding with .Net, why is the apostrophe character (?) not read
properly using the encoding "iso-8859-1"? Note: The regular
apostrophe (') is fine.
Background:
Using vb.net, I am trying to properly read an HTML file. Since many
files have encoded characters, I have to determine the character
encoding of the file before I read it. To determine the encoding, I
read the charset value in the <meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1"> (If this "charset=" is
missing from the file, then I guess the encoding type).
My problem is:
I am finding many files that have charset=iso-8859-1, and when I use
the following code to read in the files, an apostrophe character (?)
is not read properly. "Regular" apostrophes (') are read fine. The
offending apostrophe is read in as the wrong character. Here is my
code (altered a bit for this question):
Public Function getEncodedString(ByVal pFname As String) As String
Dim filstream As New IO.FileStream(pFname, FileMode.Open)
'read from file
Dim bufstream As New IO.BufferedStream(filstream) 'read from FILE
Dim encodingString As String
Dim _detectedEncoding As Encoding
_detectedEncoding = Encoding.GetEncoding("iso-8859-1")
FilStream.Seek(0, SeekOrigin.Begin)
Dim stream_reader As IO.StreamReader
Try
stream_reader = New IO.StreamReader(FilStream, _detectedEncoding, True)
Catch ex As Exception
'do something
Finally
encodingString = stream_reader.ReadToEnd
stream_reader.Close() ' clean up
End Try
Return encodingString
End Function
A sample "offending" webpage is
http://www.wsiquicknetsolutions.com/services.asp?sec=1
In that page, the words with apostrophes: you?ve, doesn?t, and
parent?s are not properly read. However, when you browse that page in
Firefox, it looks fine. And Firefox determines the encoding as
ISO-8859-1.
Environment:
O/S: Windows 2000 SP 4
.Net Framework: NET Framework 1.1 (version 1.1.4322.573) |
|
| There is no answer at this time. |
|
| Subject:
Re: Reading encoded ISO-8859-1 file with .Net
From: pianoboy77-ga on 29 May 2005 18:06 PDT |
Hi, The ISO-8859-1 character set does not actually contain the "curly" apostrophe character. Windows-1252 is a superset of ISO-8859-1 that does define some extended characters such as the curly apostrophes (smart quotes), the "em-dash", etc. These extended characters are so popular that many people forget they are not actually part of ISO-8859-1. Please refer to the following links for more info on the ISO-8859-1 / Windows-1252 differences: http://www.cs.tut.fi/~jkorpela/www/windows-chars.html http://www.cs.tut.fi/~jkorpela/chars.html#win As you can see, technically, the pages you're loading that contain these characters are not actually ISO-8859-1 compliant. The reason the example web page you provided (and others) are being identified as ISO-8859-1 is likely due to the meta-tags in the page. If you view the page source (in IE, right-click anywhere in the page, click "View Source"), you'll see at the top of the file this tag: meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" So the page is telling the browser that it is ISO-8859-1, when in fact it's not. Since this could be a common problem, what I would suggest doing would be whenever you encounter iso-8859-1, just choose to use windows-1252 instead. I know this seems like a hack, but since windows-1252 is supposedly a superset of iso-8859-1 that only differs by defining a few more printable characters, this should hopefully work without causing any new issues. Hope this helps! |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
| Search Google Answers for |
| Google Home - Answers FAQ - Terms of Service - Privacy Policy |