|
|
Subject:
Reading encoded ISO-8859-1 file with .Net
Category: Computers > Programming Asked by: yoknows-ga List Price: $25.00 |
Posted:
25 May 2005 12:04 PDT
Expires: 24 Jun 2005 12:04 PDT Question ID: 525528 |
Question: Coding with .Net, why is the apostrophe character (?) not read properly using the encoding "iso-8859-1"? Note: The regular apostrophe (') is fine. Background: Using vb.net, I am trying to properly read an HTML file. Since many files have encoded characters, I have to determine the character encoding of the file before I read it. To determine the encoding, I read the charset value in the <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> (If this "charset=" is missing from the file, then I guess the encoding type). My problem is: I am finding many files that have charset=iso-8859-1, and when I use the following code to read in the files, an apostrophe character (?) is not read properly. "Regular" apostrophes (') are read fine. The offending apostrophe is read in as the wrong character. Here is my code (altered a bit for this question): Public Function getEncodedString(ByVal pFname As String) As String Dim filstream As New IO.FileStream(pFname, FileMode.Open) 'read from file Dim bufstream As New IO.BufferedStream(filstream) 'read from FILE Dim encodingString As String Dim _detectedEncoding As Encoding _detectedEncoding = Encoding.GetEncoding("iso-8859-1") FilStream.Seek(0, SeekOrigin.Begin) Dim stream_reader As IO.StreamReader Try stream_reader = New IO.StreamReader(FilStream, _detectedEncoding, True) Catch ex As Exception 'do something Finally encodingString = stream_reader.ReadToEnd stream_reader.Close() ' clean up End Try Return encodingString End Function A sample "offending" webpage is http://www.wsiquicknetsolutions.com/services.asp?sec=1 In that page, the words with apostrophes: you?ve, doesn?t, and parent?s are not properly read. However, when you browse that page in Firefox, it looks fine. And Firefox determines the encoding as ISO-8859-1. Environment: O/S: Windows 2000 SP 4 .Net Framework: NET Framework 1.1 (version 1.1.4322.573) |
|
There is no answer at this time. |
|
Subject:
Re: Reading encoded ISO-8859-1 file with .Net
From: pianoboy77-ga on 29 May 2005 18:06 PDT |
Hi, The ISO-8859-1 character set does not actually contain the "curly" apostrophe character. Windows-1252 is a superset of ISO-8859-1 that does define some extended characters such as the curly apostrophes (smart quotes), the "em-dash", etc. These extended characters are so popular that many people forget they are not actually part of ISO-8859-1. Please refer to the following links for more info on the ISO-8859-1 / Windows-1252 differences: http://www.cs.tut.fi/~jkorpela/www/windows-chars.html http://www.cs.tut.fi/~jkorpela/chars.html#win As you can see, technically, the pages you're loading that contain these characters are not actually ISO-8859-1 compliant. The reason the example web page you provided (and others) are being identified as ISO-8859-1 is likely due to the meta-tags in the page. If you view the page source (in IE, right-click anywhere in the page, click "View Source"), you'll see at the top of the file this tag: meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" So the page is telling the browser that it is ISO-8859-1, when in fact it's not. Since this could be a common problem, what I would suggest doing would be whenever you encounter iso-8859-1, just choose to use windows-1252 instead. I know this seems like a hack, but since windows-1252 is supposedly a superset of iso-8859-1 that only differs by defining a few more printable characters, this should hopefully work without causing any new issues. Hope this helps! |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |