Google Answers Logo
View Question
 
Q: ASCII/Unicode in Word / VBA ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: ASCII/Unicode in Word / VBA
Category: Computers > Programming
Asked by: viseu-ga
List Price: $5.00
Posted: 28 Feb 2003 15:08 PST
Expires: 30 Mar 2003 15:08 PST
Question ID: 168532
The specific resolution to this question is less important to me than
understanding the concepts involved.  I am working with VBA in Word to
convert extended characters in DOS files that show up as gibberish. 
Currently I do this with a long series of replace-alls based on
trial-and-error: for example, I figured out that Chr(131) is an a with
a circumflex; Chr(132) is an a with an umlaut; Chr(133) is an a with
an accent grave.  Word uses different values, which I believe are
called Unicode.

My program works fine, but I wonder if there's a more systematic way
to do this.  And could someone give me an explanation of this
incompatibility?

Request for Question Clarification by mathtalk-ga on 28 Feb 2003 16:54 PST
Hi, viseau-ga:

Have you looked at the Character Map applet (under Accessories/System
Tools in XP)?  It shows the "upper ASCII" character map for the
various fonts installed on your system.  What code goes with what
character varies with the font.

There is a historical reason for this incompatibility.  I'm not sure
what would best answer your question, but I'd be happy to post a brief
history of ASCII vs. Unicode as an answer.

regards, mathtalk

Request for Question Clarification by j_philipp-ga on 01 Mar 2003 02:12 PST
Hello Viseu,

Since I do not have Microsoft Word on this computer, could you tell me
if the following macro does the job?

Code Page Converter v2.0 macros for MS Word
http://www.hermessoft.com/newproject/cpc/cpc.html
"The file cp_conv.dot contains some macros for processing and
converting of text from ASCII to Unicode text format"

As for the ASCII vs. Unicode incompatibility, please see the
following:

ASCII vs. Unicode
http://www.devincook.com/goldparser/doc/about/about-unicode.htm
"One of the most common storage formats is ASCII (American Standard
Code for Information Interchange).ASCII is 7-bit, giving a total of
128 possible characters.  In most cases, it is extended to 8-bit,
giving a full 256. Older systems, and some current ones as well, use
ASCII as the primary format for storing text. These include: DOS,
Mac-OS, Windows 95 (not the NT/2000 series).

Unfortunately, given the rather small number of possible characters in
ASCII, it is unable to represent any languages besides English and
other western languages.

The solution to this dilemma is Unicode. In Unicode, each logical
character is represented by a 32-bit integer, giving a total 65535
characters that can represented. This allows Unicode to represent all
the ASCII codes, the characters of other languages, and leaves plenty
of room for expansion."

Thanks!

Search terms:
microsoft word "ASCII to Unicode" conversion

Clarification of Question by viseu-ga on 01 Mar 2003 16:32 PST
Hi, both these comments are a good start.  J_Philipp's quotation about
the ASCII/Unicode difference is helpful, but the macro he referred me
to seems to require payment to be downloaded, and even if I did so I'm
sure the code is unviewable; so I wouldn't really learn how they do
it.  With mathtalk's answer, Character Map would have helped while I
was figuring out the correspondences one-by-one, but I'm interested in
knowing if there's a more elegant way to map ASCII to Unicode than
that (some VBA object maybe?).  Thanks.

Request for Question Clarification by mathtalk-ga on 01 Mar 2003 22:20 PST
Hi, viseu-ga:

I wasn't sure which way you meant the initial comment -- that the
specific resolution was not as important as understanding the concepts
involved.

I think I can point you to a couple of "homespun" solutions more
elegant than using multiple replace-all string manipulations.  However
it would be expeditious if you mention what version of Word you are
working with.

Also, at the price offered ($5), I would point you in the direction of
how to do the coding yourself, not provide you with fully tested
working code.  If this approach is agreeable to you, please let me
know.

Some guidelines on pricing for Google Answers are given here:

http://answers.google.com/answers/pricing.html 
 
regards, mathtalk

Clarification of Question by viseu-ga on 03 Mar 2003 10:33 PST
That's fine, mathtalk.  What surprises me is that given how sweeping
the transition from extended ASCII to Unicode was, Microsoft didn't
build some object into VBA to map the former to the latter.  I thought
someone would answer with a link to a brief history of this transition
along with implementable steps Microsoft took to support it.  But if
there aren't any, homespun ideas (I use MS Word 2002 with VB 6.3)
would be helpful.  Thanks.
Answer  
Subject: Re: ASCII/Unicode in Word / VBA
Answered By: mathtalk-ga on 18 Mar 2003 09:19 PST
Rated:5 out of 5 stars
 
Hi, viseu-ga:

It's a generally useful tactic, when trying to develop a piece of VBA
code, to try Record Macro to get a code snippet that at least does
correctly something close to what is wanted.

First I used TextPad 4.6 to create sample "ANSI" text document with
some special (upper ASCII) characters, taken as it happens from a
Google Answers thread (answered by Scriptor-GA) here:

[Translate Song into German]
http://answers.google.com/answers/main?cmd=threadview&id=173434

Krieg! Ha! Paßt auf! 
Was hat er Gutes? 
Absolut rein gar nichts! Hört mir zu. 
 
Ah, ich hasse den Krieg, 
Weil ganz alleine der Tod nur siegt. 
Krieg heißt Tränen, und er trifft die Mütter hart, 
Denn ihre Söhne, die sind tot, vergessen und verscharrt!

Then I recorded this macro, which correctly opens the file (macro
slightly edited for formatting purposes):

Sub myOpen()
'
' myOpen Macro for Word 2002
' Macro recorded 3/18/2003 by mathtalk-ga
'
  Documents.Open FileName:="WordASCII.txt", _
    ConfirmConversions:=False, ReadOnly:=False, _
    AddToRecentFiles:=False, PasswordDocument:="", _
    PasswordTemplate:="", Revert:=False, _
    WritePasswordDocument:="", WritePasswordTemplate:="", _
    Format:=wdOpenFormatAuto, Encoding:=1252
End Sub

That final "Encoding" parameter, which is supported in Word 2000 and
2002 but not in Word 97, works in combination with the "Format"
parameter to control how text files are converted:

[Word 2002 Documents.Open]
(click on bolded "Documents" to reveal the syntax)
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd10/html/womthOpen.asp

[Word 2000 Documents.Open]
(click on bolded "Documents" to reveal the syntax)
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/off2000/html/womthopen.asp

[Word 97 Documents.Open]
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/office97/html/output/F1/D4/S5ABE9.asp?frame=true

The mystery value 1252 shown above has a "coder friendly" equivalent,
the constant msoEncodingWestern. The particular value was apparently
chosen to match the Windows Standard code page, ANSI 1252 (see History
below for more on the "code page" concept).  This was Microsoft's
"improvement" on the ISO Western Latin(1) extension of ASCII known as
ISO-8859-1.  For details of their minor differences, see this
comparison by George Hernandez:

[ANSI]
http://www.georgehernandez.com/xComputers/CharacterSets/ANSI.htm

For a list of all the MsoEncoding values in Office VBA, see here:

[Encoding Property]
(click on bolded "MsoEncoding" to reveal the list)
msdn.microsoft.com/library/en-us/vbawd10/ html/woproEncoding.asp

These same enumeration constants are used in other related contexts. 
For example:

[ReloadAs Method]
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd10/html/womthReloadAs.asp

Strangely the "Encoding" parameter was not symmetrically added to the
Save method, as discussed here:

[Ask Dr.International #5: Word Macro Recording Misses Encoding]
(first Q&A item listed)
http://www.microsoft.com/globaldev/DrIntl/columns/005/default.mspx

Instead the way to control how Word encodes text documents during
saves is to set the SaveEncoding document property:

[SaveEncoding]
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd10/html/woproSaveEncoding.asp

For the sake of completeness here's the list of possible values for
the "Format" parameter:

wdOpenFormatAllWord 
wdOpenFormatAuto [Default]
wdOpenFormatDocument 
wdOpenFormatEncodedText 
wdOpenFormatRTF 
wdOpenFormatTemplate 
wdOpenFormatText 
wdOpenFormatUnicodeText 
wdOpenFormatWebPages 

[Word 2002 Documents.Open]
(click on bolded "Documents" to reveal the syntax)
(click on bolded "WdOpenFormat" to reveal the list)
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vbawd10/html/womthOpen.asp


History
=======

Bearing in mind your desire for a conceptual understanding, let's stop
and ask exactly what does it mean for a text document to be "ANSI"
format?  Historically the ASCII (American Standard Code for
Information Interchange) addressed only a set of 7-bit signals between
computer and "teletype" terminals (even if they were video display
terminals or "glass TTY's" that emulated the original "hardcopy"
teletypes).  As dialup-modems become normative for terminal-computer
communications, rather than hardwiring these connections, the 7-bit
character signals were "embedded" in 8-bit groups.  The eighth bit was
then available for additional information, such as "error detection"
(e.g. requiring even or odd parity for each 8-bit group).

By the time that "personal" computers were blessed by IBM's entry into
the marketplace, there were two sorts of uses for what had come to be
called the "upper ASCII" characters, treated as individual values on
independent footing from their original "lower ASCII" 7-bit
correspondances.  One of these uses was as graphical characters,
exemplified in the IBM "PC DOS" operating system as a set of primarily
line-drawing symbols (vertical, horizontal, corners, double lines,
etc.)

The other use was for displaying "foreign" (from an English alphabetic
perspective) characters.  The PC-DOS character set includes, for
example, a certain number of vowels with diacritical marks and a
handful of Greek alphabet and mathematical symbols, though hardly
sufficient for serious applications.

The ASCII character set was eventually incorporated into an
"international" (ISO) standard as ISO-646-US-ASCII:

http://www.ascii-table.org/

In order to support "localization" of IBM PC's into a number of
European countries, IBM developed what were termed "country code
pages".  What this involved, in its primative formulation, was loading
of customized fonts (from disk) at "boot time" based on settings in
the ubiquitous CONFIG.SYS file.  Applications (such as word
processors), however, would need to be written to take cognizance of
these "code page" settings, and packages such as WordPerfect did this
with greater or lesser fidelity.

But now we had a classic "tower of Babel" situation, in which simple
text files would display differently, depending on setting external to
the text files themselves.  Several approaches were proposed to remedy
this, eventually converging on the Unicode Standard (UCS):

http://www.unicode.org/

which aims to simultaneously represent all character sets, even
"large" ones like Chinese characters.  In order to do this the 256
possibilities allowed by 8 bits are obviously insufficient.  Hence one
often sees the phrase "wide character" in connection with Unicode
implementations, although these are not synonyms.

A key to understanding the Unicode standard is to appreciate the
difference between the abstract assignments of all character sets, the
BMP (Basic Multilingual Plane), and "encodings" of those sets in
"storage" mappings like UTF-7, UTF-8, UTF-16, and UTF-32.  These
designations in essence describe the number of bits used in code
blocks to map characters, with the former encodings providing
substantial backward compatibility with older ASCII/ANSI text files.

The Unicode Standard continues to evolve and to incorporate new
"alphabets".


Other Links of Interest
=======================

For a good discussion on Microsoft's compatibility aims with Word and
Unicode:

[Taking Advantage of Unicode Support]
http://www.microsoft.com/office/ork/xp/three/intd02.htm

A little known quirk of how VBA handles passing strings into DLL's is
"implicit" conversion from Unicode to ANSI:

[Anatomy of a Declare Statement]
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odeopg/html/deovranatomyofdeclarestatement.asp

Sometimes, of course, one wishes to pass Unicode strings & needs to
bypass this conversion:

[Working Around VBA String Conversion from Unicode to ANSI for DLLs]
http://www.mvps.org/vb/index2.html?tips/varptr.htm


Search Strategies
=================

  recording a macro in Word 2002
  consulting Office/Word VBA help files
  searching MSDN Library (online and offline)

Keywords:

  MsoEncoding WdOpenFormat 
  Unicode
  ASCII
  ANSI 1252

Request for Answer Clarification by viseu-ga on 21 Mar 2003 13:40 PST
This is a really generous answer with lots of information and I'm
willing to rate it a 5 for the History section alone...  But the macro
suggested didn't work for me.  If you have any patience left, I can
send you a sample file; if not, I'll still rate the answer a 5 for the
information given.

Clarification of Answer by mathtalk-ga on 21 Mar 2003 14:42 PST
Due to the arm's length nature of Google Answer's Terms of Service, we
cannot exchange emails.  However you could post the file (as a text
file) someplace on the Internet and post the URL here.  I could
download it and tell you exactly what the story is.  (Are you familiar
with FTP?)

In principle an ANSI text file could be opened in a simple text editor
(like TextPad, mentioned previously) and pasted into a "request for
clarification" here.  The Google Answers interface wraps text lines
after 70 characters, so there would probably be a number of line-break
changes.  However it might well be enough to give me an idea what is
happening. (How big are these files?)

Do I have the story right, you have an ANSI (8-bit) text file.  You
open it manually or programatically with Word 2002 and it displays
incorrectly.  If you open it with Word '97, it displays correctly. 
Therefore you suspect this is a Unicode issue.

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 21 Mar 2003 17:33 PST
Ah, hold the phone (so to speak)...

I looked back at the original question, and the sample codes given
there:

Char(131) = â
Char(132) = ä
Char(133) = à

are not the ANSI (Latin-1) code page but rather the "PC-DOS" line
drawing code page.  Let me double check the Word documentation and see
if there isn't an "encoding" that handles this translation, ie. a
value other than 1252.

-- mathtalk

Clarification of Answer by mathtalk-ga on 21 Mar 2003 18:40 PST
Hi, viseu-ga:

I haven't tested it yet, but I suspect we need to replace:

Encoding:=1252

which is equivalent to msoEncodingWestern, with:

Encoding:=20127

which is equivalent to msoEncodingUSASCII.

regards, mathtalk

Request for Answer Clarification by viseu-ga on 24 Mar 2003 12:28 PST
Hmmmmm, I tried the new encoding value and the extended characters do
appear differently for the first time.  Unfortunately they look like
boxes and blanks, though.  Any other values I might try?  Thanks.

Clarification of Answer by mathtalk-ga on 24 Mar 2003 16:39 PST
That's a bit strange.  Let's do this, just for sport.  Select the
entire document (Ctrl-A or use menu Edit > Select All) and change the
font to Lucida Sans Unicode.

One variable here is the lack of Unicode support for many "font
families".  My recollection is that the Unicode support in Windows is
the best with Arial Unicode MS.  However this font system is not
installed by default, although it is supplied with Word 2000/2002. 
The Help files for Word give you detailed steps for using Microsoft
Office Setup to install "International Support" and the "Universal
Font", so called because Arial Unicode MS has characters for the
entire Unicode 2.0 standard.

Another change I'd make to the VBA code snippet is to specify
wdOpenFormatText rather than wdOpenFormatAuto.

It's starting to look like we'll need to have you post your sample
text file on the Web someplace.  As I mentioned earlier, the Terms of
Service for Google Answers customers and researchers forbid the
exchange of email addresses.  If you don't feel up to creating a
(possibly free) domain site and uploading your text file there, you
could email Google at Answers-Support@google.com and explain the
situation to them.  Refer to the question ID and paste its URL in your
message.  They might or might not be willing to make an exception in
this case.

regards, mathtalk-ga
viseu-ga rated this answer:5 out of 5 stars and gave an additional tip of: $8.00
I haven't had a chance to try this out yet but thank you for all your
help & all the information.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy