Google Answers: Using Databases Containing Non-English Symbols

View Question

Q: Using Databases Containing Non-English Symbols ( Answered 5 out of 5 stars

Question

Subject: Using Databases Containing Non-English Symbols
Category: Computers > Software
Asked by: catullus13-ga
List Price: $10.00

Posted: 14 Jun 2002 13:16 PDT
Expires: 14 Jul 2002 13:16 PDT
Question ID: 26046

Our company has been given a database that contains a variety of
non-English characters from different European languages, such as
Spanish, French, German, Scandinavian, etc. These non-English
characters include such things as umlauts, tildes, circumflexes, and
accent marks. This database will eventually be connected to a website,
where users will do a search on the data, and the results then
displayed on screen. How can we ensure that this database will be
usable and that we can display these characters properly, with their
original symbols intact?

Request for Question Clarification by wengland-ga on 14 Jun 2002 14:25 PDT
How is the data stored in the database?  Text?  Binary?  

In a nutshell, just make sure the character set for the page has all
the symbols you'll want.

Request for Question Clarification by mother-ga on 14 Jun 2002 15:45 PDT

Hello,

Will the strings that have these special characters be available for
searching? Your solution will depend on whether you need to do some
translation going both in AND out of the database, or just out.

I would be the most comfortable with a solution that leaves your data
intact (don't convert characters in the database). Rather, a script or
function on the web server side should do that (but that's just my
personal take on it). Which brings me to my point, what scripting
environment will you be using, and what is your database platform?
This information (and ideally a couple of rows of data from your
database) will help a researcher answer your question more completely.

Good luck!
mother-ga

Answer

Subject: Re: Using Databases Containing Non-English Symbols
Answered By: hedgie-ga on 16 Jun 2002 20:30 PDT
Rated: 5 out of 5 stars

The comments, particularly the last one,  by rtsaito-ga , provide 
useful information.
  What I will add is to provide links which will lead  to a gentle 
intro into the arcane
  topic of the character encoding.
  Correct general term you need is indeed unicode. It is a character
set which
  can represent all natural languages,  and some unnatural  ones too
:-).
  That however may be an overkill in your case,  since all the
languages you mentioned
  are west european languages, which can be expressed by the extended
ASCII, also
  called latin-1.  The list of all 255 characters is listed here, and
has quite a few characters with
  diacritics, or accents:
http://www.bbsinc.com/iso8859.html

Here is the list of other ISO sets, covering all of Europe, and more:
http://czyborra.com/charsets/iso8859.html
http://www.htmlhelp.com/reference/charset/
 So, the first step is to decide which languages are to be included
and then
  to read an introductory article, such as
  http://www.hedgehog-review.com/accent/index.html#LDB

  The article explains the use of the LDB database which will allow
you to determine how many
  iso sets you need to cover the languages of interest, provides
technical references and
  example of using the headers mentioned in the comment.

  God references on Multilingual Web servers come form bilingual
canada, such as
  http://vancouver-webpages.com/multilingual/howto.htm

  and also from multilingual europe
  http://www.hum.uva.nl/~ewn/
  http://www.jca.apc.org/aworc/search/

   You should talk to the database vendor, if your data came with
specific database software,
    since large vendors have solutions already available :
    http://download-west.oracle.com/otndoc/oracle9i/901_doc/server.901/a90236/toc.htm
    http://www.sybase.com.au/press/releases/Unicode.html
    http://www-106.ibm.com/developerworks/unicode/library/unicode-db-process/?dwzone=unicode

     This an active area of research, specialized field, different
from issues of localization
     http://www.ee.umd.edu/medlab/filter/sss/papers/

     Open standards and unix solutions differ somewhat form the 
Microsoft  proprietary solutions
     http://www.microsoft.com/mind/0100/internat/internat.asp

     Use of outside specialist  may be most economical solution. Open
directory lists DB consultants:
     http://dmoz.org/Computers/Consultants/Databases/

        Please do ask for clarification, giving more details, if I did
not cover   relevant aspects  of
 your querry.

        Thank you for using google answers.

        Hedgie.

catullus13-ga rated this answer: 5 out of 5 stars

All of these responses provide excellent, useful information. Right
now, this question was anticipatory, since we are still unsure of the
format in which the database will be provided, but using these
answers, we will be better able to deal with it, no matter what.
Thanks, all.

Comments

Subject: Re: Using Databases Containing Non-English Symbols
From: phreaddy-ga on 14 Jun 2002 14:31 PDT

I don't have the technical answers (and that'll vary widely depending
on your configurations), but here's what you need in terms of
functionality. Your search function should be set up to find the
proper information even if it's incorrectly asked for. That is,
requests for a word with or without the accents (or perhaps with the
wrong accents) should all return the intended words.

Also, you can make sure accents are properly displayed on the major
browsers (nothing works on every browser) by finding the proper HTML
coding for those accented characters. Any web coder worth his/her
salary will have the proper reference materials to find this.

Subject: Re: Using Databases Containing Non-English Symbols
From: andypavlo-ga on 14 Jun 2002 14:34 PDT

If you really are worried about things being displayed properly, write
a simple script (I would use PHP but that's just me) that takes any
character beyond ASCII code #128, and convert them into their HTML
ascii equivalent:

Ā becomes &#193;

That way any modern browser should have no problem trying to figure
out what to display! Althought the draw back is that one char now
becomes 5 chars and your database will get a lot bigger if there's a
lot of these characters.

Here's an HTML ascii table:
http://www.efn.org/~gjb/asciidec.html

Here's the PHP language function to convert all characters to HTML
entities all at once:
http://www.php.net/manual/en/function.htmlentities.php

Subject: Re: Using Databases Containing Non-English Symbols
From: rtsaito-ga on 15 Jun 2002 13:57 PDT

Hi,

The solution depends on which database you are using. Is it ORACLE ?
MS-SQL? MYSQL?

Take a look at UTF-8 for storing the text. I am used to work with
Oracle and Java, and there are no problems to store "strange"
characters from European languages.

Subject: Re: Using Databases Containing Non-English Symbols
From: paradiddler-ga on 16 Jun 2002 09:31 PDT

You should make sure the web page displaying the data from the
database uses the same character set (also known as codepage) as the
database itself uses. This can easily be accomplished with a HTML meta
tag like this:

<META HTTP-EQUIV="Content-type" CONTENT="text/html;
charset=iso-8859-5">

The character set instructs the browser to display the text using the
correct symbols. The default character set according to the HTML
standard is ISO-8859-1 which covers most Western European languages.

If the database contains texts written in such a multitude of
languages that one single-byte character set (i.e. a character set
with 255 characters) does not have all the needed symbols, the
database should be coded in UNICODE. UNICODE contains 16-bit
characters and therefore is able to encode all current writing symbols
worldwide. The 16-bit characters of course take up double the bits in
the representation, so the resulting web pages usually are twice the
size and therefore requires more bandwidth.

If most of the characters are standard US-ASCII characters, the
UNICODE data could be encoded as UTF-8, which is a variable-byte
character encoding. I.e. one symbol can be coded with between one and
four bytes, with all US-ASCII characters being single byte characters.
The UTF-8 coding is especially good for using as a web page "character
set" (i.e. you specify UTF-8 as character set in the META-tag,
although it is actually an encoding of the UNICODE characterset),
because it allows the web-designer to make the HTML pages with
standard US-ASCII characters, and then just embed the data from the
database as UTF-8 encoded strings.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy