|
|
Subject:
Using Databases Containing Non-English Symbols
Category: Computers > Software Asked by: catullus13-ga List Price: $10.00 |
Posted:
14 Jun 2002 13:16 PDT
Expires: 14 Jul 2002 13:16 PDT Question ID: 26046 |
Our company has been given a database that contains a variety of non-English characters from different European languages, such as Spanish, French, German, Scandinavian, etc. These non-English characters include such things as umlauts, tildes, circumflexes, and accent marks. This database will eventually be connected to a website, where users will do a search on the data, and the results then displayed on screen. How can we ensure that this database will be usable and that we can display these characters properly, with their original symbols intact? | |
| |
|
|
Subject:
Re: Using Databases Containing Non-English Symbols
Answered By: hedgie-ga on 16 Jun 2002 20:30 PDT Rated: |
The comments, particularly the last one, by rtsaito-ga , provide useful information. What I will add is to provide links which will lead to a gentle intro into the arcane topic of the character encoding. Correct general term you need is indeed unicode. It is a character set which can represent all natural languages, and some unnatural ones too :-). That however may be an overkill in your case, since all the languages you mentioned are west european languages, which can be expressed by the extended ASCII, also called latin-1. The list of all 255 characters is listed here, and has quite a few characters with diacritics, or accents: http://www.bbsinc.com/iso8859.html Here is the list of other ISO sets, covering all of Europe, and more: http://czyborra.com/charsets/iso8859.html http://www.htmlhelp.com/reference/charset/ So, the first step is to decide which languages are to be included and then to read an introductory article, such as http://www.hedgehog-review.com/accent/index.html#LDB The article explains the use of the LDB database which will allow you to determine how many iso sets you need to cover the languages of interest, provides technical references and example of using the headers mentioned in the comment. God references on Multilingual Web servers come form bilingual canada, such as http://vancouver-webpages.com/multilingual/howto.htm and also from multilingual europe http://www.hum.uva.nl/~ewn/ http://www.jca.apc.org/aworc/search/ You should talk to the database vendor, if your data came with specific database software, since large vendors have solutions already available : http://download-west.oracle.com/otndoc/oracle9i/901_doc/server.901/a90236/toc.htm http://www.sybase.com.au/press/releases/Unicode.html http://www-106.ibm.com/developerworks/unicode/library/unicode-db-process/?dwzone=unicode This an active area of research, specialized field, different from issues of localization http://www.ee.umd.edu/medlab/filter/sss/papers/ Open standards and unix solutions differ somewhat form the Microsoft proprietary solutions http://www.microsoft.com/mind/0100/internat/internat.asp Use of outside specialist may be most economical solution. Open directory lists DB consultants: http://dmoz.org/Computers/Consultants/Databases/ Please do ask for clarification, giving more details, if I did not cover relevant aspects of your querry. Thank you for using google answers. Hedgie. |
catullus13-ga
rated this answer:
All of these responses provide excellent, useful information. Right now, this question was anticipatory, since we are still unsure of the format in which the database will be provided, but using these answers, we will be better able to deal with it, no matter what. Thanks, all. |
|
Subject:
Re: Using Databases Containing Non-English Symbols
From: phreaddy-ga on 14 Jun 2002 14:31 PDT |
I don't have the technical answers (and that'll vary widely depending on your configurations), but here's what you need in terms of functionality. Your search function should be set up to find the proper information even if it's incorrectly asked for. That is, requests for a word with or without the accents (or perhaps with the wrong accents) should all return the intended words. Also, you can make sure accents are properly displayed on the major browsers (nothing works on every browser) by finding the proper HTML coding for those accented characters. Any web coder worth his/her salary will have the proper reference materials to find this. |
Subject:
Re: Using Databases Containing Non-English Symbols
From: andypavlo-ga on 14 Jun 2002 14:34 PDT |
If you really are worried about things being displayed properly, write a simple script (I would use PHP but that's just me) that takes any character beyond ASCII code #128, and convert them into their HTML ascii equivalent: Â becomes Á That way any modern browser should have no problem trying to figure out what to display! Althought the draw back is that one char now becomes 5 chars and your database will get a lot bigger if there's a lot of these characters. Here's an HTML ascii table: http://www.efn.org/~gjb/asciidec.html Here's the PHP language function to convert all characters to HTML entities all at once: http://www.php.net/manual/en/function.htmlentities.php |
Subject:
Re: Using Databases Containing Non-English Symbols
From: rtsaito-ga on 15 Jun 2002 13:57 PDT |
Hi, The solution depends on which database you are using. Is it ORACLE ? MS-SQL? MYSQL? Take a look at UTF-8 for storing the text. I am used to work with Oracle and Java, and there are no problems to store "strange" characters from European languages. |
Subject:
Re: Using Databases Containing Non-English Symbols
From: paradiddler-ga on 16 Jun 2002 09:31 PDT |
You should make sure the web page displaying the data from the database uses the same character set (also known as codepage) as the database itself uses. This can easily be accomplished with a HTML meta tag like this: <META HTTP-EQUIV="Content-type" CONTENT="text/html; charset=iso-8859-5"> The character set instructs the browser to display the text using the correct symbols. The default character set according to the HTML standard is ISO-8859-1 which covers most Western European languages. If the database contains texts written in such a multitude of languages that one single-byte character set (i.e. a character set with 255 characters) does not have all the needed symbols, the database should be coded in UNICODE. UNICODE contains 16-bit characters and therefore is able to encode all current writing symbols worldwide. The 16-bit characters of course take up double the bits in the representation, so the resulting web pages usually are twice the size and therefore requires more bandwidth. If most of the characters are standard US-ASCII characters, the UNICODE data could be encoded as UTF-8, which is a variable-byte character encoding. I.e. one symbol can be coded with between one and four bytes, with all US-ASCII characters being single byte characters. The UTF-8 coding is especially good for using as a web page "character set" (i.e. you specify UTF-8 as character set in the META-tag, although it is actually an encoding of the UNICODE characterset), because it allows the web-designer to make the HTML pages with standard US-ASCII characters, and then just embed the data from the database as UTF-8 encoded strings. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |