Google Answers: Collecting Japanese data in PHP

View Question

Q: Collecting Japanese data in PHP ( No Answer, 2 Comments )

Question

Subject: Collecting Japanese data in PHP
Category: Computers > Internet
Asked by: shane43-ga
List Price: $20.00

Posted: 20 Sep 2005 22:10 PDT
Expires: 20 Oct 2005 22:10 PDT
Question ID: 570419

I'm trying to set up a php form that will collect data from japanese
users. More specifically, I want to target only users who speak
Hiragana or Katakana, therefore I need to check all user input to see
if it falls under either category.
 
I have no experience in working with foreign characters, so I found a
script at phpclasses.org
(http://promoxy.mirrors.phpclasses.org/browse/package/1425) which has
two useful methods for checking whether user data is japanese
[isHiragana(), isKatakana()], but it is not giving results that match
my test cases.

I'm guessing that the script is saved with incorrect encoding, or our
server doesn't support foreign characters, or our php build doesn't
support it. What needs to be set up on the server in order to handle
the japanese language? If this does not solve our problem, then
perhaps I can pay more to have you diagnose my specific set of
scripts.

Thanks!

Answer

There is no answer at this time.

Comments

Subject: Re: Collecting Japanese data in PHP
From: thinkcomp-ga on 21 Sep 2005 21:01 PDT

On the PHP side, you should make sure that the mbstring extension is
enabled. To do this, you have to compile PHP with the extension. If
you're running Linux, in the "configure" command, make sure the
--enable-mbstring flag is present, so that your command looks
something like:

./configure --enable-mbstring

You also can check whether or not it's currently enabled by using the
phpinfo() function in any PHP script.

You may also want to check that your character encoding is correct for
your page. The three Japanese character encodings are EUC-JP,
ISO-2022-JP, and Shift_JIS. You can find more information on character
encodings at:

http://lfw.org/text/jp-www.html

If you are using a database, you should make sure that your table
supports the character encoding you are using. MySQL, for example,
uses Latin-1 by default.

Subject: Re: Collecting Japanese data in PHP
From: eirikr_utlendi-ga on 22 Sep 2005 10:32 PDT

@shane43 -- 

I'm woefully ignorant of PHP, but let me clear up some things for you
about the Japanese language.  Hiragana and katakana are two of the
four scripts commonly used in Japanese, so no one "speaks" either of
these.  :)  The other two scripts are called kanji (lit, "chinese
characters") and romaji (lit, "roman characters", i.e. the Latin
alphabet).  Have a look here
(http://en.wikipedia.org/wiki/Japanese_writing_system) for a sample of
what Japanese looks like.

So most commonly written Japanese will include kanji and hiragana. 
Any foreign words or words with very complicated kanji will generally
be rendered in katakana.  *However*, so long as you have the proper
encoding (SJIS, EUC-JP, ISO-2022-JP, UTF8, etc), kanji and both kana
systems will all appear as double-byte characters.  Consequently,
there's no need to look specifically for either form of kana; just
look for double-byteness.

This might be why your test cases are failing -- almost no Japanese
text is written entirely in either hiragana or katakana, unless it's a
children's book for beginning readers, or somebody doing something
strange for effect, a bit like how the poet e.e. cummings used only
lower case (poet Miyazawa Kenji wrote a whole piece entirely in
katakana, google for "ame ni mo makezu" if you're interested).

There might also be some double-byte Latin characters, which might
make things complicated for you.  MS Word and OpenOffice.org, for
instance, both include a built-in UI command to switch double-byte
Latin to single-byte; PHP might have something similar.

If you're looking through lots of mixed multilingual text to find just
the Japanese, the hiragana character "no" (looks like a 6 turned
clockwise by 90 degrees -- have a look here
http://en.wikipedia.org/wiki/Hiragana for the character and its
Unicode encoding) is probably the single most common character used in
Japanese, as it marks the possessive, and it is also *only* used in
Japanese, so you won't catch any other languages by mistake.

HTH,

        Eirikr

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy