You may be able to check content type headers and meta encoding tags to
determine the charset.
Also, if mb_detect_encoding is failing you could try a quick script in
another language such as perl to fill in this gap.
I did some work with Asian language processing a few years ago. If you
are only dealing with Japanese for the moment there are ways to
algorithmically detect the charset of Japanese text based on the byte
A quick search turned up this Python module which might do the job:
This perl module also has some emoji support:
On Fri, 01 Jun 2007 22:14:15 +0900, Erick Papadakis <erick.papa_at_gmail.com>
> Seeing as how this list is aflutter with tech savvy folk, I hope
> someone can shed some light on this problem.
> We're developing something in Japanese that needs input from a
> because the text comes from client side using a bookmarklet. (If it
> could be a regular POST or GET, then there'd be no issues).
> My problem is that Japan seems to have had a devil of a time getting
> to standardize its character sets! Some big sites like isize.com use
> Shift_JIS, while others such as Goo or Mixi use EUC-JP, while several
> of the more modern ones (such as blogs) use UTF-8.
> When we capture the TITLE (document.title) from these websites, and
> then "rawurldecode" the received text in PHP, the string comes up
> jumbled. If we knew the standard character set before hand, we could
> have used the right mb_convert_encoding and such, but this is now an
> but that doesn't work either -- I wonder if that's a deprecated
> element of the document object?
> Would appreciate any insight into how you have solved the issue of
> different in-coming text into programs. The php function
> "mb_detect_encoding" is totally useless. Given a string, it always
> seems to return utf-8.
> Many thanks in advance!
> This mail was sent to address src_at_ubit.com
> Need archives? How to unsubscribe? http://www.appelsiini.net/keitai-l/
Ubit Europe B.V.
Mobile: +81 80 5505 7932
Tel: +31 20 408 1481
Fax: +31 84 711 5404
Received on Sat Jun 2 06:36:29 2007