Copyright © 2016 JoungKyun.Kim All rights reserved.
Notice
This project move to GitHUB. After 10 seconds, redirect to GitHUB project page.
Abstract
Determine the charset of the input data with Mozilla Universal Charset Detection C/C++ library
This is php extension that is libchardet PHP frontend.
libchardet is based on Mozilla Universal Charset Detector C/C++ library and, detects the character set used to encode data.
This module is a c-binding, is much faster than the other chardet packages taht is made by PHP code.
mod_chardet extension supports three method for detecting charset. Supporting method and required library is as follow:
- libchardet - Mozilla Universal Charset Detect C/C++ library
- ICU - IBM International Components for Unicode
- python-chardet - Mozilla Universal Charset Detect with pure python
For CJKV(Chinese, Japanese, Korean, Vitenams) languages, recommended to use MUCD(Mozilla Universal
Charset Detect). This method is best. And, about single byte languages, MUCD and ICU all best.
In the case of python-charde mode, even use the MUCD. However, the call performance is very not
good. The mode is support for test, so when if you don't give configure options, this mode does
not work basically.
For more informations, see also Reference document.
Repository https://github.com/OOPS-ORG-PHP/mod_chardet
Download
This download page is deprecated. Move to https://github.com/OOPS-ORG-PHP/mod_chardet/releases
Functional difference is same between the version 0.x and 1.x.
On PHP7, module structure was changed, so this module is branched with PHP 4/5 in 0.x
branch and PHP 7 in 1.x branch with version numbering.
* PHP 7
mod_chardet-1.0.2.tar.bz2 - 2016-05-12
mod_chardet-1.0.1.tar.bz2 - 2016-01-06
mod_chardet-1.0.0.tar.bz2 - 2015-12-28
* PHP 4/5
mod_chardet-0.0.5.tar.bz2 - 2016-05-12
mod_chardet-0.0.4.tar.bz2 - 2012-11-12
mod_chardet-0.0.3.tar.bz2 - 2009-10-05
mod_chardet-0.0.2.tar.bz2 - 2009-02-24
If you want to download with wget, don't use default user-agent of wget! (Use -U option)
Samples
See also Repository sample script.
* OOP style
<?php $strings = array ( '안녕하세요 abc는 영어고요, 가나다는 한글 입니다.', '안녕', '안녕하세요', '조금더 길게 적어 봅니다. 어느 정도가 필요할까요? 오호라.. 점점 길어지네', ); try { $chardet = new CHARDET (); $i=0; foreach ( $strings as $s ) { # # proto object chardet_detect (stream handle, string[, mode]) # database handle : return value of chardet_open () API # string : strings for character set detecting # mode : optional # if support CHARDTE_MOZ, this value is # default, and don't support CHARDET_MOZ, # CHARDET_ICU is default. # # CHARDET_MOZ : libchardet library result # CHARDET_ICU : icu library result # CHARDET_PY : python-chardet result # # if each CHARDET_(MOZ|ICU|PY) value is -1, # it means don't support each mode. # # return value type : # # stdClass { # encoding : detecting charset name # langs : charset language name. (Only CHARDET_ICU mode) # confidence : detecting confidence # status : error code (0 is not error) # } # if ( CHARDET_MOZ != -1 ) $moz = $chardet->detect ($s); if ( CHARDET_ICU != -1 ) $icu = $chardet->detect ($s, CHARDET_ICU); if ( CHARDET_PY != -1 ) $py = $chardet->detect ($s, CHARDET_PY); echo "$s\n"; if ( CHARDET_MOZ != -1 ) printf ("MOZ : Encoding -> %-12s, Confidence -> %3d, Status -> %d\n", $moz->encoding, $moz->confidence, $moz->status); if ( CHARDET_ICU != -1 ) printf ("ICU : Encoding -> %-12s, Confidence -> %3d, Status -> %d\n", $icu->encoding, $icu->confidence, $icu->status); if ( CHARDET_PY != -1 ) printf ("PY : Encoding -> %-12s, Confidence -> %3d, Status -> %d\n", $py->encoding, $py->confidence, $py->status); echo "\n"; $i++; } $chardet->close (); } catch ( ChardetException $e ) { fprintf (STDERR, "%s\n", $e->getMessage ()); $err = preg_split ('/\r?\n/', $e->getTraceAsString ()); print_r ($err); }
* Function mode
<?php $strings = array ( '안녕하세요 abc는 영어고요, 가나다는 한글 입니다.', '안녕', '안녕하세요', '조금더 길게 적어 봅니다. 어느 정도가 필요할까요? 오호라.. 점점 길어지네', ); try { $fp = chardet_open (); $i=0; foreach ( $strings as $s ) { if ( CHARDET_MOZ != -1 ) $moz = chardet_detect ($fp, $s); if ( CHARDET_ICU != -1 ) $icu = chardet_detect ($fp, $s, CHARDET_ICU); if ( CHARDET_PY != -1 ) $py = chardet_detect ($fp, $s, CHARDET_PY); echo "$s\n"; if ( CHARDET_MOZ != -1 ) printf ("MOZ : Encoding -> %-12s, Confidence -> %3d, Status -> %d\n", $moz->encoding, $moz->confidence, $moz->status); if ( CHARDET_ICU != -1 ) printf ("ICU : Encoding -> %-12s, Confidence -> %3d, Status -> %d\n", $icu->encoding, $icu->confidence, $icu->status); if ( CHARDET_PY != -1 ) printf ("PY : Encoding -> %-12s, Confidence -> %3d, Status -> %d\n", $py->encoding, $py->confidence, $py->status); echo "\n"; $i++; } chardet_close ($fp); } catch ( ChardetException $e ) { fprintf (STDERR, "%s\n", $e->getMessage ()); $err = preg_split ('/\r?\n/', $e->getTraceAsString ()); print_r ($err); }
Copyright & License
Copyright (c) 2016 JoungKyun.Kim <http://oops.org> All rights reserved. This program is under MPL 1.1 or LGPL 2.1