perl, unicodeutf8, cgi.pm, apache, mod_perl and mysql

"$gray_hairs++"

Last update 2008-03-06

Here is what I found out so far what I had to do/change to get perl and Unicode to play nicely in my LAMP stack. (My LAMP in this case being Linux/Apache/Perl/MySQL and none of the other lamps).

To start, here are the versions of different things that I use when laborating with this stuff:

Fedora Core 8
Perl 5.8.8
Apache 2.2.6
mod_perl 2.0.3
CGI.pm 3.15
DBI 1.58
DBD::mysql 4.005
MySQL 5.0.45

1. The plain perl/CGI script case

NOTE: This information also applies to when you run perl scripts normally from the shell (not through CGI).

1.1. First, perl needs to be invoked with the "-C" switch, so you need to add that to your hash-bang sequence starting every script like this:

#!/usr/bin/perl -C

1.2. Perl also needs the system environent variable LANG to be set to automatically detect and come to the conclusion to handle all I/O as unicode.

If it is not inherited by Apache from your system (which it is not in Fedora Core 8), this is probably easiest done by adding the following statement in your apache config file http.conf (assuming it is UTF-8 you're after, that is):

SetEnv LANG en_US.UTF-8

This is preferably done in the global context of your httpd.conf to enable it in all virtual hosts etc (assuming that is what you want).

1.3. Then we need to do something about CGI.pm, since it does not autmatically decode utf8 CGI parameters (like when you use $cgi->param("paramname")).

The newer versions of CGI.pm (~3.30 something) have some utf8 decoding capabilities through the -utf8 switch (see resources #2), but setting it globally may corrupt file uploads that should be handled binary without any decoding. Besides, the fedora distribution I'm using (FC8) does not ship with that version yet anyway (and I like using the stuff shipped with the distro to avoid having to patch things manually to the greatest extent possible).

Thus I use a wrapper for CGI.pm that someone published at perlmonks.org called 'as_utf8', which also takes care of the file-upload-case and do not perform any conversion in that case.

See resources #3 for the wrapper script.(Just copy the text into a file and put it somewhere in your @INC, for instance /usr/lib/perl5/site_perl/CGI/as_utf.pm)

To use it, just replace the normal "use CGI;" with a "use CGI::as_utf8;" in your scripts and it should transparently decode your $cgi->param("paramname") for you (without screwing up file-uploads etc).

1.4. Specifying charset in your HTML output 

To have your browser client understand your UTF8 encoded text and display it correctly, you must somehow make it aware of this fact. This can be done in several ways.

Either specify it in your apache config file using the following statement in the global context:
AddDefaultCharset UTF-8
Note that this of course will effect ALL directories, all virtual servers, and that might not always be what you want.

To specify it at a per-script basis, you can specify it in the http response header of your CGI script reply, replacing your normal content-type with the following:
print "Content-type: text/html; charset=utf-8\n\n";

You can also specify it in the HTML-code using a meta-tag:
<META http-equiv="Content-Type"
content="text/html; charset=utf-8"/>

Or in XHTML, using the following opening tag:
<?xml version="1.0" encoding="utf-8" ?>

Personally, I prefer using the HTTP-response header in my CGI script.

1.5. Finally, you also need to add the 'use utf8;' statement to all scripts, to be able to use utf-8 strings in your perl-scripts.

When this is specified, you can safely use utf8 encoded characters in your perl-script (assuming you have an utf8 enabled text editor, of course) like normal and it will be treated as a proper utf8 string internally by perl:

my $s = "Kalle åäöÅÄÖ";

If you don't specify the "use utf8;", at first things might appear as they work anyway, because perl treats the utf8 encoded multi-byte characters as separate characters. All until you start using split, substr, regexes and the likes.

Illustrative example on this:

#!/usr/bin/perl -C

use utf8;

my $s = "abcåäö";
print "string: ".$s."\n";
print "split/join: ".join( ",", split( "", $s))."\n";
$s =~ s/(.)/\1.*/g;
print "regexed: ".$s."\n";

This should give the output:

string: abcåäö
split/join: a,b,c,å,ä,ö
regexed: a.*b.*c.*å.*ä.*ö.*

But if you don't specify the "use utf8;", the output will look like this:

string: abcåäö
split/join: a,b,c,Ã,¥,Ã,¤,Ã,¶
regexed: a.*b.*c.*Ã.*¥.*Ã.*¤.*Ã.*¶.*

(Note the commas between the doodles that are the individual bytes the unicode characters are made up of).

2. The mod_perl case

2.1. First you need to enable the -C perl switch here also, which is done by specifying the PerlSwitches in the configuration file for apache where you have your mod_perl configuration parameters. On my FC8 box, it is found under: /etc/httpd/conf.d/perl.conf .

Just add:

PerlSwitches -C

somewhere in there. You must specify it in the global context and hence ALL your mod_perl scripts will be affected on all virtual hosts etc (to be considered if applied in a production context).

2.2. binmode( STDOUT, ':utf8');

In some situations, to get things to display nicely, I've discovered you must specify, using binmode, that your STDOUT (where all your prints go) is in utf8. However, this is not always true, and it probably has something to do with versions of mod_perl, but I haven't been able to work that relationship out yet. What I have is that I need it on my FC8 machine, but not on my FC6 machine.

If you have scripts that you want to run both as normal CGI scripts and under mod_perl, you can add something like this to your script, somewhere early in the beginning of it:

binmode( STDOUT, ':utf8') if( $ENV{'MOD_PERL'});

Otherwise if you're only going to run it under mod_perl, you can skip the "if"-part...

Also consider the next clause on IO-layers.

DISCLAIMER: I have not really figured out why the use of binmode is necessary in this case, and I have also read at several places that people states that it is wrong to do this and it should not even be possible etc, but in my setup, it works and I have not figured out any other way to make it display correctly without it. If anyone knows, please let me know.

3. Opening files in scripts and in modules / Perl IO-layers

To be able to read files and have perl decode the input correctly, you need to specify which "IO-layer" to use when opening the file. This can either be done for each filehandle by using the:
binmode( FILEHANDLE, ':utf8);
or by specifying on a more general level by using the open-pragma in the beginning of your script like in this example:
use open IO => ':locale';

You can specify other combinations of the above pragma, but using the one like above, all file I/O will default to utf8, and also STDIN/STDOUT/STDERR. The ":locale" also means that perl considers the locale environment variable ($ENV{LANG} in most cases) when setting default IO-layer, which also should make it compatible with non utf8 systems.

If you use the open IO pragma, note that also binary files will be opened with utf8-decoding, which is probably not what you want, and you need to use binmode on that filehandle explicitly to open it as binary:
binmode( BINARYFILE, ':raw');

To read more about this pragma, see resources #6. 

3.1. Opening files in modules

An important thing, that did not strike me as logical at first, is that if you are opening files in modules, the default IO layer as specified with the use open-pragma is not inherited from the script that uses the module to the module. So you will also need to add the pragma to all modules that you use, if they open files and read text. Otherwise the text read from a file in a module will not be decoded properly. Adding the use open IO => ':locale' to the script should do the trick.

After thinking of it, it makes sense that modules does not inherit this property, since the module may be opening files using another IO-layer, such as raw. Also all other pragmas are local to the current package, so I suppose it would be strange if this pragma was not. However, it makes life yet a little more difficult at times.

4. DBI/MySQL

4.1. First of all, you must specify all your tables and columns to use the utf8 encoding when you created them. You can also change this afterwards and I think mysql recodes the existing data in them automatically (better check this first).

4.2. The normal DBI does not seem to be unicode aware yet, unfortunately. However recent versions of DBD::mysql (>3.0007) have a work-around for this, and that is to set the mysql_enable_utf8 option right after connecting to the database.

That is done by poking into the dbh-object and setting the parameter like this right after the DBI->connect:

my $dbh = DBI->connect( <parameters>); $dbh->{'mysql_enable_utf8'} = 1;

After doing this, the "SET NAMES utf8" SQL statement should not be necessary to use according to the documentation, but to my experience, it may still be needed for the version of DBD::mysql i'm using. Adding to this, unfortunately, the manual page for my DBD version states the following about the "mysql_enable_utf8" parameter:

"This option is experimental and may change in future versions."

Which makes me a bit worried for future compatibility, but for now it can be used I hope.

As I checked on a FC6 box, that parameter does not seem to be included in the 3.007 release of DBI, so you need a newer version for it to work.

4.3. Another wrapper: UTF8DBI
An alternative is to use another short wrapper script that I've found, called UTF8DBI.pm. There seems to exist several version of this, the other claiming to solve bugs in the first one, but the one I use successfully can be found under resources #4.

It's quite simple to use and should work rather transparently. Just replace your normal "use DBI" with the "use UTF8DBI" and instead of calling DBI->connect to get a new database handle, call UTF8DBI->connect instead like seen in the example in 3.2.

4.4. Notes about the UTF8DBI-wrapper

1. selectrow_array does not seem to be safe to use, it sometimes fail to correctly decode UTF8 strings. Use selectrow_hashref instead.

2. $dbh->quote does not always seems to be safe either, have to check a bit further if this really is the case or not, use the "?" quote-functionallity instead. Example:
$dbh->do( "INSERT INTO stuff SET id=?, name=?", undef, $id, $name);

4.5. Second, you must specify character set on connect to your MySQL database. This is done by issuing an SQL statement right after you have connected to your database, like so:

my $dbh = UTF8DBI->connect( <parameters>);
$dbh->do( "SET NAMES utf8");

4.6. To verify that the right encoding is used in the database, I recommend installing the MySQL GUI tool "MySQL Administrator" and peek into your tables and see if your extended characters look right. Of course, you can also do this using the console utility and do a "SELECT * FROM blah", but that might be screwed up by other parameters, such as terminal emulation, character set in the terminal window, LANG environment variable, my.cnf etc. If you use the GUI tool, you can be pretty sure that what you see is what there actually is.

5. ALL SET FOR UNICODE/UTF-8 !?

This should be all, I hope, what you need to do to get proper unicode/utf-8 support to work using Linux / Apache / Perl / mod_perl / MySQL.

Unfortunately, I suspect there are a lot of other cases to cover, different version of perl modules, different version of Apache and mod_perl etc, so even if this works for me, it will probably not work for everybody.

6. Other general considerations / recommendations

6.1. Encode::encode/decode

For start, you should avoid using Encode::encode/decode/from_to to the greatest possible extent in your scripts. This will only lead to great confusion later. You may think you have gotten everything to work, but then a week later, you shall only add a little more functionality to your work and suddenly, everything falls apart and doodles will appear on your web pages.

It is better to try to get the modules you use to natively support UTF8, and before they do, use some wrapper that mends them in the meantime and keep your scripts clean of any utf8 specific code.

6.2. Perl versions

Everyone seems to be saying, use perl >=5.8 whenever working with unicode/utf8. Personally, I haven't used 5.6 or lower for quite a while and have no experience on this. I'm just passing on what other people says. 5.6 is rather old by now, so it might be wise to upgrade anyway.

<< BACK TO KBIN'S STUFF

RESOURCES

  1. Discussion around DBD::mysql and utf8 and also about using the UTF8DBI.pm wrapper module:
    http://www.simplicidade.org/
    notes/archives/2005/12/
    utf8_and_dbdmys.html
  2. Documentation on latest version of CGI.pm:
    http://search.cpan.org/dist/
    CGI.pm/CGI.pm
  3. Wrapper script for CGI.pm:
    http://www.perlmonks.org/
    ?node_id=651574
  4. UTF8DBI.pm that seems to be the most correct one:
    http://perl7.ru/lib/
    UTF8DBI.pm
  5. Perl manual page on runtime swithes and the -C switch:
    http://perldoc.perl.org/ perlrun.html
  6. Perl open IO pragma:http://perldoc.perl.org/open.html
  7. Generic FAQ on unicode/utf8: http://www.cl.cam.ac.uk/~mgk25/unicode.html
  8. Perl unicode FAQ: http://search.cpan.org/~rgarcia/
    perl-5.10.0/pod/perlunifaq.pod

Copyright C.Bingel









2008
Comments