Unicode Support on CentOS 5.2 with PHP and PCRE
Yesterday,
I talked about how to get the most out of running regular expressions
in PHP. The reason that I needed to dig in deep on regular expression
syntax with PHP is because I needed to write some regular expressions
that deal with Unicode characters.
After much reading, I believed that I knew everything that I needed.
I started writing some regex strings and testing the code.
Unfortunately, every time I ran a test with a string that contained
Unicode characters, the match failed. When I removed the Unicode
characters from the string and tested again, it would work. I was
baffled.
Finding the Problem
I had the regex testing characters (‘\X’, ‘\pL’, etc) inside of a
character class, such as ‘[\X-]‘, since I was creating a regex to test
for domains. I wrote a really simple rule by simply looking for
‘/^\X$/’ and testing the regex with a single Unicode character.
Amazingly, having the ‘\X’ outside of the square brackets changed
everything as I now received the following very concerning warning:
PHP Warning: preg_match(): Compilation failed:
support for \P, \p, and \X has not been compiled at offset 2 in
wp-content/plugins/dnsyogi/testunicode.php on line 4
Since PHP uses the PCRE engine to run regular expressions, I started
to dig into it. I found out that I could query PCRE directly. I ended
up with something very similar:
$ pcregrep ‘/\X*/u’ character.txt
pcregrep: Error in command-line regex at offset 2: support for \P, \p, and \X has not been compiled
It looked like the error was coming from PCRE itself. I searched
around for a while thinking that I could simply install a new package
using yum. I hoped to find something like pcre-utf8, pcre-unicode,
php-pcre-unicode, or something to make it simple and quick to add this
support since I much prefer using package management tools rather than
compiling and installing from source.
Unfortunately, no such package exists. This support is something
that must be an option that PCRE is compiled with, and my CentOS
repository only has packages that don’t include that support. After
much digging around, I found that this isn’t necessarily CentOS’s fault
as this package has carried over from the RHEL (Red Hat Enterprice
Linux) side of things.
A great way of checking to see if this is an issue on your system is by running the following:
$ pcretest -C
PCRE version 6.6 06-Feb-2006
Compiled with
UTF-8 support
No Unicode properties support
Newline character is LF
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack
This is the output that I received. Notice the “UTF-8 support” and
the “No Unicode properties support” lines. This means that PRCE was
compiled with the “–enable-utf8″ configure option which allows PCRE to
recognize and work with UTF-8 encoded strings. However, it wasn’t
compiled with the “–enable-unicode-properties” configure option which
works in conjunction with the enable-utf8 option to add support for the
‘\p’, ‘\P’, and ‘\X’ character classes.
This seems to have been an oversight when the rpm file was first put together. Fortunately, there is a way to fix it.
Fixing the Problem
Since I’m sure that many of you are like me and would rather not
manually compile and install software outside of the package management
system, the solution is to update the rpm to have the option that it
needs and install it.
I had never done this before. Fortunately, I found a very helpful guide that details this process out very nicely: How to patch and rebuild an RPM package.
I have provided the new rpm file that I have built at the bottom of
this post. If you don’t care about all this jibber-jabber, you can skip
down there and grab the file. However, if you would like to learn how
to address this issue yourself or have a system that my file will not
support, please read on to see how I rebuilt the rpm with the new
option.
Rebuilding the rpm
- The first thing I did is set up my ~/.rpmmacros file and src/rpm folder structure as detailed in the Setup section of the guide that I’m following. I’ll simply refer you over there as it doesn’t need repeating here.
- I needed to grab the source rpm for the current version of PCRE on
my platform. I’m on CentOS 5.2 with version 6.6 of PCRE. I found the
matching source rpm file (pcre-6.6-2.el5_1.7.src.rpm) here.
- I then installed the source rpm in order to gain access to its files:
$ rpm -ivh pcre-6.6-2.el5_1.7.src.rpm
This put the necessary files into my ~/src/rpm/SOURCES and ~/src/rpm/SPECS folders.
- I opened up the ~/src/rpm/SPECS/pcre.spec file and found the following line:
%configure --enable-utf8
I changed it to include the Unicode properties option:
%configure --enable-utf8 --enable-unicode-properties
I then saved and closed the file.
- This is the only change that I needed to make. So, now it
is time to build the new rpm file. I simply ran the following to build
it:
$ rpmbuild -ba ~/src/rpm/SPECS/pcre.spec
Toward the end of the large amount of output, I received the following:
Wrote: ~/src/rpm/SRPMS/pcre-6.6-2.7.src.rpm
Wrote: ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm
Wrote: ~/src/rpm/RPMS/x86_64/pcre-devel-6.6-2.7.x86_64.rpm
Wrote: ~/src/rpm/RPMS/x86_64/pcre-debuginfo-6.6-2.7.x86_64.rpm
This tells me exactly where I can find my new source rpm and rpm files.
Updated rpm File for CentOS 5.2 64-bit
If you are running a 64-bit version of CentOS 5.2, the following
file should work for you. If you have a different architecture, Linux
distro, or encounter any errors when trying to install this file, then
you should follow the instructions above to build an rpm that is
suitable for your distribution.
pcre-6.6-2.7.x86_64.rpm – PCRE 6.6 for CentOS 5.2 64-bit
Thanks
Robin for providing a 32-bit version:
pcre-6.6-2.7.i386.rpm
Installing New rpm
Now that I have my new rpm file, I just need to install it. Since I
already have a pcre package installed, I need to tell the rpm command
to update rather than install. The following command does this for me:
# rpm -Uvh ~/src/rpm/RPMS/x86_64/pcre-6.6-2.7.x86_64.rpm
Notice that I need to be root to run this command.
Finally, to verify that everything worked, I ran the pcrecheck program again:
$ pcretest -C
PCRE version 6.6 06-Feb-2006
Compiled with
UTF-8 support
Unicode properties support
Newline character is LF
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack
Looks good.
Finally, time to move on with life.
Tags: CentOS, PCRE, PHP, regular expressions, Unicode
Cited Source: Unicode Support PHP CentOS
When a GIG is not enough --> Terabyte Dolphin Technical Support - Server Management and Support