While doing some updates to the blog, such as adding author information and other useful information to hopefully help with legitimatizing things, I spent about two hours trying to figure out why a shortcode in the theme that I use (Suffusion) would not parse arguments. The tag suffusion-the-author would display author name, but if I added an argument to make it display the description or a link instead of just the name, it would always fail and revert to the author name only. Some quick googling told me that this didn't seem to be a problem anyone else had, and after four major versions for the theme, I'm pretty sure someone would've noticed if a basic feature were entirely nonfunctional. After lots of debugging, I discovered that the problem was happening in Wordpress's code, inside the definition of shortcode_parse_atts(), which parses the attributes matched in the shortcode into an array for use inside the callback. For me, instead, it killed all of the tag arguments and returned an empty string. In the end I found that the issue was a regular expression that appears to be designed to replace certain types of unicode spaces with normal spaces to simplify the argument parsing regular expression that comes after

$text = preg_replace("/[\x{00a0}\x{200b}]+/u", " ", $text);

On my system, this would simply set $text to an empty string, every time. Initially, I wanted to just comment it out, because I'm not expecting to see any of these characters inside tags, but that bothered me, because this has been a part of the Wordpress code for a very long time, and nobody else appears to have problems with it. Finally, I concluded that this must mean that unicode support was not enabled, so I began searching through yum. Bad news, mbstring and pcre were all already installed, and php -i told me that php was built with multibyte string support enabled, as well as pcre regex support. I tested my pcre libraries and they claimed to support unicode as well.

In the end, I solved the problem by updating php and pcre to the latest version, which was an update from php-5.3.13 to php-5.3.24 in my case, and pcre was updated from 7.x to 8.21. It appears that there was some kind of incompatibility between php 5.3 and the 7.x branch of pcre (if you build php 5.3 from source, it will target pcre 8.x), which prevented me from using unicode despite support for it being present everywhere. So, if you have trouble getting your php installation to handle unicode inside regular expressions, and you're running on Amazon Web Services, make sure you update to the latest versions of php and pcre, as there appear to be some issues with older packages.