Ruby, cron, UTF-8, and the locale environment variables
I’ve just spent an hour or so debugging a problem with one of our internal Web services at work, and I thought I’d share the details, in case anyone else comes across it.
The problem was with a Ruby script running on an Ubuntu 12.04 server. The script runs under cron and periodically requests data from a JSON Web service and adds it to an Apache Solr data store. The program was written a few years ago, and had been running fine until just recently when it started failing to update Solr. The log file showed the program was exiting because of the following exception:
/path/to/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC2" on US-ASCII (Encoding::InvalidByteSequenceError) from /path/to/lib/ruby/2.0.0/json/common.rb:155:in `initialize' from /path/to/lib/ruby/2.0.0/json/common.rb:155:in `new' from /path/to/lib/ruby/2.0.0/json/common.rb:155:in `parse'
The code in question looked like this:
response = RestClient.get(url) data = JSON.parse(response.body) post_to_solr(data)
and the exception was coming from the
The exception indicated a character encoding problem, so I took a look at the data coming from the Web service to see if it was valid JSON, and investigate the 0xC2 value that
JSON.parse was complaining about. This turned out to the byte sequence C2 A3, which in UTF-8 is the character
POUND SIGN U+00A3. The question then was why the JSON parser should think the encoding is US-ASCII, as the Web service is returning UTF-8.
I tried running the script manually to see if I could reproduce the error, but doing this the script worked fine. So, the problem wasn’t with the script, but something in its execution environment.
The locale program can be used to get information on the locale settings in use. When run from the shell it gave this output:
LANG=en_GB.UTF-8 LANGUAGE=en_GB:en LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_PAPER="en_GB.UTF-8" LC_NAME="en_GB.UTF-8" LC_ADDRESS="en_GB.UTF-8" LC_TELEPHONE="en_GB.UTF-8" LC_MEASUREMENT="en_GB.UTF-8" LC_IDENTIFICATION="en_GB.UTF-8" LC_ALL=
but when run under cron it output this instead:
LANG= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL=
Next I ran the script again, but with
LANG environment variable cleared:
$ LANG="" ./import-script
Sure enough this reported the same error. Fixing the problem was simple: it was enough to set just
LANG in the crontab file:
LANG="en_GB.UTF-8" 0 * * * * /path/to/import-script
Problem solved, and users of this Web service are happy again :-)