I’ve just spent an hour or so debugging a problem with one of our internal Web services at work, and I thought I’d share the details, in case anyone else comes across it.

The problem was with a Ruby script running on an Ubuntu 12.04 server. The script runs under cron and periodically requests data from a JSON Web service and adds it to an Apache Solr data store. The program was written a few years ago, and had been running fine until just recently when it started failing to update Solr. The log file showed the program was exiting because of the following exception:

/path/to/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC2"
  on US-ASCII (Encoding::InvalidByteSequenceError)
  from /path/to/lib/ruby/2.0.0/json/common.rb:155:in `initialize'
  from /path/to/lib/ruby/2.0.0/json/common.rb:155:in `new'
  from /path/to/lib/ruby/2.0.0/json/common.rb:155:in `parse'

The code in question looked like this:

response = RestClient.get(url)
data = JSON.parse(response.body)
post_to_solr(data)

and the exception was coming from the JSON.parse line.

The exception indicated a character encoding problem, so I took a look at the data coming from the Web service to see if it was valid JSON, and investigate the 0xC2 value that JSON.parse was complaining about. This turned out to the byte sequence C2 A3, which in UTF-8 is the character POUND SIGN U+00A3. The question then was why the JSON parser should think the encoding is US-ASCII, as the Web service is returning UTF-8.

I tried running the script manually to see if I could reproduce the error, but doing this the script worked fine. So, the problem wasn’t with the script, but something in its execution environment.

The locale program can be used to get information on the locale settings in use. When run from the shell it gave this output:

LANG=en_GB.UTF-8
LANGUAGE=en_GB:en
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

but when run under cron it output this instead:

LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

The locale documentation and the POSIX specification has the details on these variables.

Next I ran the script again, but with LANG environment variable cleared:

$ LANG="" ./import-script

Sure enough this reported the same error. Fixing the problem was simple: it was enough to set just LANG in the crontab file:

LANG="en_GB.UTF-8"
0 * * * *       /path/to/import-script

Problem solved, and users of this Web service are happy again :-)