Character encoding of files from S3 buckets.

(29. November 2020, en)

Don’t ask why, but recently I had to read obscure files in what I believe is COBOL copybook format. These files were either encoded in EBCDIC 037 (aka »CP037«) or in ASCII/UTF-8, depending on the system that wrote them. If you read »The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)«, you might already get the idea that just reading the data as UTF-8 String, check which encoding it is and then re-encode it in a different format can result in surprises. Usually this is a non-issue, because you can just read a file as InputStream from disk, convert it to a PushbackInputStream, peek the first bytes to determine the encoding.

But what if you get your data from an S3 bucket or similar. In this case I came to the conclusion: try to get the data as InputStream or get and keep it as byte[] array till you convert. In my case the copybooks always start with a number, so I checked if the first byte was a number:

public static String decodeContent(byte[] data) {
  if (data == null) {
    return null;
  }
  if (data.length == 0) {
    return "";
  }

  // we expect a digit as first character
  String probe = new String(new byte[]{data[0]}, StandardCharsets.UTF_8);
  if (Character.isDigit(probe.charAt(0))) {
    return new String(data, StandardCharsets.UTF_8);
  }

  // so no digit, is it perhaps EBCDIC 037?
  probe = new String(new byte[]{data[0]}, Charset.forName("cp037"));
  if (Character.isDigit(probe.charAt(0))) {
    return new String(data, Charset.forName("cp037"));
  }

  throw RuntimeException("something went wrong");
}