Character encoding of files from S3 buckets.
Don’t ask why, but recently I had to read obscure files in what I believe is COBOL
copybook format. These files were either encoded in EBCDIC 037 (aka »CP037«) or in
ASCII/UTF-8, depending on the system that wrote them. If you read
»The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)«,
you might already get the idea that just reading the data as UTF-8 String
, check which
encoding it is and then re-encode it in a different format can result in surprises.
Usually this is a non-issue, because you can just read a file as InputStream
from
disk, convert it to a
PushbackInputStream
,
peek the first bytes to determine the encoding.
But what if you get your data from an S3 bucket or similar. In this case I came to the
conclusion: try to get the data as InputStream
or get and keep it as byte[]
array
till you convert. In my case the copybooks always start with a number, so I checked if
the first byte was a number:
public static String decodeContent(byte[] data) {
if (data == null) {
return null;
}
if (data.length == 0) {
return "";
}
// we expect a digit as first character
String probe = new String(new byte[]{data[0]}, StandardCharsets.UTF_8);
if (Character.isDigit(probe.charAt(0))) {
return new String(data, StandardCharsets.UTF_8);
}
// so no digit, is it perhaps EBCDIC 037?
probe = new String(new byte[]{data[0]}, Charset.forName("cp037"));
if (Character.isDigit(probe.charAt(0))) {
return new String(data, Charset.forName("cp037"));
}
throw RuntimeException("something went wrong");
}