Python IO and Serialisation Snippets

(8. August 2021, en)

numpy, pathlib and combined json encoders

Being able to serialize arbitrary objects to json can be very handy, for exchanging data, but also for logging or debugging. However, for unknown object types, you’ll get a

TypeError: Object of type ... is not JSON serializable

when trying to serialize them. This error indicates that the json.JSONEncoder needs some help to figure out what to do. Luckily you can supply your own handlers to deal with these exceptions.

I’ve seen solutions on StackOverflow, which either create a new encoder by inheriting from json.JSONEncoder or overwrite the default function of the json.JSONEncoder.

Both solutions work fine, but, by accident, I found a more pythonic approach in the official code (no clue why the inheritance and not the functional approach is in the official docs). You can supply a custom encoding function (aka default) to the dump and load functions of the json module.

I picked numpy and Path from pathlib as examples.

Encoding numpy arrays

To deal with numpy arrays, you convert it to a list before feeding it to the JSONEncoder.

import numpy as np
import json


def encode_np_array(obj):
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    raise TypeError(f"Object of type {obj.__class__.__name__} is not JSON serializable")

Supply the method as default argument to e.g. json.dumps:

data = {"some_key": np.array([1, 2, 3])}
encoded = json.dumps(data, default=encode_np_array)

Encoding Path objects

Analog to the numpy example, you can also serialize Path objects:

from pathlib import Path


def encode_path(obj):
    if isinstance(obj, Path):
        return str(obj)
    raise TypeError(f"Object of type {obj.__class__.__name__} is not JSON serializable")

Also here, supply the method as default argument to e.g. json.dumps:

data = {"some_key": Path("/usr/bin/python3")}
encoded = json.dumps(data, default=encode_path)

Combine encoders

It gets a bit more tricky when you want to combine several encoders. I chose to create a higher-order function (in my case: a function that returns a function). combine_encoders gets a list of encoder functions and returns a new function that combines the encoders in a for loop.

def combine_encoders(*encs):
    def combined(obj):
        for enc in encs:
            try:
                return enc(obj)
            except TypeError:
                pass
        raise TypeError(f"Object of type {obj.__class__.__name__} is not JSON serializable")

    return combined

Using this construct is straight forward:

data = {"path": Path("/usr/bin/python3"), "numpy": np.array([1, 2, 3])}
combined_encoders = combine_encoders(encode_np_array, encode_path)
encoded = json.dumps(data, default=combined_encoders)

Parse YAML dict, keeping duplicates

from collections import Counter
import yaml


def parse_preserving_duplicates(src):
    class PreserveDuplicatesLoader(yaml.loader.Loader):
        pass

    def map_constructor(loader, node, deep=False):
        keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
        vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
        key_count = Counter(keys)
        data = {}
        for key, val in zip(keys, vals):
            if key_count[key] > 1:
                if key not in data:
                    data[key] = []
                data[key].append(val)
            else:
                data[key] = val
        return data

    PreserveDuplicatesLoader.add_constructor(
        yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor
    )
    return yaml.load(src, PreserveDuplicatesLoader)

Dedent text

from textwrap import dedent
import yaml

data = """\
       - Hesperiidae
       - Papilionidae
       - Apatelodidae
       - Epiplemidae
       """

yaml.save_load(dedent(data))