The main bottleneck was the PyUnicode_READ() macro.
This almost catches up with pre-PEP 393 performance, when decoding needed only one pass.