Investigating Total Store Ordering on the ARM M1

The new Apple Silicon ARM processors have an interesting new hardware feature for their x86 emulation: Total Store Ordering. Usually, ARM processors have laxer guarantees about the order in which parallel stores (writes) to memory happen and extra instructions to synchronize changes. Intel x86, on the other hand, guarantees total ordering for all store instructions. This makes it harder and slower to emulate x86 on ARM as every store, which possibly could depend on total ordering, has to be synchronized. Thus, Apple integrated total ordering directly into their ARM processors, greatly simplifying and accelerating the x86 emulation (Rosetta 2).

This topic focuses on the performance impact of Total Store Ordering (TSO). We ask us how expensive these ordering guarantees really are, and if ARM made the right choice to have laxer guarantees for normal stores, enabling more optimizations in the majority of cases. To answer this, we need good single- and multi-core benchmarks, executed with and without TSO on.