Back in June, y-Cruncher developer Alexander Yee criticized Intel"s decision to remove AVX-512 (Advanced Vector Extensions 512) from its 12th Gen (Alder Lake) and newer CPUs, calling it "a huge step back". Yee found that Zen 4-based Ryzen 7000 series, which added AVX-512 support after Intel, was now performing up to 31% faster with the latest version of the benchmark. While AMD does not yet support all AVX-512 instructions, it will very likely do so in the future.
It looks like Intel itself was also quite aware of this and today the company debuted a new AVX10 ISA that brings more robust AVX-512 support for the future Intel CPUs. The major highlight of AVX10 is the support for AVX-512 on Efficiency cores (E-cores) as well, which was till now limited to P-cores (Performance cores) only.
Intel explains:
The converged version of the Intel AVX10 vector ISA will include Intel AVX-512 vector instructions with an AVX512VL feature flag, a maximum vector register length of 256 bits, as well as eight 32-bit mask registers andnew versions of 256-bit instructions supporting embedded rounding. This converged version will be supported on both P-cores and E-cores.
While the converged version is limited to a maximum 256-bit vector length, Intel AVX10
itself is not limited to 256 bits, and optional 512-bit vector use is possible on supporting P-cores.
Intel also states (downloadable PDF) that by going this route, the developer load will also be reduced, and multi-threaded performance should increase as E-cores will now be able to contribute to such AVX tasks:
In addition to the previously stated usability benefits, several additional performance-based benefits of Intel AVX10 include:
- Intel AVX2-compiled applications, re-compiled to Intel AVX10, should realize performance gains without the need for additional software tuning.
- Intel AVX2 applications sensitive to vector register pressure will gain the most performance due to the 16 additional vector registers and new instructions.
- Highly-threaded vectorizable applications are likely to achieve higher aggregate throughput when running on E-core-based Intel Xeon processors or on Intel® products with performance hybrid architecture.
Existing Intel AVX-512 applications, many of them already using maximum 256-bit vectors, should see the same performance when compiled to Intel AVX10/256 at iso-vector length. For applications that can leverage greater vector lengths, Intel AVX10/512 will be supported on Intel P-cores, continuing to deliver the best-in-class performance for AI, scientific, and other high-performance codes.
Intel also highlighted the main features of AVX10 versions 1 and 2 (AVX10.1 and 10.2) in a chart:
Aside from AVX10, Intel has also debuted APX or Advanced Performance Extensions, which essentially doubles the number of general-purpose registers (GPRs) from 16 to 32 (R0 to R31 from R0 to R15 previously). Intel states that APX code contains 10% fewer loads and 20% fewer stores than Intel64. Technical jargon aside, this basically means it"s better:
Intel® APX doubles the number of general-purpose registers (GPRs) from 16 to 32. This allows the compiler to keep more values in registers; as a result, APX-compiled code contains 10% fewer loads and more than 20% fewer stores than the same code compiled for an Intel® 64 baseline. Register accesses are not only faster, but they also consume significantly less dynamic power than complex load and store operations.
While we are on the topic of Intel64, the company recently proposed a 64-bit-only x86S architecture and it is currently looking for community feedback.