perf: Speed up s8q NHWC max pooling#1292
Conversation
gunes-arm
left a comment
There was a problem hiding this comment.
Nice work Pablo. I only have doc-related suggestions.
(1)
I suggest we remove the following section from the commit description as it's highly input configuration dependent:
For the tested diff-qinfo cases this improves steady-state latency by:
- A64: 4.7% to 5.3%
- SVE2: 0.8% to 3.9%
- SME2: 3.9% to 8.6%
(2) MR title and commit title should be the same. I'd say the following is fine; note the capital start:
perf: Speed up s8q NHWC max pooling
Refactor the A64, SVE and SME QASYMM8_SIGNED differing-qinfo NHWC MAX pooling kernels to reduce the cost of the requantized path.
The old code did:
- add output offset
- explicit clamp to [-128, 127] with smax / smin
- pack down with several uzp1 shuffles
- The new code does:
- add output offset
- saturating narrow directly with sqxtn / sqxtn2 from s32 -> s16 -> s8
Why this helps:
- sqxtn already performs the signed saturation, so the explicit clamp is redundant.
- It also lets us remove a chunk of shuffle/packing work.
- So the requantized epilogue gets shorter: fewer instructions, less register traffic, less packing overhead.
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Change-Id: I47a378aeda61de86f2d6784393ccb4a08984706c
|
Looks good to me. |
9272e82 to
f78b12c
Compare
Thanks. All addressed in latest patchset. |
|
Can you also remove from the description as well. |
Done. |
Refactor the A64, SVE and SME QASYMM8_SIGNED differing-qinfo NHWC MAX pooling kernels to reduce the cost of the requantized path.
The old code did:
The new code does:
Why this helps:
Change-Id: I47a378aeda61de86f2d6784393ccb4a08984706c