Skip to content

adding F-contig layout support + other cleanups#108

Open
SwayamInSync wants to merge 3 commits into
numpy:mainfrom
SwayamInSync:fix-matmul-layout-89
Open

adding F-contig layout support + other cleanups#108
SwayamInSync wants to merge 3 commits into
numpy:mainfrom
SwayamInSync:fix-matmul-layout-89

Conversation

@SwayamInSync
Copy link
Copy Markdown
Member

@SwayamInSync SwayamInSync commented Jun 5, 2026

closes #89

This PR improves the BLAS integration very parallel to the NumPy's own logic.

  • Mirrors NumPy's own @TYPE@_matmul dispatch. NumPy hands the inner loop raw strides and expects the loop itself to dispatch, so we do the same instead of assuming row-major.
  • Ports NumPy's is_blasable2d check: an operand is BLAS-able iff its fast axis is contiguous and the slow axis is a valid leading dimension; each operand is tested in both orientations (C- and F-blasable).
  • BLAS-able operands are passed to QBLAS zero-copy with the matching transpose flag ('N'/'T') and lda derived from strides, so F-contiguous/transposed inputs need no copy (QBLAS already supports transa/transb correctly).
  • Non-BLAS-able operands (fully strided, negative strides) and non-row-major output are gathered into contiguous temps and scattered back, exactly NumPy's copy fallback.
  • Oversized (dim > INT_MAX) or empty cores fall back to the naive strided loop, matching NumPy's too_big_for_blas guard.

I think this is a good base to integrate dot operations efficiently in follow-up PRs

@SwayamInSync
Copy link
Copy Markdown
Member Author

Ahh the fix to run older-cpu workflows is done in #104 , once that will merge we can update here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incorrect matmul results for non-row-major (notably F-contiguous and transposed-view) QuadPrecision inputs by making the QBLAS dispatch stride-aware (mirroring NumPy’s layout/BLAS-ability checks) and expanding test coverage to exercise the full layout matrix and copy-fallback paths.

Changes:

  • Update the matmul strided loop to detect BLAS-able 2D operands in either orientation and pass the appropriate transa/transb + lda/ldb/ldc, copying only when required.
  • Remove the prior xfails for Fortran-order/transposed batched cases and add new tests covering layout combinations, output layout, empty-core behavior, and copy-fallback scenarios.
  • Adjust QBLAS interface edge handling for empty GEMV/GEMM calls.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/test_dot.py Removes xfails for issue #89 and adds comprehensive layout/copy-path tests for matmul.
src/csrc/umath/matmul.cpp Implements stride-derived BLAS dispatch (incl. transpose flags) plus gather/scatter copy fallback and size/empty guards.
src/csrc/quadblas_interface.cpp Tweaks empty-dimension behavior for qblas_gemv and qblas_gemm.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/csrc/quadblas_interface.cpp
Comment thread src/csrc/umath/matmul.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] matmul produces incorrect results for F-contiguous / non-row-major inputs

2 participants