Skip to content

GH-49907: [Python] Implement FixedShapeTensorType.to_pandas_dtype#50145

Open
aboderinsamuel wants to merge 2 commits into
apache:mainfrom
aboderinsamuel:gh-49907-fixed-shape-tensor-to-pandas-dtype
Open

GH-49907: [Python] Implement FixedShapeTensorType.to_pandas_dtype#50145
aboderinsamuel wants to merge 2 commits into
apache:mainfrom
aboderinsamuel:gh-49907-fixed-shape-tensor-to-pandas-dtype

Conversation

@aboderinsamuel

@aboderinsamuel aboderinsamuel commented Jun 10, 2026

Copy link
Copy Markdown

Rationale for this change

FixedShapeTensorType.to_pandas_dtype() inherited the base DataType
implementation, which raises NotImplementedError for every extension type.
This contradicted the documented public API, and it also blocked
Table.to_pandas(split_blocks=True) for fixed-shape-tensor columns: with no
pandas dtype available, the split-blocks path emitted an extension (py_array)
block with no matching dtype and crashed with KeyError in
_reconstruct_block.

(See the discussion in #49907 and the related #33134 on how extension arrays
should convert to pandas in general.)

What changes are included in this PR?

Implement to_pandas_dtype() on FixedShapeTensorType to return
pandas.ArrowDtype(self). ArrowDtype is a pandas ExtensionDtype that
implements __from_arrow__, which is exactly what the pandas_compat
extension-block path requires to build the column — so no conversion code
needed to change.

On pandas < 1.5 (no ArrowDtype), the method falls back to raising
NotImplementedError, leaving behavior on older pandas unchanged.

Are these changes tested?

Yes. test_tensor_type_to_pandas in
python/pyarrow/tests/test_extension_type.py asserts that:

parametrized over value types (int8, float32, float64) and shapes
including a permutation. It is gated to pandas >= 2.1.0, matching the
existing pd.ArrowDtype extension-block tests (GH-35821).

Are there any user-facing changes?

Yes:

  • FixedShapeTensorType.to_pandas_dtype() now returns pandas.ArrowDtype(...)
    instead of raising NotImplementedError (pandas >= 1.5).
  • Consequently, Array.to_pandas() / Table.to_pandas() on a
    fixed-shape-tensor column now yield an ArrowDtype-backed column instead of
    an object-dtype column of flattened ndarrays (pandas >= 2.1). Code
    relying on the previous object dtype will observe this change.

Copilot AI review requested due to automatic review settings June 10, 2026 08:07
@aboderinsamuel aboderinsamuel requested a review from rok as a code owner June 10, 2026 08:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds pandas dtype interoperability for PyArrow’s FixedShapeTensorType so tensor columns can round-trip through to_pandas/Table.to_pandas(split_blocks=True) using pandas’ ArrowDtype.

Changes:

  • Implement FixedShapeTensorType.to_pandas_dtype() returning pandas.ArrowDtype(self) when available.
  • Add pandas-marked tests covering to_pandas_dtype(), Array.to_pandas(), and Table.to_pandas() with/without split_blocks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
python/pyarrow/types.pxi Adds to_pandas_dtype() for FixedShapeTensorType using pandas ArrowDtype.
python/pyarrow/tests/test_extension_type.py Adds regression test ensuring tensors map to pd.ArrowDtype and work with split_blocks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/pyarrow/types.pxi
Comment thread python/pyarrow/types.pxi Outdated
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #49907 has been automatically assigned in GitHub to PR creator.

@aboderinsamuel

aboderinsamuel commented Jun 10, 2026

Copy link
Copy Markdown
Author

@AlenkaF, in #49907 you noted that to_pandas_dtype may not belong on the canonical extension types, and linked the extension-array to_pandas fallback discussion in #33134. This PR takes the concrete route of returning pd.ArrowDtype(self), since that's what lets Table.to_pandas(split_blocks=True) build the extension block without further plumbing. But I'm happy to align with whatever direction you prefer, including pursuing the #33134 fallback instead, if you'd rather not expand the to_pandas_dtype surface on extension types.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 10, 2026
@AlenkaF

AlenkaF commented Jun 11, 2026

Copy link
Copy Markdown
Member

Hm, this might actually be a nice solution.

Three questions before we continue:

  1. What about other canonical extension types, should they have the same behavior?
  2. If we change the behavior of the to_pandas_dtype in the DataType subclasses we are also changing the behavior for to_pandas. What would be the implications?
  3. Would it be worth cleaning up the docstrings for BaseExtensionType and its subclasses? As they are now inheriting from DataType without any mention of the extension type behavior.

This are not meant as suggestions for this PR but are meant to trigger a more general discussion right from start. If we proceed this way we can make follow-up PRs any time.

@aboderinsamuel

aboderinsamuel commented Jun 11, 2026

Copy link
Copy Markdown
Author

Thanks @AlenkaF! Happy to dig in, agreed these are better hashed out up front. My take, keeping this PR scoped to FixedShapeTensor and treating the rest as follow-ups:

  1. Other canonical extension types
    I think the same idea fits them, but the right pandas dtype isn't always ArrowDtype. pd.ArrowDtype(self) is a good generic default (faithful, round-trips, no extra plumbing), yet some types have a more natural mapping worth weighing case by case, e.g. bool8 → a boolean dtype, while json/uuid/opaque could stay ArrowDtype. So I'd lean toward implementing it per canonical type (defaulting to ArrowDtype, overriding where a native dtype is clearly better) rather than one blanket implementation on BaseExtensionType. That keeps each behavior change small and reviewable, and avoids silently changing to_pandas for user-defined extension types.

  2. Implications for to_pandas
    Returning a dtype with from_arrow does change to_pandas/Table.to_pandas for that type: instead of the current fallback (storage → object/numpy; e.g. tensors become an object column of flattened ndarrays), you get a faithful extension-typed column. What I see:

✅ round-trips: to_pandas → from_pandas now preserves the extension type (object dtype loses it).
✅ split_blocks=True works instead of raising KeyError (the original bug).
⚠️ user-facing change: anyone relying on the old object/numpy output now gets ArrowDtype, needs a changelog note.
types_mapper still takes precedence (it's checked before to_pandas_dtype in both _get_extension_dtypes and _array_like_to_pandas), so explicit user mappings are unaffected.
Only engages on pandas ≥ 2.1 (reliable ArrowDtype extension blocks, GH-35821); older pandas keeps today's fallback. Per-type rollout + the version gate keep the blast radius controlled.

  1. Docstrings
    Agreed, worth doing. BaseExtensionType and its subclasses currently inherit to_pandas_dtype (and friends) from DataType with no mention of extension behavior. I'd be glad to take a follow-up documenting the to_pandas_dtype/to_pandas behavior on BaseExtensionType and per-type specifics.

If it helps, I can open a tracking issue capturing this broader direction (canonical-type mappings + docstring cleanup) so it isn't buried here, and keep this PR focused on FixedShapeTensor. What do you think ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants