GH-26685: [Python][C++] Trim buffers when pickling a sliced array by kirilklein · Pull Request #50159 · apache/arrow

kirilklein · 2026-06-11T15:01:08Z

Rationale

Pickling a sliced array currently serializes the array's entire parent
buffers rather than just the sliced range, because Array.__reduce__ wraps the
raw ArrayData buffers as-is. For a one-element slice of a large array this
serializes the whole parent (megabytes for a few bytes of data), which is a
long-standing pain point for multiprocessing / Dask / Ray (issue open since
2020). The IPC writer already truncates sliced buffers; pickling did not.

What this does

Adds arrow::internal::TrimArrayDataBuffers(ArrayData, MemoryPool*)
(cpp/src/arrow/array/util.{h,cc}): when the array has a non-zero offset
(i.e. it is a slice sharing a parent's buffers) it compacts the array to the
referenced range via Concatenate({MakeArray(data)}), which already handles
every nested / variable-length / dictionary type correctly; otherwise it
returns the input unchanged. Array.__reduce__ calls it before reducing, so a
sliced array pickles only its referenced bytes. ChunkedArray / RecordBatch /
Table inherit the fix since they reduce through their arrays.

Pickle protocol-5 out-of-band buffers keep working (this is why the prior
IPC-based attempt, #37683, was rejected): we still reduce real Buffer
objects, just trimmed ones. Crucially, unsliced arrays are returned
untouched, so their protocol-5 pickling stays zero-copy
(test_array_pickle_protocol5 keeps passing). The guard is offset != 0
rather than a buffer-size comparison precisely because allocator padding makes
an unsliced array's referenced size differ from its total buffer size — a
size-based guard would needlessly copy (and break zero-copy for) unsliced
arrays.

Known limitation: a zero-offset head slice (arr.slice(0, k) of a large
array) is not trimmed, since it cannot be distinguished from an unsliced array
by offset alone without per-type buffer-size logic. This is no worse than the
status quo (such slices already pickle the full buffers); the common case of
slices with a non-zero offset is fixed. A follow-up could trim these too via a
proper per-type buffer-truncation utility.

Design note for reviewers

Earlier discussion favored refactoring the IPC writer's per-type truncation
into a shared zero-copy visitor (Option 1). That refactor stalled for years on
nested / dictionary handling. This PR instead reuses Concatenate, which
copies only the (small) compacted slice — the same bytes pickle would serialize
anyway — for a much smaller, lower-risk change. Happy to evolve this toward a
zero-copy SliceBuffer-based utility if preferred.

Benchmarks

Local source build, arrays of 2,000,000 elements, pickling a 10-element slice:

type	slice pickle before	after	shrink
int64	16,000,141	200	80,001×
string	32,889,061	247	133,154×
large_string	40,889,071	297	137,674×
list<int64>	40,000,247	405	98,766×
struct	38,889,227	419	92,814×
dictionary	8,001,235	1,236	6,473×
bool	250,140	121	2,067×

Containers inherit the fix: a sliced ChunkedArray / RecordBatch / Table built on
a 2M-row array pickles to ~250–340 bytes (was tens of MB).

Tests

Adds regression tests in python/pyarrow/tests/test_array.py covering slice
pickle size + round-trip across primitive / bool / string / list types,
protocol-5 out-of-band buffers, and the unsliced-array regression case. The
existing test_array_pickle_protocol5 (zero-copy guarantee) continues to pass.

🤖 Generated with Claude Code

GitHub Issue: [Python] Pickling a sliced array serializes all the buffers #26685

Pickling a sliced array previously serialized the array's entire parent buffers instead of just the referenced slice. Add arrow::internal::TrimArrayDataBuffers, which compacts a sliced array (offset != 0) to its referenced range via Concatenate, and call it from Array.__reduce__ so pickling only serializes referenced bytes. Unsliced arrays are returned untouched, preserving zero-copy / protocol-5 out-of-band pickling. ChunkedArray/RecordBatch/Table inherit the fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pitrou · 2026-06-11T16:08:55Z

@jorisvandenbossche Would you like to take a look?

kirilklein requested review from AlenkaF, pitrou, raulcd and rok as code owners June 11, 2026 15:01

github-actions Bot added Component: C++ Component: Python awaiting review Awaiting review labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-26685: [Python][C++] Trim buffers when pickling a sliced array#50159

GH-26685: [Python][C++] Trim buffers when pickling a sliced array#50159
kirilklein wants to merge 1 commit into
apache:mainfrom
kirilklein:gh-26685

kirilklein commented Jun 11, 2026 •

edited by github-actions Bot

Loading

Uh oh!

pitrou commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kirilklein commented Jun 11, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

What this does

Design note for reviewers

Benchmarks

Tests

Uh oh!

pitrou commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kirilklein commented Jun 11, 2026 •

edited by github-actions Bot

Loading

pitrou commented Jun 11, 2026 •

edited

Loading