Skip to content

GH-26685: [Python][C++] Trim buffers when pickling a sliced array#50159

Open
kirilklein wants to merge 1 commit into
apache:mainfrom
kirilklein:gh-26685
Open

GH-26685: [Python][C++] Trim buffers when pickling a sliced array#50159
kirilklein wants to merge 1 commit into
apache:mainfrom
kirilklein:gh-26685

Conversation

@kirilklein

@kirilklein kirilklein commented Jun 11, 2026

Copy link
Copy Markdown

Rationale

Pickling a sliced array currently serializes the array's entire parent
buffers rather than just the sliced range, because Array.__reduce__ wraps the
raw ArrayData buffers as-is. For a one-element slice of a large array this
serializes the whole parent (megabytes for a few bytes of data), which is a
long-standing pain point for multiprocessing / Dask / Ray (issue open since
2020). The IPC writer already truncates sliced buffers; pickling did not.

What this does

Adds arrow::internal::TrimArrayDataBuffers(ArrayData, MemoryPool*)
(cpp/src/arrow/array/util.{h,cc}): when the array has a non-zero offset
(i.e. it is a slice sharing a parent's buffers) it compacts the array to the
referenced range via Concatenate({MakeArray(data)}), which already handles
every nested / variable-length / dictionary type correctly; otherwise it
returns the input unchanged. Array.__reduce__ calls it before reducing, so a
sliced array pickles only its referenced bytes. ChunkedArray / RecordBatch /
Table inherit the fix since they reduce through their arrays.

Pickle protocol-5 out-of-band buffers keep working (this is why the prior
IPC-based attempt, #37683, was rejected): we still reduce real Buffer
objects, just trimmed ones. Crucially, unsliced arrays are returned
untouched
, so their protocol-5 pickling stays zero-copy
(test_array_pickle_protocol5 keeps passing). The guard is offset != 0
rather than a buffer-size comparison precisely because allocator padding makes
an unsliced array's referenced size differ from its total buffer size — a
size-based guard would needlessly copy (and break zero-copy for) unsliced
arrays.

Known limitation: a zero-offset head slice (arr.slice(0, k) of a large
array) is not trimmed, since it cannot be distinguished from an unsliced array
by offset alone without per-type buffer-size logic. This is no worse than the
status quo (such slices already pickle the full buffers); the common case of
slices with a non-zero offset is fixed. A follow-up could trim these too via a
proper per-type buffer-truncation utility.

Design note for reviewers

Earlier discussion favored refactoring the IPC writer's per-type truncation
into a shared zero-copy visitor (Option 1). That refactor stalled for years on
nested / dictionary handling. This PR instead reuses Concatenate, which
copies only the (small) compacted slice — the same bytes pickle would serialize
anyway — for a much smaller, lower-risk change. Happy to evolve this toward a
zero-copy SliceBuffer-based utility if preferred.

Benchmarks

Local source build, arrays of 2,000,000 elements, pickling a 10-element slice:

type slice pickle before after shrink
int64 16,000,141 200 80,001×
string 32,889,061 247 133,154×
large_string 40,889,071 297 137,674×
list<int64> 40,000,247 405 98,766×
struct 38,889,227 419 92,814×
dictionary 8,001,235 1,236 6,473×
bool 250,140 121 2,067×

Containers inherit the fix: a sliced ChunkedArray / RecordBatch / Table built on
a 2M-row array pickles to ~250–340 bytes (was tens of MB).

Tests

Adds regression tests in python/pyarrow/tests/test_array.py covering slice
pickle size + round-trip across primitive / bool / string / list types,
protocol-5 out-of-band buffers, and the unsliced-array regression case. The
existing test_array_pickle_protocol5 (zero-copy guarantee) continues to pass.

🤖 Generated with Claude Code

Pickling a sliced array previously serialized the array's entire parent
buffers instead of just the referenced slice. Add
arrow::internal::TrimArrayDataBuffers, which compacts a sliced array
(offset != 0) to its referenced range via Concatenate, and call it from
Array.__reduce__ so pickling only serializes referenced bytes. Unsliced
arrays are returned untouched, preserving zero-copy / protocol-5
out-of-band pickling. ChunkedArray/RecordBatch/Table inherit the fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pitrou

pitrou commented Jun 11, 2026

Copy link
Copy Markdown
Member

@jorisvandenbossche Would you like to take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants