GH-26685: [Python][C++] Trim buffers when pickling a sliced array#50159
Open
kirilklein wants to merge 1 commit into
Open
GH-26685: [Python][C++] Trim buffers when pickling a sliced array#50159kirilklein wants to merge 1 commit into
kirilklein wants to merge 1 commit into
Conversation
Pickling a sliced array previously serialized the array's entire parent buffers instead of just the referenced slice. Add arrow::internal::TrimArrayDataBuffers, which compacts a sliced array (offset != 0) to its referenced range via Concatenate, and call it from Array.__reduce__ so pickling only serializes referenced bytes. Unsliced arrays are returned untouched, preserving zero-copy / protocol-5 out-of-band pickling. ChunkedArray/RecordBatch/Table inherit the fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Member
|
@jorisvandenbossche Would you like to take a look? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale
Pickling a sliced array currently serializes the array's entire parent
buffers rather than just the sliced range, because
Array.__reduce__wraps theraw
ArrayDatabuffers as-is. For a one-element slice of a large array thisserializes the whole parent (megabytes for a few bytes of data), which is a
long-standing pain point for multiprocessing / Dask / Ray (issue open since
2020). The IPC writer already truncates sliced buffers; pickling did not.
What this does
Adds
arrow::internal::TrimArrayDataBuffers(ArrayData, MemoryPool*)(
cpp/src/arrow/array/util.{h,cc}): when the array has a non-zerooffset(i.e. it is a slice sharing a parent's buffers) it compacts the array to the
referenced range via
Concatenate({MakeArray(data)}), which already handlesevery nested / variable-length / dictionary type correctly; otherwise it
returns the input unchanged.
Array.__reduce__calls it before reducing, so asliced array pickles only its referenced bytes. ChunkedArray / RecordBatch /
Table inherit the fix since they reduce through their arrays.
Pickle protocol-5 out-of-band buffers keep working (this is why the prior
IPC-based attempt, #37683, was rejected): we still reduce real
Bufferobjects, just trimmed ones. Crucially, unsliced arrays are returned
untouched, so their protocol-5 pickling stays zero-copy
(
test_array_pickle_protocol5keeps passing). The guard isoffset != 0rather than a buffer-size comparison precisely because allocator padding makes
an unsliced array's referenced size differ from its total buffer size — a
size-based guard would needlessly copy (and break zero-copy for) unsliced
arrays.
Known limitation: a zero-offset head slice (
arr.slice(0, k)of a largearray) is not trimmed, since it cannot be distinguished from an unsliced array
by
offsetalone without per-type buffer-size logic. This is no worse than thestatus quo (such slices already pickle the full buffers); the common case of
slices with a non-zero offset is fixed. A follow-up could trim these too via a
proper per-type buffer-truncation utility.
Design note for reviewers
Earlier discussion favored refactoring the IPC writer's per-type truncation
into a shared zero-copy visitor (Option 1). That refactor stalled for years on
nested / dictionary handling. This PR instead reuses
Concatenate, whichcopies only the (small) compacted slice — the same bytes pickle would serialize
anyway — for a much smaller, lower-risk change. Happy to evolve this toward a
zero-copy
SliceBuffer-based utility if preferred.Benchmarks
Local source build, arrays of 2,000,000 elements, pickling a 10-element slice:
Containers inherit the fix: a sliced ChunkedArray / RecordBatch / Table built on
a 2M-row array pickles to ~250–340 bytes (was tens of MB).
Tests
Adds regression tests in
python/pyarrow/tests/test_array.pycovering slicepickle size + round-trip across primitive / bool / string / list types,
protocol-5 out-of-band buffers, and the unsliced-array regression case. The
existing
test_array_pickle_protocol5(zero-copy guarantee) continues to pass.🤖 Generated with Claude Code