Skip to content

fix: materialize ConstantColumnVector on Comet's serialize/export paths#4532

Open
schenksj wants to merge 1 commit into
apache:mainfrom
schenksj:fix/materialize-constant-column-vector
Open

fix: materialize ConstantColumnVector on Comet's serialize/export paths#4532
schenksj wants to merge 1 commit into
apache:mainfrom
schenksj:fix/materialize-constant-column-vector

Conversation

@schenksj
Copy link
Copy Markdown

Which issue does this PR close?

Closes #4527.

Rationale for this change

Spark wraps file-source partition columns and other per-batch constants in ConstantColumnVector. When such a batch reaches Comet's serialization path (Utils.getBatchFieldVectors, used by broadcast/shuffle) or the FFI export path (NativeUtil.exportBatch), it was rejected with:

Comet execution only takes Arrow Arrays, but got ...ConstantColumnVector

This is a standalone fix; it was surfaced while working on the Delta Lake contrib integration (the OPTIMIZE / deletion-vector rewrite paths pull constants through a Comet operator), so prioritizing it helps that effort, but it applies to any plan that routes a constant column through a Comet operator.

What changes are included in this PR?

  • ConstantColumnVectors.materialize (in the org.apache.spark.sql.comet.execution.arrow package) builds a fresh Arrow FieldVector holding the constant repeated numRows times. It reuses the existing per-type ArrowFieldWriters, so it covers every type -- scalars, decimal, timestamps, and complex struct/array/map -- and stays in sync with Spark's type handling, rather than a hand-rolled per-type switch.
  • Utils.materializeConstantColumnVector exposes it to the serialization path.
  • New match arms in Utils.getBatchFieldVectors and NativeUtil.exportBatch materialize a ConstantColumnVector instead of throwing. The existing CometVector path is untouched.

How are these changes tested?

New test in UtilsSuite round-trips a batch with a value ConstantColumnVector and a null ConstantColumnVector through serializeBatches / decodeBatches and asserts the materialized values (and nulls) survive. The test fails on main with the "only takes Arrow Arrays" exception and passes with this change. UtilsSuite (3/3) and CometExecSuite (126/0) pass. The FFI exportBatch arm shares the same materializeConstantColumnVector helper.

Spark wraps file-source partition columns and other per-batch constants in
ConstantColumnVector. When such a batch reaches Comet's serialization path
(Utils.getBatchFieldVectors, used by broadcast/shuffle) or FFI export path
(NativeUtil.exportBatch), it was rejected with "Comet execution only takes
Arrow Arrays".

Materialize the constant into a fresh Arrow FieldVector (the constant repeated
numRows times) inline. The materializer reuses the existing per-type
ArrowFieldWriters, so it covers every type -- scalars, decimal, timestamps, and
complex struct/array/map -- and stays in sync with Spark's type handling.

Adds ConstantColumnVectors.materialize (arrow package) +
Utils.materializeConstantColumnVector, with new match arms in
getBatchFieldVectors and exportBatch.

Closes apache#4527

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ConstantColumnVector inputs fail Comet export with "Comet execution only takes Arrow Arrays"

1 participant