Skip to content

Backport checkpoint restore arg layout handling to 12.9.x#2145

Open
kkraus14 wants to merge 1 commit into
NVIDIA:12.9.xfrom
kkraus14:codex/backport-checkpoint-restoreargs-12.9.x
Open

Backport checkpoint restore arg layout handling to 12.9.x#2145
kkraus14 wants to merge 1 commit into
NVIDIA:12.9.xfrom
kkraus14:codex/backport-checkpoint-restoreargs-12.9.x

Conversation

@kkraus14
Copy link
Copy Markdown
Collaborator

@kkraus14 kkraus14 commented May 27, 2026

Backport of the CUDA checkpoint restore argument layout handling from #2144 to the 12.9.x branch.

This keeps CUcheckpointRestoreArgs rendering version-flexible across the checkpoint restore layouts:

  • CUDA 12.9: reserved remains cuuint64_t[8]
  • CUDA 13.1/13.2: gpuPairs, gpuPairsCount, reserved as char[44], and reserved1
  • CUDA 13.3: gpuPairs, gpuPairsCount, and reserved as char[52]

Unlike main, the CUDA 12.9 headers do not define CUcheckpointGpuPair, so this backport keeps the 12.9 restore-args layout reserved-only while preserving the conditional template support needed for newer checkpoint restore layouts.

Validation:

  • pre-commit hooks passed during commit
  • rendered checkpoint template sections for 12.9, 13.2, and 13.3 shapes

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.bindings Everything related to the cuda.bindings module label May 27, 2026
@kkraus14 kkraus14 added bug Something isn't working P0 High priority - Must do! labels May 27, 2026
@kkraus14 kkraus14 self-assigned this May 27, 2026
@kkraus14 kkraus14 force-pushed the codex/backport-checkpoint-restoreargs-12.9.x branch from 5df5e59 to b34a70f Compare May 27, 2026 21:40
@kkraus14 kkraus14 force-pushed the codex/backport-checkpoint-restoreargs-12.9.x branch 2 times, most recently from 2dd69d8 to 440be94 Compare May 28, 2026 02:13
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@kkraus14 kkraus14 force-pushed the codex/backport-checkpoint-restoreargs-12.9.x branch 2 times, most recently from c924a85 to 648c67f Compare May 28, 2026 05:14
@kkraus14 kkraus14 force-pushed the codex/backport-checkpoint-restoreargs-12.9.x branch from 648c67f to 0292ca7 Compare May 28, 2026 19:05
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 28, 2026

I edited the PR description:

-The 12.9.x branch did not already have the CUcheckpointGpuPair generated blocks that main had before #2144, so this backport includes those conditional template blocks as well.
+Unlike main, the CUDA 12.9 headers do not define `CUcheckpointGpuPair`, so this backport keeps the 12.9 restore-args layout reserved-only while preserving the conditional template support needed for newer checkpoint restore layouts.

@kkraus14 kkraus14 force-pushed the codex/backport-checkpoint-restoreargs-12.9.x branch from 0292ca7 to 6884723 Compare May 28, 2026 20:04
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 28, 2026

My current understanding of this PR (@kkraus14 , please correct me if this is incorrect):

This is primarily a maintenance/alignment backport, not a functional fix for the normal CUDA 12.9.x build/test path.

For CUDA 12.9 headers, CUcheckpointRestoreArgs is already the reserved-only layout:

typedef struct CUcheckpointRestoreArgs_st {
    cuuint64_t reserved[8];
} CUcheckpointRestoreArgs;

and CUDA 12.9 does not define CUcheckpointGpuPair. So I would not expect existing 12.9.x CI, when generating/building against CUDA 12.9 headers, to fail without this PR.

What this PR does provide is keeping the checkpoint restore-args generation logic on 12.9.x aligned with the more robust handling from main/#2144:

  • carry parsed struct field type/array-length metadata into the templates,
  • render the restore-args reserved field based on the parsed header shape,
  • keep the CUDA 12.9 layout reserved-only,
  • preserve conditional support for newer checkpoint restore layouts where the headers actually define GPU remapping fields/types.

So I think the value is future maintenance and reduced semantic drift between the branch copies, rather than fixing a currently observed CUDA 12.9 behavior. Keith, please correct me if I’m missing a functional 12.9.x scenario that this backport is intended to cover.

Comment on lines +23290 to +23308
{{if struct_field_types.get('CUcheckpointRestoreArgs_st.reserved') == 'char'}}
@property
def reserved(self):
return PyBytes_FromStringAndSize(self._pvt_ptr[0].reserved, {{struct_field_array_lengths['CUcheckpointRestoreArgs_st.reserved']}})
@reserved.setter
def reserved(self, reserved):
if len(reserved) != {{struct_field_array_lengths['CUcheckpointRestoreArgs_st.reserved']}}:
raise ValueError("reserved length must be {{struct_field_array_lengths['CUcheckpointRestoreArgs_st.reserved']}}, is " + str(len(reserved)))
if CHAR_MIN == 0:
for i, b in enumerate(reserved):
if b < 0 and b > -129:
b = b + 256
self._pvt_ptr[0].reserved[i] = b
else:
for i, b in enumerate(reserved):
if b > 127 and b < 256:
b = b - 256
self._pvt_ptr[0].reserved[i] = b
{{endif}}
Copy link
Copy Markdown
Collaborator Author

@kkraus14 kkraus14 May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially concerning as this is an API breaking change, though this reserved keyword is just reserved for future usage and is always supposed to be zeroed currently. Should we special case this to always return the list[cuuint64_t] to maintain the previous behavior?

Comment on lines 2313 to +2319
cdef struct CUcheckpointRestoreArgs_st:
cuuint64_t reserved[8]
{{if struct_field_types.get('CUcheckpointRestoreArgs_st.reserved') == 'char'}}
char reserved[{{struct_field_array_lengths['CUcheckpointRestoreArgs_st.reserved']}}]
{{endif}}
{{if struct_field_types.get('CUcheckpointRestoreArgs_st.reserved') == 'cuuint64_t'}}
cuuint64_t reserved[{{struct_field_array_lengths['CUcheckpointRestoreArgs_st.reserved']}}]
{{endif}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as in here: #2144 (comment)

# SPDX-License-Identifier: LicenseRef-NVIDIA-SOFTWARE-LICENSE

# This code was automatically generated with version 12.9.0, generator version 49a8141. Do not modify it directly.
# This code was automatically generated with version 12.9.0, generator version 0.3.1.dev1711+g875fec45. Do not modify it directly.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why this wasn't updated in the previous refresh...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants