SM100 (Blackwell) Instructions

163 base instructions, 625 total variants

ACQBULK

Wait for Bulk Release Status Warp State

ACQSHMINIT

Wait for Shared Memory Initialization Release Status Warp State

AL2P

Unknown op

unknown

ALD

Unknown op

unknown

ATOM

Atomic Operation on Generic Memory

ATOMG

Atomic Operation on Global Memory

ATOMS

Atomic Operation on Shared Memory

B2R

Move Barrier To Register

BMSK

Bitfield Mask

BREV

Bit Reverse

CCTL

Cache Control

CGAERRBAR

CGA Error Barrier

CREDUX

Coupled Reduction of a Vector Register into a Uniform Register

CS2R

Move Special Register to Register

CSMTEST

Unknown op

unknown

DADD

FP64 Add

DFMA

FP64 Fused Mutiply Add

DMMA

Matrix Multiply and Accumulate

DMUL

FP64 Multiply

DSETP

FP64 Compare And Set Predicate

ELECT

Elect a Leader Thread

ENDCOLLECTIVE

Reset the MCOLLECTIVE mask

ERRBAR

Error Barrier

F2FP

Unknown op

unknown

F2I

Floating Point To Integer Conversion

F2IP

FP32 Down-Convert to Integer and Pack

FADD

FP32 Add

FADD2

FP32 Add

FCHK

Floating-point Range Check

FFMA

FP32 Fused Multiply and Add

FFMA2

FP32 Fused Multiply and Add

FHADD

FP32 Addition

FHFMA

FP32 Fused Multiply and Add

FLO

Find Leading One

FMNMX

FP32 Minimum/Maximum

FMNMX3

3-Input Floating-point Minimum / Maximum

FMUL

FP32 Multiply

FMUL2

FP32 Multiply

FOOTPRINT

Unknown op

unknown

FRND

Round To Integer

FSEL

Floating Point Select

FSET

FP32 Compare And Set

FSETP

FP32 Compare And Set Predicate

FSWZADD

FP32 Swizzle Add

GETLMEMBASE

Get Local Memory Base Address

HADD2

FP16 Add

HFMA2

FP16 Fused Mutiply Add

HMMA

Matrix Multiply and Accumulate

HMNMX2

FP16 Minimum / Maximum

HMUL2

FP16 Multiply

HSET2

FP16 Compare And Set

HSETP2

FP16 Compare And Set Predicate

I2F

Integer To Floating Point Conversion

I2FP

Integer to FP32 Convert and Pack

I2I

Integer To Integer Conversion

I2IP

Integer To Integer Conversion and Packing

IABS

Integer Absolute Value

IADD

Integer Addition

IADD3

3-input Integer Addition

IDP

Integer Dot Product and Accumulate

IMAD

Integer Multiply And Add

IMMA

Integer Matrix Multiply and Accumulate

IMNMX

Integer Minimum/Maximum

IMUL

Integer Multiply

IPA

Unknown op

unknown

ISBERD

Unknown op

unknown

ISETP

Integer Compare And Set Predicate

Load from generic Memory

LDC

Load Constant

LDG

Load from Global Memory

LDGDEPBAR

Global Load Dependency Barrier

LDGSTS

Asynchronous Global to Shared Memcopy

LDL

Load within Local Memory Window

LDS

Load within Shared Memory Window

LDSM

Load Matrix from Shared Memory with Element Size Expansion

LDTRAM

Unknown op

unknown

LEA

LOAD Effective Address

LEPC

Load Effective PC

LOP3

Logic Operation

MATCH

Match Register Values Across Thread Group

MOV

Move

MOVM

Move Matrix with Transposition or Expansion

MUFU

FP32 Multi Function Operation

NOP

No Operation

OUT

Unknown op

unknown

P2R

Move Predicate Register To Register

PIXLD

Unknown op

unknown

PLOP3

Predicate Logic Operation

PMTRIG

Performance Monitor Trigger

POPC

Population count

PREEXIT

Dependent Task Launch Hint

PRMT

Permute Register Pair

QADD4

Unknown op

unknown

QFMA4

Unknown op

unknown

QMUL4

Unknown op

unknown

QSPC

Query Space

R2P

Move Register To Predicate Register

R2UR

Move from Vector Register to a Uniform Register

REDAS

Asynchronous Reduction on Distributed Shared Memory With Explicit Synchronization

REDUX

Reduction of a Vector Register into a Uniform Register

RPCMOV

PC Register Move

S2R

Move Special Register to Register

S2UR

Move Special Register to Uniform Register

SEL

Select Source with Predicate

SETCTAID

Set CTA ID

SGXT

Sign Extend

SHF

Funnel Shift

SHFL

Warp Wide Register Shuffle

STAS

Asynchronous Store to Distributed Shared Memory With Explicit Synchronization

SUATOM

Atomic Op on Surface Memory

SULD

Surface Load

SUQUERY

Unknown op

unknown

SYNCS

Sync Unit

TEX

Texture Fetch

TLD

Texture Load

TLD4

Texture Load 4

TMML

Texture MipMap Level

TXD

Texture Fetch With Derivatives

TXQ

Texture Query

UBLKCP

Bulk Data Copy

UBLKPF

Bulk Data Prefetch

UBLKRED

Bulk Data Copy from Shared Memory with Reduction

UBMSK

Uniform Bitfield Mask

UBREV

Uniform Bit Reverse

UCGABAR_ARV

CGA Barrier Synchronization

UCGABAR_WAIT

CGA Barrier Synchronization

UCLEA

Load Effective Address for a Constant

UF2FP

Uniform FP32 Down-convert and Pack

UFLO

Uniform Find Leading One

UIADD3

Uniform Integer Addition

UIMAD

Uniform Integer Multiplication

UISETP

Uniform Integer Compare and Set Uniform Predicate

ULEA

Uniform Load Effective Address

ULEPC

Uniform Load Effective PC

ULOP3

Uniform Logic Operation

UMOV

Uniform Move

UP2UR

Uniform Predicate to Uniform Register

UPLOP3

Uniform Predicate Logic Operation

UPOPC

Uniform Population Count

UPRMT

Uniform Byte Permute

UR2UP

Uniform Register to Uniform Predicate

USEL

Uniform Select

USETMAXREG

Release, Deallocate and Allocate Registers

USETSHMSZ

Unknown op

unknown

USGXT

Uniform Sign Extend

USHF

Uniform Funnel Shift

UTCATOMSWS

Perform Atomic operation on SW State Register

UTMACCTL

TMA Cache Control

UTMACMDFLUSH

TMA Command Flush

UTMALDG

Tensor Load from Global to Shared Memory

UTMAPF

Tensor Prefetch

UTMAREDG

Tensor Store from Shared to Global Memory with Reduction

UTMASTG

Tensor Store from Shared to Global Memory

UVIRTCOUNT

Virtual Resource Management

VABSDIFF

Absolute Difference

VABSDIFF4

Absolute Difference

VHMNMX

SIMD FP16 3-Input Minimum / Maximum

VIADD

SIMD Integer Addition

VIADDMNMX

SIMD Integer Addition and Fused Min/Max Comparison

VIMNMX

SIMD Integer Minimum / Maximum

VIMNMX3

SIMD Integer 3-Input Minimum / Maximum

VOTE

Vote Across SIMT Thread Group

VOTEU

Voting across SIMD Thread Group with Results in Uniform Destination

Unfound Instructions

Our fuzzer has not found these 96 instructions. If you have a cubin that contains any of these instructions and would like to contribute it, message us at collab@sf-tensor.com

BAR

Barrier Synchronization

unfound

BMOV

Move Convergence Barrier State

unfound

BPT

BreakPoint/Trap

unfound

BRA

Relative Branch

unfound

BREAK

Break out of the Specified Convergence Barrier

unfound

BRX

Relative Branch Indirect

unfound

BRXU

Relative Branch with Uniform Register Based Offset

unfound

BSSY

Barrier Set Convergence Synchronization Point

unfound

BSYNC

Synchronize Threads on a Convergence Barrier

unfound

CALL

Call Function

unfound

CCTLL

Cache Control

unfound

CCTLT

Texture Cache Control

unfound

CS2UR

Load a Value from Constant Memory into a Uniform Register

unfound

DEPBAR

Dependency Barrier

unfound

EXIT

Exit Program

unfound

F2F

Floating Point To Floating Point Conversion

unfound

FADD32I

FP32 Add

unfound

FENCE

Memory Visibility Guarantee for Shared or Global Memory

unfound

FFMA32I

FP32 Fused Multiply and Add

unfound

FMUL32I

FP32 Multiply

unfound

HADD2_32I

FP16 Add

unfound

HFMA2_32I

FP16 Fused Mutiply Add

unfound

HMUL2_32I

FP16 Multiply

unfound

IADD32I

Integer Addition

unfound

IDP4A

Integer Dot Product and Accumulate

unfound

IMUL32I

Integer Multiply

unfound

ISCADD

Scaled Integer Addition

unfound

ISCADD32I

Scaled Integer Addition

unfound

JMP

Absolute Jump

unfound

JMX

Absolute Jump Indirect

unfound

JMXU

Absolute Jump with Uniform Register Based Offset

unfound

KILL

Kill Thread

unfound

LDCU

Load a Value from Constant Memory into a Uniform Register

unfound

LDGMC

Reducing Load

unfound

LDT

Load Matrix from Tensor Memory to Register File

unfound

LDTM

Load Matrix from Tensor Memory to Register File

unfound

LOP

Logic Operation

unfound

LOP32I

Logic Operation

unfound

MEMBAR

Memory Barrier

unfound

MOV32I

Move

unfound

NANOSLEEP

Suspend Execution

unfound

OMMA

FP4 Matrix Multiply and Accumulate Across a Warp

unfound

PSETP

Combine Predicates and Set Predicate

unfound

QMMA

FP8 Matrix Multiply and Accumulate Across a Warp

unfound

REDG

Reduction Operation on Generic Memory

unfound

RET

Return From Subroutine

unfound

SETLMEMBASE

Set Local Memory Base Address

unfound

SHL

Shift Left

unfound

SHR

Shift Right

unfound

Store to Generic Memory

unfound

STG

Store to Global Memory

unfound

STL

Store to Local Memory

unfound

STS

Store to Shared Memory

unfound

STSM

Store Matrix to Shared Memory

unfound

STT

Store Matrix to Tensor Memory from Register File

unfound

STTM

Store Matrix to Tensor Memory from Register File

unfound

SURED

Reduction Op on Surface Memory

unfound

SUST

Surface Store

unfound

UF2F

Uniform Float-to-Float Conversion

unfound

UF2I

Uniform Float-to-Integer Conversion

unfound

UF2IP

Uniform FP32 Down-Convert to Integer and Pack

unfound

UFADD

Uniform Uniform FP32 Addition

unfound

UFFMA

Uniform FP32 Fused Multiply-Add

unfound

UFMNMX

Uniform Floating-point Minimum / Maximum

unfound

UFMUL

Uniform FP32 Multiply

unfound

UFRND

Uniform Round to Integer

unfound

UFSEL

Uniform Floating-Point Select

unfound

UFSET

Uniform Floating-Point Compare and Set

unfound

UFSETP

Uniform Floating-Point Compare and Set Predicate

unfound

UGETNEXTWORKID

Uniform Get Next Work ID

unfound

UI2F

Uniform Integer to Float conversion

unfound

UI2FP

Uniform Integer to FP32 Convert and Pack

unfound

UI2I

Uniform Saturating Integer-to-Integer Conversion

unfound

UI2IP

Uniform Dual Saturating Integer-to-Integer Conversion and Packing

unfound

UIABS

Uniform Integer Absolute Value

unfound

UIADD3.64

Uniform Integer Addition

unfound

UIMNMX

Uniform Integer Minimum / Maximum

unfound

ULOP

Uniform Logic Operation

unfound

ULOP32I

Uniform Logic Operation

unfound

UMEMSETS

Initialize Shared Memory

unfound

UPSETP

Uniform Predicate Logic Operation

unfound

UREDGR

Uniform Reduction on Global Memory with Release

unfound

USHL

Uniform Left Shift

unfound

USHR

Uniform Right Shift

unfound

USTGR

Uniform Store to Global Memory with Release

unfound

UTCBAR

Tensor Core Barrier

unfound

UTCCP

Asynchonous data copy from Shared Memory to Tensor Memory

unfound

UTCHMMA

Uniform Matrix Multiply and Accumulate

unfound

UTCIMMA

Uniform Matrix Multiply and Accumulate

unfound

UTCOMMA

Uniform Matrix Multiply and Accumulate

unfound

UTCQMMA

Uniform Matrix Multiply and Accumulate

unfound

UTCSHIFT

Shift elements in Tensor Memory

unfound

UVIADD

Uniform SIMD Integer Addition

unfound

UVIMNMX

Uniform SIMD Integer Minimum / Maximum

unfound

WARPSYNC

Synchronize Threads in Warp

unfound

YIELD

Yield Control

unfound