SM100 (B200) Instructions
163 base instructions, 625 total variants
ACQBULK(1)
Wait for Bulk Release Status Warp State
ACQSHMINIT(1)
Wait for Shared Memory Initialization Release Status Warp State
AL2P(1)
Unknown Op
ALD(7)
Unknown Op
ATOM(10)
Atomic Operation on Generic Memory
ATOMG(10)
Atomic Operation on Global Memory
ATOMS(8)
Atomic Operation on Shared Memory
B2R(3)
Move Barrier To Register
BMSK(3)
Bitfield Mask
BREV(3)
Bit Reverse
CCTL(12)
Cache Control
CGAERRBAR(1)
CGA Error Barrier
CREDUX(1)
Coupled Reduction of a Vector Register into a Uniform Register
CS2R(1)
Move Special Register to Register
CSMTEST(1)
Unknown Op
DADD(3)
FP64 Add
DFMA(5)
FP64 Fused Mutiply Add
DMMA(2)
Matrix Multiply and Accumulate
DMUL(3)
FP64 Multiply
DSETP(3)
FP64 Compare And Set Predicate
ELECT(2)
Elect a Leader Thread
ENDCOLLECTIVE(1)
Reset the MCOLLECTIVE mask
ERRBAR(1)
Error Barrier
F2FP(15)
Unknown Op
F2I(2)
Floating Point To Integer Conversion
F2IP(5)
FP32 Down-Convert to Integer and Pack
FADD(3)
FP32 Add
FADD2(3)
FP32 Add
FCHK(3)
Floating-point Range Check
FFMA(5)
FP32 Fused Multiply and Add
FFMA2(5)
FP32 Fused Multiply and Add
FHADD(2)
FP32 Addition
FHFMA(3)
FP32 Fused Multiply and Add
FLO(3)
Find Leading One
FMNMX(6)
FP32 Minimum/Maximum
FMNMX3(3)
3-Input Floating-point Minimum / Maximum
FMUL(3)
FP32 Multiply
FMUL2(3)
FP32 Multiply
FOOTPRINT(4)
Unknown Op
FRND(3)
Round To Integer
FSEL(3)
Floating Point Select
FSET(3)
FP32 Compare And Set
FSETP(3)
FP32 Compare And Set Predicate
FSWZADD(1)
FP32 Swizzle Add
GETLMEMBASE(1)
Get Local Memory Base Address
HADD2(4)
FP16 Add
HFMA2(10)
FP16 Fused Mutiply Add
HMMA(4)
Matrix Multiply and Accumulate
HMNMX2(6)
FP16 Minimum / Maximum
HMUL2(3)
FP16 Multiply
HSET2(3)
FP16 Compare And Set
HSETP2(3)
FP16 Compare And Set Predicate
I2F(3)
Integer To Floating Point Conversion
I2FP(5)
Integer to FP32 Convert and Pack
I2I(3)
Integer To Integer Conversion
I2IP(3)
Integer To Integer Conversion and Packing
IABS(3)
Integer Absolute Value
IADD(6)
Integer Addition
IADD3(6)
3-input Integer Addition
IDP(2)
Integer Dot Product and Accumulate
IMAD(26)
Integer Multiply And Add
IMMA(4)
Integer Matrix Multiply and Accumulate
IMNMX(3)
Integer Minimum/Maximum
IMUL(9)
Integer Multiply
IPA(9)
Unknown Op
ISBERD(2)
Unknown Op
ISETP(6)
Integer Compare And Set Predicate
LD(8)
Load from generic Memory
LDC(4)
Load Constant
LDG(12)
Load from Global Memory
LDGDEPBAR(1)
Global Load Dependency Barrier
LDGSTS(4)
Asynchronous Global to Shared Memcopy
LDL(6)
Load within Local Memory Window
LDS(4)
Load within Shared Memory Window
LDSM(4)
Load Matrix from Shared Memory with Element Size Expansion
LDTRAM(2)
Unknown Op
LEA(14)
LOAD Effective Address
LEPC(2)
Load Effective PC
LOP3(3)
Logic Operation
MATCH(2)
Match Register Values Across Thread Group
MOV(4)
Move
MOVM(1)
Move Matrix with Transposition or Expansion
MUFU(3)
FP32 Multi Function Operation
NOP(1)
No Operation
OUT(4)
Unknown Op
P2R(3)
Move Predicate Register To Register
PIXLD(1)
Unknown Op
PLOP3(8)
Predicate Logic Operation
PMTRIG(1)
Performance Monitor Trigger
POPC(3)
Population count
PREEXIT(1)
Dependent Task Launch Hint
PRMT(5)
Permute Register Pair
QADD4(3)
Unknown Op
QFMA4(5)
Unknown Op
QMUL4(3)
Unknown Op
QSPC(4)
Query Space
R2P(3)
Move Register To Predicate Register
R2UR(1)
Move from Vector Register to a Uniform Register
REDAS(2)
Asynchronous Reduction on Distributed Shared Memory With Explicit Synchronization
REDUX(1)
Reduction of a Vector Register into a Uniform Register
RPCMOV(6)
PC Register Move
S2R(1)
Move Special Register to Register
S2UR(1)
Move Special Register to Uniform Register
SEL(3)
Select Source with Predicate
SETCTAID(1)
Set CTA ID
SGXT(3)
Sign Extend
SHF(5)
Funnel Shift
SHFL(4)
Warp Wide Register Shuffle
STAS(2)
Asynchronous Store to Distributed Shared Memory With Explicit Synchronization
SUATOM(4)
Atomic Op on Surface Memory
SULD(4)
Surface Load
SUQUERY(2)
Unknown Op
SYNCS(16)
Sync Unit
TEX(6)
Texture Fetch
TLD(6)
Texture Load
TLD4(6)
Texture Load 4
TMML(6)
Texture MipMap Level
TXD(6)
Texture Fetch With Derivatives
TXQ(2)
Texture Query
UBLKCP(2)
Bulk Data Copy
UBLKPF(2)
Bulk Data Prefetch
UBLKRED(2)
Bulk Data Copy from Shared Memory with Reduction
UBMSK(2)
Uniform Bitfield Mask
UBREV(2)
Uniform Bit Reverse
UCGABAR_ARV(1)
CGA Barrier Synchronization
UCGABAR_WAIT(1)
CGA Barrier Synchronization
UCLEA(2)
Load Effective Address for a Constant
UF2FP(4)
Uniform FP32 Down-convert and Pack
UFLO(2)
Uniform Find Leading One
UIADD3(8)
Uniform Integer Addition
UIMAD(10)
Uniform Integer Multiplication
UISETP(4)
Uniform Integer Compare and Set Uniform Predicate
ULEA(10)
Uniform Load Effective Address
ULEPC(2)
Uniform Load Effective PC
ULOP3(2)
Uniform Logic Operation
UMOV(2)
Uniform Move
UP2UR(2)
Uniform Predicate to Uniform Register
UPLOP3(4)
Uniform Predicate Logic Operation
UPOPC(2)
Uniform Population Count
UPRMT(2)
Uniform Byte Permute
UR2UP(2)
Uniform Register to Uniform Predicate
USEL(2)
Uniform Select
USETMAXREG(1)
Release, Deallocate and Allocate Registers
USETSHMSZ(3)
Unknown Op
USGXT(2)
Uniform Sign Extend
USHF(3)
Uniform Funnel Shift
UTCATOMSWS(3)
Perform Atomic operation on SW State Register
UTMACCTL(2)
TMA Cache Control
UTMACMDFLUSH(1)
TMA Command Flush
UTMALDG(4)
Tensor Load from Global to Shared Memory
UTMAPF(4)
Tensor Prefetch
UTMAREDG(2)
Tensor Store from Shared to Global Memory with Reduction
UTMASTG(2)
Tensor Store from Shared to Global Memory
UVIRTCOUNT(2)
Virtual Resource Management
VABSDIFF(5)
Absolute Difference
VABSDIFF4(5)
Absolute Difference
VHMNMX(3)
SIMD FP16 3-Input Minimum / Maximum
VIADD(3)
SIMD Integer Addition
VIADDMNMX(5)
SIMD Integer Addition and Fused Min/Max Comparison
VIMNMX(3)
SIMD Integer Minimum / Maximum
VIMNMX3(3)
SIMD Integer 3-Input Minimum / Maximum
VOTE(1)
Vote Across SIMT Thread Group
VOTEU(1)
Voting across SIMD Thread Group with Results in Uniform Destination
Unfound Instructions
Our fuzzer has not found these 96 instructions. If you have a cubin that contains any of these instructions and would like to contribute it, message us at collab@sf-tensor.com
BARunfound
Barrier Synchronization
BMOVunfound
Move Convergence Barrier State
BPTunfound
BreakPoint/Trap
BRAunfound
Relative Branch
BREAKunfound
Break out of the Specified Convergence Barrier
BRXunfound
Relative Branch Indirect
BRXUunfound
Relative Branch with Uniform Register Based Offset
BSSYunfound
Barrier Set Convergence Synchronization Point
BSYNCunfound
Synchronize Threads on a Convergence Barrier
CALLunfound
Call Function
CCTLLunfound
Cache Control
CCTLTunfound
Texture Cache Control
CS2URunfound
Load a Value from Constant Memory into a Uniform Register
DEPBARunfound
Dependency Barrier
EXITunfound
Exit Program
F2Funfound
Floating Point To Floating Point Conversion
FADD32Iunfound
FP32 Add
FENCEunfound
Memory Visibility Guarantee for Shared or Global Memory
FFMA32Iunfound
FP32 Fused Multiply and Add
FMUL32Iunfound
FP32 Multiply
HADD2_32Iunfound
FP16 Add
HFMA2_32Iunfound
FP16 Fused Mutiply Add
HMUL2_32Iunfound
FP16 Multiply
IADD32Iunfound
Integer Addition
IDP4Aunfound
Integer Dot Product and Accumulate
IMUL32Iunfound
Integer Multiply
ISCADDunfound
Scaled Integer Addition
ISCADD32Iunfound
Scaled Integer Addition
JMPunfound
Absolute Jump
JMXunfound
Absolute Jump Indirect
JMXUunfound
Absolute Jump with Uniform Register Based Offset
KILLunfound
Kill Thread
LDCUunfound
Load a Value from Constant Memory into a Uniform Register
LDGMCunfound
Reducing Load
LDTunfound
Load Matrix from Tensor Memory to Register File
LDTMunfound
Load Matrix from Tensor Memory to Register File
LOPunfound
Logic Operation
LOP32Iunfound
Logic Operation
MEMBARunfound
Memory Barrier
MOV32Iunfound
Move
NANOSLEEPunfound
Suspend Execution
OMMAunfound
FP4 Matrix Multiply and Accumulate Across a Warp
PSETPunfound
Combine Predicates and Set Predicate
QMMAunfound
FP8 Matrix Multiply and Accumulate Across a Warp
REDGunfound
Reduction Operation on Generic Memory
RETunfound
Return From Subroutine
SETLMEMBASEunfound
Set Local Memory Base Address
SHLunfound
Shift Left
SHRunfound
Shift Right
STunfound
Store to Generic Memory
STGunfound
Store to Global Memory
STLunfound
Store to Local Memory
STSunfound
Store to Shared Memory
STSMunfound
Store Matrix to Shared Memory
STTunfound
Store Matrix to Tensor Memory from Register File
STTMunfound
Store Matrix to Tensor Memory from Register File
SUREDunfound
Reduction Op on Surface Memory
SUSTunfound
Surface Store
UF2Funfound
Uniform Float-to-Float Conversion
UF2Iunfound
Uniform Float-to-Integer Conversion
UF2IPunfound
Uniform FP32 Down-Convert to Integer and Pack
UFADDunfound
Uniform Uniform FP32 Addition
UFFMAunfound
Uniform FP32 Fused Multiply-Add
UFMNMXunfound
Uniform Floating-point Minimum / Maximum
UFMULunfound
Uniform FP32 Multiply
UFRNDunfound
Uniform Round to Integer
UFSELunfound
Uniform Floating-Point Select
UFSETunfound
Uniform Floating-Point Compare and Set
UFSETPunfound
Uniform Floating-Point Compare and Set Predicate
UGETNEXTWORKIDunfound
Uniform Get Next Work ID
UI2Funfound
Uniform Integer to Float conversion
UI2FPunfound
Uniform Integer to FP32 Convert and Pack
UI2Iunfound
Uniform Saturating Integer-to-Integer Conversion
UI2IPunfound
Uniform Dual Saturating Integer-to-Integer Conversion and Packing
UIABSunfound
Uniform Integer Absolute Value
UIADD3.64unfound
Uniform Integer Addition
UIMNMXunfound
Uniform Integer Minimum / Maximum
ULOPunfound
Uniform Logic Operation
ULOP32Iunfound
Uniform Logic Operation
UMEMSETSunfound
Initialize Shared Memory
UPSETPunfound
Uniform Predicate Logic Operation
UREDGRunfound
Uniform Reduction on Global Memory with Release
USHLunfound
Uniform Left Shift
USHRunfound
Uniform Right Shift
USTGRunfound
Uniform Store to Global Memory with Release
UTCBARunfound
Tensor Core Barrier
UTCCPunfound
Asynchonous data copy from Shared Memory to Tensor Memory
UTCHMMAunfound
Uniform Matrix Multiply and Accumulate
UTCIMMAunfound
Uniform Matrix Multiply and Accumulate
UTCOMMAunfound
Uniform Matrix Multiply and Accumulate
UTCQMMAunfound
Uniform Matrix Multiply and Accumulate
UTCSHIFTunfound
Shift elements in Tensor Memory
UVIADDunfound
Uniform SIMD Integer Addition
UVIMNMXunfound
Uniform SIMD Integer Minimum / Maximum
WARPSYNCunfound
Synchronize Threads in Warp
YIELDunfound
Yield Control