Part 4 β Module 7: Reading Production Assembly
Difficulty: Intermediate-Advanced
Estimated reading time: ~30 minutes | Exercises: ~2-3 hours
π Table of Contents
A Reading Methodology
Guided Walkthroughs
- Walkthrough β Uniswap V3 FullMath
- Walkthrough β Solady sqrt() and log2()
- Walkthrough β Solady ERC20 Transfer
- Other Precompiles
- Build Exercise: AssemblyAuditor
π‘ A Reading Methodology
Modules 1-6 gave you the pieces: how memory is laid out (M2), how storage slots are computed (M3), how dispatch works (M4), how external calls are built (M5), and what optimization tricks look like (M6). Each module included a βHow to Studyβ section for its specific pattern type. M7 pulls those together into one unified approach, then puts it to work on real code you havenβt seen analyzed yet.
The goal isnβt to memorize these specific codebases. Itβs to build the confidence to open any assembly-heavy contract and understand what it does β whether youβre reviewing a PR, auditing a protocol, or studying a new library.
π‘ Concept: The Systematic Approach
Why this matters: Production assembly can be 200+ lines of dense Yul with no comments. Without a systematic approach, youβll stare at mstore and sload instructions and lose the thread. With one, you can break any assembly block into understandable pieces.
The 5-step method:
Step 1 β Identify the pattern type. Before reading any opcodes, ask: what kind of assembly is this? The answer tells you which mental model to reach for.
| Pattern Type | Signals | Reach For |
|---|---|---|
| Memory-heavy | mstore, mload, keccak256, FMP manipulation | Memory layout diagram (M2) |
| Storage-heavy | sload, sstore, shr/shl on stored values | Storage slot computation, packing diagrams (M3) |
| Dispatch-heavy | calldataload(0), selector comparison, JUMP | Selector matching, dispatch pattern (M4) |
| Call-heavy | call, staticcall, delegatecall, return data handling | Call lifecycle, return value checks (M5) |
| Optimization-heavy | returndatasize() as zero, branchless patterns, scratch space | Solady playbook tricks (M6) |
Most production assembly combines 2-3 of these. A Solady safeTransfer is call-heavy + memory-heavy + optimization-heavy. An Aave storage getter is storage-heavy + optimization-heavy. Identifying the dominant pattern narrows your focus.
Step 2 β Read the interface first. Look at the function signature, NatSpec, and return types before reading any assembly. Understanding what goes in and what comes out gives you the frame.
// Before diving into the assembly, you already know:
// - Input: an address and a uint256 (token transfer parameters)
// - Output: nothing (void) β but may revert
// - Side effects: must modify token balances
function safeTransfer(address token, address to, uint256 amount) internal {
assembly {
// ... 30 lines of assembly become much less intimidating
// when you already know what they're trying to accomplish
}
}
Step 3 β Draw the data layout. Based on the pattern type from Step 1, sketch the relevant layout:
- Memory-heavy: Draw a memory map β whatβs at 0x00, 0x20, 0x40, FMP, and beyond. Track every
mstoreandmload. - Storage-heavy: Use
forge inspect ContractName storageLayoutto see which variables live at which slots. For mappings, compute the slot withkeccak256(abi.encode(key, baseSlot)). - Calldata-heavy: Map out the ABI encoding β selector at bytes 0-3, first arg at bytes 4-35, etc.
Example memory map for a safeTransfer assembly block:
Offset Content Purpose
ββββββ ββββββββββββββββββ βββββββββββββββββββββ
0x00 selector (4 bytes) transfer(address,uint256)
0x04 recipient address argument 1
0x24 amount argument 2
0x00 return value overwritten by call output
This map is your reference as you trace through the opcodes. Every mstore(0x04, to) now means βwrite the recipient into the calldata layout.β
Step 4 β Trace one execution path. Donβt try to understand every branch at once. Pick the happy path (the most common execution) and follow it opcode by opcode. Mark values on your data layout as you go.
For a safeTransfer, the happy path is: encode calldata β call() succeeds β return data is true β done. Only after understanding this path should you look at error handling, edge cases, and fallbacks.
Step 5 β Identify the tricks. Now that you understand what the code does, ask why it does it that way. This is where M6βs playbook comes in:
- Why
returndatasize()instead ofpush 0? β Free zero trick (M6) - Why
xor+mulinstead of anifstatement? β Branchless pattern (M6) - Why writing at
0x00instead of allocating from FMP? β Scratch space / dirty memory (M6) - Why
revert(0x1c, 0x04)instead ofrevert(0x00, 0x04)? β Selector-only revert trick (M2)
The tricks are the style layer on top of the logic layer. Separate them mentally β first understand the logic, then appreciate the optimizations.
Quick reference β the βHow to Studyβ sections from M1-M6:
| Module | Reading Strategy | Best For |
|---|---|---|
| M1 | evm.codes, Remix debugger, forge inspect, Dedaub | Raw bytecode, opcode-level analysis |
| M2 | Draw memory layout, track FMP, follow calldata flow | SafeTransferLib, ABI encoding, error handling |
| M3 | forge inspect storageLayout, trace mapping formulas, draw packing diagrams | Aave ReserveData, bitmap configs, proxy slots |
| M4 | cast disassemble, count selectors, trace one call end-to-end | ERC20 dispatch, proxy forwarding, Huff contracts |
| M5 | Start with simplest function, compare implementations | SafeTransferLib, Proxy.sol, Multicall |
| M6 | Build up from simple patterns (min β abs β mulDiv) | FixedPointMathLib, branchless math |
π DeFi Pattern Connection
Where systematic reading matters most:
-
Audit reviews β Security firms read every assembly block in scope. A systematic approach prevents missing subtle bugs hidden in dense Yul.
-
Protocol integration β Before integrating with a protocol (calling their contracts from yours), you need to understand their assembly-level behavior: what reverts look like, what return data to expect, what gas they consume.
-
Incident response β When an exploit happens, the first step is reading the vulnerable assembly to understand the attack vector. Speed matters; methodology beats staring.
π‘ Concept: The Audit Lens
Why this matters: Reading assembly to understand it is Step 1. Reading assembly to find bugs is Step 2 β and itβs what gets you hired at audit firms and security-focused protocol teams.
Hereβs the checklist. Each item is a specific thing to look for when reviewing assembly with security in mind:
1. Unchecked call return values
The call() opcode returns 0 on failure, 1 on success. If the return value is pop()βd or ignored, a failed external call is silently swallowed. This is the most common assembly bug.
// BUG: ignoring whether the call succeeded
pop(call(gas(), token, 0, 0x00, 0x44, 0x00, 0x20))
// CORRECT: check and revert
if iszero(call(gas(), token, 0, 0x00, 0x44, 0x00, 0x20)) { revert(0, 0) }
See: M5 β The Call Lifecycle
2. Missing return data validation
Even when call() returns 1 (didnβt revert), the called function might return false. Tokens like USDT return nothing; others return a bool. The and(success, or(iszero(returndatasize()), eq(mload(0x00), 1))) pattern from SafeTransferLib handles both.
See: M5 β The SafeERC20 Pattern
3. Dirty memory corruption
Writing past the free memory pointer is safe only if no Solidity code allocates memory afterward. If assembly writes at mload(0x40) without advancing the FMP, and the function continues in Solidity (e.g., creating a dynamic array), the new allocation overwrites the assemblyβs data.
See: M6 β Memory Tricks
4. Off-by-one in shift amounts
When reading packed storage, shr(128, data) and shr(127, data) produce very different results. A single-bit error in the shift amount reads the wrong field β and the values might look plausible, making the bug hard to catch without edge-case testing.
See: M3 β Storage Packing
5. Incorrect ABI encoding lengths
The call(gas, addr, value, inputOffset, inputSize, outputOffset, outputSize) opcode requires exact byte counts. An inputSize of 0x44 (68 bytes) is correct for transfer(address,uint256) β selector (4) + address (32) + uint256 (32). Using 0x40 (64 bytes) silently truncates the last argument.
6. Returndata confusion after calls
Using returndatasize() as a free zero push only works before any external call. After a call, returndatasize() reflects the calleeβs return data. Code that uses returndatasize() as zero after a call that returned data will silently misbehave.
See: M6 β Free Zero Tricks
7. Missing overflow checks in unchecked arithmetic
Assembly arithmetic is always unchecked. add(a, b) wraps silently on overflow. Any arithmetic on user-supplied values needs explicit overflow checking β either lt(result, a) for addition or the mul(div(x,y),y) != x trick for multiplication.
8. Reentrancy through unprotected callbacks
Assembly-level external calls (call, delegatecall) transfer execution to untrusted code. If storage state hasnβt been updated before the call, the classic reentrancy vector applies. Assembly doesnβt have Solidityβs modifier sugar β the check-effects-interactions pattern must be followed manually.
9. Gas griefing via unbounded returndata
A malicious callee can return megabytes of data, forcing the caller to pay for memory expansion. The returndatasize() value after an untrusted call should be bounded before any returndatacopy.
See: M5 β The Returnbomb Attack
πΌ Job Market Context
βWhat do you look for when auditing inline assembly?β
Answer
- Good answer: βUnchecked return values, missing overflow checks, and dirty memory assumptions.β
- Great answer: βI use a checklist: return value checks on all external calls, return data validation for non-standard tokens, shift amount correctness for packed storage, memory safety when Solidity code follows the assembly block, and gas griefing vectors from unbounded returndata. I trace one execution path first to understand the logic, then check each branch against these patterns.β
π― Build Exercise: AssemblyReader
Workspace: AssemblyReader.sol | Tests
The challenge: Three fully-implemented assembly functions with no comments. Your job is to read each one, understand what it does, and prove your understanding by writing a pure Solidity equivalent that produces the same output.
What youβll practice:
- Reading packed storage access (M3 skills)
- Reading branchless math patterns (M6 skills)
- Reading custom calldata decoding (M2 skills)
3 TODOs β implement solveA(), solveB(), and solveC() in Solidity. Tests compare your output against the assembly version for various inputs including edge cases.
π― Goal: Build the habit of translating assembly back to Solidity. If you can write a Solidity function that matches the assembly output for all inputs, you truly understand the assembly.
Run:
FOUNDRY_PROFILE=part4 forge test --match-path "test/part4/module7/exercise1-assembly-reader/*"
π Key Takeaways: A Reading Methodology
After this section, you should be able to:
- Apply a 5-step reading methodology to any assembly block: identify the pattern type, read the interface, draw the data layout, trace one path, identify the tricks
- Choose the right βHow to Studyβ strategy from M1-M6 based on the assembly pattern (memory-heavy, storage-heavy, dispatch-heavy, call-heavy, optimization-heavy)
- Scan assembly for the 9 most common bug classes: unchecked return values, missing return data validation, dirty memory, shift off-by-ones, encoding length errors, returndata confusion, unchecked overflow, reentrancy, and gas griefing
Check your understanding
- 5-step reading methodology: (1) Identify the pattern type (memory-heavy, storage-heavy, call-heavy, etc.) to pick the right mental model. (2) Read the interface (signature, NatSpec, return types) before any opcodes. (3) Draw the data layout (memory, storage, or calldata). (4) Trace one execution path end-to-end. (5) Identify optimization tricks used (PUSH0, branchless, scratch space).
- Module-specific study strategies: Each M1-M6 module has a βHow to Studyβ section tuned to its pattern type β M2 for memory layouts, M3 for storage slot computation, M4 for dispatch tracing, M5 for call lifecycle, M6 for opcode tricks. Choose the strategy that matches the dominant pattern in the assembly youβre reading.
- 9 common assembly bug classes: The most critical are unchecked call return values (silent failure), missing return data validation (non-standard tokens like USDT), and dirty memory corruption (writing past FMP without advancing it when Solidity code follows). Each maps to a specific moduleβs content and has a known defensive pattern.
π‘ Guided Walkthroughs
The methodology from the previous section is abstract until you see it in action. This section applies all 5 steps to three production codebases β Uniswap V3βs FullMath, Soladyβs FixedPointMathLib, and Soladyβs ERC20. Each walkthrough demonstrates the approach, not just the code.
After these walkthroughs, youβll have seen the methodology applied to arithmetic-heavy, algorithm-heavy, and application-heavy assembly. The exercises then ask you to apply it yourself.
π‘ Walkthrough: Uniswap V3 FullMath
Why this file: FullMath.mulDiv has been mentioned across M1, M2, and M6 but never fully walked through. Itβs the most referenced piece of DeFi assembly β every protocol that computes a * b / denominator without intermediate overflow either uses it directly or reimplements the same trick.
Source: Uniswap V3 FullMath.sol
Step 1 β Identify the pattern: Arithmetic-heavy. The assembly uses mul, mulmod, div, sub, lt β no sload, no call, no mstore beyond local variables. This is pure computation on the stack.
Step 2 β Read the interface:
function mulDiv(uint256 a, uint256 b, uint256 denominator) internal pure returns (uint256 result)
Computes (a * b) / denominator with full 512-bit precision on the intermediate product. No overflow, no precision loss, deterministic rounding (down). This is the foundation of fee calculations, price conversions, and liquidity math.
Step 3 β Draw the data layout: No memory or storage β everything lives on the stack. The key data structure is a 512-bit number represented as two uint256 variables:
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββ
β prod1 (high) β prod0 (low) β
β upper 256 bits β lower 256 bits β
βββββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββ
512-bit product = a Γ b
Step 4 β Trace the happy path:
The core trick β computing a * b as a 512-bit number:
assembly {
// mulmod gives (a * b) mod (2^256 - 1) β NOT mod 2^256
// mul gives (a * b) mod 2^256 β the lower 256 bits
let mm := mulmod(a, b, not(0)) // mm = (a*b) mod (2^256 - 1)
prod0 := mul(a, b) // prod0 = (a*b) mod 2^256
// The difference tells us the upper 256 bits
prod1 := sub(sub(mm, prod0), lt(mm, prod0))
}
Why this works β the two-mod trick:
The EVM has two multiplication opcodes that keep different remainders:
mul(a, b)computesa Γ b mod 2^256β the standard wraparound. This gives usprod0, the lower 256 bits.mulmod(a, b, not(0))computesa Γ b mod (2^256 - 1). This is almost the same asprod0but differs by exactlyprod1(the carry/overflow) when the product exceeds 256 bits.
The subtraction sub(mm, prod0) gives us a value related to prod1, and the lt(mm, prod0) handles the borrow when mm < prod0. The result: prod1 contains the upper 256 bits of the full product.
Think of it like this: if you multiply two 3-digit numbers and only keep the last 3 digits, you lose the carry. But if you also keep the remainder after dividing by 999, the difference between those two remainders is the carry. Same principle, at 256-bit scale.
The fast path β when the product fits in 256 bits:
if (prod1 == 0) {
require(denominator > 0);
assembly {
result := div(prod0, denominator)
}
return result;
}
If prod1 is zero, the entire product fits in prod0 β standard division works. This handles the majority of real-world cases (small numbers, reasonable fee rates).
The 512-bit division path handles the case where prod1 > 0. It uses number-theoretic tricks to perform the full division: first it reduces the 512-bit product modulo the denominator (removing the denominatorβs power-of-2 factor), then computes the denominatorβs modular multiplicative inverse β a number inv such that denominator * inv β‘ 1 (mod 2^256). Multiplying the reduced product by this inverse gives the exact quotient. The inverse is found using Newtonβs method (the same convergence idea as sqrt()), starting from a 4-bit seed and doubling precision each step. The key insight: 512-bit division is possible entirely in 256-bit EVM arithmetic, and FullMath does it in constant gas.
Step 5 β Identify the tricks:
not(0)instead oftype(uint256).maxβ saves bytecode (1 opcode vs a PUSH32)lt(mm, prod0)as a borrow flag β branchless subtraction with carry- Modular inverse computation β number theory, not branchless tricks. This is algorithm design, not gas optimization
- No memory allocation β everything on the stack. Pure stack manipulation keeps gas minimal
π DeFi Pattern Connection
mulDiv appears everywhere precise token math is needed:
- AMM price calculations:
amountOut = reserveOut * amountIn / (reserveIn + amountIn)β but with full precision - Fee computation:
fee = amount * feeRate / 1e6β rounding matters when millions of dollars flow through - Vault share conversion:
shares = assets * totalShares / totalAssetsβ the ERC-4626 core calculation - Liquidity math: Uniswap V3βs concentrated liquidity formulas use
mulDivdozens of times per swap
Soladyβs FixedPointMathLib.mulDiv is a refined version of the same algorithm with additional gas optimizations and branchless patterns layered on top.
π‘ Walkthrough: Solady sqrt() and log2()
Why this section: M6 explicitly deferred sqrt() and log2() to M7: βDonβt get stuck on the bit-manipulation in sqrt() and log2() β come back to them after M7.β Time to deliver on that promise.
Source: Solady FixedPointMathLib.sol
Both functions use the same core technique: binary search by bit-shifting. Instead of looping, they test progressively smaller bit ranges to narrow in on the answer β all branchless, all in constant gas.
sqrt() β Integer Square Root
Step 1 β Identify the pattern: Arithmetic-heavy, optimization-heavy. No storage, no calls. Uses shr, shl, lt, add, div β the signature tools of bit-level binary search.
Step 2 β Read the interface:
function sqrt(uint256 x) internal pure returns (uint256 z)
Returns floor(sqrt(x)) β the largest integer whose square is less than or equal to x.
Step 3 β Data layout: Pure stack. The key variables are x (input), z (running result), and intermediate comparison values.
Step 4 β Trace the algorithm (using x = 625, expected result = 25):
The function works in two phases:
Phase 1 β Bit-length estimation (binary search for the initial guess):
assembly {
z := 181 // starting constant (chosen for convergence properties)
// Is x >= 2^128? If yes, work with the upper half
let r := shl(7, lt(0xffffffffffffffffffffffffffffffffff, x))
// r = 128 if x > 2^128, else 0
r := or(r, shl(6, lt(0xffffffffffffffffff, shr(r, x))))
// After shifting x right by r bits, is it still > 2^64?
// If yes, add 64 to r
r := or(r, shl(5, lt(0xffffffffff, shr(r, x))))
// Continue halving: add 32 if remaining > 2^40
r := or(r, shl(4, lt(0xfffff, shr(r, x))))
// Add 16 if remaining > 2^20
z := shl(shr(1, r), z)
// Scale z by 2^(r/2) β initial approximation of sqrt(x)
}
Each line asks: βIs the number bigger than this threshold?β If yes, it adds a power of 2 to the bit-length estimate r and shifts x down. After 4 tests, r approximates the bit-length of x, and z is scaled to be a rough initial guess for sqrt(x).
With x = 625: 625 < 2^128, 625 < 2^64, 625 < 2^40, 625 < 2^20. So r stays 0, and z remains 181.
Phase 2 β Newton-Raphson refinement (7 fixed iterations):
assembly {
z := shr(1, add(z, div(x, z))) // iteration 1
z := shr(1, add(z, div(x, z))) // iteration 2
z := shr(1, add(z, div(x, z))) // iteration 3
z := shr(1, add(z, div(x, z))) // iteration 4
z := shr(1, add(z, div(x, z))) // iteration 5
z := shr(1, add(z, div(x, z))) // iteration 6
z := shr(1, add(z, div(x, z))) // iteration 7
}
Each line computes z = (z + x/z) / 2 β the Newton-Raphson formula for square roots. This converges quadratically (doubles the number of correct bits each step), so 7 iterations are enough for any 256-bit input given a reasonable initial guess.
With x = 625, z starts at 181:
- After iteration 1:
(181 + 625/181) / 2 = (181 + 3) / 2 = 92 - After iteration 2:
(92 + 625/92) / 2 = (92 + 6) / 2 = 49 - After iteration 3:
(49 + 625/49) / 2 = (49 + 12) / 2 = 30 - After iteration 4:
(30 + 625/30) / 2 = (30 + 20) / 2 = 25 - Iterations 5-7:
(25 + 625/25) / 2 = (25 + 25) / 2 = 25β converged.
Phase 3 β Branchless final adjustment:
assembly {
z := sub(z, lt(div(x, z), z))
}
Newton-Raphson can overshoot by 1. This subtracts 1 from z if x/z < z (meaning z*z > x). The lt() returns 0 or 1 β branchless.
Step 5 β Tricks spotted:
- No loops β fixed iteration count means constant gas cost
- Branchless binary search β
shl(N, lt(threshold, x))adds 2^N without JUMPI - Branchless final correction β
sub(z, lt(...))instead ofif/else - Unrolled Newton-Raphson β 7 copies of the same line. Looks repetitive, but eliminates loop overhead (JUMPI + JUMPDEST + counter management per iteration)
log2() β Integer Base-2 Logarithm
The same binary search pattern, taken further. Where sqrt() uses 4 comparison levels, log2() uses 8 β one for each power of 2 from 128 down to 1:
assembly {
// Start with the largest possible contribution: 128
r := shl(7, lt(0xffffffffffffffffffffffffffffffff, x))
// Each subsequent line tests the next power, working on the shifted value
r := or(r, shl(6, lt(0xffffffffffffffff, shr(r, x))))
r := or(r, shl(5, lt(0xffffffff, shr(r, x))))
r := or(r, shl(4, lt(0xffff, shr(r, x))))
r := or(r, shl(3, lt(0xff, shr(r, x))))
r := or(r, shl(2, lt(0xf, shr(r, x))))
r := or(r, shl(1, lt(0x3, shr(r, x))))
r := or(r, lt(0x1, shr(r, x)))
}
How to read each line β take line 3 as an example:
r := or(r, shl(5, lt(0xffffffff, shr(r, x))))
shr(r, x)β shiftxright by the bits weβve already accounted forlt(0xffffffff, ...)β is the remaining value > 2^32 - 1? Returns 0 or 1shl(5, ...)β if yes, the contribution is 2^5 = 32or(r, ...)β add this contribution to the running total
After all 8 lines, r contains floor(log2(x)). No loops, no branches, constant gas.
Example: log2(256) = log2(2^8) = 8.
- Line 1:
256 > 2^128? No β +0.r = 0 - Line 2:
256 > 2^64? No β +0.r = 0 - Line 3:
256 > 2^32? No β +0.r = 0 - Line 4:
256 > 2^16? No β +0.r = 0 - Line 5:
256 > 255(0xff)? Yes β +8.r = 8 - Line 6:
shr(8, 256) = 1.1 > 15? No β +0.r = 8 - Line 7:
1 > 3? No β +0.r = 8 - Line 8:
1 > 1? No β +0.r = 8β
πΌ Job Market Context
βHow does Solady implement sqrt()?β
Answer
- Good answer: βBinary search for the initial guess using bit-shifting, then Newton-Raphson refinement.β
- Great answer: βIt uses a branchless binary search that tests 4 thresholds to estimate the bit-length, scales an initial constant by 2^(bitLength/2), then runs exactly 7 unrolled Newton-Raphson iterations. A branchless final adjustment handles off-by-one. The whole thing runs in constant gas β no loops, no JUMPI.β
π‘ Walkthrough: Solady ERC20 Transfer
Why this file: M4 showed a dispatch snippet from Soladyβs ERC20. But the transfer flow itself β balance lookup, underflow check, storage update, event emission β ties together M3 (storage), M5 (events), and M6 (optimization tricks) in one function. Itβs the complete picture.
Source: Solady ERC20.sol
Step 1 β Identify the pattern: Storage-heavy + optimization-heavy. The function reads and writes balance slots, emits an event, and uses scratch space throughout. No external calls.
Step 2 β Read the interface:
function transfer(address to, uint256 amount) public virtual returns (bool)
Transfers amount tokens from msg.sender to to. Reverts on insufficient balance. Emits Transfer(from, to, amount). Returns true.
Step 3 β Draw the data layout:
Storage: balances are stored in a mapping. Each address maps to a unique storage slot computed via:
balanceSlot(owner) = keccak256(owner . BALANCE_SLOT_SEED)
Memory (scratch space β no FMP allocation):
Offset Content Purpose
ββββββ ββββββββββββββββββββββ ββββββββββββββββββββ
0x00 owner address } hashed together to
0x20 BALANCE_SLOT_SEED } compute balance slot
0x20 amount event data (overwritten)
The same memory region is reused for different purposes at different points in the function. This is the dirty memory pattern from M6 β safe because the function ends with an assembly return and never allocates Solidity memory.
Step 4 β Trace the transfer flow:
assembly {
// 1. Compute sender's balance slot
mstore(0x20, _BALANCE_SLOT_SEED)
mstore(0x00, caller())
let fromBalanceSlot := keccak256(0x0c, 0x20)
let fromBalance := sload(fromBalanceSlot)
// 2. Check sufficient balance
if gt(amount, fromBalance) {
mstore(0x00, 0xf4d678b8) // InsufficientBalance selector
revert(0x1c, 0x04) // revert with just the selector
}
// 3. Update sender balance
sstore(fromBalanceSlot, sub(fromBalance, amount))
// 4. Compute receiver's balance slot (reuses scratch space)
mstore(0x00, to)
let toBalanceSlot := keccak256(0x0c, 0x20)
// 5. Update receiver balance
sstore(toBalanceSlot, add(sload(toBalanceSlot), amount))
// 6. Emit Transfer event
mstore(0x20, amount)
log3(
0x20, // data offset (amount)
0x20, // data size (32 bytes)
_TRANSFER_EVENT_SIGNATURE, // topic 0: event signature
caller(), // topic 1: from
shr(96, shl(96, to)) // topic 2: to (cleaned)
)
// 7. Return true
mstore(0x00, 1)
return(0x00, 0x20)
}
The slot computation trick (step 1): mstore(0x00, caller()) writes the 20-byte address right-aligned at offset 0x00 (bytes 12-31). mstore(0x20, _BALANCE_SLOT_SEED) writes the seed at offset 0x20. keccak256(0x0c, 0x20) then hashes 32 bytes starting at byte 12 β the 20-byte address (bytes 12-31) followed by the high 12 bytes of the seed word at offset 0x20 (bytes 32-43). Together these 32 bytes form a unique key for the balance mapping. This is a compact mapping slot computation using overlapping memory writes.
The error trick (step 2): revert(0x1c, 0x04) β not revert(0x00, 0x04). The selector was written at offset 0x00 as a full 32-byte word (left-padded). The actual 4-byte selector sits at bytes 28-31 (0x1c-0x1f). Reverting from 0x1c with length 4 sends exactly the selector. This is the same pattern covered in M2.
The address cleaning trick (step 6): shr(96, shl(96, to)) β shifts left 96 bits (clearing the top 96 bits) then shifts right 96 bits (moving back). This masks the address to exactly 20 bytes, discarding any dirty upper bits. The log3 opcode expects clean 32-byte topics.
Step 5 β Tricks spotted:
| Line | Trick | From |
|---|---|---|
keccak256(0x0c, 0x20) | Overlapping scratch space writes | M2, M3 |
revert(0x1c, 0x04) | Selector-only revert from 32-byte word | M2 |
No mload(0x40) anywhere | Entire function uses scratch space only | M6 |
return(0x00, 0x20) at the end | Manual return bypasses Solidity ABI encoding | M4 |
shr(96, shl(96, to)) | Address cleaning / masking | M1 |
π DeFi Pattern Connection
This exact transfer pattern (with minor variations) appears in every Solady token: ERC20, ERC721, ERC1155. The slot computation and event emission techniques are the same β only the storage layout and event signatures change. Once you can read one Solady token transfer, you can read them all.
π‘ Brief: Other Precompiles
M5 covered ecrecover (precompile at address 0x01) in depth and noted: βModule 7 covers reading production code that uses these precompiles.β Hereβs the landscape beyond ecrecover.
Precompiled contracts are EVM built-ins at addresses 0x01 through 0x0a. They perform computationally expensive operations in native code rather than EVM bytecode. All are called with staticcall:
// General pattern:
let success := staticcall(gas(), PRECOMPILE_ADDR, inputPtr, inputLen, outputPtr, outputLen)
| Address | Name | What It Does | Where in DeFi |
|---|---|---|---|
0x01 | ecrecover | Recover signer from ECDSA signature | Permit, EIP-712 (covered in M5) |
0x02 | SHA-256 | SHA-256 hash | Bitcoin SPV proofs, cross-chain bridges |
0x03 | RIPEMD-160 | RIPEMD-160 hash | Bitcoin address derivation in bridges |
0x04 | Identity | Copies input to output (memory copy) | Used internally by the compiler for bytes copying |
0x05 | ModExp | Modular exponentiation (base^exp mod modulus) | RSA verification, some ZK schemes |
0x06 | ecAdd | BN256 curve point addition | ZK proof verification |
0x07 | ecMul | BN256 curve scalar multiplication | ZK proof verification |
0x08 | ecPairing | BN256 pairing check | ZK proof verification (Tornado Cash, ZK rollups) |
0x09 | Blake2 | BLAKE2b compression | Zcash interoperability |
0x0a | Point evaluation | KZG point evaluation (EIP-4844) | Blob verification for L2 rollups |
When youβll encounter them:
0x02-0x03(SHA-256, RIPEMD-160): Cross-chain bridges that verify Bitcoin transactions. Youβll seestaticcall(gas(), 2, ...)in Bitcoin relay contracts.0x05(ModExp): Rare in DeFi. Shows up in specialized cryptographic operations. The input encoding is complex β three length-prefixed values.0x06-0x08(BN256): ZK proof verification. Tornado Cashβs verifier callsecPairingto verify Groth16 proofs. ZK rollup verifier contracts (zkSync, Polygon zkEVM) use all three. The calling convention involves packed point coordinates.0x0a(Point evaluation): Post-Dencun. Used by L2 contracts to verify blob data. Youβll see this in rollup settlement contracts.
You donβt need to memorize the input formats β theyβre well-documented in the EVM precompiles reference. The important thing is to recognize a precompile call when you see one: any staticcall to an address between 0x01 and 0x0a is a precompile, not a contract.
π― Build Exercise: AssemblyAuditor
Workspace: AssemblyAuditor.sol | Tests
The challenge: Three assembly functions, each containing a subtle bug from the audit checklist. Your job is to find each bug and implement the fixed version.
What youβll practice:
- Spotting unchecked call return values (audit item #1)
- Catching off-by-one errors in bit shifts (audit item #4)
- Identifying dirty memory / FMP corruption (audit item #3)
3 TODOs β implement fixedApprove(), fixedUnpack(), and fixedCache(). Tests verify your fixes produce correct behavior. Bonus: the tests also demonstrate the bugs β read the buggy function tests to see exactly how each vulnerability manifests.
π― Goal: Train your audit instincts. After this exercise, you should be able to spot these three bug classes (unchecked return values, shift off-by-ones, dirty memory) on sight in any assembly review.
Run:
FOUNDRY_PROFILE=part4 forge test --match-path "test/part4/module7/exercise2-assembly-auditor/*"
π Key Takeaways: Guided Walkthroughs
After this section, you should be able to:
- Walk through FullMathβs 512-bit multiplication trick and explain why two different
modoperations recover the upper 256 bits - Trace Soladyβs binary search pattern for
sqrt()andlog2()β branchless bit-shifting that replaces loops with unrolled comparisons - Read a complete Solady ERC20 transfer and identify each trick: scratch space slot computation, selector-only revert, address cleaning, manual return
- Recognize precompile calls (
staticcallto addresses0x01-0x0a) and know which DeFi patterns use which precompiles
Check your understanding
- FullMath 512-bit multiplication:
mul(a, b)computes(a * b) mod 2^256, giving the lower 256 bits (prod0).mulmod(a, b, not(0))computes(a * b) mod (2^256 - 1)β a slightly different remainder. The difference between these two values, with a borrow correction, recovers the upper 256 bits (prod1). Together they represent the full 512-bit product, enablingmulDivwithout intermediate overflow β critical for fixed-point math in AMMs and vaults. - Solady binary search for sqrt/log2: Instead of a loop, Solady unrolls the binary search into fixed steps, each using a branchless bit-shift: compare against a threshold, shift the result, repeat. This eliminates loop overhead and branch mispredictions, computing sqrt in ~9 steps and log2 in ~8 steps.
- Solady ERC20 transfer tricks: Uses scratch space (0x00-0x3f) instead of allocating memory for slot computation, emits events with selector-only revert on failure (no string errors), cleans addresses with
and(addr, 0xffffffffffffffffffffffffffffffffffffffff), and manually writes return data β all avoiding compiler overhead. - Precompile calls:
staticcallto addresses 0x01-0x0a invokes EVM precompiles. DeFi uses ecrecover (0x01) for permit signatures, SHA-256 (0x02) for Bitcoin SPV proofs, modexp (0x05) for RSA verification, and the bn128 curve precompiles (0x06-0x08) for ZK proof verification.
π Resources
Production Code (read alongside the walkthroughs):
- Uniswap V3 FullMath.sol β 512-bit mulDiv
- Solady FixedPointMathLib.sol β sqrt, log2, mulDiv, and more
- Solady ERC20.sol β full assembly token implementation
Reading Tools:
- evm.codes β Opcode reference with gas costs and stack effects
- Dedaub β Decompiler for deployed contracts
forge inspect ContractName asmβ View compiler-generated assemblycast disassembleβ Disassemble raw bytecode
Precompile Reference:
- EVM precompiles (evm.codes) β Input/output formats for all precompiles
- EIP-4844 β Point evaluation precompile specification
Navigation: β Module 6: Gas Optimization Patterns | Module 8: Pure Yul Contracts β