# HPML Homework 1 - Results **Machine:** AMD EPYC 7742 64-Core Processor (A100 GPU node) --- ## C1: Dot Product — Simple Loop (dp1.c) **Compilation:** `gcc -O3 -Wall -o dp1 dp1.c` ``` N: 1000000 : 0.000903 sec B: 8.857 GB/sec F: 2214177926.839 FLOP/sec N: 300000000 : 0.281009 sec B: 8.541 GB/sec F: 2135164559.020 FLOP/sec ``` - N=1M: 1000 reps, mean of last 500. Result = 1000000.0 ✓ - N=300M: 20 reps, mean of last 10. Result = 16777216.0 (= 2^24, float32 saturation) --- ## C2: Dot Product — 4x Unrolled Loop (c2.c) **Compilation:** `gcc -O3 -Wall -o c2 c2.c` ``` N: 1000000 : 0.000298 sec B: 26.848 GB/sec F: 6711888147.297 FLOP/sec N: 300000000 : 0.185558 sec B: 12.934 GB/sec F: 3233487597.925 FLOP/sec ``` - ~3x faster than C1 for N=1M, ~1.5x for N=300M - N=300M result = 67108864.0 (= 2^26 = 4×2^24, four partial sums each saturate at 2^24) --- ## C3: Dot Product — CBLAS sdot (c3.c) **Compilation:** `gcc -O3 -Wall -o c3 c3.c -lcblas` ``` N: 1000000 : 0.000358 sec B: 22.347 GB/sec F: 5586648669.464 FLOP/sec N: 300000000 : 0.202802 sec B: 11.834 GB/sec F: 2958556254.055 FLOP/sec ``` - Optimized BLAS library; ~2.5x faster than C1 for N=1M - N=300M result = 67108864.0 (= 2^26, BLAS internally uses similar unrolling) --- ## C4: Dot Product — Python Simple Loop (c4.py) ``` N: 1000000 : 0.369145 sec B: 0.022 GB/sec F: 5417919.953 FLOP/sec N: 300000000 : 112.550329 sec B: 0.021 GB/sec F: 5330948.412 FLOP/sec ``` - ~400x slower than C1 simple loop — Python interpreter overhead dominates - N=1M result = 1000000.0 ✓ (Python float is 64-bit double, no precision loss) - N=300M result = 300000000.0 ✓ (Python float is double, accumulator doesn't saturate) --- ## C5: Dot Product — numpy.dot (c5.py) ``` N: 1000000 : 0.000126 sec B: 63.707 GB/sec F: 15926824129.559 FLOP/sec N: 300000000 : 0.150702 sec B: 15.925 GB/sec F: 3981362884.911 FLOP/sec ``` - Fastest overall! numpy.dot uses optimized BLAS + SIMD internally - N=300M result = 300000000.0 ✓ (numpy likely uses pairwise summation / double accumulator) --- ## Performance Summary Table | Variant | N=1M Time (s) | N=1M GFLOP/s | N=1M GB/s | N=300M Time (s) | N=300M GFLOP/s | N=300M GB/s | |---------|---------------|-------------|-----------|-----------------|---------------|-------------| | C1: Simple loop | 0.000903 | 2.21 | 8.86 | 0.281009 | 2.14 | 8.54 | | C2: Unrolled (4x) | 0.000298 | 6.71 | 26.85 | 0.185558 | 3.23 | 12.93 | | C3: CBLAS sdot | 0.000358 | 5.59 | 22.35 | 0.202802 | 2.96 | 11.83 | | C4: Python loop | 0.369145 | 0.005 | 0.02 | 112.550329 | 0.005 | 0.02 | | C5: numpy.dot | 0.000126 | 15.93 | 63.71 | 0.150702 | 3.98 | 15.93 | ### Key Observations: - **C5 (numpy.dot) is fastest** — dispatches to BLAS + SIMD, even beats hand-written C - **C2 (unrolled) > C3 (CBLAS) > C1 (simple)** — unrolling helps ILP; CBLAS using reference BLAS (not MKL) - **C4 (Python loop) is ~400x slower than C1** — interpreter overhead per element kills performance - **Arithmetic Intensity**: dot product = 2N FLOPs / 8N bytes = 0.25 FLOP/byte (memory-bound) - **Float precision**: C1 saturates at 2^24, C2/C3 at 2^26, Python/numpy get exact results (double accumulator) --- --- ## Q1: Why Use Only the Second Half of Measurements? (5 pts) Discarding the first half of repetitions eliminates **warmup / cold-start effects** that bias timing measurements: 1. **Cold caches**: On the first invocations, data must be fetched from main memory (DRAM). Subsequent runs benefit from data already in L1/L2/L3 cache, giving faster and more stable times. 2. **TLB and page faults**: The first access to allocated memory may trigger page faults (OS maps virtual → physical pages). After the first iteration, page table entries are populated. 3. **Branch prediction**: The CPU's branch predictor learns the loop pattern after a few iterations, reducing misprediction penalties. 4. **Instruction cache**: The function code gets loaded into I-cache on first call; later calls hit the cache. 5. **Frequency scaling**: Modern CPUs may ramp up clock frequency (turbo boost) after detecting sustained load. By computing the mean over only the **second half**, we measure **steady-state performance** — the true throughput of the algorithm once all transient effects have settled. This is analogous to "burn-in" in Monte Carlo sampling, where early samples from a non-stationary distribution are discarded. --- ## Q2: Roofline Model (15 pts) **Parameters:** Peak compute = 200 GFLOP/s, Peak memory BW = 30 GB/s **Ridge point:** AI = 200 / 30 = 6.67 FLOP/byte **Dot product arithmetic intensity:** - FLOPs = 2N (N multiplications + N additions) - Bytes = 2 × N × 4 = 8N (read two float32 arrays) - **AI = 2N / 8N = 0.25 FLOP/byte** Since AI = 0.25 < ridge point = 6.67, the dot product is **memory-bandwidth bound**. The theoretical attainable performance at this AI is: 30 GB/s × 0.25 FLOP/byte = **7.5 GFLOP/s**. ### Roofline Plot ![Roofline Model](roofline.png) ### Data points on the roofline (all at AI = 0.25 FLOP/byte): | Variant | N=1M (GFLOP/s) | N=300M (GFLOP/s) | |---------|----------------|-------------------| | C1: Simple loop | 2.21 | 2.14 | | C2: Unrolled 4x | 6.71 | 3.23 | | C3: CBLAS sdot | 5.59 | 2.96 | | C4: Python loop | 0.005 | 0.005 | | C5: numpy.dot | 15.93 | 3.98 | **Observations:** - All N=300M points fall **below the 7.5 GFLOP/s bandwidth ceiling** — properly memory-bound as expected. - N=1M points for C2, C3, C5 **exceed the main memory bandwidth ceiling** because the 8MB working set (2×1M×4B) fits in L2/L3 cache, achieving much higher effective bandwidth. - C4 (Python loop) is far below the roofline — it's **compute-bound on interpreter overhead**, not memory-bound. The bottleneck is Python bytecode execution, not data transfer. --- ## Q3: Performance Differences for N=300M (C & Python variants) (5 pts) Using **C1 simple loop (N=300M) as baseline** at 2.14 GFLOP/s: | Variant | GFLOP/s | Speedup vs C1 | Explanation | |---------|---------|---------------|-------------| | C1: Simple loop | 2.14 | 1.0× (baseline) | Sequential accumulation, compiler auto-vectorization limited by loop-carried dependency on R | | C2: Unrolled 4x | 3.23 | **1.5×** | Manual unrolling reduces loop overhead and breaks the dependency chain — 4 independent multiply-adds per iteration enable instruction-level parallelism (ILP). Compiler can pipeline operations better. | | C3: CBLAS sdot | 2.96 | **1.4×** | Reference BLAS library uses optimized loops and potentially SIMD. On HPC with Intel MKL, this would be significantly faster (MKL uses AVX-512). Here we use reference BLAS, hence modest speedup. | | C4: Python loop | 0.005 | **0.0025× (400× slower)** | Each iteration incurs massive Python interpreter overhead: bytecode dispatch, dynamic type checking, numpy scalar boxing/unboxing, and dictionary lookups. The actual FP multiply+add is negligible compared to interpreter overhead (~200ns/element vs ~1ns for C). | | C5: numpy.dot | 3.98 | **1.9×** | numpy.dot dispatches to optimized BLAS which uses SIMD vectorization (SSE/AVX), loop unrolling, and cache-aware memory access patterns. Eliminates Python per-element overhead entirely — single C function call processes all N elements. | **Key insight:** For large N (memory-bound regime), all optimized C variants (C2, C3, C5) are within ~2× of each other because they're all limited by **memory bandwidth** (~8–16 GB/s). The Python loop (C4) is uniquely terrible because it's bottlenecked on interpreter overhead, not memory or compute. --- ## Q4: Dot Product Results vs. Analytical (5 pts) **Analytical result:** dot(ones(N), ones(N)) = 1×1 + 1×1 + ... + 1×1 = **N** | Variant | N=1M Result | Expected | N=300M Result | Expected | Correct? | |---------|-------------|----------|---------------|----------|----------| | C1: Simple loop | 1,000,000 | 1,000,000 | **16,777,216** | 300,000,000 | ✗ | | C2: Unrolled 4x | 1,000,000 | 1,000,000 | **67,108,864** | 300,000,000 | ✗ | | C3: CBLAS sdot | 1,000,000 | 1,000,000 | **67,108,864** | 300,000,000 | ✗ | | C4: Python loop | 1,000,000 | 1,000,000 | 300,000,000 | 300,000,000 | ✓ | | C5: numpy.dot | 1,000,000 | 1,000,000 | 300,000,000 | 300,000,000 | ✓ | ### Explanation: **Floating-point operations are not exact.** IEEE 754 float32 has 23 explicit mantissa bits + 1 implicit = **24 bits of precision**. **C1 (result = 16,777,216 = 2²⁴):** The accumulator `R` is a `float`. When R reaches 2²⁴ = 16,777,216, the unit in the last place (ULP) is 2. Adding 1.0 to 2²⁴ should give 16,777,217, but this requires 25 bits of precision. IEEE 754 round-to-nearest-even rounds it back to 16,777,216. So `R` gets permanently "stuck" — all subsequent additions of 1.0 are silently dropped. **C2/C3 (result = 67,108,864 = 2²⁶):** The unrolled loop evaluates `1*1 + 1*1 + 1*1 + 1*1 = 4.0` per iteration, then adds 4.0 to R. Adding 4.0 works at higher values than adding 1.0. At R = 2²⁶, the ULP is 8, and adding 4.0 falls at the rounding boundary (4 = ULP/2), so round-to-even keeps R at 2²⁶. Thus the unrolled version accumulates **4× more** before saturating: 4 × 2²⁴ = 2²⁶. **C4 Python (correct):** Python's `float` is **64-bit double** (53-bit mantissa). 2⁵³ ≈ 9×10¹⁵ >> 300M, so no precision loss occurs. **C5 numpy.dot (correct):** numpy internally uses **pairwise summation** or a **double-precision accumulator**, avoiding the float32 accumulation error. **Lesson:** Naive sequential float32 accumulation is dangerous for large N. Solutions include: using double-precision accumulators, Kahan compensated summation, or pairwise/recursive summation (as numpy does).