# HPML Homework 1 - Results

**Machine:** AMD EPYC 7742 64-Core Processor (A100 GPU node)

---

## C1: Dot Product — Simple Loop (dp1.c)

**Compilation:** `gcc -O3 -Wall -o dp1 dp1.c`

```
N: 1000000     <T>: 0.000903 sec  B: 8.857 GB/sec   F: 2214177926.839 FLOP/sec
N: 300000000   <T>: 0.281009 sec  B: 8.541 GB/sec   F: 2135164559.020 FLOP/sec
```

- N=1M: 1000 reps, mean of last 500. Result = 1000000.0 ✓
- N=300M: 20 reps, mean of last 10. Result = 16777216.0 (= 2^24, float32 saturation)

---

## C2: Dot Product — 4x Unrolled Loop (c2.c)

**Compilation:** `gcc -O3 -Wall -o c2 c2.c`

```
N: 1000000     <T>: 0.000298 sec  B: 26.848 GB/sec  F: 6711888147.297 FLOP/sec
N: 300000000   <T>: 0.185558 sec  B: 12.934 GB/sec  F: 3233487597.925 FLOP/sec
```

- ~3x faster than C1 for N=1M, ~1.5x for N=300M
- N=300M result = 67108864.0 (= 2^26 = 4×2^24, four partial sums each saturate at 2^24)

---

## C3: Dot Product — CBLAS sdot (c3.c)

**Compilation:** `gcc -O3 -Wall -o c3 c3.c -lcblas`

```
N: 1000000     <T>: 0.000358 sec  B: 22.347 GB/sec  F: 5586648669.464 FLOP/sec
N: 300000000   <T>: 0.202802 sec  B: 11.834 GB/sec  F: 2958556254.055 FLOP/sec
```

- Optimized BLAS library; ~2.5x faster than C1 for N=1M
- N=300M result = 67108864.0 (= 2^26, BLAS internally uses similar unrolling)

---

## C4: Dot Product — Python Simple Loop (c4.py)

```
N: 1000000     <T>: 0.369145 sec  B: 0.022 GB/sec   F: 5417919.953 FLOP/sec
N: 300000000   <T>: 112.550329 sec  B: 0.021 GB/sec  F: 5330948.412 FLOP/sec
```

- ~400x slower than C1 simple loop — Python interpreter overhead dominates
- N=1M result = 1000000.0 ✓ (Python float is 64-bit double, no precision loss)
- N=300M result = 300000000.0 ✓ (Python float is double, accumulator doesn't saturate)

---

## C5: Dot Product — numpy.dot (c5.py)

```
N: 1000000     <T>: 0.000126 sec  B: 63.707 GB/sec  F: 15926824129.559 FLOP/sec
N: 300000000   <T>: 0.150702 sec  B: 15.925 GB/sec  F: 3981362884.911 FLOP/sec
```

- Fastest overall! numpy.dot uses optimized BLAS + SIMD internally
- N=300M result = 300000000.0 ✓ (numpy likely uses pairwise summation / double accumulator)

---

## Performance Summary Table

| Variant | N=1M Time (s) | N=1M GFLOP/s | N=1M GB/s | N=300M Time (s) | N=300M GFLOP/s | N=300M GB/s |
|---------|---------------|-------------|-----------|-----------------|---------------|-------------|
| C1: Simple loop | 0.000903 | 2.21 | 8.86 | 0.281009 | 2.14 | 8.54 |
| C2: Unrolled (4x) | 0.000298 | 6.71 | 26.85 | 0.185558 | 3.23 | 12.93 |
| C3: CBLAS sdot | 0.000358 | 5.59 | 22.35 | 0.202802 | 2.96 | 11.83 |
| C4: Python loop | 0.369145 | 0.005 | 0.02 | 112.550329 | 0.005 | 0.02 |
| C5: numpy.dot | 0.000126 | 15.93 | 63.71 | 0.150702 | 3.98 | 15.93 |

### Key Observations:
- **C5 (numpy.dot) is fastest** — dispatches to BLAS + SIMD, even beats hand-written C
- **C2 (unrolled) > C3 (CBLAS) > C1 (simple)** — unrolling helps ILP; CBLAS using reference BLAS (not MKL)
- **C4 (Python loop) is ~400x slower than C1** — interpreter overhead per element kills performance
- **Arithmetic Intensity**: dot product = 2N FLOPs / 8N bytes = 0.25 FLOP/byte (memory-bound)
- **Float precision**: C1 saturates at 2^24, C2/C3 at 2^26, Python/numpy get exact results (double accumulator)

---
---

## Q1: Why Use Only the Second Half of Measurements? (5 pts)

Discarding the first half of repetitions eliminates **warmup / cold-start effects** that bias timing measurements:

1. **Cold caches**: On the first invocations, data must be fetched from main memory (DRAM). Subsequent runs benefit from data already in L1/L2/L3 cache, giving faster and more stable times.
2. **TLB and page faults**: The first access to allocated memory may trigger page faults (OS maps virtual → physical pages). After the first iteration, page table entries are populated.
3. **Branch prediction**: The CPU's branch predictor learns the loop pattern after a few iterations, reducing misprediction penalties.
4. **Instruction cache**: The function code gets loaded into I-cache on first call; later calls hit the cache.
5. **Frequency scaling**: Modern CPUs may ramp up clock frequency (turbo boost) after detecting sustained load.

By computing the mean over only the **second half**, we measure **steady-state performance** — the true throughput of the algorithm once all transient effects have settled. This is analogous to "burn-in" in Monte Carlo sampling, where early samples from a non-stationary distribution are discarded.

---

## Q2: Roofline Model (15 pts)

**Parameters:** Peak compute = 200 GFLOP/s, Peak memory BW = 30 GB/s

**Ridge point:** AI = 200 / 30 = 6.67 FLOP/byte

**Dot product arithmetic intensity:**
- FLOPs = 2N (N multiplications + N additions)
- Bytes = 2 × N × 4 = 8N (read two float32 arrays)
- **AI = 2N / 8N = 0.25 FLOP/byte**

Since AI = 0.25 < ridge point = 6.67, the dot product is **memory-bandwidth bound**. The theoretical attainable performance at this AI is: 30 GB/s × 0.25 FLOP/byte = **7.5 GFLOP/s**.

### Roofline Plot

![Roofline Model](roofline.png)

### Data points on the roofline (all at AI = 0.25 FLOP/byte):

| Variant | N=1M (GFLOP/s) | N=300M (GFLOP/s) |
|---------|----------------|-------------------|
| C1: Simple loop | 2.21 | 2.14 |
| C2: Unrolled 4x | 6.71 | 3.23 |
| C3: CBLAS sdot | 5.59 | 2.96 |
| C4: Python loop | 0.005 | 0.005 |
| C5: numpy.dot | 15.93 | 3.98 |

**Observations:**
- All N=300M points fall **below the 7.5 GFLOP/s bandwidth ceiling** — properly memory-bound as expected.
- N=1M points for C2, C3, C5 **exceed the main memory bandwidth ceiling** because the 8MB working set (2×1M×4B) fits in L2/L3 cache, achieving much higher effective bandwidth.
- C4 (Python loop) is far below the roofline — it's **compute-bound on interpreter overhead**, not memory-bound. The bottleneck is Python bytecode execution, not data transfer.

---

## Q3: Performance Differences for N=300M (C & Python variants) (5 pts)

Using **C1 simple loop (N=300M) as baseline** at 2.14 GFLOP/s:

| Variant | GFLOP/s | Speedup vs C1 | Explanation |
|---------|---------|---------------|-------------|
| C1: Simple loop | 2.14 | 1.0× (baseline) | Sequential accumulation, compiler auto-vectorization limited by loop-carried dependency on R |
| C2: Unrolled 4x | 3.23 | **1.5×** | Manual unrolling reduces loop overhead and breaks the dependency chain — 4 independent multiply-adds per iteration enable instruction-level parallelism (ILP). Compiler can pipeline operations better. |
| C3: CBLAS sdot | 2.96 | **1.4×** | Reference BLAS library uses optimized loops and potentially SIMD. On HPC with Intel MKL, this would be significantly faster (MKL uses AVX-512). Here we use reference BLAS, hence modest speedup. |
| C4: Python loop | 0.005 | **0.0025× (400× slower)** | Each iteration incurs massive Python interpreter overhead: bytecode dispatch, dynamic type checking, numpy scalar boxing/unboxing, and dictionary lookups. The actual FP multiply+add is negligible compared to interpreter overhead (~200ns/element vs ~1ns for C). |
| C5: numpy.dot | 3.98 | **1.9×** | numpy.dot dispatches to optimized BLAS which uses SIMD vectorization (SSE/AVX), loop unrolling, and cache-aware memory access patterns. Eliminates Python per-element overhead entirely — single C function call processes all N elements. |

**Key insight:** For large N (memory-bound regime), all optimized C variants (C2, C3, C5) are within ~2× of each other because they're all limited by **memory bandwidth** (~8–16 GB/s). The Python loop (C4) is uniquely terrible because it's bottlenecked on interpreter overhead, not memory or compute.

---

## Q4: Dot Product Results vs. Analytical (5 pts)

**Analytical result:** dot(ones(N), ones(N)) = 1×1 + 1×1 + ... + 1×1 = **N**

| Variant | N=1M Result | Expected | N=300M Result | Expected | Correct? |
|---------|-------------|----------|---------------|----------|----------|
| C1: Simple loop | 1,000,000 | 1,000,000 | **16,777,216** | 300,000,000 | ✗ |
| C2: Unrolled 4x | 1,000,000 | 1,000,000 | **67,108,864** | 300,000,000 | ✗ |
| C3: CBLAS sdot | 1,000,000 | 1,000,000 | **67,108,864** | 300,000,000 | ✗ |
| C4: Python loop | 1,000,000 | 1,000,000 | 300,000,000 | 300,000,000 | ✓ |
| C5: numpy.dot | 1,000,000 | 1,000,000 | 300,000,000 | 300,000,000 | ✓ |

### Explanation:

**Floating-point operations are not exact.** IEEE 754 float32 has 23 explicit mantissa bits + 1 implicit = **24 bits of precision**.

**C1 (result = 16,777,216 = 2²⁴):** The accumulator `R` is a `float`. When R reaches 2²⁴ = 16,777,216, the unit in the last place (ULP) is 2. Adding 1.0 to 2²⁴ should give 16,777,217, but this requires 25 bits of precision. IEEE 754 round-to-nearest-even rounds it back to 16,777,216. So `R` gets permanently "stuck" — all subsequent additions of 1.0 are silently dropped.

**C2/C3 (result = 67,108,864 = 2²⁶):** The unrolled loop evaluates `1*1 + 1*1 + 1*1 + 1*1 = 4.0` per iteration, then adds 4.0 to R. Adding 4.0 works at higher values than adding 1.0. At R = 2²⁶, the ULP is 8, and adding 4.0 falls at the rounding boundary (4 = ULP/2), so round-to-even keeps R at 2²⁶. Thus the unrolled version accumulates **4× more** before saturating: 4 × 2²⁴ = 2²⁶.

**C4 Python (correct):** Python's `float` is **64-bit double** (53-bit mantissa). 2⁵³ ≈ 9×10¹⁵ >> 300M, so no precision loss occurs.

**C5 numpy.dot (correct):** numpy internally uses **pairwise summation** or a **double-precision accumulator**, avoiding the float32 accumulation error.

**Lesson:** Naive sequential float32 accumulation is dangerous for large N. Solutions include: using double-precision accumulators, Kahan compensated summation, or pairwise/recursive summation (as numpy does).
