# HPML Homework 1 - Results

**Machine:** AMD EPYC 7742 64-Core Processor (A100 GPU node)

---

## C1: Dot Product — Simple Loop (dp1.c)

**Compilation:** `gcc -O3 -Wall -o dp1 dp1.c`

```
N: 1000000     <T>: 0.000903 sec  B: 8.857 GB/sec   F: 2214177926.839 FLOP/sec
N: 300000000   <T>: 0.281009 sec  B: 8.541 GB/sec   F: 2135164559.020 FLOP/sec
```

- N=1M: 1000 reps, mean of last 500. Result = 1000000.0 ✓
- N=300M: 20 reps, mean of last 10. Result = 16777216.0 (= 2^24, float32 saturation)

---

## C2: Dot Product — 4x Unrolled Loop (c2.c)

**Compilation:** `gcc -O3 -Wall -o c2 c2.c`

```
N: 1000000     <T>: 0.000298 sec  B: 26.848 GB/sec  F: 6711888147.297 FLOP/sec
N: 300000000   <T>: 0.185558 sec  B: 12.934 GB/sec  F: 3233487597.925 FLOP/sec
```

- ~3x faster than C1 for N=1M, ~1.5x for N=300M
- N=300M result = 67108864.0 (= 2^26 = 4×2^24, four partial sums each saturate at 2^24)

---

## C3: Dot Product — CBLAS sdot (c3.c)

**Compilation:** `gcc -O3 -Wall -o c3 c3.c -lcblas`

```
N: 1000000     <T>: 0.000358 sec  B: 22.347 GB/sec  F: 5586648669.464 FLOP/sec
N: 300000000   <T>: 0.202802 sec  B: 11.834 GB/sec  F: 2958556254.055 FLOP/sec
```

- Optimized BLAS library; ~2.5x faster than C1 for N=1M
- N=300M result = 67108864.0 (= 2^26, BLAS internally uses similar unrolling)

---

## C4: Dot Product — Python Simple Loop (c4.py)

```
N: 1000000     <T>: 0.369145 sec  B: 0.022 GB/sec   F: 5417919.953 FLOP/sec
N: 300000000   <T>: 112.550329 sec  B: 0.021 GB/sec  F: 5330948.412 FLOP/sec
```

- ~400x slower than C1 simple loop — Python interpreter overhead dominates
- N=1M result = 1000000.0 ✓ (Python float is 64-bit double, no precision loss)
- N=300M result = 300000000.0 ✓ (Python float is double, accumulator doesn't saturate)

---

## C5: Dot Product — numpy.dot (c5.py)

```
N: 1000000     <T>: 0.000126 sec  B: 63.707 GB/sec  F: 15926824129.559 FLOP/sec
N: 300000000   <T>: 0.150702 sec  B: 15.925 GB/sec  F: 3981362884.911 FLOP/sec
```

- Fastest overall! numpy.dot uses optimized BLAS + SIMD internally
- N=300M result = 300000000.0 ✓ (numpy likely uses pairwise summation / double accumulator)

---

## Performance Summary Table

| Variant | N=1M Time (s) | N=1M GFLOP/s | N=1M GB/s | N=300M Time (s) | N=300M GFLOP/s | N=300M GB/s |
|---------|---------------|-------------|-----------|-----------------|---------------|-------------|
| C1: Simple loop | 0.000903 | 2.21 | 8.86 | 0.281009 | 2.14 | 8.54 |
| C2: Unrolled (4x) | 0.000298 | 6.71 | 26.85 | 0.185558 | 3.23 | 12.93 |
| C3: CBLAS sdot | 0.000358 | 5.59 | 22.35 | 0.202802 | 2.96 | 11.83 |
| C4: Python loop | 0.369145 | 0.005 | 0.02 | 112.550329 | 0.005 | 0.02 |
| C5: numpy.dot | 0.000126 | 15.93 | 63.71 | 0.150702 | 3.98 | 15.93 |

### Key Observations:
- **C5 (numpy.dot) is fastest** — dispatches to BLAS + SIMD, even beats hand-written C
- **C2 (unrolled) > C3 (CBLAS) > C1 (simple)** — unrolling helps ILP; CBLAS using reference BLAS (not MKL)
- **C4 (Python loop) is ~400x slower than C1** — interpreter overhead per element kills performance
- **Arithmetic Intensity**: dot product = 2N FLOPs / 8N bytes = 0.25 FLOP/byte (memory-bound)
- **Float precision**: C1 saturates at 2^24, C2/C3 at 2^26, Python/numpy get exact results (double accumulator)

