# HPML Homework 1 - Results

**Machine:** AMD EPYC 7742 64-Core Processor (A100 GPU node)

---

## C1: Dot Product — Simple Loop (dp1.c)

**Compilation:** `gcc -O3 -Wall -o dp1 dp1.c`

```
N: 1000000     <T>: 0.000903 sec  B: 8.857 GB/sec   F: 2214177926.839 FLOP/sec
N: 300000000   <T>: 0.281009 sec  B: 8.541 GB/sec   F: 2135164559.020 FLOP/sec
```

- N=1M: 1000 reps, mean of last 500. Result = 1000000.0 ✓
- N=300M: 20 reps, mean of last 10. Result = 16777216.0 (= 2^24, float32 saturation)

---

## C2: Dot Product — 4x Unrolled Loop (c2.c)

**Compilation:** `gcc -O3 -Wall -o c2 c2.c`

```
N: 1000000     <T>: 0.000298 sec  B: 26.848 GB/sec  F: 6711888147.297 FLOP/sec
N: 300000000   <T>: 0.185558 sec  B: 12.934 GB/sec  F: 3233487597.925 FLOP/sec
```

- ~3x faster than C1 for N=1M, ~1.5x for N=300M
- N=300M result = 67108864.0 (= 2^26 = 4×2^24, four partial sums each saturate at 2^24)

---

## C3: Dot Product — CBLAS sdot (c3.c)

**Compilation:** `gcc -O3 -Wall -o c3 c3.c -lcblas`

```
N: 1000000     <T>: 0.000358 sec  B: 22.347 GB/sec  F: 5586648669.464 FLOP/sec
N: 300000000   <T>: 0.202802 sec  B: 11.834 GB/sec  F: 2958556254.055 FLOP/sec
```

- Optimized BLAS library; ~2.5x faster than C1 for N=1M
- N=300M result = 67108864.0 (= 2^26, BLAS internally uses similar unrolling)

---

## C4: Dot Product — Python Simple Loop (c4.py)

```
N: 1000000     <T>: 0.369145 sec  B: 0.022 GB/sec   F: 5417919.953 FLOP/sec
N: 300000000   <T>: (running...)
```

- ~410x slower than C1 simple loop — Python interpreter overhead dominates
- N=1M result = 1000000.0 ✓ (Python float is 64-bit double, no precision loss)

---

## C5: Dot Product — numpy.dot (c5.py)

```
N: 1000000     <T>: 0.000126 sec  B: 63.707 GB/sec  F: 15926824129.559 FLOP/sec
N: 300000000   <T>: 0.150702 sec  B: 15.925 GB/sec  F: 3981362884.911 FLOP/sec
```

- Fastest overall! numpy.dot uses optimized BLAS + SIMD internally
- N=300M result = 300000000.0 ✓ (numpy likely uses pairwise summation / double accumulator)

---

## Performance Summary Table

| Variant | N=1M Time (sec) | N=1M GFLOP/s | N=300M Time (sec) | N=300M GFLOP/s |
|---------|-----------------|--------------|-------------------|----------------|
| C1: Simple loop | 0.000903 | 2.21 | 0.281009 | 2.14 |
| C2: Unrolled (4x) | 0.000298 | 6.71 | 0.185558 | 3.23 |
| C3: CBLAS sdot | 0.000358 | 5.59 | 0.202802 | 2.96 |
| C4: Python loop | 0.369145 | 0.0054 | (running) | — |
| C5: numpy.dot | 0.000126 | 15.93 | 0.150702 | 3.98 |

