================================================================
 LAB 2 — FULL OUTPUT LOG
 Thu Mar  5 11:57:33 AM UTC 2026
================================================================

================================================================
 PART A: C1+C2 — Training with Timing
================================================================
Device: cuda
ResNet-18 params: 11,173,962

============================================================
 C1/C2: Training — 5 epochs
 optimizer=sgd  workers=4  bn=on  device=cuda
============================================================
 Ep     Loss    Acc%  Data(s)  Train(s)  Total(s)
  1   1.8894  31.12%     0.65     13.41     15.77
  2   1.3499  50.39%     1.01     10.92     13.18
  3   1.0827  61.14%     0.87     11.63     13.88
  4   0.8919  68.36%     0.61     12.90     15.23
  5   0.7512  73.83%     0.65     12.94     15.28

[Q3] Trainable params : 11,173,962
[Q3] Params w/ grads  : 11,173,962
[Q3] Optimizer states : 62

================================================================
 PART A: C3 — I/O Worker Sweep
================================================================
Device: cuda
ResNet-18 params: 11,173,962

============================================================
 C3: I/O Optimization — Worker Sweep (5 epochs each)
============================================================
 Workers  AvgData(s)  AvgTrain(s)  AvgTotal(s)
       0      13.396       12.060       27.086
       4       0.797       11.638       13.928
       8       0.852       11.728       14.123
      12       1.276       12.501       15.560
      16       1.367       12.548       15.719
/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:624: UserWarning: This DataLoader will create 20 worker processes in total. Our suggested max number of worker in current system is 16, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  if max_num_worker_suggest is None:
      20       1.201       12.477       15.488

C3.2 => Best num_workers = 4 (avg total: 13.928s, avg data: 0.797s)
Plot saved -> c3_workers.png

=> Use --num_workers 4 for subsequent experiments

================================================================
 PART A: C4 — GPU vs CPU
================================================================
Device: cuda
ResNet-18 params: 11,173,962

============================================================
 C4: GPU vs CPU
============================================================

  [CPU] Training 5 epochs...
    Ep 1: loss=1.9320  acc=30.70%  time=579.23s
    Ep 2: loss=1.4355  acc=47.51%  time=431.86s
    Ep 3: loss=1.1419  acc=59.25%  time=432.04s
    Ep 4: loss=0.9589  acc=66.19%  time=429.39s
    Ep 5: loss=0.8158  acc=71.12%  time=467.04s
  [CPU] Avg epoch time: 467.91s

  [CUDA] Training 5 epochs...
    Ep 1: loss=1.9007  acc=30.89%  time=7.07s
    Ep 2: loss=1.4480  acc=46.87%  time=6.49s
    Ep 3: loss=1.2061  acc=56.13%  time=6.50s
    Ep 4: loss=0.9932  acc=64.87%  time=6.56s
    Ep 5: loss=0.8537  acc=69.84%  time=6.52s
  [CUDA] Avg epoch time: 6.63s

============================================================
 C4 Summary: GPU vs CPU (workers=4)
============================================================
  CPU avg: 467.91s/epoch
  GPU avg: 6.63s/epoch
  GPU speedup: 70.6x

================================================================
 PART A: C5 — Optimizer Comparison
================================================================
Device: cuda
ResNet-18 params: 11,173,962

============================================================
 C5: Optimizer Comparison
============================================================

  [SGD]
   Ep     Loss    Acc%  Train(s)  Total(s)
    1   1.9082  31.52%      5.80      6.82
    2   1.4078  48.34%      5.37      6.42
    3   1.1673  57.93%      5.37      6.40
    4   0.9827  65.60%      5.38      6.38
    5   0.8566  69.93%      5.38      6.39

  [SGD_NESTEROV]
   Ep     Loss    Acc%  Train(s)  Total(s)
    1   1.9148  31.59%      5.42      6.46
    2   1.3372  51.10%      5.41      6.44
    3   1.0420  62.62%      5.42      6.51
    4   0.8773  69.03%      5.43      6.50
    5   0.7533  73.69%      5.42      6.47

  [ADAM]
   Ep     Loss    Acc%  Train(s)  Total(s)
    1   2.2094  20.82%      5.55      6.62
    2   1.8813  27.24%      5.52      6.61
    3   1.8430  28.43%      5.52      6.60
    4   1.8157  30.26%      5.52      6.59
    5   1.8078  30.47%      5.54      6.60

============================================================
 C5 Summary: Optimizer Comparison (workers=4)
============================================================
  Optimizer        AvgLoss  AvgAcc%  AvgTrain(s)
  sgd               1.2645   54.66%         5.46
  sgd_nesterov      1.1849   57.61%         5.42
  adam              1.9114   27.44%         5.53

================================================================
 PART A: C6 — Without Batch Norm
================================================================
Device: cuda
ResNet-18 params: 11,173,962

============================================================
 C6: Without Batch Norm — 5 epochs (SGD, workers=4)
============================================================
 Ep     Loss    Acc%  Train(s)  Total(s)
  1   1.9335  26.70%      4.63      5.62
  2   1.5540  42.90%      4.21      5.28
  3   1.3556  51.21%      4.22      5.26
  4   1.1640  58.84%      4.21      5.18
  5   1.0096  64.68%      4.21      5.19

C6 Summary => avg loss: 1.4034, avg acc: 48.87%

================================================================
 PART B: C7–C10 — TorchScript
================================================================
Device: cuda

Training 5 epochs before scripting...
  Ep 1: loss=1.8304  acc=32.97%
  Ep 2: loss=1.3699  acc=49.58%
  Ep 3: loss=1.0921  acc=60.89%
  Ep 4: loss=0.9062  acc=67.85%
  Ep 5: loss=0.7824  acc=72.45%

C7: Scripted model saved -> resnet18_scripted.pt
C7: Save/load verification — max diff: 0.0e+00

============================================================
 C8: TorchScript Model Graph
============================================================
graph(%self.1 : __torch__.lab2.ResNet,
      %x.1 : Tensor):
  %42 : int = prim::Constant[value=-1]()
  %13 : Function = prim::Constant[name="relu"]()
  %12 : bool = prim::Constant[value=0]() # :0:0
  %41 : int = prim::Constant[value=1]() # /home/ubuntu/hpml_nyu/lab2.py:80:29
  %bn1.1 : __torch__.torch.nn.modules.batchnorm.BatchNorm2d = prim::GetAttr[name="bn1"](%self.1)
  %conv1.1 : __torch__.torch.nn.modules.conv.Conv2d = prim::GetAttr[name="conv1"](%self.1)
  %9 : Tensor = prim::CallMethod[name="forward"](%conv1.1, %x.1) # /home/ubuntu/hpml_nyu/lab2.py:74:28
  %10 : Tensor = prim::CallMethod[name="forward"](%bn1.1, %9) # /home/ubuntu/hpml_nyu/lab2.py:74:19
  %x0.1 : Tensor = prim::CallFunction(%13, %10, %12) # :0:0
  %layer1.1 : __torch__.torch.nn.modules.container.___torch_mangle_1.Sequential = prim::GetAttr[name="layer1"](%self.1)
  %x1.1 : Tensor = prim::CallMethod[name="forward"](%layer1.1, %x0.1) # /home/ubuntu/hpml_nyu/lab2.py:75:12
  %layer2.1 : __torch__.torch.nn.modules.container.___torch_mangle_9.Sequential = prim::GetAttr[name="layer2"](%self.1)
  %x2.1 : Tensor = prim::CallMethod[name="forward"](%layer2.1, %x1.1) # /home/ubuntu/hpml_nyu/lab2.py:76:12
  %layer3.1 : __torch__.torch.nn.modules.container.___torch_mangle_17.Sequential = prim::GetAttr[name="layer3"](%self.1)
  %x3.1 : Tensor = prim::CallMethod[name="forward"](%layer3.1, %x2.1) # /home/ubuntu/hpml_nyu/lab2.py:77:12
  %layer4.1 : __torch__.torch.nn.modules.container.___torch_mangle_25.Sequential = prim::GetAttr[name="layer4"](%self.1)
  %x4.1 : Tensor = prim::CallMethod[name="forward"](%layer4.1, %x3.1) # /home/ubuntu/hpml_nyu/lab2.py:78:12
  %avgpool.1 : __torch__.torch.nn.modules.pooling.AdaptiveAvgPool2d = prim::GetAttr[name="avgpool"](%self.1)
  %x5.1 : Tensor = prim::CallMethod[name="forward"](%avgpool.1, %x4.1) # /home/ubuntu/hpml_nyu/lab2.py:79:12
  %x6.1 : Tensor = aten::flatten(%x5.1, %41, %42) # /home/ubuntu/hpml_nyu/lab2.py:80:12
  %fc.1 : __torch__.torch.nn.modules.linear.Linear = prim::GetAttr[name="fc"](%self.1)
  %48 : Tensor = prim::CallMethod[name="forward"](%fc.1, %x6.1) # /home/ubuntu/hpml_nyu/lab2.py:81:15
  return (%48)


============================================================
 C9: Test Set Accuracy
============================================================
  PyTorch model:     70.94%
  TorchScript model: 70.94%

============================================================
 C10: Latency Comparison (single image, ms)
============================================================
                    CPU (ms)   GPU (ms)
  PyTorch              12.17       2.03
  TorchScript          11.22       1.26
  CPU speedup: 1.08x
  CUDA speedup: 1.61x

================================================================
 DONE — Thu Mar  5 12:50:13 PM UTC 2026
================================================================