CANN/mat-chem-sim-pred PID整定算子基准 PidFopdtBatchRolloutScore Benchmark Report【免费下载链接】mat-chem-sim-pred面向工业领域聚焦计算仿真、预测两大核心场景构建面向流程工业机理数据双轮驱动的领域计算层推动AI for Science在材料化学领域的深度应用。项目地址: https://gitcode.com/cann/mat-chem-sim-predThis document records the measured CPU/NPU behavior ofPidFopdtBatchRolloutScore.EnvironmentNPU host:node202Device:Ascend910B3, device id0CANN:/usr/local/Ascend/ascend-toolkit/latestCPU baseline: benchmark program multi-thread mode, 64 threadsBuild:-DCMAKE_BUILD_TYPERelease -DSOC_VERSIONAscend910B3 -DRUN_MODEnpuCorrectnessNPU output isbit-identicalto the CPU reference. The candidate-axis SIMD lane width does not change the numerics (each tile is independent), so widening it leavesmax_abs_errandbest_idx_diff_countexactly as the original 256-wide kernel.Representative verified cases (B128, S1024, tileC):candidatesmax_abs_errbest_idx_diff_countnote10241.1e-40exact4096(tie)1a single argmin tie (two candidates with equal score); score rel-err 4.5e-3163844.2e-41same pre-existing argmin tieThebest_idx_diff_count1at large C is a genuine argmin tie present in the original 256-wide kernel as well; it is not introduced by the optimization.Measured Resultnode202 / Ascend910B3, B128, sim_steps1024, candidate_tileC, kernel time is the median of repeated runs. NPU kernel ms is stable; the CPU-64 baseline fluctuates on the shared node, so the speedup is given as the observed range.candidatesCPU-64 msNPU kernel msNPU kernel vs CPU-641024~347.66~4.4x4096~135-17225.42~5.3-6.8x16384~48996.3~5.1xThese are the shipped numbers after both optimizations below (wider lane fused inner loop).Optimization 1 - lane-width (kLane 256 - 768)The rollout inner loop is a serial time recurrence (y[k1]depends ony[k]), so the per-timestep chain of vector ops cannot be pipelined across steps. With a narrow SIMD lane each vector instruction processes few candidates (256 floats 4 compute cycles) yet still pays a fixed ~10-20 cycle issue/latency, so the loop islatency-bound, not throughput-bound. Widening the candidate-axis lane amortises that fixed latency over more candidates per instruction (fewer instructions for the same work), turning the kernel throughput-bound and filling the vector unit.kLane768is the largest lane that keeps the full 8 state vectors 17-block scratch the 32-slot delay ring (delay spec0..31) I/O queues within the 192 KB UB budget.Optimization 2 - inner-loop instruction reductionThe rollout inner loop issued ~37 vector ops per timestep. Two structural changes cut that to ~32 without changing the result:the response errore[k1] target - y[k1]is reused as the next steps error, dropping the redundant top-of-looptarget - yrecompute (saves 2 ops/step);the pure metric accumulators that do not feed back into the dynamics (IAE,ISE,control_energy) use the fused multiply-accumulateAxpyinstead of a separate multiply add (saves 3 ops/step).The integral and the full state recurrence keep their explicit ops, so the simulated trajectory is unchanged; on this hardwareAxpymatches the separate multiply add bit-for-bit, so the whole result stays bit-identical to the original 256-wide kernel.Combined before/afterNPU kernel ms, same inputs, bit-identical output across all stages:candidateskLane256 (orig)wider lane (768)fused inner looptotal speedup102414.138.607.661.84x409656.2328.5725.422.21x16384224.6108.596.32.33xInterpretationAfter both optimizations the operator iscompetitive on a single card: NPU kernel time is roughly 4-7x the 64-thread CPU baseline at the candidate counts that dominate the tuning sweep, with bit-identical results. This is the current performance baseline for the FOPDT rollout operator.The speedup comes from a layout change (wider candidate-axis SIMD) plus a lower instruction count per timestep, not from any accuracy trade-off: the time-stepping recurrence and the score definition are unchanged.Remaining headroom (not applied)The settling-time test is still ~6 ops/step; a branch-free cheaper reduction could trim a little more.kLane1024reaches 22.95 ms (C4096) / 91.31 ms (C16384) but requires shrinking the delay ring (spec0..31-0..19) to fit UB; usable only when the max delay is 19.Cross-batch flattening (fill the lane from the next loops candidates when one loop has fewer candidates than the lane) for the small-C regime; needs per-element plant params.Multi-card data parallelism scales absolute time linearly (hardware, not a single-card algorithmic speedup).【免费下载链接】mat-chem-sim-pred面向工业领域聚焦计算仿真、预测两大核心场景构建面向流程工业机理数据双轮驱动的领域计算层推动AI for Science在材料化学领域的深度应用。项目地址: https://gitcode.com/cann/mat-chem-sim-pred创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考