Matsubara Takashi, Miyatake Yuto, Yaguchi Takaharu
IEEE Trans Neural Netw Learn Syst. 2024 Aug;35(8):10526-10538. doi: 10.1109/TNNLS.2023.3242345. Epub 2024 Aug 5.
The combination of neural networks and numerical integration can provide highly accurate models of continuous-time dynamical systems and probabilistic distributions. However, if a neural network is used n times during numerical integration, the whole computation graph can be considered as a network n times deeper than the original. The backpropagation algorithm consumes memory in proportion to the number of uses times of the network size, causing practical difficulties. This is true even if a checkpointing scheme divides the computation graph into subgraphs. Alternatively, the adjoint method obtains a gradient by a numerical integration backward in time; although this method consumes memory only for single-network use, the computational cost of suppressing numerical errors is high. The symplectic adjoint method proposed in this study, an adjoint method solved by a symplectic integrator, obtains the exact gradient (up to rounding error) with memory proportional to the number of uses plus the network size. The theoretical analysis shows that it consumes much less memory than the naive backpropagation algorithm and checkpointing schemes. The experiments verify the theory, and they also demonstrate that the symplectic adjoint method is faster than the adjoint method and is more robust to rounding errors.
神经网络与数值积分相结合,可以为连续时间动态系统和概率分布提供高精度模型。然而,若在数值积分过程中使用神经网络n次,整个计算图可被视为比原始网络深n倍的网络。反向传播算法消耗的内存与网络规模的使用次数成正比,这带来了实际困难。即便使用检查点方案将计算图划分为子图,情况依然如此。另外,伴随方法通过时间反向的数值积分来获取梯度;尽管此方法仅在单网络使用时消耗内存,但抑制数值误差的计算成本很高。本研究提出的辛伴随方法,即一种由辛积分器求解的伴随方法,能以与使用次数加网络规模成正比的内存获取精确梯度(在舍入误差范围内)。理论分析表明,它消耗的内存比朴素反向传播算法和检查点方案少得多。实验验证了该理论,同时也表明辛伴随方法比伴随方法更快,并且对舍入误差更具鲁棒性。