← 6/9 Neural Networks 📚 Series Index 8/9 Deep Learning Architec… → Building an MLP, as we did in Part 6, is not the same as getting it to train well . The right optimizer, regularization, initialization, and learning rate — when these line up, deep networks converge. When they don't, the network refuses to learn at all. This part is the catalog of those crafts, with formulas, paper citations, and code in one place. 0. Learning Objectives Compare and write the update rules for SGD, Momentum, Nesterov, and Adam. Explain how Dropout, BatchNorm, and LayerNorm work and where they belong in a model. Derive the variance formulas for Xavier and He initialization and match them to activations. Implement step, cosine, and warmup learning-rate schedules in PyTorch. Explain why gradient clipping is effectively required for RNNs and Transformers. Diagnose the most common training failures (NaN, plateau, overfitting) and apply first-line fixes. 1. 핵심 요약 SGD : \(w \leftarro...
댓글
댓글 쓰기