FitNets- Hints for Thin Deep Nets

07 Apr 2019 | Deep learning

FitNets- Hints for Thin Deep Nets

Original paper: https://arxiv.org/abs/1412.6550

Authors: Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio

Abstract

Depth 네트워크는 성능이 좋지만 gradient 방식의 training은 깊은 네트워크의 커진 non-linearity로 인해 학습이 더 어려워졌다. 최근 제안된 knowledge distillation 접근방식은 작고 빠른 모델을 얻기 위하여 큰 네트워크나 앙상블의 soft output을 이용하여 그것을 닮도록 학습하는 방식이다. 본 논문에선 이러한 아이디어를 확장시켜서 깊은 teacher의 중간출력을 이용하여 얕은 student가 더 좋은 성능을 보이도록 학습시키는 방법을 제안한다. Student의 중간 hidden layer가 보통 teacher의 중간 layer보다 작으므로 추가적인 parameter가 teacher hidden layer의 prediction을 student hidden layer가 맞추도록 하는 추가 파라미터가 제안된다. 이로 인해 더 빠르게 동작하거나 더 나은 일반화 성능을 보이는 deeper student net에 대해 capacity를 조정 할 수 있게된다. 예를 들어 CIFAR-10 데이터셋에 대해 deep student network의 경우 10.4배의 적은 파라미터를 이용하여 크고 SOTA 성능을 보이는 teacher network의 성능을 능가 할 수 있게 된다.

Conclusion

논문에서는 student의 training process를 guide하기 위해 teacher의 hidden layer에서 intermediate-level hint를 이용하여 wide하고 deep한 네트워크를 thin하고 deeper한 네트워크로 압축하는 새로운 학습방법을 제안했다. 논문에선 이러한 hint를 사용하여 보다 적은 parameter로 very deep student model을 학습 시킬 수 있었으며, 이 student 모델은 teacher보다 더 나은 일반화 성능을 보이고 더 빠르게 동작하였다. 논문에선 teacher net의 hidden state로 thin and deep network의 inner layer들에게 hint를 주는것이 classification target으로 네트워크를 학습시키는것보다 더 나은 일반화 성능을 보인다는것을 실험적으로 증명했다. 벤치마크 데이터셋에 대한 실험은 capacity가 작은 깊은 네트워크가 10배 이상의 parameter를 가진 네트워크보다 비슷하거나 훨씬 뛰어난 feature 추출 능력이 있음을 보여준다. The hint-based training suggests that more efforts should be devoted to explore new training strategies to leverage the power of deep networks.

논문 내용

본 논문에선 2개의 신경망을 만들어서 사용한다. 하나는 teacher이고 다른 하나는 student이며, student net을 FitNets라 정의한다. Student net은 teacher에 비해 더 깊고 폭이 좁은(deeper and thinner) 구조로 되어있다. Teacher의 구조도 충분히 deep하고 성능이 괜찮지만 parameter 수에따른 많은 연산량등의 문제를 해결하기 위해 model capacity가 작은 student를 정의하여 이를 해결하겠다는 논문이다. 게다가 성능까지 더 좋게 만들었다는것이 논문의 주요 contribution이다. 또한 딥러닝의 구현에 있어 가용 가능한 capacity가 정해져 있다면 depth가 정말 중요한가에 대해 이를 실험적으로 보여줬다. 물론 depth를 키운다고 성능이 linear하게 좋아지진 않는다고 transferring attention 논문에서 보여준다.(https://arxiv.org/abs/1612.03928)
논문에서 주로 언급되는 knowledge distillation(KD) 기법에 대해 알아보면, KD는 wide and deep한 teacher net을 미리 학습시켜 놓고, teacher 의 output을 닮도록 student를 훈련시키는 모델 학습 방법이다. True label이 [0, 1, …, 0, 0] 과 같은 경우 신경망의 출력은 보통 각 class에 대한 확률 분포로 나타나기에 실제론 [0.17, 0.82, …, 0.01, 0.03] 과 같은 식으로 출력된다. KD에선 이러한 True label이 아니라 output을 닮도록 student net을 훈련시킨다. 하지만 depth도 비슷하고 성능도 괜찮아지는것에 비해 여전히 optimization이 힘들다는 단점이 있다고 한다.
이 논문에선 optimization에 대한 해결책을 제시함과 동시에 성능까지 더 좋게 만들 수 있는 방법을 제안했다. 이를 Hint-based learning(HT)라고 이름을 붙였는데, 메인 idea는 학습 시 True label, output 말고 intermediate hidden layers(hints)를 닮도록 네트워크를 훈련시키는 것 이다. 이러한 hints를 주는 방법이 parameter space에서 saddle-point(optimal point- optimization이 된 위치, minima)를 찾기 위한 더 좋은 initial position을 알려주게 된다. 이로 인해 모델의 generalization 성능이 좋아지게 된다.
이러한 HT 학습 방법을 적용함과 동시에 신경망을 더 좁고 더 깊게 만들었더니 전체적으로 parameter의 수가 확 줄어들면서도 inference에서 소요되는 multiplication 횟수가 적은, 10배 정도 효율이 더 좋은 모델을 만들 수 있게 되었다. 일단 한번 teacher 신경망을 잘 훈련시켜 놓으면 훨씬 가볍고 빠른데 심지어 성능까지 좋은 student 신경망을 만들 수 있게 된다는 의미다.

views — Figure 1. Hints를 이용한 student network의 학습

Figure 1(a)에서 굵게 칠해져 있는 두 layer가 서로 비슷해지도록 student를 훈련시킨다. 그런데 student는 teacher보다 좁기(thin) 때문에 그냥 비슷하게 하기에는 공간방향으로의 문제가 존재한다. (서로 사이즈가 다르므로 비교시 문제 발생) 즉, 차원의 문제가 발생한다. 그래서 고안한 것이 Figure 1(b)에 파란색으로 그려진 $W_{r}$ 이다. Regressor를 하나 만들어서 차원을 확장시킨 다음에 그 결과값이 비슷해지도록 하는 것인데, 이 regressor에서 사용하는 weight가 바로 $W_{r}$ 이다. Student의 학습시에도 이 $W_{r}$ 도 같이 훈련된다.
논외로.. 여기에 한단계 더 나아가는것이 transferring attention이다. 여기선 True label, output, hidden layer 말고 attention layer를 닮도록 훈련시킨다. (https://arxiv.org/abs/1612.03928)

모델의 학습

네트워크 학습에는 pre-trained teacher 파라미터인 $W_{T}$와 랜덤하게 초기화된 student(FitNet)의 파라미터들인 $W_{S}$ 를 입력으로 받는다. $h$는 hidden layers, $g$는 guided layer이다. $W_{Hint}$는 hint layer($h$)까지의 teacher의 파라미터이다. $W_{Guide}$는 guided layer($g$)까지의 FitNet의 파라미터들이다. $W_{r}$은 regressor의 파라미터들이다. 첫 번째 stage는 teacher network의 hint layer의 prediction error에 기반으로 student network를 guided layer까지 pre-training 한다(line 4). 두 번째 stage는 전체 네트워크에 대한 KD(knowledge distillation) training이다(line 6).

성능

이 논문에선 CNN을 이용한 image classification task에 대해 성능을 평가했다. CIFAR-10, CIFAR-100, SVHN, MNIST, AFLW 등의 데이터셋들에 대해 실험을 진행했다. 일단 성능 자체가 teacher보다 student(FitNets)가 훨씬 좋게 나오고, parameter 수가 적은데도 SOTA 기법들에 견줄 수 있을만큼 좋은 성능을 보였다.
Mimic ensemble: Ba, J. and Caruana, R. Do deep nets really need to be deep? In NIPS, pp. 2654–2662. 2014.

실험에선 추가적으로 computing resource를 제한해놓고 depth를 늘리면서 비교를 했다. 더 구체적으로는 30M, 107M 개로 사용 가능한 operation 수를 정해놓고, 평범한 back-propagation(BP) 학습방법, Knowledge Distillation 방법, Hint Training 방법들을 구현한 모델들의 layer 수를 각각 3, 5, 7, 9로 바꿔가며 비교실험했다. 기존의 BP방법은 layer 수가 5개를 넘어서면 operation 숫자가 부족해 아예 훈련을 못마쳤다고 한다. Knowledge distillation은 BP보다 조금 나았지만 여전히 30M 부분에서는 layer 수가 늘어나면 학습이 힘든것을 볼 수 있다. 당연하게도 이 논문에서 제안한 Hint Training 기법은 그런 것에 상관없이 훈련을 잘 마칠 수 있었으며 성능도 더 좋았다.

Comment Read more

The papers about network generalization

05 Apr 2019 | generalization

The papers about network generalization

2019년 4월 기준

Papers

Entropy-sgd: Biasing gradient descent into wide valleys, 2017 (Cited by 146)
On large-batch training for deep learning: Generalization gap and sharp minima, 2016 (Cited by 411)
Understanding deep learning requires rethinking generalization, 2017 (Cited by 789)
Visualizing the Loss Landscape of Neural Nets, 2017 (Cited by 89)

Comment Read more

Awesome Knowledge Distillation papers

05 Apr 2019 | Knowledge distillation

Awesome Knowledge Distillation

참고글: https://github.com/dkozlov/awesome-knowledge-distillation
2019년 1월 기준

Web pages to study

https://blog.lunit.io/2018/03/22/distilling-the-knowledge-in-a-neural-network-nips-2014-workshop/
http://seoulai.com/presentations/knowledge-distillation.pdf
https://medium.com/neural-machines/knowledge-distillation-dc241d7c2322

Papers

Combining labeled and unlabeled data with co-training, A. Blum, T. Mitchell, 1998
Model Compression, Rich Caruana, 2006
Dark knowledge, Geoffrey Hinton , OriolVinyals & Jeff Dean, 2014
Learning with Pseudo-Ensembles, Philip Bachman, Ouais Alsharif, Doina Precup, 2014
Distilling the Knowledge in a Neural Network, Hinton, J.Dean, 2015
Cross Modal Distillation for Supervision Transfer, Saurabh Gupta, Judy Hoffman, Jitendra Malik, 2015
Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization, Baohan Xu, Yanwei Fu, Yu-Gang Jiang, Boyang Li, Leonid Sigal, 2015
Distilling Model Knowledge, George Papamakarios, 2015
Unifying distillation and privileged information, David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, Vladimir Vapnik, 2015
Learning Using Privileged Information: Similarity Control and Knowledge Transfer, Vladimir Vapnik, Rauf Izmailov, 2015
Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks, Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, Ananthram Swami, 2016
Do deep convolutional nets really need to be deep and convolutional?, Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Rich Caruana, Abdelrahman Mohamed, Matthai Philipose, Matt Richardson, 2016
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer, Sergey Zagoruyko, Nikos Komodakis, 2016
FitNets: Hints for Thin Deep Nets, Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio, 2015
Deep Model Compression: Distilling Knowledge from Noisy Teachers, Bharat Bhusan Sau, Vineeth N. Balasubramanian, 2016
Knowledge Distillation for Small-footprint Highway Networks, Liang Lu, Michelle Guo, Steve Renals, 2016
Sequence-Level Knowledge Distillation, deeplearning-papernotes, Yoon Kim, Alexander M. Rush, 2016
MobileID: Face Model Compression by Distilling Knowledge from Neurons, Ping Luo, Zhenyao Zhu, Ziwei Liu, Xiaogang Wang and Xiaoou Tang, 2016
Recurrent Neural Network Training with Dark Knowledge Transfer, Zhiyuan Tang, Dong Wang, Zhiyong Zhang, 2016
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer, Sergey Zagoruyko, Nikos Komodakis, 2016
Adapting Models to Signal Degradation using Distillation, Jong-Chyi Su, Subhransu Maji,2016
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, Antti Tarvainen, Harri Valpola, 2017
Data-Free Knowledge Distillation For Deep Neural Networks, Raphael Gontijo Lopes, Stefano Fenu, 2017
Like What You Like: Knowledge Distill via Neuron Selectivity Transfer, Zehao Huang, Naiyan Wang, 2017
Learning Loss for Knowledge Distillation with Conditional Adversarial Networks, Zheng Xu, Yen-Chang Hsu, Jiawei Huang, 2017
DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang, 2017
Knowledge Projection for Deep Neural Networks, Zhi Zhang, Guanghan Ning, Zhihai He, 2017
Moonshine: Distilling with Cheap Convolutions, Elliot J. Crowley, Gavin Gray, Amos Storkey, 2017
Local Affine Approximators for Improving Knowledge Transfer, Suraj Srinivas and Francois Fleuret, 2017
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model, Jiasen Lu1, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra 2017
Learning Efficient Object Detection Models with Knowledge Distillation, Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, Manmohan Chandraker, 2017
Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification, Chong Wang, Xipeng Lan and Yangang Zhang, 2017
Learning Transferable Architectures for Scalable Image Recognition, Barret Zoph, Vijay Vasudevan, Jonathon Shlens, Quoc V. Le, 2017
Revisiting knowledge transfer for training object class detectors, Jasper Uijlings, Stefan Popov, Vittorio Ferrari, 2017
A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning, Junho Yim, Donggyu Joo, Jihoon Bae, Junmo Kim, 2017
Rocket Launching: A Universal and Efficient Framework for Training Well-performing Light Net, Zihao Liu, Qi Liu, Tao Liu, Yanzhi Wang, Wujie Wen, 2017
Data Distillation: Towards Omni-Supervised Learning, Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, Kaiming He, 2017
Interpreting Deep Classifiers by Visual Distillation of Dark Knowledge, Kai Xu, Dae Hoon Park, Chang Yi, Charles Sutton, 2018
Efficient Neural Architecture Search via Parameters Sharing, Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, Jeff Dean, 2018
Transparent Model Distillation, Sarah Tan, Rich Caruana, Giles Hooker, Albert Gordo, 2018
Defensive Collaborative Multi-task Training - Defending against Adversarial Attack towards Deep Neural Networks, Derek Wang, Chaoran Li, Sheng Wen, Yang Xiang, Wanlei Zhou, Surya Nepal, 2018
Deep Co-Training for Semi-Supervised Image Recognition, Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, Alan Yuille, 2018
Feature Distillation: DNN-Oriented JPEG Compression Against Adversarial Examples, Zihao Liu, Qi Liu, Tao Liu, Yanzhi Wang, Wujie Wen, 2018
Multimodal Recurrent Neural Networks with Information Transfer Layers for Indoor Scene Labeling, Abrar H. Abdulnabi, Bing Shuai, Zhen Zuo, Lap-Pui Chau, Gang Wang, 2018
Born Again Neural Networks, Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar, 2018
YASENN: Explaining Neural Networks via Partitioning Activation Sequences, Yaroslav Zharov, Denis Korzhenkov, Pavel Shvechikov, Alexander Tuzhilin, 2018
Knowledge Distillation with Adversarial Samples Supporting Decision Boundary, Byeongho Heo, Minsik Lee, Sangdoo Yun, Jin Young Choi, 2018
Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons, Byeongho Heo, Minsik Lee, Sangdoo Yun, Jin Young Choi, 2018

Videos

Dark knowledge, Geoffrey Hinton, 2014
Model Compression, Rich Caruana, 2016 ***
Implementations

MXNet

Bayesian Dark Knowledge

PyTorch

Lua

Example for teacher/student-based learning

Torch

Theano

Lasagne + Theano

Experiments-with-Distilling-Knowledge

Tensorflow

Caffe

Keras

Comment Read more

Deep Mutual Learning

04 Apr 2019 | Deep learning

Deep Mutual Learning

Original paper: https://arxiv.org/pdf/1706.00384.pdf

Authors: Ying Zhang, Tao Xiang, Timothy M. Hospedales, Huchuan Lu

Abstract

Model distillation은 teacher의 정보를 student network로 전달하도록 많이 사용되는 효과적인 기술이다. 일반적으로 small network로 성능좋고 큰 네트워크나 앙상블 네트워크를 transfer 하는것은 low-memory나 빠른 동작이 필요할 때 더 필요하다. 본 논문에서는 미리 학습(정의된) teacher와 student 사이에 단 방향으로 transfer 되는 방식이 아닌 student의 앙상블이 협력적으로 training 과정 전반에 걸쳐 서로를 가르치는 deep mutual learning(DML) strategy를 제안한다. 실험에서는 논문에서 제안하는 mutual learning이 다양한 network 구조에 대해 CIFAR-100 recognition과 Market-1501 person re-identification benchmark에서 매우 좋은 결과를 보였다. 논문에선 이전처럼 표현력 좋은 강력한 teacher network가 필요하지 않다는 것을 밝혔다. 단순히 student network들로 이루어진 collection간에 서로 상호 학습을 하도록 하는것이 더 효과적이며, 더욱 강력하면서도 teacher net의 distillation보다 더 좋은 성능을 보여준다.

1. Introduction

Deep neural network는 다양한 분야에서 SOTA 성능을 보였지만 대개는 depth나 width가 넓어 많은 파라미터들을 갖고 있다[6, 25]. 이는 연산량이 많아 속도가 느리거나 메모리를 많이 필요로하므로 제한된 성능을 갖는 환경에서의 적용이 어렵다. 따라서 빠른 모델을 만들어내는 연구가 활발히 진행되었다. 크기는 작지만 정확한 모델을 얻는방법에 대해 간단한(frugal) 모델 설계[8], model compression[2], pruning[13], binarisation[18], model distillation[7]등의 연구가 진행되었다.
Distillation based 모델 압축방식은 작은 네트워크가 대때로 큰 네트워크만큼의 표현량(representation capacity)을 갖는 경우가 많다는 obwervation과 관계되어 있다[3, 2]. 하지만 큰 네트워크와 비교했을 때 desired function을 실현시키는 올바를 파라미터를 갖도록 모델을 학습시키고 찾는것은 더 어려워지게 된다. 즉 limitation은 네트워크의 크기보다 적절한 optimization을 하는것이 더 어려운 문제다[2]. 작은 네트워크를 잘 학습시키기 위한 distillation 방법에서는 deep하고 wide하거나 앙상블로 이루어진 teacher net이 필수요소이며 작은 student network는 이러한 teacher net을 흉내내도록 학습되어진다 [7, 2, 16, 3]. Teacher net의 class probabilities[7]나 feature representation[2, 19]을 흉내내거나 하는것은 기존의 supervised learning target의 목표를 넘어 추가적인 정보를 이용 할 수 있게 되는것이다. Teacher를 흉내내도록 학습시키기 위한 optimization 문제는 target function을 다이렉트로 학습하는것보다 더 쉬운 것으로 밝혀졌으며, 이로인해 훨씬 작은 student가 larger teacher의 성능 또는 그 성능을 능가할 수 있게 된다[19].
논문에선 model distillation과 관련되었지만 다른 방법을 제안하며, 이를 mutual learning으로 정의한다. Distillation에선 성능좋고 큰 pretrained teacher network가 필수이며 작고 학습되지 않은 student net에 한 방향으로 정보를 전달하며 학습시킨다. 반면에 mutual training에서는 동시에 task를 같이 해결하도록 학습되어지는 untrained student network들(pool)이 필수요소이다. 특히 각 student는 두 개의 loss로 학습되어지며, 하나는 일반적인 supervised learning loss이고 다른 하나는 mimicry loss이며 이는 다른 student들의 class별 확률을 사용하는 각 student의 class posterior를 정렬하는 역할을 한다. 이러한 peer-teaching based scenario방법으로 학습 되는 student가 기존의 supervised learning scenario로 단독학습한 모델보다 훨씬 더 좋은 성능을 보이는 것을 확인했다. 게다가 이런 방식으로 학습 된 student net들은 기존의 pre-trained teacher를 사용하는 distillation 방법보다 더 나은 성능을 보였다. 또한 학습시키려는 student보다 더 크고 성능 좋은 teacher를 필요로하는 기존의 distillation방법에 있어서 다양한 몇몇의 큰 네트워크간의 mutual learning이 단독학습보다 더 성능을 크게 개선시키는것을 확인했다.
제안하는 방법이 왜 항상 제대로 동작하는지 확실하지는 않다. 모델 학습 과정에서 작고 학습되지 않은 student network들에 대해 어디서 추가정보가 제공되었을까? 왜 모델이 학습과정에서 ‘the blind lead the blind’처럼 학습을 방해하지 않고 잘 수렴하게 될까? 질문에 대한 몇몇 답변들은 직관적으로 다음의 사항들에 대해 얻어질 수 있다. 각 student는 주로 일반적은 supervised learning에 의해 학습되어 지므로 성능이 일반적으로 향상되어지도록 학습되어지므로 이로인해 student 그룹이 마음대로 학습되어질 수 없게 되는것이다. Supervised learning 방법을 통해 모든 네트워크가 학습과정에서 올바른 추론을 할 수 있도록 학습되어지게된다. 하지만 각 네트워크는 서로다른 initial condition에서 학습이 시작되어지므로 각 모델이 추론하는 결과가 class별로 다양해지게 된다. 또한 mutual learning 뿐만 아니라 distillation[7]에서 얻어지는 추가정보도 2차적으로 사용한다(?)(It is these secondary quantities that provide the extra information in distillation [7] as well as mutual learning). Mutual learning에서 student chort(집단)는 다음으로 가장 정답일 가능성이 높은 class에 대한 collective estimate를 효과적으로 모으게 된다. Finding out - and matching 하는 다른 students들에 따라 각 traning instace의 다른 가장 가능성있는 클래스가 각 student의 posterior entropy를 증가시키며[4, 17] 이는 student가 더 fobust하고 flatter한 minima에 수렴해 testing data에 generalization이 잘 되도록 한다. 이는 deep learning에서 high posterior entropy solutions(network parameter settings)의 rubustness에 관한 최근의 연구들과 관련이 있지만[4, 17], blind entropy regularization보다 훨씬 더 많은 선택이 가능한 대안들이 존재한다.
전반적으로 mutual learning은 다른 네트워크 집단(cohort)과의 협력을 통해 네트워크의 generalization 성능을 향상시킬 수 있는 간단하면서도 효과적인 방법을 제공한다. 미리 훈련 된 static large network를 사용하는 distillation 방법과 비교할 때, 작은 peer들의 협력적인 학습방법은 더 나은 성능을 달성한다. 게다가 논문에선 다음의 사항들을 시사한다.
- (1) Cohort 네트워크의 갯수에 따라 성능이 증가한다.(효율적 mutual learning을 위해 작은 네트워크를 이용하여 하나의 GPU에서 학습이 가능하다)
- (2) 다양한 네트워크 아키텍쳐와 크고 작은 네트워크로 이루어진 이종(heterogeneous) cohort에도 적용 가능하다.
- (3) Cohort에서 large network가 mutual learning을 사용한 방법이 단독학습한것보다 성능이 더 좋다.
- 마지막으로 논문에선 하나의 effective한 네트워크를 얻는데 초점을 맞추지만 전체 cohort를 매우 효과적인 앙상블 모델로도 사용할 수 있다.

Distillation based 모델압축방법은 한참옛날에 [3]에서 제안되었지만 이게 왜 동작하는지 직관적으로 설명하는 7로 인해 요즘 다시 재고되고 있다. 처음엔 성능 좋거나 앙상블로 구성된 teacher에 의해 근사화된 함수를 단일 신경망 student net으로 distillation하는 것이 일반적인 적용방법이었다[3, 7]. 하지만 나중엔 학습이 쉬운 큰 성능좋은 네트워크를 distillation하여 작지만 학습이 어려운 네트워크로 적용시켜 teacher의 성능을 능가하게까지 만들었다[19, FitNet]. 최근에는 [15]와 SVM+[22]와 같은 information learning theory를 이용하여 distillation이 더 systematically하게 teacher에서 선별된 정보를 student로 전달한다. 저자는 teacher와 함께 dispensing하고 student들의 앙상블이 서로 distillation하여 서로를 가르치도록 하였다.
Other related ideas include Dual Learning [5] where two cross-lingual translation models teach each other interactively. But this only applies in this special translation problems where an unconditional within-language model is available to be used to evaluate the quality of the predictions, and ultimately provides the supervision that drives the learning process. In contrast, our mutual learning approach applies to general classification problems. While conventional wisdom about ensembles prioritises diversity [12], our mutual learning approach reduces diversity in the sense that all students become somewhat more similar by learning to mimic each other. However, our goal is not necessarily to produce a diverse ensemble, but to enable networks to find robust solutions that generalise well to testing data, which would otherwise be hard to find through conventional supervised learning.

2. Deep Mutual Learning

Figure 1은 두 네트워크를 이용한 DML 적용방법에 대해 설명한다.

2.1 Formulation

자세한 수식적인것은 논문에서..

대략적으로, Cross entropy loss를 이용하여 각 네트워크의 prediction인 $p_{1}$ 과 $p_{2}$을 계산한다.
각 모델 $\Theta_{1}$의 testing에서 성능을 높히기 위해 다른 peer 네트워크인 $\Theta_{2}$을 이용한다. $\Theta_{2}$는 posterior probability인 $p_{2}$의 형식으로 training experience를 제공한다. 각 네트워크의 prediction인 $p_{1}$ 과 $p_{2}$의 match를 계산하기 위해 Kullback Leibler Divergence (KLD)를 사용한다.
이 과정에서 각 네트워크는 training instance에 대하여 정답인 true label에 대해 학습하면서도 peer가 추론한 probability도 학습하게 된다.

2.2 Optimization

Optimization summury는 아래의 그림에서 설명된다.

2.3 Extenstion to Larger Student Cohorts

자세한 수식은 논문에서..
제안하는 DML을 통해 2개보다 더 많은 student를 cohort로 만들 수 있다.
- Network를 위의 $\Theta_{1}$과 $\Theta_{2}$에서 총 K개까지 늘리면 된다.(K는 자연수)
2개를 초과하는 network의 optimization 또한 Algorithm 1의 연장선상이다.
두개를 초과하는 네트워크에 대해 모든 K-1개의 네트워크들을 하나의 teacher로 만들면 되며, prediction은 다른 네트워크의 prediction들의 평균값을 취하여 $p_{avg}$ 형태로 전달하여 KLD를 계산한다.
Section 3.6에서 single ensemble teacher나 DML_e를 사용하는 DML stratege는 위의 K-1 teacher를 사용하는 DML보다 성능이 떨어진다. 그 이유는 teacher ensemble을 teacher의 posterior probabilities를 true class에 대해 더 peak값을 갖도록 하는 model average step(위에서 prediction의 평균 취하는 과정)에서 모든 다른 class들에 대해 posterior entropy를 감소시키기 때문이다.

3. Experiment

3.1 Datasets and Settings

Datasets

Two datasets are used in our experiments. The CIFAR-100 [11] dataset consists of 32×32 color images drawn from 100 classes, which are split into 50,000 train and 10,000 test images. The Top-1 classification accuracy is reported. The Market-1501 [27] dataset is widely used in the person re-identification problem which aims to associate people across different non-overlapping camera views. It contains 32,668 images of 1,501 identities captured from six camera views, with 751 identities for training and 750 identities for testing. As per state of the art approaches to this problem [28], we train the network for 751-way classification and use the resulting feature of the last pooling layer as a representation for nearest neighbour matching at testing. This is a more challenging dataset than CIFAR-100 because the task is instance recognition thus more fine-grained, and the dataset is smaller with more classes. For evaluation, the standard Cumulative Matching Characteristic (CMC) Rank-k accuracy and mean average precision (mAP) metrics [27] are used.

Implementation Details

We implement all networks and training procedures in TensorFlow [1] and conduct all experiments on an NVIDIA GeForce GTX 1080 GPU. For CIFAR-100, we follow the experimental settings of [25]. Specifically, we use SGD with Nesterov momentum and set the initial learning rate to 0.1, momentum to 0.9 and mini-batch size to 64. The learning rate dropped by 0.1 every 60 epochs and we train for 200 epochs. The data augmentation includes horizontal flips and random crops from image padded by 4 pixels on each side, filling missing pixels with reflections of original image. For Market-1501, we use Adam optimiser [10], with learning rate lr = 0.0002, β1 = 0.5, β2 = 0.999 and a mini-batch size of 16. We train all the models for 100,000 iterations. We also report results with and without pre-training on ImageNet.

Model Size

The networks used in our experiments includes compact networks of typical student size: Resnet-32 [6] and MobileNet [8]; as well as large networks of typical teacher size: InceptionV1 [21] and Wide ResNet WRN-28-10 [25]. Table 1 compares the number of parameters of all the networks on CIFAR-100.

3.2 Results on CIFAR-100

Table 2는 다양한 구조를 사용하는 two-network DML cohort에 대한 CIFAR-100의 Top-1 accuracy다. 표에서 다음의 observation들을 확인 가능하다.
- (1) 모든 서로다른 네트워크는 ResNet-32, MobileNet, WRN-28-10중 하나며 그 조합들은 “DML-Independent” 열(column)에서 양수값을 나타내고, 이는 독립적으로 학습했을때에 비해 그만큼의 성능 향상이 있었음을 의미한다.
- (2) 작은 용량을 갖는 ResNet-32나 MobileNet의 경우 DML에서 더 많은 이점 얻을 수 있었다.
- (3) 비록 WRN-28-10이 MobileNet이나 ResNet-32보다 훨씬 큰 모델일지라도 더 작은 peer(MobileNet이나 ResNet32)와 같이 학습했을 때에도 여전히 성능히 향상되는것을 확인 할 수 있다.
- (4) WRN-28-10과 같이 큰 네트워크 cohort를 학습시키는것은 단독 학습시키는것에 비해서 여전히 이점이 존재한다.
- 따라서 model distillation과 같은 기존의 방법과 반대로 큰 pre-trained teacher가 성능향상에 필수요소가 아니게 되며, 다수의 큰 네트워크들도 제안하는 distillation-like 과정을 통하여 성능이 향상된다.

3.3 Results on Market-1501

Person re-identification task로 자세한 내용은 논문에서 확인..

3.4 Comparison with Distillation

논문의 방법은 model distillation과 관련이 있기때문에 [7]과 같은 Distillation 방법과의 성능비교를 했다. Table 4는 student net(Net2)에 fixed posterior target을 제공하는 pre-trained teacher net(Net1)으로 구성된 model distillation과 DML 방식과 결과를 비교했다. 기대했던대로 성능 좋은 pre-trained teacher로부터 온 일반적인 distillation 방식은 student net이 단독학습하는것에 비해 더 나은 성능을 보여줬다(Table 4에서 1 distills 2 와 Net2 Independent의 비교). 하지만 실험결과를 보면 pre-trained teacher net이 필요가 없음을 알 수 있다. 표에서 확인 가능하듯이 “1 distills 2”와 “DML Net 2”의 결과를 비교해보면 두 네트워크를 DML 방식을 사용하여 학습시킨것이 distillation 방식에 비해 성능향상이 더 뚜렷했다. 이는 mutual learning 과정에서 teacher 역할을 하는 네트워크가 pre-trained되지 않은 student와의 상호작용을 통해 서로 학습함으로써 pre-trained 네트워크를 사용하는 방식에 비해 더 나은 결과를 보여준다는것을 의미한다. Finally, we note that on Market1501 training two compact MobileNets together provides a similar boost over independent learning compared to mutual learning with InceptionV1 and MobileNet: Peer teaching of small networks can be highly effective. In contrast, using the same network as teacher in model distillation actually makes the student worse than independent learning (the last row 1 distills 2 (45.16) vs. Net 2 Independent (46.07)).

3.5 DML with Larger Student Cohorts

앞의 모든 연구는 2 student로 구성된 cohort를 이용하여 실험되었다. 이번 실험에선 student의 수에 따른 DML 성능에 대한 비교를 했다. Figure 2(a)는 MobileNets 구조를 사용하는 student에 대해 cohort size(student 수)가 증가함에 따른 Market-1501 데이터셋에 대한 DML training 실험결과를 보여준다. Figure 2(a)에서는 DML이 적용된 cohort의 network 수가 증가함에 따른 average single network의 mAP 성능이 향상되는것을 단독학습 된 경우와 비교하여 확인 할 수 있다. 이는 cohort의 peer의 수가 많아져 모델들이 같이 학슴함에 따라 student들의 generalization 능력이 강화된다는 것을 증명한다. From the standard deviations we can also see that the results get more and more stable with increasing number of networks in DML.
- 실험에서 사용하는 Independent 학습모델의 네트워크 수는 어떻게 저렇게 되는지 논문에서 정확히 정의되어있지 않음…앙상블 모델은 아닌듯함.
일반적으로 여러 네트워크를 학습시키는 기술은 그 네트워크들을 앙상블로 만들어 combined prediction을 만들도록 하는것이다. Figure 2(b)에서는 (a)와 동일한 구조를 갖는 네트워크에 대해 각 모델의 average prediction을 사용하는 대신 앙상블 모델(모든 멤버 의 concat된 feature 기반의 matching)을 사용하여 prediction했다. 실험 결과 앙상블 prediction이 예상대로 individual network의 성능을 넘어섰다(Fig 2.(b) vs (a)). 게다가 앙상블 prediction은 여러 네트워크를 cohort로 학습시킴으로써 더 나은 성능을 얻을 수 있었다(Fig. 2(b) DML ensemble vs. Independent ensemble). 앙상블 모델의 성능을 향상시키는 DML의 효과(Fig 2)를 볼 때 이 방법이 성능향상을 위한 일반적 방법인 앙상블 모델에 대해 최소한의 비용추가만 갖고도 성능을 향상시키는 general한 유용한 방법임을 실증한다.

3.6 How and Why does DML Work?

이 절에선 DML이 왜 효과가 있는지에 대해 설명한다. [4, 26, 9]같은 논문에선 “Why Deep Nets Generalize” 라는 주제를 다루는데, 이는 다음과 같은 insight를 준다. While there are often many solutions (deep network parameter settings) that generate zero train error, some of these generalise better than others due to being in wide valleys rather than narrow crevices [4, 9] – so that small perturbations do not change the prediction efficacy drastically; and that deep networks are better than might be expected at finding these good solutions [26], but that the tendency towards finding robust minima can be enhanced by biasing deep nets towards solutions with higher posterior entropy [4, 17].

Better Quality Solutions with More Robust Minima

위의 insight로 인해 몇몇의 DML process에 대한 관측을 수행했다. 우선 논문의 적용에 대해 네트워크가 dratining data에 perfectly하게 맞춘다(training accuracy가 100%가 되고 classification loss가 Fig 3(a)처럼 낮아짐). 하지만 앞에서 언급한것처럼 DML은 test data에 대해 더 잘 작동한다. 따라서 traning loss에서 더 deep한 minima를 찾도록 하는것이 아니라 DML가 test set에 대해 더 generalize를 잘하는 wider하고 robust한 minima를 찾도록 도와준다는 것이다. [4, 9]에서 감명받아 MobileNet을 이용해 Market-1501에 대해 발견된 minima의 robusteness를 test하는 간단한 실험을 했다. DML과 단독학습한 모델의 경우 각 모델의 파라미터에 대해 variable standard deviation $\rho$의 independent Gaussian noise를 추가하기 전과 후의 각 모델에 대한 training loss를 비교한다. We see that the depths of the two minima were the same (Fig. 3(a)), but after adding this perturbation the training loss of the independent model jumps up while the loss of the DML model increases much less. 이는 DML 모델이 더 넓은 minima를 찾음을 의미하며 이는 곧 generalization 성능이 좋다는것을 의미한다[4, 17].

How a Better Minima is Found

어떻게 DML이 이러한 더 나은 minima를 찾도록 도와주나? When asking each network to match its peers probability estimates, mismatches where a given network predicts zero and its teacher/peer predicts non-zero are heavily penalised. Therefore the overall effect of DML is that, where each network independently would put a small mass on a small set of secondary probabilities, all networks in the DML tend to aggregate their prediction of secondary probabilities, and both (i) put more mass on the secondary probabilities altogether, and (ii) place non-zero mass on more distinct secondary probabilities. 이러한 효과를 Figure 3.(c)에서 ResNet-32 on CIFAR-100 trained by DML과 independently trained ResNet-32간에 top-5 highest ranked classes의 probability들을 비교하여 설명해놨다. For each training sample, the top 5 classes are ranked according to the posterior probabilities produced by the model (Class 1 being the true class and Class 2 the second most probable class, so on and so forth). Here we can see that the assignment of mass to probabilities below the Top-1 decays much quicker for Independent than DML learning. This can be quantified by the entropy values averaged over all training samples of the DML trained model and the independently trained model being 1.7099 and 0.2602 respectively. Thus our method has connection to entropy regularisation-based approaches [4, 17] to finding wide minima, but by mutual probability matching on ‘reasonable’ alternatives, rather than a blind high-entropy preference.

DML with Ensemble Teacher

In our DML strategy, each student is taught by all other students in the cohort individually, regardless how many students are in the cohort (Eq. (10)). In Sec. 2.3, an alternative DML strategy is discussed, by which each student is asked to match the predictions of the ensemble of all other students in the cohort (Eq. (11)). One might reasonably expect this approach to be better. As the ensemble prediction is better than individual predictions, it should provide a cleaner and stronger teaching signal – more like conventional distillation. In practice the results of ensemble rather than peer teaching are worse (see Fig. 4 (a)). By analysing the teaching signal of the ensemble in comparison to peer teaching, the ensemble target is much more sharply peaked on the true label than the peer targets, resulting in larger prediction entropy value for DML than DML_e (see Fig. 4 (b)). Thus while the noise-averaging property of ensembling is effective for making a correct prediction, it is actually detrimental to providing a teaching signal where the secondary class probabilities are the salient cue in the signal and having high-entropy posterior leads to more robust solutions to model training.

Conclusion

논문에선 DNN을 집단(cohort)으로 만들어 peer와 mutual distillation 을 통해 DNN의 성능을 향상시키는 간단하지만 general하게 적용 가능한 방법을 제안하였다. 이 방법을 이용해 static(단독학습, pre-trained) teacher로부터 distilled된 네트워크보다 성능이 더 좋은 compact network를 얻을 수 있었다. Deep mutual learning(DML)을 활용하는 한가지 예로 compact하고 빠른 효율적인 네트워크를 얻을 수 있다. 또한 논문에선 이 방식을 이용해 크고 powerful한 네트워크의 성능도 향상시킬 수 있었으며, 논문에서 제안하는 방식을 따라 학습된 network cohort(네트워크 그룹)은 더 성능 향상을 위한 앙상블 모델로 사용될 수 있다.

Comment Read more

A Gift from Knowledge Distillation- Fast Optimization, Network Monimization and Tranfer Learning

03 Apr 2019 | Deep learning

A Gift from Knowledge Distillation: Fast Optimization, Network Monimization and Tranfer Learning

Original paper: http://openaccess.thecvf.com/content_cvpr_2017/papers/Yim_A_Gift_From_CVPR_2017_paper.pdf

Authors: Junho Yim1 Donggyu Joo1, Jihoon Bae, Junmo Kim (KAIST)

Abstract

본 논문에서는 knowledge transfer를 하는 새로운 방법을 제안한다.
- Knowledge transfer는 pre-trained DNN(deep neural network)의 정보(knowledge)를 distillation 하여 다른 DNN에 전달하는것을 의미한다.
네트워크에 순차적으로 쌓인 layer들의 input space부터 output space까지 DNN map들에 대해, 저자는 layer 사이의 흐름(flow)의 관점에서 distilled knowledge가 전달(transfer)되도록 정의한다. 이는 두 layer간의 feature에 대해 inner product를 계산하여 수행된다.
논문에서 동일한 size의 teacher network가 없이 학습된 DNN과 본 논문에서 제안하는 방법을 적용시킨 student DNN을 비교한다. 논문에서 제안하는 방법인 두 layer 간의 flow를 이용하여 distilled knowledge을 student DNN으로 전달 할 때, 제안하는 방법이 적용된 모델은 그렇지 않은 모델에 비해 세 가지 중요한 phenomena에 대한 차이를 보인다.
- (1): Student DNN에 대해 distilled knowledge가 전달되는 경우가 그렇지 않은 모델보다 훨씬 더 빨리 optimize된다.
- (2): 논문에서 제안하는 방법이 적용된 student DNN이 일반 DNN보다 성능이 더 우수하다.
- (3): Student DNN은 다른 task에 대해 training 된 teacher DNN으로부터 distillation된 정보(knowledge)를 학습 할 수 있으며, 이렇게 학습 된 student DNN은 처음부터(from scratch) 학습 된 DNN 모델에 비해 우수한 성능을 보인다.

1. Introduction

근 몇년동안 DNN이 제안되었으며, computer vision[8, 23]이나 NLP[1, 19]등의 다양한 task에 대해 SOTA 성능을 보인다.최근 knowledge transfer 기술에 대한 몇몇 연구들[11, 20]이 수행되었다. Hinton의 방법[11]은 처음으로 knowledge distillation(KD)에 대한 개념을 제안하였으며, 논문에서는 teacher-student framework에서 soften된 teacher output을 이용하는 방법을 제안했다. 비록 KD training이 몇몇 dataset에 대해서 정확도의 향상 달성했지만, very deep network의 optimizing이 어렵다는 문제점이 존재했다. 이러한 deep network에 대한 KD training의 optimizing 문제를 해결하기 위해 Romero의 방법[20]은 pretrained teacher의 hint layer와 student의 guided layer를 이용하는 hint-based training 접근방법을 제안했다. Hint-based training 방법 덕분에 학습된 deep student network는 original wide teacher network에 비해 더 적은 parameter 개수로 기존(Romero 방법이 적용되지 않은 모델)에 비해 더 나은 정확도를 보였다.
Knowledge transfer의 성능은 어떻게 distilled knowledge가 정의되느냐에 따라 매우 민감하게 바뀐다. Distilled knowledge는 pretrained DNN의 다양한 feature들에 의해 추출되어진다. 실제 teacher가 student를 어떻게 문제를 해결하는지를 가르친다는 점을 고려할 때, 논문에서는 high-level distilled knowledge를 문제 해결의 흐름(flow)으로써 정의한다. DNN은 구조상 input space부터 output space까지 많은 layer를 sequential하게 사용하므로, 문제 해결의 흐름(the flow of solving problem)은 곧 두 layer의 feature간의 관계(relationship)로써 정의되어질 수 있다.
Gatys의 방법[6]은 Gramian matrix를 input image의 texture information을 표현하기 위해 사용하였다. Gramian matrix는 feature vector들 간의 inner product를 계산하여 생성되므로 texture information로 생각 할 수 있는 feature간의 방향성(directionality)을 포함 할 수 있다. Gatys의 방법과 유사하게, 저자들은 두 개의 layer의 feature 사이의 inner product로 구성된 Gramian matrix를 사용하여 flow of solving problem을 나타냈다. [6]에서 사용한 Gramian matrix와 논문의 방법 사이의 주요 차이점으로는, 논문에서 제안하는 방법은 Gramian matrix를 layer들을 가로질러 계산하며, 이는 [6]의 Gramian matrix가 한 layer 안의 feature들 사이에서만 inner product를 계산하는것과는 대조적이다. Figure 1에서는 논문에서 제안하는 distilled knowledge를 전달하는 방법의 concept diagram을 보여준다. 두 layer들 사이에서 추출된 feature map들은 flow of solution procedure(FSP) matrix를 생성하기 위해 사용되어진다. 학습 과정에서 student DNN는 student DNN에서 계산되어지는 FSP matrix가 teacher DNN의 FSP matrix와 유사하도록 학습되어지게된다.
- 즉 기존의 방법([6])은 하나의 layer의 feature들에 대해서만 Gramian matrix를 계산하였다면, 본 논문에서 제안하는 방법은 전체 layer들 중 입력단과 출력단의 두 layer에서 각 feature들에 대한 Gramian matrix를 계산하고 이를 FSP matrix라고 정의하고, student DNN이 teacher DNN의 FSP matrix를 닮도록 네트워크가 학습되어진다. Figure 1을 보면 좀 더 직관적인 이해가 쉽다.

이 논문에선 논문에서 제안하는 distilled knowledge의 유용성을 3가지 task를 이용하여 검증하였다.
첫 번째로, 빠른 optimization이다. Flow of solving problem을 이해하는 DNN은 기본 main task를 해결하는데에 좋은 initial weight가 될 수 있으며, 이로인해 일반적인 DNN모델보다 빠른 속도로 학습이 가능해진다. 빠른 optimization은 매우 유용한 기술이다. 다양한 논문에서 advanced learning rate scheduling 기법을 사용하는 것 뿐만 아니라 fast optimizing을 적용에 대해 연구하였고[13, 27, 4], 뿐만 아니라 좋은 initial weight를 찾는것에 대한 다양한 연구도 진행되었다[5, 9, 18, 20]. 논문의 방식은 initial weight method를 기반으로 하므로 다른 initial weight method와의 비교를 수행했다. 저자들은 training iteration의 횟수와 논문의 scheme에 대한 성능을 다양한 다른 기술들과 비교하였다.
두 번째 task는 적은 parameter 수를 갖는 shallow network(small network)의 성능을 향상시키는 것이다. Small network가 teacher network에서 온 distilled knowledge를 학습하므로 student network(small network)가 단독으로 학습하는 것 보다 teacher network에서 나온 distilled knowledge를 이용하여 학습하는것이 성능향상에 더 도움이 된다. 저자들은 original network와 다양한 knowledge transfer 기술이 적용된 network들에 대한 성능을 비교했다.
세 번째 task는 transfer learning이다. 비록 어떠한 new task가 small dataset만 사용 가능할지라도 deep하고 heavy하며 huge dataset에 대해 pretrained된 DNN을 이용하여 transfer learning을 적용한다면 small dataset만으로도 좋은 성능을 이루어 낼 수 있을 것이다. 제안하는 방법은 distilled knowledge를 작은 DNN으로 transfer할 수 있다는 이점이 있으므로 작은 네트워크는 일반적인 transfer learning에서 사용되는 large DNN과 유사한 정확도로 동작 할 수 있게 된다.
본 논문에서는 아래의 contribution을 만든다.
1. Knowledge distillation을 하는 새로운 기술을 제안
2. 이 방법은 fast optimization에 유용함
3. 제안하는 distilled knowledge를 initial weight를 찾기 위해 사용한다면 small network의 성능을 향상 시킬 수 있음
4. 만약 student DNN이 teacher DNN과 다른 task에 대해 학습되었더라도 제안하는 distilled knowledge는 student DNN의 성능을 향상 시킬 수 있음

Knowledge Transfer

보통 computer vision task에서는 많은 파라미터를 갖는 deep network의 성능이 좋다. 대부분 architecture의 깊이는 성능 향상을 위해 깊어진다. 딥러닝이 시초인 AlexNet[16]은 5개의 conv 레이어뿐이었지만, 근래의 GoogleNet[23]같은 경우 22개의 conv 레이어나 ResNet[8]은 152개의 conv 레이어를 갖는다.
많은 파라미터를 갖는 deep network는 training이나 testing에서 무거운 연산량을 필요로 한다. 이러한 이유로 deep network들은 모바일과 같은 일반적인 computing platform에 적용이 불가능하다. 그러므로 많은 연구들이 network의 성능은 유지하면서 크기는 작게 만들려고 시도되었다. 이러한 것을 가능하게 하는 일반적인 방법은 학습된 deep network의 정보를 small network로 distilled 정보를 transfer 하는 것이다. 최근에 Hinton의 방법[11]은 dark knowledge에 기반한 model compression 방법을 설명했다. 이 방법은 small student network를 학습시키기 위해 teacher network의 soften된 최종 output 정보를 사용한다. 이러한 teaching 과정에서 small network는 어떻게 large network가 주어진 task에 대해 잘 학습했는지 압축된 형식으로 정보를 전달받게 된다. Romero 방식[20]은 final output도 사용하면서 동시에 teacher network의 중간 hidden layer의 값을 student network의 학습에 사용하였으며, 동시에 이러한 중간 layer 정보를 사용하는것이 deep하고 thin한 student network의 성능을 향상 시킬 수 있었다. Net2Net[3]도 teacher network의 parameter에 따라 student network의 parameter를 초기화(initialize)하기 위해 function-preserving transform을 적용시킨 teacher-student network system을 사용하였다.

Fast Optimization

Deep CNN은 좋은 local optima나 global optimum을 찾기 위해 비교적 시간이 많이 소요된다. 보통은 MNIST[17]나 CIFAR10[15]와 같이 작은 데이터셋은 학습시키기 쉽다. 하지만 ILSVRC[21] 데이터셋과 같은 거대한 데이터셋의 경우 big network의 경우 학습에 몇 주가 소요되기도 한다. 따라서 fast optimization은 최근 연구에서의 중요한 분야중 하나가 되었다. 주로 good initial weight를 찾거나 SGD 방법 외에 다른 기술을 사용하여 optimal point에 도달하는 몇몇 접근방식이 존재한다.
초기엔 unit variance와 zero mean을 갖는 Gaussian noise initialization이 매우 많이 사용되었다. 또는 Zavier initialization[7] 등도 광범위하게 사용되었다. 하지만 이러한 간단한 initialization 방법들은 deep network를 학습시킬 때 poor한 성능을 보인다. 이로인해 [18, 22, 14]와 같은 몇몇 새로운 방법들이 수학적인 접근방식으로 이를 해결코자 했다. 좋은 initialization으로 인해 training이 적절한 starting point에서 시작 될 때 parameter들은 빠르게 global optimum에 수렴 할 수 있게 된다.
Optimization 알고리즘들 또한 딥러닝의 발전과 함께 많이 진화되어왔다. 관습적으로 SGD 알고리즘이 기본적으로 많이 사용되어왔다. 하지만 SGD는 다양한 saddle point에서 탈출하기 힘들다는 단점이 존재한다. 이러한 문제로 인해 [13, 27, 4]와 같은 몇몇 알고리즘들이 제안되었다. 이러한 알고리즘들은 saddle point에서 벗허나도록 도와주며 global optimum에 빠르게 도착하도록 도와준다.

Transfer Learning

Transfer learning은 어떠한 task에 대해 미리 학습된 network의 파라미터를 이용하여 새로운 task에 적용가능하도록 해 주는 간단한 기술이다. 전형적으로 feature extraction을 하는 입력단의 레이어들은 pre-trained network로부터 파라미터 변경이 되지 않는 frozen형태나 fine-tuned 형태로 복사되어지며, 반면에 상단의 classifier들(fc layer들)은 새로운 task를 위해 random하게 initialize되어 slow learning rate로 학습되어진다. Fine-tuning은 때때로 처음부터 학습시키는 경우보다 성능이 좋을 수 있으며, 이는 이미 pretrained model이 정보들을 다루는데에 대한 능력이 좋기 때문이다. 예를 들어 [19, 28, 1, 2]과 같은 논문들에서는 ILSVRC데이터셋으로 pretrained된 model을 이용해 VQA[1]나 CUV200[25]와 같은 task에 대해 fine-tuninning을 적용하여 성능을 향상 시켰다. Detection이나 segmentation과 같은 많은 다른 task들에 대해서도 이러한 ImageNet pre-trained model을 initial value로 하여 사용되어지며, 이는 ILSVRC 데이터셋이 generalization에 대해 매우 도움이 되기 때문이다. 논문에서 제안하는 approach 또한 이러한 fine-tunning 기술을 제안하는 good initialization method에 적용하였다.

3. Method

제안하는 방법의 주요 개념은 어떻게 teacher DNN의 중요 정보를 정의하느냐와 이러한 정보를 어떻게 다른 DNN에 전달하느냐는 것이다. 이번 섹션에서는 4개의 파트로 나뉘어 논문의 주요 개념들에 대해 설명한다. 섹션3.1 에서는 이 연구에서 사용한 유용한 distilled knowledge에 대해 설명한다. 섹션 3.2에서는 논문에서 제안하는 distilled knowldege의 수학적 표현에 대해 설명한다. 신중히 설계된 distilled knowledge에 근거하여 섹션 3.3에서는 loss term에 대해 설명한다. 마지막으로 섹션 3.4에서는 student DNN의 전체 학습 절차에 대해 설명한다.

3.1. Proposed Distilled Knowledge

DNN은 feature들을 layer by layer로 생성한다. Higher layer feature들은 main task를 수행하기 위해 유용한 feature들과 가깝다. 만약 우리가 DNN의 input을 문제(question)로, output을 정답(answer)로 인식한다면 DNN의 중간에서 생성된 feature들은 solution process의 중간 결과로써 생각 할 수 있게된다. 이러한 아이디어에 근거하여 Romero[20]에서 제안하는 knowledge transfer technique은 student DNN이 단순하게 teacher DNN의 중간 결과를 흉내내도록 학습시킨다. 하지만 DNN의 경우 input으로부터 output을 생성하는 문제를 해결할 수 있는 다양한 방법들이 존재한다. 이러한 관점에서 teacher DNN에서 생성된 feature들을 흉내내는(mimicking)것은 student DNN에게 어려운 제약(hard constraint)이 될 것이다.
사람의 경우, 선생님(teacher)은 문제에 대해 solution process를 설명하며, 학생(student)은 이러한 solution procedure의 전체적 흐름(flow)를 배우게 된다. Student DNN은 특정한 질문이 입력될 때 반드시 중간 output을 배울 필요는 없지만 어떠한 특정 유형의 질문이 주어질 때 그에 대한 해결책을 배울 수 있다. 이런 식으로 저자들은 주어지는 문제에 대한 solution process를 보이는것(demonstrating)이 중간 output을 가르치는 것 보다 더 나은 generalization을 제공한다고 믿었다.(이에 근거하여 문제해결을 제안)

3.2. Mathematical Expression of the Distilled Knowledge

Solution procedure의 flow는 두 중간 result 사이의 관계에 의해 정의된다. DNN의 경우 관계는 두 layer의 feature들 사이의 방향(direction)에 의해 수학적으로 고려 될 수 있다(considered). 저자들은 FSP matrix가 solution process의 flow를 표현하도록 설계하였다. FSP matrix $G\in {\mathbb{R}}^{m\times n}$은 두 layer의 feature들에 의해 생성되어진다. 선택된 layer들중 하나에서 생성하는 feature map은 $F^{1}\in {\mathbb{R}}^{h\times w\times m}$ 을 따르며, 각각 $h$, $w$, $m$에 대해 height, width, channel의 갯수를 의미한다. 다른 선택된 레이어가 생성하는 feature map은 $F^{2}\in {\mathbb{R}}^{h\times w\times m}$ 을 따른다. 그 다음, FSP matrix $G\in {\mathbb{R}}^{m\times n}$ 은 아래와 같이 계산된다.

$G_{i,j}(x;W)=\sum_{s=1}^{h}\sum_{t=1}^{w}\frac{F_{s, t, i}^{1}(x;W)\times F_{s, t, j}^{2}(x;W)}{h\times w}$ , (1)

각각 $x$와 $W$는 DNN의 input image와 weight들을 의미한다. 논문에선 CIFAR-10데이터셋으로 학습된 8, 26, 32 layer를 갖는 residual network를 이용하여 실험을 준비했다. 공간(spatial)의 크기가 변경되는 CIFAR-10 데이터셋으로 학습된 residual network에는 세 가지 포인트가 있다. 논문에선 Figure 2에서처럼 FSP matrix를 생성하기 위해 여러 점 들을 선택했다.
- 논문에선 세 군데를 선택함

3.3. Loss for the FSP Matrix

저자들은 student network를 돕기 위해(성능을 개선시키기 위해) teacher network에서 나온 distilled knowledge를 전달했다. 앞에서 설명된대로 논문에서 제안하는 방식은 solution procedure의 흐름에 대한 정보를 포함하는 FSP matrix의 형태로 distilled knowledge를 표현한다. 만약 teacher network에서 생성된 $n$ 개의 FSP matrix들 $G_{i}^{T},\; i=1,\; …\; ,\; n$가 있고, student network에서 생성된 $n$ 개의 FSP matrix들 $G_{i}^{S},\; i=1,\; …\; ,\; n$가 있다고 가정해보자. 본 연구에서는 동일한 공간 크기를 갖으며 각각 teacher와 student network 사이에서 만들어진 FSP matrix들의 쌍(a pair of FSP matrices)만 고려한다. 저자들은 제곱된 L2 norm(squared L2 norm)을 각 FSP matrix 쌍의 cost function으로 사용했다. Distilled knowledge의 전달 task에 대한 cost function은 아래와 같다.

$L_{FSP}(W_{t}, W_{s})=\frac{1}{N}\sum_{x}\sum_{i=1}^{n}\lambda_{i}\times \parallel (G_{i}^{T}(x;W_{t})-G_{i}^{S}(x;W_{s})) \parallel_{2}^{2}$ , (2)

각각 $\lambda_{i}$와 $N$ 은 각각 loss term의 weight와 data point의 개수를 의미한다. 논문에선 전체 loss term이 모두 중요하다고 가정했다. 그러므로 모든 실험에서 동일한 $\lambda_{i}$ 값을 사용했다.

3.4. Learning Procedure

논문에서 제안하는 transfer method는 teacher network에서 생성된 distilled knowledge를 사용한다. 본 논문에서 teacher network가 무엇인지 확실히 설명하기 위해 다음의 두 가지 조건을 정의한다. 우선, teacher network는 어떠한 dataset에 의해 미리 학습되어야 한다. 이 데이터셋은 student network가 학습에 사용할 데이터셋과 동일하던 다르던 상관없다. Transfer learning의 경우, teacher network는 student network와 다른 데이터셋을 이용하여야 한다. 두 번째로, teacher network는 student network보다 더 깊거나 얕거나 상관 없다. 하지만 논문에선 teacher network가 student network와 동일한 깊이를 갖거나 혹은 더 깊은 모델이 되도록 하였다.
Learning procedure는 training과정에서 두 개의 stage로 나뉜다. 우선, teacher network의 FSP matrix와 student network의 FSP matrix가 서로 같도록 만들어주는 loss function인 $L_{FSP}$를 최소화 시킨다. 첫 번째 stage를 거친 student network는 이제 두 번째 stage의 main task에 대한 loss를 이용하여 학습되어진다. 본 연구에선 제안하는 방법의 효용성을 검증하기 위해 classification task를 사용하였으므로, softmax cross entropy loss로 정의되는 $L_{ori}$를 main task loss로 사용한다. 학습 procedure는 아래의 Algorithm 1 에 설명되어있다.

4. Experiment

논문에선 제안하는 방법의 효용성을 검증하기위해 3개의 실험을 수행했다. 모든 실험 세팅에 대해 deep residual network[8]를 base architecture로 사용했다. 흥미롭게도 deep resitudal network는 shortcut connection이 존재하기 때문에 앙상블 구조를 만들 수 있다[24]. 게다가 sortcut connection은 더 깊은 네트워크의 학습을 가능하게 해준다. 이러한 두 이유로 인해 많은 연구에서 residual network를 다양한 task에 대해 적용한다. Figure 2는 실험에서 사용하는 deep residual network의 base 구조를 보여준다. 네트워크엔 feature map을 같은 공간 크기로 유지하기위해 zero padding을 적용하는 몇 구간이 존재한다. 예를들어 figure 2의 deep residual network는 3가지 부분으로 나뉘어있다. 비록 3개의 구간중 어디에서 FSP matrix를 만들기 위해 두 개의 레이어를 선택하느냐에 있어서 별다른 제약은 없지만, 논문에선 첫 번째와 마지막 section의 레이어를 사용하였다. 또한 FSP matrix는 같은 공간 크기를 갖는 두 레이어의 feature들에 의해 생성되므로 실험에서는 만약 두 feature의 크기가 공간 다른 경우 같은 사이즈를 만들기 위해 max pooling 레이어를 사용하였다.
논문에선 제안하는 knowledge transfer technique의 효용성 검증을 위해 3개의 대표적인 task에 대해 실험했다. 실험에선 solution procedure의 흐름을 학습하기 위해 student network가 task에 대해 일반적인 모델보다 더 빠르게 학습되어진다는 것에 대해 section 4.1에서 다룬다. 또한 teacher network가 생성해낸 FSP matrix가 student network가 단독학습된 모델의 성능을 앞지르게 하는 것에 대해 section 4.2에서 다룬다. 앞의 실험에 대해 teacher network와 student network가 같은 task에 대해 같은 데이터셋으로 학습된 것을 사용하였다. Section 4.3에서는 이러한 idea들에 대해 transfer learning task에 적용한 것에 대해 다룬다.
모든 실험에 있어서 제안하는 모델을 존재하는 knowledge transfer model인 FitNet[20]과의 성능을 비교하였다. FitNet의 첫 번째 stage에서는 35,000회의 iterations 동안 hint 및 guided layer가 각 DNN의 중간 레이어로 설정되어 두 레이어의 출력 간 L2 loss을 최소하하는 방식으로 hint-based traning을 구현했다. Learning rate는 1e-4부터 시작했다. 다음으로 25,000회 iteration 이후 1e-5로 변경된다. 공평한 인식률의 accuracy 비교를 위해 FitNet의 두 번째 stage에서는 동일한 iteration동안 동일한 learning rate가 적용되었다. 이 stage에서 sfotening factor인 tau는 3으로 설정되었으며, KD loss function의 lambda값은 4에서 1로 선형적으로 감소한다.

4.1. Fast optimization

최근 DNN들은 성능을 높히기 위해 점점 깊어지므로 학습에 며칠이 걸린다[26, 8]. 게다가 DNN이 학습에 오래걸려도 많은 논문들에선 single DNN의 성능향상을 위해 앙상블 모델을 사용하기도 한다[23]. 이러한 경우 n개의 DNN을 이용한 앙상블 모델을 사용하게 된다면 n배만큼 학습 시간이 오래 소요된다. 이러한 이유들로 인해 빠른 optimization 기술들이 최근에 주요하게 대두되고있다.
우선 teacher DNN을 normal training procedure에 따라 학습시켜 준비한다. Teacher DNN은 section 3.4와 같이 student network를 학습시키기 위해 사용되어진다. 논문에선 하나의 teacher network를 이용하여 여러 student network를 만든다. 제안하는 빠른 optimization의 최종 목표는 일반적인 학습 과정보다 적은 학습시간동안 teacher network의 성능과 유사한 student network의 앙상블 모델을 학습시키는 것이다.

4.1.1 CIFAR-10

The CIFAR-10 dataset [15] contains 50 000 training images with 5000 images per class and 10 000 test images with 1000 images per class. The CIFAR-10 dataset comprises 32 × 32 pixel RGB images with 10 classes. However, we padded 4 pixels on each side to make the image size 40×40 pixels. Randomly cropped 32 × 32 pixel images were used for training, and the original 32×32 pixel images were used for testing.
실험에선 26개 레이어를 가진 residual network를 teacher DNN으로 사용하였으며 CIFAR-10에 대해 92% 정확도를 보인다[8]. 또한 동일한 구조를 student DNN으로 사용했다. 실험에 대해 teacher network는 learning rate는 0.1부터 0.01, 0.001까지 32,000과 48,000 iteration에서 각각 변경되었으며 64,000 iteration에서 0이된다. 또한 0.0001의 weight decay를 적용했고, momentum 0.9의 MSRA initialization [9]과 BN[12]를 적용했다.
Student network는 teacher와 동일한 구조를 가지며 알고리즘 1에서처럼 teacher network가 stage 1에서 초기 wiehgt 용도로 사용되었다. Learning rate는 0.001, 0.0001, 0.00001로 각각 11,000, 16,000, 21,000 iteration에서 decaying 되었다. Weight decaying을 0.0001로, momentum은 0.9로 적용했다. 다음으로 student DNN을 일반적인 절차를 따라 stage 1의 끝에서 전해진 wegith들을 기초로 하여 학습시켰다. 참고로 논문에선 여러 student network를 stage 2에서 학습시켰으며 동일한 stage 1에서 학습되어진 weight를 initial weight로 하여 학습시켰다. Stage 1에서 학습된 student net의 weight가 initial weight로 하여 많은 student network들에 복사되므로 stage 1은 많은 student network들을 초기화하는데에 효율적인 방법이다. 모든 student net에 동일한 initial weight를 공유하는데에 대한 한 가지 단점으로는 student net이 각각 독립적으로 initialization 되는 경우에 비해 비교적 각 네트워크가 더 상호 연관 될 수 있다는 점이다.

Figure 3은 test accuracy와 전체 시간에 대한 traning loss에 대해 나타낸다. student net이 teacher net에 비해 더 빠르게 optimization되는것을 확인 할 수 있다. Student net은 teacher net에 비해 약 3배 빠르게 saturation region에 들어가게 된다. 실험에선 teacher net에 naive initialization 방법이 아닌 고성능의 MSRA initialization technique를 사용하였기 때문에 FSP matrix가 좋은 distilled knowledge를 제공하여 student network의 wieght initialize에 도움이 되었다.
실험에선 빠른 최적화를 검증하기 위해 stage 2에서 student net를 원래 iteration 수보다 1/3수준만큼만 반복시켰다. In stage 2, we used learning rates of 0.1, 0.01, and 0.001 until 11 000, 16 000, and 21 000 iterations, which are less than one-third the original number of iterations. Table 1의 실험 결과에서 확인 해 볼때 원래 제안하는 방법을 적용해 iteration의 1/3수준만 수행한것도 학습에 충분했다는것을 확인 가능했다. 비록 student net의 iteration이 적더라도 제안하는 방법은 teacher net 뿐만 아니라 FitNet의 성능까지 능가할 수 있었다.
또한 FitNet 방법을 사용하여 3개의 중간 레이어에 3개의 loss을 적용하는 방법과 중간 레이어에 1개의 loss를 적용하는 방법에 대한 실험을 수행했다. Table 1에서 볼 수 있는 1 loss의 성능이 3 loss보다 좋았다.

제안하는 방법은 전체 네트워크를 몇 모듈로 분해 가능하게 하며, 각 모듈의 동작들은 모두 FSP matrix에 의해 capture되게 된다. 만약 student의 module의 FSP matrix가 teacher net의 matrix와 유사한 경우 student net의 module이 teacher net에서 상응하는 해당 모듈과 유사하게 작동함을 의미한다. 또한 각 모듈은 다른 모듈이 완전히 학습되어지지 않더라도 모듈 자체의 입력과 출력 간의 상관관계로부터 해당 모듈을 독립적으로 학습 시킬 수 있다. 반대로 입력과 출력 사이의 관계를 고려하지 않고 모듈의 출력만 matching시켜 학습된 three-loss FitNet의 상위 모듈의 경우 해당 모듈에 대한 입력이 의미를 갖을 수 있도록 student network의 하위 모듈이 충분이 훈련될 때 까지 학습의 효율성이 떨어지게 된다. 이는 one-loss FitNet이 three-loss 방법보다 성능이 좋은지에 대한 이유를 설명한다. Three-loss FitNet의 경우 network에 4개의 모듈이 존재한다. 2, 3번째 모듈은 중간 결과로 학습시키기 어렵다. 또한 FSP는 FitNet보다 덜 제한적이다. 만약 student net와 teacher net이 동일한 중간 feature map을 갖는다면 그 둘은 동일한 FSP matrix를 갖게 된다. 하지만 그 반대는 사실이 아니며, 즉 동일한 FSP matrix가 주어지더라도 feature map은 서로 다를 수 있게된다.
각 teacher net과 student net이 동일한 구조를 갖기때문에 한 네트워크의 정보를 다른 네트워크로 weight를 그대로 copy함으로써 전달이 가능하게 된다. 논문에선 weight copy와 knowlege transfer와의 성능을 비교했다. 이를 위해 저자는 3개의 teacher net의 복사본에 대해 추가로 21k iteration으로 학습시켰으며 이는 단일 teacher net에서 wieght를 복사한 후 거기서 학습을 시작하는것과 동일한 과정이다(?). Table 1에서 확인 가능하듯이 이 결과는 student *보다 좋지 못한 성능을 가졌다. Table 1에서 Teacher ‡를 보면, 각각의 성능이 original teacher의 성능보다 약간 나앗지만 poor한 앙상블 모델 성능을 나타냈다. 따라서 FSP는 weight를 그대로 복사하는 것 보다 덜 제한적이며 더 나은 diversity(다양성) 및 앙상블 성능을 나타낸다.
게다가 iteration을 적게 수행한 student net 앙상블 모델이 teacher net 앙상블 모델과 유사한 성능을 보였지만 FitNet은 그렇지 못했다. 비록 student net의 앙상블 성능이 teacher net 앙상블 모델과 가까웠지만, 전자(student 앙상블)의 성능 향상(92.14→93.26)이 후자(teacher 앙상블)의 성능 향상(91.75→93.48)에 비해 낮았다. 이는 student net이 initial weights를 공유하는것과 더 밀접한 상관관계가 존재하기 때문이다.
논문에선 동일한 single teacher net을 사용하여 덜 관련된 student net을 학습시키는 매우 간단하지만 효과적인 방법을 개발했다. 이 아이디어는 본질적으로 같지만 분명히 다른 여러개의 FSP matrix를 생성할 수 있다는 것이다. Student net에 같은 FSP matrix를 공유하는 대신에 서로 다른 FSP matrix를 사용하게 되면 각 student network간의 상관관계를 줄일 수 있게된다. FSP matrix는 두개의 선택된 layer의 feature들로부터 생성되어진다. 참고로 기본적으로 동일한 방법으로 작동하는 동등한(equivalent) teacher net을 얻기 위해 teacher net에서의 feature channel을 바꿀 수 있다. 즉, FSP 행렬의 행 또는 열은 distilled knowledge의 전송에 영향을 미치지 않고 섞일 수 있게 된다. 행 및 열 shuffling에 의해 얻어진 다른 FSP matrix은 stage 1에서 다른 initial weight를 갖는 다수의 student net를 생성하는데에 사용되어 질 수 있다. 이렇게 하면 stage 2 이후 student net의 상관관계가 낮아지고 성능이 향상된 앙상블 모델을 얻을 수 있게 된다. Table 1에서 확인 가능하듯이 iteration 횟수가 적을지라도 무작위로 shuffling된 FSP 행렬을 사용하는 student net의 앙상블은 teacher net의 앙상블보다 성능이 좋다.
Iteration 횟수 대신 모델 학습 시간의 관점에서, original model은 16s/100iter의 속도로 학습된 반면 제안하는 모델은 stage 1에서 35s/100iter의 속도로 학습되었다. 따라서 총 학습 시간 면에선 original 방법으로 3개의 teacher DNN을 학습시키는데 8.6시간이 걸렸고, 제안된 방법으로 3개의 student DNN을 학습 시키는데 4.84시간이 소요되었다. 후자(제안하는 방법을 적용)가 1.78배 빠르다. 하지만 보다 효율적으로 네트워크를 학습시켜서(예: 매번 FSP matrix를 계산(took 19s/100iter)하는 대신 FSP matrix를 저장하여 사용하는 방법) student *과 student *†는 각각 2.18배, 1.39배 더 빠르게 네트워크의 학습이 가능했다.

4.1.2 CIFAR-100

The CIFAR-100 dataset uses 50 000 training images with 500 images per class and 10 000 test images with 100 images per class. The CIFAR-100 dataset contains 32 × 32 pixel RGB images with 100 classes. 전체 100 클래스에 대해 한 클래스당 이미지가 적으므로 32개 레이어를 갖는 residual network를 사용하였으며 section 4.1.1에서 묘사된 모델에 비해 4배의 채널 수를 갖는데.
실험에선 다양한 실험 조건을 만들기 위해 CIFAR-10과 같은 augmentation 방법들을 사용하지 않았다. Teacher와 student network는 섹션 4.1.1과 같은 동일한 파라미터들을 사용하였다. The only difference was that we used learning rates of 0.001, 0.0001, and 0.00001 until 16 000, 24 000, and 32 000 iterations, respectively, in stage 1.

Table 2는 서로 다른 세팅에서의 인식률을 보여준다. 표에서 오른쪽에서 두 번째 열(column)은 세 개 DNN의 앙상블 모델의 실험결과를 보인다. 평균 64.15%의 정확도를 갖는 32개 레이어로 구성된 residual network와 평균 61.32%의 정확도를 갖는 동일한 네트워크구조에 1/3만큼의 iteration로 학습된 모델의 네트워크 정확도를 볼 때, iteration의 횟수는 성능 향상을 위한 중요한 지표임을 알 수 있다. 하지만 비록 student network가 training에 더 적은 iteration 횟수를 사용하였더라도 teacher network로부터 생성된 disdilled knowldege를 사용한 student network의 성능은 original teacher network와 비슷한 것을 알 수 있다.
논문에선 제안하는 방법과 FitNet간의 성능을 비교하였다. Table 2에서 FitNet의 방법이 적용된 student network가 더 적은 iteration으로 teacher network의 성능을 앞섰다. 하지만 세 네트워크의 앙상블 모델의 실험 결과를 볼 때 적은 iteration을 사용한 teacher network(Teacher* Ensemble)와 FitNet이 적용된 student network(FitNet* Ensemble)가 비슷한 정확도(67.2, 67.6)를 보였다. 즉, 큰 성능의 차이를 발견하지 못했다. 이는 table 2에서 성능과 iteration 횟수의 관점에서 제안하는 방법이 존재하는 FitNet 방법보다 훨씬 효율적임을 증명한다.

4.2. Performance improvement for the small DNN

최근 많은 연구들이 많은 파라미터 수를 사용하는 deep neural network를 모델의 성능향상을 위해 사용해왔다. 예를 들어 [10]에서는 레이어가 1000개가 넘는 residual network를 사용했다. Wide-resnet[26]에선 네트워크의 width를 늘렸다. 하지만 이로인해 연산량이 늘어 고성능의 system이 필요해진다. 게다가 모델 학습을 위한 많은 iteration 횟수가 필요하다. 따라서 작은 DNN의 성능 향상을 위한 방법에 대한 연구는 매우 중요하다.
논문에선 제안하는 방법이 다른 크기의 DNN에 대해 적용 가능한지 검증하는 실험을 수행했다. 제안하는 방법의 목표는 작은 student network의 성능을 deep teacher network의 distilled knowledge를 이용하여 향상시키는 것이다. 다시한번 강조하자면 small network는 shallow network이며 적은 weight를 갖는다. Figure 2에서 보여지듯이 teacher가 student 모델보다 더 깊다. Student net은 teacher DNN에서 residual module의 갯수를 줄여 구성되어있다. 따라서 student DNN의 파라미터 개수가 teacher DNN의 파라미터 개수보다 더 적다.
학습 과정은 section 4.1에 묘사된것과 동일하다. Student DNN과 teacher DNN이 같은 channel 갯수를 가지므로 계산되는 FSP matrix의 크기 또한 동일하다. Student DNN과 teacher DNN에서 계산되는 FSP matrix간의 거리를 최소화함으로써 student net에 대해 좋은 initial weight값을 얻을 수 있었다. 그리고 student net에 대해 그 initial weight를 기본으로 main task(classification)에 대한 학습을 진행했다.

4.2.1 CIFAR-10

실험에선 teacher DNN으로 26레이어 residual net을, student DNN으로 8 레이어 residual net을 사용하였다. 파라미터 세팅과 학습과정은 section 4.1.1과 동일하나 stage 2에서의 training iteration은 다르다. Student DNN은 teacher DNN과 동일한 iteration만큼 학습된다.

공평한 비교를 위해 end-to-end로 학습된 student DNN을 학습시켜 준비했다. Table 3에서 확인 가능하듯이 일반적으로 학습시킨 student DNN의 성능보다 논문에서 제안하는 distilled knowledge를 전달받아 학습한 모델의 성능이 더 좋다. 이는 teacher DNN에서 만들어진 distilled knowledge가 shallow student DNN에게 유용한 정보임을 의미한다. 논문에선 제안하는 방법이 기존의 방법보다 더 유용하다고 결론지었다.

4.2.2 CIFAR-100

실험에선 CIFAR-100 데이터셋에 대하여도 네트워크 minimization ability를 증명하였다. Section 4.1.2와 비슷한 조건에서 teacher net과 student net으로 각각 32레이어, 14레이어의 residual network 구조를 사용했다. 이번 section에서의 모든 실험은 64k iteration을 사용하였다.

Table 4는 서로 다른 세팅에서의 인식률을 보여준다. 실험에선 augmentation방법들을 적용하지 않았으므로 teacher DNN은 64%의 정확도를 보였다. 게다가 일반적인 방법으로 학습된 student DNN의 정확도는 58.65%에 그쳤다. 놀랍게도 제안하는 방법을 적용한 결과 student DNN의 성능이 teacher DNN의 성능과 매우 근접하도록 향상되었다. FitNet과 같이 현존하는 knowledge distillation method같은 경우도 성능 향상을 보였다. 하지만 제안하는 방법과 FitNet의 방법을 비교할 때, 제안하는 knowledge distillation 방법의 성능 향상이 더 큰것을 명확하게 확인 가능했다.

Transfer Learning

본 섹션에선 제안하는 방법을 적용 할 수 있는 응용(application)에 대해 설명한다. Teacher DNN과 student DNN은 동일한 task에 대해 학습될 수 있을 뿐만 아니라 다른 task에 대해서도 학습 가능하다. 이를 증명하기 위해 제안하는 방법을 transfer learning task에 대해 적용시켜봤다. Transfer learning은 유용한 feature를 만들기에 너무 작은 dataset만 사용 가능한 경우 폭넓게 사용되어왔다. 이런 경우 대부분 엄청 큰 ImageNet 데이터셋으로 미리 학습 된 DNN을 사용한다. 하지만 대부분 pretrained DNN을 사용하는 경우 네트워크 구조가 매우 커 많은 weight가 저장된다. 이는 곧 small dataset에 대한 성능 향상을 위해 고사양 device가 필요함을 의미한다. 그러므로 만약 distilled knowledge가 small DNN으로 전달될 수 있게 된다면 이는 곧 이러한 문제의 효율적인 답안이 될 수 있다.

우선 ImageNet dataset으로 학습 된 34 레이어 residual DNN[8]을 준비했다. 작은 image 갯수를 갖는 task를 위하여 CUB200-2011 dataset[25]를 사용했다. CUB200-2011 데이터셋은 11,788개 이미지와 200개의 bird subordinate를 갖는다. 이러한 적은 image per class로 인해 이 데이터셋만으로 학습시켜 좋은 성능을 갖는 네트워크를 만들기는 어렵다. Table 5에서, 비교적 깊은 34 레이어 residual DNN을 사용하더라도 해당 데이터셋으로만 처음부터 학습 시킨다면 모델의 성능은 매우 좋지 못하다는것을 확인 가능하다.
Shallow DNN에 대해 20 레이어 residual DNN 구조를 사용했다. 34 layer residual DNN은 동일한 부분에서 동일한 공간 크기의 feature들을 만들어내는 4개 파트로 구성되어 있다. 4개의 파트는 각각 3, 4, 6, 3개의 residual module로 구성되어있다. 20 layer residual DNN으로 구성된 student DNN은 각 4개 파트에 대해 2, 2, 3, 2 개의 residual module로 구성되어있다. For all settings, we used learning rates of 0.1, 0.01, and 0.001 up to 10 000, 20 000, and 30 000 iterations. 보통 fine tunning은 작은 learning rate를 사용한다. 하지만 learning rate가 0.1에서 시작했을때가 0.001일때보다 성능이 좋았으므로 위의 설정대로 실험을 진행했다.

논문에서 제안하는 방법이 teacher DNN의 FSP matrix를 student DNN으로 전달하기 때문에 stage 1에서 0.1, 0.01, 0.001의 learning rate를 11k, 16k, 21k iteration에서 각각 사용하였다. 이 stage에서 각 DNN의 part에서 FSP matrix를 추출했다. Tabel 5에서 볼 수 있듯이 제안하는 방법을 이용하여 fine tunning 해서 student DNN이 teacher DNN에 근접한 높은 수준의 성능이 되도록 하였다. Student DNN이 teacher DNN보다 약 1.7배 shallow한 점을 고려한다면 제안하는 방법이 teacher와 student의 task가 다르더라도 knowledge를 전달함에 있어서 매우 효율적인 방법이라는 것을 알 수 있다.

Conclusion

본 논문에서는 DNN으로부터 distilled knowledge를 생성하는 새로운 접근방식을 제안했다. Distilled knowledge를 논문에서 제안하는 FSP matrix로 계산 된 solving procedure의 흐름(flow)으로 결정함으로써 제안하는 방법의 성능이 여타 SOTA knowledge transfer method의 성능을 능가하였다. 논문에서는 3가지 중요한 측면에서 제안하는 방법의 효율성을 검증하였다. 제안하는 방법은 DNN을 더 빠르게 optimize시키며(빠른 학습), 더 높은 level의 성능을 만들어 내게 한다. 게다가 제안하는 방법은 transfer learning task에도 적용 가능하다.

Comment Read more

Older Newer

Seongkyun Han's blog

FitNets- Hints for Thin Deep Nets

FitNets- Hints for Thin Deep Nets

Abstract

Conclusion

논문 내용

모델의 학습

성능

The papers about network generalization

The papers about network generalization

Papers

Awesome Knowledge Distillation papers

Awesome Knowledge Distillation

Web pages to study

Papers

Videos

Implementations

MXNet

PyTorch

Lua

Torch

Theano

Lasagne + Theano

Tensorflow

Caffe

Keras

Deep Mutual Learning

Deep Mutual Learning

Abstract

1. Introduction

Related Work

2. Deep Mutual Learning

2.1 Formulation

2.2 Optimization

2.3 Extenstion to Larger Student Cohorts

3. Experiment

3.1 Datasets and Settings

Datasets

Implementation Details

Model Size

3.2 Results on CIFAR-100

3.3 Results on Market-1501

3.4 Comparison with Distillation

3.5 DML with Larger Student Cohorts

3.6 How and Why does DML Work?

Better Quality Solutions with More Robust Minima

How a Better Minima is Found

DML with Ensemble Teacher

Conclusion

A Gift from Knowledge Distillation- Fast Optimization, Network Monimization and Tranfer Learning

A Gift from Knowledge Distillation: Fast Optimization, Network Monimization and Tranfer Learning

Abstract

1. Introduction

2. Related Work

Knowledge Transfer

Fast Optimization

Transfer Learning

3. Method

3.1. Proposed Distilled Knowledge

3.2. Mathematical Expression of the Distilled Knowledge

3.3. Loss for the FSP Matrix

3.4. Learning Procedure

4. Experiment

4.1. Fast optimization

4.1.1 CIFAR-10

4.1.2 CIFAR-100

4.2. Performance improvement for the small DNN

4.2.1 CIFAR-10

4.2.2 CIFAR-100

Transfer Learning

Conclusion