๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

Inception-v4, Inception-ResNetand the Impact of Residual Connections on Learning

by ์ œ๋ฃฝ 2023. 7. 5.
728x90
๋ฐ˜์‘ํ˜•

 

 

 

๐Ÿ’ก
<๋ฒˆ์—ญ>
  1. Inception-v1
  1. Inception-v2

 

  1. inception-v3
  • Inception-v2 ๊ตฌ์กฐ์—์„œ ์œ„์—์„œ ์„ค๋ช…ํ•œ ๊ธฐ๋ฒ•๋“ค์„ ํ•˜๋‚˜ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•ด ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ณ , ๋ชจ๋“  ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•˜์—ฌ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ชจ๋ธ์ด Inception-v3
  • Inception-v3์€ Inception-v2์—์„œ BN-auxiliary + RMSProp + Label Smoothing + Factorized 7x7 ์„ ๋‹ค ์ ์šฉํ•œ ๋ชจ๋ธ

<inception v2, v3 ์ฐธ๊ณ >

[๋…ผ๋ฌธ ์ฝ๊ธฐ] Inception-v3(2015) ๋ฆฌ๋ทฐ, Rethinking the Inception Architecture for Computer Vision
์ด๋ฒˆ์— ์ฝ์–ด๋ณผ ๋…ผ๋ฌธ์€ Rethinking the Inception Architecture for Computer Vision ์ž…๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Inception-v2์™€ Inception-v3์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ, ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด ์ •ํ™•๋„์™€ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด, ResNet์€ skip connection์„ ํ™œ์šฉํ•ด์„œ ๋ชจ๋ธ์˜ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๊นŠ์–ด์ง„ ๋งŒํผ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์•„์ ธ ํ•™์Šตํ•˜๋Š”๋ฐ์— ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ์ด๋Š” mobile์ด๋‚˜ ์ œํ•œ๋œ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ํ™œ์šฉํ•ด์•ผ ํ• ๋•Œ, ๋‹จ์ ์œผ๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋Š” convolution ๋ถ„ํ•ด๋ฅผ ํ™œ์šฉํ•ด์„œ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ตœ์†Œํ™” ๋˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๋Š”๋ฐ ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ..
https://deep-learning-study.tistory.com/517
0. Abstract

Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08% top-5 error on the test set of the ImageNet classification (CLS) challenge. ์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„ ์ด๋ฏธ์ง€ ์ธ์‹ ์„ฑ๋Šฅ์—์„œ ๊ฐ€์žฅ ํฐ ์ง„๋ณด๋ฅผ ์ด๋ฃจ๋Š” ๋ฐ์—๋Š” ๋งค์šฐ ๊นŠ์€ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์ด ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ค‘ ํ•˜๋‚˜์˜ ์˜ˆ๋Š” Inception ์•„ํ‚คํ…์ฒ˜์ธ๋ฐ, ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ ๋งค์šฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ์—๋Š” ์ „ํ†ต์ ์ธ ์•„ํ‚คํ…์ฒ˜์™€ ํ•จ๊ป˜ ์ž”์ฐจ ์—ฐ๊ฒฐ(residual connections)์„ ๋„์ž…ํ•˜์—ฌ 2015 ILSVRC ์ฑŒ๋ฆฐ์ง€์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‚ด๋ฉด์„œ Inception-v3 ๋„คํŠธ์›Œํฌ์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด Inception ์•„ํ‚คํ…์ฒ˜์™€ ์ž”์ฐจ ์—ฐ๊ฒฐ์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์— ์–ด๋–ค ์ด์ ์ด ์žˆ๋Š”์ง€ ์˜๋ฌธ์ด ์ œ๊ธฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๊ณณ์—์„œ๋Š” ์ž”์ฐจ ์—ฐ๊ฒฐ์„ ์‚ฌ์šฉํ•œ ํ•™์Šต์ด Inception ๋„คํŠธ์›Œํฌ์˜ ํ•™์Šต์„ ํฌ๊ฒŒ ๊ฐ€์†ํ™”ํ•œ๋‹ค๋Š” ๋ช…ํ™•ํ•œ ๊ฒฝํ—˜์  ์ฆ๊ฑฐ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ž”์ฐจ Inception ๋„คํŠธ์›Œํฌ๊ฐ€ ์ž”์ฐจ ์—ฐ๊ฒฐ์ด ์—†๋Š” ์œ ์‚ฌํ•œ ๋น„์šฉ์˜ Inception ๋„คํŠธ์›Œํฌ๋ณด๋‹ค ์•ฝ๊ฐ„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ์ผ๋ถ€ ์ฆ๊ฑฐ๋„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์ž”์ฐจ ๋ฐ ๋น„-์ž”์ฐจ Inception ๋„คํŠธ์›Œํฌ์— ๋Œ€ํ•œ ๋ช‡ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๊ฐ„์†Œํ™”๋œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ณ€ํ˜•์€ ILSVRC 2012 ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ๋‹จ์ผ ํ”„๋ ˆ์ž„ ์ธ์‹ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋งค์šฐ ๋„“์€ ์ž”์ฐจ Inception ๋„คํŠธ์›Œํฌ์˜ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ ์ ˆํ•œ ํ™œ์„ฑํ™” ์Šค์ผ€์ผ๋ง์ด ์–ด๋–ป๊ฒŒ ๋„์›€์ด ๋˜๋Š”์ง€๋„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์ž”์ฐจ 3๊ฐœ์™€ Inception-v4 1๊ฐœ์˜ ์•™์ƒ๋ธ”๋กœ ImageNet ๋ถ„๋ฅ˜(CL) ์ฑŒ๋ฆฐ์ง€์˜ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ 3.08%์˜ ์ƒ์œ„ 5% ์˜ค๋ฅ˜์œจ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

1. Introduction

Since the 2012 ImageNet competition [11] winning entry by Krizhevsky et al [8], their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection [4], segmentation [10], human pose estimation [17], video classification [7], object tracking [18], and superresolution [3]. These examples are but a few of all the applications to which deep convolutional networks have been very successfully applied ever since. In this work we study the combination of the two most recent ideas: Residual connections introduced by He et al. in [5] and the latest revised version of the Inception architecture [15]. In [5], it is argued that residual connections are of inherent importance for training very deep architectures. Since Inception networks tend to be very deep, it is natural to replace the filter concatenation stage of the Inception architecture with residual connections. This would allow Inception to reap all the benefits of the residual approach while retaining its computational efficiency. Besides a straightforward integration, we have also studied whether Inception itself can be made more efficient by making it deeper and wider. For that purpose, we designed a new version named Inception-v4 which has a more uniform simplified architecture and more inception modules than Inception-v3. Historically, Inception-v3 had inherited a lot of the baggage of the earlier incarnations. The technical constraints chiefly came from the need for partitioning the model for distributed training using DistBelief [2]. Now, after migrating our training setup to TensorFlow [1] these constraints have been lifted, which allowed us to simplify the architecture significantly. The details of that simplified architecture are described in Section 3. ์ž๊ทธ๋ ˆํ”„์Šคํ‚ค ๋“ฑ์ด 2012 ImageNet ๋Œ€ํšŒ์—์„œ ์šฐ์Šนํ•œ "AlexNet" ์ดํ›„๋กœ, ๊ทธ๋“ค์˜ ๋„คํŠธ์›Œํฌ์ธ "AlexNet"์€ ๊ฐ์ฒด ๊ฐ์ง€ [4], ์„ธ๋ถ„ํ™” [10], ์ธ๊ฐ„ ์ž์„ธ ์ถ”์ • [17], ๋น„๋””์˜ค ๋ถ„๋ฅ˜ [7], ๊ฐ์ฒด ์ถ”์  [18], ์ดˆํ•ด์ƒ๋„ [3] ๋“ฑ ๋‹ค์–‘ํ•œ ์ปดํ“จํ„ฐ ๋น„์ „ ์ž‘์—…์— ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์˜ˆ๋Š” ๋งค์šฐ ๊นŠ์€ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์ด ์ดํ›„๋กœ ๋งค์šฐ ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉ๋œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ ์ค‘ ์ผ๋ถ€์— ๋ถˆ๊ณผํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ์—์„œ๋Š” He ๋“ฑ์ด [5]์—์„œ ์ œ์•ˆํ•œ ์ž”์ฐจ ์—ฐ๊ฒฐ๊ณผ Inception ์•„ํ‚คํ…์ฒ˜์˜ ์ตœ์‹  ์ˆ˜์ • ๋ฒ„์ „์„ ์กฐํ•ฉํ•˜๋Š” ๊ฒƒ์„ ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. [5]์—์„œ๋Š” ์ž”์ฐจ ์—ฐ๊ฒฐ์ด ๋งค์šฐ ๊นŠ์€ ์•„ํ‚คํ…์ฒ˜์˜ ํ•™์Šต์— ์žˆ์–ด์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ์ฃผ์žฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. Inception ๋„คํŠธ์›Œํฌ๋Š” ๋งค์šฐ ๊นŠ๊ธฐ ๋•Œ๋ฌธ์— Inception ์•„ํ‚คํ…์ฒ˜์˜ ํ•„ํ„ฐ ์—ฐ๊ฒฐ ๋‹จ๊ณ„๋ฅผ ์ž”์ฐจ ์—ฐ๊ฒฐ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์ด ์ž์—ฐ์Šค๋Ÿฝ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด Inception์€ ์ž”์ฐจ ์ ‘๊ทผ๋ฒ•์˜ ๋ชจ๋“  ์ด์ ์„ ์ทจํ•˜๋ฉด์„œ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ํ†ตํ•ฉ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์šฐ๋ฆฌ๋Š” Inception ์ž์ฒด๋ฅผ ๋” ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•ด์„œ๋„ ์—ฐ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด Inception-v3๋ณด๋‹ค ๋” ๊ท ์ผํ•˜๊ณ  ๊ฐ„์†Œํ™”๋œ ์•„ํ‚คํ…์ฒ˜์™€ ๋” ๋งŽ์€ Inception ๋ชจ๋“ˆ์„ ๊ฐ€์ง„ Inception-v4๋ผ๋Š” ์ƒˆ ๋ฒ„์ „์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ญ์‚ฌ์ ์œผ๋กœ Inception-v3๋Š” ์ด์ „ ๋ฒ„์ „์˜ ๋ฌด๊ฑฐ์šด ์š”์†Œ๋ฅผ ์ƒ์†๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ๊ธฐ์ˆ ์ ์ธ ์ œ์•ฝ ์‚ฌํ•ญ์€ ์ฃผ๋กœ DistBelief [2]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ ํ•™์Šต์„ ์œ„ํ•ด ๋ชจ๋ธ์„ ๋ถ„ํ• ํ•ด์•ผ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด์ œ ํ›ˆ๋ จ ์„ค์ •์„ TensorFlow [1]๋กœ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ํ•œ ํ›„์—๋Š” ์ด๋Ÿฌํ•œ ์ œ์•ฝ ์‚ฌํ•ญ์ด ํ•ด์ œ๋˜์–ด ์•„ํ‚คํ…์ฒ˜๋ฅผ ํฌ๊ฒŒ ๋‹จ์ˆœํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๋‹จ์ˆœํ™”๋œ ์•„ํ‚คํ…์ฒ˜์˜ ์„ธ๋ถ€ ์‚ฌํ•ญ์€ ์„น์…˜ 3์—์„œ ์„ค๋ช…๋ฉ๋‹ˆ๋‹ค.

In this report, we will compare the two pure Inception variants, Inception-v3 and v4, with similarly expensive hybrid Inception-ResNet versions. Admittedly, those models were picked in a somewhat ad hoc manner with the main constraint being that the parameters and computational complexity of the models should be somewhat similar to the cost of the non-residual models. In fact we have tested bigger and wider Inception-ResNet variants and they performed very similarly on the ImageNet classification challenge [11] dataset. The last experiment reported here is an evaluation of an ensemble of all the best performing models presented here. As it was apparent that both Inception-v4 and InceptionResNet-v2 performed similarly well, exceeding state-ofthe art single frame performance on the ImageNet validation dataset, we wanted to see how a combination of those pushes the state of the art on this well studied dataset. Surprisingly, we found that gains on the single-frame performance do not translate into similarly large gains on ensembled performance. Nonetheless, it still allows us to report 3.1% top-5 error on the validation set with four models ensembled setting a new state of the art, to our best knowledge. In the last section, we study some of the classification failures and conclude that the ensemble still has not reached the label noise of the annotations on this dataset and there is still room for improvement for the predictions. ์ด ๋ณด๊ณ ์„œ์—์„œ๋Š” ์ˆœ์ˆ˜ Inception ๋ฒ„์ „์ธ Inception-v3์™€ v4๋ฅผ ๋น„์Šทํ•œ ๋น„์šฉ์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ Inception-ResNet ๋ฒ„์ „๊ณผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. ์†”์งํžˆ ๋งํ•˜์ž๋ฉด, ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ๋‹ค์†Œ ์ž„์˜์ ์ธ ๋ฐฉ์‹์œผ๋กœ ์„ ํƒ๋˜์—ˆ์œผ๋ฉฐ, ๋ชจ๋ธ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์ด ๋น„-์ž”์ฐจ ๋ชจ๋ธ์˜ ๋น„์šฉ๊ณผ ์–ด๋Š ์ •๋„ ์œ ์‚ฌํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ฃผ์š” ์ œ์•ฝ ์กฐ๊ฑด์„ ๊ฐ€์ง€๊ณ  ์„ ํƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์‹ค, ๋” ํฌ๊ณ  ๋„“์€ Inception-ResNet ๋ณ€ํ˜•์„ ํ…Œ์ŠคํŠธํ–ˆ๊ณ , ์ด๋“ค์€ ImageNet ๋ถ„๋ฅ˜ ์ฑŒ๋ฆฐ์ง€ [11] ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋งค์šฐ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ณด๊ณ ๋œ ๋งˆ์ง€๋ง‰ ์‹คํ—˜์€ ๋ชจ๋“  ์ตœ์ƒ์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๋ชจ๋ธ๋“ค์˜ ์•™์ƒ๋ธ” ํ‰๊ฐ€์ž…๋‹ˆ๋‹ค. Inception-v4์™€ InceptionResNet-v2๊ฐ€ ๋ชจ๋‘ ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ  ImageNet ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ตœ์ฒจ๋‹จ์˜ ๋‹จ์ผ ํ”„๋ ˆ์ž„ ์„ฑ๋Šฅ์„ ์ดˆ๊ณผํ–ˆ๋‹ค๋Š” ๊ฒƒ์ด ๋ถ„๋ช…ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋“ค์„ ์กฐํ•ฉํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์ด ์ž˜ ์—ฐ๊ตฌ๋œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์–ด๋–ป๊ฒŒ ์ตœ์ฒจ๋‹จ์„ ๋ฐ€์–ด์˜ฌ๋ฆฌ๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๊ณ ์ž ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋†€๋ž๊ฒŒ๋„, ๋‹จ์ผ ํ”„๋ ˆ์ž„ ์„ฑ๋Šฅ์˜ ํ–ฅ์ƒ์ด ์•™์ƒ๋ธ” ์„ฑ๋Šฅ์—์„œ๋„ ๋น„์Šทํ•œ ๊ทœ๋ชจ์˜ ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์ง€์ง€๋Š” ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์šฐ๋ฆฌ๋Š” 4๊ฐœ ๋ชจ๋ธ์„ ์•™์ƒ๋ธ”ํ•˜์—ฌ ๊ฒ€์ฆ ์„ธํŠธ์—์„œ 3.1%์˜ ์ƒ์œ„ 5๊ฐœ ์˜ค๋ฅ˜์œจ์„ ๋ณด๊ณ ํ•˜๋ฉฐ ์ƒˆ๋กœ์šด ์ตœ์ฒจ๋‹จ์„ ๋‹ฌ์„ฑํ•œ ๊ฒƒ์œผ๋กœ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ์„น์…˜์—์„œ๋Š” ์ผ๋ถ€ ๋ถ„๋ฅ˜ ์‹คํŒจ๋ฅผ ์—ฐ๊ตฌํ•˜๊ณ , ์•„์ง ์ด ์•™์ƒ๋ธ”์ด ์ฃผ์„์˜ ๋ ˆ์ด๋ธ” ์žก์Œ์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ–ˆ์œผ๋ฉฐ ์˜ˆ์ธก์„ ๊ฐœ์„ ํ•  ์—ฌ์ง€๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ฆฝ๋‹ˆ๋‹ค.

2. Related Work

Convolutional networks have become popular in large scale image recognition tasks after Krizhevsky et al. [8]. Some of the next important milestones were Network-innetwork [9] by Lin et al., VGGNet [12] by Simonyan et al. and GoogLeNet (Inception-v1) [14] by Szegedy et al. Residual connection were introduced by He et al. in [5] in which they give convincing theoretical and practical evidence for the advantages of utilizing additive merging of signals both for image recognition, and especially for object detection. The authors argue that residual connections are inherently necessary for training very deep convolutional models. Our findings do not seem to support this view, at least for image recognition. However it might require more measurement points with deeper architectures to understand the true extent of beneficial aspects offered by residual connections. In the experimental section we demonstrate that it is not very difficult to train competitive very deep networks without utilizing residual connections. However the use of residual connections seems to improve the training speed greatly, which is alone a great argument for their use. The Inception deep convolutional architecture was introduced in [14] and was called GoogLeNet or Inception-v1 in our exposition. Later the Inception architecture was refined in various ways, first by the introduction of batch normalization [6] (Inception-v2) by Ioffe et al. Later the architecture was improved by additional factorization ideas in the third iteration [15] which will be referred to as Inception-v3 in this report. ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์€ Krizhevsky et al. [8] ์ดํ›„๋กœ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€ ์ธ์‹ ์ž‘์—…์—์„œ ์ธ๊ธฐ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ์ดํ›„ ์ค‘์š”ํ•œ ์„ฑ๊ณผ๋กœ๋Š” Lin et al.์˜ Network-in-Network [9], Simonyan et al.์˜ VGGNet [12], Szegedy et al.์˜ GoogLeNet (Inception-v1) [14] ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. Residual connection์€ He et al. [5]์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์œผ๋ฉฐ, ์ด๋ฏธ์ง€ ์ธ์‹ ๋ฐ ํŠนํžˆ ๊ฐ์ฒด ํƒ์ง€์— ๋Œ€ํ•œ ์ด์ ์„ ์„ค๋“๋ ฅ์žˆ๋Š” ์ด๋ก ์  ๋ฐ ์‹ค์šฉ์  ์ฆ๊ฑฐ๋กœ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ residual connection์ด ๋งค์šฐ ๊นŠ์€ ์ปจ๋ณผ๋ฃจ์…˜ ๋ชจ๋ธ์˜ ํ•™์Šต์— ํ•„์ˆ˜์ ์ด๋ผ ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์šฐ๋ฆฌ์˜ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋Š” ์ด๋Ÿฌํ•œ ์ฃผ์žฅ์„ ์ง€์ง€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค, ์ ์–ด๋„ ์ด๋ฏธ์ง€ ์ธ์‹์— ๋Œ€ํ•ด์„œ๋Š” ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ณด๋‹ค ๊นŠ์€ ์•„ํ‚คํ…์ฒ˜์—์„œ ๋” ๋งŽ์€ ์ธก์ • ์ง€์ ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์–ด residual connection์ด ์ œ๊ณตํ•˜๋Š” ์ด์ ์˜ ์‹ค์ œ ๋ฒ”์œ„๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž…๋‹ˆ๋‹ค. ์‹คํ—˜ ์„น์…˜์—์„œ ์šฐ๋ฆฌ๋Š” residual connection์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ ๋„ ๊ฒฝ์Ÿ๋ ฅ์žˆ๋Š” ๋งค์šฐ ๊นŠ์€ ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ residual connection์˜ ์‚ฌ์šฉ์€ ํ•™์Šต ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ์ž์ฒด๋กœ ์ข‹์€ ์ด์œ ์ž…๋‹ˆ๋‹ค. Inception ๊นŠ์€ ํ•ฉ์„ฑ๊ณฑ ์•„ํ‚คํ…์ฒ˜๋Š” [14]์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์œผ๋ฉฐ, ์šฐ๋ฆฌ๋Š” ์ด๋ฅผ GoogLeNet ๋˜๋Š” Inception-v1๋กœ ์ง€์นญํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ Inception ์•„ํ‚คํ…์ฒ˜๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ์„  Ioffe et al.์— ์˜ํ•ด ๋ฐฐ์น˜ ์ •๊ทœํ™” (Inception-v2) [6]๊ฐ€ ๋„์ž…๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ํ›„ ์ถ”๊ฐ€์ ์ธ ์ธ์ˆ˜๋ถ„ํ•ด ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์„ธ ๋ฒˆ์งธ๋กœ ๊ฐœ์„ ๋˜์—ˆ์œผ๋ฉฐ, ์ด๋ฅผ Inception-v3๋กœ ์ฐธ์กฐํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

3. Architectural Choices

3.1. Pure Inception blocks Our older Inception models used to be trained in a partitioned manner, where each replica was partitioned into a multiple sub-networks in order to be able to fit the whole model in memory. However, the Inception architecture is highly tunable, meaning that there are a lot of possible changes to the number of filters in the various layers that do not affect the quality of the fully trained network. In order to optimize the training speed, we used to tune the layer sizes carefully in order to balance the computation between the various model sub-networks. In contrast, with the introduction of TensorFlow our most recent models can be trained without partitioning the replicas. This is enabled in part by recent optimizations of memory used by backpropagation, achieved by carefully considering what tensors are needed for gradient computation and structuring the compu- tation to reduce the number of such tensors. Historically, we have been relatively conservative about changing the architectural choices and restricted our experiments to varying isolated network components while keeping the rest of the network stable. Not simplifying earlier choices resulted in networks that looked more complicated that they needed to be. In our newer experiments, for Inception-v4 we decided to shed this unnecessary baggage and made uniform choices for the Inception blocks for each grid size. Plase refer to Figure 9 for the large scale structure of the Inception-v4 network and Figures 3, 4, 5, 6, 7 and 8 for the detailed structure of its components. All the convolutions not marked with “V” in the figures are same-padded meaning that their output grid matches the size of their input. Convolutions marked with “V” are valid padded, meaning that input patch of each unit is fully contained in the previous layer and the grid size of the output activation map is reduced accordingly. 3.1. ์ˆœ์ˆ˜ Inception ๋ธ”๋ก ์ด์ „์˜ Inception ๋ชจ๋ธ์€ ๊ฐ ๋ณต์ œ๋ณธ์ด ์ „์ฒด ๋ชจ๋ธ์„ ๋ฉ”๋ชจ๋ฆฌ์— ๋งž์ถœ ์ˆ˜ ์žˆ๋„๋ก ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ๋กœ ๋ถ„ํ• ๋˜์–ด ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Inception ์•„ํ‚คํ…์ฒ˜๋Š” ๋งค์šฐ ์กฐ์ • ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๊ฐ ๋ ˆ์ด์–ด์˜ ํ•„ํ„ฐ ์ˆ˜์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์€ ์™„์ „ํžˆ ํ›ˆ๋ จ๋œ ๋„คํŠธ์›Œํฌ์˜ ํ’ˆ์งˆ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์†๋„๋ฅผ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ด์ „์—๋Š” ๊ณ„์‚ฐ์„ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ํ•˜์œ„ ๋„คํŠธ์›Œํฌ ๊ฐ„์— ๊ท ํ˜•์žˆ๊ฒŒ ๋ถ„๋ฐฐํ•˜๊ธฐ ์œ„ํ•ด ๋ ˆ์ด์–ด ํฌ๊ธฐ๋ฅผ ์‹ ์ค‘ํ•˜๊ฒŒ ์กฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ TensorFlow์˜ ๋„์ž…์œผ๋กœ ์ธํ•ด ์ตœ๊ทผ์˜ ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์€ ๋ณต์ œ๋ณธ์„ ๋ถ„ํ• ํ•˜์ง€ ์•Š๊ณ  ํ›ˆ๋ จ๋  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์—ญ์ „ํŒŒ์— ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ์˜ ์ตœ์ ํ™”๋ฅผ ์ตœ๊ทผ์— ์ด๋ฃจ์–ด์ง€๋ฉฐ, ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๊ณ„์‚ฐ์— ํ•„์š”ํ•œ ํ…์„œ์™€ ๊ทธ ๊ตฌ์กฐ๋ฅผ ์‹ ์ค‘ํ•˜๊ฒŒ ๊ณ ๋ คํ•˜์—ฌ ํ•ด๋‹น ํ…์„œ์˜ ์ˆ˜๋ฅผ ์ค„์ž„์œผ๋กœ์จ ๋‹ฌ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์—ญ์‚ฌ์ ์œผ๋กœ ์šฐ๋ฆฌ๋Š” ์•„ํ‚คํ…์ฒ˜์  ์„ ํƒ์„ ๋ณ€๊ฒฝํ•˜๋Š” ๋ฐ ์ƒ๋Œ€์ ์œผ๋กœ ๋ณด์ˆ˜์ ์ด์—ˆ์œผ๋ฉฐ, ๋„คํŠธ์›Œํฌ์˜ ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์„ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณ ๋ฆฝ๋œ ๋„คํŠธ์›Œํฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๋ณ€ํ™”์‹œํ‚ค๋Š” ์‹คํ—˜์— ์ œํ•œ์„ ๋‘์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์ „์˜ ์„ ํƒ์„ ๋‹จ์ˆœํ™”ํ•˜์ง€ ์•Š์€ ๊ฒƒ์€ ํ•„์š” ์ด์ƒ์œผ๋กœ ๋ณต์žกํ•ด ๋ณด์ด๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Inception-v4์˜ ๊ฒฝ์šฐ, ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ๋ถˆํ•„์š”ํ•œ ๋ถ€๋ถ„์„ ์ œ๊ฑฐํ•˜๊ณ  ๊ฐ ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ์— ๋Œ€ํ•ด Inception ๋ธ”๋ก์— ๋Œ€ํ•ด ์ผ๊ด€๋œ ์„ ํƒ์„ ํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. Inception-v4 ๋„คํŠธ์›Œํฌ์˜ ๋Œ€๊ทœ๋ชจ ๊ตฌ์กฐ๋Š” Figure 9๋ฅผ ์ฐธ์กฐํ•˜๊ณ , ๊ตฌ์„ฑ ์š”์†Œ์˜ ์ž์„ธํ•œ ๊ตฌ์กฐ๋Š” Figure 3, 4, 5, 6, 7 ๋ฐ 8์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค. ๊ทธ๋ฆผ์—์„œ "V"๋กœ ํ‘œ์‹œ๋˜์ง€ ์•Š์€ ๋ชจ๋“  ์ปจ๋ณผ๋ฃจ์…˜์€ same padding๋˜์–ด ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ๊ฐ€ ์ผ์น˜ํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. "V"๋กœ ํ‘œ์‹œ๋œ ์ปจ๋ณผ๋ฃจ์…˜์€ valid padding๋˜์–ด ๊ฐ ์œ ๋‹›์˜ ์ž…๋ ฅ ํŒจ์น˜๊ฐ€ ์ด์ „ ๋ ˆ์ด์–ด์— ์™„์ „ํžˆ ํฌํ•จ๋˜๊ณ  ์ถœ๋ ฅ ํ™œ์„ฑํ™” ๋งต์˜ ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ๊ฐ€ ๊ทธ์— ๋”ฐ๋ผ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค. 3.2. Residual Inception Blocks For the residual versions of the Inception networks, we use cheaper Inception blocks than the original Inception. Each Inception block is followed by filter-expansion layer (1 × 1 convolution without activation) which is used for scaling up the dimensionality of the filter bank before the addition to match the depth of the input. This is needed to compensate for the dimensionality reduction induced by the Inception block. We tried several versions of the residual version of Inception. Only two of them are detailed here. The first one “Inception-ResNet-v1” roughly the computational cost of Inception-v3, while “Inception-ResNet-v2” matches the raw cost of the newly introduced Inception-v4 network. See Figure 15 for the large scale structure of both varianets. (However, the step time of Inception-v4 proved to be significantly slower in practice, probably due to the larger number of layers.) Another small technical difference between our residual and non-residual Inception variants is that in the case of Inception-ResNet, we used batch-normalization only on top of the traditional layers, but not on top of the summations. It is reasonable to expect that a thorough use of batchnormalization should be advantageous, but we wanted to keep each model replica trainable on a single GPU. It turned out that the memory footprint of layers with large activation size was consuming disproportionate amount of GPUmemory. By omitting the batch-normalization on top of those layers, we were able to increase the overall number of Inception blocks substantially. We hope that with better utilization of computing resources, making this trade-off will become unecessary. Inception ๋„คํŠธ์›Œํฌ์˜ ์ž”์ฐจ ๋ฒ„์ „์—์„œ๋Š” ์›๋ž˜์˜ Inception๋ณด๋‹ค ๋น„์šฉ์ด ์ ์€ Inception ๋ธ”๋ก์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ Inception ๋ธ”๋ก ๋’ค์—๋Š” ์ฐจ์›์„ ๋Š˜๋ฆฌ๊ธฐ ์œ„ํ•ด ํ™œ์„ฑํ™” ํ•จ์ˆ˜๊ฐ€ ์—†๋Š” 1 × 1 ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต(filter-expansion layer)์ด ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” Inception ๋ธ”๋ก์— ์˜ํ•ด ๊ฐ์†Œ๋œ ์ฐจ์›์„ ๋ณด์ƒํ•˜๊ธฐ ์œ„ํ•ด ์ž…๋ ฅ์˜ ๊นŠ์ด์™€ ์ผ์น˜ํ•˜๋„๋ก ํ•„ํ„ฐ ๋ฑ…ํฌ์˜ ์ฐจ์›์„ ํ™•์žฅํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. Inception์˜ ์ž”์ฐจ ๋ฒ„์ „์„ ๋ช‡ ๊ฐ€์ง€ ์‹œ๋„ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๋‘ ๊ฐ€์ง€๋งŒ ์ž์„ธํžˆ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋Š” "Inception-ResNet-v1"์œผ๋กœ Inception-v3์˜ ๊ณ„์‚ฐ ๋น„์šฉ๊ณผ ๋Œ€๋žต์ ์œผ๋กœ ๋™์ผํ•˜๋ฉฐ, ๋‘ ๋ฒˆ์งธ๋Š” "Inception-ResNet-v2"๋กœ ์ƒˆ๋กญ๊ฒŒ ์†Œ๊ฐœ๋œ Inception-v4 ๋„คํŠธ์›Œํฌ์˜ ์›์‹œ ๋น„์šฉ๊ณผ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ณ€ํ˜•์˜ ์ „์ฒด ๊ตฌ์กฐ๋Š” Figure 15์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๊ทธ๋Ÿฌ๋‚˜ Inception-v4์˜ ์Šคํ… ์‹œ๊ฐ„์€ ์‹ค์ œ๋กœ ์ƒ๋‹นํžˆ ๋Š๋ฆฌ๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋Š”๋ฐ, ์ด๋Š” ์•„๋งˆ ๋” ๋งŽ์€ ๋ ˆ์ด์–ด ๋•Œ๋ฌธ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.) ์ž”์ฐจ ๋ฐ ๋น„์ž”์ฐจ Inception ๋ณ€ํ˜•์˜ ๋˜ ๋‹ค๋ฅธ ์ž‘์€ ๊ธฐ์ˆ ์  ์ฐจ์ด์ ์€ Inception-ResNet์˜ ๊ฒฝ์šฐ ์ „ํ†ต์ ์ธ ๋ ˆ์ด์–ด ์œ„์—๋งŒ ๋ฐฐ์น˜ ์ •๊ทœํ™”(batch-normalization)๋ฅผ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ ํ•ฉ์‚ฐ ์—ฐ์‚ฐ ์œ„์—๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๋ฐฐ์น˜ ์ •๊ทœํ™”๋ฅผ ์ฒ ์ €ํžˆ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜์ง€๋งŒ, ๊ฐ ๋ชจ๋ธ ๋ณต์ œ๋ณธ์„ ํ•˜๋‚˜์˜ GPU์—์„œ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๋„๋ก ์œ ์ง€ํ•˜๊ณ ์ž ํ–ˆ์Šต๋‹ˆ๋‹ค. ํฐ ํ™œ์„ฑํ™” ํฌ๊ธฐ๋ฅผ ๊ฐ–๋Š” ๋ ˆ์ด์–ด์˜ ๋ฉ”๋ชจ๋ฆฌ ํ’‹ํ”„๋ฆฐํŠธ๊ฐ€ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๋งŽ์€ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์†Œ๋น„ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋ ˆ์ด์–ด ์œ„์— ๋ฐฐ์น˜ ์ •๊ทœํ™”๋ฅผ ์ƒ๋žตํ•จ์œผ๋กœ์จ ์ „์ฒด์ ์œผ๋กœ Inception ๋ธ”๋ก์˜ ์ˆ˜๋ฅผ ์ƒ๋‹นํžˆ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ปดํ“จํŒ… ์ž์›์„ ๋” ์ž˜ ํ™œ์šฉํ•จ์œผ๋กœ์จ ์ด๋Ÿฌํ•œ ์ ˆ์ถฉ์•ˆ์ด ๋ถˆํ•„์š”ํ•ด์งˆ ๊ฒƒ์„ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.

3.3. Scaling of the Residuals Also we found that if the number of filters exceeded 1000, the residual variants started to exhibit instabilities and the network has just “died” early in the training, meaning that the last layer before the average pooling started to produce only zeros after a few tens of thousands of iterations. This could not be prevented, neither by lowering the learning rate, nor by adding an extra batch-normalization to this layer. We found that scaling down the residuals before adding them to the previous layer activation seemed to stabilize the training. In general we picked some scaling factors between 0.1 and 0.3 to scale the residuals before their being added to the accumulated layer activations (cf. Figure 20). A similar instability was observed by He et al. in [5] in the case of very deep residual networks and they suggested a two-phase training where the first “warm-up” phase is done with very low learning rate, followed by a second phase with high learning rata. We found that if the number of filters is very high, then even a very low (0.00001) learning rate is not sufficient to cope with the instabilities and the training with high learning rate had a chance to destroy its effects. We found it much more reliable to just scale the residuals. Even where the scaling was not strictly necessary, it never seemed to harm the final accuracy, but it helped to stabilize the training. ๋˜ํ•œ, ํ•„ํ„ฐ์˜ ์ˆ˜๊ฐ€ 1000์„ ์ดˆ๊ณผํ•˜๋ฉด ์ž”์ฐจ ๋ฒ„์ „์—์„œ ๋ถˆ์•ˆ์ •์„ฑ์ด ๋‚˜ํƒ€๋‚˜๋ฉฐ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•™์Šต ์ดˆ๊ธฐ์— "์ฃฝ์–ด"๋ฒ„๋ฆฝ๋‹ˆ๋‹ค. ์ด๋Š” ํ‰๊ท  ํ’€๋ง ์ด์ „์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด๊ฐ€ ๋ช‡ ๋งŒ ๋ฒˆ์˜ ๋ฐ˜๋ณต ํ›„์— 0๋งŒ์„ ์ถœ๋ ฅํ•˜๋„๋ก ๋˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ•™์Šต๋ฅ ์„ ๋‚ฎ์ถ”๊ฑฐ๋‚˜ ์ด ๋ ˆ์ด์–ด์— ์ถ”๊ฐ€์ ์ธ ๋ฐฐ์น˜ ์ •๊ทœํ™”๋ฅผ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด์ „ ๋ ˆ์ด์–ด ํ™œ์„ฑํ™”์— ์ž”์ฐจ๋ฅผ ๋”ํ•˜๊ธฐ ์ „์— ์ž”์ฐจ๋ฅผ ์Šค์ผ€์ผ๋งํ•˜๋Š” ๊ฒƒ์ด ํ•™์Šต์„ ์•ˆ์ •ํ™”์‹œํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์ž”์ฐจ๋ฅผ ๋ˆ„์ ๋œ ๋ ˆ์ด์–ด ํ™œ์„ฑํ™”์— ๋”ํ•˜๊ธฐ ์ „์— ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ 0.1์—์„œ 0.3 ์‚ฌ์ด์—์„œ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค(์ฐธ์กฐ: Figure 20). He et al. [5]์—์„œ๋„ ๋งค์šฐ ๊นŠ์€ ์ž”์ฐจ ๋„คํŠธ์›Œํฌ์˜ ๊ฒฝ์šฐ ๋น„์Šทํ•œ ๋ถˆ์•ˆ์ •์„ฑ์ด ๊ด€์ฐฐ๋˜์—ˆ๊ณ , ๋งค์šฐ ๋‚ฎ์€ ํ•™์Šต๋ฅ ๋กœ ์ฒซ ๋ฒˆ์งธ "์›œ์—…" ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์นœ ํ›„ ๋†’์€ ํ•™์Šต๋ฅ ๋กœ ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๋‘ ๋‹จ๊ณ„ ํ•™์Šต์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ํ•„ํ„ฐ ์ˆ˜๊ฐ€ ๋งค์šฐ ๋†’์€ ๊ฒฝ์šฐ ์‹ฌ์ง€์–ด ๋งค์šฐ ๋‚ฎ์€ (0.00001) ํ•™์Šต๋ฅ ๋กœ๋„ ๋ถˆ์•ˆ์ •์„ฑ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ์—๋Š” ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๋†’์€ ํ•™์Šต๋ฅ ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๊ทธ ํšจ๊ณผ๋ฅผ ํŒŒ๊ดดํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค๊ณ  ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ž”์ฐจ๋ฅผ ๋‹จ์ˆœํžˆ ์Šค์ผ€์ผ๋งํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์Œ์„ ์•Œ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์Šค์ผ€์ผ๋ง์ด ์—„๊ฒฉํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋„ ์ตœ์ข… ์ •ํ™•๋„์—๋Š” ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์˜€์ง€๋งŒ, ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

4. Training Methodology

We have trained our networks with stochastic gradient utilizing the TensorFlow [1] distributed machine learning system using 20 replicas running each on a NVidia Kepler GPU. Our earlier experiments used momentum [13] with a decay of 0.9, while our best models were achieved using RMSProp [16] with decay of 0.9 and = 1.0. We used a learning rate of 0.045, decayed every two epochs using an exponential rate of 0.94. Model evaluations are performed using a running average of the parameters computed over time. ์šฐ๋ฆฌ๋Š” TensorFlow [1] ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์„ ํ™œ์šฉํ•˜์—ฌ 20๊ฐœ์˜ ๋ณต์ œ๋ณธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ๊ฐ NVidia Kepler GPU์—์„œ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ์‹คํ—˜์—์„œ๋Š” ๋ชจ๋ฉ˜ํ…€ [13]์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์†Œ์œจ์„ 0.9๋กœ ์„ค์ •ํ–ˆ์œผ๋ฉฐ, ์šฐ๋ฆฌ์˜ ์ตœ์ƒ์˜ ๋ชจ๋ธ์€ ๊ฐ์†Œ์œจ์ด 0.9์ด๊ณ  = 1.0์ธ RMSProp [16]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต๋ฅ ์€ 0.045๋กœ ์„ค์ •ํ•˜๊ณ , ์ง€์ˆ˜์  ๊ฐ์†Œ์œจ 0.94๋กœ ๋‘ ๋ฒˆ์˜ ์—ํฌํฌ๋งˆ๋‹ค ๊ฐ์†Œํ•˜๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ํ‰๊ฐ€๋Š” ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๊ณ„์‚ฐ๋œ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ด๋™ ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ’ก
<๋ฆฌ๋ทฐ>

0. Abstract/Intro

  • ๊ธฐ์กด ์ธ์…‰์…˜ ๋ชจ๋ธ์— residual connections์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋น ๋ฅธ ํ•™์Šต์ด ๊ฐ€๋Šฅ
  • residual connections ํ•œ Inception v4๊ฐ€ ๊ธฐ์กด Inception๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์กฐ๊ธˆ ๋” ๋‚˜์Œ

 

  • Inception-v4๋Š” Inception ์‹ ๊ฒฝ๋ง์„ ์ข€ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๋„“๊ณ  ๊นŠ๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆ๋จ
  • Inception-v3๋ณด๋‹ค ๋‹จ์ˆœํ•˜๊ณ  ํš์ผํ™”๋œ ๊ตฌ์กฐ์™€ ๋” ๋งŽ์€ Inception module์„ ์‚ฌ์šฉ
  • Inception-ResNet์€ Inception-v4์— residual connection์„ ๊ฒฐํ•ฉํ•œ ๊ฒƒ → ํ•™์Šต ์†๋„๊ฐ€ ๋นจ๋ผ์ง

 

  • inception-resnet-v1์€ inception-v3์™€ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋น„์Šทํ•จ
  • inception-resnet-v2๋Š” inception-v4์™€ ์—ฐ์ƒ๋Ÿ‰์ด ๋น„์Šทํ•จ

 

⇒ ํฌ๊ฒŒ Inception v4, Inception-ResNet(inception v4 + ResNet) ๋ชจ๋ธ์„ ์„ค๋ช…ํ•จ

Inception-v4 vs Inception-ResNet

1. Inception-v4 architecture

1-1. Stem

  • Conv, Maxpool ๋“ฑ์˜ layer์ดํ›„์—๋Š” ์ฐจ์›์„ ๋งž์ถฐ์ฃผ๊ธฐ ์œ„ํ•ด padding์ด ํ•„์š”
  • V ํ‘œ์‹œ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์€ ํŒจ๋”ฉ ์ ์šฉ x
  • V ํ‘œ์‹œ๊ฐ€ ์—†๋Š” ๊ฒƒ์€ zero padding ์ ์šฉ (์ž…์ถœ๋ ฅ ํฌ๊ธฐ ๋™์ผํ•˜๊ฒŒ)
Padding : Valid Padding / Full Padding / Same Padding
https://ardino-lab.com/padding-valid-padding-full-padding-same-padding/

1-2. Inception-A

1-3. Inception-B

1-4. Inception-C

1-5. Reduction-A

1-6. Reduction-B

2. Inception + ResNet

  • Inception network์™€ residual block์„ ๊ฒฐํ•ฉํ•œ Inception-ResNet
  • Inception-ResNet์€ v1๋ฒ„์ „๊ณผ v2๋ฒ„์ „์ด ์กด์žฌ
  • v1์™€ v2๋Š” ์ „์ฒด ๊ตฌ์กฐ๋Š” ๊ฐ™์ง€๋งŒ, ๊ฐ ๋ชจ๋“ˆ์— ์ฐจ์ด์ ์ด ์กด์žฌ
  • stem์™€ ๊ฐ inception module์—์„œ ์‚ฌ์šฉํ•˜๋Š” filter์ˆ˜๊ฐ€ ๋‹ค๋ฆ„

2-1. Stem

  • V2์˜ ๊ฒฝ์šฐ, ๊ธฐ์กด inception-v4์—์„œ ์‚ฌ์šฉํ•˜๋Š” stem์„ ํ™œ์šฉ

2-2. Inception-ResNet-A

  • ๋งˆ์ง€๋ง‰ 1x1 conv์—์„œ ์ฑ„๋„์ˆ˜๊ฐ€ ๋‹ค๋ฆ„
  • ๋งˆ์ง€๋ง‰ 1x1 conv์—์„œ ์ˆซ์ž ๋’ค์— "Linear"๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š์•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ

2-3. Inception-ResNet-B

2-4. Inception-ResNet-C

2-5. Reduction-A

  • v1, v2 ๋‘˜ ๋‹ค ๊ฐ™์€ reduction-A ๋ชจ๋“ˆ ์‚ฌ์šฉ

2-6. Reduction-B

 

3. Results

(1) inception-v3์™€ inception-resnet-v1 ํ•™์Šต ๊ณก์„  ๋น„๊ต

  • inception-v3์™€ inception-resnet-v1์˜ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋‘˜์„ ๋น„๊ต

(2) inception-v4์™€ inception-resnet-v2 ํ•™์Šต ๊ณก์„  ๋น„๊ต

  • inception-v4์™€ inception-resnet-v2์˜ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‘˜์„ ๋น„๊ต

(3) ์„ฑ๋Šฅ ๋น„๊ต

๊ฒฐ๋ก : Inception-ResNet-v2๊ฐ€ ์ œ์ผ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

 


728x90
๋ฐ˜์‘ํ˜•

'Deep Learning > [๋…ผ๋ฌธ] Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

CycleGAN  (0) 2023.07.05
XLNet: Generalized Autoregressive Pretraining for Language Understanding  (1) 2023.07.05
Seq2Seq  (0) 2023.07.05
U-Net  (0) 2023.07.05
Bert  (0) 2023.07.05