๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

YOLOv4: Optimal Speed and Accuracy of Object Detection

by ์ œ๋ฃฝ 2023. 7. 9.
728x90
๋ฐ˜์‘ํ˜•

 

๐Ÿ’ก
<๋ฒˆ์—ญ>
0. Abstract

CNN์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์ด ๋งŽ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ๋Šฅ๋“ค์˜ ์กฐํ•ฉ์„ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‹ค์ œ๋กœ ํ…Œ์ŠคํŠธํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ด๋ก ์ ์œผ๋กœ ์ •๋‹นํ™”ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ ๊ธฐ๋Šฅ์€ ํŠน์ • ๋ชจ๋ธ์ด๋‚˜ ๋ฌธ์ œ์—๋งŒ ์ ์šฉ๋˜๊ฑฐ๋‚˜ ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—๋งŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐฐ์น˜ ์ •๊ทœํ™”(batch-normalization)์™€ ์ž”์ฐจ ์—ฐ๊ฒฐ(residual-connections)๊ณผ ๊ฐ™์€ ๊ธฐ๋Šฅ์€ ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ, ์ž‘์—… ๋ฐ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ €ํฌ๋Š” ๊ฐ€์ค‘ ์ž”์ฐจ ์—ฐ๊ฒฐ(Weighted-Residual-Connections, WRC), ํฌ๋กœ์Šค ์Šคํ…Œ์ด์ง€ ๋ถ€๋ถ„ ์—ฐ๊ฒฐ(Cross-Stage-Partial-connections, CSP), ํฌ๋กœ์Šค ๋ฏธ๋‹ˆ ๋ฐฐ์น˜ ์ •๊ทœํ™”(Cross mini-Batch Normalization, CmBN), ์ž๊ฐ€ ์ ๋Œ€์  ํ›ˆ๋ จ(Self-adversarial-training, SAT) ๋ฐ Mish ํ™œ์„ฑํ™”(Mish-activation)์™€ ๊ฐ™์€ ๋ฒ”์šฉ์ ์ธ ๊ธฐ๋Šฅ๋“ค์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ €ํฌ๋Š” WRC, CSP, CmBN, SAT, Mish ํ™œ์„ฑํ™”, Mosaic ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(Mosaic data augmentation), CmBN, DropBlock ์ •๊ทœํ™”(DropBlock regularization) ๋ฐ CIoU ์†์‹ค(CIoU loss)๊ณผ ๊ฐ™์€ ์ƒˆ๋กœ์šด ๊ธฐ๋Šฅ๋“ค์„ ์‚ฌ์šฉํ•˜๊ณ  ์ด๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ์ตœ์ฒจ๋‹จ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. MS COCO ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‹ค์‹œ๊ฐ„ ์†๋„์ธ ์•ฝ 65 FPS์—์„œ 43.5% AP(65.7% AP50)์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ์†Œ์Šค ์ฝ”๋“œ๋Š” ์•„๋ž˜์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1. Introduction

Figure 1: ์ œ์•ˆ๋œ YOLOv4์™€ ๋‹ค๋ฅธ ์ตœ์ฒจ๋‹จ ๊ฐ์ฒด ํƒ์ง€๊ธฐ์˜ ๋น„๊ต. YOLOv4๋Š” ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์ง„ EfficientDet๋ณด๋‹ค 2๋ฐฐ ๋” ๋น ๋ฅด๊ฒŒ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. YOLOv3์˜ AP์™€ FPS๋ฅผ ๊ฐ๊ฐ 10%์™€ 12% ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

์„œ๋ก  ๋Œ€๋ถ€๋ถ„์˜ CNN ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋Š” ์ฃผ๋กœ ์ถ”์ฒœ ์‹œ์Šคํ…œ์—๋งŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋„์‹œ ๋น„๋””์˜ค ์นด๋ฉ”๋ผ๋ฅผ ํ†ตํ•ด ๋ฌด๋ฃŒ ์ฃผ์ฐจ ๊ณต๊ฐ„์„ ํƒ์ƒ‰ํ•˜๋Š” ๊ฒฝ์šฐ, ๋Š๋ฆฌ๊ณ  ์ •ํ™•ํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ž๋™์ฐจ ์ถฉ๋Œ ๊ฒฝ๊ณ ๋Š” ๋น ๋ฅด์ง€๋งŒ ์ •ํ™•ํ•˜์ง€ ์•Š์€ ๋ชจ๋ธ๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์‹œ๊ฐ„ ๊ฐ์ฒด ํƒ์ง€๊ธฐ์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ด์œผ๋กœ์จ, ํžŒํŠธ ์ƒ์„ฑ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋…๋ฆฝ์ ์ธ ํ”„๋กœ์„ธ์Šค ๊ด€๋ฆฌ์™€ ์ธ๊ฐ„์˜ ์ž…๋ ฅ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๊ทธ๋ž˜ํ”ฝ ์ฒ˜๋ฆฌ ์žฅ์น˜(GPU)์—์„œ ์‹ค์‹œ๊ฐ„ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋ฅผ ์‹คํ–‰ํ•จ์œผ๋กœ์จ, ์ €๋ ดํ•œ ๊ฐ€๊ฒฉ์— ๋Œ€๊ทœ๋ชจ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ด์ง‘๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์ •ํ™•ํ•œ ์ตœ์‹  ์‹ ๊ฒฝ๋ง์€ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ๋กœ ํ•™์Šต์„ ์œ„ํ•ด ๋งŽ์€ ์ˆ˜์˜ GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ธฐ์กด GPU์—์„œ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ž‘๋™ํ•˜๋Š” CNN์„ ์ƒ์„ฑํ•˜๊ณ , ํ•™์Šต์—๋Š” ๋‹จ ํ•˜๋‚˜์˜ ๊ธฐ์กด GPU๋งŒ ํ•„์š”ํ•œ ๋ฐฉ์‹์œผ๋กœ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์— ๋Œ€์ฒ˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ์ž‘์—…์˜ ์ฃผ์š” ๋ชฉํ‘œ๋Š” ์ƒ์‚ฐ ์‹œ์Šคํ…œ์—์„œ ๊ฐ์ฒด ํƒ์ง€๊ธฐ์˜ ๋น ๋ฅธ ์šด์˜ ์†๋„์™€ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์ตœ์ ํ™”์ž…๋‹ˆ๋‹ค. ์ด๋ก ์  ์ง€ํ‘œ์ธ ๊ณ„์‚ฐ๋Ÿ‰ (BFLOP)๋ณด๋‹ค๋Š” ๋‚ฎ์€ ๊ณ„์‚ฐ๋Ÿ‰์„ ๊ฐ–๋Š” ๊ฒƒ๋ณด๋‹ค๋Š”, ์„ค๊ณ„๋œ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๊ฐ€ ์‰ฝ๊ฒŒ ํ›ˆ๋ จ ๋ฐ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์„ ํฌ๋งํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ผ๋ฐ˜์ ์ธ GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธํ•˜๋Š” ์‚ฌ์šฉ์ž๋Š” Figure 1์— ๋‚˜์™€ ์žˆ๋Š” YOLOv4 ๊ฒฐ๊ณผ์™€ ๊ฐ™์ด ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ณ ํ’ˆ์งˆ์ด๊ณ  ์„ค๋“๋ ฅ์žˆ๋Š” ๊ฐ์ฒด ํƒ์ง€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๊ธฐ์—ฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์š”์•ฝ๋ฉ๋‹ˆ๋‹ค:

  1. ์šฐ๋ฆฌ๋Š” ํšจ์œจ์ ์ด๊ณ  ๊ฐ•๋ ฅํ•œ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋ˆ„๊ตฌ๋‚˜ 1080 Ti ๋˜๋Š” 2080 Ti GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋ฅผ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  1. ์šฐ๋ฆฌ๋Š” ๊ฐ์ฒด ํƒ์ง€๊ธฐ ํ›ˆ๋ จ ์ค‘ ์ตœ์‹  Bag-of-Freebies์™€ Bag-of-Specials ๊ธฐ๋ฒ•์˜ ์˜ํ–ฅ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
  1. ์šฐ๋ฆฌ๋Š” ์ตœ์‹  ๊ธฐ๋ฒ•๋“ค์„ ์ˆ˜์ •ํ•˜์—ฌ ๋‹จ์ผ GPU ํ›ˆ๋ จ์— ๋” ํšจ์œจ์ ์ด๊ณ  ์ ํ•ฉํ•˜๋„๋ก ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์—๋Š” CBN [89], PAN [49], SAM [85] ๋“ฑ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
2. Related Work

2.1 ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ ํ˜„๋Œ€์ ์ธ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ImageNet์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ฐฑ๋ณธ(backbone)์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๊ฐ์ฒด์˜ ํด๋ž˜์Šค์™€ ๊ฒฝ๊ณ„ ์ƒ์ž๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ํ—ค๋“œ(head)์ž…๋‹ˆ๋‹ค. GPU ํ”Œ๋žซํผ์—์„œ ์‹คํ–‰๋˜๋Š” ๊ฐ์ฒด ํƒ์ง€๊ธฐ์˜ ๋ฐฑ๋ณธ์€ VGG [68], ResNet [26], ResNeXt [86], DenseNet [30] ๋“ฑ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. CPU ํ”Œ๋žซํผ์—์„œ ์‹คํ–‰๋˜๋Š” ๊ฐ์ฒด ํƒ์ง€๊ธฐ์˜ ๋ฐฑ๋ณธ์€ SqueezeNet [31], MobileNet [28, 66, 27, 74], ShuffleNet [97, 53] ๋“ฑ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ—ค๋“œ ๋ถ€๋ถ„์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ•œ ๋‹จ๊ณ„(object detector)์™€ ๋‘ ๋‹จ๊ณ„(object detector)๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๋‘ ๋‹จ๊ณ„ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋Š” R-CNN [19] ์‹œ๋ฆฌ์ฆˆ๋กœ, fast R-CNN [18], faster R-CNN [64], R-FCN [9], Libra R-CNN [58] ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋‘ ๋‹จ๊ณ„ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋ฅผ ์•ต์ปค ์—†๋Š” ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋กœ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ์ด์—๋Š” RepPoints [87] ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ ๋‹จ๊ณ„ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋กœ๋Š” YOLO [61, 62, 63], SSD [50], RetinaNet [45] ๋“ฑ์ด ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„ ์•ต์ปค ์—†๋Š” ํ•œ ๋‹จ๊ณ„ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋„ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์œ ํ˜•์˜ ํƒ์ง€๊ธฐ์—๋Š” CenterNet [13], CornerNet [37, 38], FCOS [78] ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„ ๊ฐœ๋ฐœ๋œ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋Š” ๋ฐฑ๋ณธ๊ณผ ํ—ค๋“œ ์‚ฌ์ด์— ๋ช‡ ๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ์‚ฝ์ž…ํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ ˆ์ด์–ด๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค๋ฅธ ๋‹จ๊ณ„์—์„œ ํŠน์ง• ๋งต์„ ์ˆ˜์ง‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ฐ์ฒด ํƒ์ง€๊ธฐ์˜ "๋„ฅ(neck)"์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ ๋„ฅ์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Bottom-up ๊ฒฝ๋กœ์™€ Top-down ๊ฒฝ๋กœ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ฐ–์ถ˜ ๋„คํŠธ์›Œํฌ์—๋Š” Feature Pyramid Network (FPN) [44], Path Aggregation Network (PAN) [49], BiFPN [77], NAS-FPN [17] ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์œ„์—์„œ ์–ธ๊ธ‰๋œ ๋ชจ๋ธ๋“ค ์™ธ์—๋„, ์ผ๋ถ€ ์—ฐ๊ตฌ์ž๋“ค์€ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๋ฐฑ๋ณธ(DetNet [43], DetNAS [7])์ด๋‚˜ ์ „์ฒด ๋ชจ๋ธ(SpineNet [12], HitDetector [20])์„ ์ง์ ‘ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝํ•˜์ž๋ฉด, ์ผ๋ฐ˜์ ์ธ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ๋ฐฑ๋ณธ (Backbone): ๋ฐฑ๋ณธ์€ ๊ฐ์ฒด ํƒ์ง€๊ธฐ์˜ ํŠน์ง• ์ถ”์ถœ ๊ตฌ์„ฑ ์š”์†Œ๋กœ, ์ผ๋ฐ˜์ ์œผ๋กœ ImageNet๊ณผ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. VGG, ResNet, ResNeXt, DenseNet, SqueezeNet, MobileNet, ShuffleNet ๋“ฑ์˜ ๊ธฐ์กด ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ์ผ ์ˆ˜๋„ ์žˆ๊ณ , DetNet์ด๋‚˜ SpineNet๊ณผ ๊ฐ™์ด ์ƒˆ๋กญ๊ฒŒ ์„ค๊ณ„๋œ ๋ฐฑ๋ณธ์ผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  1. ๋„ฅ (Neck): ๋„ฅ์€ ๋ฐฑ๋ณธ๊ณผ ํ—ค๋“œ ์‚ฌ์ด์— ์‚ฝ์ž…๋˜๋Š” ์ค‘๊ฐ„ ๊ณ„์ธต์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณ„์ธต์€ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์ด๋‚˜ ์ถ”์ƒํ™” ์ˆ˜์ค€์—์„œ ์ถ”์ถœ๋œ ํŠน์ง•์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํƒ์ง€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. FPN(Feature Pyramid Network), PAN(Path Aggregation Network), BiFPN, NAS-FPN๊ณผ ๊ฐ™์€ ๋„ฅ ๊ตฌ์กฐ์˜ ์˜ˆ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  1. ํ—ค๋“œ (Head): ํ—ค๋“œ๋Š” ๊ฐ์ฒด์˜ ํด๋ž˜์Šค์™€ ๊ฒฝ๊ณ„ ์ƒ์ž๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์—ญํ• ์„ ํ•˜๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ๋ฐฑ๋ณธ์—์„œ ์ถ”์ถœํ•œ ํŠน์ง•์„ ๋„ฅ์—์„œ ์ฒ˜๋ฆฌํ•œ ํ›„, ํ—ค๋“œ์—์„œ ๋ถ„๋ฅ˜ ๋ฐ ํšŒ๊ท€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ์ฒด ํƒ์ง€ ํ—ค๋“œ์—๋Š” ์ฃผ๋กœ ์›๋‹จ๊ณ„(detector)์™€ ์ด๋‹จ๊ณ„(detector)์˜ ๋‘ ๊ฐ€์ง€ ์œ ํ˜•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์›๋‹จ๊ณ„ ํƒ์ง€๊ธฐ(YOLO, SSD, RetinaNet)์™€ ์ด๋‹จ๊ณ„ ํƒ์ง€๊ธฐ(R-CNN ์‹œ๋ฆฌ์ฆˆ)๊ฐ€ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์ž…๋‹ˆ๋‹ค. ๊ฐ ์œ ํ˜•์€ ์ž์ฒด์ ์ธ ํŠน์ง•๊ณผ ์žฅ๋‹จ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

2.2. Bag of freebies

์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ถ”๊ฐ€ ๋ชจ๋“ˆ ๊ธฐ๋ฒ• ๋ฐ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ

์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ฐฉ๋ฒ•์€ ๋ชจ๋‘ ํ”ฝ์…€ ๋‹จ์œ„์˜ ์กฐ์ •์ด๋ฉฐ, ์กฐ์ •๋œ ์˜์—ญ์—์„œ ์›๋ž˜์˜ ํ”ฝ์…€ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, ์ผ๋ถ€ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ์—ฐ๊ตฌํ•˜๋Š” ์—ฐ๊ตฌ์ž๋“ค์€ ๊ฐ์ฒด ๊ฐ€๋ฆผ ํ˜„์ƒ์„ ๋ชจ์‚ฌํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋“ค์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์™€ ๊ฐ์ฒด ํƒ์ง€์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, random erase [100]์™€ CutOut [11]๋Š” ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๊ฐํ˜• ์˜์—ญ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜๊ณ  ๋ฌด์ž‘์œ„ ๋˜๋Š” ๋ณด์™„ ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. hide-and-seek [69]์™€ grid mask [6]์˜ ๊ฒฝ์šฐ, ์ด๋ฏธ์ง€์—์„œ ๋ฌด์ž‘์œ„๋กœ ๋˜๋Š” ๊ท ์ผํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์‚ฌ๊ฐํ˜• ์˜์—ญ์„ ์„ ํƒํ•˜๊ณ  ์ด๋ฅผ ๋ชจ๋‘ 0์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ํŠน์ง• ๋งต์— ์œ ์‚ฌํ•œ ๊ฐœ๋…์„ ์ ์šฉํ•œ๋‹ค๋ฉด DropOut [71], DropConnect [80], DropBlock [16] ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ผ๋ถ€ ์—ฐ๊ตฌ์ž๋“ค์€ ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, MixUp [92]๋Š” ๋‘ ๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๊ณ„์ˆ˜ ๋น„์œจ๋กœ ๊ณฑํ•˜๊ณ  ๊ฒน์ณ์„œ ์กฐ์ •๋œ ๋ ˆ์ด๋ธ”์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. CutMix [91]๋Š” ์ž˜๋ฆฐ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์˜ ์ง์‚ฌ๊ฐํ˜• ์˜์—ญ์œผ๋กœ ๋ฎ์–ด์”Œ์šฐ๊ณ  ํ˜ผํ•ฉ ์˜์—ญ์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๋ ˆ์ด๋ธ”์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋ฐฉ๋ฒ•๋“ค ์™ธ์—๋„, ์Šคํƒ€์ผ ์ „์ด GAN [15]๋„ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์— ์‚ฌ์šฉ๋˜๋ฉฐ, ์ด๋Ÿฌํ•œ ์‚ฌ์šฉ๋ฒ•์€ CNN์ด ํ•™์Šตํ•œ ์งˆ๊ฐ ํŽธํ–ฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์œ„์—์„œ ์ œ์•ˆ๋œ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ, ๋‹ค๋ฅธ ์ผ๋ถ€ bag of freebies ๋ฐฉ๋ฒ•๋“ค์€ ๋ฐ์ดํ„ฐ์…‹์˜ ์˜๋ฏธ ๋ถ„ํฌ์— ํŽธํ–ฅ์ด ์žˆ์„ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํŠนํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์˜๋ฏธ ๋ถ„ํฌ ํŽธํ–ฅ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ๋งค์šฐ ์ค‘์š”ํ•œ ๋ฌธ์ œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ํด๋ž˜์Šค ๊ฐ„์— ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด ๋ฌธ์ œ๋Š” ์ผ๋ถ€ ์—ฐ๊ตฌ์ž๋“ค์— ์˜ํ•ด ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, two-stage ๊ฐ์ฒด ํƒ์ง€๊ธฐ์—์„œ๋Š” ์–ด๋ ค์šด ๋ถ€์ • ์˜ˆ์ œ ๋งˆ์ด๋‹ [72]์ด๋‚˜ ์˜จ๋ผ์ธ ์–ด๋ ค์šด ์˜ˆ์ œ ๋งˆ์ด๋‹ [67]์„ ํ†ตํ•ด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์˜ˆ์ œ ๋งˆ์ด๋‹ ๋ฐฉ๋ฒ•์€ one-stage ๊ฐ์ฒด ํƒ์ง€๊ธฐ์—๋Š” ์ ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด ์ด๋Ÿฌํ•œ ํƒ์ง€๊ธฐ๋Š” ๋ฐ€์ง‘ ์˜ˆ์ธก ์•„ํ‚คํ…์ฒ˜์— ์†ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Lin et al. [45]์€ ๋‹ค์–‘ํ•œ ํด๋ž˜์Šค ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ํฌ์ปฌ ๋กœ์Šค๋ฅผ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ ๋‹ค๋ฅธ ๋งค์šฐ ์ค‘์š”ํ•œ ๋ฌธ์ œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ ์ •๋„๋ฅผ ์›-ํ•ซ ํ•˜๋“œ ํ‘œํ˜„์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ‘œํ˜„ ์ฒด๊ณ„๋Š” ๋ผ๋ฒจ๋ง์„ ์‹คํ–‰ํ•  ๋•Œ ์ž์ฃผ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. [73]์—์„œ ์ œ์•ˆ๋œ ๋ผ๋ฒจ ์Šค๋ฌด๋”ฉ์€ ํ›ˆ๋ จ์„ ์œ„ํ•ด ํ•˜๋“œ ๋ผ๋ฒจ์„ ์†Œํ”„ํŠธ ๋ผ๋ฒจ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ชจ๋ธ์„ ๋” ๊ฒฌ๊ณ ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋” ์ข‹์€ ์†Œํ”„ํŠธ ๋ผ๋ฒจ์„ ์–ป๊ธฐ ์œ„ํ•ด, Islam et al. [33]์€ ์ง€์‹ ์ฆ๋ฅ˜์˜ ๊ฐœ๋…์„ ๋„์ž…ํ•˜์—ฌ ๋ผ๋ฒจ ๋ฆฌํŒŒ์ธ๋จผํŠธ ๋„คํŠธ์›Œํฌ๋ฅผ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ ์†Œ๊ฐœํ•  bag of freebies๋Š” ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค(BBox) ํšŒ๊ท€์˜ ๋ชฉ์  ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ „ํ†ต์ ์ธ ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ BBox์˜ ์ค‘์‹ฌ์  ์ขŒํ‘œ์™€ ๋†’์ด, ๋„ˆ๋น„, ์ฆ‰ {xcenter, ycenter, w, h} ๋˜๋Š” ์ขŒ์ƒ๋‹จ ์ ๊ณผ ์šฐํ•˜๋‹จ ์ , ์ฆ‰ {xtop lef t, ytop lef t, xbottom right, ybottom right}์— ๋Œ€ํ•ด ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ(Mean Square Error, MSE)๋ฅผ ์ง์ ‘์ ์œผ๋กœ ํšŒ๊ท€์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์•ต์ปค ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ, ํ•ด๋‹น ์˜คํ”„์…‹์„ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, {xcenter of f set, ycenter of f set, wof f set, hof f set} ๋ฐ {xtop lef t of f set, ytop lef t of f set, xbottom right of f set, ybottom right of f set}์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ BBox์˜ ๊ฐ ์ ์˜ ์ขŒํ‘œ ๊ฐ’์„ ์ง์ ‘์ ์œผ๋กœ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์€ ์ด๋Ÿฌํ•œ ์ ๋“ค์„ ๋…๋ฆฝ ๋ณ€์ˆ˜๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ, ์‚ฌ์‹ค ๊ฐ์ฒด ์ž์ฒด์˜ ์™„์ „์„ฑ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ๋ฅผ ๋” ์ž˜ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๊ทผ ์ผ๋ถ€ ์—ฐ๊ตฌ์ž๋“ค์€ IoU loss [90]๋ฅผ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์˜ˆ์ธก๋œ BBox ์˜์—ญ๊ณผ ์‹ค์ œ BBox ์˜์—ญ์˜ ์ปค๋ฒ„๋ฆฌ์ง€๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. IoU loss ๊ณ„์‚ฐ ๊ณผ์ •์—์„œ๋Š” ์˜ˆ์ธก๋œ BBox์™€ ์‹ค์ œ ๊ฐ’๊ณผ์˜ IoU๋ฅผ ์‹คํ–‰ํ•˜์—ฌ BBox์˜ ๋„ค ๊ฐœ์˜ ์ขŒํ‘œ๋ฅผ ๊ณ„์‚ฐํ•œ ๋‹ค์Œ ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค. IoU๋Š” ์Šค์ผ€์ผ ๋ถˆ๋ณ€ ํ‘œํ˜„์ด๋ฏ€๋กœ ์ „ํ†ต์ ์ธ ๋ฐฉ๋ฒ•์ด {x, y, w, h}์˜ l1 ๋˜๋Š” l2 ์†์‹ค์„ ๊ณ„์‚ฐํ•  ๋•Œ ์Šค์ผ€์ผ๊ณผ ํ•จ๊ป˜ ์†์‹ค์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ ์ผ๋ถ€ ์—ฐ๊ตฌ์ž๋“ค์€ IoU loss๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„์† ๋…ธ๋ ฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, GIoU loss [65]๋Š” ์ปค๋ฒ„๋ฆฌ์ง€ ์˜์—ญ ์™ธ์—๋„ ๊ฐ์ฒด์˜ ํ˜•ํƒœ์™€ ๋ฐฉํ–ฅ์„ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ธก๋œ BBox์™€ ์‹ค์ œ BBox๋ฅผ ๋™์‹œ์— ์ปค๋ฒ„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ์ž‘์€ ์˜์—ญ์˜ BBox๋ฅผ ์ฐพ์•„ ์ด๋ฅผ ๋ถ„๋ชจ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ์กด์˜ IoU loss์—์„œ ์‚ฌ์šฉ๋˜๋˜ ๋ถ„๋ชจ๋ฅผ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. DIoU loss [99]๋Š” ๊ฐ์ฒด์˜ ์ค‘์‹ฌ์ ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ถ”๊ฐ€๋กœ ๊ณ ๋ คํ•˜๋ฉฐ, CIoU loss [99]๋Š” ๊ฒน์น˜๋Š” ์˜์—ญ, ์ค‘์‹ฌ์  ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ, ์ข…ํšก๋น„๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. CIoU๋Š” BBox ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ ๋” ๋‚˜์€ ์ˆ˜๋ ด ์†๋„์™€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2.3. ํŠน์ˆ˜ ๊ธฐ๋ฒ•(Bag of specials)

์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ถ”๊ฐ€ ๋ชจ๋“ˆ ๊ธฐ๋ฒ• ๋ฐ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•

์ถ”๋ก  ๋น„์šฉ์„ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œํ‚ค์ง€ ์•Š์œผ๋ฉด์„œ ๊ฐ์ฒด ํƒ์ง€์˜ ์ •ํ™•๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ํ”Œ๋Ÿฌ๊ทธ์ธ ๋ชจ๋“ˆ ๋ฐ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•๋“ค์„ "ํŠน์ˆ˜ ๊ธฐ๋ฒ•(Bag of specials)"์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋Ÿฌํ•œ ํ”Œ๋Ÿฌ๊ทธ์ธ ๋ชจ๋“ˆ์€ ๋ชจ๋ธ ๋‚ด์—์„œ ํŠน์ • ์†์„ฑ์„ ๊ฐ•ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋ฉฐ, ์ด๋Š” ์ˆ˜์šฉ ์˜์—ญ ํ™•์žฅ, ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋„์ž…, ํŠน์ง• ํ†ตํ•ฉ ๋Šฅ๋ ฅ ๊ฐ•ํ™” ๋“ฑ์„ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ›„์ฒ˜๋ฆฌ๋Š” ๋ชจ๋ธ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ•„ํ„ฐ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

๋ฐ˜์‘ ์˜์—ญ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ผ๋ฐ˜์ ์ธ ๋ชจ๋“ˆ๋กœ๋Š” SPP (Spatial Pyramid Pooling) [25], ASPP (Atrous Spatial Pyramid Pooling) [5], ๊ทธ๋ฆฌ๊ณ  RFB (Receptive Field Block) [47] ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

SPP ๋ชจ๋“ˆ์€ Spatial Pyramid Matching (SPM) [39]์—์„œ ๊ธฐ์›ํ•œ ๊ฒƒ์œผ๋กœ, SPM์˜ ์›๋ž˜ ๋ฐฉ๋ฒ•์€ ํŠน์„ฑ ๋งต์„ d × d ํฌ๊ธฐ์˜ ๋™์ผํ•œ ๋ธ”๋ก์œผ๋กœ ๋‚˜๋ˆˆ ๋‹ค์Œ ๊ณต๊ฐ„ ํ”ผ๋ผ๋ฏธ๋“œ๋ฅผ ํ˜•์„ฑํ•˜๊ณ  ๋‹จ์–ด ๊ฐ€๋ฐฉ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด์—ˆ์Šต๋‹ˆ๋‹ค. SPP๋Š” SPM์„ CNN์— ํ†ตํ•ฉํ•˜๊ณ  ๋‹จ์–ด ๊ฐ€๋ฐฉ ์—ฐ์‚ฐ ๋Œ€์‹  ์ตœ๋Œ€ ํ’€๋ง ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. He ๋“ฑ์ด ์ œ์•ˆํ•œ SPP ๋ชจ๋“ˆ์€ 1์ฐจ์› ํŠน์„ฑ ๋ฒกํ„ฐ๋ฅผ ์ถœ๋ ฅํ•˜๋ฏ€๋กœ Fully Convolutional Network (FCN)์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ YOLOv3 [63]์˜ ์„ค๊ณ„์—์„œ Redmon๊ณผ Farhadi๋Š” SPP ๋ชจ๋“ˆ์„ ํ–ฅ์ƒ์‹œ์ผœ ์ปค๋„ ํฌ๊ธฐ k × k (์—ฌ๊ธฐ์„œ k = {1, 5, 9, 13})์™€ ์ŠคํŠธ๋ผ์ด๋“œ 1์ธ ์ตœ๋Œ€ ํ’€๋ง ์ถœ๋ ฅ์˜ ์—ฐ๊ฒฐ๋กœ ๋ณ€๊ฒฝํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์„ค๊ณ„๋ฅผ ํ†ตํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ํฐ k × k ์ตœ๋Œ€ ํ’€๋ง์ด ๋ฐฑ๋ณธ ํŠน์„ฑ์˜ ๋ฐ˜์‘ ์˜์—ญ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ฆ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค. ๊ฐœ์„ ๋œ SPP ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•œ ํ›„์—๋Š” YOLOv3-608์—์„œ MS COCO ๊ฐ์ฒด ๊ฐ์ง€ ์ž‘์—…์˜ AP50๊ฐ€ 0.5% ์ถ”๊ฐ€ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ 2.7% ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ASPP ๋ชจ๋“ˆ [5]๊ณผ ๊ฐœ์„ ๋œ SPP ๋ชจ๋“ˆ ๊ฐ„์˜ ์ž‘๋™ ์ฐจ์ด๋Š” ์ฃผ๋กœ ์›๋ž˜์˜ k × k ์ปค๋„ ํฌ๊ธฐ, ์ŠคํŠธ๋ผ์ด๋“œ 1์˜ ์ตœ๋Œ€ ํ’€๋ง์—์„œ ์—ฌ๋Ÿฌ 3 × 3 ์ปค๋„ ํฌ๊ธฐ, ํ™•์žฅ ๋น„์œจ k, ์ŠคํŠธ๋ผ์ด๋“œ 1์˜ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜ ์—ฐ์‚ฐ์œผ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. RFB ๋ชจ๋“ˆ์€ ASPP๋ณด๋‹ค ๋” ํฌ๊ด„์ ์ธ ๊ณต๊ฐ„ ์ปค๋ฒ„๋ฆฌ์ง€๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด k × k ์ปค๋„์˜ ์—ฌ๋Ÿฌ ํ™•์žฅ ์ปจ๋ณผ๋ฃจ์…˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. RFB [47]์€ MS COCO์—์„œ SSD์˜ AP50๋ฅผ 5.7% ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€๋กœ ์ถ”๋ก  ์‹œ๊ฐ„์ด 7%๋งŒ ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.

๊ฐ์ฒด ๊ฐ์ง€์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ์–ดํ…์…˜ ๋ชจ๋“ˆ์€ ์ฃผ๋กœ ์ฑ„๋„๋ณ„ ์–ดํ…์…˜(Channel-wise Attention)๊ณผ ์ ๋ณ„ ์–ดํ…์…˜(Point-wise Attention)์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ์ด ๋‘ ์–ดํ…์…˜ ๋ชจ๋ธ์˜ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ๋Š” Squeeze-and-Excitation (SE) [29]๊ณผ Spatial Attention Module (SAM) [85]์ž…๋‹ˆ๋‹ค. SE ๋ชจ๋“ˆ์€ ImageNet ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ResNet50์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ ๊ณ„์‚ฐ ๋น„์šฉ์„ 2%๋งŒ ์ฆ๊ฐ€์‹œํ‚ด์œผ๋กœ์จ 1% top-1 ์ •ํ™•๋„๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ GPU์—์„œ๋Š” ๋ณดํ†ต ์ถ”๋ก  ์‹œ๊ฐ„์ด ์•ฝ 10% ์ฆ๊ฐ€ํ•˜๋ฏ€๋กœ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ์— ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, SAM์€ ์ถ”๊ฐ€์ ์ธ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ 0.1%๋งŒ ํ•„์š”ํ•˜๋ฉฐ ImageNet ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ResNet50-SE์˜ 0.5% top-1 ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌด์—‡๋ณด๋‹ค๋„, ์ด๋Š” GPU์—์„œ ์ถ”๋ก  ์†๋„์— ์ „ํ˜€ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํŠน์„ฑ ํ†ตํ•ฉ ์ธก๋ฉด์—์„œ, ์ดˆ๊ธฐ์˜ ๋ฐฉ๋ฒ•์€ ์Šคํ‚ต ์—ฐ๊ฒฐ(Skip Connection) [51]์ด๋‚˜ ํ•˜์ดํผ-์ปฌ๋Ÿผ(Hyper-column) [22]์„ ์‚ฌ์šฉํ•˜์—ฌ ์ €์ˆ˜์ค€์˜ ๋ฌผ๋ฆฌ์  ํŠน์„ฑ์„ ๊ณ ์ˆ˜์ค€์˜ ์˜๋ฏธ๋ก ์  ํŠน์„ฑ์— ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ FPN๊ณผ ๊ฐ™์€ ๋‹ค์ค‘ ์Šค์ผ€์ผ ์˜ˆ์ธก ๋ฐฉ๋ฒ•์ด ์ธ๊ธฐ๋ฅผ ์–ป์œผ๋ฉด์„œ, ๋‹ค์–‘ํ•œ ํ”ผ๋ผ๋ฏธ๋“œ ํŠน์„ฑ์„ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€๋ฒผ์šด ๋ชจ๋“ˆ๋“ค์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋“ˆ๋“ค์—๋Š” SFAM [98], ASFF [48], ๊ทธ๋ฆฌ๊ณ  BiFPN [77] ๋“ฑ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. SFAM์˜ ์ฃผ์š” ์•„์ด๋””์–ด๋Š” SE ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ์Šค์ผ€์ผ๋กœ ์—ฐ๊ฒฐ๋œ ํŠน์„ฑ ๋งต์— ์ฑ„๋„๋ณ„ ๋ ˆ๋ฒจ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ASFF๋Š” ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ ๋ณ„ ๋ ˆ๋ฒจ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•˜๊ณ  ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ ํŠน์„ฑ ๋งต์„ ๋”ํ•ฉ๋‹ˆ๋‹ค. BiFPN์€ ๋‹ค์ค‘ ์ž…๋ ฅ ๊ฐ€์ค‘ ์ž”์ฐจ ์—ฐ๊ฒฐ์„ ์ œ์•ˆํ•˜์—ฌ ์Šค์ผ€์ผ๋ณ„ ๋ ˆ๋ฒจ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•˜๊ณ  ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ ํŠน์„ฑ ๋งต์„ ๋”ํ•ฉ๋‹ˆ๋‹ค.

๋”ฅ๋Ÿฌ๋‹ ์—ฐ๊ตฌ์—์„œ๋Š” ์ผ๋ถ€ ์—ฐ๊ตฌ์ž๋“ค์ด ์ข‹์€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ข‹์€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” ๊ธฐ์šธ๊ธฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ „ํŒŒํ•  ์ˆ˜ ์žˆ์œผ๋ฉด์„œ๋„ ์ถ”๊ฐ€์ ์ธ ๊ณ„์‚ฐ ๋น„์šฉ์„ ์ตœ์†Œํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 2010๋…„์— Nair์™€ Hinton [56]์€ ๊ธฐ์กด์˜ tanh๋‚˜ sigmoid์™€ ๊ฐ™์€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์—์„œ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ์‹ค์งˆ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ReLU๋ฅผ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ดํ›„ LReLU [54], PReLU [24], ReLU6 [28], SELU [35], Swish [59], hard-Swish [27], Mish [55] ๋“ฑ์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋“ค์ด ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

LReLU์™€ PReLU๋Š” ReLU์˜ ์ถœ๋ ฅ์ด ์Œ์ˆ˜์ผ ๋•Œ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์ด ๋˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ReLU6์™€ hard-Swish๋Š” ์–‘์žํ™” ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•ด ํŠน๋ณ„ํžˆ ์„ค๊ณ„๋œ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. SELU ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” ์‹ ๊ฒฝ๋ง์˜ ์ž๊ธฐ ์ •๊ทœํ™”๋ฅผ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Swish์™€ Mish๋Š” ์—ฐ์†์ ์œผ๋กœ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ๊ฒ€์ถœ์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ํ›„์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์€ NMS(๋น„์ตœ๋Œ€ ์–ต์ œ)์ž…๋‹ˆ๋‹ค. NMS๋Š” ๋™์ผํ•œ ๊ฐ์ฒด๋ฅผ ์ž˜๋ชป ์˜ˆ์ธกํ•˜๋Š” BBox๋ฅผ ๊ฑธ๋Ÿฌ๋‚ด๊ณ  ๋†’์€ ์‘๋‹ต์„ ๊ฐ€์ง„ ํ›„๋ณด BBox๋งŒ์„ ๋ณด์กดํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. NMS๊ฐ€ ๊ฐœ์„ ์„ ์‹œ๋„ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. NMS๊ฐ€ ์ฒ˜์Œ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์€ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ Girshick ๋“ฑ [19]์€ R-CNN์˜ ๋ถ„๋ฅ˜ ์‹ ๋ขฐ ์ ์ˆ˜๋ฅผ ์ฐธ์กฐ๋กœ ์ถ”๊ฐ€ํ•˜์—ฌ ์‹ ๋ขฐ ์ ์ˆ˜์˜ ์ˆœ์„œ์— ๋”ฐ๋ผ ๋†’์€ ์ ์ˆ˜๋ถ€ํ„ฐ ๋‚ฎ์€ ์ ์ˆ˜๊นŒ์ง€ ํƒ์š•์ ์ธ NMS๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. Soft NMS [1]์˜ ๊ฒฝ์šฐ, ๊ฐ์ฒด์˜ ๊ฐ€๋ฆผ ํ˜„์ƒ์ด IoU ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ํƒ์š•์ ์ธ NMS์—์„œ ์‹ ๋ขฐ ์ ์ˆ˜์˜ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. DIoU NMS [99] ๊ฐœ๋ฐœ์ž๋Š” Soft NMS๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ BBox ์Šคํฌ๋ฆฌ๋‹ ๊ณผ์ •์— ์ค‘์‹ฌ์  ๊ฑฐ๋ฆฌ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์–ธ๊ธ‰ํ•  ๋งŒํ•œ ์ ์€ ์œ„์˜ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•๋“ค์€ ์บก์ฒ˜๋œ ์ด๋ฏธ์ง€ ํŠน์„ฑ์„ ์ง์ ‘ ์ฐธ์กฐํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ, ํ›„์†์ ์ธ ์•ต์ปค-ํ”„๋ฆฌ(Anchor-Free) ๋ฐฉ๋ฒ•์˜ ๊ฐœ๋ฐœ์—์„œ๋Š” ํ›„์ฒ˜๋ฆฌ๊ฐ€ ๋” ์ด์ƒ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

3. Methodology

์‹ ๊ฒฝ๋ง์˜ ๊ธฐ๋ณธ ๋ชฉํ‘œ๋Š” ์ƒ์‚ฐ ์‹œ์Šคํ…œ์—์„œ์˜ ๋น ๋ฅธ ์šด์˜ ์†๋„์™€ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์ตœ์ ํ™”์ด๋ฉฐ, ์ด๋ก ์ ์ธ ๊ณ„์‚ฐ๋Ÿ‰ (BFLOP)๋ณด๋‹ค๋Š” ์‹ค์ œ์ ์ธ ์ง€ํ‘œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์‹ค์‹œ๊ฐ„ ์‹ ๊ฒฝ๋ง์˜ ๋‘ ๊ฐ€์ง€ ์˜ต์…˜์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค: • GPU๋ฅผ ์œ„ํ•ด ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด์—์„œ ์†Œ์ˆ˜์˜ ๊ทธ๋ฃน (1-8)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: CSPResNeXt50 / CSPDarknet53 • VPU๋ฅผ ์œ„ํ•ด ๊ทธ๋ฃนํ™”๋œ ํ•ฉ์„ฑ๊ณฑ์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ Squeeze-and-Excitation (SE) ๋ธ”๋ก์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ๋‹ค์Œ ๋ชจ๋ธ๋“ค์ด ์ด์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค: EfficientNet-lite / MixNet [76] / GhostNet [21] / MobileNetV3

3.1. ์•„ํ‚คํ…์ฒ˜ ์„ ํƒ ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ์ž…๋ ฅ ์‹ ๊ฒฝ๋ง ํ•ด์ƒ๋„, ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด ์ˆ˜, ๋งค๊ฐœ ๋ณ€์ˆ˜ ์ˆ˜ (filter size2 * filters * channel / groups) ๋ฐ ๋ ˆ์ด์–ด ์ถœ๋ ฅ ์ˆ˜ (filters) ์‚ฌ์ด์—์„œ ์ตœ์ ์˜ ๊ท ํ˜•์„ ์ฐพ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์šฐ๋ฆฌ์˜ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋Š” ILSVRC2012 (ImageNet) ๋ฐ์ดํ„ฐ์…‹ [10]์—์„œ ๊ฐ์ฒด ๋ถ„๋ฅ˜ ์ธก๋ฉด์—์„œ CSPResNext50์ด CSPDarknet53๋ณด๋‹ค ํ›จ์”ฌ ์šฐ์ˆ˜ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ˜๋Œ€๋กœ, MS COCO ๋ฐ์ดํ„ฐ์…‹ [46]์—์„œ ๊ฐ์ฒด ๊ฐ์ง€ ์ธก๋ฉด์—์„œ๋Š” CSPDarknet53์ด CSPResNext50๋ณด๋‹ค ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋ชฉํ‘œ๋Š” ์ˆ˜์šฉ ์˜์—ญ์„ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ ๋ธ”๋ก์„ ์„ ํƒํ•˜๊ณ , ๋‹ค๋ฅธ ๋ฐฑ๋ณธ ๋ ˆ๋ฒจ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜ ์ง‘๊ณ„๋ฅผ ์œ„ํ•œ ์ตœ์ƒ์˜ ๋ฐฉ๋ฒ•์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด FPN, PAN, ASFF, BiFPN ๋“ฑ์ž…๋‹ˆ๋‹ค.

๋ถ„๋ฅ˜์— ์ตœ์ ์ธ ์ฐธ์กฐ ๋ชจ๋ธ์ด ํ•ญ์ƒ ๊ฐ์ง€์— ์ตœ์ ์ด์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ๋ถ„๋ฅ˜๊ธฐ์™€๋Š” ๋‹ฌ๋ฆฌ ๊ฐ์ง€๊ธฐ๋Š” ๋‹ค์Œ์„ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค: • ๋” ๋†’์€ ์ž…๋ ฅ ๋„คํŠธ์›Œํฌ ํฌ๊ธฐ (ํ•ด์ƒ๋„) - ์ž‘์€ ํฌ๊ธฐ์˜ ๋‹ค์ค‘ ๊ฐ์ฒด๋ฅผ ๊ฐ์ง€ํ•˜๊ธฐ ์œ„ํ•ด • ๋” ๋งŽ์€ ๋ ˆ์ด์–ด - ์ฆ๊ฐ€๋œ ์ž…๋ ฅ ๋„คํŠธ์›Œํฌ ํฌ๊ธฐ๋ฅผ ์ปค๋ฒ„ํ•˜๊ธฐ ์œ„ํ•œ ๋” ํฐ ์ˆ˜์šฉ ์˜์—ญ • ๋” ๋งŽ์€ ๋งค๊ฐœ ๋ณ€์ˆ˜ - ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์—์„œ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ์—ฌ๋Ÿฌ ๊ฐ์ฒด๋ฅผ ๊ฐ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์˜ ์šฉ๋Ÿ‰ ์ฆ๊ฐ€

๊ฐ€์„ค์ ์œผ๋กœ ๋งํ•˜์ž๋ฉด, ์ˆ˜์šฉ ์˜์—ญ ํฌ๊ธฐ๊ฐ€ ๋” ํฐ ๋ชจ๋ธ(๋” ๋งŽ์€ 3 × 3 ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด๋ฅผ ๊ฐ€์ง„)๊ณผ ๋” ๋งŽ์€ ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์ด ๋ฐฑ๋ณธ์œผ๋กœ ์„ ํƒ๋˜์–ด์•ผ ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ‘œ 1์€ CSPResNeXt50, CSPDarknet53 ๋ฐ EfficientNet B3์˜ ์ •๋ณด๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. CSPResNext50์€ 16๊ฐœ์˜ 3 × 3 ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด, 425 × 425 ์ˆ˜์šฉ ์˜์—ญ ๋ฐ 20.6M ๋งค๊ฐœ ๋ณ€์ˆ˜๋งŒ์„ ํฌํ•จํ•˜๋ฉฐ, CSPDarknet53์€ 29๊ฐœ์˜ 3 × 3 ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด, 725 × 725 ์ˆ˜์šฉ ์˜์—ญ ๋ฐ 27.6M ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด๋ก ์ ์ธ ๊ทผ๊ฑฐ์™€ ์šฐ๋ฆฌ์˜ ๋‹ค์–‘ํ•œ ์‹คํ—˜๋“ค๊ณผ ํ•จ๊ป˜, CSPDarknet53 ์‹ ๊ฒฝ๋ง์ด ๊ฐ์ง€๊ธฐ์˜ ๋ฐฑ๋ณธ์œผ๋กœ์„œ ๋‘ ๋ชจ๋ธ ์ค‘์—์„œ ์ตœ์ ์˜ ๋ชจ๋ธ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ์ˆ˜์šฉ ์˜์—ญ์˜ ์˜ํ–ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์š”์•ฝ๋ฉ๋‹ˆ๋‹ค: • ๊ฐ์ฒด ํฌ๊ธฐ๊นŒ์ง€ - ์ „์ฒด ๊ฐ์ฒด๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. • ๋„คํŠธ์›Œํฌ ํฌ๊ธฐ๊นŒ์ง€ - ๊ฐ์ฒด ์ฃผ๋ณ€์˜ ์ปจํ…์ŠคํŠธ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. • ๋„คํŠธ์›Œํฌ ํฌ๊ธฐ๋ฅผ ์ดˆ๊ณผ - ์ด๋ฏธ์ง€ ์ ๊ณผ ์ตœ์ข… ํ™œ์„ฑํ™” ์‚ฌ์ด์˜ ์—ฐ๊ฒฐ ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” CSPDarknet53์— SPP ๋ธ”๋ก์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ˆ˜์šฉ ์˜์—ญ์„ ํฌ๊ฒŒ ์ฆ๊ฐ€์‹œํ‚ค๊ณ  ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ปจํ…์ŠคํŠธ ํŠน์ง•์„ ๋ถ„๋ฆฌํ•˜์—ฌ ๋„คํŠธ์›Œํฌ ์ž‘๋™ ์†๋„๋ฅผ ๊ฑฐ์˜ ๊ฐ์†Œ์‹œํ‚ค์ง€ ์•Š๋Š”๋‹ค๋Š” ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” YOLOv3์—์„œ ์‚ฌ์šฉ๋œ FPN ๋Œ€์‹ ์— ๋‹ค๋ฅธ ๋ฐฑ๋ณธ ๋ ˆ๋ฒจ๋กœ๋ถ€ํ„ฐ์˜ ๋งค๊ฐœ ๋ณ€์ˆ˜ ํ†ตํ•ฉ ๋ฐฉ๋ฒ•์ธ PANet์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, ์šฐ๋ฆฌ๋Š” YOLOv4์˜ ์•„ํ‚คํ…์ฒ˜๋กœ CSPDarknet53 ๋ฐฑ๋ณธ, SPP ์ถ”๊ฐ€ ๋ชจ๋“ˆ, PANet ๊ฒฝ๋กœ ์ง‘๊ณ„ ๋„ฅ, YOLOv3 (์•ต์ปค ๊ธฐ๋ฐ˜) ํ—ค๋“œ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

์•ž์œผ๋กœ ์šฐ๋ฆฌ๋Š” ๊ฐ์ง€๊ธฐ๋ฅผ ์œ„ํ•œ Bag of Freebies (BoF)์˜ ๋‚ด์šฉ์„ ํฌ๊ฒŒ ํ™•์žฅํ•  ๊ณ„ํš์ด๋ฉฐ, ์ด๋ก ์ ์œผ๋กœ ์ผ๋ถ€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  ๊ฐ์ง€๊ธฐ์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋„๋ก ๊ฐ ํŠน์ง•์˜ ์˜ํ–ฅ์„ ์‹คํ—˜์ ์œผ๋กœ ์ˆœ์ฐจ์ ์œผ๋กœ ํ™•์ธํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” Cross-GPU Batch Normalization (CGBN ๋˜๋Š” SyncBN) ๋˜๋Š” ๊ณ ๊ฐ€์˜ ์ „์šฉ ์žฅ์น˜๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋กœ์จ ๋ˆ„๊ตฌ๋‚˜ ์ผ๋ฐ˜์ ์ธ ๊ทธ๋ž˜ํ”ฝ ํ”„๋กœ์„ธ์„œ์ธ GTX 1080Ti ๋˜๋Š” RTX 2080Ti์—์„œ ์šฐ๋ฆฌ์˜ ์ตœ์ฒจ๋‹จ ๊ฒฐ๊ณผ๋ฅผ ์žฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

3.2. Selection of BoF and BoS

๊ฐ์ง€๊ธฐ ํ•™์Šต์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด CNN์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, ๋˜๋Š” Mish
  • Bounding box ํšŒ๊ท€ ์†์‹ค: MSE, IoU, GIoU, CIoU, DIoU
  • ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•: CutOut, MixUp, CutMix
  • ์ •๊ทœํ™” ๋ฐฉ๋ฒ•: DropOut, DropPath, Spatial DropOut, ๋˜๋Š” DropBlock
  • ๋„คํŠธ์›Œํฌ ํ™œ์„ฑํ™” ์ •๊ทœํ™”: Batch Normalization (BN), Cross-GPU Batch Normalization (CGBN ๋˜๋Š” SyncBN), Filter Response Normalization (FRN), ๋˜๋Š” Cross-Iteration Batch Normalization (CBN)
  • ์Šคํ‚ต-์ปค๋„ฅ์…˜: ์ž”์ฐจ ์—ฐ๊ฒฐ, ๊ฐ€์ค‘ ์ž”์ฐจ ์—ฐ๊ฒฐ, ๋‹ค์ค‘ ์ž…๋ ฅ ๊ฐ€์ค‘ ์ž”์ฐจ ์—ฐ๊ฒฐ, ๋˜๋Š” Cross stage ๋ถ€๋ถ„ ์—ฐ๊ฒฐ (CSP)

ํ•™์Šต ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๊ฒฝ์šฐ, PReLU์™€ SELU๋Š” ํ•™์Šต์ด ๋” ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ํ›„๋ณด ๋ชฉ๋ก์—์„œ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ReLU6์€ ์–‘์žํ™” ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•ด ํŠน๋ณ„ํžˆ ์„ค๊ณ„๋œ ํ•จ์ˆ˜์ด๋ฏ€๋กœ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค. ์ •๊ทœํ™” ๋ฐฉ๋ฒ•์˜ ์„ ํƒ์—์„œ๋Š” DropBlock์ด ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค๊ณผ ์ž์„ธํžˆ ๋น„๊ต๋˜์—ˆ๊ณ , ๊ทธ ๋ฐฉ๋ฒ•์ด ๋งŽ์€ ์„ฑ๊ณผ๋ฅผ ์–ป์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์šฐ๋ฆฌ๋Š” ์ฃผ์ €ํ•˜์ง€ ์•Š๊ณ  DropBlock์„ ์ •๊ทœํ™” ๋ฐฉ๋ฒ•์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์ •๊ทœํ™” ๋ฐฉ๋ฒ•์˜ ์„ ํƒ์—์„œ๋Š” ํ•˜๋‚˜์˜ GPU๋งŒ ์‚ฌ์šฉํ•˜๋Š” ํ•™์Šต ์ „๋žต์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— syncBN์€ ๊ณ ๋ ค๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

3.3. Additional improvements

๋‹จ์ผ GPU์—์„œ ํ•™์Šต์— ๋” ์ ํ•ฉํ•œ ๋””์ž์ธ ๊ฐ์ง€๊ธฐ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถ”๊ฐ€์ ์ธ ์„ค๊ณ„์™€ ๊ฐœ์„ ์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค:

  • Mosaic ๋ฐ Self-Adversarial Training (SAT)๊ณผ ๊ฐ™์€ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ฐฉ๋ฒ•์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์œ ์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜๋ฉด์„œ ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ธฐ์กด์˜ ์ผ๋ถ€ ๋ฐฉ๋ฒ•์„ ์ˆ˜์ •ํ•˜์—ฌ ํšจ์œจ์ ์ธ ํ•™์Šต๊ณผ ๊ฐ์ง€๋ฅผ ์œ„ํ•ด ์ ํ•ฉํ•˜๋„๋ก ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค - ์ˆ˜์ •๋œ SAM, ์ˆ˜์ •๋œ PAN ๋ฐ Cross mini-Batch Normalization (CmBN).

Mosaic์€ 4๊ฐœ์˜ ํ•™์Šต ์ด๋ฏธ์ง€๋ฅผ ํ˜ผํ•ฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ฐฉ๋ฒ•์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ 4๊ฐœ์˜ ๋‹ค๋ฅธ ๋ฌธ๋งฅ์ด ํ˜ผํ•ฉ๋˜๋ฉฐ, CutMix๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ 2๊ฐœ๋งŒ ํ˜ผํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ •์ƒ์ ์ธ ๋ฌธ๋งฅ ๋ฐ–์˜ ๊ฐ์ฒด๋ฅผ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋ฐฐ์น˜ ์ •๊ทœํ™”๋Š” ๊ฐ ๋ ˆ์ด์–ด์—์„œ 4๊ฐœ์˜ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ํ™œ์„ฑํ™” ํ†ต๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋Œ€์šฉ๋Ÿ‰ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ์˜ ํ•„์š”์„ฑ์„ ํฌ๊ฒŒ ์ค„์ž…๋‹ˆ๋‹ค.

Self-Adversarial Training (SAT)์€ ๋˜ํ•œ 2๋‹จ๊ณ„์˜ ์ „์ง„-ํ›„์ง„ ๋‹จ๊ณ„๋กœ ์ž‘๋™ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ์ˆ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. 1๋‹จ๊ณ„์—์„œ ์‹ ๊ฒฝ๋ง์€ ๋„คํŠธ์›Œํฌ ๊ฐ€์ค‘์น˜ ๋Œ€์‹  ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์‹ ๊ฒฝ๋ง์€ ์ž์ฒด์ ์œผ๋กœ ์ ๋Œ€์  ๊ณต๊ฒฉ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ์›ํ•˜๋Š” ๊ฐ์ฒด๊ฐ€ ์ด๋ฏธ์ง€์— ์—†๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์†์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 2๋‹จ๊ณ„์—์„œ๋Š” ์‹ ๊ฒฝ๋ง์ด ์ˆ˜์ •๋œ ์ด๋ฏธ์ง€์—์„œ ์ •์ƒ์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๊ฐ์ฒด๋ฅผ ๊ฐ์ง€ํ•˜๋„๋ก ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.

CmBN์€ Figure 4์— ๋‚˜์™€ ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์ˆ˜์ •๋œ CBN(Cross mini-Batch Normalization)์˜ ๋ฒ„์ „์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์ผ ๋ฐฐ์น˜ ๋‚ด์˜ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜ ๊ฐ„์—๋งŒ ํ†ต๊ณ„๋ฅผ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” SAM์„ ๊ณต๊ฐ„ ๋ฐฉ์‹ ์ฃผ์˜(Spatial-wise Attention)์—์„œ ์ ๋ณ„ ๋ฐฉ์‹ ์ฃผ์˜(Pointwise Attention)๋กœ ์ˆ˜์ •ํ•˜๊ณ , PAN์˜ shortcut ์—ฐ๊ฒฐ์„ Figure 5์™€ Figure 6์— ๋‚˜์™€ ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์—ฐ๊ฒฐ(concatenation)๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

3.4. YOLOv4

YOLOv4๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์„ฑ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค: • Backbone: CSPDarknet53 [81] • Neck: SPP [25], PAN [49] • Head: YOLOv3 [63]

YOLOv4๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์š”์†Œ๋“ค์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: • Backbone์— ๋Œ€ํ•œ Bag of Freebies (BoF): CutMix์™€ Mosaic ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•, DropBlock ์ •๊ทœํ™”, ํด๋ž˜์Šค ๋ผ๋ฒจ ์Šค๋ฌด๋”ฉ • Backbone์— ๋Œ€ํ•œ Bag of Specials (BoS): Mish ํ™œ์„ฑํ™” ํ•จ์ˆ˜, Cross-stage partial connections (CSP), Multiinput weighted residual connections (MiWRC) • Detector์— ๋Œ€ํ•œ Bag of Freebies (BoF): CIoU-loss, CmBN, DropBlock ์ •๊ทœํ™”, Mosaic ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•, Self-Adversarial Training, ๊ทธ๋ฆฌ๋“œ ๋ฏผ๊ฐ๋„ ์ œ๊ฑฐ, ๋‹จ์ผ ground truth์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ์•ต์ปค ์‚ฌ์šฉ, ์ฝ”์‚ฌ์ธ ์•ค๋‹๋ง ์Šค์ผ€์ค„๋Ÿฌ [52], ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ, ๋ฌด์ž‘์œ„ ํ›ˆ๋ จ ํ˜•ํƒœ • Detector์— ๋Œ€ํ•œ Bag of Specials (BoS): Mish ํ™œ์„ฑํ™” ํ•จ์ˆ˜, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS

์ฆ‰, YOLOv4๋Š” CSPDarknet53๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ณ , SPP์™€ PAN์„ ์ด์šฉํ•œ ๋„ฅ(neck) ๊ตฌ์กฐ, ๊ทธ๋ฆฌ๊ณ  YOLOv3๋ฅผ ์ด์šฉํ•œ ํ—ค๋“œ(head)๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. BoF์™€ BoS๋ฅผ ํ†ตํ•ด backbone๊ณผ detector์˜ ๋‹ค์–‘ํ•œ ๊ฐœ์„  ๊ธฐ๋ฒ•๊ณผ ์š”์†Œ๋“ค์„ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

4. Experiments

์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ํ›ˆ๋ จ ๊ฐœ์„  ๊ธฐ์ˆ ์ด ImageNet (ILSVRC 2012 val) ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ถ„๋ฅ˜๊ธฐ์˜ ์ •ํ™•๋„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ํ…Œ์ŠคํŠธํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๋‹ค์Œ MS COCO (test-dev 2017) ๋ฐ์ดํ„ฐ์…‹์—์„œ ํƒ์ง€๊ธฐ์˜ ์ •ํ™•๋„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ํ…Œ์ŠคํŠธํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก
<๋ฆฌ๋ทฐ>
1. Intro

<YOLOv4์—์„œ ์ ์šฉํ•œ ๊ธฐ๋ฒ•>

  1. WRC (Weighted-Residual-Connections)
  1. CSP (Cross-Stage-Partial-Connections)
  1. CmBN (Cross mini-Batch Normalizations)
  1. SAT (Self-Adversarial-Training)
  1. Mish Activation
  1. Mosaic Data Agumentation
  1. Drop Block Regularization
  1. CIOU Loss
  • ์ตœ์‹  Neural Networks๋“ค์€ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€์ง€๋งŒ, ๋‚ฎ์€ FPS(์‹ค์‹œ๊ฐ„ X)์™€ ๋„ˆ๋ฌด๋‚˜ ํฐ mini-batch-size๋กœ ์ธํ•ด ํ•™์Šตํ•˜๋Š”๋ฐ ๋งŽ์€ ์ˆ˜์˜ GPU๋“ค์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌ.
๐Ÿ’ก
๊ฒฐ๋ก : 1๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ผ๋ฐ˜์ ์ธ ํ•™์Šต ํ™˜๊ฒฝ์—์„œ BOF, BOS ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ํšจ์œจ์ ์ด๊ณ  ๊ฐ•๋ ฅํ•œ Object Detection์„ ๋งŒ๋“ฌ
  • OLO v4๋Š” EfficientDet๊ณผ ๋น„์Šทํ•œ AP ์„ฑ๋Šฅ์„ ๋‚ด๋ฉด์„œ๋„ 2๋ฐฐ ๋” ๋น ๋ฅธ FPS๋ฅผ ๋ณด์œ .
  • YOLO v3์— ๋น„ํ•ด์„œ AP๋Š” 10%, FPS๋Š” 12% ํ–ฅ์ƒ๋จ
2. Related Work
2.1 Object detection models
  • ํ˜„๋Œ€์˜ Detector์˜ ๊ฒฝ์šฐ pre-trained๋œ 1. Backbone๊ณผ, class์™€ bounding box๋ฅผ ์˜ˆ์ธกํ•˜๋Š” 2. head๋กœ ๊ตฌ์„ฑ๋จ
  • ์ตœ๊ทผ ๋ช‡ ๋…„ ๊ฐ„ ๊ฐœ๋ฐœ๋œ Detector๋Š” Backbone๊ณผ Head ์‚ฌ์ด์— ๋ช‡ ๊ฐœ์˜ layer๋ฅผ ์‚ฝ์ž…ํ•จ์œผ๋กœ์จ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค๋ฅธ ๋‹จ๊ณ„์—์„œ feature map์„ ์ˆ˜์ง‘ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋จ. ์ด๋ฅผ ‘Neck’์ด๋ผ๊ณ  ํ‘œํ˜„

1. Backbone

  • detector๊ฐ€ GPU or CPU์— ๋”ฐ๋ผ backbone์„ ๊ตฌ๋ถ„ํ•จ

1.1 GPU

→ VGG [68], ResNet [26], ResNeXt [86], DenseNet [30]

1.2 CPU

→ SqueezeNet [31], MobileNet [28, 66, 27, 74], ShuffleNet [97, 53]


2. Head

  • head ๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ One-stage์™€ Two-stage๋กœ ๋‚˜๋‰จ
2.1 Two-stage
  • two-stage detector๋Š” ๋‹ค์‹œ anchor-based detector์™€ anchor-free detector๋กœ ๋‚˜๋‰˜์–ด์ง

2.1.1 Anchor-based Two-stage detector

→ R-CNN ์‹œ๋ฆฌ์ฆˆ๋กœ, fast R-CNN, faster R-CNN, R-FCN, Libra R-CNN ๋“ฑ์ด ์กด์žฌ

  • Fast R-CNN: ๋‹จ์ผ ์ด๋ฏธ์ง€๋ฅผ CNN์— ์ž…๋ ฅํ•˜์—ฌ RoI(Region of Interest)๋ฅผ ์–ป์€ ํ›„ RoI pooling์„ ํ†ตํ•ด ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ feature vector๋ฅผ fc layer์— ์ „๋‹ฌํ•˜์—ฌ R-CNN์— ๋น„ํ•ด ํ•™์Šต ๋ฐ ์ถ”๋ก  ์†๋„์ด ํ–ฅ์ƒ๋œ ๋ชจ๋ธ
  • Faster R-CNN: RPN(Region Proposal Network)์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ชจ๋ธ์˜ ๋™์ž‘ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ์ตœ์†Œํ™”ํ•˜์—ฌ Fast R-CNN์— ๋น„ํ•ด ํ•™์Šต ๋ฐ ์ถ”๋ก  ์†๋„, ์ •ํ™•๋„๊ฐ€ ํ–ฅ์ƒ๋œ ๋ชจ๋ธ
  • R-FCN(Region-based fully convolutional network): image์—์„œ์˜ region, ์ด๋ฏธ์ง€ ๋‚ด์˜ object์˜ ์œ„์น˜์ •๋ณด์— ๊ธฐ๋ฐ˜ํ•œ fully convolutional network

    → RoI๋ผ๋ฆฌ ์—ฐ์‚ฐ์„ ๊ณต์œ ํ•˜๋ฉฐ ๊ฐ์ฒด์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” Position sensitive score map & RoI pooling์„ ๋„์ž…ํ•˜์—ฌ Translation invariance dilemma ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ๋ชจ๋ธ

  • Libra R-CNN: classification ๊ณผ regression loss ์— ๋Œ€ํ•œ balance ๋ฅผ ์ƒ๊ฐํ•œ ๋ชจ๋ธ
    • IoU-based sampling, Balanced Feature Pyramid, Balanced L1 loss๋ฅผ ๋„์ž…ํ•˜์—ฌ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ ํ•™์Šต ์‹œ ๋ฐœ์ƒํ•˜๋Š” imbalance ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•œ ๋ชจ๋ธ

2.1.2 Anchor-free Two-stage detector

→ RepPoints๊ฐ€ ๋Œ€ํ‘œ์ ์ž„

  • RepPoints: deformable convolution์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ์ฒด์˜ ๋‘˜๋ ˆ์— ์ ์„ ์ฐ์–ด ์–ป์€ reppoints๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ anchor ์—†์ด ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋ธ
2.2 One-stage

2.1.3 Anchor-based One-stage detector

→ YOLO, SSD, RetinaNet๋“ฑ์ด ๋Œ€ํ‘œ์ 

  • YOLO : ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ classification๊ณผ localization์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜์—ฌ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๋ณด์ธ ๋ชจ๋ธ
  • SSD(Single Shot multibox Detector): multi-scale feature map์„ ํ™œ์šฉํ•˜์˜€์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ scale๊ณผ ๋‹ค์–‘ํ•œ ratio๋ฅผ ๊ฐ€์ง„ default box๋ฅผ ์ •์˜ํ•ด์„œ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ๋น„์œจ์˜ ๊ฐ์ฒด๋“ค์„ ์ธ์‹ํ•จ. ๋†’์€ ์ •ํ™•๋„์™€ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ๋ณด์ธ ๋ชจ๋ธ
    • ๊ธฐ์กด 2-stage ๋ชจ๋ธ์€ ์ •ํ™•๋„๋ฅผ ์œ„ํ•ด ์†๋„๋ฅผ ํฌ์ƒํ•˜๊ณ , 1-stage ๋ชจ๋ธ์€ ์†๋„๋ฅผ ์œ„ํ•ด ์ •ํ™•๋„๋ฅผ ํฌ์ƒํ•˜๋Š” trade-off๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•จ
    • Base Network๋ผ ๋ถˆ๋ฆฌ๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์— ์ถ”๊ฐ€์ ์ธ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์ธต๋“ค(Extra Feature Layers, Auxiliary Network)๋ฅผ ์ด์–ด๋ถ™์—ฌ ๊ตฌ์„ฑ๋จ
     
    https://skyil.tistory.com/202
  • RetinaNet: Focal Loss๋ฅผ ๋„์ž…ํ•˜์—ฌ object detection task์—์„œ ๋ฐœ์ƒํ•˜๋Š” class imbalance ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•œ ๋ชจ๋ธ

2.1.4 Anchor-free One-stage detector

  • CornerNet: keypoint ๋ฐฉ์‹์œผ๋กœ ๋‘๊ฐ€์ง€ ํ‚คํฌ์ธํŠธ๋กœ object detection ์ง„ํ–‰
    • CornerNet์€ ์™ผ์ชฝ ์œ„, ์˜ค๋ฅธ์ชฝ ์•„๋ž˜, ๋‘๊ฐœ์˜ ๋ชจ์„œ๋ฆฌ๋ฅผ ์ธ์‹ํ•˜์—ฌ ๊ฒฝ๊ณ„๋ฐ•์Šค๋ฅผ ์–ป์–ด๋ƒ„
    • object์˜ ๊ฒฝ๊ณ„์„ ์„ ์ธ์‹ํ•˜๋Š”๋ฐ ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ์•ˆ์— ์žˆ๋Š” ๋ฌผ์ฒด์˜ ์ „์ฒด์ ์ธ ์ •๋ณด๊ฐ€ ๋ถ€์กฑํ•ด, ๊ฒฝ๊ณ„์„ ์˜ ์–ด๋””๊นŒ์ง€๊ฐ€ ์˜ค๋ธŒ์ ํŠธ์ธ์ง€ ์ธ์‹์„ ์ž˜ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ•ด, ์ž˜๋ชป๋œ ๊ฒฝ๊ณ„๋ฐ•์Šค๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๊ธฐ๋„ ํ•จ
     
    https://velog.io/@to2915ny/CenterNet
CenterNet: Center pooling๊ณผ Cascade corner pooling์„ ํ†ตํ•ด ์„ธ ์Œ์˜ keypoint๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ CornerNet์˜ ๋‹จ์ ์„ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ
  • cascade corner pooling์€ ๋จผ์ € ๋˜‘๊ฐ™์ด ๊ฒฝ๊ณ„์„ ์ชฝ์— maximum value๋ฅผ ์ฐพ์€๋’ค, ์ตœ๋Œ€ ๊ฒฝ๊ณ„๊ฐ’์˜ ์œ„์น˜๋ฅผ ๋”ฐ๋ผ ์•ˆ์ชฝ์˜ internal maximum value๋ฅผ ์ฐพ์Œ
  • maximum value๋“ค์„ ํ•ฉ์ณ์„œ corner๋“ค์ด ๊ฒฝ๊ณ„๋ฐ•์Šค์™€, ์•ˆ์— ์žˆ๋Š” object์˜ information๊นŒ์ง€ ๋‹ค ์–ป์„ ์ˆ˜ ์žˆ์–ด์„œ, ์ •ํ™•ํ•œ ๊ฒฝ๊ณ„๋ฐ•์Šค๋ฅผ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ
 
 
 
 

  • FCOS: anchor๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ค‘์‹ฌ์ ์œผ๋กœ๋ถ€ํ„ฐ bbox ๊ฒฝ๊ณ„๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, center-ness๋ฅผ ํ†ตํ•ด ์ค‘์‹ฌ์  ๊ฑฐ๋ฆฌ๋ฅผ normalizeํ•˜์—ฌ low quality box๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๋ชจ๋ธ
    • bbox ์ค‘์‹ฌ์— ๋Œ€ํ•œ pixel์˜ ํŽธ์ฐจ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์„ center-ness๋ผ๊ณ  ํ•จ
    • FCN์œผ๋กœ ํ•ด๊ฒฐ์ด ๊ฐ€๋Šฅํ•œ semantic segmentation๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ task๋“ค๊ณผ ํ•ฉ์ณ์„œ ๋‹ค์‹œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
    • anchor, proposal์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„ ํŠœ๋‹ํ•ด์•ผ ํ•  parameter๋“ค์ด ์ค„์–ด๋“ค์–ด ์ข€ ๋” ๋‹จ์ˆœํ•˜๊ฒŒ ํ•™์Šต์ด ๊ฐ€๋Šฅ
     
    https://talktato.tistory.com/26

3. Neck

  • ์ผ๋ฐ˜์ ์œผ๋กœ ๋„ฅ์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Bottom-up ๊ฒฝ๋กœ์™€ Top-down ๊ฒฝ๋กœ๋กœ ๊ตฌ์„ฑ๋จ

→ Feature Pyramid Network (FPN) , Path Aggregation Network (PAN) , BiFPN , NAS-FPN ๋“ฑ์ด ์กด์žฌ

  • FPN(Feature Pyramid Network)
    • low-level feature map๊ณผ high-level feature map์˜ ํŠน์ง•์„ top-down, bottom-up, lateral connection์„ ํ†ตํ•ด ์œ ์˜๋ฏธํ•˜๊ฒŒ ํ™œ์šฉํ•˜์—ฌ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•œ ๋„คํŠธ์›Œํฌ

    → feature map์„ nearest neighbor upsamling ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ํ•ด์ƒ๋„๋ฅผ 2๋ฐฐ์”ฉ ํ‚ค์›€

    → bottom-up pathway์˜ feature map์— 1x1 conv๋ฅผ ์ ์šฉํ•˜์—ฌ feature map์˜ channel์„ ๊ฐ์†Œ์‹œํ‚จ ๋’ค์— ๋‹จ์ˆœํžˆ top-down pathway feature map์— ๋”ํ•˜์—ฌ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹

     
    https://deep-learning-study.tistory.com/491
  • PAN(Path Augmented Network) - PANet
    • Bottom-up Path Augmentation์„ ํ†ตํ•ด low-level feature์˜ ์ •๋ณด๋ฅผ high-level feature์— ํšจ๊ณผ์ ์œผ๋กœ ์ „๋‹ฌํ•จ์œผ๋กœ์จ ๊ฐ์ฒด ํƒ์ง€ ์‹œ localization ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ ๋„คํŠธ์›Œํฌ
    • Mask R-CNN์„ ๊ธฐ๋ฐ˜์œผ๋กœ Instance Segmentation์„ ์œ„ํ•œ ๋ชจ๋ธ
    • Path Augmentation -> Pooling -> Fusion์˜ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์นจ

Bottom up Path ์ถ”๊ฐ€ = Bottom up์„ ํ•œ๋ฒˆ ๋” ํ•ด์ค€๊ฑฐ๋กœ ์ดํ•ดํ•˜๋ฉด ๋จโ€ป Bottom-up Path AugmentationHigh-level, Low-level์ •๋ณด๊ฐ€ ๊ณจ๊ณ ๋ฃจ ์„ž์ด๋„๋ก ํ•˜๋Š” ์—ญํ• → ์ดํ›„์—, pooling ํ•ด์ฃผ๊ณ , ROI ์ถ”์ถœํ•˜๋˜์ง€ segmentation ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐฉ์‹์— ๋”ฐ๋ผ d or e ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•ด์คŒ

https://memesoo99.tistory.com/70
    • → ๊ธฐ์กด์—๋Š” a์™€ ๊ฐ™์ด ํ•˜๊ฒŒ ๋  ๊ฒฝ์šฐ, ํ•œ ์ธต๋งˆ๋‹ค ResNet-50์˜ ๋ชจ๋ธ์„ ์ง€๋‚˜๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํ•œ ์ •๋ณด๋ฅผ ๋‹ด์ง€ ๋ชปํ•˜๊ฒŒ ๋œ๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌ (FPN) → ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์ˆœํ•œ N2,3,4,5์˜ ์ธต์„ ๊ตฌ์„ฑํ•ด bottom up์„ ํ•œ ๋ฒˆ ๋” ํ•ด์คŒ
  • NAS-FPN(Neural Architecture Search-FPN)
    • NAS๋ฅผ ํ†ตํ•ด ํšจ์œจ์ ์ธ Feature Pyramid Network ํƒ์ƒ‰ํ•˜์—ฌ ์–ป์€ ์ตœ์ ์˜ FPN ๊ตฌ์กฐ
      → AutoML์˜ Neural Architecture Search๋ฅผ FPN ๊ตฌ์กฐ์— ์ ์šฉํ•˜์—ฌ ์ ์„ ์˜ ๋ถ€๋ถ„์„ ์‚ฌ๋žŒ์ด ์ง์ ‘ ๊ตฌ์ƒํ•˜์ง€ ์•Š๊ณ  neuron network๋ฅผ ๋„์ž…ํ•˜์—ฌ ์ง„ํ–‰.
    • → ํ•˜์ง€๋งŒ nas-fpn์€ ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฌ๊ณ  ๊ตฌ์กฐ๊ฐ€ ๋ณต์žกํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•จ
  • Fully-connected FPN
    • ์„œ๋กœ ๋‹ค๋ฅธ level์˜ feature map ๊ฐ„์˜ ์ •๋ณด๋ฅผ ์™„์ „ ์—ฐ๊ฒฐ(fully connect)ํ•˜์—ฌ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉ์‹์˜ Feature Pyramid
  • BiFPN
    • ๊ฐ™์€ scale์˜ feature map์— edge๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋” ๋งŽ์€ feature๋“ค์ด ์œตํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑํ•œ FPN ๊ตฌ์กฐ

โ€ป ์ผ๋ถ€ ์—ฐ๊ตฌ์ž๋“ค์€ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๋ฐฑ๋ณธ(DetNet [43], DetNAS [7])์ด๋‚˜ ์ „์ฒด ๋ชจ๋ธ(SpineNet [12], HitDetector [20])์„ ์ง์ ‘ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ํž˜์“ฐ๊ณ  ์žˆ๋‹ค๊ณ  ํ•จ

2.2 Bag of freebies
  • inference ์‹œ๊ฐ„์—๋Š” ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์œผ๋ฉด์„œ ํ•™์Šต ๋น„์šฉ๋งŒ ์ถ”๊ฐ€๋˜๋Š” ํ•™์Šต ๋ฐฉ๋ฒ•์„ Bag of Freebies(์ดํ•˜ BoF)๋ผ๊ณ  ๋ถ€๋ฆ„
  1. Data augmentation
  1. Semantic distribution bias
  1. Objective function of Bounding box regression
2.2.1 Data augmentation
  1. Random erase, CutOut
    • ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๊ฐํ˜• ์˜์—ญ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ ํ›„, ๋ฌด์ž‘์œ„ ๋˜๋Š” ๋ณด์™„๊ฐ’์œผ๋กœ ์ฑ„์šฐ๋Š” ๊ฒƒ
  1. Hide-and-seek, Grid mask
    • ์ด๋ฏธ์ง€์—์„œ ๋ฌด์ž‘์œ„๋กœ ๋˜๋Š” ๊ท ์ผํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์‚ฌ๊ฐํ˜• ์˜์—ญ์„ ์„ ํƒํ•œ ํ›„, 0์œผ๋กœ ๋Œ€์ฒดex) dropout, dropconnect,dropblock ๋“ฑ์ด ์กด์žฌ
  1. MixUp
    • ๋‘ ๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๊ณ„์ˆ˜ ๋น„์œจ๋กœ ๊ณฑํ•˜๊ณ  ๊ฒน์ณ์„œ ์กฐ์ •๋œ ๋ ˆ์ด๋ธ”์„ ์‚ฌ์šฉํ•จ
  1. CutMix
    • ์ž˜๋ฆฐ ์ด๋ฏธ์ง€๋ฅผ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์˜ ์ง์‚ฌ๊ฐํ˜• ์˜์—ญ์œผ๋กœ ๋ฎ์–ด์”Œ์šฐ๊ณ  ํ˜ผํ•ฉ ์˜์—ญ์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๋ ˆ์ด๋ธ” ์กฐ์ •
  1. GAN
2.2.2 Semantic distribution bias
  • semantic distribution bias๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•
  1. Two-Stage
    • Hard negative example mining: ๋ชจ๋ธ์ด ์ž˜๋ชป ํŒ๋‹จํ•œ false positive sample์„ ํ•™์Šต ๊ณผ์ •์— ์ถ”๊ฐ€ํ•˜์—ฌ ์žฌํ•™์Šตํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์„ ๊ฐ•๊ฑดํ•˜๊ฒŒ ๋งŒ๋“ค๋ฉฐ, false positive๋ผ๊ณ  ํŒ๋‹จํ•˜๋Š” ์˜ค๋ฅ˜๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
    • OHEM(Online Hard Example Mining): ๋ชจ๋“  RoI๋ฅผ forward passํ•œ ํ›„ loss๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋†’์€ loss๋ฅผ ๊ฐ€์ง€๋Š” RoI์— ๋Œ€ํ•ด์„œ๋งŒ backward pass๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋ชจ๋ธ์˜ ํ•™์Šต ์†๋„ ๊ฐœ์„ ๊ณผ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋ค„๋‚ธ bootstrapping ๋ฐฉ๋ฒ•
  1. One-stage
    • Focal Loss
  1. One-hot hard representation

โžก๏ธ ์„œ๋กœ ๋‹ค๋ฅธ category ๊ฐ„์˜ ์–ด๋Š ์ •๋„ ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š”์ง€ one-hot hard representation ๋ฐฉ์‹์„ ํ†ตํ•ด ํ‘œํ˜„ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์กด์žฌ. ์ด๋Š” ์ฃผ๋กœ labeling ์‹œ ๋ฐœ์ƒ

  • Label smoothing: label์„ 0 ๋˜๋Š” 1 ์ด ์•„๋‹ˆ๋ผ smoothํ•˜๊ฒŒ ๋ถ€์—ฌํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ regularization ๋ฐ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ recalibration์— ๋„์›€์„ ์ฃผ๋Š” ๋ฐฉ๋ฒ•
  • Label refinement network: ์„œ๋กœ ๋‹ค๋ฅธ ํ•ด์ƒ๋„์— ๋Œ€ํ•˜์—ฌ coarseํ•œ label์„ ์˜ˆ์ธกํ•œ ํ›„ ์ˆœ์ฐจ์ ์œผ๋กœ ๋” ์„ธ๋ฐ€ํ•œ label์„ ์˜ˆ์ธกํ•˜๋„๋ก ๊ตฌ์„ฑ๋œ ๋ชจ๋ธ
2.2.3 Objective function of Bounding box regression
  1. Anchor-based method
    • MSE(Mean Squared Error) : bounding box์˜ ์ขŒํ‘œ์— ๋Œ€ํ•œ regression์„ ์ง์ ‘์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” loss function
  2. IoU-based method
    • IoU: ๊ต์ง‘ํ•ฉ / ํ•ฉ์ง‘ํ•ฉ
    • GIoU: ๋‘ ๋ฐ•์Šค๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” ์ตœ์†Œ ์˜์—ญ์ธ C ๋ฐ•์Šค ํ™œ์šฉ
    • DIoU: IoU์™€ ์ค‘์‹ฌ์  ์ขŒํ‘œ ํ•จ๊ป˜ ๊ณ ๋ ค
    • CIoU: DIoU loss์— ๋‘ ๊ฐ์ฒด ์‚ฌ์ด์˜ aspect ratio๋ฅผ ๊ณ ๋ ค, ๊ฐ์ฒด๊ฐ€ ๊ฒน์น˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋” ๋น ๋ฅธ ์ˆ˜๋ ด์„ ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•œ loss function
     
    https://silhyeonha-git.tistory.com/3
2.3 Bag of specials
  • ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ถ”๊ฐ€ ๋ชจ๋“ˆ ๊ธฐ๋ฒ• ๋ฐ ํ›„์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ์˜๋ฏธ
  1. receptive field ์ฆ๊ฐ€
  1. attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ๋„์ž…
  1. feature๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ ‘๊ทผ ๋ฐฉ์‹
  1. activation function
  1. ํ›„์ฒ˜๋ฆฌ: ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๊ฑธ๋Ÿฌ๋‚ด๋Š”(screening) ์ ‘๊ทผ๋ฒ•
2.3.1 Enhance receptive field
  • backbone network์—์„œ ์–ป์€ feature map์— ๋Œ€ํ•œ receptive field๋ฅผ ํ‚ค์šฐ๋Š” ๋ฐฉ๋ฒ•
  • 1. SPP(Spatial Pyramid Pooling) : Additional block์˜ SPP์™€ ๋™์ผํ•˜์ง€๋งŒ, kernel์˜ ํฌ๊ธฐ๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ({1, 5, 9, 13})ํ•˜์—ฌ max pooling์„ ์ˆ˜ํ–‰ํ•˜์—ฌ feature map์— ๋Œ€ํ•œ receptive field๋ฅผ ๋Š˜๋ฆฌ๋„๋ก ๊ฐœ์„ ํ•จ
  • 2. ASPP(Atrous Spatial Pyramid Pooling) 
  • 3. RFB(Receptive Field Block) : ASPP์™€ ๋‹ฌ๋ฆฌ kernel size๊ฐ€ ๊ฐ๊ธฐ ๋‹ค๋ฆ„
2.3.2 Attention module
  • 1. Channel-wise Attention
    • SE(Squeeze-and-Excitation):  ์ž…๋ ฅ๋œ feature map์— ๋Œ€ํ•˜์—ฌ Global Average Pooling ์ˆ˜ํ–‰ ํ›„ fc layer์— ์ž…๋ ฅํ•˜์—ฌ channel๋ณ„ ์ค‘์š”๋„๋ฅผ ํŒŒ์•…ํ•œ ํ›„, ์ด๋ฅผ ์›๋ณธ feature map์— ๊ณฑํ•ด channel๋ณ„ ์ค‘์š”๋„๋ฅผ ์žฌ๋ณด์ •ํ•˜๋Š” ๋ชจ๋“ˆ
  • 2. Point-wise Attention
    • SAM(Spatial Attention Module):
    • channel attention map๊ณผ input feature map์„ ๊ณฑํ•˜์—ฌ ์ƒ์„ฑํ•œ F`์—์„œ ์ฑ„๋„์„ ์ถ•์œผ๋กœ Maxpool๊ณผ Avgpool์„ ์ ์šฉํ•ด ์ƒ์„ฑํ•œ 1xHxW์˜ F_avg์™€ F_max ๋‘ ๊ฐ’์„ concatenate
    • ์—ฌ๊ธฐ์— 7x7 conv ์—ฐ์‚ฐ์„ ์ ์šฉํ•˜์—ฌ spatial attention map์„ ์ƒ์„ฑ
2.3.3 Feature integration
  • feature map์„ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด FPN๋ณด๋‹ค ๊ฒฝ๋Ÿ‰ํ™”๋œ ๋ฒ„์ „์˜ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์—ฐ๊ตฌ
  1. SFAM: SE ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ์Šค์ผ€์ผ๋กœ ์—ฐ๊ฒฐ๋œ ํŠน์„ฑ ๋งต์— ์ฑ„๋„๋ณ„ ๋ ˆ๋ฒจ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉ
  1. ASFF: ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ ๋ณ„ ๋ ˆ๋ฒจ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•˜๊ณ  ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ ํŠน์„ฑ ๋งต์„ ๋”ํ•จ
  1. BiFPN: ๋‹ค์ค‘ ์ž…๋ ฅ ๊ฐ€์ค‘ ์ž”์ฐจ ์—ฐ๊ฒฐ์„ ์ œ์•ˆํ•˜์—ฌ ์Šค์ผ€์ผ๋ณ„ ๋ ˆ๋ฒจ ๊ฐ€์ค‘์น˜๋ฅผ ์ ์šฉํ•˜๊ณ  ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ ํŠน์„ฑ ๋งต์„ ๋”ํ•จ
2.3.4 Activation Function
  • LReLU(Leaky ReLU)
  • PReLU(Parametric ReLU): ๊ธฐ์กด ReLU ํ•จ์ˆ˜์—์„œ ์Œ์ˆ˜๊ฐ’์˜ ๊ณ„์ˆ˜๋ฅผ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ง€์ •ํ•œ activation function
  • ReLU6: ๊ธฐ์กด ReLU ํ•จ์ˆ˜์˜ ์ตœ๋Œ€๊ฐ’์„ 6์œผ๋กœ ์ง€์ •ํ•˜์—ฌ ํšจ์œจ์ ์ธ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ activation function
  • SELU(Scaled Exponential Linear Unit): self-normalizing ํšจ๊ณผ๊ฐ€ ์žˆ์–ด gradient exploding, vanishing ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” activation function
  • Swish: Sigmoid ํ•จ์ˆ˜์— ์ž…๋ ฅ๊ฐ’์„ ๊ณฑํ•ด์ค€ ํ˜•ํƒœ๋กœ, ๊นŠ์€ layer๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” activation function
  • hard-Swish: ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ๋Š” Swish ํ•จ์ˆ˜์—์„œ sigmoid์— ๋Œ€ํ•œ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋†’๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ ์šฉํ•œ activation function
  • Mish: upper bound๊ฐ€ ์—†์–ด ์บกํ•‘์œผ๋กœ ์ธํ•œ ํฌํ™”๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์•ฝ๊ฐ„์˜ ์Œ์ˆ˜๋ฅผ ํ—ˆ์šฉํ•˜์—ฌ gradient๊ฐ€ ์ž˜ ํ๋ฅด๋„๋ก ์„ค๊ณ„๋œ activation function
 
https://herbwood.tistory.com/24
2.3.5 Post-processing method
  • NMS: ๊ฐ™์€ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ถˆํ•„์š”ํ•œ bounding box๋ฅผ ์ œ๊ฑฐ
  1. Greedy NMS: ๋†’์€ confidence score๋ฅผ ๊ฐ€์ง€๋Š” bounding box๋ฅผ ๊ธฐ์ค€์œผ๋กœ, ์ž„๊ณ„์น˜ ์ด์ƒ์˜ IoU ๊ฐ’์„ ๊ฐ€์ง€๋Š” bounding box๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ NMS ๋ฐฉ๋ฒ•
  1. Soft NMS: confidence score๊ฐ€ ๋†’์€ bounding box์™€ ์ž„๊ณ„์น˜ ์ด์ƒ์˜ IoU ๊ฐ’์„ ๊ฐ€์ง€๋Š” bounding box์— ๋Œ€ํ•ด confidence score๋ฅผ decay(๋ถ€ํŒจ)์‹œ์ผœ ํƒ์ง€ ์„ฑ๋Šฅ ํ•˜๋ฝ์„ ๋ฐฉ์ง€ํ•˜๋Š” NMS ๋ฐฉ๋ฒ•
  1. DIoU NMS: ๊ธฐ์กด์˜ NMS ์ž„๊ณ„์น˜์— DIoU penalty term์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ฒน์นœ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ NMS ๋ฐฉ๋ฒ•

 

 

3. Methodology
3.2 Selection of BoF and BoS
  • Activation function: ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, ๋˜๋Š” Mish
  • Bounding box regression loss: MSE, IoU, GIoU, CIoU, DIoU
  • Data augmentation: CutOut, MixUp, CutMix
  • Regularization method: DropOut, DropPath, Spatial DropOut, ๋˜๋Š” DropBlock
  • Normalization of the network activation by their mean and variance: Batch Normalization (BN), Cross-GPU Batch Normalization (CGBN ๋˜๋Š” SyncBN), Filter Response Normalization (FRN), ๋˜๋Š” Cross-Iteration Batch Normalization (CBN)
  • Skip-connections: ์ž”์ฐจ ์—ฐ๊ฒฐ, ๊ฐ€์ค‘ ์ž”์ฐจ ์—ฐ๊ฒฐ, ๋‹ค์ค‘ ์ž…๋ ฅ ๊ฐ€์ค‘ ์ž”์ฐจ ์—ฐ๊ฒฐ, ๋˜๋Š” Cross stage patial connections (CSP)
3.3 Additional improvements

1. Mosaic

  • 4๊ฐœ์˜ ํ•™์Šต ์ด๋ฏธ์ง€๋ฅผ ํ˜ผํ•ฉํ•˜๋Š” ์ƒˆ๋กœ์šด data augmentation ๋ฐฉ๋ฒ•
  • ๋Œ€์šฉ๋Ÿ‰ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ์˜ ํ•„์š”์„ฑ์„ ํฌ๊ฒŒ ์ค„์ž„ (batch size=4๋ฅผ ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๋กœ ๋ณด์—ฌ์ฃผ๋Š” ํšจ๊ณผ๋ฅผ ์ง€๋‹˜)

2. SAT(Self-Adversarial Training)

  • foreward, backward ๋‘ ๋ฒˆ์˜ stage๋ฅผ ๊ฑธ์ณ ์ˆ˜ํ–‰๋˜๋Š” data augmentation ๋ฐฉ๋ฒ•
  • ์ฒซ ๋ฒˆ์งธ stage์—๋Š” ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ๋ณ€ํ˜•์‹œ์ผœ, ์ด๋ฏธ์ง€ ๋‚ด์— ๊ฐ์ฒด๊ฐ€ ์—†๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์†์ด๋Š” adversarial attack์„ ๊ฐ€ํ•จ
  • ๋‘ ๋ฒˆ์งธ stage์—์„œ๋Š” ๋ณ€ํ˜•๋œ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต

3. Minor Modifications

  • CmBN์€ CBN์„ ๋ณ€ํ˜•์‹œํ‚จ ๋ฒ„์ „์œผ๋กœ, Cross mini-Batch Normalization์„ ์˜๋ฏธํ•จ
  • ํ•˜๋‚˜์˜ batch์—์„œ mini-batch ์‚ฌ์ด์˜ batch statistics๋ฅผ ์ˆ˜์ง‘ํ•จ
  • ๊ธฐ์กด์˜ ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ๋ฅผ ๋˜ ๋‚˜๋ˆ ์„œ ์ง„ํ–‰ํ•œ ๊ฒƒ์„ ์˜๋ฏธ
  • ๊ธฐ์กด ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ์—์„œ mini batch ๋‹จ์œ„๋กœ ๋‚˜๋ˆˆ ํ›„, batch normalization์„ ์ถ”๊ฐ€๋กœ ๋˜ ์ˆ˜ํ–‰

4. Modified Existing Methods

  1. SAM
    • point wise ๋‹จ์œ„๋กœ ์„ค์ •ํ•ด์„œ ์‚ฌ์šฉ
  1. PAN
    • ๊ธฐ์กด์—๋Š” ๋”ํ–ˆ๋‹ค๋ฉด, ์ˆ˜์ •๋œ ๋ฐฉ์‹์œผ๋กœ๋Š” concate์„ ์ง„ํ–‰ํ•ด์คŒ

 

3.4 YOLOv4

• Backbone: CSPDarknet53 [81] • Neck: SPP [25], PAN [49] • Head: YOLOv3 [63]

 

  • Bag of Freebies(BoF) for backbone
    • CutMix and Mosaic data augmentation
    • Dropblock regularization
    • Class label smoothing
  • Bag of Specials(Bos) for backbone
    • Mish activation
    • Cross-stage partial connections(CSP)
    • Multi-input weighted residual connections(MiWRC)
  • Bag of Freebies(BoF) for detector
    • CIoU-loss
    • CmBN
    • DropBlock regularization
    • Mosaic data augmentation
    • Self-Adversarial Training
    • Eliminate grid sensitivity
    • Using multiple anchors for a single ground truth
    • Cosine annealing scheduler
    • Optimal hyperparameters
    • Random training shapes
  • Bag of Specials(BoS) for detector
    • Mish activation
    • SPP-block
    • SAM-block
    • PAN path-aggregation block
    • DIoU-NMS
4. Experiments
  1. Classifier ํ•™์Šต ์‹œ label smoothing, data augmentation์ด ์ฃผ๋Š” ์˜ํ–ฅ์— ๋Œ€ํ•œ ์‹คํ—˜
  1. Detector ํ•™์Šต ์‹œ ์„œ๋กœ ๋‹ค๋ฅธ feature์˜ ์˜ํ–ฅ์— ๋Œ€ํ•œ ์‹คํ—˜
  1. Detector ํ•™์Šต ์‹œ ์„œ๋กœ ๋‹ค๋ฅธ backbone๊ณผ pretrained weight์˜ ์˜ํ–ฅ์— ๋Œ€ํ•œ ์‹คํ—˜
  1. Detector ํ•™์Šต ์‹œ mini-batch ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์— ๋Œ€ํ•œ ์‹คํ—˜

 

๐Ÿ’ก
Reference (๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์—†์—ˆ์œผ๋ฉด ํž˜๋“ค์—ˆ์„ ๊ฒƒ ๊ฐ™์•„์š” ๐Ÿฅฒ)
 
https://herbwood.tistory.com/24

 

 
https://ropiens.tistory.com/33

 


728x90
๋ฐ˜์‘ํ˜•

'Deep Learning > [๋…ผ๋ฌธ] Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis  (1) 2023.07.13
XLM: Cross-lingual Language Model Pretraining  (0) 2023.07.09
EfficientNet  (0) 2023.07.07
cGAN/Pix2Pix  (0) 2023.07.07
R-CNN  (0) 2023.07.06