๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

YOLO: You Only Look Once: Unified, Real-Time Object Detection

by ์ œ๋ฃฝ 2023. 7. 6.
728x90
๋ฐ˜์‘ํ˜•

 

 

 

1. Intro

What is objection Detection?

  • object classification: ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  ๊ทธ๊ฒƒ์ด ๊ฐœ์ธ์ง€ ๊ณ ์–‘์ธ์ง€๋ฅผ ํŒ๋‹จ
  • object localization: ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€ ๋‚ด์—์„œ ๊ฐœ๋Š” ์–ด๋””์— ์œ„์น˜ํ•˜๋Š”์ง€ ํŒ๋‹จ → output: x,y,w,h
  • object detection: ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€ ๋‚ด์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ object๋ฅผ ๊ฐ๊ฐ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ ex) DPM, R-CNN
one-stage vs two-stage detector
  • one stage: localization+classification์„ ๋™์‹œ์— ์ˆ˜ํ–‰ex) conv๋ฅผ ํ†ต๊ณผํ•œ ํ›„, ๊ฐ grid cell ๋งˆ๋‹ค classification๊ฒฐ๊ณผ์™€ bounding box regression์„ ํ†ตํ•ด ๊ฒฐ๊ณผ ๋„์ถœ
  • two stage: localization → classification ์ˆœ์ฐจ์ ์œผ๋กœ ์ง„ํ–‰ ex) DPM, R-CNNex) region proposal์„ ํ†ตํ•ด ๋จผ์ € ํ›„๋ณด box ์ถ”์ถœํ•˜๊ณ , classification๊ฒฐ๊ณผ์™€ bounding box regression์„ ํ†ตํ•ด ๊ฒฐ๊ณผ ๋„์ถœ
  1. DPM(deformable parts models): ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ๊ฑฐ์ณ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ๋ฐฉ์‹์œผ๋กœ ๊ฐ์ฒด ๊ฒ€์ถœ ๋ชจ๋ธ
  1. R-CNN: bounding box ์ƒ์„ฑ ํ›„, classification & regression → (์ค‘๋ณต์ œ๊ฑฐ) non maximum suppression

⇒ ๋Š๋ฆฌ๋‹ค. ์ตœ์ ํ™”ํ•˜๊ธฐ ์–ด๋ ค์›€

โ€ป Yolo: ์ด๋ฏธ์ง€๋ฅผ ํ•œ ๋ฒˆ ๋ณด๋Š” ๊ฒƒ์œผ๋กœ object์˜ ์ข…๋ฅ˜์™€ ์œ„์น˜๋ฅผ ์ถ”์ธก

2. Overall architecture

 

  • 24๊ฐœ Conv + 2๊ฐœ FC layers
  1. input์œผ๋กœ image ๋„ฃ์Œ
  1. Conv๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ ํ”ผ์ฒ˜๋“ค์„ ์ด์šฉํ•ด ์ง„ํ–‰
  1. 7x7 49๊ฐœ์˜ grid cell์„ ๊ฐ–๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ ๊ฐ 2๊ฐœ์”ฉ์˜ bounding box๋ฅผ ๋ฝ‘์•„๋ƒ„
  1. ์ด๋•Œ, box ํ•˜๋‚˜๋‹น 5๊ฐœ์˜ ๊ฐ’์ด ์ฑ„์›Œ์ง (x,y,w,h,c)
  1. ์ด 10๊ฐœ์˜ ๊ฐ’์ด ๋‚˜์˜ด
  1. ๊ทธ๋ฆฌ๊ณ  ๋‚˜๋จธ์ง€ 20๊ฐœ์˜ ๊ฐ’์€ 20๊ฐœ์˜ class์— ๋Œ€ํ•œ conditional class probability
  1. ์ดํ›„, ๊ธฐ์กด์˜ ๊ฐ box์— ๋Œ€ํ•œ conifdence score์™€ conditional class probablility๋ฅผ ๊ณฑํ•ด์คŒ
  1. ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ๊ฒฐ๊ตญ 49๊ฐœ์˜ grid์— 2๊ฐœ์”ฉ bounding box๊ฐ€ ์ƒ์„ฑ๋˜๋ฏ€๋กœ ์ด 98๊ฐœ์˜ bounding box๊ฐ€ ์ƒ์„ฑ.
  1. ์ฆ‰, 98๊ฐœ์˜ class specific confidence score์„ ์–ป๊ฒŒ๋จ
  1. ์ด 98๊ฐœ์˜ score์— ๋Œ€ํ•ด 20๊ฐœ ํด๋ž˜์Šค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ค‘๋ณต ์ œ๊ฑฐ(non maximum suppression)์„ ํ•ด์„œ object์— ๋Œ€ํ•œ class ๋ฐ bounding box ์œ„์น˜๋ฅผ ์ตœ์ข… ๊ฒฐ์ •
3. Unified Detection
  • ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์–ป์€ feature map์„ ํ™œ์šฉํ•ด์„œ bbox ์˜ˆ์ธก + ๋ชจ๋“  ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ  ๊ณ„์‚ฐ (๋ณ‘๋ ฌ์ ์œผ๋กœ ์ง„ํ–‰ ๋œ๋‹ค๊ณ  ๋…ผ๋ฌธ์—์„œ ํ‘œํ˜„)
  • ๋…ผ๋ฌธ์—์„œ๋Š” 7x7๋กœ ๋‚˜์˜ด
  • 4x4 grid๋กœ ๋ถ„ํ•  ํ›„, ๊ฐ grid cell๋งˆ๋‹ค bbox 2๊ฐœ์”ฉ ์˜ˆ์ธก
  • ์ด๋•Œ, ํ•œ box ๋‹น 5๊ฐœ์— ๋Œ€ํ•œ ๊ฐ’์„ ๊ฐ–๊ฒŒ ๋˜๋Š”๋ฐ, x,y,w,h,c์˜ ๊ฐ’์ด ๋‚˜์˜ด
  • x,y: ์ค‘์‹ฌ์ขŒํ‘œ ์œ„์น˜ / w,h: box์˜ ๊ฐ€๋กœ์™€ ๋†’์ด ( ์ •ํ™•ํžˆ ๋งํ•˜์ž๋ฉด ์›๋ž˜ ์ด๋ฏธ์ง€์˜ W,H๋กœ ๋‚˜๋ˆ ์„œ 0~1 ์‚ฌ์ด ๊ฐ’์„ ๊ฐ–๋„๋ก ์„ค์ • )
  • ๋ฌผ์ฒด๊ฐ€ bbox ์•ˆ์— ์žˆ์„ ๋•Œ, grid cell์— ์žˆ๋Š” object๊ฐ€ i๋ฒˆ์งธ class์— ์†ํ•  ํ™•๋ฅ ๊ฐ’ ( class: 20๊ฐœ )ex) ๊ณ ์–‘์ด: 0.88 ๊ฐ•์•„์ง€: 0.01 ์ƒˆ: 0.005
4. Network Design - GoogleNet
  • ์ด 24 conv + 2 fc layer
  • 20 conv: pretrained๋œ layer ์‚ฌ์šฉ
  • 4 conv + 2fc → fine tuning ์ง„ํ–‰ (PASCAL VOC)
  • 1x1 reduction layer( ์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ )
5. Training
  • ์˜ˆ์ธก๋œ ์—ฌ๋Ÿฌ bounding box ์ค‘ ์‹ค์ œ ๊ฐ์ฒด๋ฅผ ๊ฐ์‹ธ๋Š” ground-truth boudning box์™€์˜ IOU๊ฐ€ ๊ฐ€์žฅ ํฐ ๊ฒƒ์„ ์„ ํƒ
  • YOLO์˜ ๊ฒฝ์šฐ, ํ•™์Šต ๋‹จ๊ณ„์—์„œ๋Š” IOU๊ฐ€ ๊ฐ€์žฅ ๋†’์€ bbox 1๊ฐœ ๋งŒ์„ ์‚ฌ์šฉ

ex) ๋…ธ๋ž€์ƒ‰ ๋ฐ•์Šค + ํŒŒ๋ž€์ƒ‰ ๋ฐ•์Šค → conv๋ฅผ ํ†ตํ•ด ์–ป์€ box

→ groundbox(๋นจ๊ฐ„์ƒ‰) ๋ฐ•์Šค์™€ ๋…ธ๋ž‘orํŒŒ๋ž‘ ์ค‘ ๋” ๊ฒน์น˜๋Š” ์• ๋ฅผ ์„ ํƒ

→ ์—ฌ๊ธฐ์„œ๋Š” ๋…ธ๋ž€์ƒ‰ ๋ฐ•์Šค๋ฅผ ์„ ํƒํ•˜๊ฒŒ ๋จ ⇒ 1๊ฐœ ์„ ์ •

  • ์ดํ›„, ๋…ธ๋ž€์ƒ‰ box๋ฅผ ์„ ํƒํ–ˆ์œผ๋ฏ€๋กœ ํ•ด๋‹น scaler ๊ฐ’์€ 1๋กœ ํ‘œ์‹œ(๊ฐ์ฒด๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ)
  • ํŒŒ๋ž€์ƒ‰์˜ ๊ฒฝ์šฐ 0์œผ๋กœ ํ‘œ์‹œ ( loss function์—์„œ loss ์ „ํŒŒx )
6. Loss Function(train)
  • Mean Squared Error (์ œ๊ณฑํ•ฉ ์—๋Ÿฌ ์‚ฌ์šฉ)

(1) Object๊ฐ€ ์กด์žฌํ•˜๋Š” grid cell i์˜ predictor bounding box j์— ๋Œ€ํ•ด, x์™€ y์˜ loss๋ฅผ ๊ณ„์‚ฐ 

(2) Object๊ฐ€ ์กด์žฌํ•˜๋Š” grid cell i์˜ predictor bounding box j์— ๋Œ€ํ•ด, w์™€ h์˜ loss๋ฅผ ๊ณ„์‚ฐ

(3) Object๊ฐ€ ์กด์žฌํ•˜๋Š” grid cell i์˜ predictor bounding box j์— ๋Œ€ํ•ด, confidence score์˜ loss๋ฅผ ๊ณ„์‚ฐ

(4) Object๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š” grid cell i์˜ bounding box j์— ๋Œ€ํ•ด, confidence score์˜ loss๋ฅผ ๊ณ„์‚ฐ 

(5) Object๊ฐ€ ์กด์žฌํ•˜๋Š” grid cell i์— ๋Œ€ํ•ด, conditional class probability์˜ loss ๊ณ„์‚ฐ

  • λcoord : coordinates(x,y,w,h)์— ๋Œ€ํ•œ loss์™€ ๋‹ค๋ฅธ loss๋“ค๊ณผ์˜ ๊ท ํ˜•์„ ์œ„ํ•œ balancing parameter
  • λnoobj : obj๊ฐ€ ์žˆ๋Š” box์™€ ์—†๋Š” box๊ฐ„์— ๊ท ํ˜•์„ ์œ„ํ•œ balancing parameter→ ์–ด๋–ค loss๋ฅผ ๋” ๋งŽ์ด ๋ฐ˜์˜ํ•  ๊ฒƒ์ธ๊ฐ€( ๊ฐ€์ค‘์น˜์˜ ๊ฐœ๋… )

โ€ป grid cell์— object๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ์™€ predictor box๋กœ ์„ ์ •๋œ ๊ฒฝ์šฐ์—๋งŒ ์˜ค์ฐจ๋ฅผ ํ•™์Šต์‹œํ‚ด

  • ๊ทธ๋ฆฌ๋“œ ์…€์— ๊ฐ์ฒด๊ฐ€ ์—†๋‹ค๋ฉด confidence score=0. ์‚ฌ์‹ค์ƒ ๋Œ€๋ถ€๋ถ„์˜ ๊ทธ๋ฆฌ๋“œ ์…€์˜ confidence socre=0์ด ๋˜๋„๋ก ํ•™์Šตํ•  ์ˆ˜๋ฐ–์— ์—†์Œ → ๋ชจ๋ธ ๋ถˆ๊ท ํ˜• ์ดˆ๋ž˜
  • ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ์ฒด๊ฐ€ ์กด์žฌํ•˜๋Š” bounding box ์ขŒํ‘œ(coordinate)์— ๋Œ€ํ•œ loss์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ณ , ๊ฐ์ฒด๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š” bounding box์˜ confidence loss์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋Š” ๊ฐ์†Œ์‹œ
7. Inference Stage(test)

โ€ป class-specific confidence score

  • confidence score x 20๊ฐœ์˜ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ๊ณฑํ•œ ๊ฐ’๋“ค์„ bbox์— ๋„ฃ์Œ
  • 7x7, 49๊ฐœ์˜ grid cell, ํ•˜๋‚˜์˜ cell ๋‹น 2๊ฐœ์˜ box๊ฐ€ ์ƒ์„ฑ๋˜๊ธฐ์— ์ด 98๊ฐœ์˜ bbox๊ฐ€ ์ƒ์„ฑ๋จ (ํ•™์Šต์‹œ์—๋Š” 49๊ฐœ๊ฐ€ ์ƒ์„ฑ)
  • bbox1: bbox1์— ๋Œ€ํ•œ ์ขŒํ‘œ๊ฐ’(x,y,w,h,c)+ class-specific confidence score๊ฐ€ ํ•˜๋‚˜์˜ bbox1์ด ๋˜๋Š” ๊ฒƒ์ž„

⇒ ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€ ๋‹น 98๊ฐœ์˜ box๊ฐ€ ์ƒ์„ฑ๋˜๋ฏ€๋กœ box๊ฐ€ ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ๊ธฐ์— NMS ์ ์šฉ์‹œํ‚ด. (ํ…Œ์ŠคํŠธ์‹œ)

⇒ ์ค‘๋ณต ์˜์—ญ์ด ์—†๊ณ , IOU ๊ฐ’์ด ๊ฐ€์žฅ ๋†’์€ ์ตœ์ข… bbox 1๊ฐœ๋งŒ์„ ๋‚จ๊ธฐ๊ฒŒ ๋จ.

8. Comparison to other Systems (vs YOLO)
  1. DPM
  1. R-CNN
  1. Deep MultiBox
  1. OverFeat
  1. MultiGrasp
9. Experiments
  • ๋‹ค๋ฅธ real-time object detect system๋“ค์— ๋น„ํ•ด ๋†’์€ mAP๋ฅผ ๋ณด์—ฌ์คŒ
  • Fast YOLO์˜ ๊ฒฝ์šฐ ๊ฐ€์žฅ ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์—ฌ์คŒ
  • Fast R-CNN๊ณผ ๋น„๊ตํ–ˆ์„ , ํ›จ์”ฌ ์ ์€ False-Positive๋ฅผ ๋ณด์—ฌ์คŒ. (low backgound error)⇒ background์— ์•„๋ฌด๊ฒƒ๋„ ์—†๋Š”๋ฐ ์žˆ๋‹ค๊ณ  ํ•˜๋Š” ํ™•๋ฅ ์„ ์ค„์ธ ๊ฒƒ์„ ๋งํ•จ
  • Fast R-CNN๊ณผ YOLO๋ฅผ combine ํ–ˆ์„ ๋•Œ 3.2% ๋” ์ฆ๊ฐ€ํ–ˆ์Œ ⇒ ํšจ๊ณผ๊ฐ€ ๋” ์ข‹์•˜๋‹ค!
  • ๊ทธ๋ฆผ ์•ˆ์— ์žˆ๋Š” ๊ฐ์ฒด๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์— ์žˆ์–ด์„œ YOLO์— ๋น„ํ•ด ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์€ ํ˜„์ €ํžˆ ๋–จ์–ด์ง์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
10. Outro
  • YOLO์˜ ํ•œ๊ณ„
  1. ๊ฐ๊ฐ์˜ grid cell์ด ํ•˜๋‚˜์˜ ํด๋ž˜์Šค๋งŒ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ž‘์€ object ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ ๋‹ค๋‹ฅ๋‹ค๋‹ฅ ๋ถ™์œผ๋ฉด ์ œ๋Œ€๋กœ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•จ.
  2. bounding box์˜ ํ˜•ํƒœ๊ฐ€ training data๋ฅผ ํ†ตํ•ด์„œ๋งŒ ํ•™์Šต๋˜๋ฏ€๋กœ, ์ƒˆ๋กœ์šด/๋…ํŠนํ•œ ํ˜•ํƒœ์˜ ๋น„์œจ์„ ๊ฐ€์ง„ bbox ์˜ ๊ฒฝ์šฐ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•จ.ex) 1:1, 1:2, 2:1 ๋น„์œจ์˜ bounding box๋งŒ ํ•™์Šตํ–ˆ๋Š”๋ฐ, 3:1 ๋น„์œจ์˜ box๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ฉด ํƒ์ง€ํ•˜์ง€ ๋ชปํ•จ
  3. ๋ช‡ ๋‹จ๊ณ„์˜ layer๋ฅผ ๊ฑฐ์ณ์„œ ๋‚˜์˜จ feature map์„ ๋Œ€์ƒ์œผ๋กœ bouding box๋ฅผ ์˜ˆ์ธกํ•˜๋ฏ€๋กœ localization์ด ๋‹ค์†Œ ๋ถ€์ •ํ™•ํ•ด์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋‹ค. (R-CNN์˜ ๊ฒฝ์šฐ, ์ฒ˜์Œ๋ถ€ํ„ฐ Region proposal ์ง„ํ–‰)
  4. ์ž‘์€ ๋ฌผ์ฒด์— ๋Œ€ํ•ด์„œ ํƒ์ง€ ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค ex) object๊ฐ€ ํฌ๊ณ , BBox๊ฐ€ ํฐ ๊ฒฝ์šฐ, bbox์˜ ์œ„์น˜ ์˜ค์ฐจ๊ฐ€ ์กฐ๊ธˆ ์ปค์ ธ๋„ ์—ฌ์ „ํžˆ object์™€ bbox์˜ ๊ฒน์น˜๋Š” ์˜์—ญ์ด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— IOU๋Š” ์—ฌ์ „ํžˆ ๋†’์ง€๋งŒ, object๊ฐ€ ์ž‘์œผ๋ฉด, bbox๊ฐ€ ์ž‘๊ฒŒ ์ƒ์„ฑ์ด ๋ ํ…๋ฐ, ์ด ๊ฒฝ์šฐ์—๋Š” bbox์˜ ์œ„์น˜ ์˜ค์ฐจ๊ฐ€ ์กฐ๊ธˆ๋งŒ ์ปค์ ธ๋„ object์™€ bbox๊ฐ€ ๊ฒน์น˜๋Š” ์˜์—ญ์ด ์ ์–ด์ง€๊ฒŒ ๋•Œ๋ฌธ์— IOU ๊ฐ’์ด ๋น ๋ฅด๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ
728x90
๋ฐ˜์‘ํ˜•

'Deep Learning > [๋…ผ๋ฌธ] Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

SPPNet  (0) 2023.07.06
Faster R-CNN  (0) 2023.07.06
Fast R-CNN  (0) 2023.07.06
Transformer  (0) 2023.07.06
Inception V2/3  (0) 2023.07.06