๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

Faster R-CNN

by ์ œ๋ฃฝ 2023. 7. 6.
728x90
๋ฐ˜์‘ํ˜•

 

 

0. R-CNN
  1. ์ž…๋ ฅ ์ด๋ฏธ์ง€์— Selective Search ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜์—ฌ bounding box(region proposal) 2000๊ฐœ๋ฅผ ์ถ”์ถœ.
  1. ์ถ”์ถœ๋œ bounding box๋ฅผ warp(resize)ํ•˜์—ฌ CNN์— ์ž…๋ ฅ.
  1. fine tunning ๋˜์–ด ์žˆ๋Š” pre-trained CNN์„ ์‚ฌ์šฉํ•˜์—ฌ bounding box์˜ 4096์ฐจ์›์˜ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœ.
  1. ์ถ”์ถœ๋œ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ SVM์„ ์ด์šฉํ•˜์—ฌ class๋ฅผ ๋ถ„๋ฅ˜.
  1. bounding box regression์„ ์ ์šฉํ•˜์—ฌ bounding box์˜ ์œ„์น˜๋ฅผ ์กฐ์ •.
  1. non maximum supression์„ ์ง„ํ–‰

⇒ ์ด ์นœ๊ตฌ์˜ ๋ฌธ์ œ์ :

1) ๊ฐœ๋Š๋ฆผ

2) ๋“ค์–ด๊ฐˆ ๋•Œ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ๊ณ ์ •์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฏธ์ง€ ์™œ๊ณก๋จ

0. SPPNet
  • R-CNN์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ์นœ๊ตฌ⇒ ์ตœ๋Œ€ 2000๋ฒˆ ์—ฐ์‚ฐ์„ ๋‹จ ํ•œ๋ฒˆ์œผ๋กœ ์ค„์ž„⇒ ์‹œ๊ฐ„๋‹จ์ถ•
  • ⇒ input ์ด๋ฏธ์ง€ ํฌ๊ธฐ ์กฐ์ • ์•ˆํ•จ ⇒ ์™œ๊ณกํ˜„์ƒ ์—†์•ฐ
  1. Selective Search
    1. input image๋ฅผ ๊ฐ€์ง€๊ณ  selective search ์ง„ํ–‰
    1. image ์•ˆ์— ๊ฐ์ฒด๊ฐ€ ์žˆ์„๋ฒ•ํ•œ ํ›„๋ณด๊ตฐ๋“ค์„ 2000๊ฐœ ์„ ์ •ํ•จ
  1. CNN
    1. input image ํ•œ ์žฅ์„ ๊ทธ๋ƒฅ CNN ๊ตฌ์กฐ์— ๋„ฃ์–ด๋ฒ„๋ฆผ (conv+pooling์˜ ๋ฐ˜๋ณต ๊ตฌ๊ฐ„)
    1. CNN ๊ณ„์ธต ๋ฐ˜๋ณตํ•˜๋‹ค๊ฐ€ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์—์„œ์˜ ํ’€๋ง์„ SPP pooling์œผ๋กœ ๋ฐ”๊ฟ”๋ฒ„๋ฆผ
  1. SPP Pooling
    1. ์•ž์—์„œ ์ถ”์ถœํ•œ 2000๊ฐœ์˜ ์˜์—ญ์„ ๊ฐ€์ ธ์˜ด.
    1. ์ดํ›„, 4*4 , 2*2, 1*1 ํฌ๊ธฐ์˜ max pooling์„ ์ง„ํ–‰ํ•จ.
    1. ์ง„ํ–‰ํ•ด์„œ 1์ฐจ์› ๋ฒกํ„ฐ ํฌ๊ธฐ๋กœ ์ด์–ด ๋ถ™์—ฌ์คŒ
    1. ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ์ด 21๊ฐœ์˜ bin์ด ๋‚˜์˜ด. (๊ณ ์ •ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๊ฐ€ ๋  ๊ฒƒ์ž„
  1. ์ดํ›„ ์–˜๋„ค๋ฅผ FC layer์— ์ง‘์–ด๋„ฃ์Œ. (๊ฐ€์ค‘์น˜ o)
  1. FC layer์— ํ•œ ๋ฒˆ์”ฉ ๋” ๋„ฃ๊ณ  SVM์„ ํ†ตํ•ด ํ•ด๋‹น ๋ฒกํ„ฐ์— ๊ฐ์ฒด์˜ ์œ ๋ฌด(classification) ์ง„ํ–‰
  1. ์ถ”๊ฐ€์ ์œผ๋กœ, Boundary Boxes Regressor ์ง„ํ–‰ํ•ด์„œ bounding box์˜ ํฌ๊ธฐ๋ฅผ ์•Œ๋งž๊ฒŒ ์กฐ์ •(๊ฐ์ฒด ์œ„์น˜์— ์žˆ๋Š” ๊ณณ์œผ๋กœ)ํ•œ ํ›„, non maximum suppression์„ ํ†ตํ•ด ์ตœ์ข… bounding box๋ฅผ ์„ ๋ณ„!

→ ์ด ์นœ๊ตฌ์˜ ๋ฌธ์ œ์ :

1) 3๋‹จ๊ณ„์˜ ๊ตฌ์กฐ๋ฅผ ์ง€๋‹˜ (selective search, CNN, SVM)

2) 4x4 2x2 1x1 spatial bin์œผ๋กœ ์ธํ•ด ์˜ค๋ฒ„ํ”ผํŒ… ๋ฐœ์ƒ ๊ฐ€๋Šฅ์„ฑ ์˜ฌ๋ผ๊ฐ

3) ์šฉ๋Ÿ‰ ๋งŽ์ด ํ•„์š”

0. Fast R-CNN
  • ๊ทธ๋ž˜์„œ ๋‚˜์˜จ ์นœ๊ตฌ๊ฐ€ fast R-CNN

  1. Selective Search
    1. input image๋ฅผ ๊ฐ€์ง€๊ณ  selective search ์ง„ํ–‰
    1. image ์•ˆ์— ๊ฐ์ฒด๊ฐ€ ์žˆ์„๋ฒ•ํ•œ ํ›„๋ณด๊ตฐ๋“ค์„ ์ตœ๋Œ€ ex) 2000๊ฐœ ์„ ์ •ํ•จ
    1. ROI ์˜์—ญ ์ถ”์ถœ⇒ ์ด ๋•Œ 2000๊ฐœ์˜ ์˜์—ญ์„ ๋‹ค ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ(Hierarohical sampling)์ด๋ผ๊ณ  ํ•จex) input image๊ฐ€ 2๊ฐœ๊ณ , region์ด 128๋กœ ์žก์•˜๋‹ค๋ฉด 64๊ฐœ์˜ ์˜์—ญ๋งŒ ํ›„๋ณด ์˜์—ญ์œผ๋กœ ๊ฐ€์ ธ๊ฐ
    2. ⇒ ํ•œ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜ ๋‹น์˜ ์ด๋ฏธ์ง€๋งŒํผ ๋‚˜๋ˆ ์ค€ ์• ๋“ค๋งŒ ์‚ฌ์šฉํ•œ๋‹ค
  1. CNN
    1. input image ํ•œ ์žฅ์„ ๊ทธ๋ƒฅ CNN ๊ตฌ์กฐ์— ๋„ฃ์–ด๋ฒ„๋ฆผ (conv+pooling์˜ ๋ฐ˜๋ณต ๊ตฌ๊ฐ„)
    1. CNN ๊ณ„์ธต ๋ฐ˜๋ณตํ•˜๋‹ค๊ฐ€ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์—์„œ์˜ ํ’€๋ง์„ ROI pooling์œผ๋กœ ์ง„ํ–‰ํ•จ
  1. ROI pooling⇒ SPP pooling์„ ๋‹จ์ˆœํ™” ์‹œํ‚จ ๋ฐฉ๋ฒ•⇒ max pooling์„ ์‚ฌ์šฉํ•ด 7x7 feature map ์ถ”์ถœํ•ด์„œ ๊ณ ์ • ํฌ๊ธฐ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ฌ
    ๊ทธ๊ฑธ ๊ฐ์•ˆํ•ด์„œ๋ผ๋„ ์ง„ํ–‰ (์‹œ๊ฐ„ ๋‹จ์ถ•)
  2. ** add) ROI์˜ ํ›„๋ณด ์˜์—ญ ํฌ๊ธฐ๋“ค์ด ๋‹ค ๋‹ค์–‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์˜ˆ๋ฅผ ๋“ค์–ด nxm ์™€ ๊ฐ™์ด ์ •์‚ฌ๊ฐํ˜•์ด ์•„๋‹Œ์• ๋“ค์˜ ๊ฒฝ์šฐ 7x7์˜ ์˜์—ญ์œผ๋กœ ์ชผ๊ฐœ๋Š” ๊ณผ์ •์—์„œ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜๋„ ์žˆ์Œ
  3. ⇒ SPP์˜ ๊ฒฝ์šฐ, 4x4 2x2 1x1๊ณผ ๊ฐ™์ด 3๊ฐœ์˜ pooling ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ๋‹ค๋ฉด, ์ด ์นœ๊ตฌ๋Š” ํ•œ ๋ฒˆ๋งŒ ์ง„ํ–‰ํ–ˆ์Œ
  1. FC Layers
    1. ์ดํ›„, FC layer์„ ํ•œ ๋ฒˆ ๊ฑฐ์น˜๊ณ , ๋‘ ๊ฐœ์˜ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆ”
    1. ๊ฐ๊ฐ FC layer์„ ๋‘ ๋ฒˆ ๋” ๊ฑฐ์น˜๊ณ  1) classification 2) Boundary boxes regression ์ง„ํ–‰
  1. Loss Function
    1. ์ด ์นœ๊ตฌ์˜ ๊ฒฝ์šฐ, classification loss์™€ bbox regressor loss๋ฅผ ์„ž์–ด์„œ ์ข…ํ•ฉ์ ์ธ loss๋ฅผ ๊ตฌํ•จ
    1. ์ด loss๊ฐ’์„ ์ด์šฉํ•ด ์—ญ์ „ํŒŒ๋ฅผ ์ง„ํ–‰ํ•จ** classification: softmax๋กœ ์–ป์–ด๋‚ธ ํ™•๋ฅ ๊ฐ’๊ณผ ์ •๋‹ต๊ฐ’์— ๋Œ€ํ•œ loss

      ⇒ smooth L1์€ ๋น ๋ฅธ ์†๋„๋กœ loss๋ฅผ 0์œผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค๋Š” ํŠน์ง•์„ ์ง€๋‹˜ (์†๋„ ๋น ๋ฅด๊ฒŒ)

    2. ** localization loss๋Š” x,y,w,h์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’๊ณผ groundtruth(์‹ค์ œ ์ •๋‹ต๊ฐ’)์˜ ์กฐ์ ˆ์„ ํ†ตํ•ด ๊ณ„์‚ฐํ•ด์„œ smooth L1 ํ•จ์ˆ˜๋ฅผ ํ†ต๊ณผํ•œ ๊ฐ’์ด๋ผ๊ณ .

→ ์ด ์นœ๊ตฌ์˜ ๋ฌธ์ œ์ :

R-CNN๊ณผ SPPNet๋ณด๋‹ค๋Š” ์„ฑ๋Šฅ์ด ์ข‹์œผ๋‚˜, test ๊ฒฐ๊ณผ๋ฅผ ๋ดค์„ ๋•Œ, region proposal ๊ณผ selective search๋ฅผ ์ง„ํ–‰ํ•  ๋•Œ ์‹œ๊ฐ„ ๋น„์ค‘์ด ํผ (์‹œ๊ฐ„ ์˜ค๋ž˜ ๊ฑธ๋ฆผ)


1. Intro
  • ๊ทธ๋ž˜์„œ ๋‚˜์˜จ๊ฒŒ Faster- RCNN
  • Region Proposal์„ GPU๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•˜์ž! ํ•ด์„œ RPN( Region Proposal Network)๊ฐ€ ์ถ”๊ฐ€๋œ ์นœ๊ตฌ
2. Overall Architecture
  1. ์ด๋ฏธ์ง€๋ฅผ pre-trained ๋œ VGG์— ์ง‘์–ด๋„ฃ์Œ
  1. ํ”ผ์ฒ˜๋งต ์ถ”์ถœ
  1. ์ถ”์ถœํ•œ ํ”ผ์ฒ˜๋งต์„ 3x3, padding=1๋กœ ์ถ”์ถœ (์ฑ„๋„ ์ˆ˜๋ฅผ 256 or 512)
  1. ๋‘ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆ ์„œ 1x1 conv ์ง„ํ–‰ (์••์ถ•์˜ ๊ฐœ๋…)
  1. Classification ์ง„ํ–‰ ( Anchor ๊ฐœ๋…์ด ์—ฌ๊ธฐ์„œ ์“ฐ์ž„)
    1. ex) Hxwx2k ๊ฐœ์˜ anchor box๊ฐ€ ์ƒ์„ฑ๋จ (2: ๊ฐ์ฒด์˜ ์œ ๋ฌด์— ๋Œ€ํ•œ ๋ฒกํ„ฐ)
  1. Bounding box regression ์ง„ํ–‰ ( Anchor ๊ฐœ๋… ์‚ฌ์šฉ)
    1. Hxwx4k ๊ฐœ์˜ anchor box๊ฐ€ ์ƒ์„ฑ๋จ (4: x,y,w,h ์ขŒํ‘œ์— ๋Œ€ํ•œ ๋ฒกํ„ฐ ์ •๋ณด๋“ค)

⇒ Non maximum suppression์„ ์ ์šฉ ( ๋งŽ์€ ํ›„๋ณด๋“ค ์ค‘ ์ค‘๋ณต๋œ ์˜์—ญ๋“ค ์ œ๊ฑฐ → ์‹œ๊ฐ„ ๋‹จ์ถ•)

⇒ ์ด 2000๊ฐœ์˜ ROI๋งŒ ์ด์šฉ(ํ•™์Šต์‹œ)

⇒ 2000๊ฐœ์—์„œ ๋” ์ค„์ธ ์ƒ์œ„ N๊ฐœ๋งŒ ํ›„๋ณด๋กœ ์„ ์ถœ(ํ‰๊ฐ€์‹œ)

  1. 5๋ฒˆ๊ณผ 6๋ฒˆ์„ ํ†ตํ•ด ํ›„๋ณด๋“ค์ด ์ถ”๋ ค์ง
  1. ์ด ์ถ”๋ ค์ง„ ์˜์—ญ๋“ค์„ ๊ฐ€์ง€๊ณ  ROI Pooling์„ ์ง„ํ–‰ํ•จ
    1. VGG์—์„œ ๋‚˜์˜จ ํ”ผ์ฒ˜๋งต์— ํ›„๋ณด ์˜์—ญ๋“ค์„ ์‚ฌ์šฉ
    1. ํ›„๋ณด ROI์— ํ•ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„๋งŒ 3x3 max-pooling์„ ํ†ตํ•ด ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ ์ถ”์ถœ
  1. FC layer์— ๊ฐ์ž ์ง‘์–ด๋„ฃ์Œ
  1. ์ดํ›„, sofmax๋ฅผ ํ™œ์šฉํ•ด classification ์ง„ํ–‰ ๋ฐ boundary box regression ์ง„ํ–‰ ํ›„ ์˜ˆ์ธก
3. Training
3-1. pre-trained VGG-16
  • pre-trained๋œ VGG16 ๋ชจ๋ธ์— 800x800x3 ํฌ๊ธฐ์˜ ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅํ•˜์—ฌ 50x50x512 ํฌ๊ธฐ์˜ feature map์„ ์–ป์Œ.
3-2-1. Anchor generation layer
  • ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ ์นœ๊ตฌ(๋ฏธ๋‹ˆ๋ฐฐ์น˜ Loss์—์„œ target ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ)
    1. ๊ฐ์ฒด ์œ ๋ฌด ๋ผ๋ฒจ ๊ฐ’
    1. ground truth bounding box⇒ ์˜ ๋ผ๋ฒจ๊ฐ’์œผ๋กœ ์“ฐ์ž„
    โ€ป ๋ฏธ๋‹ˆ ๋ฐฐ์น˜์—๋Š” RPN์—์„œ ์˜ˆ์ธกํ•œ ๊ฐ ์•ต์ปค๊ฐ€ ์‹ค์ œ๋กœ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ผ๋ฒจ๊ณผ ํ•ด๋‹น ์•ต์ปค๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ground truth bounding box๊ฐ€ ํฌํ•จ๋จ
Anchor?
  • ‘๋‹ป’์„ ์˜๋ฏธ
  • ๋ฐฐ๊ฐ€ ์›€์ง์ด์ง€ ์•Š๊ฒŒ ํ•˜๊ณ , ๋ฐฐ๊ฐ€ ์–ด๋Š ์œ„์น˜์— ์žˆ๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๋Š” ๊ธฐ์ค€
  • ์ฆ‰, anchor box๋Š” object๊ฐ€ ์žˆ์Œ์งํ•œ ๊ธฐ์ค€์˜ ์—ญํ• ์„ ํ•จ
  • sliding window ์ง„ํ–‰

⇒ ๊ฐ ๊ทธ๋ฆฌ๋“œ์—์„œ ์—ฌ๋Ÿฌ๊ฐ€์ง€์˜ object๋ฅผ ๊ฒ€์ถœํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์คŒ

  • ์›๋ณธ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์— sub-sampling ratio๋ฅผ ๊ณฑํ•œ ๋งŒํผ์˜ grid cell์ด ์ƒ์„ฑ๋จ(๊ฐ ์ค‘์‹ฌ์ )
  • ๊ฐ grid cell ๋งˆ๋‹ค 9๊ฐœ์˜ anchor box๋ฅผ ์ƒ์„ฑ
  • ๊ทธ๋Ÿฌ๋ฉด ๊ฒฐ๊ตญ ์ƒ์„ฑ๋˜๋Š” anchor box๋Š” ์œ„์˜ ์˜ˆ์‹œ๋Š” 50x50x9๊ฐœ๊ฐ€ ์ƒ์„ฑ๋จ
  1. ์›๋ณธ ์ด๋ฏธ์ง€์—์„œ sub-sampling ratio๋ฅผ ์ ์šฉํ•œ ํ›„, anchor box๋ฅผ ์ถ”์ถœํ•จ
  1. 800 → 50์œผ๋กœ ๋˜์—ˆ์œผ๋ฏ€๋กœ, ratio๋Š” 1/16
  1. ์ฆ‰, 800x(1/16) x 800x(1/16) x 9 ⇒ 50x50x9์˜ ํ›„๋ณด anchor box๋“ค์ด ์ƒ์„ฑ
  1. ์ƒ˜ํ”Œ๋ง๋œ ์• ๋“ค ์ค‘, ์–‘์„ฑ๊ณผ ์Œ์„ฑ ์•ต์ปค ์ตœ๋Œ€ 1:1 ๋น„์œจ๋กœ ๋งž์ถฐ์„œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ loss๋ฅผ ๊ณ„์‚ฐํ•จ
3-2-2. RPN
  • VGG๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ feature map์„ ํ™œ์šฉํ•จ
  • 3x3 conv, zero padding(ํฌ๊ธฐ ์œ ์ง€)์„ ํ†ตํ•ด ์ฑ„๋„ ์ˆ˜๋ฅผ 256 or 512๋กœ ์„ค์ •
  • ์ดํ›„, classification๊ณผ regression์„ ์ง„ํ–‰ํ•ด์„œ ํ›„๋ณด ์˜์—ญ๋“ค์„ ๊ณ ๋ฆ„
  1. classification1) 1x1 conv๋ฅผ ํ†ตํ•ด ์ฑ„๋„์ˆ˜ ์กฐ์ • (์ •๋ณด ์••์ถ•์˜ ๊ฐœ๋…)2) ์•„๋ž˜์˜ ์˜ˆ์‹œ ์ฐธ๊ณ ) 8x8x(2x9) ์˜ anchor box๊ฐ€ ์ƒ์„ฑ ⇒ ์ฑ„๋„์ˆ˜๋Š” 2x9 =183) 8x8x18์˜ ํ›„๋ณด anchor box๊ฐ€ ์ƒ์„ฑ
  1. boundary box regression1) 1x1 conv๋ฅผ ํ†ตํ•ด ์ฑ„๋„์ˆ˜ ์กฐ์ • (์ •๋ณด ์••์ถ•์˜ ๊ฐœ๋…)
  2. 2) 8x8x(4x9)์˜ anchor box๊ฐ€ ์ƒ์„ฑ ⇒ ์ฑ„๋„์ˆ˜๋Š” 4x9 (4= x,y,w,h์— ๋Œ€ํ•œ ์ •๋ณด)

⇒ non maximum suppression์„ ํ†ตํ•ด ์ค‘๋ณต ์˜์—ญ๋“ค ์ œ๊ฑฐ

⇒ ์ด๋•Œ, IOU์˜ ์ž„๊ณ„๊ฐ’์„ 0.7 ์ด์ƒ์ธ ์• ๋“ค์€ postive anchor๋กœ ์„ค์ •, ์ดํ•˜์ธ ์• ๋“ค์€ negative anchor๋กœ ๊ณ ์ •ํ•œ ์ฑ„, NMS ์ ์šฉ + ์ด๋ฏธ์ง€์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด๊ฐ€๋Š” ์• ๋“ค๋„ ์ง€์›Œ๋ฒ„๋ฆผ → ์•ฝ 2000๊ฐœ์˜ ์˜์—ญ๋“ค๋งŒ ๋‚จ๊น€

โ€ป ์ด๋ ‡๊ฒŒ ํ•˜๊ฒŒ ๋˜๋ฉด ํ•™์Šต๋œ RPN์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ anchor์„ ์ƒ์„ฑํ•˜๊ฒŒ ๋จ

⇒ ํ›„๋ณด ์˜์—ญ๋“ค์„ ROI pooling์˜ ํ•™์Šต์—์„œ ์ง„ํ–‰

3-3. ROI pooling
  1. VGG์—์„œ ๋‚˜์˜จ feature map์— ํ›„๋ณด anchor box๋ฅผ ์ ์šฉ์‹œํ‚ด
  1. ์ดํ›„ ROI pooling ์ง„ํ–‰ ( 3x3 ๊ตฌ์—ญ์œผ๋กœ ๋‚˜๋ˆ ์„œ max pooling)
  1. ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๊ฐ€ ์ƒ์„ฑ๋  ๊ฒƒ์ž„
3-4. Classification/Boundary Boxes Regression
  1. ๊ณ ์ •๋œ ํฌ๊ธฐ ๋ฒกํ„ฐ๋ฅผ ๋‘ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆ  FC์— ๋„ฃ์Œ
  1. ์ดํ›„, softmax๋ฅผ ํ†ตํ•ด classification ์ง„ํ–‰
  1. bbox regression ์ง„ํ–‰
  1. ํ•™์Šต ์ง„ํ–‰
3-5. Loss function(RPN)
  • ์—ฌ๊ธฐ์„œ ground truth label ๊ฐ’ ์œ„์— ์›๋ณธ ์ด๋ฏธ์ง€์—์„œ anchor box๋ฅผ ์ƒ์„ฑํ•œ ์• ๋“ค
  • i= ํ•˜๋‚˜์˜ anchor์„ ์˜๋ฏธ
  • pi= classification์„ ํ†ตํ•ด ์–ป์€ ํ•ด๋‹น anchor๊ฐ€ ์˜ค๋ธŒ์ ํŠธ์ผ ํ™•๋ฅ 
  • ti= bounding box regression์„ ํ†ตํ•ด ์–ป์€ ๋ฐ•์Šค ์กฐ์ • ๊ฐ’ ๋ฒกํ„ฐ
  • pi*, ti* = ground truth label๊ฐ’ (์ •๋‹ต๊ฐ’)
  • classification์˜ ๊ฒฝ์šฐ, log loss ์‚ฌ์šฉ
  • regression loss์˜ ๊ฒฝ์šฐ, smooth L1 ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉ

โ€ป smooth L1( huber loss)

⇒ L1 loss: ๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ง€์ ์ด ์กด์žฌ (์˜ค์ฐจ์˜ ์ ˆ๋Œ“๊ฐ’)

⇒ L2 loss: ์ด์ƒ์น˜์˜ ์—๋Ÿฌ๊ฐ€ ์ œ๊ณฑ๋จ ⇒ ์ด์ƒ์น˜์— ์ทจ์•ฝํ•จ (์˜ค์ฐจ์˜ ์ œ๊ณฑ)

์ด ๋‘˜์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•œ๊ฒŒ smooth L1

3-6. Sharing Features for RPN and Fast R-CNN
3-6-1. Alternating train
  • ์ด ๋…ผ๋ฌธ์—์„œ ํ•™์Šต์‹œํ‚จ ๋ฐฉ๋ฒ•
  1. pre-trained๋œ VGG๋ฅผ ์ด์šฉํ•ด RPN์„ ํ•™์Šต์‹œํ‚ด
  1. ์œ„์—์„œ ํ•™์Šต์‹œํ‚จ ์• ๋“ค์„ Fast R-CNN ๋ชจ๋ธ์— ์ง‘์–ด๋„ฃ์–ด์„œ ์ฒ˜์Œ image- VGG - Fast RCNN ์ˆœ์„œ๋กœ ๋‹ค์‹œ ํ•™์Šต์‹œํ‚ด⇒ ์ด ๊ฒฝ์šฐ, ๊ทธ๋Ÿผ VGG ๋ชจ๋ธ(feature map)์ด ์ด ๋‘ ๋ฒˆ์ด ํ•™์Šต๋˜๋Š” ๊ฒƒ์ž„ ( ์ด๋•Œ๊นŒ์ง€๋Š” VGG conv๊ฐ’์„ ๊ณต์œ ํ•˜์ง€ ์•Š์Œ)
  1. 1๋ฒˆ๊ณผ 2๋ฒˆ์„ ํ† ๋Œ€๋กœ ํ•ด์„œ ๋‚˜์˜จ VGG feature map์„ ๊ณต์œ .⇒ VGG(feature map)์„ ๊ฐ€์ง€๊ณ  RPN์„ ํ•œ ๋ฒˆ ๋” ์ง„ํ–‰ํ•œ ํ›„, ROI pooling์— ํ›„๋ณด ์˜์—ญ๋“ค์„ ์ง‘์–ด๋„ฃ์Œ
  1. ROI pooling์— ๋„ฃ์€ ํ›„๋ณด ์˜์—ญ๋“ค์„ ํ•™์Šต์‹œ์ผœ Fast R-CNN์„ ํ•œ ๋ฒˆ ๋” ์ง„ํ–‰
3-6-2. Approximate joint training
  • ํ•™์Šตํ•  ๋•Œ, RPN๊ณผ Fast R-CNN์„ ํ•ฉ์ณ์„œ ๋ฐ”๋กœ ํ•™์Šต์‹œํ‚ด
  • back propagation์„ ์ง„ํ–‰ํ•  ๋•Œ, RPN๊ณผ Fast R-CNN์˜ loss๋ฅผ ํ•ฉ์ณ์„œ ์—ญ์ „ํŒŒ ์ง„ํ–‰⇒ ํ•˜์ง€๋งŒ proposal box์˜ ์ขŒํ‘œ๊ฐ’์— ๋Œ€ํ•œ ๋ฏธ๋ถ„๊ฐ’์€ ๋ฌด์‹œ๋จ ⇒ ๊ทธ๋ž˜์„œ approximate๋ผ๊ณ  ํ‘œํ˜„ํ•œ๋“ฏ (๊ทผ๋ฐ rpn์—์„œ box regression์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋ฐ›๋Š”๋ฐ ์™œ ๋ฏธ๋ถ„๊ฐ’์ด ๋ฌด์‹œ๋˜๋Š”๊ฑด์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ)
  • ⇒ ํ•™์Šต์‹œ๊ฐ„ ์ค„์–ด๋“ค์Œ
3-6-3. Non-approximate joint training
  • Fast R-CNN์˜ ๊ณผ์ •์ธ ROI pooling layer๊ฐ€ conv feature๊ณผ bounding boxes๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Œ⇒ box์˜ ์ขŒํ‘œ๋„ gradient์— ํฌํ•จ๋˜๋„๋ก ํ•จ⇒ ์ด ๋•Œ๋Š” ๋ฏธ๋ถ„์ด ๊ฐ€๋Šฅํ•œ ROI pooling Layer๊ฐ€ ํ•„์š”.
  • ⇒ ์ฆ‰, ์œ„์— approximate joint training์—์„œ ๋ฌด์‹œ๋˜์—ˆ๋˜ ๊ฒƒ์ด ์ถ”๊ฐ€๋œ ๊ฒƒ
3-7. Step Alternating Training
  1. pre-trained๋œ VGG๋ฅผ ์ด์šฉํ•ด RPN์„ ํ•™์Šต์‹œํ‚ด → ๋งˆ์ง€๋ง‰ ๋‘๊ฐœ๋งŒ ๋‚จ๊น€
  1. ์œ„์—์„œ ํ•™์Šต์‹œํ‚จ ์• ๋“ค์„ Fast R-CNN ๋ชจ๋ธ์— ์ง‘์–ด๋„ฃ์–ด์„œ ์ฒ˜์Œ image- VGG - Fast RCNN ์ˆœ์„œ๋กœ ๋‹ค์‹œ ํ•™์Šต์‹œํ‚ด⇒ ์ด ๊ฒฝ์šฐ, ๊ทธ๋Ÿผ VGG ๋ชจ๋ธ(feature map)์ด ์ด ๋‘ ๋ฒˆ์ด ํ•™์Šต๋˜๋Š” ๊ฒƒ์ž„ ( ์ด๋•Œ๊นŒ์ง€๋Š” VGG conv๊ฐ’์„ ๊ณต์œ ํ•˜์ง€ ์•Š์Œ) → ๋งˆ์ง€๋ง‰ ๋‘ ๊ฐœ๋ฅผ ๋‹ค์‹œํ•™์Šต์‹œํ‚ด
  1. 1๋ฒˆ๊ณผ 2๋ฒˆ์„ ํ† ๋Œ€๋กœ ํ•ด์„œ ๋‚˜์˜จ VGG feature map์„ ๊ณต์œ .⇒ VGG(feature map)์„ ๊ฐ€์ง€๊ณ  RPN์„ ํ•œ ๋ฒˆ ๋” ์ง„ํ–‰ํ•œ ํ›„, ROI pooling์— ํ›„๋ณด ์˜์—ญ๋“ค์„ ์ง‘์–ด๋„ฃ์Œ
  1. ROI pooling์— ๋„ฃ์€ ํ›„๋ณด ์˜์—ญ๋“ค์„ ํ•™์Šต์‹œ์ผœ Fast R-CNN์„ ํ•œ ๋ฒˆ ๋” ์ง„ํ–‰
4. Test
4-1. pre-trained VGG-16
  • pre-trained๋œ VGG16 ๋ชจ๋ธ์— 800x800x3 ํฌ๊ธฐ์˜ ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅํ•˜์—ฌ 50x50x512 ํฌ๊ธฐ์˜ feature map์„ ์–ป์Œ.
4-2. RPN
  • VGG๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ feature map์„ ํ™œ์šฉํ•จ
  • 3x3 conv, zero padding(ํฌ๊ธฐ ์œ ์ง€)์„ ํ†ตํ•ด ์ฑ„๋„ ์ˆ˜๋ฅผ 256 or 512๋กœ ์„ค์ •
  • ์ดํ›„, classification๊ณผ regression์„ ์ง„ํ–‰ํ•ด์„œ ํ›„๋ณด ์˜์—ญ๋“ค์„ ๊ณ ๋ฆ„
  1. classification1) 1x1 conv๋ฅผ ํ†ตํ•ด ์ฑ„๋„์ˆ˜ ์กฐ์ • (์ •๋ณด ์••์ถ•์˜ ๊ฐœ๋…)2) ์•„๋ž˜์˜ ์˜ˆ์‹œ ์ฐธ๊ณ ) 8x8x(2x9) ์˜ anchor box๊ฐ€ ์ƒ์„ฑ ⇒ ์ฑ„๋„์ˆ˜๋Š” 2x9 =183) 8x8x18์˜ ํ›„๋ณด anchor box ์ƒ์„ฑ
  1. boundary box regression1) 1x1 conv๋ฅผ ํ†ตํ•ด ์ฑ„๋„์ˆ˜ ์กฐ์ • (์ •๋ณด ์••์ถ•์˜ ๊ฐœ๋…)
  2. 2) 8x8x(4x9)์˜ anchor box๊ฐ€ ์ƒ์„ฑ ⇒ ์ฑ„๋„์ˆ˜๋Š” 4x9 (4= x,y,w,h์— ๋Œ€ํ•œ ์ •๋ณด)

⇒ non maximum suppression์„ ํ†ตํ•ด ์ค‘๋ณต ์˜์—ญ๋“ค ์ œ๊ฑฐ

⇒ ์ด๋•Œ, IOU์˜ ์ž„๊ณ„๊ฐ’์„ 0.7 ์ด์ƒ์ธ ์• ๋“ค์€ positive anchor๋กœ ์„ค์ •, ์ดํ•˜์ธ ์• ๋“ค์€ negative anchor๋กœ ๊ณ ์ •ํ•œ ์ฑ„, NMS ์ ์šฉ ์ด๋ฏธ์ง€์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด๊ฐ€๋Š” ์• ๋“ค์€ ํด๋ฆฌํ•‘(clipping)์„ ํ†ตํ•ด box๊ฐ€ ์ด๋ฏธ์ง€ ๊ฒฝ๊ณ„ ๋‚ด๋ถ€์— ์œ„์น˜ํ•˜๋„๋ก ๋งŒ๋“ค์–ด์ค€๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋จ(ํ”ฝ์…€ ๊ฐ’ ์กฐ์ • ๊ณผ ๊ฐ™์€)

์ดํ›„, ์ƒ์œ„ N๊ฐœ์˜ ์˜์—ญ์œผ๋กœ ๊ฐ„์ถ”๋ ค์„œ ROI input์œผ๋กœ ์‚ฌ์šฉ (ํ•™์Šต ์‹œ์—๋Š” ์ด ๊ณผ์ •์ด ์—†์—ˆ์Œ)

โ€ป ์ด๋ ‡๊ฒŒ ํ•˜๊ฒŒ ๋˜๋ฉด ํ•™์Šต๋œ RPN์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ anchor์„ ์ƒ์„ฑํ•˜๊ฒŒ ๋จ

⇒ ํ›„๋ณด ์˜์—ญ๋“ค์„ ROI pooling์˜ ํ•™์Šต์—์„œ ์ง„ํ–‰

4-3. ROI Pooling
  1. VGG์—์„œ ๋‚˜์˜จ feature map์— ํ›„๋ณด anchor box๋ฅผ ์ ์šฉ์‹œํ‚ด
  1. ์ดํ›„ ROI pooling ์ง„ํ–‰ ( 3x3 ๊ตฌ์—ญ์œผ๋กœ ๋‚˜๋ˆ ์„œ max pooling)
  1. ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๊ฐ€ ์ƒ์„ฑ๋  ๊ฒƒ์ž„
4-4. Classification/Boundary Boxes Regression
  1. ๊ณ ์ •๋œ ํฌ๊ธฐ ๋ฒกํ„ฐ๋ฅผ ๋‘ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆ  FC์— ๋„ฃ์Œ
  1. ์ดํ›„, softmax๋ฅผ ํ†ตํ•ด classification ์ง„ํ–‰
  1. bbox regression ์ง„ํ–‰
  1. ํ•™์Šต ์ง„ํ–‰

5. Experiments
  1. R-CNN vs Fast R-CNN vs Faster R-CNN
  • Faster R-CNN์—์„œ๋Š” ์™„์ „ํ•œ one stage๋ฅผ ๋งŒ๋“ฌ
  • ์ฆ‰, end to end ๋ฐฉ์‹์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Œ (r-cnn์˜ ๊ฒฝ์šฐ, selective search์—์„œ๋Š” CPU๋ฅผ ์‚ฌ์šฉ์„ ํ•˜๊ณ , ๋’ท๋ถ€๋ถ„ ๋ถ€ํ„ฐ๋Š” GPU๋ฅผ ์‚ฌ์šฉ)⇒ ์ฆ‰, ์ฒ˜์Œ๋ถ€ํ„ฐ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋ƒ ์•ˆํ•˜๋ƒ์˜ ์ฐจ์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Œ.
  • ์„ฑ๋Šฅ ํ›จ์”ฌ ์ข‹์•„์ง
  1. Selective Search → RPN
  • selective search๋ฅผ RPN์œผ๋กœ ๋Œ€์ฒด ํ–ˆ์„ ๋•Œ์˜ ์†๋„ ์ฐจ
  • SS์˜ ๊ฒฝ์šฐ, total 1.8s ์ค‘, proposal์—์„œ 1.5s๊ฐ€ ์†Œ์š”๋จ → ๋Š๋ฆฌ๋‹ค!

Other experiments

  1. ์ดˆ๊ธฐ ์„ค์ •ํ•œ Anchor Box๊ฐ€ Scale๊ณผ Aspect ratio์— ๋”ฐ๋ผ ํ•™์Šต๋œ ํ›„, ์–ด๋–ป๊ฒŒ ๋ณ€๊ฒฝ ๋๋Š”๊ฐ€์— ๋Œ€ํ•œ ํ‰๊ท 
  • anchor box์˜ 3๊ฐ€์ง€ scale๊ณผ 3๊ฐ€์ง€ aspect ratio๋ฅผ ๋‹ค๋ฅธ ๊ฒฝ์šฐ์˜ ์ˆ˜๋กœ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์˜ ์„ฑ๋Šฅ ๋น„๊ต
  • ์ด ๊ฒฐ๊ณผ๋กœ์จ๋Š” ์‚ฌ์‹ค ๋ณ„๋กœ ํฐ ์ฐจ์ด๊ฐ€ ์—†์—ˆ์Œ (3๊ฐ€์ง€ scale ๊ณผ 1๊ฐ€์ง€ aspect ratio, 3๊ฐ€์ง€ aspect ratio๋ฅผ ์‚ฌ์šฉํ•ด๋ณธ ๊ฒฐ๊ณผ)
  • ํ™•์žฅ์„ฑ์„ ๊ณ ๋ คํ•ด์„œ 3๊ฐ€์ง€ scale๊ณผ 3๊ฐ€์ง€ aspect ratio๋ฅผ ์‚ฌ์šฉ
  1. loss function์˜ ๋žŒ๋‹ค์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋น„๊ต
  • ์‚ฌ์‹ค ๋ณ„ ์ฐจ์ด ์—†์—ˆ์Œ ⇒ ๋žŒ๋‹ค์˜ ์˜ํ–ฅ๋ ฅ์€ ํฌ์ง€ ์•Š๋‹ค!
  • +์ถ”๊ฐ€) ๋žŒ๋‹ค๊ฐ€ ๋ญ˜ ์˜๋ฏธํ•˜๋Š”๊ฐ€?

    โ€ป ๋žŒ๋‹ค์˜ ์—ญํ• : ๋‘ loss์˜ ์ƒ๋Œ€์ ์ธ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ ˆํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

    ex) ๋žŒ๋‹ค= 10 ⇒ Lcls์™€ Lreg์˜ ๊ฐ€์ค‘์น˜๋Š” 1:10์œผ๋กœ ์กฐ์ •

    โ€ป Faster R-CNN์—์„œ๋Š” ์•ต์ปค๋ฐ•์Šค๊ฐ€ ๋Œ€๋ถ€๋ถ„ ๋ฐฐ๊ฒฝ์— ํ•ด๋‹นํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ์ฒด ์œ ๋ฌด ํŒ๋‹จ์ด bounding box regression๋ณด๋‹ค ๋” ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์— ๊ธฐ๋ฐ˜์„ ๋‘ . ๋”ฐ๋ผ์„œ, ๋žŒ๋‹ค๋ฅผ 10์œผ๋กœ ์„ค์ •ํ–ˆ์„ ๋•Œ๊ฐ€ ๋‘ ์†์‹ค์˜ ๊ธฐ์—ฌ๋„๊ฐ€ ๊ท ๋“ฑํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž„!

  1. Faster R-CNN vs Others
  • Faster R-CNN์ด MAP์ด ๋” ๋†’๊ณ , GPU Time์ด ๋” ์งง์•˜์Œ
6. Outro
  • CPU๋ฅผ ์‚ฌ์šฉํ•˜๋˜ selective search๊ฐ€ ์•„๋‹Œ GPU์—์„œ RPN์„ ์‚ฌ์šฉํ•œ Region proposal ์ง„ํ–‰. ⇒ ์†๋„์™€ ์ •ํ™•๋„ ํ–ฅ์ƒ

7. Reference

 

728x90
๋ฐ˜์‘ํ˜•

'Deep Learning > [๋…ผ๋ฌธ] Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

AE  (0) 2023.07.06
SPPNet  (0) 2023.07.06
YOLO: You Only Look Once: Unified, Real-Time Object Detection  (1) 2023.07.06
Fast R-CNN  (0) 2023.07.06
Transformer  (0) 2023.07.06