๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

DeepLab V2: Semantic Image Segmentation with Convolutional Nets, Atrous Convolution and Fully Connected CRFs

by ์ œ๋ฃฝ 2023. 7. 5.
728x90
๋ฐ˜์‘ํ˜•

 

 

 

  • Deeplearning์˜ CNN ๋„คํŠธ์›Œํฌ๋Š” ์˜์ƒ์ฒ˜๋ฆฌ์˜ ๋Œ€๋ถ€๋ถ„์˜ ๋ฌธ์ œ์—์„œ ๊ทธ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜๊ณ  ์žˆ์Œ. classification์™€ objectDetection ๋ฌธ์ œ์—์„œ ๊ฝค๋‚˜ ์ข‹์€ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜๋ฅผ ํ•˜์˜€๋Š”๋ฐ, ์ด๋ฅผ segmentation์— ์ ์šฉ์„ ํ–ˆ๋”๋‹ˆ ์—ฌ๊ธฐ์„œ๋„ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋„ค~
  • ํ•˜์ง€๋งŒ ๊ธฐ์กด์˜ ๋„คํŠธ์›ŒํŠธ๊ฐ€ classification ๋ฌธ์ œ์— ์ ํ•ฉํ•˜๊ฒŒ ๊ตฌ์กฐ๊ฐ€ ์งœ์ ธ์žˆ์–ด์„œ ์ด๋ฅผ segmentation ๋ฌธ์ œ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ๋…ผ๋ฌธ์ด ๋‚˜์˜ค๊ธฐ ์‹œ์ž‘ํ•˜๋Š”๋ฐ ์—ฌ๊ธฐ์„œ๋Š” deeplab์ด ํ•ด๋‹น๋จ
  • ์ฆ‰ CNN์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํ•œ๊ณ„์ ์„ ์–ด๋–ป๊ฒŒ ๊ทน๋ณตํ•ด๋‚ผ ๊ฒƒ์ธ๊ฐ€
1. Intro
  • Deep Conv Neural Networks(DCNNs)๋Š” image classification, object detection ๋“ฑ์˜ ์ „๋ฐ˜์ ์ธ CV ๋ถ„์•ผ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ ๋งŽ์€ ์˜ํ–ฅ์„ ๋ผ์นจ
  • DCNN์€ end-to-end ๋ฐ built-in invariance ์„ฑ์งˆ์„ ์ง€๋‹ˆ๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ.โ€ป end to end: End-to-end๋Š” ์ž…๋ ฅ์—์„œ ์ถœ๋ ฅ๊นŒ์ง€์˜ ์ „์ฒด์ ์ธ ์‹œ์Šคํ…œ์„ ํ•˜๋‚˜์˜ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ์‹
    • ์ผ๋ถ€ ๋ณ€ํ™”์— ๋Œ€ํ•ด ์ž๋™์œผ๋กœ ๋ถˆ๋ณ€์„ฑ์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ
    • CNN์€ ์ด๋ฏธ์ง€ ๋‚ด์˜ ํŠน์ง•(feature)์„ ์ธ์‹ํ•˜๊ณ , ์œ„์น˜์— ์ƒ๊ด€์—†์ด ํŠน์ง•์„ ์ธ์‹ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ€์ง. ์ด๋Ÿฌํ•œ ๋‚ด์žฅ ๋ถˆ๋ณ€์„ฑ ๋•๋ถ„์—, CNN์€ ์ด๋ฏธ์ง€์˜ ์œ„์น˜, ํšŒ์ „, ํฌ๊ธฐ ๋“ฑ์˜ ๋ณ€ํ™”์— ์ƒ๊ด€์—†์ด ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ์Œ
  • โ€ป built-in invariance: ๋‚ด์žฅ ๋ถˆ๋ณ€์„ฑ, ex) CNN, ์–ธ์–ด๋ชจ๋ธ
  • ํ•˜์ง€๋งŒ invariance๋Š” semantic segmentation ๊ฐ™์€ dense prediction task๋ฅผ ์ €ํ•˜์‹œํ‚จ๋‹ค๊ณ  ํ•จ.โ€ป ๋ถ„๋ฅ˜์™€ ๊ฐ™์€ task์—์„œ๋Š” ์ข‹์Œ ( ๋ถˆ๋ณ€์„ฑ์˜ ๊ธฐ๋Šฅ )but, Semantic Segmentation๊ณผ ๊ฐ™์€ dense prediction task์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ณ€ํ˜•๋œ ๊ฐ์ฒด๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ธ์‹ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Invariance๋ฅผ ๋„ˆ๋ฌด ๊ฐ•์กฐํ•˜๋ฉด, ์ •ํ™•ํ•œ Segmentation ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ
  • ex) ์–ผ๊ตด ์ธ์‹ ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค ๋•Œ, ์–ผ๊ตด์ด ๊ธฐ์šธ๊ฑฐ๋‚˜, ์กฐ๋ช…์ด ์–ด๋‘ก๊ฑฐ๋‚˜, ๋ฐฉํ–ฅ์ด ๋‹ค๋ฅด๊ฑฐ๋‚˜, ์ฐฉ์šฉํ•˜๋Š” ๋ชจ์ž๋‚˜ ์•ˆ๊ฒฝ ๋“ฑ์œผ๋กœ ์ธํ•ด ์–ผ๊ตด์ด ๋ณ€ํ˜•๋˜์–ด๋„ ์ธ์‹์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ์ด Invariance์˜ ์žฅ์ .
  • ๋”ฐ๋ผ์„œ DCNNs๋ฅผ semantic image segmentation์— ์ ์šฉ์‹œํ‚ฌ ๋•Œ, ์„ธ ๊ฐ€์ง€ challenge๊ฐ€ ์กด์žฌ
  1. Reduced feature resolution
  1. Existence of objects at multiple scales
  1. Reduced localization accuracy due to DCNN invariance

 

  1. Reduced feature resolution
  • ๊ณ„์†๋˜๋Š” max-pooling ๋ฐ downsampling ๋•Œ๋ฌธ์— ๋ฐœ์ƒ
  • ๊ณ„์† conv๋ฅผ ํ†ต๊ณผ์‹œํ‚ฌ ๊ฒฝ์šฐ, feature map์€ ๋งค์šฐ ์ž‘์•„์งˆ ๊ฒƒ์ž„
  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด DCNN์˜ ๋งˆ์ง€๋ง‰ max pooling layer๋“ค์„ filter upsampling์œผ๋กœ non-zero filter taps์— ‘๊ตฌ๋ฉ’์„ ๋„ฃ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•จ
  • upsampled filter์„ atrous convolution์ด๋ผ๊ณ  ๋ถ€๋ฆ„
  • atrous convolution์„ ํ†ตํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐœ์ˆ˜๋‚˜ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€์‹œํ‚ค์ง€ ์•Š์•„๋„ filter์˜ view(receptive field)๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

 

  1. Existence of objects at multiple scales
  • spatial pyramid pooling์—์„œ ์˜๊ฐ์„ ์–ป์Œ (๋ณ‘๋ ฌ์ฒ˜๋ฆฌ)
  • ๊ธฐ์กด์—๋Š” ์›๋ณธ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ๋กœ rescalingํ•ด์„œ ํ”ผ์ฒ˜๋งต์˜ ํฌ๊ธฐ๋ฅผ ํ•ฉ์น˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ–ˆ์—ˆ์Œ → ์—ฐ์‚ฐ ๋น„์šฉ์˜ ์ฆ๊ฐ€ ๋ฌธ์ œ ๋ฐœ์ƒ
  • ๋”ฐ๋ผ์„œ ๋ณ‘๋ ฌ์ ์ธ atrous convolutional layer๋“ค์— ๋‹ค๋ฅธ sampling rate๋ฅผ ๋Œ€์ž…ํ•จ
  • ์ด๋ฅผ ํ†ตํ•ด ๋ฌผ์ฒด ํŒŒ์•…ํ•˜๊ณ  ์—ฌ๋Ÿฌ scale์—์„œ ์ด๋ฏธ์ง€๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ ⇒ ASPP๋ผ๊ณ  ์นญํ•จ

 

  1. Reduced localization accuracy due to DCNN invariance
  • ๋ถ„๋ฅ˜์™€ ๊ฐ™์€ ๊ฒฝ์šฐ ๊ณต๊ฐ„์  ๋ถˆ๋ณ€์„ฑ์„ ์–ป๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๋‹จ๊ณ„ conv+ pooling์„ ํ†ตํ•ด ๊ฐ•์ธํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•ด์•ผ ํ•จ (๋ณ€ํ™”์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š” → ๋ถ„๋ฅ˜ ๋ฌธ์ œ์ด๋ฏ€๋กœ) ๊ทธ๋ž˜์„œ detailํ•œ ๊ฒƒ๋ณด๋‹ค๋Š” ์ข€ ๋” globalํ•œ ๊ฒƒ์— ์ง‘์ค‘ํ•จ
  • ๋ฐ˜๋ฉด, sementic segmentation์˜ ๊ฒฝ์šฐ, ํ”ฝ์…€ ๋‹จ์œ„์˜ ์กฐ๋ฐ€ํ•œ ์˜ˆ์ธก์ด ํ•„์š”ํ•œ๋ฐ, classification ๋ง์„ ๊ธฐ๋ฐ˜์œผ๋กœ segmentation ๋ง์„ ๊ตฌ์„ฑํ•˜๊ฒŒ ๋˜๋ฉด ํ”ผ์ฒ˜๋งต์˜ ํฌ๊ธฐ๊ฐ€ ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์— detailํ•œ ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์—†์Œ. ๋”ฐ๋ผ์„œ, CRF๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋ฏธ์„ธํ•œ ๋””ํ…Œ์ผ ํŒŒ์•…ํ•˜๊ณ ์ž ํ•จ (๊ฒฝ๊ณ„ ๋ถ€๋ถ„ ๋ชจํ˜ธ)โ€ป CRF: ๋งˆ์ง€๋ง‰์— ์˜ค๋Š” pooling layer 2๊ฐœ๋ฅผ ์—†์• ๊ณ  atrous conv ์‚ฌ์šฉ → ์ถ”๊ฐ€์ ์œผ๋กœ CRF๋ฅผ ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ํ”ฝ์…€ ๋‹จ์œ„ ์˜ˆ์ธก์˜ ์ •ํ™•๋„๋ฅผ ๋†’์ด๋Š”๋ฐ ์‚ฌ์šฉํ•จ.

 

→ ์†๋„, ์ •ํ™•๋„, ๊ฐ„๊ฒฐ์„ฑ 3๊ฐ€์ง€๋ฅผ ์ง€๋‹Œ ๋ชจ๋ธ

2. Related Work
  • ๊ณผ๊ฑฐ์˜ sementic segmentation์€ ์ง์ ‘ ์ œ์ž‘ํ•œ feature์—๋‹ค๊ฐ€ boosting์ด๋‚˜ ๋žœํฌ SVM๊ณผ ๊ฐ™์€ flat classifier์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•จ (์ˆ˜์ž‘์—…)
  • ๋”ฅ๋Ÿฌ๋‹ ๋ฐœ์ „์ด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์—์„œ ์„ฑ๊ณต์ ์œผ๋กœ ์ด๋ฃจ์–ด์กŒ๊ณ , ์ด๋ฅผ ์ด์–ด sementic segmentation ์ž‘์—…์œผ๋กœ ์ด์ „๋˜๊ธฐ ์‹œ์ž‘ํ•จ
  • → segmentation๊ณผ classification ๋‘ ๊ฐ€์ง€ ์ž‘์—…์„ ๋ชจ๋‘ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๋‘ ๊ฐœ์˜ task๋ฅผ ํ•ฉ์น˜๋Š” ๊ฒƒ์ด ๊ด€๊ฑด์ด ๋จ

→ Sementic segmentation์—์„œ ์‚ฌ์šฉ๋œ DCNN์€ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•๋ฒ•์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Œ

  1. bottom-up segmentation ๋ฐ DCNN ๊ธฐ๋ฐ˜์˜ region classification์„ ์ง„ํ–‰ํ•จ
  • ํ•˜๋‚˜์˜ ๊ฐ์ฒด ์•ˆ์— ์—ฌ๋Ÿฌ segmentation์„ ๋งŒ๋“ค์–ด์„œ ํ•ฉ์น˜๋Š” ๊ณผ์ •์„ ์˜๋ฏธ
  • ํ•˜์ง€๋งŒ ์ค‘๊ฐ„์— error๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๊ทน๋ณตํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌ

 

  1. image labeling์— ์‚ฌ์šฉ๋œ DCNN feature ๋“ค๊ณผ ๋…๋ฆฝ์ ์œผ๋กœ ์–ป์€ segmentation์„ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹
  • image labeling์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๋…๋ฆฝ์ ์œผ๋กœ ์–ป์€ segmentation๊ณผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ
  • skip layer, region proposal ๋“ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์‚ฌ์šฉ
  • but, DCNN classification ๊ฒฐ๊ณผ์™€ ๋ถ„๋ฆฌ๋œ segmentation ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ด์„œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, classification์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ฌด์‹œํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Œ ( premature decisions๋ผ๊ณ  ํ‘œํ˜„ํ•จ )

 

  1. DCNN์„ ์‚ฌ์šฉํ•ด ๋ฐ”๋กœ category-level pixel label๋“ค์„ ์ถ”์ถœํ•จ์œผ๋กœ์จ segmentation์„ ์•„์˜ˆ ๋ฐฐ์ œ
  • DCNN์„ ํ†ตํ•ด segmentation ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๋Œ€์‹ , ์ด๋ฏธ์ง€์˜ ๊ฐ ํ”ฝ์…€์— ๋Œ€ํ•ด ํ•ด๋‹น ํ”ฝ์…€์ด ์†ํ•œ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋ฐ”๋กœ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธ
  • ํŠน์ง• ์ถ”์ถœ ๋ฐ ๋ถ„๋ฅ˜ ๊ณผ์ •์ด ๋™์‹œ์— ์ด๋ฃจ์–ด์ง„๋‹ค๊ณ  ํ•ด์„ํ•˜๋ฉด ๋ ๋“ฏ. (feature ์ถ”์ถœ๊ณผ boundary ๊ฒ€์ถœ์ด ๋™์‹œ์— ์ด๋ฃจ์–ด์ง)??
  • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ๋ฐฉ๋ฒ•์„ ํ™œ์šฉโ€ป category-level-pixel label: ์ด๋ฏธ์ง€ ๋‚ด์˜ ๊ฐ ํ”ฝ์…€์ด ์–ด๋–ค ํด๋ž˜์Šค์— ์†ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์„ ์˜๋ฏธ. ex) ์ด๋ฏธ์ง€ ๋‚ด์˜ ๊ฐ ํ”ฝ์…€์ด ์‚ฌ๋žŒ, ์ž๋™์ฐจ, ๋‚˜๋ฌด ๋“ฑ์˜ ํด๋ž˜์Šค ์ค‘ ์–ด๋””์— ํ•ด๋‹นํ•˜๋Š”์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—…์„ ๋งํ•จ
3. Methods
3.1 Atrous Convolution for Dense Feature Extraction and Field-of-View Enlargement
  • ๋” ๋„“์€ feature map์„ coverํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐ์กด pooling, striding์„ ํ†ตํ•ด ๊ณต๊ฐ„ resolution์ด ๊ฐ์†Œํ–ˆ๊ณ , ์ž‘์•„์ง„ feature map์„ ๋ณต์›ํ•˜๋ฉด์„œ ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Œ.
  • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” atrous convolution์„ ํ™œ์šฉํ•˜์—ฌ down sampling์„ ํ•˜์ง€ ์•Š๊ณ  ํฐ ๋ฒ”์œ„์˜ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œ

โ€ป (a)๋Š” sparse feature extraction์œผ๋กœ์จ r=1์ผ ๋•Œ, (b)๋Š” dense feature extraction์œผ๋กœ์จ r=2์ผ ๋•Œ๋ฅผ ๋‚˜ํƒ€๋ƒ„. 

โ€ป (b)๋ฅผ ์‚ดํŽด๋ณด๋ฉด r=2๋กœ ์„ค์ •ํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์ด์— 0์ด๋ผ๋Š” ๊ณต๋ฐฑ์ด ํ•˜๋‚˜ ์ƒ๊ธฐ๊ณ , stride = 1์œผ๋กœ ์œ ์ง€๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— receptive field์˜ ๊ฐ’์ด ์ปค์ง

  • astrous conv๋Š” dilated๋œ ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— pooling๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ์ดํ„ฐ ์†์‹ค์„ ๋ฐœ์ƒ์‹œํ‚ค์ง€ ์•Š๋”๋ผ๋„ ๋„“์€ receptive field๋ฅผ ๊ด€์ฐฐ์„ ํ•  ์ˆ˜๊ฐ€ ์žˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด์˜ CNN์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” detail ๋ถ€์กฑ ํ•ด๊ฒฐ์„ ํ• ์ˆ˜๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ž„.
  • ๋นจ๊ฐ„์„  ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ์„ค๋ช…: ์˜ˆ๋ฅผ ๋“ค์–ด VGG-16์ด๋‚˜ ResNet-101 ๋„คํŠธ์›Œํฌ์˜ feature response์—์„œ spatial density๋ฅผ 2๋ฐฐ ์ฆ๊ฐ€์‹œํ‚ค๊ณ  ์‹ถ๋‹ค๊ณ  ํ•  ๋•Œ, ๋จผ์ € ํ•ด์ƒ๋„๋ฅผ ์ค„์ด๋Š” ๋งˆ์ง€๋ง‰ pooling/convolutional layer์„ ์ฐพ์Œ.
  • ๊ทธ ๋‹ค์Œ feature map ํฌ๊ธฐ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•ด stride = 1๋กœ ์„ค์ •ํ•˜๊ณ , ๊ทธ ์ดํ›„์˜ convolution layer์„ atrous convolutional layer with r = 2๋กœ ๋ฐ”๊ฟ”์คŒ. ๋น„์šฉ์ด ๋งŽ์ด ๋“ ๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•˜๊ฒŒ ๋จ
  • ๊ทธ๋ž˜์„œ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ด์ค‘ ์„ ํ˜• ๋ณด๊ฐ„๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•จ(atrous convolution์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด์ƒ๋„๋ฅผ 4๋ฐฐ ์ฆ๊ฐ€์‹œํ‚ค๊ณ , fast bilinear interpolation์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด์ƒ๋„๋ฅผ 8๋ฐฐ ์ฆ๊ฐ€์‹œ์ผœ ์›๋ณธ ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„๋กœ ๋ณต๊ตฌํ•˜๋Š”…๋ชฐ๋ผ์š”)

→ atrous conv + ์ด์ค‘ ์„ ํ˜• ๋ณด๊ฐ„๋ฒ•์„ ํ†ตํ•ด ๊ณ ํ•ด์ƒ๋„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์„ hybrid approach ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ๋งํ•จ


  • Atrous convolution์€ ์–ด๋Š DCNN layer์—์„œ๋‚˜ ์ž„์˜๋กœ filter์˜ field-of-view๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. SoTA DCNN๋“ค์€ ์ฃผ๋กœ 3*3๊ณผ ๊ฐ™์€ ์ž‘์€ convolutional kernel๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ๊ณผ parameter์˜ ๊ฐœ์ˆ˜๋ฅผ ์ตœ์†Œํ•œ์œผ๋กœ ํ•จ.
  • ๋งŒ์•ฝ rate๊ฐ€ r์ธ atrous convolution์ด ์žˆ๋‹ค๋ฉด, ๊ทธ filter๊ฐ’๋“ค ์‚ฌ์ด์—๋Š” r-1๊ฐœ์˜ 0๋“ค์ด ์žˆ๋‹ค.
  • ์ฆ‰, filter์˜ kernel size๋ฅผ ์—ฐ์‚ฐ์ด๋‚˜ parameter ๊ฐœ์ˆ˜์˜ ์ฆ๊ฐ€ ์—†์ด ke=k+(k-1)(r-1) ๊ฐœ๋กœ ์ฆ๊ฐ€์‹œ์ผœ์ฃผ๋Š” ์—ญํ• ๋„ ํ•จex) 3x3 ํ•„ํ„ฐ์ด๊ณ , r=2์ธ ๊ฒฝ์šฐ, 3+2*1=5 ์ฆ‰, 5x5 receptive field๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋จ
  • ์ด๋ ‡๊ฒŒ ํ•ด์ฃผ๋ฉด ๊ณ„์‚ฐ๋Ÿ‰์ด๋‚˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค์ง€ ์•Š์•„๋„ ์ข‹์€ ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ
โ€ป atrous conv ์ ์šฉ๋ฐฉ๋ฒ•
  1. kernel์— rate ์„ค์ •ํ•ด์„œ feature map samplingํ•˜๋Š” ๋ฐฉ๋ฒ• (์œ„์— ๋‚˜์˜จ ๋ฐฉ๋ฒ•)
  1. ์ž…๋ ฅ feature map์„ rate ๋งŒํผ sub samplingํ•ด์„œ deinterlace ๊ณผ์ •์„ ํ†ตํ•ด r^2๋งŒํผ ๊ฐ์†Œ๋œ ํ•ด์ƒ๋„ ๋งต์„ ์ƒ์„ฑํ•จ→ ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ์ค‘๊ฐ„ feature map์— ํ‘œ์ค€ conv๋ฅผ ์ ์šฉํ•ด์„œ ์›๋ž˜ ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„๋กœ reinterlacingโ€ป interlaced image๋ž€ ํ…”๋ ˆ๋น„์ „๊ณผ ๊ฐ™์€ ์˜์ƒ ์žฅ์น˜์— ์•„๋‚ ๋กœ๊ทธ BW๋ฅผ ๋†’์ด์ง€ ์•Š๊ณ  ํ‘œ์‹œํ•  ์˜์ƒ ํ’ˆ์งˆ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ์‹์œผ๋กœ์„œ ํ™€์ˆ˜์ค„ ์ง์ˆ˜์ค„์„ ๋ฒˆ๊ฐˆ์•„์„œ ์ถœ๋ ฅํ•œ๋‹ค๋Š” ๊ฐœ๋…
    • ์•ฝ๊ฐ„ ๋ฒˆ๊ฐˆ์•„๊ฐ€๋ฉด์„œ sub sampling์„ ํ•ด์คฌ๋‹ค ์ด๋Ÿฐ ๋Š๋‚Œ..์ธ ๊ฒƒ ๊ฐ™์•„์š”
  2. → Atrous convolution์„ ์ผ๋ฐ˜์ ์ธ convolution์œผ๋กœ ์ถ•์†Œํ•˜์—ฌ ์ตœ์ ํ™”๋œ convolution ๋ฃจํ‹ด์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค..

 

3.2 Multiscale Image Representations using Atrous Spatial Pyramid Pooling
  • DeepLabV2๋Š” ํšจ๊ณผ์ ์œผ๋กœ ๋‹ค์–‘ํ•œ ๋ฒ”์œ„ ๊ณต๊ฐ„์˜ ํŠน์ง•์„ ์–ป๊ธฐ ์œ„ํ•ด feature map์˜ ํ•œ ํ”ฝ์…€ ๊ฐ’์„ ์–ป๊ธฐ ์œ„ํ•ด 4๊ฐœ์˜ atrous filter๋ฅผ ์‚ฌ์šฉํ•จ
  • ๊ฐ convolution์„ ์ ์šฉํ•œ ๋’ค์— ์–ป์€ ๊ฐ’์„ ๋”ํ•ด ๋งˆ์ง€๋ง‰ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•จ
  • 1๊ฐœ์˜ atrous filter๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ฆ๊ฐ€ํ•˜์ง€๋งŒ ๊ณ„์‚ฐ๋Ÿ‰๋„ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌ.

 

3.3 Structured Prediction with Fully-Connected Conditional Random Fields for Accurate Boundary Recovery (CRF)
  • ์ค„์˜€๋‹ค๊ฐ€ bi-linear interpolation๋ฐฉ๋ฒ•์„ ํ†ตํ•ด์„œ ๋Š˜๋ฆฌ๊ณ  fully connected CRF๋กœ ๋”์šฑ ์ •ํ™•๋„๋ฅผ ๋†’์ด๋Š” ๋ฐฉ์‹์„ ๋”ฐ๋ฆ„.

  • max-pooling layer๋“ค์ด ์ฆ๊ฐ€ํ•˜๋ฉด classification์—์„œ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋‚ด์ง€๋งŒ, ๊ทธ์— ๋”ฐ๋ผ invariance๋„ ๋†’์•„์ง€๊ธฐ ๋•Œ๋ฌธ์— localization์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์ง€์ง€ ๋ชปํ•จ. ๊ฒฝ๊ณ„๋ฅผ ์ œ๋Œ€๋กœ ๋ฌ˜์‚ฌํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ (์œ„์—์„œ ์„ค๋ช…)
  • DeepLab์€ Fully-Connected CRF๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ localization accuracy๋ฅผ ์ฆ๊ฐ€์‹œํ‚ด
  • Fully connected CRF์˜ ๋ชจ๋ธ ⇒ energy function์˜ ์‚ฌ์šฉ (unary term + pairwise term)์œผ๋กœ ๊ตฌ์„ฑ๋จ
  • x → pixel์— ๋Œ€ํ•œ ๋ผ๋ฒจ๋ง ๋ถ„๋ฅ˜
  • ์ฒซ๋ฒˆ์งธ ํ•ญ: ํŠน์ • pixel i์—์„œ์˜ label assignment ํ™•๋ฅ 
  • ๋‘ ๋ฒˆ์งธ ํ•ญ: ๊ฐ ์ด๋ฏธ์ง€ ํ”ฝ์…€์ด ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์กŒ์„ ๋•Œ, ๋ฌผ์ฒด ๋ฐ ๊ทธ ์œ„์น˜์— ๋Œ€ํ•œ ์ถ”๋ก ์„ ํšจ์œจ์ ์œผ๋กœ ํ•ด์คŒ
  • ์—ฌ๊ธฐ์„œ xi≠xj์ผ ๋•Œ μ(xi, xj) = 1๋กœ, ๊ฐ™์€ ๊ฒฝ์šฐ 0์˜ ๊ฐ’์„ ์ทจํ•˜๋„๋ก ํ•˜๋ฉด์„œ ๊ฐ™์€ ํ”ฝ์…€๋ผ๋ฆฌ๋Š” ์„œ๋กœ ์—ฐ์‚ฐ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋„๋ก ์„ค์ •.
  • pi, pj: ํ”ฝ์…€์˜ ์œ„์น˜(position)
  • li, lj: ํ”ฝ์…€์˜ ์ปฌ๋Ÿฌ๊ฐ’(intensity)
  • ์ฒซ ๋ฒˆ์งธ ๊ฐ€์šฐ์‹œ์•ˆ ์ปค๋„: ๋น„์Šทํ•œ ์œ„์น˜์™€ ๋น„์Šทํ•œ ์ปฌ๋Ÿฌ๋ฅผ ๊ฐ–๋Š” ํ”ฝ์…€๋“ค์— ๋Œ€ํ•ด ๋น„์Šทํ•œ label์ด ๋ถ™์„ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์คŒ
  • ๋‘ ๋ฒˆ์งธ ๊ฐ€์šฐ์‹œ์•ˆ ์ปค๋„: ํ”ฝ์…€์˜ ์œ„์น˜๋งŒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์กฐ์ •. smoothness(๋…ธ์ด์ฆˆ ์ œ๊ฑฐ)๋ฅผ ์š”๊ตฌํ•  ๋•Œ ์œ„์น˜์  ๊ฐ€๊นŒ์›€์ด ์–ผ๋งˆ๋‚˜ ์ถฉ์กฑ๋˜์—ˆ๋Š”์ง€๋ฅผ ๊ณ ๋ ค
  • σα, σβ, σγ: Gaussian kernel์˜ scale์„ ์กฐ์ •→ Fully Connected CRF์˜ energy function์€ Gaussian๊ผด์˜ kernel ๋•Œ๋ฌธ์— ํ™•๋ฅ ์  ์ถ”๋ก ์ด ํšจ์œจ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋„๋ก ํ•จ
โ€ป ๊ฐ€์šฐ์‹œ์•ˆ ์ปค๋„(Gaussian kernel)์€ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒ์„ฑ๋œ ํ•„ํ„ฐ(kernel)๋ฅผ ๋งํ•จ. ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋Š” ์ž์—ฐ ํ˜„์ƒ์—์„œ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ์ •๊ทœ ๋ถ„ํฌ(normal distribution) ์ค‘ ํ•˜๋‚˜๋กœ, ์ค‘์‹ฌ์„ ๊ธฐ์ค€์œผ๋กœ ์ขŒ์šฐ ๋Œ€์นญ์˜ ์ข… ๋ชจ์–‘ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ. ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒ์„ฑ๋œ ํ•„ํ„ฐ๋Š” ์ด๋ฏธ์ง€๋‚˜ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ์—์„œ smoothing, blurring, denoising ๋“ฑ์˜ ์šฉ๋„๋กœ ๋„๋ฆฌ ์‚ฌ์šฉ๋จ. ๊ฐ€์šฐ์‹œ์•ˆ ํ•„ํ„ฐ์˜ ํฌ๊ธฐ์™€ ํ‘œ์ค€ ํŽธ์ฐจ(standard deviation)๋Š” ํ•„ํ„ฐ์˜ ์„ฑ๋Šฅ๊ณผ ์—ฐ์‚ฐ ๋น„์šฉ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ํฌ๊ธฐ๊ฐ€ ํด์ˆ˜๋ก smoothing ํšจ๊ณผ๊ฐ€ ๋” ๊ฐ•ํ•ด์ง€์ง€๋งŒ, ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ฆ๊ฐ€ํ•จ

 

4. Experimental Results
  • DeepLab์€ imagenet-pretrained VGG-16 / ResNet-101 ๋„คํŠธ์›Œํฌ๋ฅผ fine-tuningํ•˜์—ฌ semantic segmentation์— ์ ์šฉํ•จ
  • Loss Function: Cross-Entropy Loss
  • ๋ชจ๋“  ๋ผ๋ฒจ๋“ค์€ ๋™๋“ฑํ•œ weight์„ ์ง€๋‹ˆ๊ณ  ์žˆ์Œ(unlabeled pixel ์ œ์™ธ)
  • Optimization : standard gradient descent
  • ๋ชจ๋ธ ์„ฑ๋Šฅ ํ™•์ธ: PASCAL VOC 2012, PASCAL-Context, PASCAL-Person-Part, Cityscapes
4.1 PASCAL VOC 2012
  • PASCAL VOC 2012 ๋ฐ์ดํ„ฐ์…‹์€ 20๊ฐœ์˜ object class์™€ ํ•˜๋‚˜์˜ background class๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Œ
  • 1,464๊ฐœ์˜ training set, 1,449์˜ validation set, 1,456๊ฐœ์˜ test set์ด pixel-level ์ด๋ฏธ์ง€์˜ ํ˜•ํƒœ.
  • ์ถ”๊ฐ€์ ์œผ๋กœ data augmentation์„ ํ†ตํ•˜์—ฌ 10,582 training์ด๋ฏธ์ง€๋ฅผ ์–ป์Œ
  • ์„ฑ๋Šฅ์€ 21๊ฐœ์˜ class์— ๋Œ€ํ•œ IOU๋กœ ํ™•์ธ

  • VGG16์˜ fc6 layer๋ฅผ atrous conv๋กœ ๋ณ€๊ฒฝํ•˜๋ฉฐ ์•„๋ž˜์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉ
  • pretrained VGG-16 for ImageNet๋ฅผ ์‚ฌ์šฉ

  • batch size : 20
  • lr : 0.001 (multiplying the learning rate by 0.1 every 2000 iterations)
  • momentum : 0.9
  • weight decay : 0.0005
  • w2์™€ σγ ๊ฐ’์€ 3์œผ๋กœ ์„ค์ •, ์ตœ์ ์˜ w2, σα, σβ ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด validation์„ ํ†ตํ•˜์—ฌ ์กฐ์ •
  • Test Set์—์„œ DeepLab-LargeFOV๋Š” 70.4 mean IOU performance๋ฅผ ์–ป์Œ
4.2 PASCAL Context
  • Pascal-Context๋Š” ๋ฌผ์ฒด ๋ฐ ๋ฐฐ๊ฒฝ์— ๋Œ€ํ•œ semantic label๋“ค์ด ๋ชจ๋‘ ์žˆ์Œ. ํ•˜๋‚˜์˜ background category์™€ 59๊ฐœ์˜ class๋“ค์ด ์žˆ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋„ ์•ž์„œ ์–ธ๊ธ‰ํ•œ ์ตœ๊ณ ์˜ ๋ชจ๋ธ์„ ์ ์šฉํ•ด SoTA result๋ฅผ ์–ป์Œ
4.3 PASCAL-Person-Part
  • PASCAL-Person-Part๋Š” ์‚ฌ๋žŒ์˜ ์‹ ์ฒด๋ฅผ ๋จธ๋ฆฌ, ๋ชธํ†ต, ์œ—ํŒ”/์•„๋žซํŒ”, ์œ—๋‹ค๋ฆฌ/์•„๋žซ๋‹ค๋ฆฌ๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ  ํ•˜๋‚˜์˜ ๋ฐฐ๊ฒฝ class๋ฅผ ์ถœ๋ ฅ. ์—ฌ๊ธฐ์„œ๋„ DeepLab๊ฐ€ ์ตœ์šฐ์ˆ˜ ์„ฑ์ ์„ ๋ƒ„
4.4 Cityscapes
  • ๋„์‹œ์— ๋Œ€ํ•œ 19 semantic label(๋ฐ 7๊ฐœ์˜ super category: ๋•…, ๊ณต์‚ฌ, ๋ฌผ์ฒด, ์ž์—ฐ, ํ•˜๋Š˜, ์‚ฌ๋žŒ, ์ฐจ๋Ÿ‰)์œผ๋กœ ์ด๋ฃจ์–ด์ง„ dataset. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ƒ„
4.5 Failure Modes
  • ๊ทธ๋Ÿฌ๋‚˜ DeepLab๋„ ํ•œ๊ณ„๊ฐ€ ์กด์žฌ. ์•„๋ž˜์ฒ˜๋Ÿผ ์ž์ „๊ฑฐ๋‚˜ ์˜์ž๊ฐ™์€ ์–‡๊ณ  segmenation์ด ์–ด๋ ค์šด ๋ฌผ์ฒด ๊ฐ™์€ ๊ฒฝ์šฐ, CRF๋ฅผ ์ ์šฉํ•ด๋„ ๊ทธ ์˜์—ญ์„ ์ œ๋Œ€๋กœ ํŒŒ์•…ํ•  ์ˆ˜ ์—†์—ˆ๋‹ค๋ผ๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ

 

 

728x90
๋ฐ˜์‘ํ˜•

'Deep Learning > [๋…ผ๋ฌธ] Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

U-Net  (0) 2023.07.05
Bert  (0) 2023.07.05
VIT [AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE]  (0) 2023.07.05
RetinaNet  (0) 2023.07.05
GPT-1  (0) 2023.07.05