๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

DINO: Emerging Properties in Self-Supervised Vision Transformers (2021)

by ์ œ๋ฃฝ 2023. 8. 10.
728x90
๋ฐ˜์‘ํ˜•

 

<๊ธฐ๋ณธ ์šฉ์–ด>

Self Supervised learning

https://brunch.co.kr/@b047a588c11b462/45

: ๋น„์ง€๋„ ํ•™์Šต ๋ฐฉ์‹์˜ ์ผ์ข…์œผ๋กœ์„œ ๋ผ๋ฒจ๋ง๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ ์ธ๊ณต์ง€๋Šฅ์ด ์Šค์Šค๋กœ ๋ถ„๋ฅ˜์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•จ

: ์Šค์Šค๋กœ ํƒœ์Šคํฌ๋ฅผ ์„ค์ •ํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•œ๋‹ค๋Š” ์ ์—์„œ ๊ธฐ์กด์˜ ๋น„์ง€๋„ ํ•™์Šต ๋ฐฉ์‹๊ณผ ์ฐจ์ด๊ฐ€ ์กด์žฌํ•˜๋ฉฐ, ์ธํ„ฐ๋„ท์ƒ ํฌ๋กค๋ง์„ ํ†ตํ•ด ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ๋Š” ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ๋น„๋””์˜ค ๋“ฑ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•  ์ˆ˜๋„ ์žˆ์Œ

: ๋ชจ๋ธ์ด ํ™•์žฅ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€๋งŒ, ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ง€์†์ ์œผ๋กœ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋งŽ์€ ๋น„์šฉ์ด ์š”๊ตฌ๋œ๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌ

: ์ž๊ธฐ ์ง€๋„ ํ•™์Šต์€ ๋ผ๋ฒจ๋ง๋˜์ง€ ์•Š์€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋งŒ ํ™•๋ณดํ•˜๋”๋ผ๋„ ๋ชจ๋ธ์˜ ๊ทœ๋ชจ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด์— ๋”ฐ๋ผ ์ •ํ™•๋„ ์—ญ์‹œ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์กด์žฌ

 

<์ง€๋„ํ•™์Šต์˜ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•>

1) Self Prediction

: ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ ๋‚ด์—์„œ, ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๋ฅผ ํ™œ์šฉํ•ด ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•

2) Contrastive Learning

: ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ์œ ์‚ฌํ•œ ์ƒ˜ํ”Œ๋“ค ๊ฐ„ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ€๊น๊ฒŒ ํ•˜๊ณ  ์œ ์‚ฌํ•˜์ง€ ์•Š์€ ์ƒ˜ํ”Œ๋“ค ๊ฐ„ ๊ฑฐ๋ฆฌ๋Š” ๋ฉ€๊ฒŒ ํ•˜๋Š” ๊ฒƒ.

: ์œ ์‚ฌ ์—ฌ๋ถ€์˜ ๊ธฐ์ค€์ด ๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹์„ anchor๋ผ๊ณ  ํ•จ

: anchor(ํšŒ์ƒ‰)์™€ ์œ ์‚ฌํ•œ ์ƒ˜ํ”Œ์„ positive point(์‹œ๋ฐ”)๋กœ, anchor์™€ positive pair๋ฅผ ์ด๋ฃธ.

: ๋ฐ˜๋Œ€๋กœ anchor์™€ ์œ ์‚ฌํ•˜์ง€ ์•Š์€ ์ƒ˜ํ”Œ์€ negative sample(ํ˜ธ๋žญ์ด)๋กœ์จ anchor๊ณผ negative pair๋ฅผ ์ด๋ฃธ

: Contrastive ํ•™์Šต ๋ฐฉ์‹์€ ๋‹ค์–‘ํ•œ ๊ด€์ ๋“ค๋กœ๋ถ€ํ„ฐ ๊ณตํ†ต๋œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ณ ์ž ํ•˜๋Š” ๋ชฉ์ ์„ ๊ฐ€์ง

: ex) ๊ณ ์–‘์ด ์ด๋ฏธ์ง€์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•œ ์ด๋ฏธ์ง€์™€ ์›๋ณธ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ์„ ๋•Œ, ๋‘ ์ด๋ฏธ์ง€ ๊ฐ„ ๊ณตํ†ต๋œ ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ๊ณ ์–‘์ด ๋ถ€๋ถ„๋งŒ ํ•™์Šต๋Œ€์ƒ์œผ๋กœ ์ธ์‹๋˜๋ฉฐ, ๊ทธ ์™ธ์˜ ๋ฐฐ๊ฒฝ์ด๋‚˜ ๋…ธ์ด์ฆˆ๋Š” ํ•™์Šต๊ณผ์ •์—์„œ ๊ณ ๋ ค๋˜์ง€ ์•Š์Œ


: contrastive learning์˜ ์„ฑ๋Šฅ์—๋Š” positive sample๊ณผ negative sample์˜ ์„ ์ •๋ฐฉ์‹์ด ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ.

: Positive pair๋Š” augmentation ๊ธฐ๋ฒ•๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์›๋ณธ์„ ๋ณ€ํ˜•์‹œํ‚ค๊ฑฐ๋‚˜, ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ƒ์ดํ•œ ๊ด€์ ์„ ์ทจํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ ์ •๋จ.

: ํ•œ ๋ฐ์ดํ„ฐ์…‹ ๋‚ด์—์„œ anchor์ด ์•„๋‹Œ ์ƒ˜ํ”Œ๋“ค์€ negative pair๋กœ ์—ฌ๊ฒจ์ง€๋ฉฐ negative sample ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ํšจ๊ณผ์ ์œผ๋กœ representation collapse๋ฅผ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Œ.

("negative sample" ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก, ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ํด๋ž˜์Šค ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๊ตฌ๋ถ„ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ํ–ฅ์ƒ)

 

โžก๏ธ ํ•˜์ง€๋งŒ negative sample์„ ์‚ฌ์šฉํ•  ๋•Œ, ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ํฌ๊ธฐ๋‚˜ ์ฆ๊ฐ•๊ธฐ๋ฒ• ์„ ํƒ ๋“ฑ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํŽธ์ฐจ๊ฐ€ ํฌ๊ฒŒ ๋‚˜ํƒ€๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต ์‹œ ๊ณ ๋ คํ•ด์•ผํ•  ์ ์ด ๋งŽ๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์กด์žฌ

BYOL : Boostrap Your Own Latent

: ๊ธฐ์กด negative sample์„ ์‚ฌ์šฉํ•  ๋•Œ, ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ ๊ณ ๋ ค๋‚˜, ์ฆ๊ฐ• ๊ธฐ๋ฒ•์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํŽธ์ฐจ๊ฐ€ ํฌ๊ธฐ ๋•Œ๋ฌธ์— ๊ณจ์นซ๋ฉ์–ด๋ฆฌ์—ˆ์Œ.

: ๋”ฐ๋ผ์„œ, BYOL์€ positive sample ๋งŒ์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•จ ⇒ ํ•œ ์ด๋ฏธ์ง€์—์„œ augmentation ํ•œ ๋‘ ๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ input์œผ๋กœ ํ™œ์šฉ(ํšŒ์ƒ‰์ด anchor(target), ์ƒ‰์กฐ ์ด๋ฏธ์ง€๊ฐ€ positive sample(online))

: BYOLO์˜ ๊ตฌ์กฐ๋Š” Online network์™€ Target network๋กœ ๊ตฌ์„ฑ

: Online network๋กœ predictor๊นŒ์ง€ ์˜ˆ์ธกํ•ด์„œ anchor(target network)์˜ predict์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ž„

: ์ด ๋•Œ, L2 loss๋ฅผ ํ†ตํ•ด์„œ online network gradient๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ด์ฃผ๊ณ , ์—…๋ฐ์ดํŠธํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’๊ณผ, ๊ธฐ์กด target network์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ํ™œ์šฉํ•ด์„œ ์ด๋™ํ‰๊ท  ํ•ด์„œ target ์„ ์—…๋ฐ์ดํŠธ ํ•จ

โ€ป L2 loss๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” negative sample์ด ์—†๊ธฐ ๋•Œ๋ฌธ

BYOL์˜ momentum encoder ์ด๋™ ํ‰๊ท 

โ€ป ์—ฌ๊ธฐ์„œ momentum encoder๋Š” online network์—์„œ ๋‚˜์˜จ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’๊ณผ target network ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ด๋™ํ‰๊ท ํ•ด์„œ target๊ฐ’์„ ์—…๋ฐ์ดํŠธ ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์„ ์˜๋ฏธํ•˜๊ณ , ์—…๋ฐ์ดํŠธ ๊ทœ์น™์ด ๊ธฐ์กด momentum์ฒ˜๋Ÿผ ์ฒ˜์Œ์—๋Š” 0.99๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์คฌ๋‹ค๊ฐ€ ์ ์  ์ค„์ด๋Š” ๋ฐฉ์‹์œผ๋กœ ์—…๋ฐ์ดํŠธํ•ด์„œ ๋ถ™์—ฌ์ง„ ๋ง ๊ฐ™์Œ.

์ถ”๊ฐ€ ๊ธฐ๋ณธ ์šฉ์–ด
ViT

: ViT(Vision Transformer)๋Š” vision task๋ฅผ ์œ„ํ•œ Transformer ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์— ์ ์šฉํ•œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋กœ, ์ด๋ฏธ์ง€๋ฅผ ์ผ์ • ํฌ๊ธฐ์˜ ํŒจ์น˜๋กœ ๋ถ„ํ• ํ•˜์—ฌ Transformer์˜ self-attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ™œ์šฉํ•ด ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ์‹

Self Distillation

: ๋ ˆ์ด๋ธ”์ด ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ•™์Šต๋œ ๋ชจ๋ธ์ธ ํ•™์ƒ ๋ชจ๋ธ๊ณผ, ํ•ด๋‹น ๋ชจ๋ธ์˜ ์ง€์‹์„ ์ „๋‹ฌํ•ด์ฃผ๋Š” ๋ชจ๋ธ์ธ ์„ ์ƒ๋‹˜ ๋ชจ๋ธ ์‚ฌ์ด์˜ ์œ ์‚ฌ์„ฑ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ์ง€์‹ ์ „๋‹ฌ์„ ํ†ตํ•ด ํ•™์ƒ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ธฐ๋ฒ•

 


 

๐ŸŒป DINO์š”์•ฝ

: Self-distillation with no labels

: ViT (vision Transformer)์— SSL (Self-Supervised Learning)์„ ์ ์šฉํ•ด๋ณด์ž!

: DINO์˜ ๊ฒฝ์šฐ, ๊ฐ์ฒด์—๋งŒ attention map์ด ํ™œ์„ฑํ™” ๋˜์–ด์žˆ์Œ

⇒ ๋ฐฐ๊ฒฝ ์ •๋ณด์—๋Š” ๋œ ์˜์กดํ•œ๋‹ค ๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Œ ↔๏ธ classification์˜ ๊ฒฝ์šฐ, ๋ฐฐ๊ฒฝ ์ •๋ณด๋„ ํ™œ์šฉ

<์ฃผ์š” ํŠน์ง•>

(1) Cross-entropy loss

(2) multi-crop

(3) mean teacher

(4) centering, sharpening

 


 

0. ABSTRACT

: ์ด ๋…ผ๋ฌธ์€ Vision Transformer(ViT)์— ๋Œ€ํ•ด Self-Supervised learning)์ด conv์™€ ๋น„๊ตํ•ด์„œ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ์ œ๊ณตํ•˜๋Š”์ง€์— ๋Œ€ํ•ด ์˜๋ฌธ์„ ์ œ๊ธฐ

ํŠน์ง•

1) DINO๋Š” image์˜ sementic segmentaion์— ๋Œ€ํ•œ ๋ช…์‹œ์ ์ธ ์ •๋ณด๋ฅผ ํฌํ•จํ•จ

2) ์šฐ์ˆ˜ํ•œ K-NN ๋ถ„๋ฅ˜๊ธฐ

3) Momentum encoder, multi-crop, ์ž‘์€ ํŒจ์น˜ ์‚ฌ์šฉ์˜ ์ค‘์š”์„ฑ ๊ฐ•์กฐ

 


 

1. INTRODUCTION

: Transformer๋Š” ์ตœ๊ทผ์— ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง(convnets) ๋Œ€์•ˆ์œผ๋กœ ๋“ฑ์žฅ

: ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ชจ๋ธ์„ ์ง€๋„ํ•™์Šต์˜ ์‚ฌ์šฉ์œผ๋กœ ์„ค๋ช…๋  ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•ด ์˜๋ฌธ์„ ์ œ๊ธฐํ•จ

: ๊ธฐ์กด NLP์—์„œ Transformer์˜ ์„ฑ๊ณต์˜ ์ฃผ์š” ๊ตฌ์„ฑ ์ค‘ ํ•˜๋‚˜๊ฐ€ BERT ๋ฐ GPT์˜ ์ž๊ธฐ์ง€๋„ ํ•™์Šต์˜ ์‚ฌ์šฉ์ด์—ˆ์Œ

: ์ด๋ฅผ ๋™๊ธฐ ์‚ผ์•„ ViT์— SSL์„ ์ ์šฉํ•จ

 


 

2. RELATED WORK
2-1) Self-supervised learning.

: instance classification์€ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต์˜ ํ•œ ์ข…๋ฅ˜๋กœ์จ, ์ด๋ฏธ์ง€๋“ค์„ ์„œ๋กœ ๋‹ค๋ฅธ ํด๋ž˜์Šค๋กœ ๊ฐ„์ฃผํ•ด์„œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๋ฐฉ๋ฒ•์ž„

: ์ฃผ๋กœ data augmentation์„ ์‚ฌ์šฉํ•ด์„œ ์ด๋ฏธ์ง€๋ฅผ ๊ตฌ๋ถ„

↔๏ธ ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ, ์ด๋ฏธ์ง€ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ํ™•์žฅ์„ฑ์ด ๋–จ์–ด์ง„๋‹ค๋Š” ๋ฌธ์ œ ์กด์žฌ

 

: ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ instance๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ ๋Œ€์‹  ์ด๋ฏธ์ง€ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ํ•™์Šตํ•ด์„œ ๋น„์ง€๋„ ํŠน์ง•์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คŒ (์ด๋ฏธ์ง€๋“ค์˜ ํŠน์„ฑ์„ ์ถ”์ถœํ•˜๊ณ  ์ด ํŠน์„ฑ๋“ค์„ ์„œ๋กœ ๋งค์นญํ•ด์„œ ํ•™์Šต ⇒ ๋” ๋งŽ์€ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”๋ฐ ์œ ๋ฆฌํ•จ)

: ๊ทธ ์ค‘์—์„œ๋„ BYOL์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์€ momentum encoder๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด์„œ ์ด๋ฏธ์ง€๋“ค์˜ ํŠน์„ฑ์„ ๋งค์นญํ•˜๊ณ , ํ•™์Šตํ•จ

: BYOL์€ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต์˜ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์คŒ + ๋ ˆ์ด๋ธ”์ด ์—†๋Š” ์ƒํƒœ์—์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ

โžก๏ธ ๋”ฐ๋ผ์„œ, BYOL์„ ์˜๊ฐ์œผ๋กœ + ๊ต์‚ฌ/ํ•™์ƒ architecture์„ ์‚ฌ์šฉํ•จ

 

2-2) Self-training and knowledge distillation.

: ์ž‘์€ ๋„คํŠธ์›Œํฌ๋ฅผ ํ›ˆ๋ จํ•˜์—ฌ ํฐ ๋„คํŠธ์›Œํฌ์˜ ์ถœ๋ ฅ์„ ๋ชจ๋ฐฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์••์ถ•ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ

 


 

3. APPROACH

3-1) SSL with Knowledge Distillation

: knowledge distillation

โžก๏ธ ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ gθt์˜ ์ถœ๋ ฅ๊ณผ ์ผ์น˜ํ•˜๋„๋ก ํ•™์ƒ ๋„คํŠธ์›Œํฌ gθs๋ฅผ ํ›ˆ๋ จํ•˜๋Š” ํ•™์Šต ๋ฐฉ์‹

(gθt๋Š” ์‚ฌ์ „์— ํ›ˆ๋ จ๋œ ๋ฏธ๋ฆฌ ์ค€๋น„๋œ ๋„คํŠธ์›Œํฌ๋กœ, ํ•™์ƒ ๋„คํŠธ์›Œํฌ gθs๋Š” gθt์˜ ์ถœ๋ ฅ๊ณผ ์ผ์น˜ํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์—์„œ ์—…๋ฐ์ดํŠธ๋˜๋Š” ๋„คํŠธ์›Œํฌ์ž„)

: θs์™€ θt๋Š” ๊ฐ๊ฐ ํ•™์ƒ ๋„คํŠธ์›Œํฌ์™€ ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜(๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์˜๋ฏธ)

: ์ด๋ฏธ์ง€ x๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‘ ๋„คํŠธ์›Œํฌ ๋ชจ๋‘ K์ฐจ์›์˜ ํ™•๋ฅ  ๋ถ„ํฌ Ps์™€ Pt(๊ฐ๊ฐ์˜ output ๊ฐ’)๋ฅผ ์ถœ๋ ฅํ•จ

: ์ด ๋•Œ, ํ™•๋ฅ  ๋ถ„ํฌ P๋Š” ๋„คํŠธ์›Œํฌ g์˜ ์ถœ๋ ฅ์„ softmax function์œผ๋กœ ์ •๊ทœํ™”ํ•ด์„œ ์–ป์Œ

: τs>0 ์™€ τt>0๋Š” temperature parameter์ด๋ฉฐ ์ถœ๋ ฅ ๋ถ„ํฌ์˜ ๋พฐ์กฑํ•œ ์ •๋„๋ฅผ ์กฐ์ ˆํ•จ (๋’ค์—์„œ ์„ค๋ช…)

: ๊ณ ์ •๋œ teacher ๋„คํŠธ์›Œํฌ gθt๊ฐ€ ์ฃผ์–ด์ง„ ์ƒํƒœ์—์„œ, student ๋„คํŠธ์›Œํฌ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ θs์— ๋Œ€ํ•ด ๋ถ„ํฌ๋ฅผ ์ผ์น˜์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋ ค๊ณ 

๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค

โ€ป H(a,b)=−alogb

: Ps(a)์™€ Pt(a)๋Š” ๊ฐ๊ฐ student์™€ teacher ๋„คํŠธ์›Œํฌ๊ฐ€ ์ถœ๋ ฅํ•œ ๋™์ผํ•œ ํด๋ž˜์Šค a์— ๋Œ€ํ•œ ํ™•๋ฅ 

: student ๋„คํŠธ์›Œํฌ๋Š” teacher ๋„คํŠธ์›Œํฌ์˜ ์ถœ๋ ฅ ๋ถ„ํฌ์™€ ์œ ์‚ฌํ•œ ์ถœ๋ ฅ ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ์œ ๋„๋จ( ํ•™์ƒ์ด ์„ ์ƒ๋‹˜์ฒ˜๋Ÿผ ๋˜๋„๋ก ํ•™์Šต๋จ )

 

๐Ÿ‘ป self-supervised learning์— ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•

: multi-crop์„ ์‚ฌ์šฉํ•ด์„œ ์ฃผ์–ด์ง„ ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์—์„œ ๋‘ ๊ฐ€์ง€์˜ view๋กœ ๊ตฌ์„ฑํ•จ

  • ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๊ธ€๋กœ๋ฒŒ ๋ทฐ(global view)
    • 224x224 ํ•ด์ƒ๋„์˜ 2๊ฐœ์˜ ์ „์—ญ ๋ทฐ์™€ ์›๋ณธ ์ด๋ฏธ์ง€์˜ ํฐ ์˜์—ญ (์˜ˆ: 50% ์ด์ƒ)์„ ํฌํ•จ
  • ์ด๋ฏธ์ง€์˜ ์ž‘์€ ์ง€์—ญ๋งŒ์„ ํฌํ•จํ•˜๋Š” ๋กœ์ปฌ ๋ทฐ(local view)
    • 96x96 ํ•ด์ƒ๋„์˜ ์—ฌ๋Ÿฌ ๋กœ์ปฌ ๋ทฐ์™€ ์›๋ณธ ์ด๋ฏธ์ง€์˜ ์ž‘์€ ์˜์—ญ (์˜ˆ: 50% ๋ฏธ๋งŒ)์„ ํฌํ•จํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉ

1) local + global (student input์œผ๋กœ ๋“ค์–ด๊ฐ)

2) global (teacher input์œผ๋กœ ๋“ค์–ด๊ฐ)

โžก๏ธ local์—์„œ global ๋Œ€์‘์„ ์œ ๋„ํ–ˆ์Œ (๊ธ€๋กœ๋ฒŒ ๋ทฐ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋กœ์ปฌ ๋ทฐ์™€์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šต)

: ์•„๋ž˜์˜ loss๋ฅผ ์ตœ์†Œํ™”ํ•จ

๋‘ ๋„คํŠธ์›Œํฌ ๋ชจ๋‘(์„ ์ƒ, ํ•™์ƒ) ๋‹ค ๊ฐ™์€ architecture์„ ๊ฐ€์ง€์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ θs์™€ θt ๋ฅผ ๊ฐ€์ง

: θs๋Š” SGD(ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•)์„ ์‚ฌ์šฉํ•ด์„œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์—…๋ฐ์ดํŠธํ•จ (์„ ์ƒ์€ ์ด๋ฏธ ํ•™์Šต๋˜์–ด์žˆ์Œ-์•„๋ž˜ ์ฐธ๊ณ )


Teacher network

: Knowledge distillation๊ณผ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์ „ ์ง€์‹์œผ๋กœ teacher network gθt๋ฅผ ๊ฐ–๊ธฐ ์•Š๊ธฐ์—, teacher network๋ฅผ student network์˜ ์ด์ „ iteration์œผ๋กœ ๊ตฌ์ถ•ํ•˜์˜€์Œ

: <Freeze> Teacher Network๋Š” ํ•œ epoch ๋™์•ˆ ๋™๊ฒฐ(freeze)๋จ. ์ด๋Š” ํ•™์ƒ ๋„คํŠธ์›Œํฌ๋ฅผ ํ›ˆ๋ จํ•˜๋Š” ๋™์•ˆ teacher network์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜์ง€ ์•Š๊ณ  ๊ณ ์ •ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ

 

๐Ÿ‘ป teacher network ์–ด๋–ค๊ฑธ๋กœ?

: <ํ•™์ƒ ๊ฐ€์ค‘์น˜๋ฅผ teacher ๊ฐ€์ค‘์น˜๋กœ > student์˜ ๊ฐ€์ค‘์น˜๋ฅผ teacher์˜ ๊ฐ€์ค‘์น˜๋กœ ์ง์ ‘ ๋ณต์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์‹œ๋„๋˜์—ˆ์œผ๋‚˜, ์ˆ˜๋ ดํ•˜์ง€ ์•Š์•„์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์ง€ ๋ชปํ–ˆ๋‹ค๊ณ  ํ•จ

: <momentum encoder> Student์˜ ๊ฐ€์ค‘์น˜์— exponential moving average (EMA)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” momentum encoder๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ .

⇒ ์—…๋ฐ์ดํŠธ ๊ทœ์น™: θt ← λθt + (1 - λ)θs

: λ๋Š” ํ›ˆ๋ จ ์ค‘์— 0.996์—์„œ 1๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ์ฝ”์‚ฌ์ธ ์Šค์ผ€์ค„์„ ๋”ฐ๋ฆ„

: ์›๋ž˜๋Š” momentum encoder๊ฐ€ contrastive learning์—์„œ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋‚˜, DINO๋Š” ํ(์ž˜๋ชฐ๋ผ)๋‚˜ contrastive loss๊ฐ€ ์—†๊ธฐ์— momentum encoder๊ฐ€ mean teacher(๋‘ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ํ‰๊ท ํ•ด์„œ ์ƒˆ๋กœ์šด ๋ชฉํ‘œ ํŠน์ง•์„ ์ƒ์„ฑํ•˜๊ณ , ํ•™์ƒ ๋ชจ๋ธ์„ ์ด ์ƒˆ๋กœ์šด ๋ชฉํ‘œ ํŠน์ง•์œผ๋กœ ํ›ˆ๋ จ ⇒ ํ•™์ƒ์˜ ์„ ์ƒ ํŠน์ง•์„ ๋” ํ•™์Šต ์ž˜ํ•˜๊ฒŒ๋”)์˜ ์—ญํ• ์„ ํ•จ

: ํ•™์Šต ์ค‘์—๋Š” teacher๊ฐ€ student๋ณด๋‹ค ๋” ์„ฑ๋Šฅ์ด ์ข‹์œผ๋ฉฐ, teacher๊ฐ€ target feature๋“ค์„ ๊ณ ํ’ˆ์งˆ๋กœ ์ œ๊ณตํ•˜์—ฌ student์˜ ํ•™์Šต์„ guideํ•จ


Network architecture

: ๋ชจ๋ธ(g)๋Š” ViT๋‚˜ ResNet backbone f (ViT [19] ๋˜๋Š” ResNet [34])์™€ projection head h๋กœ ๊ตฌ์„ฑ๋จ

: (g=hโˆ˜f) Projection head๋Š” layer 3๊ฐœ์˜ MLP, L2์ •๊ทœํ™”, ๊ฐ€์ค‘์น˜๊ฐ€ ์ •๊ทœํ™”๋œ FC layer๋กœ ๊ตฌ์„ฑ

: ViT ์•„ํ‚คํ…์ฒ˜๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ batch ์ •๊ทœํ™”(BN)๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ

: ์ „์ฒด architecture์— BN์ด ์—†์Œ


Avoiding collapse

โ€ป collapse: ๋ชจ๋ธ์ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜์ง€ ์•Š๊ณ , ๊ฐ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ผ์ •ํ•œ ํŠน์ •ํ•œ ๊ฐ’์œผ๋กœ ์ˆ˜๋ ดํ•˜์—ฌ ์ •๋ณด๋ฅผ ์žƒ์–ด๋ฒ„๋ฆฌ๋Š” ํ˜„์ƒ (ํŠน์ • ์ฐจ์›์ด ์ง€๋‚˜์น˜๊ฒŒ ์šฐ์„ธํ•ด์ ธ์„œ, ๋ชจ๋ธ์ด ๊ทธ ์ฐจ์›์— ๋Œ€ํ•œ ์ •๋ณด๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์ƒํƒœ๋ฅผ ๋งํ•จ)

: self-supervised ๋ฐฉ๋ฒ•์ด contrastive loss, clustering constraints, predictor, BN ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ collapse๋ฅผ ํ”ผํ•˜๋ ค๊ณ  ํ•จ

⇒ DINO๋Š” momentum teacher output์„ ์ •๋ ฌํ•˜๊ณ , centering ๋ฐ sharpening์œผ๋กœ ํ•ด๊ฒฐํ•จ

1) Centering

: ์›์ ์„ ๊ธฐ์ค€์œผ๋กœ ์ค‘์‹ฌํ™”(ํŠน์ง•๋“ค์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜์—ฌ ํ•ด๋‹น ๊ฐ’์„ ํŠน์ง•๋“ค์—์„œ ๋นผ๋Š” ๋ฐฉ)

: ์–ด๋–ค ํŠน์ • ์ฐจ์›์ด ๋‹ค๋ฅธ ์ฐจ์›์— ๋น„ํ•ด ์ง€๋‚˜์น˜๊ฒŒ ์šฐ์„ธํ•ด์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€

: ์ฆ‰, ํŠน์ • ์ฐจ์›์ด ์ง€๋‚˜์น˜๊ฒŒ ํฐ ๊ฐ’์„ ๊ฐ€์ง€์ง€ ์•Š๋„๋ก ๋ณด์ •ํ•˜๋Š” ์—ญํ• ์„ ํ•จ

: centering์ด ์ ์šฉ๋˜๋ฉด ๋ชจ๋ธ์˜ ํŠน์ง•๋“ค์ด ๊ท ์ผํ•œ ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜๋จ. ์ฆ‰, ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ํŠน์ง•๋“ค์ด ๋ชจ๋‘ ๋น„์Šทํ•œ ๊ฐ’์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ์ด๋กœ ์ธํ•ด ๋ชจ๋“  ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฑฐ์˜ ๋™์ผํ•œ ํŠน์ง•์œผ๋กœ ์‚ฌ์ƒ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋จ

2) Sharpening

: ํŠน์ง•๋“ค์˜ ๋ถ„ํฌ๋ฅผ ์กฐ์ ˆํ•ด์„œ, ๋” ๋šœ๋ ทํ•˜๊ณ  ์„ ๋ช…ํ•œ ๋ถ„ํฌ๋ฅผ ์–ป๋Š” ๊ฒƒ์„ ์˜๋ฏธ.

: Temperature parameter (τ)๋ผ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ ˆํ•˜์—ฌ softmax ํ•จ์ˆ˜์˜ ์„ ๋ช…๋„๋ฅผ ์กฐ์ ˆ.

: Temperature ๊ฐ’์ด ๋‚ฎ์„์ˆ˜๋ก softmax ํ•จ์ˆ˜์˜ ๋ถ„ํฌ๊ฐ€ ๋” sharpํ•ด์ง€๊ณ , ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ๋” ๊ท ์ผํ•œ ๋ถ„ํฌ๊ฐ€ ๋จ

⇒ ๋ถ•๊ดด๋ฅผ ๋ฐฉ์ง€ํ•จ. But, centering์„ ํ•จ์œผ๋กœ์จ ์•ˆ์ •์„ฑ์€ ์–ป์ง€๋งŒ, batch์— ๋Œ€ํ•œ ์˜์กด์„ฑ์ด ์ค„์–ด๋“ ๋‹ค๊ณ  ํ•จ (centering์—์„œ์˜ ํ‰๊ท ๊ฐ’ ์‚ฌ์šฉ์€ ํ•ด๋‹น ๋ฐฐ์น˜์˜ ํŠน์ง•๋“ค์— ๋Œ€ํ•œ ํ†ต๊ณ„ ์ •๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค๋ฅธ ๋ฐฐ์น˜์— ๋Œ€ํ•ด์„œ๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ง์ž„)

⇒ 1์ฐจ ๋ฐฐ์น˜ ํ†ต๊ณ„์—๋งŒ ์˜์กดํ•œ๋‹ค๊ณ .

 

: ๊ฒฐ๊ตญ, centering๊ณผ sharpening์˜ ์—ญํ• ์€ teacher์— bias ํ•ญ์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๋œป์ž„.

: c๋Š” EMA๋กœ ์—…๋ฐ์ดํŠธ ๋จ. batch size๊ฐ€ ๋‹ค๋ฅด๋”๋ผ๋„ ์ž˜ ์ž‘๋™ํ•œ๋‹ค๊ณ  ํ•จ

: m>0 ์€ ์ด๋™ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋Š” ๋น„์œจ ํŒŒ๋ผ๋ฏธํ„ฐ (ํ˜„์žฌ ๋ฐ์ดํ„ฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•˜๊ฒŒ ๋ฐ›์„๊ฑด์ง€์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜ ํŒŒ๋ผ๋ฏธํ„ฐ - ๋žŒ๋‹ค๋ž‘ ๊ฐ™์€ ์—ญํ• ์ธ ๋“ฏ)

: B๋Š” batch size

3-2) Implementation and evaluation protocols

Vision Transformer

: ViT ์•„ํ‚คํ…์ฒ˜๋Š” ํ•ด์ƒ๋„ N × N์˜ ๊ฒน์น˜์ง€ ์•Š๋Š” ์ด๋ฏธ์ง€ patch grid๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Œ

: N = 16 (" /16 ") ๋˜๋Š” N = 8 (" /8 ")์„ ์‚ฌ์šฉ

: ํŒจ์น˜๋“ค์€ ์„ ํ˜• ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ์ž„๋ฒ ๋”ฉ ์ง‘ํ•ฉ์œผ๋กœ ๋ณ€ํ™˜

 

: ์ „์ฒด ์ •๋ณด๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋„๋ก ํ† ํฐ ํ•˜๋‚˜๋ฅผ ์ถ”๊ฐ€ํ•จ + ์ถœ๋ ฅ์— projection head h๋ฅผ ์—ฐ๊ฒฐ

: ์ด ํ† ํฐ์€ ์–ด๋– ํ•œ ๋ ˆ์ด๋ธ”์ด๋‚˜ supervision์— ์—ฐ๊ฒฐ๋˜์ง€๋Š” ์•Š์ง€๋งŒ ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค๊ณผ์˜ ์ผ๊ด€์„ฑ์„ ์œ„ํ•ด ํด๋ž˜์Šค ํ† ํฐ [CLS-Special Classificaiton token]์ด๋ผ ๋ถ€๋ฆ„ (์ฒซ๋ฒˆ์งธ ์‹œ์ž‘ ํ† ํฐ)

: ํŒจ์น˜ ํ† ํฐ๊ณผ [CLS] ํ† ํฐ์€ pre-norm layer normalization์„ ๊ฐ€์ง„ ํ‘œ์ค€ Transformer network์— ์ž…๋ ฅ๋จ

: Transformer๋Š” self-attention๊ณผ feed-forward layer์˜ ์‹œํ€€์Šค์ด๋ฉฐ skip connection ์‚ฌ์šฉ

: Self-attention layer๋Š” attention mechanism์œผ๋กœ ๋‹ค๋ฅธ ํ† ํฐ ํ‘œํ˜„์„ ๋ณด๊ณ  ๊ฐ ํ† ํฐ ํ‘œํ˜„๋“ค์„ ์—…๋ฐ์ดํŠธ

 

Implementation details

: ๋ฐ์ดํ„ฐ์…‹: ImageNet ๋ฐ์ดํ„ฐ์…‹์— ๋ ˆ์ด๋ธ” ์—†์ด ์‚ฌ์ „ ํ•™์Šต

: batch size 1024, adamw optimizer, 16 GPUs

: learning rate๋Š” ์ฒ˜์Œ 10 epoch๋งŒ 0.005×batchsize/256๊นŒ์ง€ warm up ํ›„ cosine schedule๋กœ decay

: weight decay: cosine schedule๋กœ 0.04์—์„œ 0.4

: τs=0.1, τt๋Š” 0.04์—์„œ 0.07๋กœ ์ดˆ๋ฐ˜ 30 epoch๋™์•ˆ linear-warmup

: BYOL์˜ data augmentation (color jittering, Gaussian blur and solarization)๊ณผ multi-crop์„ ์‚ฌ์šฉ

 

Evaluation protocols
  • ์ƒ๋žต

 

 

 

<์ฐธ๊ณ >

http://dmqm.korea.ac.kr/activity/seminar/310

https://kyujinpy.tistory.com/44

https://kimjy99.github.io/๋…ผ๋ฌธ๋ฆฌ๋ทฐ/dino/


728x90
๋ฐ˜์‘ํ˜•