๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

mixup: Beyond Emprical Risk Minimization

by ์ œ๋ฃฝ 2023. 8. 3.
728x90
๋ฐ˜์‘ํ˜•

 

 

Mixup์ด ๋ญ์•ผ?

: Beyond Emprical Risk Minimization - ๊ฒฝํ—˜์  ์œ„ํ—˜ ์ตœ์†Œํ™”๋ฅผ ๋„˜์–ด? ์ด๊ฒŒ ๋„๋Œ€์ฒด ๋ญ”๋ง์ธ๊ฐ€

: mixup ⇒ data augmentaion ๊ธฐ๋ฒ•

:๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํ˜•์ ์œผ๋กœ ๊ฒฐํ•ฉํ•ด์„œ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ์„ ์ƒ์„ฑ

: ์ •๋ง ์‰ฝ๊ฒŒ ๋งํ•˜์ž๋ฉด, ์šฐ๋ฆฌ๊ฐ€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ›ˆ๋ จ, ์˜ˆ์ธก๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์“ฐ๋ฉด ๊ณผ์ ํ•ฉ์ด ๋ฐœ์ƒํ•˜๊ธฐ ๋งˆ๋ จ์ž„.

: ์™œ๋ƒ? ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๋งŒ ๋ณด๊ณ  ํ•™์Šต์„ ์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์—, ๋‹น์—ฐํžˆ ํ•™์Šตํ•œ ๋ชจ๋ธ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ํŽธํ–ฅ๋จ.

: ์ฆ‰, ๊ณผ์ ํ•ฉ์ด ๋‚œ๋‹ค๋Š” ๋ง. ๊ฒฐ๊ตญ, ๋‹ค๋ฅธ ์กฐ๊ธˆ๋งŒ ๋‹ค๋ฅธ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉ๋งŒ ํ•ด(Out of Distribution) ๋ชจ๋ธ์ด ์ทจ์•ฝํ•  ์ˆ˜ ๋ฐ–์— ์—†์Œ

: ๋”ฐ๋ผ์„œ, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹๋งŒ ํ•™์Šต ์‹œํ‚ค๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์˜

๊ทผ๋ฐฉ ๋ถ„ํฌ๋„ ํ•จ๊ป˜ ํ•™์Šต์„ ์‹œ์ผœ์„œ ๋ณด๋‹ค ๋” ์ผ๋ฐ˜ํ™”๋œ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋‚ด์ž! ์ด๊ฒƒ์ด ๋ฐ”๋กœ mixup์˜ ์ฃผ์š” ์Ÿ์ .

๊ฒฐ๋ก ์ ์œผ๋กœ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์ง€๋งŒ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๊ทธ๋ ‡๊ฒŒ ๋‹ค๋ฅด์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ์…‹์„ ์ถ”๊ฐ€์ ์œผ๋กœ ๋งŒ๋“ค์–ด์ฃผ์ž! ๋ผ๋Š”๊ฒŒ ๋…ผ๋ฌธ์˜ ์ „๋ถ€.

 

 

1. ERM(empirical risk minimization)

โ€ป empirical : ๊ฒฝํ—˜์˜

๊ฒฝํ—˜ํ•œ ๊ฒƒ(training data)์— ๋”ฐ๋ผ ๊ธฐ๋Œ€๋˜๋Š” ์œ„ํ—˜(์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์˜ ์ฐจ์ด(์˜ค์ฐจ))์„ ์ตœ์†Œํ™”ํ•˜์ž”

 

: ์ง€๋„ ํ•™์Šต์€ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(X)์™€ ํ•ด๋‹นํ•˜๋Š” ํƒ€๊ฒŸ ๋ฐ์ดํ„ฐ(Y) ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์ฐพ๋Š” ๊ณผ์ •. ์ด๋•Œ, ๋ฐ์ดํ„ฐ์…‹์€ ๊ฒฐํ•ฉ ๋ถ„ํฌ P(X, Y)๋ฅผ ๋”ฐ๋ฆ„

: ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•œ ํ›„, ์˜ˆ์ธก๊ฐ’ f(x)์™€ ์‹ค์ œ ํƒ€๊ฒŸ๊ฐ’ y ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ๊ฐ€์ง€๊ณ , ๋ฐ์ดํ„ฐ ๋ถ„ํฌ P ์ƒ์—์„œ ์†์‹ค ํ•จ์ˆ˜ ์˜ ํ‰๊ท ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ด ์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ (๊ทธ๋ƒฅ ์šฐ๋ฆฌ๊ฐ€ ๋งจ๋‚  ํ•˜๋Š” ์†์‹คํ•จ์ˆ˜๋ฅผ ์˜๋ฏธํ•˜๋Š”๊ฑฐ)

⇒ ๋…ผ๋ฌธ์—์„œ๋Š” expected risk๋ผ๊ณ  ํ‘œํ˜„ํ•จ

expected risk

: ํ•˜์ง€๋งŒ ์šฐ๋ฆฌ๋Š” ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ P๋ฅผ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ, ๊ทผ์‚ฌํ• ๋งŒํ•œ ๋ฐ์ดํ„ฐ(ํ•™์Šต๋ฐ์ดํ„ฐ D)๋ฅผ ํ™œ์šฉํ•จ

: ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ training dataset์„ ํ†ตํ•ด ๋ถ„ํฌ๋ฅผ ํ˜•์„ฑ ⇒ empirical distribution(๊ฒฝํ—˜์  ๋ถ„ํฌ)

: ์ฆ‰, ๊ฐ€์šฉํ•œ ๋ฐ์ดํ„ฐ(D)๋กœ ๋ชจ์ง‘๋‹จ์„ ๊ทผ์‚ฌ. ์—ฌ๊ธฐ์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์„์ˆ˜๋ก(ํ‘œ๋ณธ์ด ๊ฒฐ๊ตญ ๋งŽ์œผ๋ฉด) ๋ชจ์ง‘๋‹จ(P)์— ๊ทผ์‚ฌ๋  ํ™•๋ฅ ์ด ๋†’์Œ(๋ชจ์ง‘๋‹จ์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค)

empirical distribution

: R(f) ์•ˆ์—์žˆ๋Š” P(x,y) ์‹์„ ์ ๋ถ„๊ฐ’์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์ตœ์ข… ์‹์ด ์•„๋ž˜์™€ ๊ฐ™์•„์ง (๊ฒฝํ—˜์  ๊ธฐ๋Œ€ ์œ„ํ—˜(Empirical Expected Risk))

expected return drawn from empirical data

: Rδ(f)๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ฒŒ ๋จ์œผ๋กœ์จ ERM์˜ ์ •์˜๊ฐ€ ์™„์„ฑ๋˜๋Š” ๊ฒƒ์ž„

 

โžก๏ธ ๋”ฅ๋Ÿฌ๋‹๊ณผ ๊ฐ™์ด ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋งŽ์€ ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๊ฒฝ์šฐ, empirical distribution(ํ•™์Šต ๋ฐ์ดํ„ฐ)์„ ์ „๋ถ€ ์™ธ์›Œ๋ฒ„๋ฆฌ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋จ (๋˜ ๋˜‘๊ฐ™์€ ์–˜๊ธฐ)

โžก๏ธ ๊ทธ๋ž˜์„œ ์ œ์‹œ๋œ ๊ฒƒ์ด VRM

 

2. VRM(vicinal risk minimization)

โ€ป vicinal : ๊ทผ์‚ฌ์˜, ๊ทผ์ ‘์˜

: vicinal distrbitution ์€ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”, ๋ฐ์ดํ„ฐ ์Œ x,y ์— ๊ทผ์ ‘ํ•œ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์Œ x', y' ๋“ค์˜ ๋ถ„ํฌ๋ฅผ ์˜๋ฏธํ•จ

: ์ด๋Ÿฌํ•œ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋ง๋œ ๋ฐ์ดํ„ฐ๋กœ expected risk๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ฒŒ ๋˜๋ฉด, emprical vicinal risk ๋ฅผ ๋‹ค๋ฃธ.

vicinal distribution

: ๊ฒฐ๊ตญ, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ํฌํ•จํ•˜๋Š” Pν(x~,y~)๋ฅผ ๋ชจ๋ธ๋ง ํ–ˆ์œผ๋ฏ€๋กœ, ์ƒˆ๋กœ์šด expected risk ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ

: vicinal distribution์„ ์ผ๋ฐ˜ํ™”ํ•œ ์‹

generic vicinal distribution

 

: lambda ๊ฐ’์„ ์กฐ์ •ํ•˜๋ฉด์„œ data distribution์„ ๊ฒฐ์ •ํ•˜๊ฒŒ ๋จ

lambda ๊ฐ’์€ Beta(a,a)๋ฅผ ๋”ฐ๋ฅธ distribution

 

: ๊ฒฐ๊ตญ์—๋Š” ์•„๋ž˜์‹๋Œ€๋กœ data augmentaionํ•˜๋Š” ๊ฒƒ์ด mixup์˜ ์ „๋ถ€

 

 

Mixup์ด ํ•˜๋Š”๊ฒŒ ๋ญ˜๊นŒ์š”?

: mixup์€ uncertainty๋ฅผ ์ธก์ •ํ•˜๋Š”๋ฐ ๋” ํšจ๊ณผ์ ์ž„ ( ๋” ์Šค๋ฌด์Šคํ•˜๋‹ค )

ex) green: class 0, orange: class 1 ⇒ ํŒŒ๋ž€์ƒ‰ ๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ, ํ•ด๋‹น ๋ฐ์ดํ„ฐ x๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, class๊ฐ€ 1์ผ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ„

: ERM์€ ๋‘ ํด๋ž˜์Šค ๊ฐ„์˜ decision boundary๊ฐ€ ๋šœ๋ ทํ•˜๊ฒŒ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ์Œ : mixup์€ ๊ฐ€๊นŒ์šด ๋ถ€๋ถ„์€ ๋” ์ง™์€ ํŒŒ๋ž€์ƒ‰์œผ๋กœ ๋‚˜ํƒ€๋ƒ„(๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋งŒ๋“ฌ)

โžก๏ธ uncertainty๋ฅผ smoothํ•˜๊ฒŒ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋จ

โžก๏ธ mixup์ด ERM์— ๋Œ€ํ•ด์„œ ๊ณผ์ ํ•ฉ์ด ๋œ ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Œ

โžก๏ธ ERM์€ ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋„ˆ๋ฌด ๋ฏผ๊ฐํ•˜๊ฒŒ ๋งž์ถฐ์ง€๋Š” ๊ฒƒ์— ๋ฐ˜ํ•ด, mixup์€ ๋ฐ์ดํ„ฐ์˜ ๊ฒฐ์ • ๊ฒฝ๊ณ„๋ฅผ ๋” ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ํ•จ์œผ๋กœ์จ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋” ์ผ๋ฐ˜ํ™”๋˜๋Š” ๊ฒฝํ–ฅ. ⇒ ๊ณผ์ ํ•ฉ ํ™•๋ฅ  ๋‚ฎ์•„์ง

โ€ป ๋…ธ์ด์ฆˆ์— ๋œ ๋ฏผ๊ฐํ•˜๋‹ค

: ๋ถ€๋“œ๋Ÿฌ์šด ๊ฒฐ์ • ๊ฒฝ๊ณ„๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์กฐ๊ธˆ๋งŒ ๋ณ€ํ™”์‹œ์ผœ๋„ ํฌ๊ฒŒ ๋ฐ”๋€Œ์ง€ ์•Š๊ฒŒ ๋จ. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋…ธ์ด์ฆˆ๋‚˜ ์•„์›ƒ๋ผ์ด์–ด์— ๋œ ๋ฏผ๊ฐํ•ด์ง.

: ๋ฐ˜๋ฉด, ๋šœ๋ ทํ•œ ๊ฒฐ์ • ๊ฒฝ๊ณ„๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋งž์ถ”๊ธฐ ๋•Œ๋ฌธ์— ๋…ธ์ด์ฆˆ์—๋„ ์‰ฝ๊ฒŒ ์˜ํ–ฅ์„ ๋ฐ›์Œ.

โžก๏ธ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ๋” ์ผ๋ฐ˜ํ™”๋œ ์˜ˆ์ธก ๋ฐ ์•ˆ์ •์ ์ด๊ณ  ๊ฒฌ๊ณ ํ•œ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Œ

 

 

 

Prediction/Gradient

: (a)์˜ ๊ฒฝ์šฐ, mixup์œผ๋กœ ํ•™์Šต์‹œํ‚จ ๊ฒƒ์ด ๋” prediction์ธก๋ฉด์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

: (b)์˜ ๊ฒฝ์šฐ, gradient norm์ด ๋” ์ž‘์Œ. ์ด๋Š” ๋” ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ๋ณด์ด๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ

 

 

 

 

EXPERIMENTS

 

[3.1 IMAGENET CLASSIFICATION][3.2 CIFAR10 AND CIFAR100]
[3.4 MEMORIZATION OF CORRUPTED LABELS][3.5 ROBUSTNESS TO ADVERSARIAL EXAMPLES][3.6 TABULAR DATA]
[3.7 STABILIZATION OF GENERATIVE ADVERSARIAL NETWORKS][3.8 ABLATION STUDIES]

 

 


 

<์ฐธ๊ณ >

https://everyday-image-processing.tistory.com/145

https://rroundtable.notion.site/mixup-467e0a5d4d284e05a5879007b9d1b97f

https://techy8855.tistory.com/19

https://medium.com/swlh/paper-mixup-beyond-empirical-risk-minimization-image-classification

 


 

 

728x90
๋ฐ˜์‘ํ˜•