λ³Έλ¬Έ λ°”λ‘œκ°€κΈ°
Deep Learning/[λ…Όλ¬Έ] Paper Review

CycleGAN

by 제룽 2023. 7. 5.
728x90
λ°˜μ‘ν˜•

 

 

 

<λ²ˆμ—­>
0. Abstract

Figure 1: Given any two unordered image collections X and Y , our algorithm learns to automatically “translate” an image from one into the other and vice versa: (left) Monet paintings and landscape photos from Flickr; (center) zebras and horses from ImageNet; (right) summer and winter Yosemite photos from Flickr. Example application (bottom): using a collection of paintings of famous artists, our method learns to render natural photographs into the respective styles.

κ·Έλ¦Ό 1: μž„μ˜μ˜ 두 개의 이미지 집합 X와 Yκ°€ μ£Όμ–΄μ‘Œμ„ λ•Œ, 우리의 μ•Œκ³ λ¦¬μ¦˜μ€ ν•œ 이미지λ₯Ό λ‹€λ₯Έ μ΄λ―Έμ§€λ‘œ "λ³€ν™˜"ν•˜κ³  κ·Έ λ°˜λŒ€λ‘œ μˆ˜ν–‰ν•˜λŠ” 방법을 μžλ™μœΌλ‘œ ν•™μŠ΅ν•©λ‹ˆλ‹€. (μ™Όμͺ½) Monet의 κ·Έλ¦Όκ³Ό Flickr의 풍경 사진; (κ°€μš΄λ°) ImageNet의 얼룩말과 말; (였λ₯Έμͺ½) Flickr의 여름과 겨울 μš”μ„Έλ―Έν‹° μ‚¬μ§„μž…λ‹ˆλ‹€. 예제 μ‘μš© (μ•„λž˜): 유λͺ…ν•œ μ˜ˆμˆ κ°€λ“€μ˜ κ·Έλ¦Ό μ»¬λ ‰μ…˜μ„ μ‚¬μš©ν•˜μ—¬ 우리의 방법은 μžμ—° 사진을 ν•΄λ‹Ή μŠ€νƒ€μΌλ‘œ λ Œλ”λ§ν•˜λŠ” 방법을 ν•™μŠ΅ν•©λ‹ˆλ‹€.

 

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G : X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F : Y → X and introduce a cycle consistency loss to enforce F(G(X)) ≈ X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.

이미지 κ°„ λ³€ν™˜μ€ μž…λ ₯ 이미지와 좜λ ₯ 이미지 μ‚¬μ΄μ˜ 맀핑을 ν•™μŠ΅ν•˜λŠ” λΉ„μ „ 및 κ·Έλž˜ν”½μŠ€ 문제의 ν•œ μœ ν˜•μž…λ‹ˆλ‹€. 일치된 이미지 쌍의 ν•™μŠ΅ μ„ΈνŠΈλ₯Ό μ‚¬μš©ν•˜μ—¬ μž…λ ₯ 이미지와 좜λ ₯ 이미지 κ°„μ˜ 맀핑을 ν•™μŠ΅ν•˜λŠ” 것이 λͺ©ν‘œμž…λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ λ§Žμ€ μž‘μ—…μ—μ„œλŠ” 쌍으둜 이루어진 ν›ˆλ ¨ 데이터λ₯Ό μ‚¬μš©ν•  수 μ—†μŠ΅λ‹ˆλ‹€. μš°λ¦¬λŠ” 짝지어진 μ˜ˆμ œκ°€ 없을 λ•Œ μ†ŒμŠ€ 도메인 Xμ—μ„œ λŒ€μƒ 도메인 Y둜 이미지λ₯Ό λ³€ν™˜ν•˜λŠ” 방법을 μ œμ‹œν•©λ‹ˆλ‹€. 우리의 λͺ©ν‘œλŠ” G: X → YλΌλŠ” 맀핑을 ν•™μŠ΅ν•˜λŠ” 것인데, μ΄λ•Œ G(X)의 이미지 뢄포가 μ λŒ€μ  손싀을 μ‚¬μš©ν•˜μ—¬ 뢄포 Y와 ꡬ별할 수 없도둝 ν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ 맀핑은 맀우 λΆˆμΆ©λΆ„ν•œ μ œμ•½μ„ 가지고 μžˆμœΌλ―€λ‘œ μ—­ 맀핑 F: Y → X와 ν•¨κ»˜ κ²°ν•©ν•˜κ³  F(G(X)) ≈ X (λ°˜λŒ€μ˜ κ²½μš°λ„ λ§ˆμ°¬κ°€μ§€)λ₯Ό κ°•μ œν•˜κΈ° μœ„ν•΄ 사이클 일관성 손싀을 λ„μž…ν•©λ‹ˆλ‹€. 짝지어진 ν›ˆλ ¨ 데이터가 μ—†λŠ” μ—¬λŸ¬ μž‘μ—…μ— λŒ€ν•œ 질적인 κ²°κ³Όλ₯Ό μ œμ‹œν•˜λ©°, μ»¬λ ‰μ…˜ μŠ€νƒ€μΌ λ³€ν™˜, 객체 λ³€ν˜•, κ³„μ ˆ λ³€ν™˜, 사진 κ°œμ„  등을 ν¬ν•¨ν•©λ‹ˆλ‹€. κΈ°μ‘΄ λ°©λ²•κ³Όμ˜ 양적 λΉ„κ΅λŠ” 우리의 μ ‘κ·Ό λ°©μ‹μ˜ μš°μˆ˜μ„±μ„ μž…μ¦ν•©λ‹ˆλ‹€.

1. Introduction

Figure 2: Paired training data (left) consists of training examples {xi , yi} N i=1, where the correspondence between xi and yi exists [22]. We instead consider unpaired training data (right), consisting of a source set {xi} N i=1 (xi ∈ X) and a target set {yj}M j=1 (yj ∈ Y ), with no information provided as to which xi matches which yj . Figure 2: 맀칭된 ν›ˆλ ¨ 데이터 (μ™Όμͺ½)λŠ” ν›ˆλ ¨ μ˜ˆμ‹œ {xi, yi} N i=1둜 κ΅¬μ„±λ˜λ©°, xi와 yi μ‚¬μ΄μ˜ λŒ€μ‘μ΄ μ‘΄μž¬ν•©λ‹ˆλ‹€ [22]. λ°˜λ©΄μ— μš°λ¦¬λŠ” λ§€μΉ­λ˜μ§€ μ•Šμ€ ν›ˆλ ¨ 데이터 (였λ₯Έμͺ½)λ₯Ό κ³ λ €ν•©λ‹ˆλ‹€. μ΄λŠ” μ†ŒμŠ€ 집합 {xi} N i=1 (xi ∈ X)와 νƒ€κ²Ÿ 집합 {yj}M j=1 (yj ∈ Y)둜 κ΅¬μ„±λ˜λ©°, μ–΄λ–€ xiκ°€ μ–΄λ–€ yj와 μΌμΉ˜ν•˜λŠ”μ§€μ— λŒ€ν•œ 정보가 μ œκ³΅λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

What did Claude Monet see as he placed his easel by the bank of the Seine near Argenteuil on a lovely spring day in 1873 (Figure 1, top-left)? A color photograph, had it been invented, may have documented a crisp blue sky and a glassy river reflecting it. Monet conveyed his impression of this same scene through wispy brush strokes and a bright palette.

What if Monet had happened upon the little harbor in Cassis on a cool summer evening (Figure 1, bottom-left)? A brief stroll through a gallery of Monet paintings makes it possible to imagine how he would have rendered the scene: perhaps in pastel shades, with abrupt dabs of paint, and a somewhat flattened dynamic range. We can imagine all this despite never having seen a side by side example of a Monet painting next to a photo of the scene he painted. Instead, we have knowledge of the set of Monet paintings and of the set of landscape photographs. We can reason about the stylistic differences between thesetwo sets, and thereby imagine what a scene might look like if we were to “translate” it from one set into the other.

1873λ…„ λ΄„, ν΄λ‘œλ“œ λͺ¨λ„€κ°€ μ•„λ₯΄μž₯튀 인근 μ„ΈλŠ κ°• λ‘‘ μ˜†μ— 이지λ₯Ό μ„Έμš°κ³  μžˆμ„ λ•Œ (Figure 1, μ™Όμͺ½ 상단), κ·ΈλŠ” μ–΄λ–€ 풍경을 λ³΄μ•˜μ„κΉŒμš”? λ§Œμ•½ 컬러 사진이 이미 발λͺ…λ˜μ–΄ μžˆλ‹€λ©΄, μ²­λͺ…ν•œ νŒŒλž€ ν•˜λŠ˜κ³Ό 그것을 λ°˜μ˜ν•˜λŠ” 유리처럼 맑은 강이 사진에 λ‹΄κ²¨μžˆμ„ μˆ˜λ„ μžˆμ—ˆμŠ΅λ‹ˆλ‹€. λͺ¨λ„€λŠ” 이 같은 μž₯면을 얇은 λΆ“μ§ˆκ³Ό 밝은 νŒ”λ ˆνŠΈλ₯Ό 톡해 μžμ‹ μ˜ 인상을 μ „λ‹¬ν–ˆμŠ΅λ‹ˆλ‹€. λ§Œμ•½ λͺ¨λ„€κ°€ μΊμ‹œμŠ€μ˜ μž‘μ€ ν•­κ΅¬μ—μ„œ μ„œλŠ˜ν•œ 여름 저녁에 그곳을 λ°œκ²¬ν–ˆλ‹€λ©΄ (Figure 1, μ™Όμͺ½ ν•˜λ‹¨), λͺ¨λ„€μ˜ κ·Έλ¦Ό 가러리λ₯Ό μž μ‹œ λŒμ•„λ³΄λ©΄ κ·Έκ°€ μ–΄λ–»κ²Œ κ·Έ μž₯면을 ν‘œν˜„ν–ˆμ„μ§€ 상상할 수 μžˆμŠ΅λ‹ˆλ‹€. μ•„λ§ˆλ„ μ˜…μ€ μƒ‰μ‘°λ‘œ, κ°‘μž‘μŠ€λŸ¬μš΄ ν„ΈκΈ°λ‘œ, μ•½κ°„ νŽΌμ³μ§„ 닀이내믹 λ ˆμΈμ§€λ‘œ ν‘œν˜„ν–ˆμ„ κ²ƒμž…λ‹ˆλ‹€.

μš°λ¦¬λŠ” 이 λͺ¨λ“  것을 λͺ¨λ„€μ˜ κ·Έλ¦Όκ³Ό κ·Έκ°€ κ·Έλ¦° μž₯면의 사진을 λ‚˜λž€νžˆ λ³Έ 적이 없어도 상상할 수 μžˆμŠ΅λ‹ˆλ‹€. λŒ€μ‹ , μš°λ¦¬λŠ” λͺ¨λ„€μ˜ κ·Έλ¦Ό λͺ¨μŒκ³Ό 풍경 사진 λͺ¨μŒμ— λŒ€ν•œ 지식을 가지고 μžˆμŠ΅λ‹ˆλ‹€. 이 두 집합 κ°„μ˜ μŠ€νƒ€μΌμ  차이λ₯Ό μΆ”λ‘ ν•  수 있으며, λ”°λΌμ„œ ν•œ μ§‘ν•©μ—μ„œ λ‹€λ₯Έ μ§‘ν•©μœΌλ‘œ "λ²ˆμ—­"ν•œλ‹€λ©΄ μž₯면이 μ–΄λ–»κ²Œ 보일지 상상할 수 μžˆμŠ΅λ‹ˆλ‹€.

In this paper, we present a method that can learn to do the same: capturing special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples. This problem can be more broadly described as imageto-image translation [22], converting an image from one representation of a given scene, x, to another, y, e.g., grayscale to color, image to semantic labels, edge-map to photograph. Years of research in computer vision, image processing, computational photography, and graphics have produced powerful translation systems in the supervised setting, where example image pairs {xi , yi} N i=1 are available (Figure 2, left), e.g., [11, 19, 22, 23, 28, 33, 45, 56, 58, 62]. However, obtaining paired training data can be difficult and expensive. For example, only a couple of datasets exist for tasks like semantic segmentation (e.g., [4]), and they are relatively small. Obtaining input-output pairs for graphics tasks like artistic stylization can be even more difficult since the desired output is highly complex, typically requiring artistic authoring. For many tasks, like object transfiguration (e.g., zebra↔horse, Figure 1 top-middle), the desired output is not even well-defined.

이 λ…Όλ¬Έμ—μ„œλŠ” μ–΄λ–€ 이미지 λͺ¨μŒμ˜ νŠΉμ§•μ„ ν¬μ°©ν•˜κ³  μ΄λŸ¬ν•œ νŠΉμ§•μ„ λ‹€λ₯Έ 이미지 λͺ¨μŒμœΌλ‘œ μ–΄λ–»κ²Œ λ³€ν™˜ν•  수 μžˆλŠ”μ§€ ν•™μŠ΅ν•˜λŠ” 방법을 μ œμ•ˆν•©λ‹ˆλ‹€

. μ΄λŠ” μ–΄λ– ν•œ λŒ€μ‘λ˜λŠ” ν›ˆλ ¨ μ˜ˆμ‹œλ„ μ—†λŠ” μƒν™©μ—μ„œ μ΄λ£¨μ–΄μ§€λŠ” κ²ƒμž…λ‹ˆλ‹€. 이 λ¬Έμ œλŠ” 이미지 κ°„ λ³€ν™˜, 예λ₯Ό λ“€μ–΄ κ·Έλ ˆμ΄μŠ€μΌ€μΌμ—μ„œ 컬러둜, μ΄λ―Έμ§€μ—μ„œ μ‹œλ§¨ν‹± λ ˆμ΄λΈ”λ‘œ, 엣지 λ§΅μ—μ„œ μ‚¬μ§„μœΌλ‘œμ˜ λ³€ν™˜ λ“±μœΌλ‘œ 더 λ„“κ²Œ μ„€λͺ…될 수 μžˆμŠ΅λ‹ˆλ‹€. 컴퓨터 λΉ„μ „, 이미지 처리, 계산 μ‚¬μ§„μˆ  및 κ·Έλž˜ν”½μŠ€ λΆ„μ•Όμ—μ„œ μˆ˜λ…„κ°„μ˜ μ—°κ΅¬λ‘œλŠ” μ˜ˆμ‹œ 이미지 쌍 {xi, yi} N i=1 이 μ œκ³΅λ˜λŠ” 지도 ν•™μŠ΅ ν™˜κ²½μ—μ„œ κ°•λ ₯ν•œ λ³€ν™˜ μ‹œμŠ€ν…œμ„ κ°œλ°œν–ˆμŠ΅λ‹ˆλ‹€(Figure 2, μ™Όμͺ½), 예λ₯Ό λ“€λ©΄ [11, 19, 22, 23, 28, 33, 45, 56, 58, 62] 등이 μžˆμŠ΅λ‹ˆλ‹€.

κ·ΈλŸ¬λ‚˜ λŒ€μ‘λ˜λŠ” ν›ˆλ ¨ 데이터λ₯Ό μ–»λŠ” 것은 μ–΄λ ΅κ³  λΉ„μš©μ΄ 많이 λ“€ 수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄ μ‹œλ§¨ν‹± λΆ„ν• κ³Ό 같은 μž‘μ—…μ—λŠ” λͺ‡ 개의 λ°μ΄ν„°μ…‹λ§Œ μ‘΄μž¬ν•˜λ©° μƒλŒ€μ μœΌλ‘œ μž‘μŠ΅λ‹ˆλ‹€([4] λ“±). 예술적인 μŠ€νƒ€μΌν™”μ™€ 같은 κ·Έλž˜ν”½ μž‘μ—…μ— λŒ€ν•΄ μž…λ ₯-좜λ ₯ μŒμ„ μ–»λŠ” 것은 λ”μš± μ–΄λ €μšΈ 수 μžˆμŠ΅λ‹ˆλ‹€. μ›ν•˜λŠ” 좜λ ₯은 맀우 λ³΅μž‘ν•˜λ©° 일반적으둜 예술적 μ €μž‘μ„ ν•„μš”λ‘œ ν•©λ‹ˆλ‹€. 얼룩말 ↔ 말과 같은 객체 λ³€ν˜•κ³Ό 같은 λ§Žμ€ μž‘μ—…μ—μ„œλŠ” μ›ν•˜λŠ” 좜λ ₯이 λͺ…ν™•ν•˜κ²Œ μ •μ˜λ˜μ§€ μ•Šμ„ μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€.

We therefore seek an algorithm that can learn to translate between domains without paired input-output examples (Figure 2, right). We assume there is some underlying relationship between the domains – for example, that they are two different renderings of the same underlying scene – and seek to learn that relationship. Although we lack supervision in the form of paired examples, we can exploit supervision at the level of sets: we are given one set of images in domain X and a different set in domain Y . We may train a mapping G : X → Y such that the output yˆ = G(x), x ∈ X, is indistinguishable from images y ∈ Y by an adversary trained to classify yˆ apart from y. In theory, this objective can induce an output distribution over yˆ that matches the empirical distribution pdata(y) (in general, this requires G to be stochastic) [16]. The optimal G thereby translates the domain X to a domain Yˆ distributed identically to Y . However, such a translation does not guarantee that an individual input x and output y are paired up in a meaningful way – there are infinitely many mappings G that will induce the same distribution over yˆ. Moreover, in practice, we have found it difficult to optimize the adversarial objective in isolation: standard procedures often lead to the wellknown problem of mode collapse, where all input images map to the same output image and the optimization fails to make progress [15].

λ”°λΌμ„œ, μš°λ¦¬λŠ” μž…λ ₯-좜λ ₯ 쌍의 μ˜ˆμ‹œ 없이 도메인 간에 λ²ˆμ—­μ„ ν•™μŠ΅ν•  수 μžˆλŠ” μ•Œκ³ λ¦¬μ¦˜μ„ μ°Ύκ³  μžˆμŠ΅λ‹ˆλ‹€

(Figure 2, 였λ₯Έμͺ½). μš°λ¦¬λŠ” 도메인 간에

μ–΄λ–€ κΈ°μ € 신경을 가진 관계가 μžˆλ‹€κ³  κ°€μ •

ν•˜κ³  κ·Έ 관계λ₯Ό ν•™μŠ΅ν•˜λ €κ³  ν•©λ‹ˆλ‹€. μ˜ˆμ‹œ 쌍 ν˜•νƒœλ‘œλŠ” 지도 ν•™μŠ΅μ˜ 감독이 λΆ€μ‘±ν•˜μ§€λ§Œ, μš°λ¦¬λŠ” 집합 μˆ˜μ€€μ—μ„œμ˜ 감독을 ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€: μš°λ¦¬μ—κ²ŒλŠ” 도메인 X의 이미지 집합과 도메인 Y의 λ‹€λ₯Έ 이미지 집합이 μ£Όμ–΄μ§‘λ‹ˆλ‹€. μš°λ¦¬λŠ” 맀핑 G: X → Yλ₯Ό ν•™μŠ΅μ‹œμΌœμ„œ 좜λ ₯ yˆ = G(x), x ∈ Xκ°€ y ∈ Y 이미지와 μ°¨λ³„ν™”λ˜μ§€ μ•Šλ„λ‘ ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 이λ₯Ό μœ„ν•΄ yˆμ™€ yλ₯Ό λΆ„λ₯˜ν•˜λŠ” 데에 ν›ˆλ ¨λœ μ λŒ€μ μΈ λͺ¨λΈμ„ μ‚¬μš©ν•©λ‹ˆλ‹€. 이둠적으둜, 이 λͺ©ν‘œλŠ” yˆμ— λŒ€ν•œ 좜λ ₯ 뢄포λ₯Ό κ²½ν—˜μ  뢄포 pdata(y)와 μΌμΉ˜μ‹œν‚¬ 수 μžˆμŠ΅λ‹ˆλ‹€ (일반적으둜 이λ₯Ό μœ„ν•΄μ„œλŠ” Gκ°€ ν™•λ₯ μ μ΄μ–΄μ•Ό 함) [16]. λ”°λΌμ„œ 졜적의 GλŠ” 도메인 Xλ₯Ό 도메인 Yˆλ‘œ λ²ˆμ—­ν•˜λ©°, Y와 λ™μΌν•˜κ²Œ λΆ„ν¬λ©λ‹ˆλ‹€.

κ·ΈλŸ¬λ‚˜ μ΄λŸ¬ν•œ λ²ˆμ—­μ€ κ°œλ³„μ μΈ μž…λ ₯ x와 좜λ ₯ yκ°€ 의미 μžˆλŠ” λ°©μ‹μœΌλ‘œ λŒ€μ‘λœλ‹€λŠ” 것을 보μž₯ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€ - yˆμ— λŒ€ν•΄ λ™μΌν•œ 뢄포λ₯Ό μœ λ„ν•˜λŠ” λ¬΄ν•œνžˆ λ§Žμ€ 맀핑 Gκ°€ μ‘΄μž¬ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

κ²Œλ‹€κ°€ μ‹€μ œλ‘œλŠ” μ λŒ€μ  λͺ©μ μ„ λ…λ¦½μ μœΌλ‘œ μ΅œμ ν™”ν•˜κΈ°κ°€ μ–΄λ €μš΄ κ²ƒμœΌλ‘œ 판λͺ…λ˜μ—ˆμŠ΅λ‹ˆλ‹€. ν‘œμ€€ μ ˆμ°¨λŠ” μ’…μ’… λͺ¨λ“œ 뢕괴라고 μ•Œλ €μ§„ λ¬Έμ œμ— 이λ₯΄κ²Œ λ˜λŠ”λ°,

μ΄λŠ” λͺ¨λ“  μž…λ ₯ 이미지가 λ™μΌν•œ 좜λ ₯ μ΄λ―Έμ§€λ‘œ λ§€ν•‘λ˜κ³  μ΅œμ ν™”κ°€ μ§„μ „λ˜μ§€ μ•ŠλŠ” λ¬Έμ œμž…λ‹ˆλ‹€ [15].

These issues call for adding more structure to our objective. Therefore, we exploit the property that translation should be “cycle consistent”, in the sense that if we translate, e.g., a sentence from English to French, and then translate it back from French to English, we should arrive back at the original sentence [3]. Mathematically, if we have a translator G : X → Y and another translator F : Y → X, then G and F should be inverses of each other, and both mappings should be bijections. We apply this structural assumption by training both the mapping G and F simultaneously, and adding a cycle consistency loss [64] that encourages F(G(x)) ≈ x and G(F(y)) ≈ y. Combining this loss with adversarial losses on domains X and Y yields our full objective for unpaired image-to-image translation. We apply our method to a wide range of applications, including collection style transfer, object transfiguration, season transfer and photo enhancement. We also compare against previous approaches that rely either on hand-defined factorizations of style and content, or on shared embedding functions, and show that our method outperforms these baselines. We provide both PyTorch and Torch implementations. Check out more results at our website. μ΄λŸ¬ν•œ λ¬Έμ œλ“€μ€

λͺ©μ μ— 더 λ§Žμ€ ꡬ쑰λ₯Ό μΆ”κ°€ν•  ν•„μš”κ°€ μžˆλ‹€λŠ” 것을 μ‹œμ‚¬

ν•©λ‹ˆλ‹€. λ”°λΌμ„œ, μš°λ¦¬λŠ” λ²ˆμ—­μ€ "cycle consistent" ν•΄μ•Ό ν•œλ‹€λŠ” 속성을 ν™œμš©ν•©λ‹ˆλ‹€. 즉, 예λ₯Ό λ“€μ–΄ μ˜μ–΄μ—μ„œ ν”„λž‘μŠ€μ–΄λ‘œ λ¬Έμž₯을 λ²ˆμ—­ν•˜κ³  λ‹€μ‹œ ν”„λž‘μŠ€μ–΄μ—μ„œ μ˜μ–΄λ‘œ λ²ˆμ—­ν•˜λ©΄ μ›λž˜μ˜ λ¬Έμž₯으둜 λŒμ•„μ™€μ•Ό ν•©λ‹ˆλ‹€ [3]. μˆ˜ν•™μ μœΌλ‘œ, λ§Œμ•½ μš°λ¦¬μ—κ²Œ G: X → Y λ²ˆμ—­κΈ°μ™€ F: Y → X λ²ˆμ—­κΈ°κ°€ μžˆλ‹€λ©΄, G와 FλŠ” μ„œλ‘œμ˜ μ—­ν•¨μˆ˜μ—¬μ•Ό ν•˜λ©° 두 맀핑은 전단사 ν•¨μˆ˜(bijection)이어야 ν•©λ‹ˆλ‹€. μš°λ¦¬λŠ” 이 ꡬ쑰적 가정을 μ μš©ν•˜κΈ° μœ„ν•΄ 맀핑 G와 Fλ₯Ό λ™μ‹œμ— ν•™μŠ΅ν•˜κ³ , F(G(x)) ≈ x와 G(F(y)) ≈ yλ₯Ό μž₯λ €ν•˜λŠ” cycle consistency loss [64]λ₯Ό μΆ”κ°€ν•©λ‹ˆλ‹€. X와 Y λ„λ©”μΈμ—μ„œμ˜ μ λŒ€μ  손싀과 이 손싀을 κ²°ν•©ν•˜μ—¬ unpaired 이미지 κ°„ λ²ˆμ—­μ„ μœ„ν•œ μ™„μ „ν•œ λͺ©μ μ„ μ–»μŠ΅λ‹ˆλ‹€.

우리의 방법을 μ»¬λ ‰μ…˜ μŠ€νƒ€μΌ 전이, 객체 λ³€ν˜•, κ³„μ ˆ 전이 및 사진 κ°œμ„  λ“± λ‹€μ–‘ν•œ μ‘μš© 뢄야에 μ μš©ν•©λ‹ˆλ‹€. λ˜ν•œ, μŠ€νƒ€μΌκ³Ό μ½˜ν…μΈ μ˜ μˆ˜λ™ μ •μ˜λœ μš”μ†Œ λΆ„ν•΄λ‚˜ 곡유된 μž„λ² λ”© ν•¨μˆ˜μ— μ˜μ‘΄ν•˜λŠ” 이전 μ ‘κ·Ό 방식과 λΉ„κ΅ν•˜κ³ , 우리의 방법이 μ΄λŸ¬ν•œ κΈ°μ€€ λͺ¨λΈμ„ λŠ₯κ°€ν•œλ‹€λŠ” 것을 λ³΄μ—¬μ€λ‹ˆλ‹€. μš°λ¦¬λŠ” PyTorch와 Torch κ΅¬ν˜„μ„ μ œκ³΅ν•©λ‹ˆλ‹€. μžμ„Έν•œ κ²°κ³ΌλŠ” 우리의 μ›Ήμ‚¬μ΄νŠΈλ₯Ό ν™•μΈν•΄μ£Όμ„Έμš”.

2. Related Work

Figure 3: (a) Our model contains two mapping functions G : X → Y and F : Y → X, and associated adversarial discriminators DY and DX. DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versa for DX and F. To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that if we translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistency loss: x → G(x) → F(G(x)) ≈ x, and (c) backward cycle-consistency loss: y → F(y) → G(F(y)) ≈ y Figure 3: (a) 우리의 λͺ¨λΈμ€ 두 개의 맀핑 ν•¨μˆ˜ G: X → Y와 F: Y → Xλ₯Ό ν¬ν•¨ν•˜κ³  있으며, 이에 μ—°κ²°λœ μ λŒ€μ  νŒλ³„μž DY와 DXκ°€ μžˆμŠ΅λ‹ˆλ‹€. DYλŠ” Gκ°€ Xλ₯Ό Y 도메인과 ꡬ별할 수 μ—†λŠ” 좜λ ₯으둜 λ²ˆμ—­ν•˜λ„λ‘ μž₯λ €ν•˜λ©°, DX와 F에 λŒ€ν•΄μ„œλ„ λ§ˆμ°¬κ°€μ§€μž…λ‹ˆλ‹€. 맀핑을 λ”μš± κ·œμ œν•˜κΈ° μœ„ν•΄, μš°λ¦¬λŠ” 두 개의 μˆœν™˜ 일관성 손싀을 λ„μž…ν•©λ‹ˆλ‹€. 이 손싀은 ν•œ λ„λ©”μΈμ—μ„œ λ‹€λ₯Έ λ„λ©”μΈμœΌλ‘œ λ²ˆμ—­ν•œ λ‹€μŒ λ‹€μ‹œ λ˜λŒμ•„μ˜€λ©΄ μ›λž˜μ˜ μœ„μΉ˜λ‘œ 도달해야 ν•œλ‹€λŠ” 직관을 ν¬μ°©ν•©λ‹ˆλ‹€: (b) 순방ν–₯ μˆœν™˜ 일관성 손싀: x → G(x) → F(G(x)) ≈ x, 그리고 (c) μ—­λ°©ν–₯ μˆœν™˜ 일관성 손싀: y → F(y) → G(F(y)) ≈ y

Generative Adversarial Networks (GANs)

[16, 63] have achieved impressive results in image generation [6, 39], image editing [66], and representation learning [39, 43, 37]. Recent methods adopt the same idea for conditional image generation applications, such as text2image [41], image inpainting [38], and future prediction [36], as well as to other domains like videos [54] and 3D data [57]. The key to GANs’ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real photos. This loss is particularly powerful for image generation tasks, as this is exactly the objective that much of computer graphics aims to optimize. We adopt an adversarial loss to learn the mapping such that the translated images cannot be distinguished from images in the target domain.

생성적 μ λŒ€ 신경망(GANs)은 이미지 생성, 이미지 νŽΈμ§‘ 및 ν‘œν˜„ ν•™μŠ΅κ³Ό 같은 이미지 κ΄€λ ¨ μž‘μ—…μ—μ„œ 인상적인 κ²°κ³Όλ₯Ό λ‹¬μ„±ν–ˆμŠ΅λ‹ˆλ‹€ [6, 39, 66]. 졜근의 방법듀은 이 아이디어λ₯Ό ν…μŠ€νŠΈμ—μ„œ μ΄λ―Έμ§€λ‘œμ˜ 쑰건뢀 이미지 생성, 이미지 보정, 미래 예츑과 같은 μ‘μš© ν”„λ‘œκ·Έλž¨λΏλ§Œ μ•„λ‹ˆλΌ λΉ„λ””μ˜€ [54] 및 3D 데이터 [57]와 같은 λ‹€λ₯Έ 도메인에도 λ™μΌν•˜κ²Œ μ μš©ν•©λ‹ˆλ‹€. GAN의 μ„±κ³΅μ˜ 핡심은 μƒμ„±λœ 이미지가 μ›μΉ™μ μœΌλ‘œ μ‹€μ œ 사진과 ꡬ별할 수 없도둝 ν•˜λŠ” μ λŒ€μ  μ†μ‹€μ˜ κ°œλ…μž…λ‹ˆλ‹€. 이 손싀은 이미지 생성 μž‘μ—…μ— λŒ€ν•΄ 맀우 κ°•λ ₯ν•˜λ©°, 이것이 컴퓨터 κ·Έλž˜ν”½μŠ€μ˜ μ£Όμš” λͺ©ν‘œμž…λ‹ˆλ‹€. μš°λ¦¬λŠ” μ λŒ€μ  손싀을 μ±„νƒν•˜μ—¬ λ³€ν™˜λœ 이미지가 λŒ€μƒ λ„λ©”μΈμ˜ 이미지와 ꡬ별할 수 없도둝 맀핑을 ν•™μŠ΅ν•©λ‹ˆλ‹€. Image-to-Image Translation

The idea of image-toimage translation goes back at least to Hertzmann et al.’s Image Analogies [19], who employ a non-parametric texture model [10] on a single input-output training image pair. More recent approaches use a dataset of input-output examples to learn a parametric translation function using CNNs (e.g., [33]). Our approach builds on the “pix2pix” framework of Isola et al. [22], which uses a conditional generative adversarial network [16] to learn a mapping from input to output images. Similar ideas have been applied to various tasks such as generating photographs from sketches [44] or from attribute and semantic layouts [25]. However, unlike the above prior work, we learn the mapping without paired training examples. 이미지 λŒ€ 이미지 λ³€ν™˜μ€ 적어도 Hertzmann et al.의 Image Analogies [19]μ—μ„œ μ‹œμž‘λœ μ•„μ΄λ””μ–΄λ‘œ, 단일 μž…λ ₯-좜λ ₯ ν›ˆλ ¨ 이미지 μŒμ— λΉ„νŒŒλΌλ―Έν„° ν…μŠ€μ²˜ λͺ¨λΈ [10]을 μ‚¬μš©ν•©λ‹ˆλ‹€. 졜근의 μ ‘κ·Ό 방식은 μž…λ ₯-좜λ ₯ 예제 데이터셋을 μ‚¬μš©ν•˜μ—¬ CNN을 μ‚¬μš©ν•˜μ—¬ λ§€κ°œλ³€μˆ˜ν™”λœ λ³€ν™˜ ν•¨μˆ˜λ₯Ό ν•™μŠ΅ν•©λ‹ˆλ‹€ (예: [33]). 우리의 μ ‘κ·Ό 방식은 Isola et al.의 "pix2pix" ν”„λ ˆμž„μ›Œν¬ [22]λ₯Ό 기반으둜 ν•©λ‹ˆλ‹€. 이 ν”„λ ˆμž„μ›Œν¬λŠ” 쑰건뢀 생성적 μ λŒ€ 신경망 [16]을 μ‚¬μš©ν•˜μ—¬ μž…λ ₯ μ΄λ―Έμ§€μ—μ„œ 좜λ ₯ μ΄λ―Έμ§€λ‘œμ˜ 맀핑을 ν•™μŠ΅ν•©λ‹ˆλ‹€. λΉ„μŠ·ν•œ 아이디어가 μŠ€μΌ€μΉ˜μ—μ„œ 사진 λ˜λŠ” 속성 및 의미적 λ ˆμ΄μ•„μ›ƒμ—μ„œ 사진을 μƒμ„±ν•˜λŠ” λ“± λ‹€μ–‘ν•œ μž‘μ—…μ— μ μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€ [44, 25]. ν•˜μ§€λ§Œ, μ΄μ „μ˜ μž‘μ—…κ³Ό 달리 μš°λ¦¬λŠ” νŽ˜μ–΄λ§λœ ν›ˆλ ¨ 예제 없이 맀핑을 ν•™μŠ΅ν•©λ‹ˆλ‹€.

Unpaired Image-to-Image Translation

Several other methods also tackle the unpaired setting, where the goal is to relate two data domains: X and Y . Rosales et al. [42] propose a Bayesian framework that includes a prior based on a patch-based Markov random field computed from a source image and a likelihood term obtained from multiple style images. More recently, CoGAN [32] and cross-modal scene networks [1] use a weight-sharing strategy to learn a common representation across domains. Concurrent to our method, Liu et al. [31] extends the above framework with a combination of variational autoencoders [27] and generative adversarial networks [16]. Another line of concurrent work [46, 49, 2] encourages the input and output to share specific “content” features even though they may differ in “style“. These methods also use adversarial networks, with additional terms to enforce the output to be close to the input in a predefined metric space, such as class label space [2], image pixel space [46], and image feature space [49]. Unlike the above approaches, our formulation does not rely on any task-specific, predefined similarity function be tween the input and output, nor do we assume that the input and output have to lie in the same low-dimensional embedding space. This makes our method a general-purpose solution for many vision and graphics tasks. We directly compare against several prior and contemporary approaches in Section 5.1. 쌍이 μ•„λ‹Œ(unpaired) 이미지 λŒ€ 이미지 λ³€ν™˜μ—λŠ” λ‹€λ₯Έ λͺ‡ 가지 방법듀도 μžˆμŠ΅λ‹ˆλ‹€. μ—¬κΈ°μ„œ λͺ©ν‘œλŠ” X와 YλΌλŠ” 두 데이터 도메인을 κ΄€λ ¨μ‹œν‚€λŠ” κ²ƒμž…λ‹ˆλ‹€. Rosales λ“± [42]은 μ†ŒμŠ€ μ΄λ―Έμ§€λ‘œλΆ€ν„° κ³„μ‚°λœ 패치 기반 마λ₯΄μ½”ν”„ 랜덀 ν•„λ“œλ₯Ό 기반으둜 ν•œ 사진과 μ—¬λŸ¬ μŠ€νƒ€μΌ μ΄λ―Έμ§€μ—μ„œ 얻은 κ°€λŠ₯도 항을 ν¬ν•¨ν•˜λŠ” λ² μ΄μ§€μ•ˆ ν”„λ ˆμž„μ›Œν¬λ₯Ό μ œμ•ˆν•©λ‹ˆλ‹€. μ΅œκ·Όμ—λŠ” CoGAN [32]κ³Ό ꡐ차 λͺ¨λ‹¬(scene) λ„€νŠΈμ›Œν¬ [1]κ°€ 도메인 간에 곡톡 ν‘œν˜„μ„ ν•™μŠ΅ν•˜κΈ° μœ„ν•΄ κ°€μ€‘μΉ˜ 곡유 μ „λž΅μ„ μ‚¬μš©ν•©λ‹ˆλ‹€. 우리의 방법과 λ™μ‹œμ—, Liu λ“± [31]은 λ³€μ΄ν˜• μ˜€ν† μΈμ½”λ” [27]와 생성적 μ λŒ€ 신경망 [16]의 μ‘°ν•©μœΌλ‘œ μœ„μ˜ ν”„λ ˆμž„μ›Œν¬λ₯Ό ν™•μž₯ν•©λ‹ˆλ‹€. λ™μ‹œμ— λ‹€λ₯Έ 방법 [46, 49, 2]은 "μ½˜ν…μΈ (content)" νŠΉμ§•μ„ κ³΅μœ ν•˜λ„λ‘ μž…λ ₯κ³Ό 좜λ ₯을 μž₯λ €ν•˜μ§€λ§Œ "μŠ€νƒ€μΌ(style)"은 λ‹€λ₯Ό 수 μžˆλ„λ‘ ν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ 방법듀도 μ λŒ€μ  신경망을 μ‚¬μš©ν•˜λ©°, 좔가적인 항듀을 μ‚¬μš©ν•˜μ—¬ 좜λ ₯이 미리 μ •μ˜λœ λ©”νŠΈλ¦­ 곡간(예: 클래슀 λ ˆμ΄λΈ” 곡간 [2], 이미지 ν”½μ…€ 곡간 [46], 이미지 νŠΉμ§• 곡간 [49])μ—μ„œ μž…λ ₯에 κ°€κΉŒμ›Œμ§€λ„λ‘ ν•©λ‹ˆλ‹€. μœ„μ˜ μ ‘κ·Ό 방식과 달리, 우리의 방식은 μž…λ ₯κ³Ό 좜λ ₯ 사이에 μž‘μ—… νŠΉμ • 사전 μ •μ˜λœ μœ μ‚¬λ„ ν•¨μˆ˜μ— μ˜μ‘΄ν•˜μ§€ μ•ŠμœΌλ©°, λ˜ν•œ μž…λ ₯κ³Ό 좜λ ₯이 λ™μΌν•œ 저차원 μž„λ² λ”© 곡간에 μžˆμ–΄μ•Ό ν•œλ‹€κ³  κ°€μ •ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 이둜써 우리의 방법은 λ§Žμ€ λΉ„μ „ 및 κ·Έλž˜ν”½ μž‘μ—…μ— λŒ€ν•œ λ²”μš©μ μΈ μ†”λ£¨μ…˜μœΌλ‘œ μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. 5.1μ ˆμ—μ„œ μ΄μ „μ˜ λͺ‡ 가지 방법과 ν˜„λŒ€μ μΈ 방법듀과 직접 λΉ„κ΅ν•©λ‹ˆλ‹€.

Cycle Consistency

The idea of using transitivity as a way to regularize structured data has a long history. In visual tracking, enforcing simple forward-backward consistency has been a standard trick for decades [24, 48]. In the language domain, verifying and improving translations via “back translation and reconciliation” is a technique used by human translators [3] (including, humorously, by Mark Twain [51]), as well as by machines [17]. More recently, higher-order cycle consistency has been used in structure from motion [61], 3D shape matching [21], cosegmentation [55], dense semantic alignment [65, 64], and depth estimation [14]. Of these, Zhou et al. [64] and Godard et al. [14] are most similar to our work, as they use a cycle consistency loss as a way of using transitivity to supervise CNN training. In this work, we are introducing a similar loss to push G and F to be consistent with each other. Concurrent with our work, in these same proceedings, Yi et al. [59] independently use a similar objective for unpaired image-to-image translation, inspired by dual learning in machine translation [17]. ν™˜ν˜• 일관성 (Cycle Consistency)은 κ΅¬μ‘°ν™”λœ 데이터λ₯Ό κ·œμ œν•˜λŠ” λ°©λ²•μœΌλ‘œμ„œ 였랜 역사λ₯Ό 가지고 μžˆμŠ΅λ‹ˆλ‹€. μ‹œκ° μΆ”μ μ—μ„œλŠ” κ°„λ‹¨ν•œ 순방ν–₯-μ—­λ°©ν–₯ 일관성을 κ°•μ œν•˜λŠ” 것이 μˆ˜μ‹­ λ…„κ°„ ν‘œμ€€μ μΈ κΈ°λ²•μœΌλ‘œ μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€ [24, 48]. μ–Έμ–΄ λΆ„μ•Όμ—μ„œλŠ” 인간 λ²ˆμ—­κ°€λ“€μ΄ [3] (Mark Twain도 재미있게 μ–ΈκΈ‰ν•œ) λ°± λ²ˆμ—­κ³Ό 쑰정을 톡해 λ²ˆμ—­μ„ ν™•μΈν•˜κ³  κ°œμ„ ν•˜λŠ” κΈ°μˆ μ„ μ‚¬μš©ν•˜λ©°, 기계 λ²ˆμ—­μ—μ„œλ„ μ΄λŸ¬ν•œ 방식을 μ μš©ν•©λ‹ˆλ‹€ [17]. μ΅œκ·Όμ—λŠ” 곡간 ꡬ쑰 μΆ”λ‘  [61], 3D ν˜•μƒ 맀칭 [21], 곡유 λΆ„ν•  [55], 밀집 μ‹œλ§¨ν‹± μ •λ ¬ [65, 64], 깊이 μΆ”μ • [14] λ“±μ—μ„œ κ³ μ°¨ ν™˜ν˜• 일관성이 μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이 μ€‘μ—μ„œλ„ Zhou λ“± [64]κ³Ό Godard λ“± [14]λŠ” 우리의 μž‘μ—…κ³Ό κ°€μž₯ μœ μ‚¬ν•˜λ©°, 이듀은 ν™˜ν˜• 일관성 손싀을 μ‚¬μš©ν•˜μ—¬ 전이성을 μ΄μš©ν•˜μ—¬ CNN ν•™μŠ΅μ„ κ°λ…ν•©λ‹ˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” G와 Fκ°€ μ„œλ‘œ μΌκ΄€λ˜λ„λ‘ ν•˜λŠ” μœ μ‚¬ν•œ 손싀을 λ„μž…ν•©λ‹ˆλ‹€. λ™μ‹œμ—, 같은 λ…Όλ¬Έμ—μ„œλŠ” Yi λ“± [59]이 기계 λ²ˆμ—­μ˜ 이쀑 ν•™μŠ΅μ—μ„œ μ˜κ°μ„ λ°›μ•„ λΉ„μŠ·ν•œ λͺ©μ μ„ μœ„ν•΄ λ…λ¦½μ μœΌλ‘œ λΉ„μŠ·ν•œ λͺ©ν‘œλ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€ [17].

Neural Style Transfer

[13, 23, 52, 12] is another way to perform image-to-image translation, which synthesizes a novel image by combining the content of one image with the style of another image (typically a painting) based on matching the Gram matrix statistics of pre-trained deep features. Our primary focus, on the other hand, is learning the mapping between two image collections, rather than between two specific images, by trying to capture correspondences between higher-level appearance structures. Therefore, our method can be applied to other tasks, such as painting→ photo, object transfiguration, etc. where single sample transfer methods do not perform well. We compare these two methods in Section 5.2. μ‹ κ²½ μŠ€νƒ€μΌ 전이 (Neural Style Transfer) [13, 23, 52, 12]은 이미지 κ°„ μ „ν™˜μ„ μˆ˜ν–‰ν•˜λŠ” 또 λ‹€λ₯Έ λ°©λ²•μœΌλ‘œ, 사전에 ν›ˆλ ¨λœ κΉŠμ€ νŠΉμ§•μ˜ Gram ν–‰λ ¬ 톡계λ₯Ό λ§€μΉ­ν•˜μ—¬ ν•œ μ΄λ―Έμ§€μ˜ μ½˜ν…μΈ μ™€ λ‹€λ₯Έ 이미지(일반적으둜 κ·Έλ¦Ό)의 μŠ€νƒ€μΌμ„ κ²°ν•©ν•˜μ—¬ μƒˆλ‘œμš΄ 이미지λ₯Ό ν•©μ„±ν•©λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ μ €ν¬μ˜ μ£Όμš” 관심은 두 개의 이미지 μ»¬λ ‰μ…˜ κ°„μ˜ 맀핑을 ν•™μŠ΅ν•˜λŠ” 것이며, νŠΉμ • 이미지 κ°„μ˜ μ „ν™˜λ³΄λ‹€λŠ” 더 높은 μˆ˜μ€€μ˜ μ™Έκ΄€ ꡬ쑰 κ°„μ˜ λŒ€μ‘μ„ ν¬μ°©ν•˜λ €λŠ” κ²ƒμž…λ‹ˆλ‹€. λ”°λΌμ„œ μ €ν¬μ˜ 방법은 ν™”κ°€ → 사진, 물체 λ³€ν˜• λ“±κ³Ό 같은 단일 μƒ˜ν”Œ μ „ν™˜ 방법이 잘 μž‘λ™ν•˜μ§€ μ•ŠλŠ” λ‹€λ₯Έ μž‘μ—…μ—λ„ μ μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ 두 가지 방법을 5.2μ ˆμ—μ„œ λΉ„κ΅ν•©λ‹ˆλ‹€.

3. Formulation

Figure 4: The input images x, output images G(x) and the reconstructed images F(G(x)) from various experiments. From top to bottom: photo↔Cezanne, horses↔zebras, winter→summer Yosemite, aerial photos↔Google maps Figure 4: λ‹€μ–‘ν•œ μ‹€ν—˜μ—μ„œμ˜ μž…λ ₯ 이미지 x, 좜λ ₯ 이미지 G(x), 그리고 μž¬κ΅¬μ„±λœ 이미지 F(G(x))μž…λ‹ˆλ‹€. μœ„μ—μ„œ μ•„λž˜λ‘œ: 사진↔μ„Έμž”, 말↔얼룩말, 겨울→여름 μš”μ„Έλ―Έν‹°, 항곡 사진↔ꡬ글 지도

Our goal is to learn mapping functions between two domains X and Y given training samples {xi} N i=1 where xi ∈ X and {yj}M j=1 where yj ∈ Y 1 . We denote the data distribution as x ∼ pdata(x) and y ∼ pdata(y). As illustrated in Figure 3 (a), our model includes two mappings G : X → Y and F : Y → X. In addition, we introduce two adversarial discriminators DX and DY , where DX aims to distinguish between images {x} and translated images {F(y)}; in the same way, DY aims to discriminate between {y} and {G(x)}. Our objective contains two types of terms: adversarial losses [16] for matching the distribution of generated images to the data distribution in the target domain; and cycle consistency losses to prevent the learned mappings G and F from contradicting each other.

우리의 λͺ©ν‘œλŠ” ν›ˆλ ¨ μƒ˜ν”Œ {xi}N i=1 (μ—¬κΈ°μ„œ xi ∈ X)와 {yj}M j=1 (μ—¬κΈ°μ„œ yj ∈ Y)이 주어진 두 도메인 X와 Y μ‚¬μ΄μ˜ 맀핑 ν•¨μˆ˜λ₯Ό ν•™μŠ΅ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€. 데이터 λΆ„ν¬λŠ” x ∼ pdata(x)와 y ∼ pdata(y)둜 ν‘œκΈ°λ©λ‹ˆλ‹€. Figure 3 (a)에 λ‚˜μ™€ μžˆλŠ” κ²ƒμ²˜λŸΌ, 우리의 λͺ¨λΈμ—λŠ” 두 개의 맀핑 G: X → Y와 F: Y → Xκ°€ ν¬ν•¨λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. μΆ”κ°€λ‘œ, μš°λ¦¬λŠ” DX와 DYλΌλŠ” 두 개의 μ λŒ€μ  νŒλ³„μžλ₯Ό μ†Œκ°œν•©λ‹ˆλ‹€. DXλŠ” 이미지 집합 {x}와 λ³€ν™˜λœ 이미지 집합 {F(y)} 사이λ₯Ό κ΅¬λ³„ν•˜λ €κ³  ν•˜λ©°, λ§ˆμ°¬κ°€μ§€λ‘œ DYλŠ” {y}와 {G(x)} 사이λ₯Ό κ΅¬λ³„ν•˜λ €κ³  ν•©λ‹ˆλ‹€. 우리의 λͺ©μ μ€ 두 가지 μœ ν˜•μ˜ ν•­λͺ©μœΌλ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€: μƒμ„±λœ μ΄λ―Έμ§€μ˜ 뢄포λ₯Ό λŒ€μƒ λ„λ©”μΈμ˜ 데이터 뢄포와 μΌμΉ˜μ‹œν‚€κΈ° μœ„ν•œ μ λŒ€μ  손싀 [16] 및 ν•™μŠ΅λœ 맀핑 G와 Fκ°€ μ„œλ‘œ λͺ¨μˆœλ˜μ§€ μ•Šλ„λ‘ ν•˜κΈ° μœ„ν•œ 사이클 일관성 μ†μ‹€μž…λ‹ˆλ‹€.

3.1. Adversarial Loss

We apply adversarial losses [16] to both mapping functions. For the mapping function G : X → Y and its discriminator DY , we express the objective as: LGAN(G, DY , X, Y ) = Ey∼pdata(y) [log DY (y)] + Ex∼pdata(x) [log(1 − DY (G(x))], (1) where G tries to generate images G(x) that look similar to images from domain Y , while DY aims to distinguish between translated samples G(x) and real samples y. G aims to minimize this objective against an adversary D that tries to maximize it, i.e., minG maxDY LGAN(G, DY , X, Y ). We introduce a similar adversarial loss for the mapping function F : Y → X and its discriminator DX as well: i.e., minF maxDX LGAN(F, DX, Y, X).

3.1. μ λŒ€μ  손싀

μš°λ¦¬λŠ” 두 개의 맀핑 ν•¨μˆ˜μ— μ λŒ€μ  손싀 [16]을 μ μš©ν•©λ‹ˆλ‹€. 맀핑 ν•¨μˆ˜ G : X → Y와 κ·Έ νŒλ³„μž DY에 λŒ€ν•΄, μš°λ¦¬λŠ” λ‹€μŒκ³Ό 같이 λͺ©μ  ν•¨μˆ˜λ₯Ό ν‘œν˜„ν•©λ‹ˆλ‹€: LGAN(G, DY , X, Y ) = Ey∼pdata(y) [log DY (y)] + Ex∼pdata(x) [log(1 − DY (G(x))], (1)

μ—¬κΈ°μ„œ GλŠ” 도메인 Y의 이미지와 μœ μ‚¬ν•œ 이미지 G(x)λ₯Ό μƒμ„±ν•˜λ €κ³  ν•˜λ©°, DYλŠ” λ²ˆμ—­λœ μƒ˜ν”Œ G(x)κ³Ό μ‹€μ œ μƒ˜ν”Œ yλ₯Ό κ΅¬λ³„ν•˜λ €κ³  ν•©λ‹ˆλ‹€. GλŠ” μ λŒ€μ μΈ μƒλŒ€μΈ Dκ°€ 이 λͺ©μ  ν•¨μˆ˜λ₯Ό μ΅œλŒ€ν™”ν•˜λ €κ³  ν•  λ•Œ 이λ₯Ό μ΅œμ†Œν™”ν•˜λ €κ³  ν•©λ‹ˆλ‹€. 즉, minG maxDY LGAN(G, DY , X, Y)λ₯Ό λͺ©ν‘œλ‘œ ν•©λ‹ˆλ‹€. μš°λ¦¬λŠ” 맀핑 ν•¨μˆ˜ F : Y → X와 κ·Έ νŒλ³„μž DX에 λŒ€ν•΄μ„œλ„ μœ μ‚¬ν•œ μ λŒ€μ  손싀을 λ„μž…ν•©λ‹ˆλ‹€: minF maxDX LGAN(F, DX, Y, X).

3.2. Cycle Consistency Loss

Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively (strictly speaking, this requires G and F to be stochastic functions) [15]. However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, adversarial losses alone cannot guarantee that the learned function can map an individual input xi to a desired output yi . To further reduce the space of possible mapping functions, we argue that the learned mapping functions should be cycle-consistent: as shown in Figure 3 (b), for each image x from domain X, the image translation cycle should be able to bring x back to the original image, i.e., x → G(x) → F(G(x)) ≈ x. We call this forward cycle consistency. Similarly, as illustrated in Figure 3 (c), for each image y from domain Y , G and F should also satisfy backward cycle consistency: y → F(y) → G(F(y)) ≈ y. We incentivize this behavior using a cycle consistency loss: Lcyc(G, F) = Ex∼pdata(x) [kF(G(x)) − xk1] + Ey∼pdata(y) [kG(F(y)) − yk1]. (2)

3.2. μˆœν™˜ 일관성 손싀

이둠적으둜 μ λŒ€μ  ν›ˆλ ¨μ€ 맀핑 ν•¨μˆ˜ G와 Fκ°€ 각각 λŒ€μƒ 도메인 Y와 Xλ‘œλΆ€ν„° λ™μΌν•œ 뢄포λ₯Ό κ°–λŠ” 좜λ ₯을 μƒμ„±ν•˜λ„λ‘ ν•™μŠ΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€ (μ—„λ°€νžˆ λ§ν•˜λ©΄, 이λ₯Ό μœ„ν•΄μ„œλŠ” G와 Fκ°€ ν™•λ₯ μ μΈ ν•¨μˆ˜μ΄μ–΄μ•Ό 함) [15]. κ·ΈλŸ¬λ‚˜ μΆ©λΆ„ν•œ μš©λŸ‰μ„ 가진 λ„€νŠΈμ›Œν¬λŠ” μž…λ ₯ μ΄λ―Έμ§€μ˜ λ™μΌν•œ 집합을 λŒ€μƒ λ„λ©”μΈμ˜ μ΄λ―Έμ§€μ˜ μž„μ˜μ˜ μˆœμ—΄λ‘œ 맀핑할 수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λ•Œ ν•™μŠ΅λœ 맀핑 쀑 μ–΄λ–€ 것이든 λŒ€μƒ 뢄포와 μΌμΉ˜ν•˜λŠ” 좜λ ₯ 뢄포λ₯Ό μœ λ„ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λ”°λΌμ„œ μ λŒ€μ  μ†μ‹€λ§ŒμœΌλ‘œλŠ” ν•™μŠ΅λœ ν•¨μˆ˜κ°€ κ°œλ³„ μž…λ ₯ xiλ₯Ό μ›ν•˜λŠ” 좜λ ₯ yi둜 맀핑할 수 μžˆλŠ”μ§€λ₯Ό 보μž₯ν•  수 μ—†μŠ΅λ‹ˆλ‹€. κ°€λŠ₯ν•œ 맀핑 ν•¨μˆ˜μ˜ 곡간을 λ”μš± 쀄이기 μœ„ν•΄, ν•™μŠ΅λœ 맀핑 ν•¨μˆ˜λ“€μ€ μˆœν™˜ 일관성을 κ°€μ Έμ•Ό ν•œλ‹€κ³  μ£Όμž₯ν•©λ‹ˆλ‹€. Figure 3 (b)에 λ‚˜μ™€ μžˆλŠ” κ²ƒμ²˜λŸΌ 도메인 X의 각 이미지 x에 λŒ€ν•΄ 이미지 λ³€ν™˜ μˆœν™˜μ€ xλ₯Ό μ›λž˜μ˜ μ΄λ―Έμ§€λ‘œ 되돌릴 수 μžˆμ–΄μ•Ό ν•©λ‹ˆλ‹€. 즉, x → G(x) → F(G(x)) ≈ xκ°€ λ˜μ–΄μ•Ό ν•©λ‹ˆλ‹€. 이λ₯Ό 순방ν–₯ μˆœν™˜ 일관성이라고 ν•©λ‹ˆλ‹€. λ§ˆμ°¬κ°€μ§€λ‘œ Figure 3 (c)μ—μ„œ 보여지듯이 도메인 Y의 각 이미지 y에 λŒ€ν•΄μ„œλ„ G와 FλŠ” μ—­λ°©ν–₯ μˆœν™˜ 일관성을 λ§Œμ‘±ν•΄μ•Ό ν•©λ‹ˆλ‹€: y → F(y) → G(F(y)) ≈ y. μ΄λŸ¬ν•œ λ™μž‘μ„ μž₯λ €ν•˜κΈ° μœ„ν•΄ μˆœν™˜ 일관성 손싀을 λ„μž…ν•©λ‹ˆλ‹€: Lcyc(G, F) = Ex∼pdata(x) [kF(G(x)) − xk1] + Ey∼pdata(y) [kG(F(y)) − yk1]. (2)

In preliminary experiments, we also tried replacing the L1 norm in this loss with an adversarial loss between F(G(x)) and x, and between G(F(y)) and y, but did not observe improved performance.

The behavior induced by the cycle consistency loss can be observed in Figure 4: the reconstructed images F(G(x)) end up matching closely to the input images x.

μ˜ˆλΉ„ μ‹€ν—˜μ—μ„œλŠ” 이 μ†μ‹€μ˜ L1 노름을 F(G(x))와 x, G(F(y))와 y κ°„μ˜ μ λŒ€μ  μ†μ‹€λ‘œ λŒ€μ²΄ν•΄ λ³΄μ•˜μ§€λ§Œ κ°œμ„ λœ μ„±λŠ₯을 κ΄€μ°°ν•˜μ§€ λͺ»ν–ˆμŠ΅λ‹ˆλ‹€.

μˆœν™˜ 일관성 손싀에 μ˜ν•΄ μœ λ„λ˜λŠ” λ™μž‘μ€ Figure 4μ—μ„œ κ΄€μ°°ν•  수 μžˆμŠ΅λ‹ˆλ‹€: μž¬κ΅¬μ„±λœ 이미지 F(G(x))λŠ” μž…λ ₯ 이미지 x와 κ·Όμ ‘ν•˜κ²Œ μΌμΉ˜ν•©λ‹ˆλ‹€.

3.3. Full Objective

Our full objective is: L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F), where λ controls the relative importance of the two objectives. We aim to solve: G ∗ , F∗ = arg min G,F max Dx,DY L(G, F, DX, DY ). (4) Notice that our model can be viewed as training two “autoencoders” [20]: we learn one autoencoder F β—¦ G : X → X jointly with another Gβ—¦F : Y → Y . However, these autoencoders each have special internal structures: they map an image to itself via an intermediate representation that is a translation of the image into another domain. Such a setup can also be seen as a special case of “adversarial autoencoders” [34], which use an adversarial loss to train the bottleneck layer of an autoencoder to match an arbitrary target distribution. In our case, the target distribution for the X → X autoencoder is that of the domain Y .

μ €ν¬μ˜ 전체 λͺ©μ μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€: L(G, F, DX, DY ) = LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F), μ—¬κΈ°μ„œ λλŠ” 두 λͺ©μ μ˜ μƒλŒ€μ  μ€‘μš”λ„λ₯Ό μ‘°μ ˆν•©λ‹ˆλ‹€. μ €ν¬λŠ” λ‹€μŒμ„ ν•΄κ²°ν•˜κΈ° μœ„ν•΄ λ…Έλ ₯ν•©λ‹ˆλ‹€: G ∗ , F∗ = arg min G,F max Dx,DY L(G, F, DX, DY ). (4) 저희 λͺ¨λΈμ€ 두 개의 "μžλ™ 인코더" [20]λ₯Ό ν•™μŠ΅ν•˜λŠ” κ²ƒμœΌλ‘œ λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€: ν•˜λ‚˜λŠ” F β—¦ G : X → X μžλ™ 인코더이고, λ‹€λ₯Έ ν•˜λ‚˜λŠ” Gβ—¦F : Y → Y μžλ™ μΈμ½”λ”μž…λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ μ΄λŸ¬ν•œ μžλ™ μΈμ½”λ”λŠ” 각각 νŠΉλ³„ν•œ λ‚΄λΆ€ ꡬ쑰λ₯Ό 가지고 μžˆμŠ΅λ‹ˆλ‹€: 이미지λ₯Ό 쀑간 ν‘œν˜„μœΌλ‘œ λ³€ν™˜ν•˜μ—¬ μžμ‹  μžμ‹ μ—κ²Œ λ§€ν•‘ν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ 섀정은 "μ λŒ€μ  μžλ™ 인코더" [34]의 νŠΉμˆ˜ν•œ κ²½μš°λ‘œλ„ λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€. 이 경우, X → X μžλ™ μΈμ½”λ”μ˜ λŒ€μƒ λΆ„ν¬λŠ” 도메인 Y의 뢄포와 μΌμΉ˜ν•˜λ„λ‘ μžλ™ μΈμ½”λ”μ˜ 병λͺ© 계측을 μ λŒ€μ  μ†μ‹€λ‘œ ν•™μŠ΅ν•©λ‹ˆλ‹€.

In Section 5.1.4, we compare our method against ablations of the full objective, including the adversarial loss LGAN alone and the cycle consistency loss Lcyc alone, and empirically show that both objectives play critical roles in arriving at high-quality results. We also evaluate our method with only cycle loss in one direction and show that a single cycle is not sufficient to regularize the training for this under-constrained problem.

5.1.4μ ˆμ—μ„œλŠ” 우리의 방법을 전체 λͺ©μ κ³Ό λΉ„κ΅ν•˜μ—¬ κ²€ν† ν•˜λ©°, λ‹¨λ…μœΌλ‘œ μ‚¬μš©λ˜λŠ” μ λŒ€μ  손싀 LGANκ³Ό μˆœν™˜ 일관성 손싀 Lcycλ₯Ό ν¬ν•¨ν•œ 좔둠에 λŒ€ν•΄ λΉ„κ΅ν•©λ‹ˆλ‹€. μš°λ¦¬λŠ” μ‹€ν—˜μ μœΌλ‘œ 이 두 가지 λͺ©μ μ΄ λͺ¨λ‘ κ³ ν™”μ§ˆ 결과에 μ€‘μš”ν•œ 역할을 ν•˜λŠ” 것을 μž…μ¦ν•©λ‹ˆλ‹€. λ˜ν•œ 단일 λ°©ν–₯의 μˆœν™˜ μ†μ‹€λ§Œ μ‚¬μš©ν•˜μ—¬ 우리의 방법을 ν‰κ°€ν•˜κ³ , μ΄λ ‡κ²Œ ν•˜λ©΄ 이런 λ―Έμ •μ˜ λ¬Έμ œμ— λŒ€ν•œ ν›ˆλ ¨μ„ κ·œμ œν•˜κΈ°μ—λŠ” 단일 μˆœν™˜λ§ŒμœΌλ‘œλŠ” μΆ©λΆ„ν•˜μ§€ μ•ŠμŒμ„ λ³΄μ—¬μ€λ‹ˆλ‹€.

4. Implementation

4.1 Network Architecture We adopt the architecture for our generative networks from Johnson et al. [23] who have shown impressive results for neural style transfer and superresolution. This network contains three convolutions, several residual blocks [18], two fractionally-strided convolutions with stride 1 2 , and one convolution that maps features to RGB. We use 6 blocks for 128 × 128 images and 9 blocks for 256×256 and higher-resolution training images. Similar to Johnson et al. [23], we use instance normalization [53]. For the discriminator networks we use 70 × 70 PatchGANs [22, 30, 29], which aim to classify whether 70 × 70 overlapping image patches are real or fake. Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarilysized images in a fully convolutional fashion [22]. μš°λ¦¬λŠ” Johnson λ“± [23]의 생성 λ„€νŠΈμ›Œν¬ μ•„ν‚€ν…μ²˜λ₯Ό μ±„νƒν–ˆμŠ΅λ‹ˆλ‹€. 이듀은 μ‹ κ²½ μŠ€νƒ€μΌ 전솑과 μ΄ˆν•΄μƒλ„μ— λŒ€ν•΄ 인상적인 κ²°κ³Όλ₯Ό λ³΄μ—¬μ£Όμ—ˆμŠ΅λ‹ˆλ‹€. 이 λ„€νŠΈμ›Œν¬λŠ” μ„Έ 개의 ν•©μ„±κ³±, μ—¬λŸ¬ 개의 μž”μ°¨ 블둝 [18], μŠ€νŠΈλΌμ΄λ“œ 1/2둜 κ΅¬μ„±λœ 두 개의 λΆ„μˆ˜ μŠ€νŠΈλΌμ΄λ“œ ν•©μ„±κ³± 및 νŠΉμ§•μ„ RGB둜 λ§€ν•‘ν•˜λŠ” ν•©μ„±κ³±μœΌλ‘œ κ΅¬μ„±λ©λ‹ˆλ‹€. μš°λ¦¬λŠ” 128×128 이미지에 6개의 블둝을 μ‚¬μš©ν•˜κ³ , 256×256 및 더 높은 ν•΄μƒλ„μ˜ ν›ˆλ ¨ μ΄λ―Έμ§€μ—λŠ” 9개의 블둝을 μ‚¬μš©ν•©λ‹ˆλ‹€. Johnson λ“± [23]κ³Ό λ§ˆμ°¬κ°€μ§€λ‘œ μΈμŠ€ν„΄μŠ€ μ •κ·œν™” [53]λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. νŒλ³„μž λ„€νŠΈμ›Œν¬μ—λŠ” 70×70 패치 λ‹¨μœ„ νŒλ³„μžμΈ PatchGAN [22, 30, 29]을 μ‚¬μš©ν•©λ‹ˆλ‹€. μ΄λŠ” 70×70 κ²ΉμΉ˜λŠ” 이미지 νŒ¨μΉ˜κ°€ μ‹€μ œμΈμ§€ κ°€μ§œμΈμ§€λ₯Ό λΆ„λ₯˜ν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•©λ‹ˆλ‹€. μ΄λŸ¬ν•œ 패치 μˆ˜μ€€μ˜ νŒλ³„μž μ•„ν‚€ν…μ²˜λŠ” 전체 이미지 νŒλ³„μžλ³΄λ‹€ λ§€κ°œλ³€μˆ˜κ°€ 적으며, μ™„μ „ μ»¨λ³Όλ£¨μ…˜ λ°©μ‹μœΌλ‘œ μž„μ˜ 크기의 μ΄λ―Έμ§€μ—μ„œ μž‘λ™ν•  수 μžˆμŠ΅λ‹ˆλ‹€ [22].

4.2 Training details We apply two techniques from recent works to stabilize our model training procedure. First, for LGAN (Equation 1), we replace the negative log likelihood objective by a least-squares loss [35]. This loss is more stable during training and generates higher quality results. In particular, for a GAN loss LGAN(G, D, X, Y ), we train the G to minimize Ex∼pdata(x) [(D(G(x)) − 1)2 ] and train the D to minimize Ey∼pdata(y) [(D(y) − 1)2 ] + Ex∼pdata(x) [D(G(x))2 ]. Second, to reduce model oscillation [15], we follow Shrivastava et al.’s strategy [46] and update the discriminators using a history of generated images rather than the ones produced by the latest generators. We keep an image buffer that stores the 50 previously created images. For all the experiments, we set λ = 10 in Equation 3. We use the Adam solver [26] with a batch size of 1. All networks were trained from scratch with a learning rate of 0.0002. We keep the same learning rate for the first 100 epochs and linearly decay the rate to zero over the next 100 epochs. Please see the appendix (Section 7) for more details about the datasets, architectures, and training procedures. μ €ν¬λŠ” 졜근 μ—°κ΅¬μ—μ„œ λͺ¨λΈ ν›ˆλ ¨ 절차λ₯Ό μ•ˆμ •ν™”ν•˜κΈ° μœ„ν•΄ 두 가지 기법을 μ μš©ν•©λ‹ˆλ‹€. 첫째둜, LGAN (식 1)에 λŒ€ν•΄ 음의 둜그 μš°λ„ λͺ©μ μ„ μ΅œμ†Œ 제곱 손싀 [35]둜 λŒ€μ²΄ν•©λ‹ˆλ‹€. 이 손싀은 ν›ˆλ ¨ 쀑에 더 μ•ˆμ •μ μ΄λ©° 높은 ν’ˆμ§ˆμ˜ κ²°κ³Όλ₯Ό μƒμ„±ν•©λ‹ˆλ‹€. 특히, GAN 손싀 LGAN(G, D, X, Y)에 λŒ€ν•΄ Gλ₯Ό Ex∼pdata(x) [(D(G(x)) - 1)2]λ₯Ό μ΅œμ†Œν™”ν•˜λ„λ‘ ν›ˆλ ¨ν•˜κ³ , Dλ₯Ό Ey∼pdata(y) [(D(y) - 1)2] + Ex∼pdata(x) [D(G(x))2]λ₯Ό μ΅œμ†Œν™”ν•˜λ„λ‘ ν›ˆλ ¨ν•©λ‹ˆλ‹€.

λ‘˜μ§Έλ‘œ, λͺ¨λΈ 진동을 쀄이기 μœ„ν•΄ Shrivastava λ“±μ˜ μ „λž΅ [46]을 따라 νŒλ³„μžλ₯Ό μ΅œμ‹  μƒμ„±μžκ°€ μƒμ„±ν•œ 이미지 λŒ€μ‹  이전에 μƒμ„±λœ μ΄λ―Έμ§€μ˜ 기둝을 μ‚¬μš©ν•˜μ—¬ μ—…λ°μ΄νŠΈν•©λ‹ˆλ‹€. μš°λ¦¬λŠ” 50개의 이전에 μƒμ„±λœ 이미지λ₯Ό μ €μž₯ν•˜λŠ” 이미지 버퍼λ₯Ό μœ μ§€ν•©λ‹ˆλ‹€. λͺ¨λ“  μ‹€ν—˜μ—μ„œ 식 3μ—μ„œ λ = 10으둜 μ„€μ •ν•©λ‹ˆλ‹€. 배치 크기가 1인 Adam 솔버 [26]λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. λͺ¨λ“  λ„€νŠΈμ›Œν¬λŠ” ν•™μŠ΅λ₯  0.0002둜 μ²˜μŒλΆ€ν„° ν›ˆλ ¨λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 처음 100 epoch에 λŒ€ν•΄ λ™μΌν•œ ν•™μŠ΅λ₯ μ„ μœ μ§€ν•˜κ³  λ‹€μŒ 100 epoch λ™μ•ˆ ν•™μŠ΅λ₯ μ„ μ„ ν˜•μ μœΌλ‘œ κ°μ†Œμ‹œμΌ°μŠ΅λ‹ˆλ‹€. 데이터셋, μ•„ν‚€ν…μ²˜ 및 ν›ˆλ ¨ μ ˆμ°¨μ— λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ 뢀둝 (μ„Ήμ…˜ 7)λ₯Ό μ°Έμ‘°ν•΄μ£Όμ‹­μ‹œμ˜€.

5. Results

We first compare our approach against recent methods for unpaired image-to-image translation on paired datasets where ground truth input-output pairs are available for evaluation. We then study the importance of both the adversarial loss and the cycle consistency loss and compare our full method against several variants. Finally, we demonstrate the generality of our algorithm on a wide range of applications where paired data does not exist. For brevity, we refer to our method as CycleGAN. The PyTorch and Torch code, models, and full results can be found at our website.

μš°λ¦¬λŠ” λ¨Όμ €, 평가λ₯Ό μœ„ν•΄ μž…λ ₯-좜λ ₯ 쌍의 κ·ΈλΌμš΄λ“œ νŠΈλ£¨μŠ€κ°€ μžˆλŠ” νŽ˜μ–΄ λ°μ΄ν„°μ…‹μ—μ„œ 비짝지어진 이미지 κ°„ λ³€ν™˜μ— λŒ€ν•œ 졜근 방법듀과 우리의 접근법을 λΉ„κ΅ν•©λ‹ˆλ‹€. 그런 λ‹€μŒ, μ λŒ€μ  손싀과 사이클 일관성 μ†μ‹€μ˜ μ€‘μš”μ„±μ„ μ—°κ΅¬ν•˜κ³ , μ—¬λŸ¬ 가지 λ³€ν˜•μ— λŒ€ν•΄ 우리의 전체 방법을 λΉ„κ΅ν•©λ‹ˆλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, 비짝지어진 데이터가 μ‘΄μž¬ν•˜μ§€ μ•ŠλŠ” λ‹€μ–‘ν•œ μ‘μš© λΆ„μ•Όμ—μ„œ 우리의 μ•Œκ³ λ¦¬μ¦˜μ˜ μΌλ°˜μ„±μ„ λ³΄μ—¬μ€λ‹ˆλ‹€. 간결함을 μœ„ν•΄, 우리의 방법을 CycleGAN이라고 μ§€μΉ­ν•©λ‹ˆλ‹€. PyTorch와 Torch μ½”λ“œ, λͺ¨λΈ, 그리고 전체 κ²°κ³ΌλŠ” 우리의 μ›Ήμ‚¬μ΄νŠΈμ—μ„œ 찾을 수 μžˆμŠ΅λ‹ˆλ‹€.

5.1. Evaluation Using the same evaluation datasets and metrics as “pix2pix” [22], we compare our method against several baselines both qualitatively and quantitatively. The tasks include semantic labels↔photo on the Cityscapes dataset [4], and map↔aerial photo on data scraped from Google Maps. We also perform ablation study on the full loss function.

5.1. 평가 "pix2pix" [22]와 λ™μΌν•œ 평가 데이터셋과 평가 μ§€ν‘œλ₯Ό μ‚¬μš©ν•˜μ—¬ 우리의 방법을 μ—¬λŸ¬ κΈ°μ€€κ³Ό μ–‘μ μœΌλ‘œ λΉ„κ΅ν•©λ‹ˆλ‹€. μž‘μ—…μ€ Cityscapes 데이터셋 [4]μ—μ„œμ˜ μ‹œλ§¨ν‹± λ ˆμ΄λΈ”↔사진 및 Google Mapsμ—μ„œ ν¬λ‘€λ§ν•œ 지도↔항곡 사진 등을 ν¬ν•¨ν•©λ‹ˆλ‹€. λ˜ν•œ 전체 손싀 ν•¨μˆ˜μ— λŒ€ν•œ 쀄기 연ꡬ도 μˆ˜ν–‰ν•©λ‹ˆλ‹€.

5.1.1 Evaluation Metrics AMT perceptual studies On the map↔aerial photo task, we run “real vs fake” perceptual studies on Amazon Mechanical Turk (AMT) to assess the realism of our outputs. We follow the same perceptual study protocol from Isola et al. [22], except we only gather data from 25 participants per algorithm we tested. Participants were shown a sequence of pairs of images, one a real photo or map and one fake (generated by our algorithm or a baseline), and asked to click on the image they thought was real. The first 10 trials of each session were practice and feedback was given as to whether the participant’s response was correct or incorrect. The remaining 40 trials were used to assess the rate at which each algorithm fooled participants. Each session only tested a single algorithm, and participants were only allowed to complete a single session. The numbers we report here are not directly comparable to those in [22] as our ground truth images were processed slightly differently 2 and the participant pool we tested may be differently distributed from those tested in [22] (due to running the experiment at a different date and time). Therefore, our numbers should only be used to compare our current method against the baselines (which were run under identical conditions), rather than against [22].

5.1.1 평가 μ§€ν‘œ AMT 인지 연ꡬ 지도↔항곡 사진 μž‘μ—…μ—μ„œ 우리의 좜λ ₯물의 ν˜„μ‹€μ„±μ„ ν‰κ°€ν•˜κΈ° μœ„ν•΄ Amazon Mechanical Turk (AMT)μ—μ„œ "μ‹€μ œ vs κ°€μ§œ" 인지 연ꡬλ₯Ό μ§„ν–‰ν–ˆμŠ΅λ‹ˆλ‹€. μš°λ¦¬λŠ” Isola et al. [22]의 인지 연ꡬ ν”„λ‘œν† μ½œμ„ λ”°λ₯΄λ˜, ν…ŒμŠ€νŠΈν•œ 각 μ•Œκ³ λ¦¬μ¦˜λ§ˆλ‹€ 25λͺ…μ˜ μ°Έκ°€μžλ‘œλΆ€ν„° 데이터λ₯Ό μˆ˜μ§‘ν–ˆμŠ΅λ‹ˆλ‹€. μ°Έκ°€μžλ“€μ€ μ‹€μ œ 사진 λ˜λŠ” 지도와 κ°€μ§œ 이미지 (우리의 μ•Œκ³ λ¦¬μ¦˜ λ˜λŠ” κΈ°μ€€μ„ μ—μ„œ μƒμ„±λœ)둜 이루어진 이미지 쌍의 일련의 μ‹œν€€μŠ€λ₯Ό 보고 μ–΄λ–€ 이미지가 μ‹€μ œμΈμ§€ ν΄λ¦­ν•˜λ„λ‘ μš”μ²­λ°›μ•˜μŠ΅λ‹ˆλ‹€. 각 μ„Έμ…˜μ˜ 처음 10λ²ˆμ€ μ—°μŠ΅μ΄λ©° μ°Έκ°€μžμ˜ 응닡이 μ˜¬λ°”λ₯Έμ§€ 여뢀에 λŒ€ν•œ ν”Όλ“œλ°±μ΄ μ œκ³΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λ‚˜λ¨Έμ§€ 40λ²ˆμ€ 각 μ•Œκ³ λ¦¬μ¦˜μ΄ μ°Έκ°€μžλ₯Ό 속일 λΉ„μœ¨μ„ ν‰κ°€ν•˜λŠ” 데 μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 각 μ„Έμ…˜μ€ ν•˜λ‚˜μ˜ μ•Œκ³ λ¦¬μ¦˜λ§Œμ„ ν…ŒμŠ€νŠΈν•˜λ©°, μ°Έκ°€μžλŠ” ν•œ 번의 μ„Έμ…˜λ§Œ μ™„λ£Œν•  수 μžˆμ—ˆμŠ΅λ‹ˆλ‹€. μ—¬κΈ°μ—μ„œ λ³΄κ³ ν•˜λŠ” μˆ«μžλŠ” [22]의 μˆ«μžμ™€ μ§μ ‘μ μœΌλ‘œ 비ꡐ할 수 μ—†μŠ΅λ‹ˆλ‹€. 우리의 Ground Truth μ΄λ―Έμ§€λŠ” μ•½κ°„ λ‹€λ₯΄κ²Œ μ²˜λ¦¬λ˜μ—ˆμœΌλ©°, μš°λ¦¬κ°€ ν…ŒμŠ€νŠΈν•œ μ°Έκ°€μž 집단은 [22]μ—μ„œ ν…ŒμŠ€νŠΈλœ 집단과 λ‹€λ₯Έ 뢄포일 수 μžˆμŠ΅λ‹ˆλ‹€ (μ‹€ν—˜μ„ λ‹€λ₯Έ λ‚ μ§œμ™€ μ‹œκ°„μ— μ§„ν–‰ν–ˆκΈ° λ•Œλ¬Έμ—). λ”°λΌμ„œ 우리의 μˆ«μžλŠ” 우리의 ν˜„μž¬ 방법을 κΈ°μ€€μ„ κ³Ό (λ™μΌν•œ μ‘°κ±΄μ—μ„œ μ‹€ν–‰λœ) λΉ„κ΅ν•˜κΈ° μœ„ν•΄ μ‚¬μš©λ˜μ–΄μ•Ό ν•˜λ©°, [22]μ™€μ˜ λΉ„κ΅μ—λŠ” μ‚¬μš©λ˜μ–΄μ„œλŠ” μ•ˆ λ©λ‹ˆλ‹€.

FCN score Although perceptual studies may be the gold standard for assessing graphical realism, we also seek an automatic quantitative measure that does not require human experiments. For this, we adopt the “FCN score” from [22], and use it to evaluate the Cityscapes labels→photo task. The FCN metric evaluates how interpretable the generated photos are according to an off-the-shelf semantic segmentation algorithm (the fully-convolutional network, FCN, from [33]). The FCN predicts a label map for a generated photo. This label map can then be compared against the input ground truth labels using standard semantic segmentation metrics described below. The intuition is that if we generate a photo from a label map of “car on the road”, then we have succeeded if the FCN applied to the generated photo detects “car on the road”. Semantic segmentation metrics To evaluate the performance of photo→labels, we use the standard metrics from the Cityscapes benchmark [4], including per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union (Class IOU) [4].

FCN 점수 인지 μ—°κ΅¬λŠ” κ·Έλž˜ν”½μ μΈ ν˜„μ‹€μ„±μ„ ν‰κ°€ν•˜κΈ° μœ„ν•œ 금기적인 기쀀일 수 μžˆμ§€λ§Œ, 인간 μ‹€ν—˜μ΄ ν•„μš”ν•˜μ§€ μ•Šμ€ μžλ™μ μΈ 양적 μΈ‘μ • 방법도 ν•„μš”ν•©λ‹ˆλ‹€. 이λ₯Ό μœ„ν•΄ μš°λ¦¬λŠ” [22]의 "FCN 점수"λ₯Ό μ±„νƒν•˜κ³  Cityscapes 라벨→사진 μž‘μ—…μ„ ν‰κ°€ν•˜λŠ” 데 μ‚¬μš©ν•©λ‹ˆλ‹€. FCN μ μˆ˜λŠ” μ˜€ν”„ 더 μ…€ν”„ μ‹œλ§¨ν‹± λΆ„ν•  μ•Œκ³ λ¦¬μ¦˜ (Fully Convolutional Network, FCN)을 톡해 μƒμ„±λœ μ‚¬μ§„μ˜ 해석 κ°€λŠ₯성을 ν‰κ°€ν•©λ‹ˆλ‹€. FCN은 μƒμ„±λœ 사진에 λŒ€ν•΄ λ ˆμ΄λΈ” 맡을 μ˜ˆμΈ‘ν•©λ‹ˆλ‹€. 이 λ ˆμ΄λΈ” 맡은 μž…λ ₯된 κ·ΈλΌμš΄λ“œ 트루슀 λ ˆμ΄λΈ”κ³Ό ν‘œμ€€ μ‹œλ§¨ν‹± λΆ„ν•  μ§€ν‘œλ₯Ό μ‚¬μš©ν•˜μ—¬ 비ꡐ할 수 μžˆμŠ΅λ‹ˆλ‹€. μ§κ΄€μ μœΌλ‘œ, "λ„λ‘œ μœ„μ˜ μ°¨"λΌλŠ” λ ˆμ΄λΈ” λ§΅μ—μ„œ 사진을 μƒμ„±ν•œλ‹€λ©΄, μƒμ„±λœ 사진에 적용된 FCN이 "λ„λ‘œ μœ„μ˜ μ°¨"λ₯Ό κ°μ§€ν•˜λŠ” 경우 μ„±κ³΅ν•œ κ²ƒμž…λ‹ˆλ‹€.

μ‹œλ§¨ν‹± λΆ„ν•  μ§€ν‘œ 사진→λ ˆμ΄λΈ”μ˜ μ„±λŠ₯을 ν‰κ°€ν•˜κΈ° μœ„ν•΄ Cityscapes 벀치마크 [4]의 ν‘œμ€€ μ§€ν‘œλ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. μ΄μ—λŠ” ν”½μ…€λ‹Ή 정확도 (per-pixel accuracy), ν΄λž˜μŠ€λ³„ 정확도 (per-class accuracy), ν΄λž˜μŠ€κ°„ κ΅μ°¨μ˜μ—­-μ—°ν•© (Class IOU)의 평균 클래슀 (mean class Intersection-Over-Union) 등이 ν¬ν•¨λ©λ‹ˆλ‹€.


<λ…Όλ¬Έ 리뷰>
1. Intro
β€» κΈ°μ‘΄ unpaired 데이터셋 μƒν™©μ—μ„œμ˜ GAN μ†μ‹€ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•  경우의 문제점
  • GAN은 μž…λ ₯에 λŒ€ν•΄ ν•˜λ‚˜μ˜ μ΄λ―Έμ§€λ§Œ μƒμ„±ν•œλ‹€λŠ” νŠΉμ§•μ„ μ§€λ‹˜.
  • unpaired dataλŠ” x와 yκ°€ λ§€μΉ­λ˜μ–΄μžˆμ§€ μ•ŠκΈ° λ•Œλ¬Έμ—, μ–΄λ– ν•œ μž…λ ₯이미지가 듀어왔을 λ•Œ, μ–΄λ– ν•œ μ •λ‹΅ 이미지가 λ§Œλ“€μ–΄μ Έμ•Ό ν•˜λŠ”μ§€ λͺ¨λ¦„.
  • 즉, λ§€μΉ­λ˜λŠ” y없이 λ‹¨μˆœνžˆ μž…λ ₯ x의 νŠΉμ„±μ„ 도메인 y둜 λ°”κΎΈκ³ μž 함. κ·Έλ ‡κ²Œ 되면, μ–΄λ–€ μž…λ ₯이 듀어와도 λ‚΄κ°€ λ§Œλ“€κ³ μž ν•˜λŠ” νŠΉμ • ν•˜λ‚˜μ˜ output을 μƒμ„±ν•˜λ €κ³  ν•˜λ©΄, ν•˜λ‚˜μ˜ λ™μΌν•œ output 만 λ„μΆœν•˜κ²Œ λœλ‹€λŠ” 것을 μ˜λ―Έν•¨.

xλ₯Ό μž…λ ₯으둜 λ„£μ–΄μ„œ λ§Œλ“€μ–΄μ§€λŠ” yκ°€ μ‹€μ œ x와 λ§€μΉ­λ˜λŠ” μœ μ˜λ―Έν•œ yκ°€ 아닐 μˆ˜λ„ μžˆλ‹€λŠ” 것을 μ˜λ―Έν•¨.

⇒ λ‹€μ‹œ 말해, x에 λŒ€ν•œ 정보듀이 λ‹€ λ‹€λ₯΄μ§€λ§Œ, output을 λ™μΌν•œ ν•˜λ‚˜μ˜ κ°’μœΌλ‘œ λ°˜ν™˜ν•˜κΈ° λ•Œλ¬Έμ—, x의 정보λ₯Ό λ³€κ²½ν•΄λ²„λ¦¬κ²Œ 됨.

⇒ 이λ₯Ό μœ„ν•œ 좔가적인 μ œμ•½μ‘°κ±΄μ΄ ν•„μš”ν•¨. ⇒ cycle consistent loss 의 μ‚¬μš©

 

2. Related Work
  • CycleGAN은 G(x)κ°€ λ‹€μ‹œ 원본 이미지 x둜 μž¬κ΅¬μ„±λ  수 μžˆλ„λ‘ 함
  • 즉, 원본 μ΄λ―Έμ§€μ˜ contentλŠ” λ³΄μ‘΄ν•˜λ˜, 도메인과 κ΄€λ ¨λœ νŠΉμ„±λ§Œ 바꾸도둝 함 (μ΄μ „μ˜ introμ—μ„œ λ‚˜μ™”λ˜ ν•œκ³„ 보완) → μ΄μ „μ—λŠ” x에 λŒ€ν•œ 정보λ₯Ό μžƒμ—ˆμ—ˆλŠ”λ°, 이λ₯Ό 막기 μœ„ν•΄ x의 정보λ₯Ό λ³΄μ‘΄ν•˜λŠ” λ°©μ‹μœΌλ‘œ μž¬κ΅¬μ„±ν•˜κ³ μž 함!
  • GAN을 λ‘κ°œλ₯Ό λ§Œλ“ λ‹€ 라고 μƒκ°ν•˜λ©΄ 됨. G와 F ⇒ μƒμ„±μžλ„ 2개, νŒλ³„μžλ„ 2개
  • G와 FλŠ” μ—­ν•¨μˆ˜ 관계
  • Dx, Dy ⇒ νŠΉμ • 이미지가 x λ„λ©”μΈμ˜ μ΄λ―Έμ§€λ‘œ κ·ΈλŸ΄μ‹Έν•œμ§€ μ•„λ‹Œμ§€ νŒλ³„ν•  수 μžˆλ„λ‘ λ§Œλ“œλŠ” μ—­ν• 
  • λͺ©ν‘œ: F(G(x)) ~ x , G(F(y)) ~ y둜 (μ›λ³Έμ΄λ―Έμ§€λ‘œ 볡ꡬ될 수 μžˆλŠ” ν˜•νƒœλ‘œ ν•™μŠ΅)
  • 원본 이미지 ⇒ y λ„λ©”μΈμ˜ 이미지 (μ–Όλ£©λ§λ‘œ λ³€ν™˜) ⇒ λ‹€μ‹œ 원본 μ΄λ―Έμ§€λ‘œ
3. Formulation
  • L(GAN): νŠΉμ • λ„λ©”μΈμ˜ μ΄λ―Έμ§€λ‘œ κ·ΈλŸ΄μ‹Έν•œ μ΄λ―Έμ§€λ‘œ λ§Œλ“€ 수 μžˆκ²Œλ” ν•˜λŠ” 식 (conditional gan 식과 동일)⇒ G: xλ₯Ό y 으둜 바꿔쀄 수 μžˆλ„λ‘
  • L(cyc): λ‹€μ‹œ 원본 μ΄λ―Έμ§€λ‘œ λŒμ•„μ˜¬ 수 μžˆλ„λ‘ ν•˜λŠ” 식⇒ x → μ›λ³ΈλŒ€λ‘œ λŒμ•„μ˜€λ„λ‘ (forward)⇒ y → μ›λ³ΈλŒ€λ‘œ λŒμ•„μ˜€λ„λ‘ (backward)
  • 즉, G(F(y)): 예츑된 y의 이미지 - μ‹€μ œ y 의 차이
  • 즉, F(G(x)): 예츑된 이미지 - μ‹€μ œ x input의 차이
4. Implementation

4.1 Network Architecture

  • residual block ν™œμš© + instance normalization ν™œμš©
  • 이미지 λ‚΄ 패치 λ‹¨μœ„λ‘œ μ§„μœ„ μ—¬λΆ€ νŒλ³„ν•˜λŠ” (νŒλ³„μž) discriminator μ‚¬μš© (patchGAN)⇒ μž₯점: NxN patch λ‹¨μœ„λ‘œ prediction 진행. μ΄λ•Œ patch size N은 전체 이미지 크기보닀 훨씬 μž‘κΈ° λ•Œλ¬Έμ— λ” μž‘μ€ parameters와 λΉ λ₯Έ running time을 가짐

4.2 Training details

  • Least-squares loss: 기쑴의 cross-entropy 기반의 loss λŒ€μ‹ μ— MSE 기반의 loss μ‚¬μš©⇒ ν•™μŠ΅μ΄ 더 μ•ˆμ •μ μœΌλ‘œ 됨. μ‹€μ œ 이미지 뢄포와 더 κ°€κΉŒμš΄ 이미지 생성 κ°€λŠ₯
  • Replay buffer: 이전에 μƒμ„±λœ μ΄λ―Έμ§€μ˜ 기둝을 μ‚¬μš©ν•˜μ—¬ μ—…λ°μ΄νŠΈ⇒ μƒμ„±μžκ°€ λ§Œλ“  이전 50개의 이미지 μ €μž₯ 해두고, 이λ₯Ό ν™œμš©ν•΄ νŒλ³„μž μ—…λ°μ΄νŠΈ 진행
  • ⇒ λͺ¨λΈμ˜ 진동을 쀄이기 μœ„ν•¨. μ•ˆμ •μ  ν•™μŠ΅ κ°€λŠ₯..
5. Experiments/ Limitations
  • μ—¬κΈ°μ„œ 뒀에 μžˆλŠ” κ°’ L identityλŠ”, cycleGAN을 ν•˜λŠ” κ³Όμ •μ—μ„œ ν•΄κ°€ λœ¨λŠ” 것을 ν•΄κ°€ μ €λ¬΄λŠ” 것을 κ΅¬λΆ„ν•˜κΈ° μœ„ν•œ lossμž„
  • 즉 인풋과 μ•„μ›ƒν’‹μ˜ 색ꡬ성을 λ³΄μ‘΄ν•˜κΈ° μœ„ν•΄ μΆ”κ°€ν•œ loss
  • 얼룩말을 탄 μ‚¬λžŒ μ΄λ―Έμ§€λŠ” μ—†κΈ° λ•Œλ¬Έμ— μ„±λŠ₯이 쒋지 μ•ŠμŒ.
  • λͺ¨μ–‘은 λ°”κΏ€ 수 μ—†μŒ.

 

728x90
λ°˜μ‘ν˜•

'Deep Learning > [λ…Όλ¬Έ] Paper Review' μΉ΄ν…Œκ³ λ¦¬μ˜ λ‹€λ₯Έ κΈ€

ELMO  (0) 2023.07.06
SegNet  (0) 2023.07.06
XLNet: Generalized Autoregressive Pretraining for Language Understanding  (1) 2023.07.05
Inception-v4, Inception-ResNetand the Impact of Residual Connections on Learning  (0) 2023.07.05
Seq2Seq  (0) 2023.07.05