Asymmetric Learned Similarity Search - Ben Rohald.pdf

School of Software Engineering F aculty of Engineering The Univer sity of Ne w South W ales Asymmetric Learned Similarity Search b y Ben Rohald Thesis submitted as a requirement for the degree of Bachelor of Engineering in Software Engineering Submitted: Nov 2019 Super visor: Wei Wang Student ID: z 5019999 Asymmetric Learned Similarity Search Ben Rohald 2 Abstract The recent explosion in accessible digital information has led to a demand for fast similarity search. Applications of which include text comparison, recommendation engines, computer vision and even fraud detection [1] The often - high dimensionality of data means that simple linear searches have become insufficient, resulting in emerging fields of research dedicated t o solving the problem of search . The goal is to define an indexing procedure on high dimensional data such that locality is preserved in the indices, which can then be used for reduced cost search. The majority of modern literature focuses on datasets with symmetric class labels , result ing in several relaxed conditions. However, in many cases, assumptions of symmetry cannot be made. In this paper, we explore how asymmetry affects the learning process of modern hashing techniques. We propose a novel learning method and use it as a vehicle to reveal many of the subtle challenges associated with asymmetry. In doing so , we exhaustively demonstrate the model’s inability to solve problems of this class. We identify the issues that prevent learning and subsequently suggest a secondary framework – one that utilises mutual information as a learning objective - to combat them. More generally, this paper serves as an in - depth analysis of modern approaches to tackling high dimensional similarity search. Ben Rohald 3 Asymmetric Learned Similarity Search Acknowledgements My supervisor Wei for his guidance in times of fear and confusion. To my friends Jared and Adam , the MATLAB and emotional support you provided made the world of difference Asymmetric Learned Similarity Search Ben Rohald 4 Abbreviations CDF Cumulative Distribution Function KLSH Kernelized Locality Sensitive Hashing KNN K - Nearest - Neighbor s LSH Locality Sensitive Hashing PCA Principle Component Analysis SVM Support Vector Machine LTH Learning to Hash MI Mutual Information Ben Rohald 5 Asymmetric Learned Similarity Search Contents ABSTRACT ................................ ................................ ................................ ................................ ................ 2 ACKNOWLEDGEMENTS ................................ ................................ ................................ ............................. 3 ABBREVIATIONS ................................ ................................ ................................ ................................ ....... 4 CONTENTS ................................ ................................ ................................ ................................ ................ 5 IN TRODUCTION ................................ ................................ ................................ ................................ ..... 7 1.1 M OTIVATION ................................ ................................ ................................ ................................ ............ 7 1.2 D OCUMENT S TRUCTURE ................................ ................................ ................................ ............................ 8 BACKGROUND ................................ ................................ ................................ ................................ ........ 10 2.1 K ERNELS ................................ ................................ ................................ ................................ ................ 10 2.2 D ATA - I NDEPENDENT H ASHING M ETHODS ................................ ................................ ................................ .... 11 2.2.1 Locality Sensitive Hashing (LSH) ................................ ................................ ................................ ..... 11 2.2.1.1 Model Definition ................................ ................................ ................................ ........................ 12 2.2.1.2 Query Processing ................................ ................................ ................................ ....................... 12 2.2.1.3 Parameter Selection ................................ ................................ ................................ .................. 13 2.2.2 KLSH ................................ ................................ ................................ ................................ ............... 13 2.3 D ATA - D EPENDENT H ASHING M ETHODS ................................ ................................ ................................ ...... 14 2.3.1 Adaptive Hashing ................................ ................................ ................................ ........................... 14 2.3.2 FaceNet ................................ ................................ ................................ ................................ .......... 16 2.3.3 Deep Hashing ................................ ................................ ................................ ................................ 17 2.3.4 Simultan eous Feature Learning ................................ ................................ ................................ ..... 18 2.3.5 Mutual Information ................................ ................................ ................................ ........................ 18 2.4 Q UANTIZATION AS A D ISTANCE M ETRIC ................................ ................................ ................................ ....... 19 CLASS ASYMMETRY ................................ ................................ ................................ ................................ 21 3.1 D EFINITION ................................ ................................ ................................ ................................ ............ 21 3.2 B ACKGROUND ................................ ................................ ................................ ................................ ......... 23 3.3 I NTERCLASS VS I NTRACLASS V ARIATION ................................ ................................ ................................ ........ 23 3.4 I NTELLIGENT & U NSELFISH B ATCHING ................................ ................................ ................................ ......... 23 STAGE 1 ................................ ................................ ................................ ................................ ................. 25 4.1 O VERVIEW ................................ ................................ ................................ ................................ ............. 25 4.2 M ODEL A RCHITECTU RE ................................ ................................ ................................ ............................. 25 4.2.1 Data Preparation ................................ ................................ ................................ ........................... 25 4.2.2 Network ................................ ................................ ................................ ................................ .......... 26 4.2.3 Loss & Batchin g ................................ ................................ ................................ .............................. 26 4.2.4 Performance Measures ................................ ................................ ................................ .................. 28 4.3 M ODEL C OLLAPSE ................................ ................................ ................................ ................................ ... 28 4.3.1 Effects ................................ ................................ ................................ ................................ ............. 28 4.3.2 Regularization ................................ ................................ ................................ ................................ 29 Asymmetric Learned Similarity Search Ben Rohald 6 4.4 C ONCLUSIONS ................................ ................................ ................................ ................................ ......... 30 STAGE 2 ................................ ................................ ................................ ................................ .................. 32 5.1 O VERVIEW ................................ ................................ ................................ ................................ .............. 32 5.2 M ODEL A RCHITECTURE ................................ ................................ ................................ ............................. 32 5.2.1 Network ................................ ................................ ................................ ................................ .......... 33 5.2.2 Loss ................................ ................................ ................................ ................................ ................. 33 5.3 S UMMARY ................................ ................................ ................................ ................................ .............. 34 CONCLUSION ................................ ................................ ................................ ................................ .......... 35 6.1 C ONCLUSION ................................ ................................ ................................ ................................ .......... 35 6.2 F UTURE W ORK ................................ ................................ ................................ ................................ ........ 35 REFERENCES ................................ ................................ ................................ ................................ ........... 37 Ben Rohald Asymmetric Learned Similarity Search 7 Chapter 1 In troduction T he digital information we consume continues to grow in quality and quantity, and as a result, so do the challenges associated with search. HD images, vide o, audio and text corpora are all examples of data sources , and with big data becoming increasingly popular, there is no reason to believe this expansion will slow down. The search problem is generally referred to as ‘ the K - Nearest - Neighbour problem ’ , which is defined as follows: given a set of data points 𝑝 ∈ ℝ 𝑚 and a query point 𝑞 ∈ ℝ 𝑚 , we wish to find the K nearest points in 𝑝 to 𝑞 based on some distance metric 𝑑 ( 𝑥 1 , 𝑥 2 ) This problem is particularly prominent in computer vision as it forms the basis for facial recognition, verification and clustering. The naïve approach of a linear scan through the entire dataset as a basis of search is no longer feasible , and as such , req uire s more robust methods. Rather it is desired to define an indexing procedure on high dimensional data such that locality is preserved in the indices, which can then be used to reduc e the cost of search. This paper will explore seminal ideas relating to this topic , including foundational research papers and contemporary techniques. 1. 1 Motivation The inspiration for this thesis comes fro m [2] which suggests that all data structures can be replaced with other types of models, including deep learning models. More specifically, they suggest that the structures themselves should be self - indexing based on Asymmetric Learned Similarity Search Ben Rohald 8 patterns in the data. Examples from the paper incl ude B - Trees and Hash Maps which are data - agnostic (like most commonly used data structures) , placing the onus on the programmer to select a data structure based on the ir interpretation of the data’s distribution By extracting information from the data, th e expectation is that the procedure can more effectively index the data points compared to data agnostic method s The authors state that a B - Tree learns the CDF of the data it is indexing when considering the retrieval of a memory address from a key in a logical paging, continuous memory setting ( Figure 1 ) This implies that other forms of models such as regressions or neural networks could do the same. Figure 1 : B - Tree Position vs Key graph approximating a CDF [2] This idea can easily be extended to the case of high dimensional similarity search wherein we aim to define a procedure for indexin g data point s by vectors of significantly lower dimension. In other words, we would like to index each data point 𝑝 𝑖 ∈ 𝑝 ∈ ℝ 𝑚 by a point 𝑧 𝑖 ∈ ℝ 𝑛 where 𝑛 ≪ 𝑚 If locality is preserved in the indices, then search could be conducted on indices themselves with reduced complexity at the cost of some overhead precomputation. This could be further improved by applying a similar method to [2] in which the structure of the data is used to influence the indexing 1. 2 Document Structure Chapter 2 provides an analysis of existing literature in the field , b eginning with influential ideas that form a foundation for more advanced techniques , before moving on to the state of the art C ertain papers will be reviewed for their novel viewpoints and serve as inspiration for eccentric techniques A complete problem is formulated after examining class asymmetry in more detail throughout 0 Finally, Stage 1 and Stage 2 outline the practical implementations and analyses of two attempted learning frameworks . T hese will be respectively referred to as Ben Rohald Asymmetric Learned Similarity Search 9 Stage 1 and 2 moving forward. The inability of the preliminary model to learn in Stage 1 reveals many of the subtleties of an asymmetric class setting. The se conclusive negative results yield an in - depth identification of numerous issues that Stage 2 aims to resolve. Asymmetric Learned Similarity Search Ben Rohald 10 Chapter 2 Background A hash function is one which maps data from an arbitrary dimension to a fixed dimension. Broadly speaking , hashing methods can be divided into data - independent methods ( 2.2 ) and data - dependent ( 2.3 ) methods. D ata - independent methods include LSH [3, 4] and the later extended KLSH [5] , among others [6] D ata - dependent methods include PCA , SVM, and more re cently, LTH [7] . Learned methods are further split into supervised, unsupervised and semi - supervised. As will be demonstrated, some techniques allow labels to be computed on the fly, enabling unsupervised tasks to become supervised. Learned methods can also be segmented into online and batch learning. Although earlier research focused mainly on batch learning , in reality , this is unrealistic as the distribution of data may change or the whole dataset may not be avai lable. As such, modern research has shifted toward online learning. The remainder of this section serves as an exploration of existing literature that aim s to solve the KNN problem with hashing techniques. In many cases, t he methods that follow do not pl ace an explicit emphasis on asymmetric class labelling. Despite this, it is crucial to understand how to solve the KNN problem in a broader sense before focusing more narrowly on the asymmetric case. 2 .1 Kernels A short preamble is provided on the notion of kernel s as they are a key component referenced throughout the remainder of this thesis Consider a mapping 𝜓 : ℝ 𝑛 → ℝ 𝑚 that maps vectors in ℝ 𝑛 to a feature space ℝ 𝑚 . A kernel is a function that corresponds to the inner product of two vectors in this feature space 𝑘 ( 𝑥 , 𝑦 ) = ⟨ 𝜓 ( 𝑥 ) , 𝜓 ( 𝑦 ) ⟩ Kernel functions allow for computing the inner product within a feature space Ben Rohald Asymmetric Learned Similarity Search 11 without knowing what the space or 𝜙 is. The process o f avoiding the explicit mapping is known as ‘ the ke rnel trick ’ A useful application of kernels is best illustrated through the following example which forms the foundation of SVM . Consider trying to define a boundary between the yellow and purple points in Figure 2 Clearly, the data is not linearly separable. However, if these points were mapped to a higher dimensional space in which they were linearly separable , a decision boundary hyperplane could be obtained as seen in Figure 3 . Finally, the hyperplane could be mapped back into the original space to obtain a nonlinear boundary as seen in Figure 4 Figure 2 : Linearly inseparable dataset [8] Figure 3 : Decision boundary in feature space [8] Figure 4 : Decision boundary mapped to original space [8] The process of finding the boundary in the feature space is only dependent on the dot product of support vectors - points which lie closest to the max - margin hyperplane [9] Therefore, the kernel method can be used to skip the explicit mapping and calculate the boundary directly. Kernels are also considered to be a measure of similarity. Intuitively , this justifies their inclusion in locality preserving hashing methods. In addition, they are adequately able to capture basic non - linear relationships. 2.2 Data - Ind ependent Hashing Methods 2.2.1 Locality Sensitive Hashing ( LSH ) LSH was introduced as an alternative to the slow yet deterministic methods that preceded it such as PCA [10] . It trades determinism for probabilistic success with improved speed. The intuition behind LSH is to randomly project data points into scalar buckets and use collections of these buckets as indices. Similar points should be hashed to the s ame buckets according to a probability distribution that is in some way related to the distance between the points in the Asymmetric Learned Similarity Search Ben Rohald 12 original space , thereby preserving locality. The first approximate high dimensional similarity search with sub linear dependence on da ta size was introduced in [11] . It is locality preserving in the sense that the probability of collision is higher for objects that are close together in the original space, although , it is restricted to the binary hamming space [4] addressed this restriction by extending into the Euclidian space. Using p - stable distributions , they were able to generalize the proof to any 𝑙 𝑝 norm for p ∈ [ 0 , 2 ] . The results of the paper formed the foundation of modern LSH techniques that are widely used today. More specifically, they provide a mathematical justification for a family of locality preserving hash functions which can be used in Euclidian space. The fundamental theorem driving this result was a proof that demonstrated that the probability of index collision scaled monotonically with the 𝑙 𝑝 norm of the distance between data points in the original space Using the results from [4] , a formal definition of LSH is provided 2.2.1.1 Model D efinition Let the dimensionality of a dataset be 𝑑 . A set of hash functions is randomly selected from the following LSH function family: ℎ ( 𝑥 ; 𝑎 , 𝑏 , 𝑤 ) = ⌊ 𝑎 𝑇 𝑥 + 𝑏 𝑤 ⌋ 𝑎 𝑖 ~ 𝑁 ( 0 , 1 ) , ∀ 𝑖 ∈ [ 1 , 𝑑 ] , 𝑏 ~ 𝑈 [ 0 , 𝑤 ] ( 1 ) where 𝑤 is a user specified parameter. The LSH index consists of 𝑙 hash ta bles, each of which uses 𝑘 hash functions. An object 𝑥 is indexed by the concatenation of 𝑘 hash values denoted by 𝑔 𝑖 ( 𝑥 ) ≔ ( ℎ 𝑖 , 1 ( 𝑥 ) , ℎ 𝑖 , 2 , ... , ℎ 𝑖 , 𝑘 ( 𝑥 ) ) ( 2 ) where ℎ 𝑖 , 𝑗 is randomly selected from the family of functions defined in Error! Reference source not found. ( 1 ) . For an object 𝑜 , the i - th hash table indexes the key - value pair ( 𝑔 𝑖 ( 𝑜 ) , 𝑜 ) Intuitively , ℎ can be seen as a projection into scalar buckets of width 𝑤 𝑏 is added to avoid correlation bet ween components and the floor function is used for quanti z ation. The Gaussian distribution is used as it is p - stable in 𝑙 2 and is therefore guaranteed to retain locality. Similarly, the Cauchy distribution could be used in 𝑙 1 2.2.1.2 Query Processing The following description of the query process assume s the single nearest neighbour is Ben Rohald Asymmetric Learned Similarity Search 13 desired. Given a query 𝑞 , the first step is to generate a candidate set from all 𝑙 hash tables. In each hash table 𝐻 𝑖 , 𝑔 𝑖 ( 𝑞 ) is used as the key to retrieve al l the values (i.e. data points) defined by 𝑐𝑎𝑛𝑑 ( 𝑞 ) = { 𝑥 | ∃ 𝑖 ∈ [ 0 , 𝑙 ] 𝑔 𝑖 ( 𝑥 ) = 𝑔 𝑖 ( 𝑞 ) } . The distance between 𝑞 and all the points in 𝑐𝑎𝑛𝑑 ( 𝑞 ) are computed and the one with the minimum distance is returned. A success (fai lure) is considered to be the case when 𝑞 ’s true nearest neighbour is (not) returned. O nce pre - processing i s complete, queries can be performed in constant time. 2.2.1.3 Parameter Selection If parameters are chosen according to [4, 11] , LSH guarantees to return the true nearest neighbour with constant probability. However, this is usually considered unrealistic due to the large index size and query processing time, which is mainly due to the large value of 𝑙 In addition, there are a number of user specified parameters that will affect the performance of the model - namely 𝑘 , 𝑙 and 𝑤 As proposed in [3] , a statistical model can be used to accurately predict the average search quality and latency based on a small sample of the dataset. The methods used greatly decreased memory complexity and were able to achieve quality pr edictions within 5% of the true value. The authors of [3] also defined a procedure to allow for automatic parameter tuning , thereby eliminat ing the issue of user specified parameters. Despite having high recall (the percentage of true KNN found), their method requires the empirical analysis of two distributions: 1. 𝑋 – the distance between two arbitrary data points. 2. 𝑋 𝑘 – the distance between an arbitrary data point and its k - th nearest neighbour. They found that there is no un iversal family of distributions that fit every possible dataset, meaning that the method is not truly generalizable by being data agnostic 2.2.2 KLSH A serious shortcoming of LSH was that it supported only linear operations. Kernelized LSH [5] was introduced to generalize LSH to support arbitrary kernel functions. This was of particular benefit to the computer vision com munity as many successful vision results rely on kernel functions. In this setting the measure of similarity moves from 𝑙 𝑝 distance to the kernel function, with the NN of 𝑞 satisfying 𝑎𝑟𝑔𝑚𝑎 𝑥 𝑖 𝑘 ( 𝑥 𝑖 , 𝑞 ) for a given kernel function 𝑘 Importantly, they define the family of hash functions as ℎ ( 𝑥 ) = 𝑠𝑔𝑛 ( 𝑤 𝑇 𝜙 ( 𝑥 ) − 𝑤 0 ) ( 3 ) where 𝜙 ( 𝑥 ) = [ 𝑘 ( 𝑥 1 , 𝑥 ) , ... , 𝑘 ( 𝑥 𝑚 , 𝑥 ) ] 𝑇 for a uniformly sampled set of 𝑥 1 ... 𝑥 𝑚 from the Asymmetric Learned Similarity Search Ben Rohald 14 dataset In order to maximize entropy ( ∑ 𝑖 = 1 𝑁 ℎ ( 𝑥 𝑖 ) = 0 ), the bias term should be set to the median of the hashes, which can be approximated by the mean 𝑤 0 = 1 𝑁 ∑ 𝑖 = 1 𝑁 𝑤 𝑇 𝜙 ( 𝑥 𝑖 ) ( 4 ) 2.3 Data - D ependent Has h ing Methods Both the strength and weakness of LSH lies in its agnosticism toward the data ’s distribution. Other methods have since been developed which attempt to eliminate the stochastic aspect of LSH , whilst retaining locality preservation and ensuring low time comp lexity. Enter LTH , a set of data - dependent hashing approaches which aim to learn hash functions from the data These methods are relatively new given the recent popularization of machine learning, yet they have set a performance benchmark that surpasses th at of data - independent methods. LTH attempts to eliminate the parameter selection component of LSH by developing machine learning models that can obtain optimal values. It should be noted that much of the work remains in the hamming space due to storage and comparison efficiency, despite the work from [4] 2.3.1 Adaptive Hashing Building upon the hash functions from [5] , [12] introduce s a technique for supervised learning with kernels. They use the equivalence between hamming distanc e and inner products to derive a least - squares style objective function which can be minimized using gradient descent [13] combin es the ideas of online hashing and machine learning, although doe s not account for kernels in that it retains exclusively l inear operations. Importantly, they introduce d the idea of a hinge - like loss Joining the ideas from [12, 13] , [14] introduce s a superior online learning technique that is comparable in accuracy to the state of the art batch learning methods , whilst remaining orders of magnitude faster. As a result of its impor tance , the key workings of [14] will now be outlined. The learning process involves taking pairs of samples 𝑥 1 , 𝑥 2 and their similarity 𝑠 𝑖𝑗 ∈ { − 1 , 1 } and using them to iteratively adjust the parameters of the hash functions. T he similarity matrix can be defined arbitrarily either from labels or from a metric defined on the data space. Following [12] , they use hash functions of the same form as equation ( 3 ) . Combining the terms, the family of hash functions is defined as Ben Rohald Asymmetric Learned Similarity Search 15 𝑓 ( 𝑥 ) = 𝑠𝑔𝑛 ( 𝑤 𝑇 𝜓 ( 𝑥 ) ) where 𝜓 ( 𝑥 ) = 𝜙 ( 𝑥 ) − 1 𝑁 ∑ 𝑖 = 1 𝑁 𝜙 ( 𝑥 𝑖 ) ( 5 ) The binary code of 𝑥 , 𝒇 ( 𝑥 ) , is then defined as 𝒇 ( 𝑥 ) = 𝑠𝑔𝑛 ( 𝑊 𝑇 𝜓 ( 𝑥 ) ) = [ 𝑓 1 ( 𝑥 ) , ... , 𝑓 𝑏 (x)] ( 6 ) where 𝑊 = [ 𝑤 1 , ... , 𝑤 𝑏 ] ∈ ℝ 𝑑 × 𝑏 and 𝑏 is the dimension of the binary codes The squared error loss defined in [12] is used as it is attractive for gradient computations. The loss between two binary codes is defined as 𝑙 ( 𝒇 ( 𝑥 𝑖 ) , 𝒇 ( 𝑥 𝑗 ) ; 𝑊 ) = ( 𝒇 ( 𝑥 𝑖 ) 𝑇 𝒇 ( 𝑥 𝑗 ) − 𝑏 𝑠 𝑖𝑗 ) 2 ( 7 ) Using equation ( 7 ) they are able to derive a gradient descent optimization method for learning the parameters of 𝑊 At this point it is key to identify that the least squares error alone is insufficient for training , as it may penali s e examples where the hamming distance is not optimal, even if a perfect retrieval is possible . An example of this can be seen in Figure 5 , in which an erroneous update is performed on 𝑓 2 based on the pair of points that is being considered. Figure 5 : Erroneous correction despite perfect retrieval [14] Mathematically this can be interpreted as a nonzero gradient when there is no need for an update. The team from [14] therefore suggest an additi onal step before performing gradient descent in which they determine which hash functions need to be updated, and the amount for those that do. The approach for doing so depends on the hinge like loss function from [1 3] which is defined as Asymmetric Learned Similarity Search Ben Rohald 16 𝑙 ℎ ( 𝒇 ( 𝑥 𝑖 ) , 𝒇 ( 𝑥 𝑗 ) ) = { max ( 0 , 𝑑 𝐻 − ( 1 − 𝛼 ) 𝑏 , 𝑠 𝑖𝑗 = 1 max ( 0 , 𝛼𝑏 − 𝑑 𝐻 ) , 𝑠 𝑖𝑗 = − 1 ( 8 ) w here 𝑑 𝐻 is the hamming distance between 𝒇 ( 𝑥 𝑖 ) and 𝒇 ( 𝑥 𝑗 ) and 𝛼 is a user defined parameter that determines the permitted margin of error. ⌈ 𝑙 ℎ ⌉ can be interpreted as the number of bits, and subsequently the number of hash functions, that need to be corrected . The ⌈ 𝑙 ℎ ⌉ hash functions with the most erroneous mappings are then selected to be updated in the gradient descent process , whilst the others are not modified It became clear after an investigation of [14] ’s co debase that the RBF kernel was used. Noting that kernel choice can influence results, potential optimization methods become apparent. A key area of potential improvement is the uniformly sampled points used in ( 3 ) 2.3. 2 FaceNet Using learned methods, the authors of [15] were able to achieve superior facial recognition accuracy . Their success can be attributed to triplet loss and triplet selection which are now explained. The goal is to find an embedding 𝒇 ( 𝑥 ) such that for any g iven image of a particular person 𝑥 𝑖 𝑎 (anchor) , the distance to all other images of the same person 𝑥 𝑖 𝑝 (positive) should be less than the distance to any image of a different person 𝑥 𝑖 𝑛 ( negative). Thus, the following loss function can be minimized 𝑙 = ∑ [ ‖ 𝒇 ( 𝑥 𝑖 𝑎 ) − 𝒇 ( 𝑥 𝑖 𝑝 ) ‖ 2 2 − ‖ 𝒇 ( 𝑥 𝑖 𝑎 ) − 𝒇 ( 𝑥 𝑖 𝑛 ) ‖ 2 2 + 𝛼 ] 𝑁 𝑖 = 1 + ( 9 ) w here 𝛼 is a n enforced margin between positive and negative pairs Figure 6 : Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negat ive of a different identity [15] The objective of the training process can be seen in Figure 6 Generating all possible triplets is Ben Rohald Asymmetric Learned Similarity Search 17 not desired as many of them would not co ntribute to training and delay model convergence. The method of triplet selection therefore becomes crucial to the process. For a given 𝑥 𝑖 𝑎 , [15] suggest s selecting a hard positive , 𝑥 𝑖 𝑝 , such that 𝑎𝑟𝑔𝑚𝑎 𝑥 𝑥 𝑖 𝑝 ‖ 𝒇 ( 𝑥 𝑖 𝑎 ) − 𝒇 ( 𝑥 𝑖 𝑝 ) ‖ 2 2 and a hard negative, 𝑥 𝑖 𝑛 , such that 𝑎𝑟𝑔𝑚𝑖 𝑛 𝑥 𝑖 𝑛 ‖ 𝒇 ( 𝑥 𝑖 𝑎 ) − 𝒇 ( 𝑥 𝑖 𝑛 ) ‖ 2 2 Triplets are generated online within a minibatch due t o computational complexity. A vari e ty of enhancements can be made; however, the primary idea remains as described above Although the tasks in [15] are specific to the domain of facial recognition and the Euclidian space, the intuition of their process still holds. The potential to incorporate some variant of the triplet loss into the learning process of [14] seems very promising. A generalization [15] of would be to consider input triplets of the form ( 𝑥 , 𝑥 + , 𝑥 − ) where 𝑥 is “ more similar ” to 𝑥 + than to 𝑥 − 2.3. 3 Deep Hashing Many of the learned methods reviewed so far have involved shallow architectures H owever , deep multilayer networks have seen great success in the recent past. Deeper architectures aim to capture non - linear relationships by learning multiple projection matr ices and using non - linear activation functions Kernel methods already attempt to capture non - linear relationship s , however , they suffer from scalability and remain relatively simple. [16] suggests an intuitive unsupervised , multilayer , fully connected network which takes in feature vectors and outputs a binari s ed representation vector. Non - linear activation functions are applied at each intermediate node , with the final output being applied to a 𝑠𝑔𝑛 function for quanti z ation. The optimi z atio n problem involves minimi s ing the quanti z ation loss between the output vectors and the original real valued vectors, maximi s ing the variance of the output vectors , maximi s ing independence of the output vectors, and regulari z ation terms. They then extend their algorithm to a supervised setting in which class labels are provided. Here they incorporate a triplet loss - like approach in which 2 pairs of data are included in each training step. These are known as ‘ s iamese pairs ’ The output of the 𝑚 - th layer for an input 𝑥 𝑛 is defined as ℎ 𝑛 𝑚 = 𝑠 ( 𝑊 𝑚 ℎ 𝑛 𝑚 − 1 + 𝑐 𝑚 ) ( 10 ) where 𝑊 𝑚 and 𝑐 𝑚 are the weight matrix and bias of the 𝑚 - th layer respectively, and 𝑠 is an activation function Asymmetric Learned Similarity Search Ben Rohald 18 2.3. 4 Simultaneous Feature Learning All of the techniques evaluated so far assume that the inputs are feature vectors. For the sake of simplicity, t he extraction process from the underlying data source has thus far been ignored, however, [17] suggests that this should not be the case. The authors propos e that within a model, the feature extraction processes may not be optimally compatible with the binarization and should be combined into one model. This would require a different model for each data type; however , it should be consider ed for total model optimi z ation and is included for completeness. 2.3. 5 Mutual Information The authors of [14] later shifted their attention toward the incorporation of the information - theoretic quantity , mutual information , into LTH methods [18, 19] . Mutual information is briefly defined before examining their findings. The Kullback - Leibler (KL) divergence is a quantitative measur e of the similarity between two distributions. It is defined in terms of entropy, meaning that it can be interpreted as the information gain of using one distribution over another. A KL divergence of 0 means the two distributions are identical. More formally, the KL divergence is defined as 𝐷 𝐾𝐿 ( 𝑃 | | 𝑄 ) = − ∑ 𝑃 ( 𝑥 ) log 2 ( 𝑄 ( 𝑥 ) 𝑃 ( 𝑥 ) ) 𝑥 ∈ 𝑋 ( 11 ) The mutual information between two variables is defined as the KL divergence between their joint distribution and the product of their marginals. Following ( 11 ) , it can be interpreted as a quantitative measure of whether two distributions should be treated as independent. 𝐼 ( 𝑋 ; 𝑌 ) = 𝐷 𝐾𝐿 ( 𝑃 ( 𝑋 , 𝑌 ) | | 𝑃 𝑋 ⨂ 𝑄 𝑥 ) ( 12 ) [18] uses mutual information to solve erroneously updating hash functions at every step in an online setting . The authors show that mutual information can be used to quantify the hash function’s improvement, which can act as an indicator for updates based on whether significant improvements ha ve been made. This allows mutual information to be included as an optimi z ation step to further enhance results. Subsequently , t hey show in that mutual information can act as an update criter ion which can be used to train the model Ben Rohald Asymmetric Learned Similarity Search 19 They further extend this work in [19] to an offline setting. This paper forms the foundation for Stage 2 and is briefly described for this reason [19] ’s primary contribution is the suggestion of mutual information as a learning objective. Here an anchor point 𝑥 ̂ and a set of positive ( 𝑥 ∈ ⨁ 𝑥 ̂ ) and negative points ( 𝑥 ∈ ⊖ 𝑥 ̂ ) with respect to the anchor are selected. They are all embedded into the binary hamming space and the hamming distance ( 𝑑 Φ ) between each point and the anchor 𝑥 ̂ is computed. Figure 7 shows a histogram of these hamming distances separated by membership indicator. If the mapping is able to perfectly embed all positive points closer to the anchor than all negative points, then there should be no overlap in the two histograms. Convenientl y, the mutual information between the hamming distances and the membership indicators serv e s as a quantitative measure of the overlap. Figure 7 : Relationship between hamming distance histogram overlap and mutual information Th e remainder of the paper provides relaxed gradient derivations for the mutual information , given that it is a discrete function. It also contains a generalization of the single anchor case to that of a mini batch setting. These is discussed in more detail in Stage 2 2.4 Quantization as a Distance Metric Unlike any of the previous papers, [20] suggests t hat a significant performance improvement can be achieved by considering the quantization step Most LTH methods use hamming ranking as their querying method , which relies on hamming distance as a distance measure. The authors argue that quantization dista nce (QD) serves a s a much better distance metric , and as a result, provide a framework for fast querying known as ‘ generate - to - probe quantization distance ranking ’ (G QR) that relies on QD A comparison of QD and hamming distance can be seen in Figure 8 W hen considering the probing order of buckets for query point 𝑞 1 , h amming distance is unable to distin guish between the buckets (1,0) and (0,1) , whereas QD Asymmetric Learned Similarity Search Ben Rohald 20 can identify that (1,0) should be probed first. Figure 8 : Quantization Distance vs Hamming Distance [20] The general approach is to dynamically generate buckets to probe in sorted order of the ir quantization distance from a query poin t. The model agnostic results suggest that this should be the preferred querying method.