School of Software Engineering Faculty of Engineering The University of New South Wales Asymmetric Learned Similarity Search by Ben Rohald Thesis submitted as a requirement for the degree of Bachelor of Engineering in Software Engineering Submitted: Nov 2019 Student ID: z5019999 Supervisor: Wei Wang Asymmetric Learned Similarity Search Ben Rohald Abstract The recent explosion in accessible digital information has led to a demand for fast similarity search. Applications of which include text comparison, recommendation engines, computer vision and even fraud detection [1]. The often-high dimensionality of data means that simple linear searches have become insufficient, resulting in emerging fields of research dedicated to solving the problem of search. The goal is to define an indexing procedure on high dimensional data such that locality is preserved in the indices, which can then be used for reduced cost search. The majority of modern literature focuses on datasets with symmetric class labels, resulting in several relaxed conditions. However, in many cases, assumptions of symmetry cannot be made. In this paper, we explore how asymmetry affects the learning process of modern hashing techniques. We propose a novel learning method and use it as a vehicle to reveal many of the subtle challenges associated with asymmetry. In doing so, we exhaustively demonstrate the modelβs inability to solve problems of this class. We identify the issues that prevent learning and subsequently suggest a secondary framework β one that utilises mutual information as a learning objective - to combat them. More generally, this paper serves as an in-depth analysis of modern approaches to tackling high dimensional similarity search. 2 Asymmetric Learned Similarity Search Ben Rohald Acknowledgements My supervisor Wei for his guidance in times of fear and confusion. To my friends Jared and Adam, the MATLAB and emotional support you provided made the world of difference. 3 Asymmetric Learned Similarity Search Ben Rohald Abbreviations CDF Cumulative Distribution Function KLSH Kernelized Locality Sensitive Hashing KNN K-Nearest-Neighbors LSH Locality Sensitive Hashing PCA Principle Component Analysis SVM Support Vector Machine LTH Learning to Hash MI Mutual Information 4 Asymmetric Learned Similarity Search Ben Rohald Contents ABSTRACT ................................................................................................................................................ 2 ACKNOWLEDGEMENTS ............................................................................................................................. 3 ABBREVIATIONS ....................................................................................................................................... 4 CONTENTS ................................................................................................................................................ 5 INTRODUCTION..................................................................................................................................... 7 1.1 MOTIVATION ............................................................................................................................................ 7 1.2 DOCUMENT STRUCTURE ............................................................................................................................ 8 BACKGROUND ........................................................................................................................................ 10 2.1 KERNELS ................................................................................................................................................ 10 2.2 DATA-INDEPENDENT HASHING METHODS .................................................................................................... 11 2.2.1 Locality Sensitive Hashing (LSH)..................................................................................................... 11 2.2.1.1 Model Definition ........................................................................................................................ 12 2.2.1.2 Query Processing ....................................................................................................................... 12 2.2.1.3 Parameter Selection .................................................................................................................. 13 2.2.2 KLSH ............................................................................................................................................... 13 2.3 DATA-DEPENDENT HASHING METHODS ...................................................................................................... 14 2.3.1 Adaptive Hashing ........................................................................................................................... 14 2.3.2 FaceNet .......................................................................................................................................... 16 2.3.3 Deep Hashing ................................................................................................................................. 17 2.3.4 Simultaneous Feature Learning ..................................................................................................... 18 2.3.5 Mutual Information........................................................................................................................ 18 2.4 QUANTIZATION AS A DISTANCE METRIC....................................................................................................... 19 CLASS ASYMMETRY ................................................................................................................................ 21 3.1 DEFINITION ............................................................................................................................................ 21 3.2 BACKGROUND......................................................................................................................................... 23 3.3 INTERCLASS VS INTRACLASS VARIATION........................................................................................................ 23 3.4 INTELLIGENT & UNSELFISH BATCHING ......................................................................................................... 23 STAGE 1 ................................................................................................................................................. 25 4.1 OVERVIEW ............................................................................................................................................. 25 4.2 MODEL ARCHITECTURE............................................................................................................................. 25 4.2.1 Data Preparation ........................................................................................................................... 25 4.2.2 Network.......................................................................................................................................... 26 4.2.3 Loss & Batching .............................................................................................................................. 26 4.2.4 Performance Measures .................................................................................................................. 28 4.3 MODEL COLLAPSE ................................................................................................................................... 28 4.3.1 Effects ............................................................................................................................................. 28 4.3.2 Regularization ................................................................................................................................ 29 5 Asymmetric Learned Similarity Search Ben Rohald 4.4 CONCLUSIONS .........................................................................................................................................30 STAGE 2.................................................................................................................................................. 32 5.1 OVERVIEW..............................................................................................................................................32 5.2 MODEL ARCHITECTURE .............................................................................................................................32 5.2.1 Network ..........................................................................................................................................33 5.2.2 Loss .................................................................................................................................................33 5.3 SUMMARY ..............................................................................................................................................34 CONCLUSION .......................................................................................................................................... 35 6.1 CONCLUSION ..........................................................................................................................................35 6.2 FUTURE WORK ........................................................................................................................................35 REFERENCES ........................................................................................................................................... 37 6 Asymmetric Learned Similarity Search Ben Rohald Chapter 1 Introduction The digital information we consume continues to grow in quality and quantity, and as a result, so do the challenges associated with search. HD images, video, audio and text corpora are all examples of data sources, and with big data becoming increasingly popular, there is no reason to believe this expansion will slow down. The search problem is generally referred to as βthe K-Nearest-Neighbour problemβ, which is defined as follows: given a set of data points π β βπ and a query point π β βπ , we wish to find the K nearest points in π to π based on some distance metric π(π₯1 , π₯2 ). This problem is particularly prominent in computer vision as it forms the basis for facial recognition, verification and clustering. The naΓ―ve approach of a linear scan through the entire dataset as a basis of search is no longer feasible, and as such, requires more robust methods. Rather it is desired to define an indexing procedure on high dimensional data such that locality is preserved in the indices, which can then be used to reduce the cost of search. This paper will explore seminal ideas relating to this topic, including foundational research papers and contemporary techniques. 1.1 Motivation The inspiration for this thesis comes from [2] which suggests that all data structures can be replaced with other types of models, including deep learning models. More specifically, they suggest that the structures themselves should be self-indexing based on 7 Asymmetric Learned Similarity Search Ben Rohald patterns in the data. Examples from the paper include B-Trees and Hash Maps which are data-agnostic (like most commonly used data structures), placing the onus on the programmer to select a data structure based on their interpretation of the dataβs distribution. By extracting information from the data, the expectation is that the procedure can more effectively index the data points compared to data agnostic methods. The authors state that a B-Tree learns the CDF of the data it is indexing when considering the retrieval of a memory address from a key in a logical paging, continuous memory setting (Figure 1). This implies that other forms of models such as regressions or neural networks could do the same. Figure 1: B-Tree Position vs Key graph approximating a CDF [2] This idea can easily be extended to the case of high dimensional similarity search wherein we aim to define a procedure for indexing data points by vectors of significantly lower dimension. In other words, we would like to index each data point ππ β π β βπ by a point π§π β βπ where π βͺ π. If locality is preserved in the indices, then search could be conducted on indices themselves with reduced complexity at the cost of some overhead precomputation. This could be further improved by applying a similar method to [2] in which the structure of the data is used to influence the indexing. 1.2 Document Structure Chapter 2 provides an analysis of existing literature in the field, beginning with influential ideas that form a foundation for more advanced techniques, before moving on to the state of the art. Certain papers will be reviewed for their novel viewpoints and serve as inspiration for eccentric techniques. A complete problem is formulated after examining class asymmetry in more detail throughout 0. Finally, Stage 1 and Stage 2 outline the practical implementations and analyses of two attempted learning frameworks. These will be respectively referred to as 8 Asymmetric Learned Similarity Search Ben Rohald Stage 1 and 2 moving forward. The inability of the preliminary model to learn in Stage 1 reveals many of the subtleties of an asymmetric class setting. These conclusive negative results yield an in-depth identification of numerous issues that Stage 2 aims to resolve. 9 Asymmetric Learned Similarity Search Ben Rohald Chapter 2 Background A hash function is one which maps data from an arbitrary dimension to a fixed dimension. Broadly speaking, hashing methods can be divided into data-independent methods (2.2) and data-dependent (2.3) methods. Data-independent methods include LSH [3, 4] and the later extended KLSH [5], among others [6]. Data-dependent methods include PCA, SVM, and more recently, LTH [7]. Learned methods are further split into supervised, unsupervised and semi-supervised. As will be demonstrated, some techniques allow labels to be computed on the fly, enabling unsupervised tasks to become supervised. Learned methods can also be segmented into online and batch learning. Although earlier research focused mainly on batch learning, in reality, this is unrealistic as the distribution of data may change or the whole dataset may not be available. As such, modern research has shifted toward online learning. The remainder of this section serves as an exploration of existing literature that aims to solve the KNN problem with hashing techniques. In many cases, the methods that follow do not place an explicit emphasis on asymmetric class labelling. Despite this, it is crucial to understand how to solve the KNN problem in a broader sense before focusing more narrowly on the asymmetric case. 2.1 Kernels A short preamble is provided on the notion of kernels as they are a key component referenced throughout the remainder of this thesis. Consider a mapping π: βπ β βπ that maps vectors in βπ to a feature space βπ . A kernel is a function that corresponds to the inner product of two vectors in this feature space π(π₯, π¦) = β¨π(π₯), π(π¦)β©. Kernel functions allow for computing the inner product within a feature space 10 Asymmetric Learned Similarity Search Ben Rohald without knowing what the space or π is. The process of avoiding the explicit mapping is known as βthe kernel trickβ. A useful application of kernels is best illustrated through the following example which forms the foundation of SVM. Consider trying to define a boundary between the yellow and purple points in Figure 2. Clearly, the data is not linearly separable. However, if these points were mapped to a higher dimensional space in which they were linearly separable, a decision boundary hyperplane could be obtained as seen in Figure 3. Finally, the hyperplane could be mapped back into the original space to obtain a nonlinear boundary as seen in Figure 4. Figure 2: Linearly inseparable Figure 3: Decision boundary in Figure 4: Decision boundary dataset [8] feature space [8] mapped to original space [8] The process of finding the boundary in the feature space is only dependent on the dot product of support vectors - points which lie closest to the max-margin hyperplane [9]. Therefore, the kernel method can be used to skip the explicit mapping and calculate the boundary directly. Kernels are also considered to be a measure of similarity. Intuitively, this justifies their inclusion in locality preserving hashing methods. In addition, they are adequately able to capture basic non-linear relationships. 2.2 Data-Independent Hashing Methods 2.2.1 Locality Sensitive Hashing (LSH) LSH was introduced as an alternative to the slow yet deterministic methods that preceded it such as PCA [10]. It trades determinism for probabilistic success with improved speed. The intuition behind LSH is to randomly project data points into scalar buckets and use collections of these buckets as indices. Similar points should be hashed to the same buckets according to a probability distribution that is in some way related to the distance between the points in the 11 Asymmetric Learned Similarity Search Ben Rohald original space, thereby preserving locality. The first approximate high dimensional similarity search with sub linear dependence on data size was introduced in [11]. It is locality preserving in the sense that the probability of collision is higher for objects that are close together in the original space, although, it is restricted to the binary hamming space. [4] addressed this restriction by extending into the Euclidian space. Using p-stable distributions, they were able to generalize the proof to any ππ norm for p β [0,2]. The results of the paper formed the foundation of modern LSH techniques that are widely used today. More specifically, they provide a mathematical justification for a family of locality preserving hash functions which can be used in Euclidian space. The fundamental theorem driving this result was a proof that demonstrated that the probability of index collision scaled monotonically with the ππ norm of the distance between data points in the original space. Using the results from [4], a formal definition of LSH is provided. 2.2.1.1 Model Definition Let the dimensionality of a dataset be π. A set of hash functions is randomly selected from the following LSH function family: ππ π₯ + π β(π₯; π, π, π€) = β β ππ ~π(0,1), βπ β [1, π], π~π[0, π€] (1) π€ where π€ is a user specified parameter. The LSH index consists of π hash tables, each of which uses π hash functions. An object π₯ is indexed by the concatenation of π hash values denoted by ππ (π₯) β (βπ,1 (π₯), βπ,2 , β¦ , βπ,π (π₯)) (2) where βπ,π is randomly selected from the family of functions defined in Error! Reference source not found.(1). For an object π, the i-th hash table indexes the key-value pair (ππ (π), π). Intuitively, β can be seen as a projection into scalar buckets of width π€. π is added to avoid correlation between components and the floor function is used for quantization. The Gaussian distribution is used as it is p-stable in π2 and is therefore guaranteed to retain locality. Similarly, the Cauchy distribution could be used in π1 . 2.2.1.2 Query Processing The following description of the query process assumes the single nearest neighbour is 12 Asymmetric Learned Similarity Search Ben Rohald desired. Given a query π, the first step is to generate a candidate set from all π hash tables. In each hash table π»π , ππ (π) is used as the key to retrieve all the values (i.e. data points) defined by ππππ(π) = {π₯ | βπ β [0, π] ππ (π₯) = ππ (π)}. The distance between π and all the points in ππππ(π) are computed and the one with the minimum distance is returned. A success (failure) is considered to be the case when πβs true nearest neighbour is (not) returned. Once pre- processing is complete, queries can be performed in constant time. 2.2.1.3 Parameter Selection If parameters are chosen according to [4, 11], LSH guarantees to return the true nearest neighbour with constant probability. However, this is usually considered unrealistic due to the large index size and query processing time, which is mainly due to the large value of π. In addition, there are a number of user specified parameters that will affect the performance of the model - namely π, π and π€. As proposed in [3], a statistical model can be used to accurately predict the average search quality and latency based on a small sample of the dataset. The methods used greatly decreased memory complexity and were able to achieve quality predictions within 5% of the true value. The authors of [3] also defined a procedure to allow for automatic parameter tuning, thereby eliminating the issue of user specified parameters. Despite having high recall (the percentage of true KNN found), their method requires the empirical analysis of two distributions: 1. π β the distance between two arbitrary data points. 2. ππ β the distance between an arbitrary data point and its k-th nearest neighbour. They found that there is no universal family of distributions that fit every possible dataset, meaning that the method is not truly generalizable by being data agnostic. 2.2.2 KLSH A serious shortcoming of LSH was that it supported only linear operations. Kernelized LSH [5] was introduced to generalize LSH to support arbitrary kernel functions. This was of particular benefit to the computer vision community as many successful vision results rely on kernel functions. In this setting the measure of similarity moves from ππ distance to the kernel function, with the NN of π satisfying ππππππ₯π π(π₯π , π) for a given kernel function π. Importantly, they define the family of hash functions as β(π₯) = π ππ(π€ π π(π₯) β π€0 ) (3) where π(π₯) = [π(π₯1 , π₯), β¦ , π(π₯π , π₯)]π for a uniformly sampled set of π₯1 β¦ π₯π from the 13 Asymmetric Learned Similarity Search Ben Rohald dataset. In order to maximize entropy (βπ π=1 β(π₯π ) = 0), the bias term should be set to the median of the hashes, which can be approximated by the mean 1 π π€0 = β π€ π π(π₯π ) (4) π π=1 2.3 Data-Dependent Hashing Methods Both the strength and weakness of LSH lies in its agnosticism toward the dataβs distribution. Other methods have since been developed which attempt to eliminate the stochastic aspect of LSH, whilst retaining locality preservation and ensuring low time complexity. Enter LTH, a set of data-dependent hashing approaches which aim to learn hash functions from the data. These methods are relatively new given the recent popularization of machine learning, yet they have set a performance benchmark that surpasses that of data-independent methods. LTH attempts to eliminate the parameter selection component of LSH by developing machine learning models that can obtain optimal values. It should be noted that much of the work remains in the hamming space due to storage and comparison efficiency, despite the work from [4]. 2.3.1 Adaptive Hashing Building upon the hash functions from [5], [12] introduces a technique for supervised learning with kernels. They use the equivalence between hamming distance and inner products to derive a least-squares style objective function which can be minimized using gradient descent. [13] combines the ideas of online hashing and machine learning, although does not account for kernels in that it retains exclusively linear operations. Importantly, they introduced the idea of a hinge-like loss. Joining the ideas from [12, 13], [14] introduces a superior online learning technique that is comparable in accuracy to the state of the art batch learning methods, whilst remaining orders of magnitude faster. As a result of its importance, the key workings of [14] will now be outlined. The learning process involves taking pairs of samples π₯1 , π₯2 and their similarity π ππ β {β1,1} and using them to iteratively adjust the parameters of the hash functions. The similarity matrix can be defined arbitrarily either from labels or from a metric defined on the data space. Following [12], they use hash functions of the same form as equation (3). Combining the terms, the family of hash functions is defined as 14 Asymmetric Learned Similarity Search Ben Rohald π(π₯) = π ππ(π€ π π(π₯)) 1 (5) where π(π₯) = π(π₯) β βπ π(π₯π ) π π=1 The binary code of π₯, π(π₯), is then defined as π(π₯) = π ππ(π π π(π₯)) = [π1 (π₯), β¦ , ππ (x)] (6) where π = [π€1 , β¦ , π€π ] β βπΓπ and π is the dimension of the binary codes. The squared error loss defined in [12] is used as it is attractive for gradient computations. The loss between two binary codes is defined as 2 π(π(π₯π ), π(π₯π ); π) = (π(π₯π )π π(π₯π ) β ππ ππ ) (7) Using equation (7) they are able to derive a gradient descent optimization method for learning the parameters of π. At this point it is key to identify that the least squares error alone is insufficient for training, as it may penalise examples where the hamming distance is not optimal, even if a perfect retrieval is possible. An example of this can be seen in Figure 5, in which an erroneous update is performed on π2 based on the pair of points that is being considered. Figure 5: Erroneous correction despite perfect retrieval [14] Mathematically this can be interpreted as a nonzero gradient when there is no need for an update. The team from [14] therefore suggest an additional step before performing gradient descent in which they determine which hash functions need to be updated, and the amount for those that do. The approach for doing so depends on the hinge like loss function from [13] which is defined as 15 Asymmetric Learned Similarity Search Ben Rohald max (0, ππ» β (1 β πΌ)π, π ππ = 1 πβ (π(π₯π ), π(π₯π )) = { (8) max(0, πΌπ β ππ» ) , π ππ = β1 where ππ» is the hamming distance between π(π₯π ) and π(π₯π ) and πΌ is a user defined parameter that determines the permitted margin of error. βπβ β can be interpreted as the number of bits, and subsequently the number of hash functions, that need to be corrected. The βπβ β hash functions with the most erroneous mappings are then selected to be updated in the gradient descent process, whilst the others are not modified. It became clear after an investigation of [14]βs codebase that the RBF kernel was used. Noting that kernel choice can influence results, potential optimization methods become apparent. A key area of potential improvement is the uniformly sampled points used in (3). 2.3.2 FaceNet Using learned methods, the authors of [15] were able to achieve superior facial recognition accuracy. Their success can be attributed to triplet loss and triplet selection which are now explained. The goal is to find an embedding π(π₯) such that for any given image of a particular person π₯ππ (anchor), the distance to all other images of the same person π₯ππ (positive) should be less than the distance to any image of a different person π₯ππ (negative). Thus, the following loss function can be minimized π 2 π = β [βπ(π₯ππ ) β π(π₯ππ )β2 β βπ(π₯ππ ) β π(π₯ππ )β22 + πΌ] (9) π=1 + where πΌ is an enforced margin between positive and negative pairs. Figure 6: Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity [15] The objective of the training process can be seen in Figure 6. Generating all possible triplets is 16 Asymmetric Learned Similarity Search Ben Rohald not desired as many of them would not contribute to training and delay model convergence. The method of triplet selection therefore becomes crucial to the process. For a given π₯ππ , [15] 2 suggests selecting a hard positive, π₯ππ , such that ππππππ₯π₯ π βπ(π₯ππ ) β π(π₯ππ )β2 and a hard π negative, π₯ππ , such that πππππππ₯ππ βπ(π₯ππ ) β π(π₯ππ )β22 . Triplets are generated online within a minibatch due to computational complexity. A variety of enhancements can be made; however, the primary idea remains as described above. Although the tasks in [15] are specific to the domain of facial recognition and the Euclidian space, the intuition of their process still holds. The potential to incorporate some variant of the triplet loss into the learning process of [14] seems very promising. A generalization [15] of would be to consider input triplets of the form (π₯, π₯ + , π₯ β) where π₯ is βmore similarβ to π₯ + than to π₯ β . 2.3.3 Deep Hashing Many of the learned methods reviewed so far have involved shallow architectures. However, deep multilayer networks have seen great success in the recent past. Deeper architectures aim to capture non-linear relationships by learning multiple projection matrices and using non- linear activation functions. Kernel methods already attempt to capture non-linear relationships, however, they suffer from scalability and remain relatively simple. [16] suggests an intuitive unsupervised, multilayer, fully connected network which takes in feature vectors and outputs a binarised representation vector. Non-linear activation functions are applied at each intermediate node, with the final output being applied to a π ππ function for quantization. The optimization problem involves minimising the quantization loss between the output vectors and the original real valued vectors, maximising the variance of the output vectors, maximising independence of the output vectors, and regularization terms. They then extend their algorithm to a supervised setting in which class labels are provided. Here they incorporate a triplet loss-like approach in which 2 pairs of data are included in each training step. These are known as βsiamese pairs.β The output of the π-th layer for an input π₯π is defined as βππ = π (π π βππβ1 + π π ) (10) where π π and π π are the weight matrix and bias of the π-th layer respectively, and π is an activation function. 17 Asymmetric Learned Similarity Search Ben Rohald 2.3.4 Simultaneous Feature Learning All of the techniques evaluated so far assume that the inputs are feature vectors. For the sake of simplicity, the extraction process from the underlying data source has thus far been ignored, however, [17] suggests that this should not be the case. The authors propose that within a model, the feature extraction processes may not be optimally compatible with the binarization and should be combined into one model. This would require a different model for each data type; however, it should be considered for total model optimization and is included for completeness. 2.3.5 Mutual Information The authors of [14] later shifted their attention toward the incorporation of the information- theoretic quantity, mutual information, into LTH methods [18, 19]. Mutual information is briefly defined before examining their findings. The Kullback-Leibler (KL) divergence is a quantitative measure of the similarity between two distributions. It is defined in terms of entropy, meaning that it can be interpreted as the information gain of using one distribution over another. A KL divergence of 0 means the two distributions are identical. More formally, the KL divergence is defined as π(π₯) π·πΎπΏ (π || π) = β β π(π₯) log 2 ( ) (11) π(π₯) π₯βπ The mutual information between two variables is defined as the KL divergence between their joint distribution and the product of their marginals. Following (11), it can be interpreted as a quantitative measure of whether two distributions should be treated as independent. πΌ(π; π) = π·πΎπΏ (π(π,π) ||ππ β¨ ππ₯ ) (12) [18] uses mutual information to solve erroneously updating hash functions at every step in an online setting. The authors show that mutual information can be used to quantify the hash functionβs improvement, which can act as an indicator for updates based on whether significant improvements have been made. This allows mutual information to be included as an optimization step to further enhance results. Subsequently, they show in that mutual information can act as an update criterion which can be used to train the model. 18 Asymmetric Learned Similarity Search Ben Rohald They further extend this work in [19] to an offline setting. This paper forms the foundation for Stage 2 and is briefly described for this reason. [19]βs primary contribution is the suggestion of mutual information as a learning objective. Here an anchor point π₯Μ and a set of positive (π₯ β β¨π₯Μ ) and negative points (π₯ β βπ₯Μ ) with respect to the anchor are selected. They are all embedded into the binary hamming space and the hamming distance (πΞ¦ ) between each point and the anchor π₯Μ is computed. Figure 7 shows a histogram of these hamming distances separated by membership indicator. If the mapping is able to perfectly embed all positive points closer to the anchor than all negative points, then there should be no overlap in the two histograms. Conveniently, the mutual information between the hamming distances and the membership indicators serves as a quantitative measure of the overlap. Figure 7: Relationship between hamming distance histogram overlap and mutual information The remainder of the paper provides relaxed gradient derivations for the mutual information, given that it is a discrete function. It also contains a generalization of the single anchor case to that of a mini batch setting. These is discussed in more detail in Stage 2. 2.4 Quantization as a Distance Metric Unlike any of the previous papers, [20] suggests that a significant performance improvement can be achieved by considering the quantization step. Most LTH methods use hamming ranking as their querying method, which relies on hamming distance as a distance measure. The authors argue that quantization distance (QD) serves as a much better distance metric, and as a result, provide a framework for fast querying known as βgenerate-to-probe quantization distance rankingβ (GQR) that relies on QD. A comparison of QD and hamming distance can be seen in Figure 8. When considering the probing order of buckets for query point π1 , hamming distance is unable to distinguish between the buckets (1,0) and (0,1), whereas QD 19 Asymmetric Learned Similarity Search Ben Rohald can identify that (1,0) should be probed first. Figure 8: Quantization Distance vs Hamming Distance [20] The general approach is to dynamically generate buckets to probe in sorted order of their quantization distance from a query point. The model agnostic results suggest that this should be the preferred querying method. 20 Asymmetric Learned Similarity Search Ben Rohald Chapter 3 Class Asymmetry 3.1 Definition In a supervised learning setting, data is accompanied by metadata that serves to aid the learning process. In the context of nearest neighbour retrieval and much of the literature examined, these labels or classes usually group similar data points together into clusters. Take, for example, the now-famous MNIST dataset in which each data point β an image of a handwritten digit β comes with a label that indicates which digit the image is of. This kind of labelling is particularly prominent in image datasets since there is a natural classification system based on what, or who, the image contains. It also provides a number of attractive features. For example, distinct class labelling provides an inherent separability in the data, as seen in Figure 9. Figure 9: MNIST dataset embedded into β2 [11] 21 Asymmetric Learned Similarity Search Ben Rohald In a search-based setting, and in particular, siamese or triplet-based learning models, this labelling system clearly identifies whether objects should be considered positive or negative. Objects of the same class as the anchor are considered positive, whilst all others are considered negative. The symmetry stems from the fact that all items within a class are positive with each other. In other words, it guarantees that, for example, any image of a 5 could serve as a positive point for any anchor image that is also a 5, and that the reverse is true β the anchor and the positive points could be swapped and the pair or triplet would remain valid. Clearly, the labels are symmetric. But there are many cases in which symmetry of this kind does not exist. Datasets that do not exhibit such distinct class divisions or do not come with labels may need to use techniques such as distance-based metrics to derive labels. A simple example of which could specify that a point is positive with respect to an anchor if it is the anchorβs single nearest neighbour. Figure 10 demonstrates the asymmetry that arises from a distance-based metric such as this one, since Y is positive for X, but X is not positive for Y. Figure 10: Distance based asymmetry Metric-based labels of this kind boast numerous benefits: β’ They can be computed dynamically, requiring limited, if any, precomputation. β’ They eliminate the need for human intervention. This is particularly attractive given the size of datasets and the cost of manual labour. β’ They facilitate the adaptation of unsupervised environments to supervised. However, an obvious hinderance is that the true structure of the data may not adequately captured. 22 Asymmetric Learned Similarity Search Ben Rohald 3.2 Background It is somewhat by accident that the niche of asymmetric labelling is now the main focus of this thesis. In fact, a significant portion of work had been done before identifying the issues associated with asymmetric labelling that are outlined in the next sections. The dataset used was a collection of audio files from [21] that did not come with any class labels. Consequently, a generalization of the aforementioned approach was employed, in which a threshold was applied to the nearest neighbours by declaring that the that the KNN of each point are positive, and all others are negative. Cluster sampling determined that there were no clear groupings in this dataset that were of any value. Therefore, while arbitrary groupings could be defined, this would yield suboptimal results. Rather, it became clear that a learning scheme that could accommodate asymmetric labels was desirable. 3.3 Interclass vs Intraclass Variation It is interesting to note that learning to hash problems can be divided into modelling interclass variation and intraclass variation separately. The effect this has on learning depends on the particular problem, the number of classes and the number of items in each class. Taking facial recognition as an example where there are a large number of classes compared to the number of datapoints in each class, there is a much greater emphasis on interclass variation. The prominence of this is identified in [22] wherein the objective function is divided into two terms that represent these variances. It is trivial to see that no such distinction can be made in cases where clustering is not applicable or desired. In fact, the asymmetric setting resolves into modelling interclass variation exclusively with clusters of size 1, or intraclass variation exclusively with a cluster that contains all elements. The resulting consequences are explored in more detail at the end of Stage 1. 3.4 Intelligent & Unselfish Batching It is routine in machine learning to train a model in a minibatch setting by backpropagating over the average gradients of a batch of training examples. Intelligent batch generation can greatly improve the quality of a model as there will be certain training examples that contribute more heavily to learning, resulting in faster convergence. Part of intelligent batch generation is making sure that batches are balanced across different class labels, which amounts to ensuring there are a similar number of items from each class. This is a trivial exercise when the dataset has symmetric labels β simply choose approximately BATCH_SIZE/N_CLASSES items from each class without replacement. A considerable 23 Asymmetric Learned Similarity Search Ben Rohald number of techniques from the reviewed literature rely on balanced batch generation to work effectively [15, 16, 19, 22]. It is evident that the equivalent does not exist in the asymmetric setting based on the reduction outlined at the end of the previous section. Related to the idea of balanced batching is a property we call βunselfish batchingβ. Its importance is best illustrated through an example. Consider using a triplet loss objective function and generating all valid triplets from a balanced batch to use as training examples. In a symmetric setting this would involve many βsimilarβ triplets wherein two triplets contain the same data points, but the anchor and positive points are swapped. These similar points are guaranteed to be valid by the assumed property of symmetry. Therefore, any item included in a batch is considered as both the focus of learning (as an anchor) and as support for the learning of other points (a non-anchor member of a triplet). This property is what we define as unselfishness. Asymmetry makes unselfish batches very challenging to attain since it is unlikely that valid triplets can be generated with data points that are included in the batch to support other anchors as anchors themselves. The asymmetric case of Figure 11 demonstrates this situation as π¦ will not be the focus of learning if the second triplet is invalid. This effectively amounts to having large batches with only a few items truly contributing to learning and is made significantly worse by large class imbalances between positive and negative set sizes. Figure 11: Symmetry vs Unselfish Batching 24 Asymmetric Learned Similarity Search Ben Rohald Chapter 4 Stage 1 The following section outlines our first attempt at forming a generalizable learning framework for solving the KNN search problem. The preliminary model that follows was defined before the complexities of asymmetric labelling had been well understood. In fact, the modelβs inability to learn was the catalyst that led to the discovery of these complexities. The final conclusion of Stage 1 was that the model was incapable of learning an embedding that retained locality in an asymmetric setting. Despite this, a full recount of the approach is included for completeness. 4.1 Overview The goal of this section was to develop a model that could solve the KNN problem using relatively simple statistical methods. This involves developing an embedding scheme that can project data into a new space, while retaining locality. Intuitively, the model should learn to pull the KNN of each point toward it in the embedding space while pushing all other points away. We focus primarily on the offline setting, reserving the online setting as a future enhancement. Following common practice, the data should be embedded into the binary hamming space. 4.2 Model Architecture 4.2.1 Data Preparation The Audio dataset from [21] contains ~0.05 million audio vectors of dimension 192. The dataset comes split into two sets of sizes 53,000 and 200 respectively. The first was further 25 Asymmetric Learned Similarity Search Ben Rohald split into a training set (90%) and validation set (10%), with the second being used as a test set. The data was then normalized to have a mean of 0 and standard deviation of 1, the benefits of which will be explored later. The true KNN of each point in the test and validation sets were computed to serve as synthetic class labels. 4.2.2 Network The model architecture was inspired by [14] wherein the hash functions uniformly sample points from the dataset to use as βanchorsβ within RBF kernels. Intuitively, we aimed to optimise this approach by relaxing the restriction that these datapoints need to exist in the dataset. The resulting model is a shallow network that learns anchor points in the original space and measures whether data points fall within learnable distances from these anchors. It can be interpreted as a feedforward network with the input layer β β192 fully connected to a hidden layer where each node will output a single bit. More specifically, each node π in the hidden layer is parametrized by an anchor vector ππ β β192 and a distance scalar ππ . The output of node π will be ππ (π₯) = π ππ(||ππ β π₯||2 β ππ ), which can be seen as a binarized measure of whether π₯ is within distance ππ from the anchor ππ . The π ππ function is approximated with the π‘ππβ function and a binarization threshold of 0 due to a lack of continuity. We call this anchor dependent model an AnchorNet. The normalization of the dataset in the pre-processing stage enables trivial initialization of anchor points from a multivariate normal distribution. The initialization of the biases is slightly more complex as learning will not occur unless there is sufficient variety in the output of the model. As such, each bias ππ is initialized to the mean distance between its associated anchor ππ and all items in the training set. An alternative, if not more intelligent strategy, is to perform kmeans++ clustering with the number of clusters equal to the number of desired anchor points. The anchors resolve to be the centroids of the clusters and the biases become the mean distance between centroids and the items in their cluster. The effects of outliers in a cluster could be mitigated by using a percentile measure rather than the mean for bias initialization. 4.2.3 Loss & Batching The loss function used to train the model was a triplet loss in the form of equation (9). For a given anchor point π₯, a positive point π₯ + is considered to be one that exists within π₯βs KNN, and a negative point π₯ β one that does not. Since the value of K is likely to be small, there will be a large disparity between the cardinality of the positive and negative set sizes for an anchor. 26 Asymmetric Learned Similarity Search Ben Rohald Attempts were made to mimic the advanced triplet selection methods from [15] but it became apparent that this was not possible due to the difficulty of generating intelligent batches, as outlined in 3.4 Intelligent & Unselfish Batching. A simpler approach was adopted in which we iterate over a batch and generate a triplet for each data point with the point as an anchor. A positive and negative for each triplet is subsequently required. Given the value of K is likely to be small, a positive item can be uniformly sampled from the anchorβs true KNN. Approaching negative selection in a similar manner, a collection of items should be generated from which a negative item can be uniformly sampled. The large size of the candidate set for this collection requires intelligent selection methods to avoid hindering the learning process. It was our original belief that the negative points that would contribute most to learning would be those negative points closest to an anchor - close negatives - and those negative points farthest from an anchor - far negatives. Experimentation determined far negatives to have a hindering effect on the learning process, as evidenced in Figure 12. Figure 12: Effect of close/far negatives on recall The problem of negative selection is therefore reduced to the sampling of close negatives. One option involves a strong condition that uniformly samples an element from the N closest close negatives, where N is a user specified hyperparameter. A more generalizable approach would be to relax this constraint into something resembling the negative sampling approach from word2vec. In this setting, the probability of choosing a negative point is inversely proportional to its distance ranking from the anchor. An appropriate function for this purpose is the sigmoid π function π¦ = 1+π π(π₯βπ) which is parametrized by location, gradient and height parameters. 27 Asymmetric Learned Similarity Search Ben Rohald Figure 13: PDF of sigmoid negative selection with k, a, b = 1, 0.5, 5 4.2.4 Performance Measures Recall was used to measure model performance, defined as the ratio of true positives to the sum of true positives and false negatives. In other words, it evaluates the percentage of true KNN returned from a candidate set. It should be noted that accuracy and recall are equal in this context. The querying method used to generate a candidate set was GQR from [20]. It depends only on a single parameter representing the number of candidates to collect. In practice, increasing this value will lead to higher recall at the cost of slower computation. This leads to the performance metric βK@Cβ which represents the recall value of the K nearest neighbors from a candidate size of C. Our choice of K and C were 5 and 25 respectively. A small modification to the GQR was needed due to the following: consider the case when the output dimension of the model π is large (>100) and the entire training set is quantized into a relatively small number of buckets when compared to the number of available buckets (2π ). The GQR probing method ensures that the if left to run for long enough, every bucket will eventually be probed. It is possible that isolated points can be embedded to isolated buckets. If this happens, the GQR method will continue to probe a large number of buckets, many of which have nothing in them. Given that π is very large and the number of buckets grows exponentially with it, the number of buckets needing to be probed could be enormous. As a result, we imposed the upper limit of 1000 buckets to be probed before stopping. 4.3 Model Collapse 4.3.1 Effects Using the techniques outlined above we were able to achieve 5@25 values of up to 90% when 28 Asymmetric Learned Similarity Search Ben Rohald embedding into β 128 . Upon further investigation it became clear that the model was collapsing to such a large degree that these results were rendered invalid. A feature of the GQR querying algorithm is that after selecting a bucket to probe, all elements in that bucket are sorted by their distance to the query point in the original space. In the most extreme case, a model could project every item in the dataset into the same bucket and attain 100% recall. A trade-off becomes evident between the number of items in each bucket and the query time to sort the items within a bucket. In either case, large bucket cardinalities demonstrate that true learning has not taken place. We found that a large number of items were being projected into a very small number of buckets, necessitating an investigation into regularization. Figure 14: Number of items in each bucket before training (kmeans++ initialization, left) vs after (right) 4.3.2 Regularization Our original hypothesis was that model collapse could be avoided with regularization. The goal here is to add a penalty term to the loss function that motivates the model to project items into unique buckets. Several possible solutions are now outlined. The simplest approach is a common regularization term which enforces that anchors are 2 orthogonal to each other: ||π π π β πΌ||πΉ . However, this restriction on the anchors may prevent the model from learning the true distribution of the data by pushing the anchors away from optimal positions. A more subtle suite of tactics involves placing restrictions on the output of the model rather than the anchors themselves. As mentioned earlier, the loss function from [22] is split into learning the interclass and intraclass variations separately. The latter term enforces that the pairwise distance of points in the original space are proportional in some way to the pairwise distance of those same points in the embedding space. 0.5 β πππ π‘(π(π₯π ), π(π₯π ))ππ,π (13) π,π 29 Asymmetric Learned Similarity Search Ben Rohald Here, ππ,π is is large when π₯π and π₯π are similar, and small when they are not. One drawback of this approach is that it is computationally expensive. Alternatively, one could look to the distribution of the modelβs output and inspect its properties. It is well known that the entropy of a uniform distribution is maximal among all continuous distributions. Thus, including a measure of entropy over the modelβs output in the loss would promote a uniform distribution across buckets. Similarly, one could investigate the skewness of this same distribution since entropy and skewness are closely related. 4.4 Conclusions Despite having numerous approaches to solving the issue of model collapse, none were implemented. The challenges associated with asymmetric labels had been identified by this stage, and with them, the realization that the Stage 1 model was incapable of learning a suitable embedding. The reasons for this are threefold: 1. The embedding process is too weak 2. Triplet loss is not suitable for the asymmetric setting 3. The loss function only indirectly relates to recall The problem with AnchorNet relates back again to 3.3 Interclass vs Intraclass Variation, regardless of whether one interprets the objective to be modelling exclusively interclass or intraclass variation. In essence, its structure only allows for distinguishing interclass variation between clusters of items defined by an anchorβs inclusion boundary. It cannot characterize the variation between clusters of size 1 that are datapoints themselves or model intraclass variation at all for points within a cluster. Figure 15 demonstrates how the model has no way to differentiate between points that fall within the same cluster β and since the initialization is done with kmeans++, the clusters will not overlap. Figure 15: Inability of AnchorNet to model intraclass variation. The anchor, positive and negative will all map to the same bucket. 30 Asymmetric Learned Similarity Search Ben Rohald The triplet loss also poses a problem since it attempts to enforce a fixed margin between positive and negative points and doesnβt take into consideration the local density around an anchor. A fixed margin only makes sense in a setting where there are distinct class divisions and the margin is used to drive clusters apart. Therefore, even with a more suitable model structure, the triplet loss function is too weak for problems of this class. This was confirmed by replacing AnchorNet with a traditional weight matrix common to many research papers of the following form: π(π₯) = π ππ(π π π(π₯) β π€0 ) (14) Even with this new embedding scheme, the model could not learn to a satisfactory standard. A final issue is that the true performance metric of recall is distinct from the loss function. At no stage during training does the model take recall into account, and even if it did, recall computation relies on a number of non-differentiable operations, making backpropagating on these results infeasible. In many cases we observed a decreasing loss value paired with static recall due to a clear inconsistency between the loss function and retrieval performance. 31 Asymmetric Learned Similarity Search Ben Rohald Chapter 5 Stage 2 Armed with evidence for the inappropriateness of the existing approach and a more comprehensive understanding of the task at hand, we hoped to provide an alternate solution within the remaining timeframe of the project. Unfortunately, the limited timespan prevented the formulation of an end to end solution. In this section we present what we believe to be a more suitable framework for solving asymmetric similarity search. Despite a lack of empirical evidence, we present an exhaustive argument for its suitability. 5.1 Overview The main motivation for Stage 2 comes from [19]. As already mentioned, this paper uses the information-theoretic quantity, mutual information (MI), as a learning objective. The authors discuss issues associated with affinity matching β correlating pairwise embedding distance to similarity in the original space β and local ranking β enforcing fixed margins between the positives and negatives of triplets. They also touch on how the loss functions in these methods are only indirectly related to retrieval performance and that MI as a learning objective mitigates these issues. These qualms exactly match two of the three issues mentioned in 4.4 Conclusions. While [23] attempts to maximize the KNN recall directly, the inherent relationship between MI and retrieval performance demonstrates its suitability as a performance metric. Similarly, MI is parameter free, meaning that there is no uniform margin applied to all points. 5.2 Model Architecture 32 Asymmetric Learned Similarity Search Ben Rohald 5.2.1 Network The specifics of the embedding process in [19] are not considered important. Rather they assume that an arbitrary mapping is applied to the data before being passed to the sign function. It seems safe to assume an embedding similar to (14) would suffice. Such an embedding is considered a suitable resolution to the model weakness outlined in 4.4 Conclusions. The sign function is approximated by a variant of the sigmoid function, leading to the embedding scheme Μ (π₯) = 2π(πΎ β π(π₯)) β 1 β (β1,1)π Ξ¦ (15) where πΎ is a steepness parameter. 5.2.2 Loss Let the distribution of the embedded hamming distances between an anchor π₯Μ and some set of points be denoted by π·π₯Μ,Ξ¦ , and let πΆπ₯Μ be the membership indicators β inclusion/exclusion in the anchorβs KNN - of that set. As already mentioned in 2.3.5 Mutual Information, the MI between π·π₯Μ,Ξ¦ and πΆπ₯Μ quantifies the overlap between the hamming distance distributions when separated by membership indicator. It is desired to maximize this value since it measures the degree to which π·π₯Μ,Ξ¦ provides information about membership. Therefore, the objective function to be minimized is: β πΌ(π·π₯Μ,Ξ¦ , πΆπ₯Μ ) = π»(πΆπ₯Μ |π·π₯Μ,Ξ¦ ) β π»(πΆπ₯Μ ) (16) The derivations of MI thus far have depended on a single anchor. [19] outlines how to compute mutual information efficiently across a batch, however this is predicated on intelligent batch generation. Although slower, the online setting should still provide a suitable learning environment. This process would involve considering each item in the dataset as an anchor and sampling some set of points to generate π·π₯Μ,Ξ¦ and πΆπ₯Μ for that anchor. The sampling methods here need not differ from those discussed in 4.2.3 Loss & Batching. The discrete nature of mutual information requires the continuous approximation of numerous functions. Namely, the hamming distance must be replaced by 33 Asymmetric Learned Similarity Search Ben Rohald Μ (π₯)π Ξ¦ πβ Ξ¦ Μ (π₯) πΜΞ¦ (π₯, π₯Μ) = (17) 2 and the indicator function 1[π = π] must be replaced by the triangular kernel function |π β π| πΏ(π, π) = max {0, 1 β } (18) β Once this is done, a differentiable approximation of MI can be computed for backpropagation. 5.3 Summary It is clear that with some small modifications, MI as an objective function solves many of the issues faced in Stage 1. Despite lacking empirical evidence, we strongly believe that it is a preliminary solution to the challenge of asymmetric similarity search. 34 Asymmetric Learned Similarity Search Ben Rohald Chapter 6 Conclusion 6.1 Conclusion This thesis focuses on adapting modern hashing techniques to be suitable in an asymmetric label setting. Our main contributions include: 1. A comprehensive analysis of existing literature in the field, spanning various components of the machine learning lifecycle. 2. A detailed evaluation of the specific challenges that asymmetric class labels pose. The effects are demonstrated by empirical observation and further studied in a theoretical context. 3. We argue that mutual information as a learning objective acts to solve many of the issues presented by asymmetry and illuminate the necessary modifications to [19] to do so. As learning techniques approach performance benchmarks that exceed human capability, the logical progression in this field of study is to focus on generalization and optimization. Accommodating for asymmetry serves as an intelligent avenue through which to dispel the common prerequisite of highly tailored datasets, which often results in poor generalizability. In addition, eliminating the need for human intervention moves us closer toward the ultimate goal of end-to-end model optimization. Despite being in its infancy, the study of asymmetry illuminates numerous inadequacies in modern similarity search techniques, presenting a unique opportunity for innovation. 6.2 Future Work 35 Asymmetric Learned Similarity Search Ben Rohald There is still a great deal of work to be done on the study of asymmetry. This paper intentionally focused on a simple ranking based metric for illustrative purposes, but there are numerous class derivation techniques that should be considered. The brevity of this thesis leaves much to be understood about of the effects of asymmetry on the learning process. This involves evaluating a broader collection of existing methods to determine their compatibility with asymmetric datasets and understanding the effects of asymmetry in more depth. Finally, there remains the task of proposing novel methods to accommodate for asymmetry. 36 Asymmetric Learned Similarity Search Ben Rohald References 1. Yun Ni, K.C., Joseph Bradley. Detecting Abuse at Scale: Locality Sensitive Hashing at Uber Engineering. 2017; Available from: https://eng.uber.com/lsh/. 2. Kraska, T., et al. The Case for Learned Index Structures. 2018. ACM Press. 3. Dong, W., et al. Modeling LSH for performance tuning. 2008. ACM Press. 4. Datar, M., et al. Locality-sensitive hashing scheme based on p-stable distributions. 2004. ACM Press. 5. Brian Kulis, K.G., Kernelized Locality-Sensitive Hashing for Scalable Image Search. 2009. 6. Yair Weiss, A.T., Rob Fergus, Spectral Hashing. 2009. 7. Moran, S. Learning To Hash. Available from: https://learning2hash.github.io/. 8. Sicotte, X.B. Kernels and Feature maps: Theory and intuition. 2018; Available from: https://xavierbourretsicotte.github.io/Kernel_feature_map.html. 9. Ng, A., Support Vector Machines, in CS229 Lecture notes. 10. Pearson, K., LIII. On lines and planes of closest fit to systems of points in space. 1901. 11. Aristides Gionis, P.I., Rajeev Motwani, Similarity Search in High Dimensions via Hashing. Proceedings of the 25th International Conference on Very Large Data Bases, 1999: p. 518-- 529. 12. Wei Liu, J.W., Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang, Supervised Hashing with Kernels. 2012. 13. Long-Kai Huang, Q.Y., Wei-Shi Zeng, Online Hashing. 2013. 14. Faith Cakir, S.S., Adaptive Hashing for Fast Similarity Search. 2015. 15. Schroff, F., D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for face recognition and clustering. 2015. IEEE. 16. Venice Erin Liong, J.L., Gang Wang, Pierre Mouli, Jie Zhou, Deep Hashing for Compact Binary Codes Learning. 2015. 17. Hanjiang Lai, Y.P., Ye Liu, and Shuicheng Yan, Simultaneous Feature Learning and Hash Coding with Deep Neural Networks. 2015. 18. Fatih Cakir, K.H., Sarah Adel Bargal, Stan Sclaroff, MIHash: Online Hashing with Mutual Information. 2017. 19. Fatih Cakir, K.H., Sarah Adel Bargal, Stan Sclaroff, Hashing with Mutual Information. 37 Asymmetric Learned Similarity Search Ben Rohald CoRR, 2018. 20. Jinfeng Li, X.Y., Jian Zhang, An Xu,James Cheng,Jie Liu, Kelvin K. W. Ng,Ti-chung Cheng, A General and Efficient Querying Method for Learning to Hash. Proceedings of the 2018 International Conference on Management of Data, 2018. 21. Wang, W. NNS Benchmark: Evaluating Approximate Nearest Neighbor Search Algorithms in High Dimensional Euclidean Space. 2017; Available from: https://github.com/DBWangGroupUNSW/nns_benchmark. 22. Ruimao Zhang, L.L., Rui Zhang, Wangmeng Zuo, Lei Zhang, Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification. CoRR, 2015. abs/1508.04535. 23. Kun Ding, C.H., Bin Fan, Chunhong Pan, kNN Hashing with Factorized Neighborhood Representation. IEEE, 2015. 38
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-