Thesis.pdf | PDF Host

Multiclass Classification Using Compositional Data Seyed Amir Ali Hashemi 1 and Majeed Mohammadi 2 Abstract — Multiclass classification problems have gained increasing popularity, particularly in deep learning research. Models for this type of problem are typically trained using a combination of Softmax with Cross-Entropy loss. We challenge this common setup by recognizing the underlying compositional nature of data in Softmax and proposing two different transformers: a centred-log-ratio transformer on the labels that eliminates the need for Softmax, and a radial transformation that acts as an alternative loss function to cross-entropy. We discuss the mathematical properties of these transformers and implement them in a state-of-the-art model using real datasets. Our results demonstrate that these transformers can achieve higher accuracy with complex model architectures and yield improved outcomes when dealing with a larger number of labels. To the best of our knowledge, this work is the first to practically establish the availability of an extensive library of similarity metrics for compositional data with the Radial transformation for machine learning problems. The source code of our implementations can be found at: https://github.com/aahashemi/ Pytorch-Multiclass-Classification-Using-CoDa/ I. INTRODUCTION Multiclass classification is fundamental to a large number of real-world machine learning applications that demand the ability to automatically distinguish between thousands of different classes. Common applications include problems with categorical outputs spanning image classification, where a model has to choose the correct category label for a given image; reinforcement learning, where the agent has to choose the optimal action; and recommendation systems, where the model should recommend the most suitable option out of many other. The growing list of applications motivate an in-depth exploration of multiclass classification algorithms. Despite their extensive use however, a precise understanding of the statistical properties and behavior of classification algorithms is still missing with one important question: Are there any alternative approaches to the Softmax layer and Cross-Entropy loss for multiclass classification? Softmax and Cross-Entropy are commonly used together in problems of such and have become the prevailing convention among researchers in this field. The Softmax function is used to transform the output of a model into a probability distribution across all classes, while Cross-Entropy is employed as a loss function to assess the similarity between the predicted probability distribution and the true labels. The output of Softmax, by definition, exhibits the properties of compositional data (CoDa), which refers to a collection of non-negative data that sums to a constant, usually 1 or 100%. Their special character derives from the fact that their values depend on the particular choice of parts constituting the total amount, where the total is usually of no interest. Consequently, the role of the softmax function lies in the mapping of the model’s output from the real number space (R space) to the simplex space. ∆ K = { xε R K + : K ∑ k =1 x k = 1 } (1) Recognizing this fundamental aspect unlocks possibilities for conducting research in this field. In this paper, we seek to further this area of research by proposing two distinct transformers. The first transformer, named centered-log-ratio (CLR), aims to bypass the Softmax layer. Instead of mapping the model’s output from the R space into the simplex, we retain them in the R space and bring the labels (one-hot encoded) from the simplex into the R space to compute the loss. With this method, we expect not only to achieve faster model training times by removing the Softmax layer but also to potentially outperform Softmax in terms of performance. The second transformer, named radial transformation, differs from CLR in the sense that it does not aim to replace the Softmax layer, but rather utilizes the data on simplex to compute the loss value in a different way. The new loss function (radial loss) serves as an alternative to the cross-entropy loss. This transformation occurs within the loss function by taking the Softmax output and the one-hot encoded labels, and transforming them onto the arc of an N-dimensional space. This enables us to calculate the cosine between the two vectors and minimize it throughout the training process. Further details will be discussed in the upcoming sections. To summarize, our contributions are as follows: • We propose the CLR transformer (pre-processing) on the labels. • We propose the radial loss, a new loss function that is the cross-entropy analogue for Softmax. • We apply both the radial loss and the CLR transformer (pre-processing) to train multi-label classifiers, specifically Image and Text classifiers, on custom and benchmark datasets. We then compare their performance against conventional models that use Softmax and cross-entropy loss. II. B ACKGROUND A. Centred Log Ratio Transformation In 1982 Aitchison introduced the log-ratio framework, whereby compositional data is sent from the simplex onto the real number space, where the common statistical operation can then be applied. The reason behind this is the linearity assumption that holds in real space but not on the simplex. Therefore, arithmetic operations cannot be used on compositional data, and geometric operations, although more limited, need to be employed instead. There are several possible log-ratio transformations, with the simplest one being the additive log-ratio (ALR). In ALR, each component is taken relative to a reference, and the mapping is then applied using the natural logarithm. However, a drawback of the ALR transformation is that the choice of reference can significantly affect any downstream analysis. To address this issue, a common workaround is to use the centered log-ratio (CLR) transformation, which we will be utilizing. With CLR, there is no need to choose a reference part; instead, each of the K parts is referred to the geometric mean of all the parts: CLR ( x ) = [ log ( x 1 g ( x ) , x 2 g ( x ) , ..., x K − 1 g ( x ) )] (2) where g ( x ) = (∏ K k =1 x k ) 1 /K The fact that there are K CLRs automatically implies that they are linearly dependent meaning that any one CLR can be computed from the K − 1 others, but any subset of K − 1 CLRs is linearly independent. In the matrix vector notation we write the set of CLRs as follows: CLR ( x ) :      1 − 1 K − 1 K − 1 K . . . − 1 K − 1 K 1 − 1 K − 1 K . . . − 1 K . . . − 1 K − 1 K − 1 K . . . 1 − 1 K           log( x 1 ) log( x 2 ) log( x K )      (3) However, the log-ratio methods including CLR are not readily applicable to data with many zeros because logarithm and ratio computations in the transforms do not allow zero values. In fact, log-ratio transformations are constrained to handle data only on the open simplex. Researchers have suggested a solution by shifting the data slightly so that they all fit into the open simplex. This is achieved by substituting zeros with small positive values (epsilon) and, if desired, re-normalizing them to sum up to one. Several approaches have been proposed for this purpose, such as Mart ́ ın-Fern ́ andez et al. (2011; 2012), Rasmussen et al. (2020), and Lubbe et al. (2021). For the purposes of this paper, we employed the most basic technique, which involves checking if the values are zero and adding an epsilon to them if so. We will discuss in more detail in the Design section how we determined the appropriate epsilon value. B. Radial transformation Recall that compositional data only show relative information that doesn’t depend on the scale. This means that the ratio of the information is also found in the radial vectors, which are at the core of compositions. Having this viewpoint, the representation of compositional data on simplex can then be viewed as a combination of straight lines and nonnegative radial vectors From this interpretation, we understand that there are different ways to represent compositional data depending on the intersection of the vectors and a chosen shape. For instance, we can use hyperspheres or hypercubes, resulting in a representation of compositional data as hyperspherical or hypercubical expressions, respectively. In Figure 2, we see the blue dots, the intersection of the circle and the radial vectors, equivalently represent the corresponding compositional data on the simplex. Fig. 1: Comparison of the compositional vectors on the simplex and the unit circle. The points lie on the same dashed lines, suggesting that the relative ratios are preserved. By applying the radial transformation on compositional data, we obtain transformed data that are unconstrained and can be analyzed using traditional similarity metrics, such as Euclidean distance or cosine similarity. These metrics can quantify the similarity or dissimilarity between two compositions based on their transformed values. We define a radial transformation ψ : ∆ d → S d ≥ 0 by ψ ( x ) = x ∥ x ∥ 2 for all xε ∆ d (4) where S d ≥ 0 denotes the nonnegative part of S d . In this paper, we will utilize the radial transformation to propose our custom loss function named radial loss, which calculates the cosine similarity (a measure of the angle between two vectors, regardless of their length) between the one-hot encoded labels and the output of the Softmax layer after undergoing radial mapping. cos ( A, B ) = A.B ∥ A ∥ ∥ B ∥ = ∑ n i =1 A i B i √∑ n i =1 ( A i ) 2 ∑ n i =1 ( B i ) 2 (5) III. RESEARCH HYPOTHESIS AND QUESTIONS Most of the current research in deep-learning classification problems entails using Softmax along with cross-entropy loss in various model architectures. Considering this approach, this paper builds upon the present research and explores the following questions. Firstly, can the removal of the Softmax layer and the transformation of labels from the simplex onto the real number space using CLR transformation improve the accuracy and decrease the training time of the model? Since this transformation can be applied during the data preprocessing phase and involves the removal of the Softmax layer, it is anticipated that the model training time will be faster. Secondly, can radial loss, a custom loss function that transforms the output of the Softmax layer and the one-hot encoded labels onto the arc of an N-dimensional space and computes the cosine similarity between the two vectors, serve as a superior alternative to the commonly used cross-entropy loss and demonstrate superior performance? IV. DESIGN Since our contribution challenges Softmax and proposes an alternative to the Cross-Entropy loss, this means that we can test our hypothesis on any model architecture that employs these two components. We chose to do so on two different classifiers: one for images and the other for text. The architecture of choice for the image classifier is a state-of-the-art model called Residual Network (ResNet). This CNN-based network addresses the issue of vanishing/exploding gradients, which is prevalent in less complex architectures, by utilizing a technique called skip connections between the residual blocks. The skip connection connects activations of one layer to further layers, effectively bypassing certain layers in between. This forms a residual block. ResNet adopts a 34-layer plain network architecture, initially inspired by VGG-19, to which the shortcut connection is added, transforming the architecture into a residual network. The figure below illustrates an overview of the architecture. Fig. 2: The ResNet architecture For the text classifier, a straightforward model composed of only two layers is chosen: an EmbeddingBag layer and a linear layer for classification purposes. The EmbeddingBag layer computes the mean value of a bag of embeddings and passes it to a linear layer, which has a different number of nodes depending on the number of labels. The implementation of the centered log ratio, mentioned in the Background section, requires the selection of the hyperparameter epsilon. This value is used to shift the zero values, enabling the computation of the geometric mean and the subsequent application of the natural logarithm. The Fig. 3: The architecture for the image classifier choice of the epsilon value can significantly influence the outcome; therefore, we decided to conduct a grid search to find the optimal epsilon. The process involved training a compact CNN model on the fashion-mnist dataset, using different values of epsilon for 10 epochs each time, and comparing the model’s validation loss. It appears that higher values of epsilon result in lower validation loss. It is worth noting that the intention is to keep epsilon at a low number and not exceed 1 because this transformation is applied to the one-hot encoded labels, which contain zeros and a one only. By shifting the zeros to a value greater than one, the model’s prediction always becomes correct, which is not desired. V. THE DATASETS The experiments have been run on 2 benchmark and 1 custom multi-label classification datasets. The benchmark datasets consist of the MNIST hand-written digit dataset and the AG news dataset. The former is used for the image classifier, while the latter is used for the text classifier. Furthermore, we have developed our own computer vision dataset, similar to fashion-MNIST but slightly more advanced. The specifics of these datasets will be discussed in the subsequent subsections. A. MNIST Dataset The MNIST database is a large collection of handwritten digits which serves as a standard dataset used in computer vision and deep learning. It comprises a training set with 60,000 examples and a test set with 10,000 examples, both representing the 10 classes (0-9). This dataset is a subset of the larger NIST Special Database 3 and Special Database 1, which consist of monochrome images showcasing handwritten digits, specifically those written by high school students. The images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. B. AG News Dataset AG is a collection of more than 1 million news articles. They have been gathered from more than 2,000 news sources by ComeToMyHead and have become a benchmark in many research projects. The dataset consists of 4 classes named ”World”, ”Sports”, ”Business”, and ”Sci/Tech”. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000, and the total number of testing samples is 7,600. C. Custom Fashion Dataset The custom fashion dataset used in this project is similar to fashion-MNIST with a small but significant twist. A lot of the images in the dataset contain human models wearing the corresponding clothes. This addition makes the dataset more realistic and enables the solving of modern computer vision challenges. The images were obtained by scraping www.zalando.com using a script, which extracted the URLs of images from 10 different categories: Jacket, Pants, Jeans, Shorts, T-shirt, Pullover, Bag, Cap, Sandal, and Skirt. Later, the URLs of the images were converted into 28 by 28-pixel images using another script. Each pixel has a value between 0 and 11. In total, there are 52,000 samples, of which 40,000 were used for training, 5,000 for validation, and 5,000 for testing. Fig. 4: A random sample of the custom fashion dataset along with their corresponding labels VI. EXPERIMENTAL SETUP The experimental setup consists of two Image and Text classifiers, each trained on their respective datasets using three different configurations. The configurations include Softmax layer + Cross Entropy loss, CLR transformer + Mean Square, and Softmax layer + Radial loss. Subsequently, the accuracy of each classifier is compared within themselves for each of the three different configurations .The settings and hyperparameters are kept identical across all training experiments to avoid any confounding variables. Please refer to Table 6.1 for further details. Experimental setup (Hyper)parameters ResNet Image Classifier Text classifier Input feature dimension (1,28,28) (338,8) Batch Size 64 64 Learning Rate 0.005 0.005 Optimization algortighm SDG ( weight decay = 0 005 , momentum = 0 9 ) SDG ( weight decay = 0 005 , momentum = 0 9 ) Epochs 100 100 Epsilon For CLR 0.04 0.04 TABLE I: (Hyper)parameters in the experimental setup In the data preprocessing phase, the image classification datasets were normalized by adjusting the features to have zero mean and unit variance. Fortunately, the input dimensions of the mnist dataset and our custom dataset were the same, eliminating the need for any architectural modifications to the ResNet image classifier. However, the AG News dataset underwent a different preprocessing pipeline. First, a vocabulary was constructed using the raw training dataset. Then, the text pipeline converted each text string into a list of integers based on the lookup table defined in the vocabulary. Additionally, all labels across all datasets were one-hot encoded, and a copy was created to apply centered-log-ratio for the experiments with the CLR configuration. An overview of the dataset statistics is shown in Table 6.2. Dataset Description #Labels #Train #Test MNIST Image 10 60000 10000 AG News Text 4 30000 1900 Custom Fashion Image 10 40000 5000 TABLE II: Statistics for the 3 multi-label classification datasets VII. EVALUATION The results of the experiments are presented in Table 3. Overall, the performance of the three configurations is very similar, with a slight advantage observed for both the Softmax layer + Radial loss and the CLR transformer + Mean Square loss compared to the widely used Softmax layer + Cross Entropy loss. Our two transformers achieved the highest performance in all experiments, except in the text classifier where CLR performed slightly worse than Softmax + Cross Entropy loss. In particular, both transformers seem to be more suitable for problems with a larger number of labels. Regarding the model training time, the results are closely related. We cannot conclusively state that the CLR transformer leads to faster training times, as the difference observed is only a couple of seconds. Experiment Results Softmax layer + Cross Entropy loss Softmax layer + Radial loss CLR transformer + Mean Square loss Resnet Image Classifier Mnist Dataset Custom Dataset 97.91/2085 79.4/3126 98.74/2091 80.3/3135 99.27/2090 80.1/3129 Text Classifier AG News Dataset 90.9/1001 90.4/1048 91.0/1034 TABLE III: Test accuracy (left) and model training time in seconds (right) for the three configurations on the datasets. The confusion matrices obtained by evaluating the trained models on the test sets are shown in Figure 5. It is important to note that these matrices have been normalized for compatibility in comparison, as the number of test samples varies across the datasets. The y-axis represents the true labels, while the x-axis represents the predicted labels. For the custom dataset, the models demonstrated a generally high accuracy in predicting the classes, as indicated by the darker diagonal elements. However, all three models struggled to distinguish between Jacket and Pullovers. This difficulty can be attributed to the close similarity between these two classes. Plus, the dataset contains images of human models wearing the clothes, which adds another layer of complexity to the task. On the other hand, the models performed exceptionally well in predicting the Mnist dataset, achieving an accuracy of 99.27 percent for the model with CLR transformer + Mean Square loss. Similarly, the accuracy of the models on the AG News dataset was very high despite their naive and simple architecture. Fig. 5: The normalized confusion matrices for the three configurations on the datasets VIII. DISCUSSION AND FUTURE WORKS In this work, we have demonstrated the feasibility of employing alternative techniques other than commonly used softmax layer and cross-entropy loss combination for training a multiclass classification model. We have introduced two transformers, namely CLR and Radial, which have distinct roles: CLR serves as a data preprocessing transformer, while Radial functions as a loss function transformer. We have conducted experiments using these transformers against softmax and cross-entropy across multiple classifiers and datasets. The results indicate that both of our transformers exhibit improved performance, particularly when dealing with a larger number of labels, without any notable difference in terms of model training time. There are still questions regarding the justification of using the CLR with a high proportion of zeros from a mathematical standpoint. Log-ratio approaches introduce geometric distortions to the data near the boundary of the simplex because the logarithm of a value, y , diverges to ±∞ as y approaches 0 or ∞ . The problem arises when the dataset (one-hot encoded labels) is concentrated on the boundary points, as the Aitchison geometry (geometry of CLR transformed data) views the data as diverging to infinity. Consequently, replacing zeros in the Aitchison geometry is akin to positioning points at infinity to a finite position. As a result, the configuration of log-ratio transformed data is critically dependent on the method used for zero replacement. Given that there are countless ways of replacing zeros, it may not be possible to find an appropriate representation of the data using this approach. Furthermore, the inconsistent interpretation of the data based on the zero replacement method renders the results of statistical analysis unreliable. It is evident that these problems become more pronounced when there are more zeros or when the dimensionality of the data is higher. The advantage of the Radial transformation over CLR is that it does not suffer from zero replacement and preserves the separation of compositions without distortion. But more importantly, it also paves the path for other classes of kernels (similarity metrics) such as Kullback-Leibler divergence, Jensen-Shannon divergence, or Mahalanobis distance to be implemented after projecting compositions into the nonnegative part of a hypersphere. The incorporation of these kernels within the framework of the Radial transformation can unlock new possibilities for studying and understanding compositional data and subsequently design new loss functions for multiclass classification. IX. S OURCE C ODE The source code for the Pytorch implementation of the experiments for all three datasets is available at https://github.com/aahashemi/ Pytorch-Multiclass-Classification-Using-CoDa/ R EFERENCES