6930_slice_and_pack_tailoring_deep_.pdf

Slice-and-Pack: Tailoring Deep Models for Customized Requirements Anonymous Author(s) Affiliation Address email Abstract The learnware paradigm proposed by Zhou [1] aims to establish a learnware market 1 such that users can build their own models by reusing appropriate existing models 2 in the market without starting from scratch. It is often the case that a single model 3 is insufficient to fully satisfy the user’s requirement. Meanwhile, offering multiple 4 models can lead to significantly higher costs for users alongside an increase in 5 hardware resource demands. To address this challenge, this paper proposes the 6 “Slice-and-Pack” (S&P) framework to empower the market to provide users with 7 only the required model fragments without having to offer entire functions of all 8 involved models. Our framework first slices a set of models into small fragments 9 and subsequently packs selected fragments according to the user’s specific require- 10 ment. In the slicing stage, we extract units layer by layer and progressively connect 11 them to create numerous fragments. In the packing stage, an encoder-decoder 12 mechanism is employed to assemble these fragments. Notably, these processes 13 are conducted within data-limited constraints due to privacy concerns. Extensive 14 experiments validate the effectiveness of our proposed framework. 15 1 Introduction 16 Machine learning has achieved great success in various domains [ 2 – 6 ]. However, building a satis- 17 factory learning model from scratch is often time-consuming and requires significant expertise, and 18 personalized customization is often necessary for specific situations. Moreover, privacy concerns 19 often limit the access to raw data, making it even challenging to train well-behaved models from 20 scratch. To address these issues, Zhou [1] proposed the learnware paradigm, aiming to establish a 21 learnware market such that users can build their own models by reusing appropriate existing models 22 in the market without starting from scratch. A learnware is a pre-trained model associated with a 23 specification describing the model’s specialty and utility. Developers can submit trained models to 24 the market with a few data points describing their tasks, and the market will assign a specification 25 upon accepting this submitted model. When a user wants to tackle a learning task, the market can 26 recommend helpful learnwares whose specifications match the user’s requirement. 27 However, in many cases it is hard to find a single model to fully meet the user’s requirement, especially 28 during the early development stages of the learnware market. In real-world scenarios, the abilities 29 desired by a user might be spread across several models. Offering all these models to the user would 30 compel them to purchase numerous models, each potentially bundled with unnecessary abilities, and 31 also require more hardware resource for the deployment. For instance, consider community clinics 32 that want to use patient records for diagnosis. Due to the constraints in equipment and treatment 33 capabilities, these clinics are primarily tasked with addressing common diseases . In the market, there 34 can be several existing well-behaved models available from specialist hospitals trained on private 35 data. However, these models are designed to diagnose as many diseases as possible in a specific 36 Submitted to 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Do not distribute. Persist User Return models Combine Upload Query Persist Factorize Assemble Return fragments of normal disease Upload Query Previous framework vs Split-and-Pack Developers Specialist Hospitals Clinic A Model for normal disease Skin Infection Stomach • Big • Redundant • Expensive • Small • Compact • Cheap Figure 1: Comparison of the two frameworks. Our approach can provide more compact models with only the necessary functions for users, resulting in cost savings and lower device requirements. field, including rare ones, rather than focusing on common diseases. Models equipped with more 37 capabilities typically demand more computational power for inference, leading to the need for more 38 robust devices and higher acquisition costs. As revealed in [ 7 ], commercial models with 5-10B 39 parameters roughly cost around $1,000, while stronger models with over 170B parameters can cost 40 up to $10,000. Furthermore, the utilization of these stronger models often necessitates much more 41 computational resources, which could impose further financial burden on community clinics. 42 To address this issue, we propose the “Slice-and-Pack” (S&P) framework, which empowers the 43 market to provide users with only the required model fragments, eliminating the need to offer entire 44 functions of all involved models. These fragments can be combined in various ways to satisfy 45 different users’ specific requirements. As illustrated in Figure 1, to meet the clinic’s requirements, we 46 can slice models into fragments of different diseases and combine fragments of common diseases. 47 This framework offers superior flexibility and convenience compared to using the entire model, 48 minimizing the cost of redundant abilities while providing easy plug-and-play functionality. When 49 the user requirements change, they can easily discard outdated fragments and purchase new ones as 50 needed . Furthermore, the framework can also greatly benefit the market development, as there is 51 no need for the market to consider all possible combinations of functions, thereby expanding the 52 market’s capacity to match that of a market several times its size. The property of slicing models 53 once and reusing fragments multiple times also reduces repetitive workload for markets. 54 To achieve above desired properties, our framework works in two stages: slicing and packing. In the 55 slicing stage, we extract fragments from the original model, each corresponding to a subfunction. We 56 first identify units important to the subfunction layer by layer and then progressively connect each 57 layer’s selected units. This process results in individual fragments that perform specific functions. In 58 the packing stage, we use an encoder-decoder mechanism to pack the previously stored fragments. 59 We take the fragments identified in the slicing part and assemble them to form a new model tailored 60 to the user’s specific requirement. All these processes only utilize the data provided by developers 61 for constructing the specification. We conduct extensive experiments to evaluate the accuracy of the 62 resulting combined fragments and measure the equivalent expansion rate of the market using our 63 framework, and the empirical results clearly demonstrate its effectiveness. 64 In the following, we summarize the contributions of the paper. 65 • We introduce the “Slice-and-Pack” (S&P) framework, a novel method that expands the market by 66 first slicing models into multiple fragments that perform specific functions, and then packing the 67 selected fragments to meet new requirements in data-limited environments. 68 • Our empirical evaluations demonstrate that the S&P framework can generate highly accurate 69 packed models and expand the market’s capacity by many times. Moreover, the minimal time 70 required for packing ensures a plug-and-play feature, making it readily available to users. 71 Organization. The rest of paper is organized as follows: we first provide an overview of related 72 works in Section 2. Next, we introduce the Slice-and-Pack framework in Section 3. Section 4 presents 73 the experimental results. Finally, we conclude the paper in Section 5. Due to page limitations, we 74 have included the pseudocode and additional experiments in the appendix. 75 2 2 Related Work 76 The learnware paradigm [ 1 , 8 ] presents a promising framework where a vast number of models 77 are submitted by developers from various tasks without the availability of their original training 78 data. This poses significant challenges for users to identify and reuse helpful models in the market. 79 The specification is the core component of the learnware paradigm to achieve this goal. Numerous 80 attempts have been made to create simplified prototype frameworks. One such approach involves 81 utilizing the reduced kernel mean embedding (RKME) as a specification, as demonstrated by Wu et 82 al. in their work [ 9 ]. Another solution proposed by Tan et al. [ 10 ] enables the learning of models 83 from heterogeneous feature spaces by generating the RKME specification on a unified subspace. 84 Transfer learning [ 11 ] and domain adaptation [ 12 , 13 ] are techniques that aim to transfer knowledge 85 from the source domain to the target domain. Typically, these techniques assume that raw data from 86 one domain [ 14 – 16 ] or multiple domains [ 17 – 19 ] are accessible when training the target model. 87 However, in the learnware paradigm, the raw source data is not available at the time of training 88 the target model, making these techniques inapplicable. Hypothesis transfer learning [ 20 , 21 ] and 89 model reuse [ 22 , 23 ] attempt to exploit pre-trained models to handle current jobs of learners. These 90 techniques assume that the given pre-trained models are always helpful for the current job [ 24 , 25 ]. 91 However, this assumption is different from our problem as the abilities needed by users may be 92 distributed across several models, making it difficult to find one model helpful to users. Multi-party 93 learning [ 26 , 27 ] aims to unite local data to solve the same or similar job in privacy-preserving ways, 94 rather than using the existed pre-trained models. 95 Model decomposition is a technique used to increase parameter-efficiency in deep learning models 96 by breaking down parameter-heavy layers into multiple lightweight ones. This approach involves 97 using various techniques such as low-rank decomposition [ 28 ] and weight decomposition [ 29 – 31 ]. 98 These methods aim to generate smaller models of similar ability with comparable performance for 99 distributed runtime environments, but not to create smaller models with different functionality. 100 3 The Slice-and-Pack Framework 101 In this section, we present the details of our Slice-and-Pack (S&P) framework, which is composed of 102 two parts: Slicing and Packing, as illustrated in Figure 2. We begin by formulating the problem and 103 then proceed to explain the specifics of each of the two parts. The algorithm is shown in Appendix A. 104 3.1 Problem Formulation 105 There are K m models developed by different developers in the learnware market, denoted by 106 {M ( i ) } K m i =1 . Each model M ( i ) has a function set representing what abilities the model has. In the 107 following, we focus on classification and one function is the ability to identify one class. Let K ( i ) c 108 be the number of class for i -th model M ( i ) , C ( i ) = { c ( i ) j } K ( i ) c j =1 be the class of model M ( i ) , and 109 C = ⋃ K m i =1 C ( i ) be the set of all classes in the model set. Let K ( i ) d be the number of uploaded few 110 samples of model M ( i ) , D ( i ) = { x j , y j } K ( i ) d j =1 be a few of samples, and D ( i ) c j be the c j class samples. 111 Given a set of models {M ( i ) } K m i =1 and a few samples {D ( i ) } K m i =1 , Slice-and-Pack aims to get a set of 112 fragments, which allows for the rapid and easy construction of a smallest possible model with any 113 subset of classes C ′ ⊂ C . It can be summarized into two key points: 114 (i) The constructed model should be small and comply with the requirements. 115 (ii) The process of constructing model should be easy and flexible. 116 Let F c be the fragment of class c . Our goal can be formalized as optimizing the following objective: 117 min F c E c ∈C ′ ⊂C , ( x,y ) ∼P C′ L ( J C ′ ( x ) , y ) + λ ∑ c ∈C ′ Ω( F c ) s.t. J C ′ = arg min J C′ E ( x,y ) ∼P C′ L ( J C ′ ( x ) , y ) , (1) 3 Each layer Layer by layer Model from developer Extract fragment Persist fragments Database Select required fragments Use encoder-decoder to get 𝑒𝑒 𝑒𝑒 𝑑𝑑 1 𝑑𝑑 3 𝑑𝑑 2 Return model to users Each class Combining Pooling & Adaptation 𝑒𝑒 Select Units Figure 2: The overview of Slice-and-Pack where P C ′ is the distribution of class c ’s samples, J C ′ ( x ) = Union ( F c 1 , . . . , F c Kc ′ ) is the method 118 of packing required fragments, λ is the weight parameter, and Ω( F c ) is the size of fragment F c . The 119 entire process should be conducted in a data-limited environment for the learnware paradigm. 120 3.2 Slicing the Network 121 In this subsection, we denote M , D as the shorthand for M ( i ) , D ( i ) . The slicing method aims to 122 obtain the fragment F c with class c from the network model M . We regard M as a series of layers 123 m i composed of a linear layer and some nonlinear layers, which can be expressed as: 124 m i ( x ) = σ ( layer ( x ; W i , b )) , (2) where σ represents the nonlinear layers such as Sigmoid, Maxpooling, layer represents some kinds 125 of linear operations such as convolution, fully-connected layer, and W i and b are parameters of the 126 layer. Almost all network layers can be seen as a collection of units. For the convolution layer the 127 unit is a filter, and for the linear layer the unit is a neuron. 128 We adopt a dual-stage strategy to extract the fragment F c of class c from the network model M . In 129 the first stage, we select units important to the class c layer by layer. For each layer m i , we use the 130 technique named Function Pooling and Function Adaptation to get f i , as shown in Figure 3. In the 131 second stage, each f i is progressively combined and finally we get the fragment F c of class c 132 Function Pooling. In Slice-and-Pack setting, it is necessary to identify the units that are important 133 to a specific class c rather than the whole of the original model M While some methods use 134 parameter weights as regularization or criticism to make the model sparser or clip it, high-value 135 parameters are essential to the overall model rather than a specific class. To address this, we use the 136 activation of class c ’s samples at the i -th layer as a means of identifying important units. We first use 137 this activation as a regularization, and then we select units based on it. 138 Let X c i represent the output of layer m i with D c as input, M ′ be the model having parameters 139 completely the same as M , and u i be the number of units in m i . To encourage more units’ output to 140 approach zero, we use ℓ 2 , 1 -norm [32] of X c i as follows: 141 ∥X c i ∥ 2 , 1 = 1 |X c i | ∑ x ∈X c i u i ∑ j =1 ∥ ∥ x j ∥ ∥ 2 , (3) where x j is the output of the j -th unit. We fine-tune the model with the Pooling loss containing 142 ℓ 2 , 1 -norm of X c i and the ℓ 2 -loss function L P i between M and M ′ , which can be expressed as 143 L P i = 1 K d ∑ x ∈D ∥M ′ ( x ) − M ( x ) ∥ 2 2 + α ∥X c i ∥ 2 , 1 , (pooling loss) (4) where α is the hyperparameter. Using the loss function, we obtain a sparser layer m ′ i from M ′ 144 4 𝑙 𝑖 𝑚 𝑖 − 1 ∘ ⋯ ∘ 𝑚 1 Pooling loss Eq.( 4) 𝒳 𝑖 𝑚 𝑖 𝐷𝑎𝑡𝑎 Repair Adaptation loss Eq.( 8) Repair 𝑓 𝑖 ҧ 𝑓 𝑖 𝑓 𝑖 ( 𝑥 ) 𝑟 𝑖 𝑙 𝑖 ∘ 𝑓 𝑖 ( 𝑥 ) 𝑙 𝑖 ∘ 𝑓 𝑖 ( 𝑥 ) 𝑟 𝑖 ҧ 𝑓 𝑖 ( 𝑥 ) ҧ 𝑓 𝑖 ( 𝑥 ) 𝑚 𝐿 ∘ ⋯ ∘ 𝑚 𝑖 + 1 𝑚 𝑖 − 1 ∘ ⋯ ∘ 𝑚 1 𝑚 𝐿 ∘ ⋯ ∘ 𝑚 𝑖 + 1 𝑚 𝐿 ∘ ⋯ ∘ 𝑚 𝑖 + 1 𝑚 𝑖 − 1 ∘ ⋯ ∘ 𝑚 1 𝑚 𝑖 − 1 ∘ ⋯ ∘ 𝑚 1 Feature vector Layer’s unit Function Pooling Function Adaptation ① Make 𝑚 𝑖 Sparse 𝑓 𝑖 ҧ 𝑓 𝑖 𝐷𝑎𝑡𝑎 Data flow ② Select Units and construct 𝑓 𝑖 , ҧ 𝑓 𝑖 𝑚 𝑖 Updated layer Layer of select units Layer of unselect units 𝑖 - th layer Output of 𝑖 - th layer Figure 3: The process of factorizing layer m i . In Function Pooling, we will centralize the func- tionality of class c to less units and then select units to get f i . In Function Adaptation, We use the Adaptation loss to adapt f i for the change of structure. We then select units by either threshold or rate. Let U i be the set of selected unit indices as follows: 145 U i = { j | | ( X c i ) j | > Φ(1 − β ) } (by threshold) (5) U i = { j | j ∈ Top- β ( {| ( X c i ) j |} u i j =1 ) } , (by rate) (6) where β is a hyperparameter in the range from 0 to 1, and Φ is the inverse of the cumulative 146 distribution function of the truncated normal distribution. We use Φ to transform β into a value whose 147 cumulative distribution probability is β . We use the truncated normal distribution because nonlinear 148 layers usually contain BatchNorm and ReLU. Finally we get the new layer f i from m ′ i for class c 149 Function Adaptation. After selecting the units, we need to adapt the layer f i to account for 150 changes in the structure of the neural network. However, there are two challenges that we need to 151 address. Firstly, the output dimension of the i -th layer changes , which means that we need to drop 152 some parameters from the ( i + 1) -th layer. However, simply removing the parameters of ( i + 1) -th 153 layer without considering the context of the layer’s situation is wasteful, especially when dealing 154 with limited data. Secondly, the semantic space of the layer f i is different from that of the original 155 layer m i because f i focuses only on a specific class, while m i is used for all classes. As a result, we 156 cannot use the distance as loss in the same way as in previous approaches [ 33 , 34 ], which assume 157 that the semantic spaces of the old and new models are the same. 158 To solve these problems, we propose using the Repair operation and Patch layer. As shown in Figure 159 3, the Repair operation fills the hole created by the removal of units with the features from Patch 160 layer. We use r p i to denote the layer f i followed by a Repair operation with Patch layer p . We can 161 define the model where the i -th layer uses the Repair operation and the patch layer p as: 162 M p i = m L ◦ · · · ◦ m i +1 ◦ r p i ◦ m i − 1 ◦ · · · ◦ m 1 (7) To address the first problem, we create a linear layer l i and let l i ◦ f i be the patch layer, denoted as 163 r l i ◦ f i i l i is 1x1 conv-layer for convolutional layer and fully-connected layer for linear layer. Our 164 basic assumption is that selected units contain more information, so we can simulate the output of 165 dropped units by using the output of the remaining units and fill the missing values. In this way, 166 we will not drop any parameters of the ( i + 1) -th layer when handling the i -th layer. Moreover, l i 167 can fuse into the next linear layer, and the number of parameters is the same as directly dropping 168 parameters of the next layer [24]. The proof is provided in the Appendix B. 169 To address the second problem, we define f i as the layer made up of units not in U i and use it 170 as the patch layer. We assume that different classes have overlap because they may require some 171 common basic abilities, as features of lower levels of abstraction emerge as we move from the back 172 to the front in deep neural networks [ 35 , 36 ]. For example, in a DNN classifier for cars, horses, and 173 dogs, information about edges is captured first, followed by information about corners and contours, 174 5 and finally information about object parts. Therefore, it is very likely that these three classes have 175 common sub-functions, such as the function of catching corners. Although it is unclear where the 176 overlap is between the required class c and other classes, we can put f i back into the model and update 177 the whole model using the distance loss of all samples. By doing so, when there is a sub-function in 178 f i that is also a sub-function of another class, f i will be updated. Thus, we use f i as the patch layer, 179 and we can minimize the distance between M f i i and M using r f i i 180 In summary, the Adaptation loss L A i can be expressed as: 181 L A i = ∑ ( x,y ) ∈D [ ℓ c ( h ◦ M l i ◦ f i i ( x ) , y ) + γ · ∥M f i i ( x ) − M ( x ) ∥ 2 2 ] , (adaptation loss) (8) where ℓ c is the cross-entropy, γ is the hyperparameter, and h is classifier. All other classes are 182 considered as a single class. To address the issue of data imbalance, we have applied a weighting 183 scheme that places a higher emphasis on the class c . Specifically, we have assigned a weight to class 184 c that is twice the aggregate weight assigned to the remaining classes. 185 Combining f i layer by layer. Now a series of extracted layer f i has been obtained, we start to 186 construct F c . However, previously each f i is trained separately, so here we combine them one by 187 one from front to end. We denote ̃ M p i and the loss function ̃ L A i as 188 ̃ M p i = m L ◦ · · · ◦ m i +1 ◦ r p i i ◦ r p i − 1 i − 1 ◦ · · · ◦ r p 1 1 (9) 189 ̃ L A i = ∑ ( x,y ) ∈D [ ℓ c ( h ◦ ̃ M l ◦ f i ( x ) , y ) + γ · ∥ ̃ M f i ( x ) − M ( x ) ∥ 2 2 ] (10) By optimizing the loss function from ̃ L A 1 to ̃ L A L , we get optimized model ̃ M l ◦ f L . Because l i is the 190 linear layer and Repair operation is also the linear operation, we can fuse all l i into the linear layer 191 next to it and get F c 192 3.3 Packing Fragments 193 The packing method aims to construct a new model by fragments {F c } c ∈C ′ so that the new model 194 has all functions in C ′ . The challenge here is how to effectively combine features generated by these 195 fragments. One possible approach is to simply concatenate features, but it will result in a very long 196 feature vector that is difficult to work with, especially when there are multiple fragments. 197 In order to obtain tight features, we minimize the reconstruction error for each fragment. Let e be the 198 fusion layer used to combine the features, and let { d c } c ∈C ′ be the decoders used to restore the output 199 of each model F c . The final model, U C ′ , can be expressed as: 200 U C ′ ( x ) = e ( concat ( F c 1 ( x ) , . . . , F c |C′| ( x ))) (11) To ensure that the output of U C ′ contains all the information from the F c models, we need to minimize 201 the distance between F c and d c ◦ U C ′ . To do this, we define the loss function as: 202 L U = ∑ c ∈C ′ ∑ ( x,y ) ∈D c [ ℓ c ( h U ◦ U C ′ ( x ) , y ) + δ · ∥ d c ◦ U C ′ ( x ) − F c ∥ 2 2 ] , (12) in which h U is the classifier for the packed model, and δ is the hyperparameter. 203 In the process, we only update the parameters of e , { d c } c ∈C ′ and h U , while leaving the fragments 204 {F c } c ∈C ′ unchanged. This makes it easy to plug in or remove fragments as needed to meet new 205 requirements, and makes these fragments highly reusable. Moreover, due to the limited number of 206 parameters in e , { d c } c ∈C ′ and h U , the workload required to adjust these variables is significantly low, 207 resulting in increased efficiency in practice. 208 4 Experiments 209 We conducted extensive experiments on various datasets for image classification and sentiment 210 analysis. Our code was implemented by PyTorch and executed on an NVIDIA A100 40GB PCIe 211 GPU with AMD EPYC 7H12 64-Core Processor. The code will be made publicly available and the 212 experiment details are provided in the Appendix C. 213 6 4.1 Experimental Settings 214 Table 1: The setting of tasks Dataset Origin Model Task ID Model CIFAR10 1. {0,1,2} 2. {3,4,5} 3. {6,7,8} T 1 1. {0,3,6} 2. {1,4,7} 3. {2,5,8} CIFAR100 1. {0-19} . . . 5. {80-99} T 2 1. {0,20, · · · ,80} 2. {5,25, · · · ,85} 3. {10,30, · · · ,90} 4. {15,35, · · · ,95} 1. {0,5, · · · ,95} . . . 5. {4,9, · · · ,99} T 3 1. {0-4} 2. {25-29} 3. {50-54} 4. {75-79} 1. {0-9} . . . 10. {90-99} T 4 1. {0,10, · · · ,40} . . . 10. {9,19, · · · ,49} T 5 1. {0,10, · · · ,90} . . . 10. {9,19, · · · ,99} TREC 1. {0,1,2} 2. {3,4,5} T 6 1. {0,3} 2. {1,4} 3. {2,5} SST-5 1. {0,1,2} 2. {0,3,4} T 7 1. {1,3} 2. {2,4} Datasets, Model and Task. We conducted a se- 215 ries of experiments on four different datasets: CI- 216 FAR10, CIFAR100 [ 37 ], TREC [ 38 ], and SST-5 217 [ 39 ]. CIFAR10 and CIFAR100 consist of 50,000 218 images for training and 10,000 for testing, with 219 CIFAR10 containing 10 classes and CIFAR100 con- 220 taining 20 superclasses, each consisting of 5 classes. 221 The TREC Question Classification dataset contains 222 5,500 sentences in the training set and another 500 223 in the test set, with 6 classes. SST-5 dataset con- 224 sists of 8,544 sentences in the training set and an- 225 other 2,210 in the test set, with 5 classes. We use 226 VGG16 [ 40 ] and ResNet34 [ 41 ] as original models 227 on CIFAR-10 and CIFAR-100, and CNN [ 42 ] and 228 RNN as original models on TREC and SST-5. 229 We design multiple tasks on each dataset, as shown 230 in Table 1. Each task involves constructing target 231 models based on the original models, with each 232 task containing multiple target models with sim- 233 ilar settings. We evaluate performance based on 234 the average performance of these generated target 235 models for each task. Our method involves slicing 236 the models into multiple fragments with a class and 237 then packing the selected fragments. Other methods 238 directly obtain the target model. For each original model, we randomly sample k samples per class 239 from the corresponding dataset with seed 0 240 4.2 Experimental Results 241 Performance on Image data. We begin by presenting the results of our method on image data and 242 compare it with the finetuning method and CA-MKD[ 43 ]. Finetuning method fine-tunes one of the 243 original models using data of the target task. CA-MKD is the multi-teacher knowledge distillation 244 Table 2: The performance on image datasets. S&P (R) denotes selecting units with rate, and S&P (T) denotes selecting units with threshold. The percentage value in parentheses of the parameter column indicates the number of parameters compared to the ensemble method. (failed) means that the method cannot run in the setting for the lack of sufficient samples. Task Method VGG16 ResNet34 Param ( × 10 6 ) Acc.(%) Param ( × 10 6 ) Acc.(%) k = 5 k = 10 k = 20 k = 5 k = 10 k = 20 T 1 CA-MKD 14 7 (33 3%) (failed) (failed) (failed) 18 9 (33 3%) (failed) (failed) (failed) Finetune 14 7 (33 3%) 65 47 ± 6 41 66 88 ± 4 47 66 99 ± 2 33 18 9 (33 3%) 67 29 ± 4 69 70 04 ± 6 74 71 56 ± 4 80 S&P (R) 11 5 (26 1%) 71 13 71 13 71 13 ± 2 55 74 90 74 90 74 90 ± 6 56 74 47 ± 7 52 19 0 (33 5%) 75 43 75 43 75 43 ± 4 91 78 49 78 49 78 49 ± 8 15 80 43 80 43 80 43 ± 6 01 S&P (T) 7 4 (16 8%) 62 20 ± 4 00 73 23 ± 5 38 76 37 76 37 76 37 ± 5 49 25 7 (45 3%) 73 57 ± 2 25 76 89 ± 6 39 80 31 ± 6 43 T 2 CA-MKD 14 7 (20 0%) (failed) (failed) 63 10 ± 10 2 18 9 (20 0%) (failed) (failed) 53 35 ± 10 5 Finetune 14 7 (20 0%) 61 10 ± 8 02 64 00 ± 9 52 68 35 ± 8 21 18 9 (20 0%) 65 20 ± 9 69 71 40 ± 8 64 73 3 ± 9 60 S&P (R) 19 2 (26 1%) 61 95 ± 12 9 68 65 ± 11 0 72 90 ± 7 79 31 7 (33 5%) 67 10 ± 13 7 72 55 ± 9 26 75 85 ± 9 92 S&P (T) 27 4 (37 2%) 68 40 68 40 68 40 ± 12 0 78 10 78 10 78 10 ± 6 08 81 40 81 40 81 40 ± 4 15 58 6 (62 0%) 77 00 77 00 77 00 ± 10 6 78 90 78 90 78 90 ± 10 2 84 50 84 50 84 50 ± 7 92 T 3 CA-MKD 14 7 (20 0%) (failed) (failed) 62 42 ± 7 37 18 9 (20 0%) (failed) (failed) 50 41 ± 10 5 Finetune 14 7 (20 0%) 60 55 ± 4 26 62 05 ± 3 62 67 00 ± 5 31 18 9 (20 0%) 64 55 ± 7 17 68 15 ± 5 59 73 55 ± 6 44 S&P (R) 19 2 (26 1%) 64 80 ± 8 32 70 05 ± 8 33 72 15 ± 6 70 31 7 (33 5%) 72 80 ± 5 66 72 20 ± 5 38 79 40 ± 5 26 S&P (T) 27 0 (36 7%) 71 35 71 35 71 35 ± 6 43 77 80 77 80 77 80 ± 7 41 79 55 79 55 79 55 ± 4 27 59 8 (63 3%) 75 75 75 75 75 75 ± 8 01 81 20 81 20 81 20 ± 4 45 84 65 84 65 84 65 ± 3 92 T 4 CA-MKD 14 7 (20 0%) (failed) (failed) 59 08 ± 8 68 18 9 (20 0%) (failed) (failed) 47 74 ± 11 3 Finetune 14 7 (20 0%) 50 34 ± 5 93 54 36 ± 5 42 58 72 ± 5 39 18 9 (20 0%) 53 90 ± 5 63 59 04 ± 4 42 63 32 ± 5 47 S&P (R) 19 2 (26 1%) 56 40 ± 7 23 63 74 ± 7 96 69 86 ± 7 03 31 7 (33 5%) 66 80 ± 9 56 69 15 ± 7 52 77 30 ± 6 17 S&P (T) 14 8 (20 1%) 60 62 60 62 60 62 ± 5 98 65 96 65 96 65 96 ± 6 47 73 72 73 72 73 72 ± 5 38 53 1 (56 2%) 70 10 70 10 70 10 ± 6 15 74 90 74 90 74 90 ± 5 73 79 46 79 46 79 46 ± 4 12 T 5 CA-MKD 14 7 (10 0%) (failed) 42 22 ± 5 19 50 26 ± 4 83 18 9 (10 0%) (failed) 34 01 ± 5 88 40 97 ± 6 11 Finetune 14 7 (10 0%) 34 54 ± 4 20 37 93 ± 4 41 42 11 ± 3 44 18 9 (10 0%) 40 04 ± 4 95 45 38 ± 5 07 50 81 ± 4 44 S&P (R) 38 4 (26 1%) 45 59 ± 5 83 54 63 ± 6 27 59 22 ± 5 41 63 4 (33 5%) 50 67 ± 4 99 57 96 ± 5 80 63 40 ± 4 68 S&P (T) 27 7 (18 8%) 48 31 48 31 48 31 ± 5 58 56 35 56 35 56 35 ± 5 32 65 03 65 03 65 03 ± 4 75 101 8 (53 9%) 59 67 59 67 59 67 ± 6 03 66 10 66 10 66 10 ± 4 89 71 73 71 73 71 73 ± 4 25 7 0 20 40 60 80 100 Accuracy(%) VGG16 𝑇𝑇 1 𝑇𝑇 2 𝑇𝑇 3 𝑇𝑇 4 𝑇𝑇 5 Ensemble S&T (T) S&T (R) Finetune ResNet34 𝑇𝑇 1 𝑇𝑇 2 𝑇𝑇 3 𝑇𝑇 4 𝑇𝑇 5 CA-MKD Figure 4: Comparation of parameter amount and accuracy Task Method Param ( × 10 4 ) Acc.(%) k = 5 k = 10 T 6 (RNN) Ensemble 16 0 80 38 83 47 Finetune 8 00 64 53 69 14 S&P 6 95 79 79 81 56 T 7 (CNN) Ensemble 72 0 69 11 73 32 Finetune 36 0 50 49 52 21 S&P 35 3 67 73 74 02 Table 5: Result on text datasets 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Market Acceptable Error 0.0M 0.2M 0.4M 0.6M Equivalent expansion rate VGG16 k=5 k=10 k=20 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Market Acceptable Error ResNet34 k=5 k=10 k=20 Figure 6: Equivalent expansion rate of the market. The market size can be significantly magnified when considering an acceptable error rate of 0 2 method which adaptively assigns sample-wise reliability for each teacher prediction with the help 245 of ground-truth labels. As shown in Table 2, our method achieves better accuracy with a small 246 number of parameters, especially for S&P (T). While S&P (R) has relatively lower accuracy, its 247 parameter count is smaller. By saving more parameters, S&P (R) can achieve better accuracy, which 248 can be refered in the Appendix. It is worth noting that the standard deviation of many results is 249 higher than usual because we calculate the value between different models in the task, rather than 250 different seeds in typical experimental settings. CA-MKD requires a substantial amount of data 251 to obtain the corresponding models, and they cannot get result when the number of samples is 252 small. We also compare our results with the ensemble method, which integrates the results of all 253 original models, as shown in Figure 4. Although the ensemble method achieves nearly 5 points higher 254 accuracy than S&P, the resulting model is significantly larger, with 2 to 10 times more parameters than 255 other generated methods. Furthermore, for users, the ensemble method means they must purchase 256 all models, resulting in higher costs compared to other methods. Experimental result of packing 257 fragments with different architecture or from different datasets will be shown in the Appendix D. 258 Performance on Textual data. We evaluate the performance of our approach on text datasets 259 TREC and SST-5 datasets for Task T 6 and T 7 . Table 5 presents the accuracy of our method in 260 comparison with the ensemble and retrain methods. Ensemble method concatenates the features 261 of original models and trains a classifier based on the concatenated feature, while finetune method 262 fine-tunes one of the original models. Our method achieves comparable results with the ensemble 263 method and even outperforms it on T 7 when k = 10 . The results finetune method show that the 264 provided samples are insufficient to support training a new model. 265 Equivalent expansion rate of the Market. Now we show the ability of our framework to expand 266 markets. To do so, we randomly generate 100 different combinations of 5 classes from five different 267 original models, where each class is drawn from a different model. These original models are the 268 same as those used in T 3 . We plot the relation between the Market Equivalent magnification and the 269 Market Acceptable Error when k = 20 in Figure 6, where the Market Acceptable Error is the standard 270 set by market managers. Only models whose error is lower than the Market Acceptable Error can be 271 sold in the market. Our results demonstrate that our framework can effectively expand the market. 272 When we set the market acceptable error to 0 2 , the market is amplified to about 200,000 times its 273 original size. Similarly, when we set the market acceptable error to 0 3 , the market is expanded to 274 about 600,000 times its original size, which represents a significant improvement. Additional we can 275 observe that the area under the curve increases as k increases, which is in line with our intuition. 276 8 c1 c2 c3 c4 c5 c1 c2 c3 c4 c5 1.00 0.33 0.33 0.33 0.33 0.33 1.00 0.33 0.33 0.33 0.33 0.33 1.00 0.34 0.33 0.33 0.33 0.34 1.00 0.34 0.33 0.33 0.33 0.34 1.00 VGG16 Rate c1 c2 c3 c4 c5 c1 c2 c3 c4 c5 1.00 0.18 0.18 0.18 0.19 0.18 1.00 0.18 0.18 0.17 0.18 0.18 1.00 0.18 0.17 0.18 0.18 0.18 1.00 0.18 0.19 0.17 0.17 0.18 1.00 Resnet34 Rate c1 c2 c3 c4 c5 c1 c2 c3 c4 c5 1.00 0.61 0.62 0.63 0.60 0.61 1.00 0.62 0.61 0.62 0.62 0.62 1.00 0.64 0.61 0.63 0.61 0.64 1.00 0.58 0.60 0.62 0.61 0.58 1.00 VGG16 Threshold c1 c2 c3 c4 c5 c1 c2 c3 c4 c5 1.00 0.58 0.59 0.59 0.55 0.58 1.00 0.60 0.58 0.56 0.59 0.60 1.00 0.62 0.53 0.59 0.58 0.62 1.00 0.50 0.55 0.56 0.53 0.50 1.00 Resnet34 Threshold Figure 7: The Intersection over Union (IoU) of the selected units set U i between fragments varies significantly based on the fragment’s class. When using S&P with a threshold, we observe relatively high IoU values, as it maintains more parameters in the front layers. 𝑦𝑦 = 13. 3𝑥𝑥 + 59.6 𝑦𝑦 = 22. 8𝑥𝑥 + 73.1 𝑦𝑦 = 23. 1𝑥𝑥 + 96.3 𝑦𝑦 = 14. 3𝑥𝑥 + 114.0 𝑦𝑦 = 14. 7𝑥𝑥 + 144.9 𝑦𝑦 = 18. 3𝑥𝑥 + 218.8 Figure 8: The spent time per class as the number of user required models increases on T 1 . Our method spends little time on packing models, enabling plug-and-play property. Difference between Fragments. Next, we investigate the differences between each pair of frag- 277 ments from the same original model in our method. Figure 7 shows the Intersection over Union (IoU) 278 of each layer’s selected units set U i on different fragments in the setting of task T 3 . We calculate the 279 average IoU among all layers for each pair of fragments in the target of the task. Here, c 1 , c 2 , · · · , c 5 280 represent the classes in the original model. Our results show that fragments of different classes 281 are significantly different in terms of the selection of units. Compared with selecting units by rate, 282 selecting units by threshold has a higher IoU. This is because it prefers to maintain more parameters 283 in the front layers and drop more parameters in the last layers, resulting in nearly 1 IoU values for 284 the front layers. This observation aligns with the intuition that there are more common subfunctions 285 between classes in the shallow layers of the network. More details and figures of each layer’s IoU are 286 provided in the Appendix E. 287 Running time of Slicing-and-Packing. We show the running time of our method on T 1 using 288 VGG16 and ResNet34 as original models. Figure 8 displays the time spent per class as the number 289 of models required by users increases. As we can see, when users need new models, our method 290 only needs little time to pack required fragments, which is little compared with slicing models into 291 fragments, so slope of lines in the figure is small. Besides, for the time needed to pack fragments is 292 little, our method has plug-and-play property and user can add or remove fragments at any time. 293 5 Conclusion and Future Work 294 In this study, we developed a novel framework called Slice-and-Pack. This framework involves 295 slicing the existing models into fragments and packing them with the necessary abilities for user’s 296 requirement. This approach allows users to purchase models with only the required fragments, 297 without bundled functions, at a lower cost. Additionally, it provides a commercially viable solution 298 for markets and enables them to offer a wider variety of models, demonstrating the capabilities of a 299 larger market. Our experiments have shown that our framework can obtain highly accurate packed 300 models with minimal time spent, particularly on packing, which enables plug-and-play functionality. 301 Furthermore, our framework expands the size of the market, enabling it to provide more kinds of 302 models. For future work, we plan to explore a more flexible strategy for slicing models. Additionally, 303 we are interested in studying how to leverage common functions or classes between original models. 304 9 References 305 [1] Zhi-Hua Zhou. Learnware: on the future of machine learning. Frontiers of Computer Science , 306 2016. 307 [2] Yoshua Bengio, Yann Lecun, and Geoffrey Hinton. Deep learning for ai. Communications of 308 the ACM , 2021. 309 [3] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- 310 che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 311 Mastering the game of go with deep neural networks and tree search. Nature , 2016. 312 [4] OpenAI. Gpt-4 technical report, 2023. 313 [5] Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 314 Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings 315 in Bioinformatics , 2022. 316 [6] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, 317 Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual 318 models from natural language supervision. 2021. 319 [7] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, 320 Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of 321 language models. arXiv preprint arXiv:2211.09110 , 2022. 322 [8] Zhi-Hua Zhou and Zhi-Hao Tan. Learnware: Small models do big. CoRR , 2022. 323 [9] Xi-Zhu Wu, Wenkai Xu, Song Liu, and Zhi-Hua Zhou. Model reuse with reduced kernel mean 324 embedding specification. 2021. 325 [10] Peng Tan, Zhi-Hao Tan, Yuan Jiang, and Zhi-Hua Zhou. Towards enabling learnware to handle 326 heterogeneous feature spaces. 2022. 327 [11] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. 2010. 328 [12] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations 329 for domain adaptation. 2006. 330 [13] Yunyun Wang, Chao Wang, Hui Xue, and Songcan Chen. Self-corrected unsupervised domain 331 adaptation. Frontiers of Computer Science , 2022. 332 [14] Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex Smola. 333 Correcting sample selection bias by unlabeled data. 2006. 334 [15] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via 335 transfer component analysis. 2010. 336 [16] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual 337 domain adaptation using subspace alignment. 2013. 338 [17] Yang Shu, Zhi Kou, Zhangjie Cao, Jianmin Wang, and Mingsheng Long. Zoo-tuning: Adaptive 339 transfer from a zoo of models. In ICML , 2021. 340 [18] Dang Nguyen, Khai Nguyen, Nhat Ho, Dinh Phung, and Hung Bui. Model fusion of heteroge- 341 neous neural networks via cross-layer alignment. arXiv preprint arXiv:2110.15538 , 2021. 342 [19] Xingyi Yang, Daquan Zhou, Songhua Liu, Jingwen Ye, and Xinchao Wang. Deep model 343 reassembly. 2022. 344 [20] Ilja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. 2013. 345 [21] Ilja Kuzborskij and Francesco Orabona. Fast rates by transferring from auxiliary hypotheses. 346 Machine Learning , 2017. 347 10 [22] Yao-Xiang Ding and Zhi-Hua Zhou. Boosting-based reliable model reuse. In Asian Conference 348 on Machine Learning , 2020. 349 [23] Peng