i Preface Welcome to the Volume 11 Number 2 of the International Journal of Design, Analysis and Tools for Integrated Circuits and Systems (IJDATICS). This volume is comprised of selected research papers from the International Conference on Recent Advancements in Computing in Artificial Intelligence, Internet of Things and Computer Engineering Technology (CICET), October 24-26, 2022, Taipei, Taiwan. CICET 2022 is hosted by The Tamkang University amid pleasant surroundings in Taipei, which is a delightful city for the conference and traveling around. CICET 2022 serves a communication platform for researchers and practitioners both from academia and industry in the areas of Computing in Artificial Intelligence (AI), Internet of Things (IoT), Integrated Circuits and Systems and Computer Engineering Technology. The main target of CICET 2022 is to bring together software/hardware engineering researchers, computer scientists, practitioners and people from industry and business to exchange theories, ideas, techniques and experiences related to all aspects of CICET. Recent progress in Deep Learning (DL) has unleashed some of the promises of AI, moving it from the realm of toy applications to a powerful tool that can be leveraged across a wide number of industries. In recognition of this, CICET 2022 has selected Artificial AI and Machine Learning (ML) as this year’s central theme. The Program Committee of CICET 2022 consists of more than 150 experts in the related fields of CICET both from academia and industry. CICET 2022 is organized by The Tamkang University, Taipei, Taiwan, and co-organized by AI University Research Centre (AI-URC) and Research Institute of Big Data Analytics (RIBDA), Xi’an Jiaotong-Liverpool University, China as well as supporting by: Swinburne University of Technology Sarawak Campus, Malaysia; Taiwanese Association for Artificial Intelligence, Taiwan; Trcuteco, Belgium; International Journal of Design, Analysis and Tools for Integrated Circuits and Systems, International DATICS Research Group. The CICET 2022 Technical Program includes 1 invited speaker and 30 oral presentations. We are beholden to all of the authors and speakers for their contributions to CICET 2022. On behalf of the program committee, we would like to welcome the delegates and their guests to CICET 2022. We hope that the delegates and guests will enjoy the conference. Professor Ka Lok Man, Xi’an Jiaotong-Liverpool University, China Professor Young B. Park, Dankook University, Korea Chairs of CICET 2022 ii Table of Contents Vol. 11, No. 2, December 2022 _____________________________________________________________________________________ Preface ………………………………………………………………………………....... i Table of Contents ……………………………………………………………………….. ii _____________________________________________________________________________________ 1. Shawn Ang, Law Kim Young, Zhi Qi, Zahid Akhtar, Kamran Siddique, Ka Lok Man and Jie Zhang, Attribute Based Encryption in Cloud Computing, Xiamen University Malaysia, Malaysia 1 2. Rou Lee, Zhi Qi, Zahid Akhtar, Kamran Siddique, Ka Lok Man and Jie Zhang, Automation in Cloud Migration: An Effective Study, Xiamen University Malaysia, Malaysia 7 3. Abubakar Ya’u Muhammad, Bashir D. Bala, Shamsuddeen Yusuf and Najib Hamisu Umar, A Compact-Size and Geometrically Simple Dual band Antenna for ISM and WLAN Application, KUST Wudil, Nigeria 13 4. S. Usman, I. Abdullahi, K. G. Ibrahim, N. I. Yusuf, H. B. Yusuf and B. G. Agaie, Multiple Linear Regression Using Cholesky Decomposition in Studying Crime Rate in Jigawa State, Nigeria, Federal University Dutse, Nigeria 17 5. Yuechun Wang, Ka Lok Man, Danny Hughes and Jie Zhang, Design and Development of Trusted Real-Time Execution Environment, Xi’an Jiaotong-Liverpool University, China 22 6. Ailyn Kency Lam Cham Kee, Yuechun Wang, Yuxuan Zhao, Jie Zhang, Erick Purwanto, Tomas Krilavicius and Ka Lok Man, Machine Learning in Healthcare: the Prediction of Diabetes Risk by ML Classification Models, Vytautas Magnus University, Kaunas, Lithuania 25 7. Syu-Jhih Jhang, Chih-Yung Chang, Shih-Jung Wu and Chia-Ling Ho, BUAS: Joint Bottom-Up Article Selection for Quick Article Similarity Identification Based on NLP, Tamkang University, Taiwan 33 8. Yuan-Lin Liang, Chih-Yung Chang and Kuo-Chung Yu, CE-SQL: A Single-Table Chinese Text-to-SQL generation with BERT-Based Slot Filling Method, Tamkang University, Taiwan 37 9. Chien-Chang Chen, Cheng-Shian Lin, Yen-Ting Chen, Wen-Her Chen, Chien-Hua Chen, and I- Cheng Chen, Player Pair Evaluation in Rowing, Tamkang University, Taiwan 41 10. Sunusi Bala Abdullahi, Zakariyya Abdullahi Bature, Auwal Muhammad, Multimodal Biometric Recognition Network Base on Spatial-temporal Fingerprint and Finger Vein (STMFPFV-Net) Features, King Mongkut’s University of Technology Thonburi, Thailand 43 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Attribute Based Encryption in Cloud Computing Shawn Ang1, Law Kim Young1, Zhi Qi1, Zahid Akhtar2, Kamran Siddique1,*, Ka Lok Man3, Jie Zhang3 Abstract— Cloud computing is a computing paradigm that profit. To make sure the information stored is secure, the data provides various services and computing resources to users stored in cloud facilities should always be encrypted before through Cloud Service Providers over the Internet. However, storing data in an unsecure cloud may lead to security issues such storing into the cloud. Despite that, those data are still as privacy issues and data leakage. Therefore, it is necessary for accessible to all users, therefore, the data access should also be encryption schemes to be implemented in clouds to provide a implemented here and should be restricted and classified secure environment for the users. One of the cryptographic according to features such as user’s positions and rights in the schemes is the Attribute Based Encryption (ABE), which provides privacy and access control in cloud and can be implemented in a hierarchy of the company. Thus, this concludes that there will Trusted Real-Time Execution Environment to achieve stronger be two main things to consider when storing information and security. This paper first outlines the various existing encryption data into cloud which are privacy of data and user access techniques, which can be categorized into symmetric and control. asymmetric algorithms. It then comprehensively explore the key policy and the access policy attribute based encryption. In the first part, some of the basic and traditional encryption Furthermore, this paper examines the various schemes of ABE techniques are introduced such as symmetric and asymmetric and compares the schemes based on its access structure, key encryption cryptographic and elaborated to give a general advantages and disadvantages. Lastly, this paper discusses the idea of what cryptography is. The later part of the paper will applications of key policy and ciphertext policy ABE. focus more on Attribute Based Encryption which is a subset of Index Terms— Cloud, Encryption techniques, Attribute Based Encryption, Key-Policy Attribute Based Encryption, Ciphertext- asymmetric key encryption. Policy Based Encryption. II. RELATED WORK Symmetric Key Algorithm. It is one of the most fundamental I. INTRODUCTION techniques used in crypto community. It utilizes the similar key There are tons of advantages and benefits provided by cloud for both encryption and decryption of information. Further computing, among them are less maintenance, scalability, cost explanation is conducted in the later part of this paper. saving, accessible wherever etc. Despite with the huge benefits Asymmetric Key Algorithm. Asymmetric is the improvised provided by cloud computing, there are still a number of version of symmetric key algorithm. It however uses two keys organizations and companies are still showing hesitation to which is not similar for both encryption and decryption of move into cloud computing, specifically storing big data in it information. More of it is elaborated in the later part of the mostly because there are still some security and privacy issues paper. regarding cloud [1]. The main functionality of cloud would be storing and managing data from any part of the world regardless Key Policy Attribute Based Encryption (KPABE). It is a for the user’s location and device. Big data is also stored in it at further modification of the asymmetric key algorithm itself. It the same time for managing and analyzing purposes. Big data provides further security aspects to the encryption of data itself. processing and analyzing is proven to be pretty simple because Further information regarding the method is given on the cloud providers often support those requirements. Nevertheless, following part of the paper. the issue lies within cloud is access control as well as data Ciphertext Policy Attribute Based Encryption (CPABE). It privacy as keeping one’s data in cloud would mean that their is an improvised version of KPABE except that the encryptor data would be stored in a Cloud Service Provider (CSP) whom has control over the data he/she encrypted [5]. may not be that trustworthy as they are part of a third party [1]. This is because cloud service provider will have more or less III. ENCRYPTION TECHNIQUES direct contact to personal data that are resided in the cloud facilities and may disclose and share it with prohibited users for A. SYMMETRIC KEY ALGORITHMS Symmetric key algorithm is an encryption technique that Manuscript received Sept. 27, 2022. requires only a single key to perform both encryption and 1 Department of Information and Communication Technology, Xiamen decryption. The secret key can be anything ranging from a University Malaysia, Sepang 43900, Malaysia number, word, or even a random string. This secret key is used 2 Department of Network and Computer Security, State University of New York Polytechnic Institute, Utica, 13502, USA to encrypt and decrypt the data, rendering it unreadable to those 3 School of Advanced Technology, Xi’an Jiaotong-Liverpool University, who do not have the secret key. Sender and receiver must Suzhou, 215123, China. exchange the secret key to be used in the decryption process. *Corresponding author There are two types symmetric key algorithms, which is (email: kamran.siddique@xmu.edu.my). categorized into the block ciphers and stream ciphers. In block 1 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 ciphers, a key of predetermined length is applied to a fixed size With this protocol, a sender and the respective receiver can plaintext block and outputs a block of ciphertext of the same develop a common secret key over a channel that is not secure. length as the plaintext block. In stream ciphers, one bit is The case even applied to that both sender and receiver have no encrypted at a time with the corresponding keystream to obtain prior knowledge of each other. The key created can later be the ciphertext stream. Some of the common symmetric key utilized in the encryption of the upcoming communications. algorithms are the Advanced Encryption Standard (AES), The process starts with the two entities agrees on a random Blowfish algorithm, and Data Encryption Standard (DES). starting color that can be known publicly. For this case, the 1) ADVANCED ENCRYPTION STANDARD color blue is chosen. Each individual will also pick a unrevealed color that only themselves know. For instance, yellow and Advanced Encryption Standard (AES) is one of the green. The important part is that Alice and Bob will combine symmetric block cipher encryption algorithms. The National the mutually shared color with their respective secret colors, Institute of Standards and Technology (NIST) first introduced which will produce a result of green and cyan mixture. They AES when there was a need replacement for the Data will proceed to interchange the colors publicly. Lastly, they will Encryption Standard that started becoming more vulnerable. In then mix the exchanged color with private color of theirs and the AES encryption, there are three block ciphers, which is the produce a result mixture of light-cyan which is identical for AES-128, AES-192, and the AES-256. Data is encrypted and both of the entity. Even if an attacker eavesdrops on the very decrypted in blocks of 128-bits, but with three different exchange, he will only obtain the publicly known color which cryptographic keys: 128-bits, 192-bits, and 256-bits. The is blue and the publicly exchanged mixture. It is ciphertext is acquired by processing the plaintext through computationally impossible for the eavesdropper to discover rounds of steps that includes table substitution, transposition of the end color. The action would be computationally expensive data rows, and mixing of columns of the plaintext. The 128-bit and proven impossible to be accomplished within a time limit keys consist of 10 rounds, whereas the 192-bits consists of 12 that is logical. rounds, and 14 rounds for the 256-bits. 2) RSA CRYPTOSYSTEM 2) DATA ENCRYPTION STANDARD It was named after three cryptographers who first invented it, Data Encryption Standard (DES) is a symmetric block cipher which are Ron Rivest, Adi Shamir and Len Adleman. It is one introduced by NIST. DES algorithm applies the concept of of the first and oldest asymmetric cryptosystem ever founded Feistel Cipher, which is a multi-round cipher. It takes an input and yet remained one of the most employed and used of plaintext block with 64-bit and process it through the 16 cryptosystem up till now. Two keys are present in the process round Fiestel structure to produce a 64-bit ciphertext. A 56-bit of encryption and decryption are known as the public key and secret key is used in both the encryption and decryption process. the private key. This key will function as the element to encrypt The encryption process consists of two permutation boxes, messages and data. This key is known by everyone. With the known as the initial and final permutations, with the 16 Fiestel encrypted message, the receiver can only decrypt the particular rounds. The DES algorithm applies a different 48-bit key in message using only the private key. This asymmetry each round which is generated from the key generation cryptography is founded on the basis of intensity of the algorithm [10]. factorization of the product of two big prime numbers. As for B. ASYMMETRIC KEY ALGORITHMS process of verification, the server handling the transfer of data will implement the authentication of key. This process takes Asymmetric key cryptosystem uses a different concept place with the signing the message together along with the compared to symmetric cryptography that utilizes the similar private key. This would be also known as digital signature. This key for the process of encryption and decryption. In asymmetric digital signature is later passed back to the client. The client will cryptography, non-identical keys are utilized for encryption and then verify it when comparing it to the server’s known public decryption respectively. For this case of algorithm, each key. receiver will own a decryption key on its own, which will be known as his private key. As for his public key, it will function IV. ATTRIBUTE BASED ENCRYPTION as an encryption key. Typically, this type of cryptography For this segment, the fundamental concept of Attribute Based system will require a trusted third party which will declare Encryption (ABE) and its algorithm will be discussed. It was formally that a particular public key is the property of a certain first proposed by Sahai et.al [1]. It uses one to many algorithms entity only. for the reason of protecting information stored in cloud. For this particular encryption, the information and data is encrypted 1) DIFFIE-HELLMAN KEY EXCHANGE with a particular set of attributes as its base. Three of the main It was one of the first public-key protocols that was initially elements involved are Data Owner, Data User and Authority. proposed by Ralph Merkie but was named after Whitfield First, the authority will produce a key which is public, and it is Diffie and Martin Hellman. It is a technique to exchange sent to owner of the data for the purpose of encryption. At the cryptographic keys securely. The transfer is done over a public same time, it will also generate a master secret key. Then this channel. It is used during the early days of cryptography for key is utilized to produce user’s secret key based on its public key exchange. attributes. Data are encrypted using the public key together with 2 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 the attributes of it by the Data Owner and the data will be threshold [1,y] while AND gate is seen as [y,y]. Only when user proceeded to be kept in cloud. To decrypted then access the satisfied the root node, user will be able to decrypt and retrieve information, Data User will have to possess the private key the data otherwise the data is impossible for decryption. In Fig.2 from the authority. The whole decryption process is only X1, X2 and X3 are taken as attributes and the access policy X1 possible when at least d components of the attribute in the data AND (X2 OR X3) of the threshold gate tree structure are matches the components in the secret key. In cases of adding displayed. The attributes among X2 or X3 are displayed as [1,2] new users to the current system, a new secret key with different and it represent the OR gate while AND gate is represented as attribute will be generated again by the authority. In Fig. 1, [2,2]. ABE scheme’s architecture is displayed. ABE is generally distributed into Key Policy Attribute Based Encryption (KPABE) and Ciphertext Policy Attribute Based Encryption (CPABE). Fig. 2. The threshold gate tree structure For the mentioned KPABE schemes, all of them are using Fig. 1. ABE scheme’s architecture monotonic access structure. It means that negative attribute does not exist in the access policy. However, a non-monotonic A. KEY POLICY ATTRIBUTE BASED ENCRYPTION access structure is introduced by Ostrovsky et al. [3] That structure has included both positive as well as negative For this section, a few different KPABE schemes will be attributes which is NOT is supported between attributes. Fig.3 discussed and elaborated. In 2006, KPABE, a cryptosystem for is displayed as a mind map for the basic concept of a non- sharing of encrypted data was introduced by Goyal et al. [2]. monotonic access structure. In that figure, X1, X2 and X3 For this encryption method, a set of attributes are embedded represent the attributes of tree structure which is non-monotonic with the ciphertext when formed and the user key comes with for access policy X1 AND X2 NOT X3. policy such as access structure. The message is only able to be decrypted when the access structure is fulfilled by the user attributes. For KPABE scheme, it is specified by four algorithms as stated below. Setup: A security parameter is taken as an input during the setup algorithm. A public parameter PK and a master key MSK is outputted. All these elements are only known to the private key generator (PKG). A description of session key space K is included in the public parameters. Encryption: A message M, a set of attributes a and the public parameters PK are taken as input and ciphertext c is outputted. Key Generation: For this process, the master key msk, public parameters PK and an access structure A is taken as input and a decryption key DA is produced. Decryption: This algorithm will take ciphertext c which has been encrypted as input. Output will be the message M if a is a part of the access control structure. Fig. 3. The mind map for a nonmonotonic access structure For this particular part, KPABE uses a tree-based access B. CIPHERTEXT POLICY ATTRIBUTE BASED structure. The leaf nodes represent the attributes and the non- ENCRYPTION leaf nodes are threshold gates. They are represented in the form of [x,y] where x will represent the threshold value and number The Attribute Based Encryption was refined by Goyal, of attributes is represented by y. OR gate is represented by the Pandey, Sahai, and Waters in their following works. They 3 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 formulated two supplementary forms of ABE, which is the shows its reduced version. The access tree can be reduced using Key-Policy ABE and the Ciphertext-Policy ABE [8]. The two ways, which is to identify and omitting redundant variables KPABE has been discussed in the previous section of this paper. or to identify identical subtrees and letting them share. In this section, the CPABE will be discussed along with a few of its different schemes The first CPABE scheme was introduced by Bethencourt et al., where the ciphertext is linked with access structures and the users’ private keys with attributes. A CPABE scheme mainly consists of four algorithms: Setup, Encrypt, Key Generation, and Decrypt. A fifth algorithm called Delegate can also be added to the scheme. Setup: The setup algorithm takes the input of the security parameter and outputs the public parameters, Public Key (PK) and Master Secret Key (MK). Encrypt: The encryption algorithm takes the PK and an access structure, to encrypt a message. A ciphertext is then produced from the encryption such that only the person with the required Fig. 4. Access policy X1 V X2 using binary decision attributes and access structure can perform the decryption. Key Generation: The key generation algorithm generates the secret keys, by using the MK and a set of attributes associated with the key as input. Decrypt: The decryption algorithm decrypts the ciphertext, which contains the access policy using the private key for a set of attributes. The algorithm decrypts the ciphertext back to the original message when the set of attributes complements the access structure. Delegate: Delegation takes the secret key for a set of attributes as input and outputs a new key for the set of attributes [7][9]. The work of Bethencourt et al. [7] was then improved by Cheung et al. in terms of its security proof. It was proved to be secure under the Decisional Bilinear Diffie-Hellman (DBDH) Fig. 5. Reduced Version of Fig. 4 assumption during its first computation. The access structure consists of the AND gate positive and negative attributes, which V. COMPARSION uses the don’t care condition to identify attributes that are not Comparison between all of the ABE schemes mentioned in the AND gate. However, the ciphertext and key size with the above will be compared in terms of their access structure, number of attributes using this scheme. Goyal et al. [7] later advantages and disadvantages. introduced an improved scheme which uses the bounded access 1) Scheme: Sahai et al. [1] tree with its security proved under DBDH assumption. But, user The access structure of this particular scheme is monotonic. might be forced to use an access tree with a lower depth It is categorized under KPABE. The advantages of it compared compared to the needed depth as it has to be specified in the to the other scheme is that it has fine-grained access control and setup stage. the fact that is it uses a one to many cryptographic public key The schemes above generated ciphertext with sizes that were encryptions. The disadvantages of it would be the increasing linearly to its number of attributes. In the recent computational value for the particular scheme is very high and years, Li et al. [7] introduced an improved version of the the threshold value it uses is not quite expressive. CPABE by implementing a new access structure using Ordered 2) Scheme: Goyal et al. [2] Binary Decision Diagram (OBDD). This scheme is a non- monotonic access structure, which supports AND, OR, as well The access structure for this scheme is monotonic too. The as NOT among the attributes. The input consists of Boolean scheme proposed by Goyal is also categorized under Key variable, X1, X2, …, Xn, where each variable represents an Policy Attribute Based Encryption. The advantages that come attribute. The leaf node value of the tree determines the access, with it is that the user private key is defined on the tree access with the traversal starting from the root node down to the leaf structure. The computational complexity is greatly improved node according to value of the attribute. The subtrees then test compared to Sahai’s scheme. However, the disadvantage here for another attribute, and each will have two more subtrees, would be that the scheme itself does not allow negative until the leaf node is reached. Fig. 4 shows the representation constraints. of access policy X1 V X2 using binary decision and Fig. 5 3) Scheme: Ostrovsky et al. [3] 4 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 This scheme uses non-monotonic access structure and is also 1) COLLABORATIVE PROJECT DEVELOPMENT categorized under KPABE. Advantages of it includes access CPABE allows only individuals involved in the project to structure of the scheme includes negative attribute. But due the gain access to the data. fact, the disadvantages 2) VIDEO SURVEILLANCE SYSTEM that comes along it is that it would display more Due to privacy issues, access to the video data has to be computational overhead. restricted only to authorised personnel such as the security 4) Scheme: Bethencourt et al. [7] officer, or the supervisor. This scheme is using the same monotonic access structure. 3) DATA ACCESS IN ORGANIZATION However, it is categorized under Ciphertext Policy Attribute The organization needs to limit the access of certain data so Based Encryption. The advantages include that better that only authorised people or the higher-ups can view it. For performance on access structure defined messages. The instance, an employee from a department can only view data disadvantage is that compared to the other model, the security from their department but is restricted from other data. of this scheme is proven to be under them. However, a manager may be able to view data from all 5) Scheme: Goyal et al. [7] departments. This scheme uses a monotonic access structure and is one of 4) UNIVERSITY ACADEMIC AFFAIR MANAGEMENT the Ciphertext Policy Attribute Base Encryption. The Contents shared within a course can only be accessed by advantage of using this scheme is that it uses bounded tree students that have enrolled in that particular course. After the access which supports various access formulas. But, its access end of the course, the students will be restricted from accessing tree depth is bounded and must be specified in the setup phase. those materials using the delegation algorithm of CPABE. 6) Scheme: Li et al. [7] CPABE is not suitable for applications that require scalability This Ciphertext Policy Attribute Based Encryption scheme as it does not support scalability. introduced a non-monotonic access structure. Its advantage includes improving the efficiency and performance as it uses VII. CONCLUSIONS the ordered binary decision access structure. However, this Cloud computing has undoubtedly engraved itself in a part of scheme does not support revocation. the digital world nowadays. A lot of information and data is being stored and processes using cloud facilities and the VI. APPLICATION OF ABE services they provide [6]. Without cloud, many organizations In this section, the application of the two proposed and corporations would go out of business or have to rework complementary ABE, KPABE and CPABE [7] are discussed. their whole business operating model. All the data transferred through back and forth the cloud needs to be protected and A. KPABE encrypted to ensure confidentiality, integrity and authority of The encryption of data in KPABE takes a set of descriptive the data itself. Thus, encryption techniques are used for the very attributes and a secret key associated with access structure as purpose. The basic idea and types of encryptions are introduced input, making it suitable for the following application: for the first section. The following section explains into a 1) FORENSIC APPLICATION deeper length of the topic of the paper itself, attribute-based encryption (ABE) used for cloud computing. ABE itself is Information and evidence stored is associated with a set of further categorized into KPABE and CPABE. From paper itself descriptive attributes, such as ID, date and time, name, and a it can be seen that CPABE has a better performance when short description. The information can only be accessed by an compared against KPABE as it would give full dominance to analyst. data owner regarding their data. A thorough comparison is 2) NETWORK AUDIT LOG APPLICATION made for all the encryption scheme mentioned in the paper. The attributes associated with the log stored in network audit Their applications are also being mentioned and elaborated. It includes IP address, username, date and time, etc. An can be seen that these encryption techniques continue to authorised admin can retrieve the needed records based on the improve further and at a greater speed with the advances of requirement. technologies and that is proven to be a good thing. By In both of the applications, the encryption is done based on implementing these encryption techniques in the Trusted Real- descriptive attributes. KPABE is unsuitable for applications Time Execution Environment, a very strong and practical where the owner needs to have some control over the data. security can be achieved in the cloud. B. CPABE ACKNOWLEDGEMENT Encryption of data in CPABE includes access structures and This work was supported by the Xi’an Jiaotong-Liverpool a secret key associated with a set of attributes. Some of the University (XJTLU) AI University Research Centre, Jiangsu application of CPABE include: (Provincial) Data Science and Cognitive Computational 5 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Engineering Research Centre at XJTLU under Grant XJTLU- Performance on Notebook Device. 2019 International REF-21-01-002. Symposium on Electronics and Smart Devices (ISESD). doi:10.1109/isesd.2019.8909472 [6] Lumb, I., Choi, E., Rimal, B.P. (2009). A taxonomy and REFERENCES survey of cloud computing systems, in: International [1] A. Sahai, B. Waters. (2005). Fuzzy identity-based Conference on Networked Computing and Advanced encryption, in: Theory and Applications of Cryptographic Information Management, IEEE, pp. 44-51. Techniques, Springer Berlin Heidelberg, pp. 457-473. http://dx.doi.org/10.1109/NCM.2009.218 https://doi.org/10.1007/11426639_27 [7] P, P. K., P, S. K., & P.J.A., A. (2018). Attribute based [2] V. Goyal, O. Pandey, A.Sahai, B. Waters, (2006) encryption in cloud computing: A survey, gap analysis, Attribute-based encryption for fine-grained access control and future directions. Journal of Network and Computer of encrypted data, in: Proceedings of the 13th ACM Applications, 108, 37– Conference on Computer and Communications Security, 52. doi:10.1016/j.jnca.2018.02.009 ACM, pp. 89-98. [8] Waters, B. (2011). Ciphertext-Policy Attribute-Based https://doi.org/10.1145/1180405.1180418 Encryption: An Expressive, Efficient, and Provably [3] R. Ostrovsky, A. Sahai, B. Waters. (2007) Attribute-based Secure Realization. Public Key Cryptography – PKC encryption with non-monotonic access structures, in: 2011 Lecture Notes in Computer Science, 53–70. doi: ACM Conference on Computer and Communications 10.1007/978-3-642-19379-8_4 Security Computer and communications security, ACM, [9] Bethencourt, J., Sahai, A., Waters, B. (2007). Ciphertext- pp. 195-203. https://doi.org/10.1145/1315245.1315270 Policy Attribute-Based Encryption. IEEE Symposium on [4] Wang, C., & Liu, Y. (2009). A Secure and Efficient Key- Security and Privacy (SP '07). Berkeley, CA, pp. 321-334. Policy Attribute Based Key Encryption Scheme. 2009 doi: 10.1109/SP.2007.11 First International Conference on Information Science and [10] Chatterjee, R., Roy, S. (2017). Cryptography in Cloud Engineering. doi:10.1109/icise.2009.157 Computing: A Basic Approach to Ensure Security in [5] Suryawan, G. T., Linawati, & Andika, S. (2019). Cloud. International Journal of Engineering Science and Ciphertext-Police Attribute Based Encryption Computing, 7(5), 11818-11821. 6 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Automation in Cloud Migration: An Effective Study Rou Lee1, Zhi Qi1, Zahid Akhtar2, Kamran Siddique1,*, Ka Lok Man3, Jie Zhang3 Abstract— When conducting a cloud migration in a Trusted In 2016, the very-first automated migration framework is Real-Time Execution Environment, it is always important to conduct the migration by following certain standard and introduced. Automated migration framework is the framework constraint while modifying it according to the needs of the that worked based on artificial intelligence (AI). AI-based migration. Migration framework is introduced to provide such a frameworks are introduced to maximize performance, standard and steps to be followed for migration. However, current maximizing results and reducing cost needed to perform migration frameworks no longer essentially satisfy the cloud migration efficiently. Resources are increasing and the usage of migration. Cloud migration can be categorized into three current migration frameworks are not proficiently supporting the approaches [2], which includes: demands. Thus, researchers are developing more automation 1. Re-host: Migration that involves moving the resources cloud migration frameworks that help in reducing the cost, time, manpower and increasing efficiency to conduct cloud migration. without modifying any codes. This paper concisely addresses cloud migration, stages of 2. Re-platform: Migration that involves moving the conducting cloud migration and introduces various automated resources with a little effort in upgrading to cater to the cloud cloud migration frameworks along with detailed analysis. infrastructure. Index Terms— Cloud Migration, Automation, Migration 3. Re-factor: Refactor the code to cater to the Framework functionality and the resources of cloud infrastructure. I. INTRODUCTION Furthermore, cloud migration can be categorized into four stages, which includes discovery, planning, migration, and There is a risk that is revolving around the topic of cloud quality assurance [4]. In all these stages, different algorithms computing in recent years, which is cloud lock-in [1]. Cloud and frameworks are used to maximize the positive effect of lock-in is a risk where the consumers are unable to migrate to cloud migration. Before the introduction of artificial another cloud environment after adapting to one cloud intelligence, traditional frameworks are used to guide the environment [2]. This risk is introduced due to the demands of migration processes. However, challenges arise while making the consumers who wish to move from one cloud environment use of these frameworks. The challenges include the difficulty to another. The reason of the migration is because of the to cope up with the increasing amount of migrating resources, services that are provided by another cloud provider is better the increasing amount of cost in migration, the performance of than the current cloud provider. In order to gain more the resources during migration, and the service interruption customers, cloud providers are providing better services such during migration that led to unwanted migration [5]. as storage, SMTP support, and price. Thus, an increasing In this paper, we introduce various automated migration number of enterprises are aiming to pursue better services by frameworks and perform detailed analysis on the frameworks migrating to a better platform. Besides migration from a cloud to allow readers to have a better overview on automated cloud to another cloud, there also exists the demand of migrating from migration. on premise to cloud. The rest of the sections are organized as follows: Section I is To cater to the problem and the demand, migration the introduction; Section II is the literature review; Section III frameworks have been introduced. With the frameworks, discusses the stages of cloud migration; Section IV depicts the consumers are able to perform migration by adhering to the automated cloud migration in different stages; Section V standard and procedures in the framework. However, with the describes the challenges and future work; Section VI is the advancement of technology, the current migration framework conclusion. no longer satisfies the needs of the consumers. There is an II. RELATED WORK increment in the number of applications, resources, and data that are undergoing migration [3], which displayed that the Narantuya et al. [2] proposed strategies that solved the current migration strategies no longer support the migration in problem of cloud lock-in. They also proposed a mechanism that terms of cost and performance. reduces the downtime of migration of the resources during the Manuscript received Sept. 25, 2022. moving of the resources. The study has successfully developed 1 Xiamen University Malaysia, Sepang, Malaysia 2 a framework with enable multiple virtual machines (VM) to State University of New York Polytechnic Institute, New York, USA 3 Xi’an Jiaotong-Liverpool University, Suzhou, China migrate from one cloud to another based-on network traffic 7 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Planning Discovery Migration Quality Assurance - Determine migration - Execution of plan - Identify resources task - Making sure the - Analyze compatibility - Pre-configuration of resources is functional - Categorized task of identified resources related setting in the target according to types of with target environment environment resources - Post-configuration: Backup - Formulate timeline Figure 1. High Level Description of Stages in Cloud Migration. dependencies. However, this framework does not support technique and resources that involved in each stage required migration between public cloud, such as Google Cloud and detailed analysis and up to the judgement of the expert. For Amazon. cloud migration to be carried out successfully, it involves the Lin, et al., [6] studied about the framework and tool to following stages: discovery, planning, migration and quality automate migration planning. In their studies, they highlight the assurance [4]. The processes of migration are depicted in Figure importance of automating the migration and propose solutions 1. for migration planning to cope up with a large number of Stage 1 of cloud migration is discovery. This task in this migration resources. However, in this study the researchers stage involves the identification of the resources required to be focus on the migration of servers. Thus, the researchers tend to migrated, the analysis of the environment that is suitable for focus on business application transformation in the future. resources identified and the analysis of compatibility of the Maja, V. & Hwang, J. [4] depicted the AI-based migration resources with the target environment. For example, Company techniques that help in proposing the migration plan and the A identify that the data of the company should be migrated, challenges faced in the planning processes using AI-based they also evaluate the environment that is suitable for the data migration techniques. to be kept. If the environment of the targeted cloud is not Al-Kiswany et al. [7] discussed the automation in cloud suitable for the data, then identification of what changes are migration from one provider to another cloud provider by needed to adapt to the new environment is also required. utilising the virtual image. It involves the grouping of servers Stage 2 of cloud migration is planning. In this stage, the with similarity and migrating the resources according to groups. migration task is scheduled. Similar resources and workload are The researchers increase the efficiency of cloud migration by separated into different entities to ease the process of migration. migrating the resources as VM image. In this stage, the scope of migration should be confirmed with Bai et al. [8] depicted the importance of developing a wave the client to avoid any miscommunication. The plan devised plan for migration. The researchers also introduce a method, from this stage should be followed during the later stage of Kullback-Leibler (KL) divergence-based method that allows migration. systematic and efficient discovery of relationship between Step 3 of cloud migration is migration, which involves the servers and servers or servers and application. This discovery execution of plan. Before the execution of the devised plan, the is useful in migration planning. timeline, the task distribution and the resources are previously Beserra et al. [9] introduced a decision to serve as the step- identified and confirmed. Pre-configuration is done beforehand by-step guide for legacy application to migrate to cloud to set up the migration environment, prepare the network of the platform. This decision tool, namely CloudStep is able to target environment and configure the IPsec and VLAN needed analyze and come up with potential organization and technical for the migration. In this stage, back up is also conducted. constraints of migration and evaluate them to come up with a Step 4 of cloud migration involves the quality assurance of proper solution. the migrated resources and the environment. Quality assurance Menzel et al. [17] discussed the migration from web team is involved in this step to ensure the migrated resources is application to public cloud using genetic based algorithm and functional and operational in the target environment. Any bug CloudGenius framework. However, in this journal, the author that is discovered in this stage will be discovered by the quality did not address to reevaluate the migration progress. assurance team and fixed by the migration team. In all the stages mentioned, it required a large amount of III. STAGES OF CLOUD MIGRATION people to analyze and execute the cloud migration. Due to the The resources that involve in cloud migration are the data of amount of migration that is happening nowadays, it is deducted the company and the services utilized by the company. The that there are a set of standard procedures that can be utilized 8 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 performing cloud migration [10]. To increase the cost, time and speed for conducting effective migration frameworks, researchers has come up with various automation tools and algorithms. In the following section, the suitable tools and algorithms that has been introduced to perform automated cloud migration is depicted. A. Automation in Cloud Migration Stage 1: Discovery The main task for discovery stage is to categorized server according to their functionality and attribute for formulating wave plan that will be executed during execution stage [4]. In this stage, the best practice is to grouping the server according to their properties and convention of communication, then further categorized the servers into smaller group using weighting process. The server grouping is then validated with the expected results provided by the domain expert, such as the cloud migration architect. Figure 2. ALDM Framework For carry out the task mentioned, a framework, the researches have overlay algorithms and tools to an existing framework, namely Analytics for Logical Dependency Mapping (ALDM) discovery framework, shown in Figure 2 [11]. The discovery started with identifying information of servers by utilizing the front-end tools and back-end tools. Front-end components are made of a lightweight discovery kit, which works at the middleware between the migrator and the migration process. Back-end of this implementation composed of the data processing engine that is run to analyze and process data that is gathered. During data processing, the processed data is grouped into two main categories, static and dynamic. Static data depicts the basic information of the server, consisting the operating system (OS), IP address and information of related hardware whereas dynamic data depicts the detailed information such as the I/O, port, memory, CPU and every related dependency between the servers. All the discovery is automated and converted into XML files for further analysis. The real challenge of the discovery stage is to conduct server grouping by utilizing the information collected. For completing this challenge, researchers utilized the tools that are introduced for ALDM framework to discover the similarity of the information. This progress is automated by using the tools and a similarity matrix. Weights are assigned to each server Figure 3. AQUASI system automatically by calculation. Final results of this process is stored. This automated migration serves a high accuracy for to carry out the migration. In this process, the expert no longer discovery, in which the method proposed is highly encouraged has to follow up and each step to analyze and configure the to be utilized. related setting manually. All these tasks and jobs can be done by using automation in cloud migration. B. Automation in Cloud Migration Stage II: Planning In planning stage, the main task includes formulating a wave IV. AUTOMATION IN CLOUD MIGRATION plan that would be lead the execution of migration process in the later stage. In manual planning stage, large number of Driven by the aim to improve the performance in cloud human resources are needed for formulating a solid plan for migration, automation in cloud migration is introduced. cloud migration. The main goal of automation in this stage is to Researchers have proved that automation in cloud migration reduce the number of human resources needed for planning. will increase the efficiency and reduce the workload in 9 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Figure 4. An architectural diagram of CRPT Cloud Readiness Planning Tool (CRPT), which shown in continuous analysis in the system, server or resources migrated Figure 4 is a competent framework that automate the planning for detecting and ensuring the quality of migration. All instance process. This tool utilized AI and conducted supervised in the cloud will interact with the UI functionality provided by learning to constantly improve the feasibility of the plan. The AQUASI to perform quality assurance [12]. high-level processes for automated planning include In Figure 3, it shows the AQUASI system, which consists of classification of migration types: re-host, re-platform, re-factor the edge devices and backbone service. The diagram shows the and automation of planning processes. relationship between the devices and the services and depict the Classification of migration types is automated by providing data and sources in each entity. the information such as CPU, CPU capacity and memory that are discovered during stage 1. The classification is done by V. CHALLENGES AND FUTURE WORK adhering to an active learning paradigm with the help of a transition expert. Automation in migration has brought tons of benefits for After the classification of types of the migration, an migration. AI is a field that continually undergoing automated planning is carried out. The process is carried out improvement and commitment. Automation in migration with the usage of two main generators, Problem Language planning can be reached out to other types of migration such as Generator and Planning Domain Language Generator [16]. live migration and virtual migration. Researchers could identify These two generators work as the mechanism to generate better or more suitable framework to reduce the manpower PDDL file from the migration goals and the migration actions needed for migration. that were specified by the migrator in natural language [5]. Challenges that have been address for the automation in The output of the migration planning includes the plan to be cloud migration continually drive this field for better result. The followed for migration, the servers to be migrated, the challenges addressed includes the heterogeneity of the source’s migration processes and the problem that will be faced during environment and the target environment [13], which makes it the migration. challenging for cloud migration. Besides, other challenges C. Automation in Cloud Migration Stage III: Migration include unexpected events and action that may indirectly or directly affect the migration process. For example, the Migration process is automated in nature. By following the researchers have identified automated planning process for plan formulated in stage 2. The migration can be executed migration with large number of servers [14], but there are many successfully. more other cases which requires the effort from researchers due D. Automation in Cloud Migration Stage IV: Quality to the difference in limitation and resources. Assurance Also, security is the major issue of cloud computing, where future study may include the evaluation and framework for Quality assurance stage is utmost important in determining ensuring the security in automated cloud migration [20]. the successfulness of the migration. It also serves as the procedure to detect any problems arise after migration. VI. CONCLUSION AQUASI – A framework that is built based on the knowledge in quality assurance, serves as the medium that ensure and In this paper, we discussed the cloud migration, steps of assure the result of migration. This framework provides conducting cloud migration and some of the automation in 10 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 cloud migration. We depicted various framework that can be [6] Lin, C., Sun, H., Hwang, J., Maja, V., & John. R. (2019). utilized in various stages of cloud migration. Cloud Readiness Planning Tool (CRPT): an AI- Cloud migration in Trusted Real-Time Execution based framework to automate migration planning. 2019 Environment is the center of attention from the technology IEEE 12th International Conference on Cloud community as it provides the space for improvement, especially Computing (CLOUD) (pp. 58-62). Milan, Italy: IEEE. from current and existing framework. Automated cloud doi: 10.1109/CLOUD.2019.00021 migration has leaded to the discovery of new era; which [7] S. Al-Kiswany, D. Subhraveti, P. Sarkar, & M. Ripeanu. Vmflock. Virtual machine co-migration for the cloud. manpower is no longer the deciding factor in cloud migration. (2011). Proceedings of the 20th International Symposium With the deduced implementation, migration has been easier on High Performance Distributed Computing, 2011, 159- than ever. However, there are certain areas that have not been 170. doi: 10.1145/1996130.1996153 explored for automated cloud migration. AI is also a new field [8] K. Bai, N. Ge, H. Jamjoom, E.-E. Jan, L. Renganarayana, in technology. By constantly studying and learning, AI may and X. Zhang. (2013). What to discover before migrating develop or come up with better plans in the future. As the to the cloud. Integrated Network Management (IM 2013), enterprises and technology evolve, more researches are needed 2013 IFIP/IEEE International Symposium on, May 2013, to address the problems and issues in cloud migration. As a 320–327. suggestion, future research may focus on two directions, i.e., [9] P. V. Beserra, A. Camara, R. Ximenes, A. B. Albuquerque, the effectiveness of automation and the security. & N. C. Mendonc¸a, “Cloudstep: A step-by-step decision process to support legacy application migration to the ACKNOWLEDGMENT cloud. Maintenance and Evolution of Service-Oriented and Cloud-Based Systems (MESOCA), 2012 IEEE 6th The work was conducted in part using the resources of International Workshop on the. IEEE, 2012, 7–16. Xiamen University Malaysia. The author thanks Xiamen [10] Linthicum, D. S. (2016). Moving to Autonomous and University Malaysia for giving such opportunity to complete Self-Migrating Containers for Cloud Applications. this work. This work was also supported by the Xi’an Jiaotong- IEEE Cloud Computing, 3(6), 6-9. doi: Liverpool University (XJTLU) AI University Research Centre, 10.1109/MCC.2016.122 Jiangsu (Provincial) Data Science and Cognitive [11] Nidd, M., Bai. K.. Hwang, J., Vukovic, M., & Tacci. M. Computational Engineering Research Centre at XJTLU under Automated business application discovery. 2015 Grant XJTLU-REF-21-01-002. IFIP/IEEE International Symposium on Integrated REFERENCES Network Management (IM2015): Short Paper, 2015, 794- [1] Opara-Martins, J., Sahandi, R. & Tian, F. Critical analysis 797. of vendor lock-in and its impact on cloud computing [12] Kornmayer, H., & Salama, A. (2017). AQUASI - an migration: a fss perspective. Journal of Cloud Computing, automated quality assurance application platform for 5(1), 1-18. doi:10.1186/s13677-016-0054-z SMEs in handcraft industries. 2017 IEEE 1st International [2] Narantuya, J., Zang, H., & Lim, H. (2017). Automated Conference on Cognitive Computing, 1, 80-87. doi: cloud migration based on network traffic 10.1109/IEEE.ICCC.2017.18 dependencies. IEEE Conference on Network [13] Varghese, B., & Buyya, R. (2018). Next generation cloud Softwarization (NetSoft) (pp. 1-4). Bologna, Italy: IEEE. computing: New trends and research directions. Future doi: 10.1109/NETSOFT.2017.8004235 Generation Computer Systems, 79(2018), 849–861. [3] Cisco. (2018). Cisco Global Cloud Index: Forecast and doi:10.1016/j.future.2017.09.020 Methodology, 2016–2021 (Report Number: [14] M. Hajjat, X. Sun, Y.-W. E. Sung, D. Maltz, S. Rao, K. 1513879861264127) [White Paper]. Retrieved Oct 27, Sripanidkulchai, & M. Tawarmalani, (2010). Cloudward 2019, from bound: Planning for beneficial migration of enterprise https://www.cisco.com/c/en/us/solutions/collateral/servic applications to the cloud. SIGCOMM Comput. Commun. e-provider/global-cloud- index-gci/white-paper- Rev., 40(4), 243–254. c11-738085.html [15] J. Zhang, L. Renganarayana, X. Zhang, N. Ge, V. Bala, T. [4] Maja, V., & Hwang, J. (2016). Cloud migration using Xu, & Y. Zhou. (2014). Encore: Exploiting system automated planning. 2016 IEEE/IFIP Network environment and correlation information for Operations and Management Symposium (NOMS 2016), misconfiguration detection. Proceedings of the 19th 2016, 96-103. doi:10.1109/NOMS.2016.7502801 International Conference on Architectural Support for [5] Suleman, A. (2018). The best cloud migration path: lift Programming Languages and Operating Systems, ser. and shift, replatform or refactor? Retrieved Oct 27, ASPLOS ’14. ACM, 2014, 687–700. 2019, from [16] D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. https://www.forbes.com/sites/forbestechcouncil/2018 Ram, M. Veloso, D. Weld, & D. Wilkins. (1998). Pddl- /03/23/the-best-cloud-migration-path-lift-and-shift- the planning domain definition language. The replatform-or-refactor/#37300c 744f51 International Conference on Artificial Intelligence 11 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Planning Systems (AIPS-98) Planning Competition Language Specifications. 1998. [17] Menzel, M., Ranjan, R., Wang, L., Khan, S. U., & Chen, J, CloudGenius: A hybrid decision support method for automating the Migration of Web Application Clusters to Public Clouds. IEEE Transactions on Computers, 64(5), 1336-1348. doi: 10.1109/TC.2014.2317188 [18] Mann, V., Vishnoi, A., Iyer, A., & Bhattacharya, P. (2012). VMPatrol: Dynamic and automated QoS for virtual machine migrations. 2012 8th International Conference On Network And Service Management (Cnsm) And 2012 Workshop On Systems Virtualiztion Management (Svm) (pp. 174-178). Las Vegas, NV, USA: IEEE. Retrieved Nov 3, 2019, from https://ieeexplore.ieee.org/document/6380009 [19] Kolb, S., Lenhard, J., & Wirtz, G. (2015). Application migration effort in the cloud - the case of cloud platforms. 2015 IEEE 8th International Conference on Cloud Computing (pp. 41-48). doi: 10.1109/CLOUD.2015.16 [20] Alkhalil, A., & Sahandi, R. (2013). Migration to cloud computing - the impact on IT management and security. International Workshop on Cloud Computing and Information Security (CCIS 2013) (pp. 196-200). doi: 10.2991/ccis-13.2013.46 12 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 A Compact-Size and Geometrically Simple Dual band Antenna for ISM and WLAN Application 1 Abubakar Ya’u Muhammad, 1Bashir D. Bala, 1Shamsuddeen Yusuf and 2Najib Hamisu Umar 1 Department of Electrical Engineering Kano University of Science and Technology, Kano, Nigeria 2 National Agency for Science and Engineering Infrastructure (NASENI), Abuja, Nigeria abubakarymuhammd93@gmail.com, bdbala2@gmail.com, sywaliyyi@gmail.com umarnajibhamisu@gmail.com Abstract— A compact and geometrically simple patch bandwidth ranging from 2.2 – 2.6GHz at 2.4GHz and 5.2 - antenna is proposed for ISM and WLAN applications. 5.35GHz at 5.3GHz. The antenna has no information about The proposed antenna contains multiple stub insertions gain, which is an important parameter for the antenna. In [8], to improve performance. The performance of the antenna an antenna operating over 2.4/5.2GHz is reported to have a is improved in terms of bandwidth, return loss and shift size of 18mm × 27mm × 0.8mm. The antenna has the in frequency. The dual antenna offers operational bands advantages of compact size and operating over dual-band, but at 2.45GHz (IEEE 802.11b/g), and 5.25GHz (IEEE the disadvantage of the narrow bandwidth of 0.08GHz at 802.11a). The proposed dual-band antenna is embedded 2.4GHz and 0.25GHz at 5.2GHz. over substrate material FR4 of an overall size of 33mm × 18mm × 0.7mm. Moreover, the comparison of the A metamaterial-loaded antenna operating over dual bands of proposed with state-of-the-art is performed to show the 2.45GHz and 5.8GHz is given in [9]. The antenna has a potential of the proposed work. large size of 44.4mm × 44.4mm × 1.6mm and offers narrow bandwidths ranging from 2.3 – 2.5GHz and 5.6 – 5.9GHz. Keywords— Compact antenna, WLAN, ISM, 5G The antenna offers a high gain of 4.88dBi and 4.7dBi at I. INTRODUCTION 2.45GHz and 5.8GHz, respectively. The high gain is obtained due to loading AMC (artificial magnetic conductor). A dual band antenna operating over ISM and WLAN applications saw a number of advancements to use for 5G In table, I, the comparison of the proposed work with already applications. A major revision is observed in designing published work is given. The proposed antenna is compared an antenna for these frequencies [1]. The compact, with the literature in terms of overall size, operating geometrically simple, wideband and high gain antennae are frequency, operational bandwidth, and peak gain and design required to operate at a high data rate and entertain multiple approach. users at an instant of time [2]. In [10], a dual-band antenna for Wi-Fi and WLAN In the lower frequency band, the ISM and WLAN band get applications is reported. The antenna offers a high gain of attention. In recent research due to its wide application in 5G 4.8dBi and 5.7dBi at a resonance frequency of 2.4GH and communication systems. The aforementioned frequency 5.2GHz. Although the antenna has simple geometry but has a bands have large applications in Wi-Fi, Bluetooth, GPS, and large size of 100mm × 100mm × 0.8mm. Another high-gain ON-body as well as OFF body applications and Indoor and antenna is reported in [11]. The antenna offers a high gain of Outdoor applications [3 – 4]. For this reason, the number of 4.1dBi and 6.2dBi but has a large size of 74mm × 27mm × antennas is designed and studied in literature to operate on 1.7mm. the 2.45GHz ISM band and 5.2GHz WLAN band [5 – 11]. From the above literature review and discussion, it is clear The rectangular microstrip patch antenna reported in [5] that there is still a research gap to design antenna having a contains an air gap to obtain a high gain of 6.9dBi. The compact size, simple geometry, wide bandwidth, high gain antenna offers high gain but has a setback of large size of and low profile to operate over ISM and WLAN applications. 80mm × 60mm and a narrow band of 0.06GHz. A planner For this purpose, in this paper, an antenna having a compact antenna with a compact size is reported in [6] for WLAN size and simple geometry as well as offering wide bandwidth applications. Besides its compact size, also offer a moderate and high gain is proposed for ISM and WLAN for future 5G gain of 3.32dBi at the resonance frequency of 5.2GHz. The applications. The paper is divided into four sections. In demerit of this design is its complex geometrical section II, the antenna design methodology and parametric configuration. analysis of an important parameter are discussed. The results of the proposed dual antenna are discussed in section III. The proposed work is concluded in section IV along with Another compact antenna operating over the dual-band of references. 2.45/5.2GHz is reported in [7]. The reported work has a compact size of 11mm × 6.5mm ×0.8mm and offers 13 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 W1 H Fig. 1. Geometry of Proposed dual band Antenna (a) front view (b) Side view L4 L3 L5 W2 L2 W3 L1 R0 LF WF (a) (b) Table 1: Comparison of proposed work with state-of-art work Ref Dimension Resonance Bandwidth Gain Design Methodology Frequency (mm × mm × (GHz) (dBi) mm) (GHz) [5] 80 × 60 × 0.78 2.36 2.34 – 2.4 6.9 Rectangular MPSA with air gap [6] 14.4 × 14 × 1.6 5.2 5.1 – 5.3 3.32 Planner antenna [7] 11 × 6.5 × 0.8 2.3/5.3 2.2 – 2.6 - T and F shaped element loaded planner antenna 5.2 – 5.35 [8] 18 × 27 × 0.8 2.4/5.2 2.4 – 2.48 4.1 Shorted microstrip patch antenna 5.15 – 5.35 1.4 [9] 44.4 × 44.4 × 1.6 2.45/5.8 2.3 – 2.53 4.88 Meta surface loaded Antenna 5.62 – 5.92 4.7 [10] 100 × 100 × 0.8 2.4/5.2 2.4 – 2.6 4.8 Wide slot Planner antenna 4.95 – 5.3 5.7 [11] 74 × 27 × 1.7 2.4/5.2 2.34 – 2.5 4.1 Trapped Feeding patch antenna with U-slot 5.06 – 5.91 6.2 This Work 33 × 18 × 0.7 2.45/5.25 2.05 – 2.95 4.2 Stub loaded rectangular patch antenna 4.86 – 5.85 5.6 II. ANTENNA DESIGN AND METHODOLOGY B. Antenna Designing Steps A. Proposed Antenna Geometry Various design steps are followed to obtain the proposed The geometrical configuration of the proposed dual band dual-band antenna operating over 2.45GHz and 5.25GHz ISM and WLAN applications. These design steps contain stubs antenna for ISM and WLAN is given in Fig .1. The proposed insertion, which results in return loss improvement, frequency antenna contains the circular patch along with microstrip shifting and wideband operation. In the first step, a circular feedlines, which are loaded with various stubs to improve the patch antenna was designed for 2.45GHz. The radius of a performance of the antenna. The antenna is designed on the circle is R0 = 4mm and the antenna offer a single band at top side of lossy substrate material FR4 with relative 2.5GHz with a return loss of -9.75dB. The radius of the circle permittivity, loss tangent and thickness of 4.4, 0.02 and is obtained by the following equation [12]; 0.7mm, respectively. The proposed dual-band antenna has an overall size of L1 × W1 × H = 33mm ×18 mm × 0.7mm. The optimized parameter of the proposed antenna is given as 2H πR follows; eff = Reff {√1 + πԐrR (ln 2H + 1.7726)} (1) L1 = 33; W1 = 18; L2 = 10; L3 =3; L4 = 2; L5 = 2; W2 = 16; W3 = 8; WF = 2; LF = 12; R0 = 4; H =0.7. (Units in mm). In the equation (1) above, Reff is the effective radius, R is the radius of a circle, H is the height of the substrate and Ԑr is The proposed dual-band is designed and various parameters the relative permittivity of the substrate material used. For the are analyzed by using the Electromagnetic (EM) software resonate frequency is utilized in the equation (2) below; tool High-Frequency Structural Simulator (HFSSv9) using appropriate boundary conditions. 14 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 1.8412 × c 1100MHz at 5.25GHz ranging from 4.85 – 5.95GHz. The 𝐹= (2) antenna also provides a good value of return loss <-25dB at 4𝜋 Reff√Ԑr both operating bands. From these results, it is clear that the proposed antenna is a good candidate for future dual band In equation (2), ‘c’ is the speed of light, and equals 3 × 108ms- 1 applications operating on ISM and WLAN frequency bands. . ‘F’ represents the operating frequency. B. A. Radiation Pattern In the second stage, a rectangular stub is loaded above the Fig .4 depicts the radiation pattern of the proposed antenna circular patch as shown in Fig 2 (a). This step results antenna on the resonance frequency of 2.45GHz and 5.25GHz. It is to resonate on another band along 6GHz. The return loss is clearly observed from the figure that, at the resonance also improved due to this step as given in Fig 2(b). The frequency of 2.45GHz, the proposed antenna offers an Omni- antenna offers return losses of -15dB at 2.5GHz and -12dB at directional radiation pattern in the principal E-plane (ϕ = 0°), 6GHz. In the third step, a rectangular stub is placed above the and a bi-directional radiation pattern in the principal H-plane existing circular and rectangular patch antenna. Due to this (ϕ = 90°). On other hand, for 5.25GHz, the proposed antenna step, the return loss improved from -14dB to -19dB at 2.5GHz offers butterfly shaped radiation pattern in the principal E- plane (ϕ = 0°) and a bi-directional radiation pattern in and -10dB to -18dB at 6GHz. the principal H-plane (ϕ = 90°). The shape of the radiation pattern of the proposed antenna given in Fig .4 is due to In the final phase, a T-shaped stub is placed, which not only multiple slots etching and stub insertion into the radiator. improve the return loss but also stable the bandwidth and shift the operating frequency towards lower bands. The resultant C. B. Gain and Efficiency antenna operates at 2.45 GHz and 5.25 GHz with return losses Fig. 5 represents the gain and radiation efficiency versus of -25dB. The antenna offer bandwidth of 2.05 – 2.95GHz frequency of the proposed dual-band antenna operating over with resonant frequency of 2.45GHz and 4.86 – 5.85GHz ISM and WLAN applications. It can be observed that the with resonant frequency of 5.25GHz. proposed antenna offers a gain of > 4dBi in the operational band of 2.05 – 2.95GHz with a peak gain value of 4.35 dBi at 2.45GHz. On the other hand, the antenna offers a gain of> 4.5dBi at an operational band of 4.86 – 5.85GHz, which a peak gain of 4.75dBi at 5.25GHz. Moreover, Fig. 5 also depicts the radiation efficiency of the proposed dual-band antenna. The antenna offers radiation efficiency of > 92% for ISM operational bandwidth maximum value of 97% at 2.45GHz. In the case of WLAN, the band antenna offers radiation efficiency > 94% with a maximum value of 95% at 5.25GHz. Step 1 Step 2 Step 3 Prop. From the above discussions, the results of the proposed Antenna antenna (in form of an S-parameter representing bandwidth (a) and return loss, radiation property, gain, radiation, and efficiency) and comparison between the proposed work with 0 literature, the proposed antenna can be considered as a strong -5 and good design for future compact devices operating over ISM and WLAN frequency bands. -10 0 |S11| (dB) -15 -5 -20 -10 -25 Step 1 Step 3 |S11| (dB) Step 2 Prop. Ant -15 -30 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Frequency (GHz) -20 (b) Fig. 2. (a) Various stages to design proposed dual band antenna (b) Impact -25 on various design steps on S11 plot -30 1.0 2.0 3.0 4.0 5.0 6.0 7.0 III. RESULTS AND DISCUSSIONS Frequency (GHz) A. S-Parameter Fig. 3. S-Parameter of proposed dual band antenna operating on 2.45GHz The scattering parameter of the proposed dual-band and 5.25GHz for ISM and WLAN applications antenna is depicted in Fig. 3. It can be observed that the antenna offers dual bands of 2.45GHz and 5.25GHz allocated for ISM (IEEE 802.11b/g) and WLAN (IEEE 802.11a), respectively. The proposed work offers wide bandwidth of 900MHz at 2.45GHz ranging from 2.05 – 2.95GHz and 15 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 gain and radiation efficiency. The proposed dual bands offer 6 0 900MHz at 2.45GHz and 1100MHz at 5.25GHz. The gain 330 30 0 operates at a high gain of 4.2dBi and 5.9dBi at 2.45GHz and 300 60 5.25GHz, respectively. The multiple stubs are inserted into the -6 proposed circular patch antenna in order to obtain the dual -12 bands with wide bandwidth and high gain. The resonance 270 90 frequency of the proposed antenna covers globally dedicated -12 spectrums for ISM and WLAN bands. Moreover, the results -6 and comparison of the proposed work with a state-of-the-art 240 120 show that the proposed dual band antenna is a good candidate 0 for future 5G devices operating on ISM and WLAN 6 210 150 applications. 180 (a) REFERENCES 0 [1] Hussain, Musa, and Nabigha Nadeem. "A Co-Planer Waveguide Feed 6 330 30 Dual Band Antenna with Frequency Reconfigurability for WLAN and WiMax Systems." 2019 International Conference on Electrical, 0 300 60 Communication, and Computer Engineering (ICECCE). IEEE, 2019. -6 [2] Hussain, Musa, et al. "On-Demand Frequency Reconfigurable Flexible Antenna for 5Gsub-6-GHz and ISM Band Applications." WITS 2020. -12 270 90 Springer, Singapore, 2022. 1085-1092. -12 [3] Awan, Wahaj Abbas, et al. "The design of a wideband antenna with notching characteristics for small devices using a genetic -6 240 120 algorithm." Mathematics 9.17 (2021): 2113. 0 [4] El Hadri, Doae, Alia Zakriti, and Asmaa Zugari. "Reconfigurable Antenna for Wi-Fi and 5G Applications." Procedia Manufacturing 46 6 210 150 180 (2020): 793-799. [5] Al Kharusi, K. W. S., et al. "Gain enhancement of rectangular (b) microstrip patch antenna using air gap at 2.4 GHz." International Journal of Nanoelectronics and Materials 13 (2020): 211-224. Fig. 4. Radiation pattren of propsoed antenna at (a) 2.45 GHz and (b) 5.25 [6] Swetha, A., and K. Rama Naidu. "Miniaturised planar antenna with GHz enhanced gain characteristics for 5.2 GHz WLAN application." International Journal of Electronics 108.12 (2021): 2137-2154. 6 95 [7] Nayak, Peshal B., et al. "A novel compact dual-band antenna design Radiation Efficiency (%) for wlan applications." arXiv preprint arXiv:2106.13232 (2021). 4 90 [8] Tung, Hao‐Chun, and Kin‐Lu Wong. "A shorted microstrip antenna for 2.4/5.2 GHz dual‐band operation." Microwave and Optical Technology Gain (dBi) 2 85 Letters 30.6 (2001): 401-402. [9] Ahmad, Sarosh, et al. "A metasurface-based single-layered compact 0 80 AMC-backed dual-band antenna for off-body IoT devices." IEEE Access 9 (2021): 159598-159615. -2 75 [10] SS, Yatish Pachigolla, and Surajit Kundu. "Dual band printed wide-slot antenna for Wi-Fi and WLAN applications." 2020 URSI Regional 70 Conference on Radio Science (URSI-RCRS). IEEE, 2020. -4 [11] Gong, Qing, et al. "Dual‐band horizontally/dual‐polarized antennas for 1 2 3 4 5 6 7 WiFi/WLAN/ISM applications." Microwave and Optical Technology Frequency (GHz) Letters 62.3 (2020): 1398-1408. [12] Hussain, Musa, et al. "Design and Characterization of Compact Fig. 5. Gain and Radiation Efficency of propsoed dual band antenna for Broadband Antenna and Its MIMO Configuration for 28 GHz 5G WLAN applications Applications." Electronics 11.4 (2022): 523. IV. CONCLUSION n this paper, an antenna operating at dual bands of 2.45GHz and 5.25GHz is given. The proposed dual-band antenna has a compact size, simple geometry, wide band, high 16 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Multiple Linear Regression Using Cholesky Decomposition in Studying Crime Rate in Jigawa State, Nigeria S. Usman1, I. Abdullahi*1, K. G. Ibrahim1, N. I. Yusuf1, H. B. Yusuf2 and B. G. Agaie1 upward trends of crime rates in the 50 largest United State Abstract— Multiple linear regression using Cholesky cities in recent years. Decomposition is used in studying crime rate in Jigawa state and [2] describe an approach of multiple regression analysis in the data source was collected from the Jigawa State Police crime pattern warehouse for decision support. With multiple Command, Nigeria. The collected data was prepared and statistical method they develop a decision support system transformed into matrix form using the Least Square Method and then the covariance matrix was obtained. The solution to the based on real data of warehouses of social economic and problem was further reached by employing the Cholesky crime indicator. In conclusion, multiple regression models Decomposition Method. The solution shows that in the case of allow gaining new insights into the structure of problem and offences against person( 𝑿𝟏 ), offences against property (𝑿𝟐 ) , developing strategies for crime prevention measures. other offences not 𝑿𝟏 and 𝑿𝟐 (𝑿𝟑 ), offences against local act(𝑿𝟒 ) [3] used multiple regression to study performance indicator have positive influence on the dependent variable 𝒀 which is the in the ceramic industry, with the dependent variable being the total crime rate indicating that crime rate in the state of Jigawa size of earnings, while the independent variable consists of is on the decrease as being presented from the solution gotten self-financing capacity, return on equity, level of technical from the data. capability, personal costs per employee, flexibility, adaptable and the reactivity of companies in the ceramic industry. Index Terms— Multiple Linear Regression, Cholesky Decomposition Method, Least Square Method, Covariance [4] applied multiple linear regression analysis to measure Matrix. the effect of student learning value (measurement and evaluation, educational psychology, program development, guidance and counseling techniques) on KPSS exam scores I. INTRODUCTION (civil service selection). Modeling of real-life problems in the field of mathematics [5] used multiple linear regression model and data fit of the is very important because it helps in representing and population and the number of criminal cases, the population describing the situation in mathematical formulas or symbols and the years to construct iteration model which establish the for easy understanding. Multiple linear regression is a well- relationship between population the and key attributes. It was known mathematical model that is usually employed in observed that as population changes the key behavior affects solving problems in areas like biology, chemistry, economics, the crime rate. engineering, physics, social and other real-life problems. This [6] used multiple regression to determine the effect of model is seen to be effective in showing the effect or linear personality on work stress and it was suggested that all the relationship between a dependent variable with two or more personality dimensions show significant correlation with job independent variables. stress and two of the dimensions (neuroticism and lie) showed A well-known British Anthropologist Sir Francis Galton as predicted. (1822 – 1911) seems to be the first to introduce the word [7] propose to predict violent crime using regression and “regression” in his study on heredity. He found that on the optimize the distribution of police officers through an Integer average, heights of children do not tend toward the parents’ Linear Programming formulation, taking into account the heights, but rather toward the average as compared to the previous predictions. Although some of the optimization data parents and he termed this “regression to mediocrity in are synthetic, they propose it as a possible approach for the hereditary stature.” problem. Experiments showed that Random Forest performs [1] utilizes the multiple regression technique in examining better among the other evaluated learners, after applying the whether various measures of public policies have significant SmoteR algorithm to cope with, there are extreme values. The effect in reducing serious crimes and in suppressing the most severe violent crime rates were predicted for southern states, in accordance with state reports. Accordingly, these were the states with more police officers assigned during 1 optimization. Department of Mathematics, Federal University Dutse (FUD), [8] propose a type of Mixed Effects Regression Model, that PMB 7156, Jigawa State, Nigeria 2 is Hierarchical Linear Model to study crime rate. They derive Department of Mathematics, Nigerian Army University Biu, the estimators of the proposed model and discuss the PMB 1500, Borno State, Nigeria (email: iabdullahi94@gmail.com, Ibrahim.abdullahi@fud.edu.ng ). asymptotic properties of the model. In order to test for the 17 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 practicability of the proposed model, they estimate a crime Multiple linear regression analysis is an analysis that equation using a panel data set of the provinces in Kenya for measures the effect/relationship linearly between two or more the period 1992 to 2012 employing the REML estimator. The independent variables ( 𝑋1 , 𝑋1 ,..., 𝑋𝑝 ) with the dependent empirical results suggest that Poverty Rate, Unemployment variable (𝑌) . Multiple linear regression models can be rate, Probability of arrest, population Density and police rate presented in the form of general equations as follows: are correlated to all typologies of crime rate considered. The 𝑌𝑗 = 𝛽0 + 𝛽1 𝑋1𝑗 + 𝛽2 𝑋2𝑗 + … . + 𝛽𝑝 𝑋𝑝𝑗 + 𝜀𝑗 results further suggest that crime rate is better explained at (1) provincial level as compared to country level. Where 𝑗 = 1,2, … , 𝑛, Equation (1) can be written in matrix [9] studied a detailed model that allows for two distinct as follows criminal types associated with major and minor crime where 𝑌1 1 𝑋11 𝑋21 … 𝑋𝑝1 𝛽0 𝜀1 also they examine a stochastic variant of the model that 𝑌2 1 𝑋12 𝑋22 … 𝑋𝑝2 𝛽1 𝜀2 represents more realistically the “generation” of new [⋮ ]= [ ⋮ ]+[ ⋮ ] ⋮ ⋮ ⋮ ⋱ ⋮ criminals. Numerical solutions for the model was also 𝜀𝑛 𝑌𝑛 [1 𝑋13 𝑋2𝑛 … 𝑋𝑝𝑛 ] 𝛽𝑝 presented and compared with the actual crime data for the (2) greater Manchester area. Where 𝑌 is the independent variable column vector, 𝑋 is [10] investigated the patterns of student involvement, the the independent variable matrix, 𝜷 the regression coefficient level of satisfaction and acculturation of American Indian estimator column vector, and 𝜺 is the residual/error column college students to determine if a relationship existed between vector. these processes. This study gathered data from 139students Using the least-squares method, regression coefficients are between the ages of 18-54 who self-identify as American obtained by minimizing the residual squares, so that the Indian. Data was gathered in the spring semester 2016 using normal equation is obtained as follows: two instruments: the College Student Experience Questionnaire (CSEQ) and the Native American 𝑛𝛽0 + 𝛽1 ∑ 𝑋1𝑗 + 𝛽2 ∑ 𝑋2𝑗 + ⋯ Acculturation scale (NAAS) that were combined on a non- line survey. The data analysis used descriptive statistics, with + 𝛽𝑝 ∑ 𝑋𝑝𝑗 = ∑ 𝑌𝑗 , a T-Test (Independent/Group), Analysis of Variance 𝛽0 ∑ 𝑋1𝑗 + 𝛽1 ∑ 𝑋1𝑗2 + 𝛽2 ∑ 𝑋1𝑗 𝑋2𝑗 + ⋯ + (ANOVA) a Multiple Regression and a Pearson Product 𝛽𝑝 ∑ 𝑋1𝑗 𝑋𝑝𝑗 = ∑ 𝑋1𝑗 𝑌𝑝𝑗 , Moment correlation coefficient to measure the relationships 𝛽0 ∑ 𝑋2𝑗 + 𝛽1 ∑ 𝑋1𝑗 𝑋2𝑗 + 𝛽2 ∑ 𝑋2𝑗 2 + ⋯+ between independent and dependent variables of 𝛽𝑝 ∑ 𝑋2𝑗 𝑋𝑝𝑗 = ∑ 𝑋2𝑗 𝑌𝑗 , demographics, acculturation, satisfaction, and participation in college activities (academic, non-academic, cultural 𝛽0 ∑ 𝑋𝑝𝑗 + 𝛽1 ∑ 𝑋1𝑗 𝑋𝑝𝑗 + 𝛽2 ∑ 𝑋2𝑗 𝑋𝑝𝑗 + ⋯ + 2 programs and support services). 𝛽𝑝 ∑ 𝑋𝑝𝑗 = ∑ 𝑋𝑝𝑗 𝑌𝑗 (3) [11] evaluated the effect of the number of homicides Equation (3) can be written in matrix notation as follows: recorded from 2006-2016 on the influx of domestic and 𝑛 ∑ 𝑋1𝑗 ∑ 𝑋2𝑗 … ∑ 𝑋𝑝𝑗 𝛽0 foreign (American and Canadian) visitors to a destination on ∑ 𝑋1𝑗 2 ∑ 𝑋1𝑗 ∑ 𝑋1𝑗 𝑋2𝑗 … ∑ 𝑋1𝑗 𝑋𝑝𝑗 𝛽1 the Mexican Pacific coast, using econometric techniques such 2 as multiple linear regression. Some of the results establish a ∑ 𝑋2𝑗 ∑ 𝑋1𝑗 𝑋2𝑗 ∑ 𝑋2𝑗 … ∑ 𝑋2𝑗 𝑋𝑝𝑗 𝛽2 = relationship between homicides and the level of tourism. ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ Similarly, the statistical evidence shows that the number of 2 homicides has a moderate influence on travel by foreign [∑ 𝑋𝑝𝑗 ∑ 𝑋1𝑗 𝑋𝑝𝑗 ∑ 𝑋2𝑗 𝑋𝑝𝑗 … ∑ 𝑋𝑝𝑗 ] [ 𝛽𝑝 ] visitors to this destination but not on their actual stay there. ∑ 𝑌𝑗 [12] constructed Multiple Linear Regression model using ∑ 𝑋1𝑗 𝑌𝑗 Cholesky Decomposition and the application used in the numerical simulation of a real case where studying the ∑ 𝑋1𝑗 𝑌𝑗 (4) influence of five independent variables on a dependent ⋮ variable using data of 30 samples. Based on the above-mentioned literatures, investigation have [∑ 𝑋𝑝𝑗 𝑌𝑗 ] not been carried out in studying the crime rate in Jigawa State by applying the multiple linear regression using Cholesky 𝚺𝐗 . 𝛃 = 𝚺𝐘 Decomposition Method which form the bases of this paper. The 𝚺𝐗 matrix is called the covariance matrix. II. METHODOLOGY In this work Multiple Linear Regression and Cholesky 2.2 Cholesky Decomposition Decomposition Method was employed and discussed below. Cholesky Decomposition is a special version of LU 2.1 Multiple Linear Regression decomposition that is designed to handle symmetric matrices 18 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 more efficiently. For example, A is a definite symmetric and public/unrest, coining offences, fake currency notes, other positive matrix, 𝑎𝑖𝑗 = 𝑎𝑗𝑖 , then A can be written offences etc. 𝐴 = 𝐿𝐿𝑇 (5) 𝑋4 is the offences against local act e.g. narcotics, offence Where L is the bottom triangle matrix which is defined as against dog, offences against child rights act, anti-human follows: trafficking, offences against fire arms acts, fatal motor 𝑙11 0 … 0 accident, other incident/disaster (flood, air & fire) e.t.c and 𝑌 𝑙 𝑙 … 0 is the total cases. 𝐿 = [ 21 22 ] (6) ⋮ ⋮ ⋱ ⋮ 𝑙𝑛1 𝑙𝑛2 … 𝑙𝑛𝑛 Solution Using the Cholesky decomposition, elements of L are The solution to the problem is shown as follows by valued as follows: following the laydown procedures as stated above 𝑘−1 2 1 317 329 50 95 791 𝑙𝑘𝑘 = √𝑎𝑘𝑘 − ∑ 𝑙𝑘𝑗 or 𝑙𝑘𝑖 𝑗=1 1 233 176 50 136 588 𝑖−1 1 [1 272 203 45 233 741] = √𝑎𝑘𝑖 − ∑ 𝑙𝑖𝑗 𝑙𝑘𝑗 (7) 𝑙𝑖𝑖 𝑗=1 Applying the Least Square Method to table 2. The Where the first subscript is the row index and the second is covariance matrix equation is obtained the column index, with 𝐾 = 1,2, … , 𝑛 and 𝐼 = 1,2, … 𝑘 − 1. 𝚺𝐗 . 𝛃 = 𝚺𝐘. Steps to solve 𝐴𝑏 = 𝑐, where 𝐴 is symmetric and positive 3 822 708 145 464 𝛽0 definite, using Cholesky Decomposition given as follows: 822 228762 200517 39740 125176 𝛽1 decomposition of 𝐴 becomes 𝐴 = 𝐿𝐿𝑇 , then solution of b is obtained by: 708 200517 180426 34385 102490 𝛽2 = (a) forward substitution: solution 𝑑 is obtained by using 145 39740 34383 4625 22035 𝛽3 𝐿𝑑 = 𝑐, [464 125179 102490 22035 81810 ] [𝛽4 ] (b) substitution: the solution of 𝑏 using 𝐿𝑇 𝑏 = 𝑑. 2120 And then Cholesky Decomposition was used to solve 589303 𝚺𝐗 . 𝛃 = 𝚺𝐘. 514150 III. DISCUSSION OF RESULTS 102295 3.1 CRIME STATISTICS RECORD FROM 2018 TO 2020 COLLECTED FROM THE JIGAWA STATE [327766] POLICE COMMAND Using Cholesky decomposition formula on the above matrix to obtain 𝐿 YEAR 𝑿𝟏 𝑿𝟐 𝑿𝟑 𝑿𝟒 𝒀 𝑘−1 2018 317 329 50 95 791 2 𝑙𝑘𝑘 = √𝑎𝑘𝑘 − ∑ 𝑙𝑘𝑗 𝑙𝑘𝑖 2019 233 176 50 136 588 𝑗=1 2020 272 203 25 52 572 𝑖−1 Where 𝑋1 is the offences against person e.g. murder/culpable 1 = √𝑎𝑘𝑖 − ∑ 𝑙𝑖𝑗 𝑙𝑘𝑗 homicide, suicide, child stealing, manslaughter, attempted 𝑙𝑖𝑖 𝑗=1 murder/homicide, attempted suicide, grievous harm/wounding, assaults, terrorism, rape and indecent 𝐿 assaults, kidnapping, unnatural offences, other offences etc. 1.7320 0 0 0 0 𝑋2 is the offences against property e.g. armed robbery, 16.5529 478.0041 0 0 0 demanding with menaces, theft and other stealing, burglary, house/store breaking, vandalism (pipeline), cheating, forgery, = 15.3623 0.9361 424.4879 0 0 receiving stolen property, unlawful possession, 6.9522 0.4165 0.4361 67.6511 0 arson/mischief by fire, other offences etc. [12.4365 0.7396 0.7535 2.1499 285.759] 𝑋3 is the other offences not 𝑋1 & 𝑋2 e.g. gambling, bribery and corruption, escaping from lawful custody, breach of where 19 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Let 𝐿𝑇 𝛽 = 𝑍 and 𝐿𝑍 = 𝐵 𝐿𝐿𝑇 = 𝐴, and 𝐴𝑋 = 𝐵 Then solving for 𝑍 requires the use of Which implies 𝐿𝑍 = 𝐵 𝑍1 𝑍 𝐿𝐿𝑇 𝛽 = 𝐵 and Z = [ 2 ] 𝑍3 ! 1.7320 0 0 0 0 𝑍1 2120 16.5529 478.0041 0 0 0 𝑍2 589303 15.3623 0.9361 424.4899 0 0 𝑍3 = 514150 6.9522 0.4165 0.4361 67.6511 0 𝑍4 102295 [12.4365 0.7396 0.7535 2.1899 285.759] [𝑍5 ] [327766] 12.4365𝑍1 + 0.7396𝑍2 + 0.7535𝑍3 + 2.1899𝑍4 + 285.759𝑍5 = 327766 1.7320𝑍1 = 2120 12.4365 (1223.9831) + 0.7396 (1190.4553) + 0.7535 𝑍1 = 1223.9831 (1399.8741) + 2.1899 (1051.48531) + 285.759Z5 = 327766 𝑍5 = 1078.9022 16.5529 (1223.9831) + 478.0041𝑍2 = 589303 𝑍2 = 1190.4553 1223.9831 15.3623𝑍1 + 0.9361𝑍2 + 424.4899𝑍3 = 479843 1190.4552 15.3623 (1223.9831) + 0.9361 (1190.4553) + 424.4897𝑍3 = 614150 𝑍 = 1399.4741 𝑍3 = 1399.8741 1051.4453 6.4549𝑍1 + 13.2723𝑍2 + 0.0235𝑍3 + 74.7271𝑍4 [1078.9022] = 83250 6.4549 (1126.4109) + 13.2723 (1190.4553) But it is known that + 0.0235 (1089.4779) + 74.7217𝑍4 = 83250 𝐿𝑇 𝛽 = 𝑍 𝑍4 = 1051.48531 1.7320 16.5529 15.3623 6.9522 12.4365 𝛽0 1223.9831 0 478.0041 0.9361 0.4165 00.7396 𝛽1 1190.4553 0 0 424.4879 0.4361 0.7535 𝛽2 = 1399.8741 0 0 0 67.6511 2.1899 𝛽3 1051.4853 [ 0 0 0 0 285.759] [𝛽4 ] [1078.9022] 𝛽1 = 2.4662 1.73205𝛽0 + 16.5529𝛽1 + 15.3623𝛽2 + 6.9522𝛽3 285.759𝛽4 = 1078.9022 + 12.4365𝛽4 = 1223.9831 𝛽4 = 3.7756 1.7320𝛽0 + 16.5529(2.4662) + 15.3623(2.5440) 67.6511𝛽3 + 2.1899𝛽4 = 1051.4853 + 6.9522(15.4205) + 12.4365(3.7756) 67.6511𝛽3 + 2.1899 (3.7756) = 1051.4853 = 1123.9831 𝛽3 = 15.4205 𝛽0 = 513.7940 424.4879𝛽2 + 0.4361𝛽3 + 0.7535𝛽4 = 1399.8741 424.4879𝛽2 + 0.4361(15.4205) + 0.7535(3.7756) 𝛽0 513.7940 = 1089.4747 𝛽1 2.4662 𝛽2 = 2.5440 478.0041𝛽1 + 0.9361𝛽2 + 0.4165𝛽3 + 0.7396𝛽4 𝐵 = 𝛽2 = 2.5440 = 1190.4553 𝛽3 15.4205 478.0041𝛽1 + 0.9361(2.5440) + 0.4165(15.4205) + 0.7396(3.7756) = 1190.4553 [𝛽4 ] [ 3.7756 ] 20 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 [4] Uyanik GK. and Guler N. A Study on Multiple Linear 𝑌 = 513.7940 + 2.4662𝑋1 + 2.5440𝑋2 + 15.4205𝑋3 + Regression Analysis. Procedia Social and Behavioral 3.7756𝑋4 (8) Sciences. 2013; 106: 234-240. [5] Zhongfeng G. and Jianhui LA. A Method to Improve the The regression model presented in equation (8) show us the Rate of Case Investigation Based on Multiple Linear final solution to the problem where 𝑌 the dependent variable Regression Model. International Conference on stand for total crime committed in the community and 𝑋1 Advanced Information and Communication Technology (offences against person), 𝑋2 (offences against property), 𝑋3 (other offences not 𝑋1 & 𝑋2 ), 𝑋4 (offences against local act) for education. 2013; China: Atlantic Press; 2013. 584- represents the different crimes committed. It was observed 587. that all the independent variable 𝑋1 , 𝑋2 , 𝑋3 𝑎𝑛𝑑 𝑋4 are [6] Desa A., Yusooff F., Ibrahim N., Kabir NBA. and positive values from the obtained model equation (8) which Rahman RMA. A Study of the Relationship and implies that they all have positive influence on the dependent Influence of Personality on Job Stress Among Academic variable 𝑌 which means that there is less crime rate in the Administrators at a University. Procedia- Social and state and that is to say the law enforcement agency in the state Behavioral Sciences. 2017; 114(2014): 355-359. are carrying out their duties as they are supposed to. [7] Bruno C., Paula B. and Sergio P. Crime Prediction Using-Regression-and-Resources-Optimization. IV. CONCLUSION https://www.researchgate.net/publication/281450865. Construction of a multiple linear regression model using 2015; 1-13. Cholesky Decomposition in studying crime rate in Jigawa State was considered where the raw data were collected from [8] Chris M., Joel CC. and Joseph M. Modeling crime rate Jigawa State Police Command and analyzed. It was seen that using a mixed effects regression model. American 𝑋1 , 𝑋2 , 𝑋3 𝑎𝑛𝑑 𝑋4 gave positive influence on the environment Journal of Theoretical and Applied Statistics. 2015; 4(6): implying lesser crime recorded due to the fact that the 496-503. criminals are prosecuted and made to serve time in the [9] Lacey AA. and Michael TN. A Mathematical Model of correction centers not just releasing them back into the Serious and Minor Criminal Activity. European Journal environment. Also, the result obtained shows a great relation of Applied Mathematics. 2016; 1-19. to relative peace the state has been enjoying for the period [10] Jim KK. A Multiple Regression Analysis of Factors considered. Concerning Satisfaction, Student Involvement and V. ACKNOWLEGMENTS Acculturation as Demonstrated by American Indian The authors wish to acknowledge the Jigawa Stata Police College Students. Command for making the data available for constructive analysis of the model. https://repository.stcloudstate.edu/hied_etds. 2017; 1- 151. REFERENCES [11] Martin LS. and Silvestre FG. The Effects of Crime on [1] Yong HC. A Multiple Regression Model for Tourism: A Multiple Regression Analysis. Lectures on Measurement of Public Policy Impact on Big City Crime. Modelling and Simulation. 2018; 2018: 26-30. Policy Sciences. 1972; 3(4): 435-455. [12] Ira S., Fiyan H. and Sri P. Multiple Linear Regression [2] Dale D. and Vitalija R. Multiple Regression Analysis in Using Cholesky Decomposition. An International Crime Pattern Warehouse for Decision Support. Scientific Journal. 2020; 140(2020): 12-25. https://www.researchgate.net/publication/221464853. 2002; 1-16. [3] Zsuzsanna T. and Marian L. Multiple Regression Analysis of Performance Indicator in the Ceramic Industry. Procedia Economic and Finance. 2012; 3: 509- 514. 21 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Design and Development of Trusted Real-Time Execution Environment Yuechun Wang1, Ka Lok Man2*, Danny Hughes3, Jie Zhang4 recovery actions in case the resource provider is suspected to Abstract— Society is entrusting a growing range of ICT have become faulty. Also, this project aims to extend prior work (Information and Communications Technology) systems of on resource reservation in embedded systems [5] with support increasing complexity with high-value digital and physical assets, including in the CPS (Cyber-Physical Systems) case of the health for real-time processing in the form of a consistent suite of and safety of people. The fundamental design of such systems, in security primitives that target both software and hardware particular at a low level, advocates system architectures and Trusted Execution Environments [6]. construction principles that may be summarized as using and Also, this project explores how trusted real-time enclaves therefore necessarily trusting all components that implement the required functionality. Growing functional complexity therefore can monitor the health of lower-criticality processes by means naturally translates into a higher risk of undetected vulnerabilities of intelligent watchdogs, which ensure that the control flows that may cause the system to fail due to faults or attacks. In this of lower-criticality tasks remain intact and live. If faults project, we strive to break these trust relationships from the occur, trusted software will trigger fail-operational behavior ground up, leveraging for availability and timeliness what Intel to enable the remote repair of application software. These SGX (Software Guard Extensions) enclaves provide us for confidentiality and integrity: trusted execution of critical repair operations are foreseen to occur in parallel to the real- components that may be managed by but are not susceptible to time execution of all critical code required for this fail- failures in the management layer. operational behavior, which executes with guaranteed Index Terms—Trusted real-time execution environment, cyber- noninterference. physical system, lower-criticality task II. PROPOSED RESEARCH This section briefly outlines the main aspects of our proposed I. INTRODUCTION research and presents the methodologies that are adopted in the Conventional approaches to enabling application-level proposed research. resource scheduling to begin from the assumption of a A. Aims and Objectives trustworthy OS kernel such as a microkernel or hypervisor and We aim to establish a foundational then add mechanisms for application-level resource control. rethinking/transformation of how application-level scheduling For instance, [1] proposed CPU inheritance scheduling, a can exploit the isolation afforded by software-and-hardware mechanism by which a task may donate its allocated time to Trusted Execution Environments to guarantee availability and services it invokes. [2] extends this scheme by separating the timeliness. This project tackles the following objectives (O.): scheduling and execution contexts of a thread, allowing the O1-Determine and implement the mechanisms and former to be granted and the kernel to more efficiently identify primitives that are necessary to enable this transformation. the receiving execution context. [3] integrates this scheme into O2-Ensure that sufficient resources are available is a a capability system that enables the delegation of limited necessary prerequisite to achieving trustworthy real-time scheduling responsibility to select application-level schedulers. execution of critical functionality. [4] proposes TCaps, a one-shot time-providing mechanism, O3-Create scheduler extensions which enable the which avoids complex revocation by implicitly invalidating a untrusted application layer to create, manage and delegate time capability if its timeslot has passed. resources to the enclaves that host critical functionality. This project explores how mechanisms of this kind can be O4-Ensure that critical CPS applications run correctly with further rationalized into simple primitives, that can be safely sufficient resources. embedded in hardware and/or software Trusted Execution B. Research Questions and Methodology Environments, enabling the development of highly secure and Four research questions (Q1,.., Q4) of the project are outlined flexible mixed-critically systems. Here, an untrusted scheduler below: is empowered to grant dedicated resources to enclaves, which enforce the fulfillment of these guarantees, even in the presence Q1: Reducing the size of a system’s critical Trusted of faults in the untrusted scheduler. Leveraging such a Computing Base (TCB), will ease testing and verification, and mechanism requires an understanding of what guarantees it thereby increases trust in isolated components. conveys to either accept the granted resource or initiate Q2: Building blocks of Trusted Computing and Trusted Authors 1,2 & 4 are with the School of Advanced Technology, Xi’an Jiaotong- Execution are available in current commodity hardware as well Liverpool University, Suzhou, Jiangsu, R.O.C. Danny Hughes is with the as in a range of research prototypes. Are these platforms readily imec-DistriNet, KU Leuven, B-3001 Leuven, Belgium. email: support the development of dependable real-time systems? {Ka.Man@xjtlu.edu.cn, danny.hughes@cs.kuleuven.be} 22 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Q3: The hardware techniques developed in WP2 (see below The work plan is structured with the following 4 main tasks for details) will be translated into pure software. How can their (T.): relative security and performance be assessed? T1.1-Trust models [M1-M6]: define transitive and Q4: How can we validate all project outcomes by providing intransitive trust models for real-time computation with dedicated consideration of practical issues such as tool support untrusted managers. and evaluation through both test-beds and real-world cases? T1.2-Resource scheduling and adaptation [M6-M12]: This project is structured into four technical Work Packages create techniques to compose resource guarantees for reliable (WPs), which together answer the research questions and real-time computing in enclaves. realize the technical objectives (O1 to O4): T1.3-Programming language support for real-time WP1: Trusted Real-Time Computing – Theoretical enclaves [M12-M18]: provides language support, program Foundations. WP1 will develop the necessary theoretical abstractions, and analyses that enable the automated generation foundations for trusted and dependable real-time computing in of code that respects timing constraints. Trusted Execution Environments. We will develop models of T2.1-Resource-sharing-and-resource-availability [M6- trust and dependability, research program abstractions, policies, M12]: revisits resource sharing in the context of Trusted and protocols for ensuring the timeliness of critical Execution Environments. functionality placed into trusted execution environments, and investigates how untrusted components may benefit from the T2.2-Hardware-enforced-real-time-guarantees [M12- timely execution of functionality that is offloaded into these M18]: focus on hardware support that enables dedicated environments. enclaves to enforce the resource guarantees granted to other enclaves. WP2: Processor Extensions for Trusted Real-Time Computing. WP2 will address these shortcomings by extending T2.3-Integrated-architecture-for-trusted-real-time- Trusted Execution Environments-enabled processor computing [M18-M24]: creates an integrated hardware and architectures to support the programming models and software stack that combines the results of WP1 and WP2 in a requirements developed in WP1. WP2 will develop innovative dependable computing platform for embedded real-time processor designs that combine the security features of modern processing. Trusted Execution Environments with extended support for T3.1-Embedded-virtualization-framework [M1-M9]: hard and soft real-time scheduling, and that can be realized creates a framework for the development of embedded using FPGAs to facilitate prototyping in WP4. virtualization techniques. WP3: Trusted Real-Time on Legacy Systems. We aim to T3.2-Modular-security-and-Trust-features [M6-M18]: create a framework for the development of embedded implements software variants of the hardware security features virtualization techniques that increase the security feature specified in WP2. offered by a processor using the methodology developed in our previous work [6]. However, the application domain of Real- T3.3-Hardware/Software-interaction [M18-M24]: creates Time Trusted Execution Environments introduces a wide range techniques to enable seamless interaction between mixed of new challenges as described in WP2. Efficiently achieving deployments of hardware-and-software trust modules these goals in software demands the creation of a more flexible developed in WP2 and the outcomes of the earlier tasks of WP3. suite of systematic virtualization support that provides a generic T4.1-Tool-support [M6-M12], Test-bed-evaluation [M12- and systematic means to retrofit security techniques on existing M18] and Use-case-evaluation [M18-M24]: provide a realistic Instruction Set Architectures (ISAs). validation and evaluation of mature project outcomes. WP4: Tool Support, Test-Bed, and Use Case. WP4 envisages III. EXPECTED RESEARCH OUTCOMES AND tool support as having main pillars: (1) OS drivers connect the POTENTIAL VALUE hardware and software outcomes of the project to industry This project is expected to produce the following outcomes: standard real-time operating systems. (2) A Modified toolchain 1. A set of security abstractions that balance usability, that embodies the programming-related outcomes of WP2 and performance, and completeness to promote the adoption of WP3 and (3) DevOps (Combination of Culture Philosophies trusted execution environments by third parties. This and tools) Services which interface with the remote primarily theoretical result will be published at relevant management, crash detection, and recovery features of the conferences focusing on Middleware or Modularity, with Trusted Real-Time Execution Environments runtime to final results being published in top security journals such facilitate administration for networks of devices running the as TDSC. Trusted Real-Time Execution Environments software. 2. A software stack to support trusted real-time computing on C. Work Plan and Timeline embedded devices. This software stack will extend popular 23 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 contemporary embedded software stacks such as ensure that the control flows of lower-criticality tasks remain Arduino/FreeRTOS/RIoT and will be made available intact and live. If faults occur, trusted software will trigger fail- under an open-source license that promotes its reuse by operational behavior to enable the remote repair of application third parties (e.g. a BSD or MIT license). Initial results are software. These repair operations are foreseen to occur in reported in relevant conferences such as parallel to the real-time execution of all critical code required ISOCC/DATICS/DigiCon/PlatCon/CICET. The final for this fail-operational behavior, which executes with results will be published in top security journals such as guaranteed noninterference. TDSC. 3. An experience report exploring the performance impact of ACKNOWLEDGEMENT applying trusted execution on embedded platforms. This This work was supported by the Xi’an Jiaotong-Liverpool report will consider technical factors such as execution University (XJTLU) AI University Research Centre, Jiangsu speed, memory footprint, and energy implications (Provincial) Data Science and Cognitive Computational alongside human factors such as development effort and Engineering Research Centre at XJTLU under Grant XJTLU- usability. Initial results will be reported in relevant REF-21-01-002. conferences such as NCA/SAC. Final results will be REFERENCES published in a top system or security journal such as [1] Ford, B. and Susarla, S. CPU inheritance scheduling. TIOT/TOSN. SIGOPS Oper. Syst. Rev., 30(SI):91-105, 1996 [2] Steinberg, U., Wolter, J., and Hartig, H. Fast component This project tackles the problem of a trusted real-time interaction for real-time systems. In ECRTS, pp. 89-97, 2005 execution by striving to break these complex and implicit trust [3] Fiasco – The L4Re Microkernel, 2020 relationships from the ground up. While this project is [4] Gadepalli, P. K., Gifford, R., Baier, L., Kelly, M., and foundational in nature, we will ensure that it remains grounded by the future needs of the industry through connection to use- Parmer, G. Temporal capabilities: Access control for a time. In RTSS, pp. 56-67, 2017 cases that are afforded by complementary strategic and industrial/applied projects that are currently being executed by [5] Akkermans, S., Daniels, W., Sankar R., G., Crispo, B., and our research groups on the topic of security and trust for CPS. Hughes, D. Cerberos: A resource-secure os for sharing IoT Ultimately, this project will master the complexity of devices. In EWSN, p. 96-107. Junction Publishing, 2017 developing trustworthy real-time CPS by eliminating the need [6] Ammar, M., Crispo, B., Jacobs, B., Hughes, D., and Daniels, to trust large parts of the system’s hardware/software stack. W. The security microisor: A formally verified software-based security architecture for the internet of things. Trans. On IV. CONCLUSIONS Dependable and Secure Computing, 16(5):885-901, 2019 The proposed project explores a new emerging topic in the [7] Fan Yang, Danny Hughes, Nelson Matthys, Ka Lok Man, field of Trusted Real-Time Execution Environments and falls The PnP Web Tag: A plug-and-play programming model for into the national and international research priority area: trusted connecting IoT devices to the web of things. In APCCAS, pp. data and CPS. The untrusted scheduler is authorized to grant 452-455, 2016 dedicated resources to communities even in the presence of [8] Fan Yang, Hao Wong, Danny Hughes, Ka Lok Man, faults in the untrusted scheduler. This project explores how Towards 40 Year Battery Lifetime for the Internet trusted real-time enclaves can monitor the health of lower- of Things. International Journal of Design, Analysis & Tools criticality processes by means of intelligent watchdogs, which for Integrated Circuits & Systems, 8(1):84-85, 2019 24 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Machine Learning in Healthcare: the Prediction of Diabetes Risk by ML Classification Models Ailyn Kency Lam Cham Kee, Yuechun Wang, Yuxuan Zhao, Jie Zhang, Erick Purwanto, Tomas Krilavicius*,Ka Lok Man* as it is an indication of which machine learning models should Abstract— Due to the rise of artificial intelligence, machine be used to make their predictions. learning has taken a larger role in society. The use of machine learning models has been applied in several industries. For instance, in the medical industry, machine learning has aided the The models that are used in this project are the Logistic healthcare system in treating and diagnosing patients. Machine Regression method, K-Nearest Neighbour Classifier, Support learning models have been utilised to anticipate the probability of Vector Machine, Decision Tree Classifier and Random Forest a patient developing chronic health diseases such as Type 2 Classifier. The assessment of these models will be evaluated diabetes. Furthermore, several machine learning models can also be used in predicting the status of a patient’s health condition. using their confusion matrix. Then, their accuracy rate, This report focuses on evaluating the performance of the specificity, sensitivity, error rate, precision, F-score and prediction of machine learning models. This is done by Matthews Correlation Coefficient will be calculated and constructing the models and obtaining their level of accuracy compared. through Python. Then their performance evaluation will be calculated using the data gathered from the coding process. The aim of this study is to assess machine learning classification models The aim of this study is to answer the research questions: in predicting the risk of developing Diabetes. It also focuses on the 1. To what extent can machine learning be accurate in comparison of machine learning to achieve better model terms of medical diagnosis? performance. 2. To what degree are the machines learning classification techniques compared to each other to Index Terms— Artificial Intelligence, machine learning, achieve model performance? prediction, logistic regression, support vector machine, k-nearest neighbour, decision tree, random forest, medical diagnosis, diabetes LITERATURE REVIEW A. Overview of machine learning INTRODUCTION Machine Learning (ML) is a branch of artificial intelligence Diabetes Mellitus is a chronic health condition determined and computer science that focuses on using data and different by high blood glucose over a period of time [1]. According to algorithms. These models are used to become more accurate in the Centers for Disease Control and Prevention (CDC) [2], predicting outcomes [4]. ML is meant to imitate the way about 34.2 million adults in the US are unaware that they have humans learn by training itself. There are four types of machine diabetes. Over many years, diabetes can cause multiple health learning: supervised learning, unsupervised learning, semi- problems such as having an impact on the heart, blood vessels, supervised learning and reinforcement learning. This study eyes, kidneys and nerves [3]. Therefore an early diagnosis of focuses on supervised learning algorithms. Supervised learning Diabetes would be beneficial to the patient. uses labelled datasets to train algorithms and classify data [5]. B. Logistic Regression To simplify the diagnosis of this long-lasting disease, machine learning can be used. The goal of this project is to Logistic Regression is a model used to predict discrete and evaluate the risk of developing diabetes using machine learning categorical values [6]. This method forms part of the supervised classification models. This can be achieved using a dataset machine learning algorithms. This machine learning model which contains information on an individual’s personal predicts a class based on one or multiple predictor variables [6]. lifestyle and family background. The study shows how accurate Its outcome is usually binary, hence, there can only have two some machine learning models are. This work helps the experts values: 0 or 1, or in this case, diseased or non-diseased. This classification model does not return the exact class of the input Tomas Krilavicius is with the Department of Informatics, Vytautas Magnus but instead provides an estimated probability of being part of a University, Kaunas, Lithuania; Ka Lok Man is with the School of Advanced class. The standard logistic function is given as [1]: Technology, Xi’an Jiaotong- Liverpool University, Suzhou, Jiangsu, R.O.C. email: {tomas.krilavicius@vdu.lt, Ka.Man@xjtlu.edu.cn) 25 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 1 E. Decision Tree Classifier 𝑦= 1 + 𝑒 −𝑥 Decision Tree Classifier (DST) is a machine learning Equation 1: Standard logistic function technique which is often used for solving classification In this function, y is the output which is the outcome of the problems. This model can predict a class by making a decision weighted sum of the x, which is the input variable. Hence, if y using prior data as a reference [10]. Fig. 3 [11] shows an is greater than 0.5, the output is 1 else the output is 0 [1]. example of a decision tree: C. Support Vector Machine The Support Vector Machine (SVM) is another predictive model for data classification where it assigns new data elements to a labelled category [7]. Similar to the Logistic Regression model, it assumes that there are only two possible outcomes. Fig. 1 shows how Support Vector Machine classification is used to classify data [8]: Fig. 3. Example of a Decision Tree F. Random Forest Classifier The Random Forest Classifier (RFC) is built by many decision trees. It constructs decision trees on various samples and takes the majority vote to classify data [12]. __ [13] demonstrates how Random Forest Classifier is applied to a dataset. Fig. 1. Support Machine Vector The SVM constructs the line which shows a hyper plane that is used as a decision boundary to divide the data point. The margin is a disparity between the two lines on the nearest class points. D. K-Nearest Neighbour The K-Nearest Neighbour (KNN) model is a supervised machine learning model that classifies problems in business [1]. This model classifies data by calculating the distance between existing data points. Fig. 2 [9] illustrates how KNN works: Fig. 4. Random Forest Classifier G. Overview of related works Many researchers have studied the prediction of symptoms of diabetes through different approaches. In the “Comparative approaches for classification of Diabetes Mellitus data: Machine learning paradigm” [14], they used the Gaussian Process (GP) classification technique. In the study, three kernels are applied: linear, polynomial and radial basis kernels. The work concluded that according to several Fig. 2. K-Nearest Neighbour factors, the GP-based model with a radial basis kernel is a better classifier. After calculating the distance, the algorithm finds the point’s closest neighbours and votes for its label. A study by A.Viloria et al. [15] implemented the Support Vector Machine algorithm to predict the development of 26 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 diabetes in patients. An accuracy of 95.36% in Colombian handle arrays in coding [22]. In this project, it is used to patients and 66.25% in other ethnic groups were obtained [15]. replace the values that are 0 or empty with a null value, NaN. In the “Prediction of Type 2 Diabetes using Machine Learning Classification Methods” by N.Tigga et al. [1], two ➢ Sklearn. Sklearn is a python library that helps in different datasets were used to determine the risk of diabetes. constructing machine learning models [23]. This library is This report evaluates the risk of diabetes through prediction often used in this project as it allows for training and testing using machine learning algorithms. These algorithms were of the dataset. It is also utilized to get the accuracy and applied to both datasets and compared. scores from each machine learning algorithm. ➢ Seaborn. Seaborn is a python library used for data INDUSTRIAL RELEVANCE visualization [24]. It is often used along with the Pandas Due to the need for a precise diagnosis of a patient’s health library. This project uses this library to visualize the condition, machine learning can be advantageous to the confusion matrix. medical industry. In this industry, precision and having an efficient healthcare resource allocation is important [16]. ➢ System Environment. Table 1Error! Reference source not Hence, this work could have a significant impact on the found. below shows the details of the computer healthcare department. The evaluation of these machine configuration for the training of models. learning models would help experts make a faster and more Table 1. Number of forwarded messages at the three brokers. accurate diagnosis. Although this work focuses on the System Environment prediction of the risk of developing diabetes, there has been Operating System Windows 10 Professional Edition some research about skin cancer diagnosis through medical CPU 11th Gen Intel(R) Core (TM) i7-11700 imaging [16]. Thus, with this study, experts can also benefit GPU Intel (R) UHD Graphics 750 from the early identification of multiple health conditions. RAM 16GB This report will be able to help other industries and other organizations to acquire precise and fast identification of opportunities and risks [17]. By gathering data from individuals, C. Reading and preparing the dataset the machine learning classification models can assist decision- Before applying the machine learning algorithms to the making for businesses or the general public. dataset to make predictions, the dataset should be read and prepared to achieve an accurate and unbiased class prediction. METHODOLOGY Firstly, from Fig. 5, the libraries that are used for this work are A. Data Collection imported. The dataset collected is a combination of two databases obtained from the internet. The first database is from the National Institute of Diabetes and Digestive and Kidney Diseases [18]. The second dataset was gathered by Vikras Ukani [19]. The two databases were chosen to be used as a final Fig. 5. Import libraries dataset because they had similar features. Both of these datasets were obtained in Kaggle, a website where several individuals After this, the dataset which is in a CSV file is read using the can find and publish datasets and build models online in a data- Pandas library. This is shown in Fig. 6 Error! Reference science environment. source not found.: B. Tools and Libraries ➢ Python. Python is a high-level object-oriented script programming language. This project uses this programming language to achieve its aim and objectives. It is used for machine learning processes because it is easy to understand Fig. 6. Reading dataset and it grants fast validation of data [20]. To avoid errors, the dataset must be prepared for modelling. ➢ Pandas. Pandas is a python library used to analyze data In this study, duplicated values are removed first. It should also [21]. This project uses this library to read the CSV files be noted that duplicated data would only slow down the process which contain the diabetes dataset. It is also used to clean and can cause data leakage. Predictive modelling could result the data by removing duplicates and handling missing data. in data leakage due to the duplicated values [25]. Data leakage occurs when external information is used to create the model ➢ Numpy. Numpy is another python library that is used to 27 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 [25]. To prevent this, Fig. 7 shows the process of removing the duplicate values: Fig. 7. Removing duplicated values In the figure above, ‘df’ refers to the Data frame. The Pandas library is used to remove all the duplicates in the dataset. The dataset obtained from Kaggle contains data that are Fig. 9. Outliers medical predictors. These predictors are factors that could determine if an individual can develop diabetes. Therefore, a Fig. 9 shows how the outliers are calculated. It is done by patient’s Body Mass Index (BMI), which is one of the using the inter-quartile range which includes the first and third predictors, cannot be empty or 0. This is due to the fact that quartiles of the data. Although removing the outliers would BMI is a measurement of body fat according to an individual’s facilitate the machine learning process, they are not removed in height and weight. As the BMIs in the dataset for certain this study. This is because removing outliers may affect the individuals are missing, such data should be replaced or accuracy of the machine learning models. removed. In this study, these values are replaced. The last step to prepare the dataset is to do feature scaling. There are several ways in which the missing values can be Feature scaling standardizes the independent features in the handled. For instance, one way to handle missing data is by data in a fixed range [27]. If this process is not done, the removing them. However, in this case, the missing values will machine learning techniques tend to weigh greater values as be replaced with the median values of each variable. In this higher and smaller values as the lower values. work, the missing data is not removed in order to prevent the models from any kind of bias. Fig. 10. Feature Scaling D. Model Training Before training the dataset, the data is split into arrays of random training and testing subsets. The training dataset is used to fit in the machine learning model and the test data is used to evaluate the fit in the machine learning algorithm. From _____, Fig. 8. Replacing missing value most of the data is used for training whereas 25% of the dataset is used for testing. The training data is larger because having Fig. 8 shows how the missing values are replaced. A more data means the possibility of finding and learning function is used to find the median values of each variable. The important patterns is higher [28]. median values are used to replace missing values of medical predictors in the database. Another way to prepare the data is to ensure that all the data Fig. 11. Model Training are proportionate. The outlier is an observation which indicates data that differs from the rest of the data. They can represent The training processes for the five machine learning errors in measurement and bad data collection [26]. algorithms are very similar. Firstly, the model is created the 28 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 training data is fitted into the model. Then it can start The following shows a list of formulas [30] which are used predicting. to compare each machine learning algorithm: 1. The Accuracy Rate (ACC) is calculated using the number of all the correct predictions over the total number in the dataset. The best case is to have an accuracy rate of 1.0 whereas the worst case is an accuracy rate of 0.0. 𝑇𝑃 + 𝑇𝑁 Fig. 12. Training with Logistic Regression 𝐴𝐶𝐶 = 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑁 + 𝐹𝑃 Equation 2. Accuracy Rate 2. The Specificity (SP) is calculated using the number of correct negative predictions over the total number of the Fig. 13. Training with Support Vector Machine negative. This is also known as the true negative rate. The specificity ranges from 0.0 to 1.0, where 1.0 is the best case and the worst case is 0.0. 𝑇𝑁 SP = 𝑇𝑁 + 𝐹𝑃 Fig. 14. Training with K-Nearest Neighbour Equation 3. Specificity 3. The Sensitivity (SN) is the number of correct positive predictions divided by the total number of positives. It is also known as the true positive rate (TPR) or recall. Having a high sensitivity would be the best case at 1.0. Fig. 15. Training with Decision Tree Classifier 𝑇𝑃 SN = 𝑇𝑃 + 𝐹𝑁 Equation 4. Sensitivity 4. The Error Rate (ERR) is the number of all the wrong Fig. 16. Training with Random Forest Classifier predictions over the total number of the dataset. The range of error rate is 0.0 to 1.0, where 0.0 is the best and 1.0 is Fig. 12, Fig. 13, Fig. 14, Fig. 15 and Fig. 16 show how the the worst. models were trained. After importing the necessary library and 𝐹𝑃 + 𝐹𝑁 ERR = preparing the data, several variables are assigned to the 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑁 + 𝐹𝑃 Equation 5. Error Rate machine learning algorithms. For instance, “lg” is assigned to the function “LogisticRegression()”. Then, the “.fit()” function 5. Precision is obtained by calculating the number of true is used to perform the model training on the data, “x_train” and positives divided by the total number of all the positives. “y_train”. This process is applied to the other machine learning It is also known as the positive predictive value. Having a algorithms. After the model is built, the predictions are made precision of 1.0 is the best case whereas the worst is 0.0. using the test set. 𝑇𝑃 PREC = RESULTS 𝑇𝑃 + 𝐹𝑃 Equation 6. Precision Table 2 shows the confusion matrix of each machine learning algorithm after training. 6. The F-Score represents the test’s accuracy by using Table 2. Confusion Matrix precision and sensitivity. It is the harmonic mean of Support Decision Random Logistic K-Nearest precision and sensitivity. Vector Tree Forest Regression Neighbour Machine Classifier Classifier 2 ∗ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦) Confusion 109 16 110 15 113 12 103 23 [115 11] F − Score = [ ] [ ] [ ] [ ] 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 matrix 20 50 18 52 22 48 21 48 17 52 Equation 7. F-Score A confusion matrix shows an overview of the prediction data 7. The Matthews correlation coefficient (MCC) is a on a classification problem. Using these confusion matrices, correlation coefficient using all the values in the multiple calculations can be computed to compare each confusion matrix. machine learning algorithm. The confusion matrix (𝑇𝑃 ∗ 𝑇𝑁) − (𝐹𝑃 ∗ 𝐹𝑁) demonstrates True Positive, True Negative, False Positive and MCC = 𝑇𝑃 𝐹𝑃 √(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁)(𝑇𝑁 + 𝐹𝑃)(𝑇𝑁 + 𝐹𝑁) False Negative. It is read as [ ][29]. Equation 8. Matthews Correlation Coefficient 𝐹𝑁 𝑇𝑁 29 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Although several algorithms perform better in different The table below shows the results, where the values in bold measurements, the Random Forest Classifier is still favoured. show which classification method is the best amongst these This is because RFC performed better in most measurements of measures: performance. Table 3. Values of different measures for machine learning classification CONCLUSIONS methods One of the most important global health issues is the early Support Decision Random identification of the risk of diabetes. This report shows how Logistic K-Nearest Vector Tree Forest Regression Machine Neighbour Classifier Classifier different machine learning algorithms can be used to predict the Accuracy 0.815 0.831 0.826 0.774 0.856 risk of developing diabetes. The five classifiers were Rate (ACC) implemented and compared using different measurements. The Specificity 0.757 0.776 0.800 0.676 0.825 results have shown that the Random Forest Classifier gives the (SP) highest accuracy rate, specificity, precision and the F-score. Sensitivity 0.845 0.859 0.837 0.823 0.817 (SN) Error Rate 0.185 0.169 0.174 0.226 0.144 This study has reached its aim to demonstrate how accurate (ERR) the machine learning algorithms are and illustrate their Precision 0.872 0.880 0.904 0.817 0.913 comparison in achieving different model performances. (PREC) F-Score 0.857 0.869 0.869 0.820 0.892 Matthews 0.594 0.629 0.613 0.721 0.681 FUTURE WORK AND LIMITATIONS Correlation There are some limitations to this conclusion. For instance, Coefficient the performance of the model is not the only variable that (MCC) makes a good machine learning model. The time it takes to train the model is also a crucial variable that has to be considered. From Table 3, it can be seen that the Random Forest Classifier (RFC) is favoured as it brings the best outcomes out Another limitation to this work includes the experimental of the five algorithms. The RFC has the best accuracy rate, result, it was illustrated that the Support Vector Machine and sensitivity, error rate, F-score and Mathew’s correlation the Decision Tree Classifier proved to perform better when coefficient. The RFC algorithm is more accurate than the other measured against some parameters. Therefore, the conclusion algorithms due to its highly accurate rate and low error rate. As that the Random Forest Classifier is the better model may not the RFC has the highest specificity, it demonstrates that in this be definite. In order to overcome this limitation, as an extension algorithm, there are few false positive results. This means that of this work, it is possible to use the 10-cross fold validation. It very few results show that the patient has diabetes when in is used to estimate the competence of the machine learning reality, they do not. model [31]. The F-Score is another way to compare the accuracy of Furthermore, a limitation of this research is the dataset used. machine learning models. It can represent a more balanced Although the dataset is comprised of two datasets, these observation compared to the sensitivity, specificity and datasets came from the same country. Therefore, collecting precision [30]. In this study, the RFC has a better F-Score than datasets that originate from various countries could broaden the the other algorithm. This is an indication that the RFC machine prospect of this research and could potentially obtain a different learning algorithm is more precise. result from the current conclusion. However, in this case, the Support Vector Machine algorithm In future work, it is possible to improve the performance of has the highest sensitivity. This means that there are a high the machine learning models by tuning the algorithm. The number of true positives and that the test detected about 86% parameters of the machine learning models are modified in (0.859 * 100) of patients with diabetes correctly. order to influence the outcome [32]. Another method to potentially obtain a better result is to attempt bagging and Furthermore, the highest Matthews Correlation Coefficient boosting, which is a technique to combine the result of weak among the algorithm is from the Decision Tree Classifier models. (DST). The MCC value of the DST is 0.721. The value is far from the random guess classifier. When reading the MCC In the future, the research could take the receiver operating values, the furthest the value is from 0, and the furthest it is to characteristics (ROC) curve, and the root mean squared error the random guess classifier. This means that this algorithm is into consideration when comparing the machine learning adequately accurate and does not make frequent random models. guesses. The result of this work can be used for future predictions for other conditions. This work still holds a scope for further 30 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 improvement and research which includes using machine learning algorithms to predict the risk of diabetes or other diseases. ACKNOWLEDGEMENT https://www.analyticsvidhya.com/blog/2021/06/understa This work is partially supported by the Xi’an Jiaotong- nding-random- Liverpool University (XJTLU) AI University Research Centre, forest/#:~:text=Random%20forest%20is%20a%20Super Jiangsu (Provincial) Data Science and Cognitive vised,average%20in%20case%20of%20regression.. Computational Engineering Research Centre at XJTLU; and Accessed: 2022- 04- 24. research funding: XJTLU-REF-21-01-002.; and by the Sanda [13] Decision Tree vs. Random Forest - Which Algorithm University under Grant 2022BSZX09. Should you Use?: 2020. https://www.analyticsvidhya.com/blog/2020/05/decision -tree-vs-random-forest-algorithm/. Accessed: 2022- 04- REFERENCES 24. [1] N. Tigga and S. Garg, Prediction of 2 Diabetes using [14] M. Maniruzzman et al., Comparative approaches for Machine Learning Classification Methods. Ranchi, 2019. classification of diabetes mellitus data: Machine learning [2] C. Info, "What is Diabetes?", Centers for Disease Control paradigm. Bangladesh, 2017. and Prevention, 2021. [Online]. Available: [15] A. Viloria, Y. Herazo-Beltran, D. Cabrera and O. https://www.cdc.gov/diabetes/basics/diabetes.html. Bonerge Pineda, Diabetes Diagnostic Prediction Using [Accessed: 12- Nov- 2021]. Vector Support Machines. Warsaw, Poland, 2020. [3] "Diabetes", Who.int, 2021. [Online]. Available: [16] J. Lee, "Is Artificial Intelligence Better Than Human https://www.who.int/news-room/fact- Clinicians in Predicting Patient Outcomes?", JMIR sheets/detail/diabetes. [Accessed: 12- Nov- 2021]. Publication, 2020. [Online]. Available: [4] E. Burns, "DEFINITION machine learning", 2021. https://www.jmir.org/2020/8/e19918/. [Accessed: 18- [5] I. Education, "What is Supervised Learning?", Ibm.com. Nov- 2021]. [Online]. Available: [17] Quantilus, 2020. [Online]. Available: https://www.ibm.com/cloud/learn/supervised-learning. https://quantilus.com/why-is-machine-learning- [Accessed: 12- Nov- 2021]. important-and-how-will-it-impact-business/. [Accessed: [6] K. Bara, "Logistic Regression Essentials in R - Articles - 18- Nov- 2021]. STHDA", Sthda.com, 2018. [Online]. Available: [18] U. Machine Learning, "Pima Indians Diabetes http://www.sthda.com/english/articles/36-classification- Database", Kaggle.com, 2016. [Online]. Available: methods-essentials/151-logistic-regression-essentials-in- https://www.kaggle.com/uciml/pima-indians-diabetes- r/. [Accessed: 14- Nov- 2021]. database. [Accessed: 24- Nov- 2021]. [7] K. Bari, M. Chaouchi and T. Jung, "dummies - Learning [19] V. Ukani, "Diabetes Data Set", Kaggle.com, 2020. Made Easy", Dummies.com, 2016. [Online]. Available: [Online]. Available: https://www.dummies.com/article/technology/informati https://www.kaggle.com/vikasukani/diabetes-data-set. on-technology/ai/machine-learning/how-support-vector- [Accessed: 24- Nov- 2021]. machine-predictive-analysis-predicts-the-future-154311. [20] D. Kolasa, "Why is Python so popular in machine [Accessed: 14- Nov- 2021]. learning and AI? | ASPER BROTHERS", ASPER [8] A. Navialani, "Support Vector Machines with Scikit- BROTHERS, 2019. [Online]. Available: learn", DataCamp, 2019. [Online]. Available: https://asperbrothers.com/blog/why-python-for- https://www.datacamp.com/community/tutorials/svm- machine-learning/. [Accessed: 24- Nov- 2021]. classification-scikit-learn-python. [Accessed: 14- Nov- [21] "Pandas Tutorial", W3schools.com. [Online]. Available: 2021]. https://www.w3schools.com/python/pandas/default.asp. [9] A. Navlani, "KNN Classification using Scikit- [Accessed: 27- Nov- 2021]. learn", DataCamp, 2018. [Online]. Available: [22] "Introduction to NumPy", W3schools.com. [Online]. https://www.datacamp.com/community/tutorials/k- Available: nearest-neighbor-classification-scikit-learn. [Accessed: https://www.w3schools.com/python/numpy/numpy_intr 18- Nov- 2021]. o.asp. [Accessed: 27- Nov- 2021] [10] Decision Tree Algorithm, Explained - KDnuggets: 2020. [23] Code Faster with Line-of-Code Completions, Cloudless https://www.kdnuggets.com/2020/01/decision-tree- Processing", Kite.com, 2021. [Online]. Available: algorithm-explained.html. Accessed: 2022- 04- 24. https://www.kite.com/python/docs/sklearn. [Accessed: [11] Decision Trees for Classification: A Machine Learning 27- Nov- 2021]. Algorithm: 2017. [24] K. Katari, "Seaborn: Python", Medium, 2020. [Online]. https://www.xoriant.com/blog/product- Available: https://towardsdatascience.com/seaborn- engineering/decision-trees-machine-learning- python- algorithm.html. Accessed: 2022- 04- 24. 8563c3d0ad41#:~:text=Seaborn%20is%20a%20data%2 [12] Random Forest | Introduction to Random Forest 0visualization,Pandas%20to%20learn%20about%20Sea Algorithm: 2021. born. [Accessed: 01- Dec- 2021]. 31 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 [25] J. Brownlee, "Data Leakage in Machine 20larger,the%20data%20and%20makes%20decisions. Learning", Machine Learning Mastery, 2020. [Online]. Accessed: 2022- 04- 24. Available: https://machinelearningmastery.com/data- [29] V. Draelos, "Measuring Performance: The Confusion leakage-machine-learning/. [Accessed: 01- Dec- 2021]. Matrix", Glass Box, 2019. [Online]. Available: [26] "How to find outliers Outliers In Machine https://glassboxmedicine.com/2019/02/17/measuring- Learning", Express Analytics, 2020. [Online]. Available: performance-the-confusion-matrix/. [Accessed: 03- Dec- https://expressanalytics.com/blog/outliers-machine- 2021]. learning/#:~:text=An%20outlier%20is%20a%20data,co [30] "The Best Metric to Measure Accuracy of Classification nsidered%20when%20collecting%20the%20data. Models - KDnuggets", KDnuggets, 2016. [Online]. [Accessed: 03- Dec- 2021]. Available: https://www.kdnuggets.com/2016/12/best- [27] "ML | Feature Scaling – Part 2 - metric-measure-accuracy-classification-models.html/2. GeeksforGeeks", GeeksforGeeks, 2021. [Online]. [Accessed: 05- Dec- 2021]. Available: https://www.geeksforgeeks.org/ml-feature- [31] J. Brownlee, "A Gentle Introduction to k-fold Cross- scaling-part-2/. [Accessed: 03- Dec- 2021]. Validation", Machine Learning Mastery, 2018. [Online]. [28] The Difference Between Training Data vs. Test Data in Available: https://machinelearningmastery.com/k-fold- Machine Learning: 2022. cross-validation/. [Accessed: 05- Dec- 2021]. https://www.obviously.ai/post/the-difference-between- [32] How to Improve Machine Learning Results: 2013. training-data-vs-test-data-in-machine- https://machinelearningmastery.com/how-to-improve- learning#:~:text=Training%20data%20is%20typically% machine-learning-results/. Accessed: 2022- 04- 24 32 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 BUAS: Joint Bottom-Up Article Selection for Quick Article Similarity Identification Based on NLP Syu-Jhih Jhang1, Chih-Yung Chang1, Shih-Jung Wu1, Chia-Ling Ho2 810440064@gms.tku.edu.tw; cychang@mail.tku.edu.tw; wushihjung@mail.tku.edu.tw; chialingho@ntunhs.edu.tw 1 Tamkang University, Taiwan; 2 National Taipei University of Nursing and Health Sciences Abstract—Article Similarity Identification is one of the most studies, this study uses a button-up way to compare the issue in Article Comparison. In the literature, some studies similarity of articles, which can save a lot of comparison time. proposed the similarity comparison mechanisms based on Word2Vec, N-gram or Bert. However, a document usually II. RELATED WORK contains a large number of words. Let the source of the The core purpose of this study is to design a comparison comparison be a document. The goal of the comparison is to method to quickly find similar articles. This section describes compare the source document with thousands of documents in the related studies dealing with the similarity of articles in the database. It was time-consuming to compare the similarity recent years. Technically, related studies can be roughly of one target document and all documents in the database, since divided into three categories, including technologies based on the existing mechanisms only can compare the similarity of two BERT, N-gram and Word2vec. documents. As a result, the plagiarism comparison is very time Study [5] used BERT as a model to propose a plagiarism consuming. This paper proposes a plagiarism comparison detection system. This study first removed stop words from the mechanism, called BUAS, which speeds up the similarity two articles A and B, and then divides the two articles into comparison since the Bag of word scheme is initially applied to many sentences. After that, each sentence of the two articles A transform each document as a document vector. Then the most and B played the role of input of the sentence-transformer to similar document can be found as the candidate document. As obtain its sentence vector. Then the two articles were divided a result, the target document only needs to be compared with into sentences. The sentences of the article were compared one the candidate document. Performance studies confirm that the by one for similarity. Then a threshold value was set in order to similarity calculation by BUAS outperforms existing studies in filter the sentences with high similarity. Finally, the sentences terms of precision, recall and F1 score. that satisfied the threshold value were summed up and the average was obtained as the article similarity. Index Terms—Natural Language Processing, TF-IDF, Study [6] proposed a part-of-speech tag (POS) N-gram Word2Vec, Bag of word, Document similarity plagiarism detection system. This study first segmented the two compared articles. Then it removed the stop words and further segmented sentences from the articles. By tagging the I. INTRODUCTION processed words through POS, the part-of-speech structure of Article similarity is an important research topic in the field the sentence can be obtained. Then the similarity of part-of- of natural language processing. Although the existing text speech structures of the two articles were compared. If they similarity comparison system can provide users with the were similar, it further compared the sentence part-of-speech plagiarism comparison of sentences and paragraphs, the structure of the whole article. If the similarity of the sentences existing text similarity comparison method needs to go through was higher than the specified value, and the sentence similarity many time-consuming processes, which compare the word is compared through Word2vec, and finally the comparison similarity in a one-by-one manner between the target article and results of similar sentences were displayed. all the articles in the database. As a result, it takes a lot of time Study [7] proposed a similarity comparison method based to wait for the comparison results. on the Word2vec model. In this study, the two law articles were This study proposes a plagiarism comparison mechanism, firstly segmented, and then the sentence vectors are obtained called BUAS, aiming to save the time for similarity comparison through Word2vec. After that, the sentences of the two articles between one target article and all articles in the database. The are compared one by one for similarity. The similarity proposed BUAS firstly takes out the important keywords of comparison method mainly used the Cosine similarity and each article in the way of TF-IDF. These keywords will be Word Movers Distance methods. Finally, the sentences with considered as the article vector by Bag of word[1][2]. To the highest similarity are summed and divided by the total confirm whether or not the two words belonging to the different number of sentences as the similarity of the article. articles were similar, the Cosine Similarity[3][4] was applied. The abovementioned studies are based on BERT, N-gram Based on the article vectors, similar articles can be identified, and Word2vec. When comparing the similarity, most of them saving time for comparing a large number of articles. After that, are a single comparison of sentences, paragraphs or articles. In the proposed mechanism further compares the sentences case that there are multiple articles needed to be compared, it containing those similar keywords in the similar articles. If the takes a long time and cannot be compared quickly. sentences are also similar, the proposed mechanism further compares the paragraphs containing similar sentences. Finally, III. RESEARCH METHODS the proposed BUAS will compare the articles with particularly This paper aims to propose the article similarity similar paragraphs word by word. Different from the previous comparison mechanism which finds the most similar article from the database for giving a new article as an input. 33 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 𝑖,𝑗,𝑘 Let 𝐴 = {𝐴1 , 𝐴2 , … , A} denote the set of q articles. Each 𝜆𝑡,𝑗̂,𝑘̂ denote the number of identical words in the two 𝑖 ̂ article 𝐴𝑖 = {𝑏 𝑖,1 , 𝑏 𝑖,2 … , 𝑏 𝑖,|𝐴 | } is composed of several sentences 𝑐 𝑖,𝑗,𝑘 and 𝑐 𝑡,𝑗̂,𝑘. That is, paragraphs 𝑏 𝑖,𝑗 , where 1 ≤ 𝑗 ≤ |𝐴𝑖 |. Each paragraph 𝑏 𝑖,𝑗 = ̂ |𝑐 𝑡,𝑗̂,𝑘| (2) |𝑐 𝑖,𝑗,𝑘| 𝑖,𝑗,|𝑐 𝑖,𝑗 | 𝜆𝑖,𝑗,𝑘 ̂ =∑𝑞̂=1 𝑡,𝑗̂ ,𝑘 ∑𝑞=1 𝜆𝑖,𝑗,𝑘,𝑞 𝑡,𝑗̂ ,𝑘̂ ,𝑞̂ {𝑐 𝑖,𝑗,1 , 𝑐 𝑖,𝑗,1 , … , 𝑐 } is composed of several sentences 𝑐 𝑖,𝑗,𝑘 , where 1 ≤ 𝑘 ≤ |𝑏 𝑖,𝑗 | . Each sentence 𝑐 𝑖,𝑗,𝑘 = If the condition 𝜆𝑖,𝑗,𝑘 𝑡,𝑗̂ ,𝑘̂ ≥ |𝑐 𝑖,𝑗,𝑘 |, the paragraphs 𝑏 𝑖,𝑗 and 𝑏 𝑡,𝑗̂ ̂ {𝑑𝑖,𝑗,𝑘,1 , 𝑑𝑖,𝑗,𝑘,2 , … 𝑑𝑖,𝑗,𝑘,𝑞 } is composed of several words in the that contain 𝑐 𝑖,𝑗,𝑘 and 𝑐 𝑡,𝑗̂,𝑘 , respectively, should be further sentence 𝑐 𝑖,𝑗,𝑘 . Let U denote the union of q documents. That is, examined. Let 𝜆𝑖,𝑗 𝑡,𝑗̂ denote the number of identical sentences in 𝑈 = {𝐴1 , 𝐴2 , … , 𝐴𝑞 }. the two paragraphs 𝑏 𝑖,𝑗 and 𝑏 𝑡,𝑗̂ . That is. First of all, it is necessary to remove stop words from the (3) 𝑖,𝑗 |𝑏 𝑡,𝑗̂ | |𝑏 𝑖,𝑗 | paper, so as to avoid the use of TF-IDF for keyword calculation, 𝜆𝑡,𝑗̂ = ∑𝑘̂=1 ∑𝑘=1 𝜆𝑖,𝑗,𝑘 𝑡,𝑗̂ ,𝑘̂ which will also include unimportant words. After the stop words are removed, the paper is then segmented, word-break If the condition 𝜆𝑖,𝑗 𝑖 Let 𝜆𝑖𝑡 denote the number of 𝑡,𝑗̂ ≥ |𝑏 | , and sentence-break. Next, use TF-IDF to obtain the keywords of the article and reorder the keywords. The goal of the next identical paragraphs in the two articles 𝐴𝑖 and 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 . That step is to create an article vector for each article 𝐴𝑖 , through is, which a preliminary similarity comparison can be performed. After obtaining the word vector of all the paper keywords |𝐴 𝑡𝑎𝑟𝑔𝑒𝑡| |𝐴𝑖 | (3) 𝜆𝑖𝑡 = ∑𝑗̂=1 ∑𝑗=1 𝜆𝑖,𝑗 through the trained Word2vec model, the similarity of the 𝑡,𝑗̂ keywords is then compared. Let 𝐴𝑙𝑖𝑘𝑒 be the subset of A. The article 𝐴𝑖 ∈ 𝐴 whic Let 𝐴 = {𝐴1 , 𝐴2 , … , A} denote the set of q articles. Each 𝑖 satisfies the condition article 𝐴𝑖 = {𝑏 𝑖,1 , 𝑏 𝑖,2 … , 𝑏 𝑖,|𝐴 | } is composed of several 𝜆𝑖𝑡 ≥ 𝛿 paragraphs 𝑏 𝑖,𝑗 , where 1 ≤ 𝑗 ≤ |𝐴𝑖 |. Each paragraph 𝑏 𝑖,𝑗 = 𝑖,𝑗 will be collected in 𝐴𝑙𝑖𝑘𝑒 . That is, {𝑐 𝑖,𝑗,1 , 𝑐 𝑖,𝑗,1 , … , 𝑐 𝑖,𝑗,|𝑐 | } is composed of several sentences 𝐴𝑙𝑖𝑘𝑒 ={ 𝐴𝑖 | 𝜆𝑖𝑡 ≥ 𝛿}. 𝑐 𝑖,𝑗,𝑘 , where 1 ≤ 𝑘 ≤ |𝑏 𝑖,𝑗 | . Each sentence 𝑐 𝑖,𝑗,𝑘 = Let A be an algorithm and the set of 𝐴𝑙𝑖𝑘𝑒 found by algorithm {𝑑𝑖,𝑗,𝑘,1 , 𝑑𝑖,𝑗,𝑘,2 , … , 𝑑𝑖,𝑗,𝑘,𝑞 } is composed of several words in the A is called 𝐴𝑙𝑖𝑘𝑒 𝑖 𝑅 . Let 𝜌𝑅 denote whether or not the article 𝐴 𝑖 sentence 𝑐 𝑖,𝑗,𝑘 . Given a article 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 , this paper aims to is considered as the element of 𝐴𝑙𝑖𝑘𝑒 by applying algorithm develop a article comparison mechanism, which compares R. That is, 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 and each article 𝐴𝑖 in A and finds the most similar 𝑡𝑎𝑟𝑔𝑒𝑡 | 1, 𝐴𝑖 ∈ 𝐴𝑙𝑖𝑘𝑒 (5) article. Let 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 = {𝑏 𝑡,1 , 𝑏 𝑡,2 … , 𝑏 𝑡,|𝐴 } be composed 𝛿𝑅𝑖 ={ 𝑅 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 of several paragraphs 𝑏 𝑡,𝑗̂ , where 1 ≤ 𝑗̂ ≤ |𝐴𝑖 |. Let 𝑏 𝑡,𝑗̂ = 𝑡,𝑗̂ Let 𝑇𝑃𝑖 , 𝑇𝑁𝑖 , 𝐹𝑃𝑖 and 𝐹𝑁𝑖 denote the True Positive, True {𝑐 𝑡,𝑗̂,1 , 𝑐 𝑡,𝑗̂,2 , … , 𝑐 𝑡,𝑗̂,|𝑝 | } be composed of several sentences ̂ ̂ Negative, False Positive, and False Negative of the prediction 𝑐 𝑡,𝑗̂,𝑘 ∈ 𝑏 𝑡,𝑗̂ , 1 ≤ 𝑘̂ ≤ |𝑏 𝑖,𝑗 | . 𝑐 𝑡,𝑗̂,𝑘 = ̂ ̂ ̂ ̂ 𝑡,𝑗̂,𝑘 result of algorithm A for the article 𝐴𝑖 , respectively. We have, {𝑑𝑡,𝑗̂,𝑘,1 , 𝑑𝑡,𝑗̂,𝑘,2 , … 𝑑𝑡,𝑗̂,𝑘,|𝑠 | } be composed of several words ̂ ̂ ̂ 𝑇𝑃𝑖 = 𝛿 𝑖 × 𝛿𝑅𝑖 , 𝑑𝑡,𝑗̂,𝑘,𝑞̂ ∈ 𝑑𝑡,𝑗̂,𝑘 , 1 ≤ 𝑞̂ ≤ |𝑑𝑡,𝑗̂,𝑘 |. Consider two words 𝑑𝑖,𝑗,𝑘,𝑞 ∈ ̂ 𝑇𝑁𝑖 = (1 − 𝛿 𝑖 ) × ( 1 − 𝛿𝑅𝑖 ), 𝐴𝑖 and 𝑑𝑡,𝑗̂,𝑘,𝑞̂ ∈ 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 . Let 𝜆𝑖,𝑗,𝑘,𝑞 𝑡,𝑗̂ ,𝑘̂ ,𝑞̂ be a Boolean variable 𝐹𝑃𝑖 = (1 − 𝛿 𝑖 ) × 𝛿𝑅𝑖 , and that indicates whether or not the word 𝑑𝑖,𝑗,𝑘,𝑞 ∈ 𝐴𝑖 is identical ̂ 𝐹𝑁𝑖 = 𝛿 𝑖 × ( 1 − 𝛿𝑅𝑖 ). to the word 𝑑𝑡,𝑗̂,𝑘,𝑞̂ ∈ 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 . That is, Let 𝑇𝑃𝑅 , 𝑇𝑁𝑅 , 𝐹𝑃𝑅 and 𝐹𝑁𝑅 denote the True Positive, True ̂ 𝜆𝑖,𝑗,𝑘,𝑞 = {1 𝑑 𝑖,𝑗,𝑘,𝑞 = 𝑑𝑡,𝑗̂,𝑘,𝑞̂ (1) 𝑡,𝑗̂ ,𝑘̂ ,𝑞̂ 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Negative, False Positive, and False Negative of the prediction results for all article 𝐴𝑖 ∈ 𝐴, for 1 ≤ 𝑖 ≤ 𝑞, respectively. We If the condition 𝜆𝑖,𝑗,𝑘,𝑞 𝑡,𝑗̂ ,𝑘̂ ,𝑞̂ = 1 holds, it represents that q-th word have in the article 𝐴𝑖 is the same as the 𝑞̂-th word in the article 𝑞 𝑇𝑃𝑅 = ∑𝑖=1 𝑇𝑃𝑖, 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 . 𝑞 𝑇𝑁𝑅 = ∑𝑖=1 𝑇𝑁𝑖 , The next step is to compare the k-th sentence 𝑐 𝑖,𝑗,𝑘 in ̂ 𝐹𝑃𝑅 = ∑𝑞𝑖=1 𝐹𝑃𝑖 , and article 𝐴𝑖 and the 𝑘̂ -th sentence 𝑐 𝑡,𝑗̂,𝑘 in 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 . Let 𝐹𝑁𝑅 = ∑𝑞𝑖=1 𝐹𝑁𝑖 . 34 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 Let 𝒜 𝑅 , ℘𝑅 and ℛ𝑅 denote the Accuracy, Precision and Recall, of the predictions for applying algorithm A to all articles 𝐴𝑖 ∈ 𝐴. The values of 𝒜, ℘ and ℛ can be further derived by applying the following Exp. (6), (7) and (8), respectively. 𝑇𝑃𝑅 + 𝑇𝑁𝑅 𝒜𝑅 = (6) 𝑇𝑃𝑅 + 𝑇𝑁𝑅 + 𝐹𝑃𝑅 + 𝐹𝑁𝑅 𝑇𝑃𝑅 ℘𝑅 = (7) 𝑇𝑃𝑅 + 𝐹𝑃𝑅 𝑇𝑃𝑅 ℛ𝑅 = (8) 𝑇𝑃𝑅 + 𝐹𝑁𝑅 As a result, the F1-Score, denoted by ℱ𝑅 , is used to adjust the Figure 1. Precision comparison result weights of false positives and false negatives. Exp. (9) gives the calculation of Ϝ𝑅 . 2(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅 × 𝑅𝑒𝑐𝑎𝑙𝑙𝑅 ) ℱ𝑅 = (9) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅 + 𝑅𝑒𝑐𝑎𝑙𝑙𝑅 For a given article 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 , let ℝ denote the set of all possible mechanisms each of which can determine the most similar article 𝐴𝑡𝑎𝑟𝑔𝑒𝑡 . This paper aims to develop the best algorithm 𝑅 ∈ ℝ which can minimize the number of errors and maximize the F1-score metric. The objective function of this paper can be expressed by the Exp. (10). Objective: Figure 2. Recall comparison result 2(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅 × 𝑅𝑒𝑐𝑎𝑙𝑙𝑅 ) max ( ) (10) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅 + 𝑅𝑒𝑐𝑎𝑙𝑙𝑅 IV. PERFORMANCE EVALUATION This section mainly compares the performance of the proposed algorithm and the BERT-based and Word2Vec-based algorithms in Plagiarism Detection. The experimental setup is described below. The performance results highly depend on the similarity thresholds and the number of keywords. Therefore, the similarity thresholds would be varied ranging from 0.3 to 0.9 while the number of keywords is varied ranging from 50 to 500. The software platform of the experiment in this research is Windows 10 operating system and the development environment is Python 3.7.13. The CPU Intel core i9-10900k is Figure 3. F1-score comparison result used for model training and experiments. The metrics for comparing the proposed algorithm and the existing Bert and Figures 1, 2, 3 show the changes in threshold and Word2Vec are the Accuracy, Precision, Recall and F1-score of keywords. Experimental results of Precision, Recall and F1- the three algorithms. The MATLAB is used as the image score. From the experimental results, it can be concluded that display of data. the data results are lower. This will cause this result, because The following are the experimental results. The is the difficulty of using similar words to be recognized is higher, generated by replacing part of the article content with words so the overall data results are reduced. As shown in Figure 3, in with similar word vectors. Therefore, the challenge of the experimental results, the comparison results of Bert and identifying the content similarity of the articles. Word2Vec are more significant. This study conducts a more detailed comparison through many comparison stages, so it can better solve the problem of using similar words in the article. Similarly, when the similarity threshold is set to 0.5 and the 35 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 number of keyword extractions is set to 300, the best similarity Companion (QRS-C), pp. 354-357, 2019. comparison result can be obtained. Figure 4. Operation time required for three method Figures 4 show the operation time required for sample B. The number of document length ranging from 10000 to 25000, while the CPU running time(second) ranging from 0 to 160. The reason our approach is more Bert-based and Word2Vec- based is that the JCF comparison method is divided into several stages. In each comparison stage, only the important information in the article is obtained for similarity comparison instead of one-to-one comparison, so the whole comparison process can be completed in a shorter time. Experimental results show that our method outperforms Bert-based and Word2Vec-based methods. V. CONCLUSION This paper aims to propose the article similarity comparison mechanism. This study uses the extracted article features to compare the articles in stages from coarse to fine, and quickly locates the articles that need to be compared through the article features. This method can reduce the time spent comparing articles one by one. In the next step, the similarity restoration algorithm is used to avoid errors in the results of the article feature comparison, thereby improving the accuracy of the comparison results. Finally, the method designed in this study can improve the accuracy of similarity comparison, and improve the efficiency of users when comparing article similarity, and make modifications according to the similarities shown in the comparison results. REFERENCES [1] Z. S. Harris, "Distributional structure," Word, vol. 10, no. 2-3, pp. 146-162, 1954. [2] Y. Zhang, R. Jin, and Z.-H. Zhou, "Understanding bag-of-words model: a statistical framework," International journal of machine learning and cybernetics, vol. 1, no. 1, pp. 43-52, 2010. [3] F. Rahutomo, T. Kitasuka, and M. Aritsugi, "Semantic cosine similarity," The 7th international student conference on advanced science and technology ICAST. vol. 4, no. 1, pp. 1, 2012. [4] P. Xia, L. Zhang, and F. Li. "Learning similarity with cosine similarity ensemble," Information Sciences 307, 2015, pp. 39-52. [5] A. Bohra and N. Barwar, "A Deep Learning Approach for Plagiarism Detection System Using BERT," Congress on Intelligent Systems, pp. 163-174, 2022. [6] K. Yalcin, I. Cicekli, and G. Ercan, "An external plagiarism detection system based on part-of-speech (POS) tag n-grams and word embedding," Expert Systems with Applications, vol. 197, pp. 116677, 2022. [7] C. Xia, T. He, W. Li, Z. Qin, and Z. Zou, "Similarity analysis of law articles based on Word2vec," 2019 IEEE 19th International Conference on Software Quality, Reliability and Security 36 INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR INTERGRATED CIRCUITS AND SYSTEMS, VOL. 11, NO. 2, DECEMBER 2022 CE-SQL: A Single-Table Chinese Text-to-SQL generation with BERT-Based Slot Filling Method Yuan-Lin Liang, Chih-Yung Chang, Kuo-Chung Yu 809416018@gms.tku.edu.tw, cychang@mail.tku.edu.tw; 133742@mail.tku.edu.tw; Tamkang University, Taiwan Abstract— Information retrieval from databases is challenging Seq2SQL [6], is an attention-based encoder-decoder framework for a non-SQL domain expert. In order to overcome the that offers better performance than the previous framework. challenge, Natural Language-to-SQL (NL2SQL) is currently However, the overall accuracy is not high due to various the most popular method to tackle it. The NL2SQL is a task that problems. generates an equivalent SQL query to retrieve the information For example, the same natural language sentence may have from the database based on a natural language question. Research related to NL2SQL has provided decent solutions, many equivalent SQL queries with different grammar ordering especially in English. However, there were very few solutions that may affect the overall accuracy. The Seq2Set approach, for the Chinese NL2SQL. This research aims to present a introduced in SQLNet [3] can avoid the mentioned problems. solution for NL2SQL in the Chinese Language. This research Seq2Set offers high accuracy performance, but the performance also presents a framework based on the pre-trained BERT, is reduced when handling more complex SQL queries such as which consists of multiple deep-learning models that solve nested conditions. Due to how Seq2Set performs, many later different tasks to extract the SQL-related keywords and values. studies also adopt this approach. For example, TypeSQL [7] The experimental results show the robustness and flexibility of uses the knowledge graph to extract the data types information the method in extracting the values. of the potential keyword in the query, improving the performance. Coarse2Fine [8] method is slightly different, Index Terms—NL2SQL, CE-SQL, Text2SQL, BERT which contains two phases, the first phase generates the intermediate sketch, and the second phase produces variables for the where clause based on that sketch. The decaNLP [9] I. INTRODUCTION proposes a new multitask question-answering network Semantic parsing is a task that transforms natural languages (MQAN), which can learn simultaneously without task-specific into a logical expression that computers can execute. Some modules or parameters. Wang et al. [10] propose the execution- significant works, such as Text2Code [1], Text2Sparql [2], and guided decoding algorithm to validate the generated SQL by NL2SQL [3], are essential tasks in semantic parsing. Natural executing the candidate SQL query during the decoding phase Language to Structure Query Language (NL2SQL) is a task that to avoid runtime errors and remove the candidate SQL queries parses natural languages into SQL queries. This task has many with syntax errors. potential real-world applications, such as question-answering [4], robot navigation [5], and many more. NL2SQL allows users without domain knowledge to access the database information and increase the cost-effectiveness of data analysis. Figure 1. structure of SQL sketch, the blue color parts are the Therefore, NL2SQL has great potential research value. components that need to be filled. Up until now, many researchers studied NL2SQL with different approaches. There are three different approaches. The Pre-trained language models such as BERT [11], ELMo first approach is rule-based matching, which parses natural [12], GPT [13], etc.., play an essential role in capturing natural language using a custom set of rules into a SQL query. The language representations. SQLova [14] uses the BERT model second approach is the sequence-to-sequence (Seq2Seq) as an encoder to help capture semantic representations in the method, which directly translates natural language into SQL query, significantly improving accuracy. query. The last approach is sequence-to-set (Seq2Set), also Most works mentioned above require the English natural known as the sketch-based approach, which predicts the language as the input query, capable of archiving state-of-art components in specific slots of SQL queries based on keywords results. This work aims to study such tasks but in the Chinese mentioned in natural language sentences. Each approach has its language. This work adopts the sketch-based approach, which advantages and disadvantages. The rule-based matching uses a deep learning neural network with a BERT pre-trained approach has high accuracy if the natural language sentence has language model as an encoder and consists of many tasks specific patterns. However, it performs significantly less if divided into classifiers (C-tasks) and extractor (E-task) provided with a non-specific pattern sentence. Later studies groups. Each task predicts one component in the SQL sketch. mainly use the encoder-decoder framework with Long-Short Term Memory (LSTM) for such tasks. However, the II. FRAMEWORK ARCHITECTURE framework does not have enough computation power to handle This section presents the proposed deep learning model. The such a difficult task. Therefore, the attention mechanism [15] model, called CE-SQL, consists of four tasks: The first two are was introduced. The Seq2Seq approach, introduced in a hybrid task $𝐴𝑔𝑔 and $𝑂𝑝. The third and fourth tasks are 37
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-