applsci-15-09390-with-cover

5.5 2.5 Robust Face Recognition Under Challenging Conditions: A Comprehensive Review of Deep Learning Methods and Challenges Aidana Zhalgas, Beibut Amirgaliyev and Adil Sovet Review https://doi.org/10.3390/app15179390 Academic Editor: Douglas O’Shaughnessy Received: 11 August 2025 Revised: 19 August 2025 Accepted: 20 August 2025 Published: 27 August 2025 Citation: Zhalgas, A.; Amirgaliyev, B.; Sovet, A. Robust Face Recognition Under Challenging Conditions: A Comprehensive Review of Deep Learning Methods and Challenges. Appl. Sci. 2025 , 15 , 9390. https:// doi.org/10.3390/app15179390 Copyright: © 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/ licenses/by/4.0/). Review Robust Face Recognition Under Challenging Conditions: A Comprehensive Review of Deep Learning Methods and Challenges Aidana Zhalgas 1 , Beibut Amirgaliyev 2, * and Adil Sovet 2 1 Department of Computational and Data Science, Astana IT University, Astana 010000, Kazakhstan; aidana.zhalgas@astanait.edu.kz 2 Department of Computer Engineering, Astana IT University, Astana 010000, Kazakhstan; 242675@astanait.edu.kz * Correspondence: beibut.amirgaliyev@astanait.edu.kz Abstract The paper critically reviews face recognition models that are based on deep learning, specifi- cally security and surveillance. Existing systems are susceptible to pose variation, occlusion, low resolution and even aging, even though they perform quite well under controlled condi- tions. The authors make a systematic review of four state-of-the-art architectures—FaceNet , ArcFace, OpenFace and SFace—through the use of five benchmark datasets, namely LFW, CPLFW, CALFW, AgeDB-30 and QMUL-SurvFace. The measures of performance are evalu- ated as the area under the receiver operating characteristic (ROC-AUC), accuracy, precision and F1-score. The results reflect that FaceNet and ArcFace achieve the highest accuracy under well-lit and frontal settings; when comparing SFace, this proved to have better ro- bustness to degraded and low-resolution surveillance images. This shows the weaknesses of traditional embedding methods because bigger data sizes reduce the performance of OpenFace with all of the datasets. These results underscore the main point of this study: a comparative study of the models in difficult real life conditions and the observation of the trade-off between generalization and specialization inherent to any models. Specifically, the ArcFace and FaceNet models are optimized to perform well in constrained settings and SFace in the wild ones. This means that the selection of models must be closely monitored with respect to deployment contexts, and future studies should focus on the study of architectures that maintain performance even with fluctuating conditions in the form of the hybrid architectures. Keywords: face recognition; deep learning; occlusion handling; face detection; facial feature extraction; masked face recognition; robustness evaluation 1. Introduction Face recognition has recently become an imperative part of modern security systems, both biometric and non-biometric, as it offers a safe and time-efficient method of identity retrieval. It has also been widely used in various fields such as security, the health sector and in education, whereby it is used in criminal investigations, identifying patients and monitoring attendance. However, the changes in lighting conditions, facial expressions, head directions and background scenes tend to reduce the accuracy of recognition. Facial recognition technology has been significantly boosted due to the introduction of deep Appl. Sci. 2025 , 15 , 9390 https://doi.org/10.3390/app15179390 Appl. Sci. 2025 , 15 , 9390 2 of 27 learning that has allowed extracting even complex visual patterns, thus exceeding the per- formance of more traditional methods. Although these developments have been made, the use of a single method of feature extraction might not be sufficient in varied environments. A combination of two or more models would give a better result in terms of recognition, but selecting the most significant features among the models is a big challenge, and this has its own impact on the accuracy and computation effectiveness of the system. Recent advancements, such as Fast-FaceNet and Siamese-based lightweight models, show that through optimization in network architecture, the integration of MobileNet, depth-wise separable convolution and memory access cost (MAC) balance, face recognition can be run on mobile and embedded systems with real-time performance and competitive accuracy. The popularization of video surveillance, video stream analysis intelligent platforms and sensor infrastructure is being dictated by the growing development of cities and the adoption of the Smart City strategy. For face recognition, the important elements of such systems are so-called recognition technologies that not only provide security and law enforcement-related solutions, but also civilian ones, including access control, attendance monitoring and even targeted marketing. Although there are already many implemented face recognition and detection solutions, most of the systems perform poorly in real-time, as well as when there is poor lighting, odd facial poses, poor image resolution, and partial occlusions. Moreover, on many occasions, high level of accuracy in the processing of images could demand a huge amount of computing power, therefore making such a system not applicable in mobile and low-powered devices [ 1 – 4 ]. Practical surveillance situations, however, are even more complex because the scenes can be dynamic and many targets can enter them in such uncontrolled circumstances. Dynamic scene face recognition demands systems that not only recognize faces in the presence of a variety of different poses, illuminations, scale variations and occlusions, but whose performance remains adequate at a real-time level. These problems have motivated recent work to develop more powerful algorithms, including attention-based mechanisms, transformer architectures that enable models to learn more about long-range dependencies and occlusion-invariant features. Alternative approaches exploit spatiotemporal modeling within video streams by pairing Convolutional Neural Networks (CNNs) with either Recurrent Neural Networks (RNNs) or temporal transformers to allow systems to identify faces across frames in temporal sequences [ 5 – 8 ]. Moreover, to implement dynamic face recognition on edge devices, methodologies such as lightweight model optimization and knowledge distillation have been suggested that will allow faster inference at a reduced level of resource utilization. Therefore, to provide more successful and efficient face recognition in an active urban environment, the advanced means of recognition need to be incorporated in surveillance systems. Beyond technical issues, special attention must be paid to questions of scalability, data privacy and the ethical use of face recognition systems—particularly in public and open spaces [ 9 – 12 ]. All these factors highlight the relevance of developing robust, adaptive and resource-efficient face recognition models. This work focuses on the investigation, evaluation and comparative analysis of modern face detection and recognition models under conditions close to real-time operation and challenging environmental settings. In the experimental phase, detection was implemented through fine-tuning of state-of-the-art models on diverse datasets, taking into account variations in lighting, image quality and viewing angles. Recognition was assessed by running and testing modern embedding- based models across a range of open and specialized datasets, covering factors such as age differences, facial expressions, resolution changes and head pose variations. The paper presents experimental results, accuracy and speed metrics, and provides an analysis of model effectiveness in the context of real-world applications. Appl. Sci. 2025 , 15 , 9390 3 of 27 2. Literature Review 2.1. Classical Methods Before the emergence of deep learning, face recognition systems were mainly based on statistical and appearance-based methods. The Eigenfaces, Fisherfaces and Local Binary Pat- terns (LBP) are among the most powerful of them. These early methods formed the basis of contemporary facial recognition studies and are still applicable where the ability to execute tasks quickly and explain them is of importance. Principal Component Analysis (PCA) has also been used in the Eigenfaces [ 13 ] approach to decreasing the dimensionality of the facial image data with the intention of isolating the most important features. Faces are expressed as a linear mix of a collection of eigenvectors—referred to as Eigenfaces—belonging to the confusion matrix of the training pictures. Though this will greatly simplify the computa- tion process, it turns out to be very fragile to lighting and face expression changes. The Fisherfaces method [ 14 ] avoids this issue by adding Linear Discriminant Analysis (LDA) to the PCA method to maximize between-class variability. In contrast with Eigenfaces where directions of maximum variance are sought without attention to the class labels, Fisherfaces tries to minimize intra-class variance and to maximize inter-class variance. This makes it offer better robustness to change in illumination and facade appearance. One of these is introduced by Ojala et al. (2002) [ 15 ]: LBP, a descriptor capturing local spatial patterns in grayscale images. LBP computes highly discriminative histograms that are somewhat invariant to monotonic intensity changes by thresholding the neighbourhood of each pixel and representing the result with a binary number. LBP is simple and effi- cient enough to be used in real-time systems of face detecting and recognizing, in such resource-constrained environments. Despite the fact that high-performance applications have already replaced these classi- cal techniques with deep learning-based models, these techniques are still useful in various applications. They are visible, computationally cheap and simple to implement, and may be applied to embedded systems, initial feature extraction or in hybrid recognition systems. 2.1.1. Eigenface The paper [ 16 ] suggests an application of the in-depth face recognition framework that uses a Normalized Eigenface connected with Histogram Equalization (HE) to deal with the imagery variance of lighting. The technique enhances the accuracies of recognition as images are preprocessed using HE, pixel intensities are normalized and PCA is used as the feature extraction technique. The proposed system was compared to regular Eigenface techniques and performed better on Yale B, Extended Yale B and AR datasets in different degrees of illumination. It was shown that the recognition of efficacy was enhanced by 13.6–222.9% more than the conventional Eigenface technique. This article establishes the usefulness of normalization and contrast enhancement in robust face recognition. The paper [ 17 ] compares FaceNet+SVM and Eigenface+SVM with respect to student emotion recognition in a learning environment, particularly scenarios in which there is varying light and occlusion. With only one image per learner, a 98% accuracy was achieved in FaceNet compared to only 84% when ideal conditions are met by Eigenface and 9% with poor lighting. The paper points out the FaceNet invariance to visual interference, which means that it can be used in emotion-aware learning systems. The major preprocessing steps included data augmentation and MTCNN face detection. The study makes it clear that deep learning is better than face recognition in a real-life learning environment. This detailed survey [ 18 ] examines how Eigenfaces have been applied to face detection, their mathematical basis and beyond how to apply them to real world problems and the the limitations of their application. The paper can be divided into classifications of approaches, benchmark datasets and light-occlusion datasets, as well as the difficulties with scaling Appl. Sci. 2025 , 15 , 9390 4 of 27 datasets. It also reviews hybrid and augmented PCA-based methods with improvements being proposed via preprocessing and GPUs. The review concludes that the Eigenfaces have the disadvantage of being inaccurate in a wide variety of conditions which makes them relatively physically inefficient. The next research area will be the integration of Eigenfaces with deep learning to enhance their application in real-time and generalization elsewhere. This systematic literature review [ 19 ] presents the current advancements and issues of facial recognition technology, with special attention given to the concepts of systems, the measures of program performance and health and the societal and security applications of this technology. The review also notes that deep learning and the CNN-based approach have brought about a revolution in terms of accuracy and efficiency in recognition. Mean- while, it determines some unresolved concerns like privacy, ethical dilemmas and bias in algorithms that require responsible and regulated application. The paper ends by identi- fying future research directions that need to be undertaken to enhance system reliability, eliminate bias and build trust among the users through ethical conduct and privacy. The paper [ 20 ] presents an algorithm of face recognition using quaternion to effi- ciently represent RGB images since they can be characterized with the use of quaternion matrices. To take care of high dimensionality, the approach reduces the dimensionality by projecting the data on a subspace with important information retained; additionally, a new Jacobi method has been developed to find the quaternion Hermitian eigenproblem. The experiment that was conducted on a standard face recognition dataset proves that the method attains comparable accuracy with a small number of Eigenfaces, thus enhancing the execution speed and scalability. The algorithm, implemented in Julia, has low execution time and another feature that enables it to work with larger image dimensions. The authors of [ 21 ] look at the use of CNNs for face recognition even with limitations, including having limited data about the images and some factors like lighting and facial expressions being present. The CNN model is contrasted against standard methods such as Eigenfaces, Fisherfaces, LBPH and MLPs and it has a higher accuracy and stability. The proposed CNN-based system performed better in keeping the recognition accuracy high since it was conducted in a classroom environment and the data is noisy and out of control. The work doubles on the importance of fine-tuning the CNN architecture and the fact that CNNs have potential in limited conditions. According to the research, it is better to choose CNNs to use in real-life, low-quality data. 2.1.2. Fisherfaces The paper [ 22 ] proposes a new method of face recognition that works in uncontrolled environments which can be referred as the Enhanced Local Binary Pattern (EnLBP) method. EnLBP improves Traditional LBP by dividing input images into sub-regions of 3X3, during which the arithmetic mean is calculated in each sub-region and encoding with LBP is implemented, which reduces the dimension without losing the essential important infor- mation of the texture. The face matching is carried out through the use of cosine similarity and it is tested on a benchmark, the LFW-a, which has provided better recall (61.43%) as compared to the other existing systems that use LBP (56.75%). What is more, EnLBP has far lesser computational complexity which makes it possible to implement it in real-time and show resilience to lighting inconsistent and a wide range of facial expressions, which is a significant breakthrough in the industry. The authors of [ 23 ] provide a systematic review of facial recognition technologies with details on how the system works, moving developments and advances in algorithm devel- opment, performance measures and operational use cases. The review also emphasizes how deep learning (and especially convolutional neural networks, CNNs) allow to enhance accuracy and efficiency to such an extent. Simultaneously, it highlights persistent obstacles Appl. Sci. 2025 , 15 , 9390 5 of 27 like the apprehension of privacy, moral questions and algorithmic discriminations, which should be solved to adopt it responsibly. The paper ends with a recommendation of future studies that would futher work on the reliability of the system and minimize bias, as well as provide ethical and regulated application of facial recognition technology. The paper [ 24 ] introduces a novel pooling method called Robust LBP Guiding Pooling (G-RLBP) to enhance the noise robustness of CNN-based face recognition systems. The method uses Robust Local Binary Pattern (RLBP) to estimate noise-affected pixels and assigns weights during pooling to reduce their impact on feature maps. Integrated into the first pooling layer of standard CNNs like AlexNet and ZF-5Net, G-RLBP significantly im- proves recognition accuracy under noisy conditions. Experimental results on ORL and AR datasets demonstrate that the proposed method outperforms traditional pooling, particu- larly when images are distorted by Gaussian or salt-and-pepper noise. The approach offers a practical enhancement to CNN architectures in real-world face recognition applications where image quality is often compromised. 2.2. Deep Neural Network Models The advent of deep learning has redefined face recognition entirely, offering previously unseen levels of accuracy rates and stability in practice. At the center of the change are CNNs, Siamese networks and attention-based models, which bring different benefits in the form of feature learning and presentation. Recent studies, such as SwinFace and ViT-Face, are proving the capabilities of the Transformer-based architecture in terms of providing long-range dependencies and spatiotemporal features and thus enhancing the robustness to dynamic surveillance scenarios. These strategies underscore the need to take sophisticated deep learning approaches in order to solve substantive aspects such as motion blur, occlusion and dynamic illumination under real world conditions. 2.2.1. CNNs Most of current face recognition systems are based on CNNs. CNNs automatically discover proximity-sensitive features that characterize facial organization at ever higher levels of abstraction by using hierarchically organized chains of convolutional filters. The groundbreaking models like DeepFace [ 16 ], FaceNet [ 25 ] and VGGFace [ 26 ] also showed that the networks trained on large-scale datasets can reach near-human results on such benchmark tasks. The advantage of CNNs is that it generates highly discriminative high- dimensional embeddings, and these embeddings are invariant to changes in pose, lighting and occlusion. Ref. [ 26 ] constructed a dataset of 2.6 million images spanning 2622 identities, combin- ing web-scraped data with human-in-the-loop validation to balance scale and purity. This provided the largest publicly available dataset of its kind at the time—surpassing WDRef, CelebFaces and LFW. End-to-end CNN models were trained for both face identification and verification using this dataset, achieving state-of-the-art results on major benchmarks of the time, including LFW and YouTube Faces (YTF), using a relatively streamlined CNN architecture. Additionally, the authors explored the balance between dataset size and label accuracy; incorporating the rough filtering of noisy identities enabled scaling up without severely degrading model efficiency. In the paper [ 27 ], MTCNN is used to introduce a three-stage cascade of CNNs-P-Net, R-Net and O-Net that jointly performs face detection and facial landmark localization in real- time. Each stage uses increasingly complex CNNs to refine candidate regions and landmark positions. MTCNN simultaneously learns to detect faces and align them, leveraging the correlation between tasks to improve both accuracy and robustness. MTCNN achieves Appl. Sci. 2025 , 15 , 9390 6 of 27 state-of-the-art results in both face detection and five-point facial landmark localization tasks on benchmark datasets like FDDB and WIDER FACE. Ref. [ 28 ] introduces lightweight models combining CNN local feature extraction with Transformer’s global context awareness. This enhances AlexNet with generative adversarial training for age-variant recognition. ResNet offers high accuracy and fast convergence via inception-residual structures. Ref. [ 29 ] applied CLAHE (Contrast Limited Adaptive Histogram Equalization) and adaptive gamma correction for illumination normalization. The authors used MTCNN for accurate face detection and alignment across various poses and lighting conditions to stabi- lize feature extraction. They also fine-tuned multiple pretrained CNN backbones—VGG16 , VGG19, ResNet50, ResNet101 and MobileNetV2 [ 30 ]—on targeted datasets (e.g., CASIA3D, 105PinsFace) to capture diverse facial representations. The paper [ 31 ] reviews the significant progress made in deep learning face recognition, including CNNs, transfer learning and face feature extraction techniques. In doing so, the paper identifies persistent issues: fairness and bias in recognition across demographics; privacy and security risks; and vulnerability to adversarial attacks, especially spoofing. 2.2.2. Siamese Networks A different paradigm, Siamese networks, was proposed by Chopra et al. in 2005 [ 32 ] to encapsulate signature verification and subsequently modified to face recognition. These models are composed of the two twin subnetworks with the two subnetworks sharing the weights and processing pairs of images as the input. The training is used to make the network learn a distance measure minimized between pairs of similar face embeddings and maximized between pairs of dissimilar face embeddings. Among the most famous of implementations is FaceNet, which employs a triplet loss to drive this similarity restriction. Siamese-based architectures work especially well on the few-shot learning setting where the ideas of the new identities have little labeled data that can be harnessed. The authors of the paper [ 33 ] address the limitations of classical face recognition techniques, especially under varying conditions like pose, lighting and occlusion. The proposed method leverages a Siamese network architecture, composed of two identical convolutional branches that learn to measure the similarity between face image pairs. The approach eliminates the need for face alignment by using multi-view and multi-illumination face samples during training, ensuring robustness across diverse scenarios. The system comprises two stages: face detection (using Haar feature-based cascades) and face recognition (using Siamese CNN). The network is trained layer-by-layer using supervised learning and stochastic gradient descent. It is tested on a combined dataset consisting of LFW and lab-collected images, achieving an accuracy of 98.21%, which is competitive with state-of-the-art models like FaceNet and DeepID. Comparative experiments show the effectiveness of the approach, especially due to its efficient architecture and reduced reliance on complex preprocessing. The study concludes that Siamese CNNs are highly suitable for similarity-based biometric recognition and encourages further research in deep learning-based face recognition. The paper [ 34 ] proposes a complete low-resolution face recognition (LRFR) system that integrates face detection, super resolution (SR) and face recognition components. The system is specifically designed for real-world scenarios where high-resolution images are unavailable, such as from surveillance cameras or low-quality webcams. The core innovation lies in employing a Siamese network within the face recognition module to support one-shot learning and unbounded identity recognition. The paper [ 35 ] introduces a novel approach to low-resolution face recognition (LRFR) using a multi-stream CNN embedded in a Siamese architecture. The system is designed to handle facial images captured in uncontrolled conditions—common in real-world surveil- Appl. Sci. 2025 , 15 , 9390 7 of 27 lance scenarios—where images suffer from blur, occlusion, pose variation and poor lighting. The proposed model features eight parallel CNN streams, each employing depth-wise separable convolutions to reduce complexity while preserving representational power. A spatial dropout layer and joint identification–verification supervisory signals are employed to improve feature learning and prevent overfitting. The network uses contrastive loss for metric learning to ensure that similar faces are mapped closely in the embedding space. Additionally, the paper proposes a learned thresholding mechanism rather than relying on a fixed similarity threshold. This enhances classification performance by adapting the decision boundary during training. The paper [ 36 ] addresses the problem of face recognition under degraded conditions, such as occlusions (masks, sunglasses), illumination changes, facial expressions and pose variations, which commonly reduce the accuracy of traditional systems. The authors propose a few-shot learning approach using a Siamese network built upon a pretrained Inception-v3 model to enhance face recognition performance, particularly in uncontrolled environments with limited data. A novel Siamese network architecture based on Inception- v3 is designed for multi-class face recognition in degraded conditions. The network is trained using contrastive loss, which measures the similarity between image pairs, enabling robust face embedding even with few samples. In the paper [ 37 ], two streamlined architectures, (i) Simple SqueezeFaceNet (SSN) and (ii) Channel-Split Network (CSN), are suggested to obtain high-performance identification, both in space and time and in real-time performance in embedded and mobile systems. The authors contrast traditional methods which mainly optimize a FLOP count, with the newer measure of memory access cost (MAC) which is more robust in measuring the speed of inference. Striking a balance between MAC and FLOPs, the CSN variants (CSN-fast, CSN-faster, CSN-fastest) can reach an accuracy of 0.992 on the LFW with 155–180 FPS, greatly exceeding MobileFaceNet in terms of speed, with few accuracy trade-offs. In sum, the paper shows that efficient parameterization and MAC optimization can be used to implement compact high-speed face recognition capable of real-time video applications. 2.2.3. Attention-Based Recently attention-based models have become an influential alternative or comple- ment to ordinary CNNs. Motivated by success in natural language processing models, these approaches include those based on self-attention mechanisms, or full Transformer architectures, where the computational resources are deployed to the most important parts of the face. Particularly, the results of the model formations of Vision Transformers (ViTs) and hybrid models such as SwinFace [ 38 ] or TransFace have exhibited state-of-the-art results in the many benchmarks that they were tried in. The focus on attention improves the ability of the model to notice long-range dependencies and spatial context, which are particularly helpful when the facial portion’s quality is below par or there is enormous variation. All these deep learning-based methods have restructured the technical face recognition. Although they need considerable computational elements and large training data, they have achieved success due to their ability to generalize, be flexible and have high accuracy, leading to them being the paradigm in both scholarly research as well as commercial machinations. The paper [ 39 ] develops an attention recognition algorithm of an occluded face based on a multi-scale mechanism that enhances recognition in adverse conditions. Attention modules of different scales can be combined to enable the model to find the parts of the face that are visible and informative together but minimize the influence of the blocked sites. Experiments are performed on a number of standard face recognition datasets with different combinations of occlusions, and show that the method yields significant improvement Appl. Sci. 2025 , 15 , 9390 8 of 27 over baseline algorithms in terms of recognition accuracy. The article establishes that the multi-scale attention enhances robustness and flexibility and hence the model can be used in face recognition in practice, where occlusions are likely to occur. The method proposed in the paper [ 40 ] is an occluded face recognition based on the combination of an attention mechanism and damaged feature masking to enhance the robustness to partial occlusions. The damaged feature masking strategy converts unreliable features in covered areas whereas the attention module allows the network to focus on visible and informative facial regions. With experiments on benchmark datasets under various settings of occlusions, the proposed method outperforms the traditional methods with a substantial increase in recognition accuracy. The analysis shows the power of attention/feature masking combination on real-world occluded face recognition tasks. The paper [ 41 ] addresses the persistent challenge of face recognition in unconstrained environments where facial occlusions are common, such as in crowds, extreme head poses, or where there is partial visibility due to obstacles. The authors propose a novel deep learning approach for partial face recognition that leverages an attention-based architecture built upon a truncated ResNet-50 backbone. The proposed method demonstrates that attentional re-calibration and region-specific aggregation significantly enhance partial face recognition, making it feasible to match incomplete or occluded face images effectively. This solution is especially relevant for surveillance, forensic analysis and real-world applications where complete face visibility cannot be guaranteed. The proposed model in [ 42 ] adopts ConvNeXt-T as the backbone and incorporates the Efficient Channel Attention (ECA) mechanism. This combination enhances the ex- traction of discriminative features from the visible (unmasked) regions of the face while maintaining computational efficiency and avoiding unnecessary dimensionality reduction. This achieved 99.76% accuracy on real-world masked face datasets and maintained 99.48% accuracy under challenging conditions such as poor lighting and contrast. 2.3. Deep Learning Architectures 2.3.1. FaceNet Specialized deep learning architectures have more recently been proposed that are directly applicable to the specific constraints of face recognition at scale and with no constraints on the environment. Some of the most notable are FaceNet, ArcFace (additive angular margin), and CosFace, in addition to similar models that have been used to propose new loss functions and embedding approaches to maximize the feature discrimination together with intra-class compactness. The most important step in its direction has been the introduction of a unified framework, in which a single model [ 25 , 43 , 44 ] embedded facial images in a low-dimensional Euclidean space. The network is optimized with a triplet loss, which means that the difference between an anchor and a positive (same identity) should be smaller than the difference between the anchor and a negative (different identity) by a certain margin. The method does not require a layer of intermediate classification and high-performance verification, and clustering and identification of faces is possible using only one model. Ref. [ 45 ] pushed research towards face verification in real-world conditions, capturing natural variation in pose, lighting, occlusion and expression. Over 50 methods have been evaluated using LFW, spanning descriptor-based approaches (e.g., LBP, SIFT+Fisher vectors), metric and subspace learning and CNN-based models. High- performing systems such as FaceNet achieved 99.6% accuracy, roughly corresponding to or surpassing human-level performance, with only a handful of residual errors often tied to labeling issues [ 46 ]. The paper [ 47 ] presents a CNN-based system for automated attendance monitoring through real-time face recognition. The system utilizes two prominent deep learning Appl. Sci. 2025 , 15 , 9390 9 of 27 models, FaceNet and VGG-16, aiming to identify multiple faces simultaneously in live settings such as classrooms or offices. The FaceNet model is used for face embedding and identification, while Haar Cascade is applied for initial face detection. A custom dataset containing images of 32 students was created and stored on Amazon RDS, ensuring cloud- based, secure storage and fast access. The system workflow involves capturing images through a camera, detecting faces, extracting features using FaceNet, comparing them against a registered database and logging attendance automatically. The article [ 48 ] introduces Low-FaceNet, a deep learning framework designed to improve face recognition performance in low-light conditions, where traditional models often fail due to degraded visual quality. Unlike prior approaches, Low-FaceNet integrates low-light image enhancement (LLE) and face recognition into a single unified model, enabling mutual benefits between the enhancement and recognition tasks. The paper [ 49 ] has presented Fast-FaceNet, a shallow face detection system, which comprises the incorporation and integration of the MobileNet into FaceNet to minimize cost of computation and at the same time enhance the speed of processing. Depth-wise separable convolutions and the triplet loss function have been used to incorporate compact discriminative feature embeddings in the model. Experiments on LFW demonstrate that Fast-FaceNet has similar accuracy to the original FaceNet with a speedup of more than 2.5 × , which makes it suitable as a real-time solution and a mobile face recognition app. All in all, it is quite effective in terms of trade-off between accuracy and speed of face recognition. Based on the idea of embedding-based learning in FaceNet, CosFace [ 50 ] and Ar- cFace [ 51 ] also offered improvements on embedding-based Softmax losses that further enhanced the discriminative capacity of the learned features. 2.3.2. CosFace CosFace adds an additive cosine margin to Softmax loss, which promotes increased angular margin between classes. Ref. [ 52 ] introduced a novel loss function that adds an explicit angular margin in normalized hyperspherical space, directly corresponding to the geodesic distance between feature vectors and class centers. Unlike SphereFace’s multiplicative margin or CosFace’s cosine margin, ArcFace’s additive margin has a clear, exact geometric interpretation, which enhances convergence stability and computational efficiency. It outperformed previous methods across ten face recognition benchmarks, including massive image and video datasets, improving large-scale recognition accuracy. Unlike A-Softmax (SphereFace), which uses a multiplicative angular margin and causes unstable, non-monotonic decision boundaries, ref. [ 53 ] introduced CosFace’s additive margin, which is geometrically intuitive and monotonic, improving optimization stability. This formed the foundation for later margin-based losses like ArcFace and ElasticFace. Ref. [ 51 ] states that unlike margin-based losses such as ArcFace or CosFace that focus solely on enhancing the target class, RPCL introduces a reverse mechanism that suppresses the rival logit for each sample. This increases inter-class separation and reduces misclas- sification, especially under low-res conditions. This outperforms existing margin-based and uncertainty-aware methods on multiple LR face datasets, demonstrating significantly improved robustness under severe resolution degradation. 2.3.3. ArcFace ArcFace goes further to correct this by adding an angular margin penalty that not only features classes that are well separated but also features these together within a prior cluster. Such changes occur to provide better outcomes in relation to common face recognition benchmarking, namely LFW, MegaFace, and IJB datasets. Other architectures of note are SphereFace [ 54 ], with a multiplicative angular margin, and MagFace [ 55 ], which Appl. Sci. 2025 , 15 , 9390 10 of 27 adds magnitude-aware learning to condition the feature representation on the quality of face images. These inventions aim to tackle practical problems like intra-class variability, intra-class imbalances and quality-aware recognition. These architecture designs have been very successful owing to their fine-tuning of the embedding space, with the geometry of learned features having a direct relationship to recognition performance. When coupled with large-scale datasets and a large CNN backbone (e.g., ResNet, MobileFaceNet), these models scale well, with high performance accuracy. The paper [ 56 ] proposes a modified Softmax classifier that incorporates a fixed ad- ditive margin in the cosine similarity space to better separate classes. AM-Softmax was contrasted with multiplicative margin losses (SphereFace) and additive angular margin losses (later seen in ArcFace), demonstrating stable convergence and competitive accuracy while maintaining algorithmic simplicity. 2.3.4. OpenFace The paper [ 57 ] proposed a new technique of identifying suitable hard-negative ex- amples in the course of pedagogy through Triplet Loss. The method utilizes some of the sample pairs which otherwise go to waste, and thus develops a better model accuracy and performance. In order to handle this possible danger of premature convergence due to the added hard-negative samples, it utilizes the Adaptive Moment Estimation (Adam) optimization algorithm. The proposed method scored 0.955 of accuracy and 0.989 of the AUC on the LFW verification benchmark, which is more than 0.929 of the accuracy and 0.973 of the AUC of the original OpenFace model. The UCEU dataset [ 58 ] consists of 7395 images representing the images of 130 subjects of which 44 are men and 86 are women. In order to confirm that there still is a chance to increase the accuracy of face recognition in Asian face verification, we also employ four other models of face verification, including OpenFace and ArcFace and the VGG-Face model with gender, expression and age recognition, to the UCEC-Face. The paper [ 59 ] presents OpenFace which is an open-source face recognition library that attempts to address the face recognition accuracy disparity between publicly and state-of-the-art privately developed face recognition systems. As we integrate cameras within the Internet of Things (IoT), face recognition has the ability to occur and improve contextual understanding. OpenFace has state-of-the-art performance with almost-human accuracy on the LFW benchmark and a new classification benchmark that focuses more on the mobile setting. Targeting non-experts, the paper provides also an overview of simplified approaches of deep neural networks applied in the system. In the article [ 60 ], the authors have developed a face recognition system with the help of OpenFace combined with an intelligent training technique known as S-DDL (Self- Detection, Decision, and Learning). S-DDL allows a real-time update of the models as opposed to the fixed SVM model of conventional systems through an incremental SVM algorithm. This adaptive method increases recognition performance of particular user groups and still achieves reasonable training times. Experimental findings prove that the system has been promising and has high accuracy and real-time performance where