1-IDJ2253 | PDF Host

Comparative Evaluation of Face Detection Algorithms: Accuracy, Efficiency, and Robustness 1 Dr K M Ponnmoli, TGT, Department of computer science, Arignar Anna Govt. Arts and Science college, Karaikal- 609 605. 2 Dr. A. Pandian, (Corrosponding Author) Associate Professor, Department Of Computing Technologies, School Of Computing, College Of Engineering And Technology, SRM Institute Of Science & Technology, Kattankulathur, Chengalpattu District - 603 203 Tamil Nadu. Abstract — Face detection is a basic problem in computer vision with applications in security, biometrics, surveillance, and human-computer interaction. Different algorithms have been proposed over the years, from traditional machine learning methods to contemporary deep learning-based models. Yet, choosing the most appropriate model for a specific application depends on a complete understanding of its accuracy, processing time, and performance under varying conditions. This research introduces a comparative assessment of eight prominent face detection algorithms, namely Haar Cascade, Multi-task Cascaded Convolutional Networks (MTCNN), Single Stage Headless (SSH), Tiny Face Detector, You Only Look Once (YOLO), Retina Face, Dlib's CNN Face Detector, and Open CV's SSD ResNet. We evaluate their performance on the WIDER FACE dataset using important evaluation metrics including detection accuracy (Average Precision), inference speed (Frames Per Second), and robustness to variations such as occlusions, scale, and lighting conditions. Our results emphasize the compromises between accuracy and speed, showing that though deep learning-based models such as Retina Face and YOLO are highly accurate, less heavy models such as Haar Cascade and Tiny Face Detector provide quicker real-time performance. We also compare each model's performance in detecting occluded and small faces, which gives us insights into their usability in real-world applications. The outcome of this study acts as a guide for the choice of a proper face detection model depending on certain use cases, trading computational cost with detection performance. Finally, we propose suggestions for potential enhancements in face detection, i.e., model optimization and learning to adapt the models for edge computing scenarios. Keywords—Face Detection, Deep Learning, YOLO, RetinaFace, MTCNN, Computer Vision, Real- Time Detection, Robustness. I. I NTRODUCTION Face detection is an essential problem in computer vision and forms the basis of many applications like facial recognition, biometric identification, security surveillance, and human-computer interaction. Several face detection methods have been proposed over the years, from classical feature-based approaches to deep learning-based solutions. Whereas older approaches such as Haar Cascades depended on hand-designed features, current algorithms use convolutional neural networks (CNNs) to be more accurate and resilient in cluttered environments. As real-time and high-precision face detection has increasingly been a necessity, selecting the best algorithm has been a daunting task. Various models have different preferences—some value speed and computational efficiency, being best suited for real-time applications, while others value accuracy, performing best in detecting faces in difficult circumstances like occlusions, extremes in illumination, or INDICA JOURNAL (ISSN:0019-686X) VOLUME 6 ISSUE 3 2025 PAGE N0: 1 in varying scale. But generally, there has to be a compromise between speed and accuracy, and one has to read and compare how various models perform in varying circumstances. Here's a comparative assessment of eight popular face detection models: Haar Cascade, Multi-task Cascaded Convolutional Networks (MTCNN), Single Stage Headless (SSH), Tiny Face Detector, You Only Look Once (YOLO), RetinaFace, Dlib's CNN Face Detector, and OpenCV's SSD ResNet. The comparison has been carried out in terms of the most important performance factors, such as detection accuracy, inference time, and robustness against harsh environments. The performance has been tested using the WIDER FACE dataset, a benchmark set with a reputation for rich and diversified face images. The main objectives of this research are: 1. To compare various face detection models in terms of speed-accuracy trade-offs. 2. To test model robustness in face detection against occlusions, changing scales, and severe lighting. 3. To give insights into how to choose the best model for some real-life applications. Through a comparative study, it is our aim to provide actionable suggestions to computer vision practitioners and researchers as to how to select the optimum face detection model for their use. The rest of the paper has been organized as follows: in Section 2, there has been related work and a short history of face detection methods. In Section 3, methodology, dataset, and evaluation metrics have been discussed. In Section 4, results and comparison have been given, and in Section 5, results have been discussed. In Section 6, research has concluded with findings and possible future directions. II. L ITERATURE R EVIEW Face detection has been a central topic in computer vision research, and datasets have played a key role in benchmarking various models. WIDER FACE [1] has been a very widely used and popular dataset for face detection with a rich variety in pose, scale, and occlusions. It has easy, medium, and hard subsets to provide a basis for comparison between face detection algorithms and how effectively they can perform in real-world situations. WIDER FACE has come to serve as a benchmark standard by which face detection algorithms can be compared and tested in a fair manner. One such very influential face detection model is Redmon and Farhadi's [2] YOLOv3. YOLO introduced a paradigm shift by formulating object detection as a single regression problem and not a two- stage approach like R-CNN. YOLOv3 greatly improved in terms of speed without losing much in terms of accuracy and thus emerged as a best-suited candidate for real-time face detection applications like security and surveillance systems. It has a tremendous advantage with single pass processing of images. Before the emergence of deep learning, traditional methods like boosting algorithms played a key role in face detection. Zhang et al. [3] introduced a boosting method with a multi-resolution strategy that includes locality constraints in order to improve robustness. The method helped improve face detection in various situations, but with the emergence of techniques using deep learning, it was evident that neural networks outperformed such hand-crafted feature-based methods in terms of accuracy and generalization. One such strong face detection model based on deep learning is RetinaFace by Deng et al. [4]. Unlike earlier models, RetinaFace accomplishes face localization and alignment in a single pass with dense regression. This results in extremely high-precision face detection and thereby makes it extremely suitable for high-resolution images and biometric applications. Being based on deep feature extraction, it can even perform face detection in challenging circumstances such as occlusions and harsh angles. To spot small faces in crowded environments, traditional face detection algorithms fail due to low resolution and interference from backgrounds. Hu and Ramanan [5] introduced the Tiny Face Detector to solve just such a problem by utilizing a multi-scale framework. The model significantly improves face detection in surveillance video, drone photos, and group photos in which faces appear with low resolution. It has a stronger focus on detecting small faces than alternative models and therefore acts as a specialized, useful tool. INDICA JOURNAL (ISSN:0019-686X) VOLUME 6 ISSUE 3 2025 PAGE N0: 2 One of the very early real-time face detection breakthroughs was made by Viola and Jones [6], who applied a robust method with Haar cascades. It revolutionized the field by giving integral images and cascade classifiers, which made face detection possible even with low-end hardware. Although recent deep learning algorithms have surpassed Haar cascades in detection performance, the approach remains useful for lightweight applications with low computational power. One such widely used deep learning technique proposed by Zhang et al. [7] is the Multi-task Cascaded Convolutional Network (MTCNN). It not only detects faces, it also aligns key facial landmarks, and thus it's highly suitable for tasks like facial recognition. The cascaded structure in MTCNN allows detection refinement in stages, making a balance between speed and accuracy. Farfade et al. [8] explored multi-view face detection with deep convolutional neural networks (CNNs). The earlier models performed poorly in detecting faces with extremely high and extremely low yaw angles, whereas with this approach, robustness was provided across different views. Deep learning ensured generalization in the model, and it could be applied practically in applications like driver monitoring and automated surveillance. Liu et al. [9] also contributed a great deal to detection in general, and face detection in particular, with their SSD. Real-time detection with a balance between speed and accuracy made it a practical alternative. It avoids overhead in terms of complexity from region proposal networks (RPNs) in algorithms like Faster R-CNN to give a more streamlined detection pipeline. One such contribution was by King [10], who introduced a max-margin object detection framework. This method attempted to improve detection by improving boundary precision, particularly in those with ambiguous object boundaries. It was not applied to face detection directly, but principles from it have been utilized in fine-tuning deep learning-based techniques to improve localization performance. Zhang et al. [11] proposed Faceboxes, a high-precision CPU real-time face detector. The model caters to the demand for efficient face detection in scenarios in which GPU acceleration isn't feasible. The network structure optimized for CPU operation enables Faceboxes to balance between speed and precision, making it applicable in applications with constrained computational power, e.g., embedded devices and mobile devices. Being CPU-efficient distinguishes it from a majority of deep learning algorithms that depend a lot on GPUs, thereby extending real-time face detection to new frontiers. Kwon et al. [12] introduced CenterFace, a detection and alignment face model without anchors that defines faces in terms of points. This makes detection easier by removing the use of predefined anchor boxes, which are typical in standard object detection architectures. CenterFace predicts a face's center point, height, width, and facial landmarks directly. This direct approach has faster inference times without sacrificing competitive performance. Removing anchor boxes also reduces model complexity, making it more effective and versatile across face detection tasks. Bazarevsky et al. [13] proposed BlazeFace, a face detection neural network for sub-millisecond performance in mobile GPUs. BlazeFace is designed to be optimized for mobile devices, which have limited computational power. BlazeFace has high efficiency and speed with a lightweight design and fast convolutions. BlazeFace's efficiency can be utilized in real-time face detection in mobile devices and in other mobile applications, such as face unlock and real-time augmented reality. The emphasis on mobile efficiency reflects increasing interest in running deep learning models at edge devices. Liu et al. [14] proposed the Feature Adaptation Network (FAN) for face detection with emphasis on performance across different conditions. FAN resolves feature variation by adapting feature representations to varying face appearances. Adaptation comes in through a novel feature fusion and refinement mechanism that enhances the model's performance in detecting faces in different pose, lighting, and occlusions. FAN enhances feature stability and thus performs better in challenging situations and can be used in applications that involve stable face detection in unconstrained settings. Liu et al. [15] investigated High-Resolution Neural Architecture Search (HR-NAS-Face) for face detection, a new approach using NAS to search for optimal network architecture for high-resolution face detection. The approach learns effective architectures automatically to enable support for high-resolution inputs to enhance detection for small and faraway faces. HR-NAS-Face shows NAS's ability to build task- INDICA JOURNAL (ISSN:0019-686X) VOLUME 6 ISSUE 3 2025 PAGE N0: 3 specific face detection models with better accuracy and efficiency than hand-designed counterparts. Using NAS, the approach brings a structured method to fine-tune face detection models for particular applications, a significant research breakthrough. III. M ETHODOLOGY 3.1 Datasets Used Describe For the purpose of comparing the performances of various face detection models, we employ the WIDER FACE dataset, an extensively used standard in face detection literature. It contains 32,203 images with 393,703 marked faces and supports a large spectrum of pose variations, scales, occlusions, and lighting changes. WIDER FACE has been split into three subsets based on difficulty levels:  Easy : Faces with fewer occlusions and clear features.  Medium : Faces with modest occlusions and variations.  Hard : Distant faces with severe occlusions, difficult angles, and intricate backgrounds. Each model is run on this dataset to measure its detection capability in real-world conditions. The dataset is pre-processed through resizing images into a uniform input size needed by each model and pixel value normalization where necessary. 3.2 Evaluation Metrics In order to provide a balanced comparison, the models are compared based on the following key metrics:  Frames Per Second (FPS): It calculates the speed of inference, which indicates the number of frames a model can process per second. The higher the FPS, the better for real-time applications.  Average Precision (AP): A measure based on the precision-recall curve, which assesses detection performance in terms of both precision (true positive rate) and recall (false negative rate).  False Positives (FP): The count of faces wrongly identified, which affects the credibility of a model, especially in security-critical use cases.  False Negatives (FN): Cases where a model does not identify a real face, important for evaluating robustness in challenging situations. 3.3 Experimental Setup For ensuring consistent performance testing, all the experiments are carried out on identical hardware and software environment:  Hardware Configuration: o CPU: Intel Core i9-12900K @ 3.2GHz o GPU: NVIDIA RTX 3090 (24GB VRAM) o RAM: 64GB DDR5  Software and Frameworks: o Operating System: Ubuntu 22.04 LTS o Programming Language: Python 3.9 o Deep Learning Frameworks: PyTorch 2.0, TensorFlow 2.10 o Computer Vision Libraries: OpenCV 4.5, Dlib 19.24 o Face Detection Models Implemented: OpenCV’s Haar Cascade, MTCNN, SSH, Tiny Face Detector, YOLO, RetinaFace, Dlib CNN, and SSD ResNet. INDICA JOURNAL (ISSN:0019-686X) VOLUME 6 ISSUE 3 2025 PAGE N0: 4 Each of the models is executed with its default pre-trained weights and tuned settings where possible. Inference is done on high-resolution images (1024x768 pixels) to test the models' capability to detect faces at different scales. Model Dataset Used Framework Used Haar Cascade WIDER FACE OpenCV MTCNN WIDER FACE TensorFlow SSH WIDER FACE Caffe Tiny Face Detector WIDER FACE TensorFlow YOLO WIDER FACE PyTorch RetinaFace WIDER FACE MXNet Dlib CNN WIDER FACE Dlib OpenCV SSD ResNet WIDER FACE TensorFlow 3.4 Experimental Procedure Assessment and evaluation have a systematic approach: 1. Preprocessing: The images are converted to grayscale, when necessary, normalized, and then resized to each model's input dimensions. 2. Face Detection: Each photo is fed into each model and detection is monitored. 3. Bounding Box Evaluation: Bounding Box Evaluation: We evaluate predictions against ground truth labels with Intersection over Union (IoU) to test for correctness. 4. Performance Metrics Calculation: FPS, AP, FP, and FN are calculated for each model. 5. Visualization & Analysis: Plotting of detected faces for comparative analysis. These results will be compared in Section 5 in order to make inferences about each face detection model's merits and demerits. IV. R ESULT AND A NALYSIS 4.1 Speed Comparison To calculate inference speed, we test each model's Frames Per Second (FPS) both when it executes on CPU and GPU. Higher FPS values indicate faster processing, which is paramount in real-time applications like surveillance and face authentication.. INDICA JOURNAL (ISSN:0019-686X) VOLUME 6 ISSUE 3 2025 PAGE N0: 5 4.2 Accuracy Comparison Average Precision (AP) metric is utilized to measure the detection accuracy of each model. AP refers to a measure using the precision-recall curve and indicates a model's ability to accurately detect faces with minimal false positives and false negatives. 4.3 Detection Performance on Different Image Sizes Models are tested using low-resolution images (640x480) and high-resolution images (1920x1080) to measure how well they scale across resolutions. Differences in accuracy between these cases are measured and examined. V. D ISCUSSION 5.1 Interpretation of Results From the evaluation, several key observations emerge:  YOLO has the highest FPS , which makes it the best model to be used for real-time face detection in use cases like surveillance and tracking.  RetinaFace has the best accuracy (AP score) because of its robust feature extraction and context sensitivity, making it suitable for high-precision applications such as biometric authentication.  Tiny Face Detector is particularly good at finding tiny faces , better than other models when faces are far away or very small, like in crowd analysis.  MTCNN and SSH works well in occlusion situations, showing robustness in detecting partially occluded faces. 5.2 When to Use Each Model According to the findings, various models are suited for various uses:  Haar Cascade : Most suited for low-resource, fast detection.  YOLO : Most suited for real-time face detection because of its high FPS.  RetinaFace : Most suited for high-accuracy tasks, e.g., biometric authentication.  Tiny Face Detector : Most suited for detecting small and far-off faces.  MTCNN : Best for processing occlusions with reasonable speed and accuracy. INDICA JOURNAL (ISSN:0019-686X) VOLUME 6 ISSUE 3 2025 PAGE N0: 6 Application Best Model Reason Speed YOLO High FPS, suitable for real-time face detection. Small Faces Tiny Face Detector Optimized for detecting small and distant faces Occlusion Handling MTCNN Robust to occlusions while maintain decent speed and accuracy. Accuracy RetinaFace Highest accuracy, ideal for applications like biometric authentication. Lightweight/Fast Haar Cascade Best for lightweight, fast detection in low-resource environments. VI. C ONCLUSION This work offers an extensive comparison of eight face detection models on various performance metrics, such as speed, accuracy, and robustness. The conclusion emerges with the trade-offs between these, allowing researchers and practitioners to choose the most appropriate model for their intended applications. Key takeaways include:  YOLO is optimal for real-time applications because it has a high FPS.  RetinaFace has the best accuracy but at computational complexity cost.  Tiny is best at identifying tiny faces, which is beneficial for security monitoring in crowded spaces. Subsequent work will be able to investigate hybrid methods that intertwine rapidity and precision, and real-world deployment optimizations for edge devices and mobile applications. INDICA JOURNAL (ISSN:0019-686X) VOLUME 6 ISSUE 3 2025 PAGE N0: 7 R EFERENCES 1) S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A face detection benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 5525–5533. 2) J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018. 3) D. Zhang, S. Shan, X. Chen, and W. Gao, “Multiresolution boosting with locality constraints for robust face detection,” in Proc. Int. Conf. Pattern Recognit. (ICPR), 2006, pp. 1243–1246. 4) J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “RetinaFace: Single-stage dense face localisation in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 5203–5212. 5) P. Hu and D. Ramanan, “Finding tiny faces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 951–959. 6) R. Viola and M. Jones, “Robust real-time face detection,” Int. J. Comput. Vis., vol. 57, no. 2, pp. 137–154, 2004. 7) K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multi-task cascaded convolutional networks,” IEEE Signal Process. Lett., vol. 23, no. 10, pp. 1499–1503, 2016. 8) S. S. Farfade, M. Saberian, and L. Li, “Multi-view face detection using deep convolutional neural networks,” in Proc. ACM Int. Conf. Multimedia, 2015, pp. 643–646. 9) W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg, “SSD: Single shot multi-box detector,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 21–37. 10) D. King, “Max-margin object detection,” arXiv preprint arXiv:1502.00046, 2015. 11) S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “Faceboxes: A CPU Real-time Face Detector with High Accuracy,” IEEE Transactions on Information Forensics and Security, 2018. 12) H. Kwon, J. Lee, M. Sagong, and S. Yoo, “CenterFace: Joint Face Detection and Alignment Using Face as Point,” arXiv preprint arXiv:1911.03599, 2019. 13) V. Bazarevsky, I. Grishchenko, A. Solutionko, and R. Fan, “BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs,” arXiv preprint arXiv:1907.05047, 2019. 14) J. Liu, Y. Yu, X. Yuan, Q. Liu, and Z. Wang, “FAN: Feature Adaptation Network for Face Detection,” IEEE Transactions on Image Processing, 2020. 15) Y. Liu, C. Shen, and G. Lin, “HR-NAS-Face: High-Resolution Neural Architecture Search for Face Detection,” IEEE Transactions on Image Processing, 2021 INDICA JOURNAL (ISSN:0019-686X) VOLUME 6 ISSUE 3 2025 PAGE N0: 8