In webcam eye-tracking, accurate calibration is essential for reliable data in research and usability studies. The number of calibration points can influence gaze estimation precision, particularly across different screen areas. To explore this, we analyzed five calibration schemes in RealEye: 5-, 13-, 21-, 39-, and 78- points, evaluating their performance across screen regions and participant data quality. The goal is to find the optimal balance between calibration complexity and accuracy, offering valuable insights for researchers and practitioners.
During the validation task, 13 white targets appeared in a random order against a mid-grey background, starting from the center. Participants’ task was to click the center of each target while fixating on it, within a 10-second period.
Accuracy and Precision are key metrics used to evaluate how well the eye-tracking system estimates participants' gaze points:
Pixels were chosen as an accuracy measurement unit, instead of the more commonly used visual angle, due to the lack of control over participants' distance from the screen and their screen sizes, which are crucial when calculating visual angle. Therefore, using pixels allowed to maintain standard metrics that are independent of these variables, ensuring more consistent and reliable data analysis.
The data was processed following the steps outlined in the diagram below.
During a viewing task, slight drifts from the intended target can occur in fixations due to factors such as eye fatigue or minor shifts in attention. By selecting only the closest fixation, the impact of this drift is minimized, allowing the focus to remain on the fixation that most accurately corresponds to the target location. The analyses were then conducted using the "closest_fixation" methodology to highlight RealEye's optimal performance.
Outliers were kept in the data analysis, which could have led to inflated results. Consequently, the real values are likely higher than those described below.
The table below summarizes the accuracy results for each calibration schema.
As the number of calibration points increases, the accuracy across the screen areas improves as shown in the table above. The 5-point and 13-point calibrations show significant inaccuracies, particularly in peripheral areas. In contrast, the 21-, 39-point and 78-point calibrations achieve much better accuracy across both central and peripheral regions. This trend indicates that increasing the number of calibration points leads to better spatial accuracy.
The Kolmogorov-Smirnov test for normality was performed on all datasets and indicated that none followed a normal distribution. Therefore, the Kruskal-Wallis H-test was used to compare all groups, as it is non-parametric and can handle groups of different sizes without sensitivity to imbalances. Following this, Dunn's test (with Bonferroni adjusted p-values) was conducted for post-hoc pairwise comparisons.
The fact that 39_CalibPoints is significantly different from 5_CalibPoints but not from 13_CalibPoints or 21_CalibPoints, while still offering the best accuracy among these, implies that 39_CalibPoints represents an optimal middle ground, offering a strong performance without the potential overfitting or instability associated with very low or high calibration points.
The analysis shows that increasing the number of calibration points leads to better accuracy in eye-tracking data. However, the choice of calibration method should be based on the specific requirements of the task at hand. The 39-point calibration offers a balanced approach, suitable for a wide range of applications, while the 21-point calibration might be a more practical option for less demanding tasks. The 78-point calibration provides the highest accuracy and consistency, making it the best choice for tasks requiring precise and reliable data across the entire screen.
The 39-point calibration therefore stands out as the best overall choice, offering an optimal balance between accuracy, complexity, and time efficiency. In a variety of eye-tracking tasks, this method consistently delivers strong performance, particularly in user experience research and behavioral studies, where both precision and practicality are crucial.