Improving RealEye Accuracy on Mobile: A Study on Calibration Point Count

Martyna Pietrzak

September 29, 2024

Eye tracking on mobile devices is becoming increasingly important as they play a central role in our daily lives, from communication to entertainment and productivity. With the rise of mobile applications and user interface design, understanding user behavior through eye tracking offers valuable insights for enhancing user experience on mobile devices. Accurate eye-tracking technology on mobile platforms can reveal how users interact with their devices, which is essential for developing intuitive and effective applications.

How to accurately calibrate participants on Mobile Devices?

To find the answer to this question, in RealEye we conducted a study in which we tested 4 calibration schemes with 13-, 22-, 27-, and 39-points. The system’s performance was tested across several screen regions and participant data quality. The goal was to find the optimal balance between calibration complexity and accuracy, offering valuable insights for researchers and practitioners conducting eye-tracking on Mobile Devices.

Accuracy by Calibration Schemas

Accuracy and Precision are key metrics used to evaluate how well the eye-tracking system estimates participants' gaze points:

Accuracy [px] – is calculated as the Euclidean distance between the center of the target and the selected fixation.
Precision [px] – is calculated as the standard deviation of the Euclidean distances from the target center to the selected fixations.

‍

In case of Mobile Devices, where the accuracy-measuring task was based on fixations only and not on clicking the target, RealEye decided to calculate the distances between the center of each reference-target, and the center of the longest fixation on the target. This approach was chosen because longer fixations are typically more stable and indicative of focused attention, compensating for the lack of a precise referential point like a click.

Pixels were chosen as an accuracy measurement unit, instead of the more commonly used visual angle, due to the lack of control over participants' distance from the screen and their screen sizes, which are crucial when calculating visual angle. Therefore, using pixels allowed to maintain standard metrics that are independent of these variables, ensuring more consistent and reliable data analysis.

To learn more about RealEye's accuracy methodologies and findings on Computers, we invite you to read our Technology White Paper.

The table below summarizes the accuracy results for each calibration schema for Mobile Devices.

‍

13-point Calibration:

Accuracy by Screen Area: the accuracy varies across different screen areas, with the central-left regions achieving the highest accuracy (approximately 45-56 px), while right areas show poorer performance, particularly in the corners, where accuracy drops to 94 px.
Accuracy by Quality Groups: across all quality grades, the accuracy remains fairly consistent, ranging from 60.88 px for the highest quality participants to 71.40 px for lower quality grades, indicating little improvement with increased participant quality. However, the small sample sizes (n=7) for "Perfect" data quality limit reliability.
Accuracy for at least Good data = 67.24 px (n=30)
Implications: This schema is moderately effective, offering reasonable accuracy in central-left areas but struggling in the corners. It may be suitable for tasks that do not require high precision.

‍

22-point Calibration:

Accuracy by Screen Area: improvements in accuracy, with and average value of 55 px in the central 5 points, but with the lowest accuracy equal to 91 px in the top-right corner, which appears to be the area with the poorest accuracy also in 13-point calibration.
Accuracy by Quality Groups: the accuracy improves across all participant groups, with the highest value being 54.86 px and the lowest at 62.63 px. The small differences occurring again between quality groups suggest high calibration efficiency. Again, the small sample sizes (n=12) for "Perfect" data quality limit reliability.
Accuracy for at least Good data = 62.84 px (n=36)
Implications: This calibration strikes a balance between the number of calibration points and accuracy achieved. It is a good option for tasks that require moderate accuracy across the screen, with slight compromises in peripheral regions.

‍

27-point Calibration:

Accuracy by Screen Area: further improvements in accuracy are observed, with central areas achieving up to 37 px. Peripheral regions continue to show lower accuracy, however still better than in case of 22-point calibration. The right-hand side corners obtained the lowest accuracy amongst all areas, equal to 87 px.
Accuracy by Quality Groups: The accuracy remains consistent across participant groups, ranging from 56.14 px to 61.82 px, with improvement seen for higher-quality participants. Important note: small sample size for data quality >= 6 (18 participants).
Accuracy for at least Good data = 61.82 px (n=43)
Implications: This schema offers a better balance of accuracy across the screen compared to the 13- and 22-point calibrations. While peripheral accuracy is still a challenge, it is suitable for tasks where higher precision is needed, especially in the central regions.

‍

39-point Calibration [default]:

Accuracy by Screen Area: more improvements, with accuracy values ranging from 45 px to max 75 px. Central-left regions achieve the highest accuracy with an average of 51 pixels.
Accuracy by Quality Groups: the accuracy is consistent across participant groups, with values ranging from 53.13 px to 56.60 px, indicating that this method is robust.
Accuracy for at least Good data = 55.17 px (n=39)
Implications: This schema provides the highest accuracy across the screen, with consistent performance across different participant groups. However, its diminishing returns suggest that increasing the number of calibration points beyond 27 may not yield significant improvements.

‍

Comparative Analysis

Accuracy by Position on the Screen

As the number of calibration points increases, the accuracy across screen areas improves. From the 13-point calibration, inaccuracies are prevalent, particularly in the peripheral areas, where the lower right region shows lower accuracy. Moving to the 22-point, 27-point, and 39-point calibrations, accuracy generally improves across both central and peripheral regions.

Accuracy by Data Quality Groups

As the number of calibration points increases, accuracy improves across participant groups for all calibration schemes.

Post-hoc Dunn's Test on min Good data quality

The Kolmogorov-Smirnov test for normality indicated that none of the datasets followed a normal distribution, necessitating the use of the Kruskal-Wallis H-test, which is a non-parametric method that compares distributions without sensitivity to imbalances. Following this, Dunn's test (with Bonferroni adjusted p-values) was used for post-hoc pairwise comparisons.

‍

The post-hoc Dunn's test for minimum Good data quality revealed statistically significant differences in accuracy only between the 13-point and 39-point calibration schemas. No statistically significant differences were found between other calibration methods.

When conducting the Dunn's test post-hoc analysis with a Bonferroni correction after comparing multiple groups, the correction adjusts for the fact that multiple comparisons are made. This adjustment lowers the likelihood of identifying significant differences. Consequently, while pairs of calibration groups may show differences in isolation, they may not reach significance after the correction is applied across several groups, as observed in the initial Kruskal-Wallis test conducted with multiple groups. To further investigate these differences, additional Mann-Whitney U tests were performed for pairwise comparisons.

‍

‍

The results suggest that calibPoints_13 significantly differs from all other calibration points in terms of Euclidean distances. However, there are no significant differences between the other pairs (calibPoints_22, calibPoints_27, and calibPoints_39). This indicates that calibPoints_13 is distinct from the others in whatever measurement or characteristic you are assessing (likely distance in your case).

Although the 27-point calibration does not show a statistically significant advantage over the 22-point calibration, it is preferred due to its enhanced accuracy. The 27-point method provides more reference points than the 22-point calibration, which helps the system to capture eye movements more effectively, especially in peripheral vision scenarios.

Given these findings, the 27-point calibration stands out as a balanced solution. It provides high accuracy while avoiding the added complexity associated with the 39-point method. Therefore, the 27-point calibration is an optimal choice for various applications in mobile eye tracking, effectively striking a balance between performance and usability.

Recommendations and Conclusions

Based on the results of this study, it is clear that the number of calibration points significantly affects the accuracy of eye-tracking on mobile devices. The 27-point calibration is therefore recommended as the optimal solution because it strikes a balance between accuracy and ease of calibration. It offers:

High central accuracy while improving peripheral precision compared to lower-point schemas.
Consistency across different participant groups, ensuring reliable data collection.
Efficient use of calibration points: Beyond 27 points, the marginal gains in accuracy are minimal, making the additional effort of the 39-point calibration unnecessary for most applications.