FaceEngage: Robust Estimation of Gameplay Engagement from Real-world Front-facing Videos
Doctor of Philosophy
Measuring user cognitive engagement in interactive tasks can facilitate numerous applications toward optimizing user experience, ranging from eLearning, gaming and TV viewing. However, a significant challenge in estimating engagement is the lack of non-contact methods that can operate robustly in unconstrained environments (e.g., home or workplace). Herein, we present FaceEngage, a non-intrusive and automatic system relying simply on user facial recordings to estimate engagement in realistic conditions. Our contributions are three-fold. First, we show the potential of using front-facing videos as the training data to build the engagement estimator in tasks requiring hand-eye coordination. We compile the FaceEngage Dataset, which contains over 700 picture-in-picture YouTube gaming video clips (i.e., with both game scenes in full-screen and time-synchronized user front-facing recordings in subwindows) recorded and uploaded by gamers in their natural environments. Second, we develop the FaceEngage system, a supervised learning-based engagement estimator that is trained to capture relevant facial features of gamers from front-facing recordings to accurately infer their engagement. We implement two FaceEngage processing pipelines: an estimator trained on a set of user facial motion features inspired by prior works via traditional classifiers, and a deep learning-enabled estimator. Despite the challenging nature of realistic videos, FaceEngage attains the engagement estimation accuracy of 83.8%. Lastly, we conduct extensive experiments to conclude that: (i) certain users' motion cues (e.g., blink rates, head movements) on faces are indicative of their engagement; (ii) our deep learning-enabled FaceEngage system can extract more informative features, outperforming the models trained on facial motion features; (iii) our FaceEngage system is robust to various input video lengths, diverse users/game genres and has decent interpretability.
Engagement Estimation; Video Analysis; Deep Neural Networks