Yuan Xu1*, Zimu Zhang1*, Xiaoxuan Ma1, Wentao Zhu2, Yu Qiao3, Yizhou Wang1
1Peking University 2Eastern Institute of Technology, Ningbo 3Shanghai Jiao Tong University
Virtual and augmented reality systems increasingly demand intelligent adaptation to user behaviors for enhanced interaction experiences. Achieving this requires accurately understanding human intentions and predicting future situated behaviors—such as gaze direction and object interactions—which is vital for creating responsive VR/AR environments and applications like personalized assistants. However, accurate behavioral prediction demands modeling the underlying cognitive processes that drive human-environment interactions. In this work, we introduce a hierarchical, intention-aware framework that models human intentions and predicts detailed situated behaviors by leveraging cognitive mechanisms. Given historical human dynamics and the observation of scene contexts, our framework first identifies potential interaction targets and forecasts fine-grained future behaviors. We propose a dynamic Graph Convolutional Network (GCN) to effectively capture human-environment relationships. Extensive experiments on challenging real-world benchmarks and live VR environment demonstrate the effectiveness of our approach, achieving superior performance across all metrics and enabling practical applications for proactive VR systems that anticipate user behaviors and adapt virtual environments accordingly.
Real world experiments results. We present three different perspectives: an exocentric view, an egocentric view from the VR device, and the corresponding rendering result in Blender. In Blender, we use a yellow Lego figure to represent the ground truth and a blue Lego figure for our prediction results. Simultaneously, the color of an object's bounding box represents the probability value output by our model. A color closer to green indicates a higher probability that the model believes the object will be interacted with next. Conversely, a color closer to white indicates a lower probability of interaction.
Qualitative results of our model on the ADT dataset. The visualization includes: (top-left) an egocentric RGB view for reference, (top-right) 3D visualization of interaction scenario, (bottom) wireframe representation of the environment with ground-truth and our predictions on human states and the interacted object states. Visual elements include human gaze direction (rays), human head position and orientation (pyramids), human hand positions (points), and the interacted objects (bounding boxes). Red elements represent the input and ground-truth, and green bounding boxes represent the ground truth object interaction trajectory, while blue elements and blue bounding boxes represent the corresponding predictions. White bounding boxes indicate the top K objects selected by our interaction intention prediction module.
In the input clip, the subject's gaze sweeps over the coffee cup on the round stool while bending down and reaching for it. Our model can recognize and understand the subject's intention, identify the coffee cup as the next active object, and predict object motion trajectories similar to the ground truth.
In this example, there is a wide array of objects available on the table in front of the subject. In the input clip, the subject is holding a wooden spoon in his right hand, and his left hand is approaching a bowl on the table, with accompanying eye gaze. Our model can accurately predict from the numerous objects that the wooden spoon in the right hand will remain in interaction, and simultaneously, the black bowl will become interactive next.
In the input clip, the subject's gaze is fixed at the drink on the table, and the trajectories of their head and hand are moving towards the drink. These cues effectively address the questions of 'where to look', 'where to go', and 'which object to interact with'. Based on this information, our model successfully understands the subject's intention and accurately predicts both the next active object and trajectories that closely align with the ground truth.
In this example, the subject performs the action of drinking water. Instead of merely following the motion trend of the coffee cup within the input, our model understands the human drinking action pattern: "pick up then put down." Consequently, it correctly predicts the coffee cup's future motion.
@misc{xu2025seeingfuturepredictingsituated, title={Seeing My Future: Predicting Situated Interaction Behavior in Virtual Reality}, author={Yuan Xu and Zimu Zhang and Xiaoxuan Ma and Wentao Zhu and Yu Qiao and Yizhou Wang}, year={2025}, eprint={2510.10742}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.10742}, }