Recently, Prof. Songchun Zhu and Assistant Professor Zhu Yixin's team from the Institute for Artificial Intelligence published a paper titled "Scene Reconstruction with Functional Objects for Robot Autonomy" in IJCV 2022. They proposed a novel problem of scene reconstruction and scene representation, which provides necessary information for robot autonomy and enables interactive virtual scenes that closely resemble real-world scenarios for simulation testing. Additionally, this work developed a complete machine vision system to address the proposed scene reconstruction problem. The experiments demonstrated the effectiveness of the proposed scene reconstruction approach and the potential of scene representation for robot autonomy planning.
Perceiving the 3D environment and understanding the information it contains are a crucial aspect of human intelligence and a prerequisite for interactive engagement with the environment. Beyond the geometric features of the environment and the semantic information of objects, we can also "perceive" the potential ways of interacting with the environment, which we refer to as actionable information. For example, when we see a doorknob as shown in Figure 1(a), we naturally imagine the potential action of turning the doorknob and opening the door. In the scene depicted in Figure 1(b), we can easily observe the constraint relationship (mutual support) between stacked cups and dishes, as well as the effects of different actions on their states (directly extracting the bottom cups and dishes would cause the top ones to topple, while removing the top objects one by one allows us to safely retrieve the bottom cups and dishes). Understanding the impact of potential actions on a scene forms the basis for performing tasks and interacting within that scene. Similarly, intelligent robots require similar perceptual capabilities to autonomously accomplish complex long-horizon planning in their environment.
Fig. 1 (a) Doorknob, (b) Stacked cups and dishes (Image source: Internet, copyright belongs to the original author)
With the development of 3D scene reconstruction and semantic mapping technologies, robots are now capable of efficiently building 3D maps that incorporate both geometric and semantic information. For instance, they can create semantic panoptic maps including object and room structure information, as shown in Figure 2(b). However, there still exists a significant gap between these traditional scene representations derived from scene reconstruction and the realization of autonomous robot planning. The question then arises: How can we construct a scene representation that is universally applicable to robot perception and planning to enhance the robot's autonomous planning capabilities? How can robots leverage their own sensor inputs, such as RGB-D cameras, to construct such a scene representation in real-world environments?
In this paper [1], researchers address this challenge by reconstructing functionally-equivalent and interactive virtual scenes to preserve the actionable information from the original scenes. These virtual scenes can be used for simulated training and testing of robot autonomous planning. To achieve this reconstruction task, the researchers introduce a scene graph representation based on supporting and proximal relations, as depicted in Figure 2(a). Each node in this scene graph represents an object or a room structure (e.g., walls, floor, roof) in the scene. This scene graph representation organizes the reconstructed scene and its physical constraints to ensure that the resulting virtual scene complying with physical common sense. Moreover, it can be directly converted into an environment's kinematic tree, which comprehensively describes the kinematic relationship states of the environment and supports the effects of robot actions’ forward prediction on the environment. It can be directly utilized in robot planning tasks. The paper also mentions a complete machine vision system to realize this reconstruction task and designs output interfaces for the reconstructed scene, making it seamlessly integrated with robot simulators (e.g., Gazebo) and virtual reality environments. Some preliminary work of this paper [2] was previously published at ICRA 2021.
Fig. 2 (a) Contact graph, (b) Panoptic mapping, (c) functionally equivalent, interactive 3D virtual scene
Reconstructing real scenes in virtual environments to support robot simulation is not a straightforward task. There are three main challenges: First, accurately reconstruct and segment the geometry of each object and structure in cluttered real scenes, and estimate the physical constraints between objects (e.g., supporting relationships). Second, replace the reconstructed incomplete geometry with complete and interactive objects (e.g., CAD models). Third, integrate all this information into a unified scene representation, and assist in both scene reconstruction and robot autonomous planning.
This work proposes a specialized scene graph that provides necessary information for robot autonomous planning and assists in reconstructing physically plausible virtual scenes, as a bridge between scene reconstruction and robot interaction. On one hand, this scene graph organizes the perceived objects, room structures, and their relationships in the scene, as shown in Figure 3(a). Each node in the scene graph represents an identified and reconstructed object or room structure in the real scene, including its geometry (e.g., reconstructed 3D mesh, 3D bounding boxes, extracted plane features) and semantic information (such as instance and semantic labels); and each edge represents either a supporting relationship (directed edge in Figure 3(a)) or a proximal relationship (undirected edge in Figure 3(a)), representing certain physical constraint information. For supporting relationships, the parent node needs to include a horizontal supporting surface to stably support the child node. For proximal relationships, the 3D geometries of adjacent nodes should not overlap. On the other hand, based on the similarity between the semantic and geometric information and considering the constraints between nodes, the nodes in Figure 3(a) are replaced with complete and interactive CAD models (including articulated CAD models), resulting in a virtual scene that can be used for robot simulation and interaction, as shown in Figure 3(b). This virtual scene preserves the functionality, i.e., the actionable information, of the real scene to the extent permitted by the perception capabilities and effectively enables simulation of interactions with objects in the real scene. Furthermore, the resulting scene graph representation includes a complete description of the environmental kinematics and constraint states. It can be used to predict the short-term quantitative effects of robot actions on kinematic states, assist in robot motion planning, estimate the long-term qualitative effects of robot actions on constraint relationships, and support robot task planning.
Fig. 3 (a) Directly reconstructed scene graph, (b) Interactive scene graph after CAD model replacement
Fig. 4 Flowchart of the computer vision system used for the reconstruction task
In order to accomplish the reconstruction task above, the authors designed and implemented a multi-module computer vision system: a volumetric semantic panorama mapping module in Figure 4(A) and a CAD model replacement inference module based on physical knowledge and geometry in Figure 4(B). The former is used to robustly recognize, segment, and reconstruct the dense geometry of objects and room structures in complex real environments by utilizing RGB-D cameras. It also estimates the constraint relationships between them to obtain the scene graph as shown in Figure 3(a). The latter focuses on how to select the most suitable CAD models from a CAD model library, based on the reconstructed object's geometry and identified constraint relationships. It also estimates their poses and scales to achieve accurate alignment with the original objects, thereby generating the interactive scene graph shown in Figure 3(b). Figure 5 shows that the author reconstructed the results of a real office scene by using the Kinect2 camera, including volumetric panoramic reconstruction [Figure 5(a)], reconstruction of interactive virtual scenes in Figure 5(b), and examples of robot interaction after importing the virtual scene into a robot simulator in Figure 5(c). We can see that even in complex and heavily occluded real scenes, the proposed reconstruction system can effectively establish interactive virtual scenes. Figures 5(d-f) present some interesting examples from this experiment: In Figure 5(d), due to occlusion caused by a chair, the same table is reconstructed as two relatively small tables. Figure 5(e) shows a well-reconstructed workspace where all objects are replaced with CAD models that resemble their appearances. In Figure 5(f), the chair is not identified, resulting in a similar situation as in Figure 5(d) where it is occluded by the table. However, the refrigerator and microwave in the scene are reconstructed and replaced with articulated CAD models capable of complex interactions.
Fig. 5 Reconstruction results in a real-world environment by using a Kinect2 camera
Fig. 6 Robot task and motion planning in the reconstructed virtual scene.
In the reconstructed interactive virtual scenes, aided by the motion chains and constraint information reflected in the scene graph, robots can engage in task and motion planning [3,4]. The simulation results are shown in Figure 6. In recent related work [5], based on the scene graph representation described earlier, robots can directly perform complex task planning by using graph editing distance and efficiently generate actions.
This work presents a novel scene reconstruction problem and scene graph representation that provide essential information for robot autonomous planning and offer interactive virtual scenes with functionality similar to real-world scenes for simulation testing. Furthermore, a complete computer vision system has been developed to address the proposed scene reconstruction problem. The experiments demonstrate the effectiveness of the scene reconstruction method and the potential of the scene graph representation in robot autonomous planning.
Looking ahead, we anticipate further advancements in this work, including achieving more robust and accurate matching among rigid body and articulated CAD models and reconstructed geometry, integrating more complex potential action information into the scene graph, and leveraging scene priors for robot planning. Scene graph reconstruction empowers autonomous planning, bringing us closer to more intelligent robots in the near future.
References:
[1] Han, Muzhi, et al. “Scene Reconstruction with Functional Objects for Robot Autonomy.” 2022 International Journal of Computer Vision (IJCV), link.springer.com, 2022.
[2] Han, Muzhi, et al. “Reconstructing Interactive 3D Scenes by Panoptic Mapping and CAD Model Alignments.” 2021 IEEE International Conference on Robotics and Automation (ICRA), ieeexplore.ieee.org, 2021, pp. 12199–206.
[3] Jiao, Ziyuan, et al. “Consolidating Kinematic Models to Promote Coordinated Mobile Manipulations.” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2021, doi:10.1109/iros51168.2021.9636351.
[4] Jiao, Ziyuan, et al. “Efficient Task Planning for Mobile Manipulation: A Virtual Kinematic Chain Perspective.” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), ieeexplore.ieee.org, 2021, pp. 8288–94.
[5] Jiao, Ziyuan, et al. “Sequential Manipulation Planning on Scene Graph.” 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), ieeexplore.ieee.org, 2022.