Advancing Spatial Intelligence from Data Representation, Learning Process, and 3D Generation
3D spatial intelligence—the capacity to perceive, reason about, and manipulate geometric data—is increasingly vital for advancing artificial intelligence applications across various domains, including robotics, metaverse, immersive telepresence, and autonomous systems. This tutorial aims to provide a comprehensive overview of recent advancements in the field, covering critical aspects from data representation to learning processes and 3D generation techniques.
We will begin by exploring fundamental representations that accurately and efficiently capture 3D and 4D scenes. Understanding the intricacies of these representations is crucial for enhancing the performance of AI systems in spatial tasks. Next, we will delve into cross-modal and structural techniques that augment the learning process of 3D data, showcasing how integrating information from multiple modalities can significantly improve model robustness and accuracy.
Additionally, the tutorial will highlight state-of-the-art methods for generating 3D data, discussing the implications of generative models and their applications in real-world scenarios. We will illustrate how these techniques can be leveraged to create realistic environments and objects, thereby enhancing user experiences in AR and VR applications.
Participants will gain insights into the challenges and opportunities in advancing spatial intelligence, equipping them with the knowledge to implement these cutting-edge techniques in their own projects. By fostering a deeper understanding of the interplay between data representation, learning processes, and 3D generation, this tutorial aims to inspire innovative approaches to harnessing spatial intelligence in AI systems.
Organizer: CSIG Big Visual Data Technical Committee
Dr. Chenglin Li is currently a Full Professor with the Department of Electronic Engineering at Shanghai Jiao Tong University. His current research interests include multimedia signal processing and communications, adaptive video streaming, and theories and applications of reinforcement learning and federated learning in multimedia communication systems. He was awarded the Microsoft Research Asia (MSRA) Fellow in 2011, the Alexander von Humboldt Research Fellow in 2017, and the Top Paper Award of ACM Multimedia in 2022. He is the Associate Editor of the IEEE Transactions on Signal and Information Processing over Networks, the Youth Editor of the Chinese Journal of Electronics, the Area Chair of 2022 ACM International Conference on Multimedia, and the Secretary- General of the CSIG Big Visual Data Technical Committee. He became the PI of the Shanghai Rising-Star Program in 2020, and the PI of the Excellent Young Scientists Fund of the NSFC in 2021.
Biography of Speakers:
Dr. Junhui Hou is an Associate Professor with the Department of Computer Science, City University of Hong Kong. His research interests include multidimensional visual computing, such as light field, hyperspectral, geometry, and event data. He received the Early Career Award from the Hong Kong Research Grants Council and the Excellent Young Scientists Fund from NSFC. He has served or is serving as an Associate Editor for IEEE TIP, TVCG, TMM, and TCSVT.
Dr. Yuan Liu is an assistant professor at the Hong Kong University of Science and Technology (HKUST). Prior to that, Yuan worked in Nanyang Technological University (NTU) as a PostDoc researcher and obtained his PhD degree at the University of Hong Kong (HKU). His research mainly concentrates on 3D vision and graphics. He currently works on topics about 3D AIGC, including 3D neural representations, 3D generative models, and 3D-aware video generation. He has published over 40 papers about these topics in top venues like SIGGRAPH, ToG, CVPR, ICCV, ECCV, etc.
Title | Content | Speaker | Time |
Advancing Neural Spatial Computing from Representation and Learning Process |
1. Advanced explicit representation of 3D point cloud data |
Prof. Junhui Hou (侯军辉) |
100 minutes |
3D Modeling from Reconstruction to Generation |
1. Neural field for surface reconstruction. |
Prof. Yuan Liu (刘缘) |
80 minutes |
1. Q. Zhang, et al., RegGeoNet: Learning Regular Representations for Large-Scale 3D Point Clouds, IJCV, 2022,
2. Q. Zhang, et al., FlatteningNet: Deep Regular 2D Representation for 3D Point Cloud Analysis, IEEE TPAMI, 2023.
3. Y. Zeng, et al., Dynamic 3D Point Cloud Sequences as 2D Videos, IEEE TPAMI, 2024
4. Q. Zhang, et al., Flatten Anything: Unsupervised Neural Surface Parameterization, in Proc. NeurIPS, 2024
5. Y. Zhao et al., FlexPara: Flexible Neural Surface Parameterization, https://arxiv.org/abs/2504.19210, 2025
6. S. Ren, et al., GeoUDF: Surface Reconstruction from 3D Point Clouds via Geometry-guided Distance Representation, in Proc. ICCV, 2023
7. J. Hu, et al., A Lightweight UDF Learning Framework for 3D Reconstruction based on Local Shape Functions, in Proc. CVPR, 2025
8. Y. Yao et al., Dynosurf: Neural Deformation-based Temporally Consistent Dynamic Surface Reconstruction, in Proc. ECCV, 2024
9. C. Yang, et al., Monge-Ampere Regularization for Learning Arbitrary Shapes from Point Clouds, IEEE TPAMI, 2025
10. S. Ren, et al., Neural Compression of 3D Geometry Sets, in Proc. ICCV, 2025
11. S. Ren, et al., Shape as Line Segments: Accurate and Flexible Implicit Surface Representation, in Proc. ICLR, 2025
12. Q. Zhang, et al., NeuroGF: A Neural Representation for Fast Geodesic Distance and Path Queries, in Proc. NeurIPS, 2023
13. Y. Qian, et al., PUGeo-Net: A Geometry-centric Network for 3D Point Cloud Up-sampling, in Proc. ECCV, 2020
14. A. Mao, et al., PU-Flow: A Point Cloud Upsampling Network With Normalizing Flows, IEEE TVCG, 2023
15. Y. Qian, et al, Deep Magnification-flexible Upsampling over 3D Point Clouds, IEEE TIP, 2021.
16. Q. Zhang, et al., SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation, https://arxiv.org/abs/2503.09439
17. Y. Yao, et al., RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects Videos, in Proc. CVPR, 2025
18. Q. Liu, et al., MoDGS: Dynamic Gaussian Splatting from Causually-captured Monocular Videos with Depth Prior, in Proc. ICLR, 2025
19. Q. Zhang, et al., PointVST: Self-Supervised Pre-Training for 3D Point Clouds Via View-
Specific Point-to-Image Translation, IEEE TVCG, 2023
20. Q. Zhang, et al., PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition, IEEE TMM, 2022
21. Y. Zhang, et al., Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration, IEEE TPAMI, 2025
22. Y. Zhang, et al., Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models, in Proc. NeurIPS, 2024
23. Y. Zhang, et al., Is Contrastive Distillation Enough for Learning Comprehensive 3D Representation? https://arxiv.org/abs/2412.08973
24. Y. Zhang, et al., Unleash the Potential of Image Branch for Cross-modal 3D Object Detection, in Proc. NeurIPS, 2023.
25. S. Ren, et al., DDM: A Metric for Comparing 3D Shapes using Directional Distance Fields, IEEE TPAMI, 2025
26. Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., & Wang, W. (2021). Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS 2021.
27. Liu, Y., Wang, P., Lin, C., Long, X., Wang, J., Liu, L., ... & Wang, W. (2023). Nero: Neural geometry and brdf reconstruction of reflective objects from multiview images. ACM Transactions on Graphics (ToG), 42(4), 1-22.
28. Long, X., Lin, C., Liu, L., Liu, Y., Wang, P., Theobalt, C., ... & Wang, W. (2023). Neuraludf: Learning unsigned distance fields for multi-view reconstruction of surfaces with arbitrary topologies. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 20834-20843).
29. Huang, B., Yu, Z., Chen, A., Geiger, A., & Gao, S. (2024, July). 2d gaussian splatting for geometrically accurate radiance fields. In ACM SIGGRAPH 2024 conference papers (pp. 1-11).
30. Yu, Z., Sattler, T., & Geiger, A. (2024). Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes. ACM Transactions on Graphics (ToG), 43(6), 1-13.
31. Wang, J., Liu, Y., Wang, P., Lin, C., Hou, J., Li, X., ... & Wang, W. (2024). GausSurf: Geometry-Guided 3D Gaussian Splatting for Surface Reconstruction. arXiv preprint arXiv:2411.19454.
32. Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2022). Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988.
33. Tang, J., Ren, J., Zhou, H., Liu, Z., & Zeng, G. (2023). Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653.
34. Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., & Wang, W. (2023). Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453.
35. Long, X., Guo, Y. C., Lin, C., Liu, Y., Dou, Z., Liu, L., ... & Wang, W. (2024). Wonder3d: Single image to 3d using cross-domain diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9970-9980).
36. Li, P., Liu, Y., Long, X., Zhang, F., Lin, C., Li, M., ... & Guo, Y. (2024). Era3d: High-resolution multiview diffusion using efficient row-wise attention. Advances in Neural Information Processing Systems, 37, 55975-56000.
37. Li, P., Zheng, W., Liu, Y., Yu, T., Li, Y., Qi, X., ... & Guo, Y. (2024). PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing. arXiv preprint arXiv:2409.10141.
38. Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., ... & Tan, H. (2023). Lrm: Large reconstruction model for single image to 3d. arXiv preprint arXiv:2311.04400.
39. Xu, J., Cheng, W., Gao, Y., Wang, X., Gao, S., & Shan, Y. (2024). Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. arXiv preprint arXiv:2404.07191.
40. Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., ... & Yu, J. (2024). Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG), 43(4), 1-20.
41. Li, Y., Zou, Z. X., Liu, Z., Wang, D., Liang, Y., Yu, Z., ... & Cao, Y. P. (2025). Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608.
42. Lai, Z., Zhao, Y., Liu, H., Zhao, Z., Lin, Q., Shi, H., ... & Guo, C. (2025). Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details. arXiv preprint arXiv:2506.16504.
43. Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., ... & Liu, Y. (2025, August). Diffusion as shader: 3d-aware video diffusion for versatile video generation control. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (pp. 1-12).
44. Burgert, R., Xu, Y., Xian, W., Pilarski, O., Clausen, P., He, M., ... & Yu, N. (2025). Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 13-23).
45. Wang, Q., Luo, Y., Shi, X., Jia, X., Lu, H., Xue, T., ... & Gai, K. (2025, August). Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers (pp. 1-10).
46. Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., & Revaud, J. (2024). Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20697-20709).
47. Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., ... & Yang, M. H. (2024). Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825.
48. Lu, J., Huang, T., Li, P., Dou, Z., Lin, C., Cui, Z., ... & Liu, Y. (2025). Align3r: Aligned monocular depth estimation for dynamic videos. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 22820-22830).
49. Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., & Novotny, D. (2025). Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5294-5306).
50. Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., ... & He, T. (2025). $\pi^ 3$: Scalable Permutation-Equivariant Visual Geometry Learning. arXiv preprint arXiv:2507.13347.
51. Leroy, V., Cabon, Y., & Revaud, J. (2024, September). Grounding image matching in 3d with mast3r. In European Conference on Computer Vision (pp. 71-91). Cham: Springer Nature Switzerland.
52. Murai, R., Dexheimer, E., & Davison, A. J. (2025). MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 16695-16705).
53. Wang, Q., Zhang, Y., Holynski, A., Efros, A. A., & Kanazawa, A. (2025). Continuous 3d perception model with persistent state. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 10510-10522).
54. Lan, Y., Luo, Y., Hong, F., Zhou, S., Chen, H., Lyu, Z., ... & Pan, X. (2025). STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer. arXiv preprint arXiv:2508.10893.
55. Jiang, Z., Zheng, C., Laina, I., Larlus, D., & Vedaldi, A. (2025). Geo4d: Leveraging video generators for geometric 4d scene reconstruction. arXiv preprint arXiv:2504.07961.
56. Xu, T. X., Gao, X., Hu, W., Li, X., Zhang, S. H., & Shan, Y. (2025). Geometrycrafter: Consistent geometry estimation for open-world videos with diffusion priors. arXiv preprint arXiv:2504.01016.
China Society of Image and Graphics (CSIG)
Chinese Association for Artificial Intelligence (CAAI)
China Computer Federation (CCF)
Chinese Association of Automation (CAA)
Shanghai Jiao Tong University (SJTU)
Shanghai Feten Culture Promotion Company
AutoDL
East China Normal University