Xiaolong Li

I am a Senior Applied Research Scientist at Nvidia TAO team. My mission is to build 3D into AI foundations, with current focus on grounding 3D-VLMs into Embodied AI domain. Previously I was an Applied Scientist in Amazon AGI, working on 3D vision problems and diffusion-based video generation. I obtained Ph.D. in computer engineering at Virginia Tech, advised by Prof. A. Lynn Abbott, with research focused on deep 3D representations learning for dynamic scene understanding. I’m interested in AR/VR, Embodied AI, robotics.

During the summer 2019, I am lucky to work with Prof. Shuran Song(now Stanford University), Dr. He Wang(now Peking University), Dr. Li Yi (Google Research, now Tsinghua University), and Johnny Chung Lee(Google Brain Robotics) as a student ML researcher in Google Brain Robotics, Mountain View; in 2020 spring, I did a research internship on 3D perception in MERL, mentored by Prof. Siheng Chen(now Shanghai Jiaotong University), Dr. Alan Sullivan(MERL); in 2021 summer, I worked with Dr. Ishani Chakraborty(Hololens), Dr. Yale Song(MSR), Dr. Bugra Tekin(Hololens) in a research internship. I have also worked with Prof. Yunhui Zhu(VT 3D Optics Group) on X-ray phase imaging.

News

May 16, 2023	Named as Outstanding Reviewer for CVPR 2023
Jun 27, 2022	Joined AWS AI as an applied scientist working on 3D Vision!
Sep 28, 2021	My first submission to NeurIPS 2021 accepted, check paper here!
May 17, 2021	Starting my research internship in Hololens, Microsoft
Sep 21, 2020	Our method ranked 3rd on SemanticKitti Multi-sweep Semantic Segmentation Challenge!
Mar 13, 2020	One paper accepted to CVPR 2020 as Oral presentation!

Education

PhD Student
Aug. 2016-present

Bachelor Degree
Aug. 2012- June 2016

Industry

Applied Scientist
Summer 2022-Present

Research Intern
Summer 2021

Research Intern
Spring 2020

Student Researcher
Summer 2019

Research Intern
Summer 2018

Publications

Li, Xiaolong, Mo, Jiawei, Wang, Ying, Parameshwara, Chethan, Fei, Xiaohan, Swaminathan, Ashwin, Taylor, CJ, Tu, Zhuowen, Favaro, Paolo, and Soatto, Stefano
Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

arXiv preprint arXiv:2404.18065 2024

[project] [Paper] [Code] [Abstract]

We propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model..
Chen, Jiayi, Yan, Mi, Zhang, Jiazhao, Xu, Yinzhen, Li, Xiaolong, Weng, Yijia, Yi, Li, Song, Shuran, and Wang, He
Tracking and reconstructing hand object interactions from point cloud sequences in the wild

In Proceedings of the AAAI Conference on Artificial Intelligence 2023

[project] [Paper] [Code] [Abstract]

We tackle the challenging task of jointly tracking hand object pose and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0.
Parameshwara, Chethan, Achille, Alessandro, Trager, Matthew, Li, Xiaolong, Mo, Jiawei, Swaminathan, Ashwin, Taylor, CJ, Venkatraman, Dheera, Fei, Xiaohan, and Soatto, Stefano
Towards visual foundational models of physical scenes

arXiv preprint arXiv:2306.03727 2023

[project] [Paper] [Code] [Abstract]

We describe a first step towards learning general-purpose visual representations of physical scenes.
Zhao, Yangheng, Wang, Jun, Li, Xiaolong, Hu, Yue, Zhang, Ce, Wang, Yanfeng, and Chen, Siheng
Number-adaptive prototype learning for 3d point cloud semantic segmentation

In European Conference on Computer Vision 2022

[project] [Paper] [Code] [Abstract]

Category-level object pose estimation aims to find 6D object.
Wang, Jun, Li, Xiaolong, Sullivan, Alan, Abbott, Lynn, and Chen, Siheng
Pointmotionnet: Point-wise motion learning for large-scale lidar point clouds sequences

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022

[project] [Paper] [Code] [Abstract]

Category-level object pose estimation aims to find 6D object poses of previously unseen object instances from known categories without access to object CAD models. To reduce the huge amount of pose annotations needed for category-level learning, we propose for the first time a self-supervised learning framework to estimate category-level 6D object pose from single 3D point clouds. During training, our method assumes no ground-truth pose annotations, no CAD models, and no multi-view supervision. The key to our method is to disentangle shape and pose through an invariant shape reconstruction module and an equivariant pose estimation module, empowered by SE(3) equivariant point cloud networks. The invariant shape reconstruction module learns to perform aligned reconstructions, yielding a category-level reference frame without using any annotations. In addition,the equivariant pose estimation module achieves category-level pose estimation accuracy that is comparable to some fully supervised methods. Extensive experiments demonstrate the effectiveness of our approach on both complete and partialdepth point clouds from the ModelNet40 benchmark, and on real depth point cloudsfrom the NOCS-REAL 275 dataset.
Li, Xiaolong, Weng, Yijia, Yi, Li, Guibas, Leonidas, Abbott, A Lynn, Song, Shuran, and Wang, He
Leveraging SE (3) Equivariance for Self-Supervised Category-Level Object Pose Estimation

NeurIPS 2021

[project] [Paper] [Code] [Abstract]

Category-level object pose estimation aims to find 6D object poses of previously unseen object instances from known categories without access to object CAD models. To reduce the huge amount of pose annotations needed for category-level learning, we propose for the first time a self-supervised learning framework to estimate category-level 6D object pose from single 3D point clouds. During training, our method assumes no ground-truth pose annotations, no CAD models, and no multi-view supervision. The key to our method is to disentangle shape and pose through an invariant shape reconstruction module and an equivariant pose estimation module, empowered by SE(3) equivariant point cloud networks. The invariant shape reconstruction module learns to perform aligned reconstructions, yielding a category-level reference frame without using any annotations. In addition,the equivariant pose estimation module achieves category-level pose estimation accuracy that is comparable to some fully supervised methods. Extensive experiments demonstrate the effectiveness of our approach on both complete and partialdepth point clouds from the ModelNet40 benchmark, and on real depth point cloudsfrom the NOCS-REAL 275 dataset.
Li, Xiaolong, Wang, He, Yi, Li, Guibas, Leonidas J, Abbott, A Lynn, and Song, Shuran
Category-Level Articulated Object Pose Estimation

CVPR 2020

Oral Presentation(5.1%)

[project] [Paper] [Code] [Abstract]

This paper addresses the task of category-level pose estimation for articulated objects from a single depth image. We present a novel category-level approach that correctly accommodates object instances previously unseen during training. We introduce Articulation-aware Normalized Coordinate Space Hierarchy (ANCSH) – a canonical representation for different articulated objects in a given category. As the key to achieve intra-category general- ization, the representation constructs a canonical object space as well as a set of canonical part spaces. The canonical object space normalizes the object orientation, scales and articulations (e.g. joint parameters and states) while each canonical part space further normalizes its part pose and scale. We develop a deep network based on PointNet++ that predicts ANCSH from a single depth point cloud, including part segmentation, normalized coordi- nates, and joint parameters in the canonical object space. By leveraging the canonicalized joints, we demonstrate: 1) improved performance in part pose and scale estimations using the induced kinematic constraints from joints; 2) high accuracy for joint parameter estimation in camera space
Porwal, Prasanna, Pachade, Samiksha, Kokare, Manesh, Deshmukh, Girish .., Li, Xiaolong, and others,
Idrid: Diabetic retinopathy–segmentation and grading challenge

Medical image analysis 2020

[Paper] [Code] [Abstract]

Diabetic Retinopathy (DR) is the most common cause of avoidable vision loss, predominantly affecting the working-age population across the globe. Screening for DR, coupled with timely consultation and treatment, is a globally trusted policy to avoid vision loss. However, implementation of DR screening programs is challenging due to the scarcity of medical professionals able to screen a growing global diabetic population at risk for DR. Computer-aided disease diagnosis in retinal image analysis could provide a sustainable approach for such large-scale screening effort. The recent scientific advances in computing capacity and machine learning approaches provide an avenue for biomedical scientists to reach this goal. Aiming to advance the state-of-the-art in automatic DR diagnosis, a grand challenge on “Diabetic Retinopathy – Segmentation and Grading” was organized in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI - 2018). In this paper, we report the set-up and results of this challenge that is primarily based on Indian Diabetic Retinopathy Image Dataset (IDRiD). There were three principal sub-challenges: lesion segmentation, disease severity grading, and localization of retinal landmarks and segmentation. These multiple tasks in this challenge allow to test the generalizability of algorithms, and this is what makes it different from existing ones. It received a positive response from the scientific community with 148 submissions from 495 registrations effectively entered in this challenge. This paper outlines the challenge, its organization, the dataset used, evaluation methods and results of top-performing participating solutions. The top-performing approaches utilized a blend of clinical information, data augmentation, and an ensemble of models. These findings have the potential to enable new developments in retinal image analysis and image-based DR screening in particular.
Wu, Ziling, Li, Xiaolong, and Zhu, Yunhui
Texture orientation-resolving imaging with structure illumination

In Computational Imaging II 2017

[Paper] [Abstract]
Chen, Muhao, Gong, Chen, Li, Xiaolong, and Yu, Zongxin
Research on solving Traveling Salesman Problem based on virtual instrument technology and genetic-annealing algorithms

In 2015 Chinese Automation Congress (CAC) 2015

[Paper] [Abstract]

SERVICES

I am a reviewer in JEI, TIP, ICCV 2021, ICLR 2022, CVPR 2022, CVPR 2023, ICML 2023, NeurIPS 2023, 3DV 2023, 3DV 2024.