X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

Stanford University

Abstract

Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.

Dataset

A full release of our dataset is coming soon (contact Samuel Clarke for more information). We include some samples below.

Select an object to see examples of the data collected.
cat brush
computer speaker
metal box
insulated cup
storage bowl
Select one of the six collection points for this object.
Check out the rgb, touch, depth, and audio data available for this collection point.
rgb
target point
touch
depth
target point
audio

Hardware

We develop custom hardware to fit vision, touch, and audio sensing into a handheld package. We will open-source all our designs soon. See our video below to see a breakdown of the device hardware and see it in action. A full release of our hardware designs is coming soon (contact Samuel Clarke for more information).

Results

X-to 2D/3D Generation
Zero-Shot Audio-Based Detection
Select an image below.
glass tube
bottle opener
watch
key fob
Input (→Encoder)
(Encoder→) Shap-E
(Encoder→) Stable Diffusion
rgb
audio
touch
Select an image below.
+
=

Video

BibTex

@misc{clarke2025xcapture,
    title={X-Capture: An Open-Source Portable Device for Multi-Sensory Learning},
    author={Samuel Clarke and Suzannah Wistreich and Yanjie Ze and Jiajun Wu},
    year={2025},
    eprint={2504.02318},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2504.02318},
}