As personalised immersive display systems have been intensely explored, plausible 3D audio corresponding to the visual content is required to provide more realistic experiences to users. It is well know that spatial audio synchronised with visual information improves a sense of immersion but there have been limited number of researches on immersive audio-visual contents production and reproduction. In this paper, we propose an end-to-end pipeline to simultaneously reconstruct 3D geometry and acoustic properties of the environment from a pair of omni-directional panoramic images. A semantic scene reconstruction and completion method using a deep convolution neural network is proposed to estimate complete semantic scene geometry in order to adopt spatial audio reproduction to the scene. Experiments provide objective and subjective evaluations of the proposed pipeline for plausible audio-visual scene reproduction of real scenes.
EdgeNet360: Semantic Scene Completion from a Single 360 degree Image and Depth Map
GitLab Link
Original sound sources were recorded in an anechoic environment. Gun_shot and Swept-sine_signal for RIR measurement. Speech from the TIMIT dataset [2] and Music (clarinet) from the OpenAirLib library [3] are for sound rendering.
-- Original sound sources recorded in an anechoic chamber - Download
Original Sweep sound for RIR measurement | Original Gunshop sound for RIR measurement |
Original Music (Clarinet) sound | Original Speech sound |
Meeting Room is a representative of typical domestic living room environments approximately 5.6m x 4.3m x 2.3m
Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map
Recorded (Ground-Truth) room impulse responses (sofa format)
Results:
Semantic scene reconstruction result - Download (obj format with face colour)
Audio rendering results for both Music and Speech - Download
Result samples: (volume may varies)
0.Original Music source | 1.Ground-truth rendering by RIR |
2.Rendering with Kim19 [1] | 3.Rendering with the proposed method |
Kitchen is long and narrow room with kitchen utensil approximately 6.6m x 3.4m x 2.7m
Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map
Results:
Semantic scene reconstruction result (obj format with face colour)
Audio rendering results (1.Ground-truth by recorded RIR / 2.Rendering with Kim19 [1] / 3. Rendering with the proposed method)
Listening Room is an acoustically controlled experimental room approximately 5.6m x 5.1m x 2.9m
Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map
Recorded (Ground-Truth) room impulse responses (sofa format)
Results:
Semantic scene reconstruction result (obj format with face colour)
Audio rendering results (1.Ground-truth by recorded RIR / 2.Rendering with Kim19 [1] / 3. Rendering with the proposed method)
Studio Hall is a large hall approximately 17.1m x 14.6m x 6.5m
Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map
Results:
Semantic scene reconstruction result (obj format with face colour)
Audio rendering results (1.Ground-truth by recorded RIR / 2.Rendering with Kim19 [1] / 3. Rendering with the proposed method)
Usability Lab is another scene similar to a typical living room approximately 5.6m x 5.2m x 2.9m
Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map
Recorded (Ground-Truth) room impulse responses (sofa format)
Results:
Semantic scene reconstruction result (obj format with face colour)
Audio rendering results (1.Ground-truth by recorded RIR / 2.Rendering with Kim19 [1] / 3. Rendering with the proposed method)