Abstract

As personalised immersive display systems have been intensely explored, plausible 3D audio corresponding to the visual content is required to provide more realistic experiences to users. It is well know that spatial audio synchronised with visual information improves a sense of immersion but there have been limited number of researches on immersive audio-visual contents production and reproduction. In this paper, we propose an end-to-end pipeline to simultaneously reconstruct 3D geometry and acoustic properties of the environment from a pair of omni-directional panoramic images. A semantic scene reconstruction and completion method using a deep convolution neural network is proposed to estimate complete semantic scene geometry in order to adopt spatial audio reproduction to the scene. Experiments provide objective and subjective evaluations of the proposed pipeline for plausible audio-visual scene reproduction of real scenes.


Source code

EdgeNet360: Semantic Scene Completion from a Single 360 degree Image and Depth Map
GitLab Link


Dataset

- Orignal sound source

Original sound sources were recorded in an anechoic environment. Gun_shot and Swept-sine_signal for RIR measurement. Speech from the TIMIT dataset [2] and Music (clarinet) from the OpenAirLib library [3] are for sound rendering.

-- Original sound sources recorded in an anechoic chamber - Download

Original Sweep sound for RIR measurement Original Gunshop sound for RIR measurement
Original Music (Clarinet) sound Original Speech sound

- Meeting Room (MR)

Meeting Room is a representative of typical domestic living room environments approximately 5.6m x 4.3m x 2.3m



Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map
Recorded (Ground-Truth) room impulse responses (sofa format)

Results:
Semantic scene reconstruction result - Download (obj format with face colour)
Audio rendering results for both Music and Speech - Download

Result samples: (volume may varies)

0.Original Music source 1.Ground-truth rendering by RIR
2.Rendering with Kim19 [1] 3.Rendering with the proposed method

- Kitchen (KT)

Kitchen is long and narrow room with kitchen utensil approximately 6.6m x 3.4m x 2.7m



Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map

Results:
Semantic scene reconstruction result (obj format with face colour)
Audio rendering results (1.Ground-truth by recorded RIR / 2.Rendering with Kim19 [1] / 3. Rendering with the proposed method)

- Listening Room (LR)

Listening Room is an acoustically controlled experimental room approximately 5.6m x 5.1m x 2.9m



Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map
Recorded (Ground-Truth) room impulse responses (sofa format)

Results:
Semantic scene reconstruction result (obj format with face colour)
Audio rendering results (1.Ground-truth by recorded RIR / 2.Rendering with Kim19 [1] / 3. Rendering with the proposed method)

- Studio (ST)

Studio Hall is a large hall approximately 17.1m x 14.6m x 6.5m



Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map

Results:
Semantic scene reconstruction result (obj format with face colour)
Audio rendering results (1.Ground-truth by recorded RIR / 2.Rendering with Kim19 [1] / 3. Rendering with the proposed method)

- Usability Lab (UL)

Usability Lab is another scene similar to a typical living room approximately 5.6m x 5.2m x 2.9m



Data:
Input 360 camera image - Top
Input 360 camera image - Bottom
Estimated Depth Map
Recorded (Ground-Truth) room impulse responses (sofa format)

Results:
Semantic scene reconstruction result (obj format with face colour)
Audio rendering results (1.Ground-truth by recorded RIR / 2.Rendering with Kim19 [1] / 3. Rendering with the proposed method)


Acknowledgments

This work was supported by the EPSRC Programme Grant S3A:Future Spatial Audio for an Immersive Listener Experience at Home (EP/L000539/1) and the BBC as part of the BBC Audio Research Partnership.Details about the data underlying this work, along with the terms for data access, are available from: http://dx.doi.org/10.15126/surreydata.00812228

Reference

[1] H. Kim, L. Remaggi, P.J. Jackson, and A. Hilton, “Immersive spatialaudio reproduction for vr/ar using room acoustic modelling from 360 images,” in Proc. IEEE VR, 2019.
[2] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallet, and N.L. Dahlgren, “DARPA TIMIT acoustic phonetic continuous speechcorpus CDROM,” NIST Interagency, Tech. Rep., 1993.
[3] K. Brown, M. Paradis, and D. Murphy, “Openairlib: A javascript libraryfor the acoustics of spaces,” in Audio Engineering Society Convention 142, May 2017. [Online]. Available: http://www.aes.org/e-lib/browse.cfm?elib=18586