S3A Audio-Visual Scene Analysis datasets and resources

Updated: 7 June 2018
Contact people: Hansung Kim and Luca Remaggi

This is a web-page to distribute Audio-Visual Scene Analysis datasets created in the EPSRC S3A project.

This page includes data that enables the reproduction of results of the papers below, and it is indexed with DOI 10.15126/surreydata.00812228

The source of the datasets should be acknowledged in all publications in which it is used as by referencing the following paper [2] and [3] and this web-site:


Dataset

- Listening Room

The Listening Room is a controlled listening environment approximately 5.6m x 5.0m x 2.9m with loudspeakers surrounding a central listening position.


Data:
Input 360 camera image - Top Input 360 camera image - Bottom
Estimated Depth Map
Object recognition result
Reconstructed 3D geometry model with [3] (obj format)
Reconstructed 3D geometry model with [4] (obj format)
Recorded room impulse responses (sofa format)


Rendering Results - Music

Original music recorded in an anechoic chamber Music rendered with Ground Truth RIR
Music rendered with the synthesized RIR by [3] Music rendered with the synthesized RIR by [4]

Rendering Results - Speech

Original speech recorded in an anechoic chamber Speech rendered with Ground Truth RIR
Speech rendered with the synthesized RIR by [3] Speech rendered with the synthesized RIR by [4]

- Usability Lab

The Usability Lab is representative of typical domestic living room environments approximately 5.6m x 5.2m x 2.9m


Data:
Input 360 camera image - Top Input 360 camera image - Bottom
Estimated Depth Map
Object recognition result
Reconstructed 3D geometry model with [3] (obj format)
Reconstructed 3D geometry model with [4] (obj format)
Recorded room impulse responses (sofa format)

Rendering Results - Music

Original music recorded in an anechoic chamber Music rendered with Ground Truth RIR
Music rendered with the synthesized RIR by [3] Music rendered with the synthesized RIR by [4]

Rendering Results - Speech

Original speech recorded in an anechoic chamber Speech rendered with Ground Truth RIR
Speech rendered with the synthesized RIR by [3] Speech rendered with the synthesized RIR by [4]

- Meeting Room

The Meeting Room is another representative of typical domestic living room environments approximately 5.6m x 4.3m x 2.3m

Data:
Input 360 camera image - Top Input 360 camera image - Bottom
Estimated Depth Map
Object recognition result
Reconstructed 3D geometry model with [3] (obj format)
Reconstructed 3D geometry model with [4] (obj format)
Recorded room impulse responses (sofa format)

Rendering Results - Music

Original music recorded in an anechoic chamber Music rendered with Ground Truth RIR
Music rendered with the synthesized RIR by [3] Music rendered with the synthesized RIR by [4]

Rendering Results - Speech

Original speech recorded in an anechoic chamber Speech rendered with Ground Truth RIR
Speech rendered with the synthesized RIR by [3] Speech rendered with the synthesized RIR by [4]

- Studio

The Meeting Room is another representative of typical domestic living room environments approximately 5.6m x 4.3m x 2.3m

Data:
Input 360 camera image - Top Input 360 camera image - Bottom
Estimated Depth Map
Object recognition result
Reconstructed 3D geometry model with [4] (obj format)
Recorded room impulse responses (sofa format)

Rendering Results - Music

Original music recorded in an anechoic chamber Music rendered with Ground Truth RIR
Music rendered with the synthesized RIR by [3] Music rendered with the synthesized RIR by [4]

Rendering Results - Speech

Original speech recorded in an anechoic chamber Speech rendered with Ground Truth RIR
Speech rendered with the synthesized RIR by [3] Speech rendered with the synthesized RIR by [4]


Object recognition data

14 classes = 'none', 'bed', 'books', 'ceiling', 'chair', 'floor', 'furniture', 'objects', 'picture', 'sofa', 'table', 'tv', 'unknown', 'wall', 'window'



Codes and executables

Depth estimation code (require OpenCV to compile)
Room Model Reconstruction for [3] (Windows Executable and sample data)
Full Reconstruction pipeline for [4] (Windows Executable and Samples only)
Full Reconstruction pipeline for [4] (Full code included)
* Simple User-Interactive version of Room model reconstruction which doesn't require the SegNet setting will be released in May



Related publications

[1] H. Kim and A. Hilton, "Block World Reconstruction from Spherical Stereo Image Pairs," Computer Vision and Image Understanding, vol.139, pp.104-121, Oct. 2015
[2] H. Kim, T.d. Campos and A. Hilton, "Room Layout Estimation with Object and Material Attributes Information using a Spherical Camera," Proc. 3DV, Oct. 2016
[3] H. Kim, R.J. Hughes, L. Remaggi, P. JB Jackson, A. Hilton, T.J. Cox and B. Shirley, "Acoustic Room Modelling using a Spherical Camera for Reverberant Spatial Audio Objects," Proc. 142nd AES, Berlin, May 2017
[4] H. Kim, L. Remaggi, S. Fowler, P. JB Jackson, A. Hilton, "Acoustic Room Modelling using 360 Stereo Cameras," IEEE Trans. Multimedia, 2019 (Submitted)


Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
EPSRC S3A
To contact us: Dr. Hansung Kim (h.kim@surrey.ac.uk)