ACASVA

Project Goals

ACASVA will address the challenging problem of autonomous cognition at the interface of vision and language. The practical goal is to generalise an existing system for tennis video annotation so as to be capable of the autonomous annotation of novel sports. This will be accomplished via the cross-modal bootstrapping of high-level visual/linguistic structures in a manner parallelling human capabilities, the determination of which constitutes the psychological component of the project.

A series of psychological experiments will thus establish:

A) How visual grammars are employed in learning game rules.

B) Whether these grammars are modified by prior linguistic knowledge of the domain.

C) At what stage of cognitive processing formal abstraction of visual features takes place.

D) How visual grammars map onto linguistic grammars.

E) How inferred high-level linguistic concepts (e.g. knowing the rules of an unfamiliar game) influence lower-level visual learning (e.g. gaze-specification).

A key psychological aim of the project is thus to examine various competing theoretical positions using a novel experimental paradigm linking eye-movement behaviour to rule-induction. We predict that if subjects' understanding of the dynamic scene is informed by their comprehension of environmental rules (indicated by verbal reports), this should correlate strongly with their eye-movement behaviour. In doing so we determine a performance benchmark and a methodological insight into the process of cross-modal cognitive bootstrapping that can serve as a guide in the process of constructing the automated system for tennis video annotation. The final goal of the engineering aspect of the project is thus a usable system for novel sport video annotation capable of user-specified querying of an automatically-annotated database of footage.

SUMMARY OF OBJECTIVES

1. Creation of an adaptive multi-level framework for autonomous bootstrapping of high and low level visual representations within a constrained, rule-governed environment.

2. Creation of linguistic recognition and parsing framework for sport radio commentary.

3. The determination, via the techniques of experimental psychology, of the nature of visual and audio cognitive representation/learning within rule-based domains.

4. Implementation of an adaptive multi-level framework for cross-modal language and vision bootstrapping.

5. Construction of a usable system for sport video annotation, capable of demonstrating transferable learning to environments that significantly differ at both the high and low-levels of representation, eg singles tennis to doubles badminton.