Multimedia data types by their very nature are complex and often involve intertwined instances of different kinds of information. We can leverage this multi-modal perspective in order to extract meaning and understanding of the world, often with valuable results.
Deep Learning is an emergent field of Machine Learning focusing on learning representations of data. Deep Learning has recently found success in a variety of domains, from computer vision to speech recognition, natural language processing, web search ranking, and even online advertising. Deep Learning’s power comes from learning rich representations of data that can be tuned for the task of interest. The ability of Deep Learning methods to capture the semantics of data is however limited by both the complexity of the models and the intrinsic richness of the input to the system. In particular, current methods only consider a single modality leading to an impoverished model of the world. Sensory data are inherently multimodal instead: images are often associated with text; videos contain both visual and audio signals; text is often related to social content from public media; etc. Considering cross-modality structure may yield a big leap forward in machine understanding of the world.
Learning from multimodal inputs is technically challenging because different modalities have different statistics and different kinds of representation. For instance, text is discrete and often represented by very large and sparse vectors, while images are represented by dense tensors that exhibit strong local correlations. Fortunately, Deep Learning has the promise to learn adaptive representations from the input, potentially bridging the gap between these different modalities.
In this track, we encourage submissions that effectively deploy Deep Learning to advance the state of the art across the domain of multimedia and related applications.
Topics of interest include, but are not restricted to:
Deep learning application involving multiple modalities, such as images, videos, audio, text, clicks or any other kind of (social) content and context
Deploying deep learning to learn features from multimodal inputs
Deploying deep learning to generate one modality from other modalities
Deep learning based methods that leverage multiple modalities and also account for temporal dynamics
Deploying deep learning to increase the robustness to missing modalities
Area: Multimodal/Multisensor Analysis and Description
Area Chairs
Analysis of multimedia content enables us to better understand what the content is about in order to improve its indexing, representation, and consumption for the purpose of retrieval, content creation/enhancement and interactive applications. Research so far has mostly focused on mono-modal analysis of multimedia content, such as looking only into images, only into text, or only into video, but ignoring other modalities like the text floating around an image on a web page or the audio accompanied with the video.
The goal of this area is to attract novel multimodal/multisensor analysis research that takes multiple modalities into account when it comes to multimedia content analysis and better description of the multimedia content. The different modalities may be temporally synchronized (e.g., video clips and corresponding audio transcripts, animations, multimedia presentations), spatially related (images embedded in text, object relationships in 3D space), or otherwise semantically connected (combined analysis of collections of videos, set of images created by one’s social network).
This area calls for submissions that reveal the information encoded in different modalities, combine this information in a non-trivial way and exploit the combined information to significantly expand the current possibilities for handling and interacting with multimedia content. In addition, the submitted works are expected to support effective and efficient interaction with large-scale multimedia collections and to stretch across mobile and desktop environments in order to address changing demands of multimedia consumers.
Topics of interest include, but are not restricted to:
Novel strategies for multimodal/multisensor analysis of multimedia data
Multimodal/multisensor feature extraction and fusion
Multimodal semantic concept detection, object recognition and segmentation
Multimodal approaches to detecting complex activities
Multimodal approaches to event analysis and modeling
Multimodal approaches to temporal or structural analysis of multimedia data
Machine learning for multimodal/multisensor analysis
Scalable processing and scalability issues in multimodal/multisensor content analysis
Advanced descriptors and similarity metrics exploiting the multimodal nature of multimedia data
Area: Multimedia and Vision
Area Chairs
Advances in computer vision coupled with increased multicore processor performance have made real-time vision-based multimedia systems feasible. The interaction of commodity off-the-shelf components and codecs adapting to real-world bandwidth limits with the performance and stability of vision algorithms remains an important research area. Next generation codecs need to be investigated and designed to support the needs of vision-based systems which may be sensitive to different kinds of noise than human consumers. Another rich avenue of research is the integration of related non-visual media such as audio, pose estimates, and accelorometer data into vision algorithms.
Topics include but are not limited to:
Video coding and streaming in support of vision applications
Compression tradeoffs and artifacts within vision algorithms
Integration of non-video information sources with vision
Vision-directed compression
Vision-based video quality models
Multimedia signal processing for vision
Vision-based multimedia analytics
Distributed and coordinated surveillance systems
Inter-media correspondance and geotagging
Multimedia vision applications and performance evaluation
Understanding
Program Chair
Understanding
Multimedia data types by their very nature are complex and often involve intertwined instances of different kinds of information. We can leverage this multi-modal perspective in order to extract meaning and understanding of the world, often with valuable results.
Areas
Deep Learning for Multimedia
Multimodal/Multisensor Analysis and Description
Multimedia and Vision
Publicity Chair
TPC Members
TBD
Area: Deep Learning for Multimedia
Area Chairs
Deep Learning is an emergent field of Machine Learning focusing on learning representations of data. Deep Learning has recently found success in a variety of domains, from computer vision to speech recognition, natural language processing, web search ranking, and even online advertising. Deep Learning’s power comes from learning rich representations of data that can be tuned for the task of interest. The ability of Deep Learning methods to capture the semantics of data is however limited by both the complexity of the models and the intrinsic richness of the input to the system. In particular, current methods only consider a single modality leading to an impoverished model of the world. Sensory data are inherently multimodal instead: images are often associated with text; videos contain both visual and audio signals; text is often related to social content from public media; etc. Considering cross-modality structure may yield a big leap forward in machine understanding of the world.
Learning from multimodal inputs is technically challenging because different modalities have different statistics and different kinds of representation. For instance, text is discrete and often represented by very large and sparse vectors, while images are represented by dense tensors that exhibit strong local correlations. Fortunately, Deep Learning has the promise to learn adaptive representations from the input, potentially bridging the gap between these different modalities.
In this track, we encourage submissions that effectively deploy Deep Learning to advance the state of the art across the domain of multimedia and related applications.
Topics of interest include, but are not restricted to:
Area: Multimodal/Multisensor Analysis and Description
Area Chairs
Analysis of multimedia content enables us to better understand what the content is about in order to improve its indexing, representation, and consumption for the purpose of retrieval, content creation/enhancement and interactive applications. Research so far has mostly focused on mono-modal analysis of multimedia content, such as looking only into images, only into text, or only into video, but ignoring other modalities like the text floating around an image on a web page or the audio accompanied with the video.
The goal of this area is to attract novel multimodal/multisensor analysis research that takes multiple modalities into account when it comes to multimedia content analysis and better description of the multimedia content. The different modalities may be temporally synchronized (e.g., video clips and corresponding audio transcripts, animations, multimedia presentations), spatially related (images embedded in text, object relationships in 3D space), or otherwise semantically connected (combined analysis of collections of videos, set of images created by one’s social network).
This area calls for submissions that reveal the information encoded in different modalities, combine this information in a non-trivial way and exploit the combined information to significantly expand the current possibilities for handling and interacting with multimedia content. In addition, the submitted works are expected to support effective and efficient interaction with large-scale multimedia collections and to stretch across mobile and desktop environments in order to address changing demands of multimedia consumers.
Topics of interest include, but are not restricted to:
Area: Multimedia and Vision
Area Chairs
Advances in computer vision coupled with increased multicore processor performance have made real-time vision-based multimedia systems feasible. The interaction of commodity off-the-shelf components and codecs adapting to real-world bandwidth limits with the performance and stability of vision algorithms remains an important research area. Next generation codecs need to be investigated and designed to support the needs of vision-based systems which may be sensitive to different kinds of noise than human consumers. Another rich avenue of research is the integration of related non-visual media such as audio, pose estimates, and accelorometer data into vision algorithms.
Topics include but are not limited to: