TutorialT-02: Synthetic Data and Multimedia
|Sun, 8 Oct, 09:00 - 13:00 Malaysia Time (UTC +8)|
TutorialT-03: A Multi-Faceted View of Gradients in Neural Networks: Extraction, Interpretation and Applications in Image Understanding
TutorialT-07: Deep Learning-based HDR Imaging
|Sun, 8 Oct, 14:00 - 18:00 Malaysia Time (UTC +8)|
TutorialT-08: Present and Future of Video Compression Standards – Yes, there will be NN
TutorialT-10: Network-Empowered Scalable, Reliable and Secure Media Delivery
TutorialT-11: Edge Computing for Robot Vision: A Practical Perspective
Presented by: Nicola Conci, Niccolo' Bisagno
The tutorial will take a holistic view on the ongoing research, the relevant issues, and the potential application of using synthetic data for multimedia data processing, as a standalone resource or in combination with real data. In particular, the attention will be focused on the domain of images and videos, where the lack of representative data for specific problem categories has let emerge the possibility of relying on machine-generated contents.
Image and video processing has seen a rapid growing in the last decade, with remarkable improvements made possible thanks to the availability of ever-increasing computing power as well as deep learning-based frameworks that now allow human-like and beyond performances in many applications, including detection, classification, segmentation, to name a few.
However, it is to be noted that the development of novel algorithms and solutions is strictly bound to the availability of a relevant amount of data, which must be representative of the task that needs to be addressed.
With this respect, the literature has shown a rapid proliferation of datasets, tackling a multitude of problems, from the simplest to the most complex ones. Some of them are largely adopted and are currently recognized as the reference benchmark against which all newly proposed methods need to compete. As far as images are concerned, the most famous ones are (in order of complexity) MNIST, CIFAR-10, CIFAR-100, ImageNet.
When dealing with videos, instead, action/event recognition is among the earliest tasks being addressed by the research community, and the most widespread and well-known datasets include the Weizmann dataset (for simple action recognition), the UCF-101, UCF-Sports, EgoKitchen, to name a few. In the domain of surveillance the CAVIAR, and the PETS datasets have been largely adopted and, more recently the MoT Challenge has attracted the attention of many researchers because of the variety and diversity of contexts and situations in which detection and tracking solutions can be validated.
Still, there is an ever growing demand for data, to which researchers respond with larger and larger datasets, at a huge cost in terms of acquisition, storage, and annotation of images and clips. However, when dealing with complex problems, it is common to validate the developed algorithms across different datasets, facing inconsistencies in annotations (i.e. segmentation maps vs bounding boxes), the use of different standards (i.e. the number of joints of human skeletons in OpenPose and SMPL).
The use of synthetically-generated data can overcome such limitations, as the generation engine can be designed to fulfill an arbitrary number of requirements, all at the same time. For example, the same bounding box can hold for multiple viewpoints of the same object/scene; the 3D position of the object is always known, as well as its volume, the appearance, and the motion features. These considerations have motivated the adoption of computer-generated content to satisfy mostly two requirements: (a) the visual fidelity and (b) the behavioral fidelity.
To this aim, researchers have investigated efficient solutions to cope with these problems, including fine tuning and domain adaptation.
The tutorial will cover a number of topics dealing with the current use of datasets in a topic-wise fashion, together with the corresponding methodologies in the state of the art. A tentative list of of the topics is reported hereafter:
- image-based datasets and complexity
- video-based datasets and complexity
- limitations and need for adaptation
- synthetic datasets, pros and cons
- complementing real data with synthetic datasets
- fine tuning, domain adaptation, unsupervised learning
The focus of the tutorial will be technical, we aim at giving participants a broad view of research and important topics for developing efficient algorithms and solutions that are capable of combining the use of real and synthetic data to solve complex problems.
The attendees will be provided with the presentation slides, together with a comprehensive list of papers and reports of interest, aimed on the one hand at letting attendees be acquainted with the topic, and on the other hand to promote the research in this interesting interdisciplinary area.
Nicola Conci is Associate Professor at the Department of Information Engineering and Computer Science, University of Trento, where he teaches Computer Vision and Signal Processing. He received his Ph.D in 2007 from the same University. In 2007 he was a visiting student at the Image Processing Lab. at University of California Santa Barbara. In 2008 and 2009 he was post-doc researcher in the Multimedia and Vision research group at Queen Mary University of London.
Prof. Conci has authored and co-authored more than 130 papers in peer-reviewed journals and conferences. His current research interests are related to video analysis and computer vision applications for behavioral understanding and monitoring, coordinating a team of 6 Ph.D Students, 1 post-doc and 2 junior researchers.
At the University of Trento he coordinates the M.Sc. Degree in Information and Communications Engineering, he is member of the executive committee of the IECS Doctoral School, and he is delegate for the department of the research activities related to the Winter Olympic Games Milano-Cortina 2026.
He has served as Co-chair of several conferences, including the 1st and 2nd International Workshop on Computer Vision for Winter Sports, hosted at IEEE WACV 2022 and 2023, General Co-Chair of the International Conference on Distributed Smart Cameras 2019, General Co-Chair of the Symposium Signal Processing for Understanding Crowd Dynamics, held at IEEE AVSS 2017, and Technical Program Co-Chair of the Symposium Signal Processing for Understanding Crowd Dynamics, IEEE GlobalSip 2016.
Niccolò Bisagno received his Ph.D in 2020 from the ICT International Doctoral School of the University of Trento, Italy, for the thesis “On simulating and predicting pedestrian trajectories in a crowd”. In 2019, he was visiting PhD student at the University of Central Florida, Orlando, USA. In 2018, he was visiting Ph.D student at the Alpen-Adria-Universität, Klagenfurt , Austria.
His research area focuses on crowd analysis with a focus on pedestrian trajectory prediction and crowd simulation in virtual environments. He is also interested in machine learning and computer vision, with special focus on biologically-inspired deep learning architectures and sports analysis applications.
Presented by: Ghassan AlRegib, Mohit Prabhushankar
In this tutorial, we motivate, analyze and apply gradients of neural networks as features to understand image data. Traditionally, gradients are utilized as a computationally effective methodology to learn billions of parameters in large scale neural networks. Recently, gradients in neural networks have shown applicability in understanding and evaluating trained networks. For example, while gradients with respect to network parameters are used for learning image semantics, gradients with respect to input images are used to break the network parameters by creating adversarial data. Similarly, gradients with respect to logits provide predictive explanations while gradients with respect to loss function provide contrastive explanations. We hypothesize that once a neural network is trained, it acts as a knowledge base through which different types of gradients can be used to traverse adversarial, contrastive, explanatory, counterfactual representation spaces. Several image understanding and robustness applications including anomaly, novelty, adversarial, and out-of-distribution image detection, and noise recognition experiments among others use multiple types of gradients as features. In this tutorial, we examine the types, visual meanings, and interpretations of gradients along with their applicability in multiple applications.
The tutorial is composed of four major parts. Part 1 discusses the different interpretations of gradients extracted from trained neural networks with respect to input data, loss, and logits. Part 2 covers in detail a theoretical analysis of gradients. Part 3 describes the utility of gradient types in robustness applications of detection, recognition and explanations. Newer and emerging fields like machine teaching and active learning will be discussed with methodologies that use gradients. Part 4 connects the human visual perception with machine perception. Specifically, we discuss the expectancy-mismatch principle in neuroscience and empirically discuss this principle with respect to gradients. Results from Image Quality Assessment and Human Visual Saliency will be discussed to demonstrate the value of gradient-based methods. The outline as well as the expected time for each part is presented below.
- Part 1: Types of gradient information in neural networks (1.5 hrs)
- In the numerator: Gradients by backpropagating logits, activations, and empirical loss.
- In the denominator: Gradients with respect to inputs, activations, and network parameters
- Confounding labels: Backpropagating the wrong classes and their effect on contrastive and counterfactual representations
- Gradients as information in neural networks
- Gradients for epistemic (network based) and aleatoric (image based) uncertainty estimation
- Gradients as distance measures in representation spaces
- Detection: Adversarial, novelty, anomaly, and out-of-distribution detection
- Recognition: Recognition under noise, domain shift, Calibration, open set recognition
- Explanations: Predictive, contrastive, and counterfactual explanations
- Emerging applications: Active Learning, Machine Teaching
- Expectancy mismatch principle and gradients based on confounding loss functions
- Human visual saliency
- Image Quality Assessment
Ghassan AlRegib is currently the John and MCarty Chair Professor in the School of Electrical and Computer Engineering at the Georgia Institute of Technology. He received the ECE Outstanding Junior Faculty Member Award, in 2008 and the 2017 Denning Faculty Award for Global Engagement. His research group, the Omni Lab for Intelligent Visual Engineering and Science (OLIVES) works on research projects related to machine learning, image and video processing, image and video understanding, seismic interpretation, machine learning for ophthalmology, and video analytics. He has participated in several service activities within the IEEE. He served as the TP co-Chair for ICIP 2020. He is an IEEE Fellow.
Mohit Prabhushankar received his Ph.D. degree in electrical engineering from the Georgia Institute of Technology (Georgia Tech), Atlanta, Georgia, 30332, USA, in 2021. He is currently a Postdoctoral Research Fellow in the School of Electrical and Computer Engineering at the Georgia Institute of Technology in the Omni Lab for Intelligent Visual Engineering and Science (OLIVES). He is working in the fields of image processing, machine learning, active learning, healthcare, and robust and explainable AI. He is the recipient of the Best Paper award at ICIP 2019 and Top Viewed Special Session Paper Award at ICIP 2020. He is the recipient of the ECE Outstanding Graduate Teaching Award, the CSIP Research award, and of the Roger P Webb ECE Graduate Research Excellence award, all in 2022.
Presented by: Francesco Banterle, Alessandro Artusi
In this tutorial, we introduce how the High Dynamic Range (HDR) imaging field has evolved in this new era where machine learning approaches have become dominant. The main reason for this success is that the use of machine learning and deep learning has automatized many tedious tasks achieving high-quality results overperforming classic methods.
After an introduction to classic HDR imaging and its open problem, we will summarize the main approaches for merging of multiple exposures, single image reconstructions or inverse tone mapping, tone mapping, and display visualization.
Francesco Banterle is a Researcher at the Visual Computing Laboratory at ISTI-CNR, Italy. He received a Ph.D. in Engineering from Warwick University in 2009. During his Ph.D. he developed Inverse Tone Mapping that bridges the gap between Low Dynamic Range Imaging and High Dynamic Range (HDR) Imaging. He holds two patents, one sold to Dolby, and the other one of these was transferred to goHDR and then sold. His main research fields are high dynamic range (HDR) imaging (acquisition, tone mapping, HDR video compression, and HDR monitors), augmented reality on mobile, and image-based lighting. Recently, he has been working on applying Deep Learning to imaging and HDR imaging proposing the first Deep-Learning based metrics with and without reference. He is co-author of two books on imaging. The first one is "Advanced High Dynamic Range Imaging" (first edition 2011, second edition 2017), which is extensively used as a reference book in the field together with its MATLAB toolbox called the HDR Toolbox The second book "Image Content Retargeting", which shows how to re-target content to different displays in terms of colors, dynamic range, and spatial resolution.
Alessandro Artusi received a Ph.D. in Computer Science from the Vienna University of Technology in 2004. He is currently the Managing Director of the DeepCamera Lab at CYENS (Cyprus) who recently has joined, as a funding member, the Moving Picture, Audio and Data Coding by Artificial Intelligence (MPAI), a not-for-profit standards organization established in Geneva. He is currently the Cyprus representative in the ISO/IEC/SC 29 imaging/Video compression standardization committee, as well as representing Cyprus in two main working groups WGs 4 and 5. Prior to the above, he has been committee member of the IST37 of the British Standard Institute (BSI) and representing the UK in the JPEG and MPEG committee's. He is the recipient, for his work on the JPEG-XT standard, an image compression system for HDR content, of the prestigious BSI Award. His research interests include visual perception, image/video processing, HDR technology, objective/subjective imaging/video evaluation, deep-learning, computer vision and color science, with a particular focus to deploy the next generation of imaging/video pipeline. He is also the co-author of the "Advanced High Dynamic Range Imaging" book (first edition 2011, second edition 2017), which is a reference book in the HDR field, and author of the "Image Content Retargeting" book, which shows how to re-target content to different displays in terms of colors, dynamic range, and spatial resolution.
Presented by: Iole Moccagatta, Yan Ye
The state of video compression standards is strong and dynamic, and more compression is coming in their future. This tutorial will start with an introduction explaining why that is, followed by 2 parts. In the first of these 2 parts we will review the two most recent video compression standards: AV1 and VVC. Before deep diving into AV1 and VVC tools and performance, a high-level overview of block-based video coding concepts and terminologies will be presented. Deployment and market adoption of these two video codec standards will be presented as well. In the second part we will present the status of exploratory activities carried out in MPEG/ITU-T and in the Alliance for Open Media (AOM) and looking into new technologies, including NN-based ones, to improves compression and enable new applications. Will close the tutorial with conclusions and take aways. List of references will be provided for those who are interested in deep diving into the exploratory activities.
Dr. Iole Moccagatta is a Principal Engineer at Intel working on HW Multimedia IPs that are integrated on Intel platforms. Prior to Intel she held the position of Senior Video Architect at NVIDIA, and that of Science Director at IMEC, Belgium.
Dr. Moccagatta has been a very active member of MPEG, ITU-T, and JPEG, where she has represented US interests and companies and made many technical contributions. A number of those have been included in MPEG and JPEG standards. She is currently Co-chair of the MPEG/ITU-T Joint Video Experts Team (JVET) Ad-Hoc Group on H.266/VVC Conformance and Co-editor of the H.266/VVC Conformance Testing document.
Dr. Moccagatta has also been an active participant of the Alliance for Open Media (AOM) AV1 Codec WG, where she has co-authored two adopted proposals. She currently represents Intel in the AOM Board.
Dr. Moccagatta is also serving as IEEE Signal Processing Society (SPS) Regional Director-at-Large Regions 1-6, supporting and advising Chapters and their officers, providing input on how to serve and engage the SPS community in general, and the SPS industry members in particular, and using her professional network to attract new volunteers to serve in SPS subcommittees and task forces.
Dr. Moccagatta is the author or co-author of more than 30 publications, 2 book chapters, and more than 10 talks and tutorials in the field of image and video coding. She holds more than 10 patents in the same fields. For more details see Dr. Moccagatta professional site at http://alfiole.users.sonic.net/iole/.
Dr. Moccagatta received a Diploma of Electronic Engineering from the University of Pavia, Italy, and a PhD from the Swiss Federal Institute of Technology in Lausanne, Switzerland.
Yan Ye is currently a Senior Director at Alibaba Group U.S. and the Head of Video Technology Lab of Alibaba’s Damo Academy in Sunnyvale California. Prior to Alibaba, she held various management and technical positions at InterDigital, Dolby Laboratories, and Qualcomm.
Throughout her career, Dr. Ye has been actively involved in developing international video coding and video streaming standards in ITU-T SG16/Q.6 Video Coding Experts Group (VCEG) and ISO/IEC JTC 1/SC 29 Moving Picture Experts Group (MPEG). She holds various chairperson positions in international and U.S. national standards development organizations, where she is currently an Associate Rapporteur of the ITU-T SG16/Q.6 (since 2020), the Group Chair of INCITS/MPEG task group (since 2020), and a focus group chair of the ISO/IEC SC 29/AG 5 MPEG Visual Quality Assessment (since 2020). She has made many technical contributions to well-known video coding and streaming standards such as H.264/AVC, H.265/HEVC, H.266/VVC, MPEG DASH and MPEG OMAF. She is an Editor of the VVC test model, the 360Lib algorithm description, and the scalable extensions and the screen content coding extensions of the HEVC standard. She is a prolific inventor with hundreds of granted U.S. patents and patent applications, many of which highly cited by other researchers and inventors in the field of video coding. She is the co-author of more than 60 conference and journal papers.
Dr. Ye is currently a Distinguished Industrial Speaker of the IEEE Signal Processing Society (since 2022). She was a guest editor of IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) special section on “the joint Call for Proposals on video compression with capability beyond HEVC” in 2020 and TCSVT special section on “Versatile Video Coding” in 2021. She has been a program committee member of the IEEE Data Compression Conference (DCC) since 2014, and has organized the special session on “advances in video coding” at DCC for more than five years. She is a conference subcommittee co-chair of the IEEE Visual Signal Processing and Communication Technical Committee (VSPC-TC) (since 2022) and was an area chair of “multimedia standards and related research” of the IEEE International Conference on Multimedia and Expo (ICME) in 2021, the publicity chair of the IEEE Video Coding and Image Process (VCIP) in 2021, an industry chair of the IEEE Picture Coding Symposium (PCS) in 2019, an organizing committee member of the IEEE International Conference on Multimedia and Expo (ICME) in 2018, and a technical program committee member of the IEEE Picture Coding Symposium (PCS) in 2013 and 2019.
Dr. Ye is devoted to multimedia standards development, hardware and software video codec implementations, as well as deep learning-based video research. Her research interests include advanced video coding, processing and streaming algorithms, real-time and immersive video communications, AR/VR/MR, and deep learning-based video coding, processing, and quality assessment algorithms.
Dr. Ye received her Ph.D. degree from the University of California, San Diego, in 2002, and her B.S. and M.S. degrees from the University of Science and Technology of China in 1994 and 1997, respectively.
Presented by: Ali C. Begen
HTTP adaptive streaming is a complex technology with dynamics that need to be studied thoroughly. The experience from the deployments in the last 10+ years suggests that streaming clients typically operate in an unfettered greedy mode and they are not necessarily designed to behave well in environments where other clients exist or network conditions can change dramatically. This largely stems from the fact that clients make only indirect observations at the application (HTTP) layer (and limitedly at the transport layer, if any at all).
Typically, there are three primary camps when it comes to scaling and improving streaming systems: (𝑖) servers control client’s behavior/actions and the network uses appropriate QoS, (𝑖𝑖) servers and clients cooperate with each other and/or the network, or (𝑖𝑖𝑖) clients stay in control and no cooperation with the servers or network is needed as long as there is enough capacity in the network (said differently, use dumb servers and network and throw more bandwidth at the problem). Intuitively, using hints should improve streaming since it helps the clients and servers take more appropriate actions. The improvement could be in terms of better viewer experience and supporting more viewers for the given amount of network resources, or the added capability to explicitly support controlled unfairness (as opposed to bitrate fairness) based on features such as content type, viewer profile and display characteristics.
In this tutorial, we will examine the progress made in this area over the last several years, primarily focusing on the MPEG’s Server and Network Assisted DASH (SAND) and CTA’s Common Media Client/Server Data standards. We will also describe possible application scenarios and present an open-source sample implementation for the attendees to explore this topic further in their own, practical environments.
Upon attending this tutorial, the participants will have an overview and understanding of the following topics:
- Brief review of history of streaming, key problems and innovations
- Current standards, interoperability guidelines and deployment workflows
- Focus topics
- End-to-end system modeling and analysis
- Improvements in player algorithms
- Low-latency and omnidirectional streaming extensions
- Server-client collaboration
- Open problems and research directions
The slides will be distributed electronically to the participants.
Ali C. Begen is currently a computer science professor at Ozyegin University and a technical consultant in Comcast's Advanced Technology and Standards Group. Previously, he was a research and development engineer at Cisco. Begen received his PhD in electrical and computer engineering from Georgia Tech in 2006. To date, he received several academic and industry awards (including an Emmy® Award for Technology and Engineering), and was granted 30+ US patents. In 2020 and 2021, he was listed among the world's most influential scientists in the subfield of networking and telecommunications. More details are at https://ali.begen.net.
Presented by: Muhammad Haroon Yousaf, Muhammad Saad Saeed, Muhammad Naeem Mumtaz Awan
This tutorial has been planned to acquaint the audience with latest tools in edge computing for robot vision. Initially, the attendees will be introduced to the basics of robot vision and the challenges to be solved by robot vision. Diving deeper, application agnostic models will be discussed and in the end the model’s optimization and deployment on the Jetson devices will be presented.
This tutorial will enable participants how to prepare custom datasets, train their own custom models, and deploy custom/pre-trained machine vision models on edge devices. This tutorial will also brief participants to make applications by leveraging the strength of edge computing devices.
The Learning outcome of this course will be as following
- Enhanced knowledge about robot vision
- Problems that can be solved in combination of robots and vision
- Challenges in aerial vision
- Object detection from aerial-view
- Real-world application of robotics
- Optimization of Vision Models
Introduction to Robot Vision
- Robot vision
- Need of robot vision
- Robots with vision sensors
- Autonomous cars
- Under water ROV
- Challenges and Application Areas
Robot Vision Models
- Introduction to real-time vision Models
- Vision in Aerial Robotics
- Object Detection from Aerial-view using Edge Computing
Model Optimization and Deployment on Edge Devices
- Optimization tools
- Tensor RT
- Implementation of Optimized Model
- Jetson Nano
- Jetson Xavier
- Computer Vision Application in Robotics
- Introduction to Object Detection:
- Understanding of Object Detection Based on CNN Family and YOLO
Muhammad Haroon Yousaf is working as a Professor of Computer Engineering at University of Engineering and Technology Taxila, Pakistan. He has more than 17 years of teaching/research experience. His research interests are Image Processing, Computer Vision and Robotics. He is also the Director of Swarm Robotics Lab under National Centre for Robotics and Automation Pakistan. He has secured many funded (Govt. & Industry) research projects in his areas of interest. He has published more than 70 research papers in peer-reviewed international conferences and journals. He has supervised 03 PhDs and more than 30 MS thesis in the domain of image processing and computer vision. He is Associate Editor of IEEE Transactions on Circuits and Systems for Video Technology. He has been providing reviewing services for many prestigious peer- reviewed journals. He has served on technical program committees of many international conferences. He has been the PhD examiner/evaluator for different international universities. Prof. Haroon has received the BEST UNIVERSITY TEACHER AWARD by HIGHER EDUCATION COMMISSION, PAKISTAN in 2014. He is working as a Mentor to couple of Tech Startups in the domain of Robotics and Computer Vision. He has served on the National level curriculum development committees in 2014, 2019 and 2021. He has also served on the national/international level experts panel/board to review research grants. He is Senior Member of IEEE and member of IEEE SPS. He was the General Chair of IEEE SPS Seasonal School on Computer Vision Applications in Robotics (CVAR).
Muhammad Naeem Mumtaz is working as a Research Associate in Swarm Robotics Lab. He has more than two years of experience in Computer Vision. He received his MS Electrical Engineering degree from NUST in 2019 and BS degree (Gold Medal) in Electrical Engineering in 2016 from Riphah International University. His area of Research is Object Detection, Semantic Segmentation, Artificial Intelligence on edge and Computer Vision.
Muhammad Saad Saeed is working as a Research Associate in Swarm Robotics Lab. He is also working as Chief Technology Officer (CTO) in a Computer Vision based startup “BeeMantis.” Saad has more than three years of R&D experience in Deep Learning with applications in Computer Vision, Multimodal Learning, AI (Artificial Intelligence) on Edge, Speech, and Audio Processing. He is a Professional Member of IEEE (Institute of Electrical and Electronics Engineers) and member of IEEE SPS.