Monday, 9 October, 11:00 - 13:00
Tuesday, 10 October, 11:00 - 13:00
Wednesday, 11 October, 11:00 - 13:00
Presented by: Muhammad Haroon Yousaf, Shah Nawaz, Muhammad Saad Saeed
Our experience of the world surrounding us is multi-modal; we see things, hear sounds, smell odors, and so on. Modality refers to a way in which the world can be senses and experienced. In case of Machine Learning, modality refers to type of data that a model can process such as audio, image or text. Each modality has its own unique characteristics and properties requiring different types of processing and analysis for extraction of useful information. Multi-modal learning is a paradigm focused on combining multiple modalities of data such as audio-image, image-text learning to improve the performance of a model. The idea behind multimodal learning is that different modalities can provide complementary cues that can help a model make more accurate prediction or decisions. For example, a model that can process both images and text can better understand the context of image and make accurate predictions. Keeping in view the importance of multimodal learning, we have designed this course to acquaint participants with the latest research trends and applications in multimodal learning. This short course provides a detailed, principle and rationale introduction to Multimodal Learning. This course also discusses the applications, and research problems that can be carried out by participants. At the end of this course, we expect the audience to:
- Have basic concepts and rationale about multimodal learning – it’s functionality, applications, and challenges.
- Have better understanding of different application and research. They can build applications as well as carry out research in multimodal fusion, face-voice association, and image-text joint representation learning.
- Have hands-on experience with PyTorch and Tensorflow frameworks for training multimodal networks while working on the problems of Face-Voice Association and Image-Text Joint Representation Learning.
- Have hands-on experience on various challenges and methodologies of multimodal learning.
The course plan is given below:
- Lecture 1 - Introduction to Multimodal Learning (1.5 hour)
- Introduction and Motivation.
- Why Multimodal learning?
- Real-world Environment is Multimodal:
- Unimodal vs Multimodal learning.
- Leveraging Unimodal Networks for Multimodal learning.
- Lecture 2 - Applications and Research Trends (2 hours)
- Research Trends in Multi-modal learning.
- Technical Challenges:
- Feature Fusion.
- Joint Representation Learning.
- Feature alignment.
- Supervised vs Unsupervised Learning.
- CNN/Transformer architectures.
- Loss Functions.
- Single-Branch vs Two-Branch Networks.
- Lab 1 (30 Minutes):
- Hands-on: Leveraging Unimodal Networks for Multimodal Learning.
- Lecture 3- Face-Voice Association (1 hour):
- Datasets and Challenges.
- Vox-Celeb Datasets and Pipeline.
- Is Face-Voice Association Language Independent: MAV-Celeb Dataset.
- Impact of Latent Properties on Identity Verification: Gender, Nationality and Age.
- User identification/Verification.
- Face-Voice and Voice-Face matching.
- Multimodal emotion classification.
- Lab 2 (1hour):
- Hands-on: Pipeline for multilingual VOX-Celeb dataset.
- Challenge: Automated pipeline for curation of Multilingual Audio-Visual dataset.
- Lecture 4- Image-Text Joint Representation and Learning (1 hour):
- Datasets and Challenges
- Hateful Meme Challenge and dataset.
- Fusion strategies:
- o Early, Late and Hybrid Fusion.
- Object Identification and Classification.
- Image-Text retrieval.
- Image Captioning/Description Generation.
- Lab 3 (1 hour):
- Hands-on: Single-Branch vs Two-Branch Networks for Hateful Meme Detection.
- Lecture 5- Multimodal Network against Missing Modality (1hour):
- Missing a modality:
- Modality missing during training.
- Modality missing during inference.
- Transformers vs CNNs against performance drop.
- Performance drop mitigation strategies.
- Lab 4 (45 Minutes):
- Hands-on: Multimodal Networks against Missing Modalities.
- Course Evaluation (15 minutes)
Muhammad Haroon Yousaf is working as a Professor of Computer Engineering at University of Engineering and Technology Taxila, Pakistan. He has more than 17 years of teaching/research experience. His research interests are Image Processing, Computer Vision, and Robotics. He is also the Director of Swarm Robotics Lab under National Centre for Robotics and Automation Pakistan. He has secured many funded (Govt. & Industry) research projects in his areas of interest. Prof. Haroon has received the BEST UNIVERSITY TEACHER AWARD by HIGHER EDUCATION COMMISSION, PAKISTAN in 2014. He is a Senior Member of IEEE (Institute of Electrical and Electronics Engineers) and member of IEEE SPS. He is also serving as Associate Editor of IEEE Transactions on Circuits and Systems on Video Technology (TCSVT).
Shah Nawaz is a postdoc researcher at German Electron Synchrotron with focus on computer vision and deep learning. He did his PhD from University of Insubria, Italy. Earlier he was the Postdoc at IIT Genova, Italy. He has developed various techniques in the doctoral and postdoctoral program to learn presentation of various multimodal applications ranging from classification to cross-modal retrieval.
Muhammad Saad Saeed is working as a Research Associate in Swarm Robotics Lab. He is also working as Chief Technology Officer (CTO) in a Computer Vision based startup “BeeMantis.” Saad has more than three years of R&D experience in Deep Learning with applications in Computer Vision, Multimodal Learning, AI (Artificial Intelligence) on Edge, Speech, and Audio Processing. He is a Professional Member of IEEE (Institute of Electrical and Electronics Engineers) and member of IEEE SPS.
Monday, 9 October, 16:00 - 18:00
Tuesday, 10 October, 16:00 - 18:00
Wednesday, 11 October, 16:00 - 18:00
Presented by: Sonali Agarwal; Sanjay Kumar Sonbhadra; Narinder Singh Punn
With the rapid development of Artificial Intelligence (AI), biomedical image processing witnessed remarkable progress in terms of disease diagnosis concerning segmentation and classification tasks and has now become an active area of research in both the medical domain and academia. With various applications of deep learning, the implication of classification can determine whether a disease is present or not in various imaging modalities, such as the ground glass opacification (GGO) in the lungs present on CT scans. Additionally, localization allows for the identification of normal anatomy, such as the identifying lungs in a CT scan. Furthermore, segmentation can produce more precise borders around the target regions such as GGOs in CT. In medical imaging, most of the applications require critical examination of the target region that follows manual delineation by experts for getting an insight into the underlying biological process and developing further diagnosis or treatment plans. This course in its primary sections delivers awareness of such methods that could be utilized across various disease diagnosis tasks such as brain tumor region segmentation and classification, cancer detection, etc.
Deep learning models are data hungry whereas the expense of data acquisition and delineation, and data security, results in the limited availability of biomedical data. In par to the transfer learning and data augmentation strategies, self-supervised learning is an emerging technology that aims to efficiently address this challenge. This learning scheme advances the potential of deep learning models to better capture feature representations and generate promising results.
Although, AI has seemingly limitless potential for gaining new insights from medical images, but inefficient in presence of multimodal data. The existing AI solutions suffer significant challenges in presence of limited and restricted multimodal images in order to implement clinical settings and decision-making. Furthermore, this course talks about the present challenging area where the availability of multimodal images enforces learning and diagnosing the diseases in such a heterogeneous environment.
Despite the promising results, most of the existing AI enabled solutions (such as Deep Learning Models) are hard to interpret and lack transparency; hence, are termed as black-box models. Therefore, to express the interpretability of the models, this course also introduces the visual explanation with class activation and uncertainty maps to establish rationale in the models while generating the predictions.
Most AI methods have some internal and external disadvantages that prevent their eventual clinical implementation. As a result, this course also examines the current state of these approaches through the lens of strengths, weaknesses, opportunities, and threats (SWOT analysis).
The outline of the short course is described below:
- Session 1: Introduction to the course
- Preliminaries of Biomedical Image Processing
- Biomedical image analysis
- Biomedical imaging
- Types of biomedical images and benchmark datasets
- Biomedical image segmentation and classification
- Motivation, significance and challenges
- Machine Learning and Deep Learning Models for Medical Image Processing
- Machine Learning
- Technological advancements
- Overview of deep learning
- Neurons, activation, pooling, convolution, etc.
- Overview of Transfer Learning
- Few-shot, One-shot and Zero-shot learning
- Deep learning image segmentation architectures
- Typical approaches for implementing deep learning segmentation architectures
- Deep learning model implementation strategies and libraries
- Deep learning models for biomedical image segmentation
- Generic framework of segmentation architectures
- Data processing (Pre-processing and post-processing techniques)
- State-of-the-art deep learning frameworks
- Performance metrics for BIS
- Recent research work
- Concluding remarks at the end of session
- Preliminaries of Biomedical Image Processing
- Section 2: Self Supervised Learning Towards Medical Imaging
- Learning strategies (Supervised, Unsupervised, Semi-supervised)
- Significance of pre-training (Transfer learning)
- Self-supervised learning
- Transition from supervised to self-supervised learning
- Pre-training with pretext tasks and its intuition
- Pre-training techniques
- Different types of pre-text task
- Significance of pre-text tasks on the actual task
- Discriminative self-supervised learning methods
- Similarity maximization
- Contrastive learning (MoCo, PIRL, SimCLR)
- Clustering (DeepCluster, SeLA, SwAV)
- Distillation (BYOL, SimSiam)
- Redundancy reduction (Barlow Twins)
- Multi-modal approaches
- SWOT analysis and concluding remarks at the end of the Session
- Section: 3: Multimodal Learning with images
- Multimodal learning
- Unimodal vs Multimodal
- MML applications
- Challenges - MIMICS
- Missing Data, Incomplete levels (Semi supervised), Multimodality (Data Types), Interpretability, Casualty, and Sequential decision making (Reinforcement learning)
- Fundamentals of multimodal learning
- Foundation of multimodal neural networks
- Recurrent neural networks
- Long Short-Term Memory models
- Multimodal Transformers
- Multimodal Memory
- Multimodal learning methods, applications and datasets - I
- Multimodal image description (MMID)
- Multimodal video description (MMVD)
- Multimodal visual question answering (MMVQA)
- Multimodality fusion
- Model free approaches
- Fusion strategies
- input level, intermediate level, output level and hybrid
- Kernel-based fusion
- Multimodal graphical models
- Factorial HMM, Multi-view Hidden CRF
- Multi-view LSTM model
- Multimodal analytics framework
- SWOT analysis and concluding remarks at the end of the Session
- Section: 4: Explainability in Biomedical image processing
- Introduction and Motivation
- The need for explanation
- Towards explainable AI
- Causal Reasoning
- Tradeoff: Explainability vs. Accuracy
- Recent state-of-the-art methods
- Explanation in biomedical imaging
- Explanations in different medical imaging use cases
- Evaluation Protocols and Metrics
- Re‐trace > Understand > Explain Transparency > Trust > Acceptance Fairness > Transparency > Accountability
- Types of explanations based on:
- Feature, Training, Concept, Surrogate and Natural language processing models
- Explainable Machine Learning for Image Processing Part -I
- What is a Black Box?
- Interpretable, Explainable, and Comprehensible Models
- Open the Black Box Problems
- Interpretable Deep Learning
- From Explainability to Model Quality
- Explainable Machine Learning - II
- On the Role of Knowledge Graph in Explainable AI
- Knowledge Graphs
- Extending Machine Learning Systems with Knowledge Graphs
- On the Role of Reasoning in Explainable AI
- Relational Learning
- XAI Tools on Applications, Lessons Learnt and Research Challenges
- Unboxing the Black Box
- Open research questions
- Future Directions
- SWOT analysis and course concluding remarks at the end of the Session
Sonali Agarwal is working as an Associate Professor in the Information Technology Department of Indian Institute of Information Technology (IIIT), Allahabad, India. Her main research interests are in the areas of Artificial Intelligence and Big Data. She is the head of the Big Data Analytics Lab at IIIT Allahabad, India. Recently, she has worked on developing explainable approaches to support machine learning frameworks for biomedical imaging applications. She has delivered tutorials in prestigious CORE rank conferences such as DSAA 2021, ICONIP 2021, DASFAA 2022, PAKDD 2022, SSCI 2022 and IEEE BigData 2022.
Sanjay Kumar Sonbhadra is presently working as an Assistant Professor in the Computer Science and Engineering Department of ITER, Shiksha ‘O’ Anusandhan, Bhubaneswar, Odisha, India. He is mainly working on One-class classification, Anomaly detection, Dimensionality reduction and Training sample selection techniques to handle large-scale image data processing. He received his Ph.D. in Information Technology from Indian Institute of Information Technology Allahabad, India, where he worked as a senior member of the “Big Data Analytics Lab” during 2017-2021. He has published many articles in the area of machine learning applications to address the recent challenges of high-dimension datasets. He has working experience with machine learning algorithms to address the challenging problem of target-specific learning with limited target-class samples. He is an excellent speaker and has presented tutorials in many reputed CORE rank conferences such as DSAA 2021, ICONIP 2021, DASFAA 2022, PAKDD 2022, SSCI 2022, and IEEE BigData 2022.
Narinder Singh Punn is currently working as an Assistant Professor in the Department of Computer Science and Engineering at Atal Bihari Vajpayee Indian Institute of Information Technology and Management Gwalior. He received his Ph.D. in biomedical image segmentation from the Indian Institute of Information Technology Allahabad, Prayagraj, India, in 2022. Later, he worked as a Postdoctoral Fellow at the Machine Intelligence in Medical Imaging (MI2) lab, Mayo Clinic, Arizona, USA. With his keen interest towards deep learning and computer vision, he has published several papers in international journals, conferences and pre-print servers. His primary research includes Computer vision, Biomedical image processing, Image classification, Image segmentation and Object detection. His tutorials have already attracted audiences in prestigious CORE rank conferences such as DSAA 2021, ICONIP 2021, DASFAA 2022, PAKDD 2022, SSCI 2022, and IEEE BigData 2022.