Presented by: Fei Chen, Yu Tsao
Part I: Fundamental of objective speech intelligibility and quality assessment
Part II: Deep learning-based assessment metrics and their applications
Presented by: Laura Balzano, Qing Qu, Peng Wang, Zhihui Zhu
The Neural Collapse phenomenon has garnered significant attention in both practical and theoretical fields of deep learning, as evident from the extensive research on the topic. The presenters' own works have made key contributions to this body of research. Below is a summary of the tutorial outline. The first half focuses on the structures of representations appearing in the last layer, and we generalize the study into intermediate layers in the second half of this tutorial.
1. Prevalence of Neural Collapse & Global Optimality
The tutorial starts with the introduction of the Neural Collapse phenomenon in the last layer and its universality in deep network training, and lays out the mathematical foundations of understanding its cause based upon simplified unconstrained feature model (UFM) . We then generalize and explain this phenomenon and its implications under data imbalanceness.
2. Optimization Theory of Neural Collapse
We provide a rigorous explanation of the emergence of Neural Collapse from an optimization perspective and demonstrate its impacts on algorithmic choices, drawing on recent works. Specifically, we conduct a global landscape analysis under the UFM to show that benign landscapes are prevalent across various loss functions and problem formulations. Furthermore, we demonstrate the practical algorithmic implications of Neural Collapse on training deep neural networks.
3. Progressive Data Compression & Separation Across Intermediate Layers
We open the black-box of deep representation learning by introducing a law that governs how real-world deep neural networks separate data according to their class membership from the bottom layers to the top layers. We show that each layer roughly improves a certain measure of data separation by an equal multiplicative factor. We demonstrate its universality by showing its prevalence across different network architectures, dataset, and training losses.
4. Theory & Applications of Progressive Data Separation
Finally, we delve into theoretical understandings of the structures in the intermediate layer via studying the learning dynamics of gradient descent. In particular, we reveal that there are certain parsimonious structures in gradient dynamics so that a certain measure of data separation exhibits layer-wise linear decay from shallow to deep layers. Finally, we demonstrate its practical implications of understanding the phenomenon in transfer learning and the study of foundation models, leading to efficient fine-tuning methods with reduced overfitting.
Presented by: Jun Qi, Ying-Jer Kao, Samuel Yen-Chi Chen, Mohammadreza Noormandipour
Presented by: Dirk Slock, Christo K. Thomas
Part I Approximate Bayesian Techniques
Part II Generalized Linear models
Part III Bilinear models
Part IV Adaptive Kalman filtering
Presented by: Kush R. Varshney
Presented by: Yao Xie, Xiuyuan Cheng
Introduction
Mathematical background
Diffusion model and ODE flow
Neural ODE and continuous normalizing flow (CNF)
Learning of interpolating distributions
Evaluation of generative models
Applications
Open problems and discussion
Presented by: Tianyi Chen, Xiaodong Cui, Lisha Chen
Part I - Introduction and Background
Part II - Bilevel Optimization for Learning with Ordered Objectives
Part III - Multi-objective Optimization for Learning with Competing Objectives
Part IV - Applications to Automatic Speech Recognition
Part V - Open Research Directions
Presented by: Keshab K. Parhi
Engineering practical and reliable quantum computers and communication systems requires: (a) protection of quantum states from decoherence, and (b) overcoming the reliability issues due to faulty gates. The half-day tutorial will provide a detailed overview of the new developments related to quantum ECCs and fault tolerant computing. Specific topics include: (a) Introduction to quantum gates and circuits, (b) Shor’s 9-qubit ECC and stabilizer formalism for quantum ECCs (c) Systematic method for construction of quantum ECC circuits, (d) Optimization of quantum ECC circuits in terms of number of multiple-qubit gates, and (e) Nearest neighbor compliant (NNC) quantum ECC circuits. Descriptions of the topics are listed below.
Presented by: Sam Buchanan, Yi Ma, Druv Pai, Yaodong Yu
During the past decade, machine learning and high-dimensional data analysis have experienced explosive growth, due in major part to the extensive successes of deep neural networks. Despite their numerous achievements in disparate fields such as computer vision and natural language processing, which has led to their involvement in safety-critical data processing tasks (such as autonomous driving and security applications), such deep networks have remained mostly mysterious to their end users and even their designers. For this reason, the machine learning community continually places higher emphasis on explainable and interpretable models, those whose outputs and mechanisms are understandable by their designers and even end users. The research community has recently responded to this task with vigor, having developed various methods to add interpretability to deep learning. One such approach is to design deep networks which are fully white-box ab initio, namely designed through mechanisms which make each operator in the deep network have a clear purpose and function towards learning and/or transforming the data distribution. This tutorial will discuss classical and recent advances in constructing white-box deep networks from this perspective. We now present the Tutorial Outline:
- [Yi Ma] Introduction to high-dimensional data analysis (45 min): In the first part of the tutorial, we will discuss the overall objective of high-dimensional data analysis, that is, learning and transforming the data distribution towards template distributions with relevant semantic content for downstream tasks (such as linear discriminative representations (LDR), expressive mixtures of semantically-meaningful incoherent subspaces). We will discuss classical methods such as sparse coding through dictionary learning as particular instantiations of this learning paradigm when the underlying signal model is linear or sparsely generated. This part of the presentation involves an interactive Colab on sparse coding.
- [Sam Buchanan] Layer-wise construction of deep neural networks (45 min): In the second part of the tutorial, we will introduce unrolled optimization as a design principle for interpretable deep networks. As a simple special case, we will examine several unrolled optimization algorithms for sparse coding (especially LISTA and “sparseland”), and show that they exhibit striking similarities to current deep network architectures. These unrolled networks are white-box and interpretable ab initio. This part of the presentation involves an interactive Colab on simple unrolled networks.
- [Druv Pai] White-box representation learning via unrolled gradient descent (45 min): In the third part of the tutorial, we will focus on the special yet highly useful case of learning the data distribution and transforming it to an LDR. We will discuss the information theoretic and statistical principles behind such a representation, and design a loss function, called the coding rate reduction, which is optimized at such a representation. By unrolling the gradient ascent on the coding rate reduction, we will construct a deep network architecture, called the ReduNet, where each operator in the network has a mathematically precise (hence white-box and interpretable) function in the transformation of the data distribution towards an LDR. Also, the ReduNet may be constructed layer-wise in a forward-propagation manner, that is, without any back-propagation required. This part of the presentation involves an interactive Colab on the coding rate reduction.
- [Yaodong Yu] White-box transformers (45 min): In the fourth part of the tutorial, we will show that by melding the perspectives of sparse coding and rate reduction together, we can obtain sparse linear discriminative representations, encouraged by an objective which we call sparse rate reduction. By unrolling the optimization of the sparse rate reduction, and parameterizing the feature distribution at each layer, we will construct a deep network architecture, called CRATE, where each operator is again fully mathematically interpretable, we can understand each layer as realizing a step of an optimization algorithm, and the whole network is a white box. The design of CRATE is very different from ReduNet, despite optimizing a similar objective, demonstrating the flexibility and pragmatism of the unrolled optimization paradigm. Moreover, the CRATE architecture is extremely similar to the transformer, and many of the layer-wise interpretations of CRATE can be used to interpret the transformer, showing the benefits in interpretability from such-derived networks may carry over to understanding current deep architectures which are used in practice. We will highlight in particular the powerful and interpretable representation learning capability of these models for visual data by showing how segmentation maps for visual data emerge in their learned representations with no explicit additional regularization or complex training recipes.
Presented by: Baihan Lin
In recent years, reinforcement learning and bandits have transformed a wide range of real-world applications including healthcare, finance, recommendation systems, robotics and computer vision, and last but not least, the speech and language processing. While most speech and language applications of reinforcement learning algorithms are centered around improving deep network training with its flexible optimization properties, there are still many grounds to explore to utilize the benefits of reinforcement learning, such as its reward-driven adaptability, state representations, temporal structures and generalizability. In this one-session tutorial, we will overview the recent advancements of reinforcement learning and bandits and discuss how they can be employed to solve various speech and natural language processing problems with models that are interpretable and scalable, especially in emerging topics such as large language models.
First, we briefly introduce the basic concept of reinforcement learning and bandits, as well as the major variant problem settings in this machine learning domain. Second, we translate various speech and language tasks into the reinforcement learning problems and show the key challenges. Third, we introduce some reinforcement learning and bandit techniques and their varieties for speech and language tasks and their machine learning formulations. Fourth, we present several state-of-the-art applications of reinforcement learning in different fields of speech and language. Lastly, we will discuss some open problems in reinforcement learning and bandits to show how to further develop more advanced algorithms for speech and language research in the future.
As the second iteration of this tutorial, the topic will emphasize additional coverage in new developments in large language models and deep reinforcement learning. The audience can refer to two resources after the tutorial: (1) a review paper by the author on arXiv covering many topics in this tutorial, and (2) an upcoming Springer book by the author on the same topic to be released this December, which includes more case studies, hands-on examples and additional coverage on recent advancements in large language models.
The outline of the tutorial, along with the topics and subtopics covered, is as follows:
1. Introduction (5 min)
2. A Concise Tutorial of Reinforcement Learning and Bandits (85 min)
BREAK (20 min)
3. Reinforcement Learning Formulation for Speech and Language Applications (45 min)
4. Emerging Reinforcement Learning Strategies (15 min)
5. Conclusions, Open Questions and Challenges (10 minutes)
Overall takeaways for our attendees:
Presented by: Petros Maragos
Tropical geometry is a relatively recent field in mathematics and computer science combining elements of algebraic geometry and polyhedral geometry. The scalar arithmetic of its analytic part pre-existed (since the 1980s) in the form of max-plus and min-plus semiring arithmetic used in finite automata, nonlinear image processing, convex analysis, nonlinear control, and idempotent mathematics.
Tropical geometry recently emerged successfully in the analysis and extension of several classes of problems and systems in both classical machine learning and deep learning. Such areas include (1) Deep Neural Networks (DNNs) with piecewise-linear (PWL) activation functions, (2) Morphological Neural Networks, (3) Neural Network Minimization, (4) Optimization (e.g. dynamic programming) and Probabilistic Dynamical Systems, and (5) Nonlinear regression with PWL functions. Areas (1), (2) and (3) have many novel elements and have recently been applied to image classification problems. Area (4) offers new perspectives on several areas of optimization. Area (5) is also novel and has many applications.
The proposed tutorial will cover the following topics:
Elements from Tropical Geometry and Max-Plus Algebra (Brief). We will first summarize introductory ideas and objects of tropical geometry, including tropical curves and surfaces and Newton polytopes. We will also provide a brief introduction to the max-plus algebra that underlies tropical geometry. This will involve scalar and vector/signal operations defined over a class of nonlinear spaces and optimal solutions of systems of max-plus equations. Tropical polynomials will be defined and related to classical polynomials through Maslov dequantization. Then, the above introductory concepts and tools will be applied to analyzing and/or providing solutions for problems in the following broad areas of machine learning.
Neural Networks with Piecewise-linear (PWL) Activations. Tropical geometry recently emerged in the study of deep neural networks (DNNs) and variations of the perceptron operating in the max-plus semiring. Standard activation functions employed in DNNs, including the ReLU activation and its “leaky” variants, induce neural network layers which are PWL convex functions of their inputs and create a partition of space well-described by concepts from tropical geometry. We will illustrate a purely geometric approach for studying the representation power of DNNs -- measured via the concept of a network's “linear regions” -- under the lens of tropical geometry.
Morphological Neural Networks. Recently there has been a resurgence of networks whose layers operate with max-plus arithmetic (inspired by the fundamental operators of morphological image processing). Such networks enjoy several promising aspects including faster training and capability of being pruned to a large degree without severe degradation of their performance. We will present several aspects from this emerging class of neural networks from some modern perspectives by using ideas from tropical geometry and mathematical morphology. Subtopics include methods for their training and pruning resulting in sparse representations.
Neural Network Minimization. The field of tropical algebra is closely linked with the domain of neural networks with PWL activations, since their output can be described via tropical polynomials in the max-plus semiring. In this tutorial, we will briefly present methods based on approximation of the NN tropical polynomials and their Newton Polytopes via either (i) a form of approximate division of such polynomials, or (ii) the Hausdorff distance of tropical zonotopes, in order to minimize networks trained for multiclass classification problems. We will also present experimental evaluations on known datasets, which demonstrate a significant reduction in network size, while retaining adequate performance.
Approximation Using Tropical Mappings. Tropical Mappings, defined as vectors of tropical polynomials, can be used to express several interesting approximation problems in ML. We will focus on three closely related optimization problems: (a) the tropical inversion problem, where we know the tropical mapping and the output, and search for the input, (b) the tropical regression problem, where we know the input-output pairs and search for the tropical mapping;, and (c) the tropical compression problem, where we know the output, and search for an input and a tropical mapping that represent the data in reduced dimensions. There are several potential applications including data compression, data visualization, recommendation systems, and reinforcement learning. We will present a unified theoretical framework, where tropical matrix factorization has a central role, a complexity analysis, and solution algorithms for this class of problems. Problem (b) will be further detailed under PWL regression (see next).
Piecewise-linear (PWL) Regression. Fitting PWL functions to data is a fundamental regression problem in multidimensional signal modeling and machine learning, since approximations with PWL functions have proven analytically and computationally very useful in many fields of science and engineering. We focus on functions that admit a convex repr¬¬esentation as the maximum of affine functions (e.g. lines, planes), represented with max-plus tropical polynomials. This allows us to use concepts and tools from tropical geometry and max-plus algebra to optimally approximate the shape of curves and surfaces by fitting tropical polynomials to data, possibly in the presence of noise; this yields polygonal or polyhedral shape approximations. For this convex PWL regression problem we present optimal solutions w.r.t. $\ell_p$ error norms and efficient algorithms.
Presented by: Danilo Mandic, Harry Davies
The Hearables paradigm, that is, in-ear sensing of neural function and vital signs, is an emerging solution for 24/7 discrete health monitoring. The tutorial starts by introducing our own Hearables device, which is based on an earplug with the embedded electrodes, optical, acoustic, mechanical and temperature sensors. We show how such a miniaturised embedded system can be can used to reliably measure the Electroencephalogram (EEG), Electrocardiogram (ECG), Photoplethysmography (PPG), respiration, temperature, blood oxygen levels, and behavioural cues. Unlike standard wearables, such an inconspicuous Hearables earpiece benefits from the relatively stable position of the ear canal with respect to vital organs to operate robustly during daily activities. However, this comes at a cost of weaker signal levels and exposure to noise. This opens novel avenues of research in Machine Intelligence for eHealth, with numerous challenges and opportunities for algorithmic solutions. We describe how our hearables sensor can be used, inter alia, for the following applications:
For the Hearables to provide a paradigm shift in eHealth, they require domain-aware Machine Intelligence, to detect, estimate, and classify the notoriously weak physiological signals from the ear-canal. To this end, the second part of our tutorial is focused on interpretable AI. This is achieved through a first principles matched-filtering explanation of convolutional neural networks (CNNs), introduced by us. We next revisit the operation of CNNs and show that their key component – the convolutional layer – effectively performs matched filtering of its inputs with a set of templates (filters, kernels) of interest. This serves as a vehicle to establish a compact matched filtering perspective of the whole convolution-activation-pooling chain, which allows for a theoretically well founded and physically meaningful insight into the overall operation of CNNs. This is shown to help mitigate their interpretability and explainability issues, together with providing intuition for further developments and novel physically meaningful ways of their initialisation. Interpretable networks are pivotal in the integration of AI into medicine, by dispelling the black box nature of deep learning and allowing clinicians to make informed decisions based off network outputs. We demonstrate this in the context of Hearables by expanding on the following key findings:
Owing to their unique Collocated Sensing nature, Hearables record a rich admixture of information from several physiological variables, motion and muscle artefacts and noise. For example, even a standard Electroencephalogram (EEG) measurement contains a weak ECG and muscle artefacts, which are typically treated as bad data and are subsequently discarded. In the quest to exploit all the available information (no data is bad data), the final section of the tutorial focuses on a novel class of encoder-decoder networks which, taking the advantage from the collocation of information, maximise data utility. We introduce the novel concept of a Correncoder and demonstrate its ability to learn a shared latent space between the model input and output, making it a deep-NN generalisation of partial least squares (PLS). The key topics of the final section of this tutorial are as follows:
In summary, this tutorial details how the marriage of the emerging but crucially sensing modality of Hearables and customised interpretable deep learning models can maximise the utility of wearables data for healthcare applications, with a focus on the long-term monitoring of chronic diseases. Wearable in-ear sensing for automatic screening and monitoring of disease has the potential for immense global societal impact, and for personalised healthcare out-of-clinic and in the community – the main aims of the future eHealth.
The presenters are a perfect match for the topic of this tutorial, Prof Mandic’s team are pioneers of Hearables and the two presenters have been working together over the last several years on the links between Signal Processing, Embedded systems and Connected Health; the presenters also hold three international patents in this area.
Tutorial Outline
The tutorial with involve both the components of the Hearables paradigm and the Interpretable AI solutions for 24/7 wearable sensing in the real-world. The duration will be over 3 hours, with the following topics covered:
Presented by: Christos Thrampoulidis, Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi
Part I: Motivation and Overview
I.1 The Transformer Revolution:
Our tutorial begins by providing an in-depth account of the Transformer architecture and its extensive array of applications. We place special emphasis on examples most relevant to the signal-processing audience, including speech analysis, time-series forecasting, image processing, and most recently, wireless communication systems. Additionally, we introduce and review essential concepts associated with Transformers' training, such as pre-training, fine-tuning, and prompt-tuning, while also discussing the Transformers' emerging abilities, such as in-context learning and reasoning.
I.2 A Signal-Processing-Friendly Introduction to the Attention Mechanism:
We then dive into a comprehensive explanation of the Transformer block's structure. Our primary focus is on the Attention mechanism, which serves as the fundamental distinguishing feature from conventional architectures like fully connected, convolutional, and residual neural networks. To facilitate the signal-processing community's understanding, we introduce a simplified attention model that establishes an intimate connection with problems related to sparse signal recovery and matrix factorization. Using this model as a basis, we introduce critical questions regarding its capabilities in memorizing lengthy sequences, modeling long-range dependencies, and training effectively.
Part II: Efficient Inference and Adaptation: Quadratic attention bottleneck and Parameter-efficient tuning (PET)
II.1 Kernel viewpoint, low-rank/sparse approximation, Flash-attn (system level, implementation):
Transformers struggle with long sequences due to quadratic self-attention complexity. We review recently-proposed efficient implementations aimed to tackle this challenge, while often achieving superior or comparable performance to vanilla Transformers. First, we delve into approaches that approximate quadratic-time attention using data-adaptive, sparse, or low-rank approximation schemes. Secondly, we overview the importance of system-level improvements, such as FlashAttention, where more efficient I/O awareness can greatly accelerate inference. Finally, we highlight alternatives which replace self-attention with more efficient problem-aware blocks to retain performance.
II.2 PET: Prompt-tuning, LoRa adapter (Low-rank projection):
In traditional Transformer pipelines, models undergo general pre-training followed by task-specific fine-tuning, resulting in multiple copies for each task, increasing computational and memory demands. Recent research focuses on parameter-efficient fine-tuning (PET), updating a small set of task-specific parameters, reducing memory usage, and enabling mixed-batch inference. We highlight attention mechanisms' key role in PET, discuss prompt-tuning, and explore LoRA, a PET method linked to low-rank factorization, widely studied in signal processing.
II.3 Communication and Robustness gains in Federated Learning:
We discuss the use of large pretrained transformers in mobile ML settings with emphasis on federated learning. Our discussion emphasizes the ability of transformers to adapt in a communication efficient fashion via PET methods: (1) Use of large models shrinks the accuracy gaps between alternative approaches and improves heterogeneity robustness. Scaling allows clients to run more local SGD epochs which can significantly reduce the number of communication rounds. (2) PET methods, by design, enable >100× less communication in bits while potentially boosting robustness to client heterogeneity and small sample size.
BREAK I
Part III: Approximation, Optimization, and Generalization Fundamentals
III.1 Approximation and Memorization Abilities:
We discuss Transformers as sequence-to-sequence models with a fixed number of parameters, independent of sequence length. Despite parameter sharing, Transformers exhibit universal approximation capabilities for sequence-to-sequence tasks. We delve into key results regarding Transformer models' approximation abilities, examining the impact of depth versus width. We also address their memorization capacity, emphasizing the trade-off between model size and the number of memorized sequence-to-sequence patterns. Additionally, we discuss the link between Transformers and associative memories, a topic of interest within the signal processing community.
III.2 Optimization dynamics: Transformer as Support Vector Machines:
In this section, we present a fascinating emerging theory that elucidates how the attention layer learns, during training, to distinguish 'good' sequence elements (those most relevant to the prediction task) while suppressing 'bad' ones. This separation is formally framed as a convex optimization program, similar to classical support-vector machines (SVMs), but with a distinct operational interpretation that relates to the problems of low-rank and sparse signal recovery. This unique formulation allows us to engage the audience with a background in signal processing, as it highlights an implicit preference within the Transformer to promote sparsity in the selection of sequence elements—a characteristic reminiscent of traditional sparsity-selection mechanisms such as the LASSO.
III.3 Generalization dynamics:
Our discussion encompasses generalization aspects related to both the foundational pretraining phase and subsequent task performance improvements achieved through prompt tuning. To enhance our exploration, we will introduce statistical data models that extend traditional Gaussian mixture models, specifically tailored to match the operational characteristics of the Transformer. Our discussion includes an overview and a comprehensive list of references to a set of tools drawn from high-dimensional statistics and recently developed learning theories concerning the neural tangent kernel (NTK) and the deep neural network's feature learning abilities.
BREAK II
Part IV: Emerging abilities, in-context learning, reasoning
IV.1 Scaling laws and emerging abilities:
We begin the last part of the tutorial by exploring the intriguing world of scaling laws and their direct implications on the emerging abilities of Transformers. Specifically, we will delve into how these scaling laws quantitatively impact the performance, generalization, and computational characteristics of Transformers as they increase in size and complexity. Additionally, we draw connections between the scaling laws and phase transitions, a concept familiar to the signal processing audience, elucidating via examples in the literature how Transformers' behavior undergoes critical shifts as they traverse different scales.
IV.2 In-context learning (ICL): Transformers as optimization algorithms
We delve into the remarkable capability of ICL, which empowers Transformers to engage in reasoning, adaptation, and problem-solving across a wide array of machine learning tasks through the use of straightforward language prompts, closely resembling human interactions. To illustrate this intriguing phenomenon, we will provide concrete examples spanning both language-based tasks and mathematically structured, analytically tractable tasks. Furthermore, we present findings that shed light on an intriguing perspective of in-context learning: the Transformer's capacity to autonomously learn and implement gradient descent steps at each layer of its architectural hierarchy. In doing so, we establish connections to deep-unfolding techniques, which have garnered popularity in applications such as wireless communications and solving inverse problems.
IV.3 Primer on Reasoning:
The compositional nature of human language allows us to express fine-grained tasks/concepts. Recent innovations such as prompt-tuning, instruction-tuning, and various prompting algorithms are enabling the same for language models and catalyzing their ability to accomplish complex multi-step tasks such as mathematical reasoning or code generation. Here, we first introduce important prompting strategies that catalyze reasoning such as chain-of-thought, tree-of-thought, and self-evaluation. We then demonstrate how these methods boost reasoning performance as well as the model’s ability to evaluate its own output, contributing to trustworthiness. Finally, by building on the ICL discussion, we introduce mathematical formalisms that shed light on how reasoning can be framed as “acquiring useful problem solving skills” and “composing these skills to solve new problems”.
Conclusions, outlook, and open problems
We conclude the tutorial by going over a list of important and exciting open problems related to the fundamental understanding of Transformer models, while emphasizing how this research creates opportunities for enhancing architecture and improving algorithms & techniques. This will bring the audience to the very forefront of fast-paced research in this area.
Presented by: Shiwei Liu, Olga Saukh, Zhangyang (Atlas) Wang, Arijit Ukil, and Angshul Majumdar
This tutorial will provide a comprehensive overview of recent breakthroughs of sparsity in the emerging area of large language models (LLMs), showcasing progress and posing challenges, and endeavor to provide insights to improve the affordability and knowledge of LLMs through sparsity. The outline of this tutorial is fourfold: (1) a thorough overview/categorization of sparse neural networks; (2) the latest progress of LLMs compression via sparsity; (3) the caveat of sparsity in LLMs; and finally (4) the benefits of sparsity beyond model efficiency.
The detailed outline is given below:
Tutorial Introduction. Presenter: Zhangyang (Atlas) Wang.
Part 1: Overview of sparse neural networks. Presenter: Shiwei Liu.
We will first provide a brief overview and categorization of existing works on sparse neural networks. As one of the most classical concepts in machine learning, the pristine goal of sparsity in neural networks is to reduce inference costs. However, the research focus on sparsity has undertaken a significant shift from post-training sparsity to prior-training sparsity over the past few years, due to the latter's promise of end-to-end resource saving from training to inference. Researchers have tackled many interlinked concepts such as pruning [13], Lottery Ticket Hypothesis [14], Sparse Training [15,16], Pruning at Initialization [17], and Mixture of Experts [18]. However, the shift of interest only occurred in the last few years, and the relationships among different sparse algorithms in terms of their scopes, assumptions, and approaches are highly intricate and sometimes ambiguous. Providing a comprehensive and precise categorization of these approaches is timely for this newly shaped research community.
Part 2: Scaling up sparsity to LLMs: latest progress. Presenter: Shiwei Liu.
In the context of gigantic LLMs, sparsity is becoming even more appealing to accelerate both training and inference. We will showcase existing attempts that address sparse LLMs, encompassing weight sparsity, activation sparsity, and memory sparsity. For example, SparseGPT [8] and Essential Sparsity [9] shed light on prominent weight sparsity in LLMs, while the unveiling of ''Lazy Neuron" [13] and ''Heavy Hitter Oracle" [10] exemplifies activation sparsity and token sparsity. Specifically, the introduction of Essential Sparsity discovers a consistent pattern across various settings, that is, 30%-50% of weights from LLMs can be removed by the naive one-shot magnitude pruning for free without any significant drop in performance. Ultimately, those observations suggest that sparsity is also an emerging property in the context of LLMs, with great potential to improve the affordability of LLMs.
Coffee Break.
Part 3: The caveat of sparsity in LLMs: What tasks are we talking about? Presenter: Zhangyang (Atlas) Wang.
While sparsity has demonstrated its success in LLMs, the commonly used evaluation in the literature of sparse LLMs are often restricted to simple datasets such as GLUE, Squad, WikiText-2, and PTB; and/or simple one-turn question/instructions. Such (over-) simplified evaluations may potentially camouflage some unexpected predicaments of sparse LLMs. To depict the full picture of sparse LLMs, we highlight two recent works, SMC-Bench [11] and ''Junk DNA Hypothesis", that unveil the failures of (magnitude-based) pruned LLMs on harder language tasks, indicating a strong correlation between the model's ''prunability" and its target downstream task's difficulty.
Part 4: Sparsity beyond efficiency. Presenter: Olga Saukh.
In addition to efficiency, sparsity has been found to boost many other performance aspects such as robustness, uncertainty quantification, data efficiency, multitasking and task transferability, and interoperability [19]. We will mainly focus on the recent progress in understanding the relation between sparsity and robustness. The research literature spans multiple subfields, including empirical and theoretical analysis of adversarial robustness [20], regularization against overfitting, and noisy label resilience for sparse neural networks. By outlining these different aspects, we aim to offer a deep dive into how network sparsity affects the multi-faceted utility of neural networks in different scenarios.
Part 5: Demonstration and Hands-on Experience. Presenter: Shiwei Liu.
The Expo consists of three main components: Firstly, an implementation tutorial will be presented via a typical laptop offering step-by-step guidance in building and training sparse neural networks from scratch. Secondly, a demo will be given to showcase how to prune LLaMA-7B on a single A6000 GPU. Thirdly, we will create and maintain user-friendly open-source implementation for sparse LLMs, ensuring participants have ongoing resources at their disposal. To encourage ongoing engagement and learning, we will make all content and materials readily accessible through the tutorial websites.
Presented by: Moe Z. Win, Andrea Conti
The availability of real-time high-accuracy location awareness is essential for current and future wireless applications, particularly those involving Internet-of-Things and beyond 5G ecosystem. Reliable localization and navigation of people, objects, and vehicles – Localization-of-Things (LoT) – is a critical component for a diverse set of applications including connected communities, smart environments, vehicle autonomy, asset tracking, medical services, military systems, and crowd sensing. The coming years will see the emergence of network localization and navigation in challenging environments with sub-meter accuracy and minimal infrastructure requirements.
We will discuss the limitations of traditional positioning, and move on to the key enablers for high-accuracy location awareness. Topics covered will include: fundamental bounds, cooperative algorithms for 5G and B5G standardized scenarios, and network experimentation. Fundamental bounds serve as performance benchmarks, and as a tool for network design. Cooperative algorithms are a way to achieve dramatic performance improvements compared to traditional non-cooperative positioning. To harness these benefits, system designers must consider realistic operational settings; thus, we present the performance of B5G localization in 3GPP-compliant settings. We will also present LoT enablers, including reconfigurable intelligent surfaces, which promise to provide a dramatic gain in terms of localization accuracy and system robustness in next generation networks.
Presented by: Huck Yang, Pin-Yu Chen, Hung-yi Lee, Kai-Wei Chang, Cheng-Han Chiang
Presented by: Nir Shlezinger, Sangwoo Park, Tomer Raviv, and Osvaldo Simeone
Wireless communication technologies are subject to escalating demands for connectivity, latency, and throughput. To facilitate meeting these performance requirements, emerging technologies such as mmWave and THz communication, holographic MIMO, spectrum sharing, and RISs are currently being investigated. While these technologies may support desired performance levels, they also introduce substantial design and operating complexity. For instance, holographic MIMO hardware is likely to introduce non-linearities on transmission and reception; the presence of RISs complicates channel estimation; and classical communication models may no longer apply in novel settings such as the mmWave and THz spectrum, due to violations of far-field assumptions and lossy propagation. These considerations notably affect transceiver design.
Traditional transceiver processing design is model-based, relying on simplified channel models, which may no longer be adequate to meet the requirements of next-generation wireless systems. The rise of deep learning as an enabler technology for AI has revolutionized various disciplines, including computer vision and natural language processing (NLP). The ability of deep neural networks (DNNs) to learn mappings from data has spurred growing interest in their usage for transceiver design. DNN-aided transceivers have the ability to succeed where classical algorithms may fail. They can learn a detection function in scenarios having no well-established physics-based mathematical model, a situation known as model-deficit; or when the model is too complex to give rise to tractable and efficient model-based algorithms, a situation known as algorithm-deficit.
Despite their promise, several core challenges arise from the fundamental differences between wireless communications and traditional AI domains such as computer vision and NLP. The first challenge is attributed to the nature of the devices employed in communication systems. Wireless communication transceivers are highly constrained in terms of their compute and power resources, while deep learning inherently relies on the availability of powerful devices, e.g., high-performance computing servers. A second challenge stems from the nature of the wireless communication domain. Communication channels are dynamic, implying that the task, dictated by the data distribution, changes over time. This makes the standard pipeline of data collection, annotation, and training highly challenging. Specifically, DNNs rely on (typically labeled) data sets to learn from the underlying unknown, but stationary, data distributions. This is not the case for wireless transceivers , whose processing task depends on the time-varying channel, restricting the size of the training data set representing the task. These challenges imply that successfully applying AI for transceivers design requires deviating from conventional deep learning approaches. To this end, there is a need to develop communication-oriented AI techniques that are not only of high performance for a given channel, but also light-weight, interpretable, flexible, and adaptive.
In the proposed tutorial we shall present in a pedagogic fashion the leading approaches fordesigning of practical and effective deep transceivers that address the specific limitations imposed by the use of dataand resource-constrained wireless devices and by the dynamic nature of the communication channel. We advocate that AI-based wireless transceiver design requires revisiting the three main pillars of AI, namely, (i) the architecture of AI models;(ii) the data used to train AI models; and (iii) the training algorithm that optimizes the AI model for generalization, i.e., to maximize performance outside the training set (either on the same distribution or for a completely new one). For each of these AI pillars, we survey candidate approaches from the recent literature. We first discuss how to design light-weight trainable architectures via model-based deep learning. This methodology hinges on the principled incorporation of model-based processing, obtained from domain knowledge on optimized communication algorithms, within AI architectures. Next, we investigate how labeled data can be obtained without impairing spectral efficiency, i.e., without increasing the pilot overhead. We show how transreceivers can generate labeled data by self-supervision, aided by existing communication algorithms; and how they may further enrich data sets via data augmentation techniques tailored for such data. We then cover training algorithms designed to meet requirements in terms of efficiency, reliability, and robust adaptation of wireless communication systems, avoiding overfitting from limited training data while limiting training time. These methods include communication-specific meta-learning as well as generalized Bayesian learning and modular learning.
Tutorial outline:
Presented by: Sijia Liu, Zhangyang Wang, Tianlong Chen, Pin-Yu Chen, Mingyi Hong, Wotao Yin
Part 1: Introduction of ZO-ML
Part 2: Foundations of ZO-ML
Break
Part 3: Applications of ZO-ML
Part 4: Demo Expo
Part 5: Conclusion and Q&A
Presented by: Ehsan Variani, Georg Heigold, Ke Wu, Michael Riley
The first part of this talk focuses on the mathematical modeling of the existing neural ASR criteria. We introduce a modular framework that can explain all the existing criteria such as: Cross Entropy (CE), Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Hybrid Autoregressive Transducer (HAT) and Listen, Attend and Spell (LAS). We also introduce the LAttice-based Speech Transducer library (LAST) which provides efficient implementation of these criteria and allows the user mix and match different components to create new training criterion. A simple colab is presented to engage the audience by using LAST and implementing a simple ASR model on a digit recognition task.
The second half of the talk focuses on some practical problems in ASR modeling and some principled solutions. The problems are:
For all the problems above, the audiences will have a chance to use the LAST library and the colab to evaluate the effectiveness of the solutions themselves during the tutorial.
Presented by: Xing Liu, Tarig Ballal, Jose A. Lopez-Salcedo, Gonzalo Seco-Granados, Tareq Al-Naffouri
A closer look into LEO constellations and their main characteristics, orbits, geometry, velocity, coverage, etc. We will focus more on the signaling aspects such as modulation schemes, coding techniques, channel characteristics, receiver design, etc. We will contrast LEO attributes with those of GNSS, highlighting potential strengths and weaknesses.
In this section, we will cover the main techniques for PNT based on LEO satellite signals. We will distinguish between two main groups of methods:
We will discuss the pros and cons of each of the two categories. For each category, we will discuss the following topics:
We will conclude this section by presenting
The latter observation models will be used in the following section to develop specific techniques and algorithms for LEO-based PNT.
Here we will provide detailed descriptions of algorithms that can be used, or that have been proposed, for LEO PNT. We will establish a connection with GNSS-based techniques. We will cover the following topics:
In this section of the tutorial, we will present results from extensive simulations to highlight various aspects of LEO PNT. We will make our simulation codes freely accessible in the public domain.
The final part of the tutorial will highlight the most prominent research directions and challenges that might be of interest to the community.
Tutorial Summary highlighting the takeaway messages.
We will provide an extensive list of references.
Presented by: Byung-Jun Yoon, Youngjoon Hong
Generative AI models have emerged as a groundbreaking paradigm that can generate, modify, and interpret complex data patterns, ranging from images and sounds to structured datasets. In the realm of signal processing, these models have the potential to revolutionize how we understand, process, and leverage signals. Their capabilities span from the generation of synthetic datasets to the enhancement and restoration of signals, often achieving results that traditional methods can't match. Thus, understanding and harnessing the power of generative AI is not just an academic endeavor; it's becoming an imperative for professionals and researchers who aim to stay at the forefront of the signal processing domain.
The last few years have witnessed an explosive growth in the development and adoption of generative AI models. With the introduction of architectures like GANs, VAEs, and newer transformer-based models, the AI research community is regularly setting new performance benchmarks. The signal processing community also begins to exploit these advancements. The year 2024 presents a crucial juncture where the convergence of AI and signal processing is no longer a future possibility but an ongoing reality. Thus, a tutorial on this topic is not just timely but urgently needed.
While there have been numerous tutorials and courses on generative AI in the context of computer vision or natural language processing, its application in the pure signal and data processing domain is less explored. This tutorial is unique in its comprehensive approach, combining theory, practical methods, and a range of applications specifically tailored for the signal processing community. Attendees will not only learn about the core concepts but will also gain theory and application of generative AI techniques.
Generative AI provides a fresh lens through which to approach longstanding challenges in signal processing. This tutorial will introduce:
In conclusion, by bridging the gap between the advancements in generative AI and the vast potential applications in signal processing, this tutorial promises to equip attendees with knowledge and tools that can redefine the boundaries of what's possible in the field.