Model Distillation for Deep Audio Representations
Bachelor thesis investigating how large audio representation models such as BEATs can be distilled into lightweight student networks for deployment on hearable devices.
Tech Stack
This page presents the publicly shareable summary of my bachelor thesis, which was conducted in collaboration with the Applied Artificial Intelligence (AAI) research lab. Due to confidentiality constraints, only selected aspects of the work are published here. The thesis was supervised by Dr. Simone Lionetti and reviewed by Dr. Claudio Santelli (Sonova AG).
Problem Definition
Hearable steering requires understanding of the acoustic environment. Deep representation models such as BEATs achieve strong performance on acoustic understanding tasks, making them a natural candidate for powering hearable steering, but their prohibitive size makes them impossible to deploy on device. The project investigates how the BEATs audio representation model can be reduced by a factor of ten to one hundred in parameters and operations so that it can run directly on an edge device while still producing high-quality audio embeddings.
Solution Concept
The chosen solution for this problem is Knowledge Distillation (KD). The project explores effects of major choices for the distillation process such as selecting the effective knowledge distillation targets, identifying suitable datasets, and comparing different lightweight student architectures. The distilled models are evaluated on the DEAR benchmark which is designed to test relevant hearable tasks. Four distillation objectives are examined in this work, three of which focus on preserving relational structure within the embedding space. Distance-wise distillation matches the distance between two samples and . Angle-wise distillation uses three samples and aligns the angle formed at . Cosine-wise distillation compares the directions of the vectors and , and matches the angle between them. Finally, Feature-wise distillation operates on the embeddings directly by minimizing the element-wise difference between teacher and student outputs. Two datasets are considered. The first is AudioSet, the dataset on which BEATs was originally trained on. The second is an internal collection that combines several publicly available audio datasets. The datasets were also combined to test whether mixing large general-purpose audio data with smaller specialized datasets offers any additional benefit. Two families of vision-based encoder architectures are investigated. MicroNet focuses on extreme efficiency through compact building blocks and carefully designed channel interactions, achieving strong ImageNet performance with very few parameters. MobileViT combines convolutions with lightweight transformer blocks, enabling both local feature extraction and global context modeling.
Evaluation is performed on the DEAR benchmark, which is designed specifically to quantify encoder quality for hearables for a set of tasks. DEAR provides eight tasks that test different aspects of audio representation, including general scene context, characteristics of speech sources, and several technical acoustic properties.
Special Challenges
Training time limited the number of experiments, requiring careful handling of data preprocessing and storage. Difference in the embedding dimension between the BEATs teacher and small student models demanded considerable engineering effort. Additional complexity arose from integrating external source code for both the student models and the evaluation workflow. Furthermore, interpreting results on the DEAR benchmark proved difficult. Many tasks allow multiple plausible interpretations of what constitutes “good” performance, and the relatively small dataset size makes the scores sensitive to artifacts in the data. As a result, model scores sometimes reflect these dataset-specific effects rather than genuine acoustic understanding.
Results
Across all experiments a clear trend emerges. Compact student models can match or even surpass the BEATs teacher on several DEAR tasks when trained with relational distillation. Especially student architectures from the MobileViT v2 family often performs at or above teacher level, while MicroNet family members reaches competitive performance with significantly fewer FLOPs. These results demonstrate that high quality audio embeddings do not require large encoders as long as the distilled students preserve the teacher’s geometric structure.
Outlook
The work demonstrates that strong audio representations can be distilled from large networks to small ones. Several lightweight student networks already outperform the teacher on DEAR. This is promising for edge AI technology. Future research could push model size even further down, potentially supported by Neural Architecture Search to explore parameter-efficient CNN-focused spaces. Quantization is a promising next step, as results from other works indicate that low-bit students may retain much of their performance, shrinking the required memory by a factor of five. Significantly reducing the memory required to store model weights improves deployability on constrained devices. Simple evaluation methods such as kNN on time-averaged spectrograms remain powerful tools that could support scenarios where even small encoders are too costly to deploy.