Analysis of audio data to measure social interaction in the treatment of autism spectrum disorder using speaker diarization and identification
Loading...
Authors
Meeting name
Sponsors
Date
Journal Title
Format
Thesis
Subject
Abstract
Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects communication and behavior in social environments. Some common characteristics of a person with ASD include difficulty with communication or interaction with others, restricted interests paired with repetitive behaviors and other symptoms that may affect the person's overall social life. People with ASD endure a lower quality of life due to their inability to navigate their daily social interactions. Autism is referred to as a spectrum disorder due to the variation in type and severity of symptoms. Therefore, measurement of the social interaction of a person with ASD in a clinical setting is inaccurate because the tests are subjective, time consuming, and not naturalistic. The goal of this study is to lay the foundation to passively collect continuous audio data of people with ASD through a voice recorder application that runs in the background of their mobile device and propose a methodology to understand and analyze the collected audio data while maintaining minimal human intervention. Speaker Diarization and Speaker Identification are two methods that are explored to answer essential questions when processing unlabeled audio data such as who spoke when and to whom does a certain speaker label belong to? Speaker Diarization is the process of partitioning an audio signal that involves multiple people into homogenous segments associated with each person. It provides an answer to the question of "who spoke when?". The implemented Speaker Diarization algorithm utilizes the state-of-the-art d-vector embeddings that take advantage of neural networks by using large datasets for training so variation in speech, accent, and acoustic conditions of the audio signal can be better accounted for. Furthermore, the algorithm uses a non-parametric, connection-based clustering algorithm commonly known as spectral clustering. The spectral clustering algorithm is applied to these previously extracted d-vector embeddings to determine the number of unique speakers and assign each portion of the audio file to a specific cluster. Through various experiments and trials, we chose Microsoft Azure Cognitive Services due to the robust algorithms and models that are available to identify speakers in unlabeled audio data. The Speaker Identification API from Microsoft Azure Cognitive Services provides a state-of-the-art service to identify human voices through RESTful API calls. A simple web interface was implemented to send audio data to the Speaker Identification API which returned data in JSON format. This returned data provides an answer to the question -- "who does a certain speaker label belong to?". The proposed methods were tested extensively on numerous audio files which contain various numbers of speakers who emulate a realistic conversational exchange. The results support our goal of digitally measuring social interaction of people with ASD through the analysis of audio data while maintaining minimal human intervention. We were able to identify our target speaker and differentiate them from others given an audio signal which could ultimately unlock valuable insights such as creating a bio marker to measure response to treatment.
Table of Contents
PubMed ID
Degree
M.S.
