SHAKEELDEEP LEARNING | SIGNAL PROCESSING | Résumé
.01

ABOUT

PERSONAL DETAILS
Tokyo, Japan
mapiconimg
shakeel [at] outlook.it
Hello, welcome to my personal and academic profile

BIO

ABOUT ME

Currently, I am working as a Scientist at Honda Research Institute Japan Co., Ltd. and pursuing research in the field of automatic speech recognition (ASR).

Previously, I have completed my Ph.D. in systems and control engineering under the supervision of Prof. Kazuhiro Nakadai in Nakadai Lab at Department of Systems and Control Engineering of Tokyo Institute of Technology.

My area of research at the time of Ph.D. was audio processing and multi-modal integration for artificial intelligence based sub-indication.

I have a background in Artificial Intelligence, having a M.Sc. degree in "Artificial Intelligence and Robotics" from Sapienza University of Rome and a B.Sc. in "Electronics" from COMSATS University Islamabad.

I have also worked under the supervision of Prof. Arshad Saleem Bhatti and was involved in the development of read-out electronics of the inner tracking system (ITS) of ALICE detector at CERN. ALICE is one of the four major experiments at CERN. My major tasks were to optimize the communication link between the sensor board and the first readout electronics board.

During my master studies, I got the opportunity to carry out my master thesis in the field of Search and Rescue Robotics at Graduate School of Information Sciences, Tohoku University, Japan under the supervision of Prof. Satoshi Tadokoro and Prof. Daniele Nardi from Sapienza University of Rome.

.02

RESUME

ACADEMIC AND PROFESSIONAL POSITIONS
  • 11-2022
    Present
    TOKYO, JAPAN

    Scientist

    Honda Research Institute Japan Co., Ltd.

    Research and development of novel Automatic Speech Recognition architectures in collaboration with Carnegie Mellon University.
  • 12-2021
    08-2022
    TOKYO, JAPAN

    Research Internship

    Honda Research Institute Japan Co., Ltd.

    Research and development of novel Automatic Speech Recognition architectures in collaboration with Carnegie Mellon University.
  • 09-2019
    08-2022
    TOKYO, JAPAN

    RSEARCH ASSISTANT (RA)

    NAKADAI LAB - TOKYO INSTITUTE OF TECHNOLOGY

    • Programming and data organization
  • 01-2016
    03-2019
    ISLAMABAD, PAKISTAN

    LECTURER

    COMSATS UNIVERSITY ISLAMABAD (CUI)

    • Development of clustering algorithm for ALPIDE chip data on Kintex-7 in collaboration with ALICE (CERN) and CUI. • Involved in teaching Electronics courses at the Bachelor Level.
  • 09-2015
    12-2015
    ISLAMABAD, PAKISTAN

    RESEARCH ASSOCIATE

    COMSATS UNIVERSITY ISLAMABAD (CUI)

    • Development of clustering algorithm for ALPIDE chip data on Kintex-7 in collaboration with ALICE (CERN) and CUI.
  • 02-2015
    08-2015
    ROME, ITALY

    AI & MACHINE LEARNING ENGINEER

    SMART-I

    • Design, development, and optimization of Machine Learning algorithms for real-time traffic forecasting on the roads.
  • 02-2014
    05-2014
    ROME, ITALY

    MACHINE LEARNING INTERN

    SMART-I

    • Integrated support vector machines and Kalman filter to develop real-time prediction models, reducing energy consumption by 90\% for pedestrian traffic and 80\% for road traffic.
  • 09-2013
    12-2013
    SENDAI, JAPAN

    VISITING TRAINEE

    HUMAN-ROBOT INFORMATICS LAB - TOHOKU UNIVERSITY

    • Environmental sensing using a millimeter wave sensor equipped on a 3D rotation table.
EDUCATION
  • 2019
    2022
    TOKYO, JAPAN

    SYSTEMS AND CONTROL - PHD

    TOKYO INSITITUTE OF TECHNOLOGY

    Thesis title: Anomaly detection and classification in multi-sensor systems using deep learning
  • 2011
    2014
    ROME, ITALY

    MASTER’S DEGREE IN ARTIFICIAL INTELLIGENCE AND ROBOTICS ENGINEERING

    SAPIENZA UNIVERSITY OF ROME

    Thesis title: 3D Measurement and Environmental Sensing using Millimeter wave (MM wave) Sensor
  • 2006
    2010
    ISLAMABAD, PAKISTAN

    BACHELOR’S DEGREE IN ELECTRONICS

    COMSATS UNIVERSITY ISLAMABAD

    Thesis title: Multi-Task Autonomous Robot Using Reconfigurable Media (FPGA) and Machine Vision Techniques
.03

HONORS AND AWARDS

HONORS AND AWARDS
  • 12-2024
    12-2024
    MACAO, CHINA

    BEST PAPER AWARD

    IEEE Spoken Language Technology

    Contextualized Automatic Speech Recognition With Dynamic Vocabulary
  • 04-2019
    09-2022
    TOKYO, JAPAN

    MEXT RESEARCH SCHOLARSHIP

    TOKYO INSTITUTE OF TECHNOLOGY

    Fully funded Ph.D. Research Fellowship (Japanese Government Scholarship)
  • 09-2013
    12-2013
    SENDAI, JAPAN

    JASSO-Student Exchange Support Program

    JASSO

    Scholarship for short-term Study in Japan
  • 09-2013
    12-2013
    ROME, ITALY

    TOHOKU UNIVERSITY EXCHANGE PROGRAMME

    SAPIENZA UNIVERSITY OF ROME

    Student Mobility Scholarship for Master’s thesis
  • 2011
    2013
    ROME, ITALY

    ERASMUS MUNDUS (EU-NICE)

    SAPIENZA UNIVERSITY OF ROME

    EU scholarship for seeking Master’s degree in AI & Robotics
.04

PUBLICATIONS

PUBLICATIONS LIST
03 Dec 2024

Contextualized Automatic Speech Recognition With Dynamic Vocabulary

Macao, China

Won the best paper award at IEEE Spoken Language Technology 2024

Conferences Selected Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Shinji Watanabe

Contextualized Automatic Speech Recognition With Dynamic Vocabulary

Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Shinji Watanabe
Conferences Selected
About The Publication
Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More advanced techniques address this problem by expanding the vocabulary with additional modules, including the external language model shallow fusion or rescoring. However, they result in increasing the workload due to the additional modules. This paper proposes a dynamic vocabulary where bias tokens can be added during inference. Each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens. This approach eliminates the need to learn subword dependencies within the bias phrases. This method is easily applied to various architectures because it only expands the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the bias phrase WER on English and Japanese datasets by 3.1 – 4.9 points compared with the conventional DB method.
01 Sep 2024

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Kos Island, Greece

Accepted in Interspeech 2024

Conferences Selected Muhammad Shakeel, Yifan Peng, Yui Sudo, Shinji Watanabe

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

Muhammad Shakeel, Yifan Peng, Yui Sudo, Shinji Watanabe
Conferences Selected
About The Publication
Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediate layers may better align text tokens or audio frames with the desired objectives. Our proposed intermediate biasing loss brings more regularization and contextualization to the network. Our method outperforms a conventional contextual biasing baseline on the LibriSpeech corpus, achieving a relative improvement of 22.5% in biased word error rate (B-WER) and up to 44% compared to the non-contextual baseline with a biasing list size of 100. Moreover, employing RNN-transducer-driven joint decoding further reduces the unbiased word error rate (U-WER), resulting in a more robust network.
16 Aug 2024

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Bangkok, Thailand

The 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

Conferences Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe
Conferences
About The Publication
There has been an increasing interest in large speech models that can perform multiple speech processing tasks in a single model. Such models usually adopt the encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 25% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our codebase, pre-trained model, and training logs to promote open science in speech foundation models.
14 Apr 2024

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Seoul, Korea

2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

Conferences Selected Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe
Conferences Selected
About The Publication
End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.
01 Sep 2024

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Kos Island, Greece

Accepted in Interspeech 2024

Conferences Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe
Conferences
About The Publication
Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI’s Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.
14 Apr 2024

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Seoul, Korea

2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

Conferences Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe
Conferences
About The Publication
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.
16 Dec 2023

Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data

Taipei, Taiwan

2023 IEEE Workshop on Automatic Speech Recognition and Understanding

Conferences Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data

Yifan Peng, Jinchuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe
Conferences
About The Publication
Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisperstyle training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pretrained models and training logs to promote open science.
20 Aug 2023

Time-synchronous one-pass beam search for parallel online and offline transducers with dynamic block training

Dublin, Ireland

Accepted in Interspeech 2023

Conferences Yui Sudo, Muhammad Shakeel, Yifan Peng, Shinji Watanabe

Time-synchronous one-pass beam search for parallel online and offline transducers with dynamic block training

Yui Sudo, Muhammad Shakeel, Yifan Peng, Shinji Watanabe
Conferences
About The Publication
End-to-end automatic speech recognition (ASR) has become an increasingly popular area of research, with two main models being online and offline ASR. Online models aim to provide real-time transcription with minimal latency, whereas offline models wait until the end of the speech utterance before generating a transcription. In this work, we explore three techniques to maximize the performance of each model by 1) proposing a joint parallel online and offline architecture for transducers; 2) introducing dynamic block (DB) training, which allows flexible block size selection and improves the robustness for the offline mode; and, 3) proposing a novel time-synchronous one-pass beam search using the online and offline decoders to further improve the performance of the offline mode. Experimental results show that the proposed method consistently improves the character/word error rates on the CSJ and LibriSpeech datasets.
20 Aug 2023

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Dublin, Ireland

Accepted in Interspeech 2023

Conferences Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe
Conferences
About The Publication
Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment. Knowledge distillation trains a small student model to mimic the behavior of a large teacher model. However, the student architecture usually needs to be manually designed and will remain fixed during training, which requires prior knowledge and can lead to suboptimal performance. Inspired by recent success of task-specific structured pruning, we propose DPHuBERT, a novel task-agnostic compression method for speech SSL based on joint distillation and pruning. Experiments on SUPERB show that DPHuBERT outperforms pure distillation methods in almost all tasks. Moreover, DPHuBERT requires little training time and performs well with limited training data, making it suitable for resource-constrained applications. Our method can also be applied to various speech SSL models. Our code and models will be publicly available.
20 Aug 2023

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Dublin, Ireland

Accepted in Interspeech 2023

Conferences Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi and Shinji Watanabe

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi and Shinji Watanabe
Conferences
About The Publication
End-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and mask-predict models. There are pros and cons to each of these architectures, and thus practitioners may switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four different decoders (CTC, attention, RNN-T, mask-predict) share an encoder – we refer to this as 4D modeling. Additionally, we propose to 1) train 4D models using a two-stage strategy which stabilizes multitask learning and 2) decode 4D models using a novel time-synchronous one-pass beam search. We demonstrate that jointly trained 4D models improve the performances of each individual decoder. Further, we show that our joint CTC/RNN-T/attention decoding surpasses the previously proposed CTC/attention decoding.
17 Jan 2023

FPGA based Power-Efficient Edge Server to Accelerate Speech Interface for Socially Assistive Robotics

Atlanta, GA, USA

2023 IEEE/SICE International Symposium on System Integration

Conferences Haris Gulzar, Muhammad Shakeel, Katsutoshi Itoyama, Kazuhiro Nakadai, Kenji Nishida, Hideharu Amano, Takeharu Eda

FPGA based Power-Efficient Edge Server to Accelerate Speech Interface for Socially Assistive Robotics

Haris Gulzar, Muhammad Shakeel, Katsutoshi Itoyama, Kazuhiro Nakadai, Kenji Nishida, Hideharu Amano, Takeharu Eda
Conferences
About The Publication
Socially Assistive Robotics (SAR) is a sustainable solution for the growing elderly and disabled population requiring proper care and supervision. Internet of Things (IoT) and Edge Computing can leverage SAR by providing in-house computation of connected devices and offering a secure, autonomous, and power-efficient framework. In this study, we have proposed using a System-on-Chip (SoC) based device as an edge server, which provides a local speech recognition interface for connected IoT devices in the targeted area. Convolutional Neural Network (CNN) is used to detect a set of frequently used speech commands which are useful to control home appliances and interact with assistive robots. Proposed CNN achieves state-of-the-art accuracy with a meager computing budget. It delivers 96.14% accuracy with a 20X smaller number of parameters and 137X fewer Floating Point Operations (FLOPS) compared to similarly performing CNN networks. To address the challenge of latency requirement for practical applications, parallelization of CNN helped to achieve 6.67X times faster inference speed than its base implementation. Lastly, implementing CNN on SoC-based edge device achieved at least 5X and 7X reduction in net power consumption compared to GPU and CPU devices respectively.
18 Sep 2022

Streaming Automatic Speech Recognition with Re-blocking Processing Based on Integrated Voice Activity Detection

Incheon, Korea

Accepted in Interspeech 2022

Conferences Yui Sudo, Muhammad Shakeel, Kazuhiro Nakadai, Jiatong Shi and Shinji Watanabe

Streaming Automatic Speech Recognition with Re-blocking Processing Based on Integrated Voice Activity Detection

Yui Sudo, Muhammad Shakeel, Kazuhiro Nakadai, Jiatong Shi and Shinji Watanabe
Conferences
About The Publication
This paper proposes streaming automatic speech recognition (ASR) with re-blocking processing based on integrated voice activity detection (VAD). End-to-end (E2E) ASR models are promising for practical ASR. One of the key issues in realizing such a system is the detection of voice segments to cope with streaming input. There are three challenges for speech segmentation in streaming applications: 1) the extra VAD module in addition to the ASR model increases the system complexity and the number of parameters, 2) inappropriate segmentation of speech for block-based streaming methods deteriorates the performance, 3) non-voice segments that are not discarded results in the increase of unnecessary computational costs. This paper proposes a model that integrates a VAD branch into a block processing-based streaming ASR system and a re-blocking technique to avoid inappropriate isolation of the utterances. Experiments show that the proposed method reduces the detection error rate (ER) by 25.8% on the AMI dataset with a less than 1% of increase in the number of parameters. Furthermore, the proposed method show 7.5% relative improvement in character error rate (CER) on the CSJ dataset with 27.3% reduction in real-time factor (RTF).
20 Feb 2022

3D Convolution Recurrent Neural Networks for Multi-Label Earthquake Magnitude Classification

TOKYO, JAPAN

MDPI

Journal Paper Selected Muhammad Shakeel, Kenji Nishida, Katsutoshi Itoyama, Kazuhiro Nakadai

3D Convolution Recurrent Neural Networks for Multi-Label Earthquake Magnitude Classification

Muhammad Shakeel, Kenji Nishida, Katsutoshi Itoyama, Kazuhiro Nakadai
Journal Paper Selected
About The Publication
We examine a classification task in which signals of naturally occurring earthquakes are categorized ranging from minor to major, based on their magnitude. Generalized to a single-label classification task, most prior investigations have focused on assessing whether an earthquake’s magnitude falls into the minor or large categories. This procedure is often not practical since the tremor it generates has a wide range of variation in the neighboring regions based on the distance, depth, type of surface, and several other factors. We present an integrated 3-dimensional convolutional recurrent neural network (3D-CNN-RNN) trained to classify the seismic waveforms into multiple categories based on the problem formulation. Recent studies demonstrate using artificial intelligence-based techniques in earthquake detection and location estimation tasks with progress in collecting seismic data. However, less work has been performed in classifying the seismic signals into single or multiple categories. We leverage the use of a benchmark dataset comprising of earthquake waveforms having different magnitude and present 3D-CNN-RNN, a highly scalable neural network for multi-label classification problems. End-to-end learning has become a conventional approach in audio and image-related classification studies. However, for seismic signals classification, it has yet to be established. In this study, we propose to deploy the trained model on personal seismometers to effectively categorize earthquakes and increase the response time by leveraging the data-centric approaches. For this purpose, firstly, we transform the existing benchmark dataset into a series of multi-label examples. Secondly, we develop a novel 3D-CNN-RNN model for multi-label seismic event classification. Finally, we validate and evaluate the learned model with unseen seismic waveforms instances and report whether a specific event is associated with a particular class or not. Experimental results demonstrate the superiority and effectiveness of the proposed approach on unseen data using the multi-label classifier.
17 Jan 2023

Metric-based multimodal meta-learning for human movement identification via footstep recognition

Tokyo, JAPAN

2023 IEEE/SICE International Symposium on System Integration

Conferences Selected Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai

Metric-based multimodal meta-learning for human movement identification via footstep recognition

Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai
Conferences Selected
About The Publication
We describe a novel metric-based learning approach that introduces a multimodal framework and uses deep audio and geophone encoders in siamese configuration to design an adaptable and lightweight supervised model. This framework eliminates the need for expensive data labeling procedures and learns general-purpose representations from low multisensory data obtained from omnipresent sensing systems. These sensing systems provide numerous applications and various use cases in activity recognition tasks. Here, we intend to explore the human footstep movements from indoor environments and analyze representations from a small self-collected dataset of acoustic and vibration-based sensors. The core idea is to learn plausible similarities between two sensory traits and combining representations from audio and geophone signals. We present a generalized framework to learn embeddings from temporal and spatial features extracted from audio and geophone signals. We then extract the representations in a shared space to maximize the learning of a compatibility function between acoustic and geophone features. This, in turn, can be used effectively to carry out a classification task from the learned model, as demonstrated by assigning high similarity to the pairs with a human footstep movement and lower similarity to pairs containing no footstep movement. Performance analyses show that our proposed multimodal framework achieves a 19.99\% accuracy increase (in absolute terms) and avoided overfitting on the evaluation set when the training samples were increased from 200 pairs to just 500 pairs while satisfactorily learning the audio and geophone representations. Our results employ a metric-based contrastive learning approach for multi-sensor data to mitigate the impact of data scarcity and perform human movement identification with limited data size.
21 Oct 2021

CASE: CNN Acceleration for Speech-Classification in Edge-Computing

Hempstead, NY, USA

2021 IEEE Cloud Summit (Cloud Summit)

Conferences Haris Gulzar, Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai

CASE: CNN Acceleration for Speech-Classification in Edge-Computing

Haris Gulzar, Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai
Conferences
About The Publication
High performance of Machine Learning algorithms has enabled numerous applications based upon speech interface in our daily life, but most of the frameworks use computationally expensive algorithms deployed on cloud servers as speech recognition engines. With the recent surge in the number of IoT devices, a robust and scalable solution for enabling AI applications on IoT devices is inevitable in form of edge computing. In this paper, we propose the application of System-on-Chip (SoC) powered edge computing device as accelerator for speech commands classification using Convolutional Neural Network (CNN). Different aspects affecting the CNN performance are explored and an efficient and light-weight model named as CASENet is proposed which achieves state-of-the-art performance with significantly smaller number of parameters and operations. Efficient extraction of useful features from audio signal helped to maintain high accuracy with a 6X smaller number of parameters, making CASENet the smallest CNN in comparison to similarly performing networks. Light-weight nature of the model has led to achieve 96.45% validation accuracy with a 14X smaller number of operations which makes it ideal for low-power IoT and edge devices. A CNN accelerator is designed and deployed on FPGA part of SoC equipped edge server device. The hardware accelerator helped to improve the inference latency of speech command by a 6.7X factor as compared to standard implementation. Memory, computational cost and latency are the most important metrics for selecting a model to deploy on edge computing devices, and CASENet along with the accelerator surpasses all of these requirements.
01 Apr 2021

Detecting earthquakes: a novel deep learning-based approach for effective disaster response

TOKYO - JAPAN

Applied Intelligence

Journal Paper Selected Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai

Detecting earthquakes: a novel deep learning-based approach for effective disaster response

Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai
Journal Paper Selected
About The Publication
Abstract— In the present study, we present an intelligent earthquake signal detector that provides added assistance to automate traditional disaster responses. To effectively respond in a crisis scenario, additional sensors and automation are always necessary. Deep learning has achieved success in various low signal-to-noise ratio tasks, which motivated us to propose a novel 3-dimensional (3D) CNN-RNN-based earthquake detector from a demonstration paradigm to real-time implementation. Data taken from the ST anford EA rthquake D ataset (STEAD) are used to train the network. After preprocessing the raw earthquake signals, features such as log-mel spectrograms are extracted. Once the model has learned spatial and temporal information from low-frequency earthquake waves, it can be employed in real time to distinguish small and large earthquakes from seismic noise with an accuracy, sensitivity, and specificity of 99.057%, 98.488%, and 99.621%, respectively. We also observe that the choice of filters in log-mel spectrogram impacts the results much more than the model complexity. Furthermore, we implement and test the model on data collected continuously over two months by a personal seismometer in the laboratory. The inference speed for a single prediction is 2.27 seconds, and the system delivers a stable detection of all 63 major earthquakes from November 2019 to December 2019 reported by the Japan Meteorological Agency.
12 Jan 2021

EMC : Earthquake Magnitudes Classification on Seismic Signals via Convolutional Recurrent Networks

IWAKI, FUKUSHIMA - JAPAN

IEEE/SICE International Symposium on System Integration (SII 2021)

Conferences Selected Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai

EMC : Earthquake Magnitudes Classification on Seismic Signals via Convolutional Recurrent Networks

Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai
Conferences Selected
About The Publication
Abstract— We propose a novel framework for reliable automatic classification of earthquake magnitudes. Using deep learning methods, we aim to classify the earthquake magnitudes into different categories. The method is based on a convolutional recurrent neural network in which a new approach for feature extraction using Log-Mel spectrogram representations is applied for seismic signals. The neural network is able to classify earthquake magnitudes from minor to major. Stanford Earthquake Dataset (STEAD) is used to train and validate the proposed method. The evaluation results demonstrate the efficacy of the proposed method in a rigorous event independent scenario, which can reach a F-score of 67% depending upon the earthquake magnitude.
18 Oct 2015

Environmental sensing using millimeter wave sensor for extreme conditions

West Lafayette, IN - USA

2015-IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR)

Conferences Selected Muhammad Shakeel, Daniele Nardi, Kazunori Ohno, Satoshi Tadokoro

Environmental sensing using millimeter wave sensor for extreme conditions

Muhammad Shakeel, Daniele Nardi, Kazunori Ohno, Satoshi Tadokoro
Conferences Selected
About The Publication
The aim of this work is to investigate rotational millimeter wave sensor for autonomous navigation in extreme environmental and weather conditions. The proposed research describes the development and deployment of efficient millimeter wave radar system for mobile robotics as a primary range sensor to allow for localization and mapping. In fact, in highly demanding weather conditions, conventional sensors for outdoor navigation, like ultrasound, lasers and optical sensors have known limitations, that may be at least overcome using the Millimeter Wave Radars. Specifically, our contribution focuses on the extraction of range and velocity information from the data collected by rotating the antennas of a millimeter wave radar at different speed profiles: 360° rps, 180° rps and 36° rps. Experimental validation was performed in foggy conditions to test the robustness of the proposed sensor in a realistic scenario, showing that distances can be acquired in the range of 1 to 40m, at a suitable rate.
11 Jan 2021

Assessment of a Beamforming Implementation Developed for Surface Sound Source Separation

IWAKI, FUKUSHIMA - JAPAN

IEEE/SICE International Symposium on System Integration (SII 2021

Conferences Zhi Zhong, Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai

Assessment of a Beamforming Implementation Developed for Surface Sound Source Separation

Zhi Zhong, Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai
Conferences
About The Publication
Abstract— This paper presents the assessment of a scan-and-sum beamformer by numerical simulations. The scan-and-sum beamformer has been proposed and analyzed theoretically, in concern with sound source separation of general wide band surface sources distributed in the azimuth angle dimension. Sound sources emitted by regions are called surface sources, and tend to have various shapes and sizes, e.g., a waterfall or an orchestra on the stage. Conventionally a sound source is modelled as a point source that is without a shape or size, hence conventional beamformers are mainly designed for point source separation. A scan-and-sum beamformer deploys a conventional beamformer as the sub-beamformer, and scans the region where a target surface source exists at an appropriate scanning density. The separated surface source is formed through a weighted summation of sub-beamformers. Implementations based on the MVDR sub-beamformer are presented under a framework that reduces overlapping calculation of inverse correlation matrices. For inverse correlation estimation, two methods are provided, one is a block-wise processing which further reduces computational cost, and the other is RLS-based inverse matrix calculation which displays strength in accuracy of estimation. A self-designed diverse dataset having various mixtures of surface sound sources is also created to carry out extensive numerical simulations for a detailed comparison between the scan-and-sum beamformer method and the conventional MVDR approach. Simulations validated the efficiency and effectiveness of the current implementation, and showed that the proposed scan-and-sum beamformer outperforms a conventional one in surface sound source separation.
12 Dec 2020

A Multi-Access Edge Computing Solution with Distributed Sound Source Localization for IoT Networks

TOKYO - JAPAN

21st Society of Instrument and Control Engineers System Integration Division Lecture

Conferences Gulzar Haris, Muhammad Shakeel, Kenji Nishida, Katsutoshi Itoyama, Kazuhiro Nakadai

A Multi-Access Edge Computing Solution with Distributed Sound Source Localization for IoT Networks

Gulzar Haris, Muhammad Shakeel, Kenji Nishida, Katsutoshi Itoyama, Kazuhiro Nakadai
Conferences
About The Publication
In this paper we have presented a flexible edge computing approach for distributed sound source localization and tracking by utilizing a custom-built Multi-Access Edge Computing (MEC) device for Internet of Things (IoT) applications. A multichannel microphone array mounted embedded device is modelled as IoT node to record real-time sound signals, perform computation according to its resource capability and then communicate either audio signals, partially computed results or final sound source localization (SSL) results to MEC device to perform further computation. In this paper, we present a framework to deploy HARK (Honda research institute Japan, Audition for Robots with Kyoto university) to perform SSL on multi devices by exploiting the PYNQ platform of MEC device. IoT node and MEC device together are modelled to offer a light-weight and flexible environment to perform distributed SSL where size of communication payload was optimally reduced from 25KB to 0.1KB per frame, which is highly desired in IoT applications.
09 Oct 2020

Auditory Awareness with Sound Source Localization on Edge Devices for IoT Applications

TOKYO - JAPAN

International Sessions, the 38th Annual Conference of the Robotics Society of Japan

Conferences Haris Gulzar, Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai

Auditory Awareness with Sound Source Localization on Edge Devices for IoT Applications

Haris Gulzar, Muhammad Shakeel, Katsutoshi Itoyama, Kenji Nishida, Kazuhiro Nakadai
Conferences
About The Publication
In this paper we propose a sound source localization solution in 3D environment with specific focus on edge computing for IoT applications. The proposed method integrates real time sound signals processing, edge computing and deployment of this model for IoT network. Sound signals processing is performed on IoT edge device and sound source localization results are transmitted to the remote server using a light weight communication protocol. The computational shift from cloud to edge device doesn’t only resolve the problem of cloud overloading and scalability but transmission of minimum information through a light weight communication protocol paves the way to further develop sustainable solution for sound signals processing with enhanced data privacy, least bandwidth utilization and improved overall latency.
.05

TEACHING & VOLUNTEER

TEACHING HISTORY
  • 09-2020
    02-2021
    TOKYO, JAPAN

    TEACHING ASSISTANT (TA)

    CYBER-PHYSICAL INNOVATION (COURSE) - TOKYO INSTITUTE OF TECHNOLOGY

  • 01-2021
    TOKYO, JAPAN

    TEACHING ASSISTANT (TA)

    IJCAI-PRICAI 2020 tutorial on HARK

    19th HARK tutorial
  • 09-2019
    02-2020
    TOKYO, JAPAN

    TEACHING ASSISTANT (TA)

    SYSTEMS AND CONTROL ENGINEERING PROJECT (COURSE) - TOKYO INSTITUTE OF TECHNOLOGY

VOLUNTEERING ACTIVITIES
  • 04-2020
    03-2021
    TOKYO, JAPAN

    VICE-PRESIDENT

    Pakistan Student Association Japan (PSAJ)

    Elected and served as a Vice-President in 2020-2021 for Pakistan Student Association Japan (PSAJ)
  • 01-2021
    01-2021
    YOKOHAMA, JAPAN

    IJCAI-PRICAI VOLUNTEER

    IJCAI-PRICAI 2020

    I was a staff volunteer for IJCAI-PRICAI 2020 conference held online from January 7-15, 2021
.08

CONTACT

Drop me a line

GET IN TOUCH

You can now reserve a slot by sending an email for introductions or consulting services