publications
2024
- IJCAI’2024End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical DecodingWei Zeng, Xian He, and Ye WangIJCAI, Jeju Island, South Korea, Aug 2024
Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.
- TOMMAutomatic Lyric Transcription and Automatic Music Transcription from Multimodal SingingXiangming Gu, Longshen Ou, Wei Zeng, and 3 more authorsACM Trans. Multimedia Comput. Commun. Appl., May 2024
Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.
2023
- MM’2023Elucidate Gender Fairness in Singing Voice TranscriptionXiangming Gu, Wei Zeng, and Ye WangIn Proceedings of the 31st ACM International Conference on Multimedia , May 2023
It is widely known that males and females typically possess different sound characteristics when singing, such as timbre and pitch, but it has never been explored whether these gender-based characteristics lead to a performance disparity in singing voice transcription (SVT), whose target includes pitch. Such a disparity could cause fairness issues and severely affect the user experience of downstream SVT applications. Motivated by this, we first demonstrate the female superiority of SVT systems, which is observed across different models and datasets. We find that different pitch distributions, rather than gender data imbalance, contribute to this disparity. To address this issue, we propose using an attribute predictor to predict gender labels and adversarially training the SVT system to enforce the gender-invariance of acoustic representations. Leveraging the prior knowledge that pitch distributions may contribute to the gender bias, we propose conditionally aligning acoustic representations between demographic groups by feeding note events to the attribute predictor. Empirical experiments on multiple benchmark SVT datasets show that our method significantly reduces gender bias (up to more than 50%) with negligible degradation of overall SVT performance, on both in-domain and out-of-domain singing data, thus offering a better fairness-utility trade-off.
- Appl. EnergyDeep reinforcement learning towards real-world dynamic thermal management of data centersQingang Zhang, Wei Zeng, Qinjie Lin, and 3 more authorsApplied Energy, May 2023
Deep Reinforcement Learning has been increasingly researched for Dynamic Thermal Management in Data Centers. However, existing works typically evaluate the performance of algorithms on a specific task, utilizing models or data trajectories without discussing in detail their implementation feasibility and their ability to deal with diverse work scenarios. The lack of these works limits the real-world deployment of Deep Reinforcement Learning. To this end, this paper comprehensively evaluates the strengths and limitations of state-of-the-art algorithms by conducting analytical and numerical studies. The analysis is conducted in four dimensions: algorithms, tasks, system dynamics, and knowledge transfer. As an inherent property, the sensitivity to algorithms settings is first evaluated in a simulated data center model. The ability to deal with various tasks and the sensitivity to reward functions are subsequently studied. The trade-off between constraints and power savings is identified by conducting ablation experiments. Next, the performance under different work scenarios is investigated, including various equipment, workload schedules, locations, and power densities. Finally, the transferability of algorithms across tasks and scenarios is also evaluated. The results show that actor-critic, off-policy, and model-based algorithms outperform others in optimality, robustness, and transferability. They can reduce violations and achieve around 8.84% power savings in some scenarios compared to the default controller. However, deploying these algorithms in real-world systems is challenging since they are sensitive to specific hyperparameters, reward functions, and work scenarios. Constraint violations and sample efficiency are some aspects that are still unsatisfactory. This paper presents our well-structured investigations, new findings, and challenges when deploying deep reinforcement learning in Data Centers.