Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/99999/fk4vq4g56r
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Finkelstein, Adam | |
dc.contributor.author | Su, Jiaqi | |
dc.contributor.other | Computer Science Department | |
dc.date.accessioned | 2022-06-15T15:17:24Z | - |
dc.date.available | 2022-06-15T15:17:24Z | - |
dc.date.created | 2022-01-01 | |
dc.date.issued | 2022 | |
dc.identifier.uri | http://arks.princeton.edu/ark:/99999/fk4vq4g56r | - |
dc.description.abstract | Modern speech content such as podcasts, video narrations, and audiobooks typically requires high-quality audio to support a strong sense of presence and a pleasant listening experience. However, real-world recordings captured with consumer-grade equipment often suffer from quality degradations including noise, reverberation, equalization distortion, and loss of bandwidth. This dissertation addresses speech enhancement with a focus on improving the perceptual quality and aesthetics of recorded speech. It describes how to improve single-channel real-world consumer-grade recordings to sound like professional studio recordings -- studio-quality speech enhancement. In pursuit of this problem, we identify three challenges: objective functions misaligned with human perception, the shortcomings of commonly used audio representations (i.e., spectrogram and waveform), and the lack of available high-quality speech data for training.This dissertation presents a waveform-to-waveform deep neural network solution that consists of two steps: (1) enhancement by removing all quality degradations at limited bandwidth (i.e., 16kHz sample rate), and (2) bandwidth extension from 16kHz to 48kHz to produce a high-fidelity signal. The first enhancement stage relies on a perceptually-motivated GAN framework that combines both waveform and spectrogram representations, and learns from simulated data covering a broad range of realistic recording scenarios. Next, the bandwidth extension stage shares a similar design as the enhancement method, but focuses on filling in missing high frequency details at 48kHz. Finally, we extend the studio-quality speech enhancement problem to a more general problem called acoustic matching to convert recordings to an arbitrary acoustic environment. | |
dc.format.mimetype | application/pdf | |
dc.language.iso | en | |
dc.publisher | Princeton, NJ : Princeton University | |
dc.relation.isformatof | The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu>catalog.princeton.edu</a> | |
dc.subject | audio enhancement | |
dc.subject | generative adversarial networks | |
dc.subject | speech enhancement | |
dc.subject.classification | Computer science | |
dc.subject.classification | Artificial intelligence | |
dc.title | Studio-Quality Speech Enhancement | |
dc.type | Academic dissertations (Ph.D.) | |
pu.date.classyear | 2022 | |
pu.department | Computer Science | |
Appears in Collections: | Computer Science |
Files in This Item:
File | Size | Format | |
---|---|---|---|
Su_princeton_0181D_14127.pdf | 4.62 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.