Chapter 1 Introduction

 
 
 
      Motivation
I am looking at an acoustical waveform, horizontal axis denoting discrete time and vertical axis denoting discrete magnitude values. What information does this signal contain, how much information is there? Is there anything hidden behind the waveform? Is it just two dimensional in nature? How can I better understand why it makes this unique sound? Questions like these have aroused my curiosity in timbre which has lead to conceiving this thesis - feature extraction of musical signals. 

 
 
 

Feature extraction is an integral part of understanding musical instrument signals. These signals contain a wealth of information and feature extraction is a method for obtaining specific characteristics through signal processing techniques. Hence, it is partly a process of reducing the overwhelming acoustical information and focusing on specific areas that may give clues for describing the signal under investigation. In a computer system, digital signal processing techniques are used for analysis. The techniques of data analysis are divided into frequency and time domain analyses. With these techniques, numerous approaches from different angles are employed to extract salient information, ultimately to help understand timbral characteristics.

Various signal processing software systems exist for extracting specific acoustical features. However, very few systems exist that are tailored for the purpose of analysis and extraction of timbral qualities of musical instrument signals. In this thesis I have developed and implemented various algorithms for extracting salient features into one software application which can be readily used by musicians, composers, engineers or anyone interested in analyzing musical signals from a signal processing point of view.
 
 
 

It is also interesting to note that although numerous signal processing algorithms have been devised to accomplish feature extraction tasks, it is still unclear as to which aspects of timbre are essential and which are less or more meaningful than others. To my knowledge, there exists no theory nor rule that unambiguously defines a hierarchical description of timbral features. It is my hope that this software system will provide users the means to explore, investigate and experiment with audio signals and help answer some of the many questions regarding timbre that are yet to be discovered. However, I also plan to continue research in timbre to encompass a recognition module which would be able to take the extracted features and recognize the sound source being analyzed.
 
 
 

The software rendered in Java has been chosen for its platform independence and graphical user interface (GUI) capabilities. The Java Swing GUI was used to facilitate the interpretation of extracted features through graphical displays and parametric controls of various signal processing coefficients.
 
 
 

      Feature Extraction and Timbre
In this thesis spectral analysis is based on the Fourier transform. The theory behind the Fourier transform was first published in "Analytical Theory of Heat" by Fourier. Fourier claimed in his writing that any periodic continuous signal could be represented by the sum of an infinite number of sine and cosine waves. This elegant description of periodic signals was later exploited by the 19th century physicist Herman Helmholtz (Helmholtz 1877). His view of the ear was that of a "frequency analyzer" based primarily on Fourier's mathematical theorem, Ohm's physical definition of a simple tone and the existence of a resonator in the cochlea, capable of accomplishing sound analysis. According to Helmholtz's theory, the cochlea behaved like a spectral analyzer analogous to the Fourier transform. He believed that the cochlea resonated at specific locations along the basilar membrane (Carterette and Friedman 1978), each tuned to specific frequencies. Helmholtz also claimed that the spectral magnitude components, and not the phase components, were the sole factors contributing to the perception of musical tones. However, this over-generalization of the human ear performing a strict Fourier transform on the incoming sound waves was disproved by Békésy (Békésy 1943) who demonstrated the impossibility of such precise and acute tuning resonators in the cochlea as described by Helmholtz. In fact, the hair cells in the basilar membrane (comparable to frequency bins in the Fourier transform) are stimulated in an overlapping manner. That is, a sine tone at 100 Hz will not just trigger one hair cell at precisely that frequency, rather a group of hair cells will be excited leading to the perception of its pitch. Furthermore, the importance of phase in perceiving musical sounds was demonstrated by Clark (Clark, Luce, Abrams, Schlossberg and Rome 1963), who clearly showed that in the absence of phase information, acoustic waveforms sounded unrealistic. This may be partly attributed to the fact that the highly transient onset part of a signal stores a great deal of phase information. Helmholtz's theory works well in ideal situations when a signal is periodic. However, real-life sounds are only quasi-periodic and vary considerably. The significance of spectral fluctuation as well as inharmonicity (Fletcher, Blackman and Stratton 1962) and spectral fusion (McAdams 1984) have also been studied as potential features in describing musical tones.

 
 
 

Although the Fourier transform has been known for quite some time, it was not widely applied by the music community until after 1965, with the introduction of the fast Fourier transform (Cooley and Tukey 1965). The advent of the FFT stimulated research in music partly due to the cost effectiveness in processing the discrete Fourier transform. One such area of research in timbre was conducted using multidimensional scaling (MDS) methods (Grey 1976). The structure of musical signals was mapped to a three dimensional timbre space. The listener determined the similarity or dissimilarity between sounds when salient features were changed. The three dimensions incorporated were brightness, spectral flux and attack time. Instead of natural sounds, additive synthesis methods were employed for easy control of timbral parameters in conducting the experiments. Noise content of musical signals on the other hand has not been investigated in as much detail by researchers compared to the "periodic" aspects of musical sounds (some work has been done in modeling non-periodic signals by Serra 1997). However, voice coding research has been adapting noise analysis techniques enthusiastically, where speech is divided into a periodic and noisy part. The use of a LPC (Linear Prediction Coding) method has been the primary backbone in current and past speech analysis by synthesis (AbS) systems.
 
 
 

During the past decade a number of research topics in timbre have been pursued in the area of so called Computational Auditory Scene Analysis (CASA). It may be thought of as a research area in psychophysical disciplines to describe and explain how the listener perceives sounds. Sound, in this context may be referred to a multiplexed signal - an aggregate of a number of sound sources. The approach is to find the underlying reasons as to why we hear what we hear and not merely be content with the results of a computer system that finds a matching answer to a stimulus. The proliferation of CASA can be largely attributed to Bregman, who published his book Auditory Scene Analysis in 1990 (Bregman 1990). The book describes in detail highly intuitive and clever experiments that attempt to explain psychoacoustic phenomena and makes robust modeling of such features. However, as is the case with most if not all psychoacoustic experiments, the stimuli or test tones used in Bregram's book are also static, synthesized, sine-tones or simply impractical sound examples which are often remotely related to real-life sounds. Nevertheless, a significant and impressive amount of work has been done in this field. Work by Ellis (Ellis 1996) used a prediction-based model of the auditory system with good results in grouping sounds in noisy environments such as car horns, door slams and squeals in a "city street environment". He used a re-synthesis approach to assess its robustness and performance. Another is statistically based pattern-recognition approach (Martin 1999) where the "listening" system classifies musical instruments as one of 25 possibilities based on Ellis's PDCASA (Prediction Driven Computational Auditory Scene Analysis) architecture.


Table of Contents
next
previous