Updates
- T-minus 2 weeks - Neural network topology determined. Additional signal processing; specifically, seperating inputs into seperate words. Additional speech samples collected and web interface for sounds classification verification. Created informed consent sheets for audio sample gathering.
- T-minus 3 weeks - Chris Mi. continued to modify his signal processing
code, and looked for techniques for determining prosody and formant
frequencies. For the upcoming week, he will attempt to implement these,
and run the signal analysis using the samples Eric recorded. This will
provide inputs for the training of the neural network. Phil and Chris Mo. looked at SNNS which is a neural network simulator to be used in conjunction with the outputs from Chris Mi's code.
- T-minus 4 weeks - Chris Mi. has resurrected MATLAB code he wrote during a graduate level course in digital speech processing. This code will find various speech properties for us, such as the fundamental frequency and intensity, in a speech file. The next test is to run this over a small gallery of speech samples to be sure it is consistent and can serve our needs. If so, we can begin right away.
- Our group will be attempting to recognize a limited set of human emotions through the analysis of vocal patterns. This topic encompasses the artificial intelligence subfields of probabilistic reasoning over time, statistical learning methods, and sensory perception. Besides being interesting, real time emotion detection will be useful for the robots in the AIROLAB and allows for agents to make decisions based on the emotional states of those it interacts with.
- We are interested in having samples read and recorded by several different candidates (both male and female) in which they will read them as angry, sad, frightened, or happy.
- We will need to employ signal analysis using Matlab (and any other tools we can find) to analyze different aspects of the voice based upon fundamental frequency, duration, and sound pressure. These early ideas of how to analyze signals are based upon the research of James A. Russell of BC and Jo-Anne Bachorowski of Vanderbilt in 2002. They claim that anger and joy can both produce high fundamental frequency and high amplitude with duration and sound pressure.
- Since there has been problems found in people "acting" in emotion in voice we hope to also look at genuine emotion in voice as opposed to acting.
- We will use a multilevel feedforward neural net to model the neural
network topology implementing our classification function. A robust,
well-tested package exists for simulating neural nets, known as SNNS
(Stuttgart Neural Network Simulator). We will design our network with
the SNNS built-in tools and we will also train using SNNS' built-in
capabilities.
We thought about using belief nets, but decided against it because SNNS
is so widespread in the artificial intelligence and neural network
communities and has thus had extensive amounts of testing and essential
features added over the years. We also considered using Hidden Markov
Models, however, for much the same reasons, elected to stay with neural
networks.
We hope to be able to expand the research done the previous semester by
the team of students who implemented a sarcasm detector.
Intro
The goal of the Emotion Detection in Speech project is to build an agent
that can, via a neural network architecture, classify emotion in voice
samples. To do this, the investigators need numerous vocal samples.
Participation in the study consists of recording the participant's voice
saying several sentences in ways that portray various emotions.
We will use signal processing techniques to extract certain features
from speech signals, such as formant frequencies, intensity, and
fundamental frequency. These features will then be fed into a
multi-layer feed-forward neural net with backpropagation, which we will
train on a known set of voice samples with known emotions. We expect to
have two hidden layers, one of which is a ``local output'', the other
being just a normal hidden layer. The local output weights from each
feature indicate the recommended classification based on that feature.
Outline
1. Goal
1. Discussion of goals.
2. Approach
1. Data acquisition
1. recording samples from males and females.
2. human classification of recorded samples, toss out those with lower than ~90% unique classification.
2. Matlab for feature extraction
1. fundamental frequency
2. intensity
3. formant frequencies
3. Stuttgart Neural Network Simulator (SNNS) for implementation
1. multi-level feed-forward neural net with backpropagation
2. seperate subgroups for each feature which feed into a set of ``global outputs''.
3. Results
1. Discussion of results (pending results).
4. Conclusions / Future Work
1. Extend the system to perform classification in real time, as opposed to the current offline solution. Perceived difficulties and payoffs.
2. Expand the recognized emotion set. Perceived difficulties and payoffs.
3. Improve recognition accuracy. Perceived difficulties and payoffs.
Project Plan Overview
- Collect speech samples from a variety of races, genders, and moods.
- Research current emotion detection techniques.
- Create probibility models for signal analysis.
- Create interface to return emotion detected.
- If time permits, real time emotion detection and optimizing of code to do so.
Key Players
Christopher Middendorff - will provide the bulk of the signal analysis and MatLab expertise
Eric Albert - will work on and voice sampling and testing
Phil Snowberger - will provide knowledge of linguistics including intonation and hack hack up glue tools for
format conversions, as necessary
Christopher Moretti - will aid in development and researching the probabilistic models, including
designing its topology and training the neural net and recording voice samples
References
"Digital Processing of Speech Signals". Rabiner, L.R. and Schafer, R.W.
Prentice Hall, New Jersey, 1978.
"Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition".
Jurafsky, Daniel and Martin, James H. Prentice Hall, New Jersey,
2000.
"Artificial Intelligence: A Modern Approach". 2nd ed. Russel, Stuart
and Norvig, Peter. Prentice Hall, New Jersey, 2003.