1. Our group will be attempting to recognize a limited set of human emotions through the analysis of vocal patterns. This topic encompasses the artificial intelligence subfields of probabilistic reasoning over time, statistical learning methods, and sensory perception. Besides being interesting, real time emotion detection will be useful for the robots in the AIROLAB and allows for agents to make decisions based on the emotional states of those it interacts with.
  2. We are interested in having samples read and recorded by several different candidates (both male and female) in which they will read them as angry, sad, frightened, or happy.
  3. We will need to employ signal analysis using Matlab (and any other tools we can find) to analyze different aspects of the voice based upon fundamental frequency, duration, and sound pressure. These early ideas of how to analyze signals are based upon the research of James A. Russell of BC and Jo-Anne Bachorowski of Vanderbilt in 2002. They claim that anger and joy can both produce high fundamental frequency and high amplitude with duration and sound pressure.
  4. Since there has been problems found in people "acting" in emotion in voice we hope to also look at genuine emotion in voice as opposed to acting.
  5. We will use a multilevel feedforward neural net to model the neural network topology implementing our classification function. A robust, well-tested package exists for simulating neural nets, known as SNNS (Stuttgart Neural Network Simulator). We will design our network with the SNNS built-in tools and we will also train using SNNS' built-in capabilities. We thought about using belief nets, but decided against it because SNNS is so widespread in the artificial intelligence and neural network communities and has thus had extensive amounts of testing and essential features added over the years. We also considered using Hidden Markov Models, however, for much the same reasons, elected to stay with neural networks. We hope to be able to expand the research done the previous semester by the team of students who implemented a sarcasm detector.

Intro

The goal of the Emotion Detection in Speech project is to build an agent that can, via a neural network architecture, classify emotion in voice samples. To do this, the investigators need numerous vocal samples. Participation in the study consists of recording the participant's voice saying several sentences in ways that portray various emotions.

We will use signal processing techniques to extract certain features from speech signals, such as formant frequencies, intensity, and fundamental frequency. These features will then be fed into a multi-layer feed-forward neural net with backpropagation, which we will train on a known set of voice samples with known emotions. We expect to have two hidden layers, one of which is a ``local output'', the other being just a normal hidden layer. The local output weights from each feature indicate the recommended classification based on that feature.


Outline

1. Goal
	1. Discussion of goals.
2. Approach
	1. Data acquisition
		1. recording samples from males and females.
		2. human classification of recorded samples, toss out those with lower than ~90% unique classification.
	2. Matlab for feature extraction
		1. fundamental frequency
		2. intensity
		3. formant frequencies
	3. Stuttgart Neural Network Simulator (SNNS) for implementation
		1. multi-level feed-forward neural net with backpropagation
		2. seperate subgroups for each feature which feed into a set of ``global outputs''.
3. Results
	1. Discussion of results (pending results).
4. Conclusions / Future Work
	1. Extend the system to perform classification in real time, as opposed to the current offline solution.  Perceived difficulties and payoffs.
	2. Expand the recognized emotion set.  Perceived difficulties and payoffs.
	3. Improve recognition accuracy.  Perceived difficulties and payoffs.

Project Plan Overview

  1. Collect speech samples from a variety of races, genders, and moods.
  2. Research current emotion detection techniques.
  3. Create probibility models for signal analysis.
  4. Create interface to return emotion detected.
  5. If time permits, real time emotion detection and optimizing of code to do so.

Key Players

Christopher Middendorff - will provide the bulk of the signal analysis and MatLab expertise

Eric Albert - will work on and voice sampling and testing

Phil Snowberger - will provide knowledge of linguistics including intonation and hack hack up glue tools for format conversions, as necessary

Christopher Moretti - will aid in development and researching the probabilistic models, including designing its topology and training the neural net and recording voice samples


References
"Digital Processing of Speech Signals".  Rabiner, L.R. and Schafer, R.W.
	Prentice Hall, New Jersey, 1978.

"Speech and Language Processing: An Introduction to Natural Language
	Processing, Computational Linguistics, and Speech Recognition".
	Jurafsky, Daniel and Martin, James H.  Prentice Hall, New Jersey,
	2000.

"Artificial Intelligence: A Modern Approach".  2nd ed. Russel, Stuart
	and Norvig, Peter.  Prentice Hall, New Jersey, 2003.