Interview Expert

We did an interview with an employee at NXP. These are the questions we asked and the answers we got :

1.What is the importance of pitch detection in speech analysis:

Speech analysis is very useful for noise suppression, specially if you have background noise that is non stationary and you want to filter out the human voice. by detecting the pitch you can extract the human voice out of the noisy and unclean signal. It can also be used for speaker identification, speaker detection, speaker verification of genre/music classification.

2.How can speech analysis be used to help people:

Speech analysis can be used to filter out background noise which can solve a lot of problems for people with hearing aids. Pitch shifting sounds in a unbearable range to a higher or lower range can also help people with hearing impairment.

3. Where is speech analysis currently being used:

Speech analysis is for example being used in improving the Intelligibility of certain signals. if you want to slow down a part of a spoken sentence in order to make it more intelligible. when someone is talking and to make him/her better understandable, you could process the sound to selectively slow down the parts that need to become more intelligible and speed up the other less significant parts so that at the end the speech keeps its original length. This is quite difficult to do and pitch shifting and pitch detection are key elements of such a technology.

It can also be used to help people with hearing impairment in a noisy environment. the pitch can be slowed down so the noise does not remain stationary. Now there will be parts of the noise the speech can come through, and by extending the range of speech you get more opportunity for the speech to come through.

It can also be used in mobile phones when it comes down to your voicemail. When someone left his or her telephone number, and when this person speaks to fast you may have trouble understanding the message. If you slow down the message, it becomes more easy to understand.

4.How do you test real time software:

Real time software means it needs to run on a certain processor (most of the time DSP-like cores or processors). First of all you need a implementation and because most processors have a simulator environment, you can simulate the behavior of the core on your PC. This means you can also process test files which you first process on PC to obtain output files. These output files can then be compared to a certain reference. You can think of 2 types of tests: the first is a functional test where you check if the output behaves according to the requirements of the function which is to be tested. Another way of testing is called ‘bit through testing’. Here you first have a reference algorithm which is evaluated via listening tests to be sure it behaves well so you can than compare the outputs of your real-time implementation and your reference algorithm bit by bit. During this comparison a range is also defined in which the outputs can vary because they will never be exactly the same. This second way of testing is faster and easier to implement but unfortunately is also less robust because you still might find small issues that are no issues at all, or miss little issues that can cause a problems at the end.

5. What is more important: quality or computational load+memory(/speed):

Both. Where the quality of your algorithm requires you to do something up to a certain level of quality, speed requires that the job can be done with very little computational load available. When you have only a tiny DPS processor available where the computational load is limited for example, the main focus will be on getting it to fit onto the processor rather than quality. So when resources are limited, speed tends to be the main focus, but in general both are always important. Even in situations where processors have massive MIPS (millions of instructions per second) available, there will also be a massive amount of things to do which requires optimized algorithms that run very efficiently and still have very high audio performance.

6. Which sector has the most demand for this type of audio processing (voice processing):

Some sectors that demand voice processing are the Communication sector, the medical sector and the music industry.

Speech recognition in the medical sector is also becoming more important because in operation rooms people are starting to make recordings more often which are translated into reports automatically. These reports can than show all the events that took place during the operation. For this reason there is also need for speaker identification to be able to detect who said what and when.

 In the music industry it can also be used by DJ’s to do beat matching or to do pitch shifting.

Pitch detection

We are researching the algorithms:  Pitch shifting by delay-line modulation, Pitch shifting by PSOLA and Pitch shifting by time stretching and resampling. For the two last algorithms to work well, we designed a pitch detection algorithm specially for detecting the pitch in the human voice (and monophonic sounds).  This is because pitch shifting by PSOLA and by time stretching and resampling will only result in a high quality pitch shift if all pitch marks (see figure) are determined accurately. Therefor it is of great importance to make sure these pitch marks are exact. A consequence of this need for pitch marks is that these two algorithms will only give good results for signals where these pitch marks can be clearly located. The need for a pitch detection also means extra calculations which will make it challenging to keep everything real-time in our final implementation on the sharc DSP board. 

 AfbeeldingAfbeelding

Pitch shifting by delay-line modulation on the other hand does not require pitch detection of the source signal. On top of that, it can also shift polyphonic sounds and because it does not need the extra calculations for the pitch detection it will be able to run with a smaller delay than the other two algorithms, which is a great advantage compared to the other two algorithms.

 

Because the quality of both the pitch shifting by PSOLA and by time stretching and resampling will be dependent on the accuracy of the pitch detection, we spent quite some time on this.

 

Like most pitch detection algorithms, also this one is based on autocorrelation, but because a speech signal can consist of voiced and unvoiced parts, we first need to determine which type we are dealing with. This is because it would be inefficient to do autocorrelation on an unvoiced speech signal since unvoiced speech behaves like noise which has no distinctive period.

 

Therefor we first calculate the amount of zero crossings of a part of the signal. If the amount of crossings is larger than a specified limit, we conclude we are dealing with an unvoiced part, if the amount of crossings is smaller, we are dealing with a voiced part.

 AfbeeldingAfbeelding

In an unvoiced part we just place the pitch marks at a constant distance apart from each other. In a voiced part we calculate the autocorrelation (see figure) because voiced speech consists of periodic repetitions. Autocorrelation will return the period of the signal and then we can search for a maximum in a part of the signal of N samples long (N=period). This maximum will than represent a pitch mark which we can use in our pitch shifting algorithm.

Linear Predictive Coding

Linear predictive coding, or LPC for short, is another algorithm that might be added to our list of algorithms we will implement on the DSP board.

When a human produces phonemes, which is the smallest contrastive unit in the sound system of a language several important things are happening. First of all , we have a breath stream coming from the lungs. This breath stream is pushed upwards and passes through the vocal cords. Either voiced or unvoiced signal types can be produced. In case that a voiced sound is being produced, the vocal cords are periodically opened and closed with a certain frequency. This means that short pulses of air flow pass through the vocal tract at a rate that is equal to this frequency. Voiced phonemes are for example all the vowels and j,z,w,v,d,…. .For unvoiced phonemes the vocal cords do not vibrate. Unvoiced phonemes are for example: s,f,t,p,.. .

The model that LPC uses, states that speech can be modeled as a source signal that is altered or filtered, by the so-called vocal tract cavity. The vocal tract cavity consists of the path between the vocal cords at one end, and the lips and the end of the nose at the other end.By changes in the characteristics of the vocal tract cavity, different phonemes can be produced. In LPC, this vocal tract cavity is more commonly refered to as ‘the filter’. This filter is modeled by a set of parameters.

3 things are then needed to transfer speech signals : The aforementioned frequency, also refered to as pitch, the source signal (noisy for unvoiced, impusle-like for voiced) and finaly the filter paramaters. Note that the pitch does not have to be known to recreate unvoiced signals.

LPC will allow great compression of the data stream. Now one does not need to send the entire signal, but only a few paramaters. This is why LPC is used in the GSM standard. LPC predictors are also used in MPEG-4, FLAC and other losless audio codecs. Vocoders are another possible application for LPC, which is used a lot in electronic music.

More specificaly for us, a simple change of the pitch parameter will allow us to generate a signal with a different pitch.

the application of real-time pitch shifting

These days there are many artists that are using real-time pitch shifting during their performance. In the metal genre for example, artists use pitch shifting to lower the pitch of the vocals, which makes them sound more rough. On the other side some artists like Chuck Berry used pitch shifting to make their voice sound younger.

 

One might think now that pitch shifting can only be used by singers, so they can raise or lower their voice to sound more feminine, masculine, younger, older,…. But it can also be used by other people to. For example by voice actors, radio presenters and by people with a hearing loss in the high frequency sounds.

 

Formerly, when a very high voice was needed for a character (an animal for example) in a cartoon or animation film, they first had to record the desired speech signal at a low speed. afterwards they could playback the signal at the normal speed so that a high voice was obtained for that specific character. Obviously this takes a lot of time which is undesired when making a cartoon. Therefore it would be advantageous to do this pitch shift real-time. Voice actors could now just speak into the microphone while their voice is being raised or lowered real-time. This allows for a more intuitive manner of voice acting while speeding up the process significantly.

 

Since a radio presenters voice has to be pleasant to listen to, real-time pitch shifting could allow more people to do this job since it can easily modify a person’s voice, making his or her voice sound optimal before broadcasting.

 

Also for people with a hearing loss in the high frequency sounds, it would be advantageous to lower the pitch of the speaker in real-time, so that these people are able to understand conversations better.

Speech synthesis using PSOLA

PSOLA

This is short for pitch shift overlap and add.  Briefly explained, the algorithm consists of the following steps :

  1. Start from an audio signal
  2. Take out a fragment (more commonly referred to as a window) around a pitch mark with the length of the pitch times 2
  3. Weigh this window with a hanning window
  4. Put these windows back together.
    a. If you put them closer : higher pitch
    b. If you put them farther away from each other : lower pitch.
    Image

It is clear that if a characteristic waveform of several human phonemes is known, the first step can be omitted. One already knows what the audio signal will be. It is very easy then to reconstruct a speech signal. The computational efficiency of this algorithm also make it an interesting one. There are several useful things that can be said about speech synthesis :

  1. Telecommunication services. TTS (text to speech) makes it possible to access textual information over the telephone. Since many telephone conversations require very little interaction, this is very interesting.
  2. Aid for handicapped persons. Stephen hawking for example would not be able to give the lectures he does now without text to speech. Blind people can use computers thanks to voice synthesis.
  3. Man-machine communication. In the future it will be a lot more common to talk to machines and have them talk back to you. This also requires speech synthesis.
  4. Conveying information of public importance. Announcements at airports, railway stations , town squares, … .

Survey

I have created a website that will be used to gather data about the quality of our algorithms.It is very hard to define metrics that can accurately represent the quality. This is because of the fact that the quality of an audio fragment is very subjective. Hence, it makes sense to asses the quality of our algorithm based on a (preferably large) amount of listeners.

I did some research on C# and asp.net and created the following website : Image

The buttons speak for itself. For the storage of data we will use an SQL server.

Practical use

While it might appear that pitch shifting in itself seems rather limited to musical applications, the underlying algorithms can be used on a whole list of other applications.

PSOLA, an algorithm we use to change pitch, can also be used to synthesise speech. Text to speed for example is using this type of thing. It can also be used in cell phones. If you for example pronounce an ‘A’. It is not necessairy to keep sending the actual wave for the A. If you have the characteristic waveform once, it is enough. By using PSOLA the ‘A’ sound can be reproduced artificialy.

Generaly speaking, when dealing with audio/speech processing, it is highly likely you will see some form of pitch detection.

Algorithms Pitch shifting

We have put forward several possible algorithms to implement pitch shifting. This is to have some leeway in the end and it will be interesting to compare several algorithms with eachother. Since the pitch detection is a seperate step, we can combine several pitch shifting algorithms with several pitch detection algorithms, which will give rise to even more possible setups.

So far for pitch detection we have implemented an auto-correlation method to determine the pitch marks. For the pitch shifting, we have PSOLA, delay line modulation, time strectching and resampling. This is all done in matlab and tested on speech signals. We will focus first on speech signals, and then afterwards when this is fully operational, we can see what the possibilities are for multi-tone signals. Speech signals are easier to manipulate, since there is only 1 pitch present. This isn’t the case when music is involved. In music it is also very confusing to talk about a pitch when dealing with drums for example. A drum sound typicaly has no pitch. Some speech sounds also lack a pitch : the ‘s’,’f’,… . These are sounds which are more noisy than an actual waveform with a well-determined pitch.

Introduction

Me and thomas are making a pitch shiting algorithm to implement on a dsp board.

For this project we will design an algorithm to shift the pitch of a signal. At first we will focus mainly on speech signals. We are working on 2 big parts : Pitch detection and pitch shifting. Some pitch shifting algorithms require a detection of the pitch before the shifting can actualy happen.

The detection of the pitch of a signal is a time-consuming process. It requires the most amount of processing power for our thesis. Pitch detection is relevant in compression. Several compression schemes require the pitch to be known to be able to encode and decode the signal. It is therefore interesting to research possible improvements which can be made.

The real time implementation of this pitch shifting also presents some challenges. We have to make a calculated decision. Do we want a fast algorithm with lower quality, or a slower algorithm with higher quality.