DE10244699A1

DE10244699A1 - Voice activity determining method for detecting phrases operates in a portion of an audio signal through phrase detection based on thresholds

Info

Publication number: DE10244699A1
Application number: DE2002144699
Authority: DE
Inventors: Diane Dr.-Ing. Hirschfeld; Thomas Richter
Original assignee: VOICE INTER CONNECT GmbH
Current assignee: VOICE INTER CONNECT GmbH
Priority date: 2002-09-24
Filing date: 2002-09-24
Publication date: 2004-04-01
Anticipated expiration: 2022-09-25
Also published as: DE10244699B4

Abstract

Energy values (EV) for a time segment in an audio signal are recorded in a histogram. Using EV-detected distribution determines voice (6) and pause (5) thresholds, which are compared with a current EV to produce a decision on the limits of a phrase between speech and pause so as to determine and store a 'start' label (8) for the beginning of each phrase and a 'stop' label (7) for its end.

Description

Die Erfindung betrifft ein Verfahren zur Bestimmung der Sprachaktivität in einem Signalabschnitt eines Audio-Signals durch eine schwellenbasierende Phrasendetektion.The invention relates to a method to determine speech activity in a signal section of an audio signal by a threshold-based Phrase detection.

Die Entwicklung robuster Phrasengrenzendetektoren ist bedeutend für die automatische Erkennung von fließender Sprache. Derartige Detektoren werden beispielsweise bei der Signalverarbeitung im Mobilfunksektor eingesetzt, um eine Verbesserung der Erkennungsraten und eine Datenreduktion auf die relevanten Informationen zu erreichen. Weitere Einsatzgebiete liegen in den Bereichen der Kommandoworterkennung, der Echokompensation und der Störgeräuschunterdrückung.The development of robust phrase limit detectors is important for automatic recognition of fluent speech. Such detectors are used, for example, in signal processing in the mobile radio sector used to improve detection rates and data reduction to reach the relevant information. Other areas of application are in the areas of command word recognition, echo cancellation and noise reduction.

Dabei muss die Phrasendetektion für einen ökonomischen Einsatz bestimmte Voraussetzungen erfüllen. Dazu zählen neben der Robustheit der Detektion die schnelle Adaption an sich verändernde Umgebungsbedingungen und ein relativ geringer Ressourcenbedarf sowohl beim Datenspeicher als auch beim notwendigen Rechenaufwand.The phrase detection must be economical Use certain requirements. These include alongside the robustness of the detection the fast adaptation to changing Environmental conditions and a relatively low resource requirement both data storage as well as the necessary computing effort.

Sehr einfache und zeiteffizient arbeitende Phrasendetektionsalgorithmen sind bezüglich der Sicherheit der Detektion allgemein wenig leistungsfähig. Die Phrasengrenzen der zu detektierenden Phrasen werden häufig ungenau gefunden, so dass es zum einen zu Datenverlusten und zum anderen zu Fehldetektionen kommen kann. Unter Datenverlusten versteht man die fehlende Berücksichtigung von relevanten Informationen, beispielsweise von gesprochenen Lauten, die zur Äußerung gehören, die teilweise bedeutungsunterscheidend sind und durch den Phrasendetektor als nicht zur Äußerung gehörend markiert wurden. Fehldetektionen im anderen Fall sind als Phrasen markierte Signalabschnitte, die aber keine sprachliche Äußerung darstellen.Very simple and time efficient Phrase detection algorithms are related to the security of the detection generally not very efficient. The phrase boundaries of the phrases to be detected often become inaccurate found so that there is data loss on the one hand and on the other false detections can occur. One understands data loss the lack of consideration relevant information, such as spoken sounds, that belong to the utterance, that are partially meaningful and by the phrase detector marked as not belonging to the utterance were. False detections in the other case are marked as phrases Signal sections that do not represent a linguistic expression.

Für die Phrasendetektion lassen sich beim derzeitigen Entwicklungsstand drei Leistungsklassen von Phrasengrenzendetektoren identifizieren. Die erste Klasse umfasst einfache, energieschwellenbasierte und im Zeitbereich arbeitende Detektoren, wie in der DE 100 26 872 A1 dargestellt. Diese werten Zeitsignale nach einer Schwellwertenscheidung der ermittelten Energie eines bestimmten Signalausschnittes (Fensters) aus und sind deshalb in der Regel schnell und können mit geringem Modellierungsaufwand realisiert werden. Die dabei ermittelte Detektionsrate ist aber stark vom Signal und dessen Hintergrundgeräusch abhängig.At the current state of development, three performance classes of phrase limit detectors can be identified for phrase detection. The first class includes simple, energy threshold based and time domain detectors like in the DE 100 26 872 A1 shown. These evaluate time signals after a threshold value determination of the determined energy of a specific signal section (window) and are therefore generally fast and can be implemented with little modeling effort. The detection rate determined in this way is strongly dependent on the signal and its background noise.

Die zweite Klasse umfasst leistungsfähigere, im Frequenzbereich arbeitende Detektoren wie sie in der ETSI EN 301 708 V7.1.1 (1999-12), vom Dezember 1999 beschrieben sind. Diese werten in den Frequenzbereich transformierte und in Frequenzkanäle unterteilte Signale aus, sind deshalb üblicherweise komplex und benötigen einen großen Rechenaufwand. Eine höhere Detektionssicherheit kann erreicht werden, da viele Parameter (Tonhöhe, Signal-Rausch-Abstand, Peak-to-Average-Ratio, usw.) zur Entscheidungsfindung herangezogen werden.The second class includes more powerful, detectors operating in the frequency domain as described in the ETSI EN 301 708 V7.1.1 (1999-12), from December 1999. This values transformed into the frequency range and divided into frequency channels Signals from are therefore common complex and need a big Computational effort. A higher one Detection reliability can be achieved as many parameters (pitch, signal-to-noise ratio, Peak-to-average ratio, etc.) used for decision making become.

Die dritte Klasse fasst die aufwändigen und umfangreichen statistischen Verfahren zusammen. Durch die Auswertung der Wahrscheinlichkeitsdichtefunktion (PDF) oder der Erstellung von Modellen, mit Hilfe eines HMM (Hidden Markov Model), können über rechenaufwändige Verfahren hohe Detektionssicherheiten erreicht werden. Eine nähere Beschreibung hierzu ist in Sohn, Jongseo: "A Statistical Model-Based Voice Activity Detection", IEEE Signal Processsing Letters, Vol. 6., No. 1, January 1999 zu finden.The third class summarizes the elaborate and extensive statistical procedures. Through the evaluation the probability density function (PDF) or the creation of models, with the help of an HMM (Hidden Markov Model), can be computationally expensive high detection reliability can be achieved. A more detailed description in Son, Jongseo: "A Statistical Model-Based Voice Activity Detection ", IEEE Signal Processing Letters, vol. 6., No. 1, January 1999.

Für die Realisierung von Phrasengrenzendetektoren in Systemen mit geringen Ressourcen kommen daher nur Detektoren der ersten Leistungsklasse in Frage. Bisher muss bei diesen einfach realisierten Detektoren aber mit einer zu geringen Detektionssicherheit und Anpassung an sich verändernde Umgebungsbedingungen gerechnet werden.For the implementation of phrase limit detectors in systems with low Resources therefore only come from detectors of the first performance class in question. So far, these simply implemented detectors but with insufficient detection reliability and adaptation to changing Environmental conditions.

Der Erfindung liegt somit die Aufgabe zugrunde, ein Verfahren zur Bestimmung der Sprachaktivität in einem Signalabschnitt eines Audio-Signals anzugeben, bei dem der Gegensatz zwischen sicherer Detektion und geringem Rechenaufwand gemindert wird und das eine robuste Abgrenzung der Sprache von zeitlich variierenden Hintergrundgeräuschen realisiert.The invention is therefore the object based on a method for determining language activity in a Specify signal section of an audio signal in which the opposite reduced between reliable detection and low computing effort and that a robust demarcation of the language from time-varying Background noise realized.

Gemäß der Erfindung wird die Aufgabe dadurch gelöst, dass in einem ersten Schritt Energiewerte eines Zeitabschnitts des Audio-Signals in einem Histogramm erfasst werden, dass in einem zweiten Schritt, anhand der ermittelten Verteilung der Energiewerte, eine Sprachschwelle und eine Pausenschwelle festgelegt wird und dass durch einen Vergleich der Schwellen mit dem aktuellen Energiewert eine Phrasengrenzentscheidung zwischen Sprache und Pause getroffen wird.According to the invention the object solved by that in a first step energy values of a period of time Audio signal recorded in a histogram that in a second step, based on the determined distribution of the energy values, a language threshold and a pause threshold are set and that by comparing the thresholds with the current energy value made a phrase boundary decision between language and pause becomes.

In zeitlicher Abfolge wird zu Beginn ein Signalausschnitt (Zeitfenster) untersucht und dessen Energie bestimmt. Dieser Kurzzeitenergiewert wird in ein Histogramm eingeordnet, das die Langzeitverteilung der Signalenergie abschätzt. Für diese geschätzte Verteilung werden die Parameter Mittelwert X und Varianz s ermittelt. Anhand dieser beiden Parameter werden die Sprachschwelle ThrVoice und die Pausenschwelle ThrPause in der Schwellenadaption ermittelt. Durch Verwendung zweier Schwellen wird die Robustheit der Phrasengrenzentscheidung gegenüber kleineren Energieschwankungen erhöht, wie in 4 dargestellt.At the beginning, a signal section (time window) is examined in chronological order and its energy is determined. This short-term energy value is arranged in a histogram that estimates the long-term distribution of the signal energy. The parameters mean X and variance s are determined for this estimated distribution. The ThrVoice language threshold and the ThrPause pause threshold are determined in the threshold adaptation based on these two parameters. By using two thresholds, the robustness of the phrase limit decision against smaller energy fluctuations is increased, as in 4 shown.

In einer Ausgestaltung der Erfindung ist vorgesehen, dass die Ermittlung der Sprachschwelle und der Pausenschwelle, schritthaltend mit dem Signalverlauf, vor oder nach einer Phasengrenzentscheidung erfolgt.In one embodiment of the invention it is envisaged that the determination of the language threshold and the pause threshold, keeping pace with the signal curve, before or after a phase limit decision he follows.

Durch eine, mit dem Signalverlauf schritthaltende, Ermittlung der Schwellen wird eine robuste und schnelle Adaption an sich verändernde Umgebungsbedingungen erreicht. Die Berechnung der Signalenergie erfolgt kurzzeitbasiert für die Länge eines Zeitfensters. Der Abstand zweier aufeinanderfolgender Zeitfenster (die Fortsetzrate) steuert die zeitliche Auflösung der Phrasengrenzentscheidung. Mit einer niedrigen Fortsetzrate wird eine gute Auflösung im Zeitbereich erreicht.Robust and fast adaptation to changing environmental conditions is achieved by determining the thresholds, which keeps pace with the signal curve. The calculation of the signal energy is short-term based for the length of a time window. The interval between two successive time windows (the continuation rate) controls the temporal resolution of the phrase limit decision. With a low continuation rate, good resolution in the time domain is achieved.

Die Realisierung einer Phasengrenzentscheidung vor der Aktualisierung der Schwellen, anhand der vorliegenden Schwellen eines vorherigen Ablaufs, ermöglicht die Einordnung der allein in den Pausen auftretenden Energiewerte. Da allein die Fensterfortsetzrate entscheidet, wie viel Zeit zwischen dem aktuell betrachteten Zeitraum und dem Zeitraum, in dem die Schwellen adaptiert wurden, vergangen ist und diese im Normalfall relativ klein ist, wird der Fehler bei der Entscheidungsfindung klein gehalten.Realizing a phase boundary decision before updating the thresholds, based on the existing thresholds of a previous process the classification of the energy values occurring only during the breaks. Since the window continuation rate alone decides how much time between the currently considered period and the period in which the thresholds have been adapted, has passed and this is usually relative is small, the decision-making error is kept small.

In einer weiteren Ausgestaltung der Erfindung ist vorgesehen, dass infolge der Phasengrenzentscheidung ein Label „Start" für den Beginn einer Phrase und ein Label „Stop" für das Ende einer Phrase je Phrase ermittelt wird und dass die jeweiligen Label und die zugehörigen Zeitpunkte gespeichert werden.In a further embodiment of the Invention is provided that as a result of the phase boundary decision a label "Start" for the beginning a phrase and a label "stop" for the end a phrase per phrase is determined and that the respective label and the associated Times are saved.

Die Phrasengrenzenentscheidung vergleicht den aktuellen Energiewert mit den ermittelten Schwellen und bestimmt den Zustand des Signals. Zwei Zustände werden bei der Phrasengrenzendetektion unterschieden. Der erste Zustand charakterisiert die Pause bzw. das Hintergrundgeräusch und markiert den Beginn des Bereichs, der keine Phrase enthält, mit dem Label „Stop". Dieser Zustand wird nach dem erstmaligen Unterschreiten der Pausenschwelle durch die Signalenergie eingenommen und dauert an, bis er von einem zweiten Zustand abgelöst wird. Der zweite Zustand wird dann eingenommen, wenn eine Phrase vorliegt, dass heißt wenn die Signalenergie die Sprachschwelle erstmalig überschreitet. Der Beginn dieses Bereichs ist durch ein „Start" Label gekennzeichnet. Die Dauer dieses Bereiches wird erst bei einem erneuten Unterschreiten der Pausenschwelle, durch die Signalenergie beendet.The phrase limit decision compares the current energy value with the determined thresholds and determined the state of the signal. Two states are used in phrase limit detection distinguished. The first state characterizes the break or the background noise and marks the beginning of the area that does not contain a phrase with the label "Stop". This state is carried out after the break below the break threshold for the first time the signal energy is ingested and continues until it goes from a second state superseded becomes. The second state is assumed when a phrase is present, that is if the signal energy exceeds the speech threshold for the first time. The beginning of this Area is identified by a "start" label. The duration of this area is only when the level falls below again the break threshold, through which the signal energy ends.

In einer Ausgestaltungsform der Erfindung ist vorgesehen, dass eine minimale und eine maximale Phrasenlänge sowie eine minimale Pausenlänge festgelegt werden und eine Plausibilitätsprüfung derart erfolgt, dass Label deren zugehörige Zeitintervalle nicht den Phrasenlängen oder der Pausenlänge entsprechen, aus der Labelspur eliminiert werden.In one embodiment of the invention it is intended that a minimum and a maximum phrase length as well a minimal pause length be determined and a plausibility check is carried out in such a way that label their associated Time intervals do not correspond to the phrase lengths or the pause length the label track can be eliminated.

Eine robuste Phrasengrenzendetektion wird nicht allein durch die Adaption der Schwellen gewährleistet. Fehlentscheidungen bei der Phrasengrenzendetektion werden durch eine Korrektur der Entscheidung vermieden. Die Korrektur wird dann durchgeführt, wenn ein ganze Phrase vorliegt. Sie besteht aus einer Überprüfung der minimalen Pausenlänge, der minimal und der maximal zu erwartenden Phrasendauer. Die Korrektur der minimalen Pausenlänge bewirkt, dass detektierte Pausen innerhalb des Audiosignales, beispielsweise durch kürze Lücken innerhalb von Wörtern, nicht als Pausen markiert werden. Die Kontrolle der minimalen Phrasenlänge beseitigt kurze als Phrasen markierte Sektionen und die Prüfung der maximalen Phrasendauer beseitigt lange und nicht zu erwartende Segmente.Robust phrase limit detection is not only guaranteed by the adaptation of the thresholds. Wrong decisions in phrase limit detection are made by avoided correcting the decision. The correction will then be made carried out, when there is an entire phrase. It consists of a review of the minimum break length, the minimum and maximum expected phrase duration. Correcting the minimum break length causes pauses detected within the audio signal, for example by shorten gaps within of words are not marked as breaks. Control of minimum phrase length removed short sections marked as phrases and the check of the maximum phrase duration removed long and unexpected segments.

In einer Ausführung der Erfindung ist vorgesehen, dass die Energiewerte nach der Gleichung

mit einemIn one embodiment of the invention it is provided that the energy values according to the equation

with a

Effektivwert X eines Signalabschnitts der Breite N berechnet werden.RMS value X of a signal section the width N can be calculated.

Dieser Energiewert wird in das Histogramm bzw. in die darin enthaltene Verteilung derart eingeordnet, dass die Anzahl der im Histogramm befindenden Werte im eingeschwungenen Zustand konstant bleibt. Der eingeschwungene Zustand ist dann erreicht, wenn ausreichend Werte im Histogramm enthalten sind, denn erst nach einer gewissen Anzahl von Werten im Histogramm wird die tatsächliche Verteilung genügend genau durch das Histogramm geschätzt. Damit nicht alle Signalenergieschwankungen die Verteilung der Energie negativ beeinflussen, werden nur Energiewerte in das Histogramm aufgenommen, die nicht allzu weit vom Maximum der aktuellen Verteilung entfernt liegen. Diese Entscheidung kann aus der Verknüpfung von Verteilungsvarianz und Verteilungsmittelwert getroffen werden.This energy value is in the histogram or classified in the distribution contained therein such that the number of values in the histogram in the steady state State remains constant. The steady state is reached if there are sufficient values in the histogram, because only after a certain number of values in the histogram becomes the actual Distribution enough accurately estimated by the histogram. So that not all signal energy fluctuations affect the distribution of energy negatively affect only energy values in the histogram added that not too far from the maximum of the current distribution lie away. This decision can be made by linking Distribution variance and distribution mean are taken.

In einer besonderen Ausführungsform der Erfindung ist vorgesehen, dass nach dem ersten Schritt eine Glättung der im Histogramm erfassten Energiewerte gemäß der Formel

erfolgt.In a special embodiment of the invention it is provided that after the first step the energy values recorded in the histogram are smoothed according to the formula

he follows.

Dabei ergibt sich der geglättete Histogrammeintrag X'(N) des N-ten Histogramm-Intervalles aus der Summe der gewichteten zwei linken und der zwei rechten benachbarten Histogrammeinträge X(N – 2) , X(N – 1) , X(N + 1) und X(N + 2) sowie des gewichteten Eintrages X(N) selbst.This results in the smoothed histogram entry X '(N) of the Nth histogram interval from the sum of the weighted two left and the two right neighboring histogram entries X (N - 2) , X (N - 1) , X (N + 1) and X (N + 2) as well as the weighted entry X (N) itself.

In einer weiteren Ausführungsform der Erfindung ist vorgesehen, dass die Pausenschwelle, mit einem Adaptionsfaktor α zur Steuerung der Anpassungsgeschwindigkeit und einem Parameter β zur Festlegung des Abstands der Pausenschwelle vom Mittelwert X, gemäß der Gleichung ThrPause' = (1 – α)ThrPause + α(X + βs) ermittelt wird.In a further embodiment of the invention it is provided that the break threshold, with a Adaptation factor α to control the speed of adjustment and a parameter β to determine the distance of the break threshold from the mean X , according to the equation ThrPause '= (1 - α) ThrPause + α ( X + βs) is determined.

Die Pausenschwelle ThrPause , die bei der Phrasengrenzenentscheidung für die Detektion des Phrasenendes bedeutend ist, bestimmt sich aus oben genannter Gleichung. Bei der Schwellenadaption werden anhand der Verteilungsparameter Mittelwert X und Varianz s die Schwellen bestimmt. Der Adaptionsfaktor a steuert hierbei die Anpassungsempfindlichkeit. Ist dieser Wert nahe Null, dann ist die Adaption sehr langsam, in der Nähe von Eins hingegen sehr schnell. Der Parameter β entscheidet, wie weit die Pausenschwelle vom Mittelwert X entfernt platziert wird.The pause threshold ThrPause, which is important in the phrase limit decision for the detection of the phrase end, is determined from the above equation. In the case of threshold adaptation, the distribution parameters mean the mean X and variance s determines the thresholds. The adaptation factor a controls the sensitivity to adaptation. If this value is close to zero, the adaptation is very slow, but very close to one. The parameter β determines how far the break threshold is from the mean X is placed away.

In einer besonderen Ausgestaltung der Erfindung ist vorgesehen, dass die Sprachschwelle, mit einem Adaptionsfaktor α zur Steuerung der Anpassungsgeschwindigkeit und einem Parameter γ zur Festlegung des Abstands der Sprachschwelle von der Pausenschwelle, gemäß der Gleichung ThrVoice' = (1 – α)ThrVoice + α(ThrPause + γs) ermittelt wird.In a special configuration the invention provides that the speech threshold, with an adaptation factor α for control the rate of adaptation and a parameter γ to determine the distance of the speech threshold from the pause threshold, according to the equation ThrVoice '= (1 - α) ThrVoice + α (ThrPause + γs) determined becomes.

Die Adaption der Sprachschwelle ThrVoice basiert auf der berechneten Pausenschwelle ThrPause. In obiger Gleichung ist α wiederum der Adaptionsfaktor, mit dem die Geschwindigkeit der Adaption gesteuert wird, und γ bestimmt, wie groß der Abstand der Sprachschwelle ThrVoice zur Pausenschwelle ThrPause ist. Die in der Gleichung dargestellte Verknüpfung der Sprach- und Pausenschwelle mit der Varianz s der Verteilung hat den Vorteil, dass der Abstand der beiden Schwellen von der Verteilung der Kurzzeitenergie abhängt. Diese stellt sich unterschiedlich für sich verändernde Hintergrundgeräusche dar.The adaptation of the ThrVoice language threshold is based on the calculated pause threshold ThrPause. In the above equation is α again the adaptation factor with which the speed of the adaptation is controlled, and γ determines how big the Distance of the ThrVoice language threshold to the ThrPause pause threshold is. The link between the speech and pause threshold shown in the equation with the variance s of the distribution has the advantage that the distance of the two thresholds depends on the distribution of short-term energy. This positions itself differently for changing Background noise represents.

In ruhigen Umgebungen (statischer Fall) ist die Verteilung der Kurzzeitenergie schmal, was durch eine kleine Varianz ausgedrückt wird. Das heißt, die Kurzzeitenergie schwankt relativ wenig um ihren Langzeitmittelwert. In lauten Umgebungen treten meist dynamische Veränderungen des Hintergrundgeräusches auf, die eine breite Verteilung der Kurzzeitenergie zur Folge haben. In diesem Fall ist die Varianz groß, da die Kurzzeitenergie stark um ihren Langzeitmittelwert schwankt. Durch einen kleinen Abstand der Schwellen im statischen Fall und durch einen großen Abstand der Schwellen im dynamischen Fall werden Fehldetektionen der Phrasengrenzen eingeschränkt.In quiet environments (static Case) the distribution of short-term energy is narrow, which is indicated by a small one Expressed variance becomes. This means, the short-term energy fluctuates relatively little around its long-term average. Dynamic changes in background noise usually occur in noisy environments, which result in a wide distribution of short-term energy. In this case, the variance is large because the short-term energy is strong fluctuates around their long-term mean. By a small distance the thresholds in the static case and by a large distance the thresholds in the dynamic case become incorrect detection of the phrase boundaries limited.

Die Erfindung soll nachfolgend anhand zweier Ausführungsbeispiele näher erläutert werden. In den zugehörigen Zeichnungen zeigtThe invention is based on the following two embodiments are explained in more detail. In the associated Shows drawings

1 eine erste Variante des Verfahrensablaufs, 1 a first variant of the procedure,

2 eine zweite Variante des Verfahrensablaufs, 2 a second variant of the procedure,

3a ein Histogramm mit Energiewerten, 3a a histogram with energy values,

3b eine geglättete Verteilung und abgeleitete Parameter, 3b a smoothed distribution and derived parameters,

4 eine Beispielphrase mit Schwellen, 4 a sample phrase with thresholds,

5 ein Beispiel für die Schwellenadaption an ein sich änderndes Hintergrundgeräusch und 5 an example of the threshold adaptation to a changing background noise and

6 eine mögliche Energieverteilung für Sprach- und Störsignal. 6 a possible energy distribution for voice and interference signal.

Das erfindungsgemäße Verfahren kann in verschiedenen Bereichen zum Einsatz kommen. Bei der Sprachsignalverarbeitung kann eine Detektion von Nutzsignalen und eine verlässliche Anfangs- und Endpunktdetektion für einen Kommandoworterkenner realisiert werden. Das Verfahren ermöglicht die Realisierung einer Störgeräuschunterdrückung, bei der eine Pausendetektion für Adaptionsvorgänge notwendig ist, die Feststellung einer Sprecher-Aktivität für den Bereich der Echokompensation oder eine Bestimmung der Kanalauslastung im Bereich der Telefonie.The process according to the invention can be carried out in different ways Areas are used. In speech signal processing can detection of useful signals and reliable start and end point detection for one Command word recognizer can be realized. The procedure enables the Realization of noise suppression, at which is a break detection for adaptation processes it is necessary to determine a speaker activity for the area echo cancellation or a determination of the channel load in the Telephony area.

Eine erste Variante des Verfahrensablaufs ist in der 1 dargestellt. In zeitlicher Abfolge wird in einem ersten Schritt ein Signalausschnitt eines Audio-Signals (Zeitfenster) untersucht und dessen Energiewerte 1 bestimmt. Diese Energiewerte 1 werden in eine Verteilung in Form eines Histogramms 2 gemäß 3a eingeordnet und damit die Verteilung erstellt bzw. aktualisiert. Die Energiewerte 1 werden in das Histogramm 2 bzw. in die darin enthaltene Verteilung derart eingeordnet, dass die Anzahl der im Histogramm 2 befindenden Werte im eingeschwungenen Zustand konstant bleibt. Der eingeschwungene Zustand ist dann erreicht, wenn ausreichend Werte im Histogramm 2 enthalten sind, denn erst nach einer gewissen Anzahl von Werten im Histogramm 2 kann von einer Verteilung gesprochen werden. Damit nicht alle Signalenergieschwankungen die Verteilung der Energie negativ beeinflussen, werden nur Energiewerte 1 in das Histogramm 2 aufgenommen, die nicht allzu weit von der aktuellen Verteilung entfernt liegen. Diese Entscheidung kann aus der Verknüpfung von Verteilungsvarianz 3 und Verteilungsmittelwert 4 getroffen werden.A first variant of the process flow is in the 1 shown. In a first step, a signal section of an audio signal (time window) and its energy values are examined in chronological order 1 certainly. These energy values 1 are divided into a distribution in the form of a histogram 2 according to 3a classified and thus created or updated the distribution. The energy values 1 are in the histogram 2 or arranged in the distribution contained therein such that the number of in the histogram 2 values in the steady state remains constant. The steady state is reached when there are sufficient values in the histogram 2 are included, because only after a certain number of values in the histogram 2 can be spoken of a distribution. So that not all signal energy fluctuations negatively influence the distribution of energy, only energy values 1 into the histogram 2 recorded that are not too far from the current distribution. This decision can be made by linking distribution variance 3 and average distribution 4 to be hit.

Nach der Glättung der Verteilung wertet das Verfahren das Histogramm 2 aus und ermittelt einen Mittelwert X 4 und die Varianz s 3, wie in 3b dargestellt. Anhand dieser Verteilungsparameter 3 und 4 wird die Pausenschwelle ThrPause 5 gemäß der angegebenen Berechnungsvorschrift ermittelt. ThrPause' = (1 – α)ThrPause + α(X + βs) After smoothing the distribution, the method evaluates the histogram 2 and determines an average X 4 and the variance s 3 , as in 3b shown. Using these distribution parameters 3 and 4 the pause threshold becomes ThrPause 5 determined according to the specified calculation rule. ThrPause '= (1 - α) ThrPause + α ( X + βs)

Dabei steuert der Adaptionsfaktor α die Anpassungsempfindlichkeit. Ist α nahe Null , wird die Adaption sehr langsam durchgeführt, ist α nahe Eins erfolgt die Adaption sehr schnell. Der Parameter β beeinflusst den Abstand der Pausenschwelle 5 vom ermittelten Mittelwert X 4.The adaptation factor α controls the sensitivity to adaptation. If α is close to zero, the adaptation is carried out very slowly; if α is close to one, the adaptation takes place very quickly. The parameter β influences the Distance of the break threshold 5 from the determined mean X 4 ,

Grundlage für die Ermittlung der Sprachschwelle ThrVoice 6 sind die zuvor bestimmte Pausenschwelle ThrPause 5 und die Varianz s 3.Basis for determining the ThrVoice language threshold 6 are the previously determined pause threshold ThrPause 5 and the variance s 3.

Die Ermittlung erfolgt mit der Gleichung: ThrVoice' = (1 – α)ThrVoice + α(ThrPause + γs) The determination is made using the equation: ThrVoice '= (1 - α) ThrVoice + α (ThrPause + γs)

Dabei wird mit α wiederum die Geschwindigkeit der Adaption eingestellt. Der Abstand der Sprachschwelle von der Pausenschwelle wird durch γ beeinflusst.With α again the speed the adaptation set. The distance of the language threshold from the The pause threshold is influenced by γ.

Die in der Gleichung dargestellte Verknüpfung der Sprachschwelle ThrVoice 6 mit der Varianz s 3 der Verteilung hat den Vorteil, dass der Abstand der beiden Schwellen 5 und 6 von der Verteilung der Kurzzeitenergie abhängt. Diese stellt sich unterschiedlich für sich verändernde Hintergrundgeräusche dar (s. 5). In ruhigen Umgebungen (statischer Fall) ist die Verteilung der Kurzzeitenergie schmal, was durch eine kleine Varianz 3 ausgedrückt wird. Das heißt, die Kurzzeitenergie schwankt relativ wenig um ihren Langzeitmittelwert. In lauten Umgebungen treten meist dynamische Veränderungen des Hintergrundgeräusches auf, die eine breite Verteilung der Kurzzeitenergie zur Folge haben. In diesem Fall ist die Varianz 3 groß, da die Kurzzeitenergie stark um ihren Langzeitmittelwert schwankt. Durch einen kleinen Abstand der Schwellen 5 und 6 im statischen Fall und einen großen Abstand der Schwellen 5 und 6 im dynamischen Fall werden Fehldetektionen der Phrasengrenzen eingeschränkt.The relationship of the ThrVoice language threshold shown in the equation 6 with the variance s 3 The distribution has the advantage that the distance between the two thresholds 5 and 6 depends on the distribution of short-term energy. This is different for changing background noise (see 5 ). In quiet environments (static case) the distribution of short-term energy is narrow, which is due to a small variance 3 is expressed. This means that short-term energy fluctuates relatively little around its long-term average. In noisy environments, dynamic changes in background noise usually occur, which result in a wide distribution of short-term energy. In this case, the variance 3 large because the short-term energy fluctuates strongly around its long-term average. By a small distance between the sleepers 5 and 6 in the static case and a large distance between the thresholds 5 and 6 in the dynamic case, incorrect detections of the phrase boundaries are restricted.

Durch einen nachfolgenden Vergleich der ermittelten Schwellen 5 und 6 mit dem aktuellen Energiewert 1 wird eine Phrasengrenzentscheidung zwischen Sprache und Pause getroffen. Dabei wird zwischen zwei Zuständen unterschieden. Der erste Zustand charakterisiert die Pause bzw. das Hintergrundgeräusch. Der Beginn dieses Bereichs, der keine Phrase enthält, wird mit dem Label „Stop" 7 markiert. Dieser Zustand wird nach dem erstmaligen Unterschreiten der Pausenschwelle durch die Signalenergie eingenommen und dauert an, bis er von einem zweiten Zustand abgelöst wird. Der zweite Zustand wird dann eingenommen, wenn eine Phrase vorliegt, dass heißt wenn die Signalenergie die Sprachschwelle erstmalig überschreitet. Der Beginn dieses Bereichs ist durch ein „Start" Label 8 gekennzeichnet. Die Dauer dieses Bereiches wird erst bei einem erneuten Unterschreiten der Pausenschwelle, durch die Signalenergie, beendet. Im Ergebnis dieser Phrasendetektion liegt das in 4 dargestellte Ergebnis vor. In der Darstellung ist eine Beispielphrase 9 mit den aktuell ermittelten Schwellen 5 und 6 und den gesetzten Start- 8 und Stop-Label 7 dargestellt.By a subsequent comparison of the determined thresholds 5 and 6 with the current energy value 1 a phrase boundary decision is made between language and pause. A distinction is made between two states. The first state characterizes the pause or the background noise. The beginning of this area, which does not contain any phrase, is marked with the label “Stop” 7. This state is assumed by the signal energy after the pause threshold has been undershot for the first time and continues until it is replaced by a second state. The second state becomes then taken when a phrase is available, that is when the signal energy exceeds the speech threshold for the first time. The beginning of this area is marked by a "start" label 8th characterized. The duration of this area is only ended when the pause threshold is exceeded again due to the signal energy. The result of this phrase detection is in 4 presented result. In the illustration is a sample phrase 9 with the currently determined thresholds 5 and 6 and the set start 8th and stop label 7 shown.

Robuste Phrasengrenzen werden nicht allein durch die Adaption der Schwellen 5 und 6 gewährleistet. Fehlentscheidungen bei der Phrasengrenzendetektion werden in einem nachfolgenden Schritt durch eine Plausibilitätsprüfung vermieden. Die Prüfung wird dann durchgeführt, wenn ein ganze Phrase vorliegt. Sie besteht aus einer Überprüfung der minimalen Pausenlänge sowie der minimal und der maximal zu erwartenden Phrasendauer. Die Überprüfung der minimalen Pausenlänge bewirkt, dass detektierte Pausen innerhalb einer Phrase 9 nicht als Pausen markiert werden. Die Überprüfung der minimalen Phrasenlänge beseitigt kurze als Phrasen markierte Sektionen. Die Kontrolle der maximalen Phrasendauer filtert lange und nicht zu erwartende Segmente heraus. Somit kann einer nachgeordneten Stufe, beispielsweise einem Kommandoworterkenner, eine nahezu fehlerfreie Folge von Labeln mit zugehörigen Zeitintervallen übergeben werden.Robust phrase boundaries are not only achieved by adapting the thresholds 5 and 6 guaranteed. Wrong decisions in phrase limit detection are avoided in a subsequent step by a plausibility check. The check is carried out when an entire phrase is available. It consists of a check of the minimum break length as well as the minimum and maximum expected phrase duration. Checking the minimum pause length causes detected pauses within a phrase 9 are not marked as breaks. Checking the minimum phrase length eliminates short sections marked as phrases. The control of the maximum phrase duration filters out long and unexpected segments. An almost error-free sequence of labels with associated time intervals can thus be transferred to a subordinate stage, for example a command word recognizer.

Eine zweiten Variante des Verfahrensablaufs ist in der 2 dargestellt. Der Unterschied zur ersten Variante besteht darin, dass die Phrasengrenzentscheidung nicht nach der Ermittlung der Schwellen 5 und 6 sondern vor deren Ermittlung durchgeführt wird.A second variant of the process flow is in the 2 shown. The difference to the first variant is that the phrase limit decision is not made after determining the thresholds 5 and 6 but is carried out before their determination.

Nach der Energieermittlung über ein betrachtetes Zeitfenster wird anhand der durch den vorherigen Ablauf vorliegenden Schwellen 5 und 6 eine Phrasengrenzenentscheidung getroffen. Da allein die Fensterfortsetzrate entscheidet, wie groß die Zeitdifferenz zwischen dem aktuell betrachteten Zeitraum und dem Zeitraum, in dem die Schwellen adaptiert wurden, ist und diese im Normalfall relativ klein gehalten wird, kann der Fehler bei der Entscheidungsfindung klein gehalten werden.After the energy has been determined over a considered time window, the thresholds present in the previous sequence are used 5 and 6 made a phrase boundary decision. Since the window continuation rate alone decides how large the time difference between the currently considered period and the period in which the thresholds were adapted, and this is normally kept relatively small, the decision-making error can be kept small.

Der Vorteil der Phrasengrenzenentscheidung vor der Histogrammadaption besteht darin, dass mit Hilfe dieser Entscheidung eine Einordnung der allein in den Pausen aufgetretenen Energiewerte 1 möglich ist. Somit ist die erzeugte Verteilung allein eine Verteilung des Hintergrundgeräusches. Durch die ausschließliche Berücksichtigung von Energiewerten 1 aus Pausen, passt sich die Verteilung im Histogramm 2 schnell den Umgebungsbedingungen an und ist in vielen Fällen schmal, das heißt es liegt eine kleine Varianz 3 vor. Betrachtet man die Verteilung aller aufgetretenen Energien, ergibt sich in vielen Fällen ein der 6 dargestelltes Bild. Gut erkennbar sind zwei Maxima der Verteilung. Im linken Teil der 6 handelt es sich um die Verteilung des Hintergrundgeräusches und im rechten Teil der Abbildung um die Verteilung der gesprochenen Äußerungen.The advantage of the phrase limit decision before the histogram adaptation is that with the help of this decision the energy values that occurred during the breaks are classified 1 is possible. Thus the generated distribution is only a distribution of the background noise. Through the exclusive consideration of energy values 1 from breaks, the distribution in the histogram adjusts 2 quickly to the ambient conditions and is narrow in many cases, i.e. there is a small variance 3 in front. If one looks at the distribution of all energies that occur, in many cases one of the results 6 shown image. Two maxima of the distribution are clearly visible. In the left part of the 6 it is the distribution of the background noise and in the right part of the figure the distribution of the spoken utterances.

11: Energiewerteenergy values
22: Histogrammhistogram
33: Verteilungsvarianz sdistribution variance s
44: Verteilungsmittelwert X Distribution mean X
55: Pausenschwelle ThrPausepause threshold ThrPause
66: Sprachschwelle ThrVoicespeech threshold ThrVoice
77: Label „Stop""Stop" label
88th: Label „Start"Start label
99: Beispielphraseexample sentences

Claims

Method for determining the speech activity in a signal section of an audio signal by a threshold-based phrase detection, characterized in that in a first step energy values ( 1 ) a period of the audio signal in a histogram ( 2 ) that in a second step, based on the determined distribution of the energy values ( 1 ), a language threshold ( 6 ) and a break threshold ( 5 ) and that by comparing the thresholds ( 5 and 6 ) a phrase limit decision between speech and pause is made with the current energy value.

A method according to claim 1, characterized in that the determination of the language threshold ( 6 ) and the break threshold ( 5 ), keeping pace with the signal curve, before or after a phase limit decision.

A method according to claim 1, characterized in that a label "start" ( 8th ) for the beginning of a phrase and a label "Stop" ( 7 ) for the end of a phrase is determined for each phrase and that the respective label ( 7 and 8th ) and the associated times are saved.

Method according to Claims 1 and 3, characterized in that a minimum and a maximum phrase length and a minimum pause length are defined and a plausibility check is carried out in such a way that label ( 7 and 8th ) whose associated time intervals do not correspond to the phrase lengths or the pause length from which the label track is eliminated.

A method according to claim 1, characterized in that the energy values ( 1 ) according to the equation

with an effective value X of a signal section of width N can be calculated.

A method according to claim 1, characterized in that after the first step a smoothing of the in the histogram ( 2 ) recorded energy values ( 1 ) according to the formula

he follows.

A method according to claim 1, characterized in that the pause threshold ( 5 ), with an adaptation factor α to control the speed of adaptation and a parameter β to determine the distance of the break threshold ( 5 ) from the mean X ( 4 ), according to the equation ThrPause '= (1 - α) ThrPause + α ( X + βs) is determined.

A method according to claim 1, characterized in that the speech threshold ( 6 ), with an adaptation factor α to control the speed of adaptation and a parameter γ to determine the distance of the speech threshold ( 6 ) from the break threshold ( 5 ), according to the equation ThrVoice '= (1 - α) ThrVoice + α (ThrPause + γs) is determined.