CN102592593A - Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech - Google Patents

Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech Download PDF

Info

Publication number
CN102592593A
CN102592593A CN2012100915251A CN201210091525A CN102592593A CN 102592593 A CN102592593 A CN 102592593A CN 2012100915251 A CN2012100915251 A CN 2012100915251A CN 201210091525 A CN201210091525 A CN 201210091525A CN 102592593 A CN102592593 A CN 102592593A
Authority
CN
China
Prior art keywords
overbar
rank
characteristic
matrix
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100915251A
Other languages
Chinese (zh)
Other versions
CN102592593B (en
Inventor
吴强
刘琚
孙建德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201210091525.1A priority Critical patent/CN102592593B/en
Publication of CN102592593A publication Critical patent/CN102592593A/en
Application granted granted Critical
Publication of CN102592593B publication Critical patent/CN102592593B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an emotional-characteristic extraction method implemented through considering the sparsity of a multilinear group in a speech. The method comprising the following steps: considering multiple factors such as time, frequency, scale and direction information included in a speech signal; carrying out characteristic extraction by using a sparse decomposition method for multilinear groups; carrying out multilinear characterization on an energy spectrum of the speech signal through Gabor functions with different scales and the directions; solving a characteristic projection matrix by using a sparse tensor decomposition method for groups; calculating a characteristic projection with a frequency order; carrying out characteristic decorrelation through discrete cosine transform; and finally, calculating first-order and second-order differential coefficients so as to obtain the emotional characteristics of the speed. According to the invention, the factors such as time, frequency, scale and direction and the like in a speech signal is taken into consideration and used for extracting emotional characteristics, and the characteristic projection is performed by using a sparse tensor decomposition method for groups, thereby finally improving the accuracy rate of various speech emotion recognitions.

Description

A kind of emotional characteristics method for distilling of considering polyteny group sparse characteristic in the voice
Technical field
The present invention relates to a kind of voice mood feature extracting method that is used to improve the voice mood recognition performance, belong to the voice process technology field.
Background technology
Voice are one of convenient mode of exchanging in daily life of people, and this makes also the researchist tries to explore how to utilize voice as the instrument that exchanges between people and the machine.Except traditional interactive modes such as speech recognition, speaker's mood also is a kind of important interactive information, and it is one of important symbol of human-computer interaction intelligentization that machine can be discerned the mood of understanding the speaker automatically.
Voice mood is identified in signal Processing and the intelligent human-machine interaction field has significant values, and a lot of potential application are arranged.Aspect man-machine interaction, can improve the cordiality and the accuracy of system by computer Recognition speaker's mood, for example long-distance educational system can in time be adjusted course by identification student's mood, thereby promotes teaching efficiency; In telephone contact center and mobile communication, can in time obtain user's emotional information, improve the quality of service; Whether onboard system can be concentrated by mood identification detection driver's energy, and makes corresponding auxiliary caution.Aspect medical science, voice-based mood identification can be used as a kind of instrument, helps the doctor that patient's the state of an illness is diagnosed.
For voice mood identification, an important problem is exactly how to extract effective characteristic to be used for representing different moods.According to traditional feature extracting method, can one section voice signal be divided into multiframe usually, so that obtain approximate signal stably.The periodic feature that obtains from each frame is called local feature; For example fundamental tone, energy etc.; Its advantage is that existing sorter can utilize local feature to estimate the parameter of different emotional states comparatively accurately; Shortcoming is that intrinsic dimensionality and sample number are more, has influence on the speed of feature extraction and classification.Obtain characteristic and be called global characteristics through the characteristic of whole sentence is added up, its advantage is to obtain nicety of grading and speed preferably, but has lost the time sequence information of voice signal, occurs the problem of lack of training samples easily.Generally speaking, voice mood identification characteristic commonly used has following several types: continuously acoustic feature, spectrum signature, based on characteristic of Teager energy operator or the like.
According to the result of study of psychology and metrics etc., speaker's mood in voice the most intuitively characteristic be exactly the continuous characteristic of the rhythm, like fundamental tone, energy, the speed of speaking etc.Corresponding global characteristics comprises the average, median, standard deviation, maximal value, minimum value of fundamental tone or energy etc., and first, second resonance peak or the like.
Spectrum signature provides the useful frequency information in the voice signal, also is important feature extraction mode in the voice mood identification.Spectrum signature commonly used comprises linear predictor coefficient (LPC), linear prediction cepstrum coefficient (LPCC), Mei Er frequency cepstral coefficient (MFCC), perceptual weighting linear prediction (PLP) or the like.
Voice are to be produced by the non-linear airflow in the sonification system, and Teager energy operator (TEO) is that a kind of that people such as Teager proposes can follow the tracks of the arithmetic operation that signal energy changes in the glottis cycle fast, is used for the fine structure of analyzing speech.Under the different emotional states, the flexible situation of muscle can influence the motion of sonification system hollow air-flow, can know according to people's such as Bou-Ghazale result of study, can be used for detecting the intense strain in the voice based on the characteristic of TEO.
According to numerous experimental evaluation results, for voice mood identification, select suitable feature to characterize to different classification task, be suitable for detecting the intense strain in the voice signal based on the characteristic of Teager energy; The then suitable height of distinguishing of acoustic feature wakes mood (high-arousal emotion) and the low mood (low-arousal emotion) of waking up up continuously; And for the mood classification task of multiclass, the voice that spectrum signature is best suited for characterize, if spectrum signature is combined with acoustic feature continuously, perhaps consider the association analysis of multiple factor, also can reach the purpose of raising nicety of grading.
Classify exactly at voice mood feature extraction and the another one important stage after selecting completion.Various sorters all are used to the voice mood characteristic is classified in the area of pattern recognition at present, comprise HMM (HMM), gauss hybrid models (GMM), SVMs (SVM), linear discriminant analysis (LDA) and integrated classifier or the like.HMM is one of recognizer the most widely of in voice mood identification, using; This has benefited from its widespread usage in voice signal; Be particularly useful for handling data with sequential organization; From present result of study, higher classification accuracy can be provided based on the mood recognition system of HMM.Gauss hybrid models can be regarded as the HMM that has only a state, is very suitable for modeling is carried out in polynary distribution, and people such as Breazeal utilize GMM to be applied to the KISMET speech database as sorter, and five types of moods are carried out Classification and Identification.The SVMs area of pattern recognition that has been widely used; Its ultimate principle is through kernel function characteristic to be projected to higher dimensional space to make the characteristic linear separability; Compare HMM and GMM; It has the advantage that training algorithm global optimum and existence depend on the extensive border of data, and many results of study are to utilize SVMs as the sorter of voice mood identification and obtained classifying quality preferably.
As shown in Figure 1, following steps are adopted in traditional voice mood recognition methods based on spectrum signature usually:
1) voice signal to input carries out pre-service, comprises windowing, filtering, pre-emphasis etc.;
2) signal is carried out short time discrete Fourier transform, carry out filtering, ask logarithmic spectrum (getting log) then through the Mei Er quarter window;
3) utilize discrete cosine transform to calculate cepstrum, weighting then asks cepstral mean to subtract, and calculates difference;
4) utilize gauss hybrid models (GMM) to train, obtain the model of different moods;
5) mood model that obtains through training is discerned test data, obtains recognition accuracy.
To two types of mood classification,, reached nicety of grading relatively preferably at present like negative emotions and neutral mood; But classification for the multiclass mood; Because the unbalancedness of data is only considered single factors reasons such as (frequency or times), makes that the characteristic property distinguished is relatively poor; The mood nicety of grading is relatively low, and this makes voice-based mood recognition system use and is restricted.
Summary of the invention
Single factors is only considered in feature extraction in the identification of traditional voice mood; Like frequency or time; Make the problem that the characteristic property distinguished is relatively poor, the present invention propose a kind ofly to consider polyteny group sparse characteristic in the voice, be used for voice mood identification and can improve the voice mood feature extracting method of multiclass mood recognition accuracy.
The emotional characteristics method for distilling of polyteny group sparse characteristic in the consideration voice of the present invention is:
Consider to comprise in the voice signal the multiple factor of time, frequency, yardstick and directional information; Utilize the method for polyteny group Sparse Decomposition to carry out feature extraction; Gabor function through different scale and direction carries out the polyteny sign to the speech signal energy spectrum, utilizes the sparse tensor decomposition method of group to find the solution the characteristic projection matrix, the characteristic projection on the calculated rate rank; To the characteristic decorrelation, obtain the single order and the second order difference coefficient of characteristic through discrete cosine transform through difference; Specifically may further comprise the steps:
(1) gather voice signal s (t) (through equipment collections such as microphones), utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain signal time-frequency representation S (f, t) with energy spectrum P (f, t);
(2) utilize the two-dimensional Gabor function with different scale and direction that energy spectrum is carried out convolutional filtering, the Gabor function definition is following:
g k ‾ ( x ‾ ) = k ‾ 2 σ 2 · e - ( k ‾ 2 · x ‾ 2 / 2 σ 2 ) · [ e j k ‾ · x ‾ - e - ( σ 2 / 2 ) ] ,
Wherein:
Figure BDA0000149172760000032
Be that (f is the element of f in t frame, frequency t) to energy spectrum P; Be the yardstick of control function and the vector of direction, j representes imaginary part unit, k v=2 -(v+2)/2π, and φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K are represented total direction number, σ is a constant of confirming the function envelope, is made as 2 π.
(f, t) result of convolutional filtering is the polyteny sign of voice signal to the Gabor function to energy spectrum P
Figure BDA0000149172760000034
Here
Figure BDA0000149172760000035
Be that a size does
Figure BDA0000149172760000036
5 rank tensors, each rank is express time, frequency, direction, yardstick and classification respectively, and is right then
Figure BDA0000149172760000037
The frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors P, PSize be N 1* N 2* N 3* N 4* N 5, the length on each rank is N i, i=1, L 5;
(3) polyteny that obtains is characterized PCarry out the sparse tensor of group and decompose, calculate the projection matrix U on the different factors (i), i=1, L 5, so that carry out the characteristic projection, set up following decomposition model:
PΛ× 1U (1)× 2U (2)× 3U (3)× 4U (4)× 5U (5)
Wherein, U (i)Be that the size that decomposition obtains afterwards is N iThe projection matrix of * K; ΛBe that diagonal element is 15 rank tensors, size is K * K * K * K * K; * iExpression tensor i rank matrix multiplication, it defines as follows:
( X ‾ × i A ) n 1 , L n i - 1 , k , n i + 1 , L n M = Σ n i X ‾ n 1 , L n M A k , n i
Wherein XRepresent that a size is N 1* L * N MM rank tensor, A is that a size is N iThe matrix of * K,
Figure BDA0000149172760000039
It is tensor XElement,
Figure BDA00001491727600000310
It is the element of matrix A;
Calculate projection matrix U (i), i=1, the concrete decomposable process of L I is following, and i representes the index of rank (corresponding different factors) here, I=5:
1. adopt alternately lowest mean square or random initializtion U (i)>=0, i=1, L, I;
2. to projection matrix U (i), i=1, each column vector of L I
Figure BDA00001491727600000311
I=1, L, I, k=1, L, K carries out normalization;
3. error objective function E ‾ = 1 2 P ‾ - Σ k = 1 K u k ( 1 ) Ou k ( 2 ) OL Ou k ( I ) F 2 During greater than certain threshold value, operation below circulation is carried out:
● from n=1 to I, carry out successively:
u k ( i ) ← | | u k ( i ) | | F γ k ( i ) | | u k ( i ) | | F + λ k q i [ P ( i ) ( k ) { u k } e - i ] + ,
Wherein, || || FExpression Frobenius norm,
Figure BDA0000149172760000042
It is tensor P (k)I rank tensor matrixes launch, P ‾ ( k ) = P ‾ - Σ j = 1 , j ≠ k K u j ( 1 ) Ou j ( 2 ) OL Ou j ( I ) , { u k } e - i = [ u k ( I ) ] e L e [ u k ( i - 1 ) ] e [ u k ( i + 1 ) ] e L e [ u k ( 1 ) ] , E is that the Khatri-Rao of matrix is long-pending, λ kAnd q iBe the weight coefficient that is used to regulate objective function composition degree of rarefication, get the numerical value between 0 to 1;
If ● n ≠ 5, γ k i = u k ( I ) T u k ( I ) , If n=5,
Figure BDA0000149172760000046
4. work as objective function EDuring less than certain threshold value, loop ends calculates projection matrix U (i), i=1, L I;
(4) utilize the projection matrix U that obtains corresponding to frequency domain (2)Polyteny to voice signal characterizes PCarry out the characteristic projection:
S ‾ = P ‾ × 2 U + ( 2 )
Wherein,
Figure BDA0000149172760000048
Be projection matrix U (2)The matrix that the nonzero element of pseudoinverse is formed, * 2Representing matrix
Figure BDA0000149172760000049
With PCarrying out 2 rank matrixes of tensor takes advantage of;
(5) the time rank are fixed, the sparse sign of polyteny that obtains SCarry out tensor and launch operation, obtain size and do
Figure BDA00001491727600000410
Eigenmatrix S (f), wherein N ^ 1 = N 2 · N 3 · N 4 · N 5 ;
(6) utilize discrete cosine transform to S (f)Carry out decorrelation, obtain voice mood characteristic F, the single order of calculated characteristics and second order difference coefficient obtain final emotional characteristics.
The present invention considers that the factors such as time, frequency, yardstick and direction in the voice signal are used for the feature extraction of mood, utilizes the sparse tensor decomposition method of group to carry out the characteristic projection, has finally improved the accuracy rate of multiclass voice mood identification.
Description of drawings
Fig. 1 is the schematic block diagram of traditional voice mood identifying;
Fig. 2 is the synoptic diagram of feature extracting method of the present invention;
Fig. 3 is the schematic block diagram that adopts voice mood identifying of the present invention.
Fig. 4 is the experimental result comparison diagram to four types of voice mood identifications.
Embodiment
As shown in Figure 2, the voice mood recognition methods based on polyteny group sparse features of the present invention specifically may further comprise the steps:
(1) collect voice signal s (t) through equipment such as microphones, utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain signal time-frequency representation S (f, t) with energy spectrum P (f, t);
(2) utilize the two-dimensional Gabor function with different scale and direction that energy spectrum is carried out convolutional filtering, the polyteny that obtains voice signal characterizes
Figure BDA0000149172760000051
Right then
Figure BDA0000149172760000052
The frequency rank carry out the filtering of Mei Er quarter window and obtain characterizing P
The Gabor function definition is following:
g k ‾ ( x ‾ ) = k ‾ 2 σ 2 · e - ( k ‾ 2 · x ‾ 2 / 2 σ 2 ) · [ e j k ‾ · x ‾ - e - ( σ 2 / 2 ) ] ,
Wherein:
Figure BDA0000149172760000054
Be that (f is the element of f in t frame, frequency t) to energy spectrum P;
Figure BDA0000149172760000055
Be the yardstick of control function and the vector of direction, j representes imaginary part unit, k v=2 -(v+2)/2π, and φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K are represented total direction number, σ is a constant of confirming the function envelope, is made as 2 π.
(f, t) result of convolutional filtering is the polyteny sign of voice signal to the Gabor function to energy spectrum P
Figure BDA0000149172760000056
Here
Figure BDA0000149172760000057
Be that a size does
Figure BDA0000149172760000058
5 rank tensors, each rank is express time, frequency, direction, yardstick and classification respectively, and is right then
Figure BDA0000149172760000059
The frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors P, PSize be N 1* N 2* N 3* N 4* N 5, the length on each rank is N i, i=1, L 5;
(3) to characterizing PCarry out the sparse tensor of group and decompose, calculate the projection matrix U on the different factors (i), i=1, L 5, so that carry out the characteristic projection.Set up following decomposition model:
P≈Λ× 1U (1)× 2U (2)× 3U (3)× 4U (4)× 5U (5)
Wherein, U (i)Be that the size that decomposition obtains afterwards is N iThe projection matrix of * K; ΛBe that diagonal element is 15 rank tensors, size is K * K * K * K * K; * iExpression tensor i rank matrix multiplication, it defines as follows:
( X ‾ × i A ) n 1 , L n i - 1 , k , n i + 1 , L n M = Σ n i X ‾ n 1 , L n M A k , n i
Wherein XRepresent that a size is N 1* L * N MM rank tensor, A is that a size is N iThe matrix of * K,
Figure BDA00001491727600000511
It is tensor XElement,
Figure BDA00001491727600000512
It is the element of matrix A.
For calculating projection matrix U (i), i=1, L I, I=5 here, concrete decomposable process is following:
A) adopt alternately lowest mean square or random initializtion U (i)>=0, i=1, L, I;
B) to projection matrix U (i), i=1, each column vector of L I
Figure BDA00001491727600000513
I=1, L, I, k=1, L, K carries out normalization;
C) error objective function E ‾ = 1 2 P ‾ - Σ k = 1 K u k ( 1 ) Ou k ( 2 ) OL Ou k ( I ) F 2 During greater than certain threshold value, operation below circulation is carried out:
● from n=1 to I, carry out successively
u k ( i ) ← | | u k ( i ) | | F γ k ( i ) | | u k ( i ) | | F + λ k q i [ P ( i ) ( k ) { u k } e - i ] + ,
Wherein, || || FExpression Frobenius norm, P ‾ ( k ) = P ‾ - Σ j = 1 , j ≠ k K u j ( 1 ) Ou j ( 2 ) OL Ou j ( I ) ,
Figure BDA0000149172760000064
It is tensor P (k)I rank tensor matrixes launch, { u k } e - i = [ u k ( I ) ] e L e [ u k ( i - 1 ) ] e [ u k ( i + 1 ) ] e L e [ u k ( 1 ) ] , E is that the Khatri-Rao of matrix is long-pending, λ kAnd q iBe the weight coefficient that is used to regulate objective function composition degree of rarefication, get the numerical value between 0 to 1;
If ● n ≠ 5, γ k i = u k ( I ) T u k ( I ) , If n=5,
Figure BDA0000149172760000067
D) work as objective function EDuring less than certain threshold value, loop ends calculates projection matrix U (i), i=1, L I;
(4) utilize the projection matrix U that obtains corresponding to frequency domain (2)Polyteny to voice signal characterizes PCarry out the characteristic projection:
S ‾ = P ‾ × 2 U + ( 2 )
Wherein,
Figure BDA0000149172760000069
Be projection matrix U (2)The matrix that the nonzero element of pseudoinverse is formed, * 2Representing matrix
Figure BDA00001491727600000610
With PCarrying out 2 rank matrixes of tensor takes advantage of;
(5) the time rank are fixed, the sparse sign of polyteny that obtains SCarry out tensor and launch operation, obtain size and do
Figure BDA00001491727600000611
Eigenmatrix S (f), wherein N ^ 1 = N 2 · N 3 · N 4 · N 5 ;
(6) utilize discrete cosine transform to S (f)Carry out decorrelation, obtain voice mood characteristic F, the single order of calculated characteristics and second order difference coefficient obtain final emotional characteristics.
As shown in Figure 3, adopt above-mentioned feature extracting method to carry out the process of voice mood identification, may further comprise the steps:
1) obtains the voice signal data s that has different mood labels l(t), l=1, L, L, the different moods of total J class;
2) utilize the feature extracting method shown in Fig. 2 to extract the characteristic F of different moods l
3) utilize mixed Gaussian mixture model (GMM) that different emotional characteristicses are carried out modeling,, obtain the pairing mood model M of mood of l class through learning training l
4) when the voice signal of given unknown type of emotion When testing, the mood model M that utilizes GMM to set up l, l=1, L, L tests the calculating maximum posteriori probability successively, obtains the mood classification of maximum probability, promptly is the mood recognition result of this voice signal.
Effect of the present invention can further specify through experiment.
The recognition performance of the feature extracting method of the present invention's proposition has been tested in experiment on FAU Aibo data set, (Neutral Rest) discerns for Anger, Emphatic to 4 types of moods.The sampling rate of this experiment voice signal is 8kHz, adopts Hamming window to carry out windowing, and the 23ms window is long; The 10ms window moves; Utilize the energy spectrum of Short Time Fourier Transform signal calculated, have 4 different yardsticks and 4 different directions Gabor functions carry out the time-frequency convolutional filtering to energy spectrum, adopting size is 36 Mel bank of filters calculating Mei Er power spectrum; Utilize projection matrix on the frequency domain rank, to carry out the characteristic projection, utilize DCT that characteristic is carried out decorrelation.
Fig. 4 has compared the method for the present invention's proposition and the recognition performance of existing Feature Extraction Technology (MFCC and LFPC characteristic) compares; Visible by final recognition accuracy; After adopting the present invention; The accuracy rate of multiclass voice mood identification effectively improves, and MFCC has improved 6.1% than classic method, has improved 5.8% than the LFPC method.

Claims (2)

1. voice mood feature extracting method of considering polyteny group sparse features in the voice is characterized in that:
Consider to comprise in the voice signal the multiple factor of time, frequency, yardstick and directional information; Utilize the method for polyteny group Sparse Decomposition to carry out feature extraction, through the Gabor function of different scale and direction speech signal energy is composed and carry out the polyteny sign, utilize the sparse tensor decomposition method of group to find the solution the characteristic projection matrix; Characteristic projection on the calculated rate rank; To the characteristic decorrelation, the single order of calculated characteristics and second order difference coefficient specifically may further comprise the steps through discrete cosine transform:
(1) gather voice signal s (t), utilize Short Time Fourier Transform that s (t) is transformed to time-frequency domain, obtain signal time-frequency representation S (f, t) with energy spectrum P (f, t);
(2) utilize the two-dimensional Gabor function with different scale and direction that energy spectrum is carried out convolutional filtering, the Gabor function definition is following:
g k ‾ ( x ‾ ) = k ‾ 2 σ 2 · e - ( k ‾ 2 · x ‾ 2 / 2 σ 2 ) · [ e j k ‾ · x ‾ - e - ( σ 2 / 2 ) ] ,
Wherein:
Figure FDA0000149172750000012
Be that (f is the element of f in t frame, frequency t) to energy spectrum P;
Figure FDA0000149172750000013
Be the yardstick of control function and the vector of direction, j representes imaginary part unit, k v=2 -(v+2)/2π, and φ=u (π/K), the direction of u representative function, the yardstick of v representative function, K are represented total direction number, σ is a constant of confirming the function envelope, is made as 2 π;
(f, t) result of convolutional filtering is the polyteny sign of voice signal to the Gabor function to energy spectrum P
Figure FDA0000149172750000014
Here
Figure FDA0000149172750000015
Be that a size does 5 rank tensors, each rank is express time, frequency, direction, yardstick and classification respectively, and is right then The frequency rank carry out the filtering of Mei Er quarter window and obtain 5 new rank tensors P, its size is N 1* N 2* N 3* N 4* N 5, the length on each rank is N i, i=1, L 5;
(3) polyteny that obtains is characterized PCarry out the sparse tensor of group and decompose, calculate the projection matrix U on the different factors (i), i=1, L 5, so that carry out the characteristic projection, set up following decomposition model:
PΛ× 1U (1)× 2U (2)× 3U (3)× 4U (4)× 5U (5)
Wherein, U (i)Be that the size that decomposition obtains afterwards is N iThe projection matrix of * K,, ΛBe that diagonal element is 15 rank tensors, size is K * K * K * K * K, * iExpression tensor i rank matrix multiplication, it defines as follows:
( X ‾ × i A ) n 1 , L n i - 1 , k , n i + 1 , L n M = Σ n i X ‾ n 1 , L n M A k , n i
Wherein XRepresent that a size is N 1* L * N MM rank tensor, A is that a size is N iThe matrix of * K,
Figure FDA0000149172750000019
It is tensor XElement,
Figure FDA0000149172750000021
It is the element of matrix A;
(4) utilize the projection matrix U that obtains corresponding to frequency domain (2)Polyteny to voice signal characterizes PCarry out the characteristic projection:
S ‾ = P ‾ × 2 U + ( 2 )
Wherein,
Figure FDA0000149172750000023
Be projection matrix U (2)The matrix that the nonzero element of pseudoinverse is formed, * 2Representing matrix
Figure FDA0000149172750000024
With PCarrying out 2 rank matrixes of tensor takes advantage of;
(5) the time rank are fixed, the sparse sign of polyteny that obtains SCarry out tensor and launch operation, obtain size and do Eigenmatrix S (f), wherein N ^ 1 = N 2 · N 3 · N 4 · N 5 ;
(6) utilize discrete cosine transform to S (f)Carry out decorrelation, obtain voice mood characteristic F, the single order of calculated characteristics and second order difference coefficient obtain final emotional characteristics.
2. the voice mood feature extracting method based on polyteny group sparse features according to claim 1 is characterized in that: said calculating projection matrix U (i), i=1, the concrete decomposable process of L I is following, and i representes the index of rank (corresponding different factors) here, I=5:
1. adopt alternately lowest mean square or random initializtion U (i)>=0, i=1, L, I;
2. to projection matrix U (i), i=1, each column vector of L I
Figure FDA0000149172750000027
I=1, L, I, k=1, L, K carries out normalization;
3. error objective function E ‾ = 1 2 P ‾ - Σ k = 1 K u k ( 1 ) Ou k ( 2 ) OL Ou k ( I ) F 2 During greater than certain threshold value, operation below circulation is carried out:
● from n=1 to I, carry out successively:
u k ( i ) ← | | u k ( i ) | | F γ k ( i ) | | u k ( i ) | | F + λ k q i [ P ( i ) ( k ) { u k } e - i ] + ,
Wherein, || || FExpression Frobenius norm, It is tensor P (k)I rank tensor matrixes launch, P ‾ ( k ) = P ‾ - Σ j = 1 , j ≠ k K u j ( 1 ) Ou j ( 2 ) OL Ou j ( I ) , { u k } e - i = [ u k ( I ) ] e L e [ u k ( i - 1 ) ] e [ u k ( i + 1 ) ] e L e [ u k ( 1 ) ] , E is that the Khatri-Rao of matrix is long-pending, λ kAnd q iBe the weight coefficient that is used to regulate objective function composition degree of rarefication, get the numerical value between 0 to 1;
If ● n ≠ 5, γ k i = u k ( I ) T u k ( I ) , If n=5,
Figure FDA00001491727500000214
4. work as objective function EDuring less than certain threshold value, loop ends calculates projection matrix U (i), i=1, L I.
CN201210091525.1A 2012-03-31 2012-03-31 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech Expired - Fee Related CN102592593B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210091525.1A CN102592593B (en) 2012-03-31 2012-03-31 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210091525.1A CN102592593B (en) 2012-03-31 2012-03-31 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech

Publications (2)

Publication Number Publication Date
CN102592593A true CN102592593A (en) 2012-07-18
CN102592593B CN102592593B (en) 2014-01-01

Family

ID=46481134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210091525.1A Expired - Fee Related CN102592593B (en) 2012-03-31 2012-03-31 Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech

Country Status (1)

Country Link
CN (1) CN102592593B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102833918A (en) * 2012-08-30 2012-12-19 四川长虹电器股份有限公司 Emotional recognition-based intelligent illumination interactive method
CN103245376A (en) * 2013-04-10 2013-08-14 中国科学院上海微系统与信息技术研究所 Weak signal target detection method
CN103531199A (en) * 2013-10-11 2014-01-22 福州大学 Ecological sound identification method on basis of rapid sparse decomposition and deep learning
CN103531206A (en) * 2013-09-30 2014-01-22 华南理工大学 Voice affective characteristic extraction method capable of combining local information and global information
CN103825678A (en) * 2014-03-06 2014-05-28 重庆邮电大学 Three-dimensional multi-user multi-input and multi-output (3D MU-MIMO) precoding method based on Khatri-Rao product
CN105047194A (en) * 2015-07-28 2015-11-11 东南大学 Self-learning spectrogram feature extraction method for speech emotion recognition
CN107886942A (en) * 2017-10-31 2018-04-06 东南大学 A kind of voice signal emotion identification method returned based on local punishment random spectrum
CN109060371A (en) * 2018-07-04 2018-12-21 深圳万发创新进出口贸易有限公司 A kind of auto parts and components abnormal sound detection device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020135618A1 (en) * 2001-02-05 2002-09-26 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
CN101030316A (en) * 2007-04-17 2007-09-05 北京中星微电子有限公司 Safety driving monitoring system and method for vehicle
CN101404060A (en) * 2008-11-10 2009-04-08 北京航空航天大学 Human face recognition method based on visible light and near-infrared Gabor information amalgamation
US20110034176A1 (en) * 2009-05-01 2011-02-10 Lord John D Methods and Systems for Content Processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020135618A1 (en) * 2001-02-05 2002-09-26 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
CN101030316A (en) * 2007-04-17 2007-09-05 北京中星微电子有限公司 Safety driving monitoring system and method for vehicle
CN101404060A (en) * 2008-11-10 2009-04-08 北京航空航天大学 Human face recognition method based on visible light and near-infrared Gabor information amalgamation
US20110034176A1 (en) * 2009-05-01 2011-02-10 Lord John D Methods and Systems for Content Processing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAHMANE, MOHAMED; MEUNIER, JEAN: "Continuous Emotion Recognition Using Gabor Energy Filters", 《4TH BI-ANNUAL INTERNATIONAL CONFERENCE OF THE HUMAINE ASSOCIATION ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION》, 31 December 2011 (2011-12-31) *
MORALES-PEREZ,M. ET AL: "Feature extraction of speech signals in emotion identification", 《30TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE-ENGINEERING-IN-MEDICINE-AND-BIOLOGY-SOCIETY》, 31 December 2008 (2008-12-31) *
TU, BINBIN; YU, FENGQIN: "Bimodal Emotion Recognition Based on Speech Signals and Facial Expression", 《6TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND KNOWLEDGE ENGINEERING》, 31 December 2011 (2011-12-31) *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102833918B (en) * 2012-08-30 2015-07-15 四川长虹电器股份有限公司 Emotional recognition-based intelligent illumination interactive method
CN102833918A (en) * 2012-08-30 2012-12-19 四川长虹电器股份有限公司 Emotional recognition-based intelligent illumination interactive method
CN103245376A (en) * 2013-04-10 2013-08-14 中国科学院上海微系统与信息技术研究所 Weak signal target detection method
CN103245376B (en) * 2013-04-10 2016-01-20 中国科学院上海微系统与信息技术研究所 A kind of weak signal target detection method
CN103531206A (en) * 2013-09-30 2014-01-22 华南理工大学 Voice affective characteristic extraction method capable of combining local information and global information
CN103531206B (en) * 2013-09-30 2017-09-29 华南理工大学 A kind of local speech emotional characteristic extraction method with global information of combination
CN103531199B (en) * 2013-10-11 2016-03-09 福州大学 Based on the ecological that rapid sparse decomposition and the degree of depth learn
CN103531199A (en) * 2013-10-11 2014-01-22 福州大学 Ecological sound identification method on basis of rapid sparse decomposition and deep learning
CN103825678B (en) * 2014-03-06 2017-03-08 重庆邮电大学 A kind of method for precoding amassing 3D MU MIMO based on Khatri Rao
CN103825678A (en) * 2014-03-06 2014-05-28 重庆邮电大学 Three-dimensional multi-user multi-input and multi-output (3D MU-MIMO) precoding method based on Khatri-Rao product
CN105047194A (en) * 2015-07-28 2015-11-11 东南大学 Self-learning spectrogram feature extraction method for speech emotion recognition
CN105047194B (en) * 2015-07-28 2018-08-28 东南大学 A kind of self study sound spectrograph feature extracting method for speech emotion recognition
CN107886942A (en) * 2017-10-31 2018-04-06 东南大学 A kind of voice signal emotion identification method returned based on local punishment random spectrum
CN107886942B (en) * 2017-10-31 2021-09-28 东南大学 Voice signal emotion recognition method based on local punishment random spectral regression
CN109060371A (en) * 2018-07-04 2018-12-21 深圳万发创新进出口贸易有限公司 A kind of auto parts and components abnormal sound detection device

Also Published As

Publication number Publication date
CN102592593B (en) 2014-01-01

Similar Documents

Publication Publication Date Title
CN102592593B (en) Emotional-characteristic extraction method implemented through considering sparsity of multilinear group in speech
An et al. Deep CNNs with self-attention for speaker identification
CN106057212B (en) Driving fatigue detection method based on voice personal characteristics and model adaptation
Zhang et al. Robust sound event recognition using convolutional neural networks
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
CN102142253B (en) Voice emotion identification equipment and method
CN101923855A (en) Test-irrelevant voice print identifying system
Jancovic et al. Bird species recognition using unsupervised modeling of individual vocalization elements
CN103985381B (en) A kind of audio indexing method based on Parameter fusion Optimal Decision-making
CN112259106A (en) Voiceprint recognition method and device, storage medium and computer equipment
CN105895078A (en) Speech recognition method used for dynamically selecting speech model and device
CN101930735A (en) Speech emotion recognition equipment and speech emotion recognition method
CN102723079B (en) Music and chord automatic identification method based on sparse representation
CN110222841A (en) Neural network training method and device based on spacing loss function
CN104978507A (en) Intelligent well logging evaluation expert system identity authentication method based on voiceprint recognition
CN103456302A (en) Emotion speaker recognition method based on emotion GMM model weight synthesis
CN105702251A (en) Speech emotion identifying method based on Top-k enhanced audio bag-of-word model
CN101419799A (en) Speaker identification method based mixed t model
CN103578480B (en) The speech-emotion recognition method based on context correction during negative emotions detects
Ranjard et al. Integration over song classification replicates: Song variant analysis in the hihi
Praksah et al. Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier
CN106448660A (en) Natural language fuzzy boundary determining method with introduction of big data analysis
CN105006231A (en) Distributed large population speaker recognition method based on fuzzy clustering decision tree
Pan et al. Robust Speech Recognition by DHMM with A Codebook Trained by Genetic Algorithm.
Li et al. Feature extraction with convolutional restricted boltzmann machine for audio classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140101

Termination date: 20170331