WO2006000103A1 - Spiking neural network and use thereof - Google Patents

Spiking neural network and use thereof Download PDF

Info

Publication number
WO2006000103A1
WO2006000103A1 PCT/CA2005/001018 CA2005001018W WO2006000103A1 WO 2006000103 A1 WO2006000103 A1 WO 2006000103A1 CA 2005001018 W CA2005001018 W CA 2005001018W WO 2006000103 A1 WO2006000103 A1 WO 2006000103A1
Authority
WO
WIPO (PCT)
Prior art keywords
neurons
recited
layer
neuron
layers
Prior art date
Application number
PCT/CA2005/001018
Other languages
French (fr)
Inventor
Jean Rouat
Ramin Pichevar
Original Assignee
Universite De Sherbrooke
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universite De Sherbrooke filed Critical Universite De Sherbrooke
Publication of WO2006000103A1 publication Critical patent/WO2006000103A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • G06V10/7515Shifting the patterns to accommodate for positional errors
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present invention relates to neural networks. More specifically, the present invention is concerned with a spiking neural network and its use in pattern recognition and in monophonic source separation.
  • Pattern recognition is an aspect of the field of artificial intelligence aiming at providing perceptions to "intelligent" systems, such as robots, programmable controllers, speech recognition systems, artificial visions systems, etc.
  • pattern recognition In pattern recognition, comparison criteria, similarities between shapes, and distances must be computed in order to answer questions such as: - "Are these objects similar?" ⁇ "Has the system already identified this form?" ⁇ "Is this pattern different enough from the other patterns already identified by the system?" ⁇ “Is this form to be remembered?” ⁇ etc.
  • pattern recognition systems In a nutshell, pattern recognition systems must use performance and comparison criteria usually assimilated as distances.
  • distance should be construed as a probability, an error, a score. It is a value that can be assimilated to a distance. This type of criteria is widely used, for example: ⁇ in any rule-based expert system; ⁇ in statistical Markovian systems; ⁇ in second generation (formal) neural network system; * etc.
  • comparing Nsignals would require two steps: 1. Compute distance on each pair of signals; and 2. Find similar signals by sorting and comparing distances.
  • any distance between objects can be represented by: ⁇ more or less similar spike timings between neurons; ⁇ a single spike issued by a neuron, resulting from a a specific input sequence of This process is called "spikes order coding", and is characterized by the existence of couples of excitatory/inhibitory neurons , providing recognition of incoming spikes sequences from other neurons, after the spike has been generated by the neuron.
  • Synchronization coding occurs when two neurons groups appear spontaneously because of the neurons interconnections plasticity. Thus, two neurons having similar inputs present a growth of their mutual synaptic connections, causing their outputs to be synchronous. Otherwise, when neurons inputs are not similar, their mutual synaptic connections decrease, causing them to be desynchronized. In fact, the inputs of two neurons spiking simultaneously are relatively correlated.
  • separation of mixed signals is an important problem with many applications in the context of audio processing. It can be used, for example, to assist a robot in segregating multiple speakers, to ease the automatic transcription of video via the audio tracks, to separate musical instruments before automatic transcription, to clean the signal before performing speech recognition, etc.
  • the ideal instrumental setup is based on the use of array of microphones during recording to obtain many audio channels. In fact, in that situation, very good separation can be obtained between noise and signal of interest (see [29], [33], and [50]) and experiments with great improvements have been reported in speech recognition [4], [64]. Further applications have been ported on mobile robots [66], [71], [72]and have also been developed to track multi-speakers [58].
  • the source separation process implies segregation and/or fusion (integration), usually based on correlation, statistical estimation, binding, etc. of features extracted by the analysis module.
  • Monophonic source separation systems can be seen as comprising two main stages: i) signal analysis to yield a representation suitable to the second stage ii) clustering with segregation.
  • bottom-up processing corresponds to primitive processing
  • top-down processing means schema-based processing [15].
  • the auditory cues proposed by Bregman [15] for simple tones are not applicable directly to complex sounds. More sophisticated cues based on different auditory maps are thus desirable.
  • Ellis [13] uses sinusoidal tracks created by the interpolation of the spectral picks of the output of a cochlear filter bank, while Mellinger's model [41 ] uses partials.
  • a partial is formed if an activity on the onset maps (the beginning of an energy burst) coincides with an energy local minimum of the spectral maps.
  • Cooke [9] introduced the harmony strands, which is the counterpart of Mellinger's cues in speech.
  • the integration and segregation of streams is done using Gestalt and Bregman's heuristics.
  • Berthommierand Meyer use Amplitude Modulation maps (see [4], [42], [49] and [63]).
  • Gaillard [32] uses a more conventional approach by using the first zero crossing for the detection of pitch and harmonic structures in the frequency-time map. Brown proposes an algorithm [17] based on the mutual exclusivity Gestalt principle.
  • Hu and Wang use a pitch tracking technique [26].
  • Wang and Brown [69] use correlograms in combination with bio-inspired neural networks.
  • Grossberg [37] proposes a neural architecture that implements Bregman's rules for simple sounds. Sameti [58] uses HMMs (Hidden Markov Models), while Ro Stamm [57] and Reyes-Gomez [53] use Factorial HMMs. Jang and Lee [53] use a technique based on Maximum a posteriori (MAP) criterion. Another probability-based CASA is proposed by Cooke [22].
  • lrino and Patterson [30] propose an auditory representation that is synchronous to glottis and preserves fine temporal information, which makes possible the synchronous segregation of speech.
  • Harding and Meyer [22] use a model of multi-resolution with parallel high-resolution and low-resolution representations of the auditory signal. They propose an implementation for speech recognition.
  • Nix [45] performs a binaural statistical estimation of two speech sources by an approach that integrates temporal and frequency-specific features of speech. It tracks magnitude spectra and direction on a frame-by-frame basis.
  • a well known amazing characteristic of human perception is that the recognition of stimuli is quasi-instantaneous, even if the information propagation speed in living neurons is slow [26], [60], [61]. This implies that neural responses are conditioned by previous events and states of the neural sub-network [71]. Understanding of the underlying mechanisms of perception in combination with that of the peripheral auditory system [11], [17], [23], [73] allows designing an analysis module.
  • novelty detection allows facilitating autonomy. For example, it can allow robots to detect if stimuli are new or already seen. When associated with conditioning, novelty detection can create autonomy of the system [15], [24].
  • Sequence classification is particularly interesting for speech. Recently Panchev and Wermter [46] have shown that synaptic plasticity can be used to perform recognition of sequences. Perrinet [78] and Thorpe [61] discuss the importance of sparse coding and rank order coding for classification of sequences.
  • Neuron assemblies (groups) of spiking neurons can be used to implement segregation and fusion (integration) of objects in an auditory image representation.
  • correlations (or distances) between signals are implemented with delay lines, products and summations.
  • comparison between signals can be made with spiking neurons without implementation of delay lines. This is achieved by presenting images to spiking neurons with dynamic synapses. Then, a spontaneous organization appears in the network with sets of neurons firing in synchrony. Neurons with the same firing phase belong to the same auditory objects.
  • Milner [43] and Malsburg [67], [68], [69] propose the temporal correlation to perform binding. Milner and Malsburg have observed that synchrony is a crucial feature to bind neurons associated to similar characteristics.
  • Pattern recognition robust to noise, symmetry, homothety (size change with angle preservation), etc. has long been a challenging problem in artificial intelligence.
  • Many solutions or partial solutions to this problem have been proposed using expert systems or neural networks.
  • ⁇ Normalization In this approach the analyzed object is normalized to a standard position and size by an internal transformation. Advantage of this approach include i) the coordinate information (the "where" information) is retrievable at any stage of the processing and ii) there is a minimum loss of information.
  • the disadvantage of this approach is that the network should find the object in the scene and then normalize it. This task is not as obvious as it can appear [35], [51].
  • DLM Dynamic Link Matching
  • blobs may or may not correspond to a segmented region of the visual scene, since their size is fixed in the whole simulation period and is chosen by some parameters in the dynamics of the network [35].
  • the apparition of blobs in the network has been linked to the attention process present in the brain by the developers of the architecture.
  • the dynamics of the neurons used in the original DLM network are not the well-known spiking neuron dynamics.
  • the behavior of neurons from the DLM is based on rate coding (average neuron activity over time) and can be shown to be equivalent to an enhanced dynamic Kohonen Map in its Fast Dynamic Link Matching (FDLM) [35].
  • FDLM Fast Dynamic Link Matching
  • the above systems from the prior art are supervised / non-autonomous or include two operating modes: learning and recognition.
  • An object of the present invention is therefore to provide an improved method for monophonic sound separation.
  • Another object of the invention is to provide an improved method to image processing and/or recognition.
  • Another object of the invention is to provide an improved method for pattern recognition.
  • an Oscillatory Dynamic Link Matching algorithm which uses spiking neurons and is based on phase coding.
  • a two-layer neural network is also provided which is capable of doing motion analysis without requiring either the computing of optical flow or additional signal processing between its layers.
  • the proposed neural network can solve the correspondence problem, and at the same time, perform the segmentation of the scene, which is in accordance with the Gestalt theory of perception [21].
  • the proposed neural network based system is very useful in pattern recognition in multiple-object scenes.
  • the proposed network does normalization, segmentation, and pattern recognition at the same time. It is also self-organized.
  • a neural network system comprising: first and second layers of spiking neurons; each neurons from the first layer being configured for first internal connections to other neurons from the first layer or for external connections to neurons from the second layer to receive first extra-layer stimuli therefrom and for receiving first external stimuli; each neurons from the second layer being configured for second internal connections to other neurons from the second layer or for the external connections to neurons from the first layer to receive second extra-layer stimuli therefrom and for receiving second external stimuli; and at least one network activity controller connected to at least some of the neurons from each of the first and second layers for regulating the activity of the first and second layers of spiking neurons; whereby, in operation, upon receiving the first and second external stimuli, the first and second internal connections are promoted, and synchronous spiking from neurons from the first and second layers are promoted by the external connections when some of the first external stimuli are similar to some of the second external stimuli.
  • auditory-based features are integrated with an unconventional pattern recognition system, based on a network of spiking neurons with dynamical and multiplicative synapses.
  • the analysis is dynamical and extracts multiple features (and maps), while the neural network does not require any training and is autonomous.
  • a system for monophonic source separation comprising; a vocoder for receiving a sound mixture including at least one monophonic sound source; an auditory image generator coupled to the vocoder for receiving the sound mixture therefrom for generating an auditory image representation of the sound mixture; a neural network as recited in claim 1 , coupled to the auditory image generator for receiving the auditory image representation and for generating a mask in response to the auditory image representation; a multiplier coupled to both the vocoder and the neural network for receiving the mask from the neural network and for multiplying the mask with the at least one monophonic sound mixture from the vocoder, resulting in the identification of the at least one monophonic source by muting sounds from the sound mixture not belonging to the at least one monophonic source.
  • a method for establishing correspondence between first and second images each of the first and second images including pixels, the method comprising: providing a neural network including first and second layers of neurons; applying pixels from the first image to respective neurons of the first layer of neurons and pixels from the second image to respective neurons of the second layer of neurons; interconnecting each neuron from the first layer to each neuron of the second layer; performing a dynamic matching between the first and second layers, yielding a temporal correlation between the first and second layers; and using the temporal correlation between the first and second layers for establishing correspondence between the first and second images.
  • does not need explicit rules to create a separation mask. For example, mapping between the rules, developed by Bregman [6] for simple sounds and the real world, is difficult. Therefore, as long as the aforementioned rules are not derived and well-documented [22], expert systems are difficult to use; ⁇ does not require time-consuming training phase prior to the separation phase contrarily to approaches based on statistics like HMMs [58], Factorial HMMs [53], or MAP [53] that usually do; and ⁇ is autonomous, as it does not use hierarchical classification.
  • a method for establishing correspondence between first and second sets of data comprising; providing a neural network including first and second layers of neurons; providing first and second image representations of respectively the first and second sets of data including pixels; applying the first and second image representations respectively to the first and second layers; and interconnecting each neuron from the first layer to each neuron of the second layer; performing a dynamic matching between the first and second layers, yielding a temporal correlation between the first and second layers; and using the temporal correlation between the first and second layers for establishing correspondence between the first and second set of images.
  • the building blocks of the proposed architecture are spiking neurons. These neurons are different from conventional neurons used in engineering problems in the way they convey information and perform computations. Each spiking neuron fires "spikes". Thus, the information is transmitted through either spike rate (or frequency of discharge) or the spike timing of each neuron and the relative spike timing between different neurons.
  • the present invention concerns more specifically temporal correlation (phase synchrony between neurons).
  • the synchrony among a cluster of neurons means external inputs similarity.
  • desynchrony between clusters of neurons means that the underlying data belong to different sources (either audio or visual).
  • the information of the proposed neural network architecture is coded in the neurons synchronization or in the relative timing between spikes from different neurons.
  • the update of the synaptic weights is automatic, so that synapses are dynamic.
  • Characteristics of the method and system according to the present invention include: ⁇ no learning or recognition phase; ⁇ information coded in the synchronization or in the relative timing between spikes from different neurons; ⁇ synaptic-weight-update done automatically; ⁇ dynamic synapses; ⁇ sound sources separation in a mixture of audio sources; ⁇ allows invariant pattern processing or recognition; ⁇ no need to develop specific neural network for each new application, which confirms the application adaptation said in the main document; - no distance computation (as in classical methods),because the information needed to perform classification does not include distances; ⁇ generated either auditory maps or adequate visual pattern, depending on the nature of the target application of the system; ⁇ automatic audio channels selection, in the context of audio processing; ⁇ internal connections between neurons acting as competitive or cooperative relations.
  • Figures 1A- 1C which are labeled "Prior Art", are graphs illustrating respectively the behavior of a single third-generation neuron and a couple of neurons under different stimulus;
  • Figure 2 is a block diagram of a system for monophonic source separation according to an illustrative embodiment of the present invention, including a spiking neural network according to a first illustrative embodiment of the present invention
  • Figure 3 is a schematic view of the neural network from Figure 2;
  • Figure 4 is a flow chart illustrating a method for monophonic source separation according to an illustrative embodiment of the present invention
  • Figure 5 is a spectrogram illustrating the mixture of the utterance "Why were you all weary?" with a trill telephone noise
  • Figure 6A is a spectrogram illustrating a synthesized version of the utterance from Figure 5, "Why were you all weary?", after the separation using the method from Figure 4;
  • Figure 6B which is labeled "Prior Art", is an image illustrating a synthesized version of the utterance from Figure 5 as obtained using a system from the prior art;
  • Figure 7 is a spectrogram illustrating a synthesized version of a trill phone after the separation using the method from Figure 4;
  • Figure 8 is a spectrogram illustrating the utterance "I willingly marry Marilyn” with a 1 kHz pure tone
  • Figures 9A-19B are spectrograms illustrating the separation results using respectively the method from Figure 4 and the approach proposed by Wang and Brown; Figure 9B being labeled "Prior Art";
  • Figure 10 is a spectrogram illustrating the mixture of the utterance "I willingly marry Marylin" with a siren
  • Figure 11 is a spectrogram illustrating the separated siren obtained using the method from Figure 7;
  • Figure 12 is a spectrogram of the utterance from Figure 10 separated
  • Figure 13 is a bloc diagram illustrating a pattern recognition system according to an illustrative embodiment of the present invention.
  • Figure 14 is a schematic view illustrating an example of image to be processed by the system from Figure 13;
  • Figure 15 is a schematic view illustrating a neural network according to a second illustrative embodiment of the present invention.
  • Figure 16 is a flowchart of a method for establishing correspondence between first and second images according to an illustrative embodiment of the present invention
  • Figure 17 is a schematic view illustrating an affine transform T for a four-corner object
  • Figures 18A and 18B are images illustrating the activity of respectively the first and second layers of the neural map using the system from Figure 13 when two bars are presented to the neural network;
  • Figures 19A and 19B are graphs showing respectively the activity of one of the neurons associated with the vertical bar from Figure 18B in the first layer of the neural network from Figure 15 after the segmentation steps from Figure 16, and the activity associated with the background in the same layer;
  • Figures 2OA and 2OB are graphs showing respectively the activity of one of the neurons associated with the horizontal bar from Figure 18A in the first layer of the neural network from Figure 15 after the dynamic matching step from Figure 16, and the activity of one of the neurons associated with the vertical bar from Figure 18B in the second layer of the neural network from Figure 15 after the dynamic matching step from Figure 16;
  • Figure 21 is a graph illustrating the evolution of the thresholded activity of the network from Figure 15 through time in the segmentation phase from Figure 16 considering the images from Figures 18A-18B; each vertical rod representing a synchronized ensemble of neurons and the vertical axis representing the number of neurons in that synchronized region;
  • Figure 22 is a graph illustrating the evolution of the thresholded activity through time of the network from Figure 15, in the dynamic matching phase from Figure 16, considering the images from Figures 18A-18B;
  • Figure 23 is a graph illustrating the synchronization index of a one-object scene when the segmentation steps from Figure 16 are bypassed, the synchronization taking 85 oscillations;
  • Figure 24 is a graph illustrating the synchronization index of a one-object scene when the segmentation steps from Figure 16 preceding the matching phase, the synchronization taking 155 oscillations;
  • Figure 25 is an image illustrating the synchronization phase from the method from Figure 16, binary masks being generated by assigning binary values to different oscillation phases.
  • a system 10 for monophonic source separation according to an illustrative embodiment of the present invention will now be described with reference to Figure 2.
  • the system 10 includes a spiking neural network 12 according to a first illustrative embodiment of the present invention.
  • the system 10 allows to separate a plurality of different monophonic source blended in a sound mixture 14 provided as an input to the system 10.
  • the system 10 is in the form of a bottom-up CASA system which intends to separate different sound sources. System 10 allows separating two, three, or more sound sources.
  • the left branch in Figure 2 provides analysis/synthesis of the sound source in many sub-bands or channels. This separation is achieved by a double vocoder, in the form of FIR Gammatone filter banks 24, following the psychoacoustics cochlear frequency distribution.
  • the system 10 further comprises an auditory images generator 15 including a CAM (Cochleotopic/AMtopic Map) generator 16 and a CSM (Cochleotopic/Spectropic Map) generator 18, the two-layered spiking neural network 12 for receiving an processing a map outputted by the auditory image generator 15 and for providing a binary mask 22 based on the neural synchrony 20 in the output of the neural network, means 26 for multiplying the binary mask 22 with the output of the FIR Gammatone synthesis filter bank 24 and an integrator 28 sum-up the channels.
  • an auditory images generator 15 including a CAM (Cochleotopic/AMtopic Map) generator 16 and a CSM (Cochleotopic/Spectropic Map) generator 18, the two-layered spiking neural network 12 for receiving an processing a map outputted by the auditory image generator 15 and for providing a binary mask 22 based on the neural synchrony 20 in the output of the neural network, means 26 for multiplying the binary mask 22 with the output of the FIR Gammatone synthesis filter bank 24 and an
  • An FIR implementation of the well-known Gammatone filter bank 0 is used as the analysis/synthesis filter bank.
  • the use of the Gammatone filter bank allows obtaining the properties of the audition as observed in the psychoacoustics field.
  • the resulting number of channels is 256 with center frequencies from 100 Hz to 3600 Hz uniformly spaced on an ERB scale (Equivalent Rectangular Bandwidth scale, a psychoacoustics critical bands scale), with a sampling rate of 8 kHz.
  • the actual time-varying filtering is done by the mask 22.
  • this mask 22 is obtained by grouping synchronous oscillators of the neural net, the output of the synthesis filter bank 24 is multiplied with it.
  • auditory channels belonging to interfering sound sources are muted and channels belonging to the sound source of interest remain unaffected. This is, in someway, equivalent into labeling, for each time frame, cochlear channels.
  • a value of one is associated to the targeted signal and a value of 0 to the interfering signal, yielding a binary mask.
  • a non- binary continuous mask can also be used.
  • the signals of the masked auditory channels are added to form the synthesized signal, they are passed through the synthesis filters, which impulse responses are time-reversed versions of the impulse responses of the corresponding analysis filters.
  • the vocoder may take other forms including Linear Predictive Coding Vocoder, Fourier transform and any other transformation allowing analysis/synthesis.
  • the number of channels, the sampling rate and the center frequency values may differ without departing from the spirit and nature of the present invention.
  • the system 10 comprises an auditory images generator 15 to simultaneously generate two different image representations of the signals provided by the filter bank 24: • a first representation in the form of the well known amplitude modulation map, which will be referred to herein as Cochleotopic/AMtopic (CAM) map - closely related to modulation spectrograms as defined in [12] and [42]; and • a second representation in the form of the well known Cochleotopic/Spectrotopic (CSM) map that encodes the averaged spectral energies of the cochlear filter bank output.
  • CAM Cochleotopic/AMtopic
  • CSM Cochleotopic/Spectrotopic
  • the CAM map is an amplitude modulation representation of cochlear envelopes; while CSM is a cochlear energy representation (energy frequency distribution following the Gammatone filter bank structure). CAM and CSM generators can be added to follow each chosen application.
  • the generator 15 is configured so that, in operation, depending on the nature of the intruding sound (speech, music, noise, etc.) one of the maps is selected.
  • the generator 15 is programmed with the following CAM/CSM generation algorithm: 1. Down-sampling to 8000 Hz; 2. Filtering the sound source using a 256 dimensional bark- scaled cochlear filter bank ranging from 200 Hz to 3.6 kHz; 3. For CAM: Extracting the envelope (AM demodulation) for channels 30-256; for other low frequency channels (1- 29) using raw outputs [56]. For CSM: None is done in this step; 4. Computing the STFT (Short-Time Fourier Transform) using a Hamming window, which is well known in the art. Alternatively, non-overlapping adjacent windows with 4ms or 32ms lengths for example can also be used; 5.
  • CAM/CSM generation algorithm 1. Down-sampling to 8000 Hz; 2. Filtering the sound source using a 256 dimensional bark- scaled cochlear filter bank ranging from 200 Hz to 3.6 kHz; 3. For CAM: Extracting the envelope (AM demodulation) for channels 30-256; for other low frequency channels (1
  • the generator 15 as illustrated allows generating only CAM or CSM maps, other maps can alternatively or additionally be generated, such as a map facilitating the separation of speakers whose glottis have similar frequencies.
  • a neural network 12 according to a first illustrative embodiment of the present invention will now be described in more detail with reference to Figure 3.
  • the neural network 12 comprises first and second layers 30 and 32 of oscillatory spiking neurons respectively 36-38 and first and second global controllers 34 (only one shown).
  • the second layer 32 is one-dimensional.
  • the number of neurons 36-38 shown in Figure 3 does not correspond to the real number of neurons, which is of course greater.
  • the dynamics of oscillatory neurons 36-38 is governed by a modified version of the Van der Pol relaxation oscillator (called the Wang- Terman oscillator [69]). There is an active phase when the neuron spikes and a relaxation phase when the neuron is silent. More details on oscillatory neurons are provided in [70].
  • the neural network 12 is configured so that each neuron 36 from the first layer 30 allows for internal connections 37 to other neurons 36 from the first layer 30 or for external connections 39 to neurons 38 from the second layer 32 to receive extra-layer stimuli therefrom and for receiving external stimuli from external signals.
  • Each neuron 38 from the second layer 32 also allows for external connections 41 to other neurons 38 from the second layer 32 or for said internal connection 39 to neurons 36 from said the layer to receive extra-layer stimuli therefrom and for receiving second external stimuli from the second external signals.
  • the neurons 36-38 from of the first and second layers 30-32 are connected to a global controllers 34, which allows for synchronization and un-synchronization of different regions of the network 12.
  • the global controller 34 can be substitute by any network activity regulator allowing regulating the activity of the network.
  • the global controller 34 acts as a local inhibitor.
  • the first layer 30 performs a segmentation of the auditory map.
  • the dynamics of the neurons follows the following state-space Equations, where Xj is the membrane potential (output) of the neuron and Yi is the state for channel activation or inactivation.
  • d Vi,i e[ 1 il + ia,nh(x i , j / ⁇ )) - y id ] (2) dt p denotes the amplitude of a Gaussian noise, /?!j""the external input to
  • the Euler integration method is used to solve the Equations 1 and 2.
  • the first layer is a partially connected network of relaxation oscillators [69]. Each neuron is connected to its four neighbors.
  • the CAM 16 (or the CSM 18) is applied to the input of the neurons 36. It has been found that the geometric interpretation of pitch (ray distance criterion) is less clear for the first 29 channels. For this reason, long-range connections 40 (only one shown) have also been established from clear (high frequency) zones to confusion (low frequency) zones. These connections exist only across the cochlear channel number axis of the CAM. This architecture can help the network 12 to better extract harmonic patterns.
  • the weight between neuron (i,j) and neuron (k,m) of the first layer is computed via the following formula:
  • p(i,j) and p(k,m) are respectively external inputs to neuron(ij) and neuron(k,m) G N(ij).
  • Card ⁇ N(i,j) ⁇ is a normalization factor and is equal to the cardinal number (number of elements) of the set N(i,j) containing neighbors connected to the neuron(ij) (can be equal to 4, 3 or 2 depending on the location of the neuron on the map, i.e. center, corner, etc.).
  • the external input values are normalized.
  • is equal to 1 if the global activity of the network is greater than a predefined ⁇ and is zero otherwise, ⁇ and ⁇ are constants.
  • Li j (t) is the long range coupling as follows:
  • the first layer 30 is designed to handle presentations of auditory maps to process continuous sliding and overlapping windows on the signal.
  • Second layer temporal correlation and multiplicative synapses
  • the second layer 32 performs temporal correlation between neurons. Each of them represents a cochlear channel of the analysis/synthesis filter bank. For each presented auditory map, the second layer 32 establishes binding between neurons which entry is dominated by the same source. The external connections establish multiplicative synapses with the first layer 30.
  • the second layer is an array of 256 neurons (one for each channel) similar to those described by Equations 1 and 2. Each neuron 38 receives the weighted product of the outputs of the first layer neurons along the frequency axis of the CAM/CSM.
  • the operator ⁇ is defined as:
  • ( ) is the averaging over a time window operator (the duration of the window is on the order of the discharge period).
  • the multiplication is done only for non-zero inputs (outputs of first layer 30, in which spike is present) [19], [47]. It is to be noted that this behavior has been observed in the integration of ITD (Interaural Time Difference) and ILD (Inter Level Difference) information in the barn owl's auditory system [19] or in the monkey's posterior parietal lobe neurons that show receptive fields that can be explained by a multiplication of retinal and eye or head position signals [1].
  • ITD Interaural Time Difference
  • ILD Inter Level Difference
  • the synaptic weights inside the second layer 32 are adjusted through the following rule:
  • is chosen to be equal to 2.
  • the "binding" of these features is done via this second layer 32.
  • the second layer 32 is an array of fully connected neurons 38 along with a global controller defined as in Equations 5 and 6.
  • the global controller desynchronizes the synchronized neurons 38 for the fi rst and second sources by emitting inhibitory activities whenever there is an activity (spikes) in the network [69].
  • the selection strategy at the output of the second layer 32 is based on temporal correlation: Neurons belonging to the same source synchronize (same spiking phase); ⁇ Neurons belonging to the other source desynchronize (different spiking phase).
  • a method 100 monophonic source separation according to an illustrative embodiment of the present invention is illustrated in Figure 4.
  • the synaptic connections plasticity is a dynamic process which provides on-demand neural network topology self- modification.
  • the present invention uses this neural network auto-organization feature by the synchronization/desynchronization of the cells, forming or dismantling groups of neurons.
  • the neural network 12 is independent from the signal representation used, but this representation is related to the chosen application.
  • this representation is related to the chosen application. For example, when two speakers are talking simultaneously with similar glottis fundamental frequencies, which makes more difficult the sound sources separation, the instantaneous frequency is suitable in order to detect each glottis opening. In that case, we know that an auditor would have great difficulties to separate the two voices.
  • Some examples [14], [15], [16] demonstrate that auditors confuse the speakers when the transmission channel affects the transmitted voice. Even if the input representation presented to the network is different, the approach stands because the network is not affected by this change.
  • a binary mask 22 is generated from the output of the neural network 12 associating zeros and ones to different channels in order to preserve or remove each sound source.
  • the mask 22 allows attributing each of the channels to a respective source.
  • the energy is normalized.
  • the system 10 has been described with reference to two-source mixtures, the present method and system for monophonic sources separation can also be used for more than two sources. In that case, for each time frame n, labeling of individual channels is equivalent to the use of multiple masks (one for each source).
  • Cooke's database [10] is used for evaluation purposes.
  • the following noises have been tested: 1 kHz tone, FM siren, white noise, trill telephone noise, and human speech.
  • the aforementioned noises have been added to the target utterance.
  • Each mixture is applied to the neural system 10 and the sound sources are extracted.
  • the LSD Log Spectral Distortion
  • performance criterion [64], [65] is used as performance criterion [64], [65] as defined below:
  • Table 1 gives the LSD performances.
  • the method 100 outperforms the other two systems for the tone plus utterance case, performs better than the system proposed by Wang and Brown [69] in all cases, gives performance similar to those provided by the system from Hu and Wang [27] for the siren, but performs worst than the same system from Hu and Wang [27] for the white noise and the telephone ring. For the double-vowel case, the tests are not available for the other two approaches [27], [69].
  • SNR-like criteria such as the SNR, Segmental SNR, PEL (Percentage of Energy Loss), and PNR (Percentage of Noise Residue) are used in the literature (see for example [26], [27], [28], [35], [48], [55], and [69]) and can be used as performance scores. In what follows, spectrograms for different sounds and different approaches are given for visual comparison purposes.
  • the log spectral distortion for three different methods the method according to the present invention as described herein above, W-B (the method proposed by Wang and Brown [69]), and H-W (the method proposed by Hu and Wang [27]).
  • the intrusion noises are as follows: 1 kHz pure tone, FM Siren, telephone ring, white noise, the male intrusion (IdU) for the French /di//da/ mixture, and the female intrusion (/da/) for the French /di//da/ mixture. Except for the last two tests, for the remaining the intrusion is mixed with a sentence taken from Cooke's database.
  • Figure 5 shows the mixture of the utterance "Why ere you all weary” with the telephone trill noise (from Cooke's database).
  • the trill telephone noise (ring) is wideband, interrupted, and structured.
  • Figures 6A-6B show separated utterance spectrogram obtained using respectively the method 100 and the one proposed in Wang and Brown [69]. As can be seen, the method 100 yields better results in higher frequencies.
  • Figure 7 shows the extracted telephone trill.
  • Figure 10 shows the mixture of the utterance "I willingly marry Marylin" with a siren.
  • the siren is a locally narrowband, continuous, structured signal.
  • Figure 11 shows the separated siren obtained using the method 100.
  • Figure 12 shows the spectrogram of the separated utterance.
  • criteria like the PEL and PNR, the SNR, the Segmental SNR, the LSD, etc. are used in the literature as performance criteria, they do not always reflect exactly the real perceptive performance of a given sound separation system.
  • the SNR, the PEL and the PNR ignore high-frequency information.
  • the LSD does not take into account some temporal aspects like the phase distortion. The result of the LSD for two techniques with different phases would be the same. Therefore the LSD will not detect phase distortions in the separation results.
  • Method and system for monophonic source separation have many applications including: ⁇ Multimedia file indexation and authentication: multimedia files on the Internet or other media must be indexed before someone can launch queries on their content ("Who is the singer?", "Which song does he sing?", etc.). Method and system from the present invention can allow separating multiple sources to ease indexation. Also, since the present invention is suitable for comparison purposes, musical files authentication can also be either integrated in a file indexation system, or simply used as a stand-alone application. For example, in peer-to-peer file sharing systems, files may be renamed by users who want to share them illegally.
  • combining an auditory mask generated from camera images to the filter bank can be use to create audio stimuli from the visual scene; ⁇ Intelligent helmet to be used, for example, in high-risk industrial areas where the noise levels are high; and ⁇ Scene Analysis for visually impaired persons where sounds can be used to help blind people analyze visual scenes.
  • different colors, textures, and objects are associated to different sound characteristics (frequency, duration, etc.). If there are many objects in the scene the analysis of the corresponding sound mixture will become difficult for the subject.
  • a method and system for monophonic source separation can be used to separate different visual scene objects by separating their equivalent auditory objects.
  • the system 50 includes a spiking neural network 52 according to a second illustrative embodiment of the present invention.
  • the system 50 does not include preprocessors upstream from the neural network 12 since the image to compare 54 and the reference image 56 are both provided to the neural network 52 as an external input. More specifically, each pixel of each image 54-56 is associated to respective neurons of a respective layer 62-64 of the neural network 52 as will be described hereinbelow in more detail.
  • the synaptic weights are provided by the grey- scales or colors associated to the neurons.
  • the images 54-56 can be for example post-processed images from an external camera (not shown).
  • the source of the images 54-56 may of course vary without departing from the spirit and nature of the present invention.
  • An example of image to be inputted to the layer is illustrated in Figure 14.
  • the neural network 52 will now be described in more detail with reference to Figure 15. Since the spiking neural network 52 is similar to the neural network 12 and for concision purposes only the differences will be described herein in more detail.
  • the neural network 52 includes first and second layers 62 and 64 of spiking neurons 66 and 68.
  • a neighborhood of 4 is chosen in each layer 62 or 64 for the connections.
  • Each neuron 66 in the first layer 62 is connected to all neurons 68 in the second layer 64 and vice-versa.
  • the number of neurons 66-68 shown in Figure 15 does not correspond to the real number of neurons, which is greater.
  • the neighborhood can be set to other numbers depending on the applications.
  • a network activity regulator in the form of a global controller 70, is connected to all neurons 66-68 in the first and second layers 62 and 64 as in [7].
  • the global controller 70 has bidirectional connections to all neurons 66-68 in the two layers 62-64.
  • segmentation is done in the two layers 62 and 64 independently (with no extra-layer connections), while dynamic matching is done with both intra-layer and extra-layer couplings.
  • the intra- layer and extra-layer connections are defined as follows:
  • Card ⁇ N int (i, j) ⁇ is a normalization factor and is equal to the
  • neighbors connected to the neuron ⁇ i,j) can be equal to 4, 3 or 2 depending on the location of the neuron on the map, i.e. center, corner, etc., and the number of active connections.
  • Connection 72-76 Card is the cardinal number for extra-layer connections 76 and is equal to the number of neurons in the second layer 64 with active connection to neuron(i,j) 66 in the first layer 62.
  • normalization in Equation 14 allows the correspondence between similar pictures with different sizes. If the aim is to match objects with exactly the same size the normalization factor is set to a constant for all neurons 66-68. The reason for this is that with normalization even if the size of the picture in the second layer 64 was the double of the same object in the first layer 62 the total influence to the neuron(i,j) would be the same as if the pattern was of the same size.
  • the network 52 can have two different behavioral modes: segmentation and matching.
  • the segmentation stage there is no connection between the two layers 62 and 64.
  • the two layers 62-64 act independently (unless for the influence of the global controller 70) and segment the two images 54and 56 applied to the two layers 62 and 64 respectively.
  • the global controller 70 forces the segments on the two layers 62-64 to have different phases.
  • the two images are segmented but no two segments have the same phase (see Figures 19A-19B).
  • the results from segmentation are used to create binary masks 58 and 60 that select one object in each layer 62 and 64 in multi-object scenes.
  • a snapshot, obtained at a specific time t like the one shown in Figure 25 is used to create the binary mask m(i,j) for one of the objects as follows:
  • x sync can be the synchronized value that corresponds to either the cross or the rectangle in Figure 25 at time t sync .
  • the mask m(i,j) is different than at time t j and corresponds to a different object than the one at time t,.
  • G(t) oH(z - ⁇ ) (17)
  • % " ⁇ (18) ⁇ is equal to 1 if the global activity of the network is greater than a predefined ⁇ and is zero otherwise.
  • the inputs to the layers are defined by:
  • Extra-layer connections 76 (Equation 2) are established. If there are similar objects in the two layers 62-64, these extra-layer connections 76 will help them synchronize. In other words, these two segments are bound together through these extra-layer connections 76 [123]. In order to detect synchronization double-thresholding can be used [2]. This stage may be seen as a folded oscillatory texture segmentation device as the one proposed in [70].
  • the coupling strength S ltj for each layer in the matching phase is defined as follows:
  • x ext defines action potentials from external connections and x ⁇ nt defines action potentials from said first and second internal connections.
  • Figure 16 summarizes the general steps of a method 200 for establishing correspondence between first and second images according to an illustrative embodiment of the present invention. It is to be noted that steps 208-210 are not necessarily sequential.
  • A is a 2x2 non-singular matrix
  • p e R 2 is a point in the plane
  • p' is its affine transform
  • t is the translation vector.
  • Affine transformation is a combination of several simple mappings such as rotation, scaling, translation, and shearing.
  • the similarity transformation is a special case of affine transformation. It preserves length ratios and angles while the affine transformation, in general does not.
  • Equation 22 is equivalent to (neglecting the effect of intra-layer connections, since hf xt » N' nt ):
  • N ext A ⁇ (abc) + A ⁇ abd) (23)
  • extra-layer connections 76 are independent of the affine transform that maps the model to the scene (first and second layer objects) and can be extended to more than 4 points.
  • the original DLM is a rate coding approximation of the ODLM (Oscillatory Dynamic Link Matching) according to the present invention.
  • Aoinishi et al. [3] have shown that a canonical form of rate coding dynamic Equations solve the matching problem in the mathematical sense.
  • the dynamics of a neuron in one of the layers of the original Dynamic Link Matcher proposed in [35] is as follows:
  • Equation 28 Equation 28 becomes:
  • Equation 29 can be further simplified to:
  • Equation 31 the averaged output ⁇ 0 Of an integrate-and-fire neuron is related to the averaged-over-time inputs of a neuron ( ⁇ af'x*TM ) by a continuous function (sigmoidal, etc.).
  • Equation 31 Considering that in Equation 31 one need ⁇ x°"f > in function of ⁇ x ⁇ ° > and that Equation 32 is a set of linear Equations in ⁇ irA :
  • ⁇ wTMV(r ⁇ f o wo ) H ⁇ a WO ) * ⁇ « ⁇ o ) (35)
  • k(.) is a 2-D rectangular window (in the original DLM k(.) was chosen to be a well- known Mexican hat).
  • the DLM is an averaged-over- time approximation of the ODLM according to the present invention.
  • the network 52 can be used to solve the correspondence problem. For example, considering that in a factory chain, someone wants to check the existence of a component on an electronic circuit board (see for example Figure 14). All this person has to do is to put an image of the component on the first layer and check for synchronization between the layers. Ideally, any change in the angle or the location of the camera or even the zoom factor should not influence the result.
  • One of the signal processing counterparts of the method and system from the present invention is the morphological processing. Other partial solutions such as the Fourier transform could be used to perform matching robust to translation.
  • a method and system according to the present invention does not require training or configuration according to the stimulus applied.
  • the network 52 is autonomous and flexible to not previously seen stimuli. This is in contrast with associative memory based architectures in which a stimulus must be applied and saved into memory before retrieval (as in [66] for example). It does not require any pre-configured architecture adapted to the stimulus, like in the hierarchical coding paradigm [52].
  • Figures 18A-18B show activity snapshot (instantaneous values of x(i,j)) ⁇ n the two layers 62-64 after the segmentation step 206. Same-gray scale neurons have similar phases in Figures 18A-18B. On the other hand, different segments on different layers are desynchronized ( Figures 19A-19B and 20A-20B).
  • the segmentation step 206 can be bypassed and the network 52 can function directly in the matching mode. This allows speed up the pattern recognition process.
  • Figures 23-24 illustrate the behavior of a 13x5 network when only one object is present in each layer 62-64 showing that the synchronization time for the matching-only network is shorter. It is to be noted that the matching-only approach is inefficient when there are multiple objects in the scene. In the latter-mentioned case the segmentation plus matching approach should be used.
  • the network 52 is capable of establishing correspondence between images and is robust to translation, rotation, noise and homothetic transforms.
  • the method 200 as been described as a mean to segment images, it can also be used in solving the correspondence problem, as a whole system, using a two-layered oscillatory neural network.
  • Applications of the system 50 include: ⁇ Electronic circuit assembling where the system 50 can be used to verify whether all the electronic components are present (or in good conditions) on a PCB (Printed Circuit Board) or not; ⁇ Facial recognition: the technique can be applied to facial recognition by comparing a given face to a database of faces, in custom houses for example; and ⁇ Fault detection in a production chain where the invention can be used to find manufacturing errors in an assembly chain. ⁇ Teledetection: it can be used to find objects or changes in satellite images. It can also be used to assist in the automatic generation of maps.
  • many types of neurons can be implemented, including integrate-and-fire, relaxation oscillators, or chaotic neurons, even if in the examples detailed hereinbelow only relaxation oscillators are used.
  • the neural network 12 and 52 are implemented on a SIMULINKTM simulation spiking neural networks library, in Java and in C++ (3 different simulators can be used).
  • a neural network according to the present invention can be implemented on other platforms,
  • Neural network architecture according to the present invention is well-suited to the control of event-driven and adaptive event- driven processes and the control of robots for example. For instance, it can be used to control sensorimotor parts of robots by bio-inspired (spiking neural) networks or for fly-by-wire design of aircraft.
  • a plurality of interconnected spiking neural networks can manage and control sensors, peripherals, vision, etc.
  • Frisina R. D., Smith, R. L., Chamberlain, S. C, 1985. Differential encoding of rapid changes in sound amplitude by second-order auditory neurons. Experimental Brain Research 60, 1985, pp. 417-422.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

Systems for audio and image processing using bio-inspired neural network are proposed. The first system allows separating a specific sound in a mixture of audio sources. The second system allows performing visual pattern processing and recognition robust to affine transforms and noise. The neural network system comprises first and second layers of spiking neurons, each neurons being configured for respectively first and second internal connections to other neurons from the same layer or for external connections to neurons from the other layer for receiving extra-layer stimuli therefrom and for receiving external stimuli from external signals; and global controllers connected to all neurons to allow inhibiting the neurons. In operation, upon receiving stimuli from said the first and second external signals, the internal connections are promoted, and synchronous spiking from neurons from the first and second layers are promoted by the external connections when some of the stimuli from the first external signals are similar to some of the stimuli from the second external signals. There is no need to tune the neural network when changing the signal nature. Furthermore, the proposed neural network is autonomous and there is neither training nor recognition phase.

Description

TITLE OF THE INVENTION
SPIKING NEURAL NETWORK AND USE THEREOF
FIELD OF THE INVENTION
The present invention relates to neural networks. More specifically, the present invention is concerned with a spiking neural network and its use in pattern recognition and in monophonic source separation.
BACKGROUND OF THE INVENTION
Pattern recognition is an aspect of the field of artificial intelligence aiming at providing perceptions to "intelligent" systems, such as robots, programmable controllers, speech recognition systems, artificial visions systems, etc.
In pattern recognition, comparison criteria, similarities between shapes, and distances must be computed in order to answer questions such as: - "Are these objects similar?" "Has the system already identified this form?" "Is this pattern different enough from the other patterns already identified by the system?" "Is this form to be remembered?" etc. In a nutshell, pattern recognition systems must use performance and comparison criteria usually assimilated as distances. The term "distance" should be construed as a probability, an error, a score. It is a value that can be assimilated to a distance. This type of criteria is widely used, for example: in any rule-based expert system; in statistical Markovian systems; in second generation (formal) neural network system; * etc.
Unfortunately, the evaluation of the distance is often an important burden. Furthermore, objects comparison is usually obtained by first comparing segments of the objects, which involves distance comparison. It has been found desirable to achieve such comparison with a more global approach. For example, comparing Nsignals would require two steps: 1. Compute distance on each pair of signals; and 2. Find similar signals by sorting and comparing distances.
Third generation neural networks, including spiking neurons, pulsed neurons, etc., allows to alleviate this distance burden. Indeed, a properly designed spiking neural network allows pattern comparisons, similarity evaluation between different patterns without explicit score or distance computation. This is made by using spiking events temporally- organized (Figure 1A. They are different coding schemes, including:
1) Synchronization as coding: as illustrated in Figures 1 B and 1C, neurons not discharging at the same time are not synchronized. Conversely, neural synchronization occurs when similar input stimuli are given to the neurons, which discharge synchronously. This is called neurons synchronization;
2) Rank Order Coding: a neuron spikes only when a specific input sequence of spikes is received on it's dendrites.
This transfer between conventional digital coding and spike sequences coding is efficient in terms of distance criteria creation and comparison.
To summarize, any distance between objects can be represented by: more or less similar spike timings between neurons; a single spike issued by a neuron, resulting from a a specific input sequence of This process is called "spikes order coding", and is characterized by the existence of couples of excitatory/inhibitory neurons , providing recognition of incoming spikes sequences from other neurons, after the spike has been generated by the neuron.
Synchronization coding occurs when two neurons groups appear spontaneously because of the neurons interconnections plasticity. Thus, two neurons having similar inputs present a growth of their mutual synaptic connections, causing their outputs to be synchronous. Otherwise, when neurons inputs are not similar, their mutual synaptic connections decrease, causing them to be desynchronized. In fact, the inputs of two neurons spiking simultaneously are relatively correlated.
Source separation
In a more specific topic, separation of mixed signals is an important problem with many applications in the context of audio processing. It can be used, for example, to assist a robot in segregating multiple speakers, to ease the automatic transcription of video via the audio tracks, to separate musical instruments before automatic transcription, to clean the signal before performing speech recognition, etc. The ideal instrumental setup is based on the use of array of microphones during recording to obtain many audio channels. In fact, in that situation, very good separation can be obtained between noise and signal of interest (see [29], [33], and [50]) and experiments with great improvements have been reported in speech recognition [4], [64]. Further applications have been ported on mobile robots [66], [71], [72]and have also been developed to track multi-speakers [58].
The source separation process implies segregation and/or fusion (integration), usually based on correlation, statistical estimation, binding, etc. of features extracted by the analysis module.
Conventional approaches require training, explicit estimation, supervision, entropy estimation, huge signals databases [71], AURORA database ([10], [34]), etc. Therefore, the design and the training of such systems are heavy and very costly. Moreover, in many situations, only one channel is available to the audio engineer that still has to solve the separation problem. The automatic separation and segregation of the sources is, then, much more difficult.
Most of proposed monophonic systems from the prior art perform reasonably well on specific signals (generally voiced speech) but fail to efficiently segregate a broad range of signals. These relatively negative results may be overcome by combining and exchanging expertise and knowledge between engineering, psychoacoustic, physiology and computer science.
Monophonic source separation systems can be seen as comprising two main stages: i) signal analysis to yield a representation suitable to the second stage ii) clustering with segregation.
With at least two interfering speakers and voiced speech, it is observed that when pitch is different, the separation is relatively easy as spectral representations or auditory images exhibit different regions with structures dominated by pitch. Then, amplitude modulation of cochlear filter outputs (or modulation spectrograms) is discriminative.
In situations where speakers have similar pitches, the separation is more difficult. Features, such as phase, have to be preserved by the analysis. The glottal opening time should be taken into account or long term information such as intonation would be required (then real-time will be problematic). Using Bregman's terminology, bottom-up processing corresponds to primitive processing, and top-down processing means schema-based processing [15]. The auditory cues proposed by Bregman [15] for simple tones are not applicable directly to complex sounds. More sophisticated cues based on different auditory maps are thus desirable. For example, Ellis [13] uses sinusoidal tracks created by the interpolation of the spectral picks of the output of a cochlear filter bank, while Mellinger's model [41 ] uses partials. A partial is formed if an activity on the onset maps (the beginning of an energy burst) coincides with an energy local minimum of the spectral maps. Using these assumptions, Mellinger proposed a CASA system in order to separate musical instruments. Cooke [9] introduced the harmony strands, which is the counterpart of Mellinger's cues in speech. The integration and segregation of streams is done using Gestalt and Bregman's heuristics. Berthommierand Meyer use Amplitude Modulation maps (see [4], [42], [49] and [63]). Gaillard [32] uses a more conventional approach by using the first zero crossing for the detection of pitch and harmonic structures in the frequency-time map. Brown proposes an algorithm [17] based on the mutual exclusivity Gestalt principle. Hu and Wang use a pitch tracking technique [26]. Wang and Brown [69] use correlograms in combination with bio-inspired neural networks. Grossberg [37] proposes a neural architecture that implements Bregman's rules for simple sounds. Sameti [58] uses HMMs (Hidden Markov Models), while Roweis [57] and Reyes-Gomez [53] use Factorial HMMs. Jang and Lee [53] use a technique based on Maximum a posteriori (MAP) criterion. Another probability-based CASA is proposed by Cooke [22]. lrino and Patterson [30] propose an auditory representation that is synchronous to glottis and preserves fine temporal information, which makes possible the synchronous segregation of speech. Harding and Meyer [22] use a model of multi-resolution with parallel high-resolution and low-resolution representations of the auditory signal. They propose an implementation for speech recognition. Nix [45] performs a binaural statistical estimation of two speech sources by an approach that integrates temporal and frequency-specific features of speech. It tracks magnitude spectra and direction on a frame-by-frame basis.
A major drawback of the above-mentioned systems from the prior art is that they require training and are supervised.
Autonomous bio-inspired and spiking neural networks are an alternative to supervised systems. Examples of their implementations from the prior art in speech processing and source separation will now be briefly described.
A well known amazing characteristic of human perception is that the recognition of stimuli is quasi-instantaneous, even if the information propagation speed in living neurons is slow [26], [60], [61]. This implies that neural responses are conditioned by previous events and states of the neural sub-network [71]. Understanding of the underlying mechanisms of perception in combination with that of the peripheral auditory system [11], [17], [23], [73] allows designing an analysis module.
In the context of the mathematical formalism of spiking neurons, it has been shown that networks of spiking neurons are computationally more powerful than the models based on McCulloch Pitts neurons [39]. Information about the result of the computation is already present in the current neural network state long before the complete spatiotemporal input patterns have been received by the neural network [71]. This suggests that neural networks use the temporal order of the first spikes yielding ultra-rapid computation [62]. Thus, neural networks and dynamic synapses (including facilitation and depression) are equivalent to a given quadratic filter that can be approximated by a small neural system [40], [44]. It has been shown that any filters that can be characterized by Volterra series can be approximated with a single layer of neurons. Also, spike coding in neurons is close to optimal, and plasticity in Hebbian learning rule increases mutual information close to optimal [8], [12], [54], [94].
For unsupervised systems, novelty detection allows facilitating autonomy. For example, it can allow robots to detect if stimuli are new or already seen. When associated with conditioning, novelty detection can create autonomy of the system [15], [24].
Sequence classification is particularly interesting for speech. Recently Panchev and Wermter [46] have shown that synaptic plasticity can be used to perform recognition of sequences. Perrinet [78] and Thorpe [61] discuss the importance of sparse coding and rank order coding for classification of sequences.
Neuron assemblies (groups) of spiking neurons can be used to implement segregation and fusion (integration) of objects in an auditory image representation. Usually, in signal processing, correlations (or distances) between signals are implemented with delay lines, products and summations. Similarly, comparison between signals can be made with spiking neurons without implementation of delay lines. This is achieved by presenting images to spiking neurons with dynamic synapses. Then, a spontaneous organization appears in the network with sets of neurons firing in synchrony. Neurons with the same firing phase belong to the same auditory objects. Milner [43] and Malsburg [67], [68], [69] propose the temporal correlation to perform binding. Milner and Malsburg have observed that synchrony is a crucial feature to bind neurons associated to similar characteristics. Objects belonging to the same entity are bound together in time. In other words, synchronization between different neurons and desynchronization among different regions perform the binding. To a certain extend, such property has been exploited to perform unsupervised clustering for recognition on images [5], for vowel processing with spike synchrony between cochlear channels [59], to propose pattern recognition with spiking neurons [25], and to perform cell assembly of spiking neurons using Hebbian learning with depression [36]. Furthermore, Wang and Terman [70] have proposed an efficient and robust technique for image segmentation and study the potential in CASA [69].
Pattern recognition
Pattern recognition robust to noise, symmetry, homothety (size change with angle preservation), etc. has long been a challenging problem in artificial intelligence. Many solutions or partial solutions to this problem have been proposed using expert systems or neural networks. In general, three different approaches are used to perform invariant pattern recognition: Normalization: In this approach the analyzed object is normalized to a standard position and size by an internal transformation. Advantage of this approach include i) the coordinate information (the "where" information) is retrievable at any stage of the processing and ii) there is a minimum loss of information. The disadvantage of this approach is that the network should find the object in the scene and then normalize it. This task is not as obvious as it can appear [35], [51]. - Invariant Features: In this approach, some features that are invariant to the location and the size of an object are extracted. The disadvantages of this approach are that the position of the object may be difficult to extract after recognition and information is lost during the process. The advantage is that the technique does not require knowing where the object is and, unlike normalization in which other techniques should be used after this stage to recognize patterns, the invariant features approach already does some pattern recognition by finding important features [18]. Invariance learning from temporal input sequences: The assumption is that primary sensory signals, which in general code for local properties, vary quickly while the perceived environment changes slowly. If one succeeds in extracting slow features from the quickly varying sensory signal, he/she is likely to obtain an invariant representation of the environment [66], [72].
Based on the Normalization approach, the "Dynamic Link Matching" (DLM) has been first proposed by Konen et al. [35]. This approach consists of two layers of neurons connected to each other through synaptic connections constrained to some normalization. The saved pattern is applied to one of the layers and the pattern to be recognized to the other. The dynamics of the neurons are chosen in such a way that "blobs" are formed randomly in the layers. If the features in these two blobs are similar enough, some weight strengthening and activity similarity will be observed between the two layers, which can be detected by correlation computation [3], [35]. These blobs may or may not correspond to a segmented region of the visual scene, since their size is fixed in the whole simulation period and is chosen by some parameters in the dynamics of the network [35]. The apparition of blobs in the network has been linked to the attention process present in the brain by the developers of the architecture.
The dynamics of the neurons used in the original DLM network are not the well-known spiking neuron dynamics. In fact, the behavior of neurons from the DLM is based on rate coding (average neuron activity over time) and can be shown to be equivalent to an enhanced dynamic Kohonen Map in its Fast Dynamic Link Matching (FDLM) [35].
In summary, the above systems from the prior art are supervised / non-autonomous or include two operating modes: learning and recognition.
Other systems from the prior art, including United States Patents No. 6,242,988BI , issued on June 5, 2001 , entitled "Spiking Neural Circuit" and naming Sarpeshkar, and No. 4,518,866, issued to Clymeron May 21 , 1985 and entitled "Method of and Circuit for Simulating Neurons", make use of bio-inspired neural networks (or spiking neurons) including electronic circuitry to implement neurons, but do not provide any solution to spatiotemporal pattern.
The following United States Patent documents describe solutions to spatiotemporal pattern recognition that do not use bio-inspired neural networks (spiking neurons). They either use conventional (non- spiking) neural networks or expert systems:
Figure imgf000014_0001
OBJECTS OF THE INVENTION
An object of the present invention is therefore to provide an improved method for monophonic sound separation.
Another object of the invention is to provide an improved method to image processing and/or recognition.
Finally, another object of the invention is to provide an improved method for pattern recognition.
SUMMARY OF THE INVENTION
According to a first aspect of the present invention, there is provided an Oscillatory Dynamic Link Matching algorithm (ODLM), which uses spiking neurons and is based on phase coding. A two-layer neural network is also provided which is capable of doing motion analysis without requiring either the computing of optical flow or additional signal processing between its layers. The proposed neural network can solve the correspondence problem, and at the same time, perform the segmentation of the scene, which is in accordance with the Gestalt theory of perception [21]. The proposed neural network based system is very useful in pattern recognition in multiple-object scenes. The proposed network does normalization, segmentation, and pattern recognition at the same time. It is also self-organized.
More specifically, in accordance with the first aspect of the present invention, there is provided a neural network system comprising: first and second layers of spiking neurons; each neurons from the first layer being configured for first internal connections to other neurons from the first layer or for external connections to neurons from the second layer to receive first extra-layer stimuli therefrom and for receiving first external stimuli; each neurons from the second layer being configured for second internal connections to other neurons from the second layer or for the external connections to neurons from the first layer to receive second extra-layer stimuli therefrom and for receiving second external stimuli; and at least one network activity controller connected to at least some of the neurons from each of the first and second layers for regulating the activity of the first and second layers of spiking neurons; whereby, in operation, upon receiving the first and second external stimuli, the first and second internal connections are promoted, and synchronous spiking from neurons from the first and second layers are promoted by the external connections when some of the first external stimuli are similar to some of the second external stimuli.
According to a second aspect of the present invention, auditory-based features are integrated with an unconventional pattern recognition system, based on a network of spiking neurons with dynamical and multiplicative synapses. The analysis is dynamical and extracts multiple features (and maps), while the neural network does not require any training and is autonomous.
More specifically, in accordance to the second aspect of the present invention, there is provided a system for monophonic source separation comprising; a vocoder for receiving a sound mixture including at least one monophonic sound source; an auditory image generator coupled to the vocoder for receiving the sound mixture therefrom for generating an auditory image representation of the sound mixture; a neural network as recited in claim 1 , coupled to the auditory image generator for receiving the auditory image representation and for generating a mask in response to the auditory image representation; a multiplier coupled to both the vocoder and the neural network for receiving the mask from the neural network and for multiplying the mask with the at least one monophonic sound mixture from the vocoder, resulting in the identification of the at least one monophonic source by muting sounds from the sound mixture not belonging to the at least one monophonic source.
In accordance to a third aspect of the present invention, there is provided a method for establishing correspondence between first and second images, each of the first and second images including pixels, the method comprising: providing a neural network including first and second layers of neurons; applying pixels from the first image to respective neurons of the first layer of neurons and pixels from the second image to respective neurons of the second layer of neurons; interconnecting each neuron from the first layer to each neuron of the second layer; performing a dynamic matching between the first and second layers, yielding a temporal correlation between the first and second layers; and using the temporal correlation between the first and second layers for establishing correspondence between the first and second images.
Compared with other non-bio-inspired systems from the prior art, a system for pattern recognition according to the present invention:
does not need explicit rules to create a separation mask. For example, mapping between the rules, developed by Bregman [6] for simple sounds and the real world, is difficult. Therefore, as long as the aforementioned rules are not derived and well-documented [22], expert systems are difficult to use; does not require time-consuming training phase prior to the separation phase contrarily to approaches based on statistics like HMMs [58], Factorial HMMs [53], or MAP [53] that usually do; and is autonomous, as it does not use hierarchical classification.
Finally in accordance to a fourth aspect of the present invention, there is provided a method for establishing correspondence between first and second sets of data, the method comprising; providing a neural network including first and second layers of neurons; providing first and second image representations of respectively the first and second sets of data including pixels; applying the first and second image representations respectively to the first and second layers; and interconnecting each neuron from the first layer to each neuron of the second layer; performing a dynamic matching between the first and second layers, yielding a temporal correlation between the first and second layers; and using the temporal correlation between the first and second layers for establishing correspondence between the first and second set of images.
The building blocks of the proposed architecture are spiking neurons. These neurons are different from conventional neurons used in engineering problems in the way they convey information and perform computations. Each spiking neuron fires "spikes". Thus, the information is transmitted through either spike rate (or frequency of discharge) or the spike timing of each neuron and the relative spike timing between different neurons. The present invention concerns more specifically temporal correlation (phase synchrony between neurons). The synchrony among a cluster of neurons means external inputs similarity. On the other hand, desynchrony between clusters of neurons means that the underlying data belong to different sources (either audio or visual).
The information of the proposed neural network architecture is coded in the neurons synchronization or in the relative timing between spikes from different neurons. The update of the synaptic weights is automatic, so that synapses are dynamic.
Characteristics of the method and system according to the present invention include: no learning or recognition phase; information coded in the synchronization or in the relative timing between spikes from different neurons; synaptic-weight-update done automatically; dynamic synapses; sound sources separation in a mixture of audio sources; allows invariant pattern processing or recognition; no need to develop specific neural network for each new application, which confirms the application adaptation said in the main document; - no distance computation (as in classical methods),because the information needed to perform classification does not include distances; generated either auditory maps or adequate visual pattern, depending on the nature of the target application of the system; automatic audio channels selection, in the context of audio processing; internal connections between neurons acting as competitive or cooperative relations. Competitive influence occurs when the neurons are not synchronous, while cooperative influence occurs when the adjacent neurons are synchronous; a second layer of neurons performing a one-step computation of correlation to check synchronization, hierarchisation, and classification; self-evolution of the system; synaptic connections management; and no supervision of the system.
Other objects, advantages and features of the present invention will become more apparent upon reading the following non restrictive description of preferred embodiments thereof, given by way of example only with reference to the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
In the appended drawings:
Figures 1A- 1C, which are labeled "Prior Art", are graphs illustrating respectively the behavior of a single third-generation neuron and a couple of neurons under different stimulus;
Figure 2 is a block diagram of a system for monophonic source separation according to an illustrative embodiment of the present invention, including a spiking neural network according to a first illustrative embodiment of the present invention;
Figure 3 is a schematic view of the neural network from Figure 2;
Figure 4 is a flow chart illustrating a method for monophonic source separation according to an illustrative embodiment of the present invention;
Figure 5 is a spectrogram illustrating the mixture of the utterance "Why were you all weary?" with a trill telephone noise;
Figure 6A is a spectrogram illustrating a synthesized version of the utterance from Figure 5, "Why were you all weary?", after the separation using the method from Figure 4;
Figure 6B, which is labeled "Prior Art", is an image illustrating a synthesized version of the utterance from Figure 5 as obtained using a system from the prior art;
Figure 7 is a spectrogram illustrating a synthesized version of a trill phone after the separation using the method from Figure 4;
Figure 8 is a spectrogram illustrating the utterance "I willingly marry Marilyn" with a 1 kHz pure tone;
Figures 9A-19B are spectrograms illustrating the separation results using respectively the method from Figure 4 and the approach proposed by Wang and Brown; Figure 9B being labeled "Prior Art";
Figure 10 is a spectrogram illustrating the mixture of the utterance "I willingly marry Marylin" with a siren;
Figure 11 is a spectrogram illustrating the separated siren obtained using the method from Figure 7;
Figure 12 is a spectrogram of the utterance from Figure 10 separated;
Figure 13 is a bloc diagram illustrating a pattern recognition system according to an illustrative embodiment of the present invention;
Figure 14 is a schematic view illustrating an example of image to be processed by the system from Figure 13;
Figure 15 is a schematic view illustrating a neural network according to a second illustrative embodiment of the present invention;
Figure 16 is a flowchart of a method for establishing correspondence between first and second images according to an illustrative embodiment of the present invention;
Figure 17 is a schematic view illustrating an affine transform T for a four-corner object;
Figures 18A and 18B are images illustrating the activity of respectively the first and second layers of the neural map using the system from Figure 13 when two bars are presented to the neural network;
Figures 19A and 19B are graphs showing respectively the activity of one of the neurons associated with the vertical bar from Figure 18B in the first layer of the neural network from Figure 15 after the segmentation steps from Figure 16, and the activity associated with the background in the same layer;
Figures 2OA and 2OB are graphs showing respectively the activity of one of the neurons associated with the horizontal bar from Figure 18A in the first layer of the neural network from Figure 15 after the dynamic matching step from Figure 16, and the activity of one of the neurons associated with the vertical bar from Figure 18B in the second layer of the neural network from Figure 15 after the dynamic matching step from Figure 16;
Figure 21 is a graph illustrating the evolution of the thresholded activity of the network from Figure 15 through time in the segmentation phase from Figure 16 considering the images from Figures 18A-18B; each vertical rod representing a synchronized ensemble of neurons and the vertical axis representing the number of neurons in that synchronized region;
Figure 22 is a graph illustrating the evolution of the thresholded activity through time of the network from Figure 15, in the dynamic matching phase from Figure 16, considering the images from Figures 18A-18B;
Figure 23 is a graph illustrating the synchronization index of a one-object scene when the segmentation steps from Figure 16 are bypassed, the synchronization taking 85 oscillations;
Figure 24 is a graph illustrating the synchronization index of a one-object scene when the segmentation steps from Figure 16 preceding the matching phase, the synchronization taking 155 oscillations; and
Figure 25 is an image illustrating the synchronization phase from the method from Figure 16, binary masks being generated by assigning binary values to different oscillation phases.
DETAILED DESCRIPTION
Monophonic source separation
A system 10 for monophonic source separation according to an illustrative embodiment of the present invention will now be described with reference to Figure 2. The system 10 includes a spiking neural network 12 according to a first illustrative embodiment of the present invention.
As will become more apparent upon reading the following description, the system 10 allows to separate a plurality of different monophonic source blended in a sound mixture 14 provided as an input to the system 10.
Indeed, every day our ears receive such sounds mixtures composed of noises and other sounds. Nevertheless, we are able to concentrate our attention on one particular sound source at a time. For example, two people can understand each other, even if they are in a noisy room. The human audition processes information in a nearly optimal manner. The system 10 is in the form of a bottom-up CASA system which intends to separate different sound sources. System 10 allows separating two, three, or more sound sources.
The left branch in Figure 2 provides analysis/synthesis of the sound source in many sub-bands or channels. This separation is achieved by a double vocoder, in the form of FIR Gammatone filter banks 24, following the psychoacoustics cochlear frequency distribution.
In addition to the vocoder 24 for filtering a sound mixture 14 giving birth to frames of 256 signals for example, each belonging to one of the cochlear channels, the system 10 further comprises an auditory images generator 15 including a CAM (Cochleotopic/AMtopic Map) generator 16 and a CSM (Cochleotopic/Spectropic Map) generator 18, the two-layered spiking neural network 12 for receiving an processing a map outputted by the auditory image generator 15 and for providing a binary mask 22 based on the neural synchrony 20 in the output of the neural network, means 26 for multiplying the binary mask 22 with the output of the FIR Gammatone synthesis filter bank 24 and an integrator 28 sum-up the channels.
Each of the components of the system 10 will now be described in more detail.
Analysis/Synthesis Filter Bank
An FIR implementation of the well-known Gammatone filter bank 0 is used as the analysis/synthesis filter bank. The use of the Gammatone filter bank allows obtaining the properties of the audition as observed in the psychoacoustics field. The resulting number of channels is 256 with center frequencies from 100 Hz to 3600 Hz uniformly spaced on an ERB scale (Equivalent Rectangular Bandwidth scale, a psychoacoustics critical bands scale), with a sampling rate of 8 kHz.
The actual time-varying filtering is done by the mask 22. Once this mask 22 is obtained by grouping synchronous oscillators of the neural net, the output of the synthesis filter bank 24 is multiplied with it. Thus, auditory channels belonging to interfering sound sources are muted and channels belonging to the sound source of interest remain unaffected. This is, in someway, equivalent into labeling, for each time frame, cochlear channels. A value of one is associated to the targeted signal and a value of 0 to the interfering signal, yielding a binary mask. Of course, a non- binary continuous mask can also be used.
Before the signals of the masked auditory channels are added to form the synthesized signal, they are passed through the synthesis filters, which impulse responses are time-reversed versions of the impulse responses of the corresponding analysis filters.
This non-decimated FIR analysis/synthesis filter bank was proposed by [31] and also used in the perceptual speech coder in [58] with 20 channels only in this latter case.
Of course, the vocoder may take other forms including Linear Predictive Coding Vocoder, Fourier transform and any other transformation allowing analysis/synthesis.
Also, the number of channels, the sampling rate and the center frequency values may differ without departing from the spirit and nature of the present invention.
Analysis and auditory images generation
The system 10 comprises an auditory images generator 15 to simultaneously generate two different image representations of the signals provided by the filter bank 24: • a first representation in the form of the well known amplitude modulation map, which will be referred to herein as Cochleotopic/AMtopic (CAM) map - closely related to modulation spectrograms as defined in [12] and [42]; and • a second representation in the form of the well known Cochleotopic/Spectrotopic (CSM) map that encodes the averaged spectral energies of the cochlear filter bank output.
The CAM map is an amplitude modulation representation of cochlear envelopes; while CSM is a cochlear energy representation (energy frequency distribution following the Gammatone filter bank structure). CAM and CSM generators can be added to follow each chosen application.
The generator 15 is configured so that, in operation, depending on the nature of the intruding sound (speech, music, noise, etc.) one of the maps is selected.
More specifically, the generator 15 is programmed with the following CAM/CSM generation algorithm: 1. Down-sampling to 8000 Hz; 2. Filtering the sound source using a 256 dimensional bark- scaled cochlear filter bank ranging from 200 Hz to 3.6 kHz; 3. For CAM: Extracting the envelope (AM demodulation) for channels 30-256; for other low frequency channels (1- 29) using raw outputs [56]. For CSM: Nothing is done in this step; 4. Computing the STFT (Short-Time Fourier Transform) using a Hamming window, which is well known in the art. Alternatively, non-overlapping adjacent windows with 4ms or 32ms lengths for example can also be used; 5. In order to increase the spectro-temporal resolution of the STFT, finding the reassigned spectrum of the STFT (including applying an affine transform to the points in order to relocate the spectrum); and 6. Computing the logarithm of the magnitude of the STFT. The logarithm enhances the presence of the stronger source in a given 2D frequency bin of the CAM/CSM log (e1 + e2) ~ max (log e1 , log e2) (unless e1 and e2 are both large and almost equal) [57].
It has been recently observed that the efferent loop between the medial olivocochlear system (MOC) and the outer hair cells modifies the cochlear response in such a way that speech is enhanced from the background noise [34]. With the above-described generating algorithm, it is assumed that envelope detection and selection between the CAM and the CSM, in the auditory pathway, could be associated to the change of stiffness of hair cells combined with cochlear nucleus processing [20], [37].
Of course, the selection between the two auditory images can be done manually.
Even though the generator 15 as illustrated allows generating only CAM or CSM maps, other maps can alternatively or additionally be generated, such as a map facilitating the separation of speakers whose glottis have similar frequencies.
The Neural Network
A neural network 12 according to a first illustrative embodiment of the present invention will now be described in more detail with reference to Figure 3.
The neural network 12 comprises first and second layers 30 and 32 of oscillatory spiking neurons respectively 36-38 and first and second global controllers 34 (only one shown). The second layer 32 is one-dimensional. The number of neurons 36-38 shown in Figure 3 does not correspond to the real number of neurons, which is of course greater. The dynamics of oscillatory neurons 36-38 is governed by a modified version of the Van der Pol relaxation oscillator (called the Wang- Terman oscillator [69]). There is an active phase when the neuron spikes and a relaxation phase when the neuron is silent. More details on oscillatory neurons are provided in [70].
The neural network 12 is configured so that each neuron 36 from the first layer 30 allows for internal connections 37 to other neurons 36 from the first layer 30 or for external connections 39 to neurons 38 from the second layer 32 to receive extra-layer stimuli therefrom and for receiving external stimuli from external signals. Each neuron 38 from the second layer 32 also allows for external connections 41 to other neurons 38 from the second layer 32 or for said internal connection 39 to neurons 36 from said the layer to receive extra-layer stimuli therefrom and for receiving second external stimuli from the second external signals.
The neurons 36-38 from of the first and second layers 30-32 are connected to a global controllers 34, which allows for synchronization and un-synchronization of different regions of the network 12. As it is believed to be well known in the art, the global controller 34 can be substitute by any network activity regulator allowing regulating the activity of the network. The global controller 34 acts as a local inhibitor.
First layer: Auditory image segmentation
The first layer 30 performs a segmentation of the auditory map. The dynamics of the neurons follows the following state-space Equations, where Xj is the membrane potential (output) of the neuron and Yi is the state for channel activation or inactivation.
Figure imgf000032_0001
dVi,i = e[1il + ia,nh(xi,j/β)) - yid] (2) dt p denotes the amplitude of a Gaussian noise, /?!"j""the external input to
the neuron, and SUJ the coupling from other neurons (connections through
synaptic weights), ε, γ, and /?are constants. Initial values are generated by a uniform distribution between the interval [-2,2] for xu and between [0,8]
for yu (these values correspond to the whole dynamic range of the Equations).
The Euler integration method is used to solve the Equations 1 and 2. The first layer is a partially connected network of relaxation oscillators [69]. Each neuron is connected to its four neighbors. The CAM 16 (or the CSM 18) is applied to the input of the neurons 36. It has been found that the geometric interpretation of pitch (ray distance criterion) is less clear for the first 29 channels. For this reason, long-range connections 40 (only one shown) have also been established from clear (high frequency) zones to confusion (low frequency) zones. These connections exist only across the cochlear channel number axis of the CAM. This architecture can help the network 12 to better extract harmonic patterns.
The weight between neuron (i,j) and neuron (k,m) of the first layer is computed via the following formula:
1 0.25 WijJcmW - Card{N(i} fly eλ|p(ij;t)-p(fe,m:t)| (3)
where p(i,j) and p(k,m) are respectively external inputs to neuron(ij) and neuron(k,m) G N(ij). Card{N(i,j)} is a normalization factor and is equal to the cardinal number (number of elements) of the set N(i,j) containing neighbors connected to the neuron(ij) (can be equal to 4, 3 or 2 depending on the location of the neuron on the map, i.e. center, corner, etc.). The external input values are normalized. The value of A depends on the dynamic range of the inputs and is set to A = 1 herein. This same weight adaptation is used for long range clear to confusion zone connections (Equation 7) in CAM processing case. The coupling Sg defined in Equation 1 is : SiAt) =∑ uH,JAm(t)H(x(k, m; t)) - ηG(t) + KLφ) (4) k,meN{i,j) H(.) is the well-known Heaviside function, the dynamics of G(t), the global controller, is as follows: G(t) = aH(z - θ) (5) dz . Ε = σ - ξz (6)
α is equal to 1 if the global activity of the network is greater than a predefined ζ and is zero otherwise, α and ξ are constants.
Lij (t) is the long range coupling as follows:
Figure imgf000033_0002
Figure imgf000033_0001
Wi,j,i,k(t)H(x(i, k; t)) j < 30 ΛΓ JS a binary variable defined as follows:
Figure imgf000034_0001
The first layer 30 is designed to handle presentations of auditory maps to process continuous sliding and overlapping windows on the signal.
Second layer: temporal correlation and multiplicative synapses
The second layer 32 performs temporal correlation between neurons. Each of them represents a cochlear channel of the analysis/synthesis filter bank. For each presented auditory map, the second layer 32 establishes binding between neurons which entry is dominated by the same source. The external connections establish multiplicative synapses with the first layer 30. The second layer is an array of 256 neurons (one for each channel) similar to those described by Equations 1 and 2. Each neuron 38 receives the weighted product of the outputs of the first layer neurons along the frequency axis of the CAM/CSM. The weights between layer one 30 and layer two 32 are defined as am (/) = α/7, where / can be related to the frequency bins of the STFT and α is a constant for the CAM case, considering structured patterns.
For the CSM, ωn(i) = α is constant along the frequency bins as we are looking for energy bursts. Therefore, the input stimulus to neuron (J) 38 in the second layer 32 is defined as follows: θ(3; t) = U wu(i)Ξ{x(i,j; t)} (9) i
The operator Ξ is defined as:
Figure imgf000035_0001
where ( ) is the averaging over a time window operator (the duration of the window is on the order of the discharge period). The multiplication is done only for non-zero inputs (outputs of first layer 30, in which spike is present) [19], [47]. It is to be noted that this behavior has been observed in the integration of ITD (Interaural Time Difference) and ILD (Inter Level Difference) information in the barn owl's auditory system [19] or in the monkey's posterior parietal lobe neurons that show receptive fields that can be explained by a multiplication of retinal and eye or head position signals [1].
The synaptic weights inside the second layer 32 are adjusted through the following rule:
Figure imgf000035_0002
μ is chosen to be equal to 2. The "binding" of these features is done via this second layer 32. In fact, the second layer 32 is an array of fully connected neurons 38 along with a global controller defined as in Equations 5 and 6. The global controller desynchronizes the synchronized neurons 38 for the fi rst and second sources by emitting inhibitory activities whenever there is an activity (spikes) in the network [69]. The selection strategy at the output of the second layer 32 is based on temporal correlation: Neurons belonging to the same source synchronize (same spiking phase); Neurons belonging to the other source desynchronize (different spiking phase).
In operation of the neural network 12, upon receiving the auditory map, internal connections 37 are promoted in the first layer 30 yielding a segmentation of the auditory map. Synchronous spiking from neurons 36 from the second layer 32 are promoted by the internal connections 39 and of course the internal connections 41 for neurons 38, corresponding to the same sound source as represented by the auditory map.
A method 100 monophonic source separation according to an illustrative embodiment of the present invention is illustrated in Figure 4.
As will now be more apparent to a person skilled in the art, two types of neural connections occur in the spiking neural network 12: For each neurons layer 30-32, the internal/lateral cell connections provide cooperation and competition between the cells (the neurons). These connections make the cells regroup if their outputs are synchronous. This allows for fast neurons groups formation
Usually, the synaptic connections plasticity is a dynamic process which provides on-demand neural network topology self- modification. Particularly, the present invention uses this neural network auto-organization feature by the synchronization/desynchronization of the cells, forming or dismantling groups of neurons.
It is to be noted that the neural network 12 is independent from the signal representation used, but this representation is related to the chosen application. For example, when two speakers are talking simultaneously with similar glottis fundamental frequencies, which makes more difficult the sound sources separation, the instantaneous frequency is suitable in order to detect each glottis opening. In that case, we know that an auditor would have great difficulties to separate the two voices. Some examples [14], [15], [16] demonstrate that auditors confuse the speakers when the transmission channel affects the transmitted voice. Even if the input representation presented to the network is different, the approach stands because the network is not affected by this change.
Masking
Based on the phase synchronization described hereinabove, a binary mask 22 is generated from the output of the neural network 12 associating zeros and ones to different channels in order to preserve or remove each sound source. The mask 22 allows attributing each of the channels to a respective source.
In order to have the same SPL for all the frames, the energy is normalized. Even though, the system 10 has been described with reference to two-source mixtures, the present method and system for monophonic sources separation can also be used for more than two sources. In that case, for each time frame n, labeling of individual channels is equivalent to the use of multiple masks (one for each source).
Experimental results
Cooke's database [10] is used for evaluation purposes. The following noises have been tested: 1 kHz tone, FM siren, white noise, trill telephone noise, and human speech. The aforementioned noises have been added to the target utterance. Each mixture is applied to the neural system 10 and the sound sources are extracted. The LSD (Log Spectral Distortion) is used as performance criterion [64], [65] as defined below:
L-I — 1 ψ ^ ( (22001 loogn10U J.£M Λ!|± . £y r> ( (U12)) L 1=0 \ K 'Z) L fc=0 \O(k, l)\ + e
where l(k,l) and O(k,l) are the FFT of l(t) and O(t) respectively. L is the number of frames, K is the number of frequency bins and t is meant to prevent extreme values (equal to 0.001 in our case). Performances from the method 100 have been compared to those obtained with the system from Wang and Brown [69] and to those from Hu and Wang [26].
Separation performance
Table 1 gives the LSD performances. The method 100 outperforms the other two systems for the tone plus utterance case, performs better than the system proposed by Wang and Brown [69] in all cases, gives performance similar to those provided by the system from Hu and Wang [27] for the siren, but performs worst than the same system from Hu and Wang [27] for the white noise and the telephone ring. For the double-vowel case, the tests are not available for the other two approaches [27], [69]. Other SNR-like criteria such as the SNR, Segmental SNR, PEL (Percentage of Energy Loss), and PNR (Percentage of Noise Residue) are used in the literature (see for example [26], [27], [28], [35], [48], [55], and [69]) and can be used as performance scores. In what follows, spectrograms for different sounds and different approaches are given for visual comparison purposes.
Figure imgf000039_0001
Table 1
The log spectral distortion for three different methods: the method according to the present invention as described herein above, W-B (the method proposed by Wang and Brown [69]), and H-W (the method proposed by Hu and Wang [27]). The intrusion noises are as follows: 1 kHz pure tone, FM Siren, telephone ring, white noise, the male intrusion (IdU) for the French /di//da/ mixture, and the female intrusion (/da/) for the French /di//da/ mixture. Except for the last two tests, for the remaining the intrusion is mixed with a sentence taken from Cooke's database.
Spectrogram representations of some these examples will now be briefly described.
Figure 5 shows the mixture of the utterance "Why ere you all weary" with the telephone trill noise (from Cooke's database). The trill telephone noise (ring) is wideband, interrupted, and structured. Figures 6A-6B show separated utterance spectrogram obtained using respectively the method 100 and the one proposed in Wang and Brown [69]. As can be seen, the method 100 yields better results in higher frequencies. Figure 7 shows the extracted telephone trill.
Another similar experiment was conducted with the utterance "I willingly marry Marilyn" with a 1 kHz pure tone. The tone was a narrowband, continuous, and structured noise. Figure 8 shows the original utterance plus 1 kHz tone. Figures 9A-9B show the separation results using respectively the method 100 and the approach proposed by Wang and Brown [69].
Figure 10 shows the mixture of the utterance "I willingly marry Marylin" with a siren. The siren is a locally narrowband, continuous, structured signal. Figure 11 shows the separated siren obtained using the method 100. Figure 12 shows the spectrogram of the separated utterance. Although criteria like the PEL and PNR, the SNR, the Segmental SNR, the LSD, etc. are used in the literature as performance criteria, they do not always reflect exactly the real perceptive performance of a given sound separation system. For example the SNR, the PEL and the PNR ignore high-frequency information. The LSD does not take into account some temporal aspects like the phase distortion. The result of the LSD for two techniques with different phases would be the same. Therefore the LSD will not detect phase distortions in the separation results.
Method and system for monophonic source separation according to the present invention have many applications including: Multimedia file indexation and authentication: multimedia files on the Internet or other media must be indexed before someone can launch queries on their content ("Who is the singer?", "Which song does he sing?", etc.). Method and system from the present invention can allow separating multiple sources to ease indexation. Also, since the present invention is suitable for comparison purposes, musical files authentication can also be either integrated in a file indexation system, or simply used as a stand-alone application. For example, in peer-to-peer file sharing systems, files may be renamed by users who want to share them illegally. Consequently, method and system from the present invention provide a way to compare files contents without considering their names, therefore providing a solution to that problem; Recording quality enhancement where a method and system according to the invention allow separating noise from recorded sounds; Speech recognition: a speech separator (sound source separator) according to the present invention can be used to increase the performance of speech recognizer; - Sound quality enhancement in hearing aids allowing removing surrounding noise in hearing aids in the presence of "cocktail party" effect. Also, combining an auditory mask generated from camera images to the filter bank can be use to create audio stimuli from the visual scene; Intelligent helmet to be used, for example, in high-risk industrial areas where the noise levels are high; and Scene Analysis for visually impaired persons where sounds can be used to help blind people analyze visual scenes. In fact, different colors, textures, and objects are associated to different sound characteristics (frequency, duration, etc.). If there are many objects in the scene the analysis of the corresponding sound mixture will become difficult for the subject. A method and system for monophonic source separation can be used to separate different visual scene objects by separating their equivalent auditory objects.
Pattern recognition
A system 50 for pattern recognition according to an illustrative embodiment of the present invention will now be described with reference to Figure 13. The system 50 includes a spiking neural network 52 according to a second illustrative embodiment of the present invention.
It is to be noted that, contrarily to the system 10, the system 50 does not include preprocessors upstream from the neural network 12 since the image to compare 54 and the reference image 56 are both provided to the neural network 52 as an external input. More specifically, each pixel of each image 54-56 is associated to respective neurons of a respective layer 62-64 of the neural network 52 as will be described hereinbelow in more detail. The synaptic weights are provided by the grey- scales or colors associated to the neurons.
The images 54-56 can be for example post-processed images from an external camera (not shown). The source of the images 54-56 may of course vary without departing from the spirit and nature of the present invention. An example of image to be inputted to the layer is illustrated in Figure 14.
The neural network 52 will now be described in more detail with reference to Figure 15. Since the spiking neural network 52 is similar to the neural network 12 and for concision purposes only the differences will be described herein in more detail.
The neural network 52 includes first and second layers 62 and 64 of spiking neurons 66 and 68. A neighborhood of 4 is chosen in each layer 62 or 64 for the connections. Each neuron 66 in the first layer 62 is connected to all neurons 68 in the second layer 64 and vice-versa. Of course, the number of neurons 66-68 shown in Figure 15 does not correspond to the real number of neurons, which is greater. Moreover, the neighborhood can be set to other numbers depending on the applications.
A network activity regulator, in the form of a global controller 70, is connected to all neurons 66-68 in the first and second layers 62 and 64 as in [7]. The global controller 70 has bidirectional connections to all neurons 66-68 in the two layers 62-64.
In a first stage, segmentation is done in the two layers 62 and 64 independently (with no extra-layer connections), while dynamic matching is done with both intra-layer and extra-layer couplings. The intra- layer and extra-layer connections are defined as follows:
Figure imgf000044_0001
where ω^jtk>m (t) are intra-layer connections 72 and 74 and ω^Xm (t) are
extra-layer connections 76 (between the two layers 62-64) and αζ^x = 0.2
and αζ^ = 0.2 are constants equal to the maximum value of the synaptic
weights. Card {Nint (i, j)} is a normalization factor and is equal to the
cardinal number (number of elements) of the set Nint (i,j) containing
neighbors connected to the neuron{i,j) and can be equal to 4, 3 or 2 depending on the location of the neuron on the map, i.e. center, corner, etc., and the number of active connections. A connection is active when H(ωi J k m - O.Ol) = 1 , which is true both for intra-layer and extra-layer
connections 72-76. Card
Figure imgf000045_0001
is the cardinal number for extra-layer connections 76 and is equal to the number of neurons in the second layer 64 with active connection to neuron(i,j) 66 in the first layer 62.
It is to be noted that normalization in Equation 14 allows the correspondence between similar pictures with different sizes. If the aim is to match objects with exactly the same size the normalization factor is set to a constant for all neurons 66-68. The reason for this is that with normalization even if the size of the picture in the second layer 64 was the double of the same object in the first layer 62 the total influence to the neuron(i,j) would be the same as if the pattern was of the same size.
The network 52 can have two different behavioral modes: segmentation and matching.
In the segmentation stage, there is no connection between the two layers 62 and 64. The two layers 62-64 act independently (unless for the influence of the global controller 70) and segment the two images 54and 56 applied to the two layers 62 and 64 respectively. The global controller 70 forces the segments on the two layers 62-64 to have different phases. At the end of this stage, the two images are segmented but no two segments have the same phase (see Figures 19A-19B).
At a specific time t, the results from segmentation are used to create binary masks 58 and 60 that select one object in each layer 62 and 64 in multi-object scenes. In fact a snapshot, obtained at a specific time t, like the one shown in Figure 25 is used to create the binary mask m(i,j) for one of the objects as follows:
mt*' " -
Figure imgf000046_0001
o otherwi (15)
xsync can be the synchronized value that corresponds to either the cross or the rectangle in Figure 25 at time tsync .
At a given time th the mask m(i,j) is different than at time tj and corresponds to a different object than the one at time t,.
The coupling strength Su for each layer as defined in Equation 13 is computed by:
Sij®
Figure imgf000046_0002
where HQ is the Heaviside function, x'nt defines action potentials from said first and second internal connections, G(t) is the influence of the global controller defined by the following Equation, μ is set to a value smaller than the maximum value of synaptic weights, i.e. 0.25 herein. G(t) = oH(z - θ) (17)
% = " → (18) σ is equal to 1 if the global activity of the network is greater than a predefined ζ and is zero otherwise.
Dynamic Matching
In the matching phase, the inputs to the layers are defined by:
Figure imgf000047_0001
Extra-layer connections 76 (Equation 2) are established. If there are similar objects in the two layers 62-64, these extra-layer connections 76 will help them synchronize. In other words, these two segments are bound together through these extra-layer connections 76 [123]. In order to detect synchronization double-thresholding can be used [2]. This stage may be seen as a folded oscillatory texture segmentation device as the one proposed in [70]. The coupling strength Sltj for each layer in the matching phase is defined as follows:
Si m] t))} (20)
Figure imgf000047_0002
~ ηG(t)
where xext defines action potentials from external connections and xιnt defines action potentials from said first and second internal connections.
Figure 16 summarizes the general steps of a method 200 for establishing correspondence between first and second images according to an illustrative embodiment of the present invention. It is to be noted that steps 208-210 are not necessarily sequential.
Geometrical interpretation of the ODLM
It is well known that an object can be represented by a set of points corresponding to its corners, and any affine transform is a map T. R2 → R2 of these points defined by the following matrix operation:
p' = A * p + t (21 )
where A is a 2x2 non-singular matrix, p e R2 is a point in the plane, p' is its affine transform, and t is the translation vector. The transform is linear if t = 0. Affine transformation is a combination of several simple mappings such as rotation, scaling, translation, and shearing. The similarity transformation is a special case of affine transformation. It preserves length ratios and angles while the affine transformation, in general does not.
In the following, it will be shown that the coupling StJ is independent of the affine transform used. It is well known that any object can be shattered into its constituent triangles (three corners per triangle). It will now be considered that the set {a,b,c,d) \s mapped to the set
{τ(a),T(b),T(c),T(d)) , and that the objects formed by these two sets of points are applied to the two layers 62 and 64 of the neural network 52. It will also be considered that points inside the triangle {a,b,c} (resp. {τ(a),T(b),T(c)) ) have values equal to A (corresponding to the gray- level value of the image at that points) and points inside {a,b,d} (resp. [τ(a),T(b),T(d)} ) have values equal to B. Considering an affine transform as illustrated in Figure 17:
Figure imgf000049_0001
where Aabc is the area of the triangle [a,b,c] (expressed in number of neurons). For neuron^ belonging to {a,b,c) and neuronk m belonging to {T(a), T(b),T(c)} , Equation 22 is equivalent to (neglecting the effect of intra-layer connections, since hfxt » N'nt):
Next = Aτ(abc) + Aτ{abd) (23)
Hence,
e*t (t) _ f(j?(i,r,t) - p(k,m;t)) ^,*,™C*) - Δτ(α6c) + Δτ(αM) ' (24) w ext With f(X - y) = -^ yX,y
There are Δr(a6c) connections from the region with gray-level value A (triangle {τ(a),T{b),T(c)} ) and Δr(αM) connections from the region with gray-level value B (triangle {τ(a),T(b),T(d)} ) to the neuron^. belonging to the triangle {a,b,c} with gray-level value A. Therefore, the
external coupling for neuron^ , from all neuronk m becomes:
S. .ft) =
Figure imgf000050_0001
with φ(t, φ) = H(x%%(t))
where ψ (t, φ2) and ψ (t, φx) are respectively associated to spikes with
phases φ2 and φγ that appear after segmentation. After factorization and using Equation 22 we obtain:
q (A f(O)Φ(t, φi)
Figure imgf000050_0002
Following the above, extra-layer connections 76 are independent of the affine transform that maps the model to the scene (first and second layer objects) and can be extended to more than 4 points.
It is to be noted that if there are several objects in the scene and patterns are to be matched, the results from the segmentation phase to break the scene into its constituent parts (each synchronized region corresponds to one of the objects in the scene) can be used and the objects can be applied one by one to the network 52, until all combinations are tested. This is not possible in the averaged Dynamic Link Matching (DLM) case where no segmentation occurs. With the ODLM, according to the present invention, the successive presentations of the objects is fully automatic and there is no manual intervention.
Rate coding versus phase coding
In the following, it will be shown that the original DLM is a rate coding approximation of the ODLM (Oscillatory Dynamic Link Matching) according to the present invention. Aoinishi et al. [3] have shown that a canonical form of rate coding dynamic Equations solve the matching problem in the mathematical sense. The dynamics of a neuron in one of the layers of the original Dynamic Link Matcher proposed in [35] is as follows:
dx — = -ax + (k * σ(x)) + Ix (27) at
where k(.) is a neighborhood function, Ix is the summed value of extra- layer couplings, σ is the sigmoidal function, x is the output of the rate coded neuron, and * is the convolution operator. On the other hand, it is known in the art that the Wang-Terman oscillator can be approximated by the Integrate-and- Fire neuron [50]. For a single neuron 66-68 of the network 52:
Figure imgf000051_0001
x = 0 x > threshold (28)
where x1™" stands for neurons 68 in layer two 64 and x°"e stands for neurons 66 in layer one 62. It is to be noted that as explained in hereinabove there are synaptic connections (of* ) in the second layer 64 and synaptic connections 76 from the first layer 62 to the second layer 64 ( <*■* )
Neglecting the influence of intra-layer connections 72 and 74, Equation 28 becomes:
^T- = -^° + \m^k>mH(4^) + H(^) (29)
x = 0 x > threshold
For an integrate-and-fire neuron the approximation H(x) = x holds, since the output of an integrate-and-fire neuron is either 0 or 1 (it emits spikes or delta functions), therefore Equation 29 can be further simplified to:
Figure imgf000052_0001
x = 0 x > threshold
Averaging the two sides of Equation 30 and considering that H (pini"") is constant over T, yields:
dx*™0 -xiwo + ∑wextn ne + H(p ,input\ (31 ) dt
Figure imgf000052_0002
where < x >τ is the averaged version of x over a time window of length T. For simplicity purposes, the indices are omitted in Equation 31. From [38], we know that the averaged output ^0Of an integrate-and-fire neuron is related to the averaged-over-time inputs of a neuron ( ∑ af'x*™ ) by a continuous function (sigmoidal, etc.). Naming this function φ (note that β is proportionality constant): < *Sβ >= βφ(∑w^ < xl% >) (32) Considering that in Equation 31 one need <x°"f > in function of < x^° > and that Equation 32 is a set of linear Equations in ωirA :
Figure imgf000053_0001
Replacing the above result in Equation 31 yields: Arivio =£- = -x^° + ∑∑ωeϊtσ(4ωo) + H^ut) (34) dt where σ(x) = φ ι (x) and where the indices have been omitted, again for the sake of simplicity. On the other hand: ΣΣw™V(rεf o wo) = H^a WO) * σ«ωo) (35)
where * is a 2-D convolution operator. In the present case k(.) is a 2-D rectangular window (in the original DLM k(.) was chosen to be a well- known Mexican hat).
The above demonstrates that the DLM is an averaged-over- time approximation of the ODLM according to the present invention.
As stated earlier, the network 52 can be used to solve the correspondence problem. For example, considering that in a factory chain, someone wants to check the existence of a component on an electronic circuit board (see for example Figure 14). All this person has to do is to put an image of the component on the first layer and check for synchronization between the layers. Ideally, any change in the angle or the location of the camera or even the zoom factor should not influence the result. One of the signal processing counterparts of the method and system from the present invention is the morphological processing. Other partial solutions such as the Fourier transform could be used to perform matching robust to translation.
A method and system according to the present invention does not require training or configuration according to the stimulus applied. The network 52 is autonomous and flexible to not previously seen stimuli. This is in contrast with associative memory based architectures in which a stimulus must be applied and saved into memory before retrieval (as in [66] for example). It does not require any pre-configured architecture adapted to the stimulus, like in the hierarchical coding paradigm [52]. Experimental results
Experiments have been conducted where a rectangular neuron map was chosen. There were 5x5 neurons in each layer 62-64. A vertical bar in a background was presented in the first layer 62. The second layer 64 receives the same object transformed by an affine transformation (rotation, translation, etc.).
Figures 18A-18B show activity snapshot (instantaneous values of x(i,j)) \n the two layers 62-64 after the segmentation step 206. Same-gray scale neurons have similar phases in Figures 18A-18B. On the other hand, different segments on different layers are desynchronized (Figures 19A-19B and 20A-20B).
In the dynamic matching step 210, similar objects among different layers are synchronized as illustrated in Figure 22. The threshold sum (synchronization index) of the activity of all neurons
Figure imgf000055_0001
's shown in Figure 21 for the segmentation step 206 and in Figure 24 for the dynamic matching step 212. Since there are four different regions in the two layers 62 and 64 with different phases at the end of the segmentation step 206, four different synchronization regions can be seen in Figure 21. In the dynamic matching step 212, the similar objects (and the backgrounds) merge with each other producing only two distinct regions. In addition, when a zero-mean Gaussian noise with variance σ2 = 0.1 is added to both stimuli (SNR = 1OdB) the matching results remain unchanged. One-object scenes
If only one object is present in each layer 62-64 of the scene, then the segmentation step 206 can be bypassed and the network 52 can function directly in the matching mode. This allows speed up the pattern recognition process.
Figures 23-24 illustrate the behavior of a 13x5 network when only one object is present in each layer 62-64 showing that the synchronization time for the matching-only network is shorter. It is to be noted that the matching-only approach is inefficient when there are multiple objects in the scene. In the latter-mentioned case the segmentation plus matching approach should be used.
It has been shown that the network 52 is capable of establishing correspondence between images and is robust to translation, rotation, noise and homothetic transforms.
Even though the method 200 as been described as a mean to segment images, it can also be used in solving the correspondence problem, as a whole system, using a two-layered oscillatory neural network.
Applications of the system 50 include: Electronic circuit assembling where the system 50 can be used to verify whether all the electronic components are present (or in good conditions) on a PCB (Printed Circuit Board) or not; Facial recognition: the technique can be applied to facial recognition by comparing a given face to a database of faces, in custom houses for example; and Fault detection in a production chain where the invention can be used to find manufacturing errors in an assembly chain. Teledetection: it can be used to find objects or changes in satellite images. It can also be used to assist in the automatic generation of maps.
According to the present invention, many types of neurons can be implemented, including integrate-and-fire, relaxation oscillators, or chaotic neurons, even if in the examples detailed hereinbelow only relaxation oscillators are used.
The neural network 12 and 52 are implemented on a SIMULINK™ simulation spiking neural networks library, in Java and in C++ (3 different simulators can be used). Of course, a neural network according to the present invention can be implemented on other platforms,
Other applications
Neural network architecture according to the present invention is well-suited to the control of event-driven and adaptive event- driven processes and the control of robots for example. For instance, it can be used to control sensorimotor parts of robots by bio-inspired (spiking neural) networks or for fly-by-wire design of aircraft.
Many control systems including continuous time management exist (cars, production chains, robots, etc.). Because they have trouble to adapt to variable situations, they are still narrowly used for commercial purposes. In that way, spiking neural networks according to the present invention are suitable for these event-driven control processes, especially because of their adaptive flexibility.
Moreover, in robotic application, a plurality of interconnected spiking neural networks can manage and control sensors, peripherals, vision, etc.
Although the present invention has been described hereinabove by way of preferred embodiments thereof, it can be modified without departing from the spirit and nature of the subject invention, as defined in the appended claims. REFERENCES
[I] Andersen, R., Snyder, L., Bradley, D., Xing, J., 1997. Multimodal representation of space in the posterior parietal cortex and its use in planning movements. Ann. Rev. Neurosci., 1997, 20:303.
[2] Ando, H., Takashi Morie, N., Nagata, M. and Iwata, A. A nonlinear oscillator network circuit for image segmentation with double-threshold phase detection. In ICANN 99, 1999.
[3] Aoinishi, T., Kurata, K. and Mito, T. A phase locking theory for matching common parts of two images by dynamic link matching. Biological Cybernetics, 78(4):253-264, 1998.
[4] Berthommier, F., Meyer, G., 1997. Improving of amplitude modulation maps for fθ-dependent segregation of harmonic sounds. Eurospeech'97, 1997.
[5] Bohte, S. M., Poutre, H. L., Kok, J. N., March 2002. Unsupervised clustering with spiking neurons by sparse temporal coding and multilayer rbf networks. IEEE transactions on neural networks 13 (2), March 2002, pp. 426-435.
[6] Bregman, A., 1990. Auditory Scene Analysis. MIT Press, 1990.
[7] Cesmeli, E. and Wang, D. Motion segmentation based on motion/brightness inegration and oscillatory correlation. IEEE Trans, on neural networks, 2000, 11(4):935-947, 2000.
[8] Chechik, G., Tishby, N. Temporally dependent plasticity: An information theoretic account. NIPS, 2000.
[9] Cooke, M. Modelling auditory processing and organisation. Ph.D. thesis, University of Sheffield, 1991. [10] Cooke, M., 2004. http://www.dcs.shef.ac.uk/~martin/.
[I I] Delgutte, B. Representation of speech-like sounds in the discharge patterns of auditory nerve fibers. JASA 68, 1980, pp. 843-857.
[12] DeWeese, M. Optimization principles for the neural code. Network: Computation in Neural Systems 7 (2), 1996, pp. 325- 331. [13] Ellis, D. Prediction-driven computational auditory scene analysis. Ph.D. thesis, MIT, 1996.
[14] Ezzaidi, H. and Rouat, J. and O'Shaughnessy, D. Combining pitch and MFCC for speaker identification systems. In A speaker Odyssey, the Speaker Recognition Workshop, an ISCA Tutorial and Research Workshop (ITRW) on Speaker Recognition, June, 18-22, 2001. Paper nb : 1036.
[15] Ezzaidi, H. and Rouat, J. Pitch and MFCC dependent GMM models for speaker identification systems. In IEEE CCECE, accepted 2004. [16] Ezzaidi, H. and Rouat, J. Speech, music and songs discrimination in the context of handsets variability. In proceedings of ICSLP 2002, 16-20 September 2002.
[17] Frisina, R. D., Smith, R. L., Chamberlain, S. C, 1985. Differential encoding of rapid changes in sound amplitude by second-order auditory neurons. Experimental Brain Research 60, 1985, pp. 417-422.
[18] Fukushima, K. A neural network model for selective attention in visual pattern recognition. Biol. Cybernetics, 1986, pp.pages 5— 15, 1986.
[19] Gabbiani, F., Krapp, H., Koch, C, Laurent, G. Multiplicative computation in a visual neuron sensitive to looming. Nature 420, 2002, pp. 320-324.
[20] Giguere, C, Woodlaηd, P. C. A computational model of the auditory periphery for speech and hearing research. JASA, 1
Figure imgf000060_0001
[21] Gordon, L. E. Theories of Visual Perception. John Wiley and Sons, 1997.
[22] Harding, S., Meyer, G. Multi-resolution auditory scene analysis: Robust speech recognition using pattern-matching from a noisy signal. EUROSPEECH., September 2003, pp. 2109-2112.
[23] Hewitt, M., Meddis, R. A computer model of amplitude- modulation sensitivity of single units in the inferior colliculus. Journal of the Acoustical Society of America 95 (4),, 04 1994, pp. 2145-2159.
[24] Ho, T. V.. Rouat, J. Novelty detection based on relaxation time of a network of integrate-and-fire neurons. Proc. of the IEEEJNNS Int. Joint Conf. on Neural Networks. Vol. 2, May 1998,. pp. 1524-1529.
[25] Hopfield, J. Pattern recognition computation using action potential timing for stimulus representation. Nature 376, 1995, pp. 33-36.
[26] Hu. G.. Wang. D. Monaural speech segregation based on pitch tracking and amplitude modulation. Tech. rep., Ohio State University, 2002.
[27] Hu. G.. Wang. D. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. On Neural Networks, to appear, 2004.
[28] Hu, G., Wang, D. Separation of stop consonants. ICASSP 2003.
[29] Immerseel, L. V. Een functioneel gehoormodel voorde analyse van spraak bii spraakherkenning. Ph.D. thesis, these de doctorat en flamand, 05 1993.
[30] Irino, T., Patterson, R. Speech segregation using event synchronous auditory vocoder. ICASSP. Vol. V, 2003,. pp. 525- 528.
[31] Irino, T., Unoki, M. A time-varying, analysis/synthesis auditory filterbank using the gammachirp. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP)98. Vol. 6. Seattle, Washington, May 1998, pp. 3653-3656.
[32] Kaneda, Y., Ohga, J. Adaptive microphone-arrav system for noise reduction. TrASSP 34 (6), 1986, pp. 1391-1400.
[33] Karlsen, B. L., Brown, G. J., Cooke, M., Crawford, M., Green, P., Renals, S. Analysis of a Multi-Simultaneous-Speaker Corpus. L. Erlbaum, 1998. [34] Kim, S., Frisina, D. R., Frisina, R. D. Effects of Age on Contralateral suppression of Distorsion Product Otoacoustic Emissions in Human Listeners with Normal Hearing. Audiology Neuro Otology, 2002, 7:348-357.
[35] Konen, W., Maurer, T. and Von der Malsburg, C. A fast dynamic link matching algorithm for invariant pattern recognition. Neural Networks, 1994, pp. 1019-1030.
[36] Kubin, G., Kleijn, W. B. On speech coding in a perceptual domain. ICASSP. Vol. 1. Phoenix, Arizona, March 1999, pp. 205-208. [37] Levy, N., Horn, D., Meilijson, I., Ruppin, E. Distributed synchrony in a cell assembly of spiking neurons. Neural Networks. 14 (6-7), 7 2001 , pp. 815-824.
[38] Liberman, M., Puria, S., Guinan, J. J. The ipsilaterallv evoked olivo-cochlearreflex causes rapid adaptation of the 2f1-f2 distortion product otoacoustic emission. JASA 99, 1996, pp. 2572-3584.
[39] Maass, W. Networks of spiking neurons: The third generation of neural network models. Neural Networks 10 (9), 1997, p.p. 1659-1671.
[40] Maass, W., Sontag, E. D., aug 2000. Neural systems as nonlinear filters. Neural Computation 12 (8), aug. 2000, pp.
Figure imgf000062_0001
[41] Mellinger, K. Event formation and separation in musical sound. Ph.D. thesis, Stanford University, 1991.
[42] Meyer, G., Yang, D., Ainsworth, W. Applying a model of concurrent vowel segregation to real speech. Computational models of auditory function, 2001 , pp. 297-310.
[43] Milner, P. A model for visual shape recognition. Psychological Review 81 , 1974, pp. 521-535.
[44] Natschlager, T., Maass, W., Zador, A. Efficient temporal processing with biologically realistic dynamic synapses. Network: Computation in Neural Systems 12 (1 ), 2001 , pp. 75- 87. [45] Nix, J., Kleinschmidt, M., Hohmann, V. Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction. EUROSPEECH, September 2003,. pp. 1441-1444.
[46] Panchev, C, Wermter, S. Spiking-time-dependent synaptic plasticity: From single spikes to spike trains. Computational Neuroscience Meeting. Springer-Verlag, July 2003, pp. 494- 506. [47] Pena, J., Konishi, M. Auditory spatial receptive fields created by multiplication. Science 292, 2001 , pp., 294-252.
[48] Pichevar, R. and Rouat, J. Binding of audio elements in the sound source segregation problem via a two-layered bioinspired neural network. In IEEE CCECE'2003, Montreal, mCanada.
[49] Pichevar, R. and Rouat, J. Cochleotopic/AMtopic (CAM) and Cochleotopic/Spectrotopic (CSM) map based sound source separation using relaxation oscillatory neurons. IEEE Neural Networks for Signal Processing Workshop, Toulouse, France,
Figure imgf000063_0001
[50] Pichevar, R. Speech Processing in the Presence of "Cocktail Party" Effect and its Applications in Information Technology. PhD thesis, University of Sherbrooke (to appear), 2004.
[51] Postma, E. O., Van der Herik, H. J. and Hudson. P. T. W. SCAN: A scalable neural model of covert attention. Neural Networks, 1997, 10:993-1015, 1997.
[52] Reisenhuber, M., and Poggio, T. Are cortical models really bound by the binding problem? Neuron. 1999, 24:87-93, 1999.
[53] Reyes-Gomez, M. J., Raj, B., Ellis, D. Multi-channel source separation by factorial HMMs. ICASSP 2003.
[54] Rieke, F., Warland, D., de Ruyter van Steveninck, R., Bialek, W. SPIKES Exploring the Neural Code. MIT Press, 1997.
[55] Roman. N.. Wang, P.. Brown. G. Speech segregation based on sound localization. JASA, 2003.
[56] Rouat, J., Liu, Y. C, Morissette, D. A pitch determination and voiced/unvoiced decision algorithm for noisy speech. Speech Comm. 21 , 1997, pp. 191-207.
[57] Roweis, S. Factorial models and refiltering for speech separation and denoising. Eurospeech 2003.
[58] Sameti, H., Sheikhzadeh, H., Deng, L., Brennan, R., 1998. HMM based strategies for enhancement of speech signals embedded in nonstationary noise. IEEE Trans, on Speech and Audio Processing, 1998, pp. 445-455.
[59] Schwartz, J. L., Escudier, P. Auditory processing in a post- cochlear neural network : Vowel spectrum processing based on spike synchrony. EUROSPEECH, 1989, pp. 247-253. [60] Tang, P., Rouat, J. Modeling neurons in the anteroventral cochlear nucleus for amplitude modulation (AM) processing: Application to speech sound. Proc. Int. Conf. on Spok. Lang. Proc. p. Th.P.2S2.2, Oct 1996.
[61] Thorpe, S., Delorme, A., Rullen, R. V. Spike-based strategies for rapid processing. Neural Networks 14 (6-7), 2001 , pp. 715- 725. [62] Thorpe, S., Fize, D., Marlot, C. Speed of processing in the human visual system. Nature 381 (6582), 1996, pp. 520-522.
[63] Todd, N. An auditory cortical theory of auditory stream segregation. Network : Computation in Neural Systems. 7, 1999, pp. 349-356. [64] Valin, J.-M., Rouat, J., Michaud, F. Microphone array post-filter for separation of simultaneous non-stationary sources. IEEE Int. Conf. on Acoustics Speech Signal Processing. Accepted 2004.
[65] Valin, J.-M., Michaud, F., Rouat, J., Ltourneau, D. Robust sound source localization using a microphone array on a mobile robot. IEEE/RSJ-lnt. Conf. on Intelligent Robots & Systems, Oct. 2003. [66] Vinh Ho. T. and Rouat. J. Novelty detection based on relaxation time of a network of integrate-and-fire neurons. In IEEE Int'l Joint Conference on Neural Networks, Alaska, USA, 1998.
[67] Von der Malsburg, C. The correlation theory of brain function. Tech. Rep. Internal Report 81-2, Max-Planck Institute for Biophysical Chemistry, 1981.
[68] Von der Malsburg, C, Schneider, W. A neural cocktail-partv processor. Biol. Cybern., 1986, pp. 29-40.
[69] Wang, D., Brown, G. J. Separation of speech from interfering sounds based on oscillatory correlation. IEEE Transactions on Neural Networks 10 (3), May 1999, pp. 684-697.
[70] Wang, D., Terman, D. Image segmentation based on oscillatory correlation. Neural Computation 9, 1997, pp. 805-836.
[71] Widrow, B., al. Adaptive noise cancelling: Principles and applications. Proceedings of the IEEE 63 (12), 1975.
[72] Wiskott, L. and Sejnowski, T. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 2002, pp.pages 715-770, 2002. [73] Zotkin, D. N., Shamma, S. A., Ru, P., Duraiswami, R., Davis, L. 8. Pitch and timbre manipulations using cortical representation of sound. ICASSP. Vol. V. 2003, pp. 517-520.

Claims

WHAT IS CLAIMED IS:
1. A neural network system comprising: first and second layers of spiking neurons; each neurons from said first layer being configured for first internal connections to other neurons from said first layer or for external connections to neurons from said second layer to receive first extra-layer stimuli therefrom and for receiving first external stimuli; each neurons from said second layer being configured for second internal connections to other neurons from said second layer or for said external connections to neurons from said first layer to receive second extra-layer stimuli therefrom and for receiving second external stimuli; and at least one network activity controller connected to at least some of said neurons from each of said first and second layers for regulating the activity of said first and second layers of spiking neurons; whereby, in operation, upon receiving said first and second external stimuli, said first and second internal connections are promoted, and synchronous spiking from neurons from said first and second layers are promoted by said external connections when some of said first external stimuli are similar to some of said second external stimuli.
2. A system as recited in claim 1 , wherein said spiking neurons are from a type selected from the group consisting of oscillatory spiking neurons, integrate-and fire neurons, relaxation oscillatory neurons, and chaotic neurons.
Figure imgf000067_0001
3. A system as recited in claim 1 , wherein said second layer allows for temporal correlation between at least some of said neurons from said second layer of neurons.
4. A system as recited in claim 1 , wherein said first layer is two-dimensional and said second layer is one dimensional.
5. A system as recited in claim 4, wherein said at least one network activity controller is in the form of at least one global controller.
6. A system as recited in claim 5, wherein said at least one global controller includes first and second global controllers; each of said first and second global controllers being respectively connected to at least some of said neurons from respectively first and second layers.
7. A system as recited in claim 5, wherein dynamics of said neurons from said first layer follows the following state-space equations:
^■3 _ oτ . . _ T3 j_ o _ ,,. . fa — όXι'j χi,j + Δ y^j
Figure imgf000067_0002
^f- = e[7(l + tanhføj/0)) - yid] (2)
wherein Xj is a membrane potential of said neuron i, y is a state for channel activation or inactivation; p is the amplitude of a Gaussian noise, pfy' defines said first external stimuli, S1J defines coupling of a neuron in
said first layer with other neurons in said first layer and ε, γ, and β are constants.
8. A system as recited in claim 7, wherein a weight between neuron(ij) and neuron(k,m) of said first layer is defined by:
0.25 WiJ, k,m (t) Card{Nfa ft} eλ|p(< J;.)-p(fc,m; ;*)! (3)
where p(i,j) and p(k,m) are respectively external inputs to neuron(ij) and neuron(k,m), and Card{N(ij)} is a normalization factor.
9. A system as recited in claim 8, wherein said Card{N(i,j)} is equal to the number of elements of a set N(i,j) containing neighbors connected to the neuron(ij).
10. A system as recited in claim 9, wherein Card{N(i,j)} is within the range from 2 to 4.
11. A system as recited in claim 7, wherein said coupling Sy is defined in Equation (1 ) as : SiAt) =Σ wiJtktm(t)H{x(k, m; t)) - ηG(t) + KLφ) (4)
where HQ is a Heaviside function, and G(t) defines said at least one global controller; K is a binary variable, and Ly (t) defines long range coupling.
12. A system as recited in claim 11 , wherein: G(t) = aH(z - θ) (5) § = *-** w where α is set to 1 if the global activity of the network is greater than a predefined ζ and is set to zero otherwise; and where α and ξ are constants.
13. A system as recited in claim 1 , further comprising a camera coupled to at least one of said first and second layers of neurons to provide respective said first or second external stimuli thereto.
14. The use of a system as recited in claim 1 for an event- driven process.
15. The use of a system as recited in claim 1 in electronic circuit assembling where said system is used to verify whether electronic components are present in said electronic circuit.
16. The use of a system as recited in claim 1 in facial recognition, wherein a first image of a given face is compared to second images of faces in a database of faces.
17. The use of a system as recited in claim 1 in fault detection in a production chain.
18. The use of a system as recited in claim 1 for teledetection.
19. A system as recited in claim 5 for establishing
correspondence between first and second images, wherein said first and
second internal connections and said external connections between
neuron(ij) and neuron(k,m) are respectively defined by: wint i M mt , (Λ _ raai 1W Wι*,m MW = Card{Nint{i,j) U Ne*%j)} ex\P(i,r,t)-P(k,m;t)\
Figure imgf000070_0001
where
Figure imgf000070_0002
is a normalization factor; HQ is the Heaviside
function, p(i,j;t) and p(k,m;t) are respectively external stimuli to neuron(ij)
and neuron(k,m) at time t, and Card{N(i,j)} is a normalization factor, ωιnt max
ωβxt max and λ are constants and G(t) is the influence of said global
controller defined by the following Equations:
Git) = αH(z - θ)
dz .. Tt = σ ~ ξz
σ is equal to 1 if the global activity of the network is greater than a
predefined ζ and is zero otherwise; is the number of neighbors connected
to the from respectively a same layer.
20. A system as recited in claim 19, wherein said first and
second connections being defined by a coupling strength ^j for each said
first and second layers defined as follows: Siβ) = ∑ {Wif,k,m{t)H{xext{k,m]t)) + wt k m(t)H{xint(k,m; t))}
- ηG(t) where xext defines action potentials from external connections and x'nt defines action potentials from said first and second internal connections.
21. A method for establishing correspondence between first and second images, each of the first and second images including pixels, the method comprising: providing a neural network including first and second layers of neurons; applying pixels from the first image to respective neurons of said first layer of neurons and pixels from the second image to respective neurons of said second layer of neurons; interconnecting each neuron from said first layer to each neuron of said second layer; performing a dynamic matching between said first and second layers, yielding a temporal correlation between said first and second layers; and using said temporal correlation between said first and second layers for establishing correspondence between the first and second images.
22. A method as recite in claim 21 , wherein said applying pixels from the first image to respective neurons of said first layer of neurons and pixels from the second image to respective neurons of said second layer of neurons yields in operation a segmentation of both said first and second images.
23. A method as recited in claim 22, wherein said segmentation includes: allowing first internal connections among neurons from said first layer yielding a first segmented image and second internal connections among neurons from said second layer yielding a second segmented image.
24. A method as recited in claim 21, wherein, at time t, during said segmentation, a mask is created to select an object in each said first and second layers.
25. A method as recited in claim 23, wherein said mask is a binary mask.
26. A method as recited in claim 21 , wherein in said dynamic matching, connections are established between said first and second layers.
27.A method as recited in claim 21, wherein synaptic weights on said neurons of said first layer of neurons and on said neurons of said second layer of neurons are provided by grey-scales or colours associated to said pixels from said first and second images respectively.
28. A method as recited in claim 21 , wherein at least one of said first and second images originates from a camera.
29. A method for monophonic source separation comprising; providing a neural network including first and second layers of neurons; providing an image representation, including pixels, of a sound mixture including at least one monophonic sound sources; applying pixels from said first image representation to respective neurons of said first layer and allowing interconnections among neurons from said first layer causing a segmentation of said first image representation; allowing interconnections between neurons from said first and second layers; and performing a temporal correlation between neurons of said second layer, yielding at least one group of synchronized neurons; whereby, each of said at least one group of synchronized neurons belongings to a common source.
30. A method as recited in claim 29, wherein said providing an image representation of a sound mixture includes separating said sound mixture in a plurality of sub-bands.
31.A method as recited in claim 30, wherein said plurality of sub-bands correspond to cochlear channels.
32. A method as recited in claim 29, wherein said image representation is in the form of an auditory map.
33.A method as recited in claim 29, wherein said provding an image representation of a sound mixture includes generating at least one of a CAM (Cochleotopic/AMtopic Map) and of a CSM (Cochleotopic/Spectropic Map).
34. A method as recited in claim 33, wherein said at least one of a CAM (Cochleotopic/AMtopic Map) and of a CSM (Cochleotopic/Spectropic Map) is selected depending on said sound mixture.
35. A method as recited in claim 34, wherein said performing a temporal correlation between neurons of said second layer further comprises generating a binary mask based on said at least one group of synchronized neurons; said mask being applied to said image representation of said sound source.
36. A system for monophonic source separation comprising; a vocoder for receiving a sound mixture including at least one monophonic sound source; an auditory image generator coupled to said vocoder for receiving said sound mixture therefrom for generating an auditory image representation of said sound mixture; a neural network as recited in claim 1 , coupled to said auditory image generator for receiving said auditory image representation
Figure imgf000075_0001
and for generating a mask in response to said auditory image representation; a multiplier coupled to both said vocoder and said neural network for receiving said mask from said neural network and for multiplying said mask with said at least one monophonic sound mixture from said vocoder, resulting in the identification of said at least one monophonic source by muting sounds from said sound mixture not belonging to said at least one monophonic source.
37. A system as recited in claim 36, wherein said mask is a binary mask.
38.A system as recited in claim 36, wherein said first layer is two-dimensional and said second layer is one-dimensional.
39.A system as recited in claim 36, wherein said vocoder further allowing for filtering said sound mixture.
40. A system as recited in claim 36, wherein said vocoder is selected from the group consisting of Gammatone filter banks, a Linear Predictive Coding Vocoder and a Fourier transform.
41. A system as recited in claim 40, wherein said Gammatone filter banks is an FIR implementation therefrom.
42. A system as recited in claim 36, wherein said auditory image representation includes at least one of a CAM
Figure imgf000076_0001
(Cochleotopic/AMtopic Map) generator and a CSM (Cochleotopic/Spectropic Map) generator.
43. A system as recited in claim 42, wherein each of said neurons from said second layer of neurons receives a weighted product of said second extra-layer stimuli along a respective frequency axis of said auditory image representation, yielding a synaptic weight ω,j-,
44.A system as recited in claim 43, wherein said second extra-layer stimuli to neuron (j) of said second layer of neurons is defined by:
Figure imgf000076_0002
wherein the operator Ξ is defined as:
-f c ■ +w I 1 for B(U; *) = o ε-W,3\ t)} = < (1O) \ x(i,j; t) elsewhere
where ( ) is the averaging over a time window operator, and where a multiplication is done only for non-zero value among said second extra- layer stimuli.
45. A system as recited in claim 36, wherein said at least one network activity controller is in the form of at least one global controller.
46. A system as recited in claim 45, wherein dynamics of said neurons from said first layer follows the following state-space equations:
Figure imgf000077_0001
+ P + H(pϊJυt) + Sij (1)
^|i = e[7(l + tanh(xi>i//3)) - y,>i] (2)
wherein Xj is the membrane potential of said neuron i, yi is the state for channel activation or inactivation; p is the amplitude of a Gaussian noise, is the external input to the neuron, is the coupling from other neurons ε, γ, and β are constants.
47.A system as recited in claim 46, wherein Equations (1) and (2) are solved using the Euler equation.
48.A system as recited in claim 46, wherein a weight between neuron(i,j) and neuron(k,m) of said first layer is computed via the following formula:
Figure imgf000077_0002
where p(i,j) and p(k,m) are respectively external inputs to neuron(ij) and neuron(k,m), λ is a constant; and Card{N(i,j)} is a normalization factor.
49. A system as recited in claim 48, wherein said Card{N(i,j)} is equal to the number of elements of a set N(i,j) containing neighbors connected to the neuron(ij).
50. A system as recited in claim 48, wherein Card{N(i,j)} is within the range from 2 to 4.
51. A system as recited in claim 46, wherein said coupling Sjj is defined in Equation (1) as : Sijtt) =Σ Wij,k,m{t)H(x{k, m; t)) - ηG(t) + κLid(t) (4) k,meN{i,j) where HQ is a Heaviside function, and G(t) defines said at least one global controller; K is a binary constant and Ly (t) defines long range coupling.
52. A system as recited in claim 51 , wherein:
G(t) = aH{z - θ) (5)
Ε = σ - Ϊ* (6)
where α is set to 1 if the global activity of the network is greater than a predefined ζ and is zero otherwise; and where α and ξ are constants.
53. A system as recited in claim 51 , wherein said auditory image representation includes at least one of a CAM (Cochleotopic/AMtopic Map) generator and a CSM (Cochleotopic/Spectropic Map) generator; and
Figure imgf000079_0001
/c is defined as follows:
Figure imgf000079_0002
54. The use of a system as recited in claim 36 for multimedia file indexation or authentication.
55. The use of a system as recited in claim 36 for recording quality enhancement.
56. The use of a system as recited in claim 36 for speech recognition.
57. The use of a system as recited in claim 36 for sound quality enhancement in a hearing aid device.
58. The use of a system as recited in claim 36 for separating different visual scene objects for a visually impaired person.
59. A method for establishing correspondence between first and second sets of data, the method comprising; providing a neural network including first and second layers of neurons; providing first and second image representations of respectively said first and second sets of data including pixels; applying said first and second image representations respectively to said first and second layers; and interconnecting each neuron from said first layer to each neuron of said second layer; performing a dynamic matching between said first and second layers, yielding a temporal correlation between said first and second layers; and using said temporal correlation between said first and second layers for establishing correspondence between the first and second set of images.
PCT/CA2005/001018 2004-06-29 2005-06-29 Spiking neural network and use thereof WO2006000103A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CA2472864 2004-06-29
CA2,472,864 2004-06-29

Publications (1)

Publication Number Publication Date
WO2006000103A1 true WO2006000103A1 (en) 2006-01-05

Family

ID=35781549

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CA2005/001018 WO2006000103A1 (en) 2004-06-29 2005-06-29 Spiking neural network and use thereof

Country Status (1)

Country Link
WO (1) WO2006000103A1 (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6866841B2 (en) 2001-08-09 2005-03-15 Epatentmanager.Com Non-endocrine disrupting cytoprotective UV radiation resistant substance
EP1964036A1 (en) * 2005-12-23 2008-09-03 Université de Sherbrooke Spatio-temporal pattern recognition using a spiking neural network and processing thereof on a portable and/or distributed computer
US8515885B2 (en) 2010-10-29 2013-08-20 International Business Machines Corporation Neuromorphic and synaptronic spiking neural network with synaptic weights learned using simulation
US8626495B2 (en) 2009-08-26 2014-01-07 Oticon A/S Method of correcting errors in binary masks
CN103886395A (en) * 2014-04-08 2014-06-25 河海大学 Reservoir optimal operation method based on neural network model
US20150235125A1 (en) * 2014-02-14 2015-08-20 Qualcomm Incorporated Auditory source separation in a spiking neural network
CN105229675A (en) * 2013-05-21 2016-01-06 高通股份有限公司 The hardware-efficient of shunt peaking realizes
US9317540B2 (en) 2011-06-06 2016-04-19 Socpra Sciences Et Genie S.E.C. Method, system and aggregation engine for providing structural representations of physical entities
CN106683663A (en) * 2015-11-06 2017-05-17 三星电子株式会社 Neural network training apparatus and method, and speech recognition apparatus and method
CN106991999A (en) * 2017-03-29 2017-07-28 北京小米移动软件有限公司 Audio recognition method and device
CN110291540A (en) * 2017-02-10 2019-09-27 谷歌有限责任公司 Criticize renormalization layer
US20200272884A1 (en) * 2017-12-15 2020-08-27 Intel Corporation Context-based search using spike waves in spiking neural networks
CN112036232A (en) * 2020-07-10 2020-12-04 中科院成都信息技术股份有限公司 Image table structure identification method, system, terminal and storage medium
CN112541578A (en) * 2020-12-23 2021-03-23 中国人民解放军总医院 Retina neural network model
CN112805717A (en) * 2018-09-21 2021-05-14 族谱网运营公司 Ventral-dorsal neural network: object detection by selective attention
CN112858468A (en) * 2021-01-18 2021-05-28 金陵科技学院 Steel rail crack quantitative estimation method of multi-fusion characteristic echo state network
CN113426109A (en) * 2021-06-24 2021-09-24 杭州悠潭科技有限公司 Method for cloning chess and card game behaviors based on factorization machine
US11164068B1 (en) 2020-11-13 2021-11-02 International Business Machines Corporation Feature recognition with oscillating neural network
CN113609912A (en) * 2021-07-08 2021-11-05 西华大学 Power transmission network fault diagnosis method based on multi-source information fusion
CN117314972A (en) * 2023-11-21 2023-12-29 安徽大学 Target tracking method of pulse neural network based on multi-class attention mechanism

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004088457A2 (en) * 2003-03-25 2004-10-14 Sedna Patent Services, Llc Generating audience analytics

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004088457A2 (en) * 2003-03-25 2004-10-14 Sedna Patent Services, Llc Generating audience analytics

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JAHNKE ET AL: "Simulation of Spiking Neural Networks on Different Hardware Platforms.", INSTITUTE FUR MIKROELECTRONIC., Retrieved from the Internet <URL:http://mikro.ee.tu-berlin.de/ifm/spinn/pdf/icann97a.pdf> *
PICHEVAR ET AL: "Double-vowel Segregation through Temporal Correlation: A Bio-Inspired Neural Network Paradigm.", NONLINEAR SIGNAL PROCESSING WORKSHOP., 20 May 2003 (2003-05-20), Retrieved from the Internet <URL:http://www.nolisp2005.org/cost/doc/nolisp03/006.pdf> *
ROUAT ET AL: "A bio-inspired sound source separation technique in combination with an enhanced FIR Gammatone Analysis/Synthesis Filterbank.", EUROPEAN SIGNAL PROCESSING CONFERENCE., September 2004 (2004-09-01), Retrieved from the Internet <URL:http://www.igi.tugraz.at/lehre/CI/links/pichevar_eusipco_2004.pdf> *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6866841B2 (en) 2001-08-09 2005-03-15 Epatentmanager.Com Non-endocrine disrupting cytoprotective UV radiation resistant substance
EP1964036A1 (en) * 2005-12-23 2008-09-03 Université de Sherbrooke Spatio-temporal pattern recognition using a spiking neural network and processing thereof on a portable and/or distributed computer
EP1964036A4 (en) * 2005-12-23 2010-01-13 Univ Sherbrooke Spatio-temporal pattern recognition using a spiking neural network and processing thereof on a portable and/or distributed computer
US8346692B2 (en) 2005-12-23 2013-01-01 Societe De Commercialisation Des Produits De La Recherche Appliquee-Socpra-Sciences Et Genie S.E.C. Spatio-temporal pattern recognition using a spiking neural network and processing thereof on a portable and/or distributed computer
US8626495B2 (en) 2009-08-26 2014-01-07 Oticon A/S Method of correcting errors in binary masks
US8515885B2 (en) 2010-10-29 2013-08-20 International Business Machines Corporation Neuromorphic and synaptronic spiking neural network with synaptic weights learned using simulation
US8812415B2 (en) 2010-10-29 2014-08-19 International Business Machines Corporation Neuromorphic and synaptronic spiking neural network crossbar circuits with synaptic weights learned using a one-to-one correspondence with a simulation
US9317540B2 (en) 2011-06-06 2016-04-19 Socpra Sciences Et Genie S.E.C. Method, system and aggregation engine for providing structural representations of physical entities
CN105229675B (en) * 2013-05-21 2018-02-06 高通股份有限公司 The hardware-efficient of shunt peaking is realized
CN105229675A (en) * 2013-05-21 2016-01-06 高通股份有限公司 The hardware-efficient of shunt peaking realizes
US9269045B2 (en) 2014-02-14 2016-02-23 Qualcomm Incorporated Auditory source separation in a spiking neural network
US20150235125A1 (en) * 2014-02-14 2015-08-20 Qualcomm Incorporated Auditory source separation in a spiking neural network
CN103886395A (en) * 2014-04-08 2014-06-25 河海大学 Reservoir optimal operation method based on neural network model
CN106683663A (en) * 2015-11-06 2017-05-17 三星电子株式会社 Neural network training apparatus and method, and speech recognition apparatus and method
CN110291540A (en) * 2017-02-10 2019-09-27 谷歌有限责任公司 Criticize renormalization layer
US11887004B2 (en) 2017-02-10 2024-01-30 Google Llc Batch renormalization layers
CN106991999A (en) * 2017-03-29 2017-07-28 北京小米移动软件有限公司 Audio recognition method and device
US11636318B2 (en) * 2017-12-15 2023-04-25 Intel Corporation Context-based search using spike waves in spiking neural networks
US20200272884A1 (en) * 2017-12-15 2020-08-27 Intel Corporation Context-based search using spike waves in spiking neural networks
CN112805717A (en) * 2018-09-21 2021-05-14 族谱网运营公司 Ventral-dorsal neural network: object detection by selective attention
CN112036232A (en) * 2020-07-10 2020-12-04 中科院成都信息技术股份有限公司 Image table structure identification method, system, terminal and storage medium
CN112036232B (en) * 2020-07-10 2023-07-18 中科院成都信息技术股份有限公司 Image table structure identification method, system, terminal and storage medium
US11164068B1 (en) 2020-11-13 2021-11-02 International Business Machines Corporation Feature recognition with oscillating neural network
CN112541578A (en) * 2020-12-23 2021-03-23 中国人民解放军总医院 Retina neural network model
CN112858468B (en) * 2021-01-18 2023-08-15 金陵科技学院 Rail crack quantitative estimation method of multi-fusion characteristic echo state network
CN112858468A (en) * 2021-01-18 2021-05-28 金陵科技学院 Steel rail crack quantitative estimation method of multi-fusion characteristic echo state network
CN113426109A (en) * 2021-06-24 2021-09-24 杭州悠潭科技有限公司 Method for cloning chess and card game behaviors based on factorization machine
CN113426109B (en) * 2021-06-24 2023-09-26 深圳市优智创芯科技有限公司 Method for cloning chess and card game behaviors based on factorization machine
CN113609912A (en) * 2021-07-08 2021-11-05 西华大学 Power transmission network fault diagnosis method based on multi-source information fusion
CN117314972A (en) * 2023-11-21 2023-12-29 安徽大学 Target tracking method of pulse neural network based on multi-class attention mechanism
CN117314972B (en) * 2023-11-21 2024-02-13 安徽大学 Target tracking method of pulse neural network based on multi-class attention mechanism

Similar Documents

Publication Publication Date Title
WO2006000103A1 (en) Spiking neural network and use thereof
CA2642041C (en) Spatio-temporal pattern recognition using a spiking neural network and processing thereof on a portable and/or distributed computer
CN113035227B (en) Multi-modal voice separation method and system
WO1991002324A1 (en) Adaptive network for in-band signal separation
US6038338A (en) Hybrid neural network for pattern recognition
RU2193797C2 (en) Content-addressable memory device (alternatives) and image identification method (alternatives)
Chella A cognitive architecture for music perception exploiting conceptual spaces
Telfer et al. Adaptive wavelet classification of acoustic backscatter and imagery
Barros et al. Learning auditory neural representations for emotion recognition
AU655235B2 (en) Signal processing arrangements
Sagi et al. A biologically motivated solution to the cocktail party problem
Watrous Speaker normalization and adaptation using second-order connectionist networks
Rosero et al. Sound events localization and detection using bio-inspired gammatone filters and temporal convolutional neural networks
Movellan et al. Robust sensor fusion: Analysis and application to audio visual speech recognition
Adeel Conscious multisensory integration: introducing a universal contextual field in biological and deep artificial neural networks
Bhattacharjee et al. Clean vs. overlapped speech-music detection using harmonic-percussive features and multi-task learning
Song et al. Research on Scattering Transform of Urban Sound Events Detection Based on Self-Attention Mechanism
Makhlouf et al. Evolutionary structure of hidden Markov models for audio-visual Arabic speech recognition
Pichevar et al. Monophonic sound source separation with an unsupervised network of spiking neurones
Elhilali et al. A biologically-inspired approach to the cocktail party problem
Kasabov et al. Audio-and Visual Information Processing in the Brain and Its Modelling with Evolving SNN
Betancourt et al. Portable expert system to voice and speech recognition using an open source computer hardware
CN115132221A (en) Method for separating human voice, electronic equipment and readable storage medium
Chelali et al. Audiovisual Speaker Identification Based on Lip and Speech Modalities.
Movellan et al. Bayesian robustification for audio visual fusion

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 05761674

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 05761674

Country of ref document: EP

Kind code of ref document: A1