WO2013056756A1

WO2013056756A1 - Method and apparatus for displaying visual information about participants in a teleconference

Info

Publication number: WO2013056756A1
Application number: PCT/EP2012/003034
Authority: WO
Inventors: Christos FOUSTERIS
Original assignee: Siemens Enterprise Communications Gmbh & Co. Kg
Priority date: 2011-10-18
Filing date: 2012-07-18
Publication date: 2013-04-25

Abstract

A method for displaying visual information about participants in a teleconference comprises mixing of audio signals originating from participants in the teleconference, providing an automatic identification of a participant currently speaking and displaying at least one static digital image associated with the identified participant currently speaking at least during a part of the time while this participant is speaking.

Description

METHOD AND APPARATUS FOR DISPLAYING VISUAL INFORMATION

ABOUT PARTICIPANTS IN A TELECONFERENCE

Description

BACKGROUND OF THE INVENTION

The present invention relates generally to a method, an apparatus and a system for displaying visual information about participants in a teleconference.

Teleconference is nowadays a preferred method of communication for employees of medium and larger enterprises. In order to meet the natural needs of humans to see a visual re- presentation of participants speaking in a teleconference, various methods have been presented to meet this requirement .

US 2006/0098085 Al discloses a method and apparatus for managing a display during a teleconference between a primary participant and one or more secondary participants. According to this publication, a primary image corresponding to the primary participant and a subset of secondary images that correspond to secondary participants are displayed on first and second sections of the display, respectively. By scrolling through the secondary images during the teleconference, different subsets of the secondary images may be displayed. To improve known methods for displaying visual information about participants in a teleconference is an object of the present invention.

SUMMARY OF THE INVENTION

According to the present invention a method is provided comprising mixing of audio signals originating from participants in the teleconference. Further, an automatic identification of a participant currently speaking is provided. At least one static digital image associated with the identified participant currently speaking is displayed at least during a part of the time while this participant is speaking .

In the present context the term audio signal shall refer to all kinds of signals, especially digital signals that represent audible information of any kind, especially speech signals originating from participants of a teleconference. The term teleconference shall refer to all kinds of telecommunication processes which support the communication among participants taking part in a conference by means of telecommunication equipment, including telephones, cameras, IP-phones, PC-clients, mobile phones or other kinds of telecommunication terminal devices (e.g. UCFE, Universal Communications Front End) . Preferably these terminal devices are combinations of phones and (computer- ) screens. Usually, at least some of the participants are at remote locations, so that they cannot communicate without using technical communication equipment.

The audio and possibly other signals, e.g. video signals, originating from various participants of the teleconference are mixed so that they can be made available to other participants. Typically, the signals originating from the participant currently speaking are distributed to other participants so that these can listen to the participant currently speaking. The mixing is preferably done by a conference bridge.

In the present context the term conference bridge shall refer to a system being configured to mix the signals, especially speech signals originating from the participants. Such a conference bridge can e.g. be realized in the form of an application running on a personal computer. Such a personal computer is frequently referred to as a media server or conference server. This server receives the signals originating from the terminal devices used by the participants and sends the mixed signals to the terminal devices .

The term teleconference shall also include telecommunication processes, in which participants communicate by audio signals and video signals and possibly by application sharing .

In order to be able to provide the signals originating from the current speaker to other participants without manually switching between the signal sources, an automatic identification of a participant currently speaking is provided. This is also referred to as speaker recognition and is preferably done by voice recognition or by speech analysis.

The various technologies used to process and store voice prints include frequency estimation, hidden Markov models, Gaussian mixture models, pattern matching algorithms, neur- al networks, matrix representation, Vector Quantization and decision trees ("Speaker recognition" in Wikipedia, The Free Encyclopedia. Date of last revision: 11 May 2012 23:38 UTC. Date retrieved: 28 May 2012 18:57 UTC.

Permanent link:

http : //en . wikipedia . org/w/index . php?title=Speaker_recogniti on&oldid=492101918) . For example, a method of speaker recognition is presented in the paper Look at Who's Talking: Voice Activity Detection by Automated Gesture Analysis by Marco Cristiani et. al . Another method is published in LOOK WHO'S TALKING: SPEAKER DETECTION USING VIDEO AND AUDIO COR- REL ATION by Ross Cutler and Larry Davis, Institute for Advanced Computer Studies University of Maryland, College Park. Both papers are easily found in the internet.

According to the invention, a static digital image associated with the identified participant currently speaking is displayed at least during a part of the time while this participant is speaking. The term static digital image shall refer to a digital image with constant content, i.e. not changing, during a certain time period. This static image may be a portrait of the current speaker or an image of a person associated with the current speaker or another kind of static digital image.

Preferably, the method according to the invention is implemented by help of a component, preferably a UC component, that will reside on Front End Servers. This component will preferably perform the following tasks:

1. Listen to specific UC events;

2. Create one folder per conference ID (Conf.ID) upon conference start up;

3. Copy images of participants in the Conf.ID Folder 4. Dynamically rename the picture of the active speaker and of previous speakers;

5. Generate XML and HTML pages;

6. Delete the Conf.ID folder upon conference closure.

Unified communications (UC) is the integration of real-time communication services such as instant messaging (chat) , presence information, telephony (including IP telephony), video conferencing, data sharing (including web connected electronic whiteboards aka IWB ' s or Interactive White

Boards), call control and speech recognition with non-realtime communication services such as unified messaging (integrated voicemail, e-mail, SMS and fax) (Unified communications in Wikipedia, The Free Encyclopedia, date of last revision: 21 May 2012 19:21 UTC, date retrieved: 28 May 2012 18:39 UTC, permanent link:

http : //en . wikipedia . org/w/index . php?title=Unified_communica tions&oldid=493707199) . UC is not necessarily a single product, but a set of products that provides a consistent unified user interface and user experience across multiple devices and media types. There have been attempts at creating a single product solution however the most popular solution is dependent on multiple products.

In its broadest sense UC can encompass all forms of communications that are exchanged via the medium of the TCP/IP network to include other forms of communications such as Internet Protocol Television (IPTV) and Digital Signage Communications as they become an integrated part of the network communications deployment and may be directed as one to one communications or broadcast communications from one to many. UC allows an individual to send a message on one medium and receive the same communication on another medium. For example, one can receive a voicemail message and choose to access it through e-mail or a cell phone. If the sender is online according to the presence information and currently accepts calls, the response can be sent immediately through text chat or video call. Otherwise, it may be sent as a non real-time message that can be accessed through a variety of media .

The invention will, depending on the chosen embodiment, provide various advantages:

The trend, especially in large companies, is nowadays to use video conferencing systems instead of just audio confe rences. While live video meets the need of humans to "put face to the voice", it is not always meeting the needs appropriately. Video conferences are suitable when a group o people from one location is communicating with a group of people on a different location or in cases where just two people communicate ("one-on-one discussions"). The merits of video conferencing in other cases, e.g. meetings of te- leworkers or large global enterprise entities with employees on many different locations are questionable.

Larger groups do not easily fit on a screen. Another question is if employees always like to be seen on a daily or weekly meeting for an hour straight, while they work remotely, may not be dressed for the occasion or when they want to "partially" participate in a video call.

The invention offers the possibility to have a solution right in between of (pure) audio and (audio- ) video confe- rences. While pure audio completely lacks of pictures and while video demands a continuous stream of pictures, the invention offers the possibility to match a picture to the voice (of the current speaker, the person the voice belongs to) and to display in real time a predefined picture on the terminal devices of all participants.

The invention offers the possibility to solve problems associated with conferences where participants are in their office or cubicle (e.g. in case of teleworking) , to reduce the required bandwidth if compared to video conferences. Standard mechanisms of speaker recognition or "who is talking" detection may be used to easily implement the invention. These mechanisms are well known from contemporary UC web clients, where they are used to show the name of the current speaker. The images to be displayed can be stored in photo repositories, which can be located on various places depending on the needs of the customer. The customer's corporate directory may for instance be used as a source for the contact and communication resources and information about the company employees, who may be modeled in a CMP (Common Management Portal) as UM (User Management) users .

Unified Messaging (or UM) is the integration of different electronic messaging and communications media (e-mail, SMS, Fax, voicemail, video messaging, etc.) technologies into a single interface, accessible from a variety of different devices (Wikipedia: Unified messaging, author: Wikipedia contributors, publisher: Wikipedia , The Free Encyclopedia , date of last revision: 10 February 2012 12:55 UTC, date retrieved: 28 May 2012 18:31 UTC, Permanent link:

http : //en . wikipedia . org/w/index . php?title=Unified_messaging &oldid=476111606) . While traditional communications systems delivered messages into several different types of stores such as voicemail systems, e-mail servers, and stand-alone fax machines, with Unified Messaging all types of messages are stored in one system. Voicemail messages, for example, can be delivered directly into the user's inbox and played either through a headset or the computer's speaker. This simplifies the user's experience (only one place to check for messages) and can offer new options for workflow such as appending notes or documents to forwarded voicemails.

According to a preferred embodiment of the present invention, the method further comprises displaying a temporal sequence of a plurality of static images associated with the identified participant currently speaking at least during a part of the time while this participant is speaking. Examples of such temporal sequences of a plurality of static images associated with the identified participant currently speaking maybe digital slide shows of pictures (e.g. portraits, pictures showing the speaker in various professional or leisure time situations, etc.) of the same person or slide shows comprising pictures of persons associated with or related to the current speaker, e.g. colleagues, assistants, etc. These slide shows may, however, also comprise documents like presentation slides or similar material. This embodiment of the invention offers the possibility to provide non-verbal information to the listening participants during the verbal presentation or statement of the current speaker.

According to a preferred embodiment of the present invention, the method further comprises concurrently (i.e. si^¬ multaneously) displaying a plurality of static images asso- ciated with the identified participant currently speaking at least during a part of the time while this participant is speaking. Examples of such a plurality of static images associated with the identified participant currently speak- ing maybe combinations of portraits and texts, e.g. the name of the speaker, his or her affiliation, title, etc. This embodiment of the invention offers the possibility to provide additional information to the listening participants during the verbal presentation or statement of the current speaker.

According to a preferred embodiment of the present invention, the method further comprises displaying only one static digital image associated with the identified partic- ipant currently speaking at least during a part of the time while this participant is speaking.

According to a preferred embodiment of the present invention, the method further comprises upon start up of the te- leconference storing or checking to have been stored at least one digital image associated with each participant in a storage device accessible to a teleconference system managing the information displayed to the participants of the teleconference. Preferably, the digital images to displayed during the conference are copied from digital files, such as corporate directories, personal web pages, etc., con^¬ taining user specific information such as e.g. Employee Name, Employee Location, Employee e-mail, Employee Picture, Employee Group, etc., and are subsequently stored to an ap- plication server. These actions are preferably implemented using the LDAP (Lightweight Directory Access Protocol) . According to a preferred embodiment of the present invention, these digital images are copied and stored in a digital data storage folder associated with this teleconference and created upon start up of this teleconference.

According to a preferred embodiment of the present invention, the digital data storage folder associated with this teleconference is deleted upon conference closure. This offers the advantage of saving storage space and meeting several requirements of standard policies for privacy protection .

According to the present invention, an apparatus is provided for displaying visual information about participants in a teleconference, the apparatus comprising a conference bridge for mixing of audio signals originating from participants in the teleconference, means for providing an automatic identification of a participant currently speaking and means for causing a communication terminal device to display at least one static digital image associated with the identified participant currently speaking at least during a part of the time while this participant is speaking.

According to the present invention, a system is provided for displaying visual information about participants in a teleconference, the system comprising at least one apparatus according to the invention and a plurality of communication terminal devices receiving communication data from the at least one apparatus.

BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 illustrates a preferred system configuration of a system according to the present invention.

DETAILED DESCRIPTION OF THE INVENTION

According to a preferred embodiment of the present invention, a system is provided for displaying visual information about participants in a teleconference, the system comprising at least one apparatus 4, 8, 9, 10, 11 according to the invention and a plurality of communication terminal devices la, lb, 2a, 2b, 3a and 3b receiving communication data from the at least one apparatus. The communication terminal devices preferably comprise phone la, 2a and 3a and screens lb, 2b and 3b. The screens are preferably computer screens, connected to a personal computer. The phones and the screens are preferably connected to universal communication front ends (UCFE) 5, 6 and 7, which are preferably equipped with local storage means 5a, 6a and 7a.

According to the present invention, this at least one apparatus 4, 8, 9, 10, 11 is configured for displaying visual information about participants in a teleconference, the apparatus comprising a conference bridge 8 for mixing of audio signals originating from participants in the teleconference, means 9 for providing an automatic identification of a participant currently speaking and means 4 for causing a communication terminal device to display at least one static digital image associated with the identified participant currently speaking at least during a part of the time while this participant is speaking.

Preferably, the digital images to displayed during the conference are copied from digital files, such as corporate directories 10, personal web pages, etc., containing user specific information 11 such as e.g. Employee Name, Employee Location, Employee e-mail, Employee Picture, Employee Group, etc., and are subsequently stored to an application server 4. These actions are preferably implemented using the LDAP (Lightweight Directory Access Protocol) .

Preferably, the images to be displayed are stored in photo repositories, which can be located on various places depending on the needs of the customer, e.g. in universal communication front ends (UCFE) 5, 6, 7 or their storage devices 5a, 6a, 7a. The customer's corporate directory 10 may for instance be used as a source for the contact and communication resources and information about the company employees, who may be modeled in a CMP (Common Management Portal) as UM (User Management) users.

The following is a listing of an example XML-file for the phones upon Conference creation.

Preferably it remains the same till the conference is closed. Preferably such XML-files are stored on one or sev- eral of the universal communication front ends (UCFE) 5, 6, 7 or their storage devices 5a, 6a, 7a.

XML applications are based on the client-server architec- ture. Comparable with web browsers and web servers in the WWW, the client, which is preferably running in the phone software, requests an XML document from the server-side program. The HTTP/HTTPS GET request sent by the client includes the phone's call number, for instance:

"137.223.238.174/serverProgram?phonenumber=4711"

The server-side program then preferably generates an XML document, which is preferably delivered to the phone over HTTP/HTTPS. In the phone, the XML document is preferably parsed and displayed on the graphic display.

The same mechanism can also be used

for matching a voice to the right picture

fireEvent: {} PmEvent type=activespeaker (BE service sends fireEvents to FE service. In case of an Active speaker, FE receives a fireEvent with type=activespeaker ) .

* * *

Claims

1. A method for displaying visual information about participants in a teleconference comprising: a) mixing of audio signals originating from participants in the teleconference; b) providing an automatic identification of a participant currently speaking; c) displaying at least one static digital image associated with the identified participant currently speaking at least during a part of the time while this participant is speaking.

2. The method according to claim 1, wherein a temporal sequence of a plurality of static images associated with the identified participant currently speaking is displayed at least during a part of the time while this participant is speaking.

3. The method according to one of the preceding claims, wherein a plurality of static images associated with the identified participant currently speaking is displayed con- currently at least during a part of the time while this participant is speaking.

4. The method according to one of the preceding claims, wherein only one static digital image associated with the identified participant currently speaking is displayed at least during a part of the time while this participant is speaking .

5. The method according to one of the preceding claims, wherein upon start up of the teleconference at least one digital image associated with each participant is stored or is checked to have been stored in a storage device accessible to a teleconference system managing the information displayed to the participants of the teleconference.

6. The method according to claim 5, wherein these digital images are copied and stored in a digital data storage folder associated with this teleconference and created upon start up of this teleconference.

7. The method according to claim 6, wherein the digital data storage folder associated with this teleconference is deleted upon conference closure.

8. Apparatus (4, 8, 9, 10, 11) for displaying visual information about participants in a teleconference comprising : a) a conference bridge (8) for mixing of audio signals originating from participants in the teleconference; b) means for providing an automatic identification of a participant currently speaking; c) means for causing a communication terminal device to display at least one static digital image associated with the identified participant currently speaking at least during a part of the time while thi participant is speaking.

9. System for displaying visual information about partic ipants in a teleconference comprising: a) at least one apparatus according to claim 8; b) a plurality of communication terminal devices (la, lb, 2a, 2b, 3a and 3b) receiving communication data from the at least one apparatus.