Videoconferencing (VMC)1 is an increasingly popular way of co-working independently on physical distance. Even though there has been a remarkable success in enhancing the quality of audio and video components of VMC, it has yet to achieve the standards of face-to-face communication in overall effectivity and subjective impression of the participants. This paper aims to cover most of the important experiments on perception of VMC and its comparation with face-to-face situations. The outcomes of these studies could be used as a base for a development of VMC environment that would be for the participants as intuitive and natural as possible - that is very similar to face-to-face.
This paper is divided into two sections. The first part shows the basic information about different modes of human communication from the view of social psychology as well as an introduction to experimental methods and means of data analysis that is used in majority of cited studies. In the second part the results and software design implications of stated experiments are discussed. Special interest is aimed at audio-video interaction, effects of camera and monitor placing and finally the importance of supporting deixis2.
Compared to other modes of human interaction face-to-face communication passes the greatest number of observable details3. Besides of word meaning we pay attention to voice modulation, speech speed, nonverbal signals given by face as well as the rest of the body and the context of communication4. We use this information to get a better understanding of what is the other party presenting.
According to authors of study , there are four basic principles that have to be fulfilled (independently on the media used) if we want to achieve an effective communication. These are the needs: (1) to make contact, (2) to allocate turns at talk, (3) to monitor understanding and audience attention and (4) to support deixis - it is the possibility to see and use the artifacts used during the meeting (usually the document the discussing party is speaking about, paper used to draw diagrams on, etc.)
The aim of VMC is to create an environment that could be (for some types of communication tasks) at least comparable to those offered by classical face-to-face interactions5. For this reason alone, almost all of the experiments are based on comparing the results obtained while using VMC with the outcomes of face-to-face and audio-only communications. Two different views could be taken - (1) matter-of-fact communication effectivity that always depends on concrete definition of "effectiveness" and (2) subjective feeling of the participants (like fatigue, stress caused by VMC, etc.) that is usually measured by questionnaires or half-structured interviews. Methods, that have been most frequently used in the papers known to the author, are described below.
One of the possibilities of objective measurements is the evaluation of task products - significant number of papers [14, 11, 12, 3] use Map Task assignment that has a very easy way of evaluating the results. In this task, always done in couples, each member of the pair is given a map of the same terrain. However, only one of the maps has a route marked on it. In addition, each map differs slightly: features present on one map may be missing on the other, and vice versa (Figure 1). Collective goal is to draw the path into the other map as well while using the mode of communication supplied. Effectivity evaluation is then done by measuring the drawn route deviation from the original. The aforementioned studies have not found a statistically significant differences in effectivity (measured by route deviation as well as time needed to finish the task) between the individual modes of communication.
A second commonly used procedure is the evaluation in terms of dialog analysis. The basic concepts observed and then statisticaly analyzed are number of turns (change of speaker) and interruptions (more than one participant speaking at the same time), total amount of words spoken and the average length of one turn. Studies [1, 3, 10, 11, 12] assume that communication tends to be more effective (as long as the overall outcome is identical) when smaller number of words is spoken together with shorter turn length. The situation is less consistent when considering the impact of interruptions. Papers [11, 12] see the greater amount of simultaneous speech as a sign of more effective way of distributing information (problems and misunderstandings are clarified more quickly). Other studies  support the opposite view and describe interruptions as a trait of difficulty in turn allocation. All of the cited papers concur that the structure of communication is tightly binded with the task at hand. It is therefore hard to extrapolate any conclusions valid in an arbitrary communication context.
A more complex practice is a procedure called conversation game analysis. This method, originally used for AI communication modeling, charts the way speakers achieve their communicative goals. Conversational games theory proposes that the achievement of the goals and subgoals of conversation occur through the accomplishment of dialogue units called conversational games. The term game here is an analogy that is used to capture the fact that conversational units have rules that both participants know and follow; they have a beginning and an end, and they are interactive (i.e., a game can only be accomplished through the interaction between two or more participants).
There are two functional levels of analysis within the coding system, which are related hierarchically: moves and games. Essentially, a move is a step toward achieving the goal of the game of which it forms a part. Each utterance is classified as one of 13 move types. The conversational move category assigned to an utterance (and there may be more than one move per utterance or more than one utterance per move) represents its perceived conversational function. Conversational moves are grouped into the conversational games just described . One diagram of possible divisions is on Figure 2. For more information about conversation game analysis please consult [14, 1, 12, 3].
With an exception of study  all of the other papers have taken for granted that the VMC participants are mutually remote and every one has their own VMC equipment (monitor, video camera, speakers or headphones, microphone etc.). Although it is a very common videoconferencing design, there are no studies known to the author discussing or primary focused on communication between people or teams sharing at least some part of VMC accessories6. All of the papers cited here (except for ) do not mention this possibility at all. Fortunately, it is a plausible assumption that most of the information shown in sections 2.1 and 2.3 will be valid for this kind of VMC as well7.
Accordingly, the majority of the studies cited in this paper is focused on communication in pairs (dyads) and only rarely on greater groups8.
The video component of videoconferencing is the only difference that distinguishes VMC and audio-only communication. Regarding the amount of data that is transfered between the end systems in VMC, a great number of modifications has been tested to achieve a reasonable compromise between data load and the benefits gained.
In contrast to the usual assumptions, the studies [4, 1, 2] have shown that a difference in size of the video window9 has no effect on dialog length or structure, and surprisingly not even on subjective evaluation of the participants. When the VMC users were to choose the least important component of VMC in the study  the majority of them selected resolution. Further the study  shows that identification of a person from a picture or video-clip is very robust to resolution. One bit/pixel pictures of celebrities (encoded by the Pearson&Robinson algorithm ) were recognized nearly as often as the originals10.Therefore it is plausible to assume that the identification of a speaker and his basic expressions (smile, frown etc.) can be achieved with a modest resolution and picture size.
One of the main advantages of VMC compared to audio-only should be the possibility to distinguish nonverbal signals used in communication. According to  some of the face expressions are very short (some last only for 200 ms), gestures accompanying the speech even shorter (some only 50 ms). In correspondence with these conclusions it turns out that framerate11 significantly influences the outcome of a communication exchange - paper  proves that recognition of unknown words from a background noise is affected by framerate of the video stream. Participants with video stream had better results than those with audio-only and the increase of framerate led to an increase in overall outcome score. This trend has stopped at framerate 16,7 fps from where on it stayed constant. The authors of the study infer that VMC using at least 16,7 fps preserves the lip-reading ability we unwittingly use in face-to-face interactions. Even from a subjective view of the participants is the framerate very important - study  shows that VMC users have rated more positively dialogs with framerate 25 fps than 12 fps (difference in resolution was not statistically important).
As could be seen from the last paragraph, audio and video channels are mutually linked and paper  shows, that higher quality of audio or video component positively influences perception of the other part. Nevertheless it looks like audio was the more important part of VMC. More than a few studies didn't confirm the anticipated advantages of VMC against audio-only in means of objectively measured effectivity (using concrete cooperational assignments like Map Task or others)12. The subjective appraisal of participants was slightly more in favor of VMC than audio-only, but for example in  it was more important for the users to improve the audio than the video component of VMC.
The paper  even reports a better outcome efficacy for a half-duplex audio stream than for a full-duplex one (video stream stayed the same on both occasions). Explanation given by the authors is based on the assumption, that users of full-duplex stream wrongly assumed the channel to be identical to face-to-face and therefore didn't adapt their communication strategies (although the amount information obtained from the channel was smaller). In this way misunderstandings occurred and thus task results were impaired. Dialogs of participants using half-duplex audio had a different structure than those of full-duplex and face-to-face and also were longer by 35% (in means of both number of words and time).
As the majority of the above mentioned papers points out, the desynchronization of sound and video channels results in a very significant impairment of effectivity. According to , a maximum of 80-100ms difference between the channels is still suitable. If this cannot be achieved it is better not to have a video channel at all. Also poor framerate gives worse or same results as audio-only13. For more information consult .
A great amount of information is passed through gaze and face expressions during a communication including signalization of attention and interest, disagreement with what is being said as well as a tendency to speak (instead of interrupting the speaker verbally). The concept of gaze awareness is designating a state, when those communicating are well aware of the others gaze direction (this is the normal situation in face-to-face communications).
VMC in its usual form14 does not, from the view of gaze awareness, fulfill the qualities of face-to-face mainly because of the impossibility to achieve eye-contact (the standard camera placement doesn't allow it). Eye-contact is believed to be a very important nonverbal clue [24, 6, 11, 22] especially during communication in bigger groups, when it is used to indicate the next speaker or the receiver of current remarks [10, 22]. All of the aforesaid papers concur also in other functions of gaze awareness: it helps mutual understanding between speakers, facilitates turn taking as well as perception of other nonverbal signals included in communication.
With regard to the broadly accepted importance of gaze awareness for communication some technical improvements were tested in order to make VMC more similar to face-to-face situations:
First method that can be applied only for communication in dyads is the use of videotunnels - a system of monitor and camera where the eye-contact is made possible via half-silvered mirrors (Figure 3). No benefits have been demonstrated experimentally, papers [11, 12] state that VMC dialogs with videotunnels had more words and turns while keeping same efficacy as audio-only or VMC without videotunnels. Yet both studies were very short-term (participants used the VMC equipment for maximum of 2 hours) -- the results could have been influenced by novelty effect and longer exposition might show striking improvements. Unfortunately no long-term studies are known to the author.
Another method is based on isotropic layout of the recorders - each participant has more than one pair monitor-camera (one pair for each of the co-speakers - Figure 4). Although this setup had a very positive response from the users and in some views had been very close to face-to-face it also contains serious drawbacks. Because of multiple streams being send from and to each participant, this layout is very bandwidth demanding. Also there is a need of n(n-1) pairs monitor-camera for n participants. Therefore this method could be very useful for repeated discussions between three to four users, bigger groups get to be very costy. A detailed discussion is supplied in .
The next two studies will be presented just briefly, for more information please consult papers themselfs. GAZE Groupware System  is an VMC environment focused on mediating big group discussions. Gaze awareness is here modeled by rotating 2D videostream picture in an artificial 3D scene. A colored circle on a virtual table (in the same 3D scene) shows the current gaze direction of each participant. Advantages of this videoconferencing setting are yet to be tested. Study  is concerned with the ability to differentiate gaze direction in dyads by fixing up a precise place for the camera and the video stream window on the monitor.
The last system mentioned in this section is aimed at automated (offline or online) editing of videoconferencing meetings involving a co-located group. The group is seated around a table and recorded by a number of cameras. Depending on the participant's actions the algorithm determines who is the current speaker and then chooses the optimal camera. Videostream from this camera is then sent to the remote participants15. Unfortunately, only technical aspect of this system have been evaluated so far - the formal psychological experiments are yet to be done. More information about this project can be found in .
The word deixis originates in linguistics where it is used to describe a process in which the words depend on the conversational context (e.g., the word "I" - the context is in this case the actual speaker). In papers cited here is the aforesaid word used in a slightly different meaning as a way to transmit any kind of material that is discussed by the group at the moment (e.g., paper used to draw diagrams on, document discussed, blackboard or slides used etc.)). From this point of view it is exceedingly important to support deixis during VMC and this need should not be underestimated [10, 2, 23]. In addition in the study  the deixis component was the most important one for the users when they had to choose a sequence in which the individual components should be downgraded (in case of using slower machines or lower bandwidth).
Deixis in VMC systems is usually not included in the videostream component as it can be supplied by external programs such as WBD, WB, VNC or NTE. Nevertheless it is possible to replace the external programs by a bigger number of video cameras on some or all of the endpoints (for example one camera aimed on the speaker and second one on the document or drawing sheet).
Although the results of individual studies had not been mutually consistent it is still possible to point out several findings and important thoughts that are generally accepted.
It seems plausible to say that in the event of marginal bandwidth it is more useful to focus on the quality of audio stream or framerate rather then on the resolution of individual frames. Optimal framerate is between 16,7 and 25 fps when most of the nonverbal gestures and expressions perceived in face-to-face communication are still not lost and can be detected by the participants. As I've already noted in this paper, in my personal view using VMC with very small framerates (under 0,5 fps) could be also helpful. The addition of video component has not produced any objectively measurable advantages in tasks oriented entirely on information exchange, where the audio stream seems to suffice. Subjective preferences of VMC between the users had been found within tasks based on social interaction where the aim is to reach a compromise situation or be creative - business meeting, brainstorming etc.
It is important to notice that the majority of aforesaid studies is based on tasks involving only two subjects, where the mutual understanding is easier to achieve. As the papers [10, 17, 5, 4] show, in cases of communicating in bigger groups the advantages of VMC compared to audio-only are more striking (such as the possibility of visual identification of the momentary speaker).
Also the placement of the videocameras influences final efficacy of VMC communication - the camera should be set up in such way that not only the face but also the rest of the body with part of the surroundings (desk with the monitor etc.) is included in the final video picture. By using this adjustments it is easier for the participants to distinguish the individual nonverbal signals as well as the contemporary attention focus of the remote speaker. In the occurrence of repeating videoconferences in group of three to four users the merits of isotropic layout can override the drawbacks as is described in [22, 25]. No VMC system should underestimate the need of deixis support either by external software products or by multiple cameras on some of the working places.
Overall the results of the studies (and therefore the efficacy of VMC as well) are biased by the fact that the participants had used the VMC environments on a short-term basis (in majority of the studies for maximum of 4 hours) and on the other hand the audio-only communications are very common (e.g., the widespread use of mobile phones). This disparity can be used as an explanation for observing lower efficacy level of VMC then was anticipated. A related hypothesis (partially confirmed by ) expects a striking increase of effectivity of VMC if used on a long-term basis.
This project has been kindly supported by the research intent Parallel and Distributed Systems (MŠM 0021622419) and "Psychologické a sociální charakteristiky dětí, mládeže a rodiny, vývoj osobnosti v době proměn moderní společnosti" (MŠM 0021622406).
 Anderson, A.H. et. al. (1996) 'Impact of video-mediated communication on simulated service encounters' Interacting with Com- puters vol 8 no 2, 193-206
 Anderson, A.H., O'Malley, C. et. al. (2000) 'Video data and video links in mediated communication - what do users value' Interna- tional Journal of Human-Computer Studies 52, 165-187
 Boyle, E.A., Anderson, A.H. and Newlands, A. (1994) 'The effects of eye contact on dialogue and performance in a co-operative problem solving task' Language and Speech, 37, 1-20
 Bruce, V. (1996) 'The role of the face in communication: implications for videophone design' Interacting with Computers vol 8 no 2, 166-176
 Driskel, J.E., Radtke, P.H , Salas, E. (2003) 'Virtual Teams - Efects of Technological Mediation on Team Performance' Group Dynamics: Theory, Research, and Practise Vol 7, No 4, 297-323  Garau, M., Slater, M., Bee, S., Sasse, M.A. (2001) 'The Impact of Eye Gaze on Communication using Humanoid Avatars' Proceed- ings of the SIG-CHI conference on Human factors in computing systems, 309-316. Seattle, WA USA.
 van der Kleij, R., Paashuis, R.M., Langefeld (Anja), J.J., Schraagen, J.M.C (2004) 'Efects of long-term use of videocommunication technologies on the conversational process ' Cognition, Technology & Work Vol 6, No 1, 57-59
 Maruping, L.M., Agarwal, R. (2004) 'Managing Team Interpersonal Processes Through Technology: A Task-Technology Fit Perspective' Journal of Applied Psychology Vol 89, No 6, 975-990
 Monk, A.F., Grayson, D.M. (2003) 'Are you looking at me - Eye contact and desktop video conferencing' ACM Transactions on Computer-Human Interaction Vol 10, No 3, 221-243
 Monk, A.F., Watts, L. (1998) 'Some advantages of video conferencing over high-quality audio conferencing fluency and awareness of attentional focus' International Journal of Human-Computer Studies 49, 21-58
 O'Malley, C., Anderson, A.H., Bruce, V. et. al. (1996) 'Comparison of face-to-face and video-mediated interaction' Interacting with Computers vol 8 no 2, 177-192
 O'Malley, C., Doherty Sneddon, G., Anderson, A.H. et. al. (1997) 'Face-to-Face and Video-Mediated Communication: A Comparison of Dialogue Structure and Task Performance' Journal of Ex- perimental Psychology, Vol. 3, No. 2, 105-125
 Pearson, D.E. and Robinson, J.A. (1985) 'Visual communication at very low data-rates' Proc. IEEE 73, 795-811
 Sanford, A., Anderson, A.H., Mullin, J. (2004) 'Audio Channel Constraints in Video-Mediated Communication' Interacting with Computers 16, 1069-1094
 Te'eni, D (2001) 'A cognitive-affective model of organizational communication for designing IT' MIS Quarterly 25, 251-312
 Tegze, O. (2004) Neverbální komunikace. Praha: Computer Press
 Thompson, L.F., Coovert, M.D. (2003) 'Teamwork Online - The Efects of Computer Conferencing on Perceived Confusion, Satisfaction, and Postdiscussion Accuracy' Group Dynamics: Theory, Research, and Practise Vol 7, No 2, 135-151
 Vertegaal, R. (1999) 'The GAZE Groupware System: Mediating Joint Attention in Multiparty Communication and Collaboration' Proceedings of ACM CHI'99 Conference on Human Factors in Computing Systems, 294-301. Pittsburgh, PA USA
 Vitkovitch, M., Barber, P. (1994) 'Efect of Video Frame Rate on Subjects' Ability to Shadow One of Two Competing Verbal Passages' Journal of Speech and Hearing Research Vol.37 1204- 1210
 Vybíral, Z. (2005). Psychologie komunikace. Praha: Portál
 Watson, A., Sasse, M.A. (1996) 'Evaluating audio and video quality in low-cost multimedia conferencing systems' Interacting with Computers vol 8 no 3, 255-275
 Werkhoven, P.J., Schraagen, J.M., Punte, P.A.J. (2001) 'Seeing is believing : communication performance under isotropic teleconferencing conditions' Displays 22, 137-149
 Webster, J. (1998) 'Desktop videoconferencing - Experiences of complete users, wary users, and non-users' MIS Quarterly 22, 257-286
 Argyle, M. (1971) The Psychology of Interpersonal Behavior. Harmondsworth : Penguin Books.
 Olson, J.S., Olson, G.M. (1995) What Mix of Video and Audio is Useful for Small Groups Doing Remote Real-time Design Work? Proceedings of ACM CHI'95 Conference on Human Factors in Computing Systems, 362-368, Denver, CO
 Sumec, S. (2004) 'Multi Camera Automatic Video Editing' Pro- ceedings of ICCVG 2004, 935-945, Warsaw, PL
(1)VMC is a abbreviation for Video Mediated Communication.
(2)Exact meaning will be defined later in the appropriate chapter.
(3)Also the concept of media richness is often used - each mode of communication is described by channels over which the information flows during the conversation. Interesting papers oriented on comparing VMC with other modes of communication (email, phone, chat...) are [15, 5, 8].
(4)A very good introduction to this field is given by  or .
(5)When comparing different communication environments, one must always keep in mind that the context of communication (e.g. type of task used, number of participants, chosen metrics for effectivity, etc. ) can dramatically alter the results of individual characteristic's evaluations
(6)An example can be a communication between two or more remote teams where a monitor is shared by all team members and the whole conference room is filmed by one camera.
(7)This has been partially proven by the paper .
(8)Papers [7, 25] are using triads and part of the experiment  used communication between two pairs of participants.
(9)The minimal size was approximately 9 x 11,5 cm, enhanced picture than 16,5 x 20,5 cm. Exact resolution is, unfortunately, not mentioned in the papers.
(10)Original colored pictures had 100% success, photos encoded by the aforesaid algorithm 93 %.
(11)Framerate quantifies how quickly does the picture change during a videostream. For example, if you are watching a television program in Europe, the picture changes approximately 25 times in one second - framerate of the program is 25fps
(12)The papers concerned are [1, 12, 10, 11].
(13)For example values around 4 fps. Nevertheless an experimentally unconfirmed hypothesis is that from certain framerates down (e.g., 0.5 fps and lower) is the situation getting better - we are not using the video channel as a substitute for classical face-to-face view, but we can use information about "long term behavior" of the other speaker (such as whether he is looking something up in a book, giving all his attention to something not relevant to the momentary dialog or listening very carefully to what is being said).
(14)That means a scheme with one or more small video windows on the monitor and camera placed on the table or in the vicinity of the screen.
(15)Therefore the final outcome is very close to watching a television discussion program - the speaker is shown for most of the time and once in a while the cameraview is changed to present reactions of other participants.