Abstract observer reliability in the context of dance performance assessement
In the context of dance education, either as part of physical education or on its own, the assessment of dance performance is used as a benchmark of the students’ progress. Although reliability is a critical feature of this process, the majority of research projects are restricted in the lining up of reliability statistical indexes without giving guidelines for its establishment through sound training of the judges. The aim of the present paper is the presentation of the major issues and procedures of the judges’ training and reliability in a research project assessing the performance of Greek traditional dancers.The research project was conducted with the voluntary participation of four university dance students of the Faculty of Physical Education and Sport Science, University of Athens, Greece, who were first video-recorded, and then observed and assessed by two judges during the performance of the dance “karsilamas Aise”. The methodological design included special training sessions of the judges regarding the functional definitions of the assessment categories and criteria (Hawkins, 1982), as well as the ways of observation and recording (Kazdin, 1977; Reid, 1982). The results proved that the actual training of observers whose task is to employ and apply recording techniques in dance is of major importance in the development of valid and reliable assessment instruments.
In the context of dance education, either as part of physical education or on its own, the assessment of dance performance is used as a benchmark of the student’s progress. Dance performance is regarded as the materialized image of the students’ progress, as well as the result of the applied teaching methods and refers to their ability to understand and reproduce dance choreographies or dance segments successfully, according to predefined standards. For these reasons, many books and study guides are published while at the same time several other resources on assessing dance in education provide all the necessary instructions. At the same time, many countries adopt national standards of evaluation, comparison and recording of dance performance, which concern all types of dance. These standards address schools and educational foundations, recommending what the students should learn and what the expectations are depending on their developmental level i.e. Australia, United Kingdom, America (Oreck, 2007).
These are definite aims that come in agreement with the study curriculum, as well as a number of measurable results pertaining to dance technique, dance aesthetics, dance kinesiology, choreography, dance history and terminology and dance criticism (Carter, 2004). The review of relative literature supports that, either nationally or independently, the most common way of assessing student performance is the use of rubrics consisting of qualitative and/or quantitative criteria for the analysis of dance movement by expert judges-observers (Office for Standards in Education –England, 2002; Alaska Department of Education, 2008; Slettum, 1998; Warburton, 2000; Oreck, Owen and Baum, 2004; Dania, 2009; Krasnow and Chatfield, 2009; Chatfield, 2009). These criteria are suggested either as guidelines for leading dance performance to the desirable educational goals or as methods for the measurement and analysis of direct observational dance data (State Collaborative on Assessment and Student Standards, USA).
A rubric is a 2X2 table/scoring system that a) includes ‘dimensions’ (the valued skills to be scored) and ‘indicators’ (descriptions of excellent, good, fair and poor performance of each dimension) and b) allows complex performances to be evaluated quantitatively (Warburton, 2002). Regardless of the sophistication of coding systems and analytic techniques, the extent to which accurate recording of naturally occurring events can be collected depends on the observers who make them (Reid, 1982). The actual training of observers who try to employ and apply recording techniques is of major importance. Reliability is a critical feature of this process and is measured by the degree to which two judges using the same coding procedures and viewing the same activities agree on their codings (Van der Mars, 1989).
Research data on the use and implementation of scoring systems in dance education reveal that the majority of research projects are restricted in the lining up of reliability statistical indexes without giving guidelines for its establishment through sound training of the judges (Looney and Heimerdinger, 1991; Krasnow, Chatfield, Barr, Jensen and Dufek, 1997; Minton and McGill, 1998; Slettum, 1998; Oreck, Owen, and Baum, 2004). From the findings of the detailed report of Bonbright and Faber (2004) to NDEO with the title Research Priorities for Dance Education, the majority of reliability tests in dance do not concern assessment processes or instruments, but mainly assessment of educational curricula.
Thereof, the aim of the present paper is the presentation of the major issues and procedures of the judges’ training and reliability of observation, in a research project assessing the performance of Greek traditional dancers. The research project was conducted with the voluntary participation of four university dance students (aged 22 years old) of the Greek traditional dance major at the Faculty of Physical Education and Sport Science, University of Athens, Greece. Each one was video-recorded during the performance of a Greek traditional dance called “karsilamas Aise”. These performances were observed and assessed by two judges on the basis of a list of 35 dance performance assessment criteria grouped per categories according to the theory of Laban for the analysis of movement. Focusing on dance only at the level of dance form (structure and style) and studying it as a kind of motor activity independently of time-sociocultural context, the judges’ training and reliability involved the following procedures:
- Determination of the parameters of dance performance that would be observed and measured.
- Development of the observational manual.
- Selection and training of the judges.
- Tests of the content validity.
- Recording methods and data collection.
- Tests of intra and interobserver reliability.
According to NDEO (National Dance Education Organization) much of the body of work on dance education conducted from 1926 to the present «...relies mainly on the use of observations, case studies, and persuasive writing from committed dance educators…» (Carter 2004, p.5) deprived of validity and reliability tests. Furthermore, assessment seems to be effected through judgments based on the teacher’s personal experiences and expertise (i.e. dance contests, dance schools, etc.).However, the process of systematic observation and assessment in dance involves more than using a scoring system to collect data on certain quantitative and/or qualitative aspects of performance. For a kind of representational activity like dance, where the quality of performance is the one that will determine the levels of accomplishment (Slettum, 1998), the development of assessment instruments that could reliably record the students’ ability to reproduce the structure and style of any given dance movement, remains a challenge and a necessity. Such projects though ought to be preceded by a number of steps that need to be taken to ensure that the data to be collected will be reliable, accurate and valuable for dance researchers and educators (Van der Mars, 1989).
2.1. Determining the parameters of dance performance to be observed and measureda
Any discussion of observer/judge selection and training should begin with the specification of the coding system to be used, and specifically the kind of behaviors to be observed, analyzed and assessed (Reid, 1982; Van der Mars, 1989). Obviously, these behaviors must comprise a fair representation of the behavior pattern being studied (Hawkins, 1982), which in the present case is no other than dance performance. The present research did not treat dance as an art form that must be judged only in aesthetic terms. On the contrary, dance was analyzed both as the observable result of the structural rules of its morph-syntactic elements and properties, as well as a composite form of human energy which is expressed through various combinations of time-space schemas. For this reason dance performance was judged in relation to the total composition of both its structural and qualitative characteristics.
According to Moskal and Leydens (2000), in order to ensure the appropriateness of a coding list, the researcher should clarify the aim and expected results of assessment and select those criteria that will explicitly reflect the studied behavior. Under such perspective, the theory of Laban for the analysis of movement: Labananalysis (Johnson Jones, 1999, p.102-103; Koutsouba, 2005, p.48) was selected as the most suitable theoretical framework for the development of the categories and criteria to be analyzed and observed.
The basic advantage of this method is that it goes beyond the static description of body transport and its parts (Maletic, 1987), putting forward the most important structural elements of movement (e.g. the laterality and the symmetry of body, the directions in relation to the vertical axis, the three dimensional shape of the moving body, the identification of movement as a time sequence with a start, a middle and an end), and recognizing the body movement as a process during which the effort and the qualities of space are constantly changing (Cohen, 1978; Freedman, 1991).
Through an extensive literature review of research projects that used the Laban theory for the analysis and assessment of the quantitative (Labanotation) and/or the qualitative (Effort/Shape) parameters of local dances or dance choreographies (i.e. Hackney, 1968; Kagan, 1978; Pforisch, 1978; Cohen, 1978; Davis, 1987; Freedman, 1991; McCoubrey, 1984; Bartenieff, Hackney, True Jones, Van Jile, & Wolz, 1984) seven assessment categories were determined.The first six were: Body, Time, Space, Weight, Shape, Flow (according to the theory of Laban) and the seventh was: General Impressions (GEN) (meaning the judges’ general impressions of the dancers’ overall performance). Each category was broken down to assessment criteria/parameters of dance performance such as: movement duration in relation with the rhythmical schema, effortless action of the moving body in relation to time, width of space that is covered in relation to the required space of movement, accurate use of general space: orientation of movement in space, etc.
2.2. Development of the observational manual
A carefully constructed observational manual is very important for the specification of the behaviors to be observed. The observation manual includes the code categories’ and criteria’ definitions, instructions relative with the functional value of the assessment tool, issues regarding the ways of observation and recording, as well as issues that can affect the quality of data collection like courtesy or dress, avoidance of making inferences or hypotheses, etc. (Reid, 1982; Kazdin, 1977). Definitions should be meaningful, descriptive, and replicable focusing either on the topography of behaviour (i.e. the form, meaning the movements that make this behaviour) or on its function (i.e. the movement’s outcome) (Van der Mars, 1989). In the present research, for each category and separate element, functional definitions were given in a way that they could be used as directions for the understanding of the assessment criteria.
Particularly, these definitions a) were structured according to the theoretical frame, b) referred to observable elements of the investigated movement behaviour, included typical examples each dance performance parameter (Reid, 1982), c) included examples of both occurrence and non-occurrence of the movement behaviours, d) defined limits relative to the range of the judges’ accepted responses (Hawkins, 1982). According to the functional definitions, observable criteria for rating the different levels of dance performance were devised, representing different levels of dance ability. Particularly, the arithmetical values 0 and 1 were set for each different element, equating 1 with a successful performance consistent with the defined standard and 0 with a poor performance not consistent with the standard.The performance standard was the result of the morphological analysis (structure and style) of the selected folk dance (“karsilamas Aise”), according to which the typology of the dance was determined (Martin and Pessovar, 1961, 1963; Tyrovola, 1994; 2001). The authors kept in mind that a comprehensive and clear observation manual would facilitate the sessions of judge training, increasing the possibility of reaching high levels of interobserver agreement.
2.3. Selection and training of the judges
The selection, training and periodic drill of the judges during all the stages of a research project is a crucial factor for the accomplishment of the desired outcomes. Those observers/judges that produce the most accurate and reliable data are usually bright and highly motivated people (Dancer et al., 1978). For this reason, four dance experts were selected as the research judges/observers, all of whom exhibited the following characteristics (Reid, 1982; DeVellis, 2003):
• They were all experts on the theory of Laban. Moreover, they had all been trained by the same tutor and could detect and distinguish successfully the suggested categories of movement analysis.
• They were all graduates of the Faculty of Physical Education and Sport Science, University of Athens (F.P.E.S.S), Greece, with specialization in Greek traditional dance.
• They were all active dancers and dance teachers.
• They were responsible, enthusiastic and highly motivated as far as the scientific improvement of the subject was concerned.
All of the four judges were given code numbers from 1-4 and afterwards practiced and trained thoroughly according to the principles of the methodological design. During three 3-hour sessions, they were first introduced to the aim of the research, the functional value of the assessment tool and were informed on several issues regarding the ways of observation and recording (Reid, 1982). However, they were not given any information regarding the research hypotheses (Kazdin, 1977).
The authors wanted to ensure that the definitions of each category and criterion were learned verbatim, and for this reason the judges observed and discussed videotapes of dancers that were not in the study sample. These videotapes provided simple sequences of dance performance (Van der Mars, 1982) and were discussed thoroughly according to the guidelines of the observation manual. As the judges gained experience, the videotapes became more complex reflecting all the parameters included in the coding system. The comparison of their recording with those of a criterion judge was deemed as an important source of feedback.During the research project, the judges practiced and trained periodically (Kazdin, 1977), they were tested unexpectedly for their reliability and they were continuously reinforced and motivated (Reid, 1982), so that their performance would be maintained at high levels avoiding possible biases.
2.4. Tests of the content and construct validity
Content validity represents the extent to which a sample of subjects, criteria or questions in a methodological design covers suitably and adequately a given section of interest (e.g. a desirable behavior) (Thomas and Nelson, 2003). According to Haynes et al. (1995, p. 248) «…during the process of developing research tools, the basic purpose of testing their content validity is to minimize possible error variance and increase the possibility of obtaining acceptable indexes of validity in future researches that would use these tools…».
For this reason, after the training session the judges were asked to note a) how accurately each distinct criterion expresses the category to which it belongs, b) how well each single element is grouped so as to represent the tested assessment categories (Fitzpatrick, 1983) and c) how specialized and accurate is the phrasing of the functional definitions (Haynes et al., 1995). Taking into consideration the judges’ opinions, remarks and comments, the authors devised the first version of the assessment instrument which included the following categories and criteria: Body 10 assessment criteria, Time 6, Space 7, Weight/Force 3, Shape 4, Flow 5, and GEN 4. Afterwards, the judges watched video recorded performances of the selected dance in order to discuss and make comments based on the first version (Reid, 1982; Van Der Mars, 1989). Keeping in mind that its final version should be operational and easy to use (Hawkins, 1982; Van Der Mars, 1989), the judges suggested that the categories should be presented in a single page and each one should include the same number of assessment criteria (five per category).
Consequently, the authors devised the second version of the instrument which included seven categories (Body, Time, Space, Weight/Force, Shape, Flow and GEN) with five criteria each. The maximum score that a dancer could get was 5 per category and 35 for the sum of categories. The judges agreed that the assessment tool included the most appropriate assessment criteria, and its design was in accordance with the theory of Laban (Moskal and Leydens, 2000; Stemler, 2001).Construct validity tests the extent to which the results of a measurement procedure are in accordance with the underlying theoretical framework (Cronbach & Meehl, 1955; Thomas & Nelson, 2003; DeVellis, 2003). In the present research, construct validity was tested through the stages suggested by Cronbach & Meehl (1955):
1. At the beginning the term “dance performance” was defined and correlated with the rules of the Laban theory framework, according which it would be assessed.
2. The research hypotheses were clearly stated and compared with the empirical results so as to ascertain that dance performance was operationally defined.
2.5. Recording methods and data collection
Videotaping was used as the most appropriate method for the collection of the research data (Van der Mars, 1989; Thomas & Nelson, 2003). Four university dance students of the Greek traditional dance major at the Faculty of Physical Education and Sport Science, University of Athens, Greece were video-recorded during the performance of the Greek traditional dance “karsilamas Aise”. The interval method was used for recording their dance performance (Hawkins, 1982; Van der Mars, 1989), since according to Hawkins (1982) it permits the recording of many behaviors simultaneously without decrease in the levels of interobserver agreement. In accordance with relative literature, two of the four judges (judge No1 and judge No2) were asked to evaluate, twice with a 36-hour interval (test and retest) and in a different order each time, the dance performance of the study sample, relying on the second version of the assessment instrument.
As a result, the two judges would observe and record 140 separate movement behaviors, which can be considered a satisfactory reliability sample (Van der Mars, 1989). The intervals of observation-recording were set at 10 sec each, meaning that each judge would observe the video for ten seconds then pause and record the dancer’s performance for the next 10 sec. Each category should be observed and recorded separately and the observation should start from the Body category (1st ten seconds) and should proceed with categories Time, Space, Weight/Force, Shape, Flow and GEN. In each interval, the observer would rate the dancer’s performance criterion per criterion, recording 1 when the performance was consistent with the criterion standard and 0 when not.Each dancer was recorded separately while no judge was allowed to discuss and compare his recordings with another. The basic aim was to check the judges’ abilities and to detect possible weaknesses of the methodological design.
2.6. Tests of intra and interobserver reliability
According to Van der Mars (1989) the majority of methodological designs that use the process of systematic observation, use the percent ratio of the observers’ agreement as an index of reliability that determines the degree of consistency between observations and recordings. Reliability relates to the consistency or the repeatability of a measurement and refers to the degree to which a tool or a measurement procedure gives the same results when repeated (Thomas and Nelson, 2003). There are two types of observers’ reliability/ agreement:
1. Interobserver agreement which is «…the agreement between the records of two different observers as far as the observation of a particular behavior is concerned…» (Kazdin 1977, p.141).
2. Intraobserver agreement which is «… the agreement between the records of the same observer as far as the observation of a particular behavior in two different moments is concerned…» (Van der Mars 1989, p.54). In the present research, the percent ratio of both intraobserver and interobserver agreement per category and their total sum (average) was used as a measure of reliability according to the following formula:
The accepted level per category and their total sum was set equal and above 70% (Van Der Mars, 1989).
The results are analytically presented in Table 1. The explanation of terms is as follows: INTRA 1= percentage of agreement between the two scores of judge No1, INTRA 2= percentage of agreement between the two scores of judge No2, J1J2T= percentage of agreement between the scores of judges No1 & No2 for the first scoring session (test), J1J2R= percentage of agreement between the scores of judges No1 & No2 for the second scoring session (retest), D1-D4= dancers 1-4.Table 1. Percentages of intra and interobserver agreement (judges No1 and No2)
Table 1. Establishing observer reliability in the context of dance performance assessement
After the assessment process was completed, it turned out that the judge with the code No2 demonstrated difficulty with the use of the assessment criteria. DeMaster, Reid and Twentyman (1977) report that it is possible for observers to change their criteria for coding over time. Despite the fact that this judge was highly motivated, he remained so influenced from his personal preferences and scientific specialization that the percentages of his intraobserver agreement were below the accepted level for some of the assessment criteria. For this reason, the authors decided that he should be replaced by another judge, who, after a random selection happened to be the judge with code No3.
Furthermore, due to the elements of structure and style of the karsilamas ‘Aise’, the phrasing of many assessment criteria seemed to confuse the judges, who were not sure about what exactly they should record and how. Consequently, a new training session was held, during which judges No 1 and 3 (J1 & J3) were introduced again to the functional definitions (Reid, 1982) and trained to use the modified assessment criteria. Additionally the authors came to the following decisions:
1. The dance performance assessment instrument should be developed on the basis of the six categories that are set by the theory of Laban (Body, Time, Space, Weight/Force, Shape and Flow).
2. The scores of each participant in these six categories would be summed in order to calculate the Total Index (TI) of dance performance. A TI of 30 should mean that the dancer’s performance is excellent.
3. The GEN category would be recorded as a separate category, not included in the TI, something that would facilitate future statistical comparisons.Keeping in mind that the instrument’s validity should be retained and its reliability reexamined, the authors decided that a criterion judge (CJ) should also assess the participants. The recordings of the latter would be used as a point of reference for checking the two judges’ interrater agreement (Reid, 1982). For this reason, during a new six-hour training session the two judges and the criterion judge evaluated afresh the sample participants, according to the modified version of the dance assessment instrument (Table 2). The results of the intraobserver agreement (intra 1 & intra 3) and interobserver agreement (J1J3) of the two judges as well as their interobserver agreement results with the criterion judge (J1CJ & J3CJ) are shown in Tables 3 and 4. Table 2. Modified version of the dance performance assessment instrument
Table 2. Establishing observer reliability in the context of dance performance assessement
Table 3. Percentages of intraobserver agreement for Dancers D1, D2, D3, D4 (judges No1 and No3)
Table 3. Establishing observer reliability in the context of dance performance assessement
Table 4. Percentages of interobserver agreement for Dancers D1, D2, D3, D4 (Judges No1, No3 and criterion judge (CJ)
Table 4. Establishing observer reliability in the context of dance performance assessement
After completing this last stage of the research, the two judges as well as the criterion judge showed percentages of agreement between 80%-100% for all the categories, a fact which implied that the particular assessment instrument can be used as a valid and reliable method for the evaluation of dance performance.
4. DISCUSSION-CONCLUSIONSThe assessment of the students’ ability to reproduce a dance – irrelevant of which this dance is or if it is danced with specific steps or not - is deemed as an integral part of the educational process. This fact is interpreted on the basis of two factors. On the one hand, due to the dilemma that dance instructors cope with in their search of methods that are suitable for the thorough and effective evaluation of their students’ abilities and progress and on the other hand, due to the need for continuous upgrading and specialization of dance students’ and teachers’ knowledge and technique.
The present research addressed the challenge to deal with assessment as an integral part of the educational process and inquiringly test the list of criteria that was suggested according to a worldwide accepted theoretical framework. The results show that the suggested instrument is valid and reliable and can be used (after special training) by dance teachers who wish to acknowledge explicitly and reliably all those elements that are prerequisites for a good dancer. This is extremely important considering that nowadays the number of research studies that deal with the development of dance performance assessment instruments is limited.
These tools would certainly give dance teachers and researchers the opportunity to share, compare and study results and methods (Oreck et al., 2004; Oreck, 2007; Looney and Heimerdinger, 1991). Dance performance is by no means a complex concept that is related to both quantitative and qualitative aspects and elements. Up to nowadays, the available assessment instruments, selecting either or both these aspects, focus mainly on the presentation of reliability indexes.
Reliability is certainly a major issue in the process of systematic observation. However, it is important for researchers to remember that the process of its establishment, which is no other than the stage by stage training of the observers/judges, is a prerequisite of both reliability and validity. The credence accorded estimates of intra and interobserver agreement, independently of the computed method, presupposes eliminating sources of bias that can spuriously affect agreement (Kazdin, 1977).
The sound training of the observers brings into light both possible human biases and/or vaguenesses of the assessment criteria, sources of artifact and characteristics of assessment that influence interpretation of agreement, redefining in this way the instrument’s validity and reliability. In order that the suggested assessment instruments are used in realistic settings and particularly in the reality of the dance classroom, they have to be brief, concise, accurately phrased and easy to use.
The present research worked thoroughly on the establishment of the instrument’s validity and reliability, ending up in the development of thirty-five theoretically supported quantitative and qualitative assessment criteria. Future research projects could further elaborate on these criteria with the aim of further simplifying the already tested valid and reliable measures.Only the systematic observation and recording of dance performance according to predefined standards could give dance instructors the opportunity to distinguish which methods are more appropriate for the improvement of instruction and the strengthening of student learning. The exploration of dancing space, movement dynamics and the elements of the moving body’s anatomy does not constitute an innovation in the field of dance. The integration of these elements in new methods of organizing and teaching the subject of dance (use of dance performance assessment tools) makes them indispensable and effective in the process of visualization of what appears to be only an inner experience.
Alaska Department of Education & Early Development. (2008): The Arts Framework: Content and performance standards for Alaska students, http://www.eed.state.ak.us/tls/Frameworks/arts/2table.htm. Bartenieff, I., Hackney, P., True, J. B., Van Jile, J., and Wolz, C. (1984). The potential of movement analysis as a research tool: A preliminary analysis. Dance Research Journal, 16 (1): 3-26.
Bonbright, J. M., and Faber R. (2004). Research priorities for dance education: A report to the nation. Bethesda, MD: National Dance Education Organization.
British Columbia Ministry of Education. (1998). Appendix D: Assessment and evaluation.
Carter, C.S. (2004). Effects of formal dance training and education on student performance, perceived wellness, and self-concept in high school students. Doctoral dissertation, University of Florida.
Chatfield, S.J. (2009). A test for evaluating proficiency in dance. Journal of Dance Medicine Science, 13(4): 108-114.
Cohen, R.L. (1978). An introduction to Labananalysis: Effort/ Shape. CORD Dance Research Annual, IX: 53- 58.
Cronbach L.J. and Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52: 281-302.
Dancer, D.D., Braukmann, C. J., Schumaker, J. B., Kirigin, K. A., Willner, A. G., and Wolf, M. M. (1978). The training and validation of behavior observation and description skills. Behavior Modification, 2: 113-134. Dania, A. (2009).
Davis, M. (1987). Steps to achieving observer agreement: The LIMS Reliability Project. Movement Studies: Observer Agreement, 2: 7-19.
DeVellis, R.F. (1991). Scale development: Theory and applications (2nd ed.). Applied Social Research Method Series, Sage Publications.
Fitzpatrick, A.R. (1983). The meaning of content validity. Applied Psychological Measurement, 7(3): 3-13.
Freedman, D.C. (1991). Gender Signs: An Effort/Shape Analysis of Romanian Couple Dances. Studia Musicologica Academiae Scientiarum Hungaricae, 33: 335-345.
Hackney, P. (1968). The style and movement of Merce Cunningham: a pilot study. Effort/Shape Certification Project. Dance Notation Bureau.
Hartmann, D.P. (1982). Assessing the reliability of observational data. In: D.P Hartmann (Ed), Using observers to study behavior. New Directions for Methodology of Social and Behavioral Science. Jossey-Bass, San Francisco, p.p. 51-65.
Hawkins, R.P. (1982). Developing a behavior code. In: D.P Hartmann (Ed), Using observers to study behavior. New Directions for Methodology of Social and Behavioral Science. Jossey-Bass, San Francisco, p.p.21- 35.
Haynes, S.N., Richard, D.C.S. and Kubany, E.S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7 (3): 238-247.
Johnson, J.J. (1999). The choreographic notebook: choreographing process of the Kokuma dance theatre, an African-Caribbean dance company. In: Theresa Buckland (Ed.), Dance in the Field: Theory, Methods and Issues in Dance Ethnography. MacMillan Press, London and New York, pp.100-110.
Kagan, E. (1978). Towards the analysis of a score. A comparative study of Three Epitaphs by Paul Taylor and Water Study by Doris Humprhey. CORD Dance Research Annual, 1: 75- 94.
Kazdin, A.E. (1977). Artifact, bias, and complexity of assessment: The ABC’s of reliability. Journal of Applied Behavior Analysis 10 (1): 141-150.
Kent, R.N., Kanowich, J., O’Leary, K.D. & Cheiken, M. (1977). Observer reliability as a function of circumstances of assessment. Journal of Applied Behavior Analysis, 10(2): 317-324.
?outsouba, M. (2005).Notation of dance movement. The passage from the pre-history to the history of dance. Athens: Propompos.
Krasnow, D. H., Chatfield, S.J., Barr, S., Jensen, J.L. and Dufek, J.S. (1997). Imagery and conditioning practices for dancers. Dance Research Journal, 29 (1): 43-64.
Krasnow, D. H. and Chatfield, S.J. (2009). Development of the "performance competence evaluation measure": assessing qualitative aspects of dance performance. Journal of Dance Medicine Science, 13 (4): 101-7.
Laban, R. (1960). Mastery of movement. Macdonald & Evans (2nd ed.) London.
Laban, R. (1988). The mastery of movement. Athenaum press Ltd, Tyne & Wear, Gateshead. Looney M.A., and Heimerdinger, B.M. (1991). Validity and generalizability of social dance performance ratings. Research Quarterly for Exercise and Sport, 62: 399-405.
Maletic, V. (1987). Body-Space-Expression. Mouton de Gruyter, New York. Martin, G. and Pessovar, E. (1961). A Structural Analysis of the Hungarian Folk Dance. Acta Ethnographica, 10: 1- 40.
Martin, G. and Pessovár, E. (1963). Determination of Motive Types in Dance Folklore. Acta Ethnographica, 12, (3-4): 295-331.
McCoubrey, C. (1984). Effort observation in movement research: An interobserver reliability study. Msc diss., Hahnemann University, Philadelphia, U.S.A.
Minton, S. and McGill, K. (1998). A Study of the relationships between teacher behaviors and student performance on a spatial kinesthetic awareness test. Dance Research Journal, 30(2):39-52.
Moskal, B.M., and Leydens, J.A. (2000). Scoring rubric development: Validity and reliability.
NAEP. (1997). Arts education consensus project. National Assessment Governing Board U.S. Department of Education.
National Dance Education Organization. (2005). Standards for Learning and Teaching Dance in the Arts: Ages 5-18.
Office for Standards in Education (2002). Inspecting dance 11–16 with guidance on self-evaluation,
Oreck, B.A., Owen, S.V., and Baum, S.M. (2004). Validity, reliability and equity issues in an observational talent assessment process in the performing arts. Journal for the Education of the Gifted, 27(2): 32–39.
Oreck, B. (2007). To see and to share: Evaluating the dance experience in education. In: L. Bresler (Ed), International Handbook of Research in Arts Education. Springer, p.p. 341-356.
Pforisch, J. (1978). Labananalysis and dance style research: a historical survey and report of the 1976 Ohio State University research workshop. CORD Dance Research Annual, IX: 59-74.
Reid, J.B. (1982). Observer training in naturalistic research. In: D.P Hartmann (Ed.), Using observers to study behavior. New Directions for Methodology of Social and Behavioral Science. Jossey-Bass, San Francisco, p.p. 37-51.
Slettum, B.S. (1998). Validity and reliability of a folk dance performance checklist for children. Msc diss., Northern Illinois University.
Stemler, S.E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability.
Thomas, J.R. and Nelson, J.K. (2003). Research methods in physical activity. Trans. ?. ?arteroliotis. Athens: Paschalidis.
Tyrovola, V. (1994). [The dance “Sta tria” in Greece. Structural–Morphological and Typological Approach]. PhD diss. University of Athens, Greece.
Tyrovola, V. K. (2001). Greek folk dance. A different approach. Athens: Gutenberg.
p>Van der Mars, H. (1989). Observer reliability: Issues and procedures. In: P.W. Darst, D. B. Zakrajsek and V. H. Mancini (Ed), Analyzing physical education and sport instruction. Human Kinetics, Champaign, IL., p.p. 53-81.
Warburton, E.C. (2002). From talent identification to multidimensional assessment: toward new models of evaluation in dance education. Research in Dance Education, 3(2):103 – 121.