Psychometric Properties of Student Evaluation of Teachers’ Performance Scale: Evidence from Debre Markos University Students’ Evaluation Dataset

the and of the higher by . Data collected from 1397 students were used for this analysis. Cronbach’s alpha values and average interitem correlation were used to study the internal consistency reliability of the scale. Composite reliability, average variance extracted, hetero trait-mono-trait ratio, maximum shared variance, average shared variance, and interconstruct correlations were used to assess the construct validity of the scale, and exploratory factor analysis and conﬁrmatory factor analysis were performed with 20 items to test the hypothesis which introduced a four-dimensional construct for teachers’ evaluation scale. We used diﬀerent goodness-of-ﬁt indices to measure the ﬁt of the models. Results . The scale was shown to have good internal consistency and convergent validity but lacked discriminant validity. Furthermore, conﬁrmatory factor analysis indicated that the four-factor model produced inadequate ﬁt indices, revealing that the original factor structure of the scale changed. Conclusions . The results showed that Student Evaluation of Teaching Eﬀectiveness did not measure what it was supposed to be measuring. Moreover, the exploratory factor analysis and conﬁrmatory factor analysis results indicate that a two-dimensional model is better than the four-dimensional model to explain the data structure, which places limitations on its use.


Introduction
Reliability and validity, jointly called the "psychometric properties" of measurement scales, are the two most important and fundamental features in the evaluation of any measurement scale [1][2][3][4]. e evidence of validity and reliability are prerequisites to assure the integrity and quality of a measurement scale [5].
Student evaluation of teachers' effectiveness (SETE) is commonly used to measure teaching performance and accountability by various universities across the globe [6,7]. If carefully developed and systematically used, teacher evaluation is believed to have the potential to enhance teachers' professional development, thereby improving students' achievement [8]. Hence, the scales used to assess teacher performance should be accurate and exhaustive, allowing the results to provide useful information about teachers' teaching effectiveness. e effectiveness of an education system largely depends on the effectiveness of its teachers, which in turn has a large influence on student learning [9]. As a result, measuring teachers' effectiveness is an important vehicle for promoting educational quality [9][10][11], which in turn enhances the quality of graduates [12].
Currently, student evaluation of teaching effectiveness is a common practice in almost every institution of higher education globally [13]. Over the years, however, different SETE scales have been proposed and developed. Consequently, numerous well-designed and validated instruments are available to measure higher education teachers' teaching effectiveness [7]such as the Students' Evaluation of Teaching Effectiveness Rating Scale [14], the Student Course Experience Questionnaire [15], the questionnaire for student evaluation of teaching [16], and the Teaching Proficiency Item Pool [17]. e development of the SETE scale is an ongoing process to develop a psychometrically sound scale that measures teacher effectiveness in higher education taking into account the dynamics of the characteristics of effective teaching.
e variations in the content and the number of dimensions are attributed to the absence of agreement concerning the number and nature of these dimensions, which should be based on both theory and empirical testing [7]. Identifying the characteristics of effective teaching, which is a prerequisite for the construction of SETE scales, is also a possible reason for the variation in the SETE scale. Moreover, different institutions have different educational visions and policies, thereby developing SETE scales that are consistent with their preferences.
In the search for educational quality in Ethiopia, various attempts have been undertaken to generate meaningful and accurate indices of teacher effectiveness [32]. To this end, the Ministry of Science and Higher Education (formerly Ministry of Education) in Ethiopia identified four competencies of teacher effectiveness that served as the conceptual basis for this study: subject matter knowledge (core competency), professional competency, ethical competency, and time management [33]. e first two indices are related to the instructional effectiveness of teachers, whereas the next two indices are related to the teacher's personal quality. Each of these dimensions focuses on a key aspect of a teacher's professional qualification or responsibilities. As a result, it is critical to determine if the scale measures the intended competences or construct accurately and consistently [1][2][3][4].
Nevertheless, no study has been conducted on the psychometric properties and validity of the SETE scale used by Ethiopian higher education institutions despite many faculty members questioning the validity and reliability of SETE results for many years. e scale was not evaluated by independent experts and the target population (students) to verify that the items adequately measure the domain of interest. Pretesting has not been made to assess the extent to which items reflect the constructs of interest. In addition, the SETE scale was not evaluated to test the dimensionality, reliability, and validity. Rather, to the researchers' best knowledge, the factor structure of the scale was constructed merely via discourse between subject matter experts. With the researchers' sufficient experience in the study area, no previous studies have been conducted to investigate the factor structure (dimensionality), internal consistency, and validity of the SETE scale. erefore, the use of student evaluation of teachers' effectiveness scale claimed to have many problems concerning reliability, validity, dimensionality, and potential bias. us, this study was carried out with the major purpose of evaluating the reliability, validity, and underlying factor structure of the SETE scale. More specifically, the study aimed to determine whether the scale could indeed measure the unobservable construct/domain that was supposed to measure or check if the scale revealed an equivalent factor structure with what was established by the experts, and to test the convergent and discriminant validity of students' evaluation of SETE.
Specifically, this search sought to answer the following questions:

Population and Context of the Research. Debre Markos
University in Ethiopia is one of the public Universities founded by the Ethiopian Federal government in 2007. e university is located in East Gojjam, Amhara National Regional State, 300 km in northwest of the capital Addis Ababa. Currently, the university runs 51 bachelor's, 47 master's, and 2 Ph.D programs in regular, continuing, and distance education streams. ere are more than 1556 academic staff and 1600 administrative staff in the university to serve over 30000 students and the community at large.
In Ethiopian higher education institutions, the application of the SETE is carried out at the end of the semester, before the final exams are administered, and the students know their final grades. All teachers are evaluated by the students in the same semester.

Local Context of SETE Development Process.
e Students' Evaluation of Teaching Effectiveness (SETE) scale Table 1 is one of the three harmonized scales used to measure teachers' effectiveness. ese scales were developed by the Ethiopian Ministry of Science and Higher Education. A group of subject matter experts developed the SETE scale, which comprised 20 items and judged the dimensions to be four: subject matter knowledge, professional skills, ethical quality, and time management [34]. From the four constructs, knowledge of the subject matter was considered as the core competence. ree bodies are involved in assessing teachers' competencies: students, peers, and immediate supervisors [34]. e students' evaluation of teaching effectiveness accounted for 50% of the total evaluation, and the remaining 30% and 20% of the evaluations were accounted for by immediate supervisors and colleagues, respectively. e SETE scale items have five-point Likert scales, of which only one alternative may be chosen. Scores range from 1 to 5, where: 1 � "strongly disagree," 2 � "disagree," 3 � "neutral," 4 � "agree," and 5 � "strongly agree." e 20 items that make up the SETE scale were broadly structured to reflect two teaching effectiveness factors or constructs: 14 for core and professional competency constructs and the remaining four items were related to the ethical and time management constructs.
To ensure the relevance of the items to the general principles of teaching in higher education settings, the development of the scale went through several steps to receive feedback from different stakeholders. To do so, different focus group discussions were held with department heads, students, and college deans.

e Data and Study Participants.
is study is a secondary analysis carried out on data from the teachers' assessment survey, which was undertaken at Debre Markos at the end of every semester to monitor the performance of teachers concerning teaching and research work activities. e following steps were followed to collect (extract) data for this study. In the first step, the teachers to be included in the sample were randomly selected. en an excel data abstraction tool was prepared to record and manage the teacher's assessment score. To assist with data abstraction and data entry process, a total of 10 data collectors, one from each sampled department were selected. e data collection process was supervised by the quality assurance office of the university. e data were collected anonymously. e evaluation records of 1397students were randomly selected from a population of 5257 regular students who were active in the 2018/2019 academic year. For lower costs and smaller prediction errors, a multistage stratified random sampling was employed to select teachers' evaluation records. We followed the following steps; in the first step, we divided the population of teachers into homogeneous, mutually exclusive subgroups called colleges/faculty. In the second stage, a sample of departments was randomly taken. In the third step, teachers were stratified by their sex. Finally, a sample of teachers was randomly selected for each sex category and then their evaluation records were extracted. e probability proportional to the size sampling method was used for selecting teachers from each department. Accordingly, 92 (78.6%) of the teachers were male and the remaining 25(21.4%) were females.

Data Analysis
2.4.1. Reliability and Validity of the Scale. Descriptive measures such as Cronbach's alpha and average interitem correlations were used to assess internal consistency reliability.
We used the CFA method to test the convergent validity, discriminant validity, and nomological validity of a measurement model [35]. Convergent validity measures the extent to which different measures of the same construct converge or strongly correlate with one another, whereas discriminant validity is the extent to which measures of different constructs diverge or minimally correlate with one another [36]. Convergent validity comprises composite reliability (CR) and average variance extracted (AVE). CR, which indicates the shared variance among the observed variables of a latent construct, was applied to test the degree to which the indicator variables converged and shared the proportion of variance [35]. is is calculated using.
where λ i is the completely standardized loading for the i th indicator, δ i is the variance of the error term for the i th indicator, and p is the number of indicators. Moreover, the average variance extracted (AVE) represents the average amount of variance of constructs, which is explained by its indicator variables relative to the overall variance of its indicators. is is similar to the explained variance in EFA, as it measures the average variance in the items that a construct manages to explain [37]. A higher AVE value indicates lower error variance. e AVE for the j th construct, denoted by C j is defined using: where λ jk is the indicator loading and θ jk is the error variance of the k th indicator (k � 1,..., K j ) of the j th construct score (C j ). K j is the number of indicators of the j th construct C j . If all indicators are standardized (i.e., having a mean of 0 and a variance of 1), equation (2) simplifies to (3).
In this case, the AVE is the same as the average squared standardized loading and is equivalent to the mean value of Education Research International the indicator reliabilities. Now, let r ij be the correlation coefficient between the construct scores of constructs C i and C j. e squared interconstruct correlation r 2 ij indicates the proportion of variance that constructs C i and C j have.
According to the Fornell-Larcker criterion [38], discriminant validity is established if the condition in equation (4) holds.
at is, the square root of AVE should be greater than the interconstruct correlations for all constructs. Discriminant validity can also be evaluated using the maximum shared variance (MSV) and average shared variance (ASV), which measure the maximum variance and average variance among constructs, respectively. Both measures should be lower than the AVE for all constructs to confirm discriminant validity [39].

2.4.2.
e Heterotrait-Monotrait Ratio Approach. Henseler et al. [39] suggested using the heterotrait-monotrait ratio (HTMT) of correlations, which is the average of the heterotrait-monotrait method correlations (i.e., the correlations of indicators across constructs measuring different phenomena), relative to the average of the mono-traithetero method correlations (i.e., the correlations of indicators within the same construct). Because there are two mono-trait-hetero method submatrices, we take the geometric mean of their average correlations. Consequently, the HTMT of constructs Ci and C j with K i and K j indicators can be formulated as .
where the numerator and the denominator in equation (5) represent the average hetero trait-hetero method and the geometric mean of the average mono-trait-hetero method, the correlation of construct C i , and the average mono-traithetero method correlation of construct C j , respectively.

Exploratory Factor
Analysis. Exploratory factor analysis (EFA) is appropriate when the goal of research is to create a measurement scale that reflects a meaningful underlying construct(s) represented in the observed variables [40]. It is a popular approach to test whether item-level discriminant validity is established by assessing crossloading [39]. In EFA, the challenge is determining the required number of factors to retain a sufficient amount of variance and, at the same time, to achieve a substantial reduction in dimensionality [41,42]. Several methods are available for determining the number of components or factors for EFA, but they do not always lead to the same or even similar results. Despite the importance of factor retention decisions and extensive research on methods for making retention decisions, there is no consensus on the appropriate criteria to use [43].

Confirmatory Factor Analysis.
A confirmatory factor analysis (CFA), which has wide applications in the area of scale development and construct validation [35], was used to determine the validity of the factor structure of the teaching effectiveness assessment scale used by students. Confirmatory factor analysis (CFA) is a popular structural equation model that provides the simplest explanation of how observed and latent variables are related to assumed latent variables [44]. CFA provides a more explicit framework for confirming prior notions about the factor structure of scales [45]. It has two components. e first is a measurement model that explores the relations between a set of observed variables, also called manifest variables (items in our case), to a usually smaller set of latent variables (factors or constructs). e second is a structural model that explores the relationship between latent variables through a series of recursive and nonrecursive relationships. In this study, a four-factor measurement model was specified to test the validity and reliability of the observed indicator items measured on the knowledge of the subject matter (core competency), professional skills (competency), ethical quality, and time management constructs. Professional competency here refers to the degree to which teachers are utilizing their knowledge, skills, and good judgment related to their teaching activities to render tasks with acceptable quality.
Confirmatory factor analysis was carried out using the lavaan package version 0.6-7 [46] in R statistical software version for Windows [47]. By examining three critical sets of results-parameter estimates, fit index, and potential modification indices-researchers formally tested the measurement hypotheses, and they can modify the hypotheses to be more consistent with the actual structure of participants' responses to the scale.

Preliminary Data Analysis.
Prior to the analysis, we examined missing values and outliers. e missing values of the corresponding variables were imputed by median values. Figure 1 shows a graphical visualization of missing values, which is produced by Visdat package [48].
e figure provides the pattern and percentages of missing value distribution. It also shows the locations of missingness that occurred in the data. From Figure 1, there were 3.2% missing values and 96.8% present values in the dataset. Missing data on item level was low, except items Core5 (6.9%), Core (9.3%), and Ethic15 (7.3%). From the figure, it is apparent that the pattern of missingness is random.
For variables measured on an ordinal scale, neither the assumption of normality nor the continuity property is met [49]. e results presented in Table 2 show that the skewness measures are significantly negative in all items, indicating that maximum values are more common than smaller values. Kurtosis exceeds the reference value of the normal distribution (equal to 3) for the majority of competency components, suggesting the existence of heavy tails compared to the Gaussian distribution. is leptokurtic behavior confirms a typical distribution that exhibited fatter tails than the normal distribution. When the assumption of normality is severely violated, the diagonally weighted least squares (DWLS) method, which is a robust WLS method [50], was used as it provides more accurate parameter estimates [49][50][51]. Table 2 presents the means, standard deviations of items, and internal reliability coefficients for the factors/constructs. Accordingly, all reliability coefficient estimates of alpha except time management skill are above the traditional cutoff of 0.70, revealing that the three teaching competency dimensions/factors have sufficient internal consistency. at is, the reliability of subject matter knowledge (core competency) (Cronbach's alpha � 0.88), professional competency (Cronbach's alpha � 0.89), and ethical competency (Cronbach's alpha � 0.83).

Reliability of the SETE Scale.
Furthermore, the corrected item-total correlation ranged from 0.44 to 0.85, which exceeded the accepted cutoff of 0.40 proposed by Nunnally [52], indicating that each item was related to the corresponding components of the SETE scale.
In addition, the values of the "reliability if an item is dropped" show a lower or equal value to the alpha value for all variables of the three factors, indicating that all items in all factors contribute positively to the internal consistency of the factors [53]. In Table 2, it is also revealed that the means of the items scale ranged from 3.707 to 4.719, while the standard deviations of the items were from 0.70 to 1.46, indicating a narrow spread around the mean. e average interitem correlation (AIIC) was computed from the interitem correlation matrix. Correlation matrix presented in Figure 2. e ideal range for the AIIC value is between the values 0.20 and 0.40 [54]. Piedmont (2014) claimed that an AIIC score of less than 0.20 indicates that the items are not well correlated and do not measure the same construct or factor, whereas an AIIC score greater than 0.40 suggests that the items in the same construct are redundant.
Accordingly, the average interitem correlations (AIIC) for core competency, professional competency, ethical quality, and time management skill constructs were 0.58, 0.52, 0.59, and 0.48, respectively, suggesting that all constructs of the SETE scale contain items that measure the constructs in the same way. However, it seems that some of the items in each competency component are redundant.

Validity of the SETE Scale.
e discriminant and convergent validity of the scale were tested using various techniques. e interitem correlation matrix presented in Figure 1 was used for the first visual diagnosis of the items and scale structure. e results displayed in the figure provide evidence that our items in each factor or construct had a high correlation, implying that the items in each construct were related, indicating that the convergent validity of the scale was assured. Composite reliability (CR) and average variance extracted (AVE) were used to test the extent to which the indicator variables converged and shared the proportion of variance. According to Adedeji et al. (2017), a cutoff point of 0.7 or above for CR is required to establish that the indicator items are reliable, and a minimum value of 0.5, which is required for AVE. Furthermore, CR values higher than the AVE are required to establish convergent validity. Accordingly, Tables 3 and 4  is confirms that convergent validity is established. Moreover, convergent validity is established when CR is higher than AVE, and the AVE is higher than 0.5 [45,55]. ese conditions were confirmed in this study; consequently, the convergent validity of the scale was verified. Furthermore, from the lavaan output presented in Table 5, all items appeared to be significantly associated with their respective constructs, which provides additional evidence of convergent validity. Discriminant validity analyzes how well the constructs are distinct and uncorrelated. e scale faces a discriminant validity problem if the items correlate more highly with variables outside their parent factor than with the variables within their parent factor; that is, the latent factor is better explained by some other variables (from a different factor) than by its observed variables. We used the Fornell-Lacker criterion [38], which compares the square root of the average variance extracted (AVE) with the correlation of latent constructs to assess discriminant validity. e interconstruct correlations among the four constructs are shown in Table 3. A strong correlation between them is evidence of their dependence on one another. Accordingly, based on the estimates presented in Table 3, the square root of AVE is less than the interconstruct correlation. Furthermore, the results presented in Table 4 indicate that the maximum shared variance is greater than the average shared variance, and the average shared variance is greater than the average variance extracted (i.e., MSV > AVE and ASV > AVE). Consequently, both results justify the establishment of discriminant validity. e highlighted cells in Table 4 show the HTMT ratio of the correlation between the two constructs, which is calculated using equation (5), as proposed by Henseler et al. [39]. Accordingly, the HTMT values are above the suggested threshold of 0.85 [56], revealing that discriminant validity does not exist between the two reflective constructs, which supports the above finding. In conclusion, the scale faced a discriminant validity problem.

Education Research International
Discriminant validity was also checked by comparing the loading of an item across different constructs. If all items loaded more highly on the construct that they were measuring than on any other construct in the model, discriminant validity was met [57]. According to the EFA output presented in Table 4, considerable cross-loadings were observed. e nomological validity of the scale was checked by examining the significance of the construct correlation value between construct (interconstruct) variables in the model [35]. Accordingly, the 95% Explains the course overall objectives, prepares course outline on time, and explains the contents of the course outline 1 2 3 4 5

Core2
Prepares well for course delivery 1 2 3 4 5 Core3 Gives course reading materials and lecture notes 1 2 3 4 5 Core4 Notifies list of references and textbooks available in the library 1 2 3 4 5 Core5 Teaches depending on course nature and teaches practical sessions 1 2 3 4 5 Core6 Delivers the course in a such a way that students understand 1 2 3 4 5 Professional competency

Profe11
Follows continuous assessment approach and gives feedback on continuous assessments on time 1 2 3 4 5

Profe12
Gives supplementary exam to low-performing students on the basis of continuous assessment result 1 2 3 4 5

Profe13
Gives tutorial for female students, special needs students, and low-performing students 1 2 3 4 5

Profe14
Prepares exams as per the course content, exams cover across the course contents, exams include various assessment modes and allocates appropriate marks for exam questions      confidence interval for interconstruct correlations in Table 3 does not contain 1, implying a statistically significant interconstruct correlation. is shows the poor nomological validity of the SETE scale.

e Factor Structure of the Scale.
In this analysis, we used twofold cross-validation (CV) such that the data were divided into two random samples. e first half of the dataset with 699 observations (called the training data) was used to find the possible factor structure of the SETE scale using exploratory factor analysis, and the second half having 698 observations (called the testing data) was used to verify the factor structure of the scale.

Exploratory Factor
Analysis. An exploratory factor analysis (EFA) using the varimax-rotated component method was performed on the training data to check if item grouping was consistent with the proposed theory, that is, to test the structural validity. Before conducting factor analysis, the item-to-item correlation was examined by conducting the Kaiser-Meyer-Olkin (KMO) test and Bartlett's test for sphericity to see if there is a certain redundancy between the variables that we can summarize with the factors. e value of KMO was 0.96 and Bartlett's test of Sphericity produced p < 0.001, which are wonderful values.
us, all variables could be considered for EFA [45,58].
We applied the scree plot test [59] and parallel analysis [60] to determine the required number of factors to retain. e rule for scree plots is to retain the factors above the point where the curve starts to level off (inflection point) and eliminate any factor below the inflection point [61]. From the scree plot (left panel in Figure 3), the first two factors of the scale have eigen values greater than one. Parallel analysis offers a more objective way to assess the appropriate number of components, where factors with adjusted eigenvalues greater than one are retained [62]. Both methods suggest retaining two factors. e preliminary exploratory factor analysis (EFA) results, described in Table 6, revealed that the item variables are not significantly grouped under the respective factors, as theoretically defined. Hence, the factor structure of the scale is not consistent with the proposed understanding of the intention of the experts who devised the scale, indicating that the results from the EFA did not support the theoretical factor structure. e h 2 column in the table represents the value of communality, which must be higher than 0.3. e root mean square of the residuals (RMSR) was 0.05. Additionally, the root mean square of the residuals (RMSR = 0.05) is less than 0.1, verifying that the retained factors are appropriate for describing the correlation structure. From the results presented in Table 6, all items demonstrated high loading, ranging from 0.48 to 0.85, implying that all items are considered as important. Items in italics are loaded in the second factor. e factor loading of the first factor ranged from 0.48 to 0.77, while the factor loadings of the second factor ranged from 0.59 to 0.85. e analysis output includes the explained variance ratio. e first factor explained 31% of the total variance. e second factor explained 25% of the total variance. Hence, the two-factor construct explains 56% of the total variance. e analysis output includes the interfactor correlation after the explained variance ratio section. Based on our qualitative judgment of item content, the less serious nature of cross loading, and the expected association of factors, we decided to keep these items and assign them to the factor in which they showed stronger factor loadings and were found the most relevant. However, the two items "core 5" and "profe7" load equally on the two items. Hence, we decided to remove them. e models analyzed were identified, which means that there should be more observations than the parameters to be estimated [63].
Confirmatory factor analysis was carried out to check if the number of factors (or constructs) and the indicator variables conformed to what was expected based on the theory. Multiple fit indices were used to evaluate whether the models adequately reflected the observed data. Moreover, the two models were compared to assess if they had an identical fit. We used the "Testing" data for this purpose. Figure 4 presents the path diagram of the confirmatory factor analysis for two-factor (left panel) and four-factor (right panel) models, where a single-headed arrow is used to imply a direction of the assumed causal influence, and double-headed arrows are used to represent the covariance between two latent variables (factors).
From the path diagram for the two-factor model, the measurement error ranged between 0.36 (Profe10) and 0.61 (Profe13 and Profe13). Similarly, the four-factor model produced a measurement error ranging between 0.30 (Ethic17) and 0.59 (TM19). e increase in the measurement error for the two-factor model is due to specifying a relatively less number of factors than expected [43].
For the two-factor model, it was thus deduced that the squared coefficient of multiple correlations or the amount of variance explained by the latent variable fell within a range between 0.75 and 0.48. Similarly, all factor loadings had values equal to or greater than 0.61 (p13). e correlations between latent constructs ranged between 0.73 and 0.87. e interconstruct correlations of core competency with professional competency, ethical quality, and time management were 0.87, 0.8, and 0.7, respectively. Similarly, interconstruct correlations of professional competency with ethical quality were 0.78 and 0.87, whereas interconstruct correlation between ethical quality and time management was 0.73. Table 5 shows that the standardized coefficients for the two-factor model are significant at the 0.001 level, implying that all items are significantly correlated with their respective constructs. Because the domain is standardized (mean � 0, SD � 1), the coefficients are interpreted as the increase (or decrease) in the score of an item for every standard deviation increase in the factor/construct. For example, β � 0.69, that is, for every standard deviation increase in core competency, "Core1" increases by 0.69. In addition, in the SETE scale, the " profe12" item had the highest association with its construct (β �1). e values in Table 7 can be interpreted similarly.  e appropriateness of the measurement model in comparison with the data was examined first. e best model should have a relative chisquare (χ 2 /df) value close to 1.
We also used the comparative fit index (CFI) and Tucker-Lewis index (TLI) and RMSEA to measure whether the model fits the data better than a more restricted baseline model. However, the cutoff values for these indices are arbitrary, and the meaning of "good" fit and its relationship with fit indices are not well understood [64]. e absolute and comparative fit indices for the twofactor and four-factor CFA models are presented in Table 8. e comparative fit parameters for the four-factor model, CFI (0.89) and TLI (0.87) are less than the acceptable cutoff point of 0.90, which is relatively poor fit [65]. e comparative fit indices of the four-factor model are 0.088  (RMSEA) and 0.06 (SRMR), which are considered an indication of fair fit [66]. However, for the two-factor model, the comparative fit indices, CFI (0.999), and TLI (0.999) were greater than the 0.90 threshold, indicating an improvement of the tested four-factor model in a relative sense. We also found that the SRMR (0.056) had a good fit (<0.06), and RMSEA (0.008) had a good fit (<0.05), indicating that the two-factor model fits well to the data [67].
In conclusion, the two-dimensional model provided improved goodness-of fit indices than the four-factor model, implying that the two-factor model fits the data better than the four-factor model.
Test of comparison of the two models to explain the factor structure of the scale showed a nonsignificant p value, showing that the four-factor model did not do a better job than the two-factor model. Moreover, for the AIC, a value of 29661.43 was obtained for the two-factor model and a value of 29677.60 for the four-factor model. us, the two-factor model should be preferred (smaller AIC).

Discussion
In educational institutions, evaluating teachers' effectiveness is similar to evaluating students' learning [31]. Student evaluations of teachers' effectiveness are a current and controversial topic in higher education and research. Many stakeholders, including teachers, are doubtful of SETE's effectiveness and validity for both formative and summative purposes [7,68]. us, the primary goal of this study was to look into the psychometric properties of the students' assessment of the SETE scale, which is used by Ethiopian higher education institutions.
From the results, the SETE scale was shown to have good internal consistency and good convergent validity. is result complements the findings of [7,[19][20][21][22][23][24][25][26][27][28][29][30]69], although the dimensionality and number of items of these scales are unrelated. However, unlike the student evaluation of higher education teachers' effectiveness scale developed by [18,19,21,22,25,31,70], the SETE scale used by Ethiopian higher education faced a validity problem. Moreover, the CFA results showed poor fit indices, revealing that the underlying four-factor structure for the SETE scale is insufficient to explain the data structure. is is because the SETE scale was developed based on the evaluation on theoretical grounds. However, its development should have gone through quantitative exploration in addition to the experts' evaluation on theoretical grounds, which is one of the criteria to ensure content validity. Scale development is not a straightforward endeavor [71]. Hinkin [72] pointed out three phases of scale development to create a rigorous scale: item development (consisting of steps of identification of the domains, item generation, and content validity or theoretical analysis), scale development (including steps pretesting the items in the scale, survey administration and sample size, item reduction analysis, extraction of factors), and scale evaluation (consisting of tests of dimensionality, reliability, and validity). According to researchers' ample experience during the development of the SETE scale, however, its development fails to follow the procedures used by Hinkin [72].

Conclusion
e construction of valid and reliable scales requires systematic research, in which both theoretical knowledge and empirical data should play an important role. is study is the first attempt to assess the validity of the SETE scale, which is used by Ethiopian higher education institutions. e current study attempted to provide evidence of convergent validity, discriminant validity, and nomological validity of the SETE scale that Ethiopian public higher education institutions used to evaluate their teachers' performance. Accordingly, the scale lacks both discriminant and nomological validity despite its convergent validity, revealing that the SETE scale does not appear to discriminate well among the constructs it measures.
Although further research is needed to confirm these results based on multicenter data, the two-factor model with 18 items yielded a better factor structure of the SETE scale.
is is because the dimensionality of the scale was developed based on the opinion of experts only; it did not necessarily measure the important competency components of the teachers. Overall, the findings indicate that the SETE scale cannot be used to effectively assess teachers' teaching effectiveness unless further improvements are made to the scale and its development process.
is work has practical, theoretical, and policy implications for a variety of stakeholders at various levels. In practice, this research can assist higher education institutions and the Ministry of Science and Higher Education in identifying the SETE scale's psychometric gaps. As a result, it can be used as a framework for improving the instrument's reliability and validity in order to clearly measure teachers' effectiveness and, as a result, propose interventions to increase teachers' performance and motivation. e findings of this study can also be used to offer new knowledge and concepts on the assessment of teachers' performance and pedagogical competencies in higher education, especially in Ethiopia. As previously stated, no investigation on the psychometric features of the SETE scale has been done in Ethiopia.

Research Limitations and Future Directions
e results of this study should be considered in the light of these limitations. One limitation was that although the scale is harmonized and used by all public universities, this analysis used data from a single university, which may not be generalizable to the remaining public universities across the country. Hence, this study emphasizes the need to obtain large amounts of data from multiple universities to further strengthen the outcomes of the study. e study also assumed that students rated their teachers with no bias or prejudice. However, it is well perceived from experience that students who receive higher grades in the course rate teachers more favorably, whereas low-grade achievers revenge their teachers in the form of low teacher ratings. Other factors such as time of evaluation, physical attractiveness of the teacher, course difficulty, age, and the teacher's personality influence student ratings [28,73]. Despite its convenience, the current study used one dataset for both PCA and CFA; hence, further studies are needed to validate both the SETE scale framework and measures. Careful planning of the validation process should be carried out with large data to obtain stronger evidence on the findings and develop a scale that measures teaching effectiveness appropriately. Furthermore, analysis at a different point in time needs to be carried out to test the test-retest reliability of the scale. Although the maximum number of items per scale will depend on the complexity of the variable being measured, increasing the number of items per scale improves the scale's richness to capture more information [74]. However, the "Time management" subscale has only two items, which is another limitation of this study.