Julian J. Meimban III
Orlando P. Sanchez
Randy O. Estabillo
Ronaldo C. Llana
Marianne Gail G. Armado
Arlene T. Sarmiento
How to Cite:
Meimban III, J. J., Sanchez, O. P., Estabillo, R. O., Llana, R. C., Armado, M. G. G., & Sarmiento, A. T. (2024). Binary logistic regression modeling for predicting potential offenders of discipline rules in an integrated school. NEU Likha Journal: A Refereed Journal of the New Era University School of Graduate Studies, 1(1), 1-22. https://doi.org/10.64303/n3u-Lailj24Hk0un2o-r-bLrMfpP0driaIs
Abstract
Two binary logistic regression models (Model C and Model D) were selected from four alternative models developed to determine if a combination of gender, religion, and grade level could significantly predict potential violators of school discipline rules. Grades 7 to 12 students enrolled during the Academic Year 2022-2023 in a private Integrated School in Quezon City, Philippines composed the target population of the study. The 753 sample cases used was calculated using the formula suggested by Peduzzi et al. (1996). Assumptions on multicollinearity and outliers were satisfied. Simultaneous entry of all independent variables and classification cutoff value of .5 were used. Models C and D surpassed the other two alternative Models (A and B) in terms of Goodness-of-Fit tests (Chi-Square Omnibus test, Cox & Snell R2, Nagelkerke R2, and Hosmer-Lemeshow test) and Percentage Overall Predictive Accuracy (72.5%), as compared to 65.40% and 69.30% for Model A and Model B, respectively. Both Models C and D showed 70.0% Sensitivity, 74.8% Specificity, 71.8% Positive Predictive Value, and 73.1% Negative Predictive Value. Although Models C and D addressed different research questions, both revealed that Gender and Grade Level were highly significant predictors (in a positive or negative way) of violators (p < .001), but not Religion. Concerning Grade Level, Model C showed that Grade 7, Grade 11, and Grade 12 students had significant odds or likelihood of committing violations. Model D revealed that Grades 7, 8, 10, 11, and 12 students had significant odds (positive or negative effect) of committing offenses. Both Models did not find Grade 9 students as likely violators.
Keywords: Log-odds, Probability, Odds, Leverage, Goodness-of-Fit tests, Contrast, Reference Category, Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value
Introduction
The frequent incidence of students’ violating School regulations has been a lingering problem confronting school administrators in Junior and Senior High School. Offenders do not only disregard the school’s efforts to instill good behavior among students but also undermine the school’s ambiance as supposed to be a peaceful and conducive learning environment. Offenders bring undue distress and interpersonal relationship problems with their peers and to their parents/guardians as well. In most circumstances, parents/guardians of offenders are asked to come to the School to discuss with the Student Discipline Office/Guidance and Counseling Office and explain to them the disciplinary action or sanction and ramifications of the violation their son/daughter has committed. How to mitigate violation problems in Junior and Senior High School has been an enduring concern of school administrators, as has been the case in a private Integrated School (IS) in Quezon City, Philippines.
The said problem has motivated the researchers to try to develop a mathematical model, specifically a binary logistic regression model, to estimate the probability/odds of a student violate school discipline rules. This model can be utilized to categorize potential offenders based on their probability/odds of committing violation so that a specific mitigation program/strategy may be crafted for a particular group of likely offenders. The study covered only the Academic Year 2022 – 2023. Only Junior (Grades 7 to 10) and Senior (Grades 11 and 12) High School students were included in the study.
Research Questions
Overall Research Question
Is there a combination of gender, religion, and grade level that significantly predicts whether or not a student will violate discipline rules in an Integrated School?
The probability/odds of a student committing a violation can be predicted using a binary logistic regression model, where the parameters are estimated using the Maximum Likelihood method.
ẑ = b0 + b1X1 + b2X2 + b3X3
Where:
ẑ = ln(odds) = log-odds (or logit)
b0 = constant
b1, b2, b3 = estimators of β1, β2, and β3, respectively
X1 = gender
X2 = religion
X3 = grade level
Hypothesis
Ho: βi = 0, where i = 0 to 3.
Ha: Not all βi’s are equal to zero.
The null hypothesis was tested at .05 level of significance by Wald statistic.
Probability and Odds
Predicted Probability = (Odds)/(1 + Odds)
Odds = eẑ, where e is the base of the natural logarithms (ln) and it is equal to 2.7182818285.
Specific Research Questions
A particular binary logistic regression model addresses specific type of research questions, which, in turn, hinge on the kind of categorical variable Coding, Contrast, and Reference Category employed in the analysis. The specific research questions are addressed in the Results and Discussion.
Method
In this study, an offense/violation was considered as one falling into any of the following categories: improper/unacceptable school uniform and physical appearance; misbehavior and campus disturbances; disrespect; destruction and vandalism; dishonesty; misuse of information technology (IT); violence, harassment, indecency and immorality; gambling; drunkenness, drug or substance abuse; and the like. A student who committed multiple violation (on same or on different discipline rule) was counted only once in the analysis.
The general procedures used in modeling binary logistic regression were as follows: (1) preliminary steps, (2) model development, and (3) model validation. Excel and SPSS v27 were utilized for statistical procedure and analysis.
Preliminary Steps
Population and Sample Size
The population of interest was composed of Grades 7 to 12 students enrolled during Academic Year 2022-2023 in a private IS in Quezon City, Philippines. On the other hand, the minimum sample size was calculated using the formula suggested by Peduzzi et al. (1996). The formula, as cited in Ryan (2013), is as follows:
“n = 10k/p … with k = the number of covariates and p = the smaller of the anticipated proportion of events and nonevents in the population.” From another source, p was defined as the “smallest of the proportions of negative or positive cases in the population and k the number of covariates (the number of independent variables) …” (MedCalc, 2024).
With k = 3 (gender, religion, and grade level) and p = .04, a total of 750 minimum number of cases was needed for the study. A case is a student, violator or non-violator. The value of p was estimated from a random sample of 189 violators or positives cases (Table 1). However, the 750 minimum number of cases needed was increased to 761 to account for anticipated unusable cases (those with incomplete entries in some variables) and for influential outliers. Such cases were excluded from the analysis.

Cases Selection
The 761 cases were retrieved from the School database of violators and non-violators. Considering the confidentiality issue associated in dealing with personal data/private information, the researcher made series of meetings with the staff of the Office of Discipline Office of the School to gather the needed number of cases for the study. Extracting the cases needed (violators and non-violators alike) was a tedious process because the demographic data of a case (student) such as grade level, gender, religion, and age were recorded in an incident-sheet report, unlike in a database format where entries were encoded systematically to ease data sorting and retrieval. Practically, each case was retrieved one-by-one. Moreover, issues on the significance of random selection of cases across grade levels, gender, and religion or subgroups were explained to the staff as thoroughly as possible to ensure the generalizability of findings to the target population.
Model Development
Dependent Variable (DV) and Independent Variables (IV)
The dependent variable (DV) was labeled Group, with two categories: non-violator and violator. The independent variables (IV) were Gender, Religion, and Grade Level (Table 2).
Modeling Binary Logistic Regression
Four alternative binary logistic regression models were formulated and tested for “fit” on the sample. In formulating the four alternative models, Gender and Religion were treated as categorical variables. Only Grade Level was allowed to vary as a scale/metric variable or as a categorical variable. In Model A, Grade Level was treated as binary variable where Grades 7 to 10 were combined and labeled as Junior while Grades 11 and 12 were consolidated and labeled as Senior. In Model B, each grade level was treated as a scale/metric variable. In Models C and D, Grade Level was considered as a categorical variable. Other categorical variable settings used in modeling each model are shown in Table 2.


Assumptions
Multicollinearity. Modeling logistic regression requires no extreme collinearity between IVs. Presence of multicollinearity problem was detected through examination of the collinearity diagnostic output of a multiple regression between the DV and IVs. With a sample size of 753, the model had an R² of .109. Given this value, a collinearity problem may exist with the variable Religion and also with Grade Level because each of them had a Tolerance of less than (1 – R²) or .89 (Table 3). However, further consideration indicated that Gender and Religion may be moderately correlated because they exhibited Variance Proportions value of .38 and .51, respectively, in Dimension 2, and .54 and .37, respectively, in Dimension 3 (Table 4). These correlation values, however, suggest only low to moderate relationship (Hinkle et al., 1998, as cited in Rovai et al., 2013). Grade Level appeared to have no serious collinearity problem with other IVs.


Outliers. Logistic regression also requires no extreme outliers. Influential outliers were excluded from the analysis. ZResid and leverage were used to identify them.
Regression Method and Classification Cutoff Value
The regression method used was simultaneous entry of all independent variables and the classification cutoff value used all throughout was .5.
Results and Discussion
Cases Included in the Analysis
Out of the 761 cases as sample, 753 were employed in modeling Models A, C, and D; Model B used 752 cases. In modeling Models A, C, and D, two cases were excluded from the analysis because of incomplete or missing entries. Additionally, six cases (196, 217, 218, 238, 239, and 357) were excluded because they were influential outliers. A case was considered an outlier if its ZResid > |2.0|. An outlier was declared influential if its leverage was greater than (k/sample size), where k is the number of parameters to be estimated. The sample size used was 761 and k was 8 (i.e., for Gender, Religion, Grade 7, Grade 8, Grade 9, Grade 10, Grade 11, and Grade 12). Thus, outliers with leverage greater than 8/761 or .011 were excluded. Each of the six cases excluded had a leverage of .012 (Janssens et al., 2008).
Hence, only 753 cases were included in modeling Models C and D. For, practical purposes, the same number of cases was used in modeling Model A. For Model B, the six influential outliers and an additional three cases with missing entries were excluded from the analysis, resulting in only 752 cases for the analysis.
Sample Cross-tabulation
By gender (for Models A, C, and D), out of the 753 cases used in the analysis, 368 (48.87%) were female and 385 (51.13%) were male (Table 5). By religion, 427 (56.71%) belong to the majority group, 326 (43.29%) to the minority group. And by grade level, Grade 7, Grade 8, Grade 9, and Grade 11 had 107 (14.21%), 174 (23.11%), 183 (24.30%), and 193 (25.63%) cases, respectively. On the other hand, Grade 10 and Grade 12 had 39 (5.18%) and 57 (7.57%) cases, respectively. Noticeably, these last two Grade Levels had much smaller number of cases compared to those of the other Grade Levels. The subgroup with the least number of cases (i.e., 5) was for Grade 10-Female-Minority subgroup; with the most (95) was for Grade 11-Female-Minority subgroup.

Prediction Model Selection Criteria
The principal criteria used in selecting an appropriate Model were Goodness-of-Fit tests (Chi-Square Omnibus test, Cox & Snell R², Nagelkerke R², and Hosmer-Lemeshow test) and supported by Percentage Overall Predictive Accuracy (based on the final Classification Table) (Table 6). Of the four alternative models, Model C and Model D are the most promising. They not only met the Goodness-of-Fit tests but also registered the highest Percentage Overall Predictive Accuracy (72.50%). In comparison, Models A and B not only failed the Hosmer-Lemeshow test (p < .001) but they also showed lower Percentage Overall Predictive Accuracy (69.30% for Model A and 65.40% for Model B) than those of Models C and D. Moreover, Model A and Model B had Cox & Snell R2 about half lower than those of Models C and D.

Models C and D Predictive Accuracy Measures
Models C and D have the same Overall Predictive Accuracy (72.5%) (Table 7). Both Models have higher predictive accuracy for non-violators (74.8%) than for violators (70.0%).

Moreover, measures of Predictive Accuracy of Actual Outcome and Predictive Accuracy of Predicted Outcome were also the same for both Model C and Model D (Table 8). The values of these measures ranged from 70.0% (Sensitivity or True Positive Rate) to 74.8% (Specificity or True Negative
Rate) (Hair et al., 2019).

Models C and D revealed more significant predictors than do Models A and B (Table 9).


Based on the above results, we suggest two binary logistic regression models for predicting the probability/odds of potential violators/offenders. These models are as follows.
Model C (for the variable Grade Level: Contrast – Deviation; Referencel Category – First)
ẑ = -0.57 + 1.46(Gender) + 0.17(Religion) + 0.18(Grade 8) – 0.24(Grade 9) + 0.62(Grade 10) – 1.73(Grade 11) + 1.72(Grade 12)
Model D (for the variable Grade Level: Contrast – Indicator; Reference Category – First)
ẑ = -1.12 + 1.46(Gender) + 0.17(Religion) + .73(Grade 8) +
0.32(Grade 9) + 1.17(Grade 10) – 1.18(Grade 11) + 2.27(Grade 12)
Specifying a particular coding setting for the categorical variable Grade Level, allowed us to address Research Question 3, which, in turn, provided us further insight into the effect of this variable in relation to the commission of violation.
Model C Research Questions (RQ) (Refer to Table 10)
RQ1: Is Gender a significant predictor of committing violation?
Gender showed to be a significant predictor (p < .001). Being a male as compared with being a female highly significantly (p < .001) increased the logodds by 1.46 points (or the odds by 4.31 points) of violating school discipline rules, while holding other factors constant. The 95% C.I. for the odds ranged from 3.08 to 6.05 points. In general, male students were more likely to violate school discipline rules than did female students.
RQ2: Is Religion a significant predictor of committing violation?
Religion was found not a significant predictor of committing violations (p = .400). In other words, in general, students belonging to the minority religion were as likely to violate school discipline rules as students belonging to the majority religion. Although students belonging to the minority religion showed odds of violating discipline rules 1.18 points higher than that of students belonging to the majority religion, such apparent likelihood could be attributed to chance (Janssens et al., 2008).
RQ3: To what degree does each Grade Level make in comparison with the mean effect of all the Grade Levels on the log-odds/odds (response)?
Grade 11 students had highly significant log-odds of 1.73 points (or odds of .18 points) (p < .001) lower than the mean effect of all Grade Levels in the log-odds (response). On the other hand, Grade 12 students had highly significant log-odds of 1.72 points (or odds of 5.58 points) higher (p < .001) than the mean effect of all Grade Levels. The 95% C.I. for the odds ranged from 3.00 to 10.36 points.
The individual effect (positive or negative) of Grade Levels 8, 9, and 10 on log-odds/odds was found not statistically significant (p > .05) compared with that of the mean effect of all Grade Levels. However, Grade 7 showed a highly significant lower log-odds of .55 points (or odds of .58 points) (p = .006) compared with the mean effect of all Grade Levels.

Model D Research Questions (RQ) (refer to Table 11)
RQ1: Is Gender a significant predictor of committing violation?
The findings generated were the same as in Model C.
RQ2: Is Religion a significant predictor of committing violation?
The finding obtained was the same as in Model C.
RQ3: To what degree does each Grade Level make in comparison with the effect of Grade 7 (the Reference Group) on the log-odds/odds (response)? In an increasing degree of odds, Grade 8 students (2.07 points, p = .006)), Grade10 students (3.23 points, p = .007) and Grade 12 students (9.68 points, p < .001)individually showed significantly higher odds or likelihood of breaking School discipline rules, than those of Grade 7 students, while holding other factors constant. The 95% C.I. of the odds for Grade 12 ranged from 4.24 to 22.11 points. Grade 11 students, however, indicated to have highly significant lower odds of .31 points (p < .001) than those of Grade 7 students.

Model Validation
For external generalizability of Model C and Model D, the researchers were unable to validate them due to the insufficiency of cases available, particularly offenders, across subgroups as could be surmised from the figures in Table 1. The models will be validated using fresh data set as soon as adequate number of cases becomes available.
References
Hair Jr., Joseph F., Babin, Barry J., Black, William C., and Anderson, Rolph E. (2019). Multivariate Data Analysis (8th ed.). Cengage.
Janssens, Wim et al. (2008). Marketing Research with SPSS. Prentice Hall.
MedCalc Software Ltd. (2024). Logistic Regression. http://www.medcalc.org/manual/logisticregression.php
Rovai, Alfred, Baker, Jason B., & Ponton, Michael K. (2013). Social Science Research Design and Statistics (1st ed.). Waterfree Press LLC.
Ryan, Thomas P. (2013). Sample Size Determination and Power. John Wiley & Sons, Inc.

