Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Skip to main content

Predicting Student Performance Through Machine Learning

Predicting Student Performance Through Machine Learning
Back to insights

With the level of data collection in modern education, there is a growing interest in using this data to identify students who are potentially falling behind and pinpoint areas in which they can improve. With the extensive data collected by educational institutions, we can leverage machine learning to help educators by flagging if a student is at risk of failing a course. While having a statistical model that can predict if a student will score proficiently on an exam is powerful, being able to identify the factors going into each prediction are just as important. This article will walk through our work in predicting student performance through machine learning and our findings.

Identifying Data

Given the educational data available across a state, it is first necessary to identify the data attributes that serve as strong indicators to a student’s success. We started with an extensive literature review on previous research that gave us insight into what areas to focus on. Following this literature review, we did some exploratory analysis ourselves to find trends and relationships within our data. During our analysis, we used many data visualizations as well as statistical tests such as:    

  • Shapiro-Wilks Test – examines how closely data fits to a normal distribution
  • Two Tailed T-Test – tests whether two statistics are truly different using a p-value of 0.05
  • Mann-Whitney U-Test – tests whether two statistics are different when the data does not follow a normal distribution because it does not assume a specific distribution like the T-Test does 
  • Correlation Coefficients – a statistical measure of linearity between two variables

We identified some of our strongest variables to leverage in predicting student performance as Student Demographical data, Past Student Performance, and Historical Standardized Test Scores. 

Refining Educational Data for Predictions

predicting educational dataOnce we selected which variables we would be training a predictive model on, we had to run some data preparation steps. The raw data can have many issues such as missing data, invalid data types, and non-standardized numerical data. We started by standardizing our numerical data because different tests are scored on different scales. Standardizing this data is important since variables of larger magnitudes impact the predictions of a model more than those of smaller magnitude. Standardizing them ensures that all variables are treated equally. Next, because some of our data is non-numerical, we encoded these data points using a numerical representation of the character, Boolean, or string.  

Finally, missing data is defined as data samples that are missing certain attributes about them such as a student not having their last year’s standardized test score. Many machine learning models struggle to handle samples with missing data, leading to our motivation to impute these values. Some methods of data imputation include simply taking the mean, median, or mode of the column for the missing data, but we can do better. We utilized a K-Nearest Neighbor algorithm that uses the other data attributes for that sample to predict what the missing value would be. This is done by finding the K most similar students to that student based on the non-missing data and inferring the missing data based on those K samples. 

Choosing the Right Classifier for our Data

The next step in our process is to identify the classification algorithms we want to apply to our dataset and select the best performing one to utilize to predict student performance. We chose the top performing algorithms from our literature review on previous work that applied machine learning to the education domain. The algorithms we experimented with were XGBoost, RandomForestClassifier, Neural Networks, Support Vector Machine, and Logistic Regression. To evaluate the performance of each of these models, we set up a K-Fold Cross Validation pipeline where we split the dataset into K groups, allowing for each group to be used as the test set once and giving more stable results.  

Following our experiments, we found that the RandomForestClassifer yielded the best results with an F1-Score of 83.5%. Furthermore, we employed feature selection on our pipeline to remove the use of redundant features. Features often have correlations between one another which means that there is no true information gained from including both features. Below is a figure visualizing how the model performance changes as more features are included. As you can see, we see a plateau in our F1-Score and adding more attributes yields no lift in performance. 

predicting student performance graph

 

 

 

 

 

 

 

 

The Magic Behind the Random Forest Algorithm

The Random Forest algorithm is a very popular machine learning algorithm that combines the outputs of many decision trees. A decision tree works by asking basic questions about the data sample and taking different paths down the tree until a result node is reached in the tree. The Random Forest algorithm is an ensemble of decision trees whose outputs are aggregated to make a final prediction. In the Random Forest algorithm, bagging and feature bagging are used. Bagging is when we select a random subset of samples from our dataset to train one of the decision trees. Feature bagging means that we are only training that decision tree on a subset of the available data attributes. Using this method, we create numerous weakly performing decision trees, that when combined, produce a well-performing model for predicting student performance. 

Understanding the Why in our Model

In many domains, including education, knowing the outcome of a machine learning model is not as important as why the model made that prediction.

Explainable AI allows us to comprehend the reasoning behind AI-driven predictions and give insights to what key indicators are for a student’s success. Explainable AI also allows subject matter experts to analyze machine learning decisions and find inconsistencies that can be used to improve model performance.

While ML models may seem like black box solutions that perform magic, explainable AI sheds some light on what is happening. Using this motivation, we created a pipeline that, given a student and their educational data, generates a prediction and the primary indicators the model used to generate its prediction, such as in the figure below. 

 

Merging Data Science and Education

In conclusion, we have not only successfully utilized machine learning to predict student performance but have also showcased the use of explainable AI to add transparency to our predictions. By leveraging the extensive educational data available to our team and modern data science techniques, we were able to achieve an impressive model F1-Score of 83.5%. While the model performance is impressive on its own, we understand the importance of answering the question of ‘why’ the model infers its answers and make use of explainable AI techniques. The use of explainable AI in our pipeline offers educators the opportunity to understand the critical factors influencing a student’s success.

 

About Connor Dolan

Connor is an Associate on the Data and Analytics team.

Digging In

  • State Government

    Exploring the Future of AI at the Georgia Emerging Technology Summit: Data & AI 2024

    The Georgia Emerging Technology Summit: Data & AI 2024 was a landmark event for public sector leaders, showcasing the transformative potential of AI and data technologies. This summit brought together key figures and experts to discuss, learn, and network to enhance public service delivery through innovative technology. Keynote Highlights The summit featured insightful keynote sessions […]

  • State Government

    Exploring the Future of State Technology: Takeaways from the NASTD 2024 Annual Conference

    Last week, I attended the National Association of State Technology Directors (NASTD) 2024 Annual Conference in Minneapolis, MN. This premier event brings together state technology directors, industry experts, and vendors to discuss the latest trends, challenges, and innovations in state technology. The conference fosters the exchange of ideas and best practices essential for addressing shared […]

  • State Government

    UDig Welcomes David Allen as State Government Lead for Georgia

    UDig, a technology consulting company, is excited to announce the latest addition to our team, David Allen, who will take on the role of State Government Lead in Georgia. With two decades of technology experience in both the public and private sectors, David brings a wealth of experience and a deep understanding of the market […]

  • State Government

    NASCIO Midyear Conference | 4 Key Takeaways

    I attended the National Association of State CIO (NASCIO)’s Midyear Conference for the first time in National Harbor, Maryland. It was my first time at a NASCIO event, and what struck me most was how collaborative the NASCIO community is and how passionate these leaders are about serving their state’s constituents and employees. Each […]

  • State Government

    Improving the Process and Accuracy of State Permits and Inspections

    A custom, user-centered application that supports existing business processes, allows for function-wide automation, and converts data from a legacy system while meeting state compliance requirements.

  • State Government

    Unlocking Data Insights for a State Education Agency

    A data-enabled agency that can leverage predictive analytics to provide its school districts, students, and parents with data-driven information that fosters informed decisions.