2 minute read

As part of the Le Wagon Bootcamp, this was my final group project:

๐Ÿง  Proven Hypothesis
The academic performance of a student depends not only on their academic capabilities but also their socio-economic status.

๐Ÿซ Imagine thisโ€ฆ

Youโ€™re a headteacher managing a school of thousands of students. How do you create an ideal environment where every student can thrive?


๐ŸŽฏ Project Goal

To build a machine learning model that forecasts academic outcomes using a dataset of Portuguese students.
The objective was to uncover socio-economic drivers influencing final grades (G3) and offer a predictive dashboard for education leaders.

๐Ÿ”— Try the app: The Student Forecaster Model


๐Ÿ” Data Exploration

We started with a raw dataset and trimmed irrelevant or misleading variables like:

  • Home address
  • Parent jobs (too many โ€œotherโ€ values)
  • Nursery attendance
  • G1/G2 (used G3 only as target)

We then grouped related features:

  • Alcohol intake (weekday/weekend)
  • Family dynamics
  • Time management
  • Educational support
  • Reason for school choice
  • Parental education

Tools used:

  • Heatmaps
  • Boxplots
  • Group binning
  • Value counts
  • Correlation analysis

โš™๏ธ Model Training

After feature engineering and encoding:

  • Split data into train/test sets
  • Used Gradient Boosting Classifier
  • Preprocessing: scaling + one-hot encoding

We tested multiple models for comparison:

Model Precision Test vs Train Overfitting
Logistic Regression 0.75 0.72 vs 0.72 โŒ
KNN 0.76 0.67 vs 0.83 โœ…
Random Forest 0.81 0.70 vs 0.98 โœ…
XGBoost 0.82 0.68 vs 0.95 โœ…
Gradient Boosting 0.76 0.70 vs 0.78 โœ… Slight

๐Ÿงช Final Metrics (on test data)

  • ๐ŸŽฏ Accuracy: 0.96
  • ๐Ÿ“Š Precision: 0.95
  • ๐Ÿ” Recall: 0.99
  • ๐Ÿงฎ F1 Score: 0.97

These metrics show strong performance on the binary classification task: predicting whether a student would pass or fail based on inputs.


๐ŸŽ“ Insights

  • Students who had failed before are more likely to receive support and improve.
  • Motherโ€™s education had stronger correlation than fatherโ€™s.
  • Study time and school choice were significant drivers.
  • Most students want higher education, causing class imbalance.

๐Ÿง  Tools Used

  • Python: pandas, seaborn, scikit-learn, joblib
  • Web app: Streamlit
  • Deployment: Streamlit Cloud

๐Ÿ‘‰ student-forecaster.streamlit.app


Additional Resources ๐Ÿ“Ž

For a detailed explanation:

  1. ๐Ÿ“Š Data Analysis & Insights โ€” View the presentation slides
  2. ๐Ÿง  Model Training & Evaluation โ€” Explore the Jupyter Notebook
  3. ๐Ÿ’ป Codebase & App โ€” Visit the GitHub repository

๐Ÿ’ฌ Final Thoughts

This project brought together all the fundamentals of data science โ€” from data wrangling and EDA to machine learning and user deployment โ€” to solve a real-world education problem.

Letโ€™s build data tools that truly support educators and learners alike. ๐ŸŒ๐Ÿ“š