An analysis of Student’s performance dataset from the UCI ML Repository.
Surely there could be many great drink mixers to go with alcohol, but education certainly isn’t one of them.
The UCI Machine Learning Repository provides a data set of students’ performance in secondary education of two Portuguese schools in two subjects: Mathematics and Portuguese language. It mentions various attributes such as student grades, demographic, social, and school-related features.
One of the reasons for my interest in this particular data set was the alcohol consumption attribute. This data set shows the workday and weekend alcohol consumption by students in a five-level marker- from 1 to 5. Alcoholism is a problem that many people have to fight against their whole life. The fact that secondary school students consume alcohol is a concern worth spending time over. Not only it damages their health at a young age, but it could also lead to addiction and subsequently hamper their chances of leading a happy and successful life.
Looking at the dataset, there are three questions that I am curious to know the answers of:
1. Of those who performed poorly, how many had high alcohol consumption?
2. Do students drink more, less, or the same on weekends as they do on workdays?
3. Does parents’ education affects student’s grades?
Before answering these question, let’s explore the datasets:
There are 395 instances in the Mathematics dataset and 649 instances in the Portuguese language dataset. Out of these 382 students are common in both datasets.
There are 33 attributes. Only 5 of these are quantitative variables. Rest are categorical variables. There are no null values in both data sets.
There are a few non-numeric binary columns like [‘address’, ‘famsize’, ‘Pstatus’, ‘famsup’, ‘paid’, ‘activities’, ‘nursery’, ‘higher’, ‘internet’, ‘romantic’]. We convert these columns to numeric binary columns (0 or 1) for easier analysis.
The heads (first 5 rows) of both datasets are as follows:
Datasets would require further wrangling before it could be put into predictive models. Filtering out the irrelevant columns, we plot the heat map for both of these datasets:
From the correlation matrix, we can deduce the following in a broader picture:
1. Both variables of alcohol consumption, Dalc and Walc, have a positive relationship with other attributes like absences, age, failures, family support, free time, going out, internet, paid, and romantic.
2. There is a negative dependency between alcohol consumption and grades, extra-curricular activities, family relations, family support, and extra educational support.
3. Also the attribute ‘sex’ (0: male, 1: female) has a negative correlation with drinking which means male students consume more alcohol than female students of this dataset.
4. The final grade variable ‘G3’ has a negative relationship with attributes of family size, family support, going out, health, and romantic relations.
Now we analyze the data sets even deeper to answer the questions.
Of those who performed poorly, how many had high alcohol consumption?
To answer this question, we first need to define what we mean by “poor performance” and “high alcohol consumption”.
For performance, we use the G3 attribute of the data. Considering the Erasmus grade conversion system (European exchange program that enables student exchange in 31 countries), we assign grades as follows:
We define bad performance as grades 2 and 1. The table is as follows with Grade and Performance attributes:
For alcohol consumption, we define high alcohol consumption to be equal to or greater than 3 on either workday or on weekends.
Analyzing the data for the two subjects, the distribution for the table of Mathematics students is as follows:
And for Portuguese students:
In both subjects, it is evident from the above pie charts that the majority of those who performed poorly had high alcohol consumption and the proportion of low-performing students goes down with lower alcohol consumption. This proves that there is a correlation between high alcohol consumption and low grades.
Do students drink more, less, or the same on weekends as they do on workdays?
Analyzing the data, we can know how many students drink more or less than they do on the weekdays or just as same.
For this question, we would merge the records from two tables on common attributes:
[“school”, “sex”,” age”, “address”, “famsize”, “Pstatus”, “Medu”, “Fedu”, “Mjob”, “Fjob”, “reason”, “nursery”, “internet”]
We perform the set union operation on the two tables; the equality condition being the data in the common columns.
The difference between alcohol consumption on workdays and weekends is calculated as the difference between Dalc (workdays alcohol consumption) and Walc (weekends alcohol consumption) attributes. The negative values indicate an increase in the consumption of alcohol.
As evident from the pie chart, the proportion of students who increased their alcohol consumption is a little over 50% and only a small 1.8% of students decreased their intake.
Does parents’ education affects student’s grades?
My last question: is there a correlation between parents’ education and student’s grade. For this analysis, we can simply concatenate the students’ records from both datasets because there are different grades for common students in different subjects. The total records in the relevant dataset become 1044.
We categorize parents’ being highly educated as either father or mother possessing higher education (value 4 in Fedu and Medu fields) and use previous criteria of the good grade being greater than or equal to 3. Higher performance (HighPerf) is shown in red and lower performance (LowPerf) in blue.
The proportions are as follows:
To summarize, there is a correlation between grades and alcohol consumption, and between parents’ education and a student’s grade. Additionally, those who drink on workdays usually drink more or the same on the weekends.
Further exploration and use of predictive models can provide a deeper insight into underage alcoholism and how we can focus on attributes that could lead to an improvement in a student’s life- both in terms of better grades and healthier personal life.
Based on this dataset, what do you think could be done so that students would score better?
You can get the dataset here.
The full code for this project is in the GitHub repository here.