Environment & Versions
R==4.5.2
plotly==4.12.0
sampling==2.11
tidyverse==2.0.0Pinn Xu
February 23, 2026
This project was conducted in R v4.5.2 using the plotly (v4.12.0) and sampling packages, and the package collection tidyverse (v2.0.0), with packages such as stringr (v1.6.0), tibble (v3.3.1), and dplyr (v1.1.4) included.
The dataset named Student Social Media & Relationships from Kaggle was used for this analysis. It provided anonymous cross-sectional survey data among students aged 16-25 in different educational background across multiple countries, capturing key dimensions such as social media usage intensity, platform preferences, and relationship dynamics, allowing researchers to examine how digital behaviors relate to personal and academic outcomes. Each observation represented an individual student’s response, making the dataset suitable for statistical analysis, with sufficient sample size (N = 705).
In terms of the variables, there was one identifier variable Student_ID , giving each observation an unique ID. Other nominal variables included such as Gender , Country , Most_Used_Platform , Relationship_Status with non-ordinal levels, and Academic_Level , Affects_Academic_Performance (Boolean) with ordinal or Boolean levels. Numerical variables includes Age , Avg_Daily_Usage_Hours , Sleep_Hours_Per_Night , Mental_Health_Score , Conflicts_Over_Social_Media , and Addicted_Score , where mental health score and addicted score were self-rated levels from 1 to 10.
This project aimed to investigate the complex correlations between students’ social media patterns (e.g. average usage hours, platform preference), contextual and demographic factors (e.g. academic level, country, age, gender), and student well-being and performance (e.g. average sleep hours, mental health, addiction).
The suitability of the selected dataset stemmed from its robust sample size (N = 705) and multidimensional structure with different types of variables, which provided statistical power to draw reliable insights on student behavior.
The analytical phase of this project was structured into three primary stages. Firstly, descriptive analyses were conducted to establish a basic understanding of the sample’s demographics and social media usage patterns. Secondly, the analysis dived into correlation between indices regarding students’ digital habits and well-being level, followed by an application of the Central Limit Theorem (CLT) to validate the reliability of statistical inferences. Finally, it compared and evaluated the efficiency and precision of three different sampling methods: simple random, systematic, and stratified sampling.
Regarding the visualization process, the project mainly used used data wrangling methods from packages in tidyverse to manipulate data before plotting, including pipe operator |> and functions such as group_by() and summarize() from the dplyr package; and then tools from the plotly package was used to draw plots.
Furthermore, to improve the accessibility of visual storytelling, this project adopted colorblind-friendly palettes generated using the online tool developed by David Nichols.
The dataset was downloaded from Kaggle and saved to a local directory. It was then imported into R as df0 using the built-in read.csv() function with a relative file path. The structure of df0 was shown in Output 1 below.
'data.frame': 705 obs. of 13 variables:
$ Student_ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Age : int 19 22 20 18 21 19 23 20 18 21 ...
$ Gender : chr "Female" "Male" "Female" "Male" ...
$ Academic_Level : chr "Undergraduate" "Graduate" "Undergraduate" "High School" ...
$ Country : chr "Bangladesh" "India" "USA" "UK" ...
$ Avg_Daily_Usage_Hours : num 5.2 2.1 6 3 4.5 7.2 1.5 5.8 4 3.3 ...
$ Most_Used_Platform : chr "Instagram" "Twitter" "TikTok" "YouTube" ...
$ Affects_Academic_Performance: chr "Yes" "No" "Yes" "No" ...
$ Sleep_Hours_Per_Night : num 6.5 7.5 5 7 6 4.5 8 6 6.5 7 ...
$ Mental_Health_Score : int 6 8 5 7 6 4 9 6 7 7 ...
$ Relationship_Status : chr "In Relationship" "Single" "Complicated" "Single" ...
$ Conflicts_Over_Social_Media : int 3 0 4 1 2 5 0 2 1 1 ...
$ Addicted_Score : int 8 3 9 4 7 9 2 8 5 4 ...
Output 1. Structure of the Dataframe
To facilitate further investigation and visualization of the data, a few processing steps were conducted in advance.
Firstly, relative packages was loaded, and the original dataset df0 was converted to a tibble as df , while preserving the original dataset unchanged. Next, the names of categorical and numerical variables that would be used in subsequent analyses were specified and stored in separate vectors ( names_catvar and names_numvar ).
# attach packages
library(tidyverse)
library(plotly)
library(sampling)
# create a tibble
df <- tibble(df0) # store tibble in df, let df0 remain what it is
# store names of categorical variables to use
names_catvar <- c(
"Gender", "Country", "Most_Used_Platform", "Relationship_Status",
"Academic_Level", "Affects_Academic_Performance"
)
# store names of numerical variables to use
names_numvar <- c(
"Age", "Avg_Daily_Usage_Hours", "Sleep_Hours_Per_Night",
"Mental_Health_Score", "Conflicts_Over_Social_Media", "Addicted_Score"
)An initial inspection for missing values was conducted using sum(is.na(df)) , and no explicit missing data were identified in the dataset (shown in Output 2).
Output 2. Number of NA Values
For categorical variables, their numbers of categories (see Output 3) and counts of observations in each category (see Output 4) were first examined. The results indicated that most categorical variables contained a moderate number of categories, whereas country and most-used platform variables had relatively larger numbers of categories and therefore required further processing.
[1] "Gender 2"
[1] "Country 110"
[1] "Most_Used_Platform 12"
[1] "Relationship_Status 3"
[1] "Academic_Level 3"
[1] "Affects_Academic_Performance 2"
Output 3. Numbers of Unique Values in Categorical Variables
Gender
Female Male
353 352
Relationship_Status
Complicated In Relationship Single
32 289 384
Academic_Level
Graduate High School Undergraduate
325 27 353
Affects_Academic_Performance
No Yes
252 453
Output 4. Numbers of Observations of Each Value in Categorical Variables
Top five countries and platforms with the most counts of observations were thereby selected with their names stored ( vectors top5_countries and top5_platforms ). The results were displayed in Output 5.
Country
India USA Canada Denmark France
53 40 34 27 27
Most_Used_Platform
Instagram TikTok Facebook WhatsApp Twitter
249 154 123 54 30
Output 5. Numbers of Observations in Each Country and Platform
In addition, the Affects_Academic_Performance variable was converted from character to logical ( “Yes” = TRUE , “No” = FALSE ) as shown in Output 6.
## Boolean Variable
# convert variable from character to logical
df$Affects_Academic_Performance <- df$Affects_Academic_Performance == "Yes"
# As the variable contains only "Yes" and "No" responses,
# no further data cleaning was needed.
# check the converted variable
str(df$Affects_Academic_Performance) logi [1:705] TRUE FALSE TRUE FALSE TRUE TRUE ...
Output 6. Boolean Variable
Descriptive statistics for the numerical variables were obtained using the summary() function in Output 7. The results indicate that the variables fall within reasonable ranges, with no apparent extreme or abnormal values. Thus, no further preprocessing was deemed necessary at this stage.
Age
Min. :18.00
1st Qu.:19.00
Median :21.00
Mean :20.66
3rd Qu.:22.00
Max. :24.00
Avg_Daily_Usage_Hours
Min. :1.500
1st Qu.:4.100
Median :4.800
Mean :4.919
3rd Qu.:5.800
Max. :8.500
Sleep_Hours_Per_Night
Min. :3.800
1st Qu.:6.000
Median :6.900
Mean :6.869
3rd Qu.:7.700
Max. :9.600
Mental_Health_Score
Min. :4.000
1st Qu.:5.000
Median :6.000
Mean :6.227
3rd Qu.:7.000
Max. :9.000
Conflicts_Over_Social_Media
Min. :0.00
1st Qu.:2.00
Median :3.00
Mean :2.85
3rd Qu.:4.00
Max. :5.00
Addicted_Score
Min. :2.000
1st Qu.:5.000
Median :7.000
Mean :6.437
3rd Qu.:8.000
Max. :9.000
Output 7. Descriptive Summaries of Numerical Variables
Moreover, the Age variable was categorized into three groups due to its limited range and variability. As shown in Output 8, treating Age as continuous offers minimal dispersion within such a narrow interval (18-24) and limited continuity (only 7 unique numbers), while treating the values as single-year categories would possibly lead to several small and uneven subgroups. Categorizing into three groups therefore provides a more suitable structure for subsequent analysis.
To achieve this, a histogram was firstly used to view age distribution and obtain suitable breaks for grouping. Then, grouping information was stored into a new variable named Age_Group (see Output 9).
[1] 18 24
[1] 7
Output 8. Range of Age and Numbers of Unique Age
## Age
## Store histogram data
# store histogram data with 3 bins
hist_data <- hist(
df$Age,
breaks = 4,
xlab = "Age",
col = "#FFD794",
main = "Age Distribution",
plot = FALSE # do not display plot
)
# store breaks for grouping
breaks <- hist_data$breaks
## Add new variable Age_Group to df
# store grouping information into a new variable Age_Group
df <- df |>
mutate(
Age_Group = case_when(
(breaks[1] <= Age) & (Age <= breaks[2]) ~
sprintf("%d-%d", breaks[1], breaks[2]),
(breaks[2] <= Age) & (Age <= breaks[3]) ~
sprintf("%d-%d", breaks[2], breaks[3]),
(breaks[3] <= Age) & (Age <= breaks[4]) ~
sprintf("%d-%d", breaks[3], breaks[4]),
)
)
# view new variable
table(df["Age_Group"])Age_Group
18-20 20-22 22-24
342 303 60
Output 9. Age Groups
This section demonstrated several core contextual and behavioral patterns of the sample.
Figure 1 illustrated a highly balanced gender distribution within the dataset (N = 705). The sample was almost perfectly split, with females representing 50.1% and males representing 49.9% of the participants. This balanced distribution allowed relatively unbiased gender-based comparisons of behavioral patterns.
## Gender
# pie chart
plot_ly(
as.data.frame(table(df["Gender"])),
labels = ~Gender,
values = ~Freq,
type = "pie", # pie chart
textinfo = "label+percent", # display label & percent on pie slices
# plotting area for pie chart
domain = list(
x = c(0, 1), # horizontal: full height
y = c(0, 0.85) # vertical: leave 15% margin on the top
),
# aesthetics
marker = list(
colors = c("Female" = "#FFB697", "Male" = "#8fc4de"), # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
showlegend = FALSE, # remove legend
margin = list(l = 1, r = 1), # left & right margin
title = list(
y = 0.95, # y position of title
text = "Sample Gender Distribution"
)
)The bar plot (Figure 2) followed by Output 10 showing percentages of observations for top 5 countries with respect to the whole sample illustrated the sample’s top five countries, with India leading at 7.5%, followed by the USA (5.7%) and Canada (4.8%). Denmark and France each represent 3.8% of the total, highlighting geographical diversity of participants, centered primarily in South Asian and North American regions.
## Country
# extract top 5 country data
top5_countries_data <- df |>
filter(Country %in% top5_countries)
# barplot
plot_ly(
as.data.frame(
sort(table(top5_countries_data["Country"]), decreasing = TRUE) # sort
),
x = ~Country,
y = ~Freq,
type = "bar", # barplot
# aesthetics
marker = list(
color = "#add4a0", # filling color
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Sample Country Distribution (top 5)"
),
xaxis = list(
title = "Countries",
range = c(-0.8, 4.8) # extend x axis (0 = the 1st category)
),
yaxis = list(
title = "Frequency",
range = c(0, 60) # let y axis be exhaustive
)
)
India USA Canada Denmark France
7.517730 5.673759 4.822695 3.829787 3.829787
Output 10. Percentages of Top 5 Countries Relative to the Population
The age group distribution (shown in Figure 3) was skewed toward younger participants, with the 18-20 and 20-22 cohorts being the most predominant. The 22-24 group has a significantly lower frequency, indicating that the sample primarily consists of early-stage university students.
## Age Group
# barplot
plot_ly(
as.data.frame(
sort(table(df["Age_Group"]), decreasing = TRUE) # sort
),
x = ~Age_Group,
y = ~Freq,
type = "bar", # barplot
# aesthetics
marker = list(
color = "#FFD794", # filling color
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Sample Age Group Distribution"
),
xaxis = list(
title = "Age Groups",
range = c(-0.8, 2.8) # extend x axis (0 = the 1st category)
),
yaxis = list(
title = "Frequency",
range = c(0, 400) # let y axis be exhaustive
)
)Figure 4 showed the frequency distribution of the five most popular social platforms. It can be seen that Instagram (249) was the most popular platform among the sample, followed by TikTok (154) and Facebook (123), while WhatsApp (54) and Twitter (30) showed much lower frequencies.
## Platform
# extract top 5 platform data
top5_platforms_data <- df |>
filter(Most_Used_Platform %in% top5_platforms)
# barplot
plot_ly(
as.data.frame(
sort(table(top5_platforms_data["Most_Used_Platform"]), decreasing = TRUE)
),
x = ~Most_Used_Platform,
y = ~Freq,
type = "bar", # barplot
# aesthetics
marker = list(
color = "#c3a4cf", # filling color
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Top 5 Popular Platforms"
),
xaxis = list(
title = "Platforms",
range = c(-0.8, 4.8) # extend x axis (0 = the 1st category)
),
yaxis = list(title = "Frequency")
)Notably, after grouping by gender (shown in Figure 5), there were clear gender differences in platform preferences. Instagram and TikTok were more popular among females, with Instagram showing a larger gender gap. In contrast, Facebook and WhatsApp showed a more male-dominated trend. Twitter, on the other hand, showed a minimal gender variation. Overall, it appeared to be that females prefer visual platforms whereas males favor communication-oriented ones.
## Platform ~ Gender
# obtain data
platform_gender_data <- top5_platforms_data |>
group_by(Most_Used_Platform, Gender) |>
summarize(Count = n(), .groups = 'drop') |>
arrange(desc(Count))
# barplot
plot_ly(
platform_gender_data,
x = ~factor(Most_Used_Platform, top5_platforms), # sort
y = ~Count,
type = "bar", # barplot
color = ~Gender, # map Gender to bar color
colors = c("Female" = "#FFB697", "Male" = "#8fc4de"),
# aesthetics
marker = list(
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
barmode = "group", # side-by-side bar groups
legend = list(
orientation = "h", # horizontal legend labels
xanchor = "right", yanchor = "top", # anchor points of the legend
x = 1, y = 1 # position of the legend
),
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Top 5 Popular Platforms by Gender"
),
xaxis = list(
title = "Platforms",
range = c(-0.6, 4.6) # extend x axis (0 = the 1st category)
),
yaxis = list(title = "Frequency")
)The histogram of average media usage hours (Figure 6) followed a symmetrical distribution, peaking between 4.5 and 5 hours, suggesting a consistent high-engagement digital lifestyle across the surveyed students.
## Usage Hours
# histogram
plot_ly(
df,
x = ~Avg_Daily_Usage_Hours,
type = "histogram",
# aesthetics
marker = list(
color = "#55D2C6", # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Average Daily Usage Hours Distribution"
),
xaxis = list(
title = "Average Daily Usage Hours",
range = c(0.8, 9.4), # let x aixs be exhaustive
dtick = 1 # distance between ticks
),
yaxis = list(title = "Frequency")
)Boxplots (Figure 7) were used to illustrate the distribution of average daily usage hours across the top 5 countries. USA and India exhibit the highest levels of engagement, with median usage hours significantly exceeding those of other nations. In contrast, Canada, France, and Denmark show lower overall usage, with Denmark recording the most conservative median. In addition, a notable outlier is observed in the Indian group, representing a user with exceptionally low daily engagement compared to the national trend.
## Usage Hours ~ Top 5 Countries
## Sorting
# find median usage hours of each country
countries_by_usage <- top5_countries_data |>
group_by(Country) |>
# create new variable to store medians
summarize(
Median_Usage_Hour = median(Avg_Daily_Usage_Hours, na.rm = TRUE),
.groups = "drop" # drop grouping information
) |>
# sort by median descending
arrange(desc(Median_Usage_Hour)) |>
# pull out the vector of Country variable
# (country names with median usage hours descending)
pull(Country)
## Plotting
# boxplot
plot_ly(
top5_countries_data,
# make Country variable a factor by levels of
# countries with median usage hours descending
x = ~factor(Country, countries_by_usage),
y = ~Avg_Daily_Usage_Hours,
type = "box", # boxplot
# aesthetics: boundary box
fillcolor = "#55D2C6",
line = list(color = "#222", width = 1.5),
# aesthetics: outliers
marker = list(
color = "#55D2C6", # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Average Daily Usage Hours Distribution by Top 5 Countries"
),
xaxis = list(
title = "Top 5 Countries",
range = c(-0.6, 4.6) # extend x axis (0 = the 1st category)
),
yaxis = list(
title = "Average Daily Usage Hours",
range = c(1, 10), # let y axis be exhaustive
dtick = 1 # distance between ticks
)
)Meanwhile, the distribution of sleep hours was also investigated. Figure 8 revealed a roughly symmetrical distribution centered around a peak of approximately 7 to 7.5 hours of sleep per night. While the majority of students fall within a healthier range of 6 to 8 hours, the distribution exhibits a broad spread, with some individuals reporting as little as 4 hours and others reaching nearly 10 hours. This variation suggests a diverse range of sleeping habits within the student population.
## Sleep Hours
# histogram
plot_ly(
df,
x = ~Sleep_Hours_Per_Night,
type = "histogram",
# aesthetics
marker = list(
color = "#FFABB6", # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Average Sleep Hours Distribution"
),
xaxis = list(
title = "Average Sleep Hours",
range = c(2.8, 11), # let x aixs be exhaustive
dtick = 1 # distance between ticks
),
yaxis = list(
title = "Frequency",
range = c(0, 60) # let y axis be exhaustive
)
)Furthermore, looking into the distribution of average sleep hours by top 5 countries, Figure 9 showed a significant inverse trend compared to the previously analyzed usage hours distribution (Figure 7). While the USA and India presented the highest social media engagement, they recorded the lowest median sleep duration, both falling below 6 hours per night. Conversely, Denmark and France, which showed the most conservative usage patterns, enjoyed the highest median sleep hours, with Denmark leading at approximately 8.2 hours. In addition, India displayed a high number of upper outliers, indicating a group of students who maintained high sleep levels despite regional trends. These geographical variation suggested a negative correlation between digital engagement and sleeping, providing a critical baseline to later analysis of the relationship between behavioral patterns.
## Sleep Hours ~ Top 5 Countries
## Sorting
# find median sleep hours of each country
countries_by_sleep <- top5_countries_data |>
group_by(Country) |>
# create new variable to store medians
summarize(
Median_Sleep_Hour = median(Sleep_Hours_Per_Night, na.rm = TRUE),
.groups = "drop" # drop grouping information
) |>
# sort by median ascending
arrange(Median_Sleep_Hour) |>
# pull out the vector of Country variable
# (country names with median sleep hours ascending)
pull(Country)
## Plotting
# boxplot
plot_ly(
top5_countries_data,
# make Country variable a factor by levels of
# countries with sleep hours ascending
x = ~factor(Country, countries_by_sleep),
y = ~Sleep_Hours_Per_Night,
type = "box", # boxplot
# aesthetics: boundary box
fillcolor = "#FFABB6",
line = list(color = "#222", width = 1.5),
# aesthetics: outliers
marker = list(
color = "#FFABB6", # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Average Sleep Hours Distribution by Top 5 Countries"
),
xaxis = list(
title = "Top 5 Countries",
range = c(-0.6, 4.6) # extend x axis (0 = the 1st category)
),
yaxis = list(
title = "Average Sleep Hours",
range = c(3, 10), # let y axis be exhaustive
dtick = 1 # distance between ticks
)
)The scatter plot (Figure 10) demonstrate a strong negative correlation between daily media usage and sleep duration. As usage rises from 2 to 8 hours, sleep generally drops by about 1 hour for every additional 1 hours of use. Most points cluster along this downward trend, though some heavy users (6+ hours) still report around 7 hours of sleep, while moderate users (4-6 hours) fall to 6-8 hours of sleep.
## Sleep Hours ~ Usage Hours
# scatter plot
plot_ly(
df,
x = ~Avg_Daily_Usage_Hours
) |>
# add scatters
add_markers(
y = ~Sleep_Hours_Per_Night,
marker = list(color = "#3B9FE4", opacity = 0.6)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Relationship between Usage Hours and Sleep Hours"
),
xaxis = list(
title = "Average Daily Usage Hours",
range = c(0.8, 9.2), # let x axis be exhaustive
dtick = 1 # distance between ticks
),
yaxis = list(
title = "Average Sleep Hours",
range = c(3, 10) # let y axis be exhaustive
)
)To validate the trend, a Pearson’s correlation test was conducted. As shown in the results (see Output 11), the correlation coefficient was around -0.79 (p < .001). It therefore indicated that there was a high potential of students’ average daily social media usage hours being negatively correlated with their sleep hours.
Pearson's product-moment correlation
data: df$Avg_Daily_Usage_Hours and df$Sleep_Hours_Per_Night
t = -34.231, df = 703, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8167435 -0.7611776
sample estimates:
cor
-0.7905825
Output 11. Pearson’s Correlation Test Result
Figure 11 showed a linear fit of the two variables, which further clarified the negative correlation.
## Sleep Hours ~ Usage Hours
# linear model
fit <- lm(Sleep_Hours_Per_Night ~ Avg_Daily_Usage_Hours, data = df)
# scatter plot
plot_ly(
df,
x = ~Avg_Daily_Usage_Hours
) |>
# add scatters
add_markers(
y = ~Sleep_Hours_Per_Night,
name = "Data Points",
marker = list(color = "#3B9FE4", opacity = 0.6)
) |>
# add linear fit
add_lines(
y = ~fitted(fit),
name = "Linear Fit",
line = list(color = '#D53132', width = 1.5)
) |>
layout(
legend = list(
orientation = "h", # arrangement of legend labels
xanchor = "right", yanchor = "top", # anchor point of the legend
x = 1, y = 1 # position of the legend
),
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Linear Relationship between Usage Hours and Sleep Hours"
),
xaxis = list(
title = "Average Daily Usage Hours",
range = c(0.8, 9.2), # let x axis be exhaustive
dtick = 1 # distance between ticks
),
yaxis = list(
title = "Average Sleep Hours",
range = c(3, 10) # let y axis be exhaustive
)
)In addition, it was also considered worthwhile investigating the relationship between other indices such as mental health score and addicted score (names of indices were shown in Output 12). Therefore, a faceted 5×5 paired scatter plot matrix was used to display all of the relationship trends (see Figure 12).
[1] "Avg_Daily_Usage_Hours" "Sleep_Hours_Per_Night"
[3] "Mental_Health_Score" "Conflicts_Over_Social_Media"
[5] "Addicted_Score"
Output 12. Names of Indices to Compare
## Relationship between Indices
# extract numerical variable data
num_data <- df[, names_numvar]
# paired scatter plots
plot_ly(
num_data,
type = "splom", # scatter plots matrix
# indices to compare
dimensions = list(
list(label = "Usage", values = ~Avg_Daily_Usage_Hours),
list(label = "Sleep", values = ~Sleep_Hours_Per_Night),
list(label = "Mental Health", values = ~Mental_Health_Score),
list(label = "Conflicts", values = ~Conflicts_Over_Social_Media),
list(label = "Addicted", values = ~Addicted_Score)
),
# aesthetics
marker = list(
color = "#3B9FE4",
opacity = 0.5,
size = 4
)
) |>
layout(
margin = list(t = 60), # margin top
title = list(
y = 0.95, # y position of title
text = "Relationship between Indices"
)
)The scatter plot matrix (Figure 12) illustrated a complex web of behavioral and psychological correlations, primarily centered on the detrimental effects of excessive social media usage. Meanwhile, Output 13 showed a correlation test matrix for all five indices, providing valid insights into behavioral trends.
As shown by the results, a dominant negative correlation existed between usage hours and sleep, where increased screen time consistently aligns with diminished rest. This trend was further demonstrated by a positive correlation between usage and addiction levels, which in turn led to higher frequencies of life conflicts. Conversely, better mental health outcomes were associated with lower addiction scores, reduced usage, and longer sleep duration. Ultimately, the data suggested a negative feedback loop where high device dependency compromised sleep and social harmony, collectively contributing to poorer mental well-being.
Avg_Daily_Usage_Hours Sleep_Hours_Per_Night
Avg_Daily_Usage_Hours 1.00 -0.79
Sleep_Hours_Per_Night -0.79 1.00
Mental_Health_Score -0.80 0.71
Conflicts_Over_Social_Media 0.80 -0.68
Addicted_Score 0.83 -0.76
Mental_Health_Score Conflicts_Over_Social_Media
Avg_Daily_Usage_Hours -0.80 0.80
Sleep_Hours_Per_Night 0.71 -0.68
Mental_Health_Score 1.00 -0.89
Conflicts_Over_Social_Media -0.89 1.00
Addicted_Score -0.95 0.93
Addicted_Score
Avg_Daily_Usage_Hours 0.83
Sleep_Hours_Per_Night -0.76
Mental_Health_Score -0.95
Conflicts_Over_Social_Media 0.93
Addicted_Score 1.00
Output 13. Correlation Matrix of Indices
Notably, the correlation between addicted score and mental health (r = -.95), and that between addicted score and conflicts number over social media (r = .93) appeared to be the most significant with both | r | > .9 (see Output 14), indicating that as addictive behaviors intensify, psychological well-being collapses and interpersonal friction rises almost in lockstep.
In conclusion, addiction scores served as the most potent predictor of negative life outcomes in this dataset.
## Relationship between Indices
# find location of pairs where |r| > 0.9
loc <- which(
abs(cor_matrix) > 0.9 & cor_matrix != 1,
arr.ind = TRUE # return matrix of row and column indices
)
# print results
for(i in 1:nrow(loc)) {
row <- loc[i, 1]
col <- loc[i, 2]
cat(
rownames(cor_matrix)[row], "~", colnames(cor_matrix)[col], "\n",
"r =", cor_matrix[row, col], "\n")
}Addicted_Score ~ Mental_Health_Score
r = -0.95
Addicted_Score ~ Conflicts_Over_Social_Media
r = 0.93
Mental_Health_Score ~ Addicted_Score
r = -0.95
Conflicts_Over_Social_Media ~ Addicted_Score
r = 0.93
Output 14. Pairs of Indices with the Strongest Correlation (|r|>.9)
This section repeatedly drew random samples (1000 times) with different sample sizes (n = 10, 30, 50, 100) from the population dataset ( df ), and then observed patterns of sample mean distributions, aiming to seek for applicability of the Central Limit Theorem (CLT).
Central Limit Theorem suggests that as sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population’s original distribution.
In the meantime, the standard deviations of sample means (observed standard errors) are related to the population standard deviation σ the and sample size n, where the standard error is defined by a formula \(SE = \frac{\sigma}{\sqrt n}\) . This formula was used in later analyses to calculate the theoretical standard error.
To investigate the applicability of CLT, a function was firstly defined to draw random samples with specified sample size, sampling time, variable to sample, and other relative parameters.
## CLT
## Define function to plot
# store different sample sizes
sizes <- c(10, 30, 50, 100)
# define a function to randomly sample from df with
# specified sample size and plot sample means histogram with
# specified parameters
make_hist <- function(
n, col, time = 1000, ymax = 1,
binwidth = 0.2, bincolor = "#FFA472", xdtick = 1
) {
## n: sample size
## col: column to select from df
## time: times to replicate sampling process
## ymax: maximum value of y axis in histogram
## binwidth: width of bins in histogram
## bincolor: color of bins
## xdtick: distance between ticks in x asix
# randomly sample data from df and calculate sample means
sample_means <- replicate(
time, # repetition times, default is 1000
mean(
sample(df[[col]], size = n, replace = TRUE),
na.rm = TRUE
)
)
# find population mean & standard deviation
pop_mean <- mean(df[[col]], na.rm = TRUE)
pop_sd <- sd(df[[col]], na.rm = TRUE)
# find fixed x range
xmin <- pop_mean - (2 * pop_sd)
xmax <- pop_mean + (2 * pop_sd)
# plot sample means probability distribution histogram
plot_ly(
x = sample_means,
type = "histogram",
histnorm = "probability", # probability distribution
name = paste("n =", n),
xbins = list(
start = xmin,
end = xmax,
size = binwidth # specify binwidth, default is 0.4
),
# aesthetics
marker = list(
color = bincolor, # specify bincolor, default is "#FFA472"
line = list(color = "#222", width = 1.5)
)
) |>
layout(
margin = list(t = 60), # margin top
# add annotation of sample size
annotations = list(
list(
text = paste("Sample Size:", n),
x = 0.5, # x position of annotation
y = -0.25, # y position of annotation
xanchor = "center", # anchor of annotation for x
yanchor = "bottom", # anchor of annotation for y
xref = "paper", # position relative to current subplot
yref = "paper",
showarrow = FALSE # remove arrow of annotation
)
),
# axis limits
xaxis = list(
range = c(xmin, xmax), # x axis range
dtick = xdtick # specify dtick, default is 1
),
yaxis = list(range = c(0, ymax)) # specify ymax, default is 1
)
}In terms of the average media usage hours variable, a 2×2 histogram matrix (Figure 13) of sample mean distributions with different sample sizes (n = 10, 30, 50, 100) was plotted with axis scales being the same to see the trend of distribution shape. Meanwhile, Output 15 showed the population and sample statistics regarding usage hours.
As results shown, with sample size increased, the distribution of sample means became more centralized, with smaller standard deviations. Meanwhile, all four means of sample means were nearly closed to the population mean, successfully predicting it; and the observed standard error (standard deviation of sample means) were approximate to the theoretical standard error derived using the population standard deviation.
## CLT - Usage Hours
## Draw random samples and plot
# set random seed for reproducibility
set.seed(32)
# get plot information
# (iterate sizes and pass values as n, apply function)
plots <- lapply(
sizes, function(n) make_hist(
n, "Avg_Daily_Usage_Hours", bincolor = "#55D2C6", ymax = 0.6
)
)
# show subplots in 2*2 matrix
subplot(
plots,
nrows = 2, # number of rows in subplot matrix
margin = 0.08 # margin between subplots
) |>
layout(
showlegend = FALSE, # remove legend
margin = list(
t = 80, # margin top
b = 60 # margin bottom
),
title = list(
y = 0.95,
text = "Usage Hours Sample Means Probability Distributions"
)
)## CLT - Usage Hours
## Store sample means
# create list to store sample means of usage in different sizes
sample_means_usage <- lapply(
sizes, # iterate sample sizes
function(n) { # n: sample sizes passed from sizes
replicate(
1000, # repetition times
mean(
sample(df[["Avg_Daily_Usage_Hours"]], size = n, replace = TRUE),
na.rm = TRUE)
)
}
)
## Store means & SDs of sample means
# initialize empty vector
mean_sample_means_usage <- numeric(length(sizes))
# find means of usage sample means
for (i in seq_along(sizes)) {
mean_sample_means_usage[i] <- round(
mean(sample_means_usage[[i]], na.rm = TRUE),
digits = 2 # round to 2 digits
)
}
# initialize empty vector
sd_sample_means_usage <- numeric(length(sizes))
# find sds of usage sample means
for (i in seq_along(sizes)) {
sd_sample_means_usage[i] <- round(
sd(sample_means_usage[[i]], na.rm = TRUE),
digits = 2 # round to 2 digits
)
}
## Population mean & SD
pop_mean_usage <- mean(df$Avg_Daily_Usage_Hours, na.rm = TRUE)
pop_sd_usage <- sd(df$Avg_Daily_Usage_Hours, na.rm = TRUE)
## Print results
clt_results_usage <- data.frame(
Sample_Size = sizes,
Predicted_Mean = mean_sample_means_usage,
Observed_SE = sd_sample_means_usage,
Theoretical_SE = round(pop_sd_usage / sqrt(sizes), 2)
)
print(clt_results_usage)
cat("\nPopulation Mean:", round(pop_mean_usage, 2),
"\nPopulation SD:", round(pop_sd_usage, 2), "\n"
) Sample_Size Predicted_Mean Observed_SE Theoretical_SE
1 10 4.92 0.40 0.40
2 30 4.92 0.23 0.23
3 50 4.93 0.18 0.18
4 100 4.92 0.12 0.13
Population Mean: 4.92
Population SD: 1.26
Output 15. Comparison of Statistics Regarding Media Usage Hours
In addition, Figure 14 displayed one of the sample means distributions (n = 100, repeated 1000 times) with an ideal normal curve (n = 100) over it, which was calculated using the function dnorm() . The results demonstrated a highly coincident overlap between the sample means histogram and normal curve, in line with the CLT.
## CLT - Usage Hours
## Ideal normal distribution data
x_values <- seq(
min(tail(sample_means_usage, 1)[[1]]), # the last set of means, n = 100
max(tail(sample_means_usage, 1)[[1]]),
length.out = 100 # sequence length = 100
)
y_values <- dnorm(
x_values,
mean = mean(tail(sample_means_usage, 1)[[1]]),
sd = sd(tail(sample_means_usage, 1)[[1]])
)
## Sample Mean Distribution v.s. Normal Curve
plot_ly() |>
# histogram
add_histogram(
x = tail(sample_means_usage, 1)[[1]],
type = "histogram",
histnorm = "probability",
xbins = list(size = 0.05), # binwidth
name = "Sample Means", # histogram label
# aesthetics
marker = list(
color = "#55D2C6", # filling color
line = list(color = "#222", width = 1.5), # outline
opacity = 0.6
)
) |>
# normal curve
add_lines(
x = x_values,
y = y_values * 0.05,
name = "Normal Curve", # line label
line = list(color = '#D53132', width = 2)
) |>
layout(
legend = list(
orientation = "h", # arrangement of legend labels
xanchor = "left", yanchor = "top", # anchor point of the legend
x = 0, y = 1.1 # position of the legend
),
margin = list(t = 90), # margin top
title = list(
y = 0.95, # y position of title
text = "Usage Hours Sample Means Probability Distribution"
),
xaxis = list(
title = "Sample Means of Daily Usage Hours (n = 100)",
range = c(4.38, 5.4)
),
yaxis = list(
title = "Probability",
range = c(0, 0.18)
)
)Likewise, similar approaches were applied to another numerical variable sleep hours. The result patterns shown in Figure 15, Output 15, and Output 16 were closely similar to those of media usage hours, again demonstrating the applicability of CLT.
## CLT - Sleep Hours
# set random seed for reproducibility
set.seed(32)
# get plot information
# (iterate sizes and pass values as n, apply function)
plots <- lapply(
sizes, function(n) make_hist(
n, "Sleep_Hours_Per_Night", bincolor = "#FFABB6", ymax = 0.6
)
)
# show subplots in 2*2 matrix
subplot(
plots,
nrows = 2,
margin = 0.08
) |>
layout(
showlegend = FALSE, # remove legend
margin = list(
t = 80, # margin top
b = 60 # margin bottom
),
title = list(
y = 0.95,
text = "Sleep Hours Sample Means Probability Distributions"
)
)## CLT - Sleep Hours
## Store sample means
# create list to store sample means of usage in different sizes
sample_means_sleep <- lapply(
sizes, # iterate sample sizes
function(n) { # n: sample sizes passed from sizes
replicate(
1000, # repetition times
mean(
sample(df[["Sleep_Hours_Per_Night"]], size = n, replace = TRUE),
na.rm = TRUE)
)
}
)
## Store means & sds of sample means
# initialize empty vector
mean_sample_means_sleep <- numeric(length(sizes))
# find means of usage sample means
for (i in seq_along(sizes)) {
mean_sample_means_sleep[i] <- round(
mean(sample_means_sleep[[i]], na.rm = TRUE),
digits = 2 # round to 2 digits
)
}
# initialize empty vector
sd_sample_means_sleep <- numeric(length(sizes))
# find sds of usage sample means
for (i in seq_along(sizes)) {
sd_sample_means_sleep[i] <- round(
sd(sample_means_sleep[[i]], na.rm = TRUE),
digits = 2 # round to 2 digits
)
}
## Population mean & sd
pop_mean_sleep <- mean(df$Sleep_Hours_Per_Night, na.rm = TRUE)
pop_sd_sleep <- sd(df$Sleep_Hours_Per_Night, na.rm = TRUE)
## Print results
clt_results_sleep <- data.frame(
Sample_Size = sizes,
Predicted_Mean = mean_sample_means_sleep,
Observed_SE = sd_sample_means_sleep,
Theoretical_SE = round(pop_sd_sleep / sqrt(sizes), 2)
)
print(clt_results_sleep)
cat("\nPopulation Mean:", round(pop_mean_sleep, 2),
"\nPopulation SD:", round(pop_sd_sleep, 2), "\n"
) Sample_Size Predicted_Mean Observed_SE Theoretical_SE
1 10 6.87 0.36 0.36
2 30 6.87 0.21 0.21
3 50 6.86 0.16 0.16
4 100 6.87 0.11 0.11
Population Mean: 6.87
Population SD: 1.13
Output 16. Comparison of Statistics Regarding Sleep Hours
## CLT - Sleep Hours
## Ideal normal distribution data
x_values <- seq(
min(tail(sample_means_sleep, 1)[[1]]),
max(tail(sample_means_sleep, 1)[[1]]),
length.out = 100 # select 100 numbers
)
y_values <- dnorm(
x_values,
mean = mean(tail(sample_means_sleep, 1)[[1]]),
sd = sd(tail(sample_means_sleep, 1)[[1]])
)
## Sample Mean Distribution v.s. Normal Curve
plot_ly() |>
# histogram
add_histogram(
x = tail(sample_means_sleep, 1)[[1]],
type = "histogram",
histnorm = "probability",
xbins = list(size = 0.05), # binwidth
name = "Sample Means", # histogram label
# aesthetics
marker = list(
color = "#FFABB6", # filling color
line = list(color = "#222", width = 1.5), # outline
opacity = 0.6
)
) |>
# normal curve
add_lines(
x = x_values,
y = y_values * 0.05,
name = "Normal Curve", # line label
line = list(color = '#D53132', width = 2)
) |>
layout(
legend = list(
orientation = "h", # arrangement of legend labels
xanchor = "left", yanchor = "top", # anchor point of the legend
x = 0, y = 1.1 # position of the legend
),
margin = list(t = 90), # margin top
title = list(
y = 0.95, # y position of title
text = "Sleep Hours Sample Means Probability Distribution"
),
xaxis = list(title = "Sample Means of Sleep Hours (n = 100)"),
yaxis = list(title = "Probability")
)In this section, the top 5 platform data was used as population dataset, and three samples were drew using different sampling methods, including simple random with replacement, systematic with unequal probabilities, and stratified sampling without replacement. These three sampling methods were thus compared and contrasted to see the differences between their ability to predict population patterns.
Output 17, 18, and 19 showed the results of different samples, regarding the proportion of different platforms, and the mean and standard deviation of media usage hours.
The simple random sampling (with replacement) was conducted using the function srswr() , and then samples were store in sample1 .
## Simple Random Sampling with Replacement
## Sampling
# sample size & random seed
n <- 100
set.seed(32)
# simple random sampling
s <- srswr(n, nrow(top5_platforms_data))
# find rows where observations are sampled
rows <- (1:nrow(top5_platforms_data))[s!=0]
rows <- rep(rows, s[s!=0])
# extract sampled data
sample1 <- top5_platforms_data[rows, ]
## Calculating
# define function to calculate percentages
calc_perc <- function(sample_data, n = 100) {
## sample_data: sample data drew with different methods
## n: sample size
# frequency
f <- table(
factor(
sample_data$Most_Used_Platform,
levels = top5_platforms # to ensure the same platform sequence
)
)
# percentage
return(f / n) # n: sample size
}
# percentages of platforms
perc1 <- calc_perc(sample1)
# sample mean & SD of usage hour
mean1 <- mean(sample1$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd1 <- sd(sample1$Avg_Daily_Usage_Hours, na.rm = TRUE)
# print results
cat("Percentages of Platforms(%):\n"); perc1 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean1, "\n")
cat("Sample SD of Usage Hour:", round(sd1, 3), "\n")Percentages of Platforms(%):
Instagram TikTok Facebook WhatsApp Twitter
44 27 13 11 5
Sample Mean of Usage Hour: 5.035
Sample SD of Usage Hour: 1.262
Output 17. Results of Simple Random Sampling with Replacement
For systematic sampling, the inclusion probabilities by usage hours was firstly calculated using the function inclusionprobabilities() and stored in pik , then the samples were drew using UPsystematic(pik) and stored in sample2 .
## Systematic Sampling with Unequal Probabilities
## Sampling
# random seed
set.seed(32)
# calculate inclusion probabilities by usage hours
pik <- inclusionprobabilities(top5_platforms_data$Avg_Daily_Usage_Hours, n)
# systematic sampling with unequal probabilities
s <- UPsystematic(pik)
# extract sampled data
sample2 <- top5_platforms_data[s!= 0, ]
## Calculating
# percentages of platforms
perc2 <- calc_perc(sample2)
# sample mean & SD of usage hour
mean2 <- mean(sample2$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd2 <- sd(sample2$Avg_Daily_Usage_Hours, na.rm = TRUE)
# print results
cat("Percentages of Platforms(%):\n"); perc2 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean2, "\n")
cat("Sample SD of Usage Hour:", round(sd2, 3), "\n")Percentages of Platforms(%):
Instagram TikTok Facebook WhatsApp Twitter
41 28 18 9 4
Sample Mean of Usage Hour: 5.332
Sample SD of Usage Hour: 1.267
Output 18. Results of Systematic Sampling with Unequal Probabilities
For stratified sampling, the population data were firstly ordered by platform using order() function, and the samples were drew using strata() function and stored in sample3 , with stratum sizes calculated by each platform’s proportion.
## Stratified Sampling without Replacement
## Ordering Data
# order data by platform
order_index <- order(top5_platforms_data$Most_Used_Platform)
data <- top5_platforms_data[order_index, ] # use ordered data
# calculate proportion
prop_table <- table(data$Most_Used_Platform) / nrow(data)
# find strata sizes
sizes <- round(prop_table * n)
# check if sum of sizes = n
# sum(sizes) # [1] 100
## Stratified Sampling
# random seed
set.seed(32)
# find stratums
st <- sampling::strata(
data,
stratanames = "Most_Used_Platform",
size = sizes,
method = "srswor" # simple random sampling without replacement
)
# extract sampled data
sample3 <- sampling::getdata(data, st)
## Calculating
# percentages of platforms
perc3 <- calc_perc(sample3)
# sample mean & SD of usage hour
mean3 <- mean(sample3$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd3 <- sd(sample3$Avg_Daily_Usage_Hours, na.rm = TRUE)
# print results
cat("Percentages of Platforms(%):\n"); perc3 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean3, "\n")
cat("Sample SD of Usage Hour:", round(sd3, 3), "\n")Percentages of Platforms(%):
Instagram TikTok Facebook WhatsApp Twitter
41 25 20 9 5
Sample Mean of Usage Hour: 4.948
Sample SD of Usage Hour: 1.134
Output 19. Stratified Sampling without Replacement
Output 20 presented the results of proportions of platforms in the population (top 5 platform data) and three samples. To facilitate the investigation, the means of difference between each sample value with its associated population value were calculated and shown in Output 21. Furthermore, Figure 17 visualized the results in Output 21.
## Comparison of Sampling Methods
## Percentages of Platforms
# store sample percentages
sample_percs <- list(srswr = perc1, ups = perc2, stwor = perc3)
# calculate population percentages
pop_perc <- table(
factor(
top5_platforms_data$Most_Used_Platform,
levels = top5_platforms
)
) / nrow(top5_platforms_data)
# show results of percentages
cat("Population Percentages:\n")
round(pop_perc, 2); cat("\n\n")
cat("Sample Percentages:\n\n")
sample_percsPopulation Percentages:
Instagram TikTok Facebook WhatsApp Twitter
0.41 0.25 0.20 0.09 0.05
Sample Percentages:
$srswr
Instagram TikTok Facebook WhatsApp Twitter
0.44 0.27 0.13 0.11 0.05
$ups
Instagram TikTok Facebook WhatsApp Twitter
0.41 0.28 0.18 0.09 0.04
$stwor
Instagram TikTok Facebook WhatsApp Twitter
0.41 0.25 0.20 0.09 0.05
Output 20. Population and Sample Percentages of Different Platforms
As shown in Output 21 and Figure 17, the values of stratified sampling without replacement showed the smallest difference from population values (0.002), indicating its strongest predictive power for platform proportion. In contrast, simple random sampling without replacement showed the lowest predictive power for it (mean difference was 0.029). Stratified sampling without replacement performed best may because it preserved the population’s underlying platform distribution within each stratum of platform, reducing sampling variability, whereas simple random sampling was more random and unpredictable across platform proportions.
## Comparison of Sampling Methods
## Percentages of Platforms
# find mean differences between
# sample percentages and population percentages
mean_diffs <- sample_percs |>
# calculate differences
lapply(\(percs) abs(percs - pop_perc)) |>
# lapply: list apply, return a list
# where \() is anonymous function, same as function(percs)
# it passes values from sample_percs as percs and apply function
# to them, then returns a list
# calculate means of differences
sapply(mean, na.rm = TRUE)
# sapply: simple apply, flexible at data types of returns.
# it returns a named vector here
# show results
cat("Differences from Population Percentages:\n")
sort(round(mean_diffs, 3)); cat("\n")Differences from Population Percentages:
stwor ups srswr
0.002 0.012 0.029
Output 21. Differences from Population Percentage by Sampling Method
## Comparison of Sampling Methods
## Percentages of Platforms
# barplot of differences from population percentages
plot_ly(
x = toupper(names(mean_diffs)), # capitalized method names
y = mean_diffs,
type = "bar", # barplot
# aesthetics
marker = list(
color = "#c3a4cf", # filling color
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Differences from Population Percentages by Sampling Method"
),
xaxis = list(
title = "Sampling Methods",
type = "category", # categorical axis
categoryorder = "array", # let order be specified
categoryarray = names(sort(mean_diffs)), # specify category order
range = c(2.2, 5.8)
),
yaxis = list(title = "Difference")
)In addition to predicting the proportions of categories, the ability to predict mean and standard deviation across the three sampling methods were also tested. Output 22 showed the results of means and standard deviations, and a clearer visualization was shown in Figure 18.
## Comparison of Sampling Methods
## Means & SDs of Usage Hours
# store sample means & sds
sample_means <- c(srswr = mean1, ups = mean2, stwor = mean3)
sample_sds <- c(srswr = sd1, ups = sd2, stwor = sd3)
# show results of means
cat("Population Mean:", round(pop_mean_usage, 3), "\n")
cat("Sample Means:\n"); sample_means; cat("\n")
# show results of SDs
cat("Population SD:", round(pop_sd_usage, 3), "\n")
cat("Sample SDs:\n"); round(sample_sds, 3)Population Mean: 4.919
Sample Means:
srswr ups stwor
5.035 5.332 4.948
Population SD: 1.257
Sample SDs:
srswr ups stwor
1.262 1.267 1.134
Output 22. Population and Sample Statistics of Media Usage Hours
In Figure 18, the dots at the centers of sticks represented the sample means, and the black whiskers represented the range of one sample standard deviation away from sample means, where the population mean and standard deviation were denoted using dashed lines with instruction text labels.
Combining Output 22 and Figure 18, it could be seen that for mean of numerical variable, stratified sampling without replacement showed the nearest result and the highest predictive power, whereas systematic sampling with unequal probabilities had the lowest predictive power.
However, in terms of standard deviation, simple random sampling with replacement had the most accurate prediction whereas stratified sampling without replacement had the least accurate prediction.
These discrepancies in predictive power likely stemmed from the inherent design of each sampling method. Stratified sampling without replacement excelled at estimating means by ensuring all subgroups were proportionally represented and reduced variance by eliminating duplicate data points, whereas systematic sampling with unequal probabilities may introduce bias if the selection interval aligns with hidden patterns in the data.
Conversely, simple random sampling with replacement provided a more accurate estimation of standard deviation possibly because the replacement mechanism allowed the sample to better reflect the true variability and noise of the underlying population, while the structured nature of stratification tended to compress internal variance, leading to an underestimation of the overall spread.
## Comparison of Sampling Methods
## Means & SDs of Usage Hours
# store methods, sample means & SDs into dataframe
plot_data <- data.frame(
Method = toupper(names(sample_means)), # capitalized method names
Mean = sample_means,
SD = sample_sds
)
# stick plot
plot_ly(
plot_data,
x = ~Method,
y = ~Mean
) |>
# add sticks
add_markers(
name = "Sample Mean", # stick label
# error stick (sample SDs)
error_y = list(
type = "data", # take the data specified in array as stick length
array = ~SD, # data to use for stick length
color = '#222' # stick color
),
# aesthetics of marker (sample means)
marker = list(size = 12, color = '#55D2C6')
) |>
layout(
margin = list(t = 60, b = 60), # margin top & bottom
title = list(
y = 0.95, # y position of title
text = "Comparison of Means & SDs across Sampling Methods"
),
yaxis = list(
title = "Usage Hours",
range = c( # y limits
pop_mean_usage - 1.5*pop_sd_usage,
pop_mean_usage + 1.5*pop_sd_usage
)
),
xaxis = list(
title = "Sampling Method",
range = c(-0.5, 3), # x limits
type = "category", # categorical axis
categoryorder = "array", # let order be specified
categoryarray = plot_data$Method # specify category order
),
# add lines of population statistics
shapes = list(
# population mean
list(
type = "line", # add line
xref = "x", # x values are relative to x-axis (0,1,2)
x0 = -0.4, x1 = 2.4, # xrange of line
y0 = pop_mean_usage, # y of line
y1 = pop_mean_usage,
line = list(
color = "#D53132",
dash = "dash", # line type
width = 2
)
),
# pop_mean + pop_sd
list(
type = "line",
xref = "x",
x0 = -0.4, x1 = 2.4,
y0 = pop_mean_usage + pop_sd_usage,
y1 = pop_mean_usage + pop_sd_usage,
line = list(color = "#FFA472", dash = "dot", width = 2)
),
# pop_mean - pop_sd
list(
type = "line",
xref = "x",
x0 = -0.4, x1 = 2.4,
y0 = pop_mean_usage - pop_sd_usage,
y1 = pop_mean_usage - pop_sd_usage,
line = list(color = "#FFA472", dash = "dot", width = 2)
)
),
# annotations for lines
annotations = list(
# population mean
list(
text = "Population Mean\n(4.919)",
x = 2.5,
y = pop_mean_usage,
xanchor = "left",
showarrow = FALSE
),
# pop_mean + pop_sd
list(
text = "+1 Population SD\n(+1.257)",
x = 2.5,
y = pop_mean_usage + pop_sd_usage,
xanchor = "left",
showarrow = FALSE
),
list(
text = "-1 Population SD\n(-1.257)",
x = 2.5,
y = pop_mean_usage - pop_sd_usage,
xanchor = "left",
showarrow = FALSE
)
)
)---
title: 'Student Social Media Usage and Well-being Analysis'
author: 'Pinn Xu'
date: '2026-02-23'
categories: [R, EDA, Social Media, Mental Health, Visualization]
description: 'Analysis of social media habits and addiction patterns among students, with behavioral and academic insights.'
## All below for Rmd
# output:
# html_document:
# code_folding: hide
# highlight: "pygments"
# theme: "sandstone"
# toc: true
# toc_float:
# collapsed: true
---
```{r eval=FALSE, include=TRUE}
#| code-summary: 'Environment & Versions'
R==4.5.2
plotly==4.12.0
sampling==2.11
tidyverse==2.0.0
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(
fig.width = 8,
fig.height = 5,
# fig.align = 'left', # only suitable for static image
# fig.asp=0.8, # only suitable for static image
out.width = "95%", # let fig width occupy 95% of parent element's width
results='hold',
echo=TRUE,
warning=FALSE,
message=FALSE
)
```
# Introduction
## Environment and Package Version
This project was conducted in R v4.5.2 using the *plotly* (v4.12.0) and *sampling* packages, and the package collection *tidyverse* (v2.0.0), with packages such as *stringr* (v1.6.0), *tibble* (v3.3.1), and *dplyr* (v1.1.4) included.
```{r eval=FALSE, results='hold'}
# view versions
getRversion() # [1] ‘4.5.2’
packageVersion("plotly") # [1] ‘4.12.0’
packageVersion("sampling") # [1] ‘2.11’
packageVersion("tidyverse") # [1] ‘2.0.0’
packageVersion("stringr") # [1] ‘1.6.0’
packageVersion("tibble") # [1] ‘3.3.1’
packageVersion("dplyr") # [1] ‘1.1.4’
```
## Dataset Overview
The dataset named *Student Social Media & Relationships* from [Kaggle](https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships) was used for this analysis. It provided anonymous cross-sectional survey data among students aged 16-25 in different educational background across multiple countries, capturing key dimensions such as social media usage intensity, platform preferences, and relationship dynamics, allowing researchers to examine how digital behaviors relate to personal and academic outcomes. Each observation represented an individual student's response, making the dataset suitable for statistical analysis, with sufficient sample size (N = 705).
In terms of the variables, there was one identifier variable `Student_ID` , giving each observation an unique ID. Other nominal variables included such as `Gender` , `Country` , `Most_Used_Platform` , `Relationship_Status` with non-ordinal levels, and `Academic_Level` , `Affects_Academic_Performance` (Boolean) with ordinal or Boolean levels. Numerical variables includes `Age` , `Avg_Daily_Usage_Hours` , `Sleep_Hours_Per_Night` , `Mental_Health_Score` , `Conflicts_Over_Social_Media` , and `Addicted_Score` , where mental health score and addicted score were self-rated levels from 1 to 10.
## Goal of Analysis
This project aimed to investigate the complex correlations between students' social media patterns (e.g. average usage hours, platform preference), contextual and demographic factors (e.g. academic level, country, age, gender), and student well-being and performance (e.g. average sleep hours, mental health, addiction).
The suitability of the selected dataset stemmed from its robust sample size (N = 705) and multidimensional structure with different types of variables, which provided statistical power to draw reliable insights on student behavior.
The analytical phase of this project was structured into three primary stages. Firstly, descriptive analyses were conducted to establish a basic understanding of the sample's demographics and social media usage patterns. Secondly, the analysis dived into correlation between indices regarding students' digital habits and well-being level, followed by an application of the Central Limit Theorem (CLT) to validate the reliability of statistical inferences. Finally, it compared and evaluated the efficiency and precision of three different sampling methods: simple random, systematic, and stratified sampling.
Regarding the visualization process, the project mainly used used data wrangling methods from packages in `tidyverse` to manipulate data before plotting, including pipe operator `|>` and functions such as `group_by()` and `summarize()` from the *dplyr* package; and then tools from the *plotly* package was used to draw plots.
Furthermore, to improve the accessibility of visual storytelling, this project adopted colorblind-friendly palettes generated using the [online tool](https://davidmathlogic.com/colorblind/) developed by David Nichols.
# Data Preparation
## Importing Dataset
The dataset was downloaded from [Kaggle](https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships) and saved to a local directory. It was then imported into R as `df0` using the built-in `read.csv()` function with a relative file path. The structure of `df0` was shown in Output 1 below.
```{r}
# import dataset
df0 <- read.csv("./student-social-media-addiction.csv")
str(df0) # view dataframe structure
```
*Output 1. Structure of the Dataframe*
## Preprocessing
To facilitate further investigation and visualization of the data, a few processing steps were conducted in advance.
Firstly, relative packages was loaded, and the original dataset `df0` was converted to a tibble as `df` , while preserving the original dataset unchanged. Next, the names of categorical and numerical variables that would be used in subsequent analyses were specified and stored in separate vectors ( `names_catvar` and `names_numvar` ).
```{r}
# attach packages
library(tidyverse)
library(plotly)
library(sampling)
# create a tibble
df <- tibble(df0) # store tibble in df, let df0 remain what it is
# store names of categorical variables to use
names_catvar <- c(
"Gender", "Country", "Most_Used_Platform", "Relationship_Status",
"Academic_Level", "Affects_Academic_Performance"
)
# store names of numerical variables to use
names_numvar <- c(
"Age", "Avg_Daily_Usage_Hours", "Sleep_Hours_Per_Night",
"Mental_Health_Score", "Conflicts_Over_Social_Media", "Addicted_Score"
)
```
## Missing Data
An initial inspection for missing values was conducted using `sum(is.na(df))` , and no explicit missing data were identified in the dataset (shown in Output 2).
```{r}
# check the number of missing values
sum(is.na(df))
```
*Output 2. Number of NA Values*
## Categorical Data
For categorical variables, their numbers of categories (see Output 3) and counts of observations in each category (see Output 4) were first examined. The results indicated that most categorical variables contained a moderate number of categories, whereas *country* and *most-used platform* variables had relatively larger numbers of categories and therefore required further processing.
```{r}
## All Categorical Variables
# check the number of categories of each categorical variable
for (name in names_catvar) {
print(paste(name, length(unique(df[[name]]))))
}
```
*Output 3. Numbers of Unique Values in Categorical Variables*
```{r}
## All Categorical Variables Except Country & Platform
# check the count of observations in each category
# (except the country and most-used platform variables)
for (name in setdiff(names_catvar, c("Country", "Most_Used_Platform"))) {
print(table(df[name]))
cat("\n")
}
```
*Output 4. Numbers of Observations of Each Value in Categorical Variables*
Top five countries and platforms with the most counts of observations were thereby selected with their names stored ( vectors `top5_countries` and `top5_platforms` ). The results were displayed in Output 5.
```{r results='hold'}
## Country & Platform
# check counts of observation of top 5 countries
head(sort(table(df["Country"]), decreasing = TRUE), 5)
cat("\n")
# check counts of observation of top 5 platforms
head(sort(table(df["Most_Used_Platform"]), decreasing = TRUE), 5)
```
*Output 5. Numbers of Observations in Each Country and Platform*
```{r}
## Country & Platform
# store top 5 countries
top5_countries <- names(
head(sort(table(df$Country), decreasing = TRUE), 5)
)
# store top 5 platforms
top5_platforms <- names(
head(sort(table(df$Most_Used_Platform), decreasing = TRUE), 5)
)
```
In addition, the `Affects_Academic_Performance` variable was converted from character to logical ( "Yes" = `TRUE` , "No" = `FALSE` ) as shown in Output 6.
```{r}
## Boolean Variable
# convert variable from character to logical
df$Affects_Academic_Performance <- df$Affects_Academic_Performance == "Yes"
# As the variable contains only "Yes" and "No" responses,
# no further data cleaning was needed.
# check the converted variable
str(df$Affects_Academic_Performance)
```
*Output 6. Boolean Variable*
## Numerical Data
Descriptive statistics for the numerical variables were obtained using the `summary()` function in Output 7. The results indicate that the variables fall within reasonable ranges, with no apparent extreme or abnormal values. Thus, no further preprocessing was deemed necessary at this stage.
```{r}
## All Numerical Variables
# view descriptive summaries of numerical variables
for (name in names_numvar) {
print(summary(df[name]))
cat("\n")
}
```
*Output 7. Descriptive Summaries of Numerical Variables*
Moreover, the `Age` variable was categorized into three groups due to its limited range and variability. As shown in Output 8, treating `Age` as continuous offers minimal dispersion within such a narrow interval (18-24) and limited continuity (only 7 unique numbers), while treating the values as single-year categories would possibly lead to several small and uneven subgroups. Categorizing into three groups therefore provides a more suitable structure for subsequent analysis.
To achieve this, a histogram was firstly used to view age distribution and obtain suitable breaks for grouping. Then, grouping information was stored into a new variable named `Age_Group` (see Output 9).
```{r results='hold'}
## Age
# check value range
range(df$Age)
# check numbers of unique values in Age
length(unique(df$Age))
```
*Output 8. Range of Age and Numbers of Unique Age*
```{r fig.width=7, fig.height=5, results='hold'}
## Age
## Store histogram data
# store histogram data with 3 bins
hist_data <- hist(
df$Age,
breaks = 4,
xlab = "Age",
col = "#FFD794",
main = "Age Distribution",
plot = FALSE # do not display plot
)
# store breaks for grouping
breaks <- hist_data$breaks
## Add new variable Age_Group to df
# store grouping information into a new variable Age_Group
df <- df |>
mutate(
Age_Group = case_when(
(breaks[1] <= Age) & (Age <= breaks[2]) ~
sprintf("%d-%d", breaks[1], breaks[2]),
(breaks[2] <= Age) & (Age <= breaks[3]) ~
sprintf("%d-%d", breaks[2], breaks[3]),
(breaks[3] <= Age) & (Age <= breaks[4]) ~
sprintf("%d-%d", breaks[3], breaks[4]),
)
)
# view new variable
table(df["Age_Group"])
```
*Output 9. Age Groups*
# Descriptive Analysis
This section demonstrated several core contextual and behavioral patterns of the sample.
## Demographics
### Gender
@fig-gender-dist illustrated a highly balanced gender distribution within the dataset (N = 705). The sample was almost perfectly split, with females representing 50.1% and males representing 49.9% of the participants. This balanced distribution allowed relatively unbiased gender-based comparisons of behavioral patterns.
```{r}
#| label: fig-gender-dist
#| fig-cap: "Sample Gender Distribution"
## Gender
# pie chart
plot_ly(
as.data.frame(table(df["Gender"])),
labels = ~Gender,
values = ~Freq,
type = "pie", # pie chart
textinfo = "label+percent", # display label & percent on pie slices
# plotting area for pie chart
domain = list(
x = c(0, 1), # horizontal: full height
y = c(0, 0.85) # vertical: leave 15% margin on the top
),
# aesthetics
marker = list(
colors = c("Female" = "#FFB697", "Male" = "#8fc4de"), # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
showlegend = FALSE, # remove legend
margin = list(l = 1, r = 1), # left & right margin
title = list(
y = 0.95, # y position of title
text = "Sample Gender Distribution"
)
)
```
### Country
The bar plot (@fig-country-dist) followed by Output 10 showing percentages of observations for top 5 countries with respect to the whole sample illustrated the sample's top five countries, with India leading at 7.5%, followed by the USA (5.7%) and Canada (4.8%). Denmark and France each represent 3.8% of the total, highlighting geographical diversity of participants, centered primarily in South Asian and North American regions.
```{r}
#| label: fig-country-dist
#| fig-cap: "Sample Country Distribution (top 5)"
## Country
# extract top 5 country data
top5_countries_data <- df |>
filter(Country %in% top5_countries)
# barplot
plot_ly(
as.data.frame(
sort(table(top5_countries_data["Country"]), decreasing = TRUE) # sort
),
x = ~Country,
y = ~Freq,
type = "bar", # barplot
# aesthetics
marker = list(
color = "#add4a0", # filling color
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Sample Country Distribution (top 5)"
),
xaxis = list(
title = "Countries",
range = c(-0.8, 4.8) # extend x axis (0 = the 1st category)
),
yaxis = list(
title = "Frequency",
range = c(0, 60) # let y axis be exhaustive
)
)
```
```{r}
## Country
# view percentage of observations for top 5 countries
sort(table(top5_countries_data$Country), decreasing = TRUE) / nrow(df) * 100
```
*Output 10. Percentages of Top 5 Countries Relative to the Population*
### Age Group
The age group distribution (shown in @fig-age-group-dist) was skewed toward younger participants, with the 18-20 and 20-22 cohorts being the most predominant. The 22-24 group has a significantly lower frequency, indicating that the sample primarily consists of early-stage university students.
```{r}
#| label: fig-age-group-dist
#| fig-cap: "Sample Age Group Distribution"
## Age Group
# barplot
plot_ly(
as.data.frame(
sort(table(df["Age_Group"]), decreasing = TRUE) # sort
),
x = ~Age_Group,
y = ~Freq,
type = "bar", # barplot
# aesthetics
marker = list(
color = "#FFD794", # filling color
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Sample Age Group Distribution"
),
xaxis = list(
title = "Age Groups",
range = c(-0.8, 2.8) # extend x axis (0 = the 1st category)
),
yaxis = list(
title = "Frequency",
range = c(0, 400) # let y axis be exhaustive
)
)
```
## Platform Preference
@fig-top5-platforms showed the frequency distribution of the five most popular social platforms. It can be seen that Instagram (249) was the most popular platform among the sample, followed by TikTok (154) and Facebook (123), while WhatsApp (54) and Twitter (30) showed much lower frequencies.
```{r}
#| label: fig-top5-platforms
#| fig-cap: "Top 5 Popular Platforms"
## Platform
# extract top 5 platform data
top5_platforms_data <- df |>
filter(Most_Used_Platform %in% top5_platforms)
# barplot
plot_ly(
as.data.frame(
sort(table(top5_platforms_data["Most_Used_Platform"]), decreasing = TRUE)
),
x = ~Most_Used_Platform,
y = ~Freq,
type = "bar", # barplot
# aesthetics
marker = list(
color = "#c3a4cf", # filling color
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Top 5 Popular Platforms"
),
xaxis = list(
title = "Platforms",
range = c(-0.8, 4.8) # extend x axis (0 = the 1st category)
),
yaxis = list(title = "Frequency")
)
```
Notably, after grouping by gender (shown in @fig-top5-platforms-gender), there were clear gender differences in platform preferences. Instagram and TikTok were more popular among females, with Instagram showing a larger gender gap. In contrast, Facebook and WhatsApp showed a more male-dominated trend. Twitter, on the other hand, showed a minimal gender variation. Overall, it appeared to be that females prefer visual platforms whereas males favor communication-oriented ones.
```{r}
#| label: fig-top5-platforms-gender
#| fig-cap: "Top 5 Popular Platforms by Gender"
## Platform ~ Gender
# obtain data
platform_gender_data <- top5_platforms_data |>
group_by(Most_Used_Platform, Gender) |>
summarize(Count = n(), .groups = 'drop') |>
arrange(desc(Count))
# barplot
plot_ly(
platform_gender_data,
x = ~factor(Most_Used_Platform, top5_platforms), # sort
y = ~Count,
type = "bar", # barplot
color = ~Gender, # map Gender to bar color
colors = c("Female" = "#FFB697", "Male" = "#8fc4de"),
# aesthetics
marker = list(
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
barmode = "group", # side-by-side bar groups
legend = list(
orientation = "h", # horizontal legend labels
xanchor = "right", yanchor = "top", # anchor points of the legend
x = 1, y = 1 # position of the legend
),
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Top 5 Popular Platforms by Gender"
),
xaxis = list(
title = "Platforms",
range = c(-0.6, 4.6) # extend x axis (0 = the 1st category)
),
yaxis = list(title = "Frequency")
)
```
## Usage Hours
The histogram of average media usage hours (@fig-usage-hours-dist) followed a symmetrical distribution, peaking between 4.5 and 5 hours, suggesting a consistent high-engagement digital lifestyle across the surveyed students.
```{r}
#| label: fig-usage-hours-dist
#| fig-cap: "Daily Usage Hours Distribution"
## Usage Hours
# histogram
plot_ly(
df,
x = ~Avg_Daily_Usage_Hours,
type = "histogram",
# aesthetics
marker = list(
color = "#55D2C6", # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Average Daily Usage Hours Distribution"
),
xaxis = list(
title = "Average Daily Usage Hours",
range = c(0.8, 9.4), # let x aixs be exhaustive
dtick = 1 # distance between ticks
),
yaxis = list(title = "Frequency")
)
```
Boxplots (@fig-usage-hours-boxplot) were used to illustrate the distribution of average daily usage hours across the top 5 countries. USA and India exhibit the highest levels of engagement, with median usage hours significantly exceeding those of other nations. In contrast, Canada, France, and Denmark show lower overall usage, with Denmark recording the most conservative median. In addition, a notable outlier is observed in the Indian group, representing a user with exceptionally low daily engagement compared to the national trend.
```{r fig.width=8, fig.height=6}
#| label: fig-usage-hours-boxplot
#| fig-cap: "Daily Usage Hours Boxplot by Top 5 Countries"
## Usage Hours ~ Top 5 Countries
## Sorting
# find median usage hours of each country
countries_by_usage <- top5_countries_data |>
group_by(Country) |>
# create new variable to store medians
summarize(
Median_Usage_Hour = median(Avg_Daily_Usage_Hours, na.rm = TRUE),
.groups = "drop" # drop grouping information
) |>
# sort by median descending
arrange(desc(Median_Usage_Hour)) |>
# pull out the vector of Country variable
# (country names with median usage hours descending)
pull(Country)
## Plotting
# boxplot
plot_ly(
top5_countries_data,
# make Country variable a factor by levels of
# countries with median usage hours descending
x = ~factor(Country, countries_by_usage),
y = ~Avg_Daily_Usage_Hours,
type = "box", # boxplot
# aesthetics: boundary box
fillcolor = "#55D2C6",
line = list(color = "#222", width = 1.5),
# aesthetics: outliers
marker = list(
color = "#55D2C6", # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Average Daily Usage Hours Distribution by Top 5 Countries"
),
xaxis = list(
title = "Top 5 Countries",
range = c(-0.6, 4.6) # extend x axis (0 = the 1st category)
),
yaxis = list(
title = "Average Daily Usage Hours",
range = c(1, 10), # let y axis be exhaustive
dtick = 1 # distance between ticks
)
)
```
## Sleep Hours
Meanwhile, the distribution of sleep hours was also investigated. @fig-sleep-hours-dist revealed a roughly symmetrical distribution centered around a peak of approximately 7 to 7.5 hours of sleep per night. While the majority of students fall within a healthier range of 6 to 8 hours, the distribution exhibits a broad spread, with some individuals reporting as little as 4 hours and others reaching nearly 10 hours. This variation suggests a diverse range of sleeping habits within the student population.
```{r}
#| label: fig-sleep-hours-dist
#| fig-cap: "Average Sleep Hours Distribution"
## Sleep Hours
# histogram
plot_ly(
df,
x = ~Sleep_Hours_Per_Night,
type = "histogram",
# aesthetics
marker = list(
color = "#FFABB6", # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Average Sleep Hours Distribution"
),
xaxis = list(
title = "Average Sleep Hours",
range = c(2.8, 11), # let x aixs be exhaustive
dtick = 1 # distance between ticks
),
yaxis = list(
title = "Frequency",
range = c(0, 60) # let y axis be exhaustive
)
)
```
Furthermore, looking into the distribution of average sleep hours by top 5 countries, @fig-sleep-hours-boxplot showed a significant inverse trend compared to the previously analyzed usage hours distribution (@fig-usage-hours-boxplot). While the USA and India presented the highest social media engagement, they recorded the lowest median sleep duration, both falling below 6 hours per night. Conversely, Denmark and France, which showed the most conservative usage patterns, enjoyed the highest median sleep hours, with Denmark leading at approximately 8.2 hours. In addition, India displayed a high number of upper outliers, indicating a group of students who maintained high sleep levels despite regional trends. These geographical variation suggested a negative correlation between digital engagement and sleeping, providing a critical baseline to later analysis of the relationship between behavioral patterns.
```{r fig.width=8, fig.height=6}
#| label: fig-sleep-hours-boxplot
#| fig-cap: "Average Sleep Hours Distribution by Top 5 Countries"
## Sleep Hours ~ Top 5 Countries
## Sorting
# find median sleep hours of each country
countries_by_sleep <- top5_countries_data |>
group_by(Country) |>
# create new variable to store medians
summarize(
Median_Sleep_Hour = median(Sleep_Hours_Per_Night, na.rm = TRUE),
.groups = "drop" # drop grouping information
) |>
# sort by median ascending
arrange(Median_Sleep_Hour) |>
# pull out the vector of Country variable
# (country names with median sleep hours ascending)
pull(Country)
## Plotting
# boxplot
plot_ly(
top5_countries_data,
# make Country variable a factor by levels of
# countries with sleep hours ascending
x = ~factor(Country, countries_by_sleep),
y = ~Sleep_Hours_Per_Night,
type = "box", # boxplot
# aesthetics: boundary box
fillcolor = "#FFABB6",
line = list(color = "#222", width = 1.5),
# aesthetics: outliers
marker = list(
color = "#FFABB6", # filling color
line = list(color = "#222", width = 1.5) # outline
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Average Sleep Hours Distribution by Top 5 Countries"
),
xaxis = list(
title = "Top 5 Countries",
range = c(-0.6, 4.6) # extend x axis (0 = the 1st category)
),
yaxis = list(
title = "Average Sleep Hours",
range = c(3, 10), # let y axis be exhaustive
dtick = 1 # distance between ticks
)
)
```
# Correlation Analysis
## Sleep Hours and Usage Hours
The scatter plot (@fig-usage-sleep-scatter) demonstrate a strong negative correlation between daily media usage and sleep duration. As usage rises from 2 to 8 hours, sleep generally drops by about 1 hour for every additional 1 hours of use. Most points cluster along this downward trend, though some heavy users (6+ hours) still report around 7 hours of sleep, while moderate users (4-6 hours) fall to 6-8 hours of sleep.
```{r}
#| label: fig-usage-sleep-scatter
#| fig-cap: "Relationship between Usage Hours and Sleep Hours"
## Sleep Hours ~ Usage Hours
# scatter plot
plot_ly(
df,
x = ~Avg_Daily_Usage_Hours
) |>
# add scatters
add_markers(
y = ~Sleep_Hours_Per_Night,
marker = list(color = "#3B9FE4", opacity = 0.6)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Relationship between Usage Hours and Sleep Hours"
),
xaxis = list(
title = "Average Daily Usage Hours",
range = c(0.8, 9.2), # let x axis be exhaustive
dtick = 1 # distance between ticks
),
yaxis = list(
title = "Average Sleep Hours",
range = c(3, 10) # let y axis be exhaustive
)
)
```
To validate the trend, a Pearson's correlation test was conducted. As shown in the results (see Output 11), the correlation coefficient was around -0.79 (*p* \< .001). It therefore indicated that there was a high potential of students' average daily social media usage hours being negatively correlated with their sleep hours.
```{r results='hold'}
## Sleep Hours ~ Usage Hours
# correlation test
cor.test(x = df$Avg_Daily_Usage_Hours, y = df$Sleep_Hours_Per_Night)
```
*Output 11. Pearson's Correlation Test Result*
@fig-usage-sleep-linear showed a linear fit of the two variables, which further clarified the negative correlation.
```{r}
#| label: fig-usage-sleep-linear
#| fig-cap: "Linear Relationship between Usage Hours and Sleep Hours"
## Sleep Hours ~ Usage Hours
# linear model
fit <- lm(Sleep_Hours_Per_Night ~ Avg_Daily_Usage_Hours, data = df)
# scatter plot
plot_ly(
df,
x = ~Avg_Daily_Usage_Hours
) |>
# add scatters
add_markers(
y = ~Sleep_Hours_Per_Night,
name = "Data Points",
marker = list(color = "#3B9FE4", opacity = 0.6)
) |>
# add linear fit
add_lines(
y = ~fitted(fit),
name = "Linear Fit",
line = list(color = '#D53132', width = 1.5)
) |>
layout(
legend = list(
orientation = "h", # arrangement of legend labels
xanchor = "right", yanchor = "top", # anchor point of the legend
x = 1, y = 1 # position of the legend
),
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Linear Relationship between Usage Hours and Sleep Hours"
),
xaxis = list(
title = "Average Daily Usage Hours",
range = c(0.8, 9.2), # let x axis be exhaustive
dtick = 1 # distance between ticks
),
yaxis = list(
title = "Average Sleep Hours",
range = c(3, 10) # let y axis be exhaustive
)
)
```
## Other Indices
In addition, it was also considered worthwhile investigating the relationship between other indices such as mental health score and addicted score (names of indices were shown in Output 12). Therefore, a faceted 5×5 paired scatter plot matrix was used to display all of the relationship trends (see @fig-indices-scatter).
```{r }
## Relationship between Indices
# remove Age from names_numvar
(names_numvar <- setdiff(names_numvar, "Age"))
```
*Output 12. Names of Indices to Compare*
```{r fig.width=8, fig.height=6, out.width='95%'}
#| label: fig-indices-scatter
#| fig-cap: "Relationship between Indices"
## Relationship between Indices
# extract numerical variable data
num_data <- df[, names_numvar]
# paired scatter plots
plot_ly(
num_data,
type = "splom", # scatter plots matrix
# indices to compare
dimensions = list(
list(label = "Usage", values = ~Avg_Daily_Usage_Hours),
list(label = "Sleep", values = ~Sleep_Hours_Per_Night),
list(label = "Mental Health", values = ~Mental_Health_Score),
list(label = "Conflicts", values = ~Conflicts_Over_Social_Media),
list(label = "Addicted", values = ~Addicted_Score)
),
# aesthetics
marker = list(
color = "#3B9FE4",
opacity = 0.5,
size = 4
)
) |>
layout(
margin = list(t = 60), # margin top
title = list(
y = 0.95, # y position of title
text = "Relationship between Indices"
)
)
```
The scatter plot matrix (@fig-indices-scatter) illustrated a complex web of behavioral and psychological correlations, primarily centered on the detrimental effects of excessive social media usage. Meanwhile, Output 13 showed a correlation test matrix for all five indices, providing valid insights into behavioral trends.
As shown by the results, a dominant negative correlation existed between usage hours and sleep, where increased screen time consistently aligns with diminished rest. This trend was further demonstrated by a positive correlation between usage and addiction levels, which in turn led to higher frequencies of life conflicts. Conversely, better mental health outcomes were associated with lower addiction scores, reduced usage, and longer sleep duration. Ultimately, the data suggested a negative feedback loop where high device dependency compromised sleep and social harmony, collectively contributing to poorer mental well-being.
```{r}
## Relationship between Indices
# view correlation matrix (pearson correlation coefficient)
(cor_matrix <- round(cor(num_data), 2))
```
*Output 13. Correlation Matrix of Indices*
Notably, the correlation between addicted score and mental health (*r* = -.95), and that between addicted score and conflicts number over social media (*r* = .93) appeared to be the most significant with both \| *r* \| \> .9 (see Output 14), indicating that as addictive behaviors intensify, psychological well-being collapses and interpersonal friction rises almost in lockstep.
In conclusion, addiction scores served as the most potent predictor of negative life outcomes in this dataset.
```{r}
## Relationship between Indices
# find location of pairs where |r| > 0.9
loc <- which(
abs(cor_matrix) > 0.9 & cor_matrix != 1,
arr.ind = TRUE # return matrix of row and column indices
)
# print results
for(i in 1:nrow(loc)) {
row <- loc[i, 1]
col <- loc[i, 2]
cat(
rownames(cor_matrix)[row], "~", colnames(cor_matrix)[col], "\n",
"r =", cor_matrix[row, col], "\n")
}
```
*Output 14. Pairs of Indices with the Strongest Correlation (\|r\|\>.9)*
# Application of CLT
This section repeatedly drew random samples (1000 times) with different sample sizes (n = 10, 30, 50, 100) from the population dataset ( `df` ), and then observed patterns of sample mean distributions, aiming to seek for applicability of the Central Limit Theorem (CLT).
Central Limit Theorem suggests that as sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population’s original distribution.
In the meantime, the standard deviations of sample means (observed standard errors) are related to the population standard deviation *σ* the and sample size *n*, where the standard error is defined by a formula $SE = \frac{\sigma}{\sqrt n}$ . This formula was used in later analyses to calculate the theoretical standard error.
To investigate the applicability of CLT, a function was firstly defined to draw random samples with specified sample size, sampling time, variable to sample, and other relative parameters.
```{r}
## CLT
## Define function to plot
# store different sample sizes
sizes <- c(10, 30, 50, 100)
# define a function to randomly sample from df with
# specified sample size and plot sample means histogram with
# specified parameters
make_hist <- function(
n, col, time = 1000, ymax = 1,
binwidth = 0.2, bincolor = "#FFA472", xdtick = 1
) {
## n: sample size
## col: column to select from df
## time: times to replicate sampling process
## ymax: maximum value of y axis in histogram
## binwidth: width of bins in histogram
## bincolor: color of bins
## xdtick: distance between ticks in x asix
# randomly sample data from df and calculate sample means
sample_means <- replicate(
time, # repetition times, default is 1000
mean(
sample(df[[col]], size = n, replace = TRUE),
na.rm = TRUE
)
)
# find population mean & standard deviation
pop_mean <- mean(df[[col]], na.rm = TRUE)
pop_sd <- sd(df[[col]], na.rm = TRUE)
# find fixed x range
xmin <- pop_mean - (2 * pop_sd)
xmax <- pop_mean + (2 * pop_sd)
# plot sample means probability distribution histogram
plot_ly(
x = sample_means,
type = "histogram",
histnorm = "probability", # probability distribution
name = paste("n =", n),
xbins = list(
start = xmin,
end = xmax,
size = binwidth # specify binwidth, default is 0.4
),
# aesthetics
marker = list(
color = bincolor, # specify bincolor, default is "#FFA472"
line = list(color = "#222", width = 1.5)
)
) |>
layout(
margin = list(t = 60), # margin top
# add annotation of sample size
annotations = list(
list(
text = paste("Sample Size:", n),
x = 0.5, # x position of annotation
y = -0.25, # y position of annotation
xanchor = "center", # anchor of annotation for x
yanchor = "bottom", # anchor of annotation for y
xref = "paper", # position relative to current subplot
yref = "paper",
showarrow = FALSE # remove arrow of annotation
)
),
# axis limits
xaxis = list(
range = c(xmin, xmax), # x axis range
dtick = xdtick # specify dtick, default is 1
),
yaxis = list(range = c(0, ymax)) # specify ymax, default is 1
)
}
```
## Usage Hours
In terms of the average media usage hours variable, a 2×2 histogram matrix (@fig-usage-sample-means-grid) of sample mean distributions with different sample sizes (n = 10, 30, 50, 100) was plotted with axis scales being the same to see the trend of distribution shape. Meanwhile, Output 15 showed the population and sample statistics regarding usage hours.
As results shown, with sample size increased, the distribution of sample means became more centralized, with smaller standard deviations. Meanwhile, all four means of sample means were nearly closed to the population mean, successfully predicting it; and the observed standard error (standard deviation of sample means) were approximate to the theoretical standard error derived using the population standard deviation.
```{r fig.width = 8, fig.height=5}
#| label: fig-usage-sample-means-grid
#| fig-cap: "Usage Hours Sample Means Probability Distribution"
## CLT - Usage Hours
## Draw random samples and plot
# set random seed for reproducibility
set.seed(32)
# get plot information
# (iterate sizes and pass values as n, apply function)
plots <- lapply(
sizes, function(n) make_hist(
n, "Avg_Daily_Usage_Hours", bincolor = "#55D2C6", ymax = 0.6
)
)
# show subplots in 2*2 matrix
subplot(
plots,
nrows = 2, # number of rows in subplot matrix
margin = 0.08 # margin between subplots
) |>
layout(
showlegend = FALSE, # remove legend
margin = list(
t = 80, # margin top
b = 60 # margin bottom
),
title = list(
y = 0.95,
text = "Usage Hours Sample Means Probability Distributions"
)
)
```
```{r results='hold'}
## CLT - Usage Hours
## Store sample means
# create list to store sample means of usage in different sizes
sample_means_usage <- lapply(
sizes, # iterate sample sizes
function(n) { # n: sample sizes passed from sizes
replicate(
1000, # repetition times
mean(
sample(df[["Avg_Daily_Usage_Hours"]], size = n, replace = TRUE),
na.rm = TRUE)
)
}
)
## Store means & SDs of sample means
# initialize empty vector
mean_sample_means_usage <- numeric(length(sizes))
# find means of usage sample means
for (i in seq_along(sizes)) {
mean_sample_means_usage[i] <- round(
mean(sample_means_usage[[i]], na.rm = TRUE),
digits = 2 # round to 2 digits
)
}
# initialize empty vector
sd_sample_means_usage <- numeric(length(sizes))
# find sds of usage sample means
for (i in seq_along(sizes)) {
sd_sample_means_usage[i] <- round(
sd(sample_means_usage[[i]], na.rm = TRUE),
digits = 2 # round to 2 digits
)
}
## Population mean & SD
pop_mean_usage <- mean(df$Avg_Daily_Usage_Hours, na.rm = TRUE)
pop_sd_usage <- sd(df$Avg_Daily_Usage_Hours, na.rm = TRUE)
## Print results
clt_results_usage <- data.frame(
Sample_Size = sizes,
Predicted_Mean = mean_sample_means_usage,
Observed_SE = sd_sample_means_usage,
Theoretical_SE = round(pop_sd_usage / sqrt(sizes), 2)
)
print(clt_results_usage)
cat("\nPopulation Mean:", round(pop_mean_usage, 2),
"\nPopulation SD:", round(pop_sd_usage, 2), "\n"
)
```
*Output 15. Comparison of Statistics Regarding Media Usage Hours*
In addition, @fig-usage-sample-means-normal displayed one of the sample means distributions (n = 100, repeated 1000 times) with an ideal normal curve (n = 100) over it, which was calculated using the function `dnorm()` . The results demonstrated a highly coincident overlap between the sample means histogram and normal curve, in line with the CLT.
```{r}
#| label: fig-usage-sample-means-normal
#| fig-cap: "Usage Hours Sample Means Probability Distribution (with Normal Curve)"
## CLT - Usage Hours
## Ideal normal distribution data
x_values <- seq(
min(tail(sample_means_usage, 1)[[1]]), # the last set of means, n = 100
max(tail(sample_means_usage, 1)[[1]]),
length.out = 100 # sequence length = 100
)
y_values <- dnorm(
x_values,
mean = mean(tail(sample_means_usage, 1)[[1]]),
sd = sd(tail(sample_means_usage, 1)[[1]])
)
## Sample Mean Distribution v.s. Normal Curve
plot_ly() |>
# histogram
add_histogram(
x = tail(sample_means_usage, 1)[[1]],
type = "histogram",
histnorm = "probability",
xbins = list(size = 0.05), # binwidth
name = "Sample Means", # histogram label
# aesthetics
marker = list(
color = "#55D2C6", # filling color
line = list(color = "#222", width = 1.5), # outline
opacity = 0.6
)
) |>
# normal curve
add_lines(
x = x_values,
y = y_values * 0.05,
name = "Normal Curve", # line label
line = list(color = '#D53132', width = 2)
) |>
layout(
legend = list(
orientation = "h", # arrangement of legend labels
xanchor = "left", yanchor = "top", # anchor point of the legend
x = 0, y = 1.1 # position of the legend
),
margin = list(t = 90), # margin top
title = list(
y = 0.95, # y position of title
text = "Usage Hours Sample Means Probability Distribution"
),
xaxis = list(
title = "Sample Means of Daily Usage Hours (n = 100)",
range = c(4.38, 5.4)
),
yaxis = list(
title = "Probability",
range = c(0, 0.18)
)
)
```
## Sleep Hours
Likewise, similar approaches were applied to another numerical variable sleep hours. The result patterns shown in @fig-sleep-sample-means-grid, Output 15, and Output 16 were closely similar to those of media usage hours, again demonstrating the applicability of CLT.
```{r fig.width = 8, fig.height=5}
#| label: fig-sleep-sample-means-grid
#| fig-cap: "Sleep Hours Sample Means Probability Distribution"
## CLT - Sleep Hours
# set random seed for reproducibility
set.seed(32)
# get plot information
# (iterate sizes and pass values as n, apply function)
plots <- lapply(
sizes, function(n) make_hist(
n, "Sleep_Hours_Per_Night", bincolor = "#FFABB6", ymax = 0.6
)
)
# show subplots in 2*2 matrix
subplot(
plots,
nrows = 2,
margin = 0.08
) |>
layout(
showlegend = FALSE, # remove legend
margin = list(
t = 80, # margin top
b = 60 # margin bottom
),
title = list(
y = 0.95,
text = "Sleep Hours Sample Means Probability Distributions"
)
)
```
```{r results='hold'}
## CLT - Sleep Hours
## Store sample means
# create list to store sample means of usage in different sizes
sample_means_sleep <- lapply(
sizes, # iterate sample sizes
function(n) { # n: sample sizes passed from sizes
replicate(
1000, # repetition times
mean(
sample(df[["Sleep_Hours_Per_Night"]], size = n, replace = TRUE),
na.rm = TRUE)
)
}
)
## Store means & sds of sample means
# initialize empty vector
mean_sample_means_sleep <- numeric(length(sizes))
# find means of usage sample means
for (i in seq_along(sizes)) {
mean_sample_means_sleep[i] <- round(
mean(sample_means_sleep[[i]], na.rm = TRUE),
digits = 2 # round to 2 digits
)
}
# initialize empty vector
sd_sample_means_sleep <- numeric(length(sizes))
# find sds of usage sample means
for (i in seq_along(sizes)) {
sd_sample_means_sleep[i] <- round(
sd(sample_means_sleep[[i]], na.rm = TRUE),
digits = 2 # round to 2 digits
)
}
## Population mean & sd
pop_mean_sleep <- mean(df$Sleep_Hours_Per_Night, na.rm = TRUE)
pop_sd_sleep <- sd(df$Sleep_Hours_Per_Night, na.rm = TRUE)
## Print results
clt_results_sleep <- data.frame(
Sample_Size = sizes,
Predicted_Mean = mean_sample_means_sleep,
Observed_SE = sd_sample_means_sleep,
Theoretical_SE = round(pop_sd_sleep / sqrt(sizes), 2)
)
print(clt_results_sleep)
cat("\nPopulation Mean:", round(pop_mean_sleep, 2),
"\nPopulation SD:", round(pop_sd_sleep, 2), "\n"
)
```
*Output 16. Comparison of Statistics Regarding Sleep Hours*
```{r}
#| label: fig-sleep-sample-means-normal
#| fig-cap: "Sleep Hours Sample Means Probability Distribution (with Normal Curve)"
## CLT - Sleep Hours
## Ideal normal distribution data
x_values <- seq(
min(tail(sample_means_sleep, 1)[[1]]),
max(tail(sample_means_sleep, 1)[[1]]),
length.out = 100 # select 100 numbers
)
y_values <- dnorm(
x_values,
mean = mean(tail(sample_means_sleep, 1)[[1]]),
sd = sd(tail(sample_means_sleep, 1)[[1]])
)
## Sample Mean Distribution v.s. Normal Curve
plot_ly() |>
# histogram
add_histogram(
x = tail(sample_means_sleep, 1)[[1]],
type = "histogram",
histnorm = "probability",
xbins = list(size = 0.05), # binwidth
name = "Sample Means", # histogram label
# aesthetics
marker = list(
color = "#FFABB6", # filling color
line = list(color = "#222", width = 1.5), # outline
opacity = 0.6
)
) |>
# normal curve
add_lines(
x = x_values,
y = y_values * 0.05,
name = "Normal Curve", # line label
line = list(color = '#D53132', width = 2)
) |>
layout(
legend = list(
orientation = "h", # arrangement of legend labels
xanchor = "left", yanchor = "top", # anchor point of the legend
x = 0, y = 1.1 # position of the legend
),
margin = list(t = 90), # margin top
title = list(
y = 0.95, # y position of title
text = "Sleep Hours Sample Means Probability Distribution"
),
xaxis = list(title = "Sample Means of Sleep Hours (n = 100)"),
yaxis = list(title = "Probability")
)
```
# Sampling Methods
In this section, the top 5 platform data was used as population dataset, and three samples were drew using different sampling methods, including simple random with replacement, systematic with unequal probabilities, and stratified sampling without replacement. These three sampling methods were thus compared and contrasted to see the differences between their ability to predict population patterns.
Output 17, 18, and 19 showed the results of different samples, regarding the proportion of different platforms, and the mean and standard deviation of media usage hours.
## Simple Random Sampling
The simple random sampling (with replacement) was conducted using the function `srswr()` , and then samples were store in `sample1` .
```{r results='hold'}
## Simple Random Sampling with Replacement
## Sampling
# sample size & random seed
n <- 100
set.seed(32)
# simple random sampling
s <- srswr(n, nrow(top5_platforms_data))
# find rows where observations are sampled
rows <- (1:nrow(top5_platforms_data))[s!=0]
rows <- rep(rows, s[s!=0])
# extract sampled data
sample1 <- top5_platforms_data[rows, ]
## Calculating
# define function to calculate percentages
calc_perc <- function(sample_data, n = 100) {
## sample_data: sample data drew with different methods
## n: sample size
# frequency
f <- table(
factor(
sample_data$Most_Used_Platform,
levels = top5_platforms # to ensure the same platform sequence
)
)
# percentage
return(f / n) # n: sample size
}
# percentages of platforms
perc1 <- calc_perc(sample1)
# sample mean & SD of usage hour
mean1 <- mean(sample1$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd1 <- sd(sample1$Avg_Daily_Usage_Hours, na.rm = TRUE)
# print results
cat("Percentages of Platforms(%):\n"); perc1 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean1, "\n")
cat("Sample SD of Usage Hour:", round(sd1, 3), "\n")
```
*Output 17. Results of Simple Random Sampling with Replacement*
## Systematic Sampling
For systematic sampling, the inclusion probabilities by usage hours was firstly calculated using the function `inclusionprobabilities()` and stored in `pik` , then the samples were drew using `UPsystematic(pik)` and stored in `sample2` .
```{r results='hold'}
## Systematic Sampling with Unequal Probabilities
## Sampling
# random seed
set.seed(32)
# calculate inclusion probabilities by usage hours
pik <- inclusionprobabilities(top5_platforms_data$Avg_Daily_Usage_Hours, n)
# systematic sampling with unequal probabilities
s <- UPsystematic(pik)
# extract sampled data
sample2 <- top5_platforms_data[s!= 0, ]
## Calculating
# percentages of platforms
perc2 <- calc_perc(sample2)
# sample mean & SD of usage hour
mean2 <- mean(sample2$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd2 <- sd(sample2$Avg_Daily_Usage_Hours, na.rm = TRUE)
# print results
cat("Percentages of Platforms(%):\n"); perc2 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean2, "\n")
cat("Sample SD of Usage Hour:", round(sd2, 3), "\n")
```
*Output 18. Results of Systematic Sampling with Unequal Probabilities*
## Stratified Sampling
For stratified sampling, the population data were firstly ordered by platform using `order()` function, and the samples were drew using `strata()` function and stored in `sample3` , with stratum sizes calculated by each platform's proportion.
```{r results='hold'}
## Stratified Sampling without Replacement
## Ordering Data
# order data by platform
order_index <- order(top5_platforms_data$Most_Used_Platform)
data <- top5_platforms_data[order_index, ] # use ordered data
# calculate proportion
prop_table <- table(data$Most_Used_Platform) / nrow(data)
# find strata sizes
sizes <- round(prop_table * n)
# check if sum of sizes = n
# sum(sizes) # [1] 100
## Stratified Sampling
# random seed
set.seed(32)
# find stratums
st <- sampling::strata(
data,
stratanames = "Most_Used_Platform",
size = sizes,
method = "srswor" # simple random sampling without replacement
)
# extract sampled data
sample3 <- sampling::getdata(data, st)
## Calculating
# percentages of platforms
perc3 <- calc_perc(sample3)
# sample mean & SD of usage hour
mean3 <- mean(sample3$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd3 <- sd(sample3$Avg_Daily_Usage_Hours, na.rm = TRUE)
# print results
cat("Percentages of Platforms(%):\n"); perc3 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean3, "\n")
cat("Sample SD of Usage Hour:", round(sd3, 3), "\n")
```
*Output 19. Stratified Sampling without Replacement*
## Comparison of Sampling Methods
### Percentages of Platforms
Output 20 presented the results of proportions of platforms in the population (top 5 platform data) and three samples. To facilitate the investigation, the means of difference between each sample value with its associated population value were calculated and shown in Output 21. Furthermore, @fig-sampling-diff visualized the results in Output 21.
```{r results='hold'}
## Comparison of Sampling Methods
## Percentages of Platforms
# store sample percentages
sample_percs <- list(srswr = perc1, ups = perc2, stwor = perc3)
# calculate population percentages
pop_perc <- table(
factor(
top5_platforms_data$Most_Used_Platform,
levels = top5_platforms
)
) / nrow(top5_platforms_data)
# show results of percentages
cat("Population Percentages:\n")
round(pop_perc, 2); cat("\n\n")
cat("Sample Percentages:\n\n")
sample_percs
```
*Output 20. Population and Sample Percentages of Different Platforms*
As shown in Output 21 and @fig-sampling-diff, the values of stratified sampling without replacement showed the smallest difference from population values (0.002), indicating its strongest predictive power for platform proportion. In contrast, simple random sampling without replacement showed the lowest predictive power for it (mean difference was 0.029). Stratified sampling without replacement performed best may because it preserved the population’s underlying platform distribution within each stratum of platform, reducing sampling variability, whereas simple random sampling was more random and unpredictable across platform proportions.
```{r results='hold'}
## Comparison of Sampling Methods
## Percentages of Platforms
# find mean differences between
# sample percentages and population percentages
mean_diffs <- sample_percs |>
# calculate differences
lapply(\(percs) abs(percs - pop_perc)) |>
# lapply: list apply, return a list
# where \() is anonymous function, same as function(percs)
# it passes values from sample_percs as percs and apply function
# to them, then returns a list
# calculate means of differences
sapply(mean, na.rm = TRUE)
# sapply: simple apply, flexible at data types of returns.
# it returns a named vector here
# show results
cat("Differences from Population Percentages:\n")
sort(round(mean_diffs, 3)); cat("\n")
```
*Output 21. Differences from Population Percentage by Sampling Method*
```{r}
#| label: fig-sampling-diff
#| fig-cap: "Differences from Population Percentages by Sampling Method"
## Comparison of Sampling Methods
## Percentages of Platforms
# barplot of differences from population percentages
plot_ly(
x = toupper(names(mean_diffs)), # capitalized method names
y = mean_diffs,
type = "bar", # barplot
# aesthetics
marker = list(
color = "#c3a4cf", # filling color
line = list(color = "#222", width = 1.5) # outline color & width
)
) |>
layout(
margin = list(t = 50), # margin top
title = list(
y = 0.95, # y position of title
text = "Differences from Population Percentages by Sampling Method"
),
xaxis = list(
title = "Sampling Methods",
type = "category", # categorical axis
categoryorder = "array", # let order be specified
categoryarray = names(sort(mean_diffs)), # specify category order
range = c(2.2, 5.8)
),
yaxis = list(title = "Difference")
)
```
### Statistics of Usage Hours
In addition to predicting the proportions of categories, the ability to predict mean and standard deviation across the three sampling methods were also tested. Output 22 showed the results of means and standard deviations, and a clearer visualization was shown in @fig-sampling-means-sds.
```{r results='hold'}
## Comparison of Sampling Methods
## Means & SDs of Usage Hours
# store sample means & sds
sample_means <- c(srswr = mean1, ups = mean2, stwor = mean3)
sample_sds <- c(srswr = sd1, ups = sd2, stwor = sd3)
# show results of means
cat("Population Mean:", round(pop_mean_usage, 3), "\n")
cat("Sample Means:\n"); sample_means; cat("\n")
# show results of SDs
cat("Population SD:", round(pop_sd_usage, 3), "\n")
cat("Sample SDs:\n"); round(sample_sds, 3)
```
*Output 22. Population and Sample Statistics of Media Usage Hours*
In @fig-sampling-means-sds, the dots at the centers of sticks represented the sample means, and the black whiskers represented the range of one sample standard deviation away from sample means, where the population mean and standard deviation were denoted using dashed lines with instruction text labels.
Combining Output 22 and @fig-sampling-means-sds, it could be seen that for mean of numerical variable, stratified sampling without replacement showed the nearest result and the highest predictive power, whereas systematic sampling with unequal probabilities had the lowest predictive power.
However, in terms of standard deviation, simple random sampling with replacement had the most accurate prediction whereas stratified sampling without replacement had the least accurate prediction.
These discrepancies in predictive power likely stemmed from the inherent design of each sampling method. Stratified sampling without replacement excelled at estimating means by ensuring all subgroups were proportionally represented and reduced variance by eliminating duplicate data points, whereas systematic sampling with unequal probabilities may introduce bias if the selection interval aligns with hidden patterns in the data.
Conversely, simple random sampling with replacement provided a more accurate estimation of standard deviation possibly because the replacement mechanism allowed the sample to better reflect the true variability and noise of the underlying population, while the structured nature of stratification tended to compress internal variance, leading to an underestimation of the overall spread.
```{r fig.width=8, fig.height=6}
#| label: fig-sampling-means-sds
#| fig-cap: "Comparison of Means & SDs across Sampling Methods"
## Comparison of Sampling Methods
## Means & SDs of Usage Hours
# store methods, sample means & SDs into dataframe
plot_data <- data.frame(
Method = toupper(names(sample_means)), # capitalized method names
Mean = sample_means,
SD = sample_sds
)
# stick plot
plot_ly(
plot_data,
x = ~Method,
y = ~Mean
) |>
# add sticks
add_markers(
name = "Sample Mean", # stick label
# error stick (sample SDs)
error_y = list(
type = "data", # take the data specified in array as stick length
array = ~SD, # data to use for stick length
color = '#222' # stick color
),
# aesthetics of marker (sample means)
marker = list(size = 12, color = '#55D2C6')
) |>
layout(
margin = list(t = 60, b = 60), # margin top & bottom
title = list(
y = 0.95, # y position of title
text = "Comparison of Means & SDs across Sampling Methods"
),
yaxis = list(
title = "Usage Hours",
range = c( # y limits
pop_mean_usage - 1.5*pop_sd_usage,
pop_mean_usage + 1.5*pop_sd_usage
)
),
xaxis = list(
title = "Sampling Method",
range = c(-0.5, 3), # x limits
type = "category", # categorical axis
categoryorder = "array", # let order be specified
categoryarray = plot_data$Method # specify category order
),
# add lines of population statistics
shapes = list(
# population mean
list(
type = "line", # add line
xref = "x", # x values are relative to x-axis (0,1,2)
x0 = -0.4, x1 = 2.4, # xrange of line
y0 = pop_mean_usage, # y of line
y1 = pop_mean_usage,
line = list(
color = "#D53132",
dash = "dash", # line type
width = 2
)
),
# pop_mean + pop_sd
list(
type = "line",
xref = "x",
x0 = -0.4, x1 = 2.4,
y0 = pop_mean_usage + pop_sd_usage,
y1 = pop_mean_usage + pop_sd_usage,
line = list(color = "#FFA472", dash = "dot", width = 2)
),
# pop_mean - pop_sd
list(
type = "line",
xref = "x",
x0 = -0.4, x1 = 2.4,
y0 = pop_mean_usage - pop_sd_usage,
y1 = pop_mean_usage - pop_sd_usage,
line = list(color = "#FFA472", dash = "dot", width = 2)
)
),
# annotations for lines
annotations = list(
# population mean
list(
text = "Population Mean\n(4.919)",
x = 2.5,
y = pop_mean_usage,
xanchor = "left",
showarrow = FALSE
),
# pop_mean + pop_sd
list(
text = "+1 Population SD\n(+1.257)",
x = 2.5,
y = pop_mean_usage + pop_sd_usage,
xanchor = "left",
showarrow = FALSE
),
list(
text = "-1 Population SD\n(-1.257)",
x = 2.5,
y = pop_mean_usage - pop_sd_usage,
xanchor = "left",
showarrow = FALSE
)
)
)
```