Student Social Media Usage and Well-being Analysis

EDA

Social Media

Mental Health

Visualization

Analysis of social media habits and addiction patterns among students, with behavioral and academic insights.

Author

Pinn Xu

Published

February 23, 2026

Environment & Versions

R==4.5.2
plotly==4.12.0
sampling==2.11
tidyverse==2.0.0

Introduction

Environment and Package Version

This project was conducted in R v4.5.2 using the plotly (v4.12.0) and sampling packages, and the package collection tidyverse (v2.0.0), with packages such as stringr (v1.6.0), tibble (v3.3.1), and dplyr (v1.1.4) included.

Code

# view versions
getRversion()                # [1] ‘4.5.2’
packageVersion("plotly")     # [1] ‘4.12.0’
packageVersion("sampling")   # [1] ‘2.11’
packageVersion("tidyverse")  # [1] ‘2.0.0’
packageVersion("stringr")    # [1] ‘1.6.0’
packageVersion("tibble")     # [1] ‘3.3.1’
packageVersion("dplyr")      # [1] ‘1.1.4’

Dataset Overview

The dataset named Student Social Media & Relationships from Kaggle was used for this analysis. It provided anonymous cross-sectional survey data among students aged 16-25 in different educational background across multiple countries, capturing key dimensions such as social media usage intensity, platform preferences, and relationship dynamics, allowing researchers to examine how digital behaviors relate to personal and academic outcomes. Each observation represented an individual student’s response, making the dataset suitable for statistical analysis, with sufficient sample size (N = 705).

In terms of the variables, there was one identifier variable Student_ID , giving each observation an unique ID. Other nominal variables included such as Gender , Country , Most_Used_Platform , Relationship_Status with non-ordinal levels, and Academic_Level , Affects_Academic_Performance (Boolean) with ordinal or Boolean levels. Numerical variables includes Age , Avg_Daily_Usage_Hours , Sleep_Hours_Per_Night , Mental_Health_Score , Conflicts_Over_Social_Media , and Addicted_Score , where mental health score and addicted score were self-rated levels from 1 to 10.

Goal of Analysis

This project aimed to investigate the complex correlations between students’ social media patterns (e.g. average usage hours, platform preference), contextual and demographic factors (e.g. academic level, country, age, gender), and student well-being and performance (e.g. average sleep hours, mental health, addiction).

The suitability of the selected dataset stemmed from its robust sample size (N = 705) and multidimensional structure with different types of variables, which provided statistical power to draw reliable insights on student behavior.

The analytical phase of this project was structured into three primary stages. Firstly, descriptive analyses were conducted to establish a basic understanding of the sample’s demographics and social media usage patterns. Secondly, the analysis dived into correlation between indices regarding students’ digital habits and well-being level, followed by an application of the Central Limit Theorem (CLT) to validate the reliability of statistical inferences. Finally, it compared and evaluated the efficiency and precision of three different sampling methods: simple random, systematic, and stratified sampling.

Regarding the visualization process, the project mainly used used data wrangling methods from packages in tidyverse to manipulate data before plotting, including pipe operator |> and functions such as group_by() and summarize() from the dplyr package; and then tools from the plotly package was used to draw plots.

Furthermore, to improve the accessibility of visual storytelling, this project adopted colorblind-friendly palettes generated using the online tool developed by David Nichols.

Data Preparation

Importing Dataset

The dataset was downloaded from Kaggle and saved to a local directory. It was then imported into R as df0 using the built-in read.csv() function with a relative file path. The structure of df0 was shown in Output 1 below.

Code

# import dataset
df0 <- read.csv("./student-social-media-addiction.csv")
str(df0) # view dataframe structure

'data.frame':   705 obs. of  13 variables:
 $ Student_ID                  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Age                         : int  19 22 20 18 21 19 23 20 18 21 ...
 $ Gender                      : chr  "Female" "Male" "Female" "Male" ...
 $ Academic_Level              : chr  "Undergraduate" "Graduate" "Undergraduate" "High School" ...
 $ Country                     : chr  "Bangladesh" "India" "USA" "UK" ...
 $ Avg_Daily_Usage_Hours       : num  5.2 2.1 6 3 4.5 7.2 1.5 5.8 4 3.3 ...
 $ Most_Used_Platform          : chr  "Instagram" "Twitter" "TikTok" "YouTube" ...
 $ Affects_Academic_Performance: chr  "Yes" "No" "Yes" "No" ...
 $ Sleep_Hours_Per_Night       : num  6.5 7.5 5 7 6 4.5 8 6 6.5 7 ...
 $ Mental_Health_Score         : int  6 8 5 7 6 4 9 6 7 7 ...
 $ Relationship_Status         : chr  "In Relationship" "Single" "Complicated" "Single" ...
 $ Conflicts_Over_Social_Media : int  3 0 4 1 2 5 0 2 1 1 ...
 $ Addicted_Score              : int  8 3 9 4 7 9 2 8 5 4 ...

Output 1. Structure of the Dataframe

Preprocessing

To facilitate further investigation and visualization of the data, a few processing steps were conducted in advance.

Firstly, relative packages was loaded, and the original dataset df0 was converted to a tibble as df , while preserving the original dataset unchanged. Next, the names of categorical and numerical variables that would be used in subsequent analyses were specified and stored in separate vectors ( names_catvar and names_numvar ).

Code

# attach packages
library(tidyverse)
library(plotly)
library(sampling)

# create a tibble
df <- tibble(df0) # store tibble in df, let df0 remain what it is

# store names of categorical variables to use
names_catvar <- c(
  "Gender", "Country", "Most_Used_Platform", "Relationship_Status",
  "Academic_Level", "Affects_Academic_Performance"
)

# store names of numerical variables to use
names_numvar <- c(
  "Age", "Avg_Daily_Usage_Hours", "Sleep_Hours_Per_Night", 
  "Mental_Health_Score", "Conflicts_Over_Social_Media", "Addicted_Score"
)

Missing Data

An initial inspection for missing values was conducted using sum(is.na(df)) , and no explicit missing data were identified in the dataset (shown in Output 2).

Code

# check the number of missing values
sum(is.na(df))

[1] 0

Output 2. Number of NA Values

Categorical Data

For categorical variables, their numbers of categories (see Output 3) and counts of observations in each category (see Output 4) were first examined. The results indicated that most categorical variables contained a moderate number of categories, whereas country and most-used platform variables had relatively larger numbers of categories and therefore required further processing.

Code

## All Categorical Variables

# check the number of categories of each categorical variable
for (name in names_catvar) {
  print(paste(name, length(unique(df[[name]]))))
}

[1] "Gender 2"
[1] "Country 110"
[1] "Most_Used_Platform 12"
[1] "Relationship_Status 3"
[1] "Academic_Level 3"
[1] "Affects_Academic_Performance 2"

Output 3. Numbers of Unique Values in Categorical Variables

Code

## All Categorical Variables Except Country & Platform

# check the count of observations in each category
# (except the country and most-used platform variables)
for (name in setdiff(names_catvar, c("Country", "Most_Used_Platform"))) {
  print(table(df[name]))
  cat("\n")
}

Gender
Female   Male 
   353    352 

Relationship_Status
    Complicated In Relationship          Single 
             32             289             384 

Academic_Level
     Graduate   High School Undergraduate 
          325            27           353 

Affects_Academic_Performance
 No Yes 
252 453

Output 4. Numbers of Observations of Each Value in Categorical Variables

Top five countries and platforms with the most counts of observations were thereby selected with their names stored ( vectors top5_countries and top5_platforms ). The results were displayed in Output 5.

Code

## Country & Platform

# check counts of observation of top 5 countries
head(sort(table(df["Country"]), decreasing = TRUE), 5)

cat("\n")

# check counts of observation of top 5 platforms
head(sort(table(df["Most_Used_Platform"]), decreasing = TRUE), 5)

Country
  India     USA  Canada Denmark  France 
     53      40      34      27      27 

Most_Used_Platform
Instagram    TikTok  Facebook  WhatsApp   Twitter 
      249       154       123        54        30

Output 5. Numbers of Observations in Each Country and Platform

Code

## Country & Platform

# store top 5 countries
top5_countries <- names(
  head(sort(table(df$Country), decreasing = TRUE), 5)
)

# store top 5 platforms
top5_platforms <- names(
  head(sort(table(df$Most_Used_Platform), decreasing = TRUE), 5)
)

In addition, the Affects_Academic_Performance variable was converted from character to logical ( “Yes” = TRUE , “No” = FALSE ) as shown in Output 6.

Code

## Boolean Variable

# convert variable from character to logical
df$Affects_Academic_Performance <- df$Affects_Academic_Performance == "Yes"

  # As the variable contains only "Yes" and "No" responses, 
  # no further data cleaning was needed.

# check the converted variable
str(df$Affects_Academic_Performance)

 logi [1:705] TRUE FALSE TRUE FALSE TRUE TRUE ...

Output 6. Boolean Variable

Numerical Data

Descriptive statistics for the numerical variables were obtained using the summary() function in Output 7. The results indicate that the variables fall within reasonable ranges, with no apparent extreme or abnormal values. Thus, no further preprocessing was deemed necessary at this stage.

Code

## All Numerical Variables

# view descriptive summaries of numerical variables
for (name in names_numvar) {
  print(summary(df[name]))
  cat("\n")
}

      Age       
 Min.   :18.00  
 1st Qu.:19.00  
 Median :21.00  
 Mean   :20.66  
 3rd Qu.:22.00  
 Max.   :24.00  

 Avg_Daily_Usage_Hours
 Min.   :1.500        
 1st Qu.:4.100        
 Median :4.800        
 Mean   :4.919        
 3rd Qu.:5.800        
 Max.   :8.500        

 Sleep_Hours_Per_Night
 Min.   :3.800        
 1st Qu.:6.000        
 Median :6.900        
 Mean   :6.869        
 3rd Qu.:7.700        
 Max.   :9.600        

 Mental_Health_Score
 Min.   :4.000      
 1st Qu.:5.000      
 Median :6.000      
 Mean   :6.227      
 3rd Qu.:7.000      
 Max.   :9.000      

 Conflicts_Over_Social_Media
 Min.   :0.00               
 1st Qu.:2.00               
 Median :3.00               
 Mean   :2.85               
 3rd Qu.:4.00               
 Max.   :5.00               

 Addicted_Score 
 Min.   :2.000  
 1st Qu.:5.000  
 Median :7.000  
 Mean   :6.437  
 3rd Qu.:8.000  
 Max.   :9.000

Output 7. Descriptive Summaries of Numerical Variables

Moreover, the Age variable was categorized into three groups due to its limited range and variability. As shown in Output 8, treating Age as continuous offers minimal dispersion within such a narrow interval (18-24) and limited continuity (only 7 unique numbers), while treating the values as single-year categories would possibly lead to several small and uneven subgroups. Categorizing into three groups therefore provides a more suitable structure for subsequent analysis.

To achieve this, a histogram was firstly used to view age distribution and obtain suitable breaks for grouping. Then, grouping information was stored into a new variable named Age_Group (see Output 9).

Code

## Age

# check value range
range(df$Age)

# check numbers of unique values in Age
length(unique(df$Age))

[1] 18 24
[1] 7

Output 8. Range of Age and Numbers of Unique Age

Code

## Age

## Store histogram data

# store histogram data with 3 bins
hist_data <- hist(
  df$Age,
  breaks = 4,
  xlab = "Age",
  col = "#FFD794",
  main = "Age Distribution",
  plot = FALSE  # do not display plot
)

# store breaks for grouping
breaks <- hist_data$breaks


## Add new variable Age_Group to df

# store grouping information into a new variable Age_Group
df <- df |> 
  mutate(
    Age_Group = case_when(
      (breaks[1] <= Age) & (Age <= breaks[2]) ~ 
      sprintf("%d-%d", breaks[1], breaks[2]),
      (breaks[2] <= Age) & (Age <= breaks[3]) ~ 
      sprintf("%d-%d", breaks[2], breaks[3]),
      (breaks[3] <= Age) & (Age <= breaks[4]) ~ 
      sprintf("%d-%d", breaks[3], breaks[4]),
    )
  )

# view new variable
table(df["Age_Group"])

Age_Group
18-20 20-22 22-24 
  342   303    60

Output 9. Age Groups

Descriptive Analysis

This section demonstrated several core contextual and behavioral patterns of the sample.

Demographics

Gender

Figure 1 illustrated a highly balanced gender distribution within the dataset (N = 705). The sample was almost perfectly split, with females representing 50.1% and males representing 49.9% of the participants. This balanced distribution allowed relatively unbiased gender-based comparisons of behavioral patterns.

Code

## Gender

# pie chart
plot_ly(
  as.data.frame(table(df["Gender"])),
  labels = ~Gender,
  values = ~Freq,
  type = "pie",                # pie chart
  textinfo = "label+percent",  # display label & percent on pie slices
    
  # plotting area for pie chart
  domain = list(
    x = c(0, 1),    # horizontal: full height
    y = c(0, 0.85)  # vertical: leave 15% margin on the top
  ),
  
  # aesthetics
  marker = list(
    colors = c("Female" = "#FFB697", "Male" = "#8fc4de"), # filling color
    line = list(color = "#222", width = 1.5)              # outline
  )
) |> 
  
  layout(
    showlegend = FALSE,           # remove legend
    margin = list(l = 1, r = 1),  # left & right margin
    title = list(
      y = 0.95,                   # y position of title
      text = "Sample Gender Distribution"
    )
  )

Figure 1: Sample Gender Distribution

Country

The bar plot (Figure 2) followed by Output 10 showing percentages of observations for top 5 countries with respect to the whole sample illustrated the sample’s top five countries, with India leading at 7.5%, followed by the USA (5.7%) and Canada (4.8%). Denmark and France each represent 3.8% of the total, highlighting geographical diversity of participants, centered primarily in South Asian and North American regions.

Code

## Country

# extract top 5 country data
top5_countries_data <- df |> 
  filter(Country %in% top5_countries)

# barplot
plot_ly(
  as.data.frame(
    sort(table(top5_countries_data["Country"]), decreasing = TRUE) # sort
  ),
  x = ~Country,
  y = ~Freq,
  type = "bar",  # barplot
  
  # aesthetics
  marker = list(
    color = "#add4a0",                        # filling color
    line = list(color = "#222", width = 1.5)  # outline color & width
  )
) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Sample Country Distribution (top 5)"
    ),
    xaxis = list(
      title = "Countries",
      range = c(-0.8, 4.8)  # extend x axis (0 = the 1st category)
    ),
    yaxis = list(
      title = "Frequency",
      range = c(0, 60)      # let y axis be exhaustive
    )
  )

Figure 2: Sample Country Distribution (top 5)

Code

## Country

# view percentage of observations for top 5 countries
sort(table(top5_countries_data$Country), decreasing = TRUE) / nrow(df) * 100


   India      USA   Canada  Denmark   France 
7.517730 5.673759 4.822695 3.829787 3.829787

Output 10. Percentages of Top 5 Countries Relative to the Population

Age Group

The age group distribution (shown in Figure 3) was skewed toward younger participants, with the 18-20 and 20-22 cohorts being the most predominant. The 22-24 group has a significantly lower frequency, indicating that the sample primarily consists of early-stage university students.

Code

## Age Group

# barplot
plot_ly(
  as.data.frame(
    sort(table(df["Age_Group"]), decreasing = TRUE) # sort
  ),
  x = ~Age_Group,
  y = ~Freq,
  type = "bar",  # barplot
  
  # aesthetics
  marker = list(
    color = "#FFD794",                        # filling color
    line = list(color = "#222", width = 1.5)  # outline color & width
  )
) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Sample Age Group Distribution"
    ),
    xaxis = list(
      title = "Age Groups",
      range = c(-0.8, 2.8)  # extend x axis (0 = the 1st category)
    ),
    yaxis = list(
      title = "Frequency",
      range = c(0, 400)     # let y axis be exhaustive
    )
  )

Figure 3: Sample Age Group Distribution

Platform Preference

Figure 4 showed the frequency distribution of the five most popular social platforms. It can be seen that Instagram (249) was the most popular platform among the sample, followed by TikTok (154) and Facebook (123), while WhatsApp (54) and Twitter (30) showed much lower frequencies.

Code

## Platform

# extract top 5 platform data
top5_platforms_data <- df |> 
  filter(Most_Used_Platform %in% top5_platforms)

# barplot
plot_ly(
  as.data.frame(
    sort(table(top5_platforms_data["Most_Used_Platform"]), decreasing = TRUE)
  ),
  x = ~Most_Used_Platform,
  y = ~Freq,
  type = "bar",  # barplot
  
  # aesthetics
  marker = list(
    color = "#c3a4cf",                        # filling color
    line = list(color = "#222", width = 1.5)  # outline color & width
  )
) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Top 5 Popular Platforms"
    ),
    xaxis = list(
      title = "Platforms",
      range = c(-0.8, 4.8)  # extend x axis (0 = the 1st category)
    ),
    yaxis = list(title = "Frequency")
  )

Figure 4: Top 5 Popular Platforms

Notably, after grouping by gender (shown in Figure 5), there were clear gender differences in platform preferences. Instagram and TikTok were more popular among females, with Instagram showing a larger gender gap. In contrast, Facebook and WhatsApp showed a more male-dominated trend. Twitter, on the other hand, showed a minimal gender variation. Overall, it appeared to be that females prefer visual platforms whereas males favor communication-oriented ones.

Code

## Platform ~ Gender

# obtain data
platform_gender_data <- top5_platforms_data |> 
  group_by(Most_Used_Platform, Gender) |> 
  summarize(Count = n(), .groups = 'drop') |> 
  arrange(desc(Count))

# barplot
plot_ly(
  platform_gender_data,
  x = ~factor(Most_Used_Platform, top5_platforms),  # sort
  y = ~Count,
  type = "bar",     # barplot
  color = ~Gender,  # map Gender to bar color 
  colors = c("Female" = "#FFB697", "Male" = "#8fc4de"),
  
  # aesthetics
  marker = list(
    line = list(color = "#222", width = 1.5)  # outline color & width
  )
) |> 
  
  layout(
    barmode = "group",                     # side-by-side bar groups
    legend = list(
      orientation = "h",                   # horizontal legend labels
      xanchor = "right", yanchor = "top",  # anchor points of the legend
      x = 1, y = 1                         # position of the legend
    ),
    margin = list(t = 50),                 # margin top
    title = list(
      y = 0.95,                            # y position of title
      text = "Top 5 Popular Platforms by Gender"
    ),
    xaxis = list(
      title = "Platforms",
      range = c(-0.6, 4.6)  # extend x axis (0 = the 1st category)
    ),
    yaxis = list(title = "Frequency")
  )

Figure 5: Top 5 Popular Platforms by Gender

Usage Hours

The histogram of average media usage hours (Figure 6) followed a symmetrical distribution, peaking between 4.5 and 5 hours, suggesting a consistent high-engagement digital lifestyle across the surveyed students.

Code

## Usage Hours

# histogram
plot_ly(
  df,
  x = ~Avg_Daily_Usage_Hours, 
  type = "histogram",
  
  # aesthetics
  marker = list(
    color = "#55D2C6",                        # filling color
    line = list(color = "#222", width = 1.5)  # outline
  )
) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Average Daily Usage Hours Distribution"
    ),
    xaxis = list(
      title = "Average Daily Usage Hours",
      range = c(0.8, 9.4),  # let x aixs be exhaustive
      dtick = 1             # distance between ticks
    ),
    yaxis = list(title = "Frequency")
  )

Figure 6: Daily Usage Hours Distribution

Boxplots (Figure 7) were used to illustrate the distribution of average daily usage hours across the top 5 countries. USA and India exhibit the highest levels of engagement, with median usage hours significantly exceeding those of other nations. In contrast, Canada, France, and Denmark show lower overall usage, with Denmark recording the most conservative median. In addition, a notable outlier is observed in the Indian group, representing a user with exceptionally low daily engagement compared to the national trend.

Code

## Usage Hours ~ Top 5 Countries

## Sorting

# find median usage hours of each country
countries_by_usage <- top5_countries_data |> 
  group_by(Country) |>
  
  # create new variable to store medians
  summarize(
    Median_Usage_Hour = median(Avg_Daily_Usage_Hours, na.rm = TRUE), 
    .groups = "drop"  # drop grouping information
  ) |> 
  
  # sort by median descending
  arrange(desc(Median_Usage_Hour)) |> 
  
  # pull out the vector of Country variable
  # (country names with median usage hours descending)
  pull(Country)


## Plotting

# boxplot
plot_ly(
  top5_countries_data,
  
  # make Country variable a factor by levels of 
  # countries with median usage hours descending
  x = ~factor(Country, countries_by_usage),
  y = ~Avg_Daily_Usage_Hours,
  type = "box",  # boxplot
  
  # aesthetics: boundary box
  fillcolor = "#55D2C6",
  line = list(color = "#222", width = 1.5),
  
  # aesthetics: outliers
  marker = list(
    color = "#55D2C6",                        # filling color
    line = list(color = "#222", width = 1.5)  # outline
  )
) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Average Daily Usage Hours Distribution by Top 5 Countries"
    ),
    xaxis = list(
      title = "Top 5 Countries",
      range = c(-0.6, 4.6)  # extend x axis (0 = the 1st category)
    ),
    yaxis = list(
      title = "Average Daily Usage Hours",
      range = c(1, 10),     # let y axis be exhaustive
      dtick = 1             # distance between ticks
    )
  )

Figure 7: Daily Usage Hours Boxplot by Top 5 Countries

Sleep Hours

Meanwhile, the distribution of sleep hours was also investigated. Figure 8 revealed a roughly symmetrical distribution centered around a peak of approximately 7 to 7.5 hours of sleep per night. While the majority of students fall within a healthier range of 6 to 8 hours, the distribution exhibits a broad spread, with some individuals reporting as little as 4 hours and others reaching nearly 10 hours. This variation suggests a diverse range of sleeping habits within the student population.

Code

## Sleep Hours

# histogram
plot_ly(
  df, 
  x = ~Sleep_Hours_Per_Night, 
  type = "histogram",
  
  # aesthetics
  marker = list(
    color = "#FFABB6",                        # filling color
    line = list(color = "#222", width = 1.5)  # outline
  )
) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Average Sleep Hours Distribution"
    ),
    xaxis = list(
      title = "Average Sleep Hours",
      range = c(2.8, 11),   # let x aixs be exhaustive
      dtick = 1             # distance between ticks
    ),
    yaxis = list(
      title = "Frequency",
      range = c(0, 60)      # let y axis be exhaustive
    )
  )

Figure 8: Average Sleep Hours Distribution

Furthermore, looking into the distribution of average sleep hours by top 5 countries, Figure 9 showed a significant inverse trend compared to the previously analyzed usage hours distribution (Figure 7). While the USA and India presented the highest social media engagement, they recorded the lowest median sleep duration, both falling below 6 hours per night. Conversely, Denmark and France, which showed the most conservative usage patterns, enjoyed the highest median sleep hours, with Denmark leading at approximately 8.2 hours. In addition, India displayed a high number of upper outliers, indicating a group of students who maintained high sleep levels despite regional trends. These geographical variation suggested a negative correlation between digital engagement and sleeping, providing a critical baseline to later analysis of the relationship between behavioral patterns.

Code

## Sleep Hours ~ Top 5 Countries

## Sorting

# find median sleep hours of each country
countries_by_sleep <- top5_countries_data |> 
  group_by(Country) |>
  
  # create new variable to store medians
  summarize(
    Median_Sleep_Hour = median(Sleep_Hours_Per_Night, na.rm = TRUE), 
    .groups = "drop"  # drop grouping information
  ) |> 
  
  # sort by median ascending
  arrange(Median_Sleep_Hour) |> 
  
  # pull out the vector of Country variable
  # (country names with median sleep hours ascending)
  pull(Country)


## Plotting

# boxplot
plot_ly(
  top5_countries_data,
  
  # make Country variable a factor by levels of 
  # countries with sleep hours ascending
  x = ~factor(Country, countries_by_sleep),
  y = ~Sleep_Hours_Per_Night,
  type = "box",  # boxplot
  
  # aesthetics: boundary box
  fillcolor = "#FFABB6",
  line = list(color = "#222", width = 1.5),
  
  # aesthetics: outliers
  marker = list(
    color = "#FFABB6",                        # filling color
    line = list(color = "#222", width = 1.5)  # outline
  )
) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Average Sleep Hours Distribution by Top 5 Countries"
    ),
    xaxis = list(
      title = "Top 5 Countries",
      range = c(-0.6, 4.6)  # extend x axis (0 = the 1st category)
    ),
    yaxis = list(
      title = "Average Sleep Hours",
      range = c(3, 10),     # let y axis be exhaustive
      dtick = 1             # distance between ticks
    )
  )

Figure 9: Average Sleep Hours Distribution by Top 5 Countries

Correlation Analysis

Sleep Hours and Usage Hours

The scatter plot (Figure 10) demonstrate a strong negative correlation between daily media usage and sleep duration. As usage rises from 2 to 8 hours, sleep generally drops by about 1 hour for every additional 1 hours of use. Most points cluster along this downward trend, though some heavy users (6+ hours) still report around 7 hours of sleep, while moderate users (4-6 hours) fall to 6-8 hours of sleep.

Code

## Sleep Hours ~ Usage Hours

# scatter plot
plot_ly(
  df,
  x = ~Avg_Daily_Usage_Hours
) |> 
  
  # add scatters
  add_markers(
   y = ~Sleep_Hours_Per_Night, 
   marker = list(color = "#3B9FE4", opacity = 0.6)
  ) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Relationship between Usage Hours and Sleep Hours"
    ),
    xaxis = list(
      title = "Average Daily Usage Hours",
      range = c(0.8, 9.2),  # let x axis be exhaustive
      dtick = 1             # distance between ticks
    ),
    yaxis = list(
      title = "Average Sleep Hours",
      range = c(3, 10)      # let y axis be exhaustive 
    )
  )

Figure 10: Relationship between Usage Hours and Sleep Hours

To validate the trend, a Pearson’s correlation test was conducted. As shown in the results (see Output 11), the correlation coefficient was around -0.79 (p < .001). It therefore indicated that there was a high potential of students’ average daily social media usage hours being negatively correlated with their sleep hours.

Code

## Sleep Hours ~ Usage Hours

# correlation test
cor.test(x = df$Avg_Daily_Usage_Hours, y = df$Sleep_Hours_Per_Night)


    Pearson's product-moment correlation

data:  df$Avg_Daily_Usage_Hours and df$Sleep_Hours_Per_Night
t = -34.231, df = 703, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8167435 -0.7611776
sample estimates:
       cor 
-0.7905825

Output 11. Pearson’s Correlation Test Result

Figure 11 showed a linear fit of the two variables, which further clarified the negative correlation.

Code

## Sleep Hours ~ Usage Hours

# linear model
fit <- lm(Sleep_Hours_Per_Night ~ Avg_Daily_Usage_Hours, data = df)

# scatter plot
plot_ly(
  df,
  x = ~Avg_Daily_Usage_Hours
) |> 
  
  # add scatters
  add_markers(
   y = ~Sleep_Hours_Per_Night, 
   name = "Data Points", 
   marker = list(color = "#3B9FE4", opacity = 0.6)
  ) |> 
  
  # add linear fit
  add_lines(
    y = ~fitted(fit), 
    name = "Linear Fit", 
    line = list(color = '#D53132', width = 1.5)
  ) |> 
  
  layout(
    legend = list(
      orientation = "h",                  # arrangement of legend labels
      xanchor = "right", yanchor = "top", # anchor point of the legend
      x = 1, y = 1                        # position of the legend
    ),
    margin = list(t = 50),                # margin top
    title = list(
      y = 0.95,                           # y position of title
      text = "Linear Relationship between Usage Hours and Sleep Hours"
    ),
    xaxis = list(
      title = "Average Daily Usage Hours",
      range = c(0.8, 9.2),                # let x axis be exhaustive
      dtick = 1                           # distance between ticks
    ),
    yaxis = list(
      title = "Average Sleep Hours",
      range = c(3, 10)                    # let y axis be exhaustive 
    )
  )

Figure 11: Linear Relationship between Usage Hours and Sleep Hours

Other Indices

In addition, it was also considered worthwhile investigating the relationship between other indices such as mental health score and addicted score (names of indices were shown in Output 12). Therefore, a faceted 5×5 paired scatter plot matrix was used to display all of the relationship trends (see Figure 12).

Code

## Relationship between Indices

# remove Age from names_numvar
(names_numvar <- setdiff(names_numvar, "Age"))

[1] "Avg_Daily_Usage_Hours"       "Sleep_Hours_Per_Night"      
[3] "Mental_Health_Score"         "Conflicts_Over_Social_Media"
[5] "Addicted_Score"

Output 12. Names of Indices to Compare

Code

## Relationship between Indices

# extract numerical variable data
num_data <- df[, names_numvar]

# paired scatter plots
plot_ly(
  num_data,
  type = "splom",  # scatter plots matrix
  
  # indices to compare
  dimensions = list(
    list(label = "Usage", values = ~Avg_Daily_Usage_Hours),
    list(label = "Sleep", values = ~Sleep_Hours_Per_Night),
    list(label = "Mental Health", values = ~Mental_Health_Score),
    list(label = "Conflicts", values = ~Conflicts_Over_Social_Media),
    list(label = "Addicted", values = ~Addicted_Score)
  ),
  
  # aesthetics
  marker = list(
    color = "#3B9FE4",
    opacity = 0.5,
    size = 4
  )
) |> 
  
  layout(
    margin = list(t = 60),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Relationship between Indices"
    )
  )

Figure 12: Relationship between Indices

The scatter plot matrix (Figure 12) illustrated a complex web of behavioral and psychological correlations, primarily centered on the detrimental effects of excessive social media usage. Meanwhile, Output 13 showed a correlation test matrix for all five indices, providing valid insights into behavioral trends.

As shown by the results, a dominant negative correlation existed between usage hours and sleep, where increased screen time consistently aligns with diminished rest. This trend was further demonstrated by a positive correlation between usage and addiction levels, which in turn led to higher frequencies of life conflicts. Conversely, better mental health outcomes were associated with lower addiction scores, reduced usage, and longer sleep duration. Ultimately, the data suggested a negative feedback loop where high device dependency compromised sleep and social harmony, collectively contributing to poorer mental well-being.

Code

## Relationship between Indices

# view correlation matrix (pearson correlation coefficient)
(cor_matrix <- round(cor(num_data), 2))

                            Avg_Daily_Usage_Hours Sleep_Hours_Per_Night
Avg_Daily_Usage_Hours                        1.00                 -0.79
Sleep_Hours_Per_Night                       -0.79                  1.00
Mental_Health_Score                         -0.80                  0.71
Conflicts_Over_Social_Media                  0.80                 -0.68
Addicted_Score                               0.83                 -0.76
                            Mental_Health_Score Conflicts_Over_Social_Media
Avg_Daily_Usage_Hours                     -0.80                        0.80
Sleep_Hours_Per_Night                      0.71                       -0.68
Mental_Health_Score                        1.00                       -0.89
Conflicts_Over_Social_Media               -0.89                        1.00
Addicted_Score                            -0.95                        0.93
                            Addicted_Score
Avg_Daily_Usage_Hours                 0.83
Sleep_Hours_Per_Night                -0.76
Mental_Health_Score                  -0.95
Conflicts_Over_Social_Media           0.93
Addicted_Score                        1.00

Output 13. Correlation Matrix of Indices

Notably, the correlation between addicted score and mental health (r = -.95), and that between addicted score and conflicts number over social media (r = .93) appeared to be the most significant with both | r | > .9 (see Output 14), indicating that as addictive behaviors intensify, psychological well-being collapses and interpersonal friction rises almost in lockstep.

In conclusion, addiction scores served as the most potent predictor of negative life outcomes in this dataset.

Code

## Relationship between Indices

# find location of pairs where |r| > 0.9
loc <- which(
  abs(cor_matrix) > 0.9 & cor_matrix != 1, 
  arr.ind = TRUE  # return matrix of row and column indices
)

# print results
for(i in 1:nrow(loc)) {
  row <- loc[i, 1]
  col <- loc[i, 2]
  cat(
    rownames(cor_matrix)[row], "~", colnames(cor_matrix)[col], "\n",
    "r =", cor_matrix[row, col], "\n")
}

Addicted_Score ~ Mental_Health_Score 
 r = -0.95 
Addicted_Score ~ Conflicts_Over_Social_Media 
 r = 0.93 
Mental_Health_Score ~ Addicted_Score 
 r = -0.95 
Conflicts_Over_Social_Media ~ Addicted_Score 
 r = 0.93

Output 14. Pairs of Indices with the Strongest Correlation (|r|>.9)

Application of CLT

This section repeatedly drew random samples (1000 times) with different sample sizes (n = 10, 30, 50, 100) from the population dataset ( df ), and then observed patterns of sample mean distributions, aiming to seek for applicability of the Central Limit Theorem (CLT).

Central Limit Theorem suggests that as sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population’s original distribution.

In the meantime, the standard deviations of sample means (observed standard errors) are related to the population standard deviation σ the and sample size n, where the standard error is defined by a formula \(SE = \frac{\sigma}{\sqrt n}\) . This formula was used in later analyses to calculate the theoretical standard error.

To investigate the applicability of CLT, a function was firstly defined to draw random samples with specified sample size, sampling time, variable to sample, and other relative parameters.

Code

## CLT

## Define function to plot

# store different sample sizes
sizes <- c(10, 30, 50, 100)

# define a function to randomly sample from df with
# specified sample size and plot sample means histogram with 
# specified parameters

make_hist <- function(
    n, col, time = 1000, ymax = 1, 
    binwidth = 0.2, bincolor = "#FFA472", xdtick = 1
  ) {
  
  ## n: sample size
  ## col: column to select from df
  ## time: times to replicate sampling process
  ## ymax: maximum value of y axis in histogram
  ## binwidth: width of bins in histogram
  ## bincolor: color of bins
  ## xdtick: distance between ticks in x asix
  
  # randomly sample data from df and calculate sample means
  sample_means <- replicate(
    time,   # repetition times, default is 1000
    mean(
      sample(df[[col]], size = n, replace = TRUE), 
      na.rm = TRUE
    )
  )

  # find population mean & standard deviation
  pop_mean <- mean(df[[col]], na.rm = TRUE)
  pop_sd <- sd(df[[col]], na.rm = TRUE)
  
  # find fixed x range
  xmin <- pop_mean - (2 * pop_sd)
  xmax <- pop_mean + (2 * pop_sd)
  
  # plot sample means probability distribution histogram
  plot_ly(
    x = sample_means,
    type = "histogram", 
    histnorm = "probability",  # probability distribution
    name = paste("n =", n),
    xbins = list(
      start = xmin,
      end = xmax,
      size = binwidth          # specify binwidth, default is 0.4
    ),
    
    # aesthetics
    marker = list(
      color = bincolor,        # specify bincolor, default is "#FFA472"
      line = list(color = "#222", width = 1.5)
    )
  ) |> 
    
    layout(
      margin = list(t = 60),   # margin top
      
      # add annotation of sample size
      annotations = list(
        list(
          text = paste("Sample Size:", n),
          x = 0.5,             # x position of annotation
          y = -0.25,           # y position of annotation
          xanchor = "center",  # anchor of annotation for x
          yanchor = "bottom",  # anchor of annotation for y
          xref = "paper",      # position relative to current subplot
          yref = "paper",
          showarrow = FALSE    # remove arrow of annotation
        )
      ),
      
      # axis limits
      xaxis = list(
        range = c(xmin, xmax),          # x axis range
        dtick = xdtick                  # specify dtick, default is 1
      ),
      yaxis = list(range = c(0, ymax))  # specify ymax, default is 1
    )
}

Usage Hours

In terms of the average media usage hours variable, a 2×2 histogram matrix (Figure 13) of sample mean distributions with different sample sizes (n = 10, 30, 50, 100) was plotted with axis scales being the same to see the trend of distribution shape. Meanwhile, Output 15 showed the population and sample statistics regarding usage hours.

As results shown, with sample size increased, the distribution of sample means became more centralized, with smaller standard deviations. Meanwhile, all four means of sample means were nearly closed to the population mean, successfully predicting it; and the observed standard error (standard deviation of sample means) were approximate to the theoretical standard error derived using the population standard deviation.

Code

## CLT - Usage Hours

## Draw random samples and plot

# set random seed for reproducibility
set.seed(32)

# get plot information
# (iterate sizes and pass values as n, apply function)
plots <- lapply(
  sizes, function(n) make_hist(
    n, "Avg_Daily_Usage_Hours", bincolor = "#55D2C6", ymax = 0.6
  )
)

# show subplots in 2*2 matrix
subplot(
  plots,
  nrows = 2,     # number of rows in subplot matrix
  margin = 0.08  # margin between subplots
) |> 

  layout(
    showlegend = FALSE,  # remove legend
    margin = list(
      t = 80,            # margin top
      b = 60             # margin bottom
    ),
    title = list(
      y = 0.95,
      text = "Usage Hours Sample Means Probability Distributions"
    )
  )

Figure 13: Usage Hours Sample Means Probability Distribution

Code

## CLT - Usage Hours

## Store sample means

# create list to store sample means of usage in different sizes
sample_means_usage <- lapply(
  sizes,         # iterate sample sizes
  function(n) {  # n: sample sizes passed from sizes
    replicate(
      1000,      # repetition times
      mean(
        sample(df[["Avg_Daily_Usage_Hours"]], size = n, replace = TRUE), 
        na.rm = TRUE)
      )
  }
)


## Store means & SDs of sample means

# initialize empty vector
mean_sample_means_usage <- numeric(length(sizes))

# find means of usage sample means
for (i in seq_along(sizes)) {
  mean_sample_means_usage[i] <- round(
    mean(sample_means_usage[[i]], na.rm = TRUE),
    digits = 2  # round to 2 digits
  )
}

# initialize empty vector
sd_sample_means_usage <- numeric(length(sizes))

# find sds of usage sample means
for (i in seq_along(sizes)) {
  sd_sample_means_usage[i] <- round(
    sd(sample_means_usage[[i]], na.rm = TRUE),
    digits = 2  # round to 2 digits
  )
}


## Population mean & SD

pop_mean_usage <- mean(df$Avg_Daily_Usage_Hours, na.rm = TRUE)
pop_sd_usage <- sd(df$Avg_Daily_Usage_Hours, na.rm = TRUE)


## Print results

clt_results_usage <- data.frame(
  Sample_Size = sizes,
  Predicted_Mean = mean_sample_means_usage,
  Observed_SE = sd_sample_means_usage,
  Theoretical_SE = round(pop_sd_usage / sqrt(sizes), 2)
)

print(clt_results_usage)

cat("\nPopulation Mean:", round(pop_mean_usage, 2), 
    "\nPopulation SD:", round(pop_sd_usage, 2), "\n"
)

  Sample_Size Predicted_Mean Observed_SE Theoretical_SE
1          10           4.92        0.40           0.40
2          30           4.92        0.23           0.23
3          50           4.93        0.18           0.18
4         100           4.92        0.12           0.13

Population Mean: 4.92 
Population SD: 1.26

Output 15. Comparison of Statistics Regarding Media Usage Hours

In addition, Figure 14 displayed one of the sample means distributions (n = 100, repeated 1000 times) with an ideal normal curve (n = 100) over it, which was calculated using the function dnorm() . The results demonstrated a highly coincident overlap between the sample means histogram and normal curve, in line with the CLT.

Code

## CLT - Usage Hours

## Ideal normal distribution data

x_values <- seq(
  min(tail(sample_means_usage, 1)[[1]]),  # the last set of means, n = 100
  max(tail(sample_means_usage, 1)[[1]]), 
  length.out = 100                        # sequence length = 100
)

y_values <- dnorm(
  x_values, 
  mean = mean(tail(sample_means_usage, 1)[[1]]), 
  sd = sd(tail(sample_means_usage, 1)[[1]])
)


## Sample Mean Distribution v.s. Normal Curve

plot_ly() |>
  
  # histogram
  add_histogram(
    x = tail(sample_means_usage, 1)[[1]], 
    type = "histogram", 
    histnorm = "probability",
    xbins = list(size = 0.05),  # binwidth
    name = "Sample Means",      # histogram label
    
    # aesthetics
    marker = list(
      color = "#55D2C6",                         # filling color
      line = list(color = "#222", width = 1.5),  # outline
      opacity = 0.6
    )
  ) |> 
  
  # normal curve
  add_lines(
    x = x_values, 
    y = y_values * 0.05, 
    name = "Normal Curve",    # line label
    line = list(color = '#D53132', width = 2)
  ) |> 
  
  layout(
    legend = list(
      orientation = "h",                   # arrangement of legend labels
      xanchor = "left", yanchor = "top",   # anchor point of the legend
      x = 0, y = 1.1                       # position of the legend
    ),
    margin = list(t = 90),    # margin top
    title = list(
      y = 0.95,               # y position of title
      text = "Usage Hours Sample Means Probability Distribution"
    ),
    xaxis = list(
      title = "Sample Means of Daily Usage Hours (n = 100)",
      range = c(4.38, 5.4)
    ),
    yaxis = list(
      title = "Probability",
      range = c(0, 0.18)
    )
  )

Figure 14: Usage Hours Sample Means Probability Distribution (with Normal Curve)

Sleep Hours

Likewise, similar approaches were applied to another numerical variable sleep hours. The result patterns shown in Figure 15, Output 15, and Output 16 were closely similar to those of media usage hours, again demonstrating the applicability of CLT.

Code

## CLT - Sleep Hours

# set random seed for reproducibility
set.seed(32)

# get plot information
# (iterate sizes and pass values as n, apply function)
plots <- lapply(
  sizes, function(n) make_hist(
    n, "Sleep_Hours_Per_Night", bincolor = "#FFABB6", ymax = 0.6
  )
)

# show subplots in 2*2 matrix
subplot(
  plots,
  nrows = 2,
  margin = 0.08
) |> 

  layout(
    showlegend = FALSE,  # remove legend
    margin = list(
      t = 80,            # margin top
      b = 60             # margin bottom
    ),
    title = list(
      y = 0.95,
      text = "Sleep Hours Sample Means Probability Distributions"
    )
  )

Figure 15: Sleep Hours Sample Means Probability Distribution

Code

## CLT - Sleep Hours

## Store sample means

# create list to store sample means of usage in different sizes
sample_means_sleep <- lapply(
  sizes,         # iterate sample sizes
  function(n) {  # n: sample sizes passed from sizes
    replicate(
      1000,      # repetition times
      mean(
        sample(df[["Sleep_Hours_Per_Night"]], size = n, replace = TRUE), 
        na.rm = TRUE)
      )
  }
)


## Store means & sds of sample means

# initialize empty vector
mean_sample_means_sleep <- numeric(length(sizes))

# find means of usage sample means
for (i in seq_along(sizes)) {
  mean_sample_means_sleep[i] <- round(
    mean(sample_means_sleep[[i]], na.rm = TRUE),
    digits = 2  # round to 2 digits
  )
}

# initialize empty vector
sd_sample_means_sleep <- numeric(length(sizes))

# find sds of usage sample means
for (i in seq_along(sizes)) {
  sd_sample_means_sleep[i] <- round(
    sd(sample_means_sleep[[i]], na.rm = TRUE),
    digits = 2  # round to 2 digits
  )
}


## Population mean & sd

pop_mean_sleep <- mean(df$Sleep_Hours_Per_Night, na.rm = TRUE)
pop_sd_sleep <- sd(df$Sleep_Hours_Per_Night, na.rm = TRUE)


## Print results

clt_results_sleep <- data.frame(
  Sample_Size = sizes,
  Predicted_Mean = mean_sample_means_sleep,
  Observed_SE = sd_sample_means_sleep,
  Theoretical_SE = round(pop_sd_sleep / sqrt(sizes), 2)
)

print(clt_results_sleep)

cat("\nPopulation Mean:", round(pop_mean_sleep, 2), 
    "\nPopulation SD:", round(pop_sd_sleep, 2), "\n"
)

  Sample_Size Predicted_Mean Observed_SE Theoretical_SE
1          10           6.87        0.36           0.36
2          30           6.87        0.21           0.21
3          50           6.86        0.16           0.16
4         100           6.87        0.11           0.11

Population Mean: 6.87 
Population SD: 1.13

Output 16. Comparison of Statistics Regarding Sleep Hours

Code

## CLT - Sleep Hours

## Ideal normal distribution data

x_values <- seq(
  min(tail(sample_means_sleep, 1)[[1]]), 
  max(tail(sample_means_sleep, 1)[[1]]), 
  length.out = 100  # select 100 numbers
)

y_values <- dnorm(
  x_values, 
  mean = mean(tail(sample_means_sleep, 1)[[1]]), 
  sd = sd(tail(sample_means_sleep, 1)[[1]])
)


## Sample Mean Distribution v.s. Normal Curve

plot_ly() |>
  
  # histogram
  add_histogram(
    x = tail(sample_means_sleep, 1)[[1]], 
    type = "histogram", 
    histnorm = "probability",
    xbins = list(size = 0.05),  # binwidth
    name = "Sample Means",      # histogram label
    
    # aesthetics
    marker = list(
      color = "#FFABB6",                         # filling color
      line = list(color = "#222", width = 1.5),  # outline
      opacity = 0.6
    )
  ) |> 
  
  # normal curve
  add_lines(
    x = x_values, 
    y = y_values * 0.05, 
    name = "Normal Curve",    # line label
    line = list(color = '#D53132', width = 2)
  ) |> 
  
  layout(
    legend = list(
      orientation = "h",                   # arrangement of legend labels
      xanchor = "left", yanchor = "top",   # anchor point of the legend
      x = 0, y = 1.1                       # position of the legend
    ),
    margin = list(t = 90),    # margin top
    title = list(
      y = 0.95,               # y position of title
      text = "Sleep Hours Sample Means Probability Distribution"
    ),
    xaxis = list(title = "Sample Means of Sleep Hours (n = 100)"),
    yaxis = list(title = "Probability")
  )

Figure 16: Sleep Hours Sample Means Probability Distribution (with Normal Curve)

Sampling Methods

In this section, the top 5 platform data was used as population dataset, and three samples were drew using different sampling methods, including simple random with replacement, systematic with unequal probabilities, and stratified sampling without replacement. These three sampling methods were thus compared and contrasted to see the differences between their ability to predict population patterns.

Output 17, 18, and 19 showed the results of different samples, regarding the proportion of different platforms, and the mean and standard deviation of media usage hours.

Simple Random Sampling

The simple random sampling (with replacement) was conducted using the function srswr() , and then samples were store in sample1 .

Code

## Simple Random Sampling with Replacement

## Sampling

# sample size & random seed
n <- 100
set.seed(32)

# simple random sampling
s <- srswr(n, nrow(top5_platforms_data))

# find rows where observations are sampled
rows <- (1:nrow(top5_platforms_data))[s!=0]
rows <- rep(rows, s[s!=0])

# extract sampled data
sample1 <- top5_platforms_data[rows, ]


## Calculating

# define function to calculate percentages
calc_perc <- function(sample_data, n = 100) {
  
  ## sample_data: sample data drew with different methods
  ## n: sample size
  
  # frequency
  f <- table(
    factor(
      sample_data$Most_Used_Platform, 
      levels = top5_platforms  # to ensure the same platform sequence
    )
  )
  
  # percentage
  return(f / n)  # n: sample size
}

# percentages of platforms
perc1 <- calc_perc(sample1)

# sample mean & SD of usage hour
mean1 <- mean(sample1$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd1 <- sd(sample1$Avg_Daily_Usage_Hours, na.rm = TRUE)

# print results
cat("Percentages of Platforms(%):\n"); perc1 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean1, "\n")
cat("Sample SD of Usage Hour:", round(sd1, 3), "\n")

Percentages of Platforms(%):

Instagram    TikTok  Facebook  WhatsApp   Twitter 
       44        27        13        11         5 

Sample Mean of Usage Hour: 5.035 
Sample SD of Usage Hour: 1.262

Output 17. Results of Simple Random Sampling with Replacement

Systematic Sampling

For systematic sampling, the inclusion probabilities by usage hours was firstly calculated using the function inclusionprobabilities() and stored in pik , then the samples were drew using UPsystematic(pik) and stored in sample2 .

Code

## Systematic Sampling with Unequal Probabilities

## Sampling

# random seed
set.seed(32)

# calculate inclusion probabilities by usage hours
pik <- inclusionprobabilities(top5_platforms_data$Avg_Daily_Usage_Hours, n)

# systematic sampling with unequal probabilities
s <- UPsystematic(pik)

# extract sampled data
sample2 <- top5_platforms_data[s!= 0, ]


## Calculating

# percentages of platforms
perc2 <- calc_perc(sample2)

# sample mean & SD of usage hour
mean2 <- mean(sample2$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd2 <- sd(sample2$Avg_Daily_Usage_Hours, na.rm = TRUE)

# print results
cat("Percentages of Platforms(%):\n"); perc2 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean2, "\n")
cat("Sample SD of Usage Hour:", round(sd2, 3), "\n")

Percentages of Platforms(%):

Instagram    TikTok  Facebook  WhatsApp   Twitter 
       41        28        18         9         4 

Sample Mean of Usage Hour: 5.332 
Sample SD of Usage Hour: 1.267

Output 18. Results of Systematic Sampling with Unequal Probabilities

Stratified Sampling

For stratified sampling, the population data were firstly ordered by platform using order() function, and the samples were drew using strata() function and stored in sample3 , with stratum sizes calculated by each platform’s proportion.

Code

## Stratified Sampling without Replacement

## Ordering Data

# order data by platform
order_index <- order(top5_platforms_data$Most_Used_Platform)
data <- top5_platforms_data[order_index, ]  # use ordered data

# calculate proportion
prop_table <- table(data$Most_Used_Platform) / nrow(data)

# find strata sizes
sizes <- round(prop_table * n)

# check if sum of sizes = n
# sum(sizes)  # [1] 100


## Stratified Sampling

# random seed
set.seed(32)

# find stratums
st <- sampling::strata(
  data,
  stratanames = "Most_Used_Platform",
  size = sizes,
  method = "srswor"  # simple random sampling without replacement
)

# extract sampled data
sample3 <- sampling::getdata(data, st)


## Calculating

# percentages of platforms
perc3 <- calc_perc(sample3)

# sample mean & SD of usage hour
mean3 <- mean(sample3$Avg_Daily_Usage_Hours, na.rm = TRUE)
sd3 <- sd(sample3$Avg_Daily_Usage_Hours, na.rm = TRUE)

# print results
cat("Percentages of Platforms(%):\n"); perc3 * 100; cat("\n")
cat("Sample Mean of Usage Hour:", mean3, "\n")
cat("Sample SD of Usage Hour:", round(sd3, 3), "\n")

Percentages of Platforms(%):

Instagram    TikTok  Facebook  WhatsApp   Twitter 
       41        25        20         9         5 

Sample Mean of Usage Hour: 4.948 
Sample SD of Usage Hour: 1.134

Output 19. Stratified Sampling without Replacement

Comparison of Sampling Methods

Percentages of Platforms

Output 20 presented the results of proportions of platforms in the population (top 5 platform data) and three samples. To facilitate the investigation, the means of difference between each sample value with its associated population value were calculated and shown in Output 21. Furthermore, Figure 17 visualized the results in Output 21.

Code

## Comparison of Sampling Methods

## Percentages of Platforms

# store sample percentages
sample_percs <- list(srswr = perc1, ups = perc2, stwor = perc3)

# calculate population percentages
pop_perc <- table(
  factor(
    top5_platforms_data$Most_Used_Platform,
    levels = top5_platforms
  )
) / nrow(top5_platforms_data)


# show results of percentages
cat("Population Percentages:\n")
round(pop_perc, 2); cat("\n\n")
cat("Sample Percentages:\n\n")
sample_percs

Population Percentages:

Instagram    TikTok  Facebook  WhatsApp   Twitter 
     0.41      0.25      0.20      0.09      0.05 


Sample Percentages:

$srswr

Instagram    TikTok  Facebook  WhatsApp   Twitter 
     0.44      0.27      0.13      0.11      0.05 

$ups

Instagram    TikTok  Facebook  WhatsApp   Twitter 
     0.41      0.28      0.18      0.09      0.04 

$stwor

Instagram    TikTok  Facebook  WhatsApp   Twitter 
     0.41      0.25      0.20      0.09      0.05

Output 20. Population and Sample Percentages of Different Platforms

As shown in Output 21 and Figure 17, the values of stratified sampling without replacement showed the smallest difference from population values (0.002), indicating its strongest predictive power for platform proportion. In contrast, simple random sampling without replacement showed the lowest predictive power for it (mean difference was 0.029). Stratified sampling without replacement performed best may because it preserved the population’s underlying platform distribution within each stratum of platform, reducing sampling variability, whereas simple random sampling was more random and unpredictable across platform proportions.

Code

## Comparison of Sampling Methods

## Percentages of Platforms

# find mean differences between
# sample percentages and population percentages
mean_diffs <- sample_percs |> 
  
  # calculate differences
  lapply(\(percs) abs(percs - pop_perc)) |>
    # lapply: list apply, return a list
    # where \() is anonymous function, same as function(percs)
    # it passes values from sample_percs as percs and apply function
    # to them, then returns a list
  
  # calculate means of differences
  sapply(mean, na.rm = TRUE)
    # sapply: simple apply, flexible at data types of returns.
    # it returns a named vector here

# show results
cat("Differences from Population Percentages:\n")
sort(round(mean_diffs, 3)); cat("\n")

Differences from Population Percentages:
stwor   ups srswr 
0.002 0.012 0.029

Output 21. Differences from Population Percentage by Sampling Method

Code

## Comparison of Sampling Methods

## Percentages of Platforms

# barplot of differences from population percentages
plot_ly(
  x = toupper(names(mean_diffs)),  # capitalized method names
  y = mean_diffs,
  type = "bar",                    # barplot
  
  # aesthetics
  marker = list(
    color = "#c3a4cf",                        # filling color
    line = list(color = "#222", width = 1.5)  # outline color & width
  )
) |> 
  
  layout(
    margin = list(t = 50),  # margin top
    title = list(
      y = 0.95,             # y position of title
      text = "Differences from Population Percentages by Sampling Method"
    ),
    xaxis = list(
      title = "Sampling Methods",
      type = "category",                        # categorical axis
      categoryorder = "array",                  # let order be specified
      categoryarray = names(sort(mean_diffs)),  # specify category order
      range = c(2.2, 5.8)
    ),
    yaxis = list(title = "Difference")
  )

Figure 17: Differences from Population Percentages by Sampling Method

Statistics of Usage Hours

In addition to predicting the proportions of categories, the ability to predict mean and standard deviation across the three sampling methods were also tested. Output 22 showed the results of means and standard deviations, and a clearer visualization was shown in Figure 18.

Code

## Comparison of Sampling Methods

## Means & SDs of Usage Hours

# store sample means & sds
sample_means <- c(srswr = mean1, ups = mean2, stwor = mean3)
sample_sds <- c(srswr = sd1, ups = sd2, stwor = sd3)

# show results of means
cat("Population Mean:", round(pop_mean_usage, 3), "\n")
cat("Sample Means:\n"); sample_means; cat("\n")

# show results of SDs
cat("Population SD:", round(pop_sd_usage, 3), "\n")
cat("Sample SDs:\n"); round(sample_sds, 3)

Population Mean: 4.919 
Sample Means:
srswr   ups stwor 
5.035 5.332 4.948 

Population SD: 1.257 
Sample SDs:
srswr   ups stwor 
1.262 1.267 1.134

Output 22. Population and Sample Statistics of Media Usage Hours

In Figure 18, the dots at the centers of sticks represented the sample means, and the black whiskers represented the range of one sample standard deviation away from sample means, where the population mean and standard deviation were denoted using dashed lines with instruction text labels.

Combining Output 22 and Figure 18, it could be seen that for mean of numerical variable, stratified sampling without replacement showed the nearest result and the highest predictive power, whereas systematic sampling with unequal probabilities had the lowest predictive power.

However, in terms of standard deviation, simple random sampling with replacement had the most accurate prediction whereas stratified sampling without replacement had the least accurate prediction.

These discrepancies in predictive power likely stemmed from the inherent design of each sampling method. Stratified sampling without replacement excelled at estimating means by ensuring all subgroups were proportionally represented and reduced variance by eliminating duplicate data points, whereas systematic sampling with unequal probabilities may introduce bias if the selection interval aligns with hidden patterns in the data.

Conversely, simple random sampling with replacement provided a more accurate estimation of standard deviation possibly because the replacement mechanism allowed the sample to better reflect the true variability and noise of the underlying population, while the structured nature of stratification tended to compress internal variance, leading to an underestimation of the overall spread.

Code

## Comparison of Sampling Methods

## Means & SDs of Usage Hours

# store methods, sample means & SDs into dataframe
plot_data <- data.frame(
  Method = toupper(names(sample_means)),  # capitalized method names
  Mean   = sample_means,
  SD     = sample_sds
)

# stick plot
plot_ly(
  plot_data, 
  x = ~Method,
  y = ~Mean
) |> 
  
  # add sticks
  add_markers(
    name = "Sample Mean",  # stick label
    
    # error stick (sample SDs)
    error_y = list(
      type = "data",  # take the data specified in array as stick length
      array = ~SD,    # data to use for stick length
      color = '#222'  # stick color
    ),
    
    # aesthetics of marker (sample means)
    marker = list(size = 12, color = '#55D2C6')
  ) |> 

  layout(
    margin = list(t = 60, b = 60),  # margin top & bottom
    title = list(
      y = 0.95,                     # y position of title
      text = "Comparison of Means & SDs across Sampling Methods"
    ),
    yaxis = list(
      title = "Usage Hours",
      range = c(              # y limits
        pop_mean_usage - 1.5*pop_sd_usage,
        pop_mean_usage + 1.5*pop_sd_usage
      )
    ),
    xaxis = list(
      title = "Sampling Method",
      range = c(-0.5, 3),                  # x limits
      type = "category",                   # categorical axis
      categoryorder = "array",             # let order be specified
      categoryarray = plot_data$Method     # specify category order
    ),
    
    
    # add lines of population statistics
    shapes = list(
      
      # population mean
      list(
        type = "line",        # add line
        xref = "x",           # x values are relative to x-axis (0,1,2)
        x0 = -0.4, x1 = 2.4,  # xrange of line
        y0 = pop_mean_usage,  # y of line
        y1 = pop_mean_usage,  
        line = list(
          color = "#D53132", 
          dash = "dash",      # line type
          width = 2
        )
      ),
      
      # pop_mean + pop_sd
      list(
        type = "line", 
        xref = "x",
        x0 = -0.4, x1 = 2.4, 
        y0 = pop_mean_usage + pop_sd_usage, 
        y1 = pop_mean_usage + pop_sd_usage, 
        line = list(color = "#FFA472", dash = "dot", width = 2)
      ),
      
      # pop_mean - pop_sd
      list(
        type = "line", 
        xref = "x",
        x0 = -0.4, x1 = 2.4, 
        y0 = pop_mean_usage - pop_sd_usage, 
        y1 = pop_mean_usage - pop_sd_usage,
        line = list(color = "#FFA472", dash = "dot", width = 2)
      )
    ),
    
    # annotations for lines
    annotations = list(
      
      # population mean
      list(
        text = "Population Mean\n(4.919)",
        x = 2.5,
        y = pop_mean_usage,
        xanchor = "left",
        showarrow = FALSE
      ),
      
      # pop_mean + pop_sd
      list(
        text = "+1 Population SD\n(+1.257)",
        x = 2.5,
        y = pop_mean_usage + pop_sd_usage,
        xanchor = "left",
        showarrow = FALSE
      ),
      
      list(
        text = "-1 Population SD\n(-1.257)",
        x = 2.5,
        y = pop_mean_usage - pop_sd_usage,
        xanchor = "left",
        showarrow = FALSE
      )
    )
  )

Figure 18: Comparison of Means & SDs across Sampling Methods