Analysis document: Non-Response bias in Gambling Surveys

Author

Anonymised for review

Published

February 26, 2025

Preamble

This script/ analysis document outlines the analysis code and associated outcomes for our study on response bias in gambling research studies. The analyses reported are for two studies, both of which are part of larger projects. The analyses included here include all of those reported in our manuscript (plus some additional analyses not reported in the manuscript), but the datasets analysed have already been pre-processed in seperate scripts and aggregate variables produced to summarise gambling behaviour.

Study 1

This study is part of the overall project with the title “Understanding electronic gaming machine players”. The OSF page for this study can be found here (it is private as of the latest date this file was published).

Study 2

This study is part of the overall project with the title “Safer gambling online: Identifying and supporting at-risk consumers”. The OSF page for this study can be found here and for the overall project here (both are private as of the latest date this file was committed).

load required packages

Set up analysis environment

Install and load the groundhog package to ensure consistency of the package versions used here:

Show set-up code
# install.packages("groundhog") # Install
library('groundhog')

# List desired packages:
packages <- c('tidyverse', # Clean, organise, and visualise data
              "readxl", #  load Excel files
              "haven", # Read stata files
              "survey", # weight study 2 data
              "weights", # Weighted statistics for study two
              "wCorr", # Weighted correlations (https://cran.r-project.org/web/packages/wCorr/wCorr.pdf)
              "ufs", # Cramver's V Effect size
              'kableExtra', # Make tables
              'formattable', #  Add visualisations to tables
              'gt', # Alternative table options
              'gtExtras', # Add colours to gt tables
              'gtsummary', # Create summary tables
              'SSVS', # Bayesian variable selection
              'effsize', # Cohen's d values
              'MBESS', # 95% CIs for above
              'car', # Compute VIFs for regression
              'performance', # Assess model performance
              'see', # Supports above package
              'fmsb', # Compute NagelkerkeR2
              'plotly', # Add interactive elements to figures
              'htmlwidgets', # Make plotly plots HTML format
              'ggstatsplot', # Produce plots with combined statistical output
              'rstantools', # required for above package
              'ggstream', # smooooth area charts
              'ggrain', # Raincloud plots
              'networkD3', # Sankey plots
              'viridis', # plot colours
              'scales', # Wrap labels in plots
              'ggtext', # also helps with plot labels
              "patchwork", # Assemble/combine plots
              'sysfonts', # Special fonts for figures
              'showtext', # Special fonts for figures
              'RColorBrewer', # Colours for figures
              'sessioninfo' # Detailed session info for reproducibility
              ) 

# Load desired package with versions specific to project start date:
# groundhog.library(packages, "2024-05-01", force.install = TRUE)
groundhog.library(packages, "2024-05-29")

# SANKEY:
# install.packages("remotes")
# remotes::install_github("davidsjoberg/ggsankey")
# library(ggsankey)


# SSVS:
# install.packages("remotes")
# remotes::install_github("sabainter/SSVS")
# library('SSVS')

Set up a standard theme for plots/data visualisations:

Show code
# Load new font for figures/graphs
font_add_google("Poppins")
showtext_auto()
# showtext_opts(dpi = 300)
showtext_auto(enable = TRUE)

# Save new theme for figures/graphs.This will determine the layout, presentation, font type and font size used in all data visualisations presented here:
plot_theme<- theme_classic() +
  theme(
    text=element_text(family="Poppins"),
    plot.title = element_text(hjust = 0.5, size = 26),
          plot.subtitle = element_text(hjust = 0.5, size = 22),
        axis.text = element_text(size = 18),
        axis.title = element_text(size = 22),
    plot.caption = element_text(size = 16),
    legend.title=element_text(size=18), 
    legend.text=element_text(size=16)
        ) 

plot_theme_no_sizing<- theme_classic() +
  theme(
    text=element_text(family="Poppins"))

Study 1 Analysis

This study pertains to the analysis of a survey of EGM gamblers from a single venue in Australia.

Load data

Load in master dataset containing all customer details (this step is hidden to protect privacy & anonymity)

Data loaded.

People were sent the survey via different methods, let’s have a look at the number who completed surveys by the method:

Show code
master_dataset_initial_study1 %>%
  dplyr::count(mode)
   mode     n
1       11122
2  link   215
3 paper     2
4    qr    50
Show code
# Isolate non-link participants for later:
non_link_customers <- master_dataset_initial_study1 %>%
  filter(mode == "paper" | mode == "qr")  %>%
as_tibble() %>%
  select(CLIENT_ID,
         mode)

Now, let’s process the data for analysis by doing the following:

  • Remove anybody who wasn’t invited to take part in the survey via email or telephone (this means removing some people who completed via other methods, but not all – some people invited via email took part via hardcopy, for example).
  • Remove anybody who didn’t pass the screening question which asked whether they gambled with the venue in the past 30 days

This code contains customer ID numbers and so is hidden.

To summarise the findings of the hidden code in text, there were 4301 individuals invited to take part in the survey. However, when the account data were provided by the organisation, 33 of these customers had no account data are available for the six months preceding survey invitation and therefore should not have been invited. Five individuals from the remaining sample started the survey but stated they had not gambled in the past 30 days. These were removed from consideration as they did not meet our inclusion criteria for invitation. All analyses and survey uptake figures relate to the remaining 4,260 customers, unless “unfiltered” is specified.

Uptake figures

Let’s first look at how survey uptake varied over the window the survey was open:

Show code
master_dataset_study1 %>%
  filter(SURVEY_STATUS_START != "Not started" 
         # & is.na(START_DATE)
         ) %>%
  # select(1:10, SURVEY_STATUS_START,START_DATE) %>%
  # arrange(desc(START_DATE)) %>%
  # print(n=250)
  group_by(START_DATE) %>%
  summarise(Freq = sum(n())) %>%
  ungroup() %>%
  ggplot(aes(x = START_DATE, y = Freq)) +
    plot_theme +
  geom_bar(stat = "identity", fill = "#00798c") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "",
       x = "Date",
       y = "Number of clients") 

Show code
  # scale_y_continuous(breaks = seq(0, 1500, by = 500), limits = c(0,1500))

Now compute some figures to tell us the rates of uptake for the survey:

Original, unfiltered dataset (It’s unfiltered in the sense that it includes everyone invited, but not those who were ineligible to be invited because they don’t have any account data); So the 5 people who didn’t Pass the eligibility check are included still:

Show code
master_dataset_initial_study1 %>%
    as_tibble() %>%
  semi_join(invition_data) %>%
  dplyr::count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Study 1 (unfiltered outcomes)", subtitle = NULL)
Counts for starting the survey: Study 1 (unfiltered outcomes)
Survey status n Percent
Not started 4039 94.634489
Started 229 5.365511
Show code
master_dataset_initial_study1 %>%
    as_tibble() %>%
  semi_join(invition_data) %>%
   dplyr::count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Study 1 (unfiltered outcomes)", subtitle = NULL)
Counts for completing the survey up to the PGSI: Study 1 (unfiltered outcomes)
Survey status n Percent
Complete 185 4.334583
Incomplete 44 1.030928
Not started 4039 94.634489
Show code
master_dataset_initial_study1 %>%
    as_tibble() %>%
  semi_join(invition_data) %>%
  dplyr::count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Study 1 (unfiltered outcomes)", subtitle = NULL)
Counts for completing the survey in full: Study 1 (unfiltered outcomes)
Survey status n Percent
Complete 160 3.748828
Incomplete 69 1.616682
Not started 4039 94.634489

Now compute some figures to tell us the rates of uptake for the survey:

Filtered dataset:

Show code
master_dataset_study1 %>% 
  as_tibble() %>%
  dplyr::count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Study 1", subtitle = NULL)
Counts for starting the survey: Study 1
Survey status n Percent
Not started 4039 94.812207
Started 221 5.187793
Show code
master_dataset_study1 %>% 
  as_tibble() %>%
  dplyr::count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Study 1", subtitle = NULL)
Counts for completing the survey up to the PGSI: Study 1
Survey status n Percent
Complete 183 4.2957746
Incomplete 38 0.8920188
Not started 4039 94.8122066
Show code
master_dataset_study1 %>% 
  as_tibble() %>%
  dplyr::count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Study 1", subtitle = NULL)
Counts for completing the survey in full: Study 1
Survey status n Percent
Complete 153 3.591549
Incomplete 68 1.596244
Not started 4039 94.812207

Survey flow

Show code
# colnames(master_dataset_study1)


# Define relevant survey questions (excluding non-survey & optional ones)

#Note that optional items have been removed here
survey_questions <- c("ipaddress", "screen", "tier",
                  "pokielink", "pokiefrequency", "pokievenuetype", "pokiemembership", "pokievenuenumber",
                  "budgetset", "budgetstrat", 
                  "betsatisfaction", "accountease", "accountuse", "accountimprove", "accountintent", 
                  "attention1", "accountsocial", "accountadtrack", "accountadsetspend", 
                  "accountadsettime", "accountadblock", "accountadcash", "accountadlink", 
                  "accountadtransfer", "accountlikesign", "accountlikepoints", "accountlikeonly", 
                  "accountlikestatus", "accountlikerequire",  "pastafford", 
                  "pastexcite", "pastback", "pastsold", "pastproblem", "attention2", "paststress", 
                  "pastcritic", "pasthousehold", "pastguilty", "skspent", "skwin", "sknetoption", 
                   "skfuturespend", "skfuturewin", "skfutureoption",  "skcomfort", 
                  "lifesatisfaction", "patience", "impulsive", "risks", "alcofreq", "alcostandard", 
                  "alcosix", "alcopokies", "techsmart", "techapple", "techbank", "sex",
                  "year", "post", "lang", "educ", "work", "marstat", "child", "income", 
                  "incomevar", "incomepredict", "westhqvenue", "westhqtier", 
                  "cbenjoy1", "cbmajor2", "cbtop3", "cbcomfort4", "cbenough5", "bc1temp", "bc2break", 
                  "bc3lazy", "bc4inap", "bc5fun", "bc6refuse", "bc7discipline", "bc8iron", "bc9pleasure", 
                  "bc10conc", "bc11goals", "bc12wrong", "bc13alter")


# Filter dataset to exclude non-starters (only include those who started the survey)
survey_data_filtered <- master_dataset_study1 %>%
  filter(!is.na(ipaddress) &
          ipaddress != "") %>%
  select(all_of(survey_questions))

# Inspect data to check we are plotting the right info:
# View(survey_data_filtered)
# print(survey_data_filtered, n=100)

# All looks good. I've insured we are and including the survey questions and all of the survey questions, other than IP address which we used to tell who started the survey.

# Count non-NA responses for each question
response_counts <- survey_data_filtered %>%
  summarise(across(everything(), ~sum(!is.na(.) & . != ""), .names = "count_{.col}")) %>% # Account for completely empty responses and NA value
  pivot_longer(cols = everything(), names_to = "Question", values_to = "Respondents") %>%
  mutate(Question = gsub("count_", "", Question)) 


# Define survey section positions for text labels:
survey_sections <- data.frame(
  Section = c("Opened survey + screen", 
              "Membership details",
              "Gambling involvement", 
              "Tracking strategies",
              "Gambling satisfaction", 
              "Account-based gambling",
              "PGSI", 
              "Estimated expenditure", 
              "Life satisfaction", 
              "Personality traits", 
              "Alcohol use",
              "Technology literacy",
              "Demographics",
              "Self-control"),
  Start = c("ipaddress", 
            "tier", # Membership details
             "pokievenuetype",  #  Gambling involvement
               "budgetset",  #  Tracking strategies
               "betsatisfaction",  #  Gambling satisfaction
               "accountadsettime",  #  Perceptions of account-based systems
               "attention2",  #  PGSI
               "skfuturespend",  #  Estimated expenditure
               "lifesatisfaction",  #  Life satisfaction
               "impulsive",  #  Personality traits
               "alcosix",  #  Alcohol use
               "techapple",  #  Technology literacy
               "income",  #  Demographics
               "bc8iron"  #  Self-control
            )
)

# Define section start and end positions for lines:
section_lines <- data.frame(
  Position = c("tier",  # Start of Membership details
               "pokiefrequency",  # Start of Gambling involvement
               "budgetset",  # Start of Tracking strategies
               "betsatisfaction",  # Start of Gambling satisfaction
               "accountease",  # Start of Perceptions of account-based systems
               "pastafford",  # Start of PGSI
               "skspent",  # Start of Estimated expenditure
               "lifesatisfaction",  # Start of Life satisfaction
               "patience",  # Start of Personality traits
               "alcofreq",  # Start of Alcohol use
               "techsmart",  # Start of Technology literacy
               "sex",  # Start of Demographics
               "bc1temp",  # Start of Self-control
               "bc13alter" # End
               ))

# Convert positions to numeric for plotting
section_lines$Position <- as.numeric(factor(section_lines$Position, levels = survey_questions))

# Push the final line to the very end of the graph by changing its position:
section_lines<- section_lines %>%
  mutate(Position = case_when(Position == 88 ~89,
                              TRUE~  Position))

# Define questions in each section of the survey:
survey_sections_dot_colours <- data.frame(
  Section = c(
    rep("Opened survey + screen", 2),
    rep("Membership details", 2),
    rep("Gambling involvement", 4),
    rep("Tracking strategies", 2),
    rep("Gambling satisfaction", 1),
    rep("Account-based gambling", 18),
    rep("PGSI", 10),
    rep("Estimated expenditure", 7),
    rep("Life satisfaction", 1),
    rep("Personality traits", 3),
    rep("Alcohol use", 4),
    rep("Technology literacy", 3),
    rep("Demographics", 18),
    rep("Self-control", 13)
  ),
  Question = c(
    # Opened survey
    "ipaddress",
    "screen", # Screening Question
    
     # Membership details
     "pokielink", "tier",

    # Gambling involvement
    "pokiefrequency", "pokievenuetype", "pokiemembership", "pokievenuenumber", 
    
    # Tracking strategies
    "budgetset", "budgetstrat", 
    
    # Gambling satisfaction
    "betsatisfaction",

    # Perceptions of account-based systems
    "accountease", "accountuse", "accountimprove", "accountintent", 
    "attention1", "accountsocial", "accountadtrack", "accountadsetspend", 
    "accountadsettime", "accountadblock", "accountadcash", "accountadlink", 
    "accountadtransfer", "accountlikesign", "accountlikepoints", "accountlikeonly", 
    "accountlikestatus", "accountlikerequire", 

    # PGSI
    "pastafford", "pastexcite", "pastback", "pastsold", "pastproblem", "attention2", "paststress",
    "pastcritic", "pasthousehold", "pastguilty",

    # Estimated expenditure
    "skspent", "skwin", "sknetoption", 
     "skfuturespend", "skfuturewin", "skfutureoption", "skcomfort", 

    # Life satisfaction
    "lifesatisfaction",

    # Personality traits
    "patience", "impulsive", "risks", 

    # Alcohol use
    "alcofreq", "alcostandard", "alcosix", "alcopokies", 

    # Technology literacy
    "techsmart", "techapple", "techbank",  # Removed duplicate "alcopokies"

    # Demographics
    "sex",  "year", "post", "lang", "educ", "work", 
    "marstat", "child", "income", "incomevar", "incomepredict", "westhqvenue", 
     "westhqtier", "cbenjoy1", "cbmajor2", "cbtop3", "cbcomfort4", "cbenough5",
    
    # Self-control
    "bc1temp", "bc2break", "bc3lazy", "bc4inap", "bc5fun", "bc6refuse", "bc7discipline",
    "bc8iron", "bc9pleasure", "bc10conc", "bc11goals", "bc12wrong", "bc13alter"
  )
)



# Ensure response_counts data is structured
response_counts <- response_counts %>%
  # select(-Section) %>%
  full_join(survey_sections_dot_colours) %>%
  filter(!is.na(Respondents)) %>%
    mutate(Question = factor(Question, levels = survey_questions,
                            ordered = TRUE)) %>% # Preserve order
  # Fix order to account for people who completed later questions within the same section
  group_by(Section) %>%
  arrange(match(Question, survey_questions)) %>%  # Explicitly order by predefined question order
  mutate(Respondents = cummin(Respondents)) %>%  # Apply cumulative minimum
  ungroup()


# update data to account for absence of info about starts saved from qualtrics (data extracted by someone else on our team)
response_counts <- response_counts %>% 
mutate(Respondents = case_when(Respondents == 221 ~ 1328,
                                TRUE ~ Respondents))

# Wrap section labels to a maximum of 25 characters
survey_sections <- survey_sections %>%
  mutate(Section = str_wrap(Section, width = 30)) 

# Create a lookup table for section labels
section_labels <- setNames(survey_sections$Section, survey_sections$Start)


# Ensure Section is an ordered factor based on survey flow
response_counts$Section <- factor(response_counts$Section, levels = unique(survey_sections$Section))


# Generate a perceptually uniform color scale using viridis
num_sections <- length(unique(response_counts$Section))
section_colors <- viridis(14, option = "mako")  # Blue-Green to Blue






# Plot
main_plot <- ggplot(response_counts, aes(x = Question, y = Respondents, group = 1, color = Section)) +
  geom_line(color = "grey", size = 0.8) +  # Line plot
  
  # Dots colored by ordered survey section (now fading from dark to light)
  geom_point(size = 1.1) +
  
  # Apply wrapped section labels in scale_color_manual()
  scale_color_manual(
    values = setNames(section_colors, levels(response_counts$Section))
  ) +

  # Add dashed vertical lines for section boundaries
  geom_vline(data = section_lines, aes(xintercept = Position - 0.5),
             color = "grey50", linetype = "dashed", size = 0.5) +
  
  # Ensure x-axis follows the predefined order
  scale_x_discrete(limits = survey_questions, labels = function(x) ifelse(x %in% names(section_labels), section_labels[x], "")) +
  
  # Add curved annotation line
  geom_curve(aes(x = 36.3, y = 355, xend = 39.5, yend = 355), 
             curvature = 0, arrow = arrow(length = unit(0.03, "npc"), type = "closed"),
             color = "black", size = 0.3) +
  
  # Add text labels
  annotate("text", x = 34.5, y = 400, label = "Pre-defined\ncompletion\npoint", 
           size = 2, fontface = "plain", family = "Poppins") +
    annotate("text", x = 87.5, y = 450, label = "End of survey", 
           angle = 90, size = 2, fontface = "plain", family = "Poppins") +
  labs(
    y = "Number of Respondents\n(log-10 scale)"
  ) + 
  # Theme
  plot_theme +
  theme(
    axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1.03),
    axis.title.y = element_text(vjust = 2),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank(),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12),
    plot.margin = margin(2, 5, 2, 15),
    legend.position = "none"
  )  


main_plot +
  labs(title = "Survey 1 flow with natural y-axis scale",
       y= "")

Show code
main_plot <- main_plot +   scale_y_log10(breaks = c(250, 500, 1000), labels = c("250", "500", "1000")) +
  labs(title = "Survey 1 flow with log-10 scale")

main_plot

Show code
main_plot <-  main_plot +   
  # Themes:
  plot_theme +
  theme(
    axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1.03),
    axis.title.y = element_text(vjust = 2),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank(),
    axis.text = element_text(size = 7),
    axis.title = element_text(size = 8),
    plot.margin = margin(2, 5, 2, 15),
    legend.position = "none"
  )  

ggsave("Figures/Study 1 survey flow.pdf", 
       plot = main_plot, 
       # device = cairo_pdf,  # Ensures smooth text rendering
       width = 16,          
       height = 8.5,         
       units = "cm",        
       dpi = 300)

Compare samples

What are the general demographic and gambling characteristics of those who did and did not take part in the survey?

Show code
master_dataset_study1 %>%
  mutate(actualwin_sum6s = actualwin_sum6s*-1) %>% # Invert net outcome so losses are negative and vice versa
  tbl_summary(
    include = c(# Demographic details
      age_westhq, 
      gender_westhq, 
      IRSAD_SCORE,
      memberyears, 
       # Gambling
      days_since_lastplay,
      num_6sday,
      intensity_6sday,
      averagebet_mean6s,
      montly_net_outcome_6m, # 
      gdquiet6_percent,
      timedevice_sum6s
      
      ),
    type = all_continuous() ~ "continuous",
    statistic = list(all_continuous() ~ c(
      "{mean} ({sd}), {median}")),
     digits = all_continuous() ~ 2,
    by = STATUS,
    label = c(
      age_westhq ~ "Age",
      gender_westhq ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      memberyears ~ "No. of years as a member",
      days_since_lastplay ~ "No. days since last bet",
      num_6sday ~ "No. betting days",
      intensity_6sday ~ "Mean games played per active day",
      averagebet_mean6s ~ "Mean stake",
      montly_net_outcome_6m ~ "Mean net outcome per month",
      gdquiet6_percent ~ "Percentage of sessions between midnight & 6AM",
      timedevice_sum6s ~ "Total time on device (minutes)"
      )
  ) %>%
  add_overall() %>%
  add_difference(
    list(all_continuous() ~ "cohens_d",  # For continuous variables
    all_categorical() ~ "smd") 
  ) %>%
  modify_header(label = "Variable") %>% 
  as_tibble() %>%
  filter(Variable != "Unknown") %>% 
  rename("Effect_size" = `**Difference**`) %>%
   unite("Effect size", Effect_size, `**95% CI**`, sep = " [", remove = TRUE) %>%
   mutate("Effect size" = paste0(`Effect size`, "]")) %>%
  mutate(`Effect size` = case_when(`Effect size` == "NA [NA]" ~ "",
                                    TRUE ~ `Effect size`)) %>%

  gt() %>%
  tab_options(data_row.padding = px(1.5)) %>%
  tab_header(title = "Table 2.   Study 1 - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )
Table 2. Study 1 - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable **Overall**, N = 4,2601 **Participants**, N = 1831 **Non-participants**, N = 4,077 Effect size2
Age 53.99 (14.80), 55.00 52.08 (13.97), 53.00 54.07 (14.84), 55.00 -0.13 [-0.28, 0.01]
Gender NA NA NA 0.13 [-0.02, 0.28]
F 2,223 (52%) 107 (58%) 2,116 (52%)
M 2,036 (48%) 76 (42%) 1,960 (48%)
IRSAD score 958.93 (77.46), 970.79 961.37 (76.20), 977.56 958.82 (77.52), 970.79 0.03 [-0.12, 0.18]
No. of years as a member 10.61 (7.85), 9.40 12.16 (9.11), 9.90 10.54 (7.78), 9.40 0.21 [0.06, 0.36]
No. days since last bet 14.61 (14.87), 10.00 17.53 (17.08), 12.00 14.48 (14.75), 9.00 0.21 [0.06, 0.35]
No. betting days 19.17 (26.01), 9.00 22.63 (25.78), 15.00 19.02 (26.01), 9.00 0.14 [-0.01, 0.29]
Mean games played per active day 1,074.20 (922.03), 841.00 1,178.75 (932.99), 1,017.70 1,069.51 (921.37), 828.80 0.12 [-0.03, 0.27]
Mean stake 3.66 (9.10), 1.43 3.26 (7.71), 1.70 3.68 (9.15), 1.41 -0.05 [-0.19, 0.10]
Mean net outcome per month 720.38 (1,497.50), 192.91 1,011.20 (1,734.67), 401.06 707.00 (1,484.57), 187.62 0.20 [0.05, 0.35]
Percentage of sessions between midnight & 6AM 26.31 (25.25), 18.39 20.98 (18.11), 18.17 26.56 (25.52), 18.39 -0.22 [-0.42, -0.02]
Total time on device (minutes) 2,555.85 (4,651.01), 811.19 2,892.40 (4,037.00), 1,631.00 2,540.74 (4,676.54), 791.18 0.08 [-0.07, 0.22]
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])

NOTE: the net outcome values and effect size were reverse scored so postive values indicated losses and so we have inverted these values in the manuscript.

Another important thing to note about the above table is that the effect size for the categorical variable (gender) is actually a standardised mean difference value, Which I don’t think makes much sense here. I’m going to compute Cramver’s V effect size here and put it into the table and the manuscript:

Show code
# Calculate Cramér's V and its confidence interval
combined_table <- master_dataset_study1 %>%
  group_by(STATUS, gender_westhq) %>%
  dplyr::count() %>%
  drop_na() %>%
  ungroup() %>%
  spread(key = gender_westhq, value = n, fill = 0) %>%
  select(-1) %>%
  as.matrix()


  cramersV(combined_table)
Cramér's V =  0.027
Show code
  CI <- confIntV(combined_table)
  CI_info<- CI$intermediate
   CI_info$fisherZ.ci %>%
  t() %>%
  as.data.frame() %>%
   mutate_if(is.numeric, round, 3) %>%
unite('Effect size CIs', 1, 2, sep = ", ") 
  Effect size CIs
1   -0.003, 0.057

Calculate odds ratio instead, as per the reviewer’s suggestion:

Show code
combined_table <- master_dataset_study1 %>%
  group_by(STATUS, gender_westhq) %>%
  dplyr::count() %>%
  drop_na() %>%
  ungroup() %>%
  spread(key = gender_westhq, value = n, fill = 0) %>%
  column_to_rownames(var="STATUS") %>% 
  as.matrix()

oddsratio(combined_table)
           Disease Nondisease Total
Exposed        107         76   183
Nonexposed    2116       1960  4076
Total         2223       2036  4259

    Odds ratio estimate and its significance probability

data:  combined_table
p-value = 0.08243
95 percent confidence interval:
 0.9657426 1.7610017
sample estimates:
[1] 1.304099

Sensitivity check comparison

How does the group comparison differ if we change the criterion for survey completion? Reviewers didn’t ask for this change specifically, but some suggestions led us to conduct these additional sensitivity checks.

For this, we will change the participant group to anyone who fully completed the survey and non-participants as anyone who didn’t start or didn’t complete the survey.

Show code
# Check coding of variable that tells us whether somebody fully completed the survey or not
master_dataset_study1 %>%
  count(STATUS,
           SURVEY_STATUS_FULL)
# A tibble: 4 × 3
  STATUS           SURVEY_STATUS_FULL     n
  <fct>            <chr>              <int>
1 Participants     Complete             153
2 Participants     Incomplete            30
3 Non-participants Incomplete            38
4 Non-participants Not started         4039
Show code
master_dataset_study1_sensitivity_check   <- master_dataset_study1 %>%
  mutate(SURVEY_STATUS_FULL_BINARY = case_when(SURVEY_STATUS_FULL == "Incomplete" ~ "Non-participants",
                                               SURVEY_STATUS_FULL == "Not started" ~ "Non-participants",
         TRUE ~ "Participants")) %>%
 mutate(SURVEY_STATUS_FULL_BINARY = fct_relevel(SURVEY_STATUS_FULL_BINARY,
                            "Participants",
                            "Non-participants")) # fix order of presentation

# Check recoding has worked:
master_dataset_study1_sensitivity_check %>%
  count(STATUS,
           SURVEY_STATUS_FULL_BINARY)
# A tibble: 3 × 3
  STATUS           SURVEY_STATUS_FULL_BINARY     n
  <fct>            <fct>                     <int>
1 Participants     Participants                153
2 Participants     Non-participants             30
3 Non-participants Non-participants           4077
Show code
# Produce updated table
master_dataset_study1_sensitivity_check %>%
  mutate(actualwin_sum6s = actualwin_sum6s*-1) %>% # Invert net outcome so losses are negative and vice versa
  tbl_summary(
    include = c(# Demographic details
      age_westhq, 
      gender_westhq, 
      IRSAD_SCORE,
      memberyears, 
       # Gambling
      days_since_lastplay,
      num_6sday,
      intensity_6sday,
      averagebet_mean6s,
      montly_net_outcome_6m, # 
      gdquiet6_percent,
      timedevice_sum6s
      
      ),
    type = all_continuous() ~ "continuous",
    statistic = list(all_continuous() ~ c(
      "{mean} ({sd}), {median}")),
     digits = all_continuous() ~ 2,
    by = SURVEY_STATUS_FULL_BINARY,
    label = c(
      age_westhq ~ "Age",
      gender_westhq ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      memberyears ~ "No. of years as a member",
      days_since_lastplay ~ "No. days since last bet",
      num_6sday ~ "No. betting days",
      intensity_6sday ~ "Mean games played per active day",
      averagebet_mean6s ~ "Mean stake",
      montly_net_outcome_6m ~ "Mean net outcome per month",
      gdquiet6_percent ~ "Percentage of sessions between midnight & 6AM",
      timedevice_sum6s ~ "Total time on device (minutes)"
      )
  ) %>%
  add_overall() %>%
  add_difference(
    list(all_continuous() ~ "cohens_d",  # For continuous variables
    all_categorical() ~ "smd") 
  ) %>%
  modify_header(label = "Variable") %>% 
  as_tibble() %>%
  filter(Variable != "Unknown") %>% 
  rename("Effect_size" = `**Difference**`) %>%
   unite("Effect size", Effect_size, `**95% CI**`, sep = " [", remove = TRUE) %>%
   mutate("Effect size" = paste0(`Effect size`, "]")) %>%
  mutate(`Effect size` = case_when(`Effect size` == "NA [NA]" ~ "",
                                    TRUE ~ `Effect size`)) %>%

  gt() %>%
  tab_options(data_row.padding = px(1.5))   %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )
Variable **Overall**, N = 4,2601 **Participants**, N = 1531 **Non-participants**, N = 4,107 Effect size2
Age 53.99 (14.80), 55.00 50.59 (13.56), 51.00 54.11 (14.83), 55.00 -0.24 [-0.40, -0.08]
Gender NA NA NA 0.06 [-0.11, 0.22]
F 2,223 (52%) 84 (55%) 2,139 (52%)
M 2,036 (48%) 69 (45%) 1,967 (48%)
IRSAD score 958.93 (77.46), 970.79 965.15 (75.81), 977.56 958.69 (77.52), 970.79 0.08 [-0.08, 0.24]
No. of years as a member 10.61 (7.85), 9.40 11.77 (9.03), 9.50 10.56 (7.80), 9.40 0.15 [-0.01, 0.32]
No. days since last bet 14.61 (14.87), 10.00 16.59 (16.64), 10.00 14.54 (14.80), 10.00 0.14 [-0.02, 0.30]
No. betting days 19.17 (26.01), 9.00 22.18 (24.23), 15.00 19.06 (26.07), 9.00 0.12 [-0.04, 0.28]
Mean games played per active day 1,074.20 (922.03), 841.00 1,076.81 (875.62), 918.47 1,074.10 (923.81), 833.78 0.00 [-0.16, 0.16]
Mean stake 3.66 (9.10), 1.43 3.37 (8.38), 1.67 3.68 (9.12), 1.42 -0.03 [-0.19, 0.13]
Mean net outcome per month 720.38 (1,497.50), 192.91 907.84 (1,665.54), 328.17 713.24 (1,490.50), 190.50 0.13 [-0.03, 0.29]
Percentage of sessions between midnight & 6AM 26.31 (25.25), 18.39 21.61 (18.89), 18.17 26.49 (25.46), 18.39 -0.19 [-0.41, 0.02]
Total time on device (minutes) 2,555.85 (4,651.01), 811.19 2,747.25 (3,903.29), 1,604.52 2,548.72 (4,676.80), 796.10 0.04 [-0.12, 0.20]
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])
Show code
# Calculate Cramér's V and its confidence interval
combined_table <- master_dataset_study1_sensitivity_check %>%
  group_by(SURVEY_STATUS_FULL_BINARY, gender_westhq) %>%
  dplyr::count() %>%
  drop_na() %>%
  ungroup() %>%
  spread(key = gender_westhq, value = n, fill = 0) %>%
  select(-1) %>%
  as.matrix()


  cramersV(combined_table)
Cramér's V =  0.01
Show code
  CI <- confIntV(combined_table)
  CI_info<- CI$intermediate
   CI_info$fisherZ.ci %>%
  t() %>%
  as.data.frame() %>%
   mutate_if(is.numeric, round, 3) %>%
unite('Effect size CIs', 1, 2, sep = ", ") 
  Effect size CIs
1    -0.02, 0.041

The effect sizes appear to be smaller in these comparisons. Let’s check if this is because the group who started the survey but didn’t complete it have now been merged with non-responders, making the latter group more similar to responders, or because those who fully complete the survey are actually more similar to those who didn’t start it:

Show code
master_dataset_study1_sensitivity_check %>%
  mutate(actualwin_sum6s = actualwin_sum6s*-1) %>% # Invert net outcome so losses are negative and vice versa
  tbl_summary(
    include = c(# Demographic details
      age_westhq, 
      gender_westhq, 
      IRSAD_SCORE,
      memberyears, 
       # Gambling
      days_since_lastplay,
      num_6sday,
      intensity_6sday,
      averagebet_mean6s,
      montly_net_outcome_6m, # 
      gdquiet6_percent,
      timedevice_sum6s
      
      ),
    type = all_continuous() ~ "continuous",
    statistic = list(all_continuous() ~ c(
      "{mean} ({sd}), {median}")),
     digits = all_continuous() ~ 2,
    by = SURVEY_STATUS_FULL,
    label = c(
      age_westhq ~ "Age",
      gender_westhq ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      memberyears ~ "No. of years as a member",
      days_since_lastplay ~ "No. days since last bet",
      num_6sday ~ "No. betting days",
      intensity_6sday ~ "Mean games played per active day",
      averagebet_mean6s ~ "Mean stake",
      montly_net_outcome_6m ~ "Mean net outcome per month",
      gdquiet6_percent ~ "Percentage of sessions between midnight & 6AM",
      timedevice_sum6s ~ "Total time on device (minutes)"
      )
  )%>%
  modify_header(label = "Variable") %>% 
  as_tibble() %>%
  filter(Variable != "Unknown") %>%
   gt() %>%
  tab_options(data_row.padding = px(1.5)) %>%
  tab_header(title = "Study 1 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey")
Study 1 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey
Variable **Complete**, N = 153 **Incomplete**, N = 68 **Not started**, N = 4,039
Age 50.59 (13.56), 51.00 56.29 (14.95), 59.00 54.08 (14.83), 55.00
Gender NA NA NA
F 84 (55%) 41 (60%) 2,098 (52%)
M 69 (45%) 27 (40%) 1,940 (48%)
IRSAD score 965.15 (75.81), 977.56 949.86 (75.71), 968.48 958.84 (77.55), 970.79
No. of years as a member 11.77 (9.03), 9.50 12.92 (8.95), 11.30 10.52 (7.77), 9.40
No. days since last bet 16.59 (16.64), 10.00 18.19 (17.26), 12.50 14.48 (14.75), 9.00
No. betting days 22.18 (24.23), 15.00 29.10 (30.27), 19.00 18.89 (25.96), 9.00
Mean games played per active day 1,076.81 (875.62), 918.47 1,599.03 (1,142.05), 1,305.49 1,065.27 (917.30), 827.81
Mean stake 3.37 (8.38), 1.67 2.33 (1.95), 1.64 3.70 (9.19), 1.41
Mean net outcome per month 907.84 (1,665.54), 328.17 1,471.40 (2,484.41), 789.93 700.21 (1,464.66), 181.50
Percentage of sessions between midnight & 6AM 21.61 (18.89), 18.17 27.62 (25.92), 21.11 26.47 (25.46), 18.24
Total time on device (minutes) 2,747.25 (3,903.29), 1,604.52 4,135.38 (5,111.33), 2,247.80 2,522.01 (4,665.22), 782.30

Interestingly, it seems that the group who started but didn’t complete the survey are most different, trending towards being more involved gamblers.

PGSI

Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:

Show code
summary_PGSI <- master_dataset_study1 %>%
  filter(attention1 == "2" & # Filter to corrective responses to attention checks
           attention2 == "Sometimes") %>%
  summarise(Mean_PGSI = mean(PGSI_TOTAL, na.rm = TRUE),
            SD_PGSI = sd(PGSI_TOTAL, na.rm = TRUE),
            Median_PGSI = median(PGSI_TOTAL, na.rm = TRUE))

mean_ps_PGSI_value <- summary_PGSI$Mean_PGSI
sd_ps_PGSI_value <- summary_PGSI$SD_PGSI
median_ps_PGSI_value <- summary_PGSI$Median_PGSI

# Can produce a sentence that summarises the outcomes:
summary_sentence_PGSI_study1 <- sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", 
                            mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)

summary_sentence_PGSI_study1
[1] "The mean PGSI total score is 5.50 (standard deviation = 5.01) and the median is 4.00."

Composite risk scores

Now let’s produce a single composite score that represents people’s past six months gambling behaviour.

One thing we need to consider here is that some of the variables we’ve created could be biased by recent membership registration. However, looking at the data, all but 2 of the participants have been registered for at least six months, and the 2 people who had been registered less than this has all been registered 0.4 of a year, so the equivalent of approximately five months. So, this is unlikely to affect our outcomes here.

Show code
 master_dataset_study1 %>% 
  filter(memberyears < 0.5) %>%
  dplyr::count(memberyears)
# A tibble: 1 × 2
  memberyears     n
        <dbl> <int>
1         0.4    51

Okay, now we have all of the relevant variables, let’s look at how they correlate with PGSI and harm scores using a series of bivariate correlations visualised as a heatmap:

Show code
PGSI_cor_data_study1<- master_dataset_study1 %>% 
  # colnames()
    select(PGSI_TOTAL,
           gd_distinct_6s, # Number of distinct games played (by brand)
         game_description_count6s, # Total number of sessions played
         timedevice_sum6s, # Total time on device
        timedevice_mean6s, #  Average time on device per session
        timedevice_std6s, # SD of Time on device
        turnover1_sum6s, # Total turnover
        turnover1_mean6s, # Mean turnover per session
        turnover1_std6s, # SD of Turnover
         actualwin_sum6s, # Total net outcome 
        actualwin_mean6s, # Mean net outcome per session
        actualwin_std6s, # SD of Session net outcome
        averagebet_mean6s, # Mean bet size
        averagebet_std6s, # SD of average bet size
        gamesplayed_sum6s, #  Total games played
       gamesplayed_mean6s, #  Mean games played per session
       gamesplayed_std6s, # SD of games played per session
       max_turnover6sday, # Maximum total turnover in a single day  
     max_gamesplayed6sday, # Maximum total games played in a single day 
       num_6sday, #   Number of active betting days
       intensity_6sday, # Number of games per active day
       averagebet_6mos, #   Average monthly bet freq (days)
       averageturnover_6mos, #   average monthly turnover 
       montly_net_outcome_6m) # Average monthly net outcome

ggstatsplot::ggcorrmat(PGSI_cor_data_study1, 
                                              type = "parametric",
                                              results.subtitle = FALSE,
                                              ggcorrplot.args = c(lab_size = 3)) +
  theme(axis.text.y = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(axis.text.x = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(legend.title=element_text(color = "black", size = 12, face = "plain", family = "Poppins")) +
  theme(legend.title.align = .5) +
  theme(legend.text=element_text(color = "black", size = 8, face = "plain", family = "Poppins"))

Now let’s use all of these variables to create a composite index score of risk/involvement:

Show code
master_dataset_study1_with_risk_scores <- master_dataset_study1 %>%
  mutate(actualwin_sum6s = actualwin_sum6s*-1) %>% # Invert outcome values
   mutate(actualwin_mean6s = actualwin_mean6s*-1) %>% # Invert outcome values
  mutate(across(c(
            gd_distinct_6s, # Number of distinct games played (by brand)
         game_description_count6s, # Total number of sessions played
         timedevice_sum6s, # Total time on device
        timedevice_mean6s, #  Average time on device per session
        turnover1_sum6s, # Total turnover
        turnover1_mean6s, # Mean turnover per session
        turnover1_std6s, # SD of Turnover
         actualwin_sum6s, # Total net outcome 
        actualwin_mean6s, # Mean net outcome per session
        actualwin_std6s, # SD of Session net outcome
        averagebet_sum6s, # Total bet size
        averagebet_mean6s, # Mean bet size
        averagebet_std6s, # SD of average bet size
        gamesplayed_sum6s, #  Total games played
       gamesplayed_mean6s, #  Mean games played per session
       gamesplayed_std6s, # SD of games played per session
       max_turnover6sday, # Maximum total turnover in a single day  
     max_gamesplayed6sday, # Maximum total games played in a single day 
       num_6sday, #   Number of active betting days
       intensity_6sday, # Number of games per active day
       averagebet_6mos, #   Average monthly bet freq (days)
     montly_net_outcome_6m, # Average monthly net outcome
       averageturnover_6mos), 
                scale, # scale function computes z-scores
                .names = "z_{.col}")) %>% # Create new named columns for easy selection below
  rowwise() %>% # Group by Rows
  mutate(`Composite Risk Score` = sum(c_across(starts_with("z_")), na.rm = TRUE)) %>% # Sum across all newly created Z scores
  ungroup()


# colnames(master_dataset_study2_with_risk_scores)

Correlations

Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:

Show code
master_dataset_study1_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>%
   mutate(PGSI_CATEGORY = factor(PGSI_CATEGORY, levels = c('No risk',
                                                              'Low risk',
                                                              'Moderate risk',
                                                              'High risk'))) %>%
ggplot() +
   geom_point(aes(x = PGSI_TOTAL, y = `Composite Risk Score`,
                   colour = PGSI_CATEGORY)) +
   geom_smooth(aes(x = PGSI_TOTAL, y = `Composite Risk Score`), 
               method = "lm",
               se = FALSE,
               color = "black") +
   scale_colour_manual(values = c("#2e4057",
                                  "#00798c", 
                                  "#edae49",
                                  "#d1495b")) +
   geom_hline(yintercept = 0, colour = "grey",  linetype="dashed") +
   scale_y_continuous(limits = c(-10,120)) +
   plot_theme +
  guides(color= guide_legend("PGSI category")) +
   labs(x = "PGSI total score") +
  scale_x_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 27))

Show code
cor_PGSI <- cor.test(master_dataset_study1_with_risk_scores$PGSI_TOTAL, 
                         master_dataset_study1_with_risk_scores$`Composite Risk Score`,
                         method = "spearman")

print(cor_PGSI)

    Spearman's rank correlation rho

data:  master_dataset_study1_with_risk_scores$PGSI_TOTAL and master_dataset_study1_with_risk_scores$`Composite Risk Score`
S = 846277, p-value = 0.02031
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.1714405 

Confirm how many people are involved in this calculation:

Show code
master_dataset_study1_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>% nrow()
[1] 183

T-test

Show code
# Perform Welch's t-test
t_test_composite_study1 <- t.test(`Composite Risk Score` ~ STATUS, data = master_dataset_study1_with_risk_scores, var.equal = FALSE)

t_test_composite_study1

    Welch Two Sample t-test

data:  Composite Risk Score by STATUS
t = 2.3269, df = 197.85, p-value = 0.02098
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
 0.3172207 3.8424226
sample estimates:
    mean in group Participants mean in group Non-participants 
                    1.99047717                    -0.08934445 
Show code
# Calculate means and standard deviations
summary_composite_study1 <- master_dataset_study1_with_risk_scores %>%
  dplyr::group_by(STATUS) %>%
  dplyr::summarize(
    mean = mean(`Composite Risk Score`),
    sd = sd(`Composite Risk Score`),
    n = n()
  ) 

gt(summary_composite_study1)
STATUS mean sd n
Participants 1.99047717 11.84102 183
Non-participants -0.08934445 11.54773 4077
Show code
# Calculate Cohen's d with 95% confidence interval
cohen_d_result_study1 <- cohen.d(`Composite Risk Score` ~ STATUS, data = master_dataset_study1_with_risk_scores)

cohen_d_result_study1

Cohen's d

d estimate: 0.1799089 (negligible)
95 percent confidence interval:
     lower      upper 
0.03171688 0.32810096 

Quartiles

Now, let’s divide the sample into quartiles based on their composite summary score:

Show code
# Calculate quartiles of past-30-day bet frequency
quartiles_study1 <- quantile(master_dataset_study1_with_risk_scores$`Composite Risk Score`, 
                      probs = c(0, 0.25, 0.5, 0.75, 1))


  
master_dataset_study1_with_risk_scores$Quartile <- cut(master_dataset_study1_with_risk_scores$`Composite Risk Score`, 
                                    breaks = quartiles_study1,
                                    labels = c("Bottom quartile: least involved",
                                               "Second quartile",
                                               "Third quartile",
                                               "Top quartile: most involved"),
                                    include.lowest = TRUE)

Now compute the proportion of individuals from each quartile who went on to take part in the survey:

Show code
master_dataset_study1_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
  tab_header("Counts for granular survey status by composite risk score quartile", subtitle = NULL)
Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI n
Bottom quartile: least involved
Complete 34
Incomplete 4
Not started 1027
Second quartile
Complete 40
Incomplete 7
Not started 1018
Third quartile
Complete 49
Incomplete 10
Not started 1006
Top quartile: most involved
Complete 60
Incomplete 17
Not started 988
Show code
master_dataset_study1_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(STATUS) %>%
  mutate('Percent of quartile' = round(n/sum(n)*100,2)) %>%
  gt() %>%
  tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)
Counts (with percentages) for broad survey status by composite risk score quartile
STATUS n Percent of quartile
Bottom quartile: least involved
Participants 34 3.19
Non-participants 1031 96.81
Second quartile
Participants 40 3.76
Non-participants 1025 96.24
Third quartile
Participants 49 4.60
Non-participants 1016 95.40
Top quartile: most involved
Participants 60 5.63
Non-participants 1005 94.37

And visualise this using a Sankey diagram:

Show code
# Create Sankey plot
desired_order_sankey <- c("Top quartile: most involved",
                          "Third quartile",
                          "Second quartile",
                          "Bottom quartile: least involved")

# We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N):
risk_to_survey_links<- master_dataset_study1_with_risk_scores %>%
    select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%
  group_by(Quartile, SURVEY_STATUS_PGSI) %>%
   summarise(
    n = n(),
    .groups = "drop"
  ) %>%
   group_by(Quartile) %>%
   mutate(
     percent = n/sum(n)*100) %>%
  select(-n) %>%
  ungroup() %>%
  mutate(Quartile =  factor(Quartile,
                              levels = desired_order_sankey)) %>%
  arrange(Quartile)  

# From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: 
nodes <- data.frame(
  name=c(as.character(risk_to_survey_links$Quartile), 
  as.character(risk_to_survey_links$SURVEY_STATUS_PGSI)) %>% unique()
)

# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn  assign each from and to category with a number to make numerical connections explicit:
risk_to_survey_links$IDsource <- match(risk_to_survey_links$Quartile, nodes$name)-1 
risk_to_survey_links$IDtarget <- match(risk_to_survey_links$SURVEY_STATUS_PGSI, nodes$name)-1
 


# prepare colour scale
ColourScal ='d3.scaleOrdinal() .range(["#d1495b", 
"#edae49",
"#00798c",
"#2e4057",
"#35B779FF",
"#31688EFF",
"#31688EFF",
"#482878FF",
"#B4DE2CFF",
"#FDE725FF", 
"#26828EFF",
"#3E4A89FF",
"#440154FF",
"#6DCD59FF"
])'

# Create Sankey plot
sankey_plot1 <- sankeyNetwork(
  Links = risk_to_survey_links, 
  Nodes = nodes,
  Source = "IDsource", 
  Target = "IDtarget",
  Value = "percent", 
  NodeID = "name", 
  iterations = 0,
  sinksRight = FALSE,
  fontSize = 16, 
  fontFamily = "Poppins",
  nodeWidth = 30,
  nodePadding = 25,
  colourScale = ColourScal
)

# Custom CSS with Google Fonts link
custom_css <- "
<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'>
<style>
  text {
    font-family: 'Poppins', sans-serif !important;
  }
</style>
"

# Save plot as a HTML file with custom CSS
html_file <- tempfile(fileext = ".html")
saveWidget(
  widget = htmlwidgets::prependContent(sankey_plot1), 
  file = html_file,
  selfcontained = TRUE
)

# Read the saved HTML and ensure the custom CSS is properly embedded
html_content <- readLines(html_file, warn = FALSE)
html_content <- gsub("</head>", paste0(custom_css, "</head>"), html_content)
writeLines(html_content, con = "sankey_plot1.html")

# remove the temporary file
file.remove(html_file)
[1] TRUE
Show code
# sankey_plot1

Logistic regression

Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.

SSVS

Before we can actually run the model, we need to identify a suitable suite of predictor variables to include using used Stochastic Search Variable Selection (SSVS) and the related SSVS package (note the actual SSVS process is commented out in rendered versions of this document as it takes a long time to process).

As we want to maximise the overall predictive ability of the model, we have opted to set the parameters of the SSVS so as to have a low threshold for includin/Selecting relevant variables. First, We have said the prior inclusion of probability to 0.9, reflecting a belief that predictors are more likely to be included than not. A prior inclusion probability of 0.5 reflects the belief that each predictor has an equal probability of being included/excluded. Second, we have set the cut off for marginal inclusion probabilities values (MIPs) to 0.4. There is no standard recommendation for what this cut-off should be, but others have used 0.5. For this specific study, we did adjust this value in response to previous results, given that only 2 predictors were identified based on a lower MIP threshold and we are trying to build a model that maximises the ability to predict survey engagement based on account data.

One thing we need to do before all of this is impute missing values for the IRSAD scores as SSVS doesn’t seem to handle these values well. To do this, we will impute the median value for each person’s state as their missing value.

There are also some SD Variables which don’t have scores as people only bet once in the window. For these, will set the value to 0. Finally, for the days since registration, we’ll set this to the median as well

Show code
# Function to impute missing values with the median
impute_na_with_median <- function(data) {
  data %>%
    # Impute SD variables with 0
    mutate(across(contains("std6s"), ~ ifelse(is.na(.), 0, .))) %>%
    # Impute all other variables with median
    mutate(across(!contains("std6s"), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
}

# Prepare the dataset
master_dataset_study1_SSVS <- master_dataset_study1 %>%
  # Impute missing IRSAD scores
  mutate(IRSAD_SCORE = ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm = TRUE), IRSAD_SCORE)) %>%
  # Convert all required variables to numeric
  mutate(across(c(
    age_westhq,
    IRSAD_SCORE,
    memberyears,
    days_since_lastplay,
    gd_distinct_6s,
    game_description_count6s,
    timedevice_sum6s,
    timedevice_mean6s,
    timedevice_std6s,
    turnover1_sum6s,
    turnover1_mean6s,
    turnover1_std6s,
    actualwin_sum6s,
    actualwin_mean6s,
    actualwin_std6s,
    averagebet_mean6s,
    averagebet_std6s,
    gamesplayed_sum6s,
    gamesplayed_mean6s,
    gamesplayed_std6s,
    max_turnover6sday,
    max_gamesplayed6sday,
    num_6sday,
    intensity_6sday,
    averagebet_6mos,
    averageturnover_6mos), 
    ~ as.numeric(.))) %>%
  # Convert gender to numeric
  mutate(GENDER = case_when(
    gender_westhq == 'F' ~ 0,
    gender_westhq == 'M' ~ 1,
    TRUE ~ NA_real_)) %>%
  # Convert STATUS to numeric
  mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1)) %>%
  mutate(STATUS = as.numeric(STATUS)) %>%
  # Impute missing values with median and 0 for SD variables
  impute_na_with_median() %>%
  rename(
    'Age' = 'age_westhq',
    'Gender' = 'GENDER',
    'IRSAD score' = 'IRSAD_SCORE',
    'Days since registration' = 'memberyears',
    'Days since last bet' = 'days_since_lastplay',
    'No. of diff. game types*' = 'gd_distinct_6s',
    'Total sessions played*' = 'game_description_count6s',
    'Total time on device*' = 'timedevice_sum6s',
    'Mean time on device per session*' = 'timedevice_mean6s',
    'SD of time on device*' = 'timedevice_std6s',
    'Total turnover*' = 'turnover1_sum6s',
    'Mean turnover per session*' = 'turnover1_mean6s',
    'SD of turnover*' = 'turnover1_std6s',
    'Total net outcome*' = 'actualwin_sum6s',
    'Mean net outcome per session*' = 'actualwin_mean6s',
    'SD of session net outcome*' = 'actualwin_std6s',
    'Mean stake*' = 'averagebet_mean6s',
    'SD of stake*' = 'averagebet_std6s',
    'Total games played*' = 'gamesplayed_sum6s',
    'Mean no. games played per session*' = 'gamesplayed_mean6s',
    'SD of games played per session*' = 'gamesplayed_std6s',
    'Max turnover in a single day*' = 'max_turnover6sday',
    'Max no. games played in a single day*' = 'max_gamesplayed6sday',
    'Number of active betting days*' = 'num_6sday',
    'Mean no. games per active day*' = 'intensity_6sday',
    'Mean monthly bet freq (days)*' = 'averagebet_6mos',
    'Mean monthly turnover*' = 'averageturnover_6mos'
  ) %>%
  as.data.frame()

outcome <- 'STATUS'

predictors <- c(
  # Demographic variables:
  'Age',
  'Gender',
  'IRSAD score',
  
  # Account/wagering variables:
  'Days since registration',
  'Days since last bet',
  
  # All past six month gambling variables (As included in the composite score):
  'No. of diff. game types*',
  'Total sessions played*',
  'Total time on device*',
  'Mean time on device per session*',
  'SD of time on device*',
  'Total turnover*',
  'Mean turnover per session*',
  'SD of turnover*',
  'Total net outcome*',
  'Mean net outcome per session*',
  'SD of session net outcome*',
  'Mean stake*',
  'SD of stake*',
  'Total games played*',
  'Mean no. games played per session*',
  'SD of games played per session*',
  'Max turnover in a single day*',
  'Max no. games played in a single day*',
  'Number of active betting days*',
  'Mean no. games per active day*',
  'Mean monthly bet freq (days)*',
  'Mean monthly turnover*'
)


# Print summary statistics
print(summary(master_dataset_study1_SSVS %>% select(all_of(predictors))))
      Age            Gender        IRSAD score     Days since registration
 Min.   :18.00   Min.   :0.0000   Min.   : 780.9   Min.   : 0.40          
 1st Qu.:43.00   1st Qu.:0.0000   1st Qu.: 922.4   1st Qu.: 5.00          
 Median :55.00   Median :0.0000   Median : 970.8   Median : 9.40          
 Mean   :53.99   Mean   :0.4779   Mean   : 959.0   Mean   :10.57          
 3rd Qu.:65.00   3rd Qu.:1.0000   3rd Qu.: 997.9   3rd Qu.:14.03          
 Max.   :95.00   Max.   :1.0000   Max.   :1176.9   Max.   :59.70          
 Days since last bet No. of diff. game types* Total sessions played*
 Min.   : 0.00       Min.   :  1.00           Min.   :   1.0        
 1st Qu.: 3.00       1st Qu.:  7.00           1st Qu.:  13.0        
 Median :10.00       Median : 21.00           Median :  56.0        
 Mean   :14.61       Mean   : 35.36           Mean   : 182.5        
 3rd Qu.:22.00       3rd Qu.: 49.00           3rd Qu.: 182.0        
 Max.   :67.00       Max.   :283.00           Max.   :5500.0        
 Total time on device* Mean time on device per session* SD of time on device*
 Min.   :    0.0       Min.   :  0.000                  Min.   :  0.000      
 1st Qu.:  172.3       1st Qu.:  7.944                  1st Qu.:  6.794      
 Median :  811.2       Median : 12.840                  Median : 12.856      
 Mean   : 2555.8       Mean   : 18.019                  Mean   : 18.053      
 3rd Qu.: 2696.0       3rd Qu.: 21.377                  3rd Qu.: 22.914      
 Max.   :49777.8       Max.   :261.583                  Max.   :206.664      
 Total turnover*     Mean turnover per session* SD of turnover*  
 Min.   :      0.4   Min.   :    0.20           Min.   :    0.0  
 1st Qu.:   1909.7   1st Qu.:   87.53           1st Qu.:   78.4  
 Median :  11971.2   Median :  181.78           Median :  196.1  
 Mean   :  54122.7   Mean   :  413.79           Mean   :  479.6  
 3rd Qu.:  52419.7   3rd Qu.:  403.25           3rd Qu.:  488.4  
 Max.   :2850427.5   Max.   :29571.30           Max.   :26734.1  
 Total net outcome* Mean net outcome per session* SD of session net outcome*
 Min.   :-17910     Min.   :-650.000              Min.   :   0.0            
 1st Qu.:   110     1st Qu.:   5.664              1st Qu.:  48.4            
 Median :  1021     Median :  19.008              Median : 113.5            
 Mean   :  4121     Mean   :  35.180              Mean   : 183.6            
 3rd Qu.:  4502     3rd Qu.:  45.336              3rd Qu.: 229.7            
 Max.   :107738     Max.   :4999.980              Max.   :5164.6            
  Mean stake*        SD of stake*     Total games played*
 Min.   : 0.01771   Min.   : 0.0000   Min.   :     0     
 1st Qu.: 0.78506   1st Qu.: 0.2504   1st Qu.:  1471     
 Median : 1.42699   Median : 0.6212   Median :  7216     
 Mean   : 3.66419   Mean   : 2.2513   Mean   : 26677     
 3rd Qu.: 2.78693   3rd Qu.: 1.4415   3rd Qu.: 26613     
 Max.   :96.99538   Max.   :50.1473   Max.   :621406     
 Mean no. games played per session* SD of games played per session*
 Min.   :   1.00                    Min.   :   0.00                
 1st Qu.:  71.95                    1st Qu.:  57.39                
 Median : 118.17                    Median : 110.96                
 Mean   : 167.84                    Mean   : 164.24                
 3rd Qu.: 200.62                    3rd Qu.: 202.78                
 Max.   :3524.00                    Max.   :1748.12                
 Max turnover in a single day* Max no. games played in a single day*
 Min.   :     0.3              Min.   :    0                        
 1st Qu.:   872.0              1st Qu.:  670                        
 Median :  2904.7              Median : 1742                        
 Mean   :  6356.1              Mean   : 2264                        
 3rd Qu.:  7898.9              3rd Qu.: 3265                        
 Max.   :165993.5              Max.   :12945                        
 Number of active betting days* Mean no. games per active day*
 Min.   :  1.00                 Min.   :   0.0                
 1st Qu.:  3.00                 1st Qu.: 357.9                
 Median :  9.00                 Median : 841.0                
 Mean   : 19.17                 Mean   :1074.2                
 3rd Qu.: 24.00                 3rd Qu.:1526.2                
 Max.   :180.00                 Max.   :6526.0                
 Mean monthly bet freq (days)* Mean monthly turnover*
 Min.   : 0.1667               Min.   :     0.1      
 1st Qu.: 0.5000               1st Qu.:   318.3      
 Median : 1.5000               Median :  1995.2      
 Mean   : 3.1955               Mean   :  9020.4      
 3rd Qu.: 4.0000               3rd Qu.:  8736.6      
 Max.   :30.0000               Max.   :475071.2      
Show code
# master_dataset_study1_SSVS %>%
#   dplyr::count(Gender)


# master_dataset_study1_SSVS %>%
  # dplyr::count(STATUS)

# # # Perform SSVS analysis
# SSSV_results_Study1 <- ssvs(data = master_dataset_study1_SSVS,
#                             x = predictors, y = outcome,
#                             inprob = 0.95,
#                             continuous = FALSE,
#                             progress = TRUE)
# 
# 
# # Save file to avoid having to re-reun every time
# saveRDS(SSSV_results_Study1, "SSSV_results_Study1")

# Load in:
SSSV_results_Study1 <- read_rds("SSSV_results_Study1")

Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:

Show code
summary(SSSV_results_Study1, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  as.data.frame() %>%
gt()%>%
  tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 1**"), 
             subtitle = NULL)
Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 1
Variable MIP Avg Beta Avg Nonzero Beta Lower CI (95%) Upper CI (95%)
Days since last bet 0.9924 0.3421 0.3447 0.1818 0.4982
No. of diff. game types* 0.9296 0.2820 0.3033 0.0000 0.4275
Max turnover in a single day* 0.6883 0.1669 0.2425 0.0000 0.3688
Days since registration 0.5135 0.1249 0.2432 0.0000 0.3615
Age 0.4119 -0.1062 -0.2577 -0.3740 0.0000
Number of active betting days* 0.2653 0.1062 0.4002 -57.5381 66.9408
Mean monthly bet freq (days)* 0.2631 -0.0747 -0.2840 -66.8354 57.5835
Mean monthly turnover* 0.2529 -0.1619 -0.6402 -62.8708 60.2791
Total turnover* 0.2482 0.1627 0.6557 -59.7995 63.1582
Max no. games played in a single day* 0.1089 0.0254 0.2330 0.0000 0.2560
Mean turnover per session* 0.0894 -0.0241 -0.2691 -0.2310 0.0014
SD of session net outcome* 0.0861 0.0140 0.1624 0.0000 0.1588
Mean stake* 0.0739 -0.0148 -0.1996 -0.1399 0.0000
Total time on device* 0.0732 -0.0216 -0.2953 -0.1889 0.0004
Total games played* 0.0676 -0.0144 -0.2131 -0.1512 0.0002
Gender 0.0662 -0.0091 -0.1368 -0.0891 0.0000
SD of stake* 0.0516 -0.0067 -0.1303 0.0000 0.0095
SD of turnover* 0.0430 0.0051 0.1182 0.0000 0.0000
Mean no. games played per session* 0.0305 -0.0017 -0.0574 0.0000 0.0000
Mean no. games per active day* 0.0287 -0.0027 -0.0934 0.0000 0.0000
Total net outcome* 0.0285 0.0021 0.0736 0.0000 0.0000
Total sessions played* 0.0261 -0.0023 -0.0870 0.0000 0.0000
SD of time on device* 0.0248 0.0005 0.0182 0.0000 0.0000
IRSAD score 0.0243 0.0016 0.0667 0.0000 0.0000
Mean time on device per session* 0.0235 -0.0023 -0.1001 0.0000 0.0000
SD of games played per session* 0.0211 0.0015 0.0701 0.0000 0.0000
Mean net outcome per session* 0.0130 0.0001 0.0094 0.0000 0.0000
Show code
plot(SSSV_results_Study1,
     threshold = 0.4,
     title = "") +
  theme_classic() +
   scale_y_continuous(breaks = c(0, 0.25, 0.50, 0.75, 1), labels = c(0, 0.25, 0.50, 0.75, 1)) +
  theme(text=element_text(family="Poppins"),
      axis.text.x = element_text(angle = 45, hjust = 1),
    plot.margin = margin(t = 10, r = 10, b = 10, l = 20)
      # axis.text.y = element_text(angle = 0, hjust = .5)
      ) +
  labs(title = "",
       x = "",
       y = "Inclusion Probability") +
  annotate("text", x = 18, y = 0.45, label = "MIP threshold", size =3, color = "black", fontface = "plain") +
  theme(legend.position = "none") 

Show code
ggsave("Figures/SSVS results plot Study 1.pdf",
       width = 8.5,
       height = 4,
       units = c("in")
)

Run model

Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:

Show code
Study1_logistic_reg_data<- master_dataset_study1 %>%
 mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1)) %>%
  filter(!is.na(memberyears))

# Run model
Study1_logistic_regression<- glm(STATUS ~  days_since_lastplay + 
                gd_distinct_6s +
                max_turnover6sday +
                memberyears +
                age_westhq,
      data = Study1_logistic_reg_data, family = "binomial") 

summary(Study1_logistic_regression) 

Call:
glm(formula = STATUS ~ days_since_lastplay + gd_distinct_6s + 
    max_turnover6sday + memberyears + age_westhq, family = "binomial", 
    data = Study1_logistic_reg_data)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.395e+00  3.239e-01 -10.481  < 2e-16 ***
days_since_lastplay  2.532e-02  4.973e-03   5.091 3.57e-07 ***
gd_distinct_6s       7.284e-03  1.715e-03   4.248 2.16e-05 ***
max_turnover6sday    2.192e-05  5.775e-06   3.796 0.000147 ***
memberyears          3.451e-02  1.012e-02   3.412 0.000646 ***
age_westhq          -1.773e-02  5.889e-03  -3.011 0.002605 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1484.9  on 4116  degrees of freedom
Residual deviance: 1426.7  on 4111  degrees of freedom
AIC: 1438.7

Number of Fisher Scoring iterations: 6

Check variable inflation factors for evidence of multicollinearity:

Show code
vif(Study1_logistic_regression)
days_since_lastplay      gd_distinct_6s   max_turnover6sday         memberyears 
           1.229496            1.232278            1.076284            1.317812 
         age_westhq 
           1.261586 

No issues with collinearity.

Performs other checks of model performance:

Show code
# model1_performance_metrics<- check_model(Study1_logistic_regression_pilot)
# 
# ggsave("Figures/model1_performance_metrics.pdf",
#        width = 7,
#        height = 9,
#        units = c("in")
# )

Compute confidence intervals for the coefficient estimates:

Show code
sumarisedCIs_Study1_logistic_regression<- cbind(coef(Study1_logistic_regression), confint(Study1_logistic_regression))

sumarisedCIs_Study1_logistic_regression %>% as.data.frame() %>%
    rownames_to_column() %>%
  rename(Coefficient = V1) %>%
  mutate(across(c(Coefficient, `2.5 %`, `97.5 %`), ~round(., 3))) %>%
  arrange(desc(`2.5 %`)) %>%
   gt() %>%
  tab_header(md("**Regression coefficients with 95% CIs: Study 1**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%") 
Regression coefficients with 95% CIs: Study 1
Coefficient Lower CI: 2.5% Upper CI: 97.5%
days_since_lastplay 0.025 0.015 0.035
memberyears 0.035 0.014 0.054
gd_distinct_6s 0.007 0.004 0.011
max_turnover6sday 0.000 0.000 0.000
age_westhq -0.018 -0.029 -0.006
(Intercept) -3.395 -4.042 -2.772

Exponentiate coefficients and interpret them as odds-ratios:

Show code
sumarisedORs_Study1_logistic_regression<- exp(cbind(OR = coef(Study1_logistic_regression), confint(Study1_logistic_regression)))

sumarisedORs_Study1_logistic_regression %>% as.data.frame() %>%
  rownames_to_column() %>%
  mutate(across(c(OR, `2.5 %`, `97.5 %`), ~round(., 4))) %>%
  arrange(desc(OR)) %>%
   gt() %>%
  tab_header(md("**Odds ratios with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%") 
Odds ratios with 95% CIs: Study 2
OR Lower CI: 2.5% Upper CI: 97.5%
memberyears 1.0351 1.0145 1.0555
days_since_lastplay 1.0256 1.0155 1.0355
gd_distinct_6s 1.0073 1.0038 1.0106
max_turnover6sday 1.0000 1.0000 1.0000
age_westhq 0.9824 0.9711 0.9938
(Intercept) 0.0336 0.0176 0.0625

Calculate Nagelkerke’s R squared for the model:

Show code
NagelkerkeR2_S1<- NagelkerkeR2(Study1_logistic_regression) %>% 
  as.data.frame() %>%
  print()
     N         R2
1 4117 0.04643133
Show code
# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value

The model accounts for 4.6431328% of variance in uptake/response.

Calculate overall p-value for model by comparing it with a null model:

Show code
anova(Study1_logistic_regression, 
      update(Study1_logistic_regression, ~1), # update here produces null model for comparison 
      test="Chisq")
Analysis of Deviance Table

Model 1: STATUS ~ days_since_lastplay + gd_distinct_6s + max_turnover6sday + 
    memberyears + age_westhq
Model 2: STATUS ~ 1
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      4111     1426.7                          
2      4116     1485.0 -5  -58.294 2.736e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show code
# Set seed so that labels are consistent in the plot:
set.seed(200)

ggstatsplot::ggcoefstats(Study1_logistic_regression,
                         digits = 4,
                         stats.label.color = "#00798c",
                         sort = "ascending",
                         vline.args = list(size = 1, color = "black"),
                         # ggstatsplot.layer = FALSE,
                          exclude.intercept = TRUE, # hide the intercept
                         stats.label.args = list(size = 3, direction = "y")) +
  scale_x_continuous(name = "Regression coefficient", limits = c(-0.05, 0.05)
                     # breaks = c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5)
                     ) +
  scale_y_discrete(name = "", labels = c("Age",
                                         "Max turnover\n in a single\n day",
                                         "No. of diff.\n game types",
                                         "Days since\n last bet",
                                         "Days since\n  registration"
                                         )) +
  theme(panel.grid.minor = element_blank(),
        axis.line = element_blank()) +
  theme(axis.text = element_text(color = "black", size = 9, face = "plain"))+
  theme_light(base_family = "Poppins") +
  # theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +
  theme(axis.text.y = element_text(color="black", size=12, face="plain", family = "Poppins")) +
  theme(axis.title.x = element_text(color="black", size=12, face ="plain", vjust = 0, family = "Poppins")) 

Show code
# Set height to 7 and width to 5.5 in portrait mode
ggsave("Figures/Study 1 log regression plot.pdf",
       width = 5,
       height = 7,
       units = c("in")
)

Study 2 Analysis

This study pertains to the analysis of a survey of Online gamblers from two sites in Australia.

Load data

Load in master dataset containing all customer details (this step is hidden to protect privacy & anonymity)

Data loaded.

How many people were invited in total:

Show code
nrow(master_dataset_study2_initial)
[1] 24829

Process the data for analysis by doing the following:

  • Remove anybody who didn’t bet in the six months prior to survey participation as this was our criteria for inclusion in the sample
  • Provide a single variable that tells us whether somebody is a participant or non-participant, depending on whether they completed up to the PGSI or not
  • Recode and rename variables where required for later analysis and visualisation.
Show code
master_dataset_study2 <-  master_dataset_study2_initial %>%
  as_tibble() %>%
filter(PAS_6M_BET_FREQ_DAYS != 0) %>% # Remove anybody who didn't bet in the window of interest
  mutate(STATUS = case_when(SURVEY_STATUS_PGSI == "Incomplete" |  SURVEY_STATUS_PGSI == "Not started"~ "Non-participants", 
                            SURVEY_STATUS_PGSI == "Complete"  ~ "Participants")) %>%
mutate(STATUS = fct_relevel(STATUS,
                            "Participants",
                            "Non-participants")) %>%
  # Make numeric:
  mutate(GHM_OVER_PRIORITISE = as.numeric(GHM_OVER_PRIORITISE),
    GHM_STRAIN = as.numeric(GHM_STRAIN),
    GHM_SEVERE = as.numeric(GHM_SEVERE),
    IRSAD_SCORE = as.numeric(IRSAD_SCORE)
    ) %>% 
  # Enforce the order of factors so that they present properly in tables and graphs:
  mutate(PGSI_CATEGORY = factor(PGSI_CATEGORY, levels = c('No risk',
                                                              'Low risk',
                                                              'Moderate risk',
                                                              'High risk'))) 
  # Rename risk detection system summary variable values:
  # mutate(RE_PAS_6M_ANY_TRIGGER = case_when(RE_PAS_6M_ANY_TRIGGER == TRUE ~ "Flagged",
  #                                          RE_PAS_6M_ANY_TRIGGER == FALSE ~ "Not flagged"))

Uptake figures

Let’s first look at how survey uptake varied over the three-week window the survey was open:

Show code
master_dataset_study2_initial %>%
  as_tibble() %>%
  filter(SURVEY_STATUS_START != "Not started") %>%
  group_by(START_DATE) %>%
  summarise(Freq = sum(n())) %>%
  ungroup() %>%
  ggplot(aes(x = START_DATE, y = Freq)) +
    plot_theme +
  geom_bar(stat = "identity", fill = "#00798c") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "",
       x = "Date",
       y = "Number of clients") +
  scale_y_continuous(breaks = seq(0, 1500, by = 500), limits = c(0,1500))

Now compute some figures to tell us the rates of uptake for the survey. Will first do this for the original, unfiltered dataset and then for the refined, fill the dataset with people who didn’t bet in the past six months removed.

Original dataset:

Show code
master_dataset_study2_initial %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Unfiltered dataset", subtitle = NULL)
Counts for starting the survey: Unfiltered dataset
Survey status n Percent
Not started 22684 91.360909
Started 2145 8.639091
Show code
master_dataset_study2_initial %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Unfiltered dataset", subtitle = NULL)
Counts for completing the survey up to the PGSI: Unfiltered dataset
Survey status n Percent
Complete 1959 7.889967
Incomplete 186 0.749124
Not started 22684 91.360909
Show code
master_dataset_study2_initial %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Unfiltered dataset", subtitle = NULL)
Counts for completing the survey in full: Unfiltered dataset
Survey status n Percent
Complete 1470 5.920496
Incomplete 675 2.718595
Not started 22684 91.360909

And for the filter the dataset will use for most of the analyses,

Show code
master_dataset_study2 %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Filtered dataset", subtitle = NULL)
Counts for starting the survey: Filtered dataset
Survey status n Percent
Not started 22651 91.360465
Started 2142 8.639535
Show code
master_dataset_study2 %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Filtered dataset", subtitle = NULL)
Counts for completing the survey up to the PGSI: Filtered dataset
Survey status n Percent
Complete 1956 7.8893236
Incomplete 186 0.7502118
Not started 22651 91.3604646
Show code
master_dataset_study2 %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Filtered dataset", subtitle = NULL)
Counts for completing the survey in full: Filtered dataset
Survey status n Percent
Complete 1467 5.916993
Incomplete 675 2.722543
Not started 22651 91.360465

We need to also figure out how many people were removed by ensuring everyone bet in the past 6 months and how this related to survey status:

Show code
master_dataset_study2_initial %>%
  as_tibble() %>%
filter(PAS_6M_BET_FREQ_DAYS == 0) %>% # keep anybody who didn't bet in the window of interest
  count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for people removed from dataset because they hadn't gambled in the past 6-months", subtitle = NULL)
Counts for people removed from dataset because they hadn't gambled in the past 6-months
Survey status n
Complete 3
Not started 33

Survey flow

Show code
# Define relevant survey questions (excluding non-survey & optional ones)
survey_questions <- c(
  "IPAddress", "GAM_ACCOUNTS", "GAM_ACTIVITIES", "GAM_HARM", "GAMB_SATISFACTION",
  "PGSI_Q1", "PGSI_Q2", "PGSI_Q3", "PGSI_Q4", "PGSI_Q5", "PGSI_Q6", "PGSI_Q7", "PGSI_Q8", "PGSI_Q9",
  "GHM_F1", "GHM_P1", "ATTENTION_CHECK_1", "GHM_P2", "GHM_R1", "GHM_R2", "GHM_PH1", "GHM_PH2",
  "GHM_WS1", "GHM_WS2", "GHM_L1", "LIFE_SATISFACTION", "ACT1_TRACK",
  "ACT2_1", "ACT2_2", "ACT2_3", "ACT2_4", "ACT2_5", "ACT2_6", "ACT2_7",
  "EST_PAST_SPEND", "EST_PAST_WIN", "EST_PAST_NET_RESULT", "EST_PAST_DEPOSIT", "EST_PAST_WITHDRAW",
  "EST_NEXT_SPEND", "EST_NEXT_WIN", "EST_NEXT_NET_RESULT", "EST_NEXT_ACCEPTED_RESULT",
  "EST_NEXT_DEPOSIT", "EST_NEXT_WITHDRAW",
  "SC_1", "SC_2", "SC_3", "SC_4", "SC_5", "SC_6", "SC_7", "SC_8", "SC_9", "SC_10", "SC_11",
  "ATTENTION_CHECK_2", "SC_12", "SC_13",
  "EDUCATION", "EMPLOYMENT", "MARRIED", "NO_CHILDREN", "HOUSEHOLD_INCOME",
  "INCOME_VARIABILITY", "INCOME_PREDICTABILITY",
  "CBA_FW_1", "CBA_FW_2", "CBA_FW_3", "CBA_FW_4", "CBA_FW_5"
)

# Filter dataset to exclude non-starters (only include those who started the survey)
survey_data_filtered <- master_dataset_study2 %>%
  filter(!is.na(IPAddress)) %>%
  select(all_of(survey_questions))

# Count non-NA responses for each question
response_counts <- survey_data_filtered %>%
  summarise(across(everything(), ~sum(!is.na(.)), .names = "count_{.col}")) %>%
  pivot_longer(cols = everything(), names_to = "Question", values_to = "Respondents") %>%
  mutate(Question = gsub("count_", "", Question)) %>%
  arrange(Respondents)  # Arrange in ascending order to maintain left-to-right order


# Define survey sections with their starting positions for text labels:
survey_sections <- data.frame(
  Section = c("Opened survey", "Gambling involvement", "Gambling satisfaction", "PGSI",
              "Gambling Harm Measure", "Life satisfaction", "Activity statements",
              "Estimated expenditure", "Self-Control", "Demographics"),
  Start = c("IPAddress", "GAM_ACTIVITIES", "GAMB_SATISFACTION", "PGSI_Q5",
            "GHM_R2", "LIFE_SATISFACTION", "ACT2_4",
            "EST_NEXT_SPEND", "SC_7", "INCOME_VARIABILITY")
)

# Define section start and end positions for lines:
section_boundaries <- data.frame(
  Position = c( "GAM_ACCOUNTS",  "PGSI_Q1",
               "GHM_F1", "LIFE_SATISFACTION", "ACT1_TRACK",
               "EST_PAST_SPEND", "SC_1"),
  Type = "Start"
)

section_endings <- data.frame(
  Position = c("GAMB_SATISFACTION",
               "SC_13",
               "CBA_FW_5"),
  Type = "End"
)


# Combine for plotting
section_lines <- bind_rows(section_boundaries, section_endings)

# Convert positions to numeric for plotting
section_lines$Position <- as.numeric(factor(section_lines$Position, levels = survey_questions))


section_lines <- section_lines %>% 
mutate(Position = case_when(Position == 71 ~ 71.5,
                             TRUE ~  Position))

survey_sections_dot_colours <- data.frame(
  Section = c(
    rep("Start of survey", 1),
    rep("Gambling involvement", 3),
    rep("Gambling satisfaction", 1),
    rep("PGSI", 9),
    rep("Gambling Harm Measure", 11),
    rep("Life satisfaction", 1),
    rep("Activity statements", 8),
    rep("Estimated expenditure", 11),
    rep("Self-Control", 14),
    rep("Demographics", 12)  # No "End of survey"
  ),
  Question = c(
    # Start of survey
    "IPAddress",
    
    # Gambling involvement
    "GAM_ACTIVITIES", "GAM_HARM", "GAM_ACCOUNTS",
    
    # Gambling satisfaction
    "GAMB_SATISFACTION",
    
    # PGSI
    "PGSI_Q1", "PGSI_Q2", "PGSI_Q3", "PGSI_Q4", "PGSI_Q5", 
    "PGSI_Q6", "PGSI_Q7", "PGSI_Q8", "PGSI_Q9",
    
    # Gambling Harm Measure
    "GHM_F1", "GHM_P1", "ATTENTION_CHECK_1", "GHM_P2", "GHM_R1", 
    "GHM_R2", "GHM_PH1", "GHM_PH2", "GHM_WS1", "GHM_WS2", "GHM_L1",
    
    # Life satisfaction
    "LIFE_SATISFACTION",
    
    # Activity statements
    "ACT1_TRACK", "ACT2_1", "ACT2_2", "ACT2_3", "ACT2_4", "ACT2_5", "ACT2_6","ACT2_7",
    
    # Estimated expenditure
    "EST_PAST_SPEND", "EST_PAST_WIN", "EST_PAST_NET_RESULT", "EST_PAST_DEPOSIT", 
    "EST_PAST_WITHDRAW", "EST_NEXT_SPEND", "EST_NEXT_WIN", "EST_NEXT_NET_RESULT",
    "EST_NEXT_ACCEPTED_RESULT", "EST_NEXT_DEPOSIT", "EST_NEXT_WITHDRAW",
    
    # Self-Control
    "SC_1", "SC_2", "SC_3", "SC_4", "SC_5", "SC_6", "SC_7", "SC_8", "SC_9", 
    "SC_10", "SC_11", "ATTENTION_CHECK_2", "SC_12", "SC_13",
    
    # Demographics
    "EDUCATION", "EMPLOYMENT", "MARRIED", "NO_CHILDREN", "HOUSEHOLD_INCOME",
    "INCOME_VARIABILITY", "INCOME_PREDICTABILITY", "CBA_FW_1", "CBA_FW_2", "CBA_FW_3",
    "CBA_FW_4", "CBA_FW_5"
  )
)

# Ensure response_counts data is structured
response_counts <- response_counts %>%
  mutate(Question = factor(Question, levels = survey_questions)) %>% # Preserve order
  full_join(survey_sections_dot_colours)

# Order questions by survey flow (ensuring left-to-right reading)
response_counts$Question <- factor(response_counts$Question, levels = survey_questions)

# Ensure no zero values before applying log transformation
response_counts <- response_counts %>%
  mutate(Respondents = ifelse(Respondents == 0, 1, Respondents))  # Replaces 0 with 1


# Create a lookup table for section labels
section_labels <- setNames(survey_sections$Section, survey_sections$Start)


# Ensure Section is an ordered factor based on survey flow
response_counts$Section <- factor(response_counts$Section, levels = unique(survey_sections$Section))

# # Define custom gradient: Blue-Green to Blue
# num_sections <- length(unique(response_counts$Section))
# section_colors <- rev(colorRampPalette(c(brewer.pal(9, "GnBu")[6], brewer.pal(9, "Blues")[9]))(num_sections))

# Generate a perceptually uniform color scale using viridis
num_sections <- length(unique(response_counts$Section))
section_colors <- viridis(num_sections, option = "mako")  # Blue-Green to Blue

# Ensure factor levels match
response_counts <- response_counts %>%
  mutate(Section = factor(Section, levels = survey_sections$Section))


# Plot with ordered and reversed colors
main_plot<- ggplot(response_counts, aes(x = Question, y = Respondents, group = 1, color = Section)) +
  geom_line(color = "grey", size = 0.9) +  # Line plot
  
  # Dots colored by ordered survey section (now fading from dark to light)
  geom_point(size = 1.1) +
  
  # Apply ordered and reversed color scale to sections
  # scale_color_manual(values = setNames(section_colors, levels(response_counts$Section))) +
  scale_color_manual(values = setNames(section_colors, levels(response_counts$Section))) +
  # Add dashed vertical lines for section boundaries
  geom_vline(data = section_lines, aes(xintercept = Position - 0.5),
             color = "grey50", linetype = "dashed", size = 0.5) +
  
  # Replace x-axis labels with section labels at survey start points
  scale_x_discrete(labels = function(x) ifelse(x %in% names(section_labels), section_labels[x], "")) +

  # Add curved annotation line
  geom_curve(aes(x = 17.5, y = 2420, xend = 14.5, yend = 2420), 
             curvature = 0, arrow = arrow(length = unit(0.03, "npc"), type = "closed"),
             color = "black", size = 0.3) +
  
  # Add text labels
  annotate("text", x = 19.3, y = 2550, label = "Pre-defined\ncompletion\npoint", 
           size = 2, fontface = "plain", family = "Poppins") +
    annotate("text", x = 70, y = 2350, label = "End of survey", 
           angle = 90, size = 2, fontface = "plain", family = "Poppins") +
  
  labs(
    y = "Number of Respondents\n(log-10 scale)"
  ) +
    # Themes:
  plot_theme +
  theme(axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1.03),
        axis.title.y = element_text(vjust = 3),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        plot.margin = margin(2, 5, 1, 8),
        legend.position = "none")  


main_plot +
  labs(title = "Survey 2 flow with natural y-axis scale",
       y= "")

Show code
main_plot <- main_plot +   scale_y_log10(breaks = c(2000, 3000, 4000), labels = c("2000", "3000", "4000")) +
  labs(title = "Survey 2 flow with log-10 scale")

main_plot

Show code
main_plot <-  main_plot +   
  # Themes:
  plot_theme +
  theme(axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1.03),
        axis.title.y = element_text(vjust = 3),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text = element_text(size = 7),
        axis.title = element_text(size = 8),
        plot.margin = margin(2, 5, 1, 8),
        legend.position = "none")  


ggsave("Figures/Study 2 survey flow.pdf", 
       plot = main_plot, 
       device = cairo_pdf,  # Ensures smooth text rendering
       width = 16,          
       height = 8.5,         
       units = "cm",        
       dpi = 300)           

Compare samples

What are the general demographic and gambling characteristics of those who did and did not take part in the survey?

Before we can do this, we need to create weighted variable summaries to account for the oversampling of the at risk population:

Show code
# Create survey design:
survey_design_all <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2,
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Create survey design objects for each STATUS group:
survey_design_non_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2, STATUS == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2, STATUS == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Get sample sizes for each group
participants_n <- sum(master_dataset_study2$STATUS == "Participants")
non_participants_n <- sum(master_dataset_study2$STATUS == "Non-participants")
overall_n <- nrow(master_dataset_study2)



# [Yes, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]

# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:
# Function to calculate weighted summary for continuous variables
weighted_summary_continuous <- function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate means and standard deviations
  mean_all <- svymean(var_formula, design_all, na.rm = TRUE)
  mean_non_ps <- svymean(var_formula, design_non_participants, na.rm = TRUE)
  mean_ps <- svymean(var_formula, design_participants, na.rm = TRUE)
  
  sd_all <- sqrt(svyvar(var_formula, design_all, na.rm = TRUE))
  sd_non_ps <- sqrt(svyvar(var_formula, design_non_participants, na.rm = TRUE))
  sd_ps <- sqrt(svyvar(var_formula, design_participants, na.rm = TRUE))
  
  # Calculate weighted medians
  median_all <- svyquantile(var_formula, design_all, 0.5, na.rm = TRUE)
  median_non_ps <- svyquantile(var_formula, design_non_participants, 0.5, na.rm = TRUE)
  median_ps <- svyquantile(var_formula, design_participants, 0.5, na.rm = TRUE)
   
  # Format for table
  summary_df <- tibble(
    Variable = variable,
    `Overall` = paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),
    `Participants` = paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),
    `Non-participants` = paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2))
  ) %>%
  # Compute effect size
  mutate(
    pooled_sd = sqrt(((participants_n - 1) * sd_ps^2 + (non_participants_n - 1) * sd_non_ps^2) /
                     (participants_n + non_participants_n - 2)),
    cohen_d = (coef(mean_ps) - coef(mean_non_ps)) / pooled_sd,
    SE = sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2 / (2 * (participants_n + non_participants_n)))),
    CI_lower = cohen_d - 1.96 * SE,
    CI_upper = cohen_d + 1.96 * SE,
    Effect_size = paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]")
  ) %>%
  select(Variable, Overall, Participants, `Non-participants`, Effect_size)
  
  return(summary_df)
}

# Apply function to continuous variables
continuous_vars <- c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", 
                     "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", 
                     "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", 
                     "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", 
                     "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")

all_continuous_summaries <- map_dfr(continuous_vars, 
                                    ~weighted_summary_continuous(.x, survey_design_all, 
                                                                 survey_design_non_participants, survey_design_participants, 
                                                                 participants_n, non_participants_n, overall_n))


# Categorical variables:

# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. 

# Note: this function is computes Cramer's  effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).

# UPDATE: Let's just compute odds ratios for the two level variables  based on a reviewer suggestion (thanks ChatGPT for your help with this one!):

categorical_summary <- function(variable, design_all, design_non_participants, design_participants) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate proportions for overall sample
  prop_table_all <- svytable(var_formula, design_all)
  props_all <- prop.table(prop_table_all) %>%
    as.data.frame() %>%
    mutate(Overall = Freq * 100) %>%
    select(Variable = !!sym(variable), Overall)
  
  # Calculate proportions for non-participants
  prop_table_non_ps <- svytable(var_formula, design_non_participants)
  props_non_ps <- prop.table(prop_table_non_ps) %>%
    as.data.frame() %>%
    mutate('Non-participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Non-participants')
  
  # Calculate proportions for participants
  prop_table_ps <- svytable(var_formula, design_participants)
  props_ps <- prop.table(prop_table_ps) %>%
    as.data.frame() %>%
    mutate('Participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Participants')
  
  # Combine results and format
  summary_df <- props_ps %>%
    left_join(props_non_ps, by = "Variable") %>%
    left_join(props_all, by = "Variable") %>%
    mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%
    replace(is.na(.), "")

  # Determine if variable is binary (i.e., only two levels)
  unique_values <- length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))
  
  if (unique_values == 2 & variable != "GENDER") {
    # Ensure STATUS is treated as a factor with "Non-participants" as reference
    design_all$variables$STATUS <- relevel(as.factor(design_all$variables$STATUS), ref = "Non-participants")

    # Compute Odds Ratio (OR) using survey-weighted logistic regression
    logistic_model <- tryCatch(
      svyglm(as.formula(paste0(variable, " ~ STATUS")), 
             design = design_all, family = quasibinomial()),
      error = function(e) NULL
    )

    if (!is.null(logistic_model)) {
      coef_name <- grep("STATUS", names(coef(logistic_model)), value = TRUE)
      if (length(coef_name) > 0) {
        OR <- exp(coef(logistic_model)[coef_name])
        CI <- exp(confint(logistic_model)[coef_name, ])
        effect_size_text <- paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]")
      } else {
        effect_size_text <- "NA"
      }
    } else {
      effect_size_text <- "NA"
    }

  } else {
    # Compute Cramér’s V for categorical variables with more than two levels
    combined_table <- as.table(rbind(prop_table_non_ps, prop_table_ps))
    V <- cramersV(combined_table)
    V_rounded <- as.matrix(V$output) %>% t() %>% as.data.frame() %>%
      mutate(cramersV = as.numeric(cramersV)) %>%
      mutate(cramersV = round(cramersV, 2))
    
    CI <- confIntV(combined_table)
    CI_info <- CI$intermediate
    CIs <- CI_info$fisherZ.ci %>%
      t() %>%
      as.data.frame() %>%
      mutate_if(is.numeric, round, 2) %>%
      unite('Effect_size', 1, 2, sep = ", ")
    
    effect_size_text <- paste0(V_rounded, " [", CIs, "]")
  }

  # Add effect size to the table
  header_row <- data.frame(Variable = variable, Overall = "", Participants = "", 
                           `Non-participants` = "", `Effect_size` = effect_size_text)

  summary_df <- bind_rows(header_row, summary_df) %>%
    as_tibble() %>%
    select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)

  return(summary_df)
}



# Apply function to categorical variables
categorical_vars <- c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")
all_categorical_summaries <- map_dfr(categorical_vars, 
                                     ~categorical_summary(.x, survey_design_all, 
                                                          
                                                          survey_design_non_participants, survey_design_participants))

all_categorical_summaries <- all_categorical_summaries %>%
# Recode for consistency across summaries:
mutate(Variable = case_when(Variable==1
                   ~ "Yes", 
                   Variable==0
                   ~ "No", 
                   Variable==TRUE
                   ~ "Yes", 
                   Variable==FALSE
                   ~ "No",
                   TRUE ~ Variable))

# Combine continuous and categorical summaries
weighted_comparison_summary_data <- bind_rows(all_continuous_summaries, all_categorical_summaries) %>%
  rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols = "Participants") %>%
  rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols = "Non-participants") %>%
  rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols = "Overall")

# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)
gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]
age_summary_tabled<- weighted_comparison_summary_data[1,1:5]
summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%
  filter(Variable !="AGE",
    Variable !="GENDER",
         Variable !="Male",
         Variable !="Female",
         Variable !="Unknown"
         )

# Join everything back together and provide some nice names for our variables:
weighted_comparison_summary_data_ordered <- bind_rows(age_summary_tabled,
          gender_summary_tabled,
          summary_table_extracted_vars_missing) %>%
   mutate(Variable = dplyr::recode(Variable,
    `AGE` = "Age",
    "GENDER" = "Gender",
    "IRSAD_SCORE" = "IRSAD score",
    "DAYS_REGISTERED" = "Days since registered",
    "DAYS_SINCE_LAST_BET" = "No. days since last bet",
    "PAS_6M_BET_FREQ_DAYS" = "No. betting days",
    "PAS_6M_BET_INTENSITY" = "Mean bets per active day",
    "PAS_6M_BET_MEAN_STAKE" = "Mean stake",
    "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" = "Mean net outcome per month",
    "PAS_6M_PERCENT_UNSOCIABLE_BETS" = "Percentage of bets midnight - 6AM",
    "PAS_6M_TOTAL_DEPOSIT_FREQ" = "No. deposits",
    "PAS_6M_DEPOSIT_INTENSITY" = "Mean no. deposits per active day",
    "PAS_6M_MEAN_DEPOSIT_AMOUNT" = "Mean deposit amount",
    "PAS_6M_PERCENT_DECLINED_DEPOSITS" = "Percentage of deposits declined",
    "PAS_6M_PERCENT_CANCEL_WITHDRAWS" = "Percentage of withdrawals cancelled",
    "EVER_USED_ANY_TOOL" = "Ever used a CPT",
    "ACTIVE_DEPOSIT_LIMIT" = "Active deposit limit",
    "RE_PAS_6M_ANY_TRIGGER" = "Flagged as 'at-risk' ≥ 1 times"
  ))


# Create final table:
weighted_comparison_summary_data_ordered %>%
  rename("Effect size" = Effect_size
         ) %>%
  gt() %>%
   tab_header(title = "Table 3.   Study 2 (past 6-month bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%
  tab_options(
    data_row.padding = px(1.5)
  ) %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )
Table 3. Study 2 (past 6-month bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable Overall (N = 24793)1 Participants (N = 1956)1 Non-participants (N = 22837) Effect size2
Age 38.07 (14.73), 35 43.57 (15.51), 43 37.66 (14.59), 34 0.4 [0.36, 0.45]
Gender NA 0.03 [0.03, 0.04]
Female 17.23% 13.2% 17.53% NA
Male 81.96% 85.32% 81.71% NA
Unknown 0.81% 1.47% 0.76% NA
IRSAD score 1004.88 (75.63), 999.58 1005.21 (75.52), 1000.57 1004.86 (75.64), 999.58 0 [-0.04, 0.05]
Days since registered 1131.27 (904.71), 905 1337.27 (967.47), 1168 1115.93 (898), 892 0.24 [0.2, 0.29]
No. days since last bet 49.13 (46.06), 43 27.47 (39.71), 5 50.74 (46.09), 47 -0.51 [-0.56, -0.46]
No. betting days 22.45 (31.97), 9 39.59 (41.78), 25 21.17 (30.73), 8 0.58 [0.53, 0.63]
Mean bets per active day 7.36 (12.41), 4 10.34 (17.22), 5.59 7.14 (11.95), 4 0.26 [0.21, 0.3]
Mean stake 31.48 (163.93), 10 26.35 (107.14), 9.12 31.86 (167.38), 10 -0.03 [-0.08, 0.01]
Mean net outcome per month -131.9 (1864.76), -12.5 -242.26 (1191.8), -25 -123.68 (1905.15), -12.11 -0.06 [-0.11, -0.02]
Percentage of bets midnight - 6AM 4.03 (11.54), 0 4.19 (10.3), 0 4.02 (11.63), 0 0.02 [-0.03, 0.06]
No. deposits 31.6 (111.44), 4 64.86 (175.22), 9 29.12 (104.74), 4 0.32 [0.27, 0.37]
Mean no. deposits per active day 1.03 (1.28), 0.81 1.2 (1.69), 0.8 1.02 (1.24), 0.82 0.14 [0.09, 0.19]
Mean deposit amount 77.52 (321.94), 31.82 72.83 (202.78), 37.5 77.87 (329.1), 31.21 -0.02 [-0.06, 0.03]
Percentage of deposits declined 6.59 (15.05), 0 7.08 (15.25), 0 6.55 (15.03), 0 0.03 [-0.01, 0.08]
Percentage of withdrawals cancelled 1.81 (9.25), 0 3.06 (11.82), 0 1.72 (9.02), 0 0.15 [0.1, 0.19]
Ever used a CPT NA 1.47 [1.28, 1.69]
No 86.93% 82.33% 87.27% NA
Yes 13.07% 17.67% 12.73% NA
Active deposit limit NA 1.23 [1.03, 1.47]
No 91.58% 89.96% 91.7% NA
Yes 8.42% 10.04% 8.3% NA
Flagged as 'at-risk' ≥ 1 times NA 1.91 [1.73, 2.12]
No 97.51% 95.61% 97.65% NA
Yes 2.49% 4.39% 2.35% NA
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])

PGSI

Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:

Show code
survey_design_participants_filtered <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2, STATUS == "Participants" &
                  (ATTENTION_CHECK_1 == "PASSED" | is.na(ATTENTION_CHECK_1)) &
                  (ATTENTION_CHECK_2 == "PASSED" | is.na(ATTENTION_CHECK_2))),
  weights = ~ELIGIBLE_SURVEY_SAMPLE_WEIGHTS
)

# Weighted values:
mean_ps_PGSI <- svymean(~PGSI_TOTAL, survey_design_participants_filtered, na.rm = TRUE)
sd_ps_PGSI <- sqrt(svyvar(~PGSI_TOTAL, survey_design_participants_filtered, na.rm = TRUE))
median_ps_PGSI <- svyquantile(~PGSI_TOTAL, survey_design_participants_filtered, c(0.5), na.rm = TRUE)

# Extract desired values
mean_ps_PGSI_value <- coef(mean_ps_PGSI)
sd_ps_PGSI_value <- sd_ps_PGSI[1]
median_ps_PGSI_value <- coef(median_ps_PGSI) 

# Create sentence summarising results: 
summary_sentence_PGSI_study2 <- sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", 
                            mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)

summary_sentence_PGSI_study2
[1] "The mean PGSI total score is 3.64 (standard deviation = 4.12) and the median is 2.00."

Non-Weighted comparisons

For transparency, we have also computed non-weighted comparisons between groups using the tbl_summary() function used to do the same in Study 1 above:

Show code
master_dataset_study2 %>%
  tbl_summary(
    include = c(# Demographic details
      AGE, 
      GENDER, 
      IRSAD_SCORE,
      DAYS_REGISTERED, 
       # Wagering
      DAYS_SINCE_LAST_BET, 
      PAS_6M_BET_FREQ_DAYS,
      PAS_6M_BET_INTENSITY,
      PAS_6M_BET_MEAN_STAKE,
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
      PAS_6M_PERCENT_UNSOCIABLE_BETS,
      # Transactions
      PAS_6M_TOTAL_DEPOSIT_FREQ,
      PAS_6M_DEPOSIT_INTENSITY,
      PAS_6M_MEAN_DEPOSIT_AMOUNT,
      PAS_6M_PERCENT_DECLINED_DEPOSITS,
      PAS_6M_PERCENT_CANCEL_WITHDRAWS,
      # Use of consumer protection tools
      EVER_USED_ANY_TOOL,
      ACTIVE_DEPOSIT_LIMIT,
      RE_PAS_6M_ANY_TRIGGER),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous2() ~ c(
      "{median} [{p25} - {p75}]",
      "{mean} ({sd})")),
    by = STATUS,
    label = c(
      AGE ~ "Age",
      GENDER ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      DAYS_REGISTERED ~ "No. of days since accounts registered",
      DAYS_SINCE_LAST_BET ~ "No. days since last bet",
      PAS_6M_BET_FREQ_DAYS ~ "No. betting days",
      PAS_6M_BET_INTENSITY ~ "Mean bets per active day",
      PAS_6M_BET_MEAN_STAKE ~ "Mean stake",
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~ "Mean net outcome per month",
      PAS_6M_PERCENT_UNSOCIABLE_BETS ~ "Percentage of bets placed between midnight & 6AM",
      PAS_6M_TOTAL_DEPOSIT_FREQ ~ "No. deposits",
      PAS_6M_DEPOSIT_INTENSITY ~ "Mean no. deposits per active betting day",
      PAS_6M_MEAN_DEPOSIT_AMOUNT ~ "Mean deposit amount",
      PAS_6M_PERCENT_DECLINED_DEPOSITS ~ "Percentage of deposits declined",
      PAS_6M_PERCENT_CANCEL_WITHDRAWS ~ "Percentage of withdrawals cancelled",
      EVER_USED_ANY_TOOL ~ "Ever used a SG",
      ACTIVE_DEPOSIT_LIMIT ~ "Active deposit limit",
      RE_PAS_6M_ANY_TRIGGER ~ "Flagged as 'at-risk' ≥ 1 times")
  ) %>%
  add_difference(
    list(all_continuous() ~ "cohens_d",  # For continuous variables
    all_categorical() ~ "smd")       # For categorical variables, you can use standardized mean difference or any other suitable method
  ) %>%
  modify_header(label = "**Variable**") %>% 
  modify_caption("**Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics**") %>%
  bold_labels() %>%
  as_gt() %>%
  tab_options(data_row.padding = px(1.5))
Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics
Variable Participants, N = 1,9561 Non-participants, N = 22,8371 Difference2 95% CI2,3
Age

0.33 0.28, 0.37
    Median [IQR] 40 [30 - 54] 34 [26 - 47]

    Mean (SD) 42 (15) 38 (14)

Gender

0.14 0.10, 0.19
    Female 221 (11%) 3,570 (16%)

    Male 1,701 (87%) 19,054 (83%)

    Unknown 34 (1.7%) 213 (0.9%)

IRSAD score

-0.01 -0.05, 0.04
    Median [IQR] 1,001 [954 - 1,060] 1,001 [954 - 1,061]

    Mean (SD) 1,006 (75) 1,006 (76)

    Unknown 1 191

No. of days since accounts registered

0.24 0.20, 0.29
    Median [IQR] 1,188 [494 - 2,151] 927 [377 - 1,822]

    Mean (SD) 1,380 (992) 1,155 (915)

No. days since last bet

-0.52 -0.57, -0.47
    Median [IQR] 3 [2 - 24] 31 [4 - 66]

    Mean (SD) 22 (36) 45 (46)

No. betting days

0.63 0.58, 0.68
    Median [IQR] 38 [12 - 88] 12 [3 - 40]

    Mean (SD) 54 (49) 29 (38)

Mean bets per active day

0.30 0.25, 0.34
    Median [IQR] 7 [3 - 16] 4 [2 - 10]

    Mean (SD) 14 (23) 9 (17)

Mean stake

-0.04 -0.08, 0.01
    Median [IQR] 15 [6 - 45] 13 [6 - 38]

    Mean (SD) 68 (257) 83 (419)

Mean net outcome per month

-0.07 -0.11, -0.02
    Median [IQR] -65 [-710 - -4] -17 [-134 - -1]

    Mean (SD) -870 (2,846) -529 (5,139)

Percentage of bets placed between midnight & 6AM

0.04 -0.01, 0.08
    Median [IQR] 1 [0 - 5] 0 [0 - 3]

    Mean (SD) 5 (10) 5 (12)

No. deposits

0.43 0.39, 0.48
    Median [IQR] 26 [4 - 159] 6 [2 - 35]

    Mean (SD) 162 (354) 66 (206)

Mean no. deposits per active betting day

0.32 0.27, 0.36
    Median [IQR] 1.08 [0.39 - 2.49] 1.00 [0.33 - 1.60]

    Mean (SD) 1.99 (2.81) 1.36 (1.90)

Mean deposit amount

-0.03 -0.08, 0.02
    Median [IQR] 50 [22 - 119] 42 [15 - 101]

    Mean (SD) 148 (481) 173 (859)

Percentage of deposits declined

0.04 -0.01, 0.09
    Median [IQR] 1 [0 - 8] 0 [0 - 7]

    Mean (SD) 7 (14) 7 (15)

Percentage of withdrawals cancelled

0.21 0.16, 0.25
    Median [IQR] 0 [0 - 0] 0 [0 - 0]

    Mean (SD) 7 (18) 4 (14)

Ever used a SG 439 (22%) 3,576 (16%) 0.17 0.13, 0.22
Active deposit limit 201 (10%) 2,059 (9.0%) 0.04 0.00, 0.09
Flagged as 'at-risk' ≥ 1 times 588 (30%) 4,202 (18%) 0.27 0.23, 0.32
1 n (%)
2 Cohen’s D; Standardized Mean Difference
3 CI = Confidence Interval

Sensitivity check comparison

How does the group comparison differ if we change the criterion for survey completion? Reviewers didn’t ask for this change specifically, but some suggestions led us to conduct these additional sensitivity checks.

For this, we will change the participant group to anyone who fully completed the survey and non-participants as anyone who didn’t start or didn’t complete the survey. These will be weighted comparisons.

Show code
# Check coding of variable that tells us whether somebody fully completed the survey or not
master_dataset_study2 %>%
  count(STATUS,
           SURVEY_STATUS_FULL)
# A tibble: 4 × 3
  STATUS           SURVEY_STATUS_FULL     n
  <fct>            <chr>              <int>
1 Participants     Complete            1467
2 Participants     Incomplete           489
3 Non-participants Incomplete           186
4 Non-participants Not started        22651
Show code
master_dataset_study2_sensitivity_check   <- master_dataset_study2 %>%
  mutate(SURVEY_STATUS_FULL_BINARY = case_when(SURVEY_STATUS_FULL == "Incomplete" ~ "Non-participants",
                                               SURVEY_STATUS_FULL == "Not started" ~ "Non-participants",
         TRUE ~ "Participants")) %>%
 mutate(SURVEY_STATUS_FULL_BINARY = fct_relevel(SURVEY_STATUS_FULL_BINARY,
                            "Participants",
                            "Non-participants")) # fix order of presentation

# Check recoding has worked:
master_dataset_study2_sensitivity_check %>%
  count(STATUS,
           SURVEY_STATUS_FULL_BINARY)
# A tibble: 3 × 3
  STATUS           SURVEY_STATUS_FULL_BINARY     n
  <fct>            <fct>                     <int>
1 Participants     Participants               1467
2 Participants     Non-participants            489
3 Non-participants Non-participants          22837
Show code
# Update survey designs:
survey_design_all <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_sensitivity_check,
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Create survey design objects for each STATUS group:
survey_design_non_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Get sample sizes for each group
participants_n <- sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY == "Participants")
non_participants_n <- sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY == "Non-participants")
overall_n <- nrow(master_dataset_study2_sensitivity_check)




# Produce updated table
# [Again, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]

# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:
# Function to calculate weighted summary for continuous variables
weighted_summary_continuous <- function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate means and standard deviations
  mean_all <- svymean(var_formula, design_all, na.rm = TRUE)
  mean_non_ps <- svymean(var_formula, design_non_participants, na.rm = TRUE)
  mean_ps <- svymean(var_formula, design_participants, na.rm = TRUE)
  
  sd_all <- sqrt(svyvar(var_formula, design_all, na.rm = TRUE))
  sd_non_ps <- sqrt(svyvar(var_formula, design_non_participants, na.rm = TRUE))
  sd_ps <- sqrt(svyvar(var_formula, design_participants, na.rm = TRUE))
  
  # Calculate weighted medians
  median_all <- svyquantile(var_formula, design_all, 0.5, na.rm = TRUE)
  median_non_ps <- svyquantile(var_formula, design_non_participants, 0.5, na.rm = TRUE)
  median_ps <- svyquantile(var_formula, design_participants, 0.5, na.rm = TRUE)
   
  # Format for table
  summary_df <- tibble(
    Variable = variable,
    `Overall` = paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),
    `Participants` = paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),
    `Non-participants` = paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2))
  ) %>%
  # Compute effect size
  mutate(
    pooled_sd = sqrt(((participants_n - 1) * sd_ps^2 + (non_participants_n - 1) * sd_non_ps^2) /
                     (participants_n + non_participants_n - 2)),
    cohen_d = (coef(mean_ps) - coef(mean_non_ps)) / pooled_sd,
    SE = sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2 / (2 * (participants_n + non_participants_n)))),
    CI_lower = cohen_d - 1.96 * SE,
    CI_upper = cohen_d + 1.96 * SE,
    Effect_size = paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]")
  ) %>%
  select(Variable, Overall, Participants, `Non-participants`, Effect_size)
  
  return(summary_df)
}

# Apply function to continuous variables
continuous_vars <- c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", 
                     "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", 
                     "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", 
                     "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", 
                     "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")

all_continuous_summaries <- map_dfr(continuous_vars, 
                                    ~weighted_summary_continuous(.x, survey_design_all, 
                                                                 survey_design_non_participants, survey_design_participants, 
                                                                 participants_n, non_participants_n, overall_n))


# Categorical variables:

# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. 

# Note: this function is computes Cramer's  effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).

# UPDATE: Let's just compute odds ratios for the two level variables  based on a reviewer suggestion (thanks ChatGPT for your help with this one!):

categorical_summary <- function(variable, design_all, design_non_participants, design_participants) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate proportions for overall sample
  prop_table_all <- svytable(var_formula, design_all)
  props_all <- prop.table(prop_table_all) %>%
    as.data.frame() %>%
    mutate(Overall = Freq * 100) %>%
    select(Variable = !!sym(variable), Overall)
  
  # Calculate proportions for non-participants
  prop_table_non_ps <- svytable(var_formula, design_non_participants)
  props_non_ps <- prop.table(prop_table_non_ps) %>%
    as.data.frame() %>%
    mutate('Non-participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Non-participants')
  
  # Calculate proportions for participants
  prop_table_ps <- svytable(var_formula, design_participants)
  props_ps <- prop.table(prop_table_ps) %>%
    as.data.frame() %>%
    mutate('Participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Participants')
  
  # Combine results and format
  summary_df <- props_ps %>%
    left_join(props_non_ps, by = "Variable") %>%
    left_join(props_all, by = "Variable") %>%
    mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%
    replace(is.na(.), "")

  # Determine if variable is binary (i.e., only two levels)
  unique_values <- length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))
  
  if (unique_values == 2 & variable != "GENDER") {
    # Ensure STATUS is treated as a factor with "Non-participants" as reference
    design_all$variables$STATUS <- relevel(as.factor(design_all$variables$STATUS), ref = "Non-participants")

    # Compute Odds Ratio (OR) using survey-weighted logistic regression
    logistic_model <- tryCatch(
      svyglm(as.formula(paste0(variable, " ~ STATUS")), 
             design = design_all, family = quasibinomial()),
      error = function(e) NULL
    )

    if (!is.null(logistic_model)) {
      coef_name <- grep("STATUS", names(coef(logistic_model)), value = TRUE)
      if (length(coef_name) > 0) {
        OR <- exp(coef(logistic_model)[coef_name])
        CI <- exp(confint(logistic_model)[coef_name, ])
        effect_size_text <- paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]")
      } else {
        effect_size_text <- "NA"
      }
    } else {
      effect_size_text <- "NA"
    }

  } else {
    # Compute Cramér’s V for categorical variables with more than two levels
    combined_table <- as.table(rbind(prop_table_non_ps, prop_table_ps))
    V <- cramersV(combined_table)
    V_rounded <- as.matrix(V$output) %>% t() %>% as.data.frame() %>%
      mutate(cramersV = as.numeric(cramersV)) %>%
      mutate(cramersV = round(cramersV, 2))
    
    CI <- confIntV(combined_table)
    CI_info <- CI$intermediate
    CIs <- CI_info$fisherZ.ci %>%
      t() %>%
      as.data.frame() %>%
      mutate_if(is.numeric, round, 2) %>%
      unite('Effect_size', 1, 2, sep = ", ")
    
    effect_size_text <- paste0(V_rounded, " [", CIs, "]")
  }

  # Add effect size to the table
  header_row <- data.frame(Variable = variable, Overall = "", Participants = "", 
                           `Non-participants` = "", `Effect_size` = effect_size_text)

  summary_df <- bind_rows(header_row, summary_df) %>%
    as_tibble() %>%
    select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)

  return(summary_df)
}



# Apply function to categorical variables
categorical_vars <- c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")
all_categorical_summaries <- map_dfr(categorical_vars, 
                                     ~categorical_summary(.x, survey_design_all, 
                                                          
                                                          survey_design_non_participants, survey_design_participants))

all_categorical_summaries <- all_categorical_summaries %>%
# Recode for consistency across summaries:
mutate(Variable = case_when(Variable==1
                   ~ "Yes", 
                   Variable==0
                   ~ "No", 
                   Variable==TRUE
                   ~ "Yes", 
                   Variable==FALSE
                   ~ "No",
                   TRUE ~ Variable))

# Combine continuous and categorical summaries
weighted_comparison_summary_data <- bind_rows(all_continuous_summaries, all_categorical_summaries) %>%
  rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols = "Participants") %>%
  rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols = "Non-participants") %>%
  rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols = "Overall")

# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)
gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]
age_summary_tabled<- weighted_comparison_summary_data[1,1:5]
summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%
  filter(Variable !="AGE",
    Variable !="GENDER",
         Variable !="Male",
         Variable !="Female",
         Variable !="Unknown"
         )

# Join everything back together and provide some nice names for our variables:
weighted_comparison_summary_data_ordered <- bind_rows(age_summary_tabled,
          gender_summary_tabled,
          summary_table_extracted_vars_missing) %>%
   mutate(Variable = dplyr::recode(Variable,
    `AGE` = "Age",
    "GENDER" = "Gender",
    "IRSAD_SCORE" = "IRSAD score",
    "DAYS_REGISTERED" = "Days since registered",
    "DAYS_SINCE_LAST_BET" = "No. days since last bet",
    "PAS_6M_BET_FREQ_DAYS" = "No. betting days",
    "PAS_6M_BET_INTENSITY" = "Mean bets per active day",
    "PAS_6M_BET_MEAN_STAKE" = "Mean stake",
    "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" = "Mean net outcome per month",
    "PAS_6M_PERCENT_UNSOCIABLE_BETS" = "Percentage of bets midnight - 6AM",
    "PAS_6M_TOTAL_DEPOSIT_FREQ" = "No. deposits",
    "PAS_6M_DEPOSIT_INTENSITY" = "Mean no. deposits per active day",
    "PAS_6M_MEAN_DEPOSIT_AMOUNT" = "Mean deposit amount",
    "PAS_6M_PERCENT_DECLINED_DEPOSITS" = "Percentage of deposits declined",
    "PAS_6M_PERCENT_CANCEL_WITHDRAWS" = "Percentage of withdrawals cancelled",
    "EVER_USED_ANY_TOOL" = "Ever used a CPT",
    "ACTIVE_DEPOSIT_LIMIT" = "Active deposit limit",
    "RE_PAS_6M_ANY_TRIGGER" = "Flagged as 'at-risk' ≥ 1 times"
  ))


# Create final table:
weighted_comparison_summary_data_ordered %>%
  rename("Effect size" = Effect_size
         ) %>%
  gt() %>%
   tab_header(title = "Table 3.1   Study 2 (past 6-month bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]") %>%
  tab_options(
    data_row.padding = px(1.5)
  ) %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )
Table 3.1 Study 2 (past 6-month bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]
Variable Overall (N = 24793)1 Participants (N = 1467)1 Non-participants (N = 23326) Effect size2
Age 38.07 (14.73), 35 42.82 (14.78), 41 37.8 (14.69), 34 0.34 [0.29, 0.39]
Gender NA 0.03 [0.03, 0.04]
Female 17.23% 12.55% 17.5% NA
Male 81.96% 85.99% 81.73% NA
Unknown 0.81% 1.46% 0.77% NA
IRSAD score 1004.88 (75.63), 999.58 1007.88 (75.44), 1003.14 1004.71 (75.64), 998.46 0.04 [-0.01, 0.09]
Days since registered 1131.27 (904.71), 905 1362.54 (967.72), 1179 1118.05 (899.21), 894 0.27 [0.22, 0.32]
No. days since last bet 49.13 (46.06), 43 27.75 (40.04), 5 50.35 (46.08), 45 -0.49 [-0.55, -0.44]
No. betting days 22.45 (31.97), 9 38.78 (41.11), 24 21.51 (31.11), 8 0.54 [0.49, 0.6]
Mean bets per active day 7.36 (12.41), 4 9.73 (16.37), 5.38 7.22 (12.14), 4 0.2 [0.15, 0.25]
Mean stake 31.48 (163.93), 10 25.43 (89.58), 8.89 31.83 (167.18), 10 -0.04 [-0.09, 0.01]
Mean net outcome per month -131.9 (1864.76), -12.5 -204.51 (1078.02), -22.71 -127.75 (1899.84), -12.33 -0.04 [-0.09, 0.01]
Percentage of bets midnight - 6AM 4.03 (11.54), 0 4.36 (10.84), 0 4.01 (11.58), 0 0.03 [-0.02, 0.08]
No. deposits 31.6 (111.44), 4 58.58 (155.16), 9 30.06 (108.21), 4 0.26 [0.2, 0.31]
Mean no. deposits per active day 1.03 (1.28), 0.81 1.13 (1.64), 0.75 1.02 (1.26), 0.82 0.08 [0.03, 0.13]
Mean deposit amount 77.52 (321.94), 31.82 72.69 (208.96), 37.5 77.8 (327.23), 31.25 -0.02 [-0.07, 0.04]
Percentage of deposits declined 6.59 (15.05), 0 6.62 (14.46), 0 6.59 (15.08), 0 0 [-0.05, 0.06]
Percentage of withdrawals cancelled 1.81 (9.25), 0 2.99 (12.02), 0 1.74 (9.06), 0 0.14 [0.08, 0.19]
Ever used a CPT NA 1.47 [1.28, 1.69]
No 86.93% 82.97% 87.15% NA
Yes 13.07% 17.03% 12.85% NA
Active deposit limit NA 1.23 [1.03, 1.47]
No 91.58% 90.39% 91.65% NA
Yes 8.42% 9.61% 8.35% NA
Flagged as 'at-risk' ≥ 1 times NA 1.91 [1.73, 2.12]
No 97.51% 96.25% 97.58% NA
Yes 2.49% 3.75% 2.42% NA
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])

Note that this table says that there are 1467 people who completed the survey in full This differs from the 1470 people who were listed as completing the survey in Table 1 of our manuscript as some people have been removed from this total due to ineligibility.

Show code
master_dataset_study2 %>%
  tbl_summary(
    include = c(# Demographic details
      AGE, 
      GENDER, 
      IRSAD_SCORE,
      DAYS_REGISTERED, 
       # Wagering
      DAYS_SINCE_LAST_BET, 
      PAS_6M_BET_FREQ_DAYS,
      PAS_6M_BET_INTENSITY,
      PAS_6M_BET_MEAN_STAKE,
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
      PAS_6M_PERCENT_UNSOCIABLE_BETS,
      # Transactions
      PAS_6M_TOTAL_DEPOSIT_FREQ,
      PAS_6M_DEPOSIT_INTENSITY,
      PAS_6M_MEAN_DEPOSIT_AMOUNT,
      PAS_6M_PERCENT_DECLINED_DEPOSITS,
      PAS_6M_PERCENT_CANCEL_WITHDRAWS,
      # Use of consumer protection tools
      EVER_USED_ANY_TOOL,
      ACTIVE_DEPOSIT_LIMIT,
      RE_PAS_6M_ANY_TRIGGER),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous2() ~ c(
      "{median} [{p25} - {p75}]",
      "{mean} ({sd})")),
    by = SURVEY_STATUS_FULL,
    label = c(
      AGE ~ "Age",
      GENDER ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      DAYS_REGISTERED ~ "No. of days since accounts registered",
      DAYS_SINCE_LAST_BET ~ "No. days since last bet",
      PAS_6M_BET_FREQ_DAYS ~ "No. betting days",
      PAS_6M_BET_INTENSITY ~ "Mean bets per active day",
      PAS_6M_BET_MEAN_STAKE ~ "Mean stake",
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~ "Mean net outcome per month",
      PAS_6M_PERCENT_UNSOCIABLE_BETS ~ "Percentage of bets placed between midnight & 6AM",
      PAS_6M_TOTAL_DEPOSIT_FREQ ~ "No. deposits",
      PAS_6M_DEPOSIT_INTENSITY ~ "Mean no. deposits per active betting day",
      PAS_6M_MEAN_DEPOSIT_AMOUNT ~ "Mean deposit amount",
      PAS_6M_PERCENT_DECLINED_DEPOSITS ~ "Percentage of deposits declined",
      PAS_6M_PERCENT_CANCEL_WITHDRAWS ~ "Percentage of withdrawals cancelled",
      EVER_USED_ANY_TOOL ~ "Ever used a SG",
      ACTIVE_DEPOSIT_LIMIT ~ "Active deposit limit",
      RE_PAS_6M_ANY_TRIGGER ~ "Flagged as 'at-risk' ≥ 1 times")
  ) %>%
  modify_header(label = "**Variable**") %>% 
  modify_caption("Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey") %>%
  bold_labels() 
Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn’t complete it, and those who didn’t start the survey
Variable Complete, N = 1,4671 Incomplete, N = 6751 Not started, N = 22,6511
Age


    Median [IQR] 39 [30 - 52] 42 [28 - 56] 34 [26 - 47]
    Mean (SD) 42 (14) 43 (16) 38 (14)
Gender


    Female 153 (10%) 85 (13%) 3,553 (16%)
    Male 1,290 (88%) 577 (85%) 18,888 (83%)
    Unknown 24 (1.6%) 13 (1.9%) 210 (0.9%)
IRSAD score


    Median [IQR] 1,004 [955 - 1,062] 990 [949 - 1,056] 1,001 [954 - 1,061]
    Mean (SD) 1,008 (75) 1,001 (74) 1,006 (76)
    Unknown 1 1 190
No. of days since accounts registered


    Median [IQR] 1,213 [510 - 2,179] 1,100 [447 - 2,043] 927 [375 - 1,822]
    Mean (SD) 1,396 (986) 1,319 (1,005) 1,154 (915)
No. days since last bet


    Median [IQR] 4 [2 - 27] 3 [1 - 15] 32 [4 - 66]
    Mean (SD) 23 (37) 17 (31) 46 (46)
No. betting days


    Median [IQR] 36 [11 - 84] 50 [18 - 101] 12 [3 - 39]
    Mean (SD) 51 (47) 63 (51) 29 (38)
Mean bets per active day


    Median [IQR] 7 [3 - 15] 9 [4 - 20] 4 [2 - 10]
    Mean (SD) 13 (21) 16 (25) 9 (17)
Mean stake


    Median [IQR] 13 [5 - 42] 20 [7 - 60] 13 [6 - 38]
    Mean (SD) 61 (206) 90 (331) 83 (420)
Mean net outcome per month


    Median [IQR] -52 [-551 - -2] -151 [-1,199 - -10] -17 [-131 - -1]
    Mean (SD) -740 (2,656) -1,238 (3,122) -524 (5,155)
Percentage of bets placed between midnight & 6AM


    Median [IQR] 1 [0 - 5] 1 [0 - 5] 0 [0 - 3]
    Mean (SD) 5 (11) 5 (9) 5 (12)
No. deposits


    Median [IQR] 20 [4 - 136] 54 [8 - 272] 6 [1 - 34]
    Mean (SD) 142 (316) 225 (499) 64 (199)
Mean no. deposits per active betting day


    Median [IQR] 1.00 [0.33 - 2.19] 1.50 [0.67 - 3.22] 1.00 [0.33 - 1.60]
    Mean (SD) 1.85 (2.70) 2.46 (3.48) 1.35 (1.86)
Mean deposit amount


    Median [IQR] 50 [22 - 114] 65 [26 - 150] 41 [15 - 100]
    Mean (SD) 143 (503) 166 (380) 173 (862)
Percentage of deposits declined


    Median [IQR] 0 [0 - 8] 3 [0 - 10] 0 [0 - 7]
    Mean (SD) 7 (14) 9 (15) 7 (15)
Percentage of withdrawals cancelled


    Median [IQR] 0 [0 - 0] 0 [0 - 8] 0 [0 - 0]
    Mean (SD) 7 (17) 10 (21) 4 (14)
Ever used a SG 315 (21%) 174 (26%) 3,526 (16%)
Active deposit limit 147 (10%) 76 (11%) 2,037 (9.0%)
Flagged as 'at-risk' ≥ 1 times 392 (27%) 277 (41%) 4,121 (18%)
1 n (%)

Again, it appears to be because the group who start it but didn’t complete the survey are more involved/intense bettors than any other group.

Composite risk scores

Now let’s produce a singles composite score that represents people’s past six months gambling behaviour.

One thing we need to consider here is that some of the variables we’ve created could be biased by recent sign up to the site. for example, one of the variables we created to summarise past six months behaviour is the total number of days somebody gambled, while another is the total amount somebody deposited. Both of these variables could be affected by being more recently registered with the company. By contrast, variables which refer to averages, or daily or monthly summaries will not be affected by this.

Let’s first identify all the variables we engineered that summarise past six month behaviour, then classify them based on whether they will be subject to this bias or not:

Show code
PAS_6M_variables<- master_dataset_study2 %>% 
  select(
         contains("PAS_6M"), # Identify all relevant past six months aggregate variables
         -contains("RE_"), # Remove risk identification system variables
         -contains("INCOME") # Remove variables relating  to percentage of income which are only available for survey participants
         ) %>%
  colnames() %>%
  as.data.frame() %>%
  rename(Variables = 1) %>%
  mutate(`Biased by sign-up date` = case_when(
                     Variables == "PAS_6M_BET_FREQ_DAYS" ~ "Yes", 
                     Variables == "PAS_6M_BET_FREQ_BETS" ~ "Yes", 
                     Variables == "PAS_6M_BET_MEAN_STAKE" ~ "No", 
                     Variables == "PAS_6M_BET_SD_STAKE" ~ "No", 
                     Variables == "PAS_6M_BET_SPEND" ~ "Yes", 
                     Variables == "PAS_6M_MEAN_ODDS" ~ "Yes", 
                     Variables == "PAS_6M_N_ACTIVITIES" ~ "No", 
                     Variables == "PAS_6M_BET_INTENSITY" ~ "No", 
                     Variables == "PAS_6M_BET_WIN" ~ "Yes", 
                     Variables == "PAS_6M_NET_OUTCOME" ~ "Yes", 
                     Variables == "PAS_6M_MAX_DAY_STAKE" ~ "No", 
                     Variables == "PAS_6M_MAX_DAY_BETS" ~ "No", 
                     Variables == "PAS_6M_MEAN_DAILY_STAKE" ~ "No", 
                     Variables == "PAS_6M_SD_DAILY_STAKE" ~ "No", 
                     Variables == "PAS_6M_TOTAL_UNSOCIABLE_BETS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_UNSOCIABLE_BETS" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_BET_FREQ_DAYS" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_SPEND" ~ "No", 
                     Variables == "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" ~ "No",
                     Variables == "PAS_6M_TOTAL_DEPOSIT_AMOUNT" ~ "Yes", 
                     Variables == "PAS_6M_MEAN_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_SD_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_TOTAL_DEPOSIT_FREQ" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_DECLINED_DEPOSITS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_DECLINED_DEPOSITS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_DEPOSIT_METHODS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_FREQ" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_AMOUNT" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_CANCEL_WITHDRAWS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_CANCEL_WITHDRAWS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_METHODS" ~ "No", 
                     Variables == "PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_DEPOSIT_INTENSITY" ~ "No",
                     TRUE ~ Variables)) 

# Table variables:
PAS_6M_variables %>%
  gt()%>%
  tab_header("All past-6-month aggregate summary variables", subtitle = "With risk of bias status for each variable")
All past-6-month aggregate summary variables
With risk of bias status for each variable
Variables Biased by sign-up date
PAS_6M_BET_FREQ_DAYS Yes
PAS_6M_BET_FREQ_BETS Yes
PAS_6M_BET_FREQ_INCL_LEGS PAS_6M_BET_FREQ_INCL_LEGS
PAS_6M_BET_MEAN_STAKE No
PAS_6M_BET_SD_STAKE No
PAS_6M_BET_SPEND Yes
PAS_6M_MEAN_ODDS Yes
PAS_6M_N_ACTIVITIES No
PAS_6M_BET_INTENSITY No
PAS_6M_BET_WIN Yes
PAS_6M_NET_OUTCOME Yes
PAS_6M_MAX_DAY_STAKE No
PAS_6M_MAX_DAY_BETS No
PAS_6M_MEAN_DAILY_STAKE No
PAS_6M_SD_DAILY_STAKE No
PAS_6M_TOTAL_UNSOCIABLE_BETS Yes
PAS_6M_PERCENT_UNSOCIABLE_BETS No
PAS_6M_MONTHLY_BET_FREQ_DAYS No
PAS_6M_MONTHLY_SPEND No
PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME No
PAS_6M_TOTAL_DEPOSIT_AMOUNT Yes
PAS_6M_MEAN_DEPOSIT_AMOUNT No
PAS_6M_SD_DEPOSIT_AMOUNT No
PAS_6M_TOTAL_DEPOSIT_FREQ Yes
PAS_6M_TOTAL_DECLINED_DEPOSITS Yes
PAS_6M_PERCENT_DECLINED_DEPOSITS No
PAS_6M_TOTAL_DEPOSIT_METHODS No
PAS_6M_TOTAL_WITHDRAW_FREQ Yes
PAS_6M_TOTAL_WITHDRAW_AMOUNT Yes
PAS_6M_TOTAL_CANCEL_WITHDRAWS Yes
PAS_6M_PERCENT_CANCEL_WITHDRAWS No
PAS_6M_TOTAL_WITHDRAW_METHODS No
PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT No
PAS_6M_MONTHLY_DEPOSIT_AMOUNT No
PAS_6M_MONTHLY_DEPOSIT_FREQ PAS_6M_MONTHLY_DEPOSIT_FREQ
PAS_6M_DEPOSIT_INTENSITY No

Okay, now we have all of the relevant variables, let’s look at how they correlate with PGSI and harm scores using a series of bivariate correlations visualised as a heatmap. These results are unweighted:

Show code
PGSI_cor_data<- master_dataset_study2 %>% 
    select(PGSI_TOTAL,
        GHM_TOTAL,
       PAS_6M_BET_MEAN_STAKE, 
       PAS_6M_BET_SD_STAKE, 
       PAS_6M_N_ACTIVITIES, 
       PAS_6M_BET_INTENSITY, 
       PAS_6M_MAX_DAY_STAKE, 
       PAS_6M_MAX_DAY_BETS, 
       PAS_6M_MEAN_DAILY_STAKE, 
       PAS_6M_SD_DAILY_STAKE, 
       PAS_6M_PERCENT_UNSOCIABLE_BETS, 
       PAS_6M_MONTHLY_BET_FREQ_DAYS, 
       PAS_6M_MONTHLY_SPEND, 
       PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
       PAS_6M_MEAN_DEPOSIT_AMOUNT, 
       PAS_6M_SD_DEPOSIT_AMOUNT, 
       PAS_6M_PERCENT_DECLINED_DEPOSITS, 
       PAS_6M_TOTAL_DEPOSIT_METHODS, 
       PAS_6M_PERCENT_CANCEL_WITHDRAWS, 
       PAS_6M_TOTAL_WITHDRAW_METHODS, 
       PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, 
       PAS_6M_MONTHLY_DEPOSIT_AMOUNT, 
       PAS_6M_DEPOSIT_INTENSITY)

ggstatsplot::ggcorrmat(PGSI_cor_data, 
                                              type = "parametric",
                                              results.subtitle = FALSE,
                                              ggcorrplot.args = c(lab_size = 3)) +
  theme(axis.text.y = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(axis.text.x = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(legend.title=element_text(color = "black", size = 12, face = "plain", family = "Poppins")) +
  theme(legend.title.align = .5) +
  theme(legend.text=element_text(color = "black", size = 8, face = "plain", family = "Poppins"))

Now let’s use all of these variables to create a composite index score of risk/involvement:

Show code
master_dataset_study2_with_risk_scores <- master_dataset_study2 %>%
  mutate(PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME = PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME*-1) %>% # Invert outcome values
  mutate(across(c(PAS_6M_BET_MEAN_STAKE, 
                  PAS_6M_BET_SD_STAKE, 
                  PAS_6M_N_ACTIVITIES, 
                  PAS_6M_BET_INTENSITY, 
                  PAS_6M_MAX_DAY_STAKE, 
                  PAS_6M_MAX_DAY_BETS, 
                  PAS_6M_MEAN_DAILY_STAKE, 
                  PAS_6M_SD_DAILY_STAKE, 
                  PAS_6M_PERCENT_UNSOCIABLE_BETS, 
                  PAS_6M_MONTHLY_BET_FREQ_DAYS, 
                  PAS_6M_MONTHLY_SPEND, 
                  PAS_6M_MEAN_DEPOSIT_AMOUNT, 
                  PAS_6M_SD_DEPOSIT_AMOUNT, 
                  PAS_6M_PERCENT_DECLINED_DEPOSITS, 
                  PAS_6M_TOTAL_DEPOSIT_METHODS, 
                  PAS_6M_PERCENT_CANCEL_WITHDRAWS, 
                  PAS_6M_TOTAL_WITHDRAW_METHODS, 
                  PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, 
                  PAS_6M_MONTHLY_DEPOSIT_AMOUNT, 
                  PAS_6M_DEPOSIT_INTENSITY), 
                scale, # scale function computes z-scores
                .names = "z_{.col}")) %>% # Create new named columns for easy selection below
  rowwise() %>% # Group by Rows
  mutate(`Composite Risk Score` = sum(c_across(starts_with("z_")), na.rm = TRUE)) %>% # Sum across all newly created Z scores
  ungroup()


# colnames(master_dataset_study2_with_risk_scores)

Correlations

Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:

Show code
master_dataset_study2_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>%
   mutate(PGSI_CATEGORY = factor(PGSI_CATEGORY, levels = c('No risk',
                                                              'Low risk',
                                                              'Moderate risk',
                                                              'High risk'))) %>%
ggplot() +
   geom_point(aes(x = PGSI_TOTAL, y = `Composite Risk Score`,
                   colour = PGSI_CATEGORY)) +
   geom_smooth(aes(x = PGSI_TOTAL, y = `Composite Risk Score`), 
               method = "lm",
               se = FALSE,
               color = "black") +
   scale_colour_manual(values = c("#2e4057",
                                  "#00798c", 
                                  "#edae49",
                                  "#d1495b")) +
   geom_hline(yintercept = 0, colour = "grey",  linetype="dashed") +
   scale_y_continuous(limits = c(-10,120)) +
   plot_theme +
  guides(color= guide_legend("PGSI category")) +
   labs(x = "PGSI total score") +
  scale_x_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 27))

Perform non-weighted correlation:

Show code
cor_PGSI <- cor.test(master_dataset_study2_with_risk_scores$PGSI_TOTAL, 
                         master_dataset_study2_with_risk_scores$`Composite Risk Score`,
                         method = "spearman")

cor_PGSI

    Spearman's rank correlation rho

data:  master_dataset_study2_with_risk_scores$PGSI_TOTAL and master_dataset_study2_with_risk_scores$`Composite Risk Score`
S = 850963267, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.317731 

Perform weighted correlation:

Show code
# This doesn't work nicely with NA values so let's remove them first:
master_dataset_study2_with_risk_scores_weighted_cor_data <- master_dataset_study2_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL))

weightecor_PGSI<- weightedCorr(x = master_dataset_study2_with_risk_scores_weighted_cor_data$PGSI_TOTAL, 
             y = master_dataset_study2_with_risk_scores_weighted_cor_data$`Composite Risk Score`, 
              method = "Spearman",
             weights = master_dataset_study2_with_risk_scores_weighted_cor_data$ALL_SAMPLE_WEIGHTS)

weightecor_PGSI
[1] 0.288293

Confirm how many people are involved in these calculations:

Show code
master_dataset_study2_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>% nrow()
[1] 1956

T-test

Perform a t-test to compare the composite risk or of those who did and didn’t take part in the survey.

Unweighted t-test:

Show code
# Perform Welch's t-test
t_test_composite_study2 <- t.test(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_with_risk_scores, var.equal = FALSE)

t_test_composite_study2

    Welch Two Sample t-test

data:  Composite Risk Score by STATUS
t = 15.233, df = 2368.3, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
 3.103619 4.020736
sample estimates:
    mean in group Participants mean in group Non-participants 
                     3.2811454                     -0.2810317 

And now a Weighted t-test:

Show code
# Create survey design object for the whole sample:
survey_design_all_participants <- svydesign(
  id = ~1,
  data = master_dataset_study2_with_risk_scores,
  weights = ~ALL_SAMPLE_WEIGHTS
)

weighted_t_test_composite_study2 <-survey::svyttest(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_with_risk_scores, design = survey_design_all_participants)

weighted_t_test_composite_study2

    Design-based t-test

data:  `Composite Risk Score` ~ STATUS
t = -16.864, df = 24791, p-value < 2.2e-16
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 -2.546679 -2.016333
sample estimates:
difference in mean 
         -2.281506 

Compute non-weighted summary scores and effect size:

Show code
# Calculate means and standard deviations
summary_composite_study2 <- master_dataset_study2_with_risk_scores %>%
  group_by(STATUS) %>%
  dplyr::summarize(
    mean = mean(`Composite Risk Score`),
    sd = sd(`Composite Risk Score`),
    n = n()
  ) 

gt(summary_composite_study2)
STATUS mean sd n
Participants 3.2811454 9.855798 1956
Non-participants -0.2810317 10.708913 22837
Show code
# Calculate Cohen's d with 95% confidence interval
cohen_d_result_study2 <- cohen.d(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_with_risk_scores)

cohen_d_result_study2

Cohen's d

d estimate: 0.3346615 (small)
95 percent confidence interval:
    lower     upper 
0.2883902 0.3809328 

Compute weighted summary score and effect size:

Show code
# Rename the risk score as the statistic function doesn't like the spaces:
master_dataset_study2_with_risk_scores_temp_risk_name_change<- master_dataset_study2_with_risk_scores %>%
  rename(risk_score = `Composite Risk Score`)


# Create survey design objects for each STATUS group that include the composite risk scores:
survey_design_all_t <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_with_risk_scores_temp_risk_name_change,
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_non_participants_t <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_with_risk_scores_temp_risk_name_change, STATUS == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants_t <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_with_risk_scores_temp_risk_name_change, STATUS == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

risk_score_comparison<- weighted_summary_continuous("risk_score", 
                                                    survey_design_all_t,
                                                    survey_design_non_participants_t, 
                                                    survey_design_participants_t, 
                                                    participants_n, 
                                                    non_participants_n)

gt(risk_score_comparison)
Variable Overall Participants Non-participants Effect_size
risk_score -2.6 (5.54), -3.59 -0.47 (6.08), -1.53 -2.75 (5.46), -3.73 0.41 [0.36, 0.47]

Quartiles

Now, let’s divide the sample into quartiles based on their composite summary score:

Show code
# Calculate quartiles of past-30-day bet frequency
quartiles <- quantile(master_dataset_study2_with_risk_scores$`Composite Risk Score`, 
                      probs = c(0, 0.25, 0.5, 0.75, 1))


master_dataset_study2_with_risk_scores$Quartile <- cut(master_dataset_study2_with_risk_scores$`Composite Risk Score`, 
                                    breaks = quartiles,
                                    labels = c("Bottom quartile: least involved",
                                               "Second quartile",
                                               "Third quartile",
                                               "Top quartile: most involved"),
                                    include.lowest = TRUE)

Now compute the proportion of individuals from each quartile who went on to take part in the survey:

Show code
master_dataset_study2_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
  tab_header("Counts for granular survey status by composite risk score quartile", subtitle = NULL)
Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI n
Bottom quartile: least involved
Complete 245
Incomplete 8
Not started 5946
Second quartile
Complete 348
Incomplete 15
Not started 5835
Third quartile
Complete 535
Incomplete 50
Not started 5613
Top quartile: most involved
Complete 828
Incomplete 113
Not started 5257
Show code
master_dataset_study2_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(STATUS) %>%
  mutate('Percent of quartile' = round(n/sum(n)*100,2)) %>%
  gt() %>%
  tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)
Counts (with percentages) for broad survey status by composite risk score quartile
STATUS n Percent of quartile
Bottom quartile: least involved
Participants 245 3.95
Non-participants 5954 96.05
Second quartile
Participants 348 5.61
Non-participants 5850 94.39
Third quartile
Participants 535 8.63
Non-participants 5663 91.37
Top quartile: most involved
Participants 828 13.36
Non-participants 5370 86.64

No compute weighted accounts/proportions:

Show code
# Compute weighted counts for granular survey status
weighted_counts_granular <- svytable(~Quartile + SURVEY_STATUS_PGSI, 
                                     svydesign(id = ~1, 
                                               data = master_dataset_study2_with_risk_scores, 
                                               weights = ~ALL_SAMPLE_WEIGHTS)) %>%
  as.data.frame()


weighted_counts_granular %>%
  gt() %>%
  tab_header("Weighted counts for granular survey status by composite risk score quartile", subtitle = NULL)
Weighted counts for granular survey status by composite risk score quartile
Quartile SURVEY_STATUS_PGSI Freq
Bottom quartile: least involved Complete 2309.93893
Second quartile Complete 3267.87903
Third quartile Complete 4862.26711
Top quartile: most involved Complete 2948.77958
Bottom quartile: least involved Incomplete 75.38683
Second quartile Incomplete 142.03838
Third quartile Incomplete 463.32653
Top quartile: most involved Incomplete 390.28258
Bottom quartile: least involved Not started 56150.98467
Second quartile Not started 54996.77437
Third quartile Not started 49681.64042
Top quartile: most involved Not started 17909.09600
Show code
# Compute weighted counts and percentages for broad survey status
svytable(~Quartile + STATUS, 
                                  svydesign(id = ~1, 
                                            data = master_dataset_study2_with_risk_scores, 
                                            weights = ~ALL_SAMPLE_WEIGHTS)) %>%
  as.data.frame() %>%
  group_by(Quartile) %>%
  mutate('Percent of quartile' = round(Freq / sum(Freq) * 100, 2)) %>%
 gt() %>%
  tab_header("Weighted counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)
Weighted counts (with percentages) for broad survey status by composite risk score quartile
STATUS Freq Percent of quartile
Bottom quartile: least involved
Participants 2309.939 3.95
Non-participants 56226.371 96.05
Second quartile
Participants 3267.879 5.60
Non-participants 55138.813 94.40
Third quartile
Participants 4862.267 8.84
Non-participants 50144.967 91.16
Top quartile: most involved
Participants 2948.780 13.88
Non-participants 18299.379 86.12

And visualise this using a Sankey diagram (weighted figures):

Show code
# Create survey design object
survey_design_all_participants_sankey <- svydesign(
  id = ~1,
  data = master_dataset_study2_with_risk_scores,
  weights = ~ALL_SAMPLE_WEIGHTS
)

#  We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N), with weightings applied:
weighted_counts <- svytable(~Quartile + SURVEY_STATUS_PGSI, survey_design_all_participants_sankey) %>%
  as.data.frame() %>%
  group_by(Quartile) %>%
  mutate(percent = Freq / sum(Freq) * 100) %>%
  ungroup()

# Create Sankey plot
desired_order_sankey <- c("Top quartile: most involved",
                          "Third quartile",
                          "Second quartile",
                          "Bottom quartile: least involved")

# non-weighted data: 
risk_to_survey_links<- master_dataset_study2_with_risk_scores %>%
    select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%
  group_by(Quartile, SURVEY_STATUS_PGSI) %>%
   summarise(
    n = n(),
    .groups = "drop"
  ) %>%
   group_by(Quartile) %>%
   mutate(
     percent = n/sum(n)*100) %>%
  select(-n) %>%
  ungroup() %>%
  mutate(Quartile =  factor(Quartile,
                              levels = desired_order_sankey)) %>%
  arrange(Quartile)

# From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: 
nodes <- data.frame(
  name = c(as.character(weighted_counts$Quartile), 
           as.character(weighted_counts$SURVEY_STATUS_PGSI)) %>% unique())

# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn  assign each from and to category with a number to make numerical connections explicit:
weighted_counts$IDsource <- match(weighted_counts$Quartile, nodes$name) - 1
weighted_counts$IDtarget <- match(weighted_counts$SURVEY_STATUS_PGSI, nodes$name) - 1
 


# prepare colour scale
ColourScal ='d3.scaleOrdinal() .range(["#d1495b", 
"#edae49",
"#00798c",
"#2e4057",
"#35B779FF",
"#31688EFF",
"#31688EFF",
"#482878FF",
"#B4DE2CFF",
"#FDE725FF", 
"#26828EFF",
"#3E4A89FF",
"#440154FF",
"#6DCD59FF"
])'

# Create Sankey plot
sankey_plot2 <- sankeyNetwork(
  Links = weighted_counts, # risk_to_survey_links # non-weighted 
  Nodes = nodes,
  Source = "IDsource", 
  Target = "IDtarget",
  Value = "percent", 
  NodeID = "name", 
  iterations = 0,
  sinksRight = FALSE,
  fontSize = 16, 
  fontFamily = "Poppins",
  nodeWidth = 30,
  nodePadding = 25,
  colourScale = ColourScal
)

# Custom CSS with Google Fonts link
custom_css <- "
<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'>
<style>
  text {
    font-family: 'Poppins', sans-serif !important;
  }
</style>
"

# Save plot as a HTML file with custom CSS
html_file <- tempfile(fileext = ".html")
saveWidget(
  widget = htmlwidgets::prependContent(sankey_plot2), 
  file = html_file,
  selfcontained = TRUE
)

# Read the saved HTML and ensure the custom CSS is properly embedded
html_content <- readLines(html_file, warn = FALSE)
html_content <- gsub("</head>", paste0(custom_css, "</head>"), html_content)
writeLines(html_content, con = "sankey_plot2.html")

# remove the temporary file
file.remove(html_file)
[1] TRUE
Show code
# sankey_plot2
Show code
# ggsankey_data <- master_dataset_study2_with_risk_scores %>%
#   # mutate(SURVEY_STATUS_FULL = case_when(SURVEY_STATUS_FULL == "Not started" ~ NA,
#   #                                       TRUE ~ SURVEY_STATUS_FULL)) %>%
#   make_long(Quartile, SURVEY_STATUS_START, SURVEY_STATUS_FULL)
# 
# ggplot(ggsankey_data, aes(x = x,
#                           next_x = next_x,
#                           node = node,
#                           next_node = next_node,
#                           fill = factor(node),
#                           label = node)) +
#                geom_sankey(flow.alpha = 0.5,
#                            node.color = "black",
#                            show.legend = FALSE)+
#   geom_sankey_label(size = 3,
#                     color = "black",
#                     fill= "white",
#                     hjust = -0.2,
#                     family="Poppins")+
#   theme_bw()+
#     theme( text=element_text(family="Poppins")) +
#   # scale_fill_viridis_d() +
#     scale_fill_manual(values = c("#9a031e",  # Bright red
#   "#277da1",  # Deep blue
#   "#f9c74f",  # Muted yellow
#   "#e36414",  # Soft orange
#   "#90be6d",  # Soft green
#   "#577590",  # Slate gray
#   "#4d908e",  # Teal
#   "#9f5f80"   # Purple
# )) +
#   theme(axis.title = element_blank()
#                   , axis.text.y = element_blank()
#                   , axis.ticks = element_blank()
#                   , panel.grid = element_blank())

Logistic regression

Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.

SSVS

Before we can actually run the model, we need to identify a suitable suite of predictor variables to include using used Stochastic Search Variable Selection (SSVS) and the related SSVS package (note the actual SSVS process is commented out in rendered versions of this document as it takes a long time to process).

As we want to maximise the overall predictive ability of the model, we have opted to set the parameters of the SSVS so as to have a low threshold for including/selecting relevant variables. First, We have said the prior inclusion of probability to 0.8, reflecting a belief that predictors are more likely to be included and not. A prior inclusion probability of 0.5 reflects the belief that each predictor has an equal probability of being included/excluded. Second, we have set the cut off for marginal inclusion probabilities values (MIPs) to 0.4. There is no standard recommendation for what this cut-off should be, but others have used 0.5.

One thing we need to do before all of this is impute missing values for the IRSAD scores as SSVS doesn’t seem to handle these values well. To do this, we will impute the median value for each person’s state as their missing value.

Show code
# First we need to convert logical and categorical variables to numeric for inclusion in the model:  <-  
master_dataset_study2_SSVS <- master_dataset_study2 %>%
  # Impute missing IRSAD scores
  mutate(IRSAD_SCORE = ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm = TRUE), IRSAD_SCORE)) %>%
  # Convert all required variables to numeric:
  mutate(across(c(
    'AGE',
    'IRSAD_SCORE',
       'DAYS_REGISTERED',
       'DAYS_SINCE_LAST_BET',
       'PAS_6M_BET_MEAN_STAKE',
       'PAS_6M_BET_SD_STAKE',
       'PAS_6M_N_ACTIVITIES',
       'PAS_6M_BET_INTENSITY',
       'PAS_6M_MAX_DAY_STAKE',
       'PAS_6M_MAX_DAY_BETS',
       'PAS_6M_MEAN_DAILY_STAKE',
       'PAS_6M_SD_DAILY_STAKE',
       'PAS_6M_PERCENT_UNSOCIABLE_BETS',
       'PAS_6M_MONTHLY_BET_FREQ_DAYS',
       'PAS_6M_MONTHLY_SPEND',
       'PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME',
       'PAS_6M_MEAN_DEPOSIT_AMOUNT',
       'PAS_6M_SD_DEPOSIT_AMOUNT',
       'PAS_6M_PERCENT_DECLINED_DEPOSITS',
       'PAS_6M_TOTAL_DEPOSIT_METHODS',
       'PAS_6M_PERCENT_CANCEL_WITHDRAWS',
       'PAS_6M_TOTAL_WITHDRAW_METHODS',
       'PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT',
       'PAS_6M_MONTHLY_DEPOSIT_AMOUNT',
       'PAS_6M_DEPOSIT_INTENSITY','EVER_USED_DEPOSIT_LIMIT',
                  'EVER_USED_BLOCK_OUT', 
                  'EVER_USED_CHECK_IN', 
                  'EVER_USED_CURFEW', 
                  'EVER_USED_DEPOSIT_OPTION', 
                  'EVER_USED_MARKET_CONTROL', 
                  'EVER_USED_TIMOUT', 
                  'EVER_USED_CANCELLED_WITHDRAWS', 
                  'EVER_USED_SELF_EXCLUSION',
    'TOTAL_ACTIVE_TOOLS',
    'NUMBER_CPT_CHANGES'), 
                ~ as.numeric(.))) %>%
  mutate(GENDER = case_when(
    GENDER == 'Female' ~ 0,
    GENDER == 'Male' ~ 1,
    TRUE ~ 2)) %>%
  mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1,
    TRUE ~ 2)) %>%
    rename(
    'Age' = 'AGE',
    'Gender' =  'GENDER',
    'IRSAD score' = 'IRSAD_SCORE',
    'Days since registration' = 'DAYS_REGISTERED',
    'Days since last bet' = 'DAYS_SINCE_LAST_BET',
    'Mean stake*' = 'PAS_6M_BET_MEAN_STAKE',
    'Stake SD*' = 'PAS_6M_BET_SD_STAKE',
    'No. of activities*' = 'PAS_6M_N_ACTIVITIES',
    'Betting intensity*' = 'PAS_6M_BET_INTENSITY',
    'Max daily stake*' = 'PAS_6M_MAX_DAY_STAKE',
    'Max daily bets*' = 'PAS_6M_MAX_DAY_BETS',
    'Mean daily stake*' = 'PAS_6M_MEAN_DAILY_STAKE',
    'Daily stake SD*' = 'PAS_6M_SD_DAILY_STAKE',
    '% unsociable bets*' = 'PAS_6M_PERCENT_UNSOCIABLE_BETS',
    'Monthly bet freq (days)*' = 'PAS_6M_MONTHLY_BET_FREQ_DAYS',
    'Monthly spend*' = 'PAS_6M_MONTHLY_SPEND',
    'Mean monthly net outcome*' = 'PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME',
    'Mean deposit amount*' = 'PAS_6M_MEAN_DEPOSIT_AMOUNT',
    'Deposit amount SD*' = 'PAS_6M_SD_DEPOSIT_AMOUNT',
    '% declined deposits*' = 'PAS_6M_PERCENT_DECLINED_DEPOSITS',
    'No. deposit methods*' = 'PAS_6M_TOTAL_DEPOSIT_METHODS',
    '% cancel withdrawals*' = 'PAS_6M_PERCENT_CANCEL_WITHDRAWS',
    'No. withdrawal methods*' = 'PAS_6M_TOTAL_WITHDRAW_METHODS',
    'Max daily deposit amount*' = 'PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT',
    'Monthly deposit amount*' = 'PAS_6M_MONTHLY_DEPOSIT_AMOUNT',
    'Deposit intensity*' = 'PAS_6M_DEPOSIT_INTENSITY',
    'Ever used deposit limit' = 'EVER_USED_DEPOSIT_LIMIT',
    'Ever used block out' = 'EVER_USED_BLOCK_OUT', 
    'Ever used check in' = 'EVER_USED_CHECK_IN', 
    'Ever used curfew' = 'EVER_USED_CURFEW', 
    'Ever used deposit option' = 'EVER_USED_DEPOSIT_OPTION', 
    'Ever used market control' = 'EVER_USED_MARKET_CONTROL', 
    'Ever used timeout' = 'EVER_USED_TIMOUT', 
    'Ever used cancelled withdrawals' = 'EVER_USED_CANCELLED_WITHDRAWS', 
    'Ever used self-exclusion' = 'EVER_USED_SELF_EXCLUSION',
    'Total active tools' = 'TOTAL_ACTIVE_TOOLS',
    'Number of changes to CPTs' = 'NUMBER_CPT_CHANGES'
  )


outcome <- 'STATUS'

predictors <- c(
       # demographic variables:
       'Age',
       'Gender',
       'IRSAD score',
       # Account/wagering variables
       'Days since registration',
       'Days since last bet',
       # All past six month wagers and transactions variables (As included in the composite score):
      'Mean stake*',
    'Stake SD*',
    'No. of activities*',
    'Betting intensity*',
    'Max daily stake*',
    'Max daily bets*',
    'Mean daily stake*',
    'Daily stake SD*',
    '% unsociable bets*',
    'Monthly bet freq (days)*',
    'Monthly spend*',
    'Mean monthly net outcome*',
    'Mean deposit amount*',
    'Deposit amount SD*',
    '% declined deposits*',
    'No. deposit methods*',
    '% cancel withdrawals*',
    'No. withdrawal methods*',
    'Max daily deposit amount*',
    'Monthly deposit amount*',
    'Deposit intensity*',
       # current and historical use of consumer protection tools:
       'Ever used deposit limit',
       'Ever used block out', 
       'Ever used check in', 
       'Ever used curfew', 
       'Ever used deposit option', 
       'Ever used market control', 
       'Ever used timeout', 
       'Ever used cancelled withdrawals', 
       'Ever used self-exclusion',
       'Total active tools',
       'Number of changes to CPTs'
        )

# str(master_dataset_study2_SSVS)

# master_dataset_study2_SSVS %>%
  # select(`IRSAD score`)

# SSSV_results_Study2 <- ssvs(data = master_dataset_study2_SSVS,
#                             x = predictors, y = outcome, # Variables, as prespecified above
#                             inprob = 0.8, #  prior inclusion probability:  0.5 reflects the belief that each predictor has an equal probability of being included/excluded
#                             continuous = FALSE, # Categorical outcome variable
#                             progress = TRUE # Provide update on progress of processing
#                             )
# 
# # Save file to avoid having to re-run every time
# saveRDS(SSSV_results_Study2, "SSSV_results_Study2")

# Load in:
SSSV_results_Study2 <- read_rds("SSSV_results_Study2")

Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:

Show code
summary(SSSV_results_Study2, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  as.data.frame() %>%
gt()%>%
  tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2**"), 
             subtitle = NULL)
Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2
Variable MIP Avg Beta Avg Nonzero Beta Lower CI (95%) Upper CI (95%)
Age 1.0000 0.2783 0.2783 0.2257 0.3342
Days since last bet 1.0000 -0.3794 -0.3794 -0.4596 -0.3024
No. withdrawal methods* 1.0000 0.1980 0.1980 0.1393 0.2564
Monthly bet freq (days)* 0.9895 0.1473 0.1488 0.0799 0.2117
No. deposit methods* 0.9755 0.1185 0.1215 0.0603 0.1830
Mean deposit amount* 0.8053 -0.2226 -0.2764 -0.3901 0.0000
No. of activities* 0.4949 0.0450 0.0909 0.0000 0.1240
Stake SD* 0.3143 0.1098 0.3492 0.0000 0.4386
Mean stake* 0.3096 -0.1834 -0.5924 -0.7406 0.0000
Gender 0.2637 0.0221 0.0839 0.0000 0.1076
Deposit amount SD* 0.0689 -0.0063 -0.0908 -0.1249 0.0000
Ever used cancelled withdrawals 0.0547 0.0027 0.0489 0.0000 0.0256
Deposit intensity* 0.0364 0.0020 0.0538 0.0000 0.0000
% cancel withdrawals* 0.0255 0.0012 0.0484 0.0000 0.0000
Days since registration 0.0211 0.0011 0.0533 0.0000 0.0000
Total active tools 0.0189 0.0008 0.0412 0.0000 0.0000
Mean monthly net outcome* 0.0158 0.0012 0.0756 0.0000 0.0000
Betting intensity* 0.0111 0.0004 0.0348 0.0000 0.0000
Ever used curfew 0.0099 -0.0005 -0.0455 0.0000 0.0000
Max daily stake* 0.0085 -0.0005 -0.0563 0.0000 0.0000
% unsociable bets* 0.0080 0.0003 0.0333 0.0000 0.0000
Daily stake SD* 0.0066 -0.0001 -0.0178 0.0000 0.0000
Mean daily stake* 0.0059 -0.0002 -0.0269 0.0000 0.0000
Max daily deposit amount* 0.0059 -0.0001 -0.0231 0.0000 0.0000
Max daily bets* 0.0055 0.0002 0.0290 0.0000 0.0000
Ever used market control 0.0051 0.0002 0.0352 0.0000 0.0000
Ever used deposit limit 0.0050 0.0002 0.0359 0.0000 0.0000
Ever used timeout 0.0049 0.0001 0.0216 0.0000 0.0000
Ever used self-exclusion 0.0045 0.0001 0.0154 0.0000 0.0000
Monthly deposit amount* 0.0039 -0.0001 -0.0225 0.0000 0.0000
Monthly spend* 0.0038 -0.0001 -0.0323 0.0000 0.0000
% declined deposits* 0.0034 0.0001 0.0254 0.0000 0.0000
Ever used block out 0.0033 0.0000 0.0130 0.0000 0.0000
Ever used deposit option 0.0023 0.0000 0.0066 0.0000 0.0000
Number of changes to CPTs 0.0019 0.0000 -0.0088 0.0000 0.0000
Ever used check in 0.0013 0.0000 -0.0082 0.0000 0.0000
IRSAD score 0.0009 0.0000 0.0136 0.0000 0.0000
Show code
plot(SSSV_results_Study2,
     threshold = 0.4,
     title = "") +
  theme_classic() +
   scale_y_continuous(breaks = c(0, 0.25, 0.50, 0.75, 1), labels = c(0, 0.25, 0.50, 0.75, 1)) +
  theme(text=element_text(family="Poppins"),
      axis.text.x = element_text(angle = 45, hjust = 1)
      # axis.text.y = element_text(angle = 0, hjust = .5)
      ) +
  labs(title = "",
       x = "",
       y = "Inclusion Probability") +
  annotate("text", x = 18, y = 0.45, label = "MIP threshold", size =3, color = "black", fontface = "plain") +
  theme(legend.position = "none") 

Show code
ggsave("Figures/SSVS results plot Study 2.pdf",
       width = 8.5,
       height = 3.5,
       units = c("in")
)

Run model

Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:

Show code
SSSV_results_Study2$ssvs %>% as_tibble()
# A tibble: 0 × 0
Show code
# Set the data for analysis:
Study2_logistic_reg_data<- master_dataset_study2 %>%
 mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1))

# Create survey design object Used for waiting:
survey_design_regression <- svydesign(
  id = ~1,
  data = Study2_logistic_reg_data,
  weights = ~ALL_SAMPLE_WEIGHTS
)
  
# Check variables for inclusion: 
summary(SSSV_results_Study2, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  filter(MIP>0.4)
 Variable                 MIP    Avg Beta Avg Nonzero Beta Lower CI (95%)
 Age                      1.0000  0.2783   0.2783           0.2257       
 Days since last bet      1.0000 -0.3794  -0.3794          -0.4596       
 No. withdrawal methods*  1.0000  0.1980   0.1980           0.1393       
 Monthly bet freq (days)* 0.9895  0.1473   0.1488           0.0799       
 No. deposit methods*     0.9755  0.1185   0.1215           0.0603       
 Mean deposit amount*     0.8053 -0.2226  -0.2764          -0.3901       
 No. of activities*       0.4949  0.0450   0.0909           0.0000       
 Upper CI (95%)
  0.3342       
 -0.3024       
  0.2564       
  0.2117       
  0.1830       
  0.0000       
  0.1240       
Show code
# Run model with the Weight adjusting:

study_2_weighted_logistic_regression <- svyglm(STATUS ~ AGE + 
                DAYS_SINCE_LAST_BET +
                PAS_6M_TOTAL_WITHDRAW_METHODS +
                PAS_6M_MONTHLY_BET_FREQ_DAYS +
                PAS_6M_TOTAL_DEPOSIT_METHODS + 
                PAS_6M_MEAN_DEPOSIT_AMOUNT +
                PAS_6M_N_ACTIVITIES,
      design = survey_design_regression, family = quasibinomial())

summary(study_2_weighted_logistic_regression) 

Call:
svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + 
    PAS_6M_MONTHLY_BET_FREQ_DAYS + PAS_6M_TOTAL_DEPOSIT_METHODS + 
    PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_N_ACTIVITIES, design = survey_design_regression, 
    family = quasibinomial())

Survey design:
svydesign(id = ~1, data = Study2_logistic_reg_data, weights = ~ALL_SAMPLE_WEIGHTS)

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -3.9154835  0.1304685 -30.011  < 2e-16 ***
AGE                            0.0246511  0.0019471  12.660  < 2e-16 ***
DAYS_SINCE_LAST_BET           -0.0076221  0.0010583  -7.202 6.11e-13 ***
PAS_6M_TOTAL_WITHDRAW_METHODS  0.3606934  0.0522027   6.909 4.98e-12 ***
PAS_6M_MONTHLY_BET_FREQ_DAYS   0.0124960  0.0054804   2.280 0.022610 *  
PAS_6M_TOTAL_DEPOSIT_METHODS   0.1432951  0.0351667   4.075 4.62e-05 ***
PAS_6M_MEAN_DEPOSIT_AMOUNT    -0.0006594  0.0001831  -3.602 0.000316 ***
PAS_6M_N_ACTIVITIES            0.0416386  0.0093267   4.464 8.06e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.731407)

Number of Fisher Scoring iterations: 6

Performs checks of model performance:

Show code
model2_pilot_performance_metrics<- check_model(study_2_weighted_logistic_regression)

ggsave("Figures/model2_pilot_performance_metrics.pdf",
       width = 7,
       height = 9,
       units = c("in")
)

I have tried various methods of transforming variables to account for large residuals but none seem to have any effect (first plot above). If anything, using variables on the original scale seems to improve model diagnostics.

Compute confidence intervals for the coefficient estimates:

Show code
sumarisedCIs_study_2_weighted_logistic_regression<- cbind(coef(study_2_weighted_logistic_regression), confint(study_2_weighted_logistic_regression))

sumarisedCIs_study_2_weighted_logistic_regression %>% as.data.frame() %>%
    rownames_to_column() %>%
  rename(Coefficient = V1) %>%
  mutate(across(c(Coefficient, `2.5 %`, `97.5 %`), ~round(., 3))) %>%
  arrange(desc(`2.5 %`)) %>%
   gt() %>%
  tab_header(md("**Regression coefficients with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%") 
Regression coefficients with 95% CIs: Study 2
Coefficient Lower CI: 2.5% Upper CI: 97.5%
PAS_6M_TOTAL_WITHDRAW_METHODS 0.361 0.258 0.463
PAS_6M_TOTAL_DEPOSIT_METHODS 0.143 0.074 0.212
PAS_6M_N_ACTIVITIES 0.042 0.023 0.060
AGE 0.025 0.021 0.028
PAS_6M_MONTHLY_BET_FREQ_DAYS 0.012 0.002 0.023
PAS_6M_MEAN_DEPOSIT_AMOUNT -0.001 -0.001 0.000
DAYS_SINCE_LAST_BET -0.008 -0.010 -0.006
(Intercept) -3.915 -4.171 -3.660

Exponentiate coefficients and interpret them as odds-ratios:

Show code
sumarisedORs_study_2_weighted_logistic_regression<- exp(cbind(OR = coef(study_2_weighted_logistic_regression), confint(study_2_weighted_logistic_regression)))

sumarisedORs_study_2_weighted_logistic_regression %>% as.data.frame() %>%
  rownames_to_column() %>%
  mutate(across(c(OR, `2.5 %`, `97.5 %`), ~round(., 4))) %>%
  arrange(desc(OR)) %>%
   gt() %>%
  tab_header(md("**Odds ratios with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%") 
Odds ratios with 95% CIs: Study 2
OR Lower CI: 2.5% Upper CI: 97.5%
PAS_6M_TOTAL_WITHDRAW_METHODS 1.4343 1.2948 1.5889
PAS_6M_TOTAL_DEPOSIT_METHODS 1.1541 1.0772 1.2364
PAS_6M_N_ACTIVITIES 1.0425 1.0236 1.0618
AGE 1.0250 1.0211 1.0289
PAS_6M_MONTHLY_BET_FREQ_DAYS 1.0126 1.0018 1.0235
PAS_6M_MEAN_DEPOSIT_AMOUNT 0.9993 0.9990 0.9997
DAYS_SINCE_LAST_BET 0.9924 0.9904 0.9945
(Intercept) 0.0199 0.0154 0.0257

Calculate Nagelkerke’s R squared for the model:

Show code
NagelkerkeR2_S2<- NagelkerkeR2(study_2_weighted_logistic_regression) %>% 
  as.data.frame() %>%
  print()
      N         R2
1 24793 0.08520162
Show code
# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value

The model accounts for 8.5201619% of variance in uptake/response.

Calculate overall p-value for model by comparing it with a null model:

Show code
anova(study_2_weighted_logistic_regression, 
      update(study_2_weighted_logistic_regression, ~1), # update here produces null model for comparison 
      test="Chisq")
Working (Rao-Scott) LRT for AGE DAYS_SINCE_LAST_BET PAS_6M_TOTAL_WITHDRAW_METHODS PAS_6M_MONTHLY_BET_FREQ_DAYS PAS_6M_TOTAL_DEPOSIT_METHODS PAS_6M_MEAN_DEPOSIT_AMOUNT PAS_6M_N_ACTIVITIES
 in svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + 
    PAS_6M_MONTHLY_BET_FREQ_DAYS + PAS_6M_TOTAL_DEPOSIT_METHODS + 
    PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_N_ACTIVITIES, design = survey_design_regression, 
    family = quasibinomial())
Working 2logLR =  745.6723 p= < 2.22e-16 
(scale factors:  1.4 1.1 1 0.98 0.92 0.84 0.73 )
Show code
# Set seed so that labels are consistent in the plot:
set.seed(200)

ggstatsplot::ggcoefstats(study_2_weighted_logistic_regression,
                         digits = 4,
                         stats.label.color = "#00798c",
                         sort = "ascending",
                         vline.args = list(size = 1, color = "black"),
                         # ggstatsplot.layer = FALSE,
                          exclude.intercept = TRUE, # hide the intercept
                         stats.label.args = list(size = 3, direction = "y")) +
  scale_x_continuous(name = "Regression coefficient", limits = c(-0.1, 0.5), breaks = c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5)) +
  scale_y_discrete(name = "", labels = c("Days since\n last bet",
                                         "Mean deposit\n amount",
                                         "Mean monthly bet\n freq (days)",
                                         "Age",
                                         "No. of activities",
                                         "No. deposit\n methods",
                                         "No. withdrawal\n methods")) +
  theme(panel.grid.minor = element_blank(),
        axis.line = element_blank()) +
  theme(axis.text = element_text(color = "black", size = 9, face = "plain"))+
  theme_light(base_family = "Poppins") +
  # theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +
  theme(axis.text.y = element_text(color="black", size=12, face="plain", family = "Poppins")) +
  theme(axis.title.x = element_text(color="black", size=12, face ="plain", vjust = 0, family = "Poppins")) 

Show code
# Set height to 7 and width to 5.5 in portrait mode
ggsave("Figures/Study 2 log regression plot.pdf",
       width = 5,
       height = 7,
       units = c("in")
)

Study 2 Subsample Analysis

Okay, as noted by two reviewers during peer review, there is a key distinction in the inclusion criteria between our two studies. Study 1 invited participants who had gambled at least once in the past 30 days, whereas Study 2 invited those who had gambled at least once in the past six months. This difference could significantly impact our outcomes and help explain the higher nonresponse bias in Study 2. Since more regular gamblers appear to be more inclined to participate in studies, expanding the eligibility criteria to include less frequent gamblers makes the responders less representative of the total invited sample. This is a great point, so let’s re-run all Study 2 analyses focusing only on participants who had gambled in the 30 days prior to invitation.

Process the data for analysis by doing the following:

  • Remove anybody who didn’t bet in the six months prior to survey participation as this was our criteria for inclusion in the sample
  • Produce some new sample weights to account for the different sample size we’re now using that has a different distribution of flagged/vs non-flagged customers (this is hidden for commercial sensitivity)
Show code
master_dataset_study2_restricted_initial <-  master_dataset_study2 %>%
filter(PAS_30_BET_FREQ_DAYS != 0) 

What percentage of the original sample does this represent?

Show code
nrow(master_dataset_study2_restricted_initial)/nrow(master_dataset_study2)*100
[1] 51.62344

Uptake figures

How many people were invited in total that had gambled in the last 30 days:

Show code
nrow(master_dataset_study2_restricted)
[1] 12799

Let’s now look at how survey uptake varied over the three-week window the survey was open in this sample:

Show code
master_dataset_study2_restricted %>%
  as_tibble() %>%
  filter(SURVEY_STATUS_START != "Not started") %>%
  group_by(START_DATE) %>%
  summarise(Freq = sum(n())) %>%
  ungroup() %>%
  ggplot(aes(x = START_DATE, y = Freq)) +
    plot_theme +
  geom_bar(stat = "identity", fill = "#00798c") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "",
       x = "Date",
       y = "Number of clients") +
  scale_y_continuous(breaks = seq(0, 1500, by = 500), limits = c(0,1500))

Now compute some figures to tell us the rates of uptake for the survey. these figures only relate to the refined, filtered dataset:

Show code
master_dataset_study2_restricted %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Restricted analyses", subtitle = NULL)
Counts for starting the survey: Restricted analyses
Survey status n Percent
Not started 11127 86.93648
Started 1672 13.06352
Show code
master_dataset_study2_restricted %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Restricted analyses", subtitle = NULL)
Counts for completing the survey up to the PGSI: Restricted analyses
Survey status n Percent
Complete 1514 11.829049
Incomplete 158 1.234471
Not started 11127 86.936479
Show code
master_dataset_study2_restricted %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Restricted analyses", subtitle = NULL)
Counts for completing the survey in full: Restricted analyses
Survey status n Percent
Complete 1115 8.711618
Incomplete 557 4.351902
Not started 11127 86.936479

We need to also figure out how many people were removed by ensuring everyone bet in the past 30 days and how this related to survey status:

Show code
master_dataset_study2 %>%
  as_tibble() %>%
filter(PAS_30_BET_FREQ_DAYS == 0) %>% # keep anybody who didn't bet in the window of interest
  count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for people removed from dataset because they hadn't gambled in the past 30-months", subtitle = NULL)
Counts for people removed from dataset because they hadn't gambled in the past 30-months
Survey status n
Complete 442
Incomplete 28
Not started 11524

Okay, so this restricted analyses removes 442 people who completed the survey and fall, and 28 people who started but didn’t finish it. We still have pretty large numbers of people who completed the survey so it should be fine to proceed with these restricted analyses.

Compare samples

What are the general demographic and gambling characteristics of those who did and did not take part in the survey?

Before we can do this, we need to create weighted variable summaries to account for the oversampling of the at risk population:

Show code
# Create survey design:
survey_design_all <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_restricted,
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Create survey design objects for each STATUS group:
survey_design_non_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted, STATUS == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted, STATUS == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Get sample sizes for each group
participants_n <- sum(master_dataset_study2_restricted$STATUS == "Participants")
non_participants_n <- sum(master_dataset_study2_restricted$STATUS == "Non-participants")
overall_n <- nrow(master_dataset_study2_restricted)



# [Yes, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]

# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:
# Function to calculate weighted summary for continuous variables
weighted_summary_continuous <- function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate means and standard deviations
  mean_all <- svymean(var_formula, design_all, na.rm = TRUE)
  mean_non_ps <- svymean(var_formula, design_non_participants, na.rm = TRUE)
  mean_ps <- svymean(var_formula, design_participants, na.rm = TRUE)
  
  sd_all <- sqrt(svyvar(var_formula, design_all, na.rm = TRUE))
  sd_non_ps <- sqrt(svyvar(var_formula, design_non_participants, na.rm = TRUE))
  sd_ps <- sqrt(svyvar(var_formula, design_participants, na.rm = TRUE))
  
  # Calculate weighted medians
  median_all <- svyquantile(var_formula, design_all, 0.5, na.rm = TRUE)
  median_non_ps <- svyquantile(var_formula, design_non_participants, 0.5, na.rm = TRUE)
  median_ps <- svyquantile(var_formula, design_participants, 0.5, na.rm = TRUE)
   
  # Format for table
  summary_df <- tibble(
    Variable = variable,
    `Overall` = paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),
    `Participants` = paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),
    `Non-participants` = paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2))
  ) %>%
  # Compute effect size
  mutate(
    pooled_sd = sqrt(((participants_n - 1) * sd_ps^2 + (non_participants_n - 1) * sd_non_ps^2) /
                     (participants_n + non_participants_n - 2)),
    cohen_d = (coef(mean_ps) - coef(mean_non_ps)) / pooled_sd,
    SE = sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2 / (2 * (participants_n + non_participants_n)))),
    CI_lower = cohen_d - 1.96 * SE,
    CI_upper = cohen_d + 1.96 * SE,
    Effect_size = paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]")
  ) %>%
  select(Variable, Overall, Participants, `Non-participants`, Effect_size)
  
  return(summary_df)
}

# Apply function to continuous variables
continuous_vars <- c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", 
                     "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", 
                     "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", 
                     "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", 
                     "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")

all_continuous_summaries <- map_dfr(continuous_vars, 
                                    ~weighted_summary_continuous(.x, survey_design_all, 
                                                                 survey_design_non_participants, survey_design_participants, 
                                                                 participants_n, non_participants_n, overall_n))


# Categorical variables:

# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. 

# Note: this function is computes Cramer's  effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).

# UPDATE: Let's just compute odds ratios for the two level variables  based on a reviewer suggestion (thanks ChatGPT for your help with this one!):

categorical_summary <- function(variable, design_all, design_non_participants, design_participants) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate proportions for overall sample
  prop_table_all <- svytable(var_formula, design_all)
  props_all <- prop.table(prop_table_all) %>%
    as.data.frame() %>%
    mutate(Overall = Freq * 100) %>%
    select(Variable = !!sym(variable), Overall)
  
  # Calculate proportions for non-participants
  prop_table_non_ps <- svytable(var_formula, design_non_participants)
  props_non_ps <- prop.table(prop_table_non_ps) %>%
    as.data.frame() %>%
    mutate('Non-participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Non-participants')
  
  # Calculate proportions for participants
  prop_table_ps <- svytable(var_formula, design_participants)
  props_ps <- prop.table(prop_table_ps) %>%
    as.data.frame() %>%
    mutate('Participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Participants')
  
  # Combine results and format
  summary_df <- props_ps %>%
    left_join(props_non_ps, by = "Variable") %>%
    left_join(props_all, by = "Variable") %>%
    mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%
    replace(is.na(.), "")

  # Determine if variable is binary (i.e., only two levels)
  unique_values <- length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))
  
  if (unique_values == 2 & variable != "GENDER") {
    # Ensure STATUS is treated as a factor with "Non-participants" as reference
    design_all$variables$STATUS <- relevel(as.factor(design_all$variables$STATUS), ref = "Non-participants")

    # Compute Odds Ratio (OR) using survey-weighted logistic regression
    logistic_model <- tryCatch(
      svyglm(as.formula(paste0(variable, " ~ STATUS")), 
             design = design_all, family = quasibinomial()),
      error = function(e) NULL
    )

    if (!is.null(logistic_model)) {
      coef_name <- grep("STATUS", names(coef(logistic_model)), value = TRUE)
      if (length(coef_name) > 0) {
        OR <- exp(coef(logistic_model)[coef_name])
        CI <- exp(confint(logistic_model)[coef_name, ])
        effect_size_text <- paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]")
      } else {
        effect_size_text <- "NA"
      }
    } else {
      effect_size_text <- "NA"
    }

  } else {
    # Compute Cramér’s V for categorical variables with more than two levels
    combined_table <- as.table(rbind(prop_table_non_ps, prop_table_ps))
    V <- cramersV(combined_table)
    V_rounded <- as.matrix(V$output) %>% t() %>% as.data.frame() %>%
      mutate(cramersV = as.numeric(cramersV)) %>%
      mutate(cramersV = round(cramersV, 2))
    
    CI <- confIntV(combined_table)
    CI_info <- CI$intermediate
    CIs <- CI_info$fisherZ.ci %>%
      t() %>%
      as.data.frame() %>%
      mutate_if(is.numeric, round, 2) %>%
      unite('Effect_size', 1, 2, sep = ", ")
    
    effect_size_text <- paste0(V_rounded, " [", CIs, "]")
  }

  # Add effect size to the table
  header_row <- data.frame(Variable = variable, Overall = "", Participants = "", 
                           `Non-participants` = "", `Effect_size` = effect_size_text)

  summary_df <- bind_rows(header_row, summary_df) %>%
    as_tibble() %>%
    select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)

  return(summary_df)
}



# Apply function to categorical variables
categorical_vars <- c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")
all_categorical_summaries <- map_dfr(categorical_vars, 
                                     ~categorical_summary(.x, survey_design_all, 
                                                          
                                                          survey_design_non_participants, survey_design_participants))

all_categorical_summaries <- all_categorical_summaries %>%
# Recode for consistency across summaries:
mutate(Variable = case_when(Variable==1
                   ~ "Yes", 
                   Variable==0
                   ~ "No", 
                   Variable==TRUE
                   ~ "Yes", 
                   Variable==FALSE
                   ~ "No",
                   TRUE ~ Variable))

# Combine continuous and categorical summaries
weighted_comparison_summary_data <- bind_rows(all_continuous_summaries, all_categorical_summaries) %>%
  rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols = "Participants") %>%
  rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols = "Non-participants") %>%
  rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols = "Overall")

# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)
gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]
age_summary_tabled<- weighted_comparison_summary_data[1,1:5]
summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%
  filter(Variable !="AGE",
    Variable !="GENDER",
         Variable !="Male",
         Variable !="Female",
         Variable !="Unknown"
         )

# Join everything back together and provide some nice names for our variables:
weighted_comparison_summary_data_ordered <- bind_rows(age_summary_tabled,
          gender_summary_tabled,
          summary_table_extracted_vars_missing) %>%
   mutate(Variable = dplyr::recode(Variable,
    `AGE` = "Age",
    "GENDER" = "Gender",
    "IRSAD_SCORE" = "IRSAD score",
    "DAYS_REGISTERED" = "Days since registered",
    "DAYS_SINCE_LAST_BET" = "No. days since last bet",
    "PAS_6M_BET_FREQ_DAYS" = "No. betting days",
    "PAS_6M_BET_INTENSITY" = "Mean bets per active day",
    "PAS_6M_BET_MEAN_STAKE" = "Mean stake",
    "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" = "Mean net outcome per month",
    "PAS_6M_PERCENT_UNSOCIABLE_BETS" = "Percentage of bets midnight - 6AM",
    "PAS_6M_TOTAL_DEPOSIT_FREQ" = "No. deposits",
    "PAS_6M_DEPOSIT_INTENSITY" = "Mean no. deposits per active day",
    "PAS_6M_MEAN_DEPOSIT_AMOUNT" = "Mean deposit amount",
    "PAS_6M_PERCENT_DECLINED_DEPOSITS" = "Percentage of deposits declined",
    "PAS_6M_PERCENT_CANCEL_WITHDRAWS" = "Percentage of withdrawals cancelled",
    "EVER_USED_ANY_TOOL" = "Ever used a CPT",
    "ACTIVE_DEPOSIT_LIMIT" = "Active deposit limit",
    "RE_PAS_6M_ANY_TRIGGER" = "Flagged as 'at-risk' ≥ 1 times"
  ))


# Create final table:
weighted_comparison_summary_data_ordered %>%
  rename("Effect size" = Effect_size
         ) %>%
  gt() %>%
   tab_header(title = "Table 4.   Study 2 (past 30-day bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%
  tab_options(
    data_row.padding = px(1.5)
  ) %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )
Table 4. Study 2 (past 30-day bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable Overall (N = 12799)1 Participants (N = 1514)1 Non-participants (N = 11285) Effect size2
Age 40.62 (15.44), 38 44.47 (15.39), 44 40.15 (15.38), 38 0.28 [0.23, 0.33]
Gender NA 0.03 [0.02, 0.03]
Female 14.18% 12.4% 14.4% NA
Male 84.72% 85.94% 84.57% NA
Unknown 1.1% 1.66% 1.03% NA
IRSAD score 1002.69 (75.08), 994.21 1004.22 (75.25), 994.59 1002.5 (75.06), 994.21 0.02 [-0.03, 0.08]
Days since registered 1259.45 (926.89), 1097 1393.52 (971.99), 1214 1243.06 (919.93), 1070 0.16 [0.11, 0.22]
No. days since last bet 8.6 (8.64), 4 5.91 (6.96), 3 8.93 (8.77), 5 -0.35 [-0.41, -0.3]
No. betting days 40.36 (38.75), 27 51.1 (43.5), 38 39.05 (37.93), 26 0.31 [0.26, 0.37]
Mean bets per active day 9.22 (15.7), 5.11 11.89 (19.35), 6.53 8.89 (15.16), 4.97 0.19 [0.14, 0.24]
Mean stake 32.15 (144.94), 10.4 26.65 (111.65), 9.38 32.82 (148.49), 10.59 -0.04 [-0.1, 0.01]
Mean net outcome per month -251.88 (2584.18), -36.03 -316.48 (1321.09), -46.34 -243.98 (2698.28), -35 -0.03 [-0.08, 0.03]
Percentage of bets midnight - 6AM 4.32 (9.42), 0.06 4.21 (8.66), 0.45 4.34 (9.51), 0 -0.01 [-0.07, 0.04]
No. deposits 60.74 (154.29), 14 85.34 (195.48), 21 57.74 (148.2), 13 0.18 [0.12, 0.23]
Mean no. deposits per active day 1.14 (1.49), 0.81 1.33 (1.85), 0.89 1.12 (1.43), 0.8 0.14 [0.09, 0.19]
Mean deposit amount 84.55 (348.17), 37.5 72.73 (199.6), 38.78 85.99 (362.15), 37.5 -0.04 [-0.09, 0.02]
Percentage of deposits declined 6.18 (12.35), 0 7.44 (14.88), 0 6.03 (12), 0 0.11 [0.06, 0.17]
Percentage of withdrawals cancelled 2.96 (11.33), 0 3.75 (13.02), 0 2.87 (11.11), 0 0.08 [0.02, 0.13]
Ever used a CPT NA 1.25 [1.06, 1.47]
No 83.17% 80.23% 83.53% NA
Yes 16.83% 19.77% 16.47% NA
Active deposit limit NA 1.12 [0.91, 1.38]
No 90.35% 89.42% 90.46% NA
Yes 9.65% 10.58% 9.54% NA
Flagged as 'at-risk' ≥ 1 times NA 1.34 [1.2, 1.5]
No 95.97% 94.86% 96.11% NA
Yes 4.03% 5.14% 3.89% NA
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])

PGSI

Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:

Show code
survey_design_participants_filtered <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted, STATUS == "Participants" &
                  (ATTENTION_CHECK_1 == "PASSED" | is.na(ATTENTION_CHECK_1)) &
                  (ATTENTION_CHECK_2 == "PASSED" | is.na(ATTENTION_CHECK_2))),
  weights = ~ELIGIBLE_SURVEY_SAMPLE_WEIGHTS
)

# Weighted values:
mean_ps_PGSI <- svymean(~PGSI_TOTAL, survey_design_participants_filtered, na.rm = TRUE)
sd_ps_PGSI <- sqrt(svyvar(~PGSI_TOTAL, survey_design_participants_filtered, na.rm = TRUE))
median_ps_PGSI <- svyquantile(~PGSI_TOTAL, survey_design_participants_filtered, c(0.5), na.rm = TRUE)

# Extract desired values
mean_ps_PGSI_value <- coef(mean_ps_PGSI)
sd_ps_PGSI_value <- sd_ps_PGSI[1]
median_ps_PGSI_value <- coef(median_ps_PGSI) 

# Create sentence summarising results: 
summary_sentence_PGSI_study2 <- sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", 
                            mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)

summary_sentence_PGSI_study2
[1] "The mean PGSI total score is 3.76 (standard deviation = 4.19) and the median is 2.00."

Non-Weighted comparisons

For transparency, we have also computed non-weighted comparisons between groups using the tbl_summary() function used to do the same in Study 1 above:

Show code
master_dataset_study2_restricted %>%
  tbl_summary(
    include = c(# Demographic details
      AGE, 
      GENDER, 
      IRSAD_SCORE,
      DAYS_REGISTERED, 
       # Wagering
      DAYS_SINCE_LAST_BET, 
      PAS_6M_BET_FREQ_DAYS,
      PAS_6M_BET_INTENSITY,
      PAS_6M_BET_MEAN_STAKE,
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
      PAS_6M_PERCENT_UNSOCIABLE_BETS,
      # Transactions
      PAS_6M_TOTAL_DEPOSIT_FREQ,
      PAS_6M_DEPOSIT_INTENSITY,
      PAS_6M_MEAN_DEPOSIT_AMOUNT,
      PAS_6M_PERCENT_DECLINED_DEPOSITS,
      PAS_6M_PERCENT_CANCEL_WITHDRAWS,
      # Use of consumer protection tools
      EVER_USED_ANY_TOOL,
      ACTIVE_DEPOSIT_LIMIT,
      RE_PAS_6M_ANY_TRIGGER),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous2() ~ c(
      "{median} [{p25} - {p75}]",
      "{mean} ({sd})")),
    by = STATUS,
    label = c(
      AGE ~ "Age",
      GENDER ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      DAYS_REGISTERED ~ "No. of days since accounts registered",
      DAYS_SINCE_LAST_BET ~ "No. days since last bet",
      PAS_6M_BET_FREQ_DAYS ~ "No. betting days",
      PAS_6M_BET_INTENSITY ~ "Mean bets per active day",
      PAS_6M_BET_MEAN_STAKE ~ "Mean stake",
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~ "Mean net outcome per month",
      PAS_6M_PERCENT_UNSOCIABLE_BETS ~ "Percentage of bets placed between midnight & 6AM",
      PAS_6M_TOTAL_DEPOSIT_FREQ ~ "No. deposits",
      PAS_6M_DEPOSIT_INTENSITY ~ "Mean no. deposits per active betting day",
      PAS_6M_MEAN_DEPOSIT_AMOUNT ~ "Mean deposit amount",
      PAS_6M_PERCENT_DECLINED_DEPOSITS ~ "Percentage of deposits declined",
      PAS_6M_PERCENT_CANCEL_WITHDRAWS ~ "Percentage of withdrawals cancelled",
      EVER_USED_ANY_TOOL ~ "Ever used a SG",
      ACTIVE_DEPOSIT_LIMIT ~ "Active deposit limit",
      RE_PAS_6M_ANY_TRIGGER ~ "Flagged as 'at-risk' ≥ 1 times")
  ) %>%
  add_difference(
    list(all_continuous() ~ "cohens_d",  # For continuous variables
    all_categorical() ~ "smd")       # For categorical variables, you can use standardized mean difference or any other suitable method
  ) %>%
  modify_header(label = "**Variable**") %>% 
  modify_caption("**Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics**") %>%
  bold_labels() %>%
  as_gt() %>%
  tab_options(data_row.padding = px(1.5))
Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics
Variable Participants, N = 1,5141 Non-participants, N = 11,2851 Difference2 95% CI2,3
Age

0.21 0.16, 0.27
    Median [IQR] 41 [30 - 54] 37 [27 - 50]

    Mean (SD) 43 (15) 39 (15)

Gender

0.07 0.02, 0.12
    Female 158 (10%) 1,363 (12%)

    Male 1,327 (88%) 9,774 (87%)

    Unknown 29 (1.9%) 148 (1.3%)

IRSAD score

0.00 -0.06, 0.05
    Median [IQR] 1,000 [953 - 1,060] 998 [953 - 1,059]

    Mean (SD) 1,005 (75) 1,005 (76)

    Unknown 1 94

No. of days since accounts registered

0.15 0.09, 0.20
    Median [IQR] 1,250 [545 - 2,205] 1,135 [479 - 1,969]

    Mean (SD) 1,429 (998) 1,291 (942)

No. days since last bet

-0.33 -0.39, -0.28
    Median [IQR] 2 [1 - 6] 4 [2 - 12]

    Mean (SD) 5 (7) 8 (8)

No. betting days

0.36 0.31, 0.41
    Median [IQR] 54 [24 - 101] 36 [15 - 74]

    Mean (SD) 66 (49) 50 (44)

Mean bets per active day

0.21 0.15, 0.26
    Median [IQR] 9 [4 - 19] 6 [3 - 13]

    Mean (SD) 16 (25) 12 (22)

Mean stake

-0.06 -0.11, 0.00
    Median [IQR] 17 [7 - 54] 17 [7 - 56]

    Mean (SD) 75 (277) 96 (381)

Mean net outcome per month

-0.02 -0.07, 0.04
    Median [IQR] -149 [-1,230 - -12] -73 [-631 - -6]

    Mean (SD) -1,111 (3,185) -989 (7,240)

Percentage of bets placed between midnight & 6AM

0.01 -0.04, 0.07
    Median [IQR] 1 [0 - 6] 1 [0 - 6]

    Mean (SD) 5 (9) 5 (10)

No. deposits

0.28 0.23, 0.33
    Median [IQR] 57 [10 - 241] 27 [6 - 119]

    Mean (SD) 206 (392) 123 (280)

Mean no. deposits per active betting day

0.24 0.18, 0.29
    Median [IQR] 1.36 [0.50 - 2.91] 1.04 [0.40 - 2.12]

    Mean (SD) 2.31 (3.07) 1.73 (2.35)

Mean deposit amount

-0.06 -0.11, 0.00
    Median [IQR] 53 [25 - 134] 51 [22 - 150]

    Mean (SD) 155 (493) 206 (957)

Percentage of deposits declined

0.09 0.04, 0.14
    Median [IQR] 2 [0 - 9] 1 [0 - 9]

    Mean (SD) 8 (14) 7 (12)

Percentage of withdrawals cancelled

0.10 0.04, 0.15
    Median [IQR] 0 [0 - 4] 0 [0 - 0]

    Mean (SD) 9 (19) 7 (18)

Ever used a SG 378 (25%) 2,341 (21%) 0.10 0.05, 0.15
Active deposit limit 161 (11%) 1,167 (10%) 0.01 -0.04, 0.06
Flagged as 'at-risk' ≥ 1 times 560 (37%) 3,443 (31%) 0.14 0.08, 0.19
1 n (%)
2 Cohen’s D; Standardized Mean Difference
3 CI = Confidence Interval

Sensitivity check comparison

Again, let’s see how this comparison differs if we define Participants by full survey completion:

Show code
# Check coding of variable that tells us whether somebody fully completed the survey or not
master_dataset_study2_restricted %>%
  count(STATUS,
           SURVEY_STATUS_FULL)
# A tibble: 4 × 3
  STATUS           SURVEY_STATUS_FULL     n
  <fct>            <chr>              <int>
1 Participants     Complete            1115
2 Participants     Incomplete           399
3 Non-participants Incomplete           158
4 Non-participants Not started        11127
Show code
master_dataset_study2_sensitivity_check   <- master_dataset_study2_restricted %>%
  mutate(SURVEY_STATUS_FULL_BINARY = case_when(SURVEY_STATUS_FULL == "Incomplete" ~ "Non-participants",
                                               SURVEY_STATUS_FULL == "Not started" ~ "Non-participants",
         TRUE ~ "Participants")) %>%
 mutate(SURVEY_STATUS_FULL_BINARY = fct_relevel(SURVEY_STATUS_FULL_BINARY,
                            "Participants",
                            "Non-participants")) # fix order of presentation

# Check recoding has worked:
master_dataset_study2_sensitivity_check %>%
  count(STATUS,
           SURVEY_STATUS_FULL_BINARY)
# A tibble: 3 × 3
  STATUS           SURVEY_STATUS_FULL_BINARY     n
  <fct>            <fct>                     <int>
1 Participants     Participants               1115
2 Participants     Non-participants            399
3 Non-participants Non-participants          11285
Show code
# Update survey designs:
survey_design_all <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_sensitivity_check,
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Create survey design objects for each STATUS group:
survey_design_non_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Get sample sizes for each group
participants_n <- sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY == "Participants")
non_participants_n <- sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY == "Non-participants")
overall_n <- nrow(master_dataset_study2_sensitivity_check)




# Produce updated table
# [Again, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]

# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:
# Function to calculate weighted summary for continuous variables
weighted_summary_continuous <- function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate means and standard deviations
  mean_all <- svymean(var_formula, design_all, na.rm = TRUE)
  mean_non_ps <- svymean(var_formula, design_non_participants, na.rm = TRUE)
  mean_ps <- svymean(var_formula, design_participants, na.rm = TRUE)
  
  sd_all <- sqrt(svyvar(var_formula, design_all, na.rm = TRUE))
  sd_non_ps <- sqrt(svyvar(var_formula, design_non_participants, na.rm = TRUE))
  sd_ps <- sqrt(svyvar(var_formula, design_participants, na.rm = TRUE))
  
  # Calculate weighted medians
  median_all <- svyquantile(var_formula, design_all, 0.5, na.rm = TRUE)
  median_non_ps <- svyquantile(var_formula, design_non_participants, 0.5, na.rm = TRUE)
  median_ps <- svyquantile(var_formula, design_participants, 0.5, na.rm = TRUE)
   
  # Format for table
  summary_df <- tibble(
    Variable = variable,
    `Overall` = paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),
    `Participants` = paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),
    `Non-participants` = paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2))
  ) %>%
  # Compute effect size
  mutate(
    pooled_sd = sqrt(((participants_n - 1) * sd_ps^2 + (non_participants_n - 1) * sd_non_ps^2) /
                     (participants_n + non_participants_n - 2)),
    cohen_d = (coef(mean_ps) - coef(mean_non_ps)) / pooled_sd,
    SE = sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2 / (2 * (participants_n + non_participants_n)))),
    CI_lower = cohen_d - 1.96 * SE,
    CI_upper = cohen_d + 1.96 * SE,
    Effect_size = paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]")
  ) %>%
  select(Variable, Overall, Participants, `Non-participants`, Effect_size)
  
  return(summary_df)
}

# Apply function to continuous variables
continuous_vars <- c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", 
                     "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", 
                     "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", 
                     "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", 
                     "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")

all_continuous_summaries <- map_dfr(continuous_vars, 
                                    ~weighted_summary_continuous(.x, survey_design_all, 
                                                                 survey_design_non_participants, survey_design_participants, 
                                                                 participants_n, non_participants_n, overall_n))


# Categorical variables:

# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. 

# Note: this function is computes Cramer's  effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).

# UPDATE: Let's just compute odds ratios for the two level variables  based on a reviewer suggestion (thanks ChatGPT for your help with this one!):

categorical_summary <- function(variable, design_all, design_non_participants, design_participants) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate proportions for overall sample
  prop_table_all <- svytable(var_formula, design_all)
  props_all <- prop.table(prop_table_all) %>%
    as.data.frame() %>%
    mutate(Overall = Freq * 100) %>%
    select(Variable = !!sym(variable), Overall)
  
  # Calculate proportions for non-participants
  prop_table_non_ps <- svytable(var_formula, design_non_participants)
  props_non_ps <- prop.table(prop_table_non_ps) %>%
    as.data.frame() %>%
    mutate('Non-participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Non-participants')
  
  # Calculate proportions for participants
  prop_table_ps <- svytable(var_formula, design_participants)
  props_ps <- prop.table(prop_table_ps) %>%
    as.data.frame() %>%
    mutate('Participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Participants')
  
  # Combine results and format
  summary_df <- props_ps %>%
    left_join(props_non_ps, by = "Variable") %>%
    left_join(props_all, by = "Variable") %>%
    mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%
    replace(is.na(.), "")

  # Determine if variable is binary (i.e., only two levels)
  unique_values <- length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))
  
  if (unique_values == 2 & variable != "GENDER") {
    # Ensure STATUS is treated as a factor with "Non-participants" as reference
    design_all$variables$STATUS <- relevel(as.factor(design_all$variables$STATUS), ref = "Non-participants")

    # Compute Odds Ratio (OR) using survey-weighted logistic regression
    logistic_model <- tryCatch(
      svyglm(as.formula(paste0(variable, " ~ STATUS")), 
             design = design_all, family = quasibinomial()),
      error = function(e) NULL
    )

    if (!is.null(logistic_model)) {
      coef_name <- grep("STATUS", names(coef(logistic_model)), value = TRUE)
      if (length(coef_name) > 0) {
        OR <- exp(coef(logistic_model)[coef_name])
        CI <- exp(confint(logistic_model)[coef_name, ])
        effect_size_text <- paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]")
      } else {
        effect_size_text <- "NA"
      }
    } else {
      effect_size_text <- "NA"
    }

  } else {
    # Compute Cramér’s V for categorical variables with more than two levels
    combined_table <- as.table(rbind(prop_table_non_ps, prop_table_ps))
    V <- cramersV(combined_table)
    V_rounded <- as.matrix(V$output) %>% t() %>% as.data.frame() %>%
      mutate(cramersV = as.numeric(cramersV)) %>%
      mutate(cramersV = round(cramersV, 2))
    
    CI <- confIntV(combined_table)
    CI_info <- CI$intermediate
    CIs <- CI_info$fisherZ.ci %>%
      t() %>%
      as.data.frame() %>%
      mutate_if(is.numeric, round, 2) %>%
      unite('Effect_size', 1, 2, sep = ", ")
    
    effect_size_text <- paste0(V_rounded, " [", CIs, "]")
  }

  # Add effect size to the table
  header_row <- data.frame(Variable = variable, Overall = "", Participants = "", 
                           `Non-participants` = "", `Effect_size` = effect_size_text)

  summary_df <- bind_rows(header_row, summary_df) %>%
    as_tibble() %>%
    select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)

  return(summary_df)
}



# Apply function to categorical variables
categorical_vars <- c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")
all_categorical_summaries <- map_dfr(categorical_vars, 
                                     ~categorical_summary(.x, survey_design_all, 
                                                          
                                                          survey_design_non_participants, survey_design_participants))

all_categorical_summaries <- all_categorical_summaries %>%
# Recode for consistency across summaries:
mutate(Variable = case_when(Variable==1
                   ~ "Yes", 
                   Variable==0
                   ~ "No", 
                   Variable==TRUE
                   ~ "Yes", 
                   Variable==FALSE
                   ~ "No",
                   TRUE ~ Variable))

# Combine continuous and categorical summaries
weighted_comparison_summary_data <- bind_rows(all_continuous_summaries, all_categorical_summaries) %>%
  rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols = "Participants") %>%
  rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols = "Non-participants") %>%
  rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols = "Overall")

# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)
gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]
age_summary_tabled<- weighted_comparison_summary_data[1,1:5]
summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%
  filter(Variable !="AGE",
    Variable !="GENDER",
         Variable !="Male",
         Variable !="Female",
         Variable !="Unknown"
         )

# Join everything back together and provide some nice names for our variables:
weighted_comparison_summary_data_ordered <- bind_rows(age_summary_tabled,
          gender_summary_tabled,
          summary_table_extracted_vars_missing) %>%
   mutate(Variable = dplyr::recode(Variable,
    `AGE` = "Age",
    "GENDER" = "Gender",
    "IRSAD_SCORE" = "IRSAD score",
    "DAYS_REGISTERED" = "Days since registered",
    "DAYS_SINCE_LAST_BET" = "No. days since last bet",
    "PAS_6M_BET_FREQ_DAYS" = "No. betting days",
    "PAS_6M_BET_INTENSITY" = "Mean bets per active day",
    "PAS_6M_BET_MEAN_STAKE" = "Mean stake",
    "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" = "Mean net outcome per month",
    "PAS_6M_PERCENT_UNSOCIABLE_BETS" = "Percentage of bets midnight - 6AM",
    "PAS_6M_TOTAL_DEPOSIT_FREQ" = "No. deposits",
    "PAS_6M_DEPOSIT_INTENSITY" = "Mean no. deposits per active day",
    "PAS_6M_MEAN_DEPOSIT_AMOUNT" = "Mean deposit amount",
    "PAS_6M_PERCENT_DECLINED_DEPOSITS" = "Percentage of deposits declined",
    "PAS_6M_PERCENT_CANCEL_WITHDRAWS" = "Percentage of withdrawals cancelled",
    "EVER_USED_ANY_TOOL" = "Ever used a CPT",
    "ACTIVE_DEPOSIT_LIMIT" = "Active deposit limit",
    "RE_PAS_6M_ANY_TRIGGER" = "Flagged as 'at-risk' ≥ 1 times"
  ))


# Create final table:
weighted_comparison_summary_data_ordered %>%
  rename("Effect size" = Effect_size
         ) %>%
  gt() %>%
   tab_header(title = "Table 4.1   Study 2 (past 30-day bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]") %>%
  tab_options(
    data_row.padding = px(1.5)
  ) %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )
Table 4.1 Study 2 (past 30-day bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]
Variable Overall (N = 12799)1 Participants (N = 1115)1 Non-participants (N = 11684) Effect size2
Age 40.62 (15.44), 38 43.42 (14.68), 42 40.36 (15.48), 38 0.2 [0.14, 0.26]
Gender NA 0.03 [0.02, 0.03]
Female 14.18% 11.46% 14.43% NA
Male 84.72% 86.99% 84.51% NA
Unknown 1.1% 1.55% 1.05% NA
IRSAD score 1002.69 (75.08), 994.21 1007.51 (75.4), 1000.6 1002.24 (75.03), 993.97 0.07 [0.01, 0.13]
Days since registered 1259.45 (926.89), 1097 1417.35 (964.27), 1225 1244.9 (922.05), 1072 0.19 [0.12, 0.25]
No. days since last bet 8.6 (8.64), 4 5.84 (6.95), 3 8.85 (8.74), 5 -0.35 [-0.41, -0.29]
No. betting days 40.36 (38.75), 27 50.35 (42.96), 39 39.44 (38.21), 27 0.28 [0.22, 0.34]
Mean bets per active day 9.22 (15.7), 5.11 11.05 (18.38), 6.41 9.05 (15.42), 5 0.13 [0.07, 0.19]
Mean stake 32.15 (144.94), 10.4 25.11 (91.67), 9.02 32.8 (148.88), 10.58 -0.05 [-0.11, 0.01]
Mean net outcome per month -251.88 (2584.18), -36.03 -267.81 (1199.21), -41.62 -250.41 (2676.01), -35.5 -0.01 [-0.07, 0.05]
Percentage of bets midnight - 6AM 4.32 (9.42), 0.06 4.26 (8.81), 0.51 4.33 (9.48), 0 -0.01 [-0.07, 0.05]
No. deposits 60.74 (154.29), 14 77.52 (173.67), 19 59.2 (152.3), 14 0.12 [0.06, 0.18]
Mean no. deposits per active day 1.14 (1.49), 0.81 1.26 (1.8), 0.81 1.13 (1.45), 0.81 0.08 [0.02, 0.15]
Mean deposit amount 84.55 (348.17), 37.5 72.06 (204.19), 38 85.7 (358.52), 37.5 -0.04 [-0.1, 0.02]
Percentage of deposits declined 6.18 (12.35), 0 7.39 (14.83), 0 6.07 (12.09), 0 0.11 [0.05, 0.17]
Percentage of withdrawals cancelled 2.96 (11.33), 0 3.61 (13.22), 0 2.9 (11.14), 0 0.06 [0, 0.12]
Ever used a CPT NA 1.25 [1.06, 1.47]
No 83.17% 81.2% 83.35% NA
Yes 16.83% 18.8% 16.65% NA
Active deposit limit NA 1.12 [0.91, 1.38]
No 90.35% 90.33% 90.35% NA
Yes 9.65% 9.67% 9.65% NA
Flagged as 'at-risk' ≥ 1 times NA 1.34 [1.2, 1.5]
No 95.97% 95.61% 96.01% NA
Yes 4.03% 4.39% 3.99% NA
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])

Again, the effect sizes appear to be smaller in these comparisons. Let’s check if this is because the group who started the survey but didn’t complete it have now been merged with non-responders, making the latter group more similar to responders, or because those who fully complete the survey are actually more similar to those who didn’t start it:

Show code
master_dataset_study2_restricted %>%
  tbl_summary(
    include = c(# Demographic details
      AGE, 
      GENDER, 
      IRSAD_SCORE,
      DAYS_REGISTERED, 
       # Wagering
      DAYS_SINCE_LAST_BET, 
      PAS_6M_BET_FREQ_DAYS,
      PAS_6M_BET_INTENSITY,
      PAS_6M_BET_MEAN_STAKE,
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
      PAS_6M_PERCENT_UNSOCIABLE_BETS,
      # Transactions
      PAS_6M_TOTAL_DEPOSIT_FREQ,
      PAS_6M_DEPOSIT_INTENSITY,
      PAS_6M_MEAN_DEPOSIT_AMOUNT,
      PAS_6M_PERCENT_DECLINED_DEPOSITS,
      PAS_6M_PERCENT_CANCEL_WITHDRAWS,
      # Use of consumer protection tools
      EVER_USED_ANY_TOOL,
      ACTIVE_DEPOSIT_LIMIT,
      RE_PAS_6M_ANY_TRIGGER),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous2() ~ c(
      "{median} [{p25} - {p75}]",
      "{mean} ({sd})")),
    by = SURVEY_STATUS_FULL,
    label = c(
      AGE ~ "Age",
      GENDER ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      DAYS_REGISTERED ~ "No. of days since accounts registered",
      DAYS_SINCE_LAST_BET ~ "No. days since last bet",
      PAS_6M_BET_FREQ_DAYS ~ "No. betting days",
      PAS_6M_BET_INTENSITY ~ "Mean bets per active day",
      PAS_6M_BET_MEAN_STAKE ~ "Mean stake",
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~ "Mean net outcome per month",
      PAS_6M_PERCENT_UNSOCIABLE_BETS ~ "Percentage of bets placed between midnight & 6AM",
      PAS_6M_TOTAL_DEPOSIT_FREQ ~ "No. deposits",
      PAS_6M_DEPOSIT_INTENSITY ~ "Mean no. deposits per active betting day",
      PAS_6M_MEAN_DEPOSIT_AMOUNT ~ "Mean deposit amount",
      PAS_6M_PERCENT_DECLINED_DEPOSITS ~ "Percentage of deposits declined",
      PAS_6M_PERCENT_CANCEL_WITHDRAWS ~ "Percentage of withdrawals cancelled",
      EVER_USED_ANY_TOOL ~ "Ever used a SG",
      ACTIVE_DEPOSIT_LIMIT ~ "Active deposit limit",
      RE_PAS_6M_ANY_TRIGGER ~ "Flagged as 'at-risk' ≥ 1 times")
  ) %>%
  modify_header(label = "Variable") %>% 
  modify_caption("Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey") %>%
  bold_labels() 
Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn’t complete it, and those who didn’t start the survey
Variable Complete, N = 1,1151 Incomplete, N = 5571 Not started, N = 11,1271
Age


    Median [IQR] 40 [30 - 52] 42 [29 - 56] 37 [27 - 50]
    Mean (SD) 42 (14) 44 (16) 39 (15)
Gender


    Female 103 (9.2%) 72 (13%) 1,346 (12%)
    Male 992 (89%) 473 (85%) 9,636 (87%)
    Unknown 20 (1.8%) 12 (2.2%) 145 (1.3%)
IRSAD score


    Median [IQR] 1,002 [955 - 1,065] 990 [949 - 1,055] 998 [953 - 1,059]
    Mean (SD) 1,008 (75) 1,000 (74) 1,005 (76)
    Unknown 1 1 93
No. of days since accounts registered


    Median [IQR] 1,282 [582 - 2,217] 1,165 [463 - 2,115] 1,135 [480 - 1,970]
    Mean (SD) 1,445 (986) 1,362 (1,023) 1,291 (941)
No. days since last bet


    Median [IQR] 3 [1 - 6] 2 [1 - 5] 4 [2 - 13]
    Mean (SD) 5 (7) 5 (6) 8 (8)
No. betting days


    Median [IQR] 52 [23 - 97] 67 [28 - 113] 36 [15 - 74]
    Mean (SD) 63 (47) 73 (50) 50 (44)
Mean bets per active day


    Median [IQR] 8 [4 - 17] 11 [4 - 22] 6 [3 - 13]
    Mean (SD) 15 (23) 18 (27) 12 (22)
Mean stake


    Median [IQR] 15 [6 - 48] 23 [9 - 71] 16 [7 - 55]
    Mean (SD) 67 (222) 99 (351) 96 (383)
Mean net outcome per month


    Median [IQR] -118 [-1,060 - -9] -366 [-1,671 - -28] -72 [-613 - -6]
    Mean (SD) -953 (3,008) -1,504 (3,356) -983 (7,284)
Percentage of bets placed between midnight & 6AM


    Median [IQR] 1 [0 - 6] 1 [0 - 5] 1 [0 - 6]
    Mean (SD) 5 (9) 5 (8) 5 (10)
No. deposits


    Median [IQR] 46 [9 - 205] 92 [21 - 316] 27 [6 - 117]
    Mean (SD) 184 (352) 270 (539) 121 (270)
Mean no. deposits per active betting day


    Median [IQR] 1.26 [0.44 - 2.63] 1.73 [0.83 - 3.53] 1.03 [0.40 - 2.10]
    Mean (SD) 2.17 (2.98) 2.75 (3.73) 1.71 (2.29)
Mean deposit amount


    Median [IQR] 50 [24 - 128] 73 [31 - 164] 50 [22 - 150]
    Mean (SD) 148 (517) 179 (393) 206 (963)
Percentage of deposits declined


    Median [IQR] 2 [0 - 8] 4 [0 - 9] 1 [0 - 9]
    Mean (SD) 8 (14) 8 (14) 7 (12)
Percentage of withdrawals cancelled


    Median [IQR] 0 [0 - 2] 0 [0 - 13] 0 [0 - 0]
    Mean (SD) 8 (19) 12 (22) 7 (18)
Ever used a SG 267 (24%) 159 (29%) 2,293 (21%)
Active deposit limit 112 (10%) 69 (12%) 1,147 (10%)
Flagged as 'at-risk' ≥ 1 times 370 (33%) 266 (48%) 3,367 (30%)
1 n (%)

As with Study 1, it seems that those who didn’t complete the survey but started it out the more involved/intense bettors.

Composite risk scores

Now let’s produce a singles composite score that represents people’s past six months gambling behaviour.

One thing we need to consider here is that some of the variables we’ve created could be biased by recent sign up to the site. for example, one of the variables we created to summarise past six months behaviour is the total number of days somebody gambled, while another is the total amount somebody deposited. Both of these variables could be affected by being more recently registered with the company. By contrast, variables which refer to averages, or daily or monthly summaries will not be affected by this.

Let’s first identify all the variables we engineered that summarise past six month behaviour, then classify them based on whether they will be subject to this bias or not:

Show code
PAS_6M_variables<- master_dataset_study2_restricted %>% 
  select(
         contains("PAS_6M"), # Identify all relevant past six months aggregate variables
         -contains("RE_"), # Remove risk identification system variables
         -contains("INCOME") # Remove variables relating  to percentage of income which are only available for survey participants
         ) %>%
  colnames() %>%
  as.data.frame() %>%
  rename(Variables = 1) %>%
  mutate(`Biased by sign-up date` = case_when(
                     Variables == "PAS_6M_BET_FREQ_DAYS" ~ "Yes", 
                     Variables == "PAS_6M_BET_FREQ_BETS" ~ "Yes", 
                     Variables == "PAS_6M_BET_MEAN_STAKE" ~ "No", 
                     Variables == "PAS_6M_BET_SD_STAKE" ~ "No", 
                     Variables == "PAS_6M_BET_SPEND" ~ "Yes", 
                     Variables == "PAS_6M_MEAN_ODDS" ~ "Yes", 
                     Variables == "PAS_6M_N_ACTIVITIES" ~ "No", 
                     Variables == "PAS_6M_BET_INTENSITY" ~ "No", 
                     Variables == "PAS_6M_BET_WIN" ~ "Yes", 
                     Variables == "PAS_6M_NET_OUTCOME" ~ "Yes", 
                     Variables == "PAS_6M_MAX_DAY_STAKE" ~ "No", 
                     Variables == "PAS_6M_MAX_DAY_BETS" ~ "No", 
                     Variables == "PAS_6M_MEAN_DAILY_STAKE" ~ "No", 
                     Variables == "PAS_6M_SD_DAILY_STAKE" ~ "No", 
                     Variables == "PAS_6M_TOTAL_UNSOCIABLE_BETS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_UNSOCIABLE_BETS" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_BET_FREQ_DAYS" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_SPEND" ~ "No", 
                     Variables == "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" ~ "No",
                     Variables == "PAS_6M_TOTAL_DEPOSIT_AMOUNT" ~ "Yes", 
                     Variables == "PAS_6M_MEAN_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_SD_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_TOTAL_DEPOSIT_FREQ" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_DECLINED_DEPOSITS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_DECLINED_DEPOSITS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_DEPOSIT_METHODS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_FREQ" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_AMOUNT" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_CANCEL_WITHDRAWS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_CANCEL_WITHDRAWS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_METHODS" ~ "No", 
                     Variables == "PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_DEPOSIT_INTENSITY" ~ "No",
                     TRUE ~ Variables)) 

# Table variables:
PAS_6M_variables %>%
  gt()%>%
  tab_header("All past-6-month aggregate summary variables", subtitle = "With risk of bias status for each variable")
All past-6-month aggregate summary variables
With risk of bias status for each variable
Variables Biased by sign-up date
PAS_6M_BET_FREQ_DAYS Yes
PAS_6M_BET_FREQ_BETS Yes
PAS_6M_BET_FREQ_INCL_LEGS PAS_6M_BET_FREQ_INCL_LEGS
PAS_6M_BET_MEAN_STAKE No
PAS_6M_BET_SD_STAKE No
PAS_6M_BET_SPEND Yes
PAS_6M_MEAN_ODDS Yes
PAS_6M_N_ACTIVITIES No
PAS_6M_BET_INTENSITY No
PAS_6M_BET_WIN Yes
PAS_6M_NET_OUTCOME Yes
PAS_6M_MAX_DAY_STAKE No
PAS_6M_MAX_DAY_BETS No
PAS_6M_MEAN_DAILY_STAKE No
PAS_6M_SD_DAILY_STAKE No
PAS_6M_TOTAL_UNSOCIABLE_BETS Yes
PAS_6M_PERCENT_UNSOCIABLE_BETS No
PAS_6M_MONTHLY_BET_FREQ_DAYS No
PAS_6M_MONTHLY_SPEND No
PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME No
PAS_6M_TOTAL_DEPOSIT_AMOUNT Yes
PAS_6M_MEAN_DEPOSIT_AMOUNT No
PAS_6M_SD_DEPOSIT_AMOUNT No
PAS_6M_TOTAL_DEPOSIT_FREQ Yes
PAS_6M_TOTAL_DECLINED_DEPOSITS Yes
PAS_6M_PERCENT_DECLINED_DEPOSITS No
PAS_6M_TOTAL_DEPOSIT_METHODS No
PAS_6M_TOTAL_WITHDRAW_FREQ Yes
PAS_6M_TOTAL_WITHDRAW_AMOUNT Yes
PAS_6M_TOTAL_CANCEL_WITHDRAWS Yes
PAS_6M_PERCENT_CANCEL_WITHDRAWS No
PAS_6M_TOTAL_WITHDRAW_METHODS No
PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT No
PAS_6M_MONTHLY_DEPOSIT_AMOUNT No
PAS_6M_MONTHLY_DEPOSIT_FREQ PAS_6M_MONTHLY_DEPOSIT_FREQ
PAS_6M_DEPOSIT_INTENSITY No

Okay, now we have all of the relevant variables, let’s look at how they correlate with PGSI and harm scores using a series of bivariate correlations visualised as a heatmap. These results are unweighted:

Show code
PGSI_cor_data<- master_dataset_study2_restricted %>% 
    select(PGSI_TOTAL,
        GHM_TOTAL,
       PAS_6M_BET_MEAN_STAKE, 
       PAS_6M_BET_SD_STAKE, 
       PAS_6M_N_ACTIVITIES, 
       PAS_6M_BET_INTENSITY, 
       PAS_6M_MAX_DAY_STAKE, 
       PAS_6M_MAX_DAY_BETS, 
       PAS_6M_MEAN_DAILY_STAKE, 
       PAS_6M_SD_DAILY_STAKE, 
       PAS_6M_PERCENT_UNSOCIABLE_BETS, 
       PAS_6M_MONTHLY_BET_FREQ_DAYS, 
       PAS_6M_MONTHLY_SPEND, 
       PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
       PAS_6M_MEAN_DEPOSIT_AMOUNT, 
       PAS_6M_SD_DEPOSIT_AMOUNT, 
       PAS_6M_PERCENT_DECLINED_DEPOSITS, 
       PAS_6M_TOTAL_DEPOSIT_METHODS, 
       PAS_6M_PERCENT_CANCEL_WITHDRAWS, 
       PAS_6M_TOTAL_WITHDRAW_METHODS, 
       PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, 
       PAS_6M_MONTHLY_DEPOSIT_AMOUNT, 
       PAS_6M_DEPOSIT_INTENSITY)

ggstatsplot::ggcorrmat(PGSI_cor_data, 
                                              type = "parametric",
                                              results.subtitle = FALSE,
                                              ggcorrplot.args = c(lab_size = 3)) +
  theme(axis.text.y = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(axis.text.x = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(legend.title=element_text(color = "black", size = 12, face = "plain", family = "Poppins")) +
  theme(legend.title.align = .5) +
  theme(legend.text=element_text(color = "black", size = 8, face = "plain", family = "Poppins"))

Now let’s use all of these variables to create a composite index score of risk/involvement:

Show code
master_dataset_study2_restricted_with_risk_scores <- master_dataset_study2_restricted %>%
  mutate(PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME = PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME*-1) %>% # Invert outcome values
  mutate(across(c(PAS_6M_BET_MEAN_STAKE, 
                  PAS_6M_BET_SD_STAKE, 
                  PAS_6M_N_ACTIVITIES, 
                  PAS_6M_BET_INTENSITY, 
                  PAS_6M_MAX_DAY_STAKE, 
                  PAS_6M_MAX_DAY_BETS, 
                  PAS_6M_MEAN_DAILY_STAKE, 
                  PAS_6M_SD_DAILY_STAKE, 
                  PAS_6M_PERCENT_UNSOCIABLE_BETS, 
                  PAS_6M_MONTHLY_BET_FREQ_DAYS, 
                  PAS_6M_MONTHLY_SPEND, 
                  PAS_6M_MEAN_DEPOSIT_AMOUNT, 
                  PAS_6M_SD_DEPOSIT_AMOUNT, 
                  PAS_6M_PERCENT_DECLINED_DEPOSITS, 
                  PAS_6M_TOTAL_DEPOSIT_METHODS, 
                  PAS_6M_PERCENT_CANCEL_WITHDRAWS, 
                  PAS_6M_TOTAL_WITHDRAW_METHODS, 
                  PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, 
                  PAS_6M_MONTHLY_DEPOSIT_AMOUNT, 
                  PAS_6M_DEPOSIT_INTENSITY), 
                scale, # scale function computes z-scores
                .names = "z_{.col}")) %>% # Create new named columns for easy selection below
  rowwise() %>% # Group by Rows
  mutate(`Composite Risk Score` = sum(c_across(starts_with("z_")), na.rm = TRUE)) %>% # Sum across all newly created Z scores
  ungroup()


# colnames(master_dataset_study2_restricted_with_risk_scores)

Correlations

Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:

Show code
master_dataset_study2_restricted_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>%
   mutate(PGSI_CATEGORY = factor(PGSI_CATEGORY, levels = c('No risk',
                                                              'Low risk',
                                                              'Moderate risk',
                                                              'High risk'))) %>%
ggplot() +
   geom_point(aes(x = PGSI_TOTAL, y = `Composite Risk Score`,
                   colour = PGSI_CATEGORY)) +
   geom_smooth(aes(x = PGSI_TOTAL, y = `Composite Risk Score`), 
               method = "lm",
               se = FALSE,
               color = "black") +
   scale_colour_manual(values = c("#2e4057",
                                  "#00798c", 
                                  "#edae49",
                                  "#d1495b")) +
   geom_hline(yintercept = 0, colour = "grey",  linetype="dashed") +
   scale_y_continuous(limits = c(-10,120)) +
   plot_theme +
  guides(color= guide_legend("PGSI category")) +
   labs(x = "PGSI total score") +
  scale_x_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 27))

Perform non-weighted correlation:

Show code
cor_PGSI <- cor.test(master_dataset_study2_restricted_with_risk_scores$PGSI_TOTAL, 
                         master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`,
                         method = "spearman")

cor_PGSI

    Spearman's rank correlation rho

data:  master_dataset_study2_restricted_with_risk_scores$PGSI_TOTAL and master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`
S = 389269642, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.3269856 

Perform weighted correlation:

Show code
# This doesn't work nicely with NAvalues so let's remove them first:
master_dataset_study2_restricted_with_risk_scores_weighted_cor_data <- master_dataset_study2_restricted_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL))

weightecor_PGSI<- weightedCorr(x = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$PGSI_TOTAL, 
             y = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$`Composite Risk Score`, 
              method = "Spearman",
             weights = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$ALL_SAMPLE_WEIGHTS)

weightecor_PGSI
[1] 0.3427516

Confirm how many people are involved in these calculations:

Show code
master_dataset_study2_restricted_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>% nrow()
[1] 1514

T-test

Perform a t-test to compare the composite risk or of those who did and didn’t take part in the survey.

Unweighted t-test:

Show code
# Perform Welch's t-test
t_test_composite_study2 <- t.test(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_restricted_with_risk_scores, var.equal = FALSE)

t_test_composite_study2

    Welch Two Sample t-test

data:  Composite Risk Score by STATUS
t = 7.8814, df = 2203.5, p-value = 5.046e-15
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
 1.454290 2.417725
sample estimates:
    mean in group Participants mean in group Non-participants 
                     1.7069962                     -0.2290113 

And now a Weighted t-test:

Show code
# Create survey design object for the whole sample:
survey_design_all_participants <- svydesign(
  id = ~1,
  data = master_dataset_study2_restricted_with_risk_scores,
  weights = ~ALL_SAMPLE_WEIGHTS
)

weighted_t_test_composite_study2 <-survey::svyttest(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_restricted_with_risk_scores, design = survey_design_all_participants)

weighted_t_test_composite_study2

    Design-based t-test

data:  `Composite Risk Score` ~ STATUS
t = -9.5198, df = 12797, p-value < 2.2e-16
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 -1.716717 -1.130472
sample estimates:
difference in mean 
         -1.423595 

Compute non-weighted summary scores and effect size:

Show code
# Calculate means and standard deviations
summary_composite_study2 <- master_dataset_study2_restricted_with_risk_scores %>%
  group_by(STATUS) %>%
  dplyr::summarize(
    mean = mean(`Composite Risk Score`),
    sd = sd(`Composite Risk Score`),
    n = n()
  ) 

gt(summary_composite_study2)
STATUS mean sd n
Participants 1.7069962 8.687786 1514
Non-participants -0.2290113 10.879017 11285
Show code
# Calculate Cohen's d with 95% confidence interval
cohen_d_result_study2 <- cohen.d(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_restricted_with_risk_scores)

cohen_d_result_study2

Cohen's d

d estimate: 0.181896 (negligible)
95 percent confidence interval:
    lower     upper 
0.1282005 0.2355915 

Compute weighted summary score and effect size:

Show code
# Rename the risk score as the statistic function doesn't like the spaces:
master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change<- master_dataset_study2_restricted_with_risk_scores %>%
  rename(risk_score = `Composite Risk Score`)


# Create survey design objects for each STATUS group that include the composite risk scores:
survey_design_all_t <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change,
  weights = ~ALL_SAMPLE_WEIGHTS
)
survey_design_non_participants_t <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change, STATUS == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants_t <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change, STATUS == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

risk_score_comparison<- weighted_summary_continuous("risk_score", 
                                                    survey_design_all_t,
                                                    survey_design_non_participants_t, 
                                                    survey_design_participants_t, 
                                                    participants_n,
                                                    non_participants_n)

gt(risk_score_comparison)
Variable Overall Participants Non-participants Effect_size
risk_score -3.24 (5.55), -3.99 -1.98 (5.46), -2.79 -3.4 (5.54), -4.13 0.26 [0.2, 0.32]

Quartiles

Now, let’s divide the sample into quartiles based on their composite summary score:

Show code
# Calculate quartiles of past-30-day bet frequency
quartiles <- quantile(master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`, 
                      probs = c(0, 0.25, 0.5, 0.75, 1))


master_dataset_study2_restricted_with_risk_scores$Quartile <- cut(master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`, 
                                    breaks = quartiles,
                                    labels = c("Bottom quartile: least involved",
                                               "Second quartile",
                                               "Third quartile",
                                               "Top quartile: most involved"),
                                    include.lowest = TRUE)

Now compute the proportion of individuals from each quartile who went on to take part in the survey:

Show code
master_dataset_study2_restricted_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
  tab_header("Counts for granular survey status by composite risk score quartile", subtitle = NULL)
Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI n
Bottom quartile: least involved
Complete 252
Incomplete 11
Not started 2937
Second quartile
Complete 331
Incomplete 27
Not started 2842
Third quartile
Complete 418
Incomplete 48
Not started 2733
Top quartile: most involved
Complete 513
Incomplete 72
Not started 2615
Show code
master_dataset_study2_restricted_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(STATUS) %>%
  mutate('Percent of quartile' = round(n/sum(n)*100,2)) %>%
  gt() %>%
  tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)
Counts (with percentages) for broad survey status by composite risk score quartile
STATUS n Percent of quartile
Bottom quartile: least involved
Participants 252 7.88
Non-participants 2948 92.12
Second quartile
Participants 331 10.34
Non-participants 2869 89.66
Third quartile
Participants 418 13.07
Non-participants 2781 86.93
Top quartile: most involved
Participants 513 16.03
Non-participants 2687 83.97

No compute weighted accounts/proportions:

Show code
# Compute weighted counts for granular survey status
weighted_counts_granular <- svytable(~Quartile + SURVEY_STATUS_PGSI, 
                                     svydesign(id = ~1, 
                                               data = master_dataset_study2_restricted_with_risk_scores, 
                                               weights = ~ALL_SAMPLE_WEIGHTS)) %>%
  as.data.frame()


weighted_counts_granular %>%
  gt() %>%
  tab_header("Weighted counts for granular survey status by composite risk score quartile", subtitle = NULL)
Weighted counts for granular survey status by composite risk score quartile
Quartile SURVEY_STATUS_PGSI Freq
Bottom quartile: least involved Complete 2745.7805
Second quartile Complete 3537.8544
Third quartile Complete 3296.0030
Top quartile: most involved Complete 1306.2995
Bottom quartile: least involved Incomplete 120.6171
Second quartile Incomplete 294.5093
Third quartile Incomplete 355.4089
Top quartile: most involved Incomplete 200.9272
Bottom quartile: least involved Not started 32108.0330
Second quartile Not started 30220.4648
Third quartile Not started 19822.1098
Top quartile: most involved Not started 5914.9925
Show code
# Compute weighted counts and percentages for broad survey status
svytable(~Quartile + STATUS, 
                                  svydesign(id = ~1, 
                                            data = master_dataset_study2_restricted_with_risk_scores, 
                                            weights = ~ALL_SAMPLE_WEIGHTS)) %>%
  as.data.frame() %>%
  group_by(Quartile) %>%
  mutate('Percent of quartile' = round(Freq / sum(Freq) * 100, 2)) %>%
 gt() %>%
  tab_header("Weighted counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)
Weighted counts (with percentages) for broad survey status by composite risk score quartile
STATUS Freq Percent of quartile
Bottom quartile: least involved
Participants 2745.781 7.85
Non-participants 32228.650 92.15
Second quartile
Participants 3537.854 10.39
Non-participants 30514.974 89.61
Third quartile
Participants 3296.003 14.04
Non-participants 20177.519 85.96
Top quartile: most involved
Participants 1306.299 17.60
Non-participants 6115.920 82.40

And visualise this using a Sankey diagram (weighted figures):

Show code
# Create survey design object
survey_design_all_participants_sankey <- svydesign(
  id = ~1,
  data = master_dataset_study2_restricted_with_risk_scores,
  weights = ~ALL_SAMPLE_WEIGHTS
)

#  We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N), with weightings applied:
weighted_counts <- svytable(~Quartile + SURVEY_STATUS_PGSI, survey_design_all_participants_sankey) %>%
  as.data.frame() %>%
  group_by(Quartile) %>%
  mutate(percent = Freq / sum(Freq) * 100) %>%
  ungroup()

# Ensure Quartiles are ordered correctly
weighted_counts$Quartile <- factor(weighted_counts$Quartile,
                                   levels = c("Top quartile: most involved",
                                              "Third quartile",
                                              "Second quartile",
                                              "Bottom quartile: least involved"),
                                   ordered = TRUE)

# Create Sankey plot
desired_order_sankey <- c("Top quartile: most involved",
                          "Third quartile",
                          "Second quartile",
                          "Bottom quartile: least involved")


# non-weighted data: 
risk_to_survey_links<- master_dataset_study2_restricted_with_risk_scores %>%
    select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%
  group_by(Quartile, SURVEY_STATUS_PGSI) %>%
   summarise(
    n = n(),
    .groups = "drop"
  ) %>%
   group_by(Quartile) %>%
   mutate(
     percent = n/sum(n)*100) %>%
  select(-n) %>%
  ungroup() %>%
  mutate(Quartile =  factor(Quartile,
                              levels = desired_order_sankey)) %>%
  arrange(Quartile)




# From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: 
nodes <- data.frame(
  name = c("Top quartile: most involved",
           "Third quartile",
           "Second quartile",
           "Bottom quartile: least involved",
           as.character(unique(weighted_counts$SURVEY_STATUS_PGSI))) 
)

# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn  assign each from and to category with a number to make numerical connections explicit:
weighted_counts$IDsource <- match(weighted_counts$Quartile, nodes$name) - 1
weighted_counts$IDtarget <- match(weighted_counts$SURVEY_STATUS_PGSI, nodes$name) - 1



# prepare colour scale
ColourScal ='d3.scaleOrdinal() .range(["#d1495b", 
"#edae49",
"#00798c",
"#2e4057",
"#35B779FF",
"#31688EFF",
"#31688EFF",
"#482878FF",
"#B4DE2CFF",
"#FDE725FF", 
"#26828EFF",
"#3E4A89FF",
"#440154FF",
"#6DCD59FF"
])'

# Create Sankey plot
sankey_plot2_restricted <- sankeyNetwork(
  Links = weighted_counts, 
  Nodes = nodes,
  Source = "IDsource", 
  Target = "IDtarget",
  Value = "percent", 
  NodeID = "name", 
  iterations = 0,
  sinksRight = FALSE,
  fontSize = 16, 
  fontFamily = "Poppins",
  nodeWidth = 30,
  nodePadding = 25,
  colourScale = ColourScal
)


# Custom CSS with Google Fonts link
custom_css <- "
<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'>
<style>
  text {
    font-family: 'Poppins', sans-serif !important;
  }
</style>
"

# Save plot as a HTML file with custom CSS
html_file <- tempfile(fileext = ".html")
saveWidget(
  widget = htmlwidgets::prependContent(sankey_plot2_restricted), 
  file = html_file,
  selfcontained = TRUE
)

# Read the saved HTML and ensure the custom CSS is properly embedded
html_content <- readLines(html_file, warn = FALSE)
html_content <- gsub("</head>", paste0(custom_css, "</head>"), html_content)
writeLines(html_content, con = "sankey_plot2_restricted.html")

# remove the temporary file
file.remove(html_file)
[1] TRUE
Show code
# sankey_plot2_restricted
Show code
# ggsankey_data <- master_dataset_study2_restricted_with_risk_scores %>%
#   # mutate(SURVEY_STATUS_FULL = case_when(SURVEY_STATUS_FULL == "Not started" ~ NA,
#   #                                       TRUE ~ SURVEY_STATUS_FULL)) %>%
#   make_long(Quartile, SURVEY_STATUS_START, SURVEY_STATUS_FULL)
# 
# ggplot(ggsankey_data, aes(x = x,
#                           next_x = next_x,
#                           node = node,
#                           next_node = next_node,
#                           fill = factor(node),
#                           label = node)) +
#                geom_sankey(flow.alpha = 0.5,
#                            node.color = "black",
#                            show.legend = FALSE)+
#   geom_sankey_label(size = 3,
#                     color = "black",
#                     fill= "white",
#                     hjust = -0.2,
#                     family="Poppins")+
#   theme_bw()+
#     theme( text=element_text(family="Poppins")) +
#   # scale_fill_viridis_d() +
#     scale_fill_manual(values = c("#9a031e",  # Bright red
#   "#277da1",  # Deep blue
#   "#f9c74f",  # Muted yellow
#   "#e36414",  # Soft orange
#   "#90be6d",  # Soft green
#   "#577590",  # Slate gray
#   "#4d908e",  # Teal
#   "#9f5f80"   # Purple
# )) +
#   theme(axis.title = element_blank()
#                   , axis.text.y = element_blank()
#                   , axis.ticks = element_blank()
#                   , panel.grid = element_blank())

Logistic regression

Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.

SSVS

Before we can actually run the model, we need to identify a suitable suite of predictor variables to include using used Stochastic Search Variable Selection (SSVS) and the related SSVS package (note the actual SSVS process is commented out in rendered versions of this document as it takes a long time to process).

As we want to maximise the overall predictive ability of the model, we have opted to set the parameters of the SSVS so as to have a low threshold for including/selecting relevant variables. First, We have said the prior inclusion of probability to 0.8, reflecting a belief that predictors are more likely to be included and not. A prior inclusion probability of 0.5 reflects the belief that each predictor has an equal probability of being included/excluded. Second, we have set the cut off for marginal inclusion probabilities values (MIPs) to 0.4. There is no standard recommendation for what this cut-off should be, but others have used 0.5.

One thing we need to do before all of this is impute missing values for the IRSAD scores as SSVS doesn’t seem to handle these values well. To do this, we will impute the median value for each person’s state as their missing value.

Show code
# First we need to convert logical and categorical variables to numeric for inclusion in the model:  <-  
master_dataset_study2_restricted_SSVS <- master_dataset_study2_restricted %>%
  # Impute missing IRSAD scores
  mutate(IRSAD_SCORE = ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm = TRUE), IRSAD_SCORE)) %>%
  # Convert all required variables to numeric:
  mutate(across(c(
    'AGE',
    'IRSAD_SCORE',
       'DAYS_REGISTERED',
       'DAYS_SINCE_LAST_BET',
       'PAS_6M_BET_MEAN_STAKE',
       'PAS_6M_BET_SD_STAKE',
       'PAS_6M_N_ACTIVITIES',
       'PAS_6M_BET_INTENSITY',
       'PAS_6M_MAX_DAY_STAKE',
       'PAS_6M_MAX_DAY_BETS',
       'PAS_6M_MEAN_DAILY_STAKE',
       'PAS_6M_SD_DAILY_STAKE',
       'PAS_6M_PERCENT_UNSOCIABLE_BETS',
       'PAS_6M_MONTHLY_BET_FREQ_DAYS',
       'PAS_6M_MONTHLY_SPEND',
       'PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME',
       'PAS_6M_MEAN_DEPOSIT_AMOUNT',
       'PAS_6M_SD_DEPOSIT_AMOUNT',
       'PAS_6M_PERCENT_DECLINED_DEPOSITS',
       'PAS_6M_TOTAL_DEPOSIT_METHODS',
       'PAS_6M_PERCENT_CANCEL_WITHDRAWS',
       'PAS_6M_TOTAL_WITHDRAW_METHODS',
       'PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT',
       'PAS_6M_MONTHLY_DEPOSIT_AMOUNT',
       'PAS_6M_DEPOSIT_INTENSITY','EVER_USED_DEPOSIT_LIMIT',
                  'EVER_USED_BLOCK_OUT', 
                  'EVER_USED_CHECK_IN', 
                  'EVER_USED_CURFEW', 
                  'EVER_USED_DEPOSIT_OPTION', 
                  'EVER_USED_MARKET_CONTROL', 
                  'EVER_USED_TIMOUT', 
                  'EVER_USED_CANCELLED_WITHDRAWS', 
                  'EVER_USED_SELF_EXCLUSION',
    'TOTAL_ACTIVE_TOOLS',
    'NUMBER_CPT_CHANGES'), 
                ~ as.numeric(.))) %>%
  mutate(GENDER = case_when(
    GENDER == 'Female' ~ 0,
    GENDER == 'Male' ~ 1,
    TRUE ~ 2)) %>%
  mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1,
    TRUE ~ 2)) %>%
    rename(
    'Age' = 'AGE',
    'Gender' =  'GENDER',
    'IRSAD score' = 'IRSAD_SCORE',
    'Days since registration' = 'DAYS_REGISTERED',
    'Days since last bet' = 'DAYS_SINCE_LAST_BET',
    'Mean stake*' = 'PAS_6M_BET_MEAN_STAKE',
    'Stake SD*' = 'PAS_6M_BET_SD_STAKE',
    'No. of activities*' = 'PAS_6M_N_ACTIVITIES',
    'Betting intensity*' = 'PAS_6M_BET_INTENSITY',
    'Max daily stake*' = 'PAS_6M_MAX_DAY_STAKE',
    'Max daily bets*' = 'PAS_6M_MAX_DAY_BETS',
    'Mean daily stake*' = 'PAS_6M_MEAN_DAILY_STAKE',
    'Daily stake SD*' = 'PAS_6M_SD_DAILY_STAKE',
    '% unsociable bets*' = 'PAS_6M_PERCENT_UNSOCIABLE_BETS',
    'Monthly bet freq (days)*' = 'PAS_6M_MONTHLY_BET_FREQ_DAYS',
    'Monthly spend*' = 'PAS_6M_MONTHLY_SPEND',
    'Mean monthly net outcome*' = 'PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME',
    'Mean deposit amount*' = 'PAS_6M_MEAN_DEPOSIT_AMOUNT',
    'Deposit amount SD*' = 'PAS_6M_SD_DEPOSIT_AMOUNT',
    '% declined deposits*' = 'PAS_6M_PERCENT_DECLINED_DEPOSITS',
    'No. deposit methods*' = 'PAS_6M_TOTAL_DEPOSIT_METHODS',
    '% cancel withdrawals*' = 'PAS_6M_PERCENT_CANCEL_WITHDRAWS',
    'No. withdrawal methods*' = 'PAS_6M_TOTAL_WITHDRAW_METHODS',
    'Max daily deposit amount*' = 'PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT',
    'Monthly deposit amount*' = 'PAS_6M_MONTHLY_DEPOSIT_AMOUNT',
    'Deposit intensity*' = 'PAS_6M_DEPOSIT_INTENSITY',
    'Ever used deposit limit' = 'EVER_USED_DEPOSIT_LIMIT',
    'Ever used block out' = 'EVER_USED_BLOCK_OUT', 
    'Ever used check in' = 'EVER_USED_CHECK_IN', 
    'Ever used curfew' = 'EVER_USED_CURFEW', 
    'Ever used deposit option' = 'EVER_USED_DEPOSIT_OPTION', 
    'Ever used market control' = 'EVER_USED_MARKET_CONTROL', 
    'Ever used timeout' = 'EVER_USED_TIMOUT', 
    'Ever used cancelled withdrawals' = 'EVER_USED_CANCELLED_WITHDRAWS', 
    'Ever used self-exclusion' = 'EVER_USED_SELF_EXCLUSION',
    'Total active tools' = 'TOTAL_ACTIVE_TOOLS',
    'Number of changes to CPTs' = 'NUMBER_CPT_CHANGES'
  )


outcome <- 'STATUS'

predictors <- c(
       # demographic variables:
       'Age',
       'Gender',
       'IRSAD score',
       # Account/wagering variables
       'Days since registration',
       'Days since last bet',
       # All past six month wagers and transactions variables (As included in the composite score):
      'Mean stake*',
    'Stake SD*',
    'No. of activities*',
    'Betting intensity*',
    'Max daily stake*',
    'Max daily bets*',
    'Mean daily stake*',
    'Daily stake SD*',
    '% unsociable bets*',
    'Monthly bet freq (days)*',
    'Monthly spend*',
    'Mean monthly net outcome*',
    'Mean deposit amount*',
    'Deposit amount SD*',
    '% declined deposits*',
    'No. deposit methods*',
    '% cancel withdrawals*',
    'No. withdrawal methods*',
    'Max daily deposit amount*',
    'Monthly deposit amount*',
    'Deposit intensity*',
       # current and historical use of consumer protection tools:
       'Ever used deposit limit',
       'Ever used block out', 
       'Ever used check in', 
       'Ever used curfew', 
       'Ever used deposit option', 
       'Ever used market control', 
       'Ever used timeout', 
       'Ever used cancelled withdrawals', 
       'Ever used self-exclusion',
       'Total active tools',
       'Number of changes to CPTs'
        )

# str(master_dataset_study2_restricted_SSVS)
# 
# master_dataset_study2_restricted_SSVS %>%
#   select(`IRSAD score`)
#
# SSSV_results_Study2_Restricted <- ssvs(data = master_dataset_study2_restricted_SSVS,
#                             x = predictors, y = outcome, # Variables, as prespecified above
#                             inprob = 0.8, #  prior inclusion probability:  0.5 reflects the belief that each predictor has an equal probability of being included/excluded
#                             continuous = FALSE, # Categorical outcome variable
#                             progress = TRUE # Provide update on progress of processing
#                             )
# 
# # Save file to avoid having to re-run every time
# saveRDS(SSSV_results_Study2_Restricted, "SSSV_results_Study2_Restricted")

# Load in:
SSSV_results_Study2_Restricted <- read_rds("SSSV_results_Study2_Restricted")

Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:

Show code
summary(SSSV_results_Study2_Restricted, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  as.data.frame() %>%
gt()%>%
  tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2 Restricted analyses**"), 
             subtitle = NULL)
Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2 Restricted analyses
Variable MIP Avg Beta Avg Nonzero Beta Lower CI (95%) Upper CI (95%)
Age 1.0000 0.2402 0.2402 0.1812 0.3081
Days since last bet 1.0000 -0.2822 -0.2822 -0.3595 -0.2054
No. withdrawal methods* 1.0000 0.1961 0.1961 0.1260 0.2685
Stake SD* 0.9869 0.5778 0.5854 0.3084 0.8716
Mean stake* 0.9861 -0.6759 -0.6854 -1.0373 -0.3545
No. deposit methods* 0.8692 0.1153 0.1326 0.0000 0.1861
Mean deposit amount* 0.6957 -0.2414 -0.3470 -0.5180 0.0000
Deposit intensity* 0.5405 0.0496 0.0917 0.0000 0.1261
No. of activities* 0.2973 0.0284 0.0955 0.0000 0.1227
Monthly bet freq (days)* 0.2119 0.0220 0.1036 0.0000 0.1280
Deposit amount SD* 0.1555 -0.0369 -0.2374 -0.2868 0.0000
% declined deposits* 0.0631 0.0050 0.0793 0.0000 0.0581
Max daily stake* 0.0260 -0.0098 -0.3770 0.0000 0.0000
Gender 0.0196 0.0011 0.0537 0.0000 0.0000
Betting intensity* 0.0165 0.0007 0.0448 0.0000 0.0000
Ever used cancelled withdrawals 0.0164 0.0008 0.0489 0.0000 0.0000
Daily stake SD* 0.0163 0.0056 0.3427 0.0000 0.0000
Mean monthly net outcome* 0.0131 0.0010 0.0743 0.0000 0.0000
Ever used curfew 0.0124 -0.0007 -0.0543 0.0000 0.0000
Mean daily stake* 0.0107 -0.0004 -0.0399 0.0000 0.0000
Ever used self-exclusion 0.0084 0.0003 0.0334 0.0000 0.0000
% cancel withdrawals* 0.0078 0.0003 0.0359 0.0000 0.0000
Total active tools 0.0072 0.0003 0.0401 0.0000 0.0000
Ever used market control 0.0071 0.0003 0.0356 0.0000 0.0000
Monthly deposit amount* 0.0067 0.0000 0.0011 0.0000 0.0000
Max daily deposit amount* 0.0061 -0.0005 -0.0795 0.0000 0.0000
% unsociable bets* 0.0057 0.0002 0.0383 0.0000 0.0000
Ever used timeout 0.0052 0.0001 0.0220 0.0000 0.0000
Max daily bets* 0.0051 0.0002 0.0368 0.0000 0.0000
Monthly spend* 0.0047 -0.0001 -0.0231 0.0000 0.0000
Ever used block out 0.0041 0.0001 0.0218 0.0000 0.0000
Ever used deposit option 0.0040 0.0000 0.0003 0.0000 0.0000
Days since registration 0.0037 0.0001 0.0320 0.0000 0.0000
IRSAD score 0.0035 0.0001 0.0387 0.0000 0.0000
Ever used check in 0.0028 0.0000 -0.0104 0.0000 0.0000
Ever used deposit limit 0.0019 0.0000 0.0123 0.0000 0.0000
Number of changes to CPTs 0.0015 0.0000 -0.0184 0.0000 0.0000
Show code
plot(SSSV_results_Study2_Restricted,
     threshold = 0.4,
     title = "") +
  theme_classic() +
   scale_y_continuous(breaks = c(0, 0.25, 0.50, 0.75, 1), labels = c(0, 0.25, 0.50, 0.75, 1)) +
  theme(text=element_text(family="Poppins"),
      axis.text.x = element_text(angle = 45, hjust = 1)
      # axis.text.y = element_text(angle = 0, hjust = .5)
      ) +
  labs(title = "",
       x = "",
       y = "Inclusion Probability") +
  annotate("text", x = 18, y = 0.45, label = "MIP threshold", size =3, color = "black", fontface = "plain") +
  theme(legend.position = "none") 

Show code
ggsave("Figures/SSVS results plot Study 2 Restricted analyses.pdf",
       width = 8.5,
       height = 3.5,
       units = c("in")
)

Run model

Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:

Show code
# SSSV_results_Study2_Restricted$ssvs %>% as_tibble()
# Set the data for analysis:
Study2_logistic_reg_data<- master_dataset_study2_restricted %>%
 mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1))

# Create survey design object Used for waiting:
survey_design_regression <- svydesign(
  id = ~1,
  data = Study2_logistic_reg_data,
  weights = ~ALL_SAMPLE_WEIGHTS
)
  
# Check variables for inclusion: 
summary(SSSV_results_Study2_Restricted, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  filter(MIP>0.4)
 Variable                MIP    Avg Beta Avg Nonzero Beta Lower CI (95%)
 Age                     1.0000  0.2402   0.2402           0.1812       
 Days since last bet     1.0000 -0.2822  -0.2822          -0.3595       
 No. withdrawal methods* 1.0000  0.1961   0.1961           0.1260       
 Stake SD*               0.9869  0.5778   0.5854           0.3084       
 Mean stake*             0.9861 -0.6759  -0.6854          -1.0373       
 No. deposit methods*    0.8692  0.1153   0.1326           0.0000       
 Mean deposit amount*    0.6957 -0.2414  -0.3470          -0.5180       
 Deposit intensity*      0.5405  0.0496   0.0917           0.0000       
 Upper CI (95%)
  0.3081       
 -0.2054       
  0.2685       
  0.8716       
 -0.3545       
  0.1861       
  0.0000       
  0.1261       
Show code
# Run model with the Weight adjusting:
study_2_restricted_analysis_weighted_logistic_regression <- svyglm(STATUS ~ AGE + 
                DAYS_SINCE_LAST_BET +
                PAS_6M_TOTAL_WITHDRAW_METHODS +
                PAS_6M_BET_SD_STAKE + # New
                PAS_6M_BET_MEAN_STAKE + # New
                PAS_6M_TOTAL_DEPOSIT_METHODS + 
                PAS_6M_MEAN_DEPOSIT_AMOUNT +
                PAS_6M_DEPOSIT_INTENSITY, # New
      design = survey_design_regression, family = quasibinomial())

summary(study_2_restricted_analysis_weighted_logistic_regression) 

Call:
svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + 
    PAS_6M_BET_SD_STAKE + PAS_6M_BET_MEAN_STAKE + PAS_6M_TOTAL_DEPOSIT_METHODS + 
    PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_DEPOSIT_INTENSITY, design = survey_design_regression, 
    family = quasibinomial())

Survey design:
svydesign(id = ~1, data = Study2_logistic_reg_data, weights = ~ALL_SAMPLE_WEIGHTS)

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -3.1011096  0.1404899 -22.074  < 2e-16 ***
AGE                            0.0188877  0.0021737   8.689  < 2e-16 ***
DAYS_SINCE_LAST_BET           -0.0359699  0.0049748  -7.230 5.09e-13 ***
PAS_6M_TOTAL_WITHDRAW_METHODS  0.3543070  0.0612079   5.789 7.26e-09 ***
PAS_6M_BET_SD_STAKE            0.0021273  0.0004674   4.551 5.39e-06 ***
PAS_6M_BET_MEAN_STAKE         -0.0027559  0.0007975  -3.456 0.000551 ***
PAS_6M_TOTAL_DEPOSIT_METHODS   0.1385985  0.0400393   3.462 0.000539 ***
PAS_6M_MEAN_DEPOSIT_AMOUNT    -0.0008037  0.0003071  -2.617 0.008886 ** 
PAS_6M_DEPOSIT_INTENSITY       0.0231312  0.0178049   1.299 0.193916    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.014813)

Number of Fisher Scoring iterations: 6

Performs checks of model performance:

Show code
model2_pilot_performance_metrics_restricted_analysis<- check_model(study_2_restricted_analysis_weighted_logistic_regression)

ggsave("Figures/model2_pilot_performance_metrics_restricted_analysis.pdf",
       width = 7,
       height = 9,
       units = c("in")
)

There is some minor co-linearity here but not enough to be problematic in this context – we aren’t using this model to make actual predictions on a separate data or subsamples.

Compute confidence intervals for the coefficient estimates:

Show code
sumarisedCIs_study_2_restricted_analysis_weighted_logistic_regression<- cbind(coef(study_2_restricted_analysis_weighted_logistic_regression), confint(study_2_restricted_analysis_weighted_logistic_regression))

sumarisedCIs_study_2_restricted_analysis_weighted_logistic_regression %>% as.data.frame() %>%
    rownames_to_column() %>%
  rename(Coefficient = V1) %>%
  mutate(across(c(Coefficient, `2.5 %`, `97.5 %`), ~round(., 3))) %>%
  arrange(desc(`2.5 %`)) %>%
   gt() %>%
  tab_header(md("**Regression coefficients with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%") 
Regression coefficients with 95% CIs: Study 2
Coefficient Lower CI: 2.5% Upper CI: 97.5%
PAS_6M_TOTAL_WITHDRAW_METHODS 0.354 0.234 0.474
PAS_6M_TOTAL_DEPOSIT_METHODS 0.139 0.060 0.217
AGE 0.019 0.015 0.023
PAS_6M_BET_SD_STAKE 0.002 0.001 0.003
PAS_6M_MEAN_DEPOSIT_AMOUNT -0.001 -0.001 0.000
PAS_6M_BET_MEAN_STAKE -0.003 -0.004 -0.001
PAS_6M_DEPOSIT_INTENSITY 0.023 -0.012 0.058
DAYS_SINCE_LAST_BET -0.036 -0.046 -0.026
(Intercept) -3.101 -3.376 -2.826

Exponentiate coefficients and interpret them as odds-ratios:

Show code
sumarisedORs_study_2_restricted_analysis_weighted_logistic_regression<- exp(cbind(OR = coef(study_2_restricted_analysis_weighted_logistic_regression), confint(study_2_restricted_analysis_weighted_logistic_regression)))

sumarisedORs_study_2_restricted_analysis_weighted_logistic_regression %>% as.data.frame() %>%
  rownames_to_column() %>%
  mutate(across(c(OR, `2.5 %`, `97.5 %`), ~round(., 4))) %>%
  arrange(desc(OR)) %>%
   gt() %>%
  tab_header(md("**Odds ratios with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%") 
Odds ratios with 95% CIs: Study 2
OR Lower CI: 2.5% Upper CI: 97.5%
PAS_6M_TOTAL_WITHDRAW_METHODS 1.4252 1.2641 1.6069
PAS_6M_TOTAL_DEPOSIT_METHODS 1.1487 1.0620 1.2424
PAS_6M_DEPOSIT_INTENSITY 1.0234 0.9883 1.0597
AGE 1.0191 1.0147 1.0234
PAS_6M_BET_SD_STAKE 1.0021 1.0012 1.0030
PAS_6M_MEAN_DEPOSIT_AMOUNT 0.9992 0.9986 0.9998
PAS_6M_BET_MEAN_STAKE 0.9972 0.9957 0.9988
DAYS_SINCE_LAST_BET 0.9647 0.9553 0.9741
(Intercept) 0.0450 0.0342 0.0593

Calculate Nagelkerke’s R squared for the model:

Show code
NagelkerkeR2_S2<- NagelkerkeR2(study_2_restricted_analysis_weighted_logistic_regression) %>% 
  as.data.frame() %>%
  print()
      N         R2
1 12799 0.05661822
Show code
# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value

The model accounts for 5.6618219% of variance in uptake/response.

Calculate overall p-value for model by comparing it with a null model:

Show code
anova(study_2_restricted_analysis_weighted_logistic_regression, 
      update(study_2_restricted_analysis_weighted_logistic_regression, ~1), # update here produces null model for comparison 
      test="Chisq")
Working (Rao-Scott) LRT for AGE DAYS_SINCE_LAST_BET PAS_6M_TOTAL_WITHDRAW_METHODS PAS_6M_BET_SD_STAKE PAS_6M_BET_MEAN_STAKE PAS_6M_TOTAL_DEPOSIT_METHODS PAS_6M_MEAN_DEPOSIT_AMOUNT PAS_6M_DEPOSIT_INTENSITY
 in svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + 
    PAS_6M_BET_SD_STAKE + PAS_6M_BET_MEAN_STAKE + PAS_6M_TOTAL_DEPOSIT_METHODS + 
    PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_DEPOSIT_INTENSITY, design = survey_design_regression, 
    family = quasibinomial())
Working 2logLR =  355.9009 p= < 2.22e-16 
(scale factors:  1.4 1.3 1.3 1.2 0.93 0.87 0.62 0.4 )
Show code
# Set seed so that labels are consistent in the plot:
set.seed(200)

ggstatsplot::ggcoefstats(study_2_restricted_analysis_weighted_logistic_regression,
                         digits = 4,
                         stats.label.color = "#00798c",
                         sort = "ascending",
                         vline.args = list(size = 1, color = "black"),
                         # ggstatsplot.layer = FALSE,
                          exclude.intercept = TRUE, # hide the intercept
                         stats.label.args = list(size = 3, direction = "y")) +
  scale_x_continuous(name = "Regression coefficient", limits = c(-0.1, 0.5), breaks = c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5)) +
  scale_y_discrete(name = "", labels = c("Days since\n last bet",
                                         "Mean stake",
                                         "Mean deposit\n amount",
                                         "SD of stake",
                                         "Age",
                                         "Deposit intensity",
                                         "No. deposit\n methods",
                                         "No. withdrawal\n methods")) +
  theme(panel.grid.minor = element_blank(),
        axis.line = element_blank()) +
  theme(axis.text = element_text(color = "black", size = 9, face = "plain"))+
  theme_light(base_family = "Poppins") +
  # theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +
  theme(axis.text.y = element_text(color="black", size=12, face="plain", family = "Poppins")) +
  theme(axis.title.x = element_text(color="black", size=12, face ="plain", vjust = 0, family = "Poppins")) 

Show code
# Set height to 7 and width to 5.5 in portrait mode
ggsave("Figures/Study 2 restricted analyses log regression plot.pdf",
       width = 5,
       height = 7,
       units = c("in")
)

Reproducibility of this analysis

Package stability

To ensure that every time this script runs it will use the same versions of the packages used during the script development, all packages are loaded using the groundhog.library() function from the groundhog package.

Rendering & publishing

This document is rendered in HTML format and published online using Quarto Pub by using the following command in the terminal:

  • quarto publish quarto-pub Survey_uptake_main_analyses.qmd

Session details

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.1 (2024-06-14 ucrt)
 os       Windows 11 x64 (build 26100)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2025-02-26
 pandoc   3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version date (UTC) lib source
 car          * 3.1-2   2023-03-30 [1] CRAN (R 4.4.0)
 carData      * 3.0-5   2022-01-06 [1] CRAN (R 4.4.1)
 dplyr        * 1.1.4   2023-11-17 [1] CRAN (R 4.4.1)
 effsize      * 0.8.1   2020-10-05 [1] CRAN (R 4.4.2)
 fmsb         * 0.7.6   2024-01-19 [1] CRAN (R 4.4.2)
 forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.4.1)
 formattable  * 0.2.1   2021-01-07 [1] CRAN (R 4.4.1)
 ggplot2      * 3.5.1   2024-04-23 [1] CRAN (R 4.4.1)
 ggrain       * 0.0.4   2024-01-23 [1] CRAN (R 4.4.1)
 ggstatsplot  * 0.12.3  2024-04-06 [1] CRAN (R 4.4.0)
 ggstream     * 0.1.0   2021-05-06 [1] CRAN (R 4.4.2)
 ggtext       * 0.1.2   2022-09-16 [1] CRAN (R 4.4.2)
 groundhog    * 3.2.1   2024-09-29 [1] CRAN (R 4.4.1)
 gt           * 0.10.1  2024-01-17 [1] CRAN (R 4.4.0)
 gtExtras     * 0.5.0   2023-09-15 [1] CRAN (R 4.4.1)
 gtsummary    * 1.7.2   2023-07-15 [1] CRAN (R 4.4.0)
 haven        * 2.5.4   2023-11-30 [1] CRAN (R 4.4.1)
 Hmisc        * 5.1-3   2024-05-28 [1] CRAN (R 4.4.0)
 htmlwidgets  * 1.6.4   2023-12-06 [1] CRAN (R 4.4.1)
 kableExtra   * 1.4.0   2024-01-24 [1] CRAN (R 4.4.1)
 lubridate    * 1.9.3   2023-09-27 [1] CRAN (R 4.4.1)
 Matrix       * 1.7-0   2024-03-22 [1] CRAN (R 4.4.0)
 MBESS        * 4.9.3   2023-10-26 [1] CRAN (R 4.4.1)
 networkD3    * 0.4     2017-03-18 [1] CRAN (R 4.4.1)
 patchwork    * 1.2.0   2024-01-08 [1] CRAN (R 4.4.0)
 performance  * 0.11.0  2024-03-22 [1] CRAN (R 4.4.0)
 plotly       * 4.10.4  2024-01-13 [1] CRAN (R 4.4.1)
 purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.4.1)
 RColorBrewer * 1.1-3   2022-04-03 [1] CRAN (R 4.4.0)
 readr        * 2.1.5   2024-01-10 [1] CRAN (R 4.4.1)
 readxl       * 1.4.3   2023-07-06 [1] CRAN (R 4.4.1)
 rstantools   * 2.4.0   2024-01-31 [1] CRAN (R 4.4.2)
 scales       * 1.3.0   2023-11-28 [1] CRAN (R 4.4.1)
 see          * 0.8.4   2024-04-29 [1] CRAN (R 4.4.0)
 sessioninfo  * 1.2.2   2021-12-06 [1] CRAN (R 4.4.1)
 showtext     * 0.9-7   2024-03-02 [1] CRAN (R 4.4.1)
 showtextdb   * 3.0     2020-06-04 [1] CRAN (R 4.4.1)
 SSVS         * 2.0.0   2022-05-29 [1] CRAN (R 4.4.2)
 stringr      * 1.5.1   2023-11-14 [1] CRAN (R 4.4.1)
 survey       * 4.4-2   2024-03-20 [1] CRAN (R 4.4.1)
 survival     * 3.6-4   2024-04-24 [1] CRAN (R 4.4.0)
 sysfonts     * 0.8.9   2024-03-02 [1] CRAN (R 4.4.1)
 tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.4.1)
 tidyr        * 1.3.1   2024-01-24 [1] CRAN (R 4.4.1)
 tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.4.1)
 ufs          * 0.5.12  2024-03-09 [1] CRAN (R 4.4.2)
 viridis      * 0.6.5   2024-01-29 [1] CRAN (R 4.4.1)
 viridisLite  * 0.4.2   2023-05-02 [1] CRAN (R 4.4.1)
 wCorr        * 1.9.8   2023-08-19 [1] CRAN (R 4.4.2)
 weights      * 1.0.4   2021-06-10 [1] CRAN (R 4.4.1)

 [1] C:/Users/Robert Heirene/AppData/Local/R/win-library/4.4
 [2] C:/Program Files/R/R-4.4.1/library

──────────────────────────────────────────────────────────────────────────────
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
[3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.utf8    

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] sessioninfo_1.2.2  RColorBrewer_1.1-3 showtext_0.9-7     showtextdb_3.0    
 [5] sysfonts_0.8.9     patchwork_1.2.0    ggtext_0.1.2       scales_1.3.0      
 [9] viridis_0.6.5      viridisLite_0.4.2  networkD3_0.4      ggrain_0.0.4      
[13] ggstream_0.1.0     rstantools_2.4.0   ggstatsplot_0.12.3 htmlwidgets_1.6.4 
[17] plotly_4.10.4      fmsb_0.7.6         see_0.8.4          performance_0.11.0
[21] car_3.1-2          carData_3.0-5      MBESS_4.9.3        effsize_0.8.1     
[25] SSVS_2.0.0         gtsummary_1.7.2    gtExtras_0.5.0     gt_0.10.1         
[29] formattable_0.2.1  kableExtra_1.4.0   ufs_0.5.12         wCorr_1.9.8       
[33] weights_1.0.4      Hmisc_5.1-3        survey_4.4-2       survival_3.6-4    
[37] Matrix_1.7-0       haven_2.5.4        readxl_1.4.3       lubridate_1.9.3   
[41] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2       
[45] readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1     
[49] tidyverse_2.0.0    groundhog_3.2.1   

loaded via a namespace (and not attached):
  [1] splines_4.4.1          ggpp_0.5.7             cellranger_1.1.0      
  [4] datawizard_0.10.0      rpart_4.1.23           lifecycle_1.0.4       
  [7] lattice_0.22-6         MASS_7.3-60.2          insight_0.19.11       
 [10] backports_1.5.0        magrittr_2.0.3         sass_0.4.9            
 [13] rmarkdown_2.27         yaml_2.3.8             DBI_1.2.2             
 [16] minqa_1.2.7            DHARMa_0.4.6           abind_1.4-5           
 [19] nnet_7.3-19            ggrepel_0.9.5          correlation_0.8.4     
 [22] gdata_3.0.0            commonmark_1.9.1       svglite_2.1.3         
 [25] codetools_0.2-20       xml2_1.3.6             tidyselect_1.2.1      
 [28] shape_1.4.6.1          farver_2.1.2           lme4_1.1-35.3         
 [31] effectsize_0.8.8       smd_0.7.0              base64enc_0.1-3       
 [34] broom.helpers_1.15.0   jsonlite_1.8.8         gghalves_0.1.4        
 [37] mitml_0.4-5            Formula_1.2-5          iterators_1.0.14      
 [40] emmeans_1.10.2         systemfonts_1.1.0      foreach_1.5.2         
 [43] tools_4.4.1            ragg_1.3.2             Rcpp_1.0.12           
 [46] glue_1.7.0             mnormt_2.1.1           gridExtra_2.3         
 [49] pan_1.9                mgcv_1.9-1             xfun_0.44             
 [52] withr_3.0.0            fastmap_1.2.0          mitools_2.4           
 [55] boot_1.3-30            fansi_1.0.6            digest_0.6.35         
 [58] timechange_0.3.0       R6_2.5.1               estimability_1.5.1    
 [61] mice_3.16.0            textshaping_0.4.0      colorspace_2.1-0      
 [64] gtools_3.9.5           markdown_1.12          utf8_1.2.4            
 [67] generics_0.1.3         data.table_1.15.4      httr_1.4.7            
 [70] parameters_0.21.7      pkgconfig_2.0.3        gtable_0.3.5          
 [73] statsExpressions_1.5.4 htmltools_0.5.8.1      knitr_1.46            
 [76] rstudioapi_0.16.0      tzdb_0.4.0             reshape2_1.4.4        
 [79] coda_0.19-4.1          checkmate_2.3.1        nlme_3.1-164          
 [82] curl_5.2.1             nloptr_2.0.3           ggcorrplot_0.1.4.1    
 [85] parallel_4.4.1         foreign_0.8-86         pillar_1.9.0          
 [88] vctrs_0.6.5            jomo_2.7-6             xtable_1.8-4          
 [91] cluster_2.1.6          htmlTable_2.4.2        paletteer_1.6.0       
 [94] evaluate_0.23          zeallot_0.1.0          mvtnorm_1.2-5         
 [97] cli_3.6.2              compiler_4.4.1         rlang_1.1.3           
[100] labeling_0.4.3         rematch2_2.1.2         plyr_1.8.9            
[103] stringi_1.8.4          pander_0.6.5           munsell_0.5.1         
[106] lazyeval_0.2.2         glmnet_4.1-8           bayestestR_0.13.2     
[109] hms_1.1.3              gridtext_0.1.5         fontawesome_0.5.2     
[112] igraph_2.0.3           broom_1.0.6            RcppParallel_5.1.7    
[115] polynom_1.4-1         

END