Analysis document: Non-Response bias in Gambling Surveys

Author

Anonymised for review

Published

February 26, 2025

Preamble

This script/ analysis document outlines the analysis code and associated outcomes for our study on response bias in gambling research studies. The analyses reported are for two studies, both of which are part of larger projects. The analyses included here include all of those reported in our manuscript (plus some additional analyses not reported in the manuscript), but the datasets analysed have already been pre-processed in seperate scripts and aggregate variables produced to summarise gambling behaviour.

Study 1

This study is part of the overall project with the title “Understanding electronic gaming machine players”. The OSF page for this study can be found here (it is private as of the latest date this file was published).

Study 2

This study is part of the overall project with the title “Safer gambling online: Identifying and supporting at-risk consumers”. The OSF page for this study can be found here and for the overall project here (both are private as of the latest date this file was committed).

load required packages

Set up analysis environment

Install and load the groundhog package to ensure consistency of the package versions used here:

Show set-up code

# install.packages("groundhog") # Install
library('groundhog')

# List desired packages:
packages <- c('tidyverse', # Clean, organise, and visualise data
              "readxl", #  load Excel files
              "haven", # Read stata files
              "survey", # weight study 2 data
              "weights", # Weighted statistics for study two
              "wCorr", # Weighted correlations (https://cran.r-project.org/web/packages/wCorr/wCorr.pdf)
              "ufs", # Cramver's V Effect size
              'kableExtra', # Make tables
              'formattable', #  Add visualisations to tables
              'gt', # Alternative table options
              'gtExtras', # Add colours to gt tables
              'gtsummary', # Create summary tables
              'SSVS', # Bayesian variable selection
              'effsize', # Cohen's d values
              'MBESS', # 95% CIs for above
              'car', # Compute VIFs for regression
              'performance', # Assess model performance
              'see', # Supports above package
              'fmsb', # Compute NagelkerkeR2
              'plotly', # Add interactive elements to figures
              'htmlwidgets', # Make plotly plots HTML format
              'ggstatsplot', # Produce plots with combined statistical output
              'rstantools', # required for above package
              'ggstream', # smooooth area charts
              'ggrain', # Raincloud plots
              'networkD3', # Sankey plots
              'viridis', # plot colours
              'scales', # Wrap labels in plots
              'ggtext', # also helps with plot labels
              "patchwork", # Assemble/combine plots
              'sysfonts', # Special fonts for figures
              'showtext', # Special fonts for figures
              'RColorBrewer', # Colours for figures
              'sessioninfo' # Detailed session info for reproducibility
              ) 

# Load desired package with versions specific to project start date:
# groundhog.library(packages, "2024-05-01", force.install = TRUE)
groundhog.library(packages, "2024-05-29")

# SANKEY:
# install.packages("remotes")
# remotes::install_github("davidsjoberg/ggsankey")
# library(ggsankey)


# SSVS:
# install.packages("remotes")
# remotes::install_github("sabainter/SSVS")
# library('SSVS')

Set up a standard theme for plots/data visualisations:

Show code

# Load new font for figures/graphs
font_add_google("Poppins")
showtext_auto()
# showtext_opts(dpi = 300)
showtext_auto(enable = TRUE)

# Save new theme for figures/graphs.This will determine the layout, presentation, font type and font size used in all data visualisations presented here:
plot_theme<- theme_classic() +
  theme(
    text=element_text(family="Poppins"),
    plot.title = element_text(hjust = 0.5, size = 26),
          plot.subtitle = element_text(hjust = 0.5, size = 22),
        axis.text = element_text(size = 18),
        axis.title = element_text(size = 22),
    plot.caption = element_text(size = 16),
    legend.title=element_text(size=18), 
    legend.text=element_text(size=16)
        ) 

plot_theme_no_sizing<- theme_classic() +
  theme(
    text=element_text(family="Poppins"))

Study 1 Analysis

This study pertains to the analysis of a survey of EGM gamblers from a single venue in Australia.

Load data

Load in master dataset containing all customer details (this step is hidden to protect privacy & anonymity)

Data loaded.

People were sent the survey via different methods, let’s have a look at the number who completed surveys by the method:

Show code

master_dataset_initial_study1 %>%
  dplyr::count(mode)

   mode     n
1       11122
2  link   215
3 paper     2
4    qr    50

Show code

# Isolate non-link participants for later:
non_link_customers <- master_dataset_initial_study1 %>%
  filter(mode == "paper" | mode == "qr")  %>%
as_tibble() %>%
  select(CLIENT_ID,
         mode)

Now, let’s process the data for analysis by doing the following:

Remove anybody who wasn’t invited to take part in the survey via email or telephone (this means removing some people who completed via other methods, but not all – some people invited via email took part via hardcopy, for example).
Remove anybody who didn’t pass the screening question which asked whether they gambled with the venue in the past 30 days

This code contains customer ID numbers and so is hidden.

To summarise the findings of the hidden code in text, there were 4301 individuals invited to take part in the survey. However, when the account data were provided by the organisation, 33 of these customers had no account data are available for the six months preceding survey invitation and therefore should not have been invited. Five individuals from the remaining sample started the survey but stated they had not gambled in the past 30 days. These were removed from consideration as they did not meet our inclusion criteria for invitation. All analyses and survey uptake figures relate to the remaining 4,260 customers, unless “unfiltered” is specified.

Uptake figures

Let’s first look at how survey uptake varied over the window the survey was open:

Show code

master_dataset_study1 %>%
  filter(SURVEY_STATUS_START != "Not started" 
         # & is.na(START_DATE)
         ) %>%
  # select(1:10, SURVEY_STATUS_START,START_DATE) %>%
  # arrange(desc(START_DATE)) %>%
  # print(n=250)
  group_by(START_DATE) %>%
  summarise(Freq = sum(n())) %>%
  ungroup() %>%
  ggplot(aes(x = START_DATE, y = Freq)) +
    plot_theme +
  geom_bar(stat = "identity", fill = "#00798c") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "",
       x = "Date",
       y = "Number of clients")

Show code

  # scale_y_continuous(breaks = seq(0, 1500, by = 500), limits = c(0,1500))

Now compute some figures to tell us the rates of uptake for the survey:

Original, unfiltered dataset (It’s unfiltered in the sense that it includes everyone invited, but not those who were ineligible to be invited because they don’t have any account data); So the 5 people who didn’t Pass the eligibility check are included still:

Show code

master_dataset_initial_study1 %>%
    as_tibble() %>%
  semi_join(invition_data) %>%
  dplyr::count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Study 1 (unfiltered outcomes)", subtitle = NULL)

Counts for starting the survey: Study 1 (unfiltered outcomes)
Survey status	n	Percent
Not started	4039	94.634489
Started	229	5.365511

Show code

master_dataset_initial_study1 %>%
    as_tibble() %>%
  semi_join(invition_data) %>%
   dplyr::count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Study 1 (unfiltered outcomes)", subtitle = NULL)

Counts for completing the survey up to the PGSI: Study 1 (unfiltered outcomes)
Survey status	n	Percent
Complete	185	4.334583
Incomplete	44	1.030928
Not started	4039	94.634489

Show code

master_dataset_initial_study1 %>%
    as_tibble() %>%
  semi_join(invition_data) %>%
  dplyr::count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Study 1 (unfiltered outcomes)", subtitle = NULL)

Counts for completing the survey in full: Study 1 (unfiltered outcomes)
Survey status	n	Percent
Complete	160	3.748828
Incomplete	69	1.616682
Not started	4039	94.634489

Now compute some figures to tell us the rates of uptake for the survey:

Filtered dataset:

Show code

master_dataset_study1 %>% 
  as_tibble() %>%
  dplyr::count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Study 1", subtitle = NULL)

Counts for starting the survey: Study 1
Survey status	n	Percent
Not started	4039	94.812207
Started	221	5.187793

Show code

master_dataset_study1 %>% 
  as_tibble() %>%
  dplyr::count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Study 1", subtitle = NULL)

Counts for completing the survey up to the PGSI: Study 1
Survey status	n	Percent
Complete	183	4.2957746
Incomplete	38	0.8920188
Not started	4039	94.8122066

Show code

master_dataset_study1 %>% 
  as_tibble() %>%
  dplyr::count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Study 1", subtitle = NULL)

Counts for completing the survey in full: Study 1
Survey status	n	Percent
Complete	153	3.591549
Incomplete	68	1.596244
Not started	4039	94.812207

Survey flow

Show code

# colnames(master_dataset_study1)


# Define relevant survey questions (excluding non-survey & optional ones)

#Note that optional items have been removed here
survey_questions <- c("ipaddress", "screen", "tier",
                  "pokielink", "pokiefrequency", "pokievenuetype", "pokiemembership", "pokievenuenumber",
                  "budgetset", "budgetstrat", 
                  "betsatisfaction", "accountease", "accountuse", "accountimprove", "accountintent", 
                  "attention1", "accountsocial", "accountadtrack", "accountadsetspend", 
                  "accountadsettime", "accountadblock", "accountadcash", "accountadlink", 
                  "accountadtransfer", "accountlikesign", "accountlikepoints", "accountlikeonly", 
                  "accountlikestatus", "accountlikerequire",  "pastafford", 
                  "pastexcite", "pastback", "pastsold", "pastproblem", "attention2", "paststress", 
                  "pastcritic", "pasthousehold", "pastguilty", "skspent", "skwin", "sknetoption", 
                   "skfuturespend", "skfuturewin", "skfutureoption",  "skcomfort", 
                  "lifesatisfaction", "patience", "impulsive", "risks", "alcofreq", "alcostandard", 
                  "alcosix", "alcopokies", "techsmart", "techapple", "techbank", "sex",
                  "year", "post", "lang", "educ", "work", "marstat", "child", "income", 
                  "incomevar", "incomepredict", "westhqvenue", "westhqtier", 
                  "cbenjoy1", "cbmajor2", "cbtop3", "cbcomfort4", "cbenough5", "bc1temp", "bc2break", 
                  "bc3lazy", "bc4inap", "bc5fun", "bc6refuse", "bc7discipline", "bc8iron", "bc9pleasure", 
                  "bc10conc", "bc11goals", "bc12wrong", "bc13alter")


# Filter dataset to exclude non-starters (only include those who started the survey)
survey_data_filtered <- master_dataset_study1 %>%
  filter(!is.na(ipaddress) &
          ipaddress != "") %>%
  select(all_of(survey_questions))

# Inspect data to check we are plotting the right info:
# View(survey_data_filtered)
# print(survey_data_filtered, n=100)

# All looks good. I've insured we are and including the survey questions and all of the survey questions, other than IP address which we used to tell who started the survey.

# Count non-NA responses for each question
response_counts <- survey_data_filtered %>%
  summarise(across(everything(), ~sum(!is.na(.) & . != ""), .names = "count_{.col}")) %>% # Account for completely empty responses and NA value
  pivot_longer(cols = everything(), names_to = "Question", values_to = "Respondents") %>%
  mutate(Question = gsub("count_", "", Question)) 


# Define survey section positions for text labels:
survey_sections <- data.frame(
  Section = c("Opened survey + screen", 
              "Membership details",
              "Gambling involvement", 
              "Tracking strategies",
              "Gambling satisfaction", 
              "Account-based gambling",
              "PGSI", 
              "Estimated expenditure", 
              "Life satisfaction", 
              "Personality traits", 
              "Alcohol use",
              "Technology literacy",
              "Demographics",
              "Self-control"),
  Start = c("ipaddress", 
            "tier", # Membership details
             "pokievenuetype",  #  Gambling involvement
               "budgetset",  #  Tracking strategies
               "betsatisfaction",  #  Gambling satisfaction
               "accountadsettime",  #  Perceptions of account-based systems
               "attention2",  #  PGSI
               "skfuturespend",  #  Estimated expenditure
               "lifesatisfaction",  #  Life satisfaction
               "impulsive",  #  Personality traits
               "alcosix",  #  Alcohol use
               "techapple",  #  Technology literacy
               "income",  #  Demographics
               "bc8iron"  #  Self-control
            )
)

# Define section start and end positions for lines:
section_lines <- data.frame(
  Position = c("tier",  # Start of Membership details
               "pokiefrequency",  # Start of Gambling involvement
               "budgetset",  # Start of Tracking strategies
               "betsatisfaction",  # Start of Gambling satisfaction
               "accountease",  # Start of Perceptions of account-based systems
               "pastafford",  # Start of PGSI
               "skspent",  # Start of Estimated expenditure
               "lifesatisfaction",  # Start of Life satisfaction
               "patience",  # Start of Personality traits
               "alcofreq",  # Start of Alcohol use
               "techsmart",  # Start of Technology literacy
               "sex",  # Start of Demographics
               "bc1temp",  # Start of Self-control
               "bc13alter" # End
               ))

# Convert positions to numeric for plotting
section_lines$Position <- as.numeric(factor(section_lines$Position, levels = survey_questions))

# Push the final line to the very end of the graph by changing its position:
section_lines<- section_lines %>%
  mutate(Position = case_when(Position == 88 ~89,
                              TRUE~  Position))

# Define questions in each section of the survey:
survey_sections_dot_colours <- data.frame(
  Section = c(
    rep("Opened survey + screen", 2),
    rep("Membership details", 2),
    rep("Gambling involvement", 4),
    rep("Tracking strategies", 2),
    rep("Gambling satisfaction", 1),
    rep("Account-based gambling", 18),
    rep("PGSI", 10),
    rep("Estimated expenditure", 7),
    rep("Life satisfaction", 1),
    rep("Personality traits", 3),
    rep("Alcohol use", 4),
    rep("Technology literacy", 3),
    rep("Demographics", 18),
    rep("Self-control", 13)
  ),
  Question = c(
    # Opened survey
    "ipaddress",
    "screen", # Screening Question
    
     # Membership details
     "pokielink", "tier",

    # Gambling involvement
    "pokiefrequency", "pokievenuetype", "pokiemembership", "pokievenuenumber", 
    
    # Tracking strategies
    "budgetset", "budgetstrat", 
    
    # Gambling satisfaction
    "betsatisfaction",

    # Perceptions of account-based systems
    "accountease", "accountuse", "accountimprove", "accountintent", 
    "attention1", "accountsocial", "accountadtrack", "accountadsetspend", 
    "accountadsettime", "accountadblock", "accountadcash", "accountadlink", 
    "accountadtransfer", "accountlikesign", "accountlikepoints", "accountlikeonly", 
    "accountlikestatus", "accountlikerequire", 

    # PGSI
    "pastafford", "pastexcite", "pastback", "pastsold", "pastproblem", "attention2", "paststress",
    "pastcritic", "pasthousehold", "pastguilty",

    # Estimated expenditure
    "skspent", "skwin", "sknetoption", 
     "skfuturespend", "skfuturewin", "skfutureoption", "skcomfort", 

    # Life satisfaction
    "lifesatisfaction",

    # Personality traits
    "patience", "impulsive", "risks", 

    # Alcohol use
    "alcofreq", "alcostandard", "alcosix", "alcopokies", 

    # Technology literacy
    "techsmart", "techapple", "techbank",  # Removed duplicate "alcopokies"

    # Demographics
    "sex",  "year", "post", "lang", "educ", "work", 
    "marstat", "child", "income", "incomevar", "incomepredict", "westhqvenue", 
     "westhqtier", "cbenjoy1", "cbmajor2", "cbtop3", "cbcomfort4", "cbenough5",
    
    # Self-control
    "bc1temp", "bc2break", "bc3lazy", "bc4inap", "bc5fun", "bc6refuse", "bc7discipline",
    "bc8iron", "bc9pleasure", "bc10conc", "bc11goals", "bc12wrong", "bc13alter"
  )
)



# Ensure response_counts data is structured
response_counts <- response_counts %>%
  # select(-Section) %>%
  full_join(survey_sections_dot_colours) %>%
  filter(!is.na(Respondents)) %>%
    mutate(Question = factor(Question, levels = survey_questions,
                            ordered = TRUE)) %>% # Preserve order
  # Fix order to account for people who completed later questions within the same section
  group_by(Section) %>%
  arrange(match(Question, survey_questions)) %>%  # Explicitly order by predefined question order
  mutate(Respondents = cummin(Respondents)) %>%  # Apply cumulative minimum
  ungroup()


# update data to account for absence of info about starts saved from qualtrics (data extracted by someone else on our team)
response_counts <- response_counts %>% 
mutate(Respondents = case_when(Respondents == 221 ~ 1328,
                                TRUE ~ Respondents))

# Wrap section labels to a maximum of 25 characters
survey_sections <- survey_sections %>%
  mutate(Section = str_wrap(Section, width = 30)) 

# Create a lookup table for section labels
section_labels <- setNames(survey_sections$Section, survey_sections$Start)


# Ensure Section is an ordered factor based on survey flow
response_counts$Section <- factor(response_counts$Section, levels = unique(survey_sections$Section))


# Generate a perceptually uniform color scale using viridis
num_sections <- length(unique(response_counts$Section))
section_colors <- viridis(14, option = "mako")  # Blue-Green to Blue






# Plot
main_plot <- ggplot(response_counts, aes(x = Question, y = Respondents, group = 1, color = Section)) +
  geom_line(color = "grey", size = 0.8) +  # Line plot
  
  # Dots colored by ordered survey section (now fading from dark to light)
  geom_point(size = 1.1) +
  
  # Apply wrapped section labels in scale_color_manual()
  scale_color_manual(
    values = setNames(section_colors, levels(response_counts$Section))
  ) +

  # Add dashed vertical lines for section boundaries
  geom_vline(data = section_lines, aes(xintercept = Position - 0.5),
             color = "grey50", linetype = "dashed", size = 0.5) +
  
  # Ensure x-axis follows the predefined order
  scale_x_discrete(limits = survey_questions, labels = function(x) ifelse(x %in% names(section_labels), section_labels[x], "")) +
  
  # Add curved annotation line
  geom_curve(aes(x = 36.3, y = 355, xend = 39.5, yend = 355), 
             curvature = 0, arrow = arrow(length = unit(0.03, "npc"), type = "closed"),
             color = "black", size = 0.3) +
  
  # Add text labels
  annotate("text", x = 34.5, y = 400, label = "Pre-defined\ncompletion\npoint", 
           size = 2, fontface = "plain", family = "Poppins") +
    annotate("text", x = 87.5, y = 450, label = "End of survey", 
           angle = 90, size = 2, fontface = "plain", family = "Poppins") +
  labs(
    y = "Number of Respondents\n(log-10 scale)"
  ) + 
  # Theme
  plot_theme +
  theme(
    axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1.03),
    axis.title.y = element_text(vjust = 2),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank(),
    axis.text = element_text(size = 10),
    axis.title = element_text(size = 12),
    plot.margin = margin(2, 5, 2, 15),
    legend.position = "none"
  )  


main_plot +
  labs(title = "Survey 1 flow with natural y-axis scale",
       y= "")

Show code

main_plot <- main_plot +   scale_y_log10(breaks = c(250, 500, 1000), labels = c("250", "500", "1000")) +
  labs(title = "Survey 1 flow with log-10 scale")

main_plot

Show code

main_plot <-  main_plot +   
  # Themes:
  plot_theme +
  theme(
    axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1.03),
    axis.title.y = element_text(vjust = 2),
    axis.ticks.x = element_blank(),
    axis.title.x = element_blank(),
    axis.text = element_text(size = 7),
    axis.title = element_text(size = 8),
    plot.margin = margin(2, 5, 2, 15),
    legend.position = "none"
  )  

ggsave("Figures/Study 1 survey flow.pdf", 
       plot = main_plot, 
       # device = cairo_pdf,  # Ensures smooth text rendering
       width = 16,          
       height = 8.5,         
       units = "cm",        
       dpi = 300)

Compare samples

What are the general demographic and gambling characteristics of those who did and did not take part in the survey?

Show code

master_dataset_study1 %>%
  mutate(actualwin_sum6s = actualwin_sum6s*-1) %>% # Invert net outcome so losses are negative and vice versa
  tbl_summary(
    include = c(# Demographic details
      age_westhq, 
      gender_westhq, 
      IRSAD_SCORE,
      memberyears, 
       # Gambling
      days_since_lastplay,
      num_6sday,
      intensity_6sday,
      averagebet_mean6s,
      montly_net_outcome_6m, # 
      gdquiet6_percent,
      timedevice_sum6s
      
      ),
    type = all_continuous() ~ "continuous",
    statistic = list(all_continuous() ~ c(
      "{mean} ({sd}), {median}")),
     digits = all_continuous() ~ 2,
    by = STATUS,
    label = c(
      age_westhq ~ "Age",
      gender_westhq ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      memberyears ~ "No. of years as a member",
      days_since_lastplay ~ "No. days since last bet",
      num_6sday ~ "No. betting days",
      intensity_6sday ~ "Mean games played per active day",
      averagebet_mean6s ~ "Mean stake",
      montly_net_outcome_6m ~ "Mean net outcome per month",
      gdquiet6_percent ~ "Percentage of sessions between midnight & 6AM",
      timedevice_sum6s ~ "Total time on device (minutes)"
      )
  ) %>%
  add_overall() %>%
  add_difference(
    list(all_continuous() ~ "cohens_d",  # For continuous variables
    all_categorical() ~ "smd") 
  ) %>%
  modify_header(label = "Variable") %>% 
  as_tibble() %>%
  filter(Variable != "Unknown") %>% 
  rename("Effect_size" = `**Difference**`) %>%
   unite("Effect size", Effect_size, `**95% CI**`, sep = " [", remove = TRUE) %>%
   mutate("Effect size" = paste0(`Effect size`, "]")) %>%
  mutate(`Effect size` = case_when(`Effect size` == "NA [NA]" ~ "",
                                    TRUE ~ `Effect size`)) %>%

  gt() %>%
  tab_options(data_row.padding = px(1.5)) %>%
  tab_header(title = "Table 2.   Study 1 - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )

Table 2. Study 1 - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable	Overall, N = 4,260¹	Participants, N = 183¹	Non-participants, N = 4,077	Effect size²
Age	53.99 (14.80), 55.00	52.08 (13.97), 53.00	54.07 (14.84), 55.00	-0.13 [-0.28, 0.01]
Gender	NA	NA	NA	0.13 [-0.02, 0.28]
F	2,223 (52%)	107 (58%)	2,116 (52%)
M	2,036 (48%)	76 (42%)	1,960 (48%)
IRSAD score	958.93 (77.46), 970.79	961.37 (76.20), 977.56	958.82 (77.52), 970.79	0.03 [-0.12, 0.18]
No. of years as a member	10.61 (7.85), 9.40	12.16 (9.11), 9.90	10.54 (7.78), 9.40	0.21 [0.06, 0.36]
No. days since last bet	14.61 (14.87), 10.00	17.53 (17.08), 12.00	14.48 (14.75), 9.00	0.21 [0.06, 0.35]
No. betting days	19.17 (26.01), 9.00	22.63 (25.78), 15.00	19.02 (26.01), 9.00	0.14 [-0.01, 0.29]
Mean games played per active day	1,074.20 (922.03), 841.00	1,178.75 (932.99), 1,017.70	1,069.51 (921.37), 828.80	0.12 [-0.03, 0.27]
Mean stake	3.66 (9.10), 1.43	3.26 (7.71), 1.70	3.68 (9.15), 1.41	-0.05 [-0.19, 0.10]
Mean net outcome per month	720.38 (1,497.50), 192.91	1,011.20 (1,734.67), 401.06	707.00 (1,484.57), 187.62	0.20 [0.05, 0.35]
Percentage of sessions between midnight & 6AM	26.31 (25.25), 18.39	20.98 (18.11), 18.17	26.56 (25.52), 18.39	-0.22 [-0.42, -0.02]
Total time on device (minutes)	2,555.85 (4,651.01), 811.19	2,892.40 (4,037.00), 1,631.00	2,540.74 (4,676.54), 791.18	0.08 [-0.07, 0.22]
¹ Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
² Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])

NOTE: the net outcome values and effect size were reverse scored so postive values indicated losses and so we have inverted these values in the manuscript.

Another important thing to note about the above table is that the effect size for the categorical variable (gender) is actually a standardised mean difference value, Which I don’t think makes much sense here. I’m going to compute Cramver’s V effect size here and put it into the table and the manuscript:

Show code

# Calculate Cramér's V and its confidence interval
combined_table <- master_dataset_study1 %>%
  group_by(STATUS, gender_westhq) %>%
  dplyr::count() %>%
  drop_na() %>%
  ungroup() %>%
  spread(key = gender_westhq, value = n, fill = 0) %>%
  select(-1) %>%
  as.matrix()


  cramersV(combined_table)

Cramér's V =  0.027

Show code

  CI <- confIntV(combined_table)
  CI_info<- CI$intermediate
   CI_info$fisherZ.ci %>%
  t() %>%
  as.data.frame() %>%
   mutate_if(is.numeric, round, 3) %>%
unite('Effect size CIs', 1, 2, sep = ", ")

  Effect size CIs
1   -0.003, 0.057

Calculate odds ratio instead, as per the reviewer’s suggestion:

Show code

combined_table <- master_dataset_study1 %>%
  group_by(STATUS, gender_westhq) %>%
  dplyr::count() %>%
  drop_na() %>%
  ungroup() %>%
  spread(key = gender_westhq, value = n, fill = 0) %>%
  column_to_rownames(var="STATUS") %>% 
  as.matrix()

oddsratio(combined_table)

           Disease Nondisease Total
Exposed        107         76   183
Nonexposed    2116       1960  4076
Total         2223       2036  4259


    Odds ratio estimate and its significance probability

data:  combined_table
p-value = 0.08243
95 percent confidence interval:
 0.9657426 1.7610017
sample estimates:
[1] 1.304099

Sensitivity check comparison

How does the group comparison differ if we change the criterion for survey completion? Reviewers didn’t ask for this change specifically, but some suggestions led us to conduct these additional sensitivity checks.

For this, we will change the participant group to anyone who fully completed the survey and non-participants as anyone who didn’t start or didn’t complete the survey.

Show code

# Check coding of variable that tells us whether somebody fully completed the survey or not
master_dataset_study1 %>%
  count(STATUS,
           SURVEY_STATUS_FULL)

# A tibble: 4 × 3
  STATUS           SURVEY_STATUS_FULL     n
  <fct>            <chr>              <int>
1 Participants     Complete             153
2 Participants     Incomplete            30
3 Non-participants Incomplete            38
4 Non-participants Not started         4039

Show code

master_dataset_study1_sensitivity_check   <- master_dataset_study1 %>%
  mutate(SURVEY_STATUS_FULL_BINARY = case_when(SURVEY_STATUS_FULL == "Incomplete" ~ "Non-participants",
                                               SURVEY_STATUS_FULL == "Not started" ~ "Non-participants",
         TRUE ~ "Participants")) %>%
 mutate(SURVEY_STATUS_FULL_BINARY = fct_relevel(SURVEY_STATUS_FULL_BINARY,
                            "Participants",
                            "Non-participants")) # fix order of presentation

# Check recoding has worked:
master_dataset_study1_sensitivity_check %>%
  count(STATUS,
           SURVEY_STATUS_FULL_BINARY)

# A tibble: 3 × 3
  STATUS           SURVEY_STATUS_FULL_BINARY     n
  <fct>            <fct>                     <int>
1 Participants     Participants                153
2 Participants     Non-participants             30
3 Non-participants Non-participants           4077

Show code

# Produce updated table
master_dataset_study1_sensitivity_check %>%
  mutate(actualwin_sum6s = actualwin_sum6s*-1) %>% # Invert net outcome so losses are negative and vice versa
  tbl_summary(
    include = c(# Demographic details
      age_westhq, 
      gender_westhq, 
      IRSAD_SCORE,
      memberyears, 
       # Gambling
      days_since_lastplay,
      num_6sday,
      intensity_6sday,
      averagebet_mean6s,
      montly_net_outcome_6m, # 
      gdquiet6_percent,
      timedevice_sum6s
      
      ),
    type = all_continuous() ~ "continuous",
    statistic = list(all_continuous() ~ c(
      "{mean} ({sd}), {median}")),
     digits = all_continuous() ~ 2,
    by = SURVEY_STATUS_FULL_BINARY,
    label = c(
      age_westhq ~ "Age",
      gender_westhq ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      memberyears ~ "No. of years as a member",
      days_since_lastplay ~ "No. days since last bet",
      num_6sday ~ "No. betting days",
      intensity_6sday ~ "Mean games played per active day",
      averagebet_mean6s ~ "Mean stake",
      montly_net_outcome_6m ~ "Mean net outcome per month",
      gdquiet6_percent ~ "Percentage of sessions between midnight & 6AM",
      timedevice_sum6s ~ "Total time on device (minutes)"
      )
  ) %>%
  add_overall() %>%
  add_difference(
    list(all_continuous() ~ "cohens_d",  # For continuous variables
    all_categorical() ~ "smd") 
  ) %>%
  modify_header(label = "Variable") %>% 
  as_tibble() %>%
  filter(Variable != "Unknown") %>% 
  rename("Effect_size" = `**Difference**`) %>%
   unite("Effect size", Effect_size, `**95% CI**`, sep = " [", remove = TRUE) %>%
   mutate("Effect size" = paste0(`Effect size`, "]")) %>%
  mutate(`Effect size` = case_when(`Effect size` == "NA [NA]" ~ "",
                                    TRUE ~ `Effect size`)) %>%

  gt() %>%
  tab_options(data_row.padding = px(1.5))   %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )

Variable	Overall, N = 4,260¹	Participants, N = 153¹	Non-participants, N = 4,107	Effect size²
Age	53.99 (14.80), 55.00	50.59 (13.56), 51.00	54.11 (14.83), 55.00	-0.24 [-0.40, -0.08]
Gender	NA	NA	NA	0.06 [-0.11, 0.22]
F	2,223 (52%)	84 (55%)	2,139 (52%)
M	2,036 (48%)	69 (45%)	1,967 (48%)
IRSAD score	958.93 (77.46), 970.79	965.15 (75.81), 977.56	958.69 (77.52), 970.79	0.08 [-0.08, 0.24]
No. of years as a member	10.61 (7.85), 9.40	11.77 (9.03), 9.50	10.56 (7.80), 9.40	0.15 [-0.01, 0.32]
No. days since last bet	14.61 (14.87), 10.00	16.59 (16.64), 10.00	14.54 (14.80), 10.00	0.14 [-0.02, 0.30]
No. betting days	19.17 (26.01), 9.00	22.18 (24.23), 15.00	19.06 (26.07), 9.00	0.12 [-0.04, 0.28]
Mean games played per active day	1,074.20 (922.03), 841.00	1,076.81 (875.62), 918.47	1,074.10 (923.81), 833.78	0.00 [-0.16, 0.16]
Mean stake	3.66 (9.10), 1.43	3.37 (8.38), 1.67	3.68 (9.12), 1.42	-0.03 [-0.19, 0.13]
Mean net outcome per month	720.38 (1,497.50), 192.91	907.84 (1,665.54), 328.17	713.24 (1,490.50), 190.50	0.13 [-0.03, 0.29]
Percentage of sessions between midnight & 6AM	26.31 (25.25), 18.39	21.61 (18.89), 18.17	26.49 (25.46), 18.39	-0.19 [-0.41, 0.02]
Total time on device (minutes)	2,555.85 (4,651.01), 811.19	2,747.25 (3,903.29), 1,604.52	2,548.72 (4,676.80), 796.10	0.04 [-0.12, 0.20]
¹ Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
² Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])

Show code

# Calculate Cramér's V and its confidence interval
combined_table <- master_dataset_study1_sensitivity_check %>%
  group_by(SURVEY_STATUS_FULL_BINARY, gender_westhq) %>%
  dplyr::count() %>%
  drop_na() %>%
  ungroup() %>%
  spread(key = gender_westhq, value = n, fill = 0) %>%
  select(-1) %>%
  as.matrix()


  cramersV(combined_table)

Cramér's V =  0.01

Show code

  CI <- confIntV(combined_table)
  CI_info<- CI$intermediate
   CI_info$fisherZ.ci %>%
  t() %>%
  as.data.frame() %>%
   mutate_if(is.numeric, round, 3) %>%
unite('Effect size CIs', 1, 2, sep = ", ")

  Effect size CIs
1    -0.02, 0.041

The effect sizes appear to be smaller in these comparisons. Let’s check if this is because the group who started the survey but didn’t complete it have now been merged with non-responders, making the latter group more similar to responders, or because those who fully complete the survey are actually more similar to those who didn’t start it:

Show code

master_dataset_study1_sensitivity_check %>%
  mutate(actualwin_sum6s = actualwin_sum6s*-1) %>% # Invert net outcome so losses are negative and vice versa
  tbl_summary(
    include = c(# Demographic details
      age_westhq, 
      gender_westhq, 
      IRSAD_SCORE,
      memberyears, 
       # Gambling
      days_since_lastplay,
      num_6sday,
      intensity_6sday,
      averagebet_mean6s,
      montly_net_outcome_6m, # 
      gdquiet6_percent,
      timedevice_sum6s
      
      ),
    type = all_continuous() ~ "continuous",
    statistic = list(all_continuous() ~ c(
      "{mean} ({sd}), {median}")),
     digits = all_continuous() ~ 2,
    by = SURVEY_STATUS_FULL,
    label = c(
      age_westhq ~ "Age",
      gender_westhq ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      memberyears ~ "No. of years as a member",
      days_since_lastplay ~ "No. days since last bet",
      num_6sday ~ "No. betting days",
      intensity_6sday ~ "Mean games played per active day",
      averagebet_mean6s ~ "Mean stake",
      montly_net_outcome_6m ~ "Mean net outcome per month",
      gdquiet6_percent ~ "Percentage of sessions between midnight & 6AM",
      timedevice_sum6s ~ "Total time on device (minutes)"
      )
  )%>%
  modify_header(label = "Variable") %>% 
  as_tibble() %>%
  filter(Variable != "Unknown") %>%
   gt() %>%
  tab_options(data_row.padding = px(1.5)) %>%
  tab_header(title = "Study 1 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey")

Study 1 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey
Variable	Complete, N = 153	Incomplete, N = 68	Not started, N = 4,039
Age	50.59 (13.56), 51.00	56.29 (14.95), 59.00	54.08 (14.83), 55.00
Gender	NA	NA	NA
F	84 (55%)	41 (60%)	2,098 (52%)
M	69 (45%)	27 (40%)	1,940 (48%)
IRSAD score	965.15 (75.81), 977.56	949.86 (75.71), 968.48	958.84 (77.55), 970.79
No. of years as a member	11.77 (9.03), 9.50	12.92 (8.95), 11.30	10.52 (7.77), 9.40
No. days since last bet	16.59 (16.64), 10.00	18.19 (17.26), 12.50	14.48 (14.75), 9.00
No. betting days	22.18 (24.23), 15.00	29.10 (30.27), 19.00	18.89 (25.96), 9.00
Mean games played per active day	1,076.81 (875.62), 918.47	1,599.03 (1,142.05), 1,305.49	1,065.27 (917.30), 827.81
Mean stake	3.37 (8.38), 1.67	2.33 (1.95), 1.64	3.70 (9.19), 1.41
Mean net outcome per month	907.84 (1,665.54), 328.17	1,471.40 (2,484.41), 789.93	700.21 (1,464.66), 181.50
Percentage of sessions between midnight & 6AM	21.61 (18.89), 18.17	27.62 (25.92), 21.11	26.47 (25.46), 18.24
Total time on device (minutes)	2,747.25 (3,903.29), 1,604.52	4,135.38 (5,111.33), 2,247.80	2,522.01 (4,665.22), 782.30

Interestingly, it seems that the group who started but didn’t complete the survey are most different, trending towards being more involved gamblers.

PGSI

Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:

Show code

summary_PGSI <- master_dataset_study1 %>%
  filter(attention1 == "2" & # Filter to corrective responses to attention checks
           attention2 == "Sometimes") %>%
  summarise(Mean_PGSI = mean(PGSI_TOTAL, na.rm = TRUE),
            SD_PGSI = sd(PGSI_TOTAL, na.rm = TRUE),
            Median_PGSI = median(PGSI_TOTAL, na.rm = TRUE))

mean_ps_PGSI_value <- summary_PGSI$Mean_PGSI
sd_ps_PGSI_value <- summary_PGSI$SD_PGSI
median_ps_PGSI_value <- summary_PGSI$Median_PGSI

# Can produce a sentence that summarises the outcomes:
summary_sentence_PGSI_study1 <- sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", 
                            mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)

summary_sentence_PGSI_study1

[1] "The mean PGSI total score is 5.50 (standard deviation = 5.01) and the median is 4.00."

Composite risk scores

Now let’s produce a single composite score that represents people’s past six months gambling behaviour.

One thing we need to consider here is that some of the variables we’ve created could be biased by recent membership registration. However, looking at the data, all but 2 of the participants have been registered for at least six months, and the 2 people who had been registered less than this has all been registered 0.4 of a year, so the equivalent of approximately five months. So, this is unlikely to affect our outcomes here.

Show code

 master_dataset_study1 %>% 
  filter(memberyears < 0.5) %>%
  dplyr::count(memberyears)

# A tibble: 1 × 2
  memberyears     n
        <dbl> <int>
1         0.4    51

Okay, now we have all of the relevant variables, let’s look at how they correlate with PGSI and harm scores using a series of bivariate correlations visualised as a heatmap:

Show code

PGSI_cor_data_study1<- master_dataset_study1 %>% 
  # colnames()
    select(PGSI_TOTAL,
           gd_distinct_6s, # Number of distinct games played (by brand)
         game_description_count6s, # Total number of sessions played
         timedevice_sum6s, # Total time on device
        timedevice_mean6s, #  Average time on device per session
        timedevice_std6s, # SD of Time on device
        turnover1_sum6s, # Total turnover
        turnover1_mean6s, # Mean turnover per session
        turnover1_std6s, # SD of Turnover
         actualwin_sum6s, # Total net outcome 
        actualwin_mean6s, # Mean net outcome per session
        actualwin_std6s, # SD of Session net outcome
        averagebet_mean6s, # Mean bet size
        averagebet_std6s, # SD of average bet size
        gamesplayed_sum6s, #  Total games played
       gamesplayed_mean6s, #  Mean games played per session
       gamesplayed_std6s, # SD of games played per session
       max_turnover6sday, # Maximum total turnover in a single day  
     max_gamesplayed6sday, # Maximum total games played in a single day 
       num_6sday, #   Number of active betting days
       intensity_6sday, # Number of games per active day
       averagebet_6mos, #   Average monthly bet freq (days)
       averageturnover_6mos, #   average monthly turnover 
       montly_net_outcome_6m) # Average monthly net outcome

ggstatsplot::ggcorrmat(PGSI_cor_data_study1, 
                                              type = "parametric",
                                              results.subtitle = FALSE,
                                              ggcorrplot.args = c(lab_size = 3)) +
  theme(axis.text.y = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(axis.text.x = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(legend.title=element_text(color = "black", size = 12, face = "plain", family = "Poppins")) +
  theme(legend.title.align = .5) +
  theme(legend.text=element_text(color = "black", size = 8, face = "plain", family = "Poppins"))

Now let’s use all of these variables to create a composite index score of risk/involvement:

Show code

master_dataset_study1_with_risk_scores <- master_dataset_study1 %>%
  mutate(actualwin_sum6s = actualwin_sum6s*-1) %>% # Invert outcome values
   mutate(actualwin_mean6s = actualwin_mean6s*-1) %>% # Invert outcome values
  mutate(across(c(
            gd_distinct_6s, # Number of distinct games played (by brand)
         game_description_count6s, # Total number of sessions played
         timedevice_sum6s, # Total time on device
        timedevice_mean6s, #  Average time on device per session
        turnover1_sum6s, # Total turnover
        turnover1_mean6s, # Mean turnover per session
        turnover1_std6s, # SD of Turnover
         actualwin_sum6s, # Total net outcome 
        actualwin_mean6s, # Mean net outcome per session
        actualwin_std6s, # SD of Session net outcome
        averagebet_sum6s, # Total bet size
        averagebet_mean6s, # Mean bet size
        averagebet_std6s, # SD of average bet size
        gamesplayed_sum6s, #  Total games played
       gamesplayed_mean6s, #  Mean games played per session
       gamesplayed_std6s, # SD of games played per session
       max_turnover6sday, # Maximum total turnover in a single day  
     max_gamesplayed6sday, # Maximum total games played in a single day 
       num_6sday, #   Number of active betting days
       intensity_6sday, # Number of games per active day
       averagebet_6mos, #   Average monthly bet freq (days)
     montly_net_outcome_6m, # Average monthly net outcome
       averageturnover_6mos), 
                scale, # scale function computes z-scores
                .names = "z_{.col}")) %>% # Create new named columns for easy selection below
  rowwise() %>% # Group by Rows
  mutate(`Composite Risk Score` = sum(c_across(starts_with("z_")), na.rm = TRUE)) %>% # Sum across all newly created Z scores
  ungroup()


# colnames(master_dataset_study2_with_risk_scores)

Correlations

Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:

Show code

master_dataset_study1_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>%
   mutate(PGSI_CATEGORY = factor(PGSI_CATEGORY, levels = c('No risk',
                                                              'Low risk',
                                                              'Moderate risk',
                                                              'High risk'))) %>%
ggplot() +
   geom_point(aes(x = PGSI_TOTAL, y = `Composite Risk Score`,
                   colour = PGSI_CATEGORY)) +
   geom_smooth(aes(x = PGSI_TOTAL, y = `Composite Risk Score`), 
               method = "lm",
               se = FALSE,
               color = "black") +
   scale_colour_manual(values = c("#2e4057",
                                  "#00798c", 
                                  "#edae49",
                                  "#d1495b")) +
   geom_hline(yintercept = 0, colour = "grey",  linetype="dashed") +
   scale_y_continuous(limits = c(-10,120)) +
   plot_theme +
  guides(color= guide_legend("PGSI category")) +
   labs(x = "PGSI total score") +
  scale_x_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 27))

Show code

cor_PGSI <- cor.test(master_dataset_study1_with_risk_scores$PGSI_TOTAL, 
                         master_dataset_study1_with_risk_scores$`Composite Risk Score`,
                         method = "spearman")

print(cor_PGSI)


    Spearman's rank correlation rho

data:  master_dataset_study1_with_risk_scores$PGSI_TOTAL and master_dataset_study1_with_risk_scores$`Composite Risk Score`
S = 846277, p-value = 0.02031
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.1714405

Confirm how many people are involved in this calculation:

Show code

master_dataset_study1_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>% nrow()

[1] 183

T-test

Show code

# Perform Welch's t-test
t_test_composite_study1 <- t.test(`Composite Risk Score` ~ STATUS, data = master_dataset_study1_with_risk_scores, var.equal = FALSE)

t_test_composite_study1


    Welch Two Sample t-test

data:  Composite Risk Score by STATUS
t = 2.3269, df = 197.85, p-value = 0.02098
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
 0.3172207 3.8424226
sample estimates:
    mean in group Participants mean in group Non-participants 
                    1.99047717                    -0.08934445

Show code

# Calculate means and standard deviations
summary_composite_study1 <- master_dataset_study1_with_risk_scores %>%
  dplyr::group_by(STATUS) %>%
  dplyr::summarize(
    mean = mean(`Composite Risk Score`),
    sd = sd(`Composite Risk Score`),
    n = n()
  ) 

gt(summary_composite_study1)

STATUS	mean	sd	n
Participants	1.99047717	11.84102	183
Non-participants	-0.08934445	11.54773	4077

Show code

# Calculate Cohen's d with 95% confidence interval
cohen_d_result_study1 <- cohen.d(`Composite Risk Score` ~ STATUS, data = master_dataset_study1_with_risk_scores)

cohen_d_result_study1


Cohen's d

d estimate: 0.1799089 (negligible)
95 percent confidence interval:
     lower      upper 
0.03171688 0.32810096

Quartiles

Now, let’s divide the sample into quartiles based on their composite summary score:

Show code

# Calculate quartiles of past-30-day bet frequency
quartiles_study1 <- quantile(master_dataset_study1_with_risk_scores$`Composite Risk Score`, 
                      probs = c(0, 0.25, 0.5, 0.75, 1))


  
master_dataset_study1_with_risk_scores$Quartile <- cut(master_dataset_study1_with_risk_scores$`Composite Risk Score`, 
                                    breaks = quartiles_study1,
                                    labels = c("Bottom quartile: least involved",
                                               "Second quartile",
                                               "Third quartile",
                                               "Top quartile: most involved"),
                                    include.lowest = TRUE)

Now compute the proportion of individuals from each quartile who went on to take part in the survey:

Show code

master_dataset_study1_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
  tab_header("Counts for granular survey status by composite risk score quartile", subtitle = NULL)

Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI	n
Bottom quartile: least involved
Complete	34
Incomplete	4
Not started	1027
Second quartile
Complete	40
Incomplete	7
Not started	1018
Third quartile
Complete	49
Incomplete	10
Not started	1006
Top quartile: most involved
Complete	60
Incomplete	17
Not started	988

Show code

master_dataset_study1_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(STATUS) %>%
  mutate('Percent of quartile' = round(n/sum(n)*100,2)) %>%
  gt() %>%
  tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)

Counts (with percentages) for broad survey status by composite risk score quartile
STATUS	n	Percent of quartile
Bottom quartile: least involved
Participants	34	3.19
Non-participants	1031	96.81
Second quartile
Participants	40	3.76
Non-participants	1025	96.24
Third quartile
Participants	49	4.60
Non-participants	1016	95.40
Top quartile: most involved
Participants	60	5.63
Non-participants	1005	94.37

And visualise this using a Sankey diagram:

Show code

# Create Sankey plot
desired_order_sankey <- c("Top quartile: most involved",
                          "Third quartile",
                          "Second quartile",
                          "Bottom quartile: least involved")

# We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N):
risk_to_survey_links<- master_dataset_study1_with_risk_scores %>%
    select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%
  group_by(Quartile, SURVEY_STATUS_PGSI) %>%
   summarise(
    n = n(),
    .groups = "drop"
  ) %>%
   group_by(Quartile) %>%
   mutate(
     percent = n/sum(n)*100) %>%
  select(-n) %>%
  ungroup() %>%
  mutate(Quartile =  factor(Quartile,
                              levels = desired_order_sankey)) %>%
  arrange(Quartile)  

# From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: 
nodes <- data.frame(
  name=c(as.character(risk_to_survey_links$Quartile), 
  as.character(risk_to_survey_links$SURVEY_STATUS_PGSI)) %>% unique()
)

# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn  assign each from and to category with a number to make numerical connections explicit:
risk_to_survey_links$IDsource <- match(risk_to_survey_links$Quartile, nodes$name)-1 
risk_to_survey_links$IDtarget <- match(risk_to_survey_links$SURVEY_STATUS_PGSI, nodes$name)-1
 


# prepare colour scale
ColourScal ='d3.scaleOrdinal() .range(["#d1495b", 
"#edae49",
"#00798c",
"#2e4057",
"#35B779FF",
"#31688EFF",
"#31688EFF",
"#482878FF",
"#B4DE2CFF",
"#FDE725FF", 
"#26828EFF",
"#3E4A89FF",
"#440154FF",
"#6DCD59FF"
])'

# Create Sankey plot
sankey_plot1 <- sankeyNetwork(
  Links = risk_to_survey_links, 
  Nodes = nodes,
  Source = "IDsource", 
  Target = "IDtarget",
  Value = "percent", 
  NodeID = "name", 
  iterations = 0,
  sinksRight = FALSE,
  fontSize = 16, 
  fontFamily = "Poppins",
  nodeWidth = 30,
  nodePadding = 25,
  colourScale = ColourScal
)

# Custom CSS with Google Fonts link
custom_css <- "
<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'>
<style>
  text {
    font-family: 'Poppins', sans-serif !important;
  }
</style>
"

# Save plot as a HTML file with custom CSS
html_file <- tempfile(fileext = ".html")
saveWidget(
  widget = htmlwidgets::prependContent(sankey_plot1), 
  file = html_file,
  selfcontained = TRUE
)

# Read the saved HTML and ensure the custom CSS is properly embedded
html_content <- readLines(html_file, warn = FALSE)
html_content <- gsub("</head>", paste0(custom_css, "</head>"), html_content)
writeLines(html_content, con = "sankey_plot1.html")

# remove the temporary file
file.remove(html_file)

[1] TRUE

Show code

# sankey_plot1

Logistic regression

Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.

SSVS

Before we can actually run the model, we need to identify a suitable suite of predictor variables to include using used Stochastic Search Variable Selection (SSVS) and the related SSVS package (note the actual SSVS process is commented out in rendered versions of this document as it takes a long time to process).

As we want to maximise the overall predictive ability of the model, we have opted to set the parameters of the SSVS so as to have a low threshold for includin/Selecting relevant variables. First, We have said the prior inclusion of probability to 0.9, reflecting a belief that predictors are more likely to be included than not. A prior inclusion probability of 0.5 reflects the belief that each predictor has an equal probability of being included/excluded. Second, we have set the cut off for marginal inclusion probabilities values (MIPs) to 0.4. There is no standard recommendation for what this cut-off should be, but others have used 0.5. For this specific study, we did adjust this value in response to previous results, given that only 2 predictors were identified based on a lower MIP threshold and we are trying to build a model that maximises the ability to predict survey engagement based on account data.

One thing we need to do before all of this is impute missing values for the IRSAD scores as SSVS doesn’t seem to handle these values well. To do this, we will impute the median value for each person’s state as their missing value.

There are also some SD Variables which don’t have scores as people only bet once in the window. For these, will set the value to 0. Finally, for the days since registration, we’ll set this to the median as well

Show code

# Function to impute missing values with the median
impute_na_with_median <- function(data) {
  data %>%
    # Impute SD variables with 0
    mutate(across(contains("std6s"), ~ ifelse(is.na(.), 0, .))) %>%
    # Impute all other variables with median
    mutate(across(!contains("std6s"), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
}

# Prepare the dataset
master_dataset_study1_SSVS <- master_dataset_study1 %>%
  # Impute missing IRSAD scores
  mutate(IRSAD_SCORE = ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm = TRUE), IRSAD_SCORE)) %>%
  # Convert all required variables to numeric
  mutate(across(c(
    age_westhq,
    IRSAD_SCORE,
    memberyears,
    days_since_lastplay,
    gd_distinct_6s,
    game_description_count6s,
    timedevice_sum6s,
    timedevice_mean6s,
    timedevice_std6s,
    turnover1_sum6s,
    turnover1_mean6s,
    turnover1_std6s,
    actualwin_sum6s,
    actualwin_mean6s,
    actualwin_std6s,
    averagebet_mean6s,
    averagebet_std6s,
    gamesplayed_sum6s,
    gamesplayed_mean6s,
    gamesplayed_std6s,
    max_turnover6sday,
    max_gamesplayed6sday,
    num_6sday,
    intensity_6sday,
    averagebet_6mos,
    averageturnover_6mos), 
    ~ as.numeric(.))) %>%
  # Convert gender to numeric
  mutate(GENDER = case_when(
    gender_westhq == 'F' ~ 0,
    gender_westhq == 'M' ~ 1,
    TRUE ~ NA_real_)) %>%
  # Convert STATUS to numeric
  mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1)) %>%
  mutate(STATUS = as.numeric(STATUS)) %>%
  # Impute missing values with median and 0 for SD variables
  impute_na_with_median() %>%
  rename(
    'Age' = 'age_westhq',
    'Gender' = 'GENDER',
    'IRSAD score' = 'IRSAD_SCORE',
    'Days since registration' = 'memberyears',
    'Days since last bet' = 'days_since_lastplay',
    'No. of diff. game types*' = 'gd_distinct_6s',
    'Total sessions played*' = 'game_description_count6s',
    'Total time on device*' = 'timedevice_sum6s',
    'Mean time on device per session*' = 'timedevice_mean6s',
    'SD of time on device*' = 'timedevice_std6s',
    'Total turnover*' = 'turnover1_sum6s',
    'Mean turnover per session*' = 'turnover1_mean6s',
    'SD of turnover*' = 'turnover1_std6s',
    'Total net outcome*' = 'actualwin_sum6s',
    'Mean net outcome per session*' = 'actualwin_mean6s',
    'SD of session net outcome*' = 'actualwin_std6s',
    'Mean stake*' = 'averagebet_mean6s',
    'SD of stake*' = 'averagebet_std6s',
    'Total games played*' = 'gamesplayed_sum6s',
    'Mean no. games played per session*' = 'gamesplayed_mean6s',
    'SD of games played per session*' = 'gamesplayed_std6s',
    'Max turnover in a single day*' = 'max_turnover6sday',
    'Max no. games played in a single day*' = 'max_gamesplayed6sday',
    'Number of active betting days*' = 'num_6sday',
    'Mean no. games per active day*' = 'intensity_6sday',
    'Mean monthly bet freq (days)*' = 'averagebet_6mos',
    'Mean monthly turnover*' = 'averageturnover_6mos'
  ) %>%
  as.data.frame()

outcome <- 'STATUS'

predictors <- c(
  # Demographic variables:
  'Age',
  'Gender',
  'IRSAD score',
  
  # Account/wagering variables:
  'Days since registration',
  'Days since last bet',
  
  # All past six month gambling variables (As included in the composite score):
  'No. of diff. game types*',
  'Total sessions played*',
  'Total time on device*',
  'Mean time on device per session*',
  'SD of time on device*',
  'Total turnover*',
  'Mean turnover per session*',
  'SD of turnover*',
  'Total net outcome*',
  'Mean net outcome per session*',
  'SD of session net outcome*',
  'Mean stake*',
  'SD of stake*',
  'Total games played*',
  'Mean no. games played per session*',
  'SD of games played per session*',
  'Max turnover in a single day*',
  'Max no. games played in a single day*',
  'Number of active betting days*',
  'Mean no. games per active day*',
  'Mean monthly bet freq (days)*',
  'Mean monthly turnover*'
)


# Print summary statistics
print(summary(master_dataset_study1_SSVS %>% select(all_of(predictors))))

      Age            Gender        IRSAD score     Days since registration
 Min.   :18.00   Min.   :0.0000   Min.   : 780.9   Min.   : 0.40          
 1st Qu.:43.00   1st Qu.:0.0000   1st Qu.: 922.4   1st Qu.: 5.00          
 Median :55.00   Median :0.0000   Median : 970.8   Median : 9.40          
 Mean   :53.99   Mean   :0.4779   Mean   : 959.0   Mean   :10.57          
 3rd Qu.:65.00   3rd Qu.:1.0000   3rd Qu.: 997.9   3rd Qu.:14.03          
 Max.   :95.00   Max.   :1.0000   Max.   :1176.9   Max.   :59.70          
 Days since last bet No. of diff. game types* Total sessions played*
 Min.   : 0.00       Min.   :  1.00           Min.   :   1.0        
 1st Qu.: 3.00       1st Qu.:  7.00           1st Qu.:  13.0        
 Median :10.00       Median : 21.00           Median :  56.0        
 Mean   :14.61       Mean   : 35.36           Mean   : 182.5        
 3rd Qu.:22.00       3rd Qu.: 49.00           3rd Qu.: 182.0        
 Max.   :67.00       Max.   :283.00           Max.   :5500.0        
 Total time on device* Mean time on device per session* SD of time on device*
 Min.   :    0.0       Min.   :  0.000                  Min.   :  0.000      
 1st Qu.:  172.3       1st Qu.:  7.944                  1st Qu.:  6.794      
 Median :  811.2       Median : 12.840                  Median : 12.856      
 Mean   : 2555.8       Mean   : 18.019                  Mean   : 18.053      
 3rd Qu.: 2696.0       3rd Qu.: 21.377                  3rd Qu.: 22.914      
 Max.   :49777.8       Max.   :261.583                  Max.   :206.664      
 Total turnover*     Mean turnover per session* SD of turnover*  
 Min.   :      0.4   Min.   :    0.20           Min.   :    0.0  
 1st Qu.:   1909.7   1st Qu.:   87.53           1st Qu.:   78.4  
 Median :  11971.2   Median :  181.78           Median :  196.1  
 Mean   :  54122.7   Mean   :  413.79           Mean   :  479.6  
 3rd Qu.:  52419.7   3rd Qu.:  403.25           3rd Qu.:  488.4  
 Max.   :2850427.5   Max.   :29571.30           Max.   :26734.1  
 Total net outcome* Mean net outcome per session* SD of session net outcome*
 Min.   :-17910     Min.   :-650.000              Min.   :   0.0            
 1st Qu.:   110     1st Qu.:   5.664              1st Qu.:  48.4            
 Median :  1021     Median :  19.008              Median : 113.5            
 Mean   :  4121     Mean   :  35.180              Mean   : 183.6            
 3rd Qu.:  4502     3rd Qu.:  45.336              3rd Qu.: 229.7            
 Max.   :107738     Max.   :4999.980              Max.   :5164.6            
  Mean stake*        SD of stake*     Total games played*
 Min.   : 0.01771   Min.   : 0.0000   Min.   :     0     
 1st Qu.: 0.78506   1st Qu.: 0.2504   1st Qu.:  1471     
 Median : 1.42699   Median : 0.6212   Median :  7216     
 Mean   : 3.66419   Mean   : 2.2513   Mean   : 26677     
 3rd Qu.: 2.78693   3rd Qu.: 1.4415   3rd Qu.: 26613     
 Max.   :96.99538   Max.   :50.1473   Max.   :621406     
 Mean no. games played per session* SD of games played per session*
 Min.   :   1.00                    Min.   :   0.00                
 1st Qu.:  71.95                    1st Qu.:  57.39                
 Median : 118.17                    Median : 110.96                
 Mean   : 167.84                    Mean   : 164.24                
 3rd Qu.: 200.62                    3rd Qu.: 202.78                
 Max.   :3524.00                    Max.   :1748.12                
 Max turnover in a single day* Max no. games played in a single day*
 Min.   :     0.3              Min.   :    0                        
 1st Qu.:   872.0              1st Qu.:  670                        
 Median :  2904.7              Median : 1742                        
 Mean   :  6356.1              Mean   : 2264                        
 3rd Qu.:  7898.9              3rd Qu.: 3265                        
 Max.   :165993.5              Max.   :12945                        
 Number of active betting days* Mean no. games per active day*
 Min.   :  1.00                 Min.   :   0.0                
 1st Qu.:  3.00                 1st Qu.: 357.9                
 Median :  9.00                 Median : 841.0                
 Mean   : 19.17                 Mean   :1074.2                
 3rd Qu.: 24.00                 3rd Qu.:1526.2                
 Max.   :180.00                 Max.   :6526.0                
 Mean monthly bet freq (days)* Mean monthly turnover*
 Min.   : 0.1667               Min.   :     0.1      
 1st Qu.: 0.5000               1st Qu.:   318.3      
 Median : 1.5000               Median :  1995.2      
 Mean   : 3.1955               Mean   :  9020.4      
 3rd Qu.: 4.0000               3rd Qu.:  8736.6      
 Max.   :30.0000               Max.   :475071.2

Show code

# master_dataset_study1_SSVS %>%
#   dplyr::count(Gender)


# master_dataset_study1_SSVS %>%
  # dplyr::count(STATUS)

# # # Perform SSVS analysis
# SSSV_results_Study1 <- ssvs(data = master_dataset_study1_SSVS,
#                             x = predictors, y = outcome,
#                             inprob = 0.95,
#                             continuous = FALSE,
#                             progress = TRUE)
# 
# 
# # Save file to avoid having to re-reun every time
# saveRDS(SSSV_results_Study1, "SSSV_results_Study1")

# Load in:
SSSV_results_Study1 <- read_rds("SSSV_results_Study1")

Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:

Show code

summary(SSSV_results_Study1, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  as.data.frame() %>%
gt()%>%
  tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 1**"), 
             subtitle = NULL)

Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 1
Variable	MIP	Avg Beta	Avg Nonzero Beta	Lower CI (95%)	Upper CI (95%)
Days since last bet	0.9924	0.3421	0.3447	0.1818	0.4982
No. of diff. game types*	0.9296	0.2820	0.3033	0.0000	0.4275
Max turnover in a single day*	0.6883	0.1669	0.2425	0.0000	0.3688
Days since registration	0.5135	0.1249	0.2432	0.0000	0.3615
Age	0.4119	-0.1062	-0.2577	-0.3740	0.0000
Number of active betting days*	0.2653	0.1062	0.4002	-57.5381	66.9408
Mean monthly bet freq (days)*	0.2631	-0.0747	-0.2840	-66.8354	57.5835
Mean monthly turnover*	0.2529	-0.1619	-0.6402	-62.8708	60.2791
Total turnover*	0.2482	0.1627	0.6557	-59.7995	63.1582
Max no. games played in a single day*	0.1089	0.0254	0.2330	0.0000	0.2560
Mean turnover per session*	0.0894	-0.0241	-0.2691	-0.2310	0.0014
SD of session net outcome*	0.0861	0.0140	0.1624	0.0000	0.1588
Mean stake*	0.0739	-0.0148	-0.1996	-0.1399	0.0000
Total time on device*	0.0732	-0.0216	-0.2953	-0.1889	0.0004
Total games played*	0.0676	-0.0144	-0.2131	-0.1512	0.0002
Gender	0.0662	-0.0091	-0.1368	-0.0891	0.0000
SD of stake*	0.0516	-0.0067	-0.1303	0.0000	0.0095
SD of turnover*	0.0430	0.0051	0.1182	0.0000	0.0000
Mean no. games played per session*	0.0305	-0.0017	-0.0574	0.0000	0.0000
Mean no. games per active day*	0.0287	-0.0027	-0.0934	0.0000	0.0000
Total net outcome*	0.0285	0.0021	0.0736	0.0000	0.0000
Total sessions played*	0.0261	-0.0023	-0.0870	0.0000	0.0000
SD of time on device*	0.0248	0.0005	0.0182	0.0000	0.0000
IRSAD score	0.0243	0.0016	0.0667	0.0000	0.0000
Mean time on device per session*	0.0235	-0.0023	-0.1001	0.0000	0.0000
SD of games played per session*	0.0211	0.0015	0.0701	0.0000	0.0000
Mean net outcome per session*	0.0130	0.0001	0.0094	0.0000	0.0000

Show code

plot(SSSV_results_Study1,
     threshold = 0.4,
     title = "") +
  theme_classic() +
   scale_y_continuous(breaks = c(0, 0.25, 0.50, 0.75, 1), labels = c(0, 0.25, 0.50, 0.75, 1)) +
  theme(text=element_text(family="Poppins"),
      axis.text.x = element_text(angle = 45, hjust = 1),
    plot.margin = margin(t = 10, r = 10, b = 10, l = 20)
      # axis.text.y = element_text(angle = 0, hjust = .5)
      ) +
  labs(title = "",
       x = "",
       y = "Inclusion Probability") +
  annotate("text", x = 18, y = 0.45, label = "MIP threshold", size =3, color = "black", fontface = "plain") +
  theme(legend.position = "none")

Show code

ggsave("Figures/SSVS results plot Study 1.pdf",
       width = 8.5,
       height = 4,
       units = c("in")
)

Run model

Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:

Show code

Study1_logistic_reg_data<- master_dataset_study1 %>%
 mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1)) %>%
  filter(!is.na(memberyears))

# Run model
Study1_logistic_regression<- glm(STATUS ~  days_since_lastplay + 
                gd_distinct_6s +
                max_turnover6sday +
                memberyears +
                age_westhq,
      data = Study1_logistic_reg_data, family = "binomial") 

summary(Study1_logistic_regression)


Call:
glm(formula = STATUS ~ days_since_lastplay + gd_distinct_6s + 
    max_turnover6sday + memberyears + age_westhq, family = "binomial", 
    data = Study1_logistic_reg_data)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.395e+00  3.239e-01 -10.481  < 2e-16 ***
days_since_lastplay  2.532e-02  4.973e-03   5.091 3.57e-07 ***
gd_distinct_6s       7.284e-03  1.715e-03   4.248 2.16e-05 ***
max_turnover6sday    2.192e-05  5.775e-06   3.796 0.000147 ***
memberyears          3.451e-02  1.012e-02   3.412 0.000646 ***
age_westhq          -1.773e-02  5.889e-03  -3.011 0.002605 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1484.9  on 4116  degrees of freedom
Residual deviance: 1426.7  on 4111  degrees of freedom
AIC: 1438.7

Number of Fisher Scoring iterations: 6

Check variable inflation factors for evidence of multicollinearity:

Show code

vif(Study1_logistic_regression)

days_since_lastplay      gd_distinct_6s   max_turnover6sday         memberyears 
           1.229496            1.232278            1.076284            1.317812 
         age_westhq 
           1.261586

No issues with collinearity.

Performs other checks of model performance:

Show code

# model1_performance_metrics<- check_model(Study1_logistic_regression_pilot)
# 
# ggsave("Figures/model1_performance_metrics.pdf",
#        width = 7,
#        height = 9,
#        units = c("in")
# )

Compute confidence intervals for the coefficient estimates:

Show code

sumarisedCIs_Study1_logistic_regression<- cbind(coef(Study1_logistic_regression), confint(Study1_logistic_regression))

sumarisedCIs_Study1_logistic_regression %>% as.data.frame() %>%
    rownames_to_column() %>%
  rename(Coefficient = V1) %>%
  mutate(across(c(Coefficient, `2.5 %`, `97.5 %`), ~round(., 3))) %>%
  arrange(desc(`2.5 %`)) %>%
   gt() %>%
  tab_header(md("**Regression coefficients with 95% CIs: Study 1**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%")

Regression coefficients with 95% CIs: Study 1
	Coefficient	Lower CI: 2.5%	Upper CI: 97.5%
days_since_lastplay	0.025	0.015	0.035
memberyears	0.035	0.014	0.054
gd_distinct_6s	0.007	0.004	0.011
max_turnover6sday	0.000	0.000	0.000
age_westhq	-0.018	-0.029	-0.006
(Intercept)	-3.395	-4.042	-2.772

Exponentiate coefficients and interpret them as odds-ratios:

Show code

sumarisedORs_Study1_logistic_regression<- exp(cbind(OR = coef(Study1_logistic_regression), confint(Study1_logistic_regression)))

sumarisedORs_Study1_logistic_regression %>% as.data.frame() %>%
  rownames_to_column() %>%
  mutate(across(c(OR, `2.5 %`, `97.5 %`), ~round(., 4))) %>%
  arrange(desc(OR)) %>%
   gt() %>%
  tab_header(md("**Odds ratios with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%")

Odds ratios with 95% CIs: Study 2
	OR	Lower CI: 2.5%	Upper CI: 97.5%
memberyears	1.0351	1.0145	1.0555
days_since_lastplay	1.0256	1.0155	1.0355
gd_distinct_6s	1.0073	1.0038	1.0106
max_turnover6sday	1.0000	1.0000	1.0000
age_westhq	0.9824	0.9711	0.9938
(Intercept)	0.0336	0.0176	0.0625

Calculate Nagelkerke’s R squared for the model:

Show code

NagelkerkeR2_S1<- NagelkerkeR2(Study1_logistic_regression) %>% 
  as.data.frame() %>%
  print()

     N         R2
1 4117 0.04643133

Show code

# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value

The model accounts for 4.6431328% of variance in uptake/response.

Calculate overall p-value for model by comparing it with a null model:

Show code

anova(Study1_logistic_regression, 
      update(Study1_logistic_regression, ~1), # update here produces null model for comparison 
      test="Chisq")

Analysis of Deviance Table

Model 1: STATUS ~ days_since_lastplay + gd_distinct_6s + max_turnover6sday + 
    memberyears + age_westhq
Model 2: STATUS ~ 1
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      4111     1426.7                          
2      4116     1485.0 -5  -58.294 2.736e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show code

# Set seed so that labels are consistent in the plot:
set.seed(200)

ggstatsplot::ggcoefstats(Study1_logistic_regression,
                         digits = 4,
                         stats.label.color = "#00798c",
                         sort = "ascending",
                         vline.args = list(size = 1, color = "black"),
                         # ggstatsplot.layer = FALSE,
                          exclude.intercept = TRUE, # hide the intercept
                         stats.label.args = list(size = 3, direction = "y")) +
  scale_x_continuous(name = "Regression coefficient", limits = c(-0.05, 0.05)
                     # breaks = c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5)
                     ) +
  scale_y_discrete(name = "", labels = c("Age",
                                         "Max turnover\n in a single\n day",
                                         "No. of diff.\n game types",
                                         "Days since\n last bet",
                                         "Days since\n  registration"
                                         )) +
  theme(panel.grid.minor = element_blank(),
        axis.line = element_blank()) +
  theme(axis.text = element_text(color = "black", size = 9, face = "plain"))+
  theme_light(base_family = "Poppins") +
  # theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +
  theme(axis.text.y = element_text(color="black", size=12, face="plain", family = "Poppins")) +
  theme(axis.title.x = element_text(color="black", size=12, face ="plain", vjust = 0, family = "Poppins"))

Show code

# Set height to 7 and width to 5.5 in portrait mode
ggsave("Figures/Study 1 log regression plot.pdf",
       width = 5,
       height = 7,
       units = c("in")
)

Study 2 Analysis

This study pertains to the analysis of a survey of Online gamblers from two sites in Australia.

Load data

Load in master dataset containing all customer details (this step is hidden to protect privacy & anonymity)

Data loaded.

How many people were invited in total:

Show code

nrow(master_dataset_study2_initial)

[1] 24829

Process the data for analysis by doing the following:

Remove anybody who didn’t bet in the six months prior to survey participation as this was our criteria for inclusion in the sample
Provide a single variable that tells us whether somebody is a participant or non-participant, depending on whether they completed up to the PGSI or not
Recode and rename variables where required for later analysis and visualisation.

Show code

master_dataset_study2 <-  master_dataset_study2_initial %>%
  as_tibble() %>%
filter(PAS_6M_BET_FREQ_DAYS != 0) %>% # Remove anybody who didn't bet in the window of interest
  mutate(STATUS = case_when(SURVEY_STATUS_PGSI == "Incomplete" |  SURVEY_STATUS_PGSI == "Not started"~ "Non-participants", 
                            SURVEY_STATUS_PGSI == "Complete"  ~ "Participants")) %>%
mutate(STATUS = fct_relevel(STATUS,
                            "Participants",
                            "Non-participants")) %>%
  # Make numeric:
  mutate(GHM_OVER_PRIORITISE = as.numeric(GHM_OVER_PRIORITISE),
    GHM_STRAIN = as.numeric(GHM_STRAIN),
    GHM_SEVERE = as.numeric(GHM_SEVERE),
    IRSAD_SCORE = as.numeric(IRSAD_SCORE)
    ) %>% 
  # Enforce the order of factors so that they present properly in tables and graphs:
  mutate(PGSI_CATEGORY = factor(PGSI_CATEGORY, levels = c('No risk',
                                                              'Low risk',
                                                              'Moderate risk',
                                                              'High risk'))) 
  # Rename risk detection system summary variable values:
  # mutate(RE_PAS_6M_ANY_TRIGGER = case_when(RE_PAS_6M_ANY_TRIGGER == TRUE ~ "Flagged",
  #                                          RE_PAS_6M_ANY_TRIGGER == FALSE ~ "Not flagged"))

Uptake figures

Let’s first look at how survey uptake varied over the three-week window the survey was open:

Show code

master_dataset_study2_initial %>%
  as_tibble() %>%
  filter(SURVEY_STATUS_START != "Not started") %>%
  group_by(START_DATE) %>%
  summarise(Freq = sum(n())) %>%
  ungroup() %>%
  ggplot(aes(x = START_DATE, y = Freq)) +
    plot_theme +
  geom_bar(stat = "identity", fill = "#00798c") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "",
       x = "Date",
       y = "Number of clients") +
  scale_y_continuous(breaks = seq(0, 1500, by = 500), limits = c(0,1500))

Now compute some figures to tell us the rates of uptake for the survey. Will first do this for the original, unfiltered dataset and then for the refined, fill the dataset with people who didn’t bet in the past six months removed.

Original dataset:

Show code

master_dataset_study2_initial %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Unfiltered dataset", subtitle = NULL)

Counts for starting the survey: Unfiltered dataset
Survey status	n	Percent
Not started	22684	91.360909
Started	2145	8.639091

Show code

master_dataset_study2_initial %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Unfiltered dataset", subtitle = NULL)

Counts for completing the survey up to the PGSI: Unfiltered dataset
Survey status	n	Percent
Complete	1959	7.889967
Incomplete	186	0.749124
Not started	22684	91.360909

Show code

master_dataset_study2_initial %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Unfiltered dataset", subtitle = NULL)

Counts for completing the survey in full: Unfiltered dataset
Survey status	n	Percent
Complete	1470	5.920496
Incomplete	675	2.718595
Not started	22684	91.360909

And for the filter the dataset will use for most of the analyses,

Show code

master_dataset_study2 %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Filtered dataset", subtitle = NULL)

Counts for starting the survey: Filtered dataset
Survey status	n	Percent
Not started	22651	91.360465
Started	2142	8.639535

Show code

master_dataset_study2 %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Filtered dataset", subtitle = NULL)

Counts for completing the survey up to the PGSI: Filtered dataset
Survey status	n	Percent
Complete	1956	7.8893236
Incomplete	186	0.7502118
Not started	22651	91.3604646

Show code

master_dataset_study2 %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Filtered dataset", subtitle = NULL)

Counts for completing the survey in full: Filtered dataset
Survey status	n	Percent
Complete	1467	5.916993
Incomplete	675	2.722543
Not started	22651	91.360465

We need to also figure out how many people were removed by ensuring everyone bet in the past 6 months and how this related to survey status:

Show code

master_dataset_study2_initial %>%
  as_tibble() %>%
filter(PAS_6M_BET_FREQ_DAYS == 0) %>% # keep anybody who didn't bet in the window of interest
  count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for people removed from dataset because they hadn't gambled in the past 6-months", subtitle = NULL)

Counts for people removed from dataset because they hadn't gambled in the past 6-months
Survey status	n
Complete	3
Not started	33

Survey flow

Show code

# Define relevant survey questions (excluding non-survey & optional ones)
survey_questions <- c(
  "IPAddress", "GAM_ACCOUNTS", "GAM_ACTIVITIES", "GAM_HARM", "GAMB_SATISFACTION",
  "PGSI_Q1", "PGSI_Q2", "PGSI_Q3", "PGSI_Q4", "PGSI_Q5", "PGSI_Q6", "PGSI_Q7", "PGSI_Q8", "PGSI_Q9",
  "GHM_F1", "GHM_P1", "ATTENTION_CHECK_1", "GHM_P2", "GHM_R1", "GHM_R2", "GHM_PH1", "GHM_PH2",
  "GHM_WS1", "GHM_WS2", "GHM_L1", "LIFE_SATISFACTION", "ACT1_TRACK",
  "ACT2_1", "ACT2_2", "ACT2_3", "ACT2_4", "ACT2_5", "ACT2_6", "ACT2_7",
  "EST_PAST_SPEND", "EST_PAST_WIN", "EST_PAST_NET_RESULT", "EST_PAST_DEPOSIT", "EST_PAST_WITHDRAW",
  "EST_NEXT_SPEND", "EST_NEXT_WIN", "EST_NEXT_NET_RESULT", "EST_NEXT_ACCEPTED_RESULT",
  "EST_NEXT_DEPOSIT", "EST_NEXT_WITHDRAW",
  "SC_1", "SC_2", "SC_3", "SC_4", "SC_5", "SC_6", "SC_7", "SC_8", "SC_9", "SC_10", "SC_11",
  "ATTENTION_CHECK_2", "SC_12", "SC_13",
  "EDUCATION", "EMPLOYMENT", "MARRIED", "NO_CHILDREN", "HOUSEHOLD_INCOME",
  "INCOME_VARIABILITY", "INCOME_PREDICTABILITY",
  "CBA_FW_1", "CBA_FW_2", "CBA_FW_3", "CBA_FW_4", "CBA_FW_5"
)

# Filter dataset to exclude non-starters (only include those who started the survey)
survey_data_filtered <- master_dataset_study2 %>%
  filter(!is.na(IPAddress)) %>%
  select(all_of(survey_questions))

# Count non-NA responses for each question
response_counts <- survey_data_filtered %>%
  summarise(across(everything(), ~sum(!is.na(.)), .names = "count_{.col}")) %>%
  pivot_longer(cols = everything(), names_to = "Question", values_to = "Respondents") %>%
  mutate(Question = gsub("count_", "", Question)) %>%
  arrange(Respondents)  # Arrange in ascending order to maintain left-to-right order


# Define survey sections with their starting positions for text labels:
survey_sections <- data.frame(
  Section = c("Opened survey", "Gambling involvement", "Gambling satisfaction", "PGSI",
              "Gambling Harm Measure", "Life satisfaction", "Activity statements",
              "Estimated expenditure", "Self-Control", "Demographics"),
  Start = c("IPAddress", "GAM_ACTIVITIES", "GAMB_SATISFACTION", "PGSI_Q5",
            "GHM_R2", "LIFE_SATISFACTION", "ACT2_4",
            "EST_NEXT_SPEND", "SC_7", "INCOME_VARIABILITY")
)

# Define section start and end positions for lines:
section_boundaries <- data.frame(
  Position = c( "GAM_ACCOUNTS",  "PGSI_Q1",
               "GHM_F1", "LIFE_SATISFACTION", "ACT1_TRACK",
               "EST_PAST_SPEND", "SC_1"),
  Type = "Start"
)

section_endings <- data.frame(
  Position = c("GAMB_SATISFACTION",
               "SC_13",
               "CBA_FW_5"),
  Type = "End"
)


# Combine for plotting
section_lines <- bind_rows(section_boundaries, section_endings)

# Convert positions to numeric for plotting
section_lines$Position <- as.numeric(factor(section_lines$Position, levels = survey_questions))


section_lines <- section_lines %>% 
mutate(Position = case_when(Position == 71 ~ 71.5,
                             TRUE ~  Position))

survey_sections_dot_colours <- data.frame(
  Section = c(
    rep("Start of survey", 1),
    rep("Gambling involvement", 3),
    rep("Gambling satisfaction", 1),
    rep("PGSI", 9),
    rep("Gambling Harm Measure", 11),
    rep("Life satisfaction", 1),
    rep("Activity statements", 8),
    rep("Estimated expenditure", 11),
    rep("Self-Control", 14),
    rep("Demographics", 12)  # No "End of survey"
  ),
  Question = c(
    # Start of survey
    "IPAddress",
    
    # Gambling involvement
    "GAM_ACTIVITIES", "GAM_HARM", "GAM_ACCOUNTS",
    
    # Gambling satisfaction
    "GAMB_SATISFACTION",
    
    # PGSI
    "PGSI_Q1", "PGSI_Q2", "PGSI_Q3", "PGSI_Q4", "PGSI_Q5", 
    "PGSI_Q6", "PGSI_Q7", "PGSI_Q8", "PGSI_Q9",
    
    # Gambling Harm Measure
    "GHM_F1", "GHM_P1", "ATTENTION_CHECK_1", "GHM_P2", "GHM_R1", 
    "GHM_R2", "GHM_PH1", "GHM_PH2", "GHM_WS1", "GHM_WS2", "GHM_L1",
    
    # Life satisfaction
    "LIFE_SATISFACTION",
    
    # Activity statements
    "ACT1_TRACK", "ACT2_1", "ACT2_2", "ACT2_3", "ACT2_4", "ACT2_5", "ACT2_6","ACT2_7",
    
    # Estimated expenditure
    "EST_PAST_SPEND", "EST_PAST_WIN", "EST_PAST_NET_RESULT", "EST_PAST_DEPOSIT", 
    "EST_PAST_WITHDRAW", "EST_NEXT_SPEND", "EST_NEXT_WIN", "EST_NEXT_NET_RESULT",
    "EST_NEXT_ACCEPTED_RESULT", "EST_NEXT_DEPOSIT", "EST_NEXT_WITHDRAW",
    
    # Self-Control
    "SC_1", "SC_2", "SC_3", "SC_4", "SC_5", "SC_6", "SC_7", "SC_8", "SC_9", 
    "SC_10", "SC_11", "ATTENTION_CHECK_2", "SC_12", "SC_13",
    
    # Demographics
    "EDUCATION", "EMPLOYMENT", "MARRIED", "NO_CHILDREN", "HOUSEHOLD_INCOME",
    "INCOME_VARIABILITY", "INCOME_PREDICTABILITY", "CBA_FW_1", "CBA_FW_2", "CBA_FW_3",
    "CBA_FW_4", "CBA_FW_5"
  )
)

# Ensure response_counts data is structured
response_counts <- response_counts %>%
  mutate(Question = factor(Question, levels = survey_questions)) %>% # Preserve order
  full_join(survey_sections_dot_colours)

# Order questions by survey flow (ensuring left-to-right reading)
response_counts$Question <- factor(response_counts$Question, levels = survey_questions)

# Ensure no zero values before applying log transformation
response_counts <- response_counts %>%
  mutate(Respondents = ifelse(Respondents == 0, 1, Respondents))  # Replaces 0 with 1


# Create a lookup table for section labels
section_labels <- setNames(survey_sections$Section, survey_sections$Start)


# Ensure Section is an ordered factor based on survey flow
response_counts$Section <- factor(response_counts$Section, levels = unique(survey_sections$Section))

# # Define custom gradient: Blue-Green to Blue
# num_sections <- length(unique(response_counts$Section))
# section_colors <- rev(colorRampPalette(c(brewer.pal(9, "GnBu")[6], brewer.pal(9, "Blues")[9]))(num_sections))

# Generate a perceptually uniform color scale using viridis
num_sections <- length(unique(response_counts$Section))
section_colors <- viridis(num_sections, option = "mako")  # Blue-Green to Blue

# Ensure factor levels match
response_counts <- response_counts %>%
  mutate(Section = factor(Section, levels = survey_sections$Section))


# Plot with ordered and reversed colors
main_plot<- ggplot(response_counts, aes(x = Question, y = Respondents, group = 1, color = Section)) +
  geom_line(color = "grey", size = 0.9) +  # Line plot
  
  # Dots colored by ordered survey section (now fading from dark to light)
  geom_point(size = 1.1) +
  
  # Apply ordered and reversed color scale to sections
  # scale_color_manual(values = setNames(section_colors, levels(response_counts$Section))) +
  scale_color_manual(values = setNames(section_colors, levels(response_counts$Section))) +
  # Add dashed vertical lines for section boundaries
  geom_vline(data = section_lines, aes(xintercept = Position - 0.5),
             color = "grey50", linetype = "dashed", size = 0.5) +
  
  # Replace x-axis labels with section labels at survey start points
  scale_x_discrete(labels = function(x) ifelse(x %in% names(section_labels), section_labels[x], "")) +

  # Add curved annotation line
  geom_curve(aes(x = 17.5, y = 2420, xend = 14.5, yend = 2420), 
             curvature = 0, arrow = arrow(length = unit(0.03, "npc"), type = "closed"),
             color = "black", size = 0.3) +
  
  # Add text labels
  annotate("text", x = 19.3, y = 2550, label = "Pre-defined\ncompletion\npoint", 
           size = 2, fontface = "plain", family = "Poppins") +
    annotate("text", x = 70, y = 2350, label = "End of survey", 
           angle = 90, size = 2, fontface = "plain", family = "Poppins") +
  
  labs(
    y = "Number of Respondents\n(log-10 scale)"
  ) +
    # Themes:
  plot_theme +
  theme(axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1.03),
        axis.title.y = element_text(vjust = 3),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        plot.margin = margin(2, 5, 1, 8),
        legend.position = "none")  


main_plot +
  labs(title = "Survey 2 flow with natural y-axis scale",
       y= "")

Show code

main_plot <- main_plot +   scale_y_log10(breaks = c(2000, 3000, 4000), labels = c("2000", "3000", "4000")) +
  labs(title = "Survey 2 flow with log-10 scale")

main_plot

Show code

main_plot <-  main_plot +   
  # Themes:
  plot_theme +
  theme(axis.text.x = element_text(angle = 50, hjust = 1, vjust = 1.03),
        axis.title.y = element_text(vjust = 3),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text = element_text(size = 7),
        axis.title = element_text(size = 8),
        plot.margin = margin(2, 5, 1, 8),
        legend.position = "none")  


ggsave("Figures/Study 2 survey flow.pdf", 
       plot = main_plot, 
       device = cairo_pdf,  # Ensures smooth text rendering
       width = 16,          
       height = 8.5,         
       units = "cm",        
       dpi = 300)

Compare samples

What are the general demographic and gambling characteristics of those who did and did not take part in the survey?

Before we can do this, we need to create weighted variable summaries to account for the oversampling of the at risk population:

Show code

# Create survey design:
survey_design_all <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2,
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Create survey design objects for each STATUS group:
survey_design_non_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2, STATUS == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2, STATUS == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Get sample sizes for each group
participants_n <- sum(master_dataset_study2$STATUS == "Participants")
non_participants_n <- sum(master_dataset_study2$STATUS == "Non-participants")
overall_n <- nrow(master_dataset_study2)



# [Yes, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]

# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:
# Function to calculate weighted summary for continuous variables
weighted_summary_continuous <- function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate means and standard deviations
  mean_all <- svymean(var_formula, design_all, na.rm = TRUE)
  mean_non_ps <- svymean(var_formula, design_non_participants, na.rm = TRUE)
  mean_ps <- svymean(var_formula, design_participants, na.rm = TRUE)
  
  sd_all <- sqrt(svyvar(var_formula, design_all, na.rm = TRUE))
  sd_non_ps <- sqrt(svyvar(var_formula, design_non_participants, na.rm = TRUE))
  sd_ps <- sqrt(svyvar(var_formula, design_participants, na.rm = TRUE))
  
  # Calculate weighted medians
  median_all <- svyquantile(var_formula, design_all, 0.5, na.rm = TRUE)
  median_non_ps <- svyquantile(var_formula, design_non_participants, 0.5, na.rm = TRUE)
  median_ps <- svyquantile(var_formula, design_participants, 0.5, na.rm = TRUE)
   
  # Format for table
  summary_df <- tibble(
    Variable = variable,
    `Overall` = paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),
    `Participants` = paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),
    `Non-participants` = paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2))
  ) %>%
  # Compute effect size
  mutate(
    pooled_sd = sqrt(((participants_n - 1) * sd_ps^2 + (non_participants_n - 1) * sd_non_ps^2) /
                     (participants_n + non_participants_n - 2)),
    cohen_d = (coef(mean_ps) - coef(mean_non_ps)) / pooled_sd,
    SE = sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2 / (2 * (participants_n + non_participants_n)))),
    CI_lower = cohen_d - 1.96 * SE,
    CI_upper = cohen_d + 1.96 * SE,
    Effect_size = paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]")
  ) %>%
  select(Variable, Overall, Participants, `Non-participants`, Effect_size)
  
  return(summary_df)
}

# Apply function to continuous variables
continuous_vars <- c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", 
                     "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", 
                     "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", 
                     "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", 
                     "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")

all_continuous_summaries <- map_dfr(continuous_vars, 
                                    ~weighted_summary_continuous(.x, survey_design_all, 
                                                                 survey_design_non_participants, survey_design_participants, 
                                                                 participants_n, non_participants_n, overall_n))


# Categorical variables:

# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. 

# Note: this function is computes Cramer's  effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).

# UPDATE: Let's just compute odds ratios for the two level variables  based on a reviewer suggestion (thanks ChatGPT for your help with this one!):

categorical_summary <- function(variable, design_all, design_non_participants, design_participants) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate proportions for overall sample
  prop_table_all <- svytable(var_formula, design_all)
  props_all <- prop.table(prop_table_all) %>%
    as.data.frame() %>%
    mutate(Overall = Freq * 100) %>%
    select(Variable = !!sym(variable), Overall)
  
  # Calculate proportions for non-participants
  prop_table_non_ps <- svytable(var_formula, design_non_participants)
  props_non_ps <- prop.table(prop_table_non_ps) %>%
    as.data.frame() %>%
    mutate('Non-participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Non-participants')
  
  # Calculate proportions for participants
  prop_table_ps <- svytable(var_formula, design_participants)
  props_ps <- prop.table(prop_table_ps) %>%
    as.data.frame() %>%
    mutate('Participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Participants')
  
  # Combine results and format
  summary_df <- props_ps %>%
    left_join(props_non_ps, by = "Variable") %>%
    left_join(props_all, by = "Variable") %>%
    mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%
    replace(is.na(.), "")

  # Determine if variable is binary (i.e., only two levels)
  unique_values <- length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))
  
  if (unique_values == 2 & variable != "GENDER") {
    # Ensure STATUS is treated as a factor with "Non-participants" as reference
    design_all$variables$STATUS <- relevel(as.factor(design_all$variables$STATUS), ref = "Non-participants")

    # Compute Odds Ratio (OR) using survey-weighted logistic regression
    logistic_model <- tryCatch(
      svyglm(as.formula(paste0(variable, " ~ STATUS")), 
             design = design_all, family = quasibinomial()),
      error = function(e) NULL
    )

    if (!is.null(logistic_model)) {
      coef_name <- grep("STATUS", names(coef(logistic_model)), value = TRUE)
      if (length(coef_name) > 0) {
        OR <- exp(coef(logistic_model)[coef_name])
        CI <- exp(confint(logistic_model)[coef_name, ])
        effect_size_text <- paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]")
      } else {
        effect_size_text <- "NA"
      }
    } else {
      effect_size_text <- "NA"
    }

  } else {
    # Compute Cramér’s V for categorical variables with more than two levels
    combined_table <- as.table(rbind(prop_table_non_ps, prop_table_ps))
    V <- cramersV(combined_table)
    V_rounded <- as.matrix(V$output) %>% t() %>% as.data.frame() %>%
      mutate(cramersV = as.numeric(cramersV)) %>%
      mutate(cramersV = round(cramersV, 2))
    
    CI <- confIntV(combined_table)
    CI_info <- CI$intermediate
    CIs <- CI_info$fisherZ.ci %>%
      t() %>%
      as.data.frame() %>%
      mutate_if(is.numeric, round, 2) %>%
      unite('Effect_size', 1, 2, sep = ", ")
    
    effect_size_text <- paste0(V_rounded, " [", CIs, "]")
  }

  # Add effect size to the table
  header_row <- data.frame(Variable = variable, Overall = "", Participants = "", 
                           `Non-participants` = "", `Effect_size` = effect_size_text)

  summary_df <- bind_rows(header_row, summary_df) %>%
    as_tibble() %>%
    select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)

  return(summary_df)
}



# Apply function to categorical variables
categorical_vars <- c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")
all_categorical_summaries <- map_dfr(categorical_vars, 
                                     ~categorical_summary(.x, survey_design_all, 
                                                          
                                                          survey_design_non_participants, survey_design_participants))

all_categorical_summaries <- all_categorical_summaries %>%
# Recode for consistency across summaries:
mutate(Variable = case_when(Variable==1
                   ~ "Yes", 
                   Variable==0
                   ~ "No", 
                   Variable==TRUE
                   ~ "Yes", 
                   Variable==FALSE
                   ~ "No",
                   TRUE ~ Variable))

# Combine continuous and categorical summaries
weighted_comparison_summary_data <- bind_rows(all_continuous_summaries, all_categorical_summaries) %>%
  rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols = "Participants") %>%
  rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols = "Non-participants") %>%
  rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols = "Overall")

# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)
gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]
age_summary_tabled<- weighted_comparison_summary_data[1,1:5]
summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%
  filter(Variable !="AGE",
    Variable !="GENDER",
         Variable !="Male",
         Variable !="Female",
         Variable !="Unknown"
         )

# Join everything back together and provide some nice names for our variables:
weighted_comparison_summary_data_ordered <- bind_rows(age_summary_tabled,
          gender_summary_tabled,
          summary_table_extracted_vars_missing) %>%
   mutate(Variable = dplyr::recode(Variable,
    `AGE` = "Age",
    "GENDER" = "Gender",
    "IRSAD_SCORE" = "IRSAD score",
    "DAYS_REGISTERED" = "Days since registered",
    "DAYS_SINCE_LAST_BET" = "No. days since last bet",
    "PAS_6M_BET_FREQ_DAYS" = "No. betting days",
    "PAS_6M_BET_INTENSITY" = "Mean bets per active day",
    "PAS_6M_BET_MEAN_STAKE" = "Mean stake",
    "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" = "Mean net outcome per month",
    "PAS_6M_PERCENT_UNSOCIABLE_BETS" = "Percentage of bets midnight - 6AM",
    "PAS_6M_TOTAL_DEPOSIT_FREQ" = "No. deposits",
    "PAS_6M_DEPOSIT_INTENSITY" = "Mean no. deposits per active day",
    "PAS_6M_MEAN_DEPOSIT_AMOUNT" = "Mean deposit amount",
    "PAS_6M_PERCENT_DECLINED_DEPOSITS" = "Percentage of deposits declined",
    "PAS_6M_PERCENT_CANCEL_WITHDRAWS" = "Percentage of withdrawals cancelled",
    "EVER_USED_ANY_TOOL" = "Ever used a CPT",
    "ACTIVE_DEPOSIT_LIMIT" = "Active deposit limit",
    "RE_PAS_6M_ANY_TRIGGER" = "Flagged as 'at-risk' ≥ 1 times"
  ))


# Create final table:
weighted_comparison_summary_data_ordered %>%
  rename("Effect size" = Effect_size
         ) %>%
  gt() %>%
   tab_header(title = "Table 3.   Study 2 (past 6-month bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%
  tab_options(
    data_row.padding = px(1.5)
  ) %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )

Table 3. Study 2 (past 6-month bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable	Overall (N = 24793)¹	Participants (N = 1956)¹	Non-participants (N = 22837)	Effect size²
Age	38.07 (14.73), 35	43.57 (15.51), 43	37.66 (14.59), 34	0.4 [0.36, 0.45]
Gender			NA	0.03 [0.03, 0.04]
Female	17.23%	13.2%	17.53%	NA
Male	81.96%	85.32%	81.71%	NA
Unknown	0.81%	1.47%	0.76%	NA
IRSAD score	1004.88 (75.63), 999.58	1005.21 (75.52), 1000.57	1004.86 (75.64), 999.58	0 [-0.04, 0.05]
Days since registered	1131.27 (904.71), 905	1337.27 (967.47), 1168	1115.93 (898), 892	0.24 [0.2, 0.29]
No. days since last bet	49.13 (46.06), 43	27.47 (39.71), 5	50.74 (46.09), 47	-0.51 [-0.56, -0.46]
No. betting days	22.45 (31.97), 9	39.59 (41.78), 25	21.17 (30.73), 8	0.58 [0.53, 0.63]
Mean bets per active day	7.36 (12.41), 4	10.34 (17.22), 5.59	7.14 (11.95), 4	0.26 [0.21, 0.3]
Mean stake	31.48 (163.93), 10	26.35 (107.14), 9.12	31.86 (167.38), 10	-0.03 [-0.08, 0.01]
Mean net outcome per month	-131.9 (1864.76), -12.5	-242.26 (1191.8), -25	-123.68 (1905.15), -12.11	-0.06 [-0.11, -0.02]
Percentage of bets midnight - 6AM	4.03 (11.54), 0	4.19 (10.3), 0	4.02 (11.63), 0	0.02 [-0.03, 0.06]
No. deposits	31.6 (111.44), 4	64.86 (175.22), 9	29.12 (104.74), 4	0.32 [0.27, 0.37]
Mean no. deposits per active day	1.03 (1.28), 0.81	1.2 (1.69), 0.8	1.02 (1.24), 0.82	0.14 [0.09, 0.19]
Mean deposit amount	77.52 (321.94), 31.82	72.83 (202.78), 37.5	77.87 (329.1), 31.21	-0.02 [-0.06, 0.03]
Percentage of deposits declined	6.59 (15.05), 0	7.08 (15.25), 0	6.55 (15.03), 0	0.03 [-0.01, 0.08]
Percentage of withdrawals cancelled	1.81 (9.25), 0	3.06 (11.82), 0	1.72 (9.02), 0	0.15 [0.1, 0.19]
Ever used a CPT			NA	1.47 [1.28, 1.69]
No	86.93%	82.33%	87.27%	NA
Yes	13.07%	17.67%	12.73%	NA
Active deposit limit			NA	1.23 [1.03, 1.47]
No	91.58%	89.96%	91.7%	NA
Yes	8.42%	10.04%	8.3%	NA
Flagged as 'at-risk' ≥ 1 times			NA	1.91 [1.73, 2.12]
No	97.51%	95.61%	97.65%	NA
Yes	2.49%	4.39%	2.35%	NA
¹ Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
² Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])

PGSI

Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:

Show code

survey_design_participants_filtered <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2, STATUS == "Participants" &
                  (ATTENTION_CHECK_1 == "PASSED" | is.na(ATTENTION_CHECK_1)) &
                  (ATTENTION_CHECK_2 == "PASSED" | is.na(ATTENTION_CHECK_2))),
  weights = ~ELIGIBLE_SURVEY_SAMPLE_WEIGHTS
)

# Weighted values:
mean_ps_PGSI <- svymean(~PGSI_TOTAL, survey_design_participants_filtered, na.rm = TRUE)
sd_ps_PGSI <- sqrt(svyvar(~PGSI_TOTAL, survey_design_participants_filtered, na.rm = TRUE))
median_ps_PGSI <- svyquantile(~PGSI_TOTAL, survey_design_participants_filtered, c(0.5), na.rm = TRUE)

# Extract desired values
mean_ps_PGSI_value <- coef(mean_ps_PGSI)
sd_ps_PGSI_value <- sd_ps_PGSI[1]
median_ps_PGSI_value <- coef(median_ps_PGSI) 

# Create sentence summarising results: 
summary_sentence_PGSI_study2 <- sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", 
                            mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)

summary_sentence_PGSI_study2

[1] "The mean PGSI total score is 3.64 (standard deviation = 4.12) and the median is 2.00."

Non-Weighted comparisons

For transparency, we have also computed non-weighted comparisons between groups using the tbl_summary() function used to do the same in Study 1 above:

Show code

master_dataset_study2 %>%
  tbl_summary(
    include = c(# Demographic details
      AGE, 
      GENDER, 
      IRSAD_SCORE,
      DAYS_REGISTERED, 
       # Wagering
      DAYS_SINCE_LAST_BET, 
      PAS_6M_BET_FREQ_DAYS,
      PAS_6M_BET_INTENSITY,
      PAS_6M_BET_MEAN_STAKE,
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
      PAS_6M_PERCENT_UNSOCIABLE_BETS,
      # Transactions
      PAS_6M_TOTAL_DEPOSIT_FREQ,
      PAS_6M_DEPOSIT_INTENSITY,
      PAS_6M_MEAN_DEPOSIT_AMOUNT,
      PAS_6M_PERCENT_DECLINED_DEPOSITS,
      PAS_6M_PERCENT_CANCEL_WITHDRAWS,
      # Use of consumer protection tools
      EVER_USED_ANY_TOOL,
      ACTIVE_DEPOSIT_LIMIT,
      RE_PAS_6M_ANY_TRIGGER),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous2() ~ c(
      "{median} [{p25} - {p75}]",
      "{mean} ({sd})")),
    by = STATUS,
    label = c(
      AGE ~ "Age",
      GENDER ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      DAYS_REGISTERED ~ "No. of days since accounts registered",
      DAYS_SINCE_LAST_BET ~ "No. days since last bet",
      PAS_6M_BET_FREQ_DAYS ~ "No. betting days",
      PAS_6M_BET_INTENSITY ~ "Mean bets per active day",
      PAS_6M_BET_MEAN_STAKE ~ "Mean stake",
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~ "Mean net outcome per month",
      PAS_6M_PERCENT_UNSOCIABLE_BETS ~ "Percentage of bets placed between midnight & 6AM",
      PAS_6M_TOTAL_DEPOSIT_FREQ ~ "No. deposits",
      PAS_6M_DEPOSIT_INTENSITY ~ "Mean no. deposits per active betting day",
      PAS_6M_MEAN_DEPOSIT_AMOUNT ~ "Mean deposit amount",
      PAS_6M_PERCENT_DECLINED_DEPOSITS ~ "Percentage of deposits declined",
      PAS_6M_PERCENT_CANCEL_WITHDRAWS ~ "Percentage of withdrawals cancelled",
      EVER_USED_ANY_TOOL ~ "Ever used a SG",
      ACTIVE_DEPOSIT_LIMIT ~ "Active deposit limit",
      RE_PAS_6M_ANY_TRIGGER ~ "Flagged as 'at-risk' ≥ 1 times")
  ) %>%
  add_difference(
    list(all_continuous() ~ "cohens_d",  # For continuous variables
    all_categorical() ~ "smd")       # For categorical variables, you can use standardized mean difference or any other suitable method
  ) %>%
  modify_header(label = "**Variable**") %>% 
  modify_caption("**Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics**") %>%
  bold_labels() %>%
  as_gt() %>%
  tab_options(data_row.padding = px(1.5))

**Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics**
Variable	Participants, N = 1,956¹	Non-participants, N = 22,837¹	Difference²	95% CI^2,3
Age			0.33	0.28, 0.37
Median [IQR]	40 [30 - 54]	34 [26 - 47]
Mean (SD)	42 (15)	38 (14)
Gender			0.14	0.10, 0.19
Female	221 (11%)	3,570 (16%)
Male	1,701 (87%)	19,054 (83%)
Unknown	34 (1.7%)	213 (0.9%)
IRSAD score			-0.01	-0.05, 0.04
Median [IQR]	1,001 [954 - 1,060]	1,001 [954 - 1,061]
Mean (SD)	1,006 (75)	1,006 (76)
Unknown	1	191
No. of days since accounts registered			0.24	0.20, 0.29
Median [IQR]	1,188 [494 - 2,151]	927 [377 - 1,822]
Mean (SD)	1,380 (992)	1,155 (915)
No. days since last bet			-0.52	-0.57, -0.47
Median [IQR]	3 [2 - 24]	31 [4 - 66]
Mean (SD)	22 (36)	45 (46)
No. betting days			0.63	0.58, 0.68
Median [IQR]	38 [12 - 88]	12 [3 - 40]
Mean (SD)	54 (49)	29 (38)
Mean bets per active day			0.30	0.25, 0.34
Median [IQR]	7 [3 - 16]	4 [2 - 10]
Mean (SD)	14 (23)	9 (17)
Mean stake			-0.04	-0.08, 0.01
Median [IQR]	15 [6 - 45]	13 [6 - 38]
Mean (SD)	68 (257)	83 (419)
Mean net outcome per month			-0.07	-0.11, -0.02
Median [IQR]	-65 [-710 - -4]	-17 [-134 - -1]
Mean (SD)	-870 (2,846)	-529 (5,139)
Percentage of bets placed between midnight & 6AM			0.04	-0.01, 0.08
Median [IQR]	1 [0 - 5]	0 [0 - 3]
Mean (SD)	5 (10)	5 (12)
No. deposits			0.43	0.39, 0.48
Median [IQR]	26 [4 - 159]	6 [2 - 35]
Mean (SD)	162 (354)	66 (206)
Mean no. deposits per active betting day			0.32	0.27, 0.36
Median [IQR]	1.08 [0.39 - 2.49]	1.00 [0.33 - 1.60]
Mean (SD)	1.99 (2.81)	1.36 (1.90)
Mean deposit amount			-0.03	-0.08, 0.02
Median [IQR]	50 [22 - 119]	42 [15 - 101]
Mean (SD)	148 (481)	173 (859)
Percentage of deposits declined			0.04	-0.01, 0.09
Median [IQR]	1 [0 - 8]	0 [0 - 7]
Mean (SD)	7 (14)	7 (15)
Percentage of withdrawals cancelled			0.21	0.16, 0.25
Median [IQR]	0 [0 - 0]	0 [0 - 0]
Mean (SD)	7 (18)	4 (14)
Ever used a SG	439 (22%)	3,576 (16%)	0.17	0.13, 0.22
Active deposit limit	201 (10%)	2,059 (9.0%)	0.04	0.00, 0.09
Flagged as 'at-risk' ≥ 1 times	588 (30%)	4,202 (18%)	0.27	0.23, 0.32
¹ n (%)
² Cohen’s D; Standardized Mean Difference
³ CI = Confidence Interval

Sensitivity check comparison

For this, we will change the participant group to anyone who fully completed the survey and non-participants as anyone who didn’t start or didn’t complete the survey. These will be weighted comparisons.

Show code

# Check coding of variable that tells us whether somebody fully completed the survey or not
master_dataset_study2 %>%
  count(STATUS,
           SURVEY_STATUS_FULL)

# A tibble: 4 × 3
  STATUS           SURVEY_STATUS_FULL     n
  <fct>            <chr>              <int>
1 Participants     Complete            1467
2 Participants     Incomplete           489
3 Non-participants Incomplete           186
4 Non-participants Not started        22651

Show code

master_dataset_study2_sensitivity_check   <- master_dataset_study2 %>%
  mutate(SURVEY_STATUS_FULL_BINARY = case_when(SURVEY_STATUS_FULL == "Incomplete" ~ "Non-participants",
                                               SURVEY_STATUS_FULL == "Not started" ~ "Non-participants",
         TRUE ~ "Participants")) %>%
 mutate(SURVEY_STATUS_FULL_BINARY = fct_relevel(SURVEY_STATUS_FULL_BINARY,
                            "Participants",
                            "Non-participants")) # fix order of presentation

# Check recoding has worked:
master_dataset_study2_sensitivity_check %>%
  count(STATUS,
           SURVEY_STATUS_FULL_BINARY)

# A tibble: 3 × 3
  STATUS           SURVEY_STATUS_FULL_BINARY     n
  <fct>            <fct>                     <int>
1 Participants     Participants               1467
2 Participants     Non-participants            489
3 Non-participants Non-participants          22837

Show code

# Update survey designs:
survey_design_all <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_sensitivity_check,
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Create survey design objects for each STATUS group:
survey_design_non_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Get sample sizes for each group
participants_n <- sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY == "Participants")
non_participants_n <- sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY == "Non-participants")
overall_n <- nrow(master_dataset_study2_sensitivity_check)




# Produce updated table
# [Again, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]

# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:
# Function to calculate weighted summary for continuous variables
weighted_summary_continuous <- function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate means and standard deviations
  mean_all <- svymean(var_formula, design_all, na.rm = TRUE)
  mean_non_ps <- svymean(var_formula, design_non_participants, na.rm = TRUE)
  mean_ps <- svymean(var_formula, design_participants, na.rm = TRUE)
  
  sd_all <- sqrt(svyvar(var_formula, design_all, na.rm = TRUE))
  sd_non_ps <- sqrt(svyvar(var_formula, design_non_participants, na.rm = TRUE))
  sd_ps <- sqrt(svyvar(var_formula, design_participants, na.rm = TRUE))
  
  # Calculate weighted medians
  median_all <- svyquantile(var_formula, design_all, 0.5, na.rm = TRUE)
  median_non_ps <- svyquantile(var_formula, design_non_participants, 0.5, na.rm = TRUE)
  median_ps <- svyquantile(var_formula, design_participants, 0.5, na.rm = TRUE)
   
  # Format for table
  summary_df <- tibble(
    Variable = variable,
    `Overall` = paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),
    `Participants` = paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),
    `Non-participants` = paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2))
  ) %>%
  # Compute effect size
  mutate(
    pooled_sd = sqrt(((participants_n - 1) * sd_ps^2 + (non_participants_n - 1) * sd_non_ps^2) /
                     (participants_n + non_participants_n - 2)),
    cohen_d = (coef(mean_ps) - coef(mean_non_ps)) / pooled_sd,
    SE = sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2 / (2 * (participants_n + non_participants_n)))),
    CI_lower = cohen_d - 1.96 * SE,
    CI_upper = cohen_d + 1.96 * SE,
    Effect_size = paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]")
  ) %>%
  select(Variable, Overall, Participants, `Non-participants`, Effect_size)
  
  return(summary_df)
}

# Apply function to continuous variables
continuous_vars <- c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", 
                     "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", 
                     "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", 
                     "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", 
                     "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")

all_continuous_summaries <- map_dfr(continuous_vars, 
                                    ~weighted_summary_continuous(.x, survey_design_all, 
                                                                 survey_design_non_participants, survey_design_participants, 
                                                                 participants_n, non_participants_n, overall_n))


# Categorical variables:

# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. 

# Note: this function is computes Cramer's  effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).

# UPDATE: Let's just compute odds ratios for the two level variables  based on a reviewer suggestion (thanks ChatGPT for your help with this one!):

categorical_summary <- function(variable, design_all, design_non_participants, design_participants) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate proportions for overall sample
  prop_table_all <- svytable(var_formula, design_all)
  props_all <- prop.table(prop_table_all) %>%
    as.data.frame() %>%
    mutate(Overall = Freq * 100) %>%
    select(Variable = !!sym(variable), Overall)
  
  # Calculate proportions for non-participants
  prop_table_non_ps <- svytable(var_formula, design_non_participants)
  props_non_ps <- prop.table(prop_table_non_ps) %>%
    as.data.frame() %>%
    mutate('Non-participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Non-participants')
  
  # Calculate proportions for participants
  prop_table_ps <- svytable(var_formula, design_participants)
  props_ps <- prop.table(prop_table_ps) %>%
    as.data.frame() %>%
    mutate('Participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Participants')
  
  # Combine results and format
  summary_df <- props_ps %>%
    left_join(props_non_ps, by = "Variable") %>%
    left_join(props_all, by = "Variable") %>%
    mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%
    replace(is.na(.), "")

  # Determine if variable is binary (i.e., only two levels)
  unique_values <- length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))
  
  if (unique_values == 2 & variable != "GENDER") {
    # Ensure STATUS is treated as a factor with "Non-participants" as reference
    design_all$variables$STATUS <- relevel(as.factor(design_all$variables$STATUS), ref = "Non-participants")

    # Compute Odds Ratio (OR) using survey-weighted logistic regression
    logistic_model <- tryCatch(
      svyglm(as.formula(paste0(variable, " ~ STATUS")), 
             design = design_all, family = quasibinomial()),
      error = function(e) NULL
    )

    if (!is.null(logistic_model)) {
      coef_name <- grep("STATUS", names(coef(logistic_model)), value = TRUE)
      if (length(coef_name) > 0) {
        OR <- exp(coef(logistic_model)[coef_name])
        CI <- exp(confint(logistic_model)[coef_name, ])
        effect_size_text <- paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]")
      } else {
        effect_size_text <- "NA"
      }
    } else {
      effect_size_text <- "NA"
    }

  } else {
    # Compute Cramér’s V for categorical variables with more than two levels
    combined_table <- as.table(rbind(prop_table_non_ps, prop_table_ps))
    V <- cramersV(combined_table)
    V_rounded <- as.matrix(V$output) %>% t() %>% as.data.frame() %>%
      mutate(cramersV = as.numeric(cramersV)) %>%
      mutate(cramersV = round(cramersV, 2))
    
    CI <- confIntV(combined_table)
    CI_info <- CI$intermediate
    CIs <- CI_info$fisherZ.ci %>%
      t() %>%
      as.data.frame() %>%
      mutate_if(is.numeric, round, 2) %>%
      unite('Effect_size', 1, 2, sep = ", ")
    
    effect_size_text <- paste0(V_rounded, " [", CIs, "]")
  }

  # Add effect size to the table
  header_row <- data.frame(Variable = variable, Overall = "", Participants = "", 
                           `Non-participants` = "", `Effect_size` = effect_size_text)

  summary_df <- bind_rows(header_row, summary_df) %>%
    as_tibble() %>%
    select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)

  return(summary_df)
}



# Apply function to categorical variables
categorical_vars <- c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")
all_categorical_summaries <- map_dfr(categorical_vars, 
                                     ~categorical_summary(.x, survey_design_all, 
                                                          
                                                          survey_design_non_participants, survey_design_participants))

all_categorical_summaries <- all_categorical_summaries %>%
# Recode for consistency across summaries:
mutate(Variable = case_when(Variable==1
                   ~ "Yes", 
                   Variable==0
                   ~ "No", 
                   Variable==TRUE
                   ~ "Yes", 
                   Variable==FALSE
                   ~ "No",
                   TRUE ~ Variable))

# Combine continuous and categorical summaries
weighted_comparison_summary_data <- bind_rows(all_continuous_summaries, all_categorical_summaries) %>%
  rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols = "Participants") %>%
  rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols = "Non-participants") %>%
  rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols = "Overall")

# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)
gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]
age_summary_tabled<- weighted_comparison_summary_data[1,1:5]
summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%
  filter(Variable !="AGE",
    Variable !="GENDER",
         Variable !="Male",
         Variable !="Female",
         Variable !="Unknown"
         )

# Join everything back together and provide some nice names for our variables:
weighted_comparison_summary_data_ordered <- bind_rows(age_summary_tabled,
          gender_summary_tabled,
          summary_table_extracted_vars_missing) %>%
   mutate(Variable = dplyr::recode(Variable,
    `AGE` = "Age",
    "GENDER" = "Gender",
    "IRSAD_SCORE" = "IRSAD score",
    "DAYS_REGISTERED" = "Days since registered",
    "DAYS_SINCE_LAST_BET" = "No. days since last bet",
    "PAS_6M_BET_FREQ_DAYS" = "No. betting days",
    "PAS_6M_BET_INTENSITY" = "Mean bets per active day",
    "PAS_6M_BET_MEAN_STAKE" = "Mean stake",
    "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" = "Mean net outcome per month",
    "PAS_6M_PERCENT_UNSOCIABLE_BETS" = "Percentage of bets midnight - 6AM",
    "PAS_6M_TOTAL_DEPOSIT_FREQ" = "No. deposits",
    "PAS_6M_DEPOSIT_INTENSITY" = "Mean no. deposits per active day",
    "PAS_6M_MEAN_DEPOSIT_AMOUNT" = "Mean deposit amount",
    "PAS_6M_PERCENT_DECLINED_DEPOSITS" = "Percentage of deposits declined",
    "PAS_6M_PERCENT_CANCEL_WITHDRAWS" = "Percentage of withdrawals cancelled",
    "EVER_USED_ANY_TOOL" = "Ever used a CPT",
    "ACTIVE_DEPOSIT_LIMIT" = "Active deposit limit",
    "RE_PAS_6M_ANY_TRIGGER" = "Flagged as 'at-risk' ≥ 1 times"
  ))


# Create final table:
weighted_comparison_summary_data_ordered %>%
  rename("Effect size" = Effect_size
         ) %>%
  gt() %>%
   tab_header(title = "Table 3.1   Study 2 (past 6-month bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]") %>%
  tab_options(
    data_row.padding = px(1.5)
  ) %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )

Table 3.1 Study 2 (past 6-month bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]
Variable	Overall (N = 24793)¹	Participants (N = 1467)¹	Non-participants (N = 23326)	Effect size²
Age	38.07 (14.73), 35	42.82 (14.78), 41	37.8 (14.69), 34	0.34 [0.29, 0.39]
Gender			NA	0.03 [0.03, 0.04]
Female	17.23%	12.55%	17.5%	NA
Male	81.96%	85.99%	81.73%	NA
Unknown	0.81%	1.46%	0.77%	NA
IRSAD score	1004.88 (75.63), 999.58	1007.88 (75.44), 1003.14	1004.71 (75.64), 998.46	0.04 [-0.01, 0.09]
Days since registered	1131.27 (904.71), 905	1362.54 (967.72), 1179	1118.05 (899.21), 894	0.27 [0.22, 0.32]
No. days since last bet	49.13 (46.06), 43	27.75 (40.04), 5	50.35 (46.08), 45	-0.49 [-0.55, -0.44]
No. betting days	22.45 (31.97), 9	38.78 (41.11), 24	21.51 (31.11), 8	0.54 [0.49, 0.6]
Mean bets per active day	7.36 (12.41), 4	9.73 (16.37), 5.38	7.22 (12.14), 4	0.2 [0.15, 0.25]
Mean stake	31.48 (163.93), 10	25.43 (89.58), 8.89	31.83 (167.18), 10	-0.04 [-0.09, 0.01]
Mean net outcome per month	-131.9 (1864.76), -12.5	-204.51 (1078.02), -22.71	-127.75 (1899.84), -12.33	-0.04 [-0.09, 0.01]
Percentage of bets midnight - 6AM	4.03 (11.54), 0	4.36 (10.84), 0	4.01 (11.58), 0	0.03 [-0.02, 0.08]
No. deposits	31.6 (111.44), 4	58.58 (155.16), 9	30.06 (108.21), 4	0.26 [0.2, 0.31]
Mean no. deposits per active day	1.03 (1.28), 0.81	1.13 (1.64), 0.75	1.02 (1.26), 0.82	0.08 [0.03, 0.13]
Mean deposit amount	77.52 (321.94), 31.82	72.69 (208.96), 37.5	77.8 (327.23), 31.25	-0.02 [-0.07, 0.04]
Percentage of deposits declined	6.59 (15.05), 0	6.62 (14.46), 0	6.59 (15.08), 0	0 [-0.05, 0.06]
Percentage of withdrawals cancelled	1.81 (9.25), 0	2.99 (12.02), 0	1.74 (9.06), 0	0.14 [0.08, 0.19]
Ever used a CPT			NA	1.47 [1.28, 1.69]
No	86.93%	82.97%	87.15%	NA
Yes	13.07%	17.03%	12.85%	NA
Active deposit limit			NA	1.23 [1.03, 1.47]
No	91.58%	90.39%	91.65%	NA
Yes	8.42%	9.61%	8.35%	NA
Flagged as 'at-risk' ≥ 1 times			NA	1.91 [1.73, 2.12]
No	97.51%	96.25%	97.58%	NA
Yes	2.49%	3.75%	2.42%	NA
¹ Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
² Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])

Note that this table says that there are 1467 people who completed the survey in full This differs from the 1470 people who were listed as completing the survey in Table 1 of our manuscript as some people have been removed from this total due to ineligibility.

Show code

master_dataset_study2 %>%
  tbl_summary(
    include = c(# Demographic details
      AGE, 
      GENDER, 
      IRSAD_SCORE,
      DAYS_REGISTERED, 
       # Wagering
      DAYS_SINCE_LAST_BET, 
      PAS_6M_BET_FREQ_DAYS,
      PAS_6M_BET_INTENSITY,
      PAS_6M_BET_MEAN_STAKE,
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
      PAS_6M_PERCENT_UNSOCIABLE_BETS,
      # Transactions
      PAS_6M_TOTAL_DEPOSIT_FREQ,
      PAS_6M_DEPOSIT_INTENSITY,
      PAS_6M_MEAN_DEPOSIT_AMOUNT,
      PAS_6M_PERCENT_DECLINED_DEPOSITS,
      PAS_6M_PERCENT_CANCEL_WITHDRAWS,
      # Use of consumer protection tools
      EVER_USED_ANY_TOOL,
      ACTIVE_DEPOSIT_LIMIT,
      RE_PAS_6M_ANY_TRIGGER),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous2() ~ c(
      "{median} [{p25} - {p75}]",
      "{mean} ({sd})")),
    by = SURVEY_STATUS_FULL,
    label = c(
      AGE ~ "Age",
      GENDER ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      DAYS_REGISTERED ~ "No. of days since accounts registered",
      DAYS_SINCE_LAST_BET ~ "No. days since last bet",
      PAS_6M_BET_FREQ_DAYS ~ "No. betting days",
      PAS_6M_BET_INTENSITY ~ "Mean bets per active day",
      PAS_6M_BET_MEAN_STAKE ~ "Mean stake",
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~ "Mean net outcome per month",
      PAS_6M_PERCENT_UNSOCIABLE_BETS ~ "Percentage of bets placed between midnight & 6AM",
      PAS_6M_TOTAL_DEPOSIT_FREQ ~ "No. deposits",
      PAS_6M_DEPOSIT_INTENSITY ~ "Mean no. deposits per active betting day",
      PAS_6M_MEAN_DEPOSIT_AMOUNT ~ "Mean deposit amount",
      PAS_6M_PERCENT_DECLINED_DEPOSITS ~ "Percentage of deposits declined",
      PAS_6M_PERCENT_CANCEL_WITHDRAWS ~ "Percentage of withdrawals cancelled",
      EVER_USED_ANY_TOOL ~ "Ever used a SG",
      ACTIVE_DEPOSIT_LIMIT ~ "Active deposit limit",
      RE_PAS_6M_ANY_TRIGGER ~ "Flagged as 'at-risk' ≥ 1 times")
  ) %>%
  modify_header(label = "**Variable**") %>% 
  modify_caption("Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey") %>%
  bold_labels()

Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn’t complete it, and those who didn’t start the survey
Variable	Complete, N = 1,467¹	Incomplete, N = 675¹	Not started, N = 22,651¹
Age
Median [IQR]	39 [30 - 52]	42 [28 - 56]	34 [26 - 47]
Mean (SD)	42 (14)	43 (16)	38 (14)
Gender
Female	153 (10%)	85 (13%)	3,553 (16%)
Male	1,290 (88%)	577 (85%)	18,888 (83%)
Unknown	24 (1.6%)	13 (1.9%)	210 (0.9%)
IRSAD score
Median [IQR]	1,004 [955 - 1,062]	990 [949 - 1,056]	1,001 [954 - 1,061]
Mean (SD)	1,008 (75)	1,001 (74)	1,006 (76)
Unknown	1	1	190
No. of days since accounts registered
Median [IQR]	1,213 [510 - 2,179]	1,100 [447 - 2,043]	927 [375 - 1,822]
Mean (SD)	1,396 (986)	1,319 (1,005)	1,154 (915)
No. days since last bet
Median [IQR]	4 [2 - 27]	3 [1 - 15]	32 [4 - 66]
Mean (SD)	23 (37)	17 (31)	46 (46)
No. betting days
Median [IQR]	36 [11 - 84]	50 [18 - 101]	12 [3 - 39]
Mean (SD)	51 (47)	63 (51)	29 (38)
Mean bets per active day
Median [IQR]	7 [3 - 15]	9 [4 - 20]	4 [2 - 10]
Mean (SD)	13 (21)	16 (25)	9 (17)
Mean stake
Median [IQR]	13 [5 - 42]	20 [7 - 60]	13 [6 - 38]
Mean (SD)	61 (206)	90 (331)	83 (420)
Mean net outcome per month
Median [IQR]	-52 [-551 - -2]	-151 [-1,199 - -10]	-17 [-131 - -1]
Mean (SD)	-740 (2,656)	-1,238 (3,122)	-524 (5,155)
Percentage of bets placed between midnight & 6AM
Median [IQR]	1 [0 - 5]	1 [0 - 5]	0 [0 - 3]
Mean (SD)	5 (11)	5 (9)	5 (12)
No. deposits
Median [IQR]	20 [4 - 136]	54 [8 - 272]	6 [1 - 34]
Mean (SD)	142 (316)	225 (499)	64 (199)
Mean no. deposits per active betting day
Median [IQR]	1.00 [0.33 - 2.19]	1.50 [0.67 - 3.22]	1.00 [0.33 - 1.60]
Mean (SD)	1.85 (2.70)	2.46 (3.48)	1.35 (1.86)
Mean deposit amount
Median [IQR]	50 [22 - 114]	65 [26 - 150]	41 [15 - 100]
Mean (SD)	143 (503)	166 (380)	173 (862)
Percentage of deposits declined
Median [IQR]	0 [0 - 8]	3 [0 - 10]	0 [0 - 7]
Mean (SD)	7 (14)	9 (15)	7 (15)
Percentage of withdrawals cancelled
Median [IQR]	0 [0 - 0]	0 [0 - 8]	0 [0 - 0]
Mean (SD)	7 (17)	10 (21)	4 (14)
Ever used a SG	315 (21%)	174 (26%)	3,526 (16%)
Active deposit limit	147 (10%)	76 (11%)	2,037 (9.0%)
Flagged as 'at-risk' ≥ 1 times	392 (27%)	277 (41%)	4,121 (18%)
¹ n (%)

Again, it appears to be because the group who start it but didn’t complete the survey are more involved/intense bettors than any other group.

Composite risk scores

Now let’s produce a singles composite score that represents people’s past six months gambling behaviour.

One thing we need to consider here is that some of the variables we’ve created could be biased by recent sign up to the site. for example, one of the variables we created to summarise past six months behaviour is the total number of days somebody gambled, while another is the total amount somebody deposited. Both of these variables could be affected by being more recently registered with the company. By contrast, variables which refer to averages, or daily or monthly summaries will not be affected by this.

Let’s first identify all the variables we engineered that summarise past six month behaviour, then classify them based on whether they will be subject to this bias or not:

Show code

PAS_6M_variables<- master_dataset_study2 %>% 
  select(
         contains("PAS_6M"), # Identify all relevant past six months aggregate variables
         -contains("RE_"), # Remove risk identification system variables
         -contains("INCOME") # Remove variables relating  to percentage of income which are only available for survey participants
         ) %>%
  colnames() %>%
  as.data.frame() %>%
  rename(Variables = 1) %>%
  mutate(`Biased by sign-up date` = case_when(
                     Variables == "PAS_6M_BET_FREQ_DAYS" ~ "Yes", 
                     Variables == "PAS_6M_BET_FREQ_BETS" ~ "Yes", 
                     Variables == "PAS_6M_BET_MEAN_STAKE" ~ "No", 
                     Variables == "PAS_6M_BET_SD_STAKE" ~ "No", 
                     Variables == "PAS_6M_BET_SPEND" ~ "Yes", 
                     Variables == "PAS_6M_MEAN_ODDS" ~ "Yes", 
                     Variables == "PAS_6M_N_ACTIVITIES" ~ "No", 
                     Variables == "PAS_6M_BET_INTENSITY" ~ "No", 
                     Variables == "PAS_6M_BET_WIN" ~ "Yes", 
                     Variables == "PAS_6M_NET_OUTCOME" ~ "Yes", 
                     Variables == "PAS_6M_MAX_DAY_STAKE" ~ "No", 
                     Variables == "PAS_6M_MAX_DAY_BETS" ~ "No", 
                     Variables == "PAS_6M_MEAN_DAILY_STAKE" ~ "No", 
                     Variables == "PAS_6M_SD_DAILY_STAKE" ~ "No", 
                     Variables == "PAS_6M_TOTAL_UNSOCIABLE_BETS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_UNSOCIABLE_BETS" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_BET_FREQ_DAYS" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_SPEND" ~ "No", 
                     Variables == "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" ~ "No",
                     Variables == "PAS_6M_TOTAL_DEPOSIT_AMOUNT" ~ "Yes", 
                     Variables == "PAS_6M_MEAN_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_SD_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_TOTAL_DEPOSIT_FREQ" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_DECLINED_DEPOSITS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_DECLINED_DEPOSITS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_DEPOSIT_METHODS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_FREQ" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_AMOUNT" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_CANCEL_WITHDRAWS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_CANCEL_WITHDRAWS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_METHODS" ~ "No", 
                     Variables == "PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_DEPOSIT_INTENSITY" ~ "No",
                     TRUE ~ Variables)) 

# Table variables:
PAS_6M_variables %>%
  gt()%>%
  tab_header("All past-6-month aggregate summary variables", subtitle = "With risk of bias status for each variable")

All past-6-month aggregate summary variables
With risk of bias status for each variable
Variables	Biased by sign-up date
PAS_6M_BET_FREQ_DAYS	Yes
PAS_6M_BET_FREQ_BETS	Yes
PAS_6M_BET_FREQ_INCL_LEGS	PAS_6M_BET_FREQ_INCL_LEGS
PAS_6M_BET_MEAN_STAKE	No
PAS_6M_BET_SD_STAKE	No
PAS_6M_BET_SPEND	Yes
PAS_6M_MEAN_ODDS	Yes
PAS_6M_N_ACTIVITIES	No
PAS_6M_BET_INTENSITY	No
PAS_6M_BET_WIN	Yes
PAS_6M_NET_OUTCOME	Yes
PAS_6M_MAX_DAY_STAKE	No
PAS_6M_MAX_DAY_BETS	No
PAS_6M_MEAN_DAILY_STAKE	No
PAS_6M_SD_DAILY_STAKE	No
PAS_6M_TOTAL_UNSOCIABLE_BETS	Yes
PAS_6M_PERCENT_UNSOCIABLE_BETS	No
PAS_6M_MONTHLY_BET_FREQ_DAYS	No
PAS_6M_MONTHLY_SPEND	No
PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME	No
PAS_6M_TOTAL_DEPOSIT_AMOUNT	Yes
PAS_6M_MEAN_DEPOSIT_AMOUNT	No
PAS_6M_SD_DEPOSIT_AMOUNT	No
PAS_6M_TOTAL_DEPOSIT_FREQ	Yes
PAS_6M_TOTAL_DECLINED_DEPOSITS	Yes
PAS_6M_PERCENT_DECLINED_DEPOSITS	No
PAS_6M_TOTAL_DEPOSIT_METHODS	No
PAS_6M_TOTAL_WITHDRAW_FREQ	Yes
PAS_6M_TOTAL_WITHDRAW_AMOUNT	Yes
PAS_6M_TOTAL_CANCEL_WITHDRAWS	Yes
PAS_6M_PERCENT_CANCEL_WITHDRAWS	No
PAS_6M_TOTAL_WITHDRAW_METHODS	No
PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT	No
PAS_6M_MONTHLY_DEPOSIT_AMOUNT	No
PAS_6M_MONTHLY_DEPOSIT_FREQ	PAS_6M_MONTHLY_DEPOSIT_FREQ
PAS_6M_DEPOSIT_INTENSITY	No

Okay, now we have all of the relevant variables, let’s look at how they correlate with PGSI and harm scores using a series of bivariate correlations visualised as a heatmap. These results are unweighted:

Show code

PGSI_cor_data<- master_dataset_study2 %>% 
    select(PGSI_TOTAL,
        GHM_TOTAL,
       PAS_6M_BET_MEAN_STAKE, 
       PAS_6M_BET_SD_STAKE, 
       PAS_6M_N_ACTIVITIES, 
       PAS_6M_BET_INTENSITY, 
       PAS_6M_MAX_DAY_STAKE, 
       PAS_6M_MAX_DAY_BETS, 
       PAS_6M_MEAN_DAILY_STAKE, 
       PAS_6M_SD_DAILY_STAKE, 
       PAS_6M_PERCENT_UNSOCIABLE_BETS, 
       PAS_6M_MONTHLY_BET_FREQ_DAYS, 
       PAS_6M_MONTHLY_SPEND, 
       PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
       PAS_6M_MEAN_DEPOSIT_AMOUNT, 
       PAS_6M_SD_DEPOSIT_AMOUNT, 
       PAS_6M_PERCENT_DECLINED_DEPOSITS, 
       PAS_6M_TOTAL_DEPOSIT_METHODS, 
       PAS_6M_PERCENT_CANCEL_WITHDRAWS, 
       PAS_6M_TOTAL_WITHDRAW_METHODS, 
       PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, 
       PAS_6M_MONTHLY_DEPOSIT_AMOUNT, 
       PAS_6M_DEPOSIT_INTENSITY)

ggstatsplot::ggcorrmat(PGSI_cor_data, 
                                              type = "parametric",
                                              results.subtitle = FALSE,
                                              ggcorrplot.args = c(lab_size = 3)) +
  theme(axis.text.y = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(axis.text.x = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(legend.title=element_text(color = "black", size = 12, face = "plain", family = "Poppins")) +
  theme(legend.title.align = .5) +
  theme(legend.text=element_text(color = "black", size = 8, face = "plain", family = "Poppins"))

Now let’s use all of these variables to create a composite index score of risk/involvement:

Show code

master_dataset_study2_with_risk_scores <- master_dataset_study2 %>%
  mutate(PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME = PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME*-1) %>% # Invert outcome values
  mutate(across(c(PAS_6M_BET_MEAN_STAKE, 
                  PAS_6M_BET_SD_STAKE, 
                  PAS_6M_N_ACTIVITIES, 
                  PAS_6M_BET_INTENSITY, 
                  PAS_6M_MAX_DAY_STAKE, 
                  PAS_6M_MAX_DAY_BETS, 
                  PAS_6M_MEAN_DAILY_STAKE, 
                  PAS_6M_SD_DAILY_STAKE, 
                  PAS_6M_PERCENT_UNSOCIABLE_BETS, 
                  PAS_6M_MONTHLY_BET_FREQ_DAYS, 
                  PAS_6M_MONTHLY_SPEND, 
                  PAS_6M_MEAN_DEPOSIT_AMOUNT, 
                  PAS_6M_SD_DEPOSIT_AMOUNT, 
                  PAS_6M_PERCENT_DECLINED_DEPOSITS, 
                  PAS_6M_TOTAL_DEPOSIT_METHODS, 
                  PAS_6M_PERCENT_CANCEL_WITHDRAWS, 
                  PAS_6M_TOTAL_WITHDRAW_METHODS, 
                  PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, 
                  PAS_6M_MONTHLY_DEPOSIT_AMOUNT, 
                  PAS_6M_DEPOSIT_INTENSITY), 
                scale, # scale function computes z-scores
                .names = "z_{.col}")) %>% # Create new named columns for easy selection below
  rowwise() %>% # Group by Rows
  mutate(`Composite Risk Score` = sum(c_across(starts_with("z_")), na.rm = TRUE)) %>% # Sum across all newly created Z scores
  ungroup()


# colnames(master_dataset_study2_with_risk_scores)

Correlations

Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:

Show code

master_dataset_study2_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>%
   mutate(PGSI_CATEGORY = factor(PGSI_CATEGORY, levels = c('No risk',
                                                              'Low risk',
                                                              'Moderate risk',
                                                              'High risk'))) %>%
ggplot() +
   geom_point(aes(x = PGSI_TOTAL, y = `Composite Risk Score`,
                   colour = PGSI_CATEGORY)) +
   geom_smooth(aes(x = PGSI_TOTAL, y = `Composite Risk Score`), 
               method = "lm",
               se = FALSE,
               color = "black") +
   scale_colour_manual(values = c("#2e4057",
                                  "#00798c", 
                                  "#edae49",
                                  "#d1495b")) +
   geom_hline(yintercept = 0, colour = "grey",  linetype="dashed") +
   scale_y_continuous(limits = c(-10,120)) +
   plot_theme +
  guides(color= guide_legend("PGSI category")) +
   labs(x = "PGSI total score") +
  scale_x_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 27))

Perform non-weighted correlation:

Show code

cor_PGSI <- cor.test(master_dataset_study2_with_risk_scores$PGSI_TOTAL, 
                         master_dataset_study2_with_risk_scores$`Composite Risk Score`,
                         method = "spearman")

cor_PGSI


    Spearman's rank correlation rho

data:  master_dataset_study2_with_risk_scores$PGSI_TOTAL and master_dataset_study2_with_risk_scores$`Composite Risk Score`
S = 850963267, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
     rho 
0.317731

Perform weighted correlation:

Show code

# This doesn't work nicely with NA values so let's remove them first:
master_dataset_study2_with_risk_scores_weighted_cor_data <- master_dataset_study2_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL))

weightecor_PGSI<- weightedCorr(x = master_dataset_study2_with_risk_scores_weighted_cor_data$PGSI_TOTAL, 
             y = master_dataset_study2_with_risk_scores_weighted_cor_data$`Composite Risk Score`, 
              method = "Spearman",
             weights = master_dataset_study2_with_risk_scores_weighted_cor_data$ALL_SAMPLE_WEIGHTS)

weightecor_PGSI

[1] 0.288293

Confirm how many people are involved in these calculations:

Show code

master_dataset_study2_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>% nrow()

[1] 1956

T-test

Perform a t-test to compare the composite risk or of those who did and didn’t take part in the survey.

Unweighted t-test:

Show code

# Perform Welch's t-test
t_test_composite_study2 <- t.test(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_with_risk_scores, var.equal = FALSE)

t_test_composite_study2


    Welch Two Sample t-test

data:  Composite Risk Score by STATUS
t = 15.233, df = 2368.3, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
 3.103619 4.020736
sample estimates:
    mean in group Participants mean in group Non-participants 
                     3.2811454                     -0.2810317

And now a Weighted t-test:

Show code

# Create survey design object for the whole sample:
survey_design_all_participants <- svydesign(
  id = ~1,
  data = master_dataset_study2_with_risk_scores,
  weights = ~ALL_SAMPLE_WEIGHTS
)

weighted_t_test_composite_study2 <-survey::svyttest(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_with_risk_scores, design = survey_design_all_participants)

weighted_t_test_composite_study2


    Design-based t-test

data:  `Composite Risk Score` ~ STATUS
t = -16.864, df = 24791, p-value < 2.2e-16
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 -2.546679 -2.016333
sample estimates:
difference in mean 
         -2.281506

Compute non-weighted summary scores and effect size:

Show code

# Calculate means and standard deviations
summary_composite_study2 <- master_dataset_study2_with_risk_scores %>%
  group_by(STATUS) %>%
  dplyr::summarize(
    mean = mean(`Composite Risk Score`),
    sd = sd(`Composite Risk Score`),
    n = n()
  ) 

gt(summary_composite_study2)

STATUS	mean	sd	n
Participants	3.2811454	9.855798	1956
Non-participants	-0.2810317	10.708913	22837

Show code

# Calculate Cohen's d with 95% confidence interval
cohen_d_result_study2 <- cohen.d(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_with_risk_scores)

cohen_d_result_study2


Cohen's d

d estimate: 0.3346615 (small)
95 percent confidence interval:
    lower     upper 
0.2883902 0.3809328

Compute weighted summary score and effect size:

Show code

# Rename the risk score as the statistic function doesn't like the spaces:
master_dataset_study2_with_risk_scores_temp_risk_name_change<- master_dataset_study2_with_risk_scores %>%
  rename(risk_score = `Composite Risk Score`)


# Create survey design objects for each STATUS group that include the composite risk scores:
survey_design_all_t <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_with_risk_scores_temp_risk_name_change,
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_non_participants_t <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_with_risk_scores_temp_risk_name_change, STATUS == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants_t <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_with_risk_scores_temp_risk_name_change, STATUS == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

risk_score_comparison<- weighted_summary_continuous("risk_score", 
                                                    survey_design_all_t,
                                                    survey_design_non_participants_t, 
                                                    survey_design_participants_t, 
                                                    participants_n, 
                                                    non_participants_n)

gt(risk_score_comparison)

Variable	Overall	Participants	Non-participants	Effect_size
risk_score	-2.6 (5.54), -3.59	-0.47 (6.08), -1.53	-2.75 (5.46), -3.73	0.41 [0.36, 0.47]

Quartiles

Now, let’s divide the sample into quartiles based on their composite summary score:

Show code

# Calculate quartiles of past-30-day bet frequency
quartiles <- quantile(master_dataset_study2_with_risk_scores$`Composite Risk Score`, 
                      probs = c(0, 0.25, 0.5, 0.75, 1))


master_dataset_study2_with_risk_scores$Quartile <- cut(master_dataset_study2_with_risk_scores$`Composite Risk Score`, 
                                    breaks = quartiles,
                                    labels = c("Bottom quartile: least involved",
                                               "Second quartile",
                                               "Third quartile",
                                               "Top quartile: most involved"),
                                    include.lowest = TRUE)

Now compute the proportion of individuals from each quartile who went on to take part in the survey:

Show code

master_dataset_study2_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
  tab_header("Counts for granular survey status by composite risk score quartile", subtitle = NULL)

Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI	n
Bottom quartile: least involved
Complete	245
Incomplete	8
Not started	5946
Second quartile
Complete	348
Incomplete	15
Not started	5835
Third quartile
Complete	535
Incomplete	50
Not started	5613
Top quartile: most involved
Complete	828
Incomplete	113
Not started	5257

Show code

master_dataset_study2_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(STATUS) %>%
  mutate('Percent of quartile' = round(n/sum(n)*100,2)) %>%
  gt() %>%
  tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)

Counts (with percentages) for broad survey status by composite risk score quartile
STATUS	n	Percent of quartile
Bottom quartile: least involved
Participants	245	3.95
Non-participants	5954	96.05
Second quartile
Participants	348	5.61
Non-participants	5850	94.39
Third quartile
Participants	535	8.63
Non-participants	5663	91.37
Top quartile: most involved
Participants	828	13.36
Non-participants	5370	86.64

No compute weighted accounts/proportions:

Show code

# Compute weighted counts for granular survey status
weighted_counts_granular <- svytable(~Quartile + SURVEY_STATUS_PGSI, 
                                     svydesign(id = ~1, 
                                               data = master_dataset_study2_with_risk_scores, 
                                               weights = ~ALL_SAMPLE_WEIGHTS)) %>%
  as.data.frame()


weighted_counts_granular %>%
  gt() %>%
  tab_header("Weighted counts for granular survey status by composite risk score quartile", subtitle = NULL)

Weighted counts for granular survey status by composite risk score quartile
Quartile	SURVEY_STATUS_PGSI	Freq
Bottom quartile: least involved	Complete	2309.93893
Second quartile	Complete	3267.87903
Third quartile	Complete	4862.26711
Top quartile: most involved	Complete	2948.77958
Bottom quartile: least involved	Incomplete	75.38683
Second quartile	Incomplete	142.03838
Third quartile	Incomplete	463.32653
Top quartile: most involved	Incomplete	390.28258
Bottom quartile: least involved	Not started	56150.98467
Second quartile	Not started	54996.77437
Third quartile	Not started	49681.64042
Top quartile: most involved	Not started	17909.09600

Show code

# Compute weighted counts and percentages for broad survey status
svytable(~Quartile + STATUS, 
                                  svydesign(id = ~1, 
                                            data = master_dataset_study2_with_risk_scores, 
                                            weights = ~ALL_SAMPLE_WEIGHTS)) %>%
  as.data.frame() %>%
  group_by(Quartile) %>%
  mutate('Percent of quartile' = round(Freq / sum(Freq) * 100, 2)) %>%
 gt() %>%
  tab_header("Weighted counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)

Weighted counts (with percentages) for broad survey status by composite risk score quartile
STATUS	Freq	Percent of quartile
Bottom quartile: least involved
Participants	2309.939	3.95
Non-participants	56226.371	96.05
Second quartile
Participants	3267.879	5.60
Non-participants	55138.813	94.40
Third quartile
Participants	4862.267	8.84
Non-participants	50144.967	91.16
Top quartile: most involved
Participants	2948.780	13.88
Non-participants	18299.379	86.12

And visualise this using a Sankey diagram (weighted figures):

Show code

# Create survey design object
survey_design_all_participants_sankey <- svydesign(
  id = ~1,
  data = master_dataset_study2_with_risk_scores,
  weights = ~ALL_SAMPLE_WEIGHTS
)

#  We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N), with weightings applied:
weighted_counts <- svytable(~Quartile + SURVEY_STATUS_PGSI, survey_design_all_participants_sankey) %>%
  as.data.frame() %>%
  group_by(Quartile) %>%
  mutate(percent = Freq / sum(Freq) * 100) %>%
  ungroup()

# Create Sankey plot
desired_order_sankey <- c("Top quartile: most involved",
                          "Third quartile",
                          "Second quartile",
                          "Bottom quartile: least involved")

# non-weighted data: 
risk_to_survey_links<- master_dataset_study2_with_risk_scores %>%
    select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%
  group_by(Quartile, SURVEY_STATUS_PGSI) %>%
   summarise(
    n = n(),
    .groups = "drop"
  ) %>%
   group_by(Quartile) %>%
   mutate(
     percent = n/sum(n)*100) %>%
  select(-n) %>%
  ungroup() %>%
  mutate(Quartile =  factor(Quartile,
                              levels = desired_order_sankey)) %>%
  arrange(Quartile)

# From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: 
nodes <- data.frame(
  name = c(as.character(weighted_counts$Quartile), 
           as.character(weighted_counts$SURVEY_STATUS_PGSI)) %>% unique())

# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn  assign each from and to category with a number to make numerical connections explicit:
weighted_counts$IDsource <- match(weighted_counts$Quartile, nodes$name) - 1
weighted_counts$IDtarget <- match(weighted_counts$SURVEY_STATUS_PGSI, nodes$name) - 1
 


# prepare colour scale
ColourScal ='d3.scaleOrdinal() .range(["#d1495b", 
"#edae49",
"#00798c",
"#2e4057",
"#35B779FF",
"#31688EFF",
"#31688EFF",
"#482878FF",
"#B4DE2CFF",
"#FDE725FF", 
"#26828EFF",
"#3E4A89FF",
"#440154FF",
"#6DCD59FF"
])'

# Create Sankey plot
sankey_plot2 <- sankeyNetwork(
  Links = weighted_counts, # risk_to_survey_links # non-weighted 
  Nodes = nodes,
  Source = "IDsource", 
  Target = "IDtarget",
  Value = "percent", 
  NodeID = "name", 
  iterations = 0,
  sinksRight = FALSE,
  fontSize = 16, 
  fontFamily = "Poppins",
  nodeWidth = 30,
  nodePadding = 25,
  colourScale = ColourScal
)

# Custom CSS with Google Fonts link
custom_css <- "
<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'>
<style>
  text {
    font-family: 'Poppins', sans-serif !important;
  }
</style>
"

# Save plot as a HTML file with custom CSS
html_file <- tempfile(fileext = ".html")
saveWidget(
  widget = htmlwidgets::prependContent(sankey_plot2), 
  file = html_file,
  selfcontained = TRUE
)

# Read the saved HTML and ensure the custom CSS is properly embedded
html_content <- readLines(html_file, warn = FALSE)
html_content <- gsub("</head>", paste0(custom_css, "</head>"), html_content)
writeLines(html_content, con = "sankey_plot2.html")

# remove the temporary file
file.remove(html_file)

[1] TRUE

Show code

# sankey_plot2

Show code

# ggsankey_data <- master_dataset_study2_with_risk_scores %>%
#   # mutate(SURVEY_STATUS_FULL = case_when(SURVEY_STATUS_FULL == "Not started" ~ NA,
#   #                                       TRUE ~ SURVEY_STATUS_FULL)) %>%
#   make_long(Quartile, SURVEY_STATUS_START, SURVEY_STATUS_FULL)
# 
# ggplot(ggsankey_data, aes(x = x,
#                           next_x = next_x,
#                           node = node,
#                           next_node = next_node,
#                           fill = factor(node),
#                           label = node)) +
#                geom_sankey(flow.alpha = 0.5,
#                            node.color = "black",
#                            show.legend = FALSE)+
#   geom_sankey_label(size = 3,
#                     color = "black",
#                     fill= "white",
#                     hjust = -0.2,
#                     family="Poppins")+
#   theme_bw()+
#     theme( text=element_text(family="Poppins")) +
#   # scale_fill_viridis_d() +
#     scale_fill_manual(values = c("#9a031e",  # Bright red
#   "#277da1",  # Deep blue
#   "#f9c74f",  # Muted yellow
#   "#e36414",  # Soft orange
#   "#90be6d",  # Soft green
#   "#577590",  # Slate gray
#   "#4d908e",  # Teal
#   "#9f5f80"   # Purple
# )) +
#   theme(axis.title = element_blank()
#                   , axis.text.y = element_blank()
#                   , axis.ticks = element_blank()
#                   , panel.grid = element_blank())

Logistic regression

Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.

SSVS

As we want to maximise the overall predictive ability of the model, we have opted to set the parameters of the SSVS so as to have a low threshold for including/selecting relevant variables. First, We have said the prior inclusion of probability to 0.8, reflecting a belief that predictors are more likely to be included and not. A prior inclusion probability of 0.5 reflects the belief that each predictor has an equal probability of being included/excluded. Second, we have set the cut off for marginal inclusion probabilities values (MIPs) to 0.4. There is no standard recommendation for what this cut-off should be, but others have used 0.5.

Show code

# First we need to convert logical and categorical variables to numeric for inclusion in the model:  <-  
master_dataset_study2_SSVS <- master_dataset_study2 %>%
  # Impute missing IRSAD scores
  mutate(IRSAD_SCORE = ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm = TRUE), IRSAD_SCORE)) %>%
  # Convert all required variables to numeric:
  mutate(across(c(
    'AGE',
    'IRSAD_SCORE',
       'DAYS_REGISTERED',
       'DAYS_SINCE_LAST_BET',
       'PAS_6M_BET_MEAN_STAKE',
       'PAS_6M_BET_SD_STAKE',
       'PAS_6M_N_ACTIVITIES',
       'PAS_6M_BET_INTENSITY',
       'PAS_6M_MAX_DAY_STAKE',
       'PAS_6M_MAX_DAY_BETS',
       'PAS_6M_MEAN_DAILY_STAKE',
       'PAS_6M_SD_DAILY_STAKE',
       'PAS_6M_PERCENT_UNSOCIABLE_BETS',
       'PAS_6M_MONTHLY_BET_FREQ_DAYS',
       'PAS_6M_MONTHLY_SPEND',
       'PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME',
       'PAS_6M_MEAN_DEPOSIT_AMOUNT',
       'PAS_6M_SD_DEPOSIT_AMOUNT',
       'PAS_6M_PERCENT_DECLINED_DEPOSITS',
       'PAS_6M_TOTAL_DEPOSIT_METHODS',
       'PAS_6M_PERCENT_CANCEL_WITHDRAWS',
       'PAS_6M_TOTAL_WITHDRAW_METHODS',
       'PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT',
       'PAS_6M_MONTHLY_DEPOSIT_AMOUNT',
       'PAS_6M_DEPOSIT_INTENSITY','EVER_USED_DEPOSIT_LIMIT',
                  'EVER_USED_BLOCK_OUT', 
                  'EVER_USED_CHECK_IN', 
                  'EVER_USED_CURFEW', 
                  'EVER_USED_DEPOSIT_OPTION', 
                  'EVER_USED_MARKET_CONTROL', 
                  'EVER_USED_TIMOUT', 
                  'EVER_USED_CANCELLED_WITHDRAWS', 
                  'EVER_USED_SELF_EXCLUSION',
    'TOTAL_ACTIVE_TOOLS',
    'NUMBER_CPT_CHANGES'), 
                ~ as.numeric(.))) %>%
  mutate(GENDER = case_when(
    GENDER == 'Female' ~ 0,
    GENDER == 'Male' ~ 1,
    TRUE ~ 2)) %>%
  mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1,
    TRUE ~ 2)) %>%
    rename(
    'Age' = 'AGE',
    'Gender' =  'GENDER',
    'IRSAD score' = 'IRSAD_SCORE',
    'Days since registration' = 'DAYS_REGISTERED',
    'Days since last bet' = 'DAYS_SINCE_LAST_BET',
    'Mean stake*' = 'PAS_6M_BET_MEAN_STAKE',
    'Stake SD*' = 'PAS_6M_BET_SD_STAKE',
    'No. of activities*' = 'PAS_6M_N_ACTIVITIES',
    'Betting intensity*' = 'PAS_6M_BET_INTENSITY',
    'Max daily stake*' = 'PAS_6M_MAX_DAY_STAKE',
    'Max daily bets*' = 'PAS_6M_MAX_DAY_BETS',
    'Mean daily stake*' = 'PAS_6M_MEAN_DAILY_STAKE',
    'Daily stake SD*' = 'PAS_6M_SD_DAILY_STAKE',
    '% unsociable bets*' = 'PAS_6M_PERCENT_UNSOCIABLE_BETS',
    'Monthly bet freq (days)*' = 'PAS_6M_MONTHLY_BET_FREQ_DAYS',
    'Monthly spend*' = 'PAS_6M_MONTHLY_SPEND',
    'Mean monthly net outcome*' = 'PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME',
    'Mean deposit amount*' = 'PAS_6M_MEAN_DEPOSIT_AMOUNT',
    'Deposit amount SD*' = 'PAS_6M_SD_DEPOSIT_AMOUNT',
    '% declined deposits*' = 'PAS_6M_PERCENT_DECLINED_DEPOSITS',
    'No. deposit methods*' = 'PAS_6M_TOTAL_DEPOSIT_METHODS',
    '% cancel withdrawals*' = 'PAS_6M_PERCENT_CANCEL_WITHDRAWS',
    'No. withdrawal methods*' = 'PAS_6M_TOTAL_WITHDRAW_METHODS',
    'Max daily deposit amount*' = 'PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT',
    'Monthly deposit amount*' = 'PAS_6M_MONTHLY_DEPOSIT_AMOUNT',
    'Deposit intensity*' = 'PAS_6M_DEPOSIT_INTENSITY',
    'Ever used deposit limit' = 'EVER_USED_DEPOSIT_LIMIT',
    'Ever used block out' = 'EVER_USED_BLOCK_OUT', 
    'Ever used check in' = 'EVER_USED_CHECK_IN', 
    'Ever used curfew' = 'EVER_USED_CURFEW', 
    'Ever used deposit option' = 'EVER_USED_DEPOSIT_OPTION', 
    'Ever used market control' = 'EVER_USED_MARKET_CONTROL', 
    'Ever used timeout' = 'EVER_USED_TIMOUT', 
    'Ever used cancelled withdrawals' = 'EVER_USED_CANCELLED_WITHDRAWS', 
    'Ever used self-exclusion' = 'EVER_USED_SELF_EXCLUSION',
    'Total active tools' = 'TOTAL_ACTIVE_TOOLS',
    'Number of changes to CPTs' = 'NUMBER_CPT_CHANGES'
  )


outcome <- 'STATUS'

predictors <- c(
       # demographic variables:
       'Age',
       'Gender',
       'IRSAD score',
       # Account/wagering variables
       'Days since registration',
       'Days since last bet',
       # All past six month wagers and transactions variables (As included in the composite score):
      'Mean stake*',
    'Stake SD*',
    'No. of activities*',
    'Betting intensity*',
    'Max daily stake*',
    'Max daily bets*',
    'Mean daily stake*',
    'Daily stake SD*',
    '% unsociable bets*',
    'Monthly bet freq (days)*',
    'Monthly spend*',
    'Mean monthly net outcome*',
    'Mean deposit amount*',
    'Deposit amount SD*',
    '% declined deposits*',
    'No. deposit methods*',
    '% cancel withdrawals*',
    'No. withdrawal methods*',
    'Max daily deposit amount*',
    'Monthly deposit amount*',
    'Deposit intensity*',
       # current and historical use of consumer protection tools:
       'Ever used deposit limit',
       'Ever used block out', 
       'Ever used check in', 
       'Ever used curfew', 
       'Ever used deposit option', 
       'Ever used market control', 
       'Ever used timeout', 
       'Ever used cancelled withdrawals', 
       'Ever used self-exclusion',
       'Total active tools',
       'Number of changes to CPTs'
        )

# str(master_dataset_study2_SSVS)

# master_dataset_study2_SSVS %>%
  # select(`IRSAD score`)

# SSSV_results_Study2 <- ssvs(data = master_dataset_study2_SSVS,
#                             x = predictors, y = outcome, # Variables, as prespecified above
#                             inprob = 0.8, #  prior inclusion probability:  0.5 reflects the belief that each predictor has an equal probability of being included/excluded
#                             continuous = FALSE, # Categorical outcome variable
#                             progress = TRUE # Provide update on progress of processing
#                             )
# 
# # Save file to avoid having to re-run every time
# saveRDS(SSSV_results_Study2, "SSSV_results_Study2")

# Load in:
SSSV_results_Study2 <- read_rds("SSSV_results_Study2")

Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:

Show code

summary(SSSV_results_Study2, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  as.data.frame() %>%
gt()%>%
  tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2**"), 
             subtitle = NULL)

Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2
Variable	MIP	Avg Beta	Avg Nonzero Beta	Lower CI (95%)	Upper CI (95%)
Age	1.0000	0.2783	0.2783	0.2257	0.3342
Days since last bet	1.0000	-0.3794	-0.3794	-0.4596	-0.3024
No. withdrawal methods*	1.0000	0.1980	0.1980	0.1393	0.2564
Monthly bet freq (days)*	0.9895	0.1473	0.1488	0.0799	0.2117
No. deposit methods*	0.9755	0.1185	0.1215	0.0603	0.1830
Mean deposit amount*	0.8053	-0.2226	-0.2764	-0.3901	0.0000
No. of activities*	0.4949	0.0450	0.0909	0.0000	0.1240
Stake SD*	0.3143	0.1098	0.3492	0.0000	0.4386
Mean stake*	0.3096	-0.1834	-0.5924	-0.7406	0.0000
Gender	0.2637	0.0221	0.0839	0.0000	0.1076
Deposit amount SD*	0.0689	-0.0063	-0.0908	-0.1249	0.0000
Ever used cancelled withdrawals	0.0547	0.0027	0.0489	0.0000	0.0256
Deposit intensity*	0.0364	0.0020	0.0538	0.0000	0.0000
% cancel withdrawals*	0.0255	0.0012	0.0484	0.0000	0.0000
Days since registration	0.0211	0.0011	0.0533	0.0000	0.0000
Total active tools	0.0189	0.0008	0.0412	0.0000	0.0000
Mean monthly net outcome*	0.0158	0.0012	0.0756	0.0000	0.0000
Betting intensity*	0.0111	0.0004	0.0348	0.0000	0.0000
Ever used curfew	0.0099	-0.0005	-0.0455	0.0000	0.0000
Max daily stake*	0.0085	-0.0005	-0.0563	0.0000	0.0000
% unsociable bets*	0.0080	0.0003	0.0333	0.0000	0.0000
Daily stake SD*	0.0066	-0.0001	-0.0178	0.0000	0.0000
Mean daily stake*	0.0059	-0.0002	-0.0269	0.0000	0.0000
Max daily deposit amount*	0.0059	-0.0001	-0.0231	0.0000	0.0000
Max daily bets*	0.0055	0.0002	0.0290	0.0000	0.0000
Ever used market control	0.0051	0.0002	0.0352	0.0000	0.0000
Ever used deposit limit	0.0050	0.0002	0.0359	0.0000	0.0000
Ever used timeout	0.0049	0.0001	0.0216	0.0000	0.0000
Ever used self-exclusion	0.0045	0.0001	0.0154	0.0000	0.0000
Monthly deposit amount*	0.0039	-0.0001	-0.0225	0.0000	0.0000
Monthly spend*	0.0038	-0.0001	-0.0323	0.0000	0.0000
% declined deposits*	0.0034	0.0001	0.0254	0.0000	0.0000
Ever used block out	0.0033	0.0000	0.0130	0.0000	0.0000
Ever used deposit option	0.0023	0.0000	0.0066	0.0000	0.0000
Number of changes to CPTs	0.0019	0.0000	-0.0088	0.0000	0.0000
Ever used check in	0.0013	0.0000	-0.0082	0.0000	0.0000
IRSAD score	0.0009	0.0000	0.0136	0.0000	0.0000

Show code

plot(SSSV_results_Study2,
     threshold = 0.4,
     title = "") +
  theme_classic() +
   scale_y_continuous(breaks = c(0, 0.25, 0.50, 0.75, 1), labels = c(0, 0.25, 0.50, 0.75, 1)) +
  theme(text=element_text(family="Poppins"),
      axis.text.x = element_text(angle = 45, hjust = 1)
      # axis.text.y = element_text(angle = 0, hjust = .5)
      ) +
  labs(title = "",
       x = "",
       y = "Inclusion Probability") +
  annotate("text", x = 18, y = 0.45, label = "MIP threshold", size =3, color = "black", fontface = "plain") +
  theme(legend.position = "none")

Show code

ggsave("Figures/SSVS results plot Study 2.pdf",
       width = 8.5,
       height = 3.5,
       units = c("in")
)

Run model

Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:

Show code

SSSV_results_Study2$ssvs %>% as_tibble()

# A tibble: 0 × 0

Show code

# Set the data for analysis:
Study2_logistic_reg_data<- master_dataset_study2 %>%
 mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1))

# Create survey design object Used for waiting:
survey_design_regression <- svydesign(
  id = ~1,
  data = Study2_logistic_reg_data,
  weights = ~ALL_SAMPLE_WEIGHTS
)
  
# Check variables for inclusion: 
summary(SSSV_results_Study2, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  filter(MIP>0.4)

 Variable                 MIP    Avg Beta Avg Nonzero Beta Lower CI (95%)
 Age                      1.0000  0.2783   0.2783           0.2257       
 Days since last bet      1.0000 -0.3794  -0.3794          -0.4596       
 No. withdrawal methods*  1.0000  0.1980   0.1980           0.1393       
 Monthly bet freq (days)* 0.9895  0.1473   0.1488           0.0799       
 No. deposit methods*     0.9755  0.1185   0.1215           0.0603       
 Mean deposit amount*     0.8053 -0.2226  -0.2764          -0.3901       
 No. of activities*       0.4949  0.0450   0.0909           0.0000       
 Upper CI (95%)
  0.3342       
 -0.3024       
  0.2564       
  0.2117       
  0.1830       
  0.0000       
  0.1240

Show code

# Run model with the Weight adjusting:

study_2_weighted_logistic_regression <- svyglm(STATUS ~ AGE + 
                DAYS_SINCE_LAST_BET +
                PAS_6M_TOTAL_WITHDRAW_METHODS +
                PAS_6M_MONTHLY_BET_FREQ_DAYS +
                PAS_6M_TOTAL_DEPOSIT_METHODS + 
                PAS_6M_MEAN_DEPOSIT_AMOUNT +
                PAS_6M_N_ACTIVITIES,
      design = survey_design_regression, family = quasibinomial())

summary(study_2_weighted_logistic_regression)


Call:
svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + 
    PAS_6M_MONTHLY_BET_FREQ_DAYS + PAS_6M_TOTAL_DEPOSIT_METHODS + 
    PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_N_ACTIVITIES, design = survey_design_regression, 
    family = quasibinomial())

Survey design:
svydesign(id = ~1, data = Study2_logistic_reg_data, weights = ~ALL_SAMPLE_WEIGHTS)

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -3.9154835  0.1304685 -30.011  < 2e-16 ***
AGE                            0.0246511  0.0019471  12.660  < 2e-16 ***
DAYS_SINCE_LAST_BET           -0.0076221  0.0010583  -7.202 6.11e-13 ***
PAS_6M_TOTAL_WITHDRAW_METHODS  0.3606934  0.0522027   6.909 4.98e-12 ***
PAS_6M_MONTHLY_BET_FREQ_DAYS   0.0124960  0.0054804   2.280 0.022610 *  
PAS_6M_TOTAL_DEPOSIT_METHODS   0.1432951  0.0351667   4.075 4.62e-05 ***
PAS_6M_MEAN_DEPOSIT_AMOUNT    -0.0006594  0.0001831  -3.602 0.000316 ***
PAS_6M_N_ACTIVITIES            0.0416386  0.0093267   4.464 8.06e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.731407)

Number of Fisher Scoring iterations: 6

Performs checks of model performance:

Show code

model2_pilot_performance_metrics<- check_model(study_2_weighted_logistic_regression)

ggsave("Figures/model2_pilot_performance_metrics.pdf",
       width = 7,
       height = 9,
       units = c("in")
)

I have tried various methods of transforming variables to account for large residuals but none seem to have any effect (first plot above). If anything, using variables on the original scale seems to improve model diagnostics.

Compute confidence intervals for the coefficient estimates:

Show code

sumarisedCIs_study_2_weighted_logistic_regression<- cbind(coef(study_2_weighted_logistic_regression), confint(study_2_weighted_logistic_regression))

sumarisedCIs_study_2_weighted_logistic_regression %>% as.data.frame() %>%
    rownames_to_column() %>%
  rename(Coefficient = V1) %>%
  mutate(across(c(Coefficient, `2.5 %`, `97.5 %`), ~round(., 3))) %>%
  arrange(desc(`2.5 %`)) %>%
   gt() %>%
  tab_header(md("**Regression coefficients with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%")

Regression coefficients with 95% CIs: Study 2
	Coefficient	Lower CI: 2.5%	Upper CI: 97.5%
PAS_6M_TOTAL_WITHDRAW_METHODS	0.361	0.258	0.463
PAS_6M_TOTAL_DEPOSIT_METHODS	0.143	0.074	0.212
PAS_6M_N_ACTIVITIES	0.042	0.023	0.060
AGE	0.025	0.021	0.028
PAS_6M_MONTHLY_BET_FREQ_DAYS	0.012	0.002	0.023
PAS_6M_MEAN_DEPOSIT_AMOUNT	-0.001	-0.001	0.000
DAYS_SINCE_LAST_BET	-0.008	-0.010	-0.006
(Intercept)	-3.915	-4.171	-3.660

Exponentiate coefficients and interpret them as odds-ratios:

Show code

sumarisedORs_study_2_weighted_logistic_regression<- exp(cbind(OR = coef(study_2_weighted_logistic_regression), confint(study_2_weighted_logistic_regression)))

sumarisedORs_study_2_weighted_logistic_regression %>% as.data.frame() %>%
  rownames_to_column() %>%
  mutate(across(c(OR, `2.5 %`, `97.5 %`), ~round(., 4))) %>%
  arrange(desc(OR)) %>%
   gt() %>%
  tab_header(md("**Odds ratios with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%")

Odds ratios with 95% CIs: Study 2
	OR	Lower CI: 2.5%	Upper CI: 97.5%
PAS_6M_TOTAL_WITHDRAW_METHODS	1.4343	1.2948	1.5889
PAS_6M_TOTAL_DEPOSIT_METHODS	1.1541	1.0772	1.2364
PAS_6M_N_ACTIVITIES	1.0425	1.0236	1.0618
AGE	1.0250	1.0211	1.0289
PAS_6M_MONTHLY_BET_FREQ_DAYS	1.0126	1.0018	1.0235
PAS_6M_MEAN_DEPOSIT_AMOUNT	0.9993	0.9990	0.9997
DAYS_SINCE_LAST_BET	0.9924	0.9904	0.9945
(Intercept)	0.0199	0.0154	0.0257

Calculate Nagelkerke’s R squared for the model:

Show code

NagelkerkeR2_S2<- NagelkerkeR2(study_2_weighted_logistic_regression) %>% 
  as.data.frame() %>%
  print()

      N         R2
1 24793 0.08520162

Show code

# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value

The model accounts for 8.5201619% of variance in uptake/response.

Calculate overall p-value for model by comparing it with a null model:

Show code

anova(study_2_weighted_logistic_regression, 
      update(study_2_weighted_logistic_regression, ~1), # update here produces null model for comparison 
      test="Chisq")

Working (Rao-Scott) LRT for AGE DAYS_SINCE_LAST_BET PAS_6M_TOTAL_WITHDRAW_METHODS PAS_6M_MONTHLY_BET_FREQ_DAYS PAS_6M_TOTAL_DEPOSIT_METHODS PAS_6M_MEAN_DEPOSIT_AMOUNT PAS_6M_N_ACTIVITIES
 in svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + 
    PAS_6M_MONTHLY_BET_FREQ_DAYS + PAS_6M_TOTAL_DEPOSIT_METHODS + 
    PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_N_ACTIVITIES, design = survey_design_regression, 
    family = quasibinomial())
Working 2logLR =  745.6723 p= < 2.22e-16 
(scale factors:  1.4 1.1 1 0.98 0.92 0.84 0.73 )

Show code

# Set seed so that labels are consistent in the plot:
set.seed(200)

ggstatsplot::ggcoefstats(study_2_weighted_logistic_regression,
                         digits = 4,
                         stats.label.color = "#00798c",
                         sort = "ascending",
                         vline.args = list(size = 1, color = "black"),
                         # ggstatsplot.layer = FALSE,
                          exclude.intercept = TRUE, # hide the intercept
                         stats.label.args = list(size = 3, direction = "y")) +
  scale_x_continuous(name = "Regression coefficient", limits = c(-0.1, 0.5), breaks = c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5)) +
  scale_y_discrete(name = "", labels = c("Days since\n last bet",
                                         "Mean deposit\n amount",
                                         "Mean monthly bet\n freq (days)",
                                         "Age",
                                         "No. of activities",
                                         "No. deposit\n methods",
                                         "No. withdrawal\n methods")) +
  theme(panel.grid.minor = element_blank(),
        axis.line = element_blank()) +
  theme(axis.text = element_text(color = "black", size = 9, face = "plain"))+
  theme_light(base_family = "Poppins") +
  # theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +
  theme(axis.text.y = element_text(color="black", size=12, face="plain", family = "Poppins")) +
  theme(axis.title.x = element_text(color="black", size=12, face ="plain", vjust = 0, family = "Poppins"))

Show code

# Set height to 7 and width to 5.5 in portrait mode
ggsave("Figures/Study 2 log regression plot.pdf",
       width = 5,
       height = 7,
       units = c("in")
)

Study 2 Subsample Analysis

Okay, as noted by two reviewers during peer review, there is a key distinction in the inclusion criteria between our two studies. Study 1 invited participants who had gambled at least once in the past 30 days, whereas Study 2 invited those who had gambled at least once in the past six months. This difference could significantly impact our outcomes and help explain the higher nonresponse bias in Study 2. Since more regular gamblers appear to be more inclined to participate in studies, expanding the eligibility criteria to include less frequent gamblers makes the responders less representative of the total invited sample. This is a great point, so let’s re-run all Study 2 analyses focusing only on participants who had gambled in the 30 days prior to invitation.

Process the data for analysis by doing the following:

Remove anybody who didn’t bet in the six months prior to survey participation as this was our criteria for inclusion in the sample
Produce some new sample weights to account for the different sample size we’re now using that has a different distribution of flagged/vs non-flagged customers (this is hidden for commercial sensitivity)

Show code

master_dataset_study2_restricted_initial <-  master_dataset_study2 %>%
filter(PAS_30_BET_FREQ_DAYS != 0)

What percentage of the original sample does this represent?

Show code

nrow(master_dataset_study2_restricted_initial)/nrow(master_dataset_study2)*100

[1] 51.62344

Uptake figures

How many people were invited in total that had gambled in the last 30 days:

Show code

nrow(master_dataset_study2_restricted)

[1] 12799

Let’s now look at how survey uptake varied over the three-week window the survey was open in this sample:

Show code

master_dataset_study2_restricted %>%
  as_tibble() %>%
  filter(SURVEY_STATUS_START != "Not started") %>%
  group_by(START_DATE) %>%
  summarise(Freq = sum(n())) %>%
  ungroup() %>%
  ggplot(aes(x = START_DATE, y = Freq)) +
    plot_theme +
  geom_bar(stat = "identity", fill = "#00798c") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "",
       x = "Date",
       y = "Number of clients") +
  scale_y_continuous(breaks = seq(0, 1500, by = 500), limits = c(0,1500))

Now compute some figures to tell us the rates of uptake for the survey. these figures only relate to the refined, filtered dataset:

Show code

master_dataset_study2_restricted %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_START) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_START = "Survey status") %>%
  tab_header("Counts for starting the survey: Restricted analyses", subtitle = NULL)

Counts for starting the survey: Restricted analyses
Survey status	n	Percent
Not started	11127	86.93648
Started	1672	13.06352

Show code

master_dataset_study2_restricted %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_PGSI) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for completing the survey up to the PGSI: Restricted analyses", subtitle = NULL)

Counts for completing the survey up to the PGSI: Restricted analyses
Survey status	n	Percent
Complete	1514	11.829049
Incomplete	158	1.234471
Not started	11127	86.936479

Show code

master_dataset_study2_restricted %>% 
  as_tibble() %>%
  count(SURVEY_STATUS_FULL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_FULL = "Survey status") %>%
  tab_header("Counts for completing the survey in full: Restricted analyses", subtitle = NULL)

Counts for completing the survey in full: Restricted analyses
Survey status	n	Percent
Complete	1115	8.711618
Incomplete	557	4.351902
Not started	11127	86.936479

We need to also figure out how many people were removed by ensuring everyone bet in the past 30 days and how this related to survey status:

Show code

master_dataset_study2 %>%
  as_tibble() %>%
filter(PAS_30_BET_FREQ_DAYS == 0) %>% # keep anybody who didn't bet in the window of interest
  count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
    cols_label(SURVEY_STATUS_PGSI = "Survey status") %>%
  tab_header("Counts for people removed from dataset because they hadn't gambled in the past 30-months", subtitle = NULL)

Counts for people removed from dataset because they hadn't gambled in the past 30-months
Survey status	n
Complete	442
Incomplete	28
Not started	11524

Okay, so this restricted analyses removes 442 people who completed the survey and fall, and 28 people who started but didn’t finish it. We still have pretty large numbers of people who completed the survey so it should be fine to proceed with these restricted analyses.

Compare samples

What are the general demographic and gambling characteristics of those who did and did not take part in the survey?

Before we can do this, we need to create weighted variable summaries to account for the oversampling of the at risk population:

Show code

# Create survey design:
survey_design_all <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_restricted,
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Create survey design objects for each STATUS group:
survey_design_non_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted, STATUS == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted, STATUS == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Get sample sizes for each group
participants_n <- sum(master_dataset_study2_restricted$STATUS == "Participants")
non_participants_n <- sum(master_dataset_study2_restricted$STATUS == "Non-participants")
overall_n <- nrow(master_dataset_study2_restricted)



# [Yes, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]

# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:
# Function to calculate weighted summary for continuous variables
weighted_summary_continuous <- function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate means and standard deviations
  mean_all <- svymean(var_formula, design_all, na.rm = TRUE)
  mean_non_ps <- svymean(var_formula, design_non_participants, na.rm = TRUE)
  mean_ps <- svymean(var_formula, design_participants, na.rm = TRUE)
  
  sd_all <- sqrt(svyvar(var_formula, design_all, na.rm = TRUE))
  sd_non_ps <- sqrt(svyvar(var_formula, design_non_participants, na.rm = TRUE))
  sd_ps <- sqrt(svyvar(var_formula, design_participants, na.rm = TRUE))
  
  # Calculate weighted medians
  median_all <- svyquantile(var_formula, design_all, 0.5, na.rm = TRUE)
  median_non_ps <- svyquantile(var_formula, design_non_participants, 0.5, na.rm = TRUE)
  median_ps <- svyquantile(var_formula, design_participants, 0.5, na.rm = TRUE)
   
  # Format for table
  summary_df <- tibble(
    Variable = variable,
    `Overall` = paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),
    `Participants` = paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),
    `Non-participants` = paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2))
  ) %>%
  # Compute effect size
  mutate(
    pooled_sd = sqrt(((participants_n - 1) * sd_ps^2 + (non_participants_n - 1) * sd_non_ps^2) /
                     (participants_n + non_participants_n - 2)),
    cohen_d = (coef(mean_ps) - coef(mean_non_ps)) / pooled_sd,
    SE = sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2 / (2 * (participants_n + non_participants_n)))),
    CI_lower = cohen_d - 1.96 * SE,
    CI_upper = cohen_d + 1.96 * SE,
    Effect_size = paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]")
  ) %>%
  select(Variable, Overall, Participants, `Non-participants`, Effect_size)
  
  return(summary_df)
}

# Apply function to continuous variables
continuous_vars <- c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", 
                     "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", 
                     "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", 
                     "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", 
                     "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")

all_continuous_summaries <- map_dfr(continuous_vars, 
                                    ~weighted_summary_continuous(.x, survey_design_all, 
                                                                 survey_design_non_participants, survey_design_participants, 
                                                                 participants_n, non_participants_n, overall_n))


# Categorical variables:

# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. 

# Note: this function is computes Cramer's  effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).

# UPDATE: Let's just compute odds ratios for the two level variables  based on a reviewer suggestion (thanks ChatGPT for your help with this one!):

categorical_summary <- function(variable, design_all, design_non_participants, design_participants) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate proportions for overall sample
  prop_table_all <- svytable(var_formula, design_all)
  props_all <- prop.table(prop_table_all) %>%
    as.data.frame() %>%
    mutate(Overall = Freq * 100) %>%
    select(Variable = !!sym(variable), Overall)
  
  # Calculate proportions for non-participants
  prop_table_non_ps <- svytable(var_formula, design_non_participants)
  props_non_ps <- prop.table(prop_table_non_ps) %>%
    as.data.frame() %>%
    mutate('Non-participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Non-participants')
  
  # Calculate proportions for participants
  prop_table_ps <- svytable(var_formula, design_participants)
  props_ps <- prop.table(prop_table_ps) %>%
    as.data.frame() %>%
    mutate('Participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Participants')
  
  # Combine results and format
  summary_df <- props_ps %>%
    left_join(props_non_ps, by = "Variable") %>%
    left_join(props_all, by = "Variable") %>%
    mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%
    replace(is.na(.), "")

  # Determine if variable is binary (i.e., only two levels)
  unique_values <- length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))
  
  if (unique_values == 2 & variable != "GENDER") {
    # Ensure STATUS is treated as a factor with "Non-participants" as reference
    design_all$variables$STATUS <- relevel(as.factor(design_all$variables$STATUS), ref = "Non-participants")

    # Compute Odds Ratio (OR) using survey-weighted logistic regression
    logistic_model <- tryCatch(
      svyglm(as.formula(paste0(variable, " ~ STATUS")), 
             design = design_all, family = quasibinomial()),
      error = function(e) NULL
    )

    if (!is.null(logistic_model)) {
      coef_name <- grep("STATUS", names(coef(logistic_model)), value = TRUE)
      if (length(coef_name) > 0) {
        OR <- exp(coef(logistic_model)[coef_name])
        CI <- exp(confint(logistic_model)[coef_name, ])
        effect_size_text <- paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]")
      } else {
        effect_size_text <- "NA"
      }
    } else {
      effect_size_text <- "NA"
    }

  } else {
    # Compute Cramér’s V for categorical variables with more than two levels
    combined_table <- as.table(rbind(prop_table_non_ps, prop_table_ps))
    V <- cramersV(combined_table)
    V_rounded <- as.matrix(V$output) %>% t() %>% as.data.frame() %>%
      mutate(cramersV = as.numeric(cramersV)) %>%
      mutate(cramersV = round(cramersV, 2))
    
    CI <- confIntV(combined_table)
    CI_info <- CI$intermediate
    CIs <- CI_info$fisherZ.ci %>%
      t() %>%
      as.data.frame() %>%
      mutate_if(is.numeric, round, 2) %>%
      unite('Effect_size', 1, 2, sep = ", ")
    
    effect_size_text <- paste0(V_rounded, " [", CIs, "]")
  }

  # Add effect size to the table
  header_row <- data.frame(Variable = variable, Overall = "", Participants = "", 
                           `Non-participants` = "", `Effect_size` = effect_size_text)

  summary_df <- bind_rows(header_row, summary_df) %>%
    as_tibble() %>%
    select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)

  return(summary_df)
}



# Apply function to categorical variables
categorical_vars <- c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")
all_categorical_summaries <- map_dfr(categorical_vars, 
                                     ~categorical_summary(.x, survey_design_all, 
                                                          
                                                          survey_design_non_participants, survey_design_participants))

all_categorical_summaries <- all_categorical_summaries %>%
# Recode for consistency across summaries:
mutate(Variable = case_when(Variable==1
                   ~ "Yes", 
                   Variable==0
                   ~ "No", 
                   Variable==TRUE
                   ~ "Yes", 
                   Variable==FALSE
                   ~ "No",
                   TRUE ~ Variable))

# Combine continuous and categorical summaries
weighted_comparison_summary_data <- bind_rows(all_continuous_summaries, all_categorical_summaries) %>%
  rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols = "Participants") %>%
  rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols = "Non-participants") %>%
  rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols = "Overall")

# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)
gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]
age_summary_tabled<- weighted_comparison_summary_data[1,1:5]
summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%
  filter(Variable !="AGE",
    Variable !="GENDER",
         Variable !="Male",
         Variable !="Female",
         Variable !="Unknown"
         )

# Join everything back together and provide some nice names for our variables:
weighted_comparison_summary_data_ordered <- bind_rows(age_summary_tabled,
          gender_summary_tabled,
          summary_table_extracted_vars_missing) %>%
   mutate(Variable = dplyr::recode(Variable,
    `AGE` = "Age",
    "GENDER" = "Gender",
    "IRSAD_SCORE" = "IRSAD score",
    "DAYS_REGISTERED" = "Days since registered",
    "DAYS_SINCE_LAST_BET" = "No. days since last bet",
    "PAS_6M_BET_FREQ_DAYS" = "No. betting days",
    "PAS_6M_BET_INTENSITY" = "Mean bets per active day",
    "PAS_6M_BET_MEAN_STAKE" = "Mean stake",
    "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" = "Mean net outcome per month",
    "PAS_6M_PERCENT_UNSOCIABLE_BETS" = "Percentage of bets midnight - 6AM",
    "PAS_6M_TOTAL_DEPOSIT_FREQ" = "No. deposits",
    "PAS_6M_DEPOSIT_INTENSITY" = "Mean no. deposits per active day",
    "PAS_6M_MEAN_DEPOSIT_AMOUNT" = "Mean deposit amount",
    "PAS_6M_PERCENT_DECLINED_DEPOSITS" = "Percentage of deposits declined",
    "PAS_6M_PERCENT_CANCEL_WITHDRAWS" = "Percentage of withdrawals cancelled",
    "EVER_USED_ANY_TOOL" = "Ever used a CPT",
    "ACTIVE_DEPOSIT_LIMIT" = "Active deposit limit",
    "RE_PAS_6M_ANY_TRIGGER" = "Flagged as 'at-risk' ≥ 1 times"
  ))


# Create final table:
weighted_comparison_summary_data_ordered %>%
  rename("Effect size" = Effect_size
         ) %>%
  gt() %>%
   tab_header(title = "Table 4.   Study 2 (past 30-day bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%
  tab_options(
    data_row.padding = px(1.5)
  ) %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )

Table 4. Study 2 (past 30-day bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable	Overall (N = 12799)¹	Participants (N = 1514)¹	Non-participants (N = 11285)	Effect size²
Age	40.62 (15.44), 38	44.47 (15.39), 44	40.15 (15.38), 38	0.28 [0.23, 0.33]
Gender			NA	0.03 [0.02, 0.03]
Female	14.18%	12.4%	14.4%	NA
Male	84.72%	85.94%	84.57%	NA
Unknown	1.1%	1.66%	1.03%	NA
IRSAD score	1002.69 (75.08), 994.21	1004.22 (75.25), 994.59	1002.5 (75.06), 994.21	0.02 [-0.03, 0.08]
Days since registered	1259.45 (926.89), 1097	1393.52 (971.99), 1214	1243.06 (919.93), 1070	0.16 [0.11, 0.22]
No. days since last bet	8.6 (8.64), 4	5.91 (6.96), 3	8.93 (8.77), 5	-0.35 [-0.41, -0.3]
No. betting days	40.36 (38.75), 27	51.1 (43.5), 38	39.05 (37.93), 26	0.31 [0.26, 0.37]
Mean bets per active day	9.22 (15.7), 5.11	11.89 (19.35), 6.53	8.89 (15.16), 4.97	0.19 [0.14, 0.24]
Mean stake	32.15 (144.94), 10.4	26.65 (111.65), 9.38	32.82 (148.49), 10.59	-0.04 [-0.1, 0.01]
Mean net outcome per month	-251.88 (2584.18), -36.03	-316.48 (1321.09), -46.34	-243.98 (2698.28), -35	-0.03 [-0.08, 0.03]
Percentage of bets midnight - 6AM	4.32 (9.42), 0.06	4.21 (8.66), 0.45	4.34 (9.51), 0	-0.01 [-0.07, 0.04]
No. deposits	60.74 (154.29), 14	85.34 (195.48), 21	57.74 (148.2), 13	0.18 [0.12, 0.23]
Mean no. deposits per active day	1.14 (1.49), 0.81	1.33 (1.85), 0.89	1.12 (1.43), 0.8	0.14 [0.09, 0.19]
Mean deposit amount	84.55 (348.17), 37.5	72.73 (199.6), 38.78	85.99 (362.15), 37.5	-0.04 [-0.09, 0.02]
Percentage of deposits declined	6.18 (12.35), 0	7.44 (14.88), 0	6.03 (12), 0	0.11 [0.06, 0.17]
Percentage of withdrawals cancelled	2.96 (11.33), 0	3.75 (13.02), 0	2.87 (11.11), 0	0.08 [0.02, 0.13]
Ever used a CPT			NA	1.25 [1.06, 1.47]
No	83.17%	80.23%	83.53%	NA
Yes	16.83%	19.77%	16.47%	NA
Active deposit limit			NA	1.12 [0.91, 1.38]
No	90.35%	89.42%	90.46%	NA
Yes	9.65%	10.58%	9.54%	NA
Flagged as 'at-risk' ≥ 1 times			NA	1.34 [1.2, 1.5]
No	95.97%	94.86%	96.11%	NA
Yes	4.03%	5.14%	3.89%	NA
¹ Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
² Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])

PGSI

Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:

Show code

survey_design_participants_filtered <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted, STATUS == "Participants" &
                  (ATTENTION_CHECK_1 == "PASSED" | is.na(ATTENTION_CHECK_1)) &
                  (ATTENTION_CHECK_2 == "PASSED" | is.na(ATTENTION_CHECK_2))),
  weights = ~ELIGIBLE_SURVEY_SAMPLE_WEIGHTS
)

# Weighted values:
mean_ps_PGSI <- svymean(~PGSI_TOTAL, survey_design_participants_filtered, na.rm = TRUE)
sd_ps_PGSI <- sqrt(svyvar(~PGSI_TOTAL, survey_design_participants_filtered, na.rm = TRUE))
median_ps_PGSI <- svyquantile(~PGSI_TOTAL, survey_design_participants_filtered, c(0.5), na.rm = TRUE)

# Extract desired values
mean_ps_PGSI_value <- coef(mean_ps_PGSI)
sd_ps_PGSI_value <- sd_ps_PGSI[1]
median_ps_PGSI_value <- coef(median_ps_PGSI) 

# Create sentence summarising results: 
summary_sentence_PGSI_study2 <- sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", 
                            mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)

summary_sentence_PGSI_study2

[1] "The mean PGSI total score is 3.76 (standard deviation = 4.19) and the median is 2.00."

Non-Weighted comparisons

For transparency, we have also computed non-weighted comparisons between groups using the tbl_summary() function used to do the same in Study 1 above:

Show code

master_dataset_study2_restricted %>%
  tbl_summary(
    include = c(# Demographic details
      AGE, 
      GENDER, 
      IRSAD_SCORE,
      DAYS_REGISTERED, 
       # Wagering
      DAYS_SINCE_LAST_BET, 
      PAS_6M_BET_FREQ_DAYS,
      PAS_6M_BET_INTENSITY,
      PAS_6M_BET_MEAN_STAKE,
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
      PAS_6M_PERCENT_UNSOCIABLE_BETS,
      # Transactions
      PAS_6M_TOTAL_DEPOSIT_FREQ,
      PAS_6M_DEPOSIT_INTENSITY,
      PAS_6M_MEAN_DEPOSIT_AMOUNT,
      PAS_6M_PERCENT_DECLINED_DEPOSITS,
      PAS_6M_PERCENT_CANCEL_WITHDRAWS,
      # Use of consumer protection tools
      EVER_USED_ANY_TOOL,
      ACTIVE_DEPOSIT_LIMIT,
      RE_PAS_6M_ANY_TRIGGER),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous2() ~ c(
      "{median} [{p25} - {p75}]",
      "{mean} ({sd})")),
    by = STATUS,
    label = c(
      AGE ~ "Age",
      GENDER ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      DAYS_REGISTERED ~ "No. of days since accounts registered",
      DAYS_SINCE_LAST_BET ~ "No. days since last bet",
      PAS_6M_BET_FREQ_DAYS ~ "No. betting days",
      PAS_6M_BET_INTENSITY ~ "Mean bets per active day",
      PAS_6M_BET_MEAN_STAKE ~ "Mean stake",
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~ "Mean net outcome per month",
      PAS_6M_PERCENT_UNSOCIABLE_BETS ~ "Percentage of bets placed between midnight & 6AM",
      PAS_6M_TOTAL_DEPOSIT_FREQ ~ "No. deposits",
      PAS_6M_DEPOSIT_INTENSITY ~ "Mean no. deposits per active betting day",
      PAS_6M_MEAN_DEPOSIT_AMOUNT ~ "Mean deposit amount",
      PAS_6M_PERCENT_DECLINED_DEPOSITS ~ "Percentage of deposits declined",
      PAS_6M_PERCENT_CANCEL_WITHDRAWS ~ "Percentage of withdrawals cancelled",
      EVER_USED_ANY_TOOL ~ "Ever used a SG",
      ACTIVE_DEPOSIT_LIMIT ~ "Active deposit limit",
      RE_PAS_6M_ANY_TRIGGER ~ "Flagged as 'at-risk' ≥ 1 times")
  ) %>%
  add_difference(
    list(all_continuous() ~ "cohens_d",  # For continuous variables
    all_categorical() ~ "smd")       # For categorical variables, you can use standardized mean difference or any other suitable method
  ) %>%
  modify_header(label = "**Variable**") %>% 
  modify_caption("**Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics**") %>%
  bold_labels() %>%
  as_gt() %>%
  tab_options(data_row.padding = px(1.5))

**Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics**
Variable	Participants, N = 1,514¹	Non-participants, N = 11,285¹	Difference²	95% CI^2,3
Age			0.21	0.16, 0.27
Median [IQR]	41 [30 - 54]	37 [27 - 50]
Mean (SD)	43 (15)	39 (15)
Gender			0.07	0.02, 0.12
Female	158 (10%)	1,363 (12%)
Male	1,327 (88%)	9,774 (87%)
Unknown	29 (1.9%)	148 (1.3%)
IRSAD score			0.00	-0.06, 0.05
Median [IQR]	1,000 [953 - 1,060]	998 [953 - 1,059]
Mean (SD)	1,005 (75)	1,005 (76)
Unknown	1	94
No. of days since accounts registered			0.15	0.09, 0.20
Median [IQR]	1,250 [545 - 2,205]	1,135 [479 - 1,969]
Mean (SD)	1,429 (998)	1,291 (942)
No. days since last bet			-0.33	-0.39, -0.28
Median [IQR]	2 [1 - 6]	4 [2 - 12]
Mean (SD)	5 (7)	8 (8)
No. betting days			0.36	0.31, 0.41
Median [IQR]	54 [24 - 101]	36 [15 - 74]
Mean (SD)	66 (49)	50 (44)
Mean bets per active day			0.21	0.15, 0.26
Median [IQR]	9 [4 - 19]	6 [3 - 13]
Mean (SD)	16 (25)	12 (22)
Mean stake			-0.06	-0.11, 0.00
Median [IQR]	17 [7 - 54]	17 [7 - 56]
Mean (SD)	75 (277)	96 (381)
Mean net outcome per month			-0.02	-0.07, 0.04
Median [IQR]	-149 [-1,230 - -12]	-73 [-631 - -6]
Mean (SD)	-1,111 (3,185)	-989 (7,240)
Percentage of bets placed between midnight & 6AM			0.01	-0.04, 0.07
Median [IQR]	1 [0 - 6]	1 [0 - 6]
Mean (SD)	5 (9)	5 (10)
No. deposits			0.28	0.23, 0.33
Median [IQR]	57 [10 - 241]	27 [6 - 119]
Mean (SD)	206 (392)	123 (280)
Mean no. deposits per active betting day			0.24	0.18, 0.29
Median [IQR]	1.36 [0.50 - 2.91]	1.04 [0.40 - 2.12]
Mean (SD)	2.31 (3.07)	1.73 (2.35)
Mean deposit amount			-0.06	-0.11, 0.00
Median [IQR]	53 [25 - 134]	51 [22 - 150]
Mean (SD)	155 (493)	206 (957)
Percentage of deposits declined			0.09	0.04, 0.14
Median [IQR]	2 [0 - 9]	1 [0 - 9]
Mean (SD)	8 (14)	7 (12)
Percentage of withdrawals cancelled			0.10	0.04, 0.15
Median [IQR]	0 [0 - 4]	0 [0 - 0]
Mean (SD)	9 (19)	7 (18)
Ever used a SG	378 (25%)	2,341 (21%)	0.10	0.05, 0.15
Active deposit limit	161 (11%)	1,167 (10%)	0.01	-0.04, 0.06
Flagged as 'at-risk' ≥ 1 times	560 (37%)	3,443 (31%)	0.14	0.08, 0.19
¹ n (%)
² Cohen’s D; Standardized Mean Difference
³ CI = Confidence Interval

Sensitivity check comparison

Again, let’s see how this comparison differs if we define Participants by full survey completion:

Show code

# Check coding of variable that tells us whether somebody fully completed the survey or not
master_dataset_study2_restricted %>%
  count(STATUS,
           SURVEY_STATUS_FULL)

# A tibble: 4 × 3
  STATUS           SURVEY_STATUS_FULL     n
  <fct>            <chr>              <int>
1 Participants     Complete            1115
2 Participants     Incomplete           399
3 Non-participants Incomplete           158
4 Non-participants Not started        11127

Show code

master_dataset_study2_sensitivity_check   <- master_dataset_study2_restricted %>%
  mutate(SURVEY_STATUS_FULL_BINARY = case_when(SURVEY_STATUS_FULL == "Incomplete" ~ "Non-participants",
                                               SURVEY_STATUS_FULL == "Not started" ~ "Non-participants",
         TRUE ~ "Participants")) %>%
 mutate(SURVEY_STATUS_FULL_BINARY = fct_relevel(SURVEY_STATUS_FULL_BINARY,
                            "Participants",
                            "Non-participants")) # fix order of presentation

# Check recoding has worked:
master_dataset_study2_sensitivity_check %>%
  count(STATUS,
           SURVEY_STATUS_FULL_BINARY)

# A tibble: 3 × 3
  STATUS           SURVEY_STATUS_FULL_BINARY     n
  <fct>            <fct>                     <int>
1 Participants     Participants               1115
2 Participants     Non-participants            399
3 Non-participants Non-participants          11285

Show code

# Update survey designs:
survey_design_all <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_sensitivity_check,
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Create survey design objects for each STATUS group:
survey_design_non_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

# Get sample sizes for each group
participants_n <- sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY == "Participants")
non_participants_n <- sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY == "Non-participants")
overall_n <- nrow(master_dataset_study2_sensitivity_check)




# Produce updated table
# [Again, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]

# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:
# Function to calculate weighted summary for continuous variables
weighted_summary_continuous <- function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate means and standard deviations
  mean_all <- svymean(var_formula, design_all, na.rm = TRUE)
  mean_non_ps <- svymean(var_formula, design_non_participants, na.rm = TRUE)
  mean_ps <- svymean(var_formula, design_participants, na.rm = TRUE)
  
  sd_all <- sqrt(svyvar(var_formula, design_all, na.rm = TRUE))
  sd_non_ps <- sqrt(svyvar(var_formula, design_non_participants, na.rm = TRUE))
  sd_ps <- sqrt(svyvar(var_formula, design_participants, na.rm = TRUE))
  
  # Calculate weighted medians
  median_all <- svyquantile(var_formula, design_all, 0.5, na.rm = TRUE)
  median_non_ps <- svyquantile(var_formula, design_non_participants, 0.5, na.rm = TRUE)
  median_ps <- svyquantile(var_formula, design_participants, 0.5, na.rm = TRUE)
   
  # Format for table
  summary_df <- tibble(
    Variable = variable,
    `Overall` = paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),
    `Participants` = paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),
    `Non-participants` = paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2))
  ) %>%
  # Compute effect size
  mutate(
    pooled_sd = sqrt(((participants_n - 1) * sd_ps^2 + (non_participants_n - 1) * sd_non_ps^2) /
                     (participants_n + non_participants_n - 2)),
    cohen_d = (coef(mean_ps) - coef(mean_non_ps)) / pooled_sd,
    SE = sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2 / (2 * (participants_n + non_participants_n)))),
    CI_lower = cohen_d - 1.96 * SE,
    CI_upper = cohen_d + 1.96 * SE,
    Effect_size = paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]")
  ) %>%
  select(Variable, Overall, Participants, `Non-participants`, Effect_size)
  
  return(summary_df)
}

# Apply function to continuous variables
continuous_vars <- c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", 
                     "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", 
                     "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", 
                     "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", 
                     "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")

all_continuous_summaries <- map_dfr(continuous_vars, 
                                    ~weighted_summary_continuous(.x, survey_design_all, 
                                                                 survey_design_non_participants, survey_design_participants, 
                                                                 participants_n, non_participants_n, overall_n))


# Categorical variables:

# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. 

# Note: this function is computes Cramer's  effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).

# UPDATE: Let's just compute odds ratios for the two level variables  based on a reviewer suggestion (thanks ChatGPT for your help with this one!):

categorical_summary <- function(variable, design_all, design_non_participants, design_participants) {
  var_formula <- as.formula(paste0("~", variable))
  
  # Calculate proportions for overall sample
  prop_table_all <- svytable(var_formula, design_all)
  props_all <- prop.table(prop_table_all) %>%
    as.data.frame() %>%
    mutate(Overall = Freq * 100) %>%
    select(Variable = !!sym(variable), Overall)
  
  # Calculate proportions for non-participants
  prop_table_non_ps <- svytable(var_formula, design_non_participants)
  props_non_ps <- prop.table(prop_table_non_ps) %>%
    as.data.frame() %>%
    mutate('Non-participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Non-participants')
  
  # Calculate proportions for participants
  prop_table_ps <- svytable(var_formula, design_participants)
  props_ps <- prop.table(prop_table_ps) %>%
    as.data.frame() %>%
    mutate('Participants' = Freq * 100) %>%
    select(Variable = !!sym(variable), 'Participants')
  
  # Combine results and format
  summary_df <- props_ps %>%
    left_join(props_non_ps, by = "Variable") %>%
    left_join(props_all, by = "Variable") %>%
    mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%
    replace(is.na(.), "")

  # Determine if variable is binary (i.e., only two levels)
  unique_values <- length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))
  
  if (unique_values == 2 & variable != "GENDER") {
    # Ensure STATUS is treated as a factor with "Non-participants" as reference
    design_all$variables$STATUS <- relevel(as.factor(design_all$variables$STATUS), ref = "Non-participants")

    # Compute Odds Ratio (OR) using survey-weighted logistic regression
    logistic_model <- tryCatch(
      svyglm(as.formula(paste0(variable, " ~ STATUS")), 
             design = design_all, family = quasibinomial()),
      error = function(e) NULL
    )

    if (!is.null(logistic_model)) {
      coef_name <- grep("STATUS", names(coef(logistic_model)), value = TRUE)
      if (length(coef_name) > 0) {
        OR <- exp(coef(logistic_model)[coef_name])
        CI <- exp(confint(logistic_model)[coef_name, ])
        effect_size_text <- paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]")
      } else {
        effect_size_text <- "NA"
      }
    } else {
      effect_size_text <- "NA"
    }

  } else {
    # Compute Cramér’s V for categorical variables with more than two levels
    combined_table <- as.table(rbind(prop_table_non_ps, prop_table_ps))
    V <- cramersV(combined_table)
    V_rounded <- as.matrix(V$output) %>% t() %>% as.data.frame() %>%
      mutate(cramersV = as.numeric(cramersV)) %>%
      mutate(cramersV = round(cramersV, 2))
    
    CI <- confIntV(combined_table)
    CI_info <- CI$intermediate
    CIs <- CI_info$fisherZ.ci %>%
      t() %>%
      as.data.frame() %>%
      mutate_if(is.numeric, round, 2) %>%
      unite('Effect_size', 1, 2, sep = ", ")
    
    effect_size_text <- paste0(V_rounded, " [", CIs, "]")
  }

  # Add effect size to the table
  header_row <- data.frame(Variable = variable, Overall = "", Participants = "", 
                           `Non-participants` = "", `Effect_size` = effect_size_text)

  summary_df <- bind_rows(header_row, summary_df) %>%
    as_tibble() %>%
    select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)

  return(summary_df)
}



# Apply function to categorical variables
categorical_vars <- c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")
all_categorical_summaries <- map_dfr(categorical_vars, 
                                     ~categorical_summary(.x, survey_design_all, 
                                                          
                                                          survey_design_non_participants, survey_design_participants))

all_categorical_summaries <- all_categorical_summaries %>%
# Recode for consistency across summaries:
mutate(Variable = case_when(Variable==1
                   ~ "Yes", 
                   Variable==0
                   ~ "No", 
                   Variable==TRUE
                   ~ "Yes", 
                   Variable==FALSE
                   ~ "No",
                   TRUE ~ Variable))

# Combine continuous and categorical summaries
weighted_comparison_summary_data <- bind_rows(all_continuous_summaries, all_categorical_summaries) %>%
  rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols = "Participants") %>%
  rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols = "Non-participants") %>%
  rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols = "Overall")

# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)
gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]
age_summary_tabled<- weighted_comparison_summary_data[1,1:5]
summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%
  filter(Variable !="AGE",
    Variable !="GENDER",
         Variable !="Male",
         Variable !="Female",
         Variable !="Unknown"
         )

# Join everything back together and provide some nice names for our variables:
weighted_comparison_summary_data_ordered <- bind_rows(age_summary_tabled,
          gender_summary_tabled,
          summary_table_extracted_vars_missing) %>%
   mutate(Variable = dplyr::recode(Variable,
    `AGE` = "Age",
    "GENDER" = "Gender",
    "IRSAD_SCORE" = "IRSAD score",
    "DAYS_REGISTERED" = "Days since registered",
    "DAYS_SINCE_LAST_BET" = "No. days since last bet",
    "PAS_6M_BET_FREQ_DAYS" = "No. betting days",
    "PAS_6M_BET_INTENSITY" = "Mean bets per active day",
    "PAS_6M_BET_MEAN_STAKE" = "Mean stake",
    "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" = "Mean net outcome per month",
    "PAS_6M_PERCENT_UNSOCIABLE_BETS" = "Percentage of bets midnight - 6AM",
    "PAS_6M_TOTAL_DEPOSIT_FREQ" = "No. deposits",
    "PAS_6M_DEPOSIT_INTENSITY" = "Mean no. deposits per active day",
    "PAS_6M_MEAN_DEPOSIT_AMOUNT" = "Mean deposit amount",
    "PAS_6M_PERCENT_DECLINED_DEPOSITS" = "Percentage of deposits declined",
    "PAS_6M_PERCENT_CANCEL_WITHDRAWS" = "Percentage of withdrawals cancelled",
    "EVER_USED_ANY_TOOL" = "Ever used a CPT",
    "ACTIVE_DEPOSIT_LIMIT" = "Active deposit limit",
    "RE_PAS_6M_ANY_TRIGGER" = "Flagged as 'at-risk' ≥ 1 times"
  ))


# Create final table:
weighted_comparison_summary_data_ordered %>%
  rename("Effect size" = Effect_size
         ) %>%
  gt() %>%
   tab_header(title = "Table 4.1   Study 2 (past 30-day bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]") %>%
  tab_options(
    data_row.padding = px(1.5)
  ) %>%
  cols_align(
    align = "right",
    columns = c(Variable)
  ) %>%
    cols_align(
    align = "center",
    columns = c(2:4)
  ) %>%
  tab_footnote(
    footnote = "Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",
    locations = cells_column_labels(columns = c(2,3))
  )%>%
  tab_footnote(
    footnote = "Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",
    locations = cells_column_labels(columns = vars(`Effect size`))
  )

Table 4.1 Study 2 (past 30-day bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]
Variable	Overall (N = 12799)¹	Participants (N = 1115)¹	Non-participants (N = 11684)	Effect size²
Age	40.62 (15.44), 38	43.42 (14.68), 42	40.36 (15.48), 38	0.2 [0.14, 0.26]
Gender			NA	0.03 [0.02, 0.03]
Female	14.18%	11.46%	14.43%	NA
Male	84.72%	86.99%	84.51%	NA
Unknown	1.1%	1.55%	1.05%	NA
IRSAD score	1002.69 (75.08), 994.21	1007.51 (75.4), 1000.6	1002.24 (75.03), 993.97	0.07 [0.01, 0.13]
Days since registered	1259.45 (926.89), 1097	1417.35 (964.27), 1225	1244.9 (922.05), 1072	0.19 [0.12, 0.25]
No. days since last bet	8.6 (8.64), 4	5.84 (6.95), 3	8.85 (8.74), 5	-0.35 [-0.41, -0.29]
No. betting days	40.36 (38.75), 27	50.35 (42.96), 39	39.44 (38.21), 27	0.28 [0.22, 0.34]
Mean bets per active day	9.22 (15.7), 5.11	11.05 (18.38), 6.41	9.05 (15.42), 5	0.13 [0.07, 0.19]
Mean stake	32.15 (144.94), 10.4	25.11 (91.67), 9.02	32.8 (148.88), 10.58	-0.05 [-0.11, 0.01]
Mean net outcome per month	-251.88 (2584.18), -36.03	-267.81 (1199.21), -41.62	-250.41 (2676.01), -35.5	-0.01 [-0.07, 0.05]
Percentage of bets midnight - 6AM	4.32 (9.42), 0.06	4.26 (8.81), 0.51	4.33 (9.48), 0	-0.01 [-0.07, 0.05]
No. deposits	60.74 (154.29), 14	77.52 (173.67), 19	59.2 (152.3), 14	0.12 [0.06, 0.18]
Mean no. deposits per active day	1.14 (1.49), 0.81	1.26 (1.8), 0.81	1.13 (1.45), 0.81	0.08 [0.02, 0.15]
Mean deposit amount	84.55 (348.17), 37.5	72.06 (204.19), 38	85.7 (358.52), 37.5	-0.04 [-0.1, 0.02]
Percentage of deposits declined	6.18 (12.35), 0	7.39 (14.83), 0	6.07 (12.09), 0	0.11 [0.05, 0.17]
Percentage of withdrawals cancelled	2.96 (11.33), 0	3.61 (13.22), 0	2.9 (11.14), 0	0.06 [0, 0.12]
Ever used a CPT			NA	1.25 [1.06, 1.47]
No	83.17%	81.2%	83.35%	NA
Yes	16.83%	18.8%	16.65%	NA
Active deposit limit			NA	1.12 [0.91, 1.38]
No	90.35%	90.33%	90.35%	NA
Yes	9.65%	9.67%	9.65%	NA
Flagged as 'at-risk' ≥ 1 times			NA	1.34 [1.2, 1.5]
No	95.97%	95.61%	96.01%	NA
Yes	4.03%	4.39%	3.99%	NA
¹ Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
² Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])

Again, the effect sizes appear to be smaller in these comparisons. Let’s check if this is because the group who started the survey but didn’t complete it have now been merged with non-responders, making the latter group more similar to responders, or because those who fully complete the survey are actually more similar to those who didn’t start it:

Show code

master_dataset_study2_restricted %>%
  tbl_summary(
    include = c(# Demographic details
      AGE, 
      GENDER, 
      IRSAD_SCORE,
      DAYS_REGISTERED, 
       # Wagering
      DAYS_SINCE_LAST_BET, 
      PAS_6M_BET_FREQ_DAYS,
      PAS_6M_BET_INTENSITY,
      PAS_6M_BET_MEAN_STAKE,
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
      PAS_6M_PERCENT_UNSOCIABLE_BETS,
      # Transactions
      PAS_6M_TOTAL_DEPOSIT_FREQ,
      PAS_6M_DEPOSIT_INTENSITY,
      PAS_6M_MEAN_DEPOSIT_AMOUNT,
      PAS_6M_PERCENT_DECLINED_DEPOSITS,
      PAS_6M_PERCENT_CANCEL_WITHDRAWS,
      # Use of consumer protection tools
      EVER_USED_ANY_TOOL,
      ACTIVE_DEPOSIT_LIMIT,
      RE_PAS_6M_ANY_TRIGGER),
    type = all_continuous() ~ "continuous2",
    statistic = list(all_continuous2() ~ c(
      "{median} [{p25} - {p75}]",
      "{mean} ({sd})")),
    by = SURVEY_STATUS_FULL,
    label = c(
      AGE ~ "Age",
      GENDER ~ "Gender",
      IRSAD_SCORE ~ "IRSAD score",
      DAYS_REGISTERED ~ "No. of days since accounts registered",
      DAYS_SINCE_LAST_BET ~ "No. days since last bet",
      PAS_6M_BET_FREQ_DAYS ~ "No. betting days",
      PAS_6M_BET_INTENSITY ~ "Mean bets per active day",
      PAS_6M_BET_MEAN_STAKE ~ "Mean stake",
      PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~ "Mean net outcome per month",
      PAS_6M_PERCENT_UNSOCIABLE_BETS ~ "Percentage of bets placed between midnight & 6AM",
      PAS_6M_TOTAL_DEPOSIT_FREQ ~ "No. deposits",
      PAS_6M_DEPOSIT_INTENSITY ~ "Mean no. deposits per active betting day",
      PAS_6M_MEAN_DEPOSIT_AMOUNT ~ "Mean deposit amount",
      PAS_6M_PERCENT_DECLINED_DEPOSITS ~ "Percentage of deposits declined",
      PAS_6M_PERCENT_CANCEL_WITHDRAWS ~ "Percentage of withdrawals cancelled",
      EVER_USED_ANY_TOOL ~ "Ever used a SG",
      ACTIVE_DEPOSIT_LIMIT ~ "Active deposit limit",
      RE_PAS_6M_ANY_TRIGGER ~ "Flagged as 'at-risk' ≥ 1 times")
  ) %>%
  modify_header(label = "Variable") %>% 
  modify_caption("Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey") %>%
  bold_labels()

Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn’t complete it, and those who didn’t start the survey
Variable	Complete, N = 1,115¹	Incomplete, N = 557¹	Not started, N = 11,127¹
Age
Median [IQR]	40 [30 - 52]	42 [29 - 56]	37 [27 - 50]
Mean (SD)	42 (14)	44 (16)	39 (15)
Gender
Female	103 (9.2%)	72 (13%)	1,346 (12%)
Male	992 (89%)	473 (85%)	9,636 (87%)
Unknown	20 (1.8%)	12 (2.2%)	145 (1.3%)
IRSAD score
Median [IQR]	1,002 [955 - 1,065]	990 [949 - 1,055]	998 [953 - 1,059]
Mean (SD)	1,008 (75)	1,000 (74)	1,005 (76)
Unknown	1	1	93
No. of days since accounts registered
Median [IQR]	1,282 [582 - 2,217]	1,165 [463 - 2,115]	1,135 [480 - 1,970]
Mean (SD)	1,445 (986)	1,362 (1,023)	1,291 (941)
No. days since last bet
Median [IQR]	3 [1 - 6]	2 [1 - 5]	4 [2 - 13]
Mean (SD)	5 (7)	5 (6)	8 (8)
No. betting days
Median [IQR]	52 [23 - 97]	67 [28 - 113]	36 [15 - 74]
Mean (SD)	63 (47)	73 (50)	50 (44)
Mean bets per active day
Median [IQR]	8 [4 - 17]	11 [4 - 22]	6 [3 - 13]
Mean (SD)	15 (23)	18 (27)	12 (22)
Mean stake
Median [IQR]	15 [6 - 48]	23 [9 - 71]	16 [7 - 55]
Mean (SD)	67 (222)	99 (351)	96 (383)
Mean net outcome per month
Median [IQR]	-118 [-1,060 - -9]	-366 [-1,671 - -28]	-72 [-613 - -6]
Mean (SD)	-953 (3,008)	-1,504 (3,356)	-983 (7,284)
Percentage of bets placed between midnight & 6AM
Median [IQR]	1 [0 - 6]	1 [0 - 5]	1 [0 - 6]
Mean (SD)	5 (9)	5 (8)	5 (10)
No. deposits
Median [IQR]	46 [9 - 205]	92 [21 - 316]	27 [6 - 117]
Mean (SD)	184 (352)	270 (539)	121 (270)
Mean no. deposits per active betting day
Median [IQR]	1.26 [0.44 - 2.63]	1.73 [0.83 - 3.53]	1.03 [0.40 - 2.10]
Mean (SD)	2.17 (2.98)	2.75 (3.73)	1.71 (2.29)
Mean deposit amount
Median [IQR]	50 [24 - 128]	73 [31 - 164]	50 [22 - 150]
Mean (SD)	148 (517)	179 (393)	206 (963)
Percentage of deposits declined
Median [IQR]	2 [0 - 8]	4 [0 - 9]	1 [0 - 9]
Mean (SD)	8 (14)	8 (14)	7 (12)
Percentage of withdrawals cancelled
Median [IQR]	0 [0 - 2]	0 [0 - 13]	0 [0 - 0]
Mean (SD)	8 (19)	12 (22)	7 (18)
Ever used a SG	267 (24%)	159 (29%)	2,293 (21%)
Active deposit limit	112 (10%)	69 (12%)	1,147 (10%)
Flagged as 'at-risk' ≥ 1 times	370 (33%)	266 (48%)	3,367 (30%)
¹ n (%)

As with Study 1, it seems that those who didn’t complete the survey but started it out the more involved/intense bettors.

Composite risk scores

Now let’s produce a singles composite score that represents people’s past six months gambling behaviour.

Let’s first identify all the variables we engineered that summarise past six month behaviour, then classify them based on whether they will be subject to this bias or not:

Show code

PAS_6M_variables<- master_dataset_study2_restricted %>% 
  select(
         contains("PAS_6M"), # Identify all relevant past six months aggregate variables
         -contains("RE_"), # Remove risk identification system variables
         -contains("INCOME") # Remove variables relating  to percentage of income which are only available for survey participants
         ) %>%
  colnames() %>%
  as.data.frame() %>%
  rename(Variables = 1) %>%
  mutate(`Biased by sign-up date` = case_when(
                     Variables == "PAS_6M_BET_FREQ_DAYS" ~ "Yes", 
                     Variables == "PAS_6M_BET_FREQ_BETS" ~ "Yes", 
                     Variables == "PAS_6M_BET_MEAN_STAKE" ~ "No", 
                     Variables == "PAS_6M_BET_SD_STAKE" ~ "No", 
                     Variables == "PAS_6M_BET_SPEND" ~ "Yes", 
                     Variables == "PAS_6M_MEAN_ODDS" ~ "Yes", 
                     Variables == "PAS_6M_N_ACTIVITIES" ~ "No", 
                     Variables == "PAS_6M_BET_INTENSITY" ~ "No", 
                     Variables == "PAS_6M_BET_WIN" ~ "Yes", 
                     Variables == "PAS_6M_NET_OUTCOME" ~ "Yes", 
                     Variables == "PAS_6M_MAX_DAY_STAKE" ~ "No", 
                     Variables == "PAS_6M_MAX_DAY_BETS" ~ "No", 
                     Variables == "PAS_6M_MEAN_DAILY_STAKE" ~ "No", 
                     Variables == "PAS_6M_SD_DAILY_STAKE" ~ "No", 
                     Variables == "PAS_6M_TOTAL_UNSOCIABLE_BETS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_UNSOCIABLE_BETS" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_BET_FREQ_DAYS" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_SPEND" ~ "No", 
                     Variables == "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME" ~ "No",
                     Variables == "PAS_6M_TOTAL_DEPOSIT_AMOUNT" ~ "Yes", 
                     Variables == "PAS_6M_MEAN_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_SD_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_TOTAL_DEPOSIT_FREQ" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_DECLINED_DEPOSITS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_DECLINED_DEPOSITS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_DEPOSIT_METHODS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_FREQ" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_AMOUNT" ~ "Yes", 
                     Variables == "PAS_6M_TOTAL_CANCEL_WITHDRAWS" ~ "Yes", 
                     Variables == "PAS_6M_PERCENT_CANCEL_WITHDRAWS" ~ "No", 
                     Variables == "PAS_6M_TOTAL_WITHDRAW_METHODS" ~ "No", 
                     Variables == "PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_MONTHLY_DEPOSIT_AMOUNT" ~ "No", 
                     Variables == "PAS_6M_DEPOSIT_INTENSITY" ~ "No",
                     TRUE ~ Variables)) 

# Table variables:
PAS_6M_variables %>%
  gt()%>%
  tab_header("All past-6-month aggregate summary variables", subtitle = "With risk of bias status for each variable")

All past-6-month aggregate summary variables
With risk of bias status for each variable
Variables	Biased by sign-up date
PAS_6M_BET_FREQ_DAYS	Yes
PAS_6M_BET_FREQ_BETS	Yes
PAS_6M_BET_FREQ_INCL_LEGS	PAS_6M_BET_FREQ_INCL_LEGS
PAS_6M_BET_MEAN_STAKE	No
PAS_6M_BET_SD_STAKE	No
PAS_6M_BET_SPEND	Yes
PAS_6M_MEAN_ODDS	Yes
PAS_6M_N_ACTIVITIES	No
PAS_6M_BET_INTENSITY	No
PAS_6M_BET_WIN	Yes
PAS_6M_NET_OUTCOME	Yes
PAS_6M_MAX_DAY_STAKE	No
PAS_6M_MAX_DAY_BETS	No
PAS_6M_MEAN_DAILY_STAKE	No
PAS_6M_SD_DAILY_STAKE	No
PAS_6M_TOTAL_UNSOCIABLE_BETS	Yes
PAS_6M_PERCENT_UNSOCIABLE_BETS	No
PAS_6M_MONTHLY_BET_FREQ_DAYS	No
PAS_6M_MONTHLY_SPEND	No
PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME	No
PAS_6M_TOTAL_DEPOSIT_AMOUNT	Yes
PAS_6M_MEAN_DEPOSIT_AMOUNT	No
PAS_6M_SD_DEPOSIT_AMOUNT	No
PAS_6M_TOTAL_DEPOSIT_FREQ	Yes
PAS_6M_TOTAL_DECLINED_DEPOSITS	Yes
PAS_6M_PERCENT_DECLINED_DEPOSITS	No
PAS_6M_TOTAL_DEPOSIT_METHODS	No
PAS_6M_TOTAL_WITHDRAW_FREQ	Yes
PAS_6M_TOTAL_WITHDRAW_AMOUNT	Yes
PAS_6M_TOTAL_CANCEL_WITHDRAWS	Yes
PAS_6M_PERCENT_CANCEL_WITHDRAWS	No
PAS_6M_TOTAL_WITHDRAW_METHODS	No
PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT	No
PAS_6M_MONTHLY_DEPOSIT_AMOUNT	No
PAS_6M_MONTHLY_DEPOSIT_FREQ	PAS_6M_MONTHLY_DEPOSIT_FREQ
PAS_6M_DEPOSIT_INTENSITY	No

Show code

PGSI_cor_data<- master_dataset_study2_restricted %>% 
    select(PGSI_TOTAL,
        GHM_TOTAL,
       PAS_6M_BET_MEAN_STAKE, 
       PAS_6M_BET_SD_STAKE, 
       PAS_6M_N_ACTIVITIES, 
       PAS_6M_BET_INTENSITY, 
       PAS_6M_MAX_DAY_STAKE, 
       PAS_6M_MAX_DAY_BETS, 
       PAS_6M_MEAN_DAILY_STAKE, 
       PAS_6M_SD_DAILY_STAKE, 
       PAS_6M_PERCENT_UNSOCIABLE_BETS, 
       PAS_6M_MONTHLY_BET_FREQ_DAYS, 
       PAS_6M_MONTHLY_SPEND, 
       PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME,
       PAS_6M_MEAN_DEPOSIT_AMOUNT, 
       PAS_6M_SD_DEPOSIT_AMOUNT, 
       PAS_6M_PERCENT_DECLINED_DEPOSITS, 
       PAS_6M_TOTAL_DEPOSIT_METHODS, 
       PAS_6M_PERCENT_CANCEL_WITHDRAWS, 
       PAS_6M_TOTAL_WITHDRAW_METHODS, 
       PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, 
       PAS_6M_MONTHLY_DEPOSIT_AMOUNT, 
       PAS_6M_DEPOSIT_INTENSITY)

ggstatsplot::ggcorrmat(PGSI_cor_data, 
                                              type = "parametric",
                                              results.subtitle = FALSE,
                                              ggcorrplot.args = c(lab_size = 3)) +
  theme(axis.text.y = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(axis.text.x = element_text(color = "black", size = 9, face = "plain", family = "Poppins")) +
  theme(legend.title=element_text(color = "black", size = 12, face = "plain", family = "Poppins")) +
  theme(legend.title.align = .5) +
  theme(legend.text=element_text(color = "black", size = 8, face = "plain", family = "Poppins"))

Now let’s use all of these variables to create a composite index score of risk/involvement:

Show code

master_dataset_study2_restricted_with_risk_scores <- master_dataset_study2_restricted %>%
  mutate(PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME = PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME*-1) %>% # Invert outcome values
  mutate(across(c(PAS_6M_BET_MEAN_STAKE, 
                  PAS_6M_BET_SD_STAKE, 
                  PAS_6M_N_ACTIVITIES, 
                  PAS_6M_BET_INTENSITY, 
                  PAS_6M_MAX_DAY_STAKE, 
                  PAS_6M_MAX_DAY_BETS, 
                  PAS_6M_MEAN_DAILY_STAKE, 
                  PAS_6M_SD_DAILY_STAKE, 
                  PAS_6M_PERCENT_UNSOCIABLE_BETS, 
                  PAS_6M_MONTHLY_BET_FREQ_DAYS, 
                  PAS_6M_MONTHLY_SPEND, 
                  PAS_6M_MEAN_DEPOSIT_AMOUNT, 
                  PAS_6M_SD_DEPOSIT_AMOUNT, 
                  PAS_6M_PERCENT_DECLINED_DEPOSITS, 
                  PAS_6M_TOTAL_DEPOSIT_METHODS, 
                  PAS_6M_PERCENT_CANCEL_WITHDRAWS, 
                  PAS_6M_TOTAL_WITHDRAW_METHODS, 
                  PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, 
                  PAS_6M_MONTHLY_DEPOSIT_AMOUNT, 
                  PAS_6M_DEPOSIT_INTENSITY), 
                scale, # scale function computes z-scores
                .names = "z_{.col}")) %>% # Create new named columns for easy selection below
  rowwise() %>% # Group by Rows
  mutate(`Composite Risk Score` = sum(c_across(starts_with("z_")), na.rm = TRUE)) %>% # Sum across all newly created Z scores
  ungroup()


# colnames(master_dataset_study2_restricted_with_risk_scores)

Correlations

Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:

Show code

master_dataset_study2_restricted_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>%
   mutate(PGSI_CATEGORY = factor(PGSI_CATEGORY, levels = c('No risk',
                                                              'Low risk',
                                                              'Moderate risk',
                                                              'High risk'))) %>%
ggplot() +
   geom_point(aes(x = PGSI_TOTAL, y = `Composite Risk Score`,
                   colour = PGSI_CATEGORY)) +
   geom_smooth(aes(x = PGSI_TOTAL, y = `Composite Risk Score`), 
               method = "lm",
               se = FALSE,
               color = "black") +
   scale_colour_manual(values = c("#2e4057",
                                  "#00798c", 
                                  "#edae49",
                                  "#d1495b")) +
   geom_hline(yintercept = 0, colour = "grey",  linetype="dashed") +
   scale_y_continuous(limits = c(-10,120)) +
   plot_theme +
  guides(color= guide_legend("PGSI category")) +
   labs(x = "PGSI total score") +
  scale_x_continuous(breaks = seq(0, 25, by = 5), limits = c(0, 27))

Perform non-weighted correlation:

Show code

cor_PGSI <- cor.test(master_dataset_study2_restricted_with_risk_scores$PGSI_TOTAL, 
                         master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`,
                         method = "spearman")

cor_PGSI


    Spearman's rank correlation rho

data:  master_dataset_study2_restricted_with_risk_scores$PGSI_TOTAL and master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`
S = 389269642, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.3269856

Perform weighted correlation:

Show code

# This doesn't work nicely with NAvalues so let's remove them first:
master_dataset_study2_restricted_with_risk_scores_weighted_cor_data <- master_dataset_study2_restricted_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL))

weightecor_PGSI<- weightedCorr(x = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$PGSI_TOTAL, 
             y = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$`Composite Risk Score`, 
              method = "Spearman",
             weights = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$ALL_SAMPLE_WEIGHTS)

weightecor_PGSI

[1] 0.3427516

Confirm how many people are involved in these calculations:

Show code

master_dataset_study2_restricted_with_risk_scores %>%
  filter(!is.na(PGSI_TOTAL)) %>% nrow()

[1] 1514

T-test

Perform a t-test to compare the composite risk or of those who did and didn’t take part in the survey.

Unweighted t-test:

Show code

# Perform Welch's t-test
t_test_composite_study2 <- t.test(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_restricted_with_risk_scores, var.equal = FALSE)

t_test_composite_study2


    Welch Two Sample t-test

data:  Composite Risk Score by STATUS
t = 7.8814, df = 2203.5, p-value = 5.046e-15
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
 1.454290 2.417725
sample estimates:
    mean in group Participants mean in group Non-participants 
                     1.7069962                     -0.2290113

And now a Weighted t-test:

Show code

# Create survey design object for the whole sample:
survey_design_all_participants <- svydesign(
  id = ~1,
  data = master_dataset_study2_restricted_with_risk_scores,
  weights = ~ALL_SAMPLE_WEIGHTS
)

weighted_t_test_composite_study2 <-survey::svyttest(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_restricted_with_risk_scores, design = survey_design_all_participants)

weighted_t_test_composite_study2


    Design-based t-test

data:  `Composite Risk Score` ~ STATUS
t = -9.5198, df = 12797, p-value < 2.2e-16
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
 -1.716717 -1.130472
sample estimates:
difference in mean 
         -1.423595

Compute non-weighted summary scores and effect size:

Show code

# Calculate means and standard deviations
summary_composite_study2 <- master_dataset_study2_restricted_with_risk_scores %>%
  group_by(STATUS) %>%
  dplyr::summarize(
    mean = mean(`Composite Risk Score`),
    sd = sd(`Composite Risk Score`),
    n = n()
  ) 

gt(summary_composite_study2)

STATUS	mean	sd	n
Participants	1.7069962	8.687786	1514
Non-participants	-0.2290113	10.879017	11285

Show code

# Calculate Cohen's d with 95% confidence interval
cohen_d_result_study2 <- cohen.d(`Composite Risk Score` ~ STATUS, data = master_dataset_study2_restricted_with_risk_scores)

cohen_d_result_study2


Cohen's d

d estimate: 0.181896 (negligible)
95 percent confidence interval:
    lower     upper 
0.1282005 0.2355915

Compute weighted summary score and effect size:

Show code

# Rename the risk score as the statistic function doesn't like the spaces:
master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change<- master_dataset_study2_restricted_with_risk_scores %>%
  rename(risk_score = `Composite Risk Score`)


# Create survey design objects for each STATUS group that include the composite risk scores:
survey_design_all_t <- survey::svydesign(
  id = ~1,
  data = master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change,
  weights = ~ALL_SAMPLE_WEIGHTS
)
survey_design_non_participants_t <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change, STATUS == "Non-participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

survey_design_participants_t <- survey::svydesign(
  id = ~1,
  data = subset(master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change, STATUS == "Participants"),
  weights = ~ALL_SAMPLE_WEIGHTS
)

risk_score_comparison<- weighted_summary_continuous("risk_score", 
                                                    survey_design_all_t,
                                                    survey_design_non_participants_t, 
                                                    survey_design_participants_t, 
                                                    participants_n,
                                                    non_participants_n)

gt(risk_score_comparison)

Variable	Overall	Participants	Non-participants	Effect_size
risk_score	-3.24 (5.55), -3.99	-1.98 (5.46), -2.79	-3.4 (5.54), -4.13	0.26 [0.2, 0.32]

Quartiles

Now, let’s divide the sample into quartiles based on their composite summary score:

Show code

# Calculate quartiles of past-30-day bet frequency
quartiles <- quantile(master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`, 
                      probs = c(0, 0.25, 0.5, 0.75, 1))


master_dataset_study2_restricted_with_risk_scores$Quartile <- cut(master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`, 
                                    breaks = quartiles,
                                    labels = c("Bottom quartile: least involved",
                                               "Second quartile",
                                               "Third quartile",
                                               "Top quartile: most involved"),
                                    include.lowest = TRUE)

Now compute the proportion of individuals from each quartile who went on to take part in the survey:

Show code

master_dataset_study2_restricted_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(SURVEY_STATUS_PGSI) %>%
  gt() %>%
  tab_header("Counts for granular survey status by composite risk score quartile", subtitle = NULL)

Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI	n
Bottom quartile: least involved
Complete	252
Incomplete	11
Not started	2937
Second quartile
Complete	331
Incomplete	27
Not started	2842
Third quartile
Complete	418
Incomplete	48
Not started	2733
Top quartile: most involved
Complete	513
Incomplete	72
Not started	2615

Show code

master_dataset_study2_restricted_with_risk_scores %>%
  group_by(Quartile) %>%
  dplyr::count(STATUS) %>%
  mutate('Percent of quartile' = round(n/sum(n)*100,2)) %>%
  gt() %>%
  tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)

Counts (with percentages) for broad survey status by composite risk score quartile
STATUS	n	Percent of quartile
Bottom quartile: least involved
Participants	252	7.88
Non-participants	2948	92.12
Second quartile
Participants	331	10.34
Non-participants	2869	89.66
Third quartile
Participants	418	13.07
Non-participants	2781	86.93
Top quartile: most involved
Participants	513	16.03
Non-participants	2687	83.97

No compute weighted accounts/proportions:

Show code

# Compute weighted counts for granular survey status
weighted_counts_granular <- svytable(~Quartile + SURVEY_STATUS_PGSI, 
                                     svydesign(id = ~1, 
                                               data = master_dataset_study2_restricted_with_risk_scores, 
                                               weights = ~ALL_SAMPLE_WEIGHTS)) %>%
  as.data.frame()


weighted_counts_granular %>%
  gt() %>%
  tab_header("Weighted counts for granular survey status by composite risk score quartile", subtitle = NULL)

Weighted counts for granular survey status by composite risk score quartile
Quartile	SURVEY_STATUS_PGSI	Freq
Bottom quartile: least involved	Complete	2745.7805
Second quartile	Complete	3537.8544
Third quartile	Complete	3296.0030
Top quartile: most involved	Complete	1306.2995
Bottom quartile: least involved	Incomplete	120.6171
Second quartile	Incomplete	294.5093
Third quartile	Incomplete	355.4089
Top quartile: most involved	Incomplete	200.9272
Bottom quartile: least involved	Not started	32108.0330
Second quartile	Not started	30220.4648
Third quartile	Not started	19822.1098
Top quartile: most involved	Not started	5914.9925

Show code

# Compute weighted counts and percentages for broad survey status
svytable(~Quartile + STATUS, 
                                  svydesign(id = ~1, 
                                            data = master_dataset_study2_restricted_with_risk_scores, 
                                            weights = ~ALL_SAMPLE_WEIGHTS)) %>%
  as.data.frame() %>%
  group_by(Quartile) %>%
  mutate('Percent of quartile' = round(Freq / sum(Freq) * 100, 2)) %>%
 gt() %>%
  tab_header("Weighted counts (with percentages) for broad survey status by composite risk score quartile", subtitle = NULL)

Weighted counts (with percentages) for broad survey status by composite risk score quartile
STATUS	Freq	Percent of quartile
Bottom quartile: least involved
Participants	2745.781	7.85
Non-participants	32228.650	92.15
Second quartile
Participants	3537.854	10.39
Non-participants	30514.974	89.61
Third quartile
Participants	3296.003	14.04
Non-participants	20177.519	85.96
Top quartile: most involved
Participants	1306.299	17.60
Non-participants	6115.920	82.40

And visualise this using a Sankey diagram (weighted figures):

Show code

# Create survey design object
survey_design_all_participants_sankey <- svydesign(
  id = ~1,
  data = master_dataset_study2_restricted_with_risk_scores,
  weights = ~ALL_SAMPLE_WEIGHTS
)

#  We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N), with weightings applied:
weighted_counts <- svytable(~Quartile + SURVEY_STATUS_PGSI, survey_design_all_participants_sankey) %>%
  as.data.frame() %>%
  group_by(Quartile) %>%
  mutate(percent = Freq / sum(Freq) * 100) %>%
  ungroup()

# Ensure Quartiles are ordered correctly
weighted_counts$Quartile <- factor(weighted_counts$Quartile,
                                   levels = c("Top quartile: most involved",
                                              "Third quartile",
                                              "Second quartile",
                                              "Bottom quartile: least involved"),
                                   ordered = TRUE)

# Create Sankey plot
desired_order_sankey <- c("Top quartile: most involved",
                          "Third quartile",
                          "Second quartile",
                          "Bottom quartile: least involved")


# non-weighted data: 
risk_to_survey_links<- master_dataset_study2_restricted_with_risk_scores %>%
    select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%
  group_by(Quartile, SURVEY_STATUS_PGSI) %>%
   summarise(
    n = n(),
    .groups = "drop"
  ) %>%
   group_by(Quartile) %>%
   mutate(
     percent = n/sum(n)*100) %>%
  select(-n) %>%
  ungroup() %>%
  mutate(Quartile =  factor(Quartile,
                              levels = desired_order_sankey)) %>%
  arrange(Quartile)




# From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: 
nodes <- data.frame(
  name = c("Top quartile: most involved",
           "Third quartile",
           "Second quartile",
           "Bottom quartile: least involved",
           as.character(unique(weighted_counts$SURVEY_STATUS_PGSI))) 
)

# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn  assign each from and to category with a number to make numerical connections explicit:
weighted_counts$IDsource <- match(weighted_counts$Quartile, nodes$name) - 1
weighted_counts$IDtarget <- match(weighted_counts$SURVEY_STATUS_PGSI, nodes$name) - 1



# prepare colour scale
ColourScal ='d3.scaleOrdinal() .range(["#d1495b", 
"#edae49",
"#00798c",
"#2e4057",
"#35B779FF",
"#31688EFF",
"#31688EFF",
"#482878FF",
"#B4DE2CFF",
"#FDE725FF", 
"#26828EFF",
"#3E4A89FF",
"#440154FF",
"#6DCD59FF"
])'

# Create Sankey plot
sankey_plot2_restricted <- sankeyNetwork(
  Links = weighted_counts, 
  Nodes = nodes,
  Source = "IDsource", 
  Target = "IDtarget",
  Value = "percent", 
  NodeID = "name", 
  iterations = 0,
  sinksRight = FALSE,
  fontSize = 16, 
  fontFamily = "Poppins",
  nodeWidth = 30,
  nodePadding = 25,
  colourScale = ColourScal
)


# Custom CSS with Google Fonts link
custom_css <- "
<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'>
<style>
  text {
    font-family: 'Poppins', sans-serif !important;
  }
</style>
"

# Save plot as a HTML file with custom CSS
html_file <- tempfile(fileext = ".html")
saveWidget(
  widget = htmlwidgets::prependContent(sankey_plot2_restricted), 
  file = html_file,
  selfcontained = TRUE
)

# Read the saved HTML and ensure the custom CSS is properly embedded
html_content <- readLines(html_file, warn = FALSE)
html_content <- gsub("</head>", paste0(custom_css, "</head>"), html_content)
writeLines(html_content, con = "sankey_plot2_restricted.html")

# remove the temporary file
file.remove(html_file)

[1] TRUE

Show code

# sankey_plot2_restricted

Show code

# ggsankey_data <- master_dataset_study2_restricted_with_risk_scores %>%
#   # mutate(SURVEY_STATUS_FULL = case_when(SURVEY_STATUS_FULL == "Not started" ~ NA,
#   #                                       TRUE ~ SURVEY_STATUS_FULL)) %>%
#   make_long(Quartile, SURVEY_STATUS_START, SURVEY_STATUS_FULL)
# 
# ggplot(ggsankey_data, aes(x = x,
#                           next_x = next_x,
#                           node = node,
#                           next_node = next_node,
#                           fill = factor(node),
#                           label = node)) +
#                geom_sankey(flow.alpha = 0.5,
#                            node.color = "black",
#                            show.legend = FALSE)+
#   geom_sankey_label(size = 3,
#                     color = "black",
#                     fill= "white",
#                     hjust = -0.2,
#                     family="Poppins")+
#   theme_bw()+
#     theme( text=element_text(family="Poppins")) +
#   # scale_fill_viridis_d() +
#     scale_fill_manual(values = c("#9a031e",  # Bright red
#   "#277da1",  # Deep blue
#   "#f9c74f",  # Muted yellow
#   "#e36414",  # Soft orange
#   "#90be6d",  # Soft green
#   "#577590",  # Slate gray
#   "#4d908e",  # Teal
#   "#9f5f80"   # Purple
# )) +
#   theme(axis.title = element_blank()
#                   , axis.text.y = element_blank()
#                   , axis.ticks = element_blank()
#                   , panel.grid = element_blank())

Logistic regression

Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.

SSVS

Show code

# First we need to convert logical and categorical variables to numeric for inclusion in the model:  <-  
master_dataset_study2_restricted_SSVS <- master_dataset_study2_restricted %>%
  # Impute missing IRSAD scores
  mutate(IRSAD_SCORE = ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm = TRUE), IRSAD_SCORE)) %>%
  # Convert all required variables to numeric:
  mutate(across(c(
    'AGE',
    'IRSAD_SCORE',
       'DAYS_REGISTERED',
       'DAYS_SINCE_LAST_BET',
       'PAS_6M_BET_MEAN_STAKE',
       'PAS_6M_BET_SD_STAKE',
       'PAS_6M_N_ACTIVITIES',
       'PAS_6M_BET_INTENSITY',
       'PAS_6M_MAX_DAY_STAKE',
       'PAS_6M_MAX_DAY_BETS',
       'PAS_6M_MEAN_DAILY_STAKE',
       'PAS_6M_SD_DAILY_STAKE',
       'PAS_6M_PERCENT_UNSOCIABLE_BETS',
       'PAS_6M_MONTHLY_BET_FREQ_DAYS',
       'PAS_6M_MONTHLY_SPEND',
       'PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME',
       'PAS_6M_MEAN_DEPOSIT_AMOUNT',
       'PAS_6M_SD_DEPOSIT_AMOUNT',
       'PAS_6M_PERCENT_DECLINED_DEPOSITS',
       'PAS_6M_TOTAL_DEPOSIT_METHODS',
       'PAS_6M_PERCENT_CANCEL_WITHDRAWS',
       'PAS_6M_TOTAL_WITHDRAW_METHODS',
       'PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT',
       'PAS_6M_MONTHLY_DEPOSIT_AMOUNT',
       'PAS_6M_DEPOSIT_INTENSITY','EVER_USED_DEPOSIT_LIMIT',
                  'EVER_USED_BLOCK_OUT', 
                  'EVER_USED_CHECK_IN', 
                  'EVER_USED_CURFEW', 
                  'EVER_USED_DEPOSIT_OPTION', 
                  'EVER_USED_MARKET_CONTROL', 
                  'EVER_USED_TIMOUT', 
                  'EVER_USED_CANCELLED_WITHDRAWS', 
                  'EVER_USED_SELF_EXCLUSION',
    'TOTAL_ACTIVE_TOOLS',
    'NUMBER_CPT_CHANGES'), 
                ~ as.numeric(.))) %>%
  mutate(GENDER = case_when(
    GENDER == 'Female' ~ 0,
    GENDER == 'Male' ~ 1,
    TRUE ~ 2)) %>%
  mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1,
    TRUE ~ 2)) %>%
    rename(
    'Age' = 'AGE',
    'Gender' =  'GENDER',
    'IRSAD score' = 'IRSAD_SCORE',
    'Days since registration' = 'DAYS_REGISTERED',
    'Days since last bet' = 'DAYS_SINCE_LAST_BET',
    'Mean stake*' = 'PAS_6M_BET_MEAN_STAKE',
    'Stake SD*' = 'PAS_6M_BET_SD_STAKE',
    'No. of activities*' = 'PAS_6M_N_ACTIVITIES',
    'Betting intensity*' = 'PAS_6M_BET_INTENSITY',
    'Max daily stake*' = 'PAS_6M_MAX_DAY_STAKE',
    'Max daily bets*' = 'PAS_6M_MAX_DAY_BETS',
    'Mean daily stake*' = 'PAS_6M_MEAN_DAILY_STAKE',
    'Daily stake SD*' = 'PAS_6M_SD_DAILY_STAKE',
    '% unsociable bets*' = 'PAS_6M_PERCENT_UNSOCIABLE_BETS',
    'Monthly bet freq (days)*' = 'PAS_6M_MONTHLY_BET_FREQ_DAYS',
    'Monthly spend*' = 'PAS_6M_MONTHLY_SPEND',
    'Mean monthly net outcome*' = 'PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME',
    'Mean deposit amount*' = 'PAS_6M_MEAN_DEPOSIT_AMOUNT',
    'Deposit amount SD*' = 'PAS_6M_SD_DEPOSIT_AMOUNT',
    '% declined deposits*' = 'PAS_6M_PERCENT_DECLINED_DEPOSITS',
    'No. deposit methods*' = 'PAS_6M_TOTAL_DEPOSIT_METHODS',
    '% cancel withdrawals*' = 'PAS_6M_PERCENT_CANCEL_WITHDRAWS',
    'No. withdrawal methods*' = 'PAS_6M_TOTAL_WITHDRAW_METHODS',
    'Max daily deposit amount*' = 'PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT',
    'Monthly deposit amount*' = 'PAS_6M_MONTHLY_DEPOSIT_AMOUNT',
    'Deposit intensity*' = 'PAS_6M_DEPOSIT_INTENSITY',
    'Ever used deposit limit' = 'EVER_USED_DEPOSIT_LIMIT',
    'Ever used block out' = 'EVER_USED_BLOCK_OUT', 
    'Ever used check in' = 'EVER_USED_CHECK_IN', 
    'Ever used curfew' = 'EVER_USED_CURFEW', 
    'Ever used deposit option' = 'EVER_USED_DEPOSIT_OPTION', 
    'Ever used market control' = 'EVER_USED_MARKET_CONTROL', 
    'Ever used timeout' = 'EVER_USED_TIMOUT', 
    'Ever used cancelled withdrawals' = 'EVER_USED_CANCELLED_WITHDRAWS', 
    'Ever used self-exclusion' = 'EVER_USED_SELF_EXCLUSION',
    'Total active tools' = 'TOTAL_ACTIVE_TOOLS',
    'Number of changes to CPTs' = 'NUMBER_CPT_CHANGES'
  )


outcome <- 'STATUS'

predictors <- c(
       # demographic variables:
       'Age',
       'Gender',
       'IRSAD score',
       # Account/wagering variables
       'Days since registration',
       'Days since last bet',
       # All past six month wagers and transactions variables (As included in the composite score):
      'Mean stake*',
    'Stake SD*',
    'No. of activities*',
    'Betting intensity*',
    'Max daily stake*',
    'Max daily bets*',
    'Mean daily stake*',
    'Daily stake SD*',
    '% unsociable bets*',
    'Monthly bet freq (days)*',
    'Monthly spend*',
    'Mean monthly net outcome*',
    'Mean deposit amount*',
    'Deposit amount SD*',
    '% declined deposits*',
    'No. deposit methods*',
    '% cancel withdrawals*',
    'No. withdrawal methods*',
    'Max daily deposit amount*',
    'Monthly deposit amount*',
    'Deposit intensity*',
       # current and historical use of consumer protection tools:
       'Ever used deposit limit',
       'Ever used block out', 
       'Ever used check in', 
       'Ever used curfew', 
       'Ever used deposit option', 
       'Ever used market control', 
       'Ever used timeout', 
       'Ever used cancelled withdrawals', 
       'Ever used self-exclusion',
       'Total active tools',
       'Number of changes to CPTs'
        )

# str(master_dataset_study2_restricted_SSVS)
# 
# master_dataset_study2_restricted_SSVS %>%
#   select(`IRSAD score`)
#
# SSSV_results_Study2_Restricted <- ssvs(data = master_dataset_study2_restricted_SSVS,
#                             x = predictors, y = outcome, # Variables, as prespecified above
#                             inprob = 0.8, #  prior inclusion probability:  0.5 reflects the belief that each predictor has an equal probability of being included/excluded
#                             continuous = FALSE, # Categorical outcome variable
#                             progress = TRUE # Provide update on progress of processing
#                             )
# 
# # Save file to avoid having to re-run every time
# saveRDS(SSSV_results_Study2_Restricted, "SSSV_results_Study2_Restricted")

# Load in:
SSSV_results_Study2_Restricted <- read_rds("SSSV_results_Study2_Restricted")

Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:

Show code

summary(SSSV_results_Study2_Restricted, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  as.data.frame() %>%
gt()%>%
  tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2 Restricted analyses**"), 
             subtitle = NULL)

Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2 Restricted analyses
Variable	MIP	Avg Beta	Avg Nonzero Beta	Lower CI (95%)	Upper CI (95%)
Age	1.0000	0.2402	0.2402	0.1812	0.3081
Days since last bet	1.0000	-0.2822	-0.2822	-0.3595	-0.2054
No. withdrawal methods*	1.0000	0.1961	0.1961	0.1260	0.2685
Stake SD*	0.9869	0.5778	0.5854	0.3084	0.8716
Mean stake*	0.9861	-0.6759	-0.6854	-1.0373	-0.3545
No. deposit methods*	0.8692	0.1153	0.1326	0.0000	0.1861
Mean deposit amount*	0.6957	-0.2414	-0.3470	-0.5180	0.0000
Deposit intensity*	0.5405	0.0496	0.0917	0.0000	0.1261
No. of activities*	0.2973	0.0284	0.0955	0.0000	0.1227
Monthly bet freq (days)*	0.2119	0.0220	0.1036	0.0000	0.1280
Deposit amount SD*	0.1555	-0.0369	-0.2374	-0.2868	0.0000
% declined deposits*	0.0631	0.0050	0.0793	0.0000	0.0581
Max daily stake*	0.0260	-0.0098	-0.3770	0.0000	0.0000
Gender	0.0196	0.0011	0.0537	0.0000	0.0000
Betting intensity*	0.0165	0.0007	0.0448	0.0000	0.0000
Ever used cancelled withdrawals	0.0164	0.0008	0.0489	0.0000	0.0000
Daily stake SD*	0.0163	0.0056	0.3427	0.0000	0.0000
Mean monthly net outcome*	0.0131	0.0010	0.0743	0.0000	0.0000
Ever used curfew	0.0124	-0.0007	-0.0543	0.0000	0.0000
Mean daily stake*	0.0107	-0.0004	-0.0399	0.0000	0.0000
Ever used self-exclusion	0.0084	0.0003	0.0334	0.0000	0.0000
% cancel withdrawals*	0.0078	0.0003	0.0359	0.0000	0.0000
Total active tools	0.0072	0.0003	0.0401	0.0000	0.0000
Ever used market control	0.0071	0.0003	0.0356	0.0000	0.0000
Monthly deposit amount*	0.0067	0.0000	0.0011	0.0000	0.0000
Max daily deposit amount*	0.0061	-0.0005	-0.0795	0.0000	0.0000
% unsociable bets*	0.0057	0.0002	0.0383	0.0000	0.0000
Ever used timeout	0.0052	0.0001	0.0220	0.0000	0.0000
Max daily bets*	0.0051	0.0002	0.0368	0.0000	0.0000
Monthly spend*	0.0047	-0.0001	-0.0231	0.0000	0.0000
Ever used block out	0.0041	0.0001	0.0218	0.0000	0.0000
Ever used deposit option	0.0040	0.0000	0.0003	0.0000	0.0000
Days since registration	0.0037	0.0001	0.0320	0.0000	0.0000
IRSAD score	0.0035	0.0001	0.0387	0.0000	0.0000
Ever used check in	0.0028	0.0000	-0.0104	0.0000	0.0000
Ever used deposit limit	0.0019	0.0000	0.0123	0.0000	0.0000
Number of changes to CPTs	0.0015	0.0000	-0.0184	0.0000	0.0000

Show code

plot(SSSV_results_Study2_Restricted,
     threshold = 0.4,
     title = "") +
  theme_classic() +
   scale_y_continuous(breaks = c(0, 0.25, 0.50, 0.75, 1), labels = c(0, 0.25, 0.50, 0.75, 1)) +
  theme(text=element_text(family="Poppins"),
      axis.text.x = element_text(angle = 45, hjust = 1)
      # axis.text.y = element_text(angle = 0, hjust = .5)
      ) +
  labs(title = "",
       x = "",
       y = "Inclusion Probability") +
  annotate("text", x = 18, y = 0.45, label = "MIP threshold", size =3, color = "black", fontface = "plain") +
  theme(legend.position = "none")

Show code

ggsave("Figures/SSVS results plot Study 2 Restricted analyses.pdf",
       width = 8.5,
       height = 3.5,
       units = c("in")
)

Run model

Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:

Show code

# SSSV_results_Study2_Restricted$ssvs %>% as_tibble()
# Set the data for analysis:
Study2_logistic_reg_data<- master_dataset_study2_restricted %>%
 mutate(STATUS = case_when(
    STATUS == 'Non-participants' ~ 0,
    STATUS == 'Participants' ~ 1))

# Create survey design object Used for waiting:
survey_design_regression <- svydesign(
  id = ~1,
  data = Study2_logistic_reg_data,
  weights = ~ALL_SAMPLE_WEIGHTS
)
  
# Check variables for inclusion: 
summary(SSSV_results_Study2_Restricted, 
        interval = 0.95, # Credible interval
        ordered = TRUE) %>%
  filter(MIP>0.4)

 Variable                MIP    Avg Beta Avg Nonzero Beta Lower CI (95%)
 Age                     1.0000  0.2402   0.2402           0.1812       
 Days since last bet     1.0000 -0.2822  -0.2822          -0.3595       
 No. withdrawal methods* 1.0000  0.1961   0.1961           0.1260       
 Stake SD*               0.9869  0.5778   0.5854           0.3084       
 Mean stake*             0.9861 -0.6759  -0.6854          -1.0373       
 No. deposit methods*    0.8692  0.1153   0.1326           0.0000       
 Mean deposit amount*    0.6957 -0.2414  -0.3470          -0.5180       
 Deposit intensity*      0.5405  0.0496   0.0917           0.0000       
 Upper CI (95%)
  0.3081       
 -0.2054       
  0.2685       
  0.8716       
 -0.3545       
  0.1861       
  0.0000       
  0.1261

Show code

# Run model with the Weight adjusting:
study_2_restricted_analysis_weighted_logistic_regression <- svyglm(STATUS ~ AGE + 
                DAYS_SINCE_LAST_BET +
                PAS_6M_TOTAL_WITHDRAW_METHODS +
                PAS_6M_BET_SD_STAKE + # New
                PAS_6M_BET_MEAN_STAKE + # New
                PAS_6M_TOTAL_DEPOSIT_METHODS + 
                PAS_6M_MEAN_DEPOSIT_AMOUNT +
                PAS_6M_DEPOSIT_INTENSITY, # New
      design = survey_design_regression, family = quasibinomial())

summary(study_2_restricted_analysis_weighted_logistic_regression)


Call:
svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + 
    PAS_6M_BET_SD_STAKE + PAS_6M_BET_MEAN_STAKE + PAS_6M_TOTAL_DEPOSIT_METHODS + 
    PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_DEPOSIT_INTENSITY, design = survey_design_regression, 
    family = quasibinomial())

Survey design:
svydesign(id = ~1, data = Study2_logistic_reg_data, weights = ~ALL_SAMPLE_WEIGHTS)

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -3.1011096  0.1404899 -22.074  < 2e-16 ***
AGE                            0.0188877  0.0021737   8.689  < 2e-16 ***
DAYS_SINCE_LAST_BET           -0.0359699  0.0049748  -7.230 5.09e-13 ***
PAS_6M_TOTAL_WITHDRAW_METHODS  0.3543070  0.0612079   5.789 7.26e-09 ***
PAS_6M_BET_SD_STAKE            0.0021273  0.0004674   4.551 5.39e-06 ***
PAS_6M_BET_MEAN_STAKE         -0.0027559  0.0007975  -3.456 0.000551 ***
PAS_6M_TOTAL_DEPOSIT_METHODS   0.1385985  0.0400393   3.462 0.000539 ***
PAS_6M_MEAN_DEPOSIT_AMOUNT    -0.0008037  0.0003071  -2.617 0.008886 ** 
PAS_6M_DEPOSIT_INTENSITY       0.0231312  0.0178049   1.299 0.193916    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.014813)

Number of Fisher Scoring iterations: 6

Performs checks of model performance:

Show code

model2_pilot_performance_metrics_restricted_analysis<- check_model(study_2_restricted_analysis_weighted_logistic_regression)

ggsave("Figures/model2_pilot_performance_metrics_restricted_analysis.pdf",
       width = 7,
       height = 9,
       units = c("in")
)

There is some minor co-linearity here but not enough to be problematic in this context – we aren’t using this model to make actual predictions on a separate data or subsamples.

Compute confidence intervals for the coefficient estimates:

Show code

sumarisedCIs_study_2_restricted_analysis_weighted_logistic_regression<- cbind(coef(study_2_restricted_analysis_weighted_logistic_regression), confint(study_2_restricted_analysis_weighted_logistic_regression))

sumarisedCIs_study_2_restricted_analysis_weighted_logistic_regression %>% as.data.frame() %>%
    rownames_to_column() %>%
  rename(Coefficient = V1) %>%
  mutate(across(c(Coefficient, `2.5 %`, `97.5 %`), ~round(., 3))) %>%
  arrange(desc(`2.5 %`)) %>%
   gt() %>%
  tab_header(md("**Regression coefficients with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%")

Regression coefficients with 95% CIs: Study 2
	Coefficient	Lower CI: 2.5%	Upper CI: 97.5%
PAS_6M_TOTAL_WITHDRAW_METHODS	0.354	0.234	0.474
PAS_6M_TOTAL_DEPOSIT_METHODS	0.139	0.060	0.217
AGE	0.019	0.015	0.023
PAS_6M_BET_SD_STAKE	0.002	0.001	0.003
PAS_6M_MEAN_DEPOSIT_AMOUNT	-0.001	-0.001	0.000
PAS_6M_BET_MEAN_STAKE	-0.003	-0.004	-0.001
PAS_6M_DEPOSIT_INTENSITY	0.023	-0.012	0.058
DAYS_SINCE_LAST_BET	-0.036	-0.046	-0.026
(Intercept)	-3.101	-3.376	-2.826

Exponentiate coefficients and interpret them as odds-ratios:

Show code

sumarisedORs_study_2_restricted_analysis_weighted_logistic_regression<- exp(cbind(OR = coef(study_2_restricted_analysis_weighted_logistic_regression), confint(study_2_restricted_analysis_weighted_logistic_regression)))

sumarisedORs_study_2_restricted_analysis_weighted_logistic_regression %>% as.data.frame() %>%
  rownames_to_column() %>%
  mutate(across(c(OR, `2.5 %`, `97.5 %`), ~round(., 4))) %>%
  arrange(desc(OR)) %>%
   gt() %>%
  tab_header(md("**Odds ratios with 95% CIs: Study 2**"), subtitle = NULL) %>%
   cols_label('2.5 %' = "Lower CI: 2.5%",
              '97.5 %' = "Upper CI: 97.5%")

Odds ratios with 95% CIs: Study 2
	OR	Lower CI: 2.5%	Upper CI: 97.5%
PAS_6M_TOTAL_WITHDRAW_METHODS	1.4252	1.2641	1.6069
PAS_6M_TOTAL_DEPOSIT_METHODS	1.1487	1.0620	1.2424
PAS_6M_DEPOSIT_INTENSITY	1.0234	0.9883	1.0597
AGE	1.0191	1.0147	1.0234
PAS_6M_BET_SD_STAKE	1.0021	1.0012	1.0030
PAS_6M_MEAN_DEPOSIT_AMOUNT	0.9992	0.9986	0.9998
PAS_6M_BET_MEAN_STAKE	0.9972	0.9957	0.9988
DAYS_SINCE_LAST_BET	0.9647	0.9553	0.9741
(Intercept)	0.0450	0.0342	0.0593

Calculate Nagelkerke’s R squared for the model:

Show code

NagelkerkeR2_S2<- NagelkerkeR2(study_2_restricted_analysis_weighted_logistic_regression) %>% 
  as.data.frame() %>%
  print()

      N         R2
1 12799 0.05661822

Show code

# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value

The model accounts for 5.6618219% of variance in uptake/response.

Calculate overall p-value for model by comparing it with a null model:

Show code

anova(study_2_restricted_analysis_weighted_logistic_regression, 
      update(study_2_restricted_analysis_weighted_logistic_regression, ~1), # update here produces null model for comparison 
      test="Chisq")

Working (Rao-Scott) LRT for AGE DAYS_SINCE_LAST_BET PAS_6M_TOTAL_WITHDRAW_METHODS PAS_6M_BET_SD_STAKE PAS_6M_BET_MEAN_STAKE PAS_6M_TOTAL_DEPOSIT_METHODS PAS_6M_MEAN_DEPOSIT_AMOUNT PAS_6M_DEPOSIT_INTENSITY
 in svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + 
    PAS_6M_BET_SD_STAKE + PAS_6M_BET_MEAN_STAKE + PAS_6M_TOTAL_DEPOSIT_METHODS + 
    PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_DEPOSIT_INTENSITY, design = survey_design_regression, 
    family = quasibinomial())
Working 2logLR =  355.9009 p= < 2.22e-16 
(scale factors:  1.4 1.3 1.3 1.2 0.93 0.87 0.62 0.4 )

Show code

# Set seed so that labels are consistent in the plot:
set.seed(200)

ggstatsplot::ggcoefstats(study_2_restricted_analysis_weighted_logistic_regression,
                         digits = 4,
                         stats.label.color = "#00798c",
                         sort = "ascending",
                         vline.args = list(size = 1, color = "black"),
                         # ggstatsplot.layer = FALSE,
                          exclude.intercept = TRUE, # hide the intercept
                         stats.label.args = list(size = 3, direction = "y")) +
  scale_x_continuous(name = "Regression coefficient", limits = c(-0.1, 0.5), breaks = c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5)) +
  scale_y_discrete(name = "", labels = c("Days since\n last bet",
                                         "Mean stake",
                                         "Mean deposit\n amount",
                                         "SD of stake",
                                         "Age",
                                         "Deposit intensity",
                                         "No. deposit\n methods",
                                         "No. withdrawal\n methods")) +
  theme(panel.grid.minor = element_blank(),
        axis.line = element_blank()) +
  theme(axis.text = element_text(color = "black", size = 9, face = "plain"))+
  theme_light(base_family = "Poppins") +
  # theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +
  theme(axis.text.y = element_text(color="black", size=12, face="plain", family = "Poppins")) +
  theme(axis.title.x = element_text(color="black", size=12, face ="plain", vjust = 0, family = "Poppins"))

Show code

# Set height to 7 and width to 5.5 in portrait mode
ggsave("Figures/Study 2 restricted analyses log regression plot.pdf",
       width = 5,
       height = 7,
       units = c("in")
)

Reproducibility of this analysis

Package stability

To ensure that every time this script runs it will use the same versions of the packages used during the script development, all packages are loaded using the groundhog.library() function from the groundhog package.

Rendering & publishing

This document is rendered in HTML format and published online using Quarto Pub by using the following command in the terminal:

quarto publish quarto-pub Survey_uptake_main_analyses.qmd

Session details

Expand for session information

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.1 (2024-06-14 ucrt)
 os       Windows 11 x64 (build 26100)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Australia.utf8
 ctype    English_Australia.utf8
 tz       Australia/Sydney
 date     2025-02-26
 pandoc   3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version date (UTC) lib source
 car          * 3.1-2   2023-03-30 [1] CRAN (R 4.4.0)
 carData      * 3.0-5   2022-01-06 [1] CRAN (R 4.4.1)
 dplyr        * 1.1.4   2023-11-17 [1] CRAN (R 4.4.1)
 effsize      * 0.8.1   2020-10-05 [1] CRAN (R 4.4.2)
 fmsb         * 0.7.6   2024-01-19 [1] CRAN (R 4.4.2)
 forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.4.1)
 formattable  * 0.2.1   2021-01-07 [1] CRAN (R 4.4.1)
 ggplot2      * 3.5.1   2024-04-23 [1] CRAN (R 4.4.1)
 ggrain       * 0.0.4   2024-01-23 [1] CRAN (R 4.4.1)
 ggstatsplot  * 0.12.3  2024-04-06 [1] CRAN (R 4.4.0)
 ggstream     * 0.1.0   2021-05-06 [1] CRAN (R 4.4.2)
 ggtext       * 0.1.2   2022-09-16 [1] CRAN (R 4.4.2)
 groundhog    * 3.2.1   2024-09-29 [1] CRAN (R 4.4.1)
 gt           * 0.10.1  2024-01-17 [1] CRAN (R 4.4.0)
 gtExtras     * 0.5.0   2023-09-15 [1] CRAN (R 4.4.1)
 gtsummary    * 1.7.2   2023-07-15 [1] CRAN (R 4.4.0)
 haven        * 2.5.4   2023-11-30 [1] CRAN (R 4.4.1)
 Hmisc        * 5.1-3   2024-05-28 [1] CRAN (R 4.4.0)
 htmlwidgets  * 1.6.4   2023-12-06 [1] CRAN (R 4.4.1)
 kableExtra   * 1.4.0   2024-01-24 [1] CRAN (R 4.4.1)
 lubridate    * 1.9.3   2023-09-27 [1] CRAN (R 4.4.1)
 Matrix       * 1.7-0   2024-03-22 [1] CRAN (R 4.4.0)
 MBESS        * 4.9.3   2023-10-26 [1] CRAN (R 4.4.1)
 networkD3    * 0.4     2017-03-18 [1] CRAN (R 4.4.1)
 patchwork    * 1.2.0   2024-01-08 [1] CRAN (R 4.4.0)
 performance  * 0.11.0  2024-03-22 [1] CRAN (R 4.4.0)
 plotly       * 4.10.4  2024-01-13 [1] CRAN (R 4.4.1)
 purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.4.1)
 RColorBrewer * 1.1-3   2022-04-03 [1] CRAN (R 4.4.0)
 readr        * 2.1.5   2024-01-10 [1] CRAN (R 4.4.1)
 readxl       * 1.4.3   2023-07-06 [1] CRAN (R 4.4.1)
 rstantools   * 2.4.0   2024-01-31 [1] CRAN (R 4.4.2)
 scales       * 1.3.0   2023-11-28 [1] CRAN (R 4.4.1)
 see          * 0.8.4   2024-04-29 [1] CRAN (R 4.4.0)
 sessioninfo  * 1.2.2   2021-12-06 [1] CRAN (R 4.4.1)
 showtext     * 0.9-7   2024-03-02 [1] CRAN (R 4.4.1)
 showtextdb   * 3.0     2020-06-04 [1] CRAN (R 4.4.1)
 SSVS         * 2.0.0   2022-05-29 [1] CRAN (R 4.4.2)
 stringr      * 1.5.1   2023-11-14 [1] CRAN (R 4.4.1)
 survey       * 4.4-2   2024-03-20 [1] CRAN (R 4.4.1)
 survival     * 3.6-4   2024-04-24 [1] CRAN (R 4.4.0)
 sysfonts     * 0.8.9   2024-03-02 [1] CRAN (R 4.4.1)
 tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.4.1)
 tidyr        * 1.3.1   2024-01-24 [1] CRAN (R 4.4.1)
 tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.4.1)
 ufs          * 0.5.12  2024-03-09 [1] CRAN (R 4.4.2)
 viridis      * 0.6.5   2024-01-29 [1] CRAN (R 4.4.1)
 viridisLite  * 0.4.2   2023-05-02 [1] CRAN (R 4.4.1)
 wCorr        * 1.9.8   2023-08-19 [1] CRAN (R 4.4.2)
 weights      * 1.0.4   2021-06-10 [1] CRAN (R 4.4.1)

 [1] C:/Users/Robert Heirene/AppData/Local/R/win-library/4.4
 [2] C:/Program Files/R/R-4.4.1/library

──────────────────────────────────────────────────────────────────────────────

R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
[3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.utf8    

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] sessioninfo_1.2.2  RColorBrewer_1.1-3 showtext_0.9-7     showtextdb_3.0    
 [5] sysfonts_0.8.9     patchwork_1.2.0    ggtext_0.1.2       scales_1.3.0      
 [9] viridis_0.6.5      viridisLite_0.4.2  networkD3_0.4      ggrain_0.0.4      
[13] ggstream_0.1.0     rstantools_2.4.0   ggstatsplot_0.12.3 htmlwidgets_1.6.4 
[17] plotly_4.10.4      fmsb_0.7.6         see_0.8.4          performance_0.11.0
[21] car_3.1-2          carData_3.0-5      MBESS_4.9.3        effsize_0.8.1     
[25] SSVS_2.0.0         gtsummary_1.7.2    gtExtras_0.5.0     gt_0.10.1         
[29] formattable_0.2.1  kableExtra_1.4.0   ufs_0.5.12         wCorr_1.9.8       
[33] weights_1.0.4      Hmisc_5.1-3        survey_4.4-2       survival_3.6-4    
[37] Matrix_1.7-0       haven_2.5.4        readxl_1.4.3       lubridate_1.9.3   
[41] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2       
[45] readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1     
[49] tidyverse_2.0.0    groundhog_3.2.1   

loaded via a namespace (and not attached):
  [1] splines_4.4.1          ggpp_0.5.7             cellranger_1.1.0      
  [4] datawizard_0.10.0      rpart_4.1.23           lifecycle_1.0.4       
  [7] lattice_0.22-6         MASS_7.3-60.2          insight_0.19.11       
 [10] backports_1.5.0        magrittr_2.0.3         sass_0.4.9            
 [13] rmarkdown_2.27         yaml_2.3.8             DBI_1.2.2             
 [16] minqa_1.2.7            DHARMa_0.4.6           abind_1.4-5           
 [19] nnet_7.3-19            ggrepel_0.9.5          correlation_0.8.4     
 [22] gdata_3.0.0            commonmark_1.9.1       svglite_2.1.3         
 [25] codetools_0.2-20       xml2_1.3.6             tidyselect_1.2.1      
 [28] shape_1.4.6.1          farver_2.1.2           lme4_1.1-35.3         
 [31] effectsize_0.8.8       smd_0.7.0              base64enc_0.1-3       
 [34] broom.helpers_1.15.0   jsonlite_1.8.8         gghalves_0.1.4        
 [37] mitml_0.4-5            Formula_1.2-5          iterators_1.0.14      
 [40] emmeans_1.10.2         systemfonts_1.1.0      foreach_1.5.2         
 [43] tools_4.4.1            ragg_1.3.2             Rcpp_1.0.12           
 [46] glue_1.7.0             mnormt_2.1.1           gridExtra_2.3         
 [49] pan_1.9                mgcv_1.9-1             xfun_0.44             
 [52] withr_3.0.0            fastmap_1.2.0          mitools_2.4           
 [55] boot_1.3-30            fansi_1.0.6            digest_0.6.35         
 [58] timechange_0.3.0       R6_2.5.1               estimability_1.5.1    
 [61] mice_3.16.0            textshaping_0.4.0      colorspace_2.1-0      
 [64] gtools_3.9.5           markdown_1.12          utf8_1.2.4            
 [67] generics_0.1.3         data.table_1.15.4      httr_1.4.7            
 [70] parameters_0.21.7      pkgconfig_2.0.3        gtable_0.3.5          
 [73] statsExpressions_1.5.4 htmltools_0.5.8.1      knitr_1.46            
 [76] rstudioapi_0.16.0      tzdb_0.4.0             reshape2_1.4.4        
 [79] coda_0.19-4.1          checkmate_2.3.1        nlme_3.1-164          
 [82] curl_5.2.1             nloptr_2.0.3           ggcorrplot_0.1.4.1    
 [85] parallel_4.4.1         foreign_0.8-86         pillar_1.9.0          
 [88] vctrs_0.6.5            jomo_2.7-6             xtable_1.8-4          
 [91] cluster_2.1.6          htmlTable_2.4.2        paletteer_1.6.0       
 [94] evaluate_0.23          zeallot_0.1.0          mvtnorm_1.2-5         
 [97] cli_3.6.2              compiler_4.4.1         rlang_1.1.3           
[100] labeling_0.4.3         rematch2_2.1.2         plyr_1.8.9            
[103] stringi_1.8.4          pander_0.6.5           munsell_0.5.1         
[106] lazyeval_0.2.2         glmnet_4.1-8           bayestestR_0.13.2     
[109] hms_1.1.3              gridtext_0.1.5         fontawesome_0.5.2     
[112] igraph_2.0.3           broom_1.0.6            RcppParallel_5.1.7    
[115] polynom_1.4-1