Analysis document: Non-Response bias in Gambling Surveys
Author
Anonymised for review
Published
February 26, 2025
Preamble
This script/ analysis document outlines the analysis code and associated outcomes for our study on response bias in gambling research studies. The analyses reported are for two studies, both of which are part of larger projects. The analyses included here include all of those reported in our manuscript (plus some additional analyses not reported in the manuscript), but the datasets analysed have already been pre-processed in seperate scripts and aggregate variables produced to summarise gambling behaviour.
Study 1
This study is part of the overall project with the title “Understanding electronic gaming machine players”. The OSF page for this study can be found here (it is private as of the latest date this file was published).
Study 2
This study is part of the overall project with the title “Safer gambling online: Identifying and supporting at-risk consumers”. The OSF page for this study can be found here and for the overall project here (both are private as of the latest date this file was committed).
load required packages
Set up analysis environment
Install and load the groundhog package to ensure consistency of the package versions used here:
Show set-up code
# install.packages("groundhog") # Installlibrary('groundhog')# List desired packages:packages <-c('tidyverse', # Clean, organise, and visualise data"readxl", # load Excel files"haven", # Read stata files"survey", # weight study 2 data"weights", # Weighted statistics for study two"wCorr", # Weighted correlations (https://cran.r-project.org/web/packages/wCorr/wCorr.pdf)"ufs", # Cramver's V Effect size'kableExtra', # Make tables'formattable', # Add visualisations to tables'gt', # Alternative table options'gtExtras', # Add colours to gt tables'gtsummary', # Create summary tables'SSVS', # Bayesian variable selection'effsize', # Cohen's d values'MBESS', # 95% CIs for above'car', # Compute VIFs for regression'performance', # Assess model performance'see', # Supports above package'fmsb', # Compute NagelkerkeR2'plotly', # Add interactive elements to figures'htmlwidgets', # Make plotly plots HTML format'ggstatsplot', # Produce plots with combined statistical output'rstantools', # required for above package'ggstream', # smooooth area charts'ggrain', # Raincloud plots'networkD3', # Sankey plots'viridis', # plot colours'scales', # Wrap labels in plots'ggtext', # also helps with plot labels"patchwork", # Assemble/combine plots'sysfonts', # Special fonts for figures'showtext', # Special fonts for figures'RColorBrewer', # Colours for figures'sessioninfo'# Detailed session info for reproducibility ) # Load desired package with versions specific to project start date:# groundhog.library(packages, "2024-05-01", force.install = TRUE)groundhog.library(packages, "2024-05-29")# SANKEY:# install.packages("remotes")# remotes::install_github("davidsjoberg/ggsankey")# library(ggsankey)# SSVS:# install.packages("remotes")# remotes::install_github("sabainter/SSVS")# library('SSVS')
Set up a standard theme for plots/data visualisations:
Show code
# Load new font for figures/graphsfont_add_google("Poppins")showtext_auto()# showtext_opts(dpi = 300)showtext_auto(enable =TRUE)# Save new theme for figures/graphs.This will determine the layout, presentation, font type and font size used in all data visualisations presented here:plot_theme<-theme_classic() +theme(text=element_text(family="Poppins"),plot.title =element_text(hjust =0.5, size =26),plot.subtitle =element_text(hjust =0.5, size =22),axis.text =element_text(size =18),axis.title =element_text(size =22),plot.caption =element_text(size =16),legend.title=element_text(size=18), legend.text=element_text(size=16) ) plot_theme_no_sizing<-theme_classic() +theme(text=element_text(family="Poppins"))
Study 1 Analysis
This study pertains to the analysis of a survey of EGM gamblers from a single venue in Australia.
Load data
Load in master dataset containing all customer details (this step is hidden to protect privacy & anonymity)
Data loaded.
People were sent the survey via different methods, let’s have a look at the number who completed surveys by the method:
Now, let’s process the data for analysis by doing the following:
Remove anybody who wasn’t invited to take part in the survey via email or telephone (this means removing some people who completed via other methods, but not all – some people invited via email took part via hardcopy, for example).
Remove anybody who didn’t pass the screening question which asked whether they gambled with the venue in the past 30 days
This code contains customer ID numbers and so is hidden.
To summarise the findings of the hidden code in text, there were 4301 individuals invited to take part in the survey. However, when the account data were provided by the organisation, 33 of these customers had no account data are available for the six months preceding survey invitation and therefore should not have been invited. Five individuals from the remaining sample started the survey but stated they had not gambled in the past 30 days. These were removed from consideration as they did not meet our inclusion criteria for invitation. All analyses and survey uptake figures relate to the remaining 4,260 customers, unless “unfiltered” is specified.
Uptake figures
Let’s first look at how survey uptake varied over the window the survey was open:
Now compute some figures to tell us the rates of uptake for the survey:
Original, unfiltered dataset (It’s unfiltered in the sense that it includes everyone invited, but not those who were ineligible to be invited because they don’t have any account data); So the 5 people who didn’t Pass the eligibility check are included still:
Show code
master_dataset_initial_study1 %>%as_tibble() %>%semi_join(invition_data) %>% dplyr::count(SURVEY_STATUS_START) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_START ="Survey status") %>%tab_header("Counts for starting the survey: Study 1 (unfiltered outcomes)", subtitle =NULL)
Counts for starting the survey: Study 1 (unfiltered outcomes)
Survey status
n
Percent
Not started
4039
94.634489
Started
229
5.365511
Show code
master_dataset_initial_study1 %>%as_tibble() %>%semi_join(invition_data) %>% dplyr::count(SURVEY_STATUS_PGSI) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_PGSI ="Survey status") %>%tab_header("Counts for completing the survey up to the PGSI: Study 1 (unfiltered outcomes)", subtitle =NULL)
Counts for completing the survey up to the PGSI: Study 1 (unfiltered outcomes)
Survey status
n
Percent
Complete
185
4.334583
Incomplete
44
1.030928
Not started
4039
94.634489
Show code
master_dataset_initial_study1 %>%as_tibble() %>%semi_join(invition_data) %>% dplyr::count(SURVEY_STATUS_FULL) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_FULL ="Survey status") %>%tab_header("Counts for completing the survey in full: Study 1 (unfiltered outcomes)", subtitle =NULL)
Counts for completing the survey in full: Study 1 (unfiltered outcomes)
Survey status
n
Percent
Complete
160
3.748828
Incomplete
69
1.616682
Not started
4039
94.634489
Now compute some figures to tell us the rates of uptake for the survey:
Filtered dataset:
Show code
master_dataset_study1 %>%as_tibble() %>% dplyr::count(SURVEY_STATUS_START) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_START ="Survey status") %>%tab_header("Counts for starting the survey: Study 1", subtitle =NULL)
Counts for starting the survey: Study 1
Survey status
n
Percent
Not started
4039
94.812207
Started
221
5.187793
Show code
master_dataset_study1 %>%as_tibble() %>% dplyr::count(SURVEY_STATUS_PGSI) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_PGSI ="Survey status") %>%tab_header("Counts for completing the survey up to the PGSI: Study 1", subtitle =NULL)
Counts for completing the survey up to the PGSI: Study 1
Survey status
n
Percent
Complete
183
4.2957746
Incomplete
38
0.8920188
Not started
4039
94.8122066
Show code
master_dataset_study1 %>%as_tibble() %>% dplyr::count(SURVEY_STATUS_FULL) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_FULL ="Survey status") %>%tab_header("Counts for completing the survey in full: Study 1", subtitle =NULL)
Counts for completing the survey in full: Study 1
Survey status
n
Percent
Complete
153
3.591549
Incomplete
68
1.596244
Not started
4039
94.812207
Survey flow
Show code
# colnames(master_dataset_study1)# Define relevant survey questions (excluding non-survey & optional ones)#Note that optional items have been removed heresurvey_questions <-c("ipaddress", "screen", "tier","pokielink", "pokiefrequency", "pokievenuetype", "pokiemembership", "pokievenuenumber","budgetset", "budgetstrat", "betsatisfaction", "accountease", "accountuse", "accountimprove", "accountintent", "attention1", "accountsocial", "accountadtrack", "accountadsetspend", "accountadsettime", "accountadblock", "accountadcash", "accountadlink", "accountadtransfer", "accountlikesign", "accountlikepoints", "accountlikeonly", "accountlikestatus", "accountlikerequire", "pastafford", "pastexcite", "pastback", "pastsold", "pastproblem", "attention2", "paststress", "pastcritic", "pasthousehold", "pastguilty", "skspent", "skwin", "sknetoption", "skfuturespend", "skfuturewin", "skfutureoption", "skcomfort", "lifesatisfaction", "patience", "impulsive", "risks", "alcofreq", "alcostandard", "alcosix", "alcopokies", "techsmart", "techapple", "techbank", "sex","year", "post", "lang", "educ", "work", "marstat", "child", "income", "incomevar", "incomepredict", "westhqvenue", "westhqtier", "cbenjoy1", "cbmajor2", "cbtop3", "cbcomfort4", "cbenough5", "bc1temp", "bc2break", "bc3lazy", "bc4inap", "bc5fun", "bc6refuse", "bc7discipline", "bc8iron", "bc9pleasure", "bc10conc", "bc11goals", "bc12wrong", "bc13alter")# Filter dataset to exclude non-starters (only include those who started the survey)survey_data_filtered <- master_dataset_study1 %>%filter(!is.na(ipaddress) & ipaddress !="") %>%select(all_of(survey_questions))# Inspect data to check we are plotting the right info:# View(survey_data_filtered)# print(survey_data_filtered, n=100)# All looks good. I've insured we are and including the survey questions and all of the survey questions, other than IP address which we used to tell who started the survey.# Count non-NA responses for each questionresponse_counts <- survey_data_filtered %>%summarise(across(everything(), ~sum(!is.na(.) & . !=""), .names ="count_{.col}")) %>%# Account for completely empty responses and NA valuepivot_longer(cols =everything(), names_to ="Question", values_to ="Respondents") %>%mutate(Question =gsub("count_", "", Question)) # Define survey section positions for text labels:survey_sections <-data.frame(Section =c("Opened survey + screen", "Membership details","Gambling involvement", "Tracking strategies","Gambling satisfaction", "Account-based gambling","PGSI", "Estimated expenditure", "Life satisfaction", "Personality traits", "Alcohol use","Technology literacy","Demographics","Self-control"),Start =c("ipaddress", "tier", # Membership details"pokievenuetype", # Gambling involvement"budgetset", # Tracking strategies"betsatisfaction", # Gambling satisfaction"accountadsettime", # Perceptions of account-based systems"attention2", # PGSI"skfuturespend", # Estimated expenditure"lifesatisfaction", # Life satisfaction"impulsive", # Personality traits"alcosix", # Alcohol use"techapple", # Technology literacy"income", # Demographics"bc8iron"# Self-control ))# Define section start and end positions for lines:section_lines <-data.frame(Position =c("tier", # Start of Membership details"pokiefrequency", # Start of Gambling involvement"budgetset", # Start of Tracking strategies"betsatisfaction", # Start of Gambling satisfaction"accountease", # Start of Perceptions of account-based systems"pastafford", # Start of PGSI"skspent", # Start of Estimated expenditure"lifesatisfaction", # Start of Life satisfaction"patience", # Start of Personality traits"alcofreq", # Start of Alcohol use"techsmart", # Start of Technology literacy"sex", # Start of Demographics"bc1temp", # Start of Self-control"bc13alter"# End ))# Convert positions to numeric for plottingsection_lines$Position <-as.numeric(factor(section_lines$Position, levels = survey_questions))# Push the final line to the very end of the graph by changing its position:section_lines<- section_lines %>%mutate(Position =case_when(Position ==88~89,TRUE~ Position))# Define questions in each section of the survey:survey_sections_dot_colours <-data.frame(Section =c(rep("Opened survey + screen", 2),rep("Membership details", 2),rep("Gambling involvement", 4),rep("Tracking strategies", 2),rep("Gambling satisfaction", 1),rep("Account-based gambling", 18),rep("PGSI", 10),rep("Estimated expenditure", 7),rep("Life satisfaction", 1),rep("Personality traits", 3),rep("Alcohol use", 4),rep("Technology literacy", 3),rep("Demographics", 18),rep("Self-control", 13) ),Question =c(# Opened survey"ipaddress","screen", # Screening Question# Membership details"pokielink", "tier",# Gambling involvement"pokiefrequency", "pokievenuetype", "pokiemembership", "pokievenuenumber", # Tracking strategies"budgetset", "budgetstrat", # Gambling satisfaction"betsatisfaction",# Perceptions of account-based systems"accountease", "accountuse", "accountimprove", "accountintent", "attention1", "accountsocial", "accountadtrack", "accountadsetspend", "accountadsettime", "accountadblock", "accountadcash", "accountadlink", "accountadtransfer", "accountlikesign", "accountlikepoints", "accountlikeonly", "accountlikestatus", "accountlikerequire", # PGSI"pastafford", "pastexcite", "pastback", "pastsold", "pastproblem", "attention2", "paststress","pastcritic", "pasthousehold", "pastguilty",# Estimated expenditure"skspent", "skwin", "sknetoption", "skfuturespend", "skfuturewin", "skfutureoption", "skcomfort", # Life satisfaction"lifesatisfaction",# Personality traits"patience", "impulsive", "risks", # Alcohol use"alcofreq", "alcostandard", "alcosix", "alcopokies", # Technology literacy"techsmart", "techapple", "techbank", # Removed duplicate "alcopokies"# Demographics"sex", "year", "post", "lang", "educ", "work", "marstat", "child", "income", "incomevar", "incomepredict", "westhqvenue", "westhqtier", "cbenjoy1", "cbmajor2", "cbtop3", "cbcomfort4", "cbenough5",# Self-control"bc1temp", "bc2break", "bc3lazy", "bc4inap", "bc5fun", "bc6refuse", "bc7discipline","bc8iron", "bc9pleasure", "bc10conc", "bc11goals", "bc12wrong", "bc13alter" ))# Ensure response_counts data is structuredresponse_counts <- response_counts %>%# select(-Section) %>%full_join(survey_sections_dot_colours) %>%filter(!is.na(Respondents)) %>%mutate(Question =factor(Question, levels = survey_questions,ordered =TRUE)) %>%# Preserve order# Fix order to account for people who completed later questions within the same sectiongroup_by(Section) %>%arrange(match(Question, survey_questions)) %>%# Explicitly order by predefined question ordermutate(Respondents =cummin(Respondents)) %>%# Apply cumulative minimumungroup()# update data to account for absence of info about starts saved from qualtrics (data extracted by someone else on our team)response_counts <- response_counts %>%mutate(Respondents =case_when(Respondents ==221~1328,TRUE~ Respondents))# Wrap section labels to a maximum of 25 characterssurvey_sections <- survey_sections %>%mutate(Section =str_wrap(Section, width =30)) # Create a lookup table for section labelssection_labels <-setNames(survey_sections$Section, survey_sections$Start)# Ensure Section is an ordered factor based on survey flowresponse_counts$Section <-factor(response_counts$Section, levels =unique(survey_sections$Section))# Generate a perceptually uniform color scale using viridisnum_sections <-length(unique(response_counts$Section))section_colors <-viridis(14, option ="mako") # Blue-Green to Blue# Plotmain_plot <-ggplot(response_counts, aes(x = Question, y = Respondents, group =1, color = Section)) +geom_line(color ="grey", size =0.8) +# Line plot# Dots colored by ordered survey section (now fading from dark to light)geom_point(size =1.1) +# Apply wrapped section labels in scale_color_manual()scale_color_manual(values =setNames(section_colors, levels(response_counts$Section)) ) +# Add dashed vertical lines for section boundariesgeom_vline(data = section_lines, aes(xintercept = Position -0.5),color ="grey50", linetype ="dashed", size =0.5) +# Ensure x-axis follows the predefined orderscale_x_discrete(limits = survey_questions, labels =function(x) ifelse(x %in%names(section_labels), section_labels[x], "")) +# Add curved annotation linegeom_curve(aes(x =36.3, y =355, xend =39.5, yend =355), curvature =0, arrow =arrow(length =unit(0.03, "npc"), type ="closed"),color ="black", size =0.3) +# Add text labelsannotate("text", x =34.5, y =400, label ="Pre-defined\ncompletion\npoint", size =2, fontface ="plain", family ="Poppins") +annotate("text", x =87.5, y =450, label ="End of survey", angle =90, size =2, fontface ="plain", family ="Poppins") +labs(y ="Number of Respondents\n(log-10 scale)" ) +# Theme plot_theme +theme(axis.text.x =element_text(angle =50, hjust =1, vjust =1.03),axis.title.y =element_text(vjust =2),axis.ticks.x =element_blank(),axis.title.x =element_blank(),axis.text =element_text(size =10),axis.title =element_text(size =12),plot.margin =margin(2, 5, 2, 15),legend.position ="none" ) main_plot +labs(title ="Survey 1 flow with natural y-axis scale",y="")
What are the general demographic and gambling characteristics of those who did and did not take part in the survey?
Show code
master_dataset_study1 %>%mutate(actualwin_sum6s = actualwin_sum6s*-1) %>%# Invert net outcome so losses are negative and vice versatbl_summary(include =c(# Demographic details age_westhq, gender_westhq, IRSAD_SCORE, memberyears, # Gambling days_since_lastplay, num_6sday, intensity_6sday, averagebet_mean6s, montly_net_outcome_6m, # gdquiet6_percent, timedevice_sum6s ),type =all_continuous() ~"continuous",statistic =list(all_continuous() ~c("{mean} ({sd}), {median}")),digits =all_continuous() ~2,by = STATUS,label =c( age_westhq ~"Age", gender_westhq ~"Gender", IRSAD_SCORE ~"IRSAD score", memberyears ~"No. of years as a member", days_since_lastplay ~"No. days since last bet", num_6sday ~"No. betting days", intensity_6sday ~"Mean games played per active day", averagebet_mean6s ~"Mean stake", montly_net_outcome_6m ~"Mean net outcome per month", gdquiet6_percent ~"Percentage of sessions between midnight & 6AM", timedevice_sum6s ~"Total time on device (minutes)" ) ) %>%add_overall() %>%add_difference(list(all_continuous() ~"cohens_d", # For continuous variablesall_categorical() ~"smd") ) %>%modify_header(label ="Variable") %>%as_tibble() %>%filter(Variable !="Unknown") %>%rename("Effect_size"=`**Difference**`) %>%unite("Effect size", Effect_size, `**95% CI**`, sep =" [", remove =TRUE) %>%mutate("Effect size"=paste0(`Effect size`, "]")) %>%mutate(`Effect size`=case_when(`Effect size`=="NA [NA]"~"",TRUE~`Effect size`)) %>%gt() %>%tab_options(data_row.padding =px(1.5)) %>%tab_header(title ="Table 2. Study 1 - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%cols_align(align ="right",columns =c(Variable) ) %>%cols_align(align ="center",columns =c(2:4) ) %>%tab_footnote(footnote ="Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",locations =cells_column_labels(columns =c(2,3)) )%>%tab_footnote(footnote ="Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])",locations =cells_column_labels(columns =vars(`Effect size`)) )
Table 2. Study 1 - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable
**Overall**, N = 4,2601
**Participants**, N = 1831
**Non-participants**, N = 4,077
Effect size2
Age
53.99 (14.80), 55.00
52.08 (13.97), 53.00
54.07 (14.84), 55.00
-0.13 [-0.28, 0.01]
Gender
NA
NA
NA
0.13 [-0.02, 0.28]
F
2,223 (52%)
107 (58%)
2,116 (52%)
M
2,036 (48%)
76 (42%)
1,960 (48%)
IRSAD score
958.93 (77.46), 970.79
961.37 (76.20), 977.56
958.82 (77.52), 970.79
0.03 [-0.12, 0.18]
No. of years as a member
10.61 (7.85), 9.40
12.16 (9.11), 9.90
10.54 (7.78), 9.40
0.21 [0.06, 0.36]
No. days since last bet
14.61 (14.87), 10.00
17.53 (17.08), 12.00
14.48 (14.75), 9.00
0.21 [0.06, 0.35]
No. betting days
19.17 (26.01), 9.00
22.63 (25.78), 15.00
19.02 (26.01), 9.00
0.14 [-0.01, 0.29]
Mean games played per active day
1,074.20 (922.03), 841.00
1,178.75 (932.99), 1,017.70
1,069.51 (921.37), 828.80
0.12 [-0.03, 0.27]
Mean stake
3.66 (9.10), 1.43
3.26 (7.71), 1.70
3.68 (9.15), 1.41
-0.05 [-0.19, 0.10]
Mean net outcome per month
720.38 (1,497.50), 192.91
1,011.20 (1,734.67), 401.06
707.00 (1,484.57), 187.62
0.20 [0.05, 0.35]
Percentage of sessions between midnight & 6AM
26.31 (25.25), 18.39
20.98 (18.11), 18.17
26.56 (25.52), 18.39
-0.22 [-0.42, -0.02]
Total time on device (minutes)
2,555.85 (4,651.01), 811.19
2,892.40 (4,037.00), 1,631.00
2,540.74 (4,676.54), 791.18
0.08 [-0.07, 0.22]
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])
NOTE: the net outcome values and effect size were reverse scored so postive values indicated losses and so we have inverted these values in the manuscript.
Another important thing to note about the above table is that the effect size for the categorical variable (gender) is actually a standardised mean difference value, Which I don’t think makes much sense here. I’m going to compute Cramver’s V effect size here and put it into the table and the manuscript:
Show code
# Calculate Cramér's V and its confidence intervalcombined_table <- master_dataset_study1 %>%group_by(STATUS, gender_westhq) %>% dplyr::count() %>%drop_na() %>%ungroup() %>%spread(key = gender_westhq, value = n, fill =0) %>%select(-1) %>%as.matrix()cramersV(combined_table)
Calculate odds ratio instead, as per the reviewer’s suggestion:
Show code
combined_table <- master_dataset_study1 %>%group_by(STATUS, gender_westhq) %>% dplyr::count() %>%drop_na() %>%ungroup() %>%spread(key = gender_westhq, value = n, fill =0) %>%column_to_rownames(var="STATUS") %>%as.matrix()oddsratio(combined_table)
Disease Nondisease Total
Exposed 107 76 183
Nonexposed 2116 1960 4076
Total 2223 2036 4259
Odds ratio estimate and its significance probability
data: combined_table
p-value = 0.08243
95 percent confidence interval:
0.9657426 1.7610017
sample estimates:
[1] 1.304099
Sensitivity check comparison
How does the group comparison differ if we change the criterion for survey completion? Reviewers didn’t ask for this change specifically, but some suggestions led us to conduct these additional sensitivity checks.
For this, we will change the participant group to anyone who fully completed the survey and non-participants as anyone who didn’t start or didn’t complete the survey.
Show code
# Check coding of variable that tells us whether somebody fully completed the survey or notmaster_dataset_study1 %>%count(STATUS, SURVEY_STATUS_FULL)
# A tibble: 4 × 3
STATUS SURVEY_STATUS_FULL n
<fct> <chr> <int>
1 Participants Complete 153
2 Participants Incomplete 30
3 Non-participants Incomplete 38
4 Non-participants Not started 4039
Show code
master_dataset_study1_sensitivity_check <- master_dataset_study1 %>%mutate(SURVEY_STATUS_FULL_BINARY =case_when(SURVEY_STATUS_FULL =="Incomplete"~"Non-participants", SURVEY_STATUS_FULL =="Not started"~"Non-participants",TRUE~"Participants")) %>%mutate(SURVEY_STATUS_FULL_BINARY =fct_relevel(SURVEY_STATUS_FULL_BINARY,"Participants","Non-participants")) # fix order of presentation# Check recoding has worked:master_dataset_study1_sensitivity_check %>%count(STATUS, SURVEY_STATUS_FULL_BINARY)
# A tibble: 3 × 3
STATUS SURVEY_STATUS_FULL_BINARY n
<fct> <fct> <int>
1 Participants Participants 153
2 Participants Non-participants 30
3 Non-participants Non-participants 4077
Show code
# Produce updated tablemaster_dataset_study1_sensitivity_check %>%mutate(actualwin_sum6s = actualwin_sum6s*-1) %>%# Invert net outcome so losses are negative and vice versatbl_summary(include =c(# Demographic details age_westhq, gender_westhq, IRSAD_SCORE, memberyears, # Gambling days_since_lastplay, num_6sday, intensity_6sday, averagebet_mean6s, montly_net_outcome_6m, # gdquiet6_percent, timedevice_sum6s ),type =all_continuous() ~"continuous",statistic =list(all_continuous() ~c("{mean} ({sd}), {median}")),digits =all_continuous() ~2,by = SURVEY_STATUS_FULL_BINARY,label =c( age_westhq ~"Age", gender_westhq ~"Gender", IRSAD_SCORE ~"IRSAD score", memberyears ~"No. of years as a member", days_since_lastplay ~"No. days since last bet", num_6sday ~"No. betting days", intensity_6sday ~"Mean games played per active day", averagebet_mean6s ~"Mean stake", montly_net_outcome_6m ~"Mean net outcome per month", gdquiet6_percent ~"Percentage of sessions between midnight & 6AM", timedevice_sum6s ~"Total time on device (minutes)" ) ) %>%add_overall() %>%add_difference(list(all_continuous() ~"cohens_d", # For continuous variablesall_categorical() ~"smd") ) %>%modify_header(label ="Variable") %>%as_tibble() %>%filter(Variable !="Unknown") %>%rename("Effect_size"=`**Difference**`) %>%unite("Effect size", Effect_size, `**95% CI**`, sep =" [", remove =TRUE) %>%mutate("Effect size"=paste0(`Effect size`, "]")) %>%mutate(`Effect size`=case_when(`Effect size`=="NA [NA]"~"",TRUE~`Effect size`)) %>%gt() %>%tab_options(data_row.padding =px(1.5)) %>%cols_align(align ="right",columns =c(Variable) ) %>%cols_align(align ="center",columns =c(2:4) ) %>%tab_footnote(footnote ="Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",locations =cells_column_labels(columns =c(2,3)) )%>%tab_footnote(footnote ="Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])",locations =cells_column_labels(columns =vars(`Effect size`)) )
Variable
**Overall**, N = 4,2601
**Participants**, N = 1531
**Non-participants**, N = 4,107
Effect size2
Age
53.99 (14.80), 55.00
50.59 (13.56), 51.00
54.11 (14.83), 55.00
-0.24 [-0.40, -0.08]
Gender
NA
NA
NA
0.06 [-0.11, 0.22]
F
2,223 (52%)
84 (55%)
2,139 (52%)
M
2,036 (48%)
69 (45%)
1,967 (48%)
IRSAD score
958.93 (77.46), 970.79
965.15 (75.81), 977.56
958.69 (77.52), 970.79
0.08 [-0.08, 0.24]
No. of years as a member
10.61 (7.85), 9.40
11.77 (9.03), 9.50
10.56 (7.80), 9.40
0.15 [-0.01, 0.32]
No. days since last bet
14.61 (14.87), 10.00
16.59 (16.64), 10.00
14.54 (14.80), 10.00
0.14 [-0.02, 0.30]
No. betting days
19.17 (26.01), 9.00
22.18 (24.23), 15.00
19.06 (26.07), 9.00
0.12 [-0.04, 0.28]
Mean games played per active day
1,074.20 (922.03), 841.00
1,076.81 (875.62), 918.47
1,074.10 (923.81), 833.78
0.00 [-0.16, 0.16]
Mean stake
3.66 (9.10), 1.43
3.37 (8.38), 1.67
3.68 (9.12), 1.42
-0.03 [-0.19, 0.13]
Mean net outcome per month
720.38 (1,497.50), 192.91
907.84 (1,665.54), 328.17
713.24 (1,490.50), 190.50
0.13 [-0.03, 0.29]
Percentage of sessions between midnight & 6AM
26.31 (25.25), 18.39
21.61 (18.89), 18.17
26.49 (25.46), 18.39
-0.19 [-0.41, 0.02]
Total time on device (minutes)
2,555.85 (4,651.01), 811.19
2,747.25 (3,903.29), 1,604.52
2,548.72 (4,676.80), 796.10
0.04 [-0.12, 0.20]
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables: Cramer's V [95% CIs])
Show code
# Calculate Cramér's V and its confidence intervalcombined_table <- master_dataset_study1_sensitivity_check %>%group_by(SURVEY_STATUS_FULL_BINARY, gender_westhq) %>% dplyr::count() %>%drop_na() %>%ungroup() %>%spread(key = gender_westhq, value = n, fill =0) %>%select(-1) %>%as.matrix()cramersV(combined_table)
The effect sizes appear to be smaller in these comparisons. Let’s check if this is because the group who started the survey but didn’t complete it have now been merged with non-responders, making the latter group more similar to responders, or because those who fully complete the survey are actually more similar to those who didn’t start it:
Show code
master_dataset_study1_sensitivity_check %>%mutate(actualwin_sum6s = actualwin_sum6s*-1) %>%# Invert net outcome so losses are negative and vice versatbl_summary(include =c(# Demographic details age_westhq, gender_westhq, IRSAD_SCORE, memberyears, # Gambling days_since_lastplay, num_6sday, intensity_6sday, averagebet_mean6s, montly_net_outcome_6m, # gdquiet6_percent, timedevice_sum6s ),type =all_continuous() ~"continuous",statistic =list(all_continuous() ~c("{mean} ({sd}), {median}")),digits =all_continuous() ~2,by = SURVEY_STATUS_FULL,label =c( age_westhq ~"Age", gender_westhq ~"Gender", IRSAD_SCORE ~"IRSAD score", memberyears ~"No. of years as a member", days_since_lastplay ~"No. days since last bet", num_6sday ~"No. betting days", intensity_6sday ~"Mean games played per active day", averagebet_mean6s ~"Mean stake", montly_net_outcome_6m ~"Mean net outcome per month", gdquiet6_percent ~"Percentage of sessions between midnight & 6AM", timedevice_sum6s ~"Total time on device (minutes)" ) )%>%modify_header(label ="Variable") %>%as_tibble() %>%filter(Variable !="Unknown") %>%gt() %>%tab_options(data_row.padding =px(1.5)) %>%tab_header(title ="Study 1 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey")
Study 1 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey
Variable
**Complete**, N = 153
**Incomplete**, N = 68
**Not started**, N = 4,039
Age
50.59 (13.56), 51.00
56.29 (14.95), 59.00
54.08 (14.83), 55.00
Gender
NA
NA
NA
F
84 (55%)
41 (60%)
2,098 (52%)
M
69 (45%)
27 (40%)
1,940 (48%)
IRSAD score
965.15 (75.81), 977.56
949.86 (75.71), 968.48
958.84 (77.55), 970.79
No. of years as a member
11.77 (9.03), 9.50
12.92 (8.95), 11.30
10.52 (7.77), 9.40
No. days since last bet
16.59 (16.64), 10.00
18.19 (17.26), 12.50
14.48 (14.75), 9.00
No. betting days
22.18 (24.23), 15.00
29.10 (30.27), 19.00
18.89 (25.96), 9.00
Mean games played per active day
1,076.81 (875.62), 918.47
1,599.03 (1,142.05), 1,305.49
1,065.27 (917.30), 827.81
Mean stake
3.37 (8.38), 1.67
2.33 (1.95), 1.64
3.70 (9.19), 1.41
Mean net outcome per month
907.84 (1,665.54), 328.17
1,471.40 (2,484.41), 789.93
700.21 (1,464.66), 181.50
Percentage of sessions between midnight & 6AM
21.61 (18.89), 18.17
27.62 (25.92), 21.11
26.47 (25.46), 18.24
Total time on device (minutes)
2,747.25 (3,903.29), 1,604.52
4,135.38 (5,111.33), 2,247.80
2,522.01 (4,665.22), 782.30
Interestingly, it seems that the group who started but didn’t complete the survey are most different, trending towards being more involved gamblers.
PGSI
Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:
Show code
summary_PGSI <- master_dataset_study1 %>%filter(attention1 =="2"&# Filter to corrective responses to attention checks attention2 =="Sometimes") %>%summarise(Mean_PGSI =mean(PGSI_TOTAL, na.rm =TRUE),SD_PGSI =sd(PGSI_TOTAL, na.rm =TRUE),Median_PGSI =median(PGSI_TOTAL, na.rm =TRUE))mean_ps_PGSI_value <- summary_PGSI$Mean_PGSIsd_ps_PGSI_value <- summary_PGSI$SD_PGSImedian_ps_PGSI_value <- summary_PGSI$Median_PGSI# Can produce a sentence that summarises the outcomes:summary_sentence_PGSI_study1 <-sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)summary_sentence_PGSI_study1
[1] "The mean PGSI total score is 5.50 (standard deviation = 5.01) and the median is 4.00."
Composite risk scores
Now let’s produce a single composite score that represents people’s past six months gambling behaviour.
One thing we need to consider here is that some of the variables we’ve created could be biased by recent membership registration. However, looking at the data, all but 2 of the participants have been registered for at least six months, and the 2 people who had been registered less than this has all been registered 0.4 of a year, so the equivalent of approximately five months. So, this is unlikely to affect our outcomes here.
# A tibble: 1 × 2
memberyears n
<dbl> <int>
1 0.4 51
Okay, now we have all of the relevant variables, let’s look at how they correlate with PGSI and harm scores using a series of bivariate correlations visualised as a heatmap:
Show code
PGSI_cor_data_study1<- master_dataset_study1 %>%# colnames()select(PGSI_TOTAL, gd_distinct_6s, # Number of distinct games played (by brand) game_description_count6s, # Total number of sessions played timedevice_sum6s, # Total time on device timedevice_mean6s, # Average time on device per session timedevice_std6s, # SD of Time on device turnover1_sum6s, # Total turnover turnover1_mean6s, # Mean turnover per session turnover1_std6s, # SD of Turnover actualwin_sum6s, # Total net outcome actualwin_mean6s, # Mean net outcome per session actualwin_std6s, # SD of Session net outcome averagebet_mean6s, # Mean bet size averagebet_std6s, # SD of average bet size gamesplayed_sum6s, # Total games played gamesplayed_mean6s, # Mean games played per session gamesplayed_std6s, # SD of games played per session max_turnover6sday, # Maximum total turnover in a single day max_gamesplayed6sday, # Maximum total games played in a single day num_6sday, # Number of active betting days intensity_6sday, # Number of games per active day averagebet_6mos, # Average monthly bet freq (days) averageturnover_6mos, # average monthly turnover montly_net_outcome_6m) # Average monthly net outcomeggstatsplot::ggcorrmat(PGSI_cor_data_study1, type ="parametric",results.subtitle =FALSE,ggcorrplot.args =c(lab_size =3)) +theme(axis.text.y =element_text(color ="black", size =9, face ="plain", family ="Poppins")) +theme(axis.text.x =element_text(color ="black", size =9, face ="plain", family ="Poppins")) +theme(legend.title=element_text(color ="black", size =12, face ="plain", family ="Poppins")) +theme(legend.title.align = .5) +theme(legend.text=element_text(color ="black", size =8, face ="plain", family ="Poppins"))
Now let’s use all of these variables to create a composite index score of risk/involvement:
Show code
master_dataset_study1_with_risk_scores <- master_dataset_study1 %>%mutate(actualwin_sum6s = actualwin_sum6s*-1) %>%# Invert outcome valuesmutate(actualwin_mean6s = actualwin_mean6s*-1) %>%# Invert outcome valuesmutate(across(c( gd_distinct_6s, # Number of distinct games played (by brand) game_description_count6s, # Total number of sessions played timedevice_sum6s, # Total time on device timedevice_mean6s, # Average time on device per session turnover1_sum6s, # Total turnover turnover1_mean6s, # Mean turnover per session turnover1_std6s, # SD of Turnover actualwin_sum6s, # Total net outcome actualwin_mean6s, # Mean net outcome per session actualwin_std6s, # SD of Session net outcome averagebet_sum6s, # Total bet size averagebet_mean6s, # Mean bet size averagebet_std6s, # SD of average bet size gamesplayed_sum6s, # Total games played gamesplayed_mean6s, # Mean games played per session gamesplayed_std6s, # SD of games played per session max_turnover6sday, # Maximum total turnover in a single day max_gamesplayed6sday, # Maximum total games played in a single day num_6sday, # Number of active betting days intensity_6sday, # Number of games per active day averagebet_6mos, # Average monthly bet freq (days) montly_net_outcome_6m, # Average monthly net outcome averageturnover_6mos), scale, # scale function computes z-scores.names ="z_{.col}")) %>%# Create new named columns for easy selection belowrowwise() %>%# Group by Rowsmutate(`Composite Risk Score`=sum(c_across(starts_with("z_")), na.rm =TRUE)) %>%# Sum across all newly created Z scoresungroup()# colnames(master_dataset_study2_with_risk_scores)
Correlations
Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:
Spearman's rank correlation rho
data: master_dataset_study1_with_risk_scores$PGSI_TOTAL and master_dataset_study1_with_risk_scores$`Composite Risk Score`
S = 846277, p-value = 0.02031
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.1714405
Confirm how many people are involved in this calculation:
Welch Two Sample t-test
data: Composite Risk Score by STATUS
t = 2.3269, df = 197.85, p-value = 0.02098
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
0.3172207 3.8424226
sample estimates:
mean in group Participants mean in group Non-participants
1.99047717 -0.08934445
Show code
# Calculate means and standard deviationssummary_composite_study1 <- master_dataset_study1_with_risk_scores %>% dplyr::group_by(STATUS) %>% dplyr::summarize(mean =mean(`Composite Risk Score`),sd =sd(`Composite Risk Score`),n =n() ) gt(summary_composite_study1)
STATUS
mean
sd
n
Participants
1.99047717
11.84102
183
Non-participants
-0.08934445
11.54773
4077
Show code
# Calculate Cohen's d with 95% confidence intervalcohen_d_result_study1 <-cohen.d(`Composite Risk Score`~ STATUS, data = master_dataset_study1_with_risk_scores)cohen_d_result_study1
Cohen's d
d estimate: 0.1799089 (negligible)
95 percent confidence interval:
lower upper
0.03171688 0.32810096
Quartiles
Now, let’s divide the sample into quartiles based on their composite summary score:
Show code
# Calculate quartiles of past-30-day bet frequencyquartiles_study1 <-quantile(master_dataset_study1_with_risk_scores$`Composite Risk Score`, probs =c(0, 0.25, 0.5, 0.75, 1))master_dataset_study1_with_risk_scores$Quartile <-cut(master_dataset_study1_with_risk_scores$`Composite Risk Score`, breaks = quartiles_study1,labels =c("Bottom quartile: least involved","Second quartile","Third quartile","Top quartile: most involved"),include.lowest =TRUE)
Now compute the proportion of individuals from each quartile who went on to take part in the survey:
Show code
master_dataset_study1_with_risk_scores %>%group_by(Quartile) %>% dplyr::count(SURVEY_STATUS_PGSI) %>%gt() %>%tab_header("Counts for granular survey status by composite risk score quartile", subtitle =NULL)
Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI
n
Bottom quartile: least involved
Complete
34
Incomplete
4
Not started
1027
Second quartile
Complete
40
Incomplete
7
Not started
1018
Third quartile
Complete
49
Incomplete
10
Not started
1006
Top quartile: most involved
Complete
60
Incomplete
17
Not started
988
Show code
master_dataset_study1_with_risk_scores %>%group_by(Quartile) %>% dplyr::count(STATUS) %>%mutate('Percent of quartile'=round(n/sum(n)*100,2)) %>%gt() %>%tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle =NULL)
Counts (with percentages) for broad survey status by composite risk score quartile
STATUS
n
Percent of quartile
Bottom quartile: least involved
Participants
34
3.19
Non-participants
1031
96.81
Second quartile
Participants
40
3.76
Non-participants
1025
96.24
Third quartile
Participants
49
4.60
Non-participants
1016
95.40
Top quartile: most involved
Participants
60
5.63
Non-participants
1005
94.37
And visualise this using a Sankey diagram:
Show code
# Create Sankey plotdesired_order_sankey <-c("Top quartile: most involved","Third quartile","Second quartile","Bottom quartile: least involved")# We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N):risk_to_survey_links<- master_dataset_study1_with_risk_scores %>%select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%group_by(Quartile, SURVEY_STATUS_PGSI) %>%summarise(n =n(),.groups ="drop" ) %>%group_by(Quartile) %>%mutate(percent = n/sum(n)*100) %>%select(-n) %>%ungroup() %>%mutate(Quartile =factor(Quartile,levels = desired_order_sankey)) %>%arrange(Quartile) # From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: nodes <-data.frame(name=c(as.character(risk_to_survey_links$Quartile), as.character(risk_to_survey_links$SURVEY_STATUS_PGSI)) %>%unique())# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn assign each from and to category with a number to make numerical connections explicit:risk_to_survey_links$IDsource <-match(risk_to_survey_links$Quartile, nodes$name)-1risk_to_survey_links$IDtarget <-match(risk_to_survey_links$SURVEY_STATUS_PGSI, nodes$name)-1# prepare colour scaleColourScal ='d3.scaleOrdinal() .range(["#d1495b", "#edae49","#00798c","#2e4057","#35B779FF","#31688EFF","#31688EFF","#482878FF","#B4DE2CFF","#FDE725FF", "#26828EFF","#3E4A89FF","#440154FF","#6DCD59FF"])'# Create Sankey plotsankey_plot1 <-sankeyNetwork(Links = risk_to_survey_links, Nodes = nodes,Source ="IDsource", Target ="IDtarget",Value ="percent", NodeID ="name", iterations =0,sinksRight =FALSE,fontSize =16, fontFamily ="Poppins",nodeWidth =30,nodePadding =25,colourScale = ColourScal)# Custom CSS with Google Fonts linkcustom_css <-"<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'><style> text { font-family: 'Poppins', sans-serif !important; }</style>"# Save plot as a HTML file with custom CSShtml_file <-tempfile(fileext =".html")saveWidget(widget = htmlwidgets::prependContent(sankey_plot1), file = html_file,selfcontained =TRUE)# Read the saved HTML and ensure the custom CSS is properly embeddedhtml_content <-readLines(html_file, warn =FALSE)html_content <-gsub("</head>", paste0(custom_css, "</head>"), html_content)writeLines(html_content, con ="sankey_plot1.html")# remove the temporary filefile.remove(html_file)
[1] TRUE
Show code
# sankey_plot1
Logistic regression
Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.
SSVS
Before we can actually run the model, we need to identify a suitable suite of predictor variables to include using used Stochastic Search Variable Selection (SSVS) and the related SSVS package (note the actual SSVS process is commented out in rendered versions of this document as it takes a long time to process).
As we want to maximise the overall predictive ability of the model, we have opted to set the parameters of the SSVS so as to have a low threshold for includin/Selecting relevant variables. First, We have said the prior inclusion of probability to 0.9, reflecting a belief that predictors are more likely to be included than not. A prior inclusion probability of 0.5 reflects the belief that each predictor has an equal probability of being included/excluded. Second, we have set the cut off for marginal inclusion probabilities values (MIPs) to 0.4. There is no standard recommendation for what this cut-off should be, but others have used 0.5. For this specific study, we did adjust this value in response to previous results, given that only 2 predictors were identified based on a lower MIP threshold and we are trying to build a model that maximises the ability to predict survey engagement based on account data.
One thing we need to do before all of this is impute missing values for the IRSAD scores as SSVS doesn’t seem to handle these values well. To do this, we will impute the median value for each person’s state as their missing value.
There are also some SD Variables which don’t have scores as people only bet once in the window. For these, will set the value to 0. Finally, for the days since registration, we’ll set this to the median as well
Show code
# Function to impute missing values with the medianimpute_na_with_median <-function(data) { data %>%# Impute SD variables with 0mutate(across(contains("std6s"), ~ifelse(is.na(.), 0, .))) %>%# Impute all other variables with medianmutate(across(!contains("std6s"), ~ifelse(is.na(.), median(., na.rm =TRUE), .)))}# Prepare the datasetmaster_dataset_study1_SSVS <- master_dataset_study1 %>%# Impute missing IRSAD scoresmutate(IRSAD_SCORE =ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm =TRUE), IRSAD_SCORE)) %>%# Convert all required variables to numericmutate(across(c( age_westhq, IRSAD_SCORE, memberyears, days_since_lastplay, gd_distinct_6s, game_description_count6s, timedevice_sum6s, timedevice_mean6s, timedevice_std6s, turnover1_sum6s, turnover1_mean6s, turnover1_std6s, actualwin_sum6s, actualwin_mean6s, actualwin_std6s, averagebet_mean6s, averagebet_std6s, gamesplayed_sum6s, gamesplayed_mean6s, gamesplayed_std6s, max_turnover6sday, max_gamesplayed6sday, num_6sday, intensity_6sday, averagebet_6mos, averageturnover_6mos), ~as.numeric(.))) %>%# Convert gender to numericmutate(GENDER =case_when( gender_westhq =='F'~0, gender_westhq =='M'~1,TRUE~NA_real_)) %>%# Convert STATUS to numericmutate(STATUS =case_when( STATUS =='Non-participants'~0, STATUS =='Participants'~1)) %>%mutate(STATUS =as.numeric(STATUS)) %>%# Impute missing values with median and 0 for SD variablesimpute_na_with_median() %>%rename('Age'='age_westhq','Gender'='GENDER','IRSAD score'='IRSAD_SCORE','Days since registration'='memberyears','Days since last bet'='days_since_lastplay','No. of diff. game types*'='gd_distinct_6s','Total sessions played*'='game_description_count6s','Total time on device*'='timedevice_sum6s','Mean time on device per session*'='timedevice_mean6s','SD of time on device*'='timedevice_std6s','Total turnover*'='turnover1_sum6s','Mean turnover per session*'='turnover1_mean6s','SD of turnover*'='turnover1_std6s','Total net outcome*'='actualwin_sum6s','Mean net outcome per session*'='actualwin_mean6s','SD of session net outcome*'='actualwin_std6s','Mean stake*'='averagebet_mean6s','SD of stake*'='averagebet_std6s','Total games played*'='gamesplayed_sum6s','Mean no. games played per session*'='gamesplayed_mean6s','SD of games played per session*'='gamesplayed_std6s','Max turnover in a single day*'='max_turnover6sday','Max no. games played in a single day*'='max_gamesplayed6sday','Number of active betting days*'='num_6sday','Mean no. games per active day*'='intensity_6sday','Mean monthly bet freq (days)*'='averagebet_6mos','Mean monthly turnover*'='averageturnover_6mos' ) %>%as.data.frame()outcome <-'STATUS'predictors <-c(# Demographic variables:'Age','Gender','IRSAD score',# Account/wagering variables:'Days since registration','Days since last bet',# All past six month gambling variables (As included in the composite score):'No. of diff. game types*','Total sessions played*','Total time on device*','Mean time on device per session*','SD of time on device*','Total turnover*','Mean turnover per session*','SD of turnover*','Total net outcome*','Mean net outcome per session*','SD of session net outcome*','Mean stake*','SD of stake*','Total games played*','Mean no. games played per session*','SD of games played per session*','Max turnover in a single day*','Max no. games played in a single day*','Number of active betting days*','Mean no. games per active day*','Mean monthly bet freq (days)*','Mean monthly turnover*')# Print summary statisticsprint(summary(master_dataset_study1_SSVS %>%select(all_of(predictors))))
Age Gender IRSAD score Days since registration
Min. :18.00 Min. :0.0000 Min. : 780.9 Min. : 0.40
1st Qu.:43.00 1st Qu.:0.0000 1st Qu.: 922.4 1st Qu.: 5.00
Median :55.00 Median :0.0000 Median : 970.8 Median : 9.40
Mean :53.99 Mean :0.4779 Mean : 959.0 Mean :10.57
3rd Qu.:65.00 3rd Qu.:1.0000 3rd Qu.: 997.9 3rd Qu.:14.03
Max. :95.00 Max. :1.0000 Max. :1176.9 Max. :59.70
Days since last bet No. of diff. game types* Total sessions played*
Min. : 0.00 Min. : 1.00 Min. : 1.0
1st Qu.: 3.00 1st Qu.: 7.00 1st Qu.: 13.0
Median :10.00 Median : 21.00 Median : 56.0
Mean :14.61 Mean : 35.36 Mean : 182.5
3rd Qu.:22.00 3rd Qu.: 49.00 3rd Qu.: 182.0
Max. :67.00 Max. :283.00 Max. :5500.0
Total time on device* Mean time on device per session* SD of time on device*
Min. : 0.0 Min. : 0.000 Min. : 0.000
1st Qu.: 172.3 1st Qu.: 7.944 1st Qu.: 6.794
Median : 811.2 Median : 12.840 Median : 12.856
Mean : 2555.8 Mean : 18.019 Mean : 18.053
3rd Qu.: 2696.0 3rd Qu.: 21.377 3rd Qu.: 22.914
Max. :49777.8 Max. :261.583 Max. :206.664
Total turnover* Mean turnover per session* SD of turnover*
Min. : 0.4 Min. : 0.20 Min. : 0.0
1st Qu.: 1909.7 1st Qu.: 87.53 1st Qu.: 78.4
Median : 11971.2 Median : 181.78 Median : 196.1
Mean : 54122.7 Mean : 413.79 Mean : 479.6
3rd Qu.: 52419.7 3rd Qu.: 403.25 3rd Qu.: 488.4
Max. :2850427.5 Max. :29571.30 Max. :26734.1
Total net outcome* Mean net outcome per session* SD of session net outcome*
Min. :-17910 Min. :-650.000 Min. : 0.0
1st Qu.: 110 1st Qu.: 5.664 1st Qu.: 48.4
Median : 1021 Median : 19.008 Median : 113.5
Mean : 4121 Mean : 35.180 Mean : 183.6
3rd Qu.: 4502 3rd Qu.: 45.336 3rd Qu.: 229.7
Max. :107738 Max. :4999.980 Max. :5164.6
Mean stake* SD of stake* Total games played*
Min. : 0.01771 Min. : 0.0000 Min. : 0
1st Qu.: 0.78506 1st Qu.: 0.2504 1st Qu.: 1471
Median : 1.42699 Median : 0.6212 Median : 7216
Mean : 3.66419 Mean : 2.2513 Mean : 26677
3rd Qu.: 2.78693 3rd Qu.: 1.4415 3rd Qu.: 26613
Max. :96.99538 Max. :50.1473 Max. :621406
Mean no. games played per session* SD of games played per session*
Min. : 1.00 Min. : 0.00
1st Qu.: 71.95 1st Qu.: 57.39
Median : 118.17 Median : 110.96
Mean : 167.84 Mean : 164.24
3rd Qu.: 200.62 3rd Qu.: 202.78
Max. :3524.00 Max. :1748.12
Max turnover in a single day* Max no. games played in a single day*
Min. : 0.3 Min. : 0
1st Qu.: 872.0 1st Qu.: 670
Median : 2904.7 Median : 1742
Mean : 6356.1 Mean : 2264
3rd Qu.: 7898.9 3rd Qu.: 3265
Max. :165993.5 Max. :12945
Number of active betting days* Mean no. games per active day*
Min. : 1.00 Min. : 0.0
1st Qu.: 3.00 1st Qu.: 357.9
Median : 9.00 Median : 841.0
Mean : 19.17 Mean :1074.2
3rd Qu.: 24.00 3rd Qu.:1526.2
Max. :180.00 Max. :6526.0
Mean monthly bet freq (days)* Mean monthly turnover*
Min. : 0.1667 Min. : 0.1
1st Qu.: 0.5000 1st Qu.: 318.3
Median : 1.5000 Median : 1995.2
Mean : 3.1955 Mean : 9020.4
3rd Qu.: 4.0000 3rd Qu.: 8736.6
Max. :30.0000 Max. :475071.2
Show code
# master_dataset_study1_SSVS %>%# dplyr::count(Gender)# master_dataset_study1_SSVS %>%# dplyr::count(STATUS)# # # Perform SSVS analysis# SSSV_results_Study1 <- ssvs(data = master_dataset_study1_SSVS,# x = predictors, y = outcome,# inprob = 0.95,# continuous = FALSE,# progress = TRUE)# # # # Save file to avoid having to re-reun every time# saveRDS(SSSV_results_Study1, "SSSV_results_Study1")# Load in:SSSV_results_Study1 <-read_rds("SSSV_results_Study1")
Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:
Show code
summary(SSSV_results_Study1, interval =0.95, # Credible intervalordered =TRUE) %>%as.data.frame() %>%gt()%>%tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 1**"), subtitle =NULL)
Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 1
Variable
MIP
Avg Beta
Avg Nonzero Beta
Lower CI (95%)
Upper CI (95%)
Days since last bet
0.9924
0.3421
0.3447
0.1818
0.4982
No. of diff. game types*
0.9296
0.2820
0.3033
0.0000
0.4275
Max turnover in a single day*
0.6883
0.1669
0.2425
0.0000
0.3688
Days since registration
0.5135
0.1249
0.2432
0.0000
0.3615
Age
0.4119
-0.1062
-0.2577
-0.3740
0.0000
Number of active betting days*
0.2653
0.1062
0.4002
-57.5381
66.9408
Mean monthly bet freq (days)*
0.2631
-0.0747
-0.2840
-66.8354
57.5835
Mean monthly turnover*
0.2529
-0.1619
-0.6402
-62.8708
60.2791
Total turnover*
0.2482
0.1627
0.6557
-59.7995
63.1582
Max no. games played in a single day*
0.1089
0.0254
0.2330
0.0000
0.2560
Mean turnover per session*
0.0894
-0.0241
-0.2691
-0.2310
0.0014
SD of session net outcome*
0.0861
0.0140
0.1624
0.0000
0.1588
Mean stake*
0.0739
-0.0148
-0.1996
-0.1399
0.0000
Total time on device*
0.0732
-0.0216
-0.2953
-0.1889
0.0004
Total games played*
0.0676
-0.0144
-0.2131
-0.1512
0.0002
Gender
0.0662
-0.0091
-0.1368
-0.0891
0.0000
SD of stake*
0.0516
-0.0067
-0.1303
0.0000
0.0095
SD of turnover*
0.0430
0.0051
0.1182
0.0000
0.0000
Mean no. games played per session*
0.0305
-0.0017
-0.0574
0.0000
0.0000
Mean no. games per active day*
0.0287
-0.0027
-0.0934
0.0000
0.0000
Total net outcome*
0.0285
0.0021
0.0736
0.0000
0.0000
Total sessions played*
0.0261
-0.0023
-0.0870
0.0000
0.0000
SD of time on device*
0.0248
0.0005
0.0182
0.0000
0.0000
IRSAD score
0.0243
0.0016
0.0667
0.0000
0.0000
Mean time on device per session*
0.0235
-0.0023
-0.1001
0.0000
0.0000
SD of games played per session*
0.0211
0.0015
0.0701
0.0000
0.0000
Mean net outcome per session*
0.0130
0.0001
0.0094
0.0000
0.0000
Show code
plot(SSSV_results_Study1,threshold =0.4,title ="") +theme_classic() +scale_y_continuous(breaks =c(0, 0.25, 0.50, 0.75, 1), labels =c(0, 0.25, 0.50, 0.75, 1)) +theme(text=element_text(family="Poppins"),axis.text.x =element_text(angle =45, hjust =1),plot.margin =margin(t =10, r =10, b =10, l =20)# axis.text.y = element_text(angle = 0, hjust = .5) ) +labs(title ="",x ="",y ="Inclusion Probability") +annotate("text", x =18, y =0.45, label ="MIP threshold", size =3, color ="black", fontface ="plain") +theme(legend.position ="none")
Show code
ggsave("Figures/SSVS results plot Study 1.pdf",width =8.5,height =4,units =c("in"))
Run model
Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:
Show code
Study1_logistic_reg_data<- master_dataset_study1 %>%mutate(STATUS =case_when( STATUS =='Non-participants'~0, STATUS =='Participants'~1)) %>%filter(!is.na(memberyears))# Run modelStudy1_logistic_regression<-glm(STATUS ~ days_since_lastplay + gd_distinct_6s + max_turnover6sday + memberyears + age_westhq,data = Study1_logistic_reg_data, family ="binomial") summary(Study1_logistic_regression)
Call:
glm(formula = STATUS ~ days_since_lastplay + gd_distinct_6s +
max_turnover6sday + memberyears + age_westhq, family = "binomial",
data = Study1_logistic_reg_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.395e+00 3.239e-01 -10.481 < 2e-16 ***
days_since_lastplay 2.532e-02 4.973e-03 5.091 3.57e-07 ***
gd_distinct_6s 7.284e-03 1.715e-03 4.248 2.16e-05 ***
max_turnover6sday 2.192e-05 5.775e-06 3.796 0.000147 ***
memberyears 3.451e-02 1.012e-02 3.412 0.000646 ***
age_westhq -1.773e-02 5.889e-03 -3.011 0.002605 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1484.9 on 4116 degrees of freedom
Residual deviance: 1426.7 on 4111 degrees of freedom
AIC: 1438.7
Number of Fisher Scoring iterations: 6
Check variable inflation factors for evidence of multicollinearity:
# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value
The model accounts for 4.6431328% of variance in uptake/response.
Calculate overall p-value for model by comparing it with a null model:
Show code
anova(Study1_logistic_regression, update(Study1_logistic_regression, ~1), # update here produces null model for comparison test="Chisq")
Analysis of Deviance Table
Model 1: STATUS ~ days_since_lastplay + gd_distinct_6s + max_turnover6sday +
memberyears + age_westhq
Model 2: STATUS ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 4111 1426.7
2 4116 1485.0 -5 -58.294 2.736e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show code
# Set seed so that labels are consistent in the plot:set.seed(200)ggstatsplot::ggcoefstats(Study1_logistic_regression,digits =4,stats.label.color ="#00798c",sort ="ascending",vline.args =list(size =1, color ="black"),# ggstatsplot.layer = FALSE,exclude.intercept =TRUE, # hide the interceptstats.label.args =list(size =3, direction ="y")) +scale_x_continuous(name ="Regression coefficient", limits =c(-0.05, 0.05)# breaks = c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5) ) +scale_y_discrete(name ="", labels =c("Age","Max turnover\n in a single\n day","No. of diff.\n game types","Days since\n last bet","Days since\n registration" )) +theme(panel.grid.minor =element_blank(),axis.line =element_blank()) +theme(axis.text =element_text(color ="black", size =9, face ="plain"))+theme_light(base_family ="Poppins") +# theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +theme(axis.text.y =element_text(color="black", size=12, face="plain", family ="Poppins")) +theme(axis.title.x =element_text(color="black", size=12, face ="plain", vjust =0, family ="Poppins"))
Show code
# Set height to 7 and width to 5.5 in portrait modeggsave("Figures/Study 1 log regression plot.pdf",width =5,height =7,units =c("in"))
Study 2 Analysis
This study pertains to the analysis of a survey of Online gamblers from two sites in Australia.
Load data
Load in master dataset containing all customer details (this step is hidden to protect privacy & anonymity)
Data loaded.
How many people were invited in total:
Show code
nrow(master_dataset_study2_initial)
[1] 24829
Process the data for analysis by doing the following:
Remove anybody who didn’t bet in the six months prior to survey participation as this was our criteria for inclusion in the sample
Provide a single variable that tells us whether somebody is a participant or non-participant, depending on whether they completed up to the PGSI or not
Recode and rename variables where required for later analysis and visualisation.
Show code
master_dataset_study2 <- master_dataset_study2_initial %>%as_tibble() %>%filter(PAS_6M_BET_FREQ_DAYS !=0) %>%# Remove anybody who didn't bet in the window of interestmutate(STATUS =case_when(SURVEY_STATUS_PGSI =="Incomplete"| SURVEY_STATUS_PGSI =="Not started"~"Non-participants", SURVEY_STATUS_PGSI =="Complete"~"Participants")) %>%mutate(STATUS =fct_relevel(STATUS,"Participants","Non-participants")) %>%# Make numeric:mutate(GHM_OVER_PRIORITISE =as.numeric(GHM_OVER_PRIORITISE),GHM_STRAIN =as.numeric(GHM_STRAIN),GHM_SEVERE =as.numeric(GHM_SEVERE),IRSAD_SCORE =as.numeric(IRSAD_SCORE) ) %>%# Enforce the order of factors so that they present properly in tables and graphs:mutate(PGSI_CATEGORY =factor(PGSI_CATEGORY, levels =c('No risk','Low risk','Moderate risk','High risk'))) # Rename risk detection system summary variable values:# mutate(RE_PAS_6M_ANY_TRIGGER = case_when(RE_PAS_6M_ANY_TRIGGER == TRUE ~ "Flagged",# RE_PAS_6M_ANY_TRIGGER == FALSE ~ "Not flagged"))
Uptake figures
Let’s first look at how survey uptake varied over the three-week window the survey was open:
Show code
master_dataset_study2_initial %>%as_tibble() %>%filter(SURVEY_STATUS_START !="Not started") %>%group_by(START_DATE) %>%summarise(Freq =sum(n())) %>%ungroup() %>%ggplot(aes(x = START_DATE, y = Freq)) + plot_theme +geom_bar(stat ="identity", fill ="#00798c") +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="",x ="Date",y ="Number of clients") +scale_y_continuous(breaks =seq(0, 1500, by =500), limits =c(0,1500))
Now compute some figures to tell us the rates of uptake for the survey. Will first do this for the original, unfiltered dataset and then for the refined, fill the dataset with people who didn’t bet in the past six months removed.
Original dataset:
Show code
master_dataset_study2_initial %>%as_tibble() %>%count(SURVEY_STATUS_START) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_START ="Survey status") %>%tab_header("Counts for starting the survey: Unfiltered dataset", subtitle =NULL)
Counts for starting the survey: Unfiltered dataset
Survey status
n
Percent
Not started
22684
91.360909
Started
2145
8.639091
Show code
master_dataset_study2_initial %>%as_tibble() %>%count(SURVEY_STATUS_PGSI) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_PGSI ="Survey status") %>%tab_header("Counts for completing the survey up to the PGSI: Unfiltered dataset", subtitle =NULL)
Counts for completing the survey up to the PGSI: Unfiltered dataset
Survey status
n
Percent
Complete
1959
7.889967
Incomplete
186
0.749124
Not started
22684
91.360909
Show code
master_dataset_study2_initial %>%as_tibble() %>%count(SURVEY_STATUS_FULL) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_FULL ="Survey status") %>%tab_header("Counts for completing the survey in full: Unfiltered dataset", subtitle =NULL)
Counts for completing the survey in full: Unfiltered dataset
Survey status
n
Percent
Complete
1470
5.920496
Incomplete
675
2.718595
Not started
22684
91.360909
And for the filter the dataset will use for most of the analyses,
Show code
master_dataset_study2 %>%as_tibble() %>%count(SURVEY_STATUS_START) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_START ="Survey status") %>%tab_header("Counts for starting the survey: Filtered dataset", subtitle =NULL)
Counts for starting the survey: Filtered dataset
Survey status
n
Percent
Not started
22651
91.360465
Started
2142
8.639535
Show code
master_dataset_study2 %>%as_tibble() %>%count(SURVEY_STATUS_PGSI) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_PGSI ="Survey status") %>%tab_header("Counts for completing the survey up to the PGSI: Filtered dataset", subtitle =NULL)
Counts for completing the survey up to the PGSI: Filtered dataset
Survey status
n
Percent
Complete
1956
7.8893236
Incomplete
186
0.7502118
Not started
22651
91.3604646
Show code
master_dataset_study2 %>%as_tibble() %>%count(SURVEY_STATUS_FULL) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_FULL ="Survey status") %>%tab_header("Counts for completing the survey in full: Filtered dataset", subtitle =NULL)
Counts for completing the survey in full: Filtered dataset
Survey status
n
Percent
Complete
1467
5.916993
Incomplete
675
2.722543
Not started
22651
91.360465
We need to also figure out how many people were removed by ensuring everyone bet in the past 6 months and how this related to survey status:
Show code
master_dataset_study2_initial %>%as_tibble() %>%filter(PAS_6M_BET_FREQ_DAYS ==0) %>%# keep anybody who didn't bet in the window of interestcount(SURVEY_STATUS_PGSI) %>%gt() %>%cols_label(SURVEY_STATUS_PGSI ="Survey status") %>%tab_header("Counts for people removed from dataset because they hadn't gambled in the past 6-months", subtitle =NULL)
Counts for people removed from dataset because they hadn't gambled in the past 6-months
Survey status
n
Complete
3
Not started
33
Survey flow
Show code
# Define relevant survey questions (excluding non-survey & optional ones)survey_questions <-c("IPAddress", "GAM_ACCOUNTS", "GAM_ACTIVITIES", "GAM_HARM", "GAMB_SATISFACTION","PGSI_Q1", "PGSI_Q2", "PGSI_Q3", "PGSI_Q4", "PGSI_Q5", "PGSI_Q6", "PGSI_Q7", "PGSI_Q8", "PGSI_Q9","GHM_F1", "GHM_P1", "ATTENTION_CHECK_1", "GHM_P2", "GHM_R1", "GHM_R2", "GHM_PH1", "GHM_PH2","GHM_WS1", "GHM_WS2", "GHM_L1", "LIFE_SATISFACTION", "ACT1_TRACK","ACT2_1", "ACT2_2", "ACT2_3", "ACT2_4", "ACT2_5", "ACT2_6", "ACT2_7","EST_PAST_SPEND", "EST_PAST_WIN", "EST_PAST_NET_RESULT", "EST_PAST_DEPOSIT", "EST_PAST_WITHDRAW","EST_NEXT_SPEND", "EST_NEXT_WIN", "EST_NEXT_NET_RESULT", "EST_NEXT_ACCEPTED_RESULT","EST_NEXT_DEPOSIT", "EST_NEXT_WITHDRAW","SC_1", "SC_2", "SC_3", "SC_4", "SC_5", "SC_6", "SC_7", "SC_8", "SC_9", "SC_10", "SC_11","ATTENTION_CHECK_2", "SC_12", "SC_13","EDUCATION", "EMPLOYMENT", "MARRIED", "NO_CHILDREN", "HOUSEHOLD_INCOME","INCOME_VARIABILITY", "INCOME_PREDICTABILITY","CBA_FW_1", "CBA_FW_2", "CBA_FW_3", "CBA_FW_4", "CBA_FW_5")# Filter dataset to exclude non-starters (only include those who started the survey)survey_data_filtered <- master_dataset_study2 %>%filter(!is.na(IPAddress)) %>%select(all_of(survey_questions))# Count non-NA responses for each questionresponse_counts <- survey_data_filtered %>%summarise(across(everything(), ~sum(!is.na(.)), .names ="count_{.col}")) %>%pivot_longer(cols =everything(), names_to ="Question", values_to ="Respondents") %>%mutate(Question =gsub("count_", "", Question)) %>%arrange(Respondents) # Arrange in ascending order to maintain left-to-right order# Define survey sections with their starting positions for text labels:survey_sections <-data.frame(Section =c("Opened survey", "Gambling involvement", "Gambling satisfaction", "PGSI","Gambling Harm Measure", "Life satisfaction", "Activity statements","Estimated expenditure", "Self-Control", "Demographics"),Start =c("IPAddress", "GAM_ACTIVITIES", "GAMB_SATISFACTION", "PGSI_Q5","GHM_R2", "LIFE_SATISFACTION", "ACT2_4","EST_NEXT_SPEND", "SC_7", "INCOME_VARIABILITY"))# Define section start and end positions for lines:section_boundaries <-data.frame(Position =c( "GAM_ACCOUNTS", "PGSI_Q1","GHM_F1", "LIFE_SATISFACTION", "ACT1_TRACK","EST_PAST_SPEND", "SC_1"),Type ="Start")section_endings <-data.frame(Position =c("GAMB_SATISFACTION","SC_13","CBA_FW_5"),Type ="End")# Combine for plottingsection_lines <-bind_rows(section_boundaries, section_endings)# Convert positions to numeric for plottingsection_lines$Position <-as.numeric(factor(section_lines$Position, levels = survey_questions))section_lines <- section_lines %>%mutate(Position =case_when(Position ==71~71.5,TRUE~ Position))survey_sections_dot_colours <-data.frame(Section =c(rep("Start of survey", 1),rep("Gambling involvement", 3),rep("Gambling satisfaction", 1),rep("PGSI", 9),rep("Gambling Harm Measure", 11),rep("Life satisfaction", 1),rep("Activity statements", 8),rep("Estimated expenditure", 11),rep("Self-Control", 14),rep("Demographics", 12) # No "End of survey" ),Question =c(# Start of survey"IPAddress",# Gambling involvement"GAM_ACTIVITIES", "GAM_HARM", "GAM_ACCOUNTS",# Gambling satisfaction"GAMB_SATISFACTION",# PGSI"PGSI_Q1", "PGSI_Q2", "PGSI_Q3", "PGSI_Q4", "PGSI_Q5", "PGSI_Q6", "PGSI_Q7", "PGSI_Q8", "PGSI_Q9",# Gambling Harm Measure"GHM_F1", "GHM_P1", "ATTENTION_CHECK_1", "GHM_P2", "GHM_R1", "GHM_R2", "GHM_PH1", "GHM_PH2", "GHM_WS1", "GHM_WS2", "GHM_L1",# Life satisfaction"LIFE_SATISFACTION",# Activity statements"ACT1_TRACK", "ACT2_1", "ACT2_2", "ACT2_3", "ACT2_4", "ACT2_5", "ACT2_6","ACT2_7",# Estimated expenditure"EST_PAST_SPEND", "EST_PAST_WIN", "EST_PAST_NET_RESULT", "EST_PAST_DEPOSIT", "EST_PAST_WITHDRAW", "EST_NEXT_SPEND", "EST_NEXT_WIN", "EST_NEXT_NET_RESULT","EST_NEXT_ACCEPTED_RESULT", "EST_NEXT_DEPOSIT", "EST_NEXT_WITHDRAW",# Self-Control"SC_1", "SC_2", "SC_3", "SC_4", "SC_5", "SC_6", "SC_7", "SC_8", "SC_9", "SC_10", "SC_11", "ATTENTION_CHECK_2", "SC_12", "SC_13",# Demographics"EDUCATION", "EMPLOYMENT", "MARRIED", "NO_CHILDREN", "HOUSEHOLD_INCOME","INCOME_VARIABILITY", "INCOME_PREDICTABILITY", "CBA_FW_1", "CBA_FW_2", "CBA_FW_3","CBA_FW_4", "CBA_FW_5" ))# Ensure response_counts data is structuredresponse_counts <- response_counts %>%mutate(Question =factor(Question, levels = survey_questions)) %>%# Preserve orderfull_join(survey_sections_dot_colours)# Order questions by survey flow (ensuring left-to-right reading)response_counts$Question <-factor(response_counts$Question, levels = survey_questions)# Ensure no zero values before applying log transformationresponse_counts <- response_counts %>%mutate(Respondents =ifelse(Respondents ==0, 1, Respondents)) # Replaces 0 with 1# Create a lookup table for section labelssection_labels <-setNames(survey_sections$Section, survey_sections$Start)# Ensure Section is an ordered factor based on survey flowresponse_counts$Section <-factor(response_counts$Section, levels =unique(survey_sections$Section))# # Define custom gradient: Blue-Green to Blue# num_sections <- length(unique(response_counts$Section))# section_colors <- rev(colorRampPalette(c(brewer.pal(9, "GnBu")[6], brewer.pal(9, "Blues")[9]))(num_sections))# Generate a perceptually uniform color scale using viridisnum_sections <-length(unique(response_counts$Section))section_colors <-viridis(num_sections, option ="mako") # Blue-Green to Blue# Ensure factor levels matchresponse_counts <- response_counts %>%mutate(Section =factor(Section, levels = survey_sections$Section))# Plot with ordered and reversed colorsmain_plot<-ggplot(response_counts, aes(x = Question, y = Respondents, group =1, color = Section)) +geom_line(color ="grey", size =0.9) +# Line plot# Dots colored by ordered survey section (now fading from dark to light)geom_point(size =1.1) +# Apply ordered and reversed color scale to sections# scale_color_manual(values = setNames(section_colors, levels(response_counts$Section))) +scale_color_manual(values =setNames(section_colors, levels(response_counts$Section))) +# Add dashed vertical lines for section boundariesgeom_vline(data = section_lines, aes(xintercept = Position -0.5),color ="grey50", linetype ="dashed", size =0.5) +# Replace x-axis labels with section labels at survey start pointsscale_x_discrete(labels =function(x) ifelse(x %in%names(section_labels), section_labels[x], "")) +# Add curved annotation linegeom_curve(aes(x =17.5, y =2420, xend =14.5, yend =2420), curvature =0, arrow =arrow(length =unit(0.03, "npc"), type ="closed"),color ="black", size =0.3) +# Add text labelsannotate("text", x =19.3, y =2550, label ="Pre-defined\ncompletion\npoint", size =2, fontface ="plain", family ="Poppins") +annotate("text", x =70, y =2350, label ="End of survey", angle =90, size =2, fontface ="plain", family ="Poppins") +labs(y ="Number of Respondents\n(log-10 scale)" ) +# Themes: plot_theme +theme(axis.text.x =element_text(angle =50, hjust =1, vjust =1.03),axis.title.y =element_text(vjust =3),axis.ticks.x =element_blank(),axis.title.x =element_blank(),axis.text =element_text(size =10),axis.title =element_text(size =14),plot.margin =margin(2, 5, 1, 8),legend.position ="none") main_plot +labs(title ="Survey 2 flow with natural y-axis scale",y="")
What are the general demographic and gambling characteristics of those who did and did not take part in the survey?
Before we can do this, we need to create weighted variable summaries to account for the oversampling of the at risk population:
Show code
# Create survey design:survey_design_all <- survey::svydesign(id =~1,data = master_dataset_study2,weights =~ALL_SAMPLE_WEIGHTS)# Create survey design objects for each STATUS group:survey_design_non_participants <- survey::svydesign(id =~1,data =subset(master_dataset_study2, STATUS =="Non-participants"),weights =~ALL_SAMPLE_WEIGHTS)survey_design_participants <- survey::svydesign(id =~1,data =subset(master_dataset_study2, STATUS =="Participants"),weights =~ALL_SAMPLE_WEIGHTS)# Get sample sizes for each groupparticipants_n <-sum(master_dataset_study2$STATUS =="Participants")non_participants_n <-sum(master_dataset_study2$STATUS =="Non-participants")overall_n <-nrow(master_dataset_study2)# [Yes, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:# Function to calculate weighted summary for continuous variablesweighted_summary_continuous <-function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) { var_formula <-as.formula(paste0("~", variable))# Calculate means and standard deviations mean_all <-svymean(var_formula, design_all, na.rm =TRUE) mean_non_ps <-svymean(var_formula, design_non_participants, na.rm =TRUE) mean_ps <-svymean(var_formula, design_participants, na.rm =TRUE) sd_all <-sqrt(svyvar(var_formula, design_all, na.rm =TRUE)) sd_non_ps <-sqrt(svyvar(var_formula, design_non_participants, na.rm =TRUE)) sd_ps <-sqrt(svyvar(var_formula, design_participants, na.rm =TRUE))# Calculate weighted medians median_all <-svyquantile(var_formula, design_all, 0.5, na.rm =TRUE) median_non_ps <-svyquantile(var_formula, design_non_participants, 0.5, na.rm =TRUE) median_ps <-svyquantile(var_formula, design_participants, 0.5, na.rm =TRUE)# Format for table summary_df <-tibble(Variable = variable,`Overall`=paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),`Participants`=paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),`Non-participants`=paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2)) ) %>%# Compute effect sizemutate(pooled_sd =sqrt(((participants_n -1) * sd_ps^2+ (non_participants_n -1) * sd_non_ps^2) / (participants_n + non_participants_n -2)),cohen_d = (coef(mean_ps) -coef(mean_non_ps)) / pooled_sd,SE =sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2/ (2* (participants_n + non_participants_n)))),CI_lower = cohen_d -1.96* SE,CI_upper = cohen_d +1.96* SE,Effect_size =paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]") ) %>%select(Variable, Overall, Participants, `Non-participants`, Effect_size)return(summary_df)}# Apply function to continuous variablescontinuous_vars <-c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")all_continuous_summaries <-map_dfr(continuous_vars, ~weighted_summary_continuous(.x, survey_design_all, survey_design_non_participants, survey_design_participants, participants_n, non_participants_n, overall_n))# Categorical variables:# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. # Note: this function is computes Cramer's effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).# UPDATE: Let's just compute odds ratios for the two level variables based on a reviewer suggestion (thanks ChatGPT for your help with this one!):categorical_summary <-function(variable, design_all, design_non_participants, design_participants) { var_formula <-as.formula(paste0("~", variable))# Calculate proportions for overall sample prop_table_all <-svytable(var_formula, design_all) props_all <-prop.table(prop_table_all) %>%as.data.frame() %>%mutate(Overall = Freq *100) %>%select(Variable =!!sym(variable), Overall)# Calculate proportions for non-participants prop_table_non_ps <-svytable(var_formula, design_non_participants) props_non_ps <-prop.table(prop_table_non_ps) %>%as.data.frame() %>%mutate('Non-participants'= Freq *100) %>%select(Variable =!!sym(variable), 'Non-participants')# Calculate proportions for participants prop_table_ps <-svytable(var_formula, design_participants) props_ps <-prop.table(prop_table_ps) %>%as.data.frame() %>%mutate('Participants'= Freq *100) %>%select(Variable =!!sym(variable), 'Participants')# Combine results and format summary_df <- props_ps %>%left_join(props_non_ps, by ="Variable") %>%left_join(props_all, by ="Variable") %>%mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%replace(is.na(.), "")# Determine if variable is binary (i.e., only two levels) unique_values <-length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))if (unique_values ==2& variable !="GENDER") {# Ensure STATUS is treated as a factor with "Non-participants" as reference design_all$variables$STATUS <-relevel(as.factor(design_all$variables$STATUS), ref ="Non-participants")# Compute Odds Ratio (OR) using survey-weighted logistic regression logistic_model <-tryCatch(svyglm(as.formula(paste0(variable, " ~ STATUS")), design = design_all, family =quasibinomial()),error =function(e) NULL )if (!is.null(logistic_model)) { coef_name <-grep("STATUS", names(coef(logistic_model)), value =TRUE)if (length(coef_name) >0) { OR <-exp(coef(logistic_model)[coef_name]) CI <-exp(confint(logistic_model)[coef_name, ]) effect_size_text <-paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]") } else { effect_size_text <-"NA" } } else { effect_size_text <-"NA" } } else {# Compute Cramér’s V for categorical variables with more than two levels combined_table <-as.table(rbind(prop_table_non_ps, prop_table_ps)) V <-cramersV(combined_table) V_rounded <-as.matrix(V$output) %>%t() %>%as.data.frame() %>%mutate(cramersV =as.numeric(cramersV)) %>%mutate(cramersV =round(cramersV, 2)) CI <-confIntV(combined_table) CI_info <- CI$intermediate CIs <- CI_info$fisherZ.ci %>%t() %>%as.data.frame() %>%mutate_if(is.numeric, round, 2) %>%unite('Effect_size', 1, 2, sep =", ") effect_size_text <-paste0(V_rounded, " [", CIs, "]") }# Add effect size to the table header_row <-data.frame(Variable = variable, Overall ="", Participants ="", `Non-participants`="", `Effect_size`= effect_size_text) summary_df <-bind_rows(header_row, summary_df) %>%as_tibble() %>%select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)return(summary_df)}# Apply function to categorical variablescategorical_vars <-c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")all_categorical_summaries <-map_dfr(categorical_vars, ~categorical_summary(.x, survey_design_all, survey_design_non_participants, survey_design_participants))all_categorical_summaries <- all_categorical_summaries %>%# Recode for consistency across summaries:mutate(Variable =case_when(Variable==1~"Yes", Variable==0~"No", Variable==TRUE~"Yes", Variable==FALSE~"No",TRUE~ Variable))# Combine continuous and categorical summariesweighted_comparison_summary_data <-bind_rows(all_continuous_summaries, all_categorical_summaries) %>%rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols ="Participants") %>%rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols ="Non-participants") %>%rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols ="Overall")# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]age_summary_tabled<- weighted_comparison_summary_data[1,1:5]summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%filter(Variable !="AGE", Variable !="GENDER", Variable !="Male", Variable !="Female", Variable !="Unknown" )# Join everything back together and provide some nice names for our variables:weighted_comparison_summary_data_ordered <-bind_rows(age_summary_tabled, gender_summary_tabled, summary_table_extracted_vars_missing) %>%mutate(Variable = dplyr::recode(Variable,`AGE`="Age","GENDER"="Gender","IRSAD_SCORE"="IRSAD score","DAYS_REGISTERED"="Days since registered","DAYS_SINCE_LAST_BET"="No. days since last bet","PAS_6M_BET_FREQ_DAYS"="No. betting days","PAS_6M_BET_INTENSITY"="Mean bets per active day","PAS_6M_BET_MEAN_STAKE"="Mean stake","PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME"="Mean net outcome per month","PAS_6M_PERCENT_UNSOCIABLE_BETS"="Percentage of bets midnight - 6AM","PAS_6M_TOTAL_DEPOSIT_FREQ"="No. deposits","PAS_6M_DEPOSIT_INTENSITY"="Mean no. deposits per active day","PAS_6M_MEAN_DEPOSIT_AMOUNT"="Mean deposit amount","PAS_6M_PERCENT_DECLINED_DEPOSITS"="Percentage of deposits declined","PAS_6M_PERCENT_CANCEL_WITHDRAWS"="Percentage of withdrawals cancelled","EVER_USED_ANY_TOOL"="Ever used a CPT","ACTIVE_DEPOSIT_LIMIT"="Active deposit limit","RE_PAS_6M_ANY_TRIGGER"="Flagged as 'at-risk' ≥ 1 times" ))# Create final table:weighted_comparison_summary_data_ordered %>%rename("Effect size"= Effect_size ) %>%gt() %>%tab_header(title ="Table 3. Study 2 (past 6-month bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%tab_options(data_row.padding =px(1.5) ) %>%cols_align(align ="right",columns =c(Variable) ) %>%cols_align(align ="center",columns =c(2:4) ) %>%tab_footnote(footnote ="Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",locations =cells_column_labels(columns =c(2,3)) )%>%tab_footnote(footnote ="Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",locations =cells_column_labels(columns =vars(`Effect size`)) )
Table 3. Study 2 (past 6-month bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable
Overall (N = 24793)1
Participants (N = 1956)1
Non-participants (N = 22837)
Effect size2
Age
38.07 (14.73), 35
43.57 (15.51), 43
37.66 (14.59), 34
0.4 [0.36, 0.45]
Gender
NA
0.03 [0.03, 0.04]
Female
17.23%
13.2%
17.53%
NA
Male
81.96%
85.32%
81.71%
NA
Unknown
0.81%
1.47%
0.76%
NA
IRSAD score
1004.88 (75.63), 999.58
1005.21 (75.52), 1000.57
1004.86 (75.64), 999.58
0 [-0.04, 0.05]
Days since registered
1131.27 (904.71), 905
1337.27 (967.47), 1168
1115.93 (898), 892
0.24 [0.2, 0.29]
No. days since last bet
49.13 (46.06), 43
27.47 (39.71), 5
50.74 (46.09), 47
-0.51 [-0.56, -0.46]
No. betting days
22.45 (31.97), 9
39.59 (41.78), 25
21.17 (30.73), 8
0.58 [0.53, 0.63]
Mean bets per active day
7.36 (12.41), 4
10.34 (17.22), 5.59
7.14 (11.95), 4
0.26 [0.21, 0.3]
Mean stake
31.48 (163.93), 10
26.35 (107.14), 9.12
31.86 (167.38), 10
-0.03 [-0.08, 0.01]
Mean net outcome per month
-131.9 (1864.76), -12.5
-242.26 (1191.8), -25
-123.68 (1905.15), -12.11
-0.06 [-0.11, -0.02]
Percentage of bets midnight - 6AM
4.03 (11.54), 0
4.19 (10.3), 0
4.02 (11.63), 0
0.02 [-0.03, 0.06]
No. deposits
31.6 (111.44), 4
64.86 (175.22), 9
29.12 (104.74), 4
0.32 [0.27, 0.37]
Mean no. deposits per active day
1.03 (1.28), 0.81
1.2 (1.69), 0.8
1.02 (1.24), 0.82
0.14 [0.09, 0.19]
Mean deposit amount
77.52 (321.94), 31.82
72.83 (202.78), 37.5
77.87 (329.1), 31.21
-0.02 [-0.06, 0.03]
Percentage of deposits declined
6.59 (15.05), 0
7.08 (15.25), 0
6.55 (15.03), 0
0.03 [-0.01, 0.08]
Percentage of withdrawals cancelled
1.81 (9.25), 0
3.06 (11.82), 0
1.72 (9.02), 0
0.15 [0.1, 0.19]
Ever used a CPT
NA
1.47 [1.28, 1.69]
No
86.93%
82.33%
87.27%
NA
Yes
13.07%
17.67%
12.73%
NA
Active deposit limit
NA
1.23 [1.03, 1.47]
No
91.58%
89.96%
91.7%
NA
Yes
8.42%
10.04%
8.3%
NA
Flagged as 'at-risk' ≥ 1 times
NA
1.91 [1.73, 2.12]
No
97.51%
95.61%
97.65%
NA
Yes
2.49%
4.39%
2.35%
NA
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])
PGSI
Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:
Show code
survey_design_participants_filtered <- survey::svydesign(id =~1,data =subset(master_dataset_study2, STATUS =="Participants"& (ATTENTION_CHECK_1 =="PASSED"|is.na(ATTENTION_CHECK_1)) & (ATTENTION_CHECK_2 =="PASSED"|is.na(ATTENTION_CHECK_2))),weights =~ELIGIBLE_SURVEY_SAMPLE_WEIGHTS)# Weighted values:mean_ps_PGSI <-svymean(~PGSI_TOTAL, survey_design_participants_filtered, na.rm =TRUE)sd_ps_PGSI <-sqrt(svyvar(~PGSI_TOTAL, survey_design_participants_filtered, na.rm =TRUE))median_ps_PGSI <-svyquantile(~PGSI_TOTAL, survey_design_participants_filtered, c(0.5), na.rm =TRUE)# Extract desired valuesmean_ps_PGSI_value <-coef(mean_ps_PGSI)sd_ps_PGSI_value <- sd_ps_PGSI[1]median_ps_PGSI_value <-coef(median_ps_PGSI) # Create sentence summarising results: summary_sentence_PGSI_study2 <-sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)summary_sentence_PGSI_study2
[1] "The mean PGSI total score is 3.64 (standard deviation = 4.12) and the median is 2.00."
Non-Weighted comparisons
For transparency, we have also computed non-weighted comparisons between groups using the tbl_summary() function used to do the same in Study 1 above:
Show code
master_dataset_study2 %>%tbl_summary(include =c(# Demographic details AGE, GENDER, IRSAD_SCORE, DAYS_REGISTERED, # Wagering DAYS_SINCE_LAST_BET, PAS_6M_BET_FREQ_DAYS, PAS_6M_BET_INTENSITY, PAS_6M_BET_MEAN_STAKE, PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME, PAS_6M_PERCENT_UNSOCIABLE_BETS,# Transactions PAS_6M_TOTAL_DEPOSIT_FREQ, PAS_6M_DEPOSIT_INTENSITY, PAS_6M_MEAN_DEPOSIT_AMOUNT, PAS_6M_PERCENT_DECLINED_DEPOSITS, PAS_6M_PERCENT_CANCEL_WITHDRAWS,# Use of consumer protection tools EVER_USED_ANY_TOOL, ACTIVE_DEPOSIT_LIMIT, RE_PAS_6M_ANY_TRIGGER),type =all_continuous() ~"continuous2",statistic =list(all_continuous2() ~c("{median} [{p25} - {p75}]","{mean} ({sd})")),by = STATUS,label =c( AGE ~"Age", GENDER ~"Gender", IRSAD_SCORE ~"IRSAD score", DAYS_REGISTERED ~"No. of days since accounts registered", DAYS_SINCE_LAST_BET ~"No. days since last bet", PAS_6M_BET_FREQ_DAYS ~"No. betting days", PAS_6M_BET_INTENSITY ~"Mean bets per active day", PAS_6M_BET_MEAN_STAKE ~"Mean stake", PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~"Mean net outcome per month", PAS_6M_PERCENT_UNSOCIABLE_BETS ~"Percentage of bets placed between midnight & 6AM", PAS_6M_TOTAL_DEPOSIT_FREQ ~"No. deposits", PAS_6M_DEPOSIT_INTENSITY ~"Mean no. deposits per active betting day", PAS_6M_MEAN_DEPOSIT_AMOUNT ~"Mean deposit amount", PAS_6M_PERCENT_DECLINED_DEPOSITS ~"Percentage of deposits declined", PAS_6M_PERCENT_CANCEL_WITHDRAWS ~"Percentage of withdrawals cancelled", EVER_USED_ANY_TOOL ~"Ever used a SG", ACTIVE_DEPOSIT_LIMIT ~"Active deposit limit", RE_PAS_6M_ANY_TRIGGER ~"Flagged as 'at-risk' ≥ 1 times") ) %>%add_difference(list(all_continuous() ~"cohens_d", # For continuous variablesall_categorical() ~"smd") # For categorical variables, you can use standardized mean difference or any other suitable method ) %>%modify_header(label ="**Variable**") %>%modify_caption("**Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics**") %>%bold_labels() %>%as_gt() %>%tab_options(data_row.padding =px(1.5))
Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics
Variable
Participants, N = 1,9561
Non-participants, N = 22,8371
Difference2
95% CI2,3
Age
0.33
0.28, 0.37
Median [IQR]
40 [30 - 54]
34 [26 - 47]
Mean (SD)
42 (15)
38 (14)
Gender
0.14
0.10, 0.19
Female
221 (11%)
3,570 (16%)
Male
1,701 (87%)
19,054 (83%)
Unknown
34 (1.7%)
213 (0.9%)
IRSAD score
-0.01
-0.05, 0.04
Median [IQR]
1,001 [954 - 1,060]
1,001 [954 - 1,061]
Mean (SD)
1,006 (75)
1,006 (76)
Unknown
1
191
No. of days since accounts registered
0.24
0.20, 0.29
Median [IQR]
1,188 [494 - 2,151]
927 [377 - 1,822]
Mean (SD)
1,380 (992)
1,155 (915)
No. days since last bet
-0.52
-0.57, -0.47
Median [IQR]
3 [2 - 24]
31 [4 - 66]
Mean (SD)
22 (36)
45 (46)
No. betting days
0.63
0.58, 0.68
Median [IQR]
38 [12 - 88]
12 [3 - 40]
Mean (SD)
54 (49)
29 (38)
Mean bets per active day
0.30
0.25, 0.34
Median [IQR]
7 [3 - 16]
4 [2 - 10]
Mean (SD)
14 (23)
9 (17)
Mean stake
-0.04
-0.08, 0.01
Median [IQR]
15 [6 - 45]
13 [6 - 38]
Mean (SD)
68 (257)
83 (419)
Mean net outcome per month
-0.07
-0.11, -0.02
Median [IQR]
-65 [-710 - -4]
-17 [-134 - -1]
Mean (SD)
-870 (2,846)
-529 (5,139)
Percentage of bets placed between midnight & 6AM
0.04
-0.01, 0.08
Median [IQR]
1 [0 - 5]
0 [0 - 3]
Mean (SD)
5 (10)
5 (12)
No. deposits
0.43
0.39, 0.48
Median [IQR]
26 [4 - 159]
6 [2 - 35]
Mean (SD)
162 (354)
66 (206)
Mean no. deposits per active betting day
0.32
0.27, 0.36
Median [IQR]
1.08 [0.39 - 2.49]
1.00 [0.33 - 1.60]
Mean (SD)
1.99 (2.81)
1.36 (1.90)
Mean deposit amount
-0.03
-0.08, 0.02
Median [IQR]
50 [22 - 119]
42 [15 - 101]
Mean (SD)
148 (481)
173 (859)
Percentage of deposits declined
0.04
-0.01, 0.09
Median [IQR]
1 [0 - 8]
0 [0 - 7]
Mean (SD)
7 (14)
7 (15)
Percentage of withdrawals cancelled
0.21
0.16, 0.25
Median [IQR]
0 [0 - 0]
0 [0 - 0]
Mean (SD)
7 (18)
4 (14)
Ever used a SG
439 (22%)
3,576 (16%)
0.17
0.13, 0.22
Active deposit limit
201 (10%)
2,059 (9.0%)
0.04
0.00, 0.09
Flagged as 'at-risk' ≥ 1 times
588 (30%)
4,202 (18%)
0.27
0.23, 0.32
1 n (%)
2 Cohen’s D; Standardized Mean Difference
3 CI = Confidence Interval
Sensitivity check comparison
How does the group comparison differ if we change the criterion for survey completion? Reviewers didn’t ask for this change specifically, but some suggestions led us to conduct these additional sensitivity checks.
For this, we will change the participant group to anyone who fully completed the survey and non-participants as anyone who didn’t start or didn’t complete the survey. These will be weighted comparisons.
Show code
# Check coding of variable that tells us whether somebody fully completed the survey or notmaster_dataset_study2 %>%count(STATUS, SURVEY_STATUS_FULL)
# A tibble: 4 × 3
STATUS SURVEY_STATUS_FULL n
<fct> <chr> <int>
1 Participants Complete 1467
2 Participants Incomplete 489
3 Non-participants Incomplete 186
4 Non-participants Not started 22651
Show code
master_dataset_study2_sensitivity_check <- master_dataset_study2 %>%mutate(SURVEY_STATUS_FULL_BINARY =case_when(SURVEY_STATUS_FULL =="Incomplete"~"Non-participants", SURVEY_STATUS_FULL =="Not started"~"Non-participants",TRUE~"Participants")) %>%mutate(SURVEY_STATUS_FULL_BINARY =fct_relevel(SURVEY_STATUS_FULL_BINARY,"Participants","Non-participants")) # fix order of presentation# Check recoding has worked:master_dataset_study2_sensitivity_check %>%count(STATUS, SURVEY_STATUS_FULL_BINARY)
# A tibble: 3 × 3
STATUS SURVEY_STATUS_FULL_BINARY n
<fct> <fct> <int>
1 Participants Participants 1467
2 Participants Non-participants 489
3 Non-participants Non-participants 22837
Show code
# Update survey designs:survey_design_all <- survey::svydesign(id =~1,data = master_dataset_study2_sensitivity_check,weights =~ALL_SAMPLE_WEIGHTS)# Create survey design objects for each STATUS group:survey_design_non_participants <- survey::svydesign(id =~1,data =subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY =="Non-participants"),weights =~ALL_SAMPLE_WEIGHTS)survey_design_participants <- survey::svydesign(id =~1,data =subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY =="Participants"),weights =~ALL_SAMPLE_WEIGHTS)# Get sample sizes for each groupparticipants_n <-sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY =="Participants")non_participants_n <-sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY =="Non-participants")overall_n <-nrow(master_dataset_study2_sensitivity_check)# Produce updated table# [Again, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:# Function to calculate weighted summary for continuous variablesweighted_summary_continuous <-function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) { var_formula <-as.formula(paste0("~", variable))# Calculate means and standard deviations mean_all <-svymean(var_formula, design_all, na.rm =TRUE) mean_non_ps <-svymean(var_formula, design_non_participants, na.rm =TRUE) mean_ps <-svymean(var_formula, design_participants, na.rm =TRUE) sd_all <-sqrt(svyvar(var_formula, design_all, na.rm =TRUE)) sd_non_ps <-sqrt(svyvar(var_formula, design_non_participants, na.rm =TRUE)) sd_ps <-sqrt(svyvar(var_formula, design_participants, na.rm =TRUE))# Calculate weighted medians median_all <-svyquantile(var_formula, design_all, 0.5, na.rm =TRUE) median_non_ps <-svyquantile(var_formula, design_non_participants, 0.5, na.rm =TRUE) median_ps <-svyquantile(var_formula, design_participants, 0.5, na.rm =TRUE)# Format for table summary_df <-tibble(Variable = variable,`Overall`=paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),`Participants`=paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),`Non-participants`=paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2)) ) %>%# Compute effect sizemutate(pooled_sd =sqrt(((participants_n -1) * sd_ps^2+ (non_participants_n -1) * sd_non_ps^2) / (participants_n + non_participants_n -2)),cohen_d = (coef(mean_ps) -coef(mean_non_ps)) / pooled_sd,SE =sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2/ (2* (participants_n + non_participants_n)))),CI_lower = cohen_d -1.96* SE,CI_upper = cohen_d +1.96* SE,Effect_size =paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]") ) %>%select(Variable, Overall, Participants, `Non-participants`, Effect_size)return(summary_df)}# Apply function to continuous variablescontinuous_vars <-c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")all_continuous_summaries <-map_dfr(continuous_vars, ~weighted_summary_continuous(.x, survey_design_all, survey_design_non_participants, survey_design_participants, participants_n, non_participants_n, overall_n))# Categorical variables:# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. # Note: this function is computes Cramer's effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).# UPDATE: Let's just compute odds ratios for the two level variables based on a reviewer suggestion (thanks ChatGPT for your help with this one!):categorical_summary <-function(variable, design_all, design_non_participants, design_participants) { var_formula <-as.formula(paste0("~", variable))# Calculate proportions for overall sample prop_table_all <-svytable(var_formula, design_all) props_all <-prop.table(prop_table_all) %>%as.data.frame() %>%mutate(Overall = Freq *100) %>%select(Variable =!!sym(variable), Overall)# Calculate proportions for non-participants prop_table_non_ps <-svytable(var_formula, design_non_participants) props_non_ps <-prop.table(prop_table_non_ps) %>%as.data.frame() %>%mutate('Non-participants'= Freq *100) %>%select(Variable =!!sym(variable), 'Non-participants')# Calculate proportions for participants prop_table_ps <-svytable(var_formula, design_participants) props_ps <-prop.table(prop_table_ps) %>%as.data.frame() %>%mutate('Participants'= Freq *100) %>%select(Variable =!!sym(variable), 'Participants')# Combine results and format summary_df <- props_ps %>%left_join(props_non_ps, by ="Variable") %>%left_join(props_all, by ="Variable") %>%mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%replace(is.na(.), "")# Determine if variable is binary (i.e., only two levels) unique_values <-length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))if (unique_values ==2& variable !="GENDER") {# Ensure STATUS is treated as a factor with "Non-participants" as reference design_all$variables$STATUS <-relevel(as.factor(design_all$variables$STATUS), ref ="Non-participants")# Compute Odds Ratio (OR) using survey-weighted logistic regression logistic_model <-tryCatch(svyglm(as.formula(paste0(variable, " ~ STATUS")), design = design_all, family =quasibinomial()),error =function(e) NULL )if (!is.null(logistic_model)) { coef_name <-grep("STATUS", names(coef(logistic_model)), value =TRUE)if (length(coef_name) >0) { OR <-exp(coef(logistic_model)[coef_name]) CI <-exp(confint(logistic_model)[coef_name, ]) effect_size_text <-paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]") } else { effect_size_text <-"NA" } } else { effect_size_text <-"NA" } } else {# Compute Cramér’s V for categorical variables with more than two levels combined_table <-as.table(rbind(prop_table_non_ps, prop_table_ps)) V <-cramersV(combined_table) V_rounded <-as.matrix(V$output) %>%t() %>%as.data.frame() %>%mutate(cramersV =as.numeric(cramersV)) %>%mutate(cramersV =round(cramersV, 2)) CI <-confIntV(combined_table) CI_info <- CI$intermediate CIs <- CI_info$fisherZ.ci %>%t() %>%as.data.frame() %>%mutate_if(is.numeric, round, 2) %>%unite('Effect_size', 1, 2, sep =", ") effect_size_text <-paste0(V_rounded, " [", CIs, "]") }# Add effect size to the table header_row <-data.frame(Variable = variable, Overall ="", Participants ="", `Non-participants`="", `Effect_size`= effect_size_text) summary_df <-bind_rows(header_row, summary_df) %>%as_tibble() %>%select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)return(summary_df)}# Apply function to categorical variablescategorical_vars <-c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")all_categorical_summaries <-map_dfr(categorical_vars, ~categorical_summary(.x, survey_design_all, survey_design_non_participants, survey_design_participants))all_categorical_summaries <- all_categorical_summaries %>%# Recode for consistency across summaries:mutate(Variable =case_when(Variable==1~"Yes", Variable==0~"No", Variable==TRUE~"Yes", Variable==FALSE~"No",TRUE~ Variable))# Combine continuous and categorical summariesweighted_comparison_summary_data <-bind_rows(all_continuous_summaries, all_categorical_summaries) %>%rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols ="Participants") %>%rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols ="Non-participants") %>%rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols ="Overall")# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]age_summary_tabled<- weighted_comparison_summary_data[1,1:5]summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%filter(Variable !="AGE", Variable !="GENDER", Variable !="Male", Variable !="Female", Variable !="Unknown" )# Join everything back together and provide some nice names for our variables:weighted_comparison_summary_data_ordered <-bind_rows(age_summary_tabled, gender_summary_tabled, summary_table_extracted_vars_missing) %>%mutate(Variable = dplyr::recode(Variable,`AGE`="Age","GENDER"="Gender","IRSAD_SCORE"="IRSAD score","DAYS_REGISTERED"="Days since registered","DAYS_SINCE_LAST_BET"="No. days since last bet","PAS_6M_BET_FREQ_DAYS"="No. betting days","PAS_6M_BET_INTENSITY"="Mean bets per active day","PAS_6M_BET_MEAN_STAKE"="Mean stake","PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME"="Mean net outcome per month","PAS_6M_PERCENT_UNSOCIABLE_BETS"="Percentage of bets midnight - 6AM","PAS_6M_TOTAL_DEPOSIT_FREQ"="No. deposits","PAS_6M_DEPOSIT_INTENSITY"="Mean no. deposits per active day","PAS_6M_MEAN_DEPOSIT_AMOUNT"="Mean deposit amount","PAS_6M_PERCENT_DECLINED_DEPOSITS"="Percentage of deposits declined","PAS_6M_PERCENT_CANCEL_WITHDRAWS"="Percentage of withdrawals cancelled","EVER_USED_ANY_TOOL"="Ever used a CPT","ACTIVE_DEPOSIT_LIMIT"="Active deposit limit","RE_PAS_6M_ANY_TRIGGER"="Flagged as 'at-risk' ≥ 1 times" ))# Create final table:weighted_comparison_summary_data_ordered %>%rename("Effect size"= Effect_size ) %>%gt() %>%tab_header(title ="Table 3.1 Study 2 (past 6-month bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]") %>%tab_options(data_row.padding =px(1.5) ) %>%cols_align(align ="right",columns =c(Variable) ) %>%cols_align(align ="center",columns =c(2:4) ) %>%tab_footnote(footnote ="Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",locations =cells_column_labels(columns =c(2,3)) )%>%tab_footnote(footnote ="Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",locations =cells_column_labels(columns =vars(`Effect size`)) )
Table 3.1 Study 2 (past 6-month bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]
Variable
Overall (N = 24793)1
Participants (N = 1467)1
Non-participants (N = 23326)
Effect size2
Age
38.07 (14.73), 35
42.82 (14.78), 41
37.8 (14.69), 34
0.34 [0.29, 0.39]
Gender
NA
0.03 [0.03, 0.04]
Female
17.23%
12.55%
17.5%
NA
Male
81.96%
85.99%
81.73%
NA
Unknown
0.81%
1.46%
0.77%
NA
IRSAD score
1004.88 (75.63), 999.58
1007.88 (75.44), 1003.14
1004.71 (75.64), 998.46
0.04 [-0.01, 0.09]
Days since registered
1131.27 (904.71), 905
1362.54 (967.72), 1179
1118.05 (899.21), 894
0.27 [0.22, 0.32]
No. days since last bet
49.13 (46.06), 43
27.75 (40.04), 5
50.35 (46.08), 45
-0.49 [-0.55, -0.44]
No. betting days
22.45 (31.97), 9
38.78 (41.11), 24
21.51 (31.11), 8
0.54 [0.49, 0.6]
Mean bets per active day
7.36 (12.41), 4
9.73 (16.37), 5.38
7.22 (12.14), 4
0.2 [0.15, 0.25]
Mean stake
31.48 (163.93), 10
25.43 (89.58), 8.89
31.83 (167.18), 10
-0.04 [-0.09, 0.01]
Mean net outcome per month
-131.9 (1864.76), -12.5
-204.51 (1078.02), -22.71
-127.75 (1899.84), -12.33
-0.04 [-0.09, 0.01]
Percentage of bets midnight - 6AM
4.03 (11.54), 0
4.36 (10.84), 0
4.01 (11.58), 0
0.03 [-0.02, 0.08]
No. deposits
31.6 (111.44), 4
58.58 (155.16), 9
30.06 (108.21), 4
0.26 [0.2, 0.31]
Mean no. deposits per active day
1.03 (1.28), 0.81
1.13 (1.64), 0.75
1.02 (1.26), 0.82
0.08 [0.03, 0.13]
Mean deposit amount
77.52 (321.94), 31.82
72.69 (208.96), 37.5
77.8 (327.23), 31.25
-0.02 [-0.07, 0.04]
Percentage of deposits declined
6.59 (15.05), 0
6.62 (14.46), 0
6.59 (15.08), 0
0 [-0.05, 0.06]
Percentage of withdrawals cancelled
1.81 (9.25), 0
2.99 (12.02), 0
1.74 (9.06), 0
0.14 [0.08, 0.19]
Ever used a CPT
NA
1.47 [1.28, 1.69]
No
86.93%
82.97%
87.15%
NA
Yes
13.07%
17.03%
12.85%
NA
Active deposit limit
NA
1.23 [1.03, 1.47]
No
91.58%
90.39%
91.65%
NA
Yes
8.42%
9.61%
8.35%
NA
Flagged as 'at-risk' ≥ 1 times
NA
1.91 [1.73, 2.12]
No
97.51%
96.25%
97.58%
NA
Yes
2.49%
3.75%
2.42%
NA
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])
Note that this table says that there are 1467 people who completed the survey in full This differs from the 1470 people who were listed as completing the survey in Table 1 of our manuscript as some people have been removed from this total due to ineligibility.
Show code
master_dataset_study2 %>%tbl_summary(include =c(# Demographic details AGE, GENDER, IRSAD_SCORE, DAYS_REGISTERED, # Wagering DAYS_SINCE_LAST_BET, PAS_6M_BET_FREQ_DAYS, PAS_6M_BET_INTENSITY, PAS_6M_BET_MEAN_STAKE, PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME, PAS_6M_PERCENT_UNSOCIABLE_BETS,# Transactions PAS_6M_TOTAL_DEPOSIT_FREQ, PAS_6M_DEPOSIT_INTENSITY, PAS_6M_MEAN_DEPOSIT_AMOUNT, PAS_6M_PERCENT_DECLINED_DEPOSITS, PAS_6M_PERCENT_CANCEL_WITHDRAWS,# Use of consumer protection tools EVER_USED_ANY_TOOL, ACTIVE_DEPOSIT_LIMIT, RE_PAS_6M_ANY_TRIGGER),type =all_continuous() ~"continuous2",statistic =list(all_continuous2() ~c("{median} [{p25} - {p75}]","{mean} ({sd})")),by = SURVEY_STATUS_FULL,label =c( AGE ~"Age", GENDER ~"Gender", IRSAD_SCORE ~"IRSAD score", DAYS_REGISTERED ~"No. of days since accounts registered", DAYS_SINCE_LAST_BET ~"No. days since last bet", PAS_6M_BET_FREQ_DAYS ~"No. betting days", PAS_6M_BET_INTENSITY ~"Mean bets per active day", PAS_6M_BET_MEAN_STAKE ~"Mean stake", PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~"Mean net outcome per month", PAS_6M_PERCENT_UNSOCIABLE_BETS ~"Percentage of bets placed between midnight & 6AM", PAS_6M_TOTAL_DEPOSIT_FREQ ~"No. deposits", PAS_6M_DEPOSIT_INTENSITY ~"Mean no. deposits per active betting day", PAS_6M_MEAN_DEPOSIT_AMOUNT ~"Mean deposit amount", PAS_6M_PERCENT_DECLINED_DEPOSITS ~"Percentage of deposits declined", PAS_6M_PERCENT_CANCEL_WITHDRAWS ~"Percentage of withdrawals cancelled", EVER_USED_ANY_TOOL ~"Ever used a SG", ACTIVE_DEPOSIT_LIMIT ~"Active deposit limit", RE_PAS_6M_ANY_TRIGGER ~"Flagged as 'at-risk' ≥ 1 times") ) %>%modify_header(label ="**Variable**") %>%modify_caption("Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey") %>%bold_labels()
Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn’t complete it, and those who didn’t start the survey
Variable
Complete, N = 1,4671
Incomplete, N = 6751
Not started, N = 22,6511
Age
Median [IQR]
39 [30 - 52]
42 [28 - 56]
34 [26 - 47]
Mean (SD)
42 (14)
43 (16)
38 (14)
Gender
Female
153 (10%)
85 (13%)
3,553 (16%)
Male
1,290 (88%)
577 (85%)
18,888 (83%)
Unknown
24 (1.6%)
13 (1.9%)
210 (0.9%)
IRSAD score
Median [IQR]
1,004 [955 - 1,062]
990 [949 - 1,056]
1,001 [954 - 1,061]
Mean (SD)
1,008 (75)
1,001 (74)
1,006 (76)
Unknown
1
1
190
No. of days since accounts registered
Median [IQR]
1,213 [510 - 2,179]
1,100 [447 - 2,043]
927 [375 - 1,822]
Mean (SD)
1,396 (986)
1,319 (1,005)
1,154 (915)
No. days since last bet
Median [IQR]
4 [2 - 27]
3 [1 - 15]
32 [4 - 66]
Mean (SD)
23 (37)
17 (31)
46 (46)
No. betting days
Median [IQR]
36 [11 - 84]
50 [18 - 101]
12 [3 - 39]
Mean (SD)
51 (47)
63 (51)
29 (38)
Mean bets per active day
Median [IQR]
7 [3 - 15]
9 [4 - 20]
4 [2 - 10]
Mean (SD)
13 (21)
16 (25)
9 (17)
Mean stake
Median [IQR]
13 [5 - 42]
20 [7 - 60]
13 [6 - 38]
Mean (SD)
61 (206)
90 (331)
83 (420)
Mean net outcome per month
Median [IQR]
-52 [-551 - -2]
-151 [-1,199 - -10]
-17 [-131 - -1]
Mean (SD)
-740 (2,656)
-1,238 (3,122)
-524 (5,155)
Percentage of bets placed between midnight & 6AM
Median [IQR]
1 [0 - 5]
1 [0 - 5]
0 [0 - 3]
Mean (SD)
5 (11)
5 (9)
5 (12)
No. deposits
Median [IQR]
20 [4 - 136]
54 [8 - 272]
6 [1 - 34]
Mean (SD)
142 (316)
225 (499)
64 (199)
Mean no. deposits per active betting day
Median [IQR]
1.00 [0.33 - 2.19]
1.50 [0.67 - 3.22]
1.00 [0.33 - 1.60]
Mean (SD)
1.85 (2.70)
2.46 (3.48)
1.35 (1.86)
Mean deposit amount
Median [IQR]
50 [22 - 114]
65 [26 - 150]
41 [15 - 100]
Mean (SD)
143 (503)
166 (380)
173 (862)
Percentage of deposits declined
Median [IQR]
0 [0 - 8]
3 [0 - 10]
0 [0 - 7]
Mean (SD)
7 (14)
9 (15)
7 (15)
Percentage of withdrawals cancelled
Median [IQR]
0 [0 - 0]
0 [0 - 8]
0 [0 - 0]
Mean (SD)
7 (17)
10 (21)
4 (14)
Ever used a SG
315 (21%)
174 (26%)
3,526 (16%)
Active deposit limit
147 (10%)
76 (11%)
2,037 (9.0%)
Flagged as 'at-risk' ≥ 1 times
392 (27%)
277 (41%)
4,121 (18%)
1 n (%)
Again, it appears to be because the group who start it but didn’t complete the survey are more involved/intense bettors than any other group.
Composite risk scores
Now let’s produce a singles composite score that represents people’s past six months gambling behaviour.
One thing we need to consider here is that some of the variables we’ve created could be biased by recent sign up to the site. for example, one of the variables we created to summarise past six months behaviour is the total number of days somebody gambled, while another is the total amount somebody deposited. Both of these variables could be affected by being more recently registered with the company. By contrast, variables which refer to averages, or daily or monthly summaries will not be affected by this.
Let’s first identify all the variables we engineered that summarise past six month behaviour, then classify them based on whether they will be subject to this bias or not:
Show code
PAS_6M_variables<- master_dataset_study2 %>%select(contains("PAS_6M"), # Identify all relevant past six months aggregate variables-contains("RE_"), # Remove risk identification system variables-contains("INCOME") # Remove variables relating to percentage of income which are only available for survey participants ) %>%colnames() %>%as.data.frame() %>%rename(Variables =1) %>%mutate(`Biased by sign-up date`=case_when( Variables =="PAS_6M_BET_FREQ_DAYS"~"Yes", Variables =="PAS_6M_BET_FREQ_BETS"~"Yes", Variables =="PAS_6M_BET_MEAN_STAKE"~"No", Variables =="PAS_6M_BET_SD_STAKE"~"No", Variables =="PAS_6M_BET_SPEND"~"Yes", Variables =="PAS_6M_MEAN_ODDS"~"Yes", Variables =="PAS_6M_N_ACTIVITIES"~"No", Variables =="PAS_6M_BET_INTENSITY"~"No", Variables =="PAS_6M_BET_WIN"~"Yes", Variables =="PAS_6M_NET_OUTCOME"~"Yes", Variables =="PAS_6M_MAX_DAY_STAKE"~"No", Variables =="PAS_6M_MAX_DAY_BETS"~"No", Variables =="PAS_6M_MEAN_DAILY_STAKE"~"No", Variables =="PAS_6M_SD_DAILY_STAKE"~"No", Variables =="PAS_6M_TOTAL_UNSOCIABLE_BETS"~"Yes", Variables =="PAS_6M_PERCENT_UNSOCIABLE_BETS"~"No", Variables =="PAS_6M_MONTHLY_BET_FREQ_DAYS"~"No", Variables =="PAS_6M_MONTHLY_SPEND"~"No", Variables =="PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME"~"No", Variables =="PAS_6M_TOTAL_DEPOSIT_AMOUNT"~"Yes", Variables =="PAS_6M_MEAN_DEPOSIT_AMOUNT"~"No", Variables =="PAS_6M_SD_DEPOSIT_AMOUNT"~"No", Variables =="PAS_6M_TOTAL_DEPOSIT_FREQ"~"Yes", Variables =="PAS_6M_TOTAL_DECLINED_DEPOSITS"~"Yes", Variables =="PAS_6M_PERCENT_DECLINED_DEPOSITS"~"No", Variables =="PAS_6M_TOTAL_DEPOSIT_METHODS"~"No", Variables =="PAS_6M_TOTAL_WITHDRAW_FREQ"~"Yes", Variables =="PAS_6M_TOTAL_WITHDRAW_AMOUNT"~"Yes", Variables =="PAS_6M_TOTAL_CANCEL_WITHDRAWS"~"Yes", Variables =="PAS_6M_PERCENT_CANCEL_WITHDRAWS"~"No", Variables =="PAS_6M_TOTAL_WITHDRAW_METHODS"~"No", Variables =="PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT"~"No", Variables =="PAS_6M_MONTHLY_DEPOSIT_AMOUNT"~"No", Variables =="PAS_6M_DEPOSIT_INTENSITY"~"No",TRUE~ Variables)) # Table variables:PAS_6M_variables %>%gt()%>%tab_header("All past-6-month aggregate summary variables", subtitle ="With risk of bias status for each variable")
All past-6-month aggregate summary variables
With risk of bias status for each variable
Variables
Biased by sign-up date
PAS_6M_BET_FREQ_DAYS
Yes
PAS_6M_BET_FREQ_BETS
Yes
PAS_6M_BET_FREQ_INCL_LEGS
PAS_6M_BET_FREQ_INCL_LEGS
PAS_6M_BET_MEAN_STAKE
No
PAS_6M_BET_SD_STAKE
No
PAS_6M_BET_SPEND
Yes
PAS_6M_MEAN_ODDS
Yes
PAS_6M_N_ACTIVITIES
No
PAS_6M_BET_INTENSITY
No
PAS_6M_BET_WIN
Yes
PAS_6M_NET_OUTCOME
Yes
PAS_6M_MAX_DAY_STAKE
No
PAS_6M_MAX_DAY_BETS
No
PAS_6M_MEAN_DAILY_STAKE
No
PAS_6M_SD_DAILY_STAKE
No
PAS_6M_TOTAL_UNSOCIABLE_BETS
Yes
PAS_6M_PERCENT_UNSOCIABLE_BETS
No
PAS_6M_MONTHLY_BET_FREQ_DAYS
No
PAS_6M_MONTHLY_SPEND
No
PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME
No
PAS_6M_TOTAL_DEPOSIT_AMOUNT
Yes
PAS_6M_MEAN_DEPOSIT_AMOUNT
No
PAS_6M_SD_DEPOSIT_AMOUNT
No
PAS_6M_TOTAL_DEPOSIT_FREQ
Yes
PAS_6M_TOTAL_DECLINED_DEPOSITS
Yes
PAS_6M_PERCENT_DECLINED_DEPOSITS
No
PAS_6M_TOTAL_DEPOSIT_METHODS
No
PAS_6M_TOTAL_WITHDRAW_FREQ
Yes
PAS_6M_TOTAL_WITHDRAW_AMOUNT
Yes
PAS_6M_TOTAL_CANCEL_WITHDRAWS
Yes
PAS_6M_PERCENT_CANCEL_WITHDRAWS
No
PAS_6M_TOTAL_WITHDRAW_METHODS
No
PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT
No
PAS_6M_MONTHLY_DEPOSIT_AMOUNT
No
PAS_6M_MONTHLY_DEPOSIT_FREQ
PAS_6M_MONTHLY_DEPOSIT_FREQ
PAS_6M_DEPOSIT_INTENSITY
No
Okay, now we have all of the relevant variables, let’s look at how they correlate with PGSI and harm scores using a series of bivariate correlations visualised as a heatmap. These results are unweighted:
Show code
PGSI_cor_data<- master_dataset_study2 %>%select(PGSI_TOTAL, GHM_TOTAL, PAS_6M_BET_MEAN_STAKE, PAS_6M_BET_SD_STAKE, PAS_6M_N_ACTIVITIES, PAS_6M_BET_INTENSITY, PAS_6M_MAX_DAY_STAKE, PAS_6M_MAX_DAY_BETS, PAS_6M_MEAN_DAILY_STAKE, PAS_6M_SD_DAILY_STAKE, PAS_6M_PERCENT_UNSOCIABLE_BETS, PAS_6M_MONTHLY_BET_FREQ_DAYS, PAS_6M_MONTHLY_SPEND, PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME, PAS_6M_MEAN_DEPOSIT_AMOUNT, PAS_6M_SD_DEPOSIT_AMOUNT, PAS_6M_PERCENT_DECLINED_DEPOSITS, PAS_6M_TOTAL_DEPOSIT_METHODS, PAS_6M_PERCENT_CANCEL_WITHDRAWS, PAS_6M_TOTAL_WITHDRAW_METHODS, PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, PAS_6M_MONTHLY_DEPOSIT_AMOUNT, PAS_6M_DEPOSIT_INTENSITY)ggstatsplot::ggcorrmat(PGSI_cor_data, type ="parametric",results.subtitle =FALSE,ggcorrplot.args =c(lab_size =3)) +theme(axis.text.y =element_text(color ="black", size =9, face ="plain", family ="Poppins")) +theme(axis.text.x =element_text(color ="black", size =9, face ="plain", family ="Poppins")) +theme(legend.title=element_text(color ="black", size =12, face ="plain", family ="Poppins")) +theme(legend.title.align = .5) +theme(legend.text=element_text(color ="black", size =8, face ="plain", family ="Poppins"))
Now let’s use all of these variables to create a composite index score of risk/involvement:
Show code
master_dataset_study2_with_risk_scores <- master_dataset_study2 %>%mutate(PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME = PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME*-1) %>%# Invert outcome valuesmutate(across(c(PAS_6M_BET_MEAN_STAKE, PAS_6M_BET_SD_STAKE, PAS_6M_N_ACTIVITIES, PAS_6M_BET_INTENSITY, PAS_6M_MAX_DAY_STAKE, PAS_6M_MAX_DAY_BETS, PAS_6M_MEAN_DAILY_STAKE, PAS_6M_SD_DAILY_STAKE, PAS_6M_PERCENT_UNSOCIABLE_BETS, PAS_6M_MONTHLY_BET_FREQ_DAYS, PAS_6M_MONTHLY_SPEND, PAS_6M_MEAN_DEPOSIT_AMOUNT, PAS_6M_SD_DEPOSIT_AMOUNT, PAS_6M_PERCENT_DECLINED_DEPOSITS, PAS_6M_TOTAL_DEPOSIT_METHODS, PAS_6M_PERCENT_CANCEL_WITHDRAWS, PAS_6M_TOTAL_WITHDRAW_METHODS, PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, PAS_6M_MONTHLY_DEPOSIT_AMOUNT, PAS_6M_DEPOSIT_INTENSITY), scale, # scale function computes z-scores.names ="z_{.col}")) %>%# Create new named columns for easy selection belowrowwise() %>%# Group by Rowsmutate(`Composite Risk Score`=sum(c_across(starts_with("z_")), na.rm =TRUE)) %>%# Sum across all newly created Z scoresungroup()# colnames(master_dataset_study2_with_risk_scores)
Correlations
Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:
Spearman's rank correlation rho
data: master_dataset_study2_with_risk_scores$PGSI_TOTAL and master_dataset_study2_with_risk_scores$`Composite Risk Score`
S = 850963267, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.317731
Perform weighted correlation:
Show code
# This doesn't work nicely with NA values so let's remove them first:master_dataset_study2_with_risk_scores_weighted_cor_data <- master_dataset_study2_with_risk_scores %>%filter(!is.na(PGSI_TOTAL))weightecor_PGSI<-weightedCorr(x = master_dataset_study2_with_risk_scores_weighted_cor_data$PGSI_TOTAL, y = master_dataset_study2_with_risk_scores_weighted_cor_data$`Composite Risk Score`, method ="Spearman",weights = master_dataset_study2_with_risk_scores_weighted_cor_data$ALL_SAMPLE_WEIGHTS)weightecor_PGSI
[1] 0.288293
Confirm how many people are involved in these calculations:
Welch Two Sample t-test
data: Composite Risk Score by STATUS
t = 15.233, df = 2368.3, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
3.103619 4.020736
sample estimates:
mean in group Participants mean in group Non-participants
3.2811454 -0.2810317
And now a Weighted t-test:
Show code
# Create survey design object for the whole sample:survey_design_all_participants <-svydesign(id =~1,data = master_dataset_study2_with_risk_scores,weights =~ALL_SAMPLE_WEIGHTS)weighted_t_test_composite_study2 <-survey::svyttest(`Composite Risk Score`~ STATUS, data = master_dataset_study2_with_risk_scores, design = survey_design_all_participants)weighted_t_test_composite_study2
Design-based t-test
data: `Composite Risk Score` ~ STATUS
t = -16.864, df = 24791, p-value < 2.2e-16
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
-2.546679 -2.016333
sample estimates:
difference in mean
-2.281506
Compute non-weighted summary scores and effect size:
Show code
# Calculate means and standard deviationssummary_composite_study2 <- master_dataset_study2_with_risk_scores %>%group_by(STATUS) %>% dplyr::summarize(mean =mean(`Composite Risk Score`),sd =sd(`Composite Risk Score`),n =n() ) gt(summary_composite_study2)
STATUS
mean
sd
n
Participants
3.2811454
9.855798
1956
Non-participants
-0.2810317
10.708913
22837
Show code
# Calculate Cohen's d with 95% confidence intervalcohen_d_result_study2 <-cohen.d(`Composite Risk Score`~ STATUS, data = master_dataset_study2_with_risk_scores)cohen_d_result_study2
Cohen's d
d estimate: 0.3346615 (small)
95 percent confidence interval:
lower upper
0.2883902 0.3809328
Compute weighted summary score and effect size:
Show code
# Rename the risk score as the statistic function doesn't like the spaces:master_dataset_study2_with_risk_scores_temp_risk_name_change<- master_dataset_study2_with_risk_scores %>%rename(risk_score =`Composite Risk Score`)# Create survey design objects for each STATUS group that include the composite risk scores:survey_design_all_t <- survey::svydesign(id =~1,data = master_dataset_study2_with_risk_scores_temp_risk_name_change,weights =~ALL_SAMPLE_WEIGHTS)survey_design_non_participants_t <- survey::svydesign(id =~1,data =subset(master_dataset_study2_with_risk_scores_temp_risk_name_change, STATUS =="Non-participants"),weights =~ALL_SAMPLE_WEIGHTS)survey_design_participants_t <- survey::svydesign(id =~1,data =subset(master_dataset_study2_with_risk_scores_temp_risk_name_change, STATUS =="Participants"),weights =~ALL_SAMPLE_WEIGHTS)risk_score_comparison<-weighted_summary_continuous("risk_score", survey_design_all_t, survey_design_non_participants_t, survey_design_participants_t, participants_n, non_participants_n)gt(risk_score_comparison)
Variable
Overall
Participants
Non-participants
Effect_size
risk_score
-2.6 (5.54), -3.59
-0.47 (6.08), -1.53
-2.75 (5.46), -3.73
0.41 [0.36, 0.47]
Quartiles
Now, let’s divide the sample into quartiles based on their composite summary score:
Show code
# Calculate quartiles of past-30-day bet frequencyquartiles <-quantile(master_dataset_study2_with_risk_scores$`Composite Risk Score`, probs =c(0, 0.25, 0.5, 0.75, 1))master_dataset_study2_with_risk_scores$Quartile <-cut(master_dataset_study2_with_risk_scores$`Composite Risk Score`, breaks = quartiles,labels =c("Bottom quartile: least involved","Second quartile","Third quartile","Top quartile: most involved"),include.lowest =TRUE)
Now compute the proportion of individuals from each quartile who went on to take part in the survey:
Show code
master_dataset_study2_with_risk_scores %>%group_by(Quartile) %>% dplyr::count(SURVEY_STATUS_PGSI) %>%gt() %>%tab_header("Counts for granular survey status by composite risk score quartile", subtitle =NULL)
Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI
n
Bottom quartile: least involved
Complete
245
Incomplete
8
Not started
5946
Second quartile
Complete
348
Incomplete
15
Not started
5835
Third quartile
Complete
535
Incomplete
50
Not started
5613
Top quartile: most involved
Complete
828
Incomplete
113
Not started
5257
Show code
master_dataset_study2_with_risk_scores %>%group_by(Quartile) %>% dplyr::count(STATUS) %>%mutate('Percent of quartile'=round(n/sum(n)*100,2)) %>%gt() %>%tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle =NULL)
Counts (with percentages) for broad survey status by composite risk score quartile
STATUS
n
Percent of quartile
Bottom quartile: least involved
Participants
245
3.95
Non-participants
5954
96.05
Second quartile
Participants
348
5.61
Non-participants
5850
94.39
Third quartile
Participants
535
8.63
Non-participants
5663
91.37
Top quartile: most involved
Participants
828
13.36
Non-participants
5370
86.64
No compute weighted accounts/proportions:
Show code
# Compute weighted counts for granular survey statusweighted_counts_granular <-svytable(~Quartile + SURVEY_STATUS_PGSI, svydesign(id =~1, data = master_dataset_study2_with_risk_scores, weights =~ALL_SAMPLE_WEIGHTS)) %>%as.data.frame()weighted_counts_granular %>%gt() %>%tab_header("Weighted counts for granular survey status by composite risk score quartile", subtitle =NULL)
Weighted counts for granular survey status by composite risk score quartile
Quartile
SURVEY_STATUS_PGSI
Freq
Bottom quartile: least involved
Complete
2309.93893
Second quartile
Complete
3267.87903
Third quartile
Complete
4862.26711
Top quartile: most involved
Complete
2948.77958
Bottom quartile: least involved
Incomplete
75.38683
Second quartile
Incomplete
142.03838
Third quartile
Incomplete
463.32653
Top quartile: most involved
Incomplete
390.28258
Bottom quartile: least involved
Not started
56150.98467
Second quartile
Not started
54996.77437
Third quartile
Not started
49681.64042
Top quartile: most involved
Not started
17909.09600
Show code
# Compute weighted counts and percentages for broad survey statussvytable(~Quartile + STATUS, svydesign(id =~1, data = master_dataset_study2_with_risk_scores, weights =~ALL_SAMPLE_WEIGHTS)) %>%as.data.frame() %>%group_by(Quartile) %>%mutate('Percent of quartile'=round(Freq /sum(Freq) *100, 2)) %>%gt() %>%tab_header("Weighted counts (with percentages) for broad survey status by composite risk score quartile", subtitle =NULL)
Weighted counts (with percentages) for broad survey status by composite risk score quartile
STATUS
Freq
Percent of quartile
Bottom quartile: least involved
Participants
2309.939
3.95
Non-participants
56226.371
96.05
Second quartile
Participants
3267.879
5.60
Non-participants
55138.813
94.40
Third quartile
Participants
4862.267
8.84
Non-participants
50144.967
91.16
Top quartile: most involved
Participants
2948.780
13.88
Non-participants
18299.379
86.12
And visualise this using a Sankey diagram (weighted figures):
Show code
# Create survey design objectsurvey_design_all_participants_sankey <-svydesign(id =~1,data = master_dataset_study2_with_risk_scores,weights =~ALL_SAMPLE_WEIGHTS)# We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N), with weightings applied:weighted_counts <-svytable(~Quartile + SURVEY_STATUS_PGSI, survey_design_all_participants_sankey) %>%as.data.frame() %>%group_by(Quartile) %>%mutate(percent = Freq /sum(Freq) *100) %>%ungroup()# Create Sankey plotdesired_order_sankey <-c("Top quartile: most involved","Third quartile","Second quartile","Bottom quartile: least involved")# non-weighted data: risk_to_survey_links<- master_dataset_study2_with_risk_scores %>%select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%group_by(Quartile, SURVEY_STATUS_PGSI) %>%summarise(n =n(),.groups ="drop" ) %>%group_by(Quartile) %>%mutate(percent = n/sum(n)*100) %>%select(-n) %>%ungroup() %>%mutate(Quartile =factor(Quartile,levels = desired_order_sankey)) %>%arrange(Quartile)# From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: nodes <-data.frame(name =c(as.character(weighted_counts$Quartile), as.character(weighted_counts$SURVEY_STATUS_PGSI)) %>%unique())# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn assign each from and to category with a number to make numerical connections explicit:weighted_counts$IDsource <-match(weighted_counts$Quartile, nodes$name) -1weighted_counts$IDtarget <-match(weighted_counts$SURVEY_STATUS_PGSI, nodes$name) -1# prepare colour scaleColourScal ='d3.scaleOrdinal() .range(["#d1495b", "#edae49","#00798c","#2e4057","#35B779FF","#31688EFF","#31688EFF","#482878FF","#B4DE2CFF","#FDE725FF", "#26828EFF","#3E4A89FF","#440154FF","#6DCD59FF"])'# Create Sankey plotsankey_plot2 <-sankeyNetwork(Links = weighted_counts, # risk_to_survey_links # non-weighted Nodes = nodes,Source ="IDsource", Target ="IDtarget",Value ="percent", NodeID ="name", iterations =0,sinksRight =FALSE,fontSize =16, fontFamily ="Poppins",nodeWidth =30,nodePadding =25,colourScale = ColourScal)# Custom CSS with Google Fonts linkcustom_css <-"<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'><style> text { font-family: 'Poppins', sans-serif !important; }</style>"# Save plot as a HTML file with custom CSShtml_file <-tempfile(fileext =".html")saveWidget(widget = htmlwidgets::prependContent(sankey_plot2), file = html_file,selfcontained =TRUE)# Read the saved HTML and ensure the custom CSS is properly embeddedhtml_content <-readLines(html_file, warn =FALSE)html_content <-gsub("</head>", paste0(custom_css, "</head>"), html_content)writeLines(html_content, con ="sankey_plot2.html")# remove the temporary filefile.remove(html_file)
Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.
SSVS
Before we can actually run the model, we need to identify a suitable suite of predictor variables to include using used Stochastic Search Variable Selection (SSVS) and the related SSVS package (note the actual SSVS process is commented out in rendered versions of this document as it takes a long time to process).
As we want to maximise the overall predictive ability of the model, we have opted to set the parameters of the SSVS so as to have a low threshold for including/selecting relevant variables. First, We have said the prior inclusion of probability to 0.8, reflecting a belief that predictors are more likely to be included and not. A prior inclusion probability of 0.5 reflects the belief that each predictor has an equal probability of being included/excluded. Second, we have set the cut off for marginal inclusion probabilities values (MIPs) to 0.4. There is no standard recommendation for what this cut-off should be, but others have used 0.5.
One thing we need to do before all of this is impute missing values for the IRSAD scores as SSVS doesn’t seem to handle these values well. To do this, we will impute the median value for each person’s state as their missing value.
Show code
# First we need to convert logical and categorical variables to numeric for inclusion in the model: <- master_dataset_study2_SSVS <- master_dataset_study2 %>%# Impute missing IRSAD scoresmutate(IRSAD_SCORE =ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm =TRUE), IRSAD_SCORE)) %>%# Convert all required variables to numeric:mutate(across(c('AGE','IRSAD_SCORE','DAYS_REGISTERED','DAYS_SINCE_LAST_BET','PAS_6M_BET_MEAN_STAKE','PAS_6M_BET_SD_STAKE','PAS_6M_N_ACTIVITIES','PAS_6M_BET_INTENSITY','PAS_6M_MAX_DAY_STAKE','PAS_6M_MAX_DAY_BETS','PAS_6M_MEAN_DAILY_STAKE','PAS_6M_SD_DAILY_STAKE','PAS_6M_PERCENT_UNSOCIABLE_BETS','PAS_6M_MONTHLY_BET_FREQ_DAYS','PAS_6M_MONTHLY_SPEND','PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME','PAS_6M_MEAN_DEPOSIT_AMOUNT','PAS_6M_SD_DEPOSIT_AMOUNT','PAS_6M_PERCENT_DECLINED_DEPOSITS','PAS_6M_TOTAL_DEPOSIT_METHODS','PAS_6M_PERCENT_CANCEL_WITHDRAWS','PAS_6M_TOTAL_WITHDRAW_METHODS','PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT','PAS_6M_MONTHLY_DEPOSIT_AMOUNT','PAS_6M_DEPOSIT_INTENSITY','EVER_USED_DEPOSIT_LIMIT','EVER_USED_BLOCK_OUT', 'EVER_USED_CHECK_IN', 'EVER_USED_CURFEW', 'EVER_USED_DEPOSIT_OPTION', 'EVER_USED_MARKET_CONTROL', 'EVER_USED_TIMOUT', 'EVER_USED_CANCELLED_WITHDRAWS', 'EVER_USED_SELF_EXCLUSION','TOTAL_ACTIVE_TOOLS','NUMBER_CPT_CHANGES'), ~as.numeric(.))) %>%mutate(GENDER =case_when( GENDER =='Female'~0, GENDER =='Male'~1,TRUE~2)) %>%mutate(STATUS =case_when( STATUS =='Non-participants'~0, STATUS =='Participants'~1,TRUE~2)) %>%rename('Age'='AGE','Gender'='GENDER','IRSAD score'='IRSAD_SCORE','Days since registration'='DAYS_REGISTERED','Days since last bet'='DAYS_SINCE_LAST_BET','Mean stake*'='PAS_6M_BET_MEAN_STAKE','Stake SD*'='PAS_6M_BET_SD_STAKE','No. of activities*'='PAS_6M_N_ACTIVITIES','Betting intensity*'='PAS_6M_BET_INTENSITY','Max daily stake*'='PAS_6M_MAX_DAY_STAKE','Max daily bets*'='PAS_6M_MAX_DAY_BETS','Mean daily stake*'='PAS_6M_MEAN_DAILY_STAKE','Daily stake SD*'='PAS_6M_SD_DAILY_STAKE','% unsociable bets*'='PAS_6M_PERCENT_UNSOCIABLE_BETS','Monthly bet freq (days)*'='PAS_6M_MONTHLY_BET_FREQ_DAYS','Monthly spend*'='PAS_6M_MONTHLY_SPEND','Mean monthly net outcome*'='PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME','Mean deposit amount*'='PAS_6M_MEAN_DEPOSIT_AMOUNT','Deposit amount SD*'='PAS_6M_SD_DEPOSIT_AMOUNT','% declined deposits*'='PAS_6M_PERCENT_DECLINED_DEPOSITS','No. deposit methods*'='PAS_6M_TOTAL_DEPOSIT_METHODS','% cancel withdrawals*'='PAS_6M_PERCENT_CANCEL_WITHDRAWS','No. withdrawal methods*'='PAS_6M_TOTAL_WITHDRAW_METHODS','Max daily deposit amount*'='PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT','Monthly deposit amount*'='PAS_6M_MONTHLY_DEPOSIT_AMOUNT','Deposit intensity*'='PAS_6M_DEPOSIT_INTENSITY','Ever used deposit limit'='EVER_USED_DEPOSIT_LIMIT','Ever used block out'='EVER_USED_BLOCK_OUT', 'Ever used check in'='EVER_USED_CHECK_IN', 'Ever used curfew'='EVER_USED_CURFEW', 'Ever used deposit option'='EVER_USED_DEPOSIT_OPTION', 'Ever used market control'='EVER_USED_MARKET_CONTROL', 'Ever used timeout'='EVER_USED_TIMOUT', 'Ever used cancelled withdrawals'='EVER_USED_CANCELLED_WITHDRAWS', 'Ever used self-exclusion'='EVER_USED_SELF_EXCLUSION','Total active tools'='TOTAL_ACTIVE_TOOLS','Number of changes to CPTs'='NUMBER_CPT_CHANGES' )outcome <-'STATUS'predictors <-c(# demographic variables:'Age','Gender','IRSAD score',# Account/wagering variables'Days since registration','Days since last bet',# All past six month wagers and transactions variables (As included in the composite score):'Mean stake*','Stake SD*','No. of activities*','Betting intensity*','Max daily stake*','Max daily bets*','Mean daily stake*','Daily stake SD*','% unsociable bets*','Monthly bet freq (days)*','Monthly spend*','Mean monthly net outcome*','Mean deposit amount*','Deposit amount SD*','% declined deposits*','No. deposit methods*','% cancel withdrawals*','No. withdrawal methods*','Max daily deposit amount*','Monthly deposit amount*','Deposit intensity*',# current and historical use of consumer protection tools:'Ever used deposit limit','Ever used block out', 'Ever used check in', 'Ever used curfew', 'Ever used deposit option', 'Ever used market control', 'Ever used timeout', 'Ever used cancelled withdrawals', 'Ever used self-exclusion','Total active tools','Number of changes to CPTs' )# str(master_dataset_study2_SSVS)# master_dataset_study2_SSVS %>%# select(`IRSAD score`)# SSSV_results_Study2 <- ssvs(data = master_dataset_study2_SSVS,# x = predictors, y = outcome, # Variables, as prespecified above# inprob = 0.8, # prior inclusion probability: 0.5 reflects the belief that each predictor has an equal probability of being included/excluded# continuous = FALSE, # Categorical outcome variable# progress = TRUE # Provide update on progress of processing# )# # # Save file to avoid having to re-run every time# saveRDS(SSSV_results_Study2, "SSSV_results_Study2")# Load in:SSSV_results_Study2 <-read_rds("SSSV_results_Study2")
Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:
Show code
summary(SSSV_results_Study2, interval =0.95, # Credible intervalordered =TRUE) %>%as.data.frame() %>%gt()%>%tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2**"), subtitle =NULL)
Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2
ggsave("Figures/SSVS results plot Study 2.pdf",width =8.5,height =3.5,units =c("in"))
Run model
Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:
Show code
SSSV_results_Study2$ssvs %>%as_tibble()
# A tibble: 0 × 0
Show code
# Set the data for analysis:Study2_logistic_reg_data<- master_dataset_study2 %>%mutate(STATUS =case_when( STATUS =='Non-participants'~0, STATUS =='Participants'~1))# Create survey design object Used for waiting:survey_design_regression <-svydesign(id =~1,data = Study2_logistic_reg_data,weights =~ALL_SAMPLE_WEIGHTS)# Check variables for inclusion: summary(SSSV_results_Study2, interval =0.95, # Credible intervalordered =TRUE) %>%filter(MIP>0.4)
Variable MIP Avg Beta Avg Nonzero Beta Lower CI (95%)
Age 1.0000 0.2783 0.2783 0.2257
Days since last bet 1.0000 -0.3794 -0.3794 -0.4596
No. withdrawal methods* 1.0000 0.1980 0.1980 0.1393
Monthly bet freq (days)* 0.9895 0.1473 0.1488 0.0799
No. deposit methods* 0.9755 0.1185 0.1215 0.0603
Mean deposit amount* 0.8053 -0.2226 -0.2764 -0.3901
No. of activities* 0.4949 0.0450 0.0909 0.0000
Upper CI (95%)
0.3342
-0.3024
0.2564
0.2117
0.1830
0.0000
0.1240
Show code
# Run model with the Weight adjusting:study_2_weighted_logistic_regression <-svyglm(STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + PAS_6M_MONTHLY_BET_FREQ_DAYS + PAS_6M_TOTAL_DEPOSIT_METHODS + PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_N_ACTIVITIES,design = survey_design_regression, family =quasibinomial())summary(study_2_weighted_logistic_regression)
Call:
svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS +
PAS_6M_MONTHLY_BET_FREQ_DAYS + PAS_6M_TOTAL_DEPOSIT_METHODS +
PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_N_ACTIVITIES, design = survey_design_regression,
family = quasibinomial())
Survey design:
svydesign(id = ~1, data = Study2_logistic_reg_data, weights = ~ALL_SAMPLE_WEIGHTS)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.9154835 0.1304685 -30.011 < 2e-16 ***
AGE 0.0246511 0.0019471 12.660 < 2e-16 ***
DAYS_SINCE_LAST_BET -0.0076221 0.0010583 -7.202 6.11e-13 ***
PAS_6M_TOTAL_WITHDRAW_METHODS 0.3606934 0.0522027 6.909 4.98e-12 ***
PAS_6M_MONTHLY_BET_FREQ_DAYS 0.0124960 0.0054804 2.280 0.022610 *
PAS_6M_TOTAL_DEPOSIT_METHODS 0.1432951 0.0351667 4.075 4.62e-05 ***
PAS_6M_MEAN_DEPOSIT_AMOUNT -0.0006594 0.0001831 -3.602 0.000316 ***
PAS_6M_N_ACTIVITIES 0.0416386 0.0093267 4.464 8.06e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 1.731407)
Number of Fisher Scoring iterations: 6
I have tried various methods of transforming variables to account for large residuals but none seem to have any effect (first plot above). If anything, using variables on the original scale seems to improve model diagnostics.
Compute confidence intervals for the coefficient estimates:
# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value
The model accounts for 8.5201619% of variance in uptake/response.
Calculate overall p-value for model by comparing it with a null model:
Show code
anova(study_2_weighted_logistic_regression, update(study_2_weighted_logistic_regression, ~1), # update here produces null model for comparison test="Chisq")
Working (Rao-Scott) LRT for AGE DAYS_SINCE_LAST_BET PAS_6M_TOTAL_WITHDRAW_METHODS PAS_6M_MONTHLY_BET_FREQ_DAYS PAS_6M_TOTAL_DEPOSIT_METHODS PAS_6M_MEAN_DEPOSIT_AMOUNT PAS_6M_N_ACTIVITIES
in svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS +
PAS_6M_MONTHLY_BET_FREQ_DAYS + PAS_6M_TOTAL_DEPOSIT_METHODS +
PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_N_ACTIVITIES, design = survey_design_regression,
family = quasibinomial())
Working 2logLR = 745.6723 p= < 2.22e-16
(scale factors: 1.4 1.1 1 0.98 0.92 0.84 0.73 )
Show code
# Set seed so that labels are consistent in the plot:set.seed(200)ggstatsplot::ggcoefstats(study_2_weighted_logistic_regression,digits =4,stats.label.color ="#00798c",sort ="ascending",vline.args =list(size =1, color ="black"),# ggstatsplot.layer = FALSE,exclude.intercept =TRUE, # hide the interceptstats.label.args =list(size =3, direction ="y")) +scale_x_continuous(name ="Regression coefficient", limits =c(-0.1, 0.5), breaks =c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5)) +scale_y_discrete(name ="", labels =c("Days since\n last bet","Mean deposit\n amount","Mean monthly bet\n freq (days)","Age","No. of activities","No. deposit\n methods","No. withdrawal\n methods")) +theme(panel.grid.minor =element_blank(),axis.line =element_blank()) +theme(axis.text =element_text(color ="black", size =9, face ="plain"))+theme_light(base_family ="Poppins") +# theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +theme(axis.text.y =element_text(color="black", size=12, face="plain", family ="Poppins")) +theme(axis.title.x =element_text(color="black", size=12, face ="plain", vjust =0, family ="Poppins"))
Show code
# Set height to 7 and width to 5.5 in portrait modeggsave("Figures/Study 2 log regression plot.pdf",width =5,height =7,units =c("in"))
Study 2 Subsample Analysis
Okay, as noted by two reviewers during peer review, there is a key distinction in the inclusion criteria between our two studies. Study 1 invited participants who had gambled at least once in the past 30 days, whereas Study 2 invited those who had gambled at least once in the past six months. This difference could significantly impact our outcomes and help explain the higher nonresponse bias in Study 2. Since more regular gamblers appear to be more inclined to participate in studies, expanding the eligibility criteria to include less frequent gamblers makes the responders less representative of the total invited sample. This is a great point, so let’s re-run all Study 2 analyses focusing only on participants who had gambled in the 30 days prior to invitation.
Process the data for analysis by doing the following:
Remove anybody who didn’t bet in the six months prior to survey participation as this was our criteria for inclusion in the sample
Produce some new sample weights to account for the different sample size we’re now using that has a different distribution of flagged/vs non-flagged customers (this is hidden for commercial sensitivity)
How many people were invited in total that had gambled in the last 30 days:
Show code
nrow(master_dataset_study2_restricted)
[1] 12799
Let’s now look at how survey uptake varied over the three-week window the survey was open in this sample:
Show code
master_dataset_study2_restricted %>%as_tibble() %>%filter(SURVEY_STATUS_START !="Not started") %>%group_by(START_DATE) %>%summarise(Freq =sum(n())) %>%ungroup() %>%ggplot(aes(x = START_DATE, y = Freq)) + plot_theme +geom_bar(stat ="identity", fill ="#00798c") +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="",x ="Date",y ="Number of clients") +scale_y_continuous(breaks =seq(0, 1500, by =500), limits =c(0,1500))
Now compute some figures to tell us the rates of uptake for the survey. these figures only relate to the refined, filtered dataset:
Show code
master_dataset_study2_restricted %>%as_tibble() %>%count(SURVEY_STATUS_START) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_START ="Survey status") %>%tab_header("Counts for starting the survey: Restricted analyses", subtitle =NULL)
Counts for starting the survey: Restricted analyses
Survey status
n
Percent
Not started
11127
86.93648
Started
1672
13.06352
Show code
master_dataset_study2_restricted %>%as_tibble() %>%count(SURVEY_STATUS_PGSI) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_PGSI ="Survey status") %>%tab_header("Counts for completing the survey up to the PGSI: Restricted analyses", subtitle =NULL)
Counts for completing the survey up to the PGSI: Restricted analyses
Survey status
n
Percent
Complete
1514
11.829049
Incomplete
158
1.234471
Not started
11127
86.936479
Show code
master_dataset_study2_restricted %>%as_tibble() %>%count(SURVEY_STATUS_FULL) %>%mutate(Percent = n/sum(n)*100) %>%gt() %>%cols_label(SURVEY_STATUS_FULL ="Survey status") %>%tab_header("Counts for completing the survey in full: Restricted analyses", subtitle =NULL)
Counts for completing the survey in full: Restricted analyses
Survey status
n
Percent
Complete
1115
8.711618
Incomplete
557
4.351902
Not started
11127
86.936479
We need to also figure out how many people were removed by ensuring everyone bet in the past 30 days and how this related to survey status:
Show code
master_dataset_study2 %>%as_tibble() %>%filter(PAS_30_BET_FREQ_DAYS ==0) %>%# keep anybody who didn't bet in the window of interestcount(SURVEY_STATUS_PGSI) %>%gt() %>%cols_label(SURVEY_STATUS_PGSI ="Survey status") %>%tab_header("Counts for people removed from dataset because they hadn't gambled in the past 30-months", subtitle =NULL)
Counts for people removed from dataset because they hadn't gambled in the past 30-months
Survey status
n
Complete
442
Incomplete
28
Not started
11524
Okay, so this restricted analyses removes 442 people who completed the survey and fall, and 28 people who started but didn’t finish it. We still have pretty large numbers of people who completed the survey so it should be fine to proceed with these restricted analyses.
Compare samples
What are the general demographic and gambling characteristics of those who did and did not take part in the survey?
Before we can do this, we need to create weighted variable summaries to account for the oversampling of the at risk population:
Show code
# Create survey design:survey_design_all <- survey::svydesign(id =~1,data = master_dataset_study2_restricted,weights =~ALL_SAMPLE_WEIGHTS)# Create survey design objects for each STATUS group:survey_design_non_participants <- survey::svydesign(id =~1,data =subset(master_dataset_study2_restricted, STATUS =="Non-participants"),weights =~ALL_SAMPLE_WEIGHTS)survey_design_participants <- survey::svydesign(id =~1,data =subset(master_dataset_study2_restricted, STATUS =="Participants"),weights =~ALL_SAMPLE_WEIGHTS)# Get sample sizes for each groupparticipants_n <-sum(master_dataset_study2_restricted$STATUS =="Participants")non_participants_n <-sum(master_dataset_study2_restricted$STATUS =="Non-participants")overall_n <-nrow(master_dataset_study2_restricted)# [Yes, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:# Function to calculate weighted summary for continuous variablesweighted_summary_continuous <-function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) { var_formula <-as.formula(paste0("~", variable))# Calculate means and standard deviations mean_all <-svymean(var_formula, design_all, na.rm =TRUE) mean_non_ps <-svymean(var_formula, design_non_participants, na.rm =TRUE) mean_ps <-svymean(var_formula, design_participants, na.rm =TRUE) sd_all <-sqrt(svyvar(var_formula, design_all, na.rm =TRUE)) sd_non_ps <-sqrt(svyvar(var_formula, design_non_participants, na.rm =TRUE)) sd_ps <-sqrt(svyvar(var_formula, design_participants, na.rm =TRUE))# Calculate weighted medians median_all <-svyquantile(var_formula, design_all, 0.5, na.rm =TRUE) median_non_ps <-svyquantile(var_formula, design_non_participants, 0.5, na.rm =TRUE) median_ps <-svyquantile(var_formula, design_participants, 0.5, na.rm =TRUE)# Format for table summary_df <-tibble(Variable = variable,`Overall`=paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),`Participants`=paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),`Non-participants`=paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2)) ) %>%# Compute effect sizemutate(pooled_sd =sqrt(((participants_n -1) * sd_ps^2+ (non_participants_n -1) * sd_non_ps^2) / (participants_n + non_participants_n -2)),cohen_d = (coef(mean_ps) -coef(mean_non_ps)) / pooled_sd,SE =sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2/ (2* (participants_n + non_participants_n)))),CI_lower = cohen_d -1.96* SE,CI_upper = cohen_d +1.96* SE,Effect_size =paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]") ) %>%select(Variable, Overall, Participants, `Non-participants`, Effect_size)return(summary_df)}# Apply function to continuous variablescontinuous_vars <-c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")all_continuous_summaries <-map_dfr(continuous_vars, ~weighted_summary_continuous(.x, survey_design_all, survey_design_non_participants, survey_design_participants, participants_n, non_participants_n, overall_n))# Categorical variables:# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. # Note: this function is computes Cramer's effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).# UPDATE: Let's just compute odds ratios for the two level variables based on a reviewer suggestion (thanks ChatGPT for your help with this one!):categorical_summary <-function(variable, design_all, design_non_participants, design_participants) { var_formula <-as.formula(paste0("~", variable))# Calculate proportions for overall sample prop_table_all <-svytable(var_formula, design_all) props_all <-prop.table(prop_table_all) %>%as.data.frame() %>%mutate(Overall = Freq *100) %>%select(Variable =!!sym(variable), Overall)# Calculate proportions for non-participants prop_table_non_ps <-svytable(var_formula, design_non_participants) props_non_ps <-prop.table(prop_table_non_ps) %>%as.data.frame() %>%mutate('Non-participants'= Freq *100) %>%select(Variable =!!sym(variable), 'Non-participants')# Calculate proportions for participants prop_table_ps <-svytable(var_formula, design_participants) props_ps <-prop.table(prop_table_ps) %>%as.data.frame() %>%mutate('Participants'= Freq *100) %>%select(Variable =!!sym(variable), 'Participants')# Combine results and format summary_df <- props_ps %>%left_join(props_non_ps, by ="Variable") %>%left_join(props_all, by ="Variable") %>%mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%replace(is.na(.), "")# Determine if variable is binary (i.e., only two levels) unique_values <-length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))if (unique_values ==2& variable !="GENDER") {# Ensure STATUS is treated as a factor with "Non-participants" as reference design_all$variables$STATUS <-relevel(as.factor(design_all$variables$STATUS), ref ="Non-participants")# Compute Odds Ratio (OR) using survey-weighted logistic regression logistic_model <-tryCatch(svyglm(as.formula(paste0(variable, " ~ STATUS")), design = design_all, family =quasibinomial()),error =function(e) NULL )if (!is.null(logistic_model)) { coef_name <-grep("STATUS", names(coef(logistic_model)), value =TRUE)if (length(coef_name) >0) { OR <-exp(coef(logistic_model)[coef_name]) CI <-exp(confint(logistic_model)[coef_name, ]) effect_size_text <-paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]") } else { effect_size_text <-"NA" } } else { effect_size_text <-"NA" } } else {# Compute Cramér’s V for categorical variables with more than two levels combined_table <-as.table(rbind(prop_table_non_ps, prop_table_ps)) V <-cramersV(combined_table) V_rounded <-as.matrix(V$output) %>%t() %>%as.data.frame() %>%mutate(cramersV =as.numeric(cramersV)) %>%mutate(cramersV =round(cramersV, 2)) CI <-confIntV(combined_table) CI_info <- CI$intermediate CIs <- CI_info$fisherZ.ci %>%t() %>%as.data.frame() %>%mutate_if(is.numeric, round, 2) %>%unite('Effect_size', 1, 2, sep =", ") effect_size_text <-paste0(V_rounded, " [", CIs, "]") }# Add effect size to the table header_row <-data.frame(Variable = variable, Overall ="", Participants ="", `Non-participants`="", `Effect_size`= effect_size_text) summary_df <-bind_rows(header_row, summary_df) %>%as_tibble() %>%select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)return(summary_df)}# Apply function to categorical variablescategorical_vars <-c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")all_categorical_summaries <-map_dfr(categorical_vars, ~categorical_summary(.x, survey_design_all, survey_design_non_participants, survey_design_participants))all_categorical_summaries <- all_categorical_summaries %>%# Recode for consistency across summaries:mutate(Variable =case_when(Variable==1~"Yes", Variable==0~"No", Variable==TRUE~"Yes", Variable==FALSE~"No",TRUE~ Variable))# Combine continuous and categorical summariesweighted_comparison_summary_data <-bind_rows(all_continuous_summaries, all_categorical_summaries) %>%rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols ="Participants") %>%rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols ="Non-participants") %>%rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols ="Overall")# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]age_summary_tabled<- weighted_comparison_summary_data[1,1:5]summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%filter(Variable !="AGE", Variable !="GENDER", Variable !="Male", Variable !="Female", Variable !="Unknown" )# Join everything back together and provide some nice names for our variables:weighted_comparison_summary_data_ordered <-bind_rows(age_summary_tabled, gender_summary_tabled, summary_table_extracted_vars_missing) %>%mutate(Variable = dplyr::recode(Variable,`AGE`="Age","GENDER"="Gender","IRSAD_SCORE"="IRSAD score","DAYS_REGISTERED"="Days since registered","DAYS_SINCE_LAST_BET"="No. days since last bet","PAS_6M_BET_FREQ_DAYS"="No. betting days","PAS_6M_BET_INTENSITY"="Mean bets per active day","PAS_6M_BET_MEAN_STAKE"="Mean stake","PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME"="Mean net outcome per month","PAS_6M_PERCENT_UNSOCIABLE_BETS"="Percentage of bets midnight - 6AM","PAS_6M_TOTAL_DEPOSIT_FREQ"="No. deposits","PAS_6M_DEPOSIT_INTENSITY"="Mean no. deposits per active day","PAS_6M_MEAN_DEPOSIT_AMOUNT"="Mean deposit amount","PAS_6M_PERCENT_DECLINED_DEPOSITS"="Percentage of deposits declined","PAS_6M_PERCENT_CANCEL_WITHDRAWS"="Percentage of withdrawals cancelled","EVER_USED_ANY_TOOL"="Ever used a CPT","ACTIVE_DEPOSIT_LIMIT"="Active deposit limit","RE_PAS_6M_ANY_TRIGGER"="Flagged as 'at-risk' ≥ 1 times" ))# Create final table:weighted_comparison_summary_data_ordered %>%rename("Effect size"= Effect_size ) %>%gt() %>%tab_header(title ="Table 4. Study 2 (past 30-day bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics") %>%tab_options(data_row.padding =px(1.5) ) %>%cols_align(align ="right",columns =c(Variable) ) %>%cols_align(align ="center",columns =c(2:4) ) %>%tab_footnote(footnote ="Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",locations =cells_column_labels(columns =c(2,3)) )%>%tab_footnote(footnote ="Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",locations =cells_column_labels(columns =vars(`Effect size`)) )
Table 4. Study 2 (past 30-day bettors) - Comparison of participants and non-participants: Demographic and gambling characteristics
Variable
Overall (N = 12799)1
Participants (N = 1514)1
Non-participants (N = 11285)
Effect size2
Age
40.62 (15.44), 38
44.47 (15.39), 44
40.15 (15.38), 38
0.28 [0.23, 0.33]
Gender
NA
0.03 [0.02, 0.03]
Female
14.18%
12.4%
14.4%
NA
Male
84.72%
85.94%
84.57%
NA
Unknown
1.1%
1.66%
1.03%
NA
IRSAD score
1002.69 (75.08), 994.21
1004.22 (75.25), 994.59
1002.5 (75.06), 994.21
0.02 [-0.03, 0.08]
Days since registered
1259.45 (926.89), 1097
1393.52 (971.99), 1214
1243.06 (919.93), 1070
0.16 [0.11, 0.22]
No. days since last bet
8.6 (8.64), 4
5.91 (6.96), 3
8.93 (8.77), 5
-0.35 [-0.41, -0.3]
No. betting days
40.36 (38.75), 27
51.1 (43.5), 38
39.05 (37.93), 26
0.31 [0.26, 0.37]
Mean bets per active day
9.22 (15.7), 5.11
11.89 (19.35), 6.53
8.89 (15.16), 4.97
0.19 [0.14, 0.24]
Mean stake
32.15 (144.94), 10.4
26.65 (111.65), 9.38
32.82 (148.49), 10.59
-0.04 [-0.1, 0.01]
Mean net outcome per month
-251.88 (2584.18), -36.03
-316.48 (1321.09), -46.34
-243.98 (2698.28), -35
-0.03 [-0.08, 0.03]
Percentage of bets midnight - 6AM
4.32 (9.42), 0.06
4.21 (8.66), 0.45
4.34 (9.51), 0
-0.01 [-0.07, 0.04]
No. deposits
60.74 (154.29), 14
85.34 (195.48), 21
57.74 (148.2), 13
0.18 [0.12, 0.23]
Mean no. deposits per active day
1.14 (1.49), 0.81
1.33 (1.85), 0.89
1.12 (1.43), 0.8
0.14 [0.09, 0.19]
Mean deposit amount
84.55 (348.17), 37.5
72.73 (199.6), 38.78
85.99 (362.15), 37.5
-0.04 [-0.09, 0.02]
Percentage of deposits declined
6.18 (12.35), 0
7.44 (14.88), 0
6.03 (12), 0
0.11 [0.06, 0.17]
Percentage of withdrawals cancelled
2.96 (11.33), 0
3.75 (13.02), 0
2.87 (11.11), 0
0.08 [0.02, 0.13]
Ever used a CPT
NA
1.25 [1.06, 1.47]
No
83.17%
80.23%
83.53%
NA
Yes
16.83%
19.77%
16.47%
NA
Active deposit limit
NA
1.12 [0.91, 1.38]
No
90.35%
89.42%
90.46%
NA
Yes
9.65%
10.58%
9.54%
NA
Flagged as 'at-risk' ≥ 1 times
NA
1.34 [1.2, 1.5]
No
95.97%
94.86%
96.11%
NA
Yes
4.03%
5.14%
3.89%
NA
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])
PGSI
Also check PGSI score of the survey sample for inclusion in the table later on. For this, we will remove anyone who didn’t pass the attention checks, to get the most reliable estimate:
Show code
survey_design_participants_filtered <- survey::svydesign(id =~1,data =subset(master_dataset_study2_restricted, STATUS =="Participants"& (ATTENTION_CHECK_1 =="PASSED"|is.na(ATTENTION_CHECK_1)) & (ATTENTION_CHECK_2 =="PASSED"|is.na(ATTENTION_CHECK_2))),weights =~ELIGIBLE_SURVEY_SAMPLE_WEIGHTS)# Weighted values:mean_ps_PGSI <-svymean(~PGSI_TOTAL, survey_design_participants_filtered, na.rm =TRUE)sd_ps_PGSI <-sqrt(svyvar(~PGSI_TOTAL, survey_design_participants_filtered, na.rm =TRUE))median_ps_PGSI <-svyquantile(~PGSI_TOTAL, survey_design_participants_filtered, c(0.5), na.rm =TRUE)# Extract desired valuesmean_ps_PGSI_value <-coef(mean_ps_PGSI)sd_ps_PGSI_value <- sd_ps_PGSI[1]median_ps_PGSI_value <-coef(median_ps_PGSI) # Create sentence summarising results: summary_sentence_PGSI_study2 <-sprintf("The mean PGSI total score is %.2f (standard deviation = %.2f) and the median is %.2f.", mean_ps_PGSI_value, sd_ps_PGSI_value, median_ps_PGSI_value)summary_sentence_PGSI_study2
[1] "The mean PGSI total score is 3.76 (standard deviation = 4.19) and the median is 2.00."
Non-Weighted comparisons
For transparency, we have also computed non-weighted comparisons between groups using the tbl_summary() function used to do the same in Study 1 above:
Show code
master_dataset_study2_restricted %>%tbl_summary(include =c(# Demographic details AGE, GENDER, IRSAD_SCORE, DAYS_REGISTERED, # Wagering DAYS_SINCE_LAST_BET, PAS_6M_BET_FREQ_DAYS, PAS_6M_BET_INTENSITY, PAS_6M_BET_MEAN_STAKE, PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME, PAS_6M_PERCENT_UNSOCIABLE_BETS,# Transactions PAS_6M_TOTAL_DEPOSIT_FREQ, PAS_6M_DEPOSIT_INTENSITY, PAS_6M_MEAN_DEPOSIT_AMOUNT, PAS_6M_PERCENT_DECLINED_DEPOSITS, PAS_6M_PERCENT_CANCEL_WITHDRAWS,# Use of consumer protection tools EVER_USED_ANY_TOOL, ACTIVE_DEPOSIT_LIMIT, RE_PAS_6M_ANY_TRIGGER),type =all_continuous() ~"continuous2",statistic =list(all_continuous2() ~c("{median} [{p25} - {p75}]","{mean} ({sd})")),by = STATUS,label =c( AGE ~"Age", GENDER ~"Gender", IRSAD_SCORE ~"IRSAD score", DAYS_REGISTERED ~"No. of days since accounts registered", DAYS_SINCE_LAST_BET ~"No. days since last bet", PAS_6M_BET_FREQ_DAYS ~"No. betting days", PAS_6M_BET_INTENSITY ~"Mean bets per active day", PAS_6M_BET_MEAN_STAKE ~"Mean stake", PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~"Mean net outcome per month", PAS_6M_PERCENT_UNSOCIABLE_BETS ~"Percentage of bets placed between midnight & 6AM", PAS_6M_TOTAL_DEPOSIT_FREQ ~"No. deposits", PAS_6M_DEPOSIT_INTENSITY ~"Mean no. deposits per active betting day", PAS_6M_MEAN_DEPOSIT_AMOUNT ~"Mean deposit amount", PAS_6M_PERCENT_DECLINED_DEPOSITS ~"Percentage of deposits declined", PAS_6M_PERCENT_CANCEL_WITHDRAWS ~"Percentage of withdrawals cancelled", EVER_USED_ANY_TOOL ~"Ever used a SG", ACTIVE_DEPOSIT_LIMIT ~"Active deposit limit", RE_PAS_6M_ANY_TRIGGER ~"Flagged as 'at-risk' ≥ 1 times") ) %>%add_difference(list(all_continuous() ~"cohens_d", # For continuous variablesall_categorical() ~"smd") # For categorical variables, you can use standardized mean difference or any other suitable method ) %>%modify_header(label ="**Variable**") %>%modify_caption("**Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics**") %>%bold_labels() %>%as_gt() %>%tab_options(data_row.padding =px(1.5))
Table 3. Comparison of participants and non-participants: Demographic and gambling characteristics
Variable
Participants, N = 1,5141
Non-participants, N = 11,2851
Difference2
95% CI2,3
Age
0.21
0.16, 0.27
Median [IQR]
41 [30 - 54]
37 [27 - 50]
Mean (SD)
43 (15)
39 (15)
Gender
0.07
0.02, 0.12
Female
158 (10%)
1,363 (12%)
Male
1,327 (88%)
9,774 (87%)
Unknown
29 (1.9%)
148 (1.3%)
IRSAD score
0.00
-0.06, 0.05
Median [IQR]
1,000 [953 - 1,060]
998 [953 - 1,059]
Mean (SD)
1,005 (75)
1,005 (76)
Unknown
1
94
No. of days since accounts registered
0.15
0.09, 0.20
Median [IQR]
1,250 [545 - 2,205]
1,135 [479 - 1,969]
Mean (SD)
1,429 (998)
1,291 (942)
No. days since last bet
-0.33
-0.39, -0.28
Median [IQR]
2 [1 - 6]
4 [2 - 12]
Mean (SD)
5 (7)
8 (8)
No. betting days
0.36
0.31, 0.41
Median [IQR]
54 [24 - 101]
36 [15 - 74]
Mean (SD)
66 (49)
50 (44)
Mean bets per active day
0.21
0.15, 0.26
Median [IQR]
9 [4 - 19]
6 [3 - 13]
Mean (SD)
16 (25)
12 (22)
Mean stake
-0.06
-0.11, 0.00
Median [IQR]
17 [7 - 54]
17 [7 - 56]
Mean (SD)
75 (277)
96 (381)
Mean net outcome per month
-0.02
-0.07, 0.04
Median [IQR]
-149 [-1,230 - -12]
-73 [-631 - -6]
Mean (SD)
-1,111 (3,185)
-989 (7,240)
Percentage of bets placed between midnight & 6AM
0.01
-0.04, 0.07
Median [IQR]
1 [0 - 6]
1 [0 - 6]
Mean (SD)
5 (9)
5 (10)
No. deposits
0.28
0.23, 0.33
Median [IQR]
57 [10 - 241]
27 [6 - 119]
Mean (SD)
206 (392)
123 (280)
Mean no. deposits per active betting day
0.24
0.18, 0.29
Median [IQR]
1.36 [0.50 - 2.91]
1.04 [0.40 - 2.12]
Mean (SD)
2.31 (3.07)
1.73 (2.35)
Mean deposit amount
-0.06
-0.11, 0.00
Median [IQR]
53 [25 - 134]
51 [22 - 150]
Mean (SD)
155 (493)
206 (957)
Percentage of deposits declined
0.09
0.04, 0.14
Median [IQR]
2 [0 - 9]
1 [0 - 9]
Mean (SD)
8 (14)
7 (12)
Percentage of withdrawals cancelled
0.10
0.04, 0.15
Median [IQR]
0 [0 - 4]
0 [0 - 0]
Mean (SD)
9 (19)
7 (18)
Ever used a SG
378 (25%)
2,341 (21%)
0.10
0.05, 0.15
Active deposit limit
161 (11%)
1,167 (10%)
0.01
-0.04, 0.06
Flagged as 'at-risk' ≥ 1 times
560 (37%)
3,443 (31%)
0.14
0.08, 0.19
1 n (%)
2 Cohen’s D; Standardized Mean Difference
3 CI = Confidence Interval
Sensitivity check comparison
Again, let’s see how this comparison differs if we define Participants by full survey completion:
Show code
# Check coding of variable that tells us whether somebody fully completed the survey or notmaster_dataset_study2_restricted %>%count(STATUS, SURVEY_STATUS_FULL)
# A tibble: 4 × 3
STATUS SURVEY_STATUS_FULL n
<fct> <chr> <int>
1 Participants Complete 1115
2 Participants Incomplete 399
3 Non-participants Incomplete 158
4 Non-participants Not started 11127
Show code
master_dataset_study2_sensitivity_check <- master_dataset_study2_restricted %>%mutate(SURVEY_STATUS_FULL_BINARY =case_when(SURVEY_STATUS_FULL =="Incomplete"~"Non-participants", SURVEY_STATUS_FULL =="Not started"~"Non-participants",TRUE~"Participants")) %>%mutate(SURVEY_STATUS_FULL_BINARY =fct_relevel(SURVEY_STATUS_FULL_BINARY,"Participants","Non-participants")) # fix order of presentation# Check recoding has worked:master_dataset_study2_sensitivity_check %>%count(STATUS, SURVEY_STATUS_FULL_BINARY)
# A tibble: 3 × 3
STATUS SURVEY_STATUS_FULL_BINARY n
<fct> <fct> <int>
1 Participants Participants 1115
2 Participants Non-participants 399
3 Non-participants Non-participants 11285
Show code
# Update survey designs:survey_design_all <- survey::svydesign(id =~1,data = master_dataset_study2_sensitivity_check,weights =~ALL_SAMPLE_WEIGHTS)# Create survey design objects for each STATUS group:survey_design_non_participants <- survey::svydesign(id =~1,data =subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY =="Non-participants"),weights =~ALL_SAMPLE_WEIGHTS)survey_design_participants <- survey::svydesign(id =~1,data =subset(master_dataset_study2_sensitivity_check, SURVEY_STATUS_FULL_BINARY =="Participants"),weights =~ALL_SAMPLE_WEIGHTS)# Get sample sizes for each groupparticipants_n <-sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY =="Participants")non_participants_n <-sum(master_dataset_study2_sensitivity_check$SURVEY_STATUS_FULL_BINARY =="Non-participants")overall_n <-nrow(master_dataset_study2_sensitivity_check)# Produce updated table# [Again, I know you can do all of the below using tbl_svysummary(), but it doesn't easily compute the right effect sizes I want]# To streamline this substantially, let's develop a function to calculate weighted summary for continuous variables and put it all together nicely for tabling later on:# Function to calculate weighted summary for continuous variablesweighted_summary_continuous <-function(variable, design_all, design_non_participants, design_participants, participants_n, non_participants_n, overall_n) { var_formula <-as.formula(paste0("~", variable))# Calculate means and standard deviations mean_all <-svymean(var_formula, design_all, na.rm =TRUE) mean_non_ps <-svymean(var_formula, design_non_participants, na.rm =TRUE) mean_ps <-svymean(var_formula, design_participants, na.rm =TRUE) sd_all <-sqrt(svyvar(var_formula, design_all, na.rm =TRUE)) sd_non_ps <-sqrt(svyvar(var_formula, design_non_participants, na.rm =TRUE)) sd_ps <-sqrt(svyvar(var_formula, design_participants, na.rm =TRUE))# Calculate weighted medians median_all <-svyquantile(var_formula, design_all, 0.5, na.rm =TRUE) median_non_ps <-svyquantile(var_formula, design_non_participants, 0.5, na.rm =TRUE) median_ps <-svyquantile(var_formula, design_participants, 0.5, na.rm =TRUE)# Format for table summary_df <-tibble(Variable = variable,`Overall`=paste0(round(coef(mean_all), 2), " (", round(sd_all, 2), "), ", round(coef(median_all), 2)),`Participants`=paste0(round(coef(mean_ps), 2), " (", round(sd_ps, 2), "), ", round(coef(median_ps), 2)),`Non-participants`=paste0(round(coef(mean_non_ps), 2), " (", round(sd_non_ps, 2), "), ", round(coef(median_non_ps), 2)) ) %>%# Compute effect sizemutate(pooled_sd =sqrt(((participants_n -1) * sd_ps^2+ (non_participants_n -1) * sd_non_ps^2) / (participants_n + non_participants_n -2)),cohen_d = (coef(mean_ps) -coef(mean_non_ps)) / pooled_sd,SE =sqrt((participants_n + non_participants_n) / (participants_n * non_participants_n) + (cohen_d^2/ (2* (participants_n + non_participants_n)))),CI_lower = cohen_d -1.96* SE,CI_upper = cohen_d +1.96* SE,Effect_size =paste0(round(cohen_d, 2), " [", round(CI_lower, 2), ", ", round(CI_upper, 2), "]") ) %>%select(Variable, Overall, Participants, `Non-participants`, Effect_size)return(summary_df)}# Apply function to continuous variablescontinuous_vars <-c("AGE", "IRSAD_SCORE", "DAYS_REGISTERED", "DAYS_SINCE_LAST_BET", "PAS_6M_BET_FREQ_DAYS", "PAS_6M_BET_INTENSITY", "PAS_6M_BET_MEAN_STAKE", "PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME", "PAS_6M_PERCENT_UNSOCIABLE_BETS", "PAS_6M_TOTAL_DEPOSIT_FREQ", "PAS_6M_DEPOSIT_INTENSITY", "PAS_6M_MEAN_DEPOSIT_AMOUNT", "PAS_6M_PERCENT_DECLINED_DEPOSITS", "PAS_6M_PERCENT_CANCEL_WITHDRAWS")all_continuous_summaries <-map_dfr(continuous_vars, ~weighted_summary_continuous(.x, survey_design_all, survey_design_non_participants, survey_design_participants, participants_n, non_participants_n, overall_n))# Categorical variables:# Let's also develop a function to calculate and format percentages for categorical variables that we can join to the above outcomes for tabling. # Note: this function is computes Cramer's effect size, which is typically used for categorical variables with more than two levels, But it also works for variables with only two levels (in which case it's equivalent to the phi coefficient).# UPDATE: Let's just compute odds ratios for the two level variables based on a reviewer suggestion (thanks ChatGPT for your help with this one!):categorical_summary <-function(variable, design_all, design_non_participants, design_participants) { var_formula <-as.formula(paste0("~", variable))# Calculate proportions for overall sample prop_table_all <-svytable(var_formula, design_all) props_all <-prop.table(prop_table_all) %>%as.data.frame() %>%mutate(Overall = Freq *100) %>%select(Variable =!!sym(variable), Overall)# Calculate proportions for non-participants prop_table_non_ps <-svytable(var_formula, design_non_participants) props_non_ps <-prop.table(prop_table_non_ps) %>%as.data.frame() %>%mutate('Non-participants'= Freq *100) %>%select(Variable =!!sym(variable), 'Non-participants')# Calculate proportions for participants prop_table_ps <-svytable(var_formula, design_participants) props_ps <-prop.table(prop_table_ps) %>%as.data.frame() %>%mutate('Participants'= Freq *100) %>%select(Variable =!!sym(variable), 'Participants')# Combine results and format summary_df <- props_ps %>%left_join(props_non_ps, by ="Variable") %>%left_join(props_all, by ="Variable") %>%mutate(across(where(is.numeric), ~paste0(round(.x, 2), "%"))) %>%replace(is.na(.), "")# Determine if variable is binary (i.e., only two levels) unique_values <-length(unique(as.numeric(na.omit(design_all$variables[[variable]]))))if (unique_values ==2& variable !="GENDER") {# Ensure STATUS is treated as a factor with "Non-participants" as reference design_all$variables$STATUS <-relevel(as.factor(design_all$variables$STATUS), ref ="Non-participants")# Compute Odds Ratio (OR) using survey-weighted logistic regression logistic_model <-tryCatch(svyglm(as.formula(paste0(variable, " ~ STATUS")), design = design_all, family =quasibinomial()),error =function(e) NULL )if (!is.null(logistic_model)) { coef_name <-grep("STATUS", names(coef(logistic_model)), value =TRUE)if (length(coef_name) >0) { OR <-exp(coef(logistic_model)[coef_name]) CI <-exp(confint(logistic_model)[coef_name, ]) effect_size_text <-paste0(round(OR, 2), " [", round(CI[1], 2), ", ", round(CI[2], 2), "]") } else { effect_size_text <-"NA" } } else { effect_size_text <-"NA" } } else {# Compute Cramér’s V for categorical variables with more than two levels combined_table <-as.table(rbind(prop_table_non_ps, prop_table_ps)) V <-cramersV(combined_table) V_rounded <-as.matrix(V$output) %>%t() %>%as.data.frame() %>%mutate(cramersV =as.numeric(cramersV)) %>%mutate(cramersV =round(cramersV, 2)) CI <-confIntV(combined_table) CI_info <- CI$intermediate CIs <- CI_info$fisherZ.ci %>%t() %>%as.data.frame() %>%mutate_if(is.numeric, round, 2) %>%unite('Effect_size', 1, 2, sep =", ") effect_size_text <-paste0(V_rounded, " [", CIs, "]") }# Add effect size to the table header_row <-data.frame(Variable = variable, Overall ="", Participants ="", `Non-participants`="", `Effect_size`= effect_size_text) summary_df <-bind_rows(header_row, summary_df) %>%as_tibble() %>%select(Variable, Overall, Participants, `Non-participants`, `Effect_size`)return(summary_df)}# Apply function to categorical variablescategorical_vars <-c("GENDER", "EVER_USED_ANY_TOOL", "ACTIVE_DEPOSIT_LIMIT", "RE_PAS_6M_ANY_TRIGGER")all_categorical_summaries <-map_dfr(categorical_vars, ~categorical_summary(.x, survey_design_all, survey_design_non_participants, survey_design_participants))all_categorical_summaries <- all_categorical_summaries %>%# Recode for consistency across summaries:mutate(Variable =case_when(Variable==1~"Yes", Variable==0~"No", Variable==TRUE~"Yes", Variable==FALSE~"No",TRUE~ Variable))# Combine continuous and categorical summariesweighted_comparison_summary_data <-bind_rows(all_continuous_summaries, all_categorical_summaries) %>%rename_with(~paste0("Participants (N = ", participants_n, ")"), .cols ="Participants") %>%rename_with(~paste0("Non-participants (N = ", non_participants_n, ")"), .cols ="Non-participants") %>%rename_with(~paste0("Overall (N = ", overall_n, ")"), .cols ="Overall")# Order data for table (As we have duplicate names in the variable column, this is a pain in the but that needs to be done by separating and rejoining the variables in the right order)gender_summary_tabled<- weighted_comparison_summary_data[15:18,1:5]age_summary_tabled<- weighted_comparison_summary_data[1,1:5]summary_table_extracted_vars_missing<- weighted_comparison_summary_data %>%filter(Variable !="AGE", Variable !="GENDER", Variable !="Male", Variable !="Female", Variable !="Unknown" )# Join everything back together and provide some nice names for our variables:weighted_comparison_summary_data_ordered <-bind_rows(age_summary_tabled, gender_summary_tabled, summary_table_extracted_vars_missing) %>%mutate(Variable = dplyr::recode(Variable,`AGE`="Age","GENDER"="Gender","IRSAD_SCORE"="IRSAD score","DAYS_REGISTERED"="Days since registered","DAYS_SINCE_LAST_BET"="No. days since last bet","PAS_6M_BET_FREQ_DAYS"="No. betting days","PAS_6M_BET_INTENSITY"="Mean bets per active day","PAS_6M_BET_MEAN_STAKE"="Mean stake","PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME"="Mean net outcome per month","PAS_6M_PERCENT_UNSOCIABLE_BETS"="Percentage of bets midnight - 6AM","PAS_6M_TOTAL_DEPOSIT_FREQ"="No. deposits","PAS_6M_DEPOSIT_INTENSITY"="Mean no. deposits per active day","PAS_6M_MEAN_DEPOSIT_AMOUNT"="Mean deposit amount","PAS_6M_PERCENT_DECLINED_DEPOSITS"="Percentage of deposits declined","PAS_6M_PERCENT_CANCEL_WITHDRAWS"="Percentage of withdrawals cancelled","EVER_USED_ANY_TOOL"="Ever used a CPT","ACTIVE_DEPOSIT_LIMIT"="Active deposit limit","RE_PAS_6M_ANY_TRIGGER"="Flagged as 'at-risk' ≥ 1 times" ))# Create final table:weighted_comparison_summary_data_ordered %>%rename("Effect size"= Effect_size ) %>%gt() %>%tab_header(title ="Table 4.1 Study 2 (past 30-day bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]") %>%tab_options(data_row.padding =px(1.5) ) %>%cols_align(align ="right",columns =c(Variable) ) %>%cols_align(align ="center",columns =c(2:4) ) %>%tab_footnote(footnote ="Continuous variables: Mean (standard deviation), median; Categorical variables: % of group",locations =cells_column_labels(columns =c(2,3)) )%>%tab_footnote(footnote ="Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])",locations =cells_column_labels(columns =vars(`Effect size`)) )
Table 4.1 Study 2 (past 30-day bettors); SENSITIVITY ANALYSIS - Comparison of participants and non-participants: Demographic and gambling characteristics [Participants defined by full completion of the survey]
Variable
Overall (N = 12799)1
Participants (N = 1115)1
Non-participants (N = 11684)
Effect size2
Age
40.62 (15.44), 38
43.42 (14.68), 42
40.36 (15.48), 38
0.2 [0.14, 0.26]
Gender
NA
0.03 [0.02, 0.03]
Female
14.18%
11.46%
14.43%
NA
Male
84.72%
86.99%
84.51%
NA
Unknown
1.1%
1.55%
1.05%
NA
IRSAD score
1002.69 (75.08), 994.21
1007.51 (75.4), 1000.6
1002.24 (75.03), 993.97
0.07 [0.01, 0.13]
Days since registered
1259.45 (926.89), 1097
1417.35 (964.27), 1225
1244.9 (922.05), 1072
0.19 [0.12, 0.25]
No. days since last bet
8.6 (8.64), 4
5.84 (6.95), 3
8.85 (8.74), 5
-0.35 [-0.41, -0.29]
No. betting days
40.36 (38.75), 27
50.35 (42.96), 39
39.44 (38.21), 27
0.28 [0.22, 0.34]
Mean bets per active day
9.22 (15.7), 5.11
11.05 (18.38), 6.41
9.05 (15.42), 5
0.13 [0.07, 0.19]
Mean stake
32.15 (144.94), 10.4
25.11 (91.67), 9.02
32.8 (148.88), 10.58
-0.05 [-0.11, 0.01]
Mean net outcome per month
-251.88 (2584.18), -36.03
-267.81 (1199.21), -41.62
-250.41 (2676.01), -35.5
-0.01 [-0.07, 0.05]
Percentage of bets midnight - 6AM
4.32 (9.42), 0.06
4.26 (8.81), 0.51
4.33 (9.48), 0
-0.01 [-0.07, 0.05]
No. deposits
60.74 (154.29), 14
77.52 (173.67), 19
59.2 (152.3), 14
0.12 [0.06, 0.18]
Mean no. deposits per active day
1.14 (1.49), 0.81
1.26 (1.8), 0.81
1.13 (1.45), 0.81
0.08 [0.02, 0.15]
Mean deposit amount
84.55 (348.17), 37.5
72.06 (204.19), 38
85.7 (358.52), 37.5
-0.04 [-0.1, 0.02]
Percentage of deposits declined
6.18 (12.35), 0
7.39 (14.83), 0
6.07 (12.09), 0
0.11 [0.05, 0.17]
Percentage of withdrawals cancelled
2.96 (11.33), 0
3.61 (13.22), 0
2.9 (11.14), 0
0.06 [0, 0.12]
Ever used a CPT
NA
1.25 [1.06, 1.47]
No
83.17%
81.2%
83.35%
NA
Yes
16.83%
18.8%
16.65%
NA
Active deposit limit
NA
1.12 [0.91, 1.38]
No
90.35%
90.33%
90.35%
NA
Yes
9.65%
9.67%
9.65%
NA
Flagged as 'at-risk' ≥ 1 times
NA
1.34 [1.2, 1.5]
No
95.97%
95.61%
96.01%
NA
Yes
4.03%
4.39%
3.99%
NA
1 Continuous variables: Mean (standard deviation), median; Categorical variables: % of group
2 Continuous variables: Cohen's D [(95% Confidence intervals (CIs)]; Categorical variables with two levels: Odds ratios [95% CIs]; Categorical variables with 3 levels (i.e., gender): Cramer's V [95% CIs])
Again, the effect sizes appear to be smaller in these comparisons. Let’s check if this is because the group who started the survey but didn’t complete it have now been merged with non-responders, making the latter group more similar to responders, or because those who fully complete the survey are actually more similar to those who didn’t start it:
Show code
master_dataset_study2_restricted %>%tbl_summary(include =c(# Demographic details AGE, GENDER, IRSAD_SCORE, DAYS_REGISTERED, # Wagering DAYS_SINCE_LAST_BET, PAS_6M_BET_FREQ_DAYS, PAS_6M_BET_INTENSITY, PAS_6M_BET_MEAN_STAKE, PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME, PAS_6M_PERCENT_UNSOCIABLE_BETS,# Transactions PAS_6M_TOTAL_DEPOSIT_FREQ, PAS_6M_DEPOSIT_INTENSITY, PAS_6M_MEAN_DEPOSIT_AMOUNT, PAS_6M_PERCENT_DECLINED_DEPOSITS, PAS_6M_PERCENT_CANCEL_WITHDRAWS,# Use of consumer protection tools EVER_USED_ANY_TOOL, ACTIVE_DEPOSIT_LIMIT, RE_PAS_6M_ANY_TRIGGER),type =all_continuous() ~"continuous2",statistic =list(all_continuous2() ~c("{median} [{p25} - {p75}]","{mean} ({sd})")),by = SURVEY_STATUS_FULL,label =c( AGE ~"Age", GENDER ~"Gender", IRSAD_SCORE ~"IRSAD score", DAYS_REGISTERED ~"No. of days since accounts registered", DAYS_SINCE_LAST_BET ~"No. days since last bet", PAS_6M_BET_FREQ_DAYS ~"No. betting days", PAS_6M_BET_INTENSITY ~"Mean bets per active day", PAS_6M_BET_MEAN_STAKE ~"Mean stake", PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME ~"Mean net outcome per month", PAS_6M_PERCENT_UNSOCIABLE_BETS ~"Percentage of bets placed between midnight & 6AM", PAS_6M_TOTAL_DEPOSIT_FREQ ~"No. deposits", PAS_6M_DEPOSIT_INTENSITY ~"Mean no. deposits per active betting day", PAS_6M_MEAN_DEPOSIT_AMOUNT ~"Mean deposit amount", PAS_6M_PERCENT_DECLINED_DEPOSITS ~"Percentage of deposits declined", PAS_6M_PERCENT_CANCEL_WITHDRAWS ~"Percentage of withdrawals cancelled", EVER_USED_ANY_TOOL ~"Ever used a SG", ACTIVE_DEPOSIT_LIMIT ~"Active deposit limit", RE_PAS_6M_ANY_TRIGGER ~"Flagged as 'at-risk' ≥ 1 times") ) %>%modify_header(label ="Variable") %>%modify_caption("Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn't complete it, and those who didn't start the survey") %>%bold_labels()
Study 2 SENSITIVITY ANALYSIS - Comparing those who fully completed the survey, those who started but didn’t complete it, and those who didn’t start the survey
Variable
Complete, N = 1,1151
Incomplete, N = 5571
Not started, N = 11,1271
Age
Median [IQR]
40 [30 - 52]
42 [29 - 56]
37 [27 - 50]
Mean (SD)
42 (14)
44 (16)
39 (15)
Gender
Female
103 (9.2%)
72 (13%)
1,346 (12%)
Male
992 (89%)
473 (85%)
9,636 (87%)
Unknown
20 (1.8%)
12 (2.2%)
145 (1.3%)
IRSAD score
Median [IQR]
1,002 [955 - 1,065]
990 [949 - 1,055]
998 [953 - 1,059]
Mean (SD)
1,008 (75)
1,000 (74)
1,005 (76)
Unknown
1
1
93
No. of days since accounts registered
Median [IQR]
1,282 [582 - 2,217]
1,165 [463 - 2,115]
1,135 [480 - 1,970]
Mean (SD)
1,445 (986)
1,362 (1,023)
1,291 (941)
No. days since last bet
Median [IQR]
3 [1 - 6]
2 [1 - 5]
4 [2 - 13]
Mean (SD)
5 (7)
5 (6)
8 (8)
No. betting days
Median [IQR]
52 [23 - 97]
67 [28 - 113]
36 [15 - 74]
Mean (SD)
63 (47)
73 (50)
50 (44)
Mean bets per active day
Median [IQR]
8 [4 - 17]
11 [4 - 22]
6 [3 - 13]
Mean (SD)
15 (23)
18 (27)
12 (22)
Mean stake
Median [IQR]
15 [6 - 48]
23 [9 - 71]
16 [7 - 55]
Mean (SD)
67 (222)
99 (351)
96 (383)
Mean net outcome per month
Median [IQR]
-118 [-1,060 - -9]
-366 [-1,671 - -28]
-72 [-613 - -6]
Mean (SD)
-953 (3,008)
-1,504 (3,356)
-983 (7,284)
Percentage of bets placed between midnight & 6AM
Median [IQR]
1 [0 - 6]
1 [0 - 5]
1 [0 - 6]
Mean (SD)
5 (9)
5 (8)
5 (10)
No. deposits
Median [IQR]
46 [9 - 205]
92 [21 - 316]
27 [6 - 117]
Mean (SD)
184 (352)
270 (539)
121 (270)
Mean no. deposits per active betting day
Median [IQR]
1.26 [0.44 - 2.63]
1.73 [0.83 - 3.53]
1.03 [0.40 - 2.10]
Mean (SD)
2.17 (2.98)
2.75 (3.73)
1.71 (2.29)
Mean deposit amount
Median [IQR]
50 [24 - 128]
73 [31 - 164]
50 [22 - 150]
Mean (SD)
148 (517)
179 (393)
206 (963)
Percentage of deposits declined
Median [IQR]
2 [0 - 8]
4 [0 - 9]
1 [0 - 9]
Mean (SD)
8 (14)
8 (14)
7 (12)
Percentage of withdrawals cancelled
Median [IQR]
0 [0 - 2]
0 [0 - 13]
0 [0 - 0]
Mean (SD)
8 (19)
12 (22)
7 (18)
Ever used a SG
267 (24%)
159 (29%)
2,293 (21%)
Active deposit limit
112 (10%)
69 (12%)
1,147 (10%)
Flagged as 'at-risk' ≥ 1 times
370 (33%)
266 (48%)
3,367 (30%)
1 n (%)
As with Study 1, it seems that those who didn’t complete the survey but started it out the more involved/intense bettors.
Composite risk scores
Now let’s produce a singles composite score that represents people’s past six months gambling behaviour.
One thing we need to consider here is that some of the variables we’ve created could be biased by recent sign up to the site. for example, one of the variables we created to summarise past six months behaviour is the total number of days somebody gambled, while another is the total amount somebody deposited. Both of these variables could be affected by being more recently registered with the company. By contrast, variables which refer to averages, or daily or monthly summaries will not be affected by this.
Let’s first identify all the variables we engineered that summarise past six month behaviour, then classify them based on whether they will be subject to this bias or not:
Show code
PAS_6M_variables<- master_dataset_study2_restricted %>%select(contains("PAS_6M"), # Identify all relevant past six months aggregate variables-contains("RE_"), # Remove risk identification system variables-contains("INCOME") # Remove variables relating to percentage of income which are only available for survey participants ) %>%colnames() %>%as.data.frame() %>%rename(Variables =1) %>%mutate(`Biased by sign-up date`=case_when( Variables =="PAS_6M_BET_FREQ_DAYS"~"Yes", Variables =="PAS_6M_BET_FREQ_BETS"~"Yes", Variables =="PAS_6M_BET_MEAN_STAKE"~"No", Variables =="PAS_6M_BET_SD_STAKE"~"No", Variables =="PAS_6M_BET_SPEND"~"Yes", Variables =="PAS_6M_MEAN_ODDS"~"Yes", Variables =="PAS_6M_N_ACTIVITIES"~"No", Variables =="PAS_6M_BET_INTENSITY"~"No", Variables =="PAS_6M_BET_WIN"~"Yes", Variables =="PAS_6M_NET_OUTCOME"~"Yes", Variables =="PAS_6M_MAX_DAY_STAKE"~"No", Variables =="PAS_6M_MAX_DAY_BETS"~"No", Variables =="PAS_6M_MEAN_DAILY_STAKE"~"No", Variables =="PAS_6M_SD_DAILY_STAKE"~"No", Variables =="PAS_6M_TOTAL_UNSOCIABLE_BETS"~"Yes", Variables =="PAS_6M_PERCENT_UNSOCIABLE_BETS"~"No", Variables =="PAS_6M_MONTHLY_BET_FREQ_DAYS"~"No", Variables =="PAS_6M_MONTHLY_SPEND"~"No", Variables =="PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME"~"No", Variables =="PAS_6M_TOTAL_DEPOSIT_AMOUNT"~"Yes", Variables =="PAS_6M_MEAN_DEPOSIT_AMOUNT"~"No", Variables =="PAS_6M_SD_DEPOSIT_AMOUNT"~"No", Variables =="PAS_6M_TOTAL_DEPOSIT_FREQ"~"Yes", Variables =="PAS_6M_TOTAL_DECLINED_DEPOSITS"~"Yes", Variables =="PAS_6M_PERCENT_DECLINED_DEPOSITS"~"No", Variables =="PAS_6M_TOTAL_DEPOSIT_METHODS"~"No", Variables =="PAS_6M_TOTAL_WITHDRAW_FREQ"~"Yes", Variables =="PAS_6M_TOTAL_WITHDRAW_AMOUNT"~"Yes", Variables =="PAS_6M_TOTAL_CANCEL_WITHDRAWS"~"Yes", Variables =="PAS_6M_PERCENT_CANCEL_WITHDRAWS"~"No", Variables =="PAS_6M_TOTAL_WITHDRAW_METHODS"~"No", Variables =="PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT"~"No", Variables =="PAS_6M_MONTHLY_DEPOSIT_AMOUNT"~"No", Variables =="PAS_6M_DEPOSIT_INTENSITY"~"No",TRUE~ Variables)) # Table variables:PAS_6M_variables %>%gt()%>%tab_header("All past-6-month aggregate summary variables", subtitle ="With risk of bias status for each variable")
All past-6-month aggregate summary variables
With risk of bias status for each variable
Variables
Biased by sign-up date
PAS_6M_BET_FREQ_DAYS
Yes
PAS_6M_BET_FREQ_BETS
Yes
PAS_6M_BET_FREQ_INCL_LEGS
PAS_6M_BET_FREQ_INCL_LEGS
PAS_6M_BET_MEAN_STAKE
No
PAS_6M_BET_SD_STAKE
No
PAS_6M_BET_SPEND
Yes
PAS_6M_MEAN_ODDS
Yes
PAS_6M_N_ACTIVITIES
No
PAS_6M_BET_INTENSITY
No
PAS_6M_BET_WIN
Yes
PAS_6M_NET_OUTCOME
Yes
PAS_6M_MAX_DAY_STAKE
No
PAS_6M_MAX_DAY_BETS
No
PAS_6M_MEAN_DAILY_STAKE
No
PAS_6M_SD_DAILY_STAKE
No
PAS_6M_TOTAL_UNSOCIABLE_BETS
Yes
PAS_6M_PERCENT_UNSOCIABLE_BETS
No
PAS_6M_MONTHLY_BET_FREQ_DAYS
No
PAS_6M_MONTHLY_SPEND
No
PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME
No
PAS_6M_TOTAL_DEPOSIT_AMOUNT
Yes
PAS_6M_MEAN_DEPOSIT_AMOUNT
No
PAS_6M_SD_DEPOSIT_AMOUNT
No
PAS_6M_TOTAL_DEPOSIT_FREQ
Yes
PAS_6M_TOTAL_DECLINED_DEPOSITS
Yes
PAS_6M_PERCENT_DECLINED_DEPOSITS
No
PAS_6M_TOTAL_DEPOSIT_METHODS
No
PAS_6M_TOTAL_WITHDRAW_FREQ
Yes
PAS_6M_TOTAL_WITHDRAW_AMOUNT
Yes
PAS_6M_TOTAL_CANCEL_WITHDRAWS
Yes
PAS_6M_PERCENT_CANCEL_WITHDRAWS
No
PAS_6M_TOTAL_WITHDRAW_METHODS
No
PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT
No
PAS_6M_MONTHLY_DEPOSIT_AMOUNT
No
PAS_6M_MONTHLY_DEPOSIT_FREQ
PAS_6M_MONTHLY_DEPOSIT_FREQ
PAS_6M_DEPOSIT_INTENSITY
No
Okay, now we have all of the relevant variables, let’s look at how they correlate with PGSI and harm scores using a series of bivariate correlations visualised as a heatmap. These results are unweighted:
Show code
PGSI_cor_data<- master_dataset_study2_restricted %>%select(PGSI_TOTAL, GHM_TOTAL, PAS_6M_BET_MEAN_STAKE, PAS_6M_BET_SD_STAKE, PAS_6M_N_ACTIVITIES, PAS_6M_BET_INTENSITY, PAS_6M_MAX_DAY_STAKE, PAS_6M_MAX_DAY_BETS, PAS_6M_MEAN_DAILY_STAKE, PAS_6M_SD_DAILY_STAKE, PAS_6M_PERCENT_UNSOCIABLE_BETS, PAS_6M_MONTHLY_BET_FREQ_DAYS, PAS_6M_MONTHLY_SPEND, PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME, PAS_6M_MEAN_DEPOSIT_AMOUNT, PAS_6M_SD_DEPOSIT_AMOUNT, PAS_6M_PERCENT_DECLINED_DEPOSITS, PAS_6M_TOTAL_DEPOSIT_METHODS, PAS_6M_PERCENT_CANCEL_WITHDRAWS, PAS_6M_TOTAL_WITHDRAW_METHODS, PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, PAS_6M_MONTHLY_DEPOSIT_AMOUNT, PAS_6M_DEPOSIT_INTENSITY)ggstatsplot::ggcorrmat(PGSI_cor_data, type ="parametric",results.subtitle =FALSE,ggcorrplot.args =c(lab_size =3)) +theme(axis.text.y =element_text(color ="black", size =9, face ="plain", family ="Poppins")) +theme(axis.text.x =element_text(color ="black", size =9, face ="plain", family ="Poppins")) +theme(legend.title=element_text(color ="black", size =12, face ="plain", family ="Poppins")) +theme(legend.title.align = .5) +theme(legend.text=element_text(color ="black", size =8, face ="plain", family ="Poppins"))
Now let’s use all of these variables to create a composite index score of risk/involvement:
Show code
master_dataset_study2_restricted_with_risk_scores <- master_dataset_study2_restricted %>%mutate(PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME = PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME*-1) %>%# Invert outcome valuesmutate(across(c(PAS_6M_BET_MEAN_STAKE, PAS_6M_BET_SD_STAKE, PAS_6M_N_ACTIVITIES, PAS_6M_BET_INTENSITY, PAS_6M_MAX_DAY_STAKE, PAS_6M_MAX_DAY_BETS, PAS_6M_MEAN_DAILY_STAKE, PAS_6M_SD_DAILY_STAKE, PAS_6M_PERCENT_UNSOCIABLE_BETS, PAS_6M_MONTHLY_BET_FREQ_DAYS, PAS_6M_MONTHLY_SPEND, PAS_6M_MEAN_DEPOSIT_AMOUNT, PAS_6M_SD_DEPOSIT_AMOUNT, PAS_6M_PERCENT_DECLINED_DEPOSITS, PAS_6M_TOTAL_DEPOSIT_METHODS, PAS_6M_PERCENT_CANCEL_WITHDRAWS, PAS_6M_TOTAL_WITHDRAW_METHODS, PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT, PAS_6M_MONTHLY_DEPOSIT_AMOUNT, PAS_6M_DEPOSIT_INTENSITY), scale, # scale function computes z-scores.names ="z_{.col}")) %>%# Create new named columns for easy selection belowrowwise() %>%# Group by Rowsmutate(`Composite Risk Score`=sum(c_across(starts_with("z_")), na.rm =TRUE)) %>%# Sum across all newly created Z scoresungroup()# colnames(master_dataset_study2_restricted_with_risk_scores)
Correlations
Now let’s look at how well this composite score correlates with PGSI scores in the survey sample:
Spearman's rank correlation rho
data: master_dataset_study2_restricted_with_risk_scores$PGSI_TOTAL and master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`
S = 389269642, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.3269856
Perform weighted correlation:
Show code
# This doesn't work nicely with NAvalues so let's remove them first:master_dataset_study2_restricted_with_risk_scores_weighted_cor_data <- master_dataset_study2_restricted_with_risk_scores %>%filter(!is.na(PGSI_TOTAL))weightecor_PGSI<-weightedCorr(x = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$PGSI_TOTAL, y = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$`Composite Risk Score`, method ="Spearman",weights = master_dataset_study2_restricted_with_risk_scores_weighted_cor_data$ALL_SAMPLE_WEIGHTS)weightecor_PGSI
[1] 0.3427516
Confirm how many people are involved in these calculations:
Welch Two Sample t-test
data: Composite Risk Score by STATUS
t = 7.8814, df = 2203.5, p-value = 5.046e-15
alternative hypothesis: true difference in means between group Participants and group Non-participants is not equal to 0
95 percent confidence interval:
1.454290 2.417725
sample estimates:
mean in group Participants mean in group Non-participants
1.7069962 -0.2290113
And now a Weighted t-test:
Show code
# Create survey design object for the whole sample:survey_design_all_participants <-svydesign(id =~1,data = master_dataset_study2_restricted_with_risk_scores,weights =~ALL_SAMPLE_WEIGHTS)weighted_t_test_composite_study2 <-survey::svyttest(`Composite Risk Score`~ STATUS, data = master_dataset_study2_restricted_with_risk_scores, design = survey_design_all_participants)weighted_t_test_composite_study2
Design-based t-test
data: `Composite Risk Score` ~ STATUS
t = -9.5198, df = 12797, p-value < 2.2e-16
alternative hypothesis: true difference in mean is not equal to 0
95 percent confidence interval:
-1.716717 -1.130472
sample estimates:
difference in mean
-1.423595
Compute non-weighted summary scores and effect size:
Show code
# Calculate means and standard deviationssummary_composite_study2 <- master_dataset_study2_restricted_with_risk_scores %>%group_by(STATUS) %>% dplyr::summarize(mean =mean(`Composite Risk Score`),sd =sd(`Composite Risk Score`),n =n() ) gt(summary_composite_study2)
STATUS
mean
sd
n
Participants
1.7069962
8.687786
1514
Non-participants
-0.2290113
10.879017
11285
Show code
# Calculate Cohen's d with 95% confidence intervalcohen_d_result_study2 <-cohen.d(`Composite Risk Score`~ STATUS, data = master_dataset_study2_restricted_with_risk_scores)cohen_d_result_study2
Cohen's d
d estimate: 0.181896 (negligible)
95 percent confidence interval:
lower upper
0.1282005 0.2355915
Compute weighted summary score and effect size:
Show code
# Rename the risk score as the statistic function doesn't like the spaces:master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change<- master_dataset_study2_restricted_with_risk_scores %>%rename(risk_score =`Composite Risk Score`)# Create survey design objects for each STATUS group that include the composite risk scores:survey_design_all_t <- survey::svydesign(id =~1,data = master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change,weights =~ALL_SAMPLE_WEIGHTS)survey_design_non_participants_t <- survey::svydesign(id =~1,data =subset(master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change, STATUS =="Non-participants"),weights =~ALL_SAMPLE_WEIGHTS)survey_design_participants_t <- survey::svydesign(id =~1,data =subset(master_dataset_study2_restricted_with_risk_scores_temp_risk_name_change, STATUS =="Participants"),weights =~ALL_SAMPLE_WEIGHTS)risk_score_comparison<-weighted_summary_continuous("risk_score", survey_design_all_t, survey_design_non_participants_t, survey_design_participants_t, participants_n, non_participants_n)gt(risk_score_comparison)
Variable
Overall
Participants
Non-participants
Effect_size
risk_score
-3.24 (5.55), -3.99
-1.98 (5.46), -2.79
-3.4 (5.54), -4.13
0.26 [0.2, 0.32]
Quartiles
Now, let’s divide the sample into quartiles based on their composite summary score:
Show code
# Calculate quartiles of past-30-day bet frequencyquartiles <-quantile(master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`, probs =c(0, 0.25, 0.5, 0.75, 1))master_dataset_study2_restricted_with_risk_scores$Quartile <-cut(master_dataset_study2_restricted_with_risk_scores$`Composite Risk Score`, breaks = quartiles,labels =c("Bottom quartile: least involved","Second quartile","Third quartile","Top quartile: most involved"),include.lowest =TRUE)
Now compute the proportion of individuals from each quartile who went on to take part in the survey:
Show code
master_dataset_study2_restricted_with_risk_scores %>%group_by(Quartile) %>% dplyr::count(SURVEY_STATUS_PGSI) %>%gt() %>%tab_header("Counts for granular survey status by composite risk score quartile", subtitle =NULL)
Counts for granular survey status by composite risk score quartile
SURVEY_STATUS_PGSI
n
Bottom quartile: least involved
Complete
252
Incomplete
11
Not started
2937
Second quartile
Complete
331
Incomplete
27
Not started
2842
Third quartile
Complete
418
Incomplete
48
Not started
2733
Top quartile: most involved
Complete
513
Incomplete
72
Not started
2615
Show code
master_dataset_study2_restricted_with_risk_scores %>%group_by(Quartile) %>% dplyr::count(STATUS) %>%mutate('Percent of quartile'=round(n/sum(n)*100,2)) %>%gt() %>%tab_header("Counts (with percentages) for broad survey status by composite risk score quartile", subtitle =NULL)
Counts (with percentages) for broad survey status by composite risk score quartile
STATUS
n
Percent of quartile
Bottom quartile: least involved
Participants
252
7.88
Non-participants
2948
92.12
Second quartile
Participants
331
10.34
Non-participants
2869
89.66
Third quartile
Participants
418
13.07
Non-participants
2781
86.93
Top quartile: most involved
Participants
513
16.03
Non-participants
2687
83.97
No compute weighted accounts/proportions:
Show code
# Compute weighted counts for granular survey statusweighted_counts_granular <-svytable(~Quartile + SURVEY_STATUS_PGSI, svydesign(id =~1, data = master_dataset_study2_restricted_with_risk_scores, weights =~ALL_SAMPLE_WEIGHTS)) %>%as.data.frame()weighted_counts_granular %>%gt() %>%tab_header("Weighted counts for granular survey status by composite risk score quartile", subtitle =NULL)
Weighted counts for granular survey status by composite risk score quartile
Quartile
SURVEY_STATUS_PGSI
Freq
Bottom quartile: least involved
Complete
2745.7805
Second quartile
Complete
3537.8544
Third quartile
Complete
3296.0030
Top quartile: most involved
Complete
1306.2995
Bottom quartile: least involved
Incomplete
120.6171
Second quartile
Incomplete
294.5093
Third quartile
Incomplete
355.4089
Top quartile: most involved
Incomplete
200.9272
Bottom quartile: least involved
Not started
32108.0330
Second quartile
Not started
30220.4648
Third quartile
Not started
19822.1098
Top quartile: most involved
Not started
5914.9925
Show code
# Compute weighted counts and percentages for broad survey statussvytable(~Quartile + STATUS, svydesign(id =~1, data = master_dataset_study2_restricted_with_risk_scores, weights =~ALL_SAMPLE_WEIGHTS)) %>%as.data.frame() %>%group_by(Quartile) %>%mutate('Percent of quartile'=round(Freq /sum(Freq) *100, 2)) %>%gt() %>%tab_header("Weighted counts (with percentages) for broad survey status by composite risk score quartile", subtitle =NULL)
Weighted counts (with percentages) for broad survey status by composite risk score quartile
STATUS
Freq
Percent of quartile
Bottom quartile: least involved
Participants
2745.781
7.85
Non-participants
32228.650
92.15
Second quartile
Participants
3537.854
10.39
Non-participants
30514.974
89.61
Third quartile
Participants
3296.003
14.04
Non-participants
20177.519
85.96
Top quartile: most involved
Participants
1306.299
17.60
Non-participants
6115.920
82.40
And visualise this using a Sankey diagram (weighted figures):
Show code
# Create survey design objectsurvey_design_all_participants_sankey <-svydesign(id =~1,data = master_dataset_study2_restricted_with_risk_scores,weights =~ALL_SAMPLE_WEIGHTS)# We first need to create a data frame contains the flows "from" and "to" categories and "intensity" values (i.e., N), with weightings applied:weighted_counts <-svytable(~Quartile + SURVEY_STATUS_PGSI, survey_design_all_participants_sankey) %>%as.data.frame() %>%group_by(Quartile) %>%mutate(percent = Freq /sum(Freq) *100) %>%ungroup()# Ensure Quartiles are ordered correctlyweighted_counts$Quartile <-factor(weighted_counts$Quartile,levels =c("Top quartile: most involved","Third quartile","Second quartile","Bottom quartile: least involved"),ordered =TRUE)# Create Sankey plotdesired_order_sankey <-c("Top quartile: most involved","Third quartile","Second quartile","Bottom quartile: least involved")# non-weighted data: risk_to_survey_links<- master_dataset_study2_restricted_with_risk_scores %>%select(CLIENT_ID, SURVEY_STATUS_PGSI, Quartile) %>%group_by(Quartile, SURVEY_STATUS_PGSI) %>%summarise(n =n(),.groups ="drop" ) %>%group_by(Quartile) %>%mutate(percent = n/sum(n)*100) %>%select(-n) %>%ungroup() %>%mutate(Quartile =factor(Quartile,levels = desired_order_sankey)) %>%arrange(Quartile)# From these flows We have to create a "node" data frame which provides a list of all categories (to and from) in the data frame: nodes <-data.frame(name =c("Top quartile: most involved","Third quartile","Second quartile","Bottom quartile: least involved",as.character(unique(weighted_counts$SURVEY_STATUS_PGSI))) )# For the networkD3 package, the links between categories are made using ids And not the character names provided in the sankey_links data frame. This code will turn assign each from and to category with a number to make numerical connections explicit:weighted_counts$IDsource <-match(weighted_counts$Quartile, nodes$name) -1weighted_counts$IDtarget <-match(weighted_counts$SURVEY_STATUS_PGSI, nodes$name) -1# prepare colour scaleColourScal ='d3.scaleOrdinal() .range(["#d1495b", "#edae49","#00798c","#2e4057","#35B779FF","#31688EFF","#31688EFF","#482878FF","#B4DE2CFF","#FDE725FF", "#26828EFF","#3E4A89FF","#440154FF","#6DCD59FF"])'# Create Sankey plotsankey_plot2_restricted <-sankeyNetwork(Links = weighted_counts, Nodes = nodes,Source ="IDsource", Target ="IDtarget",Value ="percent", NodeID ="name", iterations =0,sinksRight =FALSE,fontSize =16, fontFamily ="Poppins",nodeWidth =30,nodePadding =25,colourScale = ColourScal)# Custom CSS with Google Fonts linkcustom_css <-"<link href='https://fonts.googleapis.com/css2?family=Poppins:wght@400;700&display=swap' rel='stylesheet'><style> text { font-family: 'Poppins', sans-serif !important; }</style>"# Save plot as a HTML file with custom CSShtml_file <-tempfile(fileext =".html")saveWidget(widget = htmlwidgets::prependContent(sankey_plot2_restricted), file = html_file,selfcontained =TRUE)# Read the saved HTML and ensure the custom CSS is properly embeddedhtml_content <-readLines(html_file, warn =FALSE)html_content <-gsub("</head>", paste0(custom_css, "</head>"), html_content)writeLines(html_content, con ="sankey_plot2_restricted.html")# remove the temporary filefile.remove(html_file)
Now let’s perform a logistical regression model to identify the strongest predictors of survey uptake.
SSVS
Before we can actually run the model, we need to identify a suitable suite of predictor variables to include using used Stochastic Search Variable Selection (SSVS) and the related SSVS package (note the actual SSVS process is commented out in rendered versions of this document as it takes a long time to process).
As we want to maximise the overall predictive ability of the model, we have opted to set the parameters of the SSVS so as to have a low threshold for including/selecting relevant variables. First, We have said the prior inclusion of probability to 0.8, reflecting a belief that predictors are more likely to be included and not. A prior inclusion probability of 0.5 reflects the belief that each predictor has an equal probability of being included/excluded. Second, we have set the cut off for marginal inclusion probabilities values (MIPs) to 0.4. There is no standard recommendation for what this cut-off should be, but others have used 0.5.
One thing we need to do before all of this is impute missing values for the IRSAD scores as SSVS doesn’t seem to handle these values well. To do this, we will impute the median value for each person’s state as their missing value.
Show code
# First we need to convert logical and categorical variables to numeric for inclusion in the model: <- master_dataset_study2_restricted_SSVS <- master_dataset_study2_restricted %>%# Impute missing IRSAD scoresmutate(IRSAD_SCORE =ifelse(is.na(IRSAD_SCORE), median(IRSAD_SCORE, na.rm =TRUE), IRSAD_SCORE)) %>%# Convert all required variables to numeric:mutate(across(c('AGE','IRSAD_SCORE','DAYS_REGISTERED','DAYS_SINCE_LAST_BET','PAS_6M_BET_MEAN_STAKE','PAS_6M_BET_SD_STAKE','PAS_6M_N_ACTIVITIES','PAS_6M_BET_INTENSITY','PAS_6M_MAX_DAY_STAKE','PAS_6M_MAX_DAY_BETS','PAS_6M_MEAN_DAILY_STAKE','PAS_6M_SD_DAILY_STAKE','PAS_6M_PERCENT_UNSOCIABLE_BETS','PAS_6M_MONTHLY_BET_FREQ_DAYS','PAS_6M_MONTHLY_SPEND','PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME','PAS_6M_MEAN_DEPOSIT_AMOUNT','PAS_6M_SD_DEPOSIT_AMOUNT','PAS_6M_PERCENT_DECLINED_DEPOSITS','PAS_6M_TOTAL_DEPOSIT_METHODS','PAS_6M_PERCENT_CANCEL_WITHDRAWS','PAS_6M_TOTAL_WITHDRAW_METHODS','PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT','PAS_6M_MONTHLY_DEPOSIT_AMOUNT','PAS_6M_DEPOSIT_INTENSITY','EVER_USED_DEPOSIT_LIMIT','EVER_USED_BLOCK_OUT', 'EVER_USED_CHECK_IN', 'EVER_USED_CURFEW', 'EVER_USED_DEPOSIT_OPTION', 'EVER_USED_MARKET_CONTROL', 'EVER_USED_TIMOUT', 'EVER_USED_CANCELLED_WITHDRAWS', 'EVER_USED_SELF_EXCLUSION','TOTAL_ACTIVE_TOOLS','NUMBER_CPT_CHANGES'), ~as.numeric(.))) %>%mutate(GENDER =case_when( GENDER =='Female'~0, GENDER =='Male'~1,TRUE~2)) %>%mutate(STATUS =case_when( STATUS =='Non-participants'~0, STATUS =='Participants'~1,TRUE~2)) %>%rename('Age'='AGE','Gender'='GENDER','IRSAD score'='IRSAD_SCORE','Days since registration'='DAYS_REGISTERED','Days since last bet'='DAYS_SINCE_LAST_BET','Mean stake*'='PAS_6M_BET_MEAN_STAKE','Stake SD*'='PAS_6M_BET_SD_STAKE','No. of activities*'='PAS_6M_N_ACTIVITIES','Betting intensity*'='PAS_6M_BET_INTENSITY','Max daily stake*'='PAS_6M_MAX_DAY_STAKE','Max daily bets*'='PAS_6M_MAX_DAY_BETS','Mean daily stake*'='PAS_6M_MEAN_DAILY_STAKE','Daily stake SD*'='PAS_6M_SD_DAILY_STAKE','% unsociable bets*'='PAS_6M_PERCENT_UNSOCIABLE_BETS','Monthly bet freq (days)*'='PAS_6M_MONTHLY_BET_FREQ_DAYS','Monthly spend*'='PAS_6M_MONTHLY_SPEND','Mean monthly net outcome*'='PAS_6M_AVERAGE_MONTHLY_NET_OUTCOME','Mean deposit amount*'='PAS_6M_MEAN_DEPOSIT_AMOUNT','Deposit amount SD*'='PAS_6M_SD_DEPOSIT_AMOUNT','% declined deposits*'='PAS_6M_PERCENT_DECLINED_DEPOSITS','No. deposit methods*'='PAS_6M_TOTAL_DEPOSIT_METHODS','% cancel withdrawals*'='PAS_6M_PERCENT_CANCEL_WITHDRAWS','No. withdrawal methods*'='PAS_6M_TOTAL_WITHDRAW_METHODS','Max daily deposit amount*'='PAS_6M_MAX_DAILY_DEPOSIT_AMOUNT','Monthly deposit amount*'='PAS_6M_MONTHLY_DEPOSIT_AMOUNT','Deposit intensity*'='PAS_6M_DEPOSIT_INTENSITY','Ever used deposit limit'='EVER_USED_DEPOSIT_LIMIT','Ever used block out'='EVER_USED_BLOCK_OUT', 'Ever used check in'='EVER_USED_CHECK_IN', 'Ever used curfew'='EVER_USED_CURFEW', 'Ever used deposit option'='EVER_USED_DEPOSIT_OPTION', 'Ever used market control'='EVER_USED_MARKET_CONTROL', 'Ever used timeout'='EVER_USED_TIMOUT', 'Ever used cancelled withdrawals'='EVER_USED_CANCELLED_WITHDRAWS', 'Ever used self-exclusion'='EVER_USED_SELF_EXCLUSION','Total active tools'='TOTAL_ACTIVE_TOOLS','Number of changes to CPTs'='NUMBER_CPT_CHANGES' )outcome <-'STATUS'predictors <-c(# demographic variables:'Age','Gender','IRSAD score',# Account/wagering variables'Days since registration','Days since last bet',# All past six month wagers and transactions variables (As included in the composite score):'Mean stake*','Stake SD*','No. of activities*','Betting intensity*','Max daily stake*','Max daily bets*','Mean daily stake*','Daily stake SD*','% unsociable bets*','Monthly bet freq (days)*','Monthly spend*','Mean monthly net outcome*','Mean deposit amount*','Deposit amount SD*','% declined deposits*','No. deposit methods*','% cancel withdrawals*','No. withdrawal methods*','Max daily deposit amount*','Monthly deposit amount*','Deposit intensity*',# current and historical use of consumer protection tools:'Ever used deposit limit','Ever used block out', 'Ever used check in', 'Ever used curfew', 'Ever used deposit option', 'Ever used market control', 'Ever used timeout', 'Ever used cancelled withdrawals', 'Ever used self-exclusion','Total active tools','Number of changes to CPTs' )# str(master_dataset_study2_restricted_SSVS)# # master_dataset_study2_restricted_SSVS %>%# select(`IRSAD score`)## SSSV_results_Study2_Restricted <- ssvs(data = master_dataset_study2_restricted_SSVS,# x = predictors, y = outcome, # Variables, as prespecified above# inprob = 0.8, # prior inclusion probability: 0.5 reflects the belief that each predictor has an equal probability of being included/excluded# continuous = FALSE, # Categorical outcome variable# progress = TRUE # Provide update on progress of processing# )# # # Save file to avoid having to re-run every time# saveRDS(SSSV_results_Study2_Restricted, "SSSV_results_Study2_Restricted")# Load in:SSSV_results_Study2_Restricted <-read_rds("SSSV_results_Study2_Restricted")
Now we have the outcomes from the variable selection process, let’s present them as a table and visually in a plot:
Show code
summary(SSSV_results_Study2_Restricted, interval =0.95, # Credible intervalordered =TRUE) %>%as.data.frame() %>%gt()%>%tab_header(md("**Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2 Restricted analyses**"), subtitle =NULL)
Outcomes from the Stochastic Search Variable Selection (SSVS) process: Study 2 Restricted analyses
ggsave("Figures/SSVS results plot Study 2 Restricted analyses.pdf",width =8.5,height =3.5,units =c("in"))
Run model
Now let’s run the logistic regression model with our selection of predictor variables identified via SSVS:
Show code
# SSSV_results_Study2_Restricted$ssvs %>% as_tibble()# Set the data for analysis:Study2_logistic_reg_data<- master_dataset_study2_restricted %>%mutate(STATUS =case_when( STATUS =='Non-participants'~0, STATUS =='Participants'~1))# Create survey design object Used for waiting:survey_design_regression <-svydesign(id =~1,data = Study2_logistic_reg_data,weights =~ALL_SAMPLE_WEIGHTS)# Check variables for inclusion: summary(SSSV_results_Study2_Restricted, interval =0.95, # Credible intervalordered =TRUE) %>%filter(MIP>0.4)
Variable MIP Avg Beta Avg Nonzero Beta Lower CI (95%)
Age 1.0000 0.2402 0.2402 0.1812
Days since last bet 1.0000 -0.2822 -0.2822 -0.3595
No. withdrawal methods* 1.0000 0.1961 0.1961 0.1260
Stake SD* 0.9869 0.5778 0.5854 0.3084
Mean stake* 0.9861 -0.6759 -0.6854 -1.0373
No. deposit methods* 0.8692 0.1153 0.1326 0.0000
Mean deposit amount* 0.6957 -0.2414 -0.3470 -0.5180
Deposit intensity* 0.5405 0.0496 0.0917 0.0000
Upper CI (95%)
0.3081
-0.2054
0.2685
0.8716
-0.3545
0.1861
0.0000
0.1261
Show code
# Run model with the Weight adjusting:study_2_restricted_analysis_weighted_logistic_regression <-svyglm(STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS + PAS_6M_BET_SD_STAKE +# New PAS_6M_BET_MEAN_STAKE +# New PAS_6M_TOTAL_DEPOSIT_METHODS + PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_DEPOSIT_INTENSITY, # Newdesign = survey_design_regression, family =quasibinomial())summary(study_2_restricted_analysis_weighted_logistic_regression)
Call:
svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS +
PAS_6M_BET_SD_STAKE + PAS_6M_BET_MEAN_STAKE + PAS_6M_TOTAL_DEPOSIT_METHODS +
PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_DEPOSIT_INTENSITY, design = survey_design_regression,
family = quasibinomial())
Survey design:
svydesign(id = ~1, data = Study2_logistic_reg_data, weights = ~ALL_SAMPLE_WEIGHTS)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.1011096 0.1404899 -22.074 < 2e-16 ***
AGE 0.0188877 0.0021737 8.689 < 2e-16 ***
DAYS_SINCE_LAST_BET -0.0359699 0.0049748 -7.230 5.09e-13 ***
PAS_6M_TOTAL_WITHDRAW_METHODS 0.3543070 0.0612079 5.789 7.26e-09 ***
PAS_6M_BET_SD_STAKE 0.0021273 0.0004674 4.551 5.39e-06 ***
PAS_6M_BET_MEAN_STAKE -0.0027559 0.0007975 -3.456 0.000551 ***
PAS_6M_TOTAL_DEPOSIT_METHODS 0.1385985 0.0400393 3.462 0.000539 ***
PAS_6M_MEAN_DEPOSIT_AMOUNT -0.0008037 0.0003071 -2.617 0.008886 **
PAS_6M_DEPOSIT_INTENSITY 0.0231312 0.0178049 1.299 0.193916
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 1.014813)
Number of Fisher Scoring iterations: 6
There is some minor co-linearity here but not enough to be problematic in this context – we aren’t using this model to make actual predictions on a separate data or subsamples.
Compute confidence intervals for the coefficient estimates:
# NOTE: R2 is calculate as a proportion here and so in the paper we have multiplied it by 100 to form a percentage value
The model accounts for 5.6618219% of variance in uptake/response.
Calculate overall p-value for model by comparing it with a null model:
Show code
anova(study_2_restricted_analysis_weighted_logistic_regression, update(study_2_restricted_analysis_weighted_logistic_regression, ~1), # update here produces null model for comparison test="Chisq")
Working (Rao-Scott) LRT for AGE DAYS_SINCE_LAST_BET PAS_6M_TOTAL_WITHDRAW_METHODS PAS_6M_BET_SD_STAKE PAS_6M_BET_MEAN_STAKE PAS_6M_TOTAL_DEPOSIT_METHODS PAS_6M_MEAN_DEPOSIT_AMOUNT PAS_6M_DEPOSIT_INTENSITY
in svyglm(formula = STATUS ~ AGE + DAYS_SINCE_LAST_BET + PAS_6M_TOTAL_WITHDRAW_METHODS +
PAS_6M_BET_SD_STAKE + PAS_6M_BET_MEAN_STAKE + PAS_6M_TOTAL_DEPOSIT_METHODS +
PAS_6M_MEAN_DEPOSIT_AMOUNT + PAS_6M_DEPOSIT_INTENSITY, design = survey_design_regression,
family = quasibinomial())
Working 2logLR = 355.9009 p= < 2.22e-16
(scale factors: 1.4 1.3 1.3 1.2 0.93 0.87 0.62 0.4 )
Show code
# Set seed so that labels are consistent in the plot:set.seed(200)ggstatsplot::ggcoefstats(study_2_restricted_analysis_weighted_logistic_regression,digits =4,stats.label.color ="#00798c",sort ="ascending",vline.args =list(size =1, color ="black"),# ggstatsplot.layer = FALSE,exclude.intercept =TRUE, # hide the interceptstats.label.args =list(size =3, direction ="y")) +scale_x_continuous(name ="Regression coefficient", limits =c(-0.1, 0.5), breaks =c(-.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5)) +scale_y_discrete(name ="", labels =c("Days since\n last bet","Mean stake","Mean deposit\n amount","SD of stake","Age","Deposit intensity","No. deposit\n methods","No. withdrawal\n methods")) +theme(panel.grid.minor =element_blank(),axis.line =element_blank()) +theme(axis.text =element_text(color ="black", size =9, face ="plain"))+theme_light(base_family ="Poppins") +# theme(plot.margin = unit(c(.1,.1,.3,-3), "cm")) +theme(axis.text.y =element_text(color="black", size=12, face="plain", family ="Poppins")) +theme(axis.title.x =element_text(color="black", size=12, face ="plain", vjust =0, family ="Poppins"))
Show code
# Set height to 7 and width to 5.5 in portrait modeggsave("Figures/Study 2 restricted analyses log regression plot.pdf",width =5,height =7,units =c("in"))
Reproducibility of this analysis
Package stability
To ensure that every time this script runs it will use the same versions of the packages used during the script development, all packages are loaded using the groundhog.library() function from the groundhog package.
Rendering & publishing
This document is rendered in HTML format and published online using Quarto Pub by using the following command in the terminal:
quarto publish quarto-pub Survey_uptake_main_analyses.qmd