Skip to content

Instantly share code, notes, and snippets.

@martinctc
Last active March 12, 2025 15:14
Show Gist options
  • Select an option

  • Save martinctc/aed79808392d4271d2810013b5c4a70a to your computer and use it in GitHub Desktop.

Select an option

Save martinctc/aed79808392d4271d2810013b5c4a70a to your computer and use it in GitHub Desktop.
[Simulate dataset, duplicate, and modify with a distribution] #R
# This script simulates a dataset, duplicates it over time, and modifies it to
# create a bell curve-like distribution.
# Set up
library(tidyverse)
library(uuid)
# Simulate dataset
temp_df <-
tibble(
PersonId = uuid::UUIDgenerate(100, use.time = FALSE),
Emails_sent = rnorm(100, mean = 30, sd = 5) %>% round() %>% abs()
)
# Simulate dates
seq_dates <- seq(
from = as.Date("2023-01-01"),
to = as.Date("2023-12-31"),
by = "week"
)
# Bind `temp_df` together based on dates
full_df <-
seq_dates %>%
map_dfr(function(x){
temp_df %>%
mutate(MetricDate = x) %>%
select(PersonId, MetricDate, Emails_sent)
})
# Add random noise to the dataset by PersonId
full_df_with_noise <-
full_df %>%
group_split(PersonId) %>%
map_dfr(function(x){
total_rows <- nrow(x)
set.seed(total_rows)
x$normdist <- rnorm(total_rows, mean = 0.5, sd = 2) %>%
scales::rescale(to = c(-.2, .2))
print(summary(x$normdist))
x %>%
mutate(Emails_sent_2 = Emails_sent * normdist) %>%
select(-normdist)
})
# Plot differences of `Emails_sent` and `Emails_sent_2` over time
full_df_with_noise %>%
group_by(MetricDate) %>%
summarise(
Emails_sent = mean(Emails_sent),
Emails_sent_2 = mean(Emails_sent_2)
) %>%
pivot_longer(
cols = c(Emails_sent, Emails_sent_2),
names_to = "Metric",
values_to = "Value"
) %>%
ggplot(aes(x = MetricDate, y = Value, color = Metric)) +
geom_point() +
geom_line()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment