transformations.Rmd
Out of the box, deident
features a set of
transformations to aid in the de-identification of data sets. Each
transformation is implemented via R6Class
and extends
BaseDeident
. User defined transformations can be
implemented in a similar manner.
To demonstrate the different transformation we use the
babynames
data set co-opted from the babynames
package:
library(deident)
babynames <- babynames::babynames |>
dplyr::filter(year > 2015)
babynames
#> # A tibble: 65,448 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 2016 F Emma 19471 0.0101
#> 2 2016 F Olivia 19327 0.0100
#> 3 2016 F Ava 16283 0.00844
#> 4 2016 F Sophia 16112 0.00835
#> # ℹ 65,444 more rows
Apply a cached random replacement cipher. Re-occurrence of the same key will receive the same hash.
Implemented deident
options:
# Examples only:
deident(df, "psudonymize", ...)
deident(df, "Pseudonymizer", ...)
deident(df, Pseudonymizer, ...)
deident(df, Pseudonymizer$new(), ...)
psu <- Pseudonymizer$new()
deident(df, psu, ...)
By default Pseudonymizer
replaces values in variables
with a random alpha-numeric string of 5 characters. This can be replaced
via calling set_method
on an instantiated Pseudonymizer
with the desired function:
psu <- Pseudonymizer$new()
new_method <- function(key, ...) {
paste(sample(letters, 12, T), collapse = "")
}
psu$set_method(new_method)
dlist_psu <- deident(babynames, psu, name)
dlist_psu
#> DeidentList
#> 1 step(s) implemented
#> Step 1 : 'Pseudonymizer' on variable(s) name
#> For data:
#> columns: year, sex, name, n, prop
apply_deident(babynames, dlist_psu)
#> # A tibble: 65,448 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 2016 F mwlefoodfiee 19471 0.0101
#> 2 2016 F xombwkkeulbf 19327 0.0100
#> 3 2016 F dpbcyvkgufiv 16283 0.00844
#> 4 2016 F jubqtxvafbsn 16112 0.00835
#> # ℹ 65,444 more rows
The first argument to the method receives the key to be transformed.
Apply cryptographic hashing to a variable.
Implemented deident
options:
# Examples only:
deident(df, "encrypt", ...)
deident(df, "Encrypter", ...)
deident(df, Encrypter, ...)
deident(df, Encrypter$new(), ...)
encrypt <- Encrypter$new()
deident(df, encrypt, ...)
At initialization, Encrypter
can be given
hash_key
and seed
values to control the
cryptographic encryption. It is recommended users set these values and
do not disclose them.
encrypt <- Encrypter$new(hash_key = "deident_hash_key_123", seed = 202)
dlist_enc <- deident(babynames, encrypt, name)
dlist_enc
#> DeidentList
#> 1 step(s) implemented
#> Step 1 : 'Encrypter' on variable(s) name
#> For data:
#> columns: year, sex, name, n, prop
apply_deident(babynames, dlist_enc)
#> # A tibble: 65,448 × 5
#> year sex name n prop
#> <dbl> <chr> <hash> <int> <dbl>
#> 1 2016 F b80ab4da7d4e50cdc2897d002a5446cd8d69cc7d62d798535a4… 19471 0.0101
#> 2 2016 F f435872c4aaf1f167d583cd2a505513b9c6404590a930a70dcc… 19327 0.0100
#> 3 2016 F 68f6878121736390203d6bcfeb7ca94cc58100e34806999d33f… 16283 0.00844
#> 4 2016 F adfb6f7051dc1f87fdbf75f0d8783d054f8dd06b9d0748e699b… 16112 0.00835
#> # ℹ 65,444 more rows
Apply Gaussian white noise to a numeric variable.
Implemented deident
options:
# Example only:
deident(df, "perturb", ...)
deident(df, "Perturber", ...)
deident(df, Perturber, ...)
deident(df, Perturber$new(), ...)
perturb <- Perturber$new()
deident(df, perturb, ...)
At initialization, Perturber
can be given a scale for
the white noise via the sd
argument.
perturb <- Perturber$new(noise = adaptive_noise(0.2))
dlist_pert <- deident(babynames, perturb, prop)
dlist_pert
#> DeidentList
#> 1 step(s) implemented
#> Step 1 : 'Perturber(adaptive_noise(0.2))' on variable(s) prop
#> For data:
#> columns: year, sex, name, n, prop
apply_deident(babynames, dlist_pert)
#> # A tibble: 65,448 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 2016 F Emma 19471 0.0101
#> 2 2016 F Olivia 19327 0.0100
#> 3 2016 F Ava 16283 0.00841
#> 4 2016 F Sophia 16112 0.00848
#> # ℹ 65,444 more rows
Aggregate categorical values dependent on a user supplied list. the
list must be supplied to Blur
at initialization.
Implemented deident
options:
Aggregate numeric values dependent on a user supplied vector of
breaks/ cuts. If no vector is supplied NumericBlurrer
defaults to a binary classification about 0.
Implemented deident
options:
# Example only:
deident(df, "numeric_blur", ...)
deident(df, "NumericBlurrer", ...)
deident(df, NumericBlurrer, ...)
deident(df, NumericBlurrer$new(), ...)
numeric_blur <- NumericBlurrer$new()
deident(df, numeric_blur, ...)
At initialization NumericBlurrer
takes an argument
cuts
to define the limits of each interval.
numeric_blur <- NumericBlurrer$new(cuts = c(10, 30))
dlist_nb <- deident(babynames, numeric_blur, n)
dlist_nb
#> DeidentList
#> 1 step(s) implemented
#> Step 1 : 'NumericBlurrer' on variable(s) n
#> For data:
#> columns: year, sex, name, n, prop
apply_deident(babynames, dlist_nb)
#> # A tibble: 65,448 × 5
#> year sex name n prop
#> <dbl> <chr> <chr> <fct> <dbl>
#> 1 2016 F Emma (30, Inf] 0.0101
#> 2 2016 F Olivia (30, Inf] 0.0100
#> 3 2016 F Ava (30, Inf] 0.00844
#> 4 2016 F Sophia (30, Inf] 0.00835
#> # ℹ 65,444 more rows
Apply Shuffler
to a data set having first grouped the
data on column(s). The grouping needs to be defined at
initialization.
Implemented deident
options:
# Example only:
grouped_shuffle <- GroupedShuffler$new(year)
deident(babynames, grouped_shuffle, name)
At initialization GroupedShuffler
takes an argument
limit
such that if any aggregated sub group has fewer than
limit
observations all values are dropped.
numeric_blur <- GroupedShuffler$new(year, limit = 1)
dlist_groupshuffle <- deident(babynames, numeric_blur, name)
dlist_groupshuffle
#> DeidentList
#> 1 step(s) implemented
#> Step 1 : 'GroupedShuffler(group_on=year)' on variable(s) name
#> For data:
#> columns: year, sex, name, n, prop
apply_deident(babynames, dlist_groupshuffle)
#> # A tibble: 65,448 × 5
#> # Groups: year [2]
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 2016 F Ailani 19471 0.0101
#> 2 2016 F Liliyana 19327 0.0100
#> 3 2016 F Randell 16283 0.00844
#> 4 2016 F Powell 16112 0.00835
#> # ℹ 65,444 more rows