Transformations • deident

Out of the box, deident features a set of transformations to aid in the de-identification of data sets. Each transformation is implemented via R6Class and extends BaseDeident. User defined transformations can be implemented in a similar manner.

To demonstrate the different transformation we use the babynames data set co-opted from the babynames package:

library(deident)

babynames <- babynames::babynames |> 
  dplyr::filter(year > 2015) 

babynames
#> # A tibble: 65,448 × 5
#>    year sex   name       n    prop
#>   <dbl> <chr> <chr>  <int>   <dbl>
#> 1  2016 F     Emma   19471 0.0101 
#> 2  2016 F     Olivia 19327 0.0100 
#> 3  2016 F     Ava    16283 0.00844
#> 4  2016 F     Sophia 16112 0.00835
#> # ℹ 65,444 more rows

Psudonymizer

Apply a cached random replacement cipher. Re-occurrence of the same key will receive the same hash.

Implemented deident options:

# Examples only:
deident(df, "psudonymize", ...)
deident(df, "Pseudonymizer", ...)
deident(df, Pseudonymizer, ...)
deident(df, Pseudonymizer$new(), ...)

psu <- Pseudonymizer$new()
deident(df, psu, ...)

Options

By default Pseudonymizer replaces values in variables with a random alpha-numeric string of 5 characters. This can be replaced via calling set_method on an instantiated Pseudonymizer with the desired function:

psu <- Pseudonymizer$new()

new_method <- function(key, ...) {
  paste(sample(letters, 12, T), collapse = "")
}

psu$set_method(new_method)

dlist_psu <- deident(babynames, psu, name)
dlist_psu
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'Pseudonymizer' on variable(s) name 
#> For data:
#>    columns: year, sex, name, n, prop

apply_deident(babynames, dlist_psu)
#> # A tibble: 65,448 × 5
#>    year sex   name             n    prop
#>   <dbl> <chr> <chr>        <int>   <dbl>
#> 1  2016 F     mwlefoodfiee 19471 0.0101 
#> 2  2016 F     xombwkkeulbf 19327 0.0100 
#> 3  2016 F     dpbcyvkgufiv 16283 0.00844
#> 4  2016 F     jubqtxvafbsn 16112 0.00835
#> # ℹ 65,444 more rows

The first argument to the method receives the key to be transformed.

Shuffler

Implemented deident options:

# Examples only:
deident(df, "shuffle", ...)
deident(df, "Shuffler", ...)
deident(df, Shuffler, ...)
deident(df, Shuffler$new(), ...)

shuffle <- Shuffler$new()
deident(df, shuffle, ...)

Encrypter

Apply cryptographic hashing to a variable.

Implemented deident options:

# Examples only:
deident(df, "encrypt", ...)
deident(df, "Encrypter", ...)
deident(df, Encrypter, ...)
deident(df, Encrypter$new(), ...)

encrypt <- Encrypter$new()
deident(df, encrypt, ...)

Options

At initialization, Encrypter can be given hash_key and seed values to control the cryptographic encryption. It is recommended users set these values and do not disclose them.

encrypt <- Encrypter$new(hash_key = "deident_hash_key_123", seed = 202)
dlist_enc <- deident(babynames, encrypt, name)
dlist_enc
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'Encrypter' on variable(s) name 
#> For data:
#>    columns: year, sex, name, n, prop

apply_deident(babynames, dlist_enc)
#> # A tibble: 65,448 × 5
#>    year sex   name                                                     n    prop
#>   <dbl> <chr> <hash>                                               <int>   <dbl>
#> 1  2016 F     b80ab4da7d4e50cdc2897d002a5446cd8d69cc7d62d798535a4… 19471 0.0101 
#> 2  2016 F     f435872c4aaf1f167d583cd2a505513b9c6404590a930a70dcc… 19327 0.0100 
#> 3  2016 F     68f6878121736390203d6bcfeb7ca94cc58100e34806999d33f… 16283 0.00844
#> 4  2016 F     adfb6f7051dc1f87fdbf75f0d8783d054f8dd06b9d0748e699b… 16112 0.00835
#> # ℹ 65,444 more rows

Perturber

Apply Gaussian white noise to a numeric variable.

Implemented deident options:

# Example only:
deident(df, "perturb", ...)
deident(df, "Perturber", ...)
deident(df, Perturber, ...)
deident(df, Perturber$new(), ...)

perturb <- Perturber$new()
deident(df, perturb, ...)

Options

At initialization, Perturber can be given a scale for the white noise via the sd argument.

perturb <- Perturber$new(noise = adaptive_noise(0.2))
dlist_pert <- deident(babynames, perturb, prop)
dlist_pert
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'Perturber(adaptive_noise(0.2))' on variable(s) prop 
#> For data:
#>    columns: year, sex, name, n, prop

apply_deident(babynames, dlist_pert)
#> # A tibble: 65,448 × 5
#>    year sex   name       n    prop
#>   <dbl> <chr> <chr>  <int>   <dbl>
#> 1  2016 F     Emma   19471 0.0101 
#> 2  2016 F     Olivia 19327 0.0100 
#> 3  2016 F     Ava    16283 0.00841
#> 4  2016 F     Sophia 16112 0.00848
#> # ℹ 65,444 more rows

Blurrer

Aggregate categorical values dependent on a user supplied list. the list must be supplied to Blur at initialization.

Implemented deident options:

# Example only:
letter_blur <- c(rep("Early", 13), rep("Late", 13))
names(letter_blur) <- letters

blur <- Blurrer$new(blur = letter_blur)
deident(df, blur, A)

NumericBlurrer

Aggregate numeric values dependent on a user supplied vector of breaks/ cuts. If no vector is supplied NumericBlurrer defaults to a binary classification about 0.

Implemented deident options:

# Example only:
deident(df, "numeric_blur", ...)
deident(df, "NumericBlurrer", ...)
deident(df, NumericBlurrer, ...)
deident(df, NumericBlurrer$new(), ...)

numeric_blur <- NumericBlurrer$new()
deident(df, numeric_blur, ...)

Options

At initialization NumericBlurrer takes an argument cuts to define the limits of each interval.

numeric_blur <- NumericBlurrer$new(cuts = c(10, 30))
dlist_nb <- deident(babynames, numeric_blur, n)
dlist_nb
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'NumericBlurrer' on variable(s) n 
#> For data:
#>    columns: year, sex, name, n, prop

apply_deident(babynames, dlist_nb)
#> # A tibble: 65,448 × 5
#>    year sex   name   n            prop
#>   <dbl> <chr> <chr>  <fct>       <dbl>
#> 1  2016 F     Emma   (30, Inf] 0.0101 
#> 2  2016 F     Olivia (30, Inf] 0.0100 
#> 3  2016 F     Ava    (30, Inf] 0.00844
#> 4  2016 F     Sophia (30, Inf] 0.00835
#> # ℹ 65,444 more rows

GroupedShuffler

Apply Shuffler to a data set having first grouped the data on column(s). The grouping needs to be defined at initialization.

Implemented deident options:

# Example only:
grouped_shuffle <- GroupedShuffler$new(year)
deident(babynames, grouped_shuffle, name)

Options

At initialization GroupedShuffler takes an argument limit such that if any aggregated sub group has fewer than limit observations all values are dropped.

numeric_blur <- GroupedShuffler$new(year, limit = 1)
dlist_groupshuffle <- deident(babynames, numeric_blur, name)
dlist_groupshuffle
#> DeidentList
#>    1 step(s) implemented 
#>    Step 1 : 'GroupedShuffler(group_on=year)' on variable(s) name 
#> For data:
#>    columns: year, sex, name, n, prop

apply_deident(babynames, dlist_groupshuffle)
#> # A tibble: 65,448 × 5
#> # Groups:   year [2]
#>    year sex   name         n    prop
#>   <dbl> <chr> <chr>    <int>   <dbl>
#> 1  2016 F     Ailani   19471 0.0101 
#> 2  2016 F     Liliyana 19327 0.0100 
#> 3  2016 F     Randell  16283 0.00844
#> 4  2016 F     Powell   16112 0.00835
#> # ℹ 65,444 more rows

Drop

Define a column to be removed from the pipeline.

Implemented deident options:

# Example only:
deident(df, Drop, ...)

drop <- deident:::Drop$new()
deident(df, drop, ...)