reusing_methods.Rmd
NB: the following is an advanced usage of deident
. If
you are just getting started we recommend looking at the other vignettes
first.
While the deident
package implements multiple different
methods for deidentification, one of its key advantages is the ability
to re-use and share methods across data sets due to the ‘stateful’
nature of its design.
If you wish to share a unit between different pipelines, the cleanest approach is to initialize the method of interest and then pass it into the first pipeline:
library(deident)
psu <- Pseudonymizer$new()
name_pipe <- starwars |>
deident(psu, name)
apply_deident(starwars, name_pipe)
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 SwlKL 172 77 blond fair blue 19 male mascu…
#> 2 UUEdL 167 75 NA gold yellow 112 none mascu…
#> 3 ieexo 96 32 NA white, bl… red 33 none mascu…
#> 4 mb92Q 202 136 none white yellow 41.9 male mascu…
#> 5 9QeuR 150 49 brown light brown 19 fema… femin…
#> 6 8HfdV 178 120 brown, grey light blue 52 male mascu…
#> 7 HIyvQ 165 75 brown light blue 47 fema… femin…
#> 8 gF0fi 97 32 NA white, red red NA none mascu…
#> 9 vjF0H 183 84 black light brown 24 male mascu…
#> 10 qZ3vE 182 77 auburn, white fair blue-gray 57 male mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Having called apply_deident
the Pseudonymizer
psu
has learned encodings for each string in
starwars$name
. If these strings appear a second time, they
will be replaced in the same way, and we can build a second pipeline
using psu
:
combined.frm <- data.frame(
ID = c(head(starwars$name, 5), head(ShiftsWorked$Employee, 5))
)
reused_pipe <- combined.frm |>
deident(psu, ID)
apply_deident(combined.frm, reused_pipe)
#> ID
#> 1 SwlKL
#> 2 UUEdL
#> 3 ieexo
#> 4 mb92Q
#> 5 9QeuR
#> 6 zbiPl
#> 7 fS5Hb
#> 8 OGqaB
#> 9 G4XIp
#> 10 EOl8m
Since the first 5 lines of combined.frm$ID
are the same
as starwars$ID
the first 5 lines of each transformed data
set are also the same.