
pidpos: An R package for the detection of personally identifiable data
Source:vignettes/paper-vignette.Rmd
paper-vignette.RmdSummary
The pidpos package aids in identifying personal
identifiability risks in datasets. Using part-of-speech (POS) tagging,
it extracts proper nouns from text fields, reducing the complexity of
the review process and enabling faster human oversight. The package also
provides tools for designing and implementing a redaction workflow.
Statement of need
Data collection and analysis has grown enormously in scale and scope, prompting international legislation to protect individuals’ rights over their own data (European Parliament and Council of the European Union 2016). This has heightened awareness of the responsibilities of data controllers (ICO, n.d.b) and the risks posed by large datasets (Clarke 2016). A central concern is personal identifiability — the ability to directly or indirectly identify an individual from a dataset (Finck and Pallas 2020) — with breaches carrying significant reputational and financial consequences (ICO, n.d.a).
For small, structured datasets, manual inspection can identify personally identifiable data (PID) with reasonable effort. In large datasets, however, PID embedded within free-text fields or appearing rarely in a variable can easily be missed. Existing R packages such as PII (Patterson-Stein 2025) address this through pattern matching, which risks missing edge cases.
pidpos takes a different approach. Building on
part-of-speech tagging (by default the udpipe framework (Straka, Hajic, and Straková 2016; Wijffels
2023), with the ability to use a custom tagging engine) it
extracts all proper nouns from a dataset, deliberately accepting a
higher false positive rate, and implementing tools to aid human review
rather than attempting full automation. This makes it robust to the edge
cases that pattern-matching approaches can miss.
In practice
To install the current version of pidpos package, use
the following code:
# install.packages("pak")
pak::pkg_install("Stat-Cook/pidpos")The intended workflow breaks down into three stages:
- Detection of PID risks via
pidpos() - Preparation of redaction rules via
report_to_redaction_rules()andauto_replace() - Redaction of the original data via
redact()
To illustrate this, we include a subset of the friends
package data set:
#> # A tibble: 20 × 4
#> scene utterance speaker text
#> <int> <int> <chr> <chr>
#> 1 1 1 Scene Directions [Scene: Central Perk, everyone is there.]
#> 2 1 2 Phoebe Buffay Oh, Ross, Mon, is it okay if I bring someone…
#> 3 1 3 Monica Geller Yeah.
#> 4 1 4 Ross Geller Sure. Yeah.
#> # ℹ 16 more rows
First, generate a PID report:
report <- pidpos(example_data)
head(report)
#> # A tibble: 6 × 6
#> ID Token Sentence Document Repeats `Affected Columns`
#> <glue> <chr> <chr> <chr> <int> <chr>
#> 1 Col:text Row:1 Central [Scene: Central… [Scene:… 1 `text`
#> 2 Col:text Row:1 Perk [Scene: Central… [Scene:… 1 `text`
#> 3 Col:speaker Row:2 Phoebe Phoebe Buffay Phoebe … 3 `speaker`
#> 4 Col:speaker Row:2 Buffay Phoebe Buffay Phoebe … 3 `speaker`
#> # ℹ 2 more rowsThe report lists all detected proper nouns alongside their source
variable and position. By default, pidpos() uses the
udpipe framework for POS tagging, but the package is
designed to support alternative taggers. A ready-made script for using
spaCy is included, and users may supply a custom tagging function,
allowing the package to be integrated into existing NLP pipelines.
Further details are provided in Custom
Functions.
Should the user wish to not only identify, but redact the data, the report can be converted into redaction rules and apply replacements:
raw_rules <- report_to_redaction_rules(report)
replacement_func <- make_random_replacement()
prepared_replacements <- auto_replace(raw_rules, replacement_func)
head(prepared_replacements)
#> # A tibble: 6 × 3
#> If From To
#> <chr> <chr> <chr>
#> 1 [Scene: Central Perk, everyone is there.] Central MWLEFOODFI
#> 2 [Scene: Central Perk, everyone is there.] Perk EEXOMBWKKE
#> 3 Phoebe Buffay Phoebe ULBFDPBCYV
#> 4 Phoebe Buffay Buffay KGUFIVJUBQ
#> # ℹ 2 more rowsUsers may define replacement values manually or use the built-in automatic replacement tools, which include options such as random replacement and encryption. Full documentation of the available replacement strategies is provided in Automatic Replacement Tools. Finally, apply the rules to produce a redacted dataset:
redacted_data <- redact(example_data, prepared_replacements)
head(redacted_data)
#> # A tibble: 6 × 4
#> scene utterance speaker text
#> <int> <int> <chr> <chr>
#> 1 1 1 Scene Directions [Scene: MWLEFOODFI EEXOMBWKKE, everyone…
#> 2 1 2 ULBFDPBCYV KGUFIVJUBQ Oh, TXVAFBSNAY, OREHALSPKZ, is it okay …
#> 3 1 3 CBWJCFDJAG FVSYBHAAGL Yeah.
#> 4 1 4 TXVAFBSNAY FVSYBHAAGL Sure. Yeah.
#> # ℹ 2 more rowsMultiple file API
When a project involves multiple files, three additional functions support batch processing:
-
report_on_folder()to generate PID reports -
get_distinct_redaction_rules()to combine the distinct reports into a single set of raw redactions. -
redact_at_folder()to produce redacted copies of the data.
Current applications
The pidpos package was developed for applications in the
NuRS and AmReS research projects which aim to extract and analyse
retrospective operational data from NHS Trusts to understand staff
retention and patient safety.
Contributions
The package was designed by RC, MA and SJ. Implementation was done by RC. Quality assurance was done by MA. Documentation was written by RC. Funding for the work was won by RC and SJ.
Acknowledgements
The development of pidpos was part of the NuRS and AmReS
projects funded by the Health Foundation.