
Reporting on every file in a folder
FolderReport.Rmd
The PID.POS
report
A common struggle in research is how data controllers can gain some level of confidence that large data sets don’t contain personally identifiable data. In some cases, this job only requires a brief inspection and columns that often contain PID such as name, or ID are obvious. However, data sets can contain broad free text fields, fields that are only needed in a small number of cases, or may have been shifted - placing PID in harder to detect locations. If the data sets consist of ~10,000 or more observations manual inspection of rare PID has only a limited chance of finding problems, let alone the resource cost required to do any pass of the data.
To help overcome these issues, as part of the PID.POS
package we have implemented an API for the automated production of
proper noun reports on all files found within the same directory. The
intention is that should a collection of data sets be required for
transfer, they can be placed in a single location, and the reports
generated.
To demonstrate how this function works - we have supplied a
collection of data sets featuring free text with the package. The free
text data draws on the janeaustenr
package - constructing
three csv files:
Emma.csv
PridePrejudice.csv
SenseSensability.csv
with each file consisting of 1000 rows and four columns (a primary key, a line of text from the book, a reference category, and the string-length of the text). The first step in processing the data is to identify where the files are in your local folder structure:
library(pid.pos)
data_path <- system.file("vignette_data", package = "pid.pos")
list.files(data_path)
#> [1] "Emma.csv" "PridePrejudice.csv" "SenseSensability.csv"
#> [4] "Temp.csv"
and we check the files are the intended data:
emma.csv <- system.file("vignette_data", "Emma.csv", package = "pid.pos")
kable(read.csv(emma.csv, nrows = 5))
Doc.ID | Text | Reference | Length |
---|---|---|---|
1 | 54HhG | 0 | |
2 | much better.” | 54HhG | 13 |
3 | very unwell, which he had had no previous suspicion of–and there was | BT26y | 69 |
4 | all there without me.” | BT26y | 22 |
5 | 6QBpq | 0 |
To generate reports we call report_on_folder
which takes
three arguments:
-
data_path
- the path to the data directory -
report_dir
- a system path to where the proper noun reports should be saved -
to_remove
[optional] - a vector of columns to be ignored e.g. primary keys.
report_on_folder(data_path, report_dir = "Proper Noun Report")
Once evaluated the report_dir
folder gets populated by a
set of csv files, one per data set found at data_path
:
browseURL("Proper Noun Report")
Each of these files consists of 5 columns:
-
doc_id
- a reference of where the proper noun was detected -
token
- the proper noun detected -
sentence
- the full free text field -
Repeats
- how many timessentence
appeared in the data set -
Affected Columns
- all the columns thatsentence
occured in.
read.csv("Proper Noun Report/Emma.csv")