Out of the box, pidpos() relies on the
udpipe package to perform part-of-speech (POS)
tagging.
However, the pidpos API is intentionally flexible and
allows users to:
- Supply their own custom POS tagging functions
- Implement custom filtering logic
- Integrate alternative tagging engines (e.g., Python libraries)
This vignette demonstrates each of these extensions.
A custom tagger must:
- Accept a single character string (a sentence)
- Return a data frame with one row per token
A custom filter must:
- Accept the tagger’s returned data frame
- Return a filtered version of that data frame
We begin by loading the package:
We will use the example dataset:
example.data <- head(the_one_in_massapequa, 20)
example.data
#> # A tibble: 20 × 4
#> scene utterance speaker text
#> <int> <int> <chr> <chr>
#> 1 1 1 Scene Directions "[Scene: Central Perk, everyone is there.]"
#> 2 1 2 Phoebe Buffay "Oh, Ross, Mon, is it okay if I bring someo…
#> 3 1 3 Monica Geller "Yeah."
#> 4 1 4 Ross Geller "Sure. Yeah."
#> 5 1 5 Joey Tribbiani "So, who's the guy?"
#> 6 1 6 Phoebe Buffay "Well, his name is Parker and I met him at …
#> 7 1 7 Chandler Bing "Oooh, did he put a little starch in your b…
#> 8 1 8 Phoebe Buffay "Yeah, he's really great though. He has thi…
#> 9 1 9 Monica Geller "Oh, by the way. Would it be okay if I gave…
#> 10 1 10 Ross Geller "Uh, yeah, you sure you want to after what …
#> 11 1 11 Monica Geller "Yeah, I'd really like to."
#> 12 1 12 Ross Geller "Okay, hopefully this time mom won't boo yo…
#> 13 1 13 Monica Geller "Yes! Every year Ross makes the toast, and …
#> 14 1 14 Chandler Bing "And you wonder why Ross is their favorite?"
#> 15 1 15 Monica Geller "No! Really! Any time Ross makes a toast ev…
#> 16 1 16 Joey Tribbiani "Well I can promise you, at least one perso…
#> 17 1 17 Monica Geller "Really you can do that?"
#> 18 1 18 Joey Tribbiani "Are you kidding me? Watch! Well I can't do…
#> 19 2 1 Scene Directions "[Scene: Chandler and Monica's, they're get…
#> 20 2 2 Chandler Bing "What are you doing?"Custom POS Tagging Functions
As a minimal example, suppose we want to tag only the main characters from Friends as proper nouns.
friends <- c("Joey", "Phoebe", "Ross", "Chandler", "Monica", "Rachel")We define a sentence-level tagging function:
sentence_tagger <- function(sentence) {
clean_text <- gsub("[[:punct:]]", "", sentence)
tokens <- strsplit(clean_text, "\\s+")[[1]]
tokens <- tokens[tokens != ""]
tibble::tibble(
Token = tokens,
upos = ifelse(tokens %in% friends, "PROPN", "XXX")
)
}
sentence_tagger("Joey sat in Central Perk")
#> # A tibble: 5 × 2
#> Token upos
#> <chr> <chr>
#> 1 Joey PROPN
#> 2 sat XXX
#> 3 in XXX
#> 4 Central XXX
#> 5 Perk XXXTo use this with pidpos(), we wrap it with
custom_tagger():
friends_tagger <- custom_tagger(sentence_tagger)
friends_tagger(example.data$text)
#> # A tibble: 263 × 4
#> Token upos ID Sentence
#> <chr> <chr> <int> <chr>
#> 1 Scene XXX 1 [Scene: Central Perk, everyone is there.]
#> 2 Central XXX 1 [Scene: Central Perk, everyone is there.]
#> 3 Perk XXX 1 [Scene: Central Perk, everyone is there.]
#> 4 everyone XXX 1 [Scene: Central Perk, everyone is there.]
#> 5 is XXX 1 [Scene: Central Perk, everyone is there.]
#> 6 there XXX 1 [Scene: Central Perk, everyone is there.]
#> 7 Oh XXX 2 Oh, Ross, Mon, is it okay if I bring someone to your pa…
#> 8 Ross PROPN 2 Oh, Ross, Mon, is it okay if I bring someone to your pa…
#> 9 Mon XXX 2 Oh, Ross, Mon, is it okay if I bring someone to your pa…
#> 10 is XXX 2 Oh, Ross, Mon, is it okay if I bring someone to your pa…
#> # ℹ 253 more rowsWe can now supply it to pidpos():
result <- pidpos(example.data, tagger = friends_tagger)
result
#> # A tibble: 10 × 6
#> ID Token Sentence Document Repeats `Affected Columns`
#> * <glue> <chr> <chr> <chr> <int> <chr>
#> 1 Col:speaker Row:2 Phoebe "Phoebe Buffa… "Phoebe… 3 `speaker`
#> 2 Col:text Row:2 Ross "Oh, Ross, Mo… "Oh, Ro… 1 `text`
#> 3 Col:speaker Row:3 Monica "Monica Gelle… "Monica… 6 `speaker`
#> 4 Col:speaker Row:4 Ross "Ross Geller" "Ross G… 3 `speaker`
#> 5 Col:speaker Row:5 Joey "Joey Tribbia… "Joey T… 3 `speaker`
#> 6 Col:speaker Row:7 Chandler "Chandler Bin… "Chandl… 3 `speaker`
#> 7 Col:text Row:13 Ross "Yes! Every y… "Yes! E… 1 `text`
#> 8 Col:text Row:14 Ross "And you wond… "And yo… 1 `text`
#> 9 Col:text Row:15 Ross "No! Really! … "No! Re… 1 `text`
#> 10 Col:text Row:19 Chandler "[Scene: Chan… "[Scene… 1 `text`Custom Filtering Functions
The default filter_to_proper_noun() assumes:
- Proper nouns are labeled
"PROPN" - The POS column is named
"upos"
If your tagger uses different conventions, you must supply a custom filter.
For example, suppose:
- Proper nouns are labeled
"NNP" - The POS column is named
"POS"
We define:
custom_filter <- function(tag_frm) {
dplyr::filter(tag_frm, POS == "NNP")
}And a compatible tagger:
friends_tagger2 <- custom_tagger(function(sentence) {
if (is.na(sentence)) {
return(tibble::tibble(Token = character(), POS = character()))
}
clean_text <- gsub("[[:punct:]]", "", sentence)
tokens <- strsplit(clean_text, "\\s+")[[1]]
tokens <- tokens[tokens != ""]
tibble::tibble(
Token = tokens,
POS = ifelse(tokens %in% friends, "NNP", "XXX")
)
})Now use both:
pidpos(
example.data,
tagger = friends_tagger2,
filter = custom_filter
)
#> # A tibble: 10 × 7
#> Token POS ID Sentence Document Repeats `Affected Columns`
#> * <chr> <chr> <glue> <chr> <chr> <int> <chr>
#> 1 Phoebe NNP Col:speaker Row:2 "Phoebe… "Phoebe… 3 `speaker`
#> 2 Ross NNP Col:text Row:2 "Oh, Ro… "Oh, Ro… 1 `text`
#> 3 Monica NNP Col:speaker Row:3 "Monica… "Monica… 6 `speaker`
#> 4 Ross NNP Col:speaker Row:4 "Ross G… "Ross G… 3 `speaker`
#> 5 Joey NNP Col:speaker Row:5 "Joey T… "Joey T… 3 `speaker`
#> 6 Chandler NNP Col:speaker Row:7 "Chandl… "Chandl… 3 `speaker`
#> 7 Ross NNP Col:text Row:13 "Yes! E… "Yes! E… 1 `text`
#> 8 Ross NNP Col:text Row:14 "And yo… "And yo… 1 `text`
#> 9 Ross NNP Col:text Row:15 "No! Re… "No! Re… 1 `text`
#> 10 Chandler NNP Col:text Row:19 "[Scene… "[Scene… 1 `text`Using Other Tagging Frameworks (Python Example)
The tagging engine can also be replaced entirely. For example,
Python’s nltk library provides POS tagging
functionality.
The following Python function (included in the package as
nltk_function.py) defines a simple wrapper:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
def nltk_pos_tagger(sentence):
tokens = word_tokenize(sentence)
return pos_tag(tokens)We can bridge this into R using reticulate (see this
vignette for an intro to using reticulate).
Note: This section requires a working Python installation and the
nltkmodule.
If these are not available, the code below will be skipped.
reticulate::use_virtualenv("YOUR ENVIRONMENT", required = TRUE)
if (!reticulate::py_module_available("nltk")) {
reticulate::py_install("nltk", pip = TRUE)
}
nltk_tagger <- custom_tagger(function(sentence) {
tagged <- nltk_pos_tagger(sentence)
tagged_frm <- tibble::as_tibble(
do.call(rbind, tagged),
.name_repair = "minimal"
)
colnames(tagged_frm) <- c("Token", "POS")
dplyr::mutate(
tagged_frm,
dplyr::across(dplyr::everything(), as.character)
)
})
nltk_filter <- function(frm) {
dplyr::filter(frm, POS %in% c("NNP", "NNPS"))
}
pidpos(
example.data,
tagger = nltk_tagger,
filter = nltk_filter
)Summary
The pidpos() API separates:
- Tagging logic
- Filtering logic
This design allows users to:
- Implement lightweight dictionary-based taggers
- Swap in alternative POS frameworks
- Integrate external engines such as Python libraries
- Customize filtering criteria without modifying tagging code
By adhering to the simple contract:
- Taggers return a token-level data frame
- Filters return a subset of that data frame
pidpos() can operate with virtually any POS tagging
system.
