Fuzzy (Approximate) String Matching
match_fz.Rd
Wraps stringdist::amatch()
to perform "fuzzy" (approximate) string
matching while providing more informative output. Instead of an integer
vector of best match positions, this function returns a tibble with the
input, its corresponding best match, and the approximate string distance.
Arguments
- x
elements to be approximately matched: will be coerced to
character
unless it is a list consisting ofinteger
vectors.- table
lookup table for matching. Will be coerced to
character
unless it is a list consting ofinteger
vectors.- method
Matching algorithm to use. See
stringdist-metrics
.- maxDist
Elements in
x
will not be matched with elements oftable
if their distance is larger thanmaxDist
. Note that the maximum distance between strings depends on the method: it should always be specified.- ...
arguments passed on to
stringdist::amatch()
Value
A tibble with 3 columns:
x
table_match
: the closest match ofx
dist
: the distance between x and its closest match (given the method selected
NOTES
Fuzzy string matching is SLOW. Expect this function to take >1 min for comparisons of more than 500 values for all methods.
For comparison of citation titles specifically, the "lcs" method is faster
than "osa" and seems to work better. Based on light experimentation, a good
setting for maxDist
value for citation titles is between 80-115.