Function std.numeric.gapWeightedSimilarity
The so-called "all-lengths gap-weighted string kernel" computes a
similarity measure between s
and t
based on all of their
common subsequences of all lengths. Gapped subsequences are also
included.
F gapWeightedSimilarity(alias comp, R1, R2, F)
(
R1 s,
R2 t,
F lambda
)
if (isRandomAccessRange!R1 && hasLength!R1 && isRandomAccessRange!R2 && hasLength!R2);
To understand what gapWeightedSimilarity(s, t, lambda)
computes,
consider first the case lambda = 1
and the strings s =
["Hello", "brave", "new", "world"]
and t = ["Hello", "new",
"world"]
. In that case, gapWeightedSimilarity
counts the
following matches:
- three matches of length 1, namely
"Hello"
,"new"
, and"world"
; - three matches of length 2, namely (
"Hello", "new"
), ("Hello", "world"
), and ("new", "world"
); - one match of length 3, namely (
"Hello", "new", "world"
).
The call gapWeightedSimilarity(s, t, 1)
simply counts all of
these matches and adds them up, returning 7.
string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, t, 1) == 7);
Note how the gaps in matching are simply ignored, for example ("Hello", "new"
) is deemed as good a match as ("new",
"world"
). This may be too permissive for some applications. To
eliminate gapped matches entirely, use lambda = 0
:
string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, t, 0) == 4);
The call above eliminated the gapped matches ("Hello", "new"
),
("Hello", "world"
), and ("Hello", "new", "world"
) from the
tally. That leaves only 4 matches.
The most interesting case is when gapped matches still participate in
the result, but not as strongly as ungapped matches. The result will
be a smooth, fine-grained similarity measure between the input
strings. This is where values of lambda
between 0 and 1 enter
into play: gapped matches are exponentially penalized with the
number of gaps with base lambda
. This means that an ungapped
match adds 1 to the return value; a match with one gap in either
string adds lambda
to the return value; ...; a match with a total
of n
gaps in both strings adds pow(lambda, n)
to the return
value. In the example above, we have 4 matches without gaps, 2 matches
with one gap, and 1 match with three gaps. The latter match is ("Hello", "world"
), which has two gaps in the first string and one gap
in the second string, totaling to three gaps. Summing these up we get
4 + 2 * lambda + pow(lambda, 3)
.
string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, t, 0.5) == 4 + 0.5 * 2 + 0.125);
gapWeightedSimilarity
is useful wherever a smooth similarity
measure between sequences allowing for approximate matches is
needed. The examples above are given with words, but any sequences
with elements comparable for equality are allowed, e.g. characters or
numbers. gapWeightedSimilarity
uses a highly optimized dynamic
programming implementation that needs 16 * min(s
extra bytes of memory and Ο(s
) time
to complete.
Authors
Andrei Alexandrescu, Don Clugston, Robert Jacques, Ilya Yaroshenko