Getting Started
Installing YAKE!
Installing Yake using pip:
Usage (Command line)
How to use it on your using the command line:
Keyword Deduplication Methods
YAKE uses three methods to compute string similarity during keyword deduplication:
1. levs
— Levenshtein Similarity
- What it is: Measures the edit distance between two strings — how many operations (insertions, deletions, substitutions) are needed to turn one string into another.
- Formula used:
- Best for: Very accurate for small changes (e.g., "house" vs "horse")
- Performance: Medium speed
2. jaro
— Jaro Similarity
- What it is: Measures similarity based on matching characters and their relative positions
- Implementation: Uses the
jellyfish
library - Best for: More tolerant of transpositions (e.g., "maria" vs "maira")
- Performance: Fast
3. seqm
— SequenceMatcher Ratio
- What it is: Uses Python's built-in
difflib.SequenceMatcher
- Formula:
where M
is the number of matching characters, and T
is the total number of characters in both strings.
- Best for: Good for detecting shared blocks in longer strings
- Performance: Fast
Comparison Table
Method | Based on | Best for | Performance |
---|---|---|---|
levs | Edit operations | Typos and small changes | Medium |
jaro | Matching positions | Names and short strings with swaps | Fast |
seqm | Common subsequences | General phrase similarity | Fast |
Practical Examples
Compared Strings | levs | jaro | seqm |
---|---|---|---|
"casa" vs "caso" | 0.75 | 0.83 | 0.75 |
"machine" vs "mecine" | 0.71 | 0.88 | 0.82 |
"apple" vs "a pple" | 0.8 | 0.93 | 0.9 |
Recommendation: For general use with a good balance of speed and accuracy, seqm
is a solid default (and it is YAKE's default). For stricter lexical similarity, choose levs
. For names or when letter swaps are common, go with jaro
.
Usage (Python)
How to use it using Python:
Simple usage using default parameters
Specifying custom parameters
Output
The lower the score, the more relevant the keyword is.
Copyright ©2018-2025 INESC TEC. Distributed by an INESCTEC license.