Getting Started
Installing YAKE!
Installing Yake using pip:
Usage (Command line)
How to use it on your using the command line:
Keyword Deduplication Methods
YAKE uses three methods to compute string similarity during keyword deduplication:
1. levs — Levenshtein Similarity
- What it is: Measures the edit distance between two strings — how many operations (insertions, deletions, substitutions) are needed to turn one string into another.
- Formula used:
- Best for: Very accurate for small changes (e.g., "house" vs "horse")
- Performance: Medium speed
2. jaro — Jaro Similarity
- What it is: Measures similarity based on matching characters and their relative positions
- Implementation: Uses the
jellyfishlibrary - Best for: More tolerant of transpositions (e.g., "maria" vs "maira")
- Performance: Fast
3. seqm — SequenceMatcher Ratio
- What it is: Uses Python's built-in
difflib.SequenceMatcher - Formula:
where M is the number of matching characters, and T is the total number of characters in both strings.
- Best for: Good for detecting shared blocks in longer strings
- Performance: Fast
Comparison Table
| Method | Based on | Best for | Performance |
|---|---|---|---|
levs | Edit operations | Typos and small changes | Medium |
jaro | Matching positions | Names and short strings with swaps | Fast |
seqm | Common subsequences | General phrase similarity | Fast |
Practical Examples
| Compared Strings | levs | jaro | seqm |
|---|---|---|---|
| "casa" vs "caso" | 0.75 | 0.83 | 0.75 |
| "machine" vs "mecine" | 0.71 | 0.88 | 0.82 |
| "apple" vs "a pple" | 0.8 | 0.93 | 0.9 |
Recommendation: For general use with a good balance of speed and accuracy, seqm is a solid default (and it is YAKE's default). For stricter lexical similarity, choose levs. For names or when letter swaps are common, go with jaro.
Usage (Python)
How to use it using Python:
Simple usage using default parameters
Specifying custom parameters
Output
The lower the score, the more relevant the keyword is.
Copyright ©2018-2025 INESC TEC. Distributed by an INESCTEC license.