Skip to content

Semantic Matcher

SemanticMatcher

SemanticMatcher(embedding_model, threshold=0.7)

Semantic similarity helper built on top of EmbeddingModel.

Provides convenient methods for calculating semantic similarity between texts and matching items across lists. Uses cosine similarity of embeddings with an optional threshold filter.

Attributes:

Name Type Description
embedding_model

The embedding model to use for encoding texts.

threshold

Minimum similarity score (0-1) to consider a match. Default: 0.7.

Example
from at_scorer.ml import EmbeddingModel

model = EmbeddingModel()
matcher = SemanticMatcher(model, threshold=0.75)

# Calculate similarity
score = matcher.similarity("Python developer", "Software engineer")

# Find top matches
candidates = ["Java", "Python", "JavaScript", "C++"]
matches = matcher.top_matches("Python developer", candidates, top_k=2)
# Returns: [("Python", 0.95), ("JavaScript", 0.72)]

Initialize the semantic matcher.

Parameters:

Name Type Description Default
embedding_model EmbeddingModel

The embedding model to use for encoding.

required
threshold float

Minimum similarity score to consider a match (0-1).

0.7
Source code in at_scorer/ml/semantic_matcher.py
def __init__(self, embedding_model: EmbeddingModel, threshold: float = 0.7):
    """Initialize the semantic matcher.

    Args:
        embedding_model: The embedding model to use for encoding.
        threshold: Minimum similarity score to consider a match (0-1).
    """
    self.embedding_model = embedding_model
    self.threshold = threshold

Functions

similarity
similarity(text_a, text_b)

Calculate semantic similarity between two texts.

Parameters:

Name Type Description Default
text_a str

First text to compare.

required
text_b str

Second text to compare.

required

Returns:

Type Description
float

Cosine similarity score between 0 and 1 (higher = more similar).

Source code in at_scorer/ml/semantic_matcher.py
def similarity(self, text_a: str, text_b: str) -> float:
    """Calculate semantic similarity between two texts.

    Args:
        text_a: First text to compare.
        text_b: Second text to compare.

    Returns:
        Cosine similarity score between 0 and 1 (higher = more similar).
    """
    vec_a = self.embedding_model.encode(text_a)
    vec_b = self.embedding_model.encode(text_b)
    return self.embedding_model.similarity(vec_a, vec_b)
top_matches
top_matches(query, candidates, top_k=5)

Find top-k most similar candidates to a query text.

Parameters:

Name Type Description Default
query str

The query text to match against.

required
candidates Iterable[str]

Iterable of candidate texts to search.

required
top_k int

Maximum number of matches to return.

5

Returns:

Type Description
list[tuple[str, float]]

List of (candidate, score) tuples, sorted by score descending.

list[tuple[str, float]]

Only includes matches above the threshold.

Source code in at_scorer/ml/semantic_matcher.py
def top_matches(
    self, query: str, candidates: Iterable[str], top_k: int = 5
) -> list[tuple[str, float]]:
    """Find top-k most similar candidates to a query text.

    Args:
        query: The query text to match against.
        candidates: Iterable of candidate texts to search.
        top_k: Maximum number of matches to return.

    Returns:
        List of (candidate, score) tuples, sorted by score descending.
        Only includes matches above the threshold.
    """
    candidates_list = [c for c in candidates if c]
    if not query or not candidates_list:
        return []
    query_vec = self.embedding_model.encode(query)
    cand_vecs = self.embedding_model.encode_batch(candidates_list)
    scores = [self.embedding_model.similarity(query_vec, v) for v in cand_vecs]
    ranked = sorted(
        zip(candidates_list, scores, strict=False), key=lambda x: x[1], reverse=True
    )
    filtered = [(c, s) for c, s in ranked if s >= self.threshold]
    return filtered[:top_k]
match_lists
match_lists(source, target, top_k=3)

Match items from source list to target list using semantic similarity.

For each item in the source list, finds the top-k most similar items in the target list.

Parameters:

Name Type Description Default
source Iterable[str]

Source list of texts to match.

required
target Iterable[str]

Target list of texts to match against.

required
top_k int

Maximum number of matches per source item.

3

Returns:

Type Description
list[tuple[str, str, float]]

List of (source_item, target_item, score) tuples for all matches

list[tuple[str, str, float]]

above the threshold.

Source code in at_scorer/ml/semantic_matcher.py
def match_lists(
    self, source: Iterable[str], target: Iterable[str], top_k: int = 3
) -> list[tuple[str, str, float]]:
    """Match items from source list to target list using semantic similarity.

    For each item in the source list, finds the top-k most similar items
    in the target list.

    Args:
        source: Source list of texts to match.
        target: Target list of texts to match against.
        top_k: Maximum number of matches per source item.

    Returns:
        List of (source_item, target_item, score) tuples for all matches
        above the threshold.
    """
    results: list[tuple[str, str, float]] = []
    source_list = [s for s in source if s]
    target_list = [t for t in target if t]
    if not source_list or not target_list:
        return results
    target_vecs = self.embedding_model.encode_batch(target_list)
    for src in source_list:
        src_vec = self.embedding_model.encode(src)
        scores = [
            (tgt, self.embedding_model.similarity(src_vec, tgt_vec))
            for tgt, tgt_vec in zip(target_list, target_vecs, strict=False)
        ]
        best = sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]
        results.extend([(src, tgt, score) for tgt, score in best if score >= self.threshold])
    return results