Semantic Matcher

SemanticMatcher

SemanticMatcher(embedding_model, threshold=0.7)

Semantic similarity helper built on top of EmbeddingModel.

Provides convenient methods for calculating semantic similarity between texts and matching items across lists. Uses cosine similarity of embeddings with an optional threshold filter.

Attributes:

Name	Type	Description
`embedding_model`		The embedding model to use for encoding texts.
`threshold`		Minimum similarity score (0-1) to consider a match. Default: 0.7.

Example

from at_scorer.ml import EmbeddingModel

model = EmbeddingModel()
matcher = SemanticMatcher(model, threshold=0.75)

# Calculate similarity
score = matcher.similarity("Python developer", "Software engineer")

# Find top matches
candidates = ["Java", "Python", "JavaScript", "C++"]
matches = matcher.top_matches("Python developer", candidates, top_k=2)
# Returns: [("Python", 0.95), ("JavaScript", 0.72)]

Initialize the semantic matcher.

Parameters:

Name	Type	Description	Default
`embedding_model`	`EmbeddingModel`	The embedding model to use for encoding.	required
`threshold`	`float`	Minimum similarity score to consider a match (0-1).	`0.7`

Source code in at_scorer/ml/semantic_matcher.py

def __init__(self, embedding_model: EmbeddingModel, threshold: float = 0.7):
    """Initialize the semantic matcher.

    Args:
        embedding_model: The embedding model to use for encoding.
        threshold: Minimum similarity score to consider a match (0-1).
    """
    self.embedding_model = embedding_model
    self.threshold = threshold

Functions

similarity

similarity(text_a, text_b)

Calculate semantic similarity between two texts.

Parameters:

Name	Type	Description	Default
`text_a`	`str`	First text to compare.	required
`text_b`	`str`	Second text to compare.	required

Returns:

Type	Description
`float`	Cosine similarity score between 0 and 1 (higher = more similar).

Source code in at_scorer/ml/semantic_matcher.py

def similarity(self, text_a: str, text_b: str) -> float:
    """Calculate semantic similarity between two texts.

    Args:
        text_a: First text to compare.
        text_b: Second text to compare.

    Returns:
        Cosine similarity score between 0 and 1 (higher = more similar).
    """
    vec_a = self.embedding_model.encode(text_a)
    vec_b = self.embedding_model.encode(text_b)
    return self.embedding_model.similarity(vec_a, vec_b)

top_matches

top_matches(query, candidates, top_k=5)

Find top-k most similar candidates to a query text.

Parameters:

Name	Type	Description	Default
`query`	`str`	The query text to match against.	required
`candidates`	`Iterable[str]`	Iterable of candidate texts to search.	required
`top_k`	`int`	Maximum number of matches to return.	`5`

Returns:

Type	Description
`list[tuple[str, float]]`	List of (candidate, score) tuples, sorted by score descending.
`list[tuple[str, float]]`	Only includes matches above the threshold.

Source code in at_scorer/ml/semantic_matcher.py

def top_matches(
    self, query: str, candidates: Iterable[str], top_k: int = 5
) -> list[tuple[str, float]]:
    """Find top-k most similar candidates to a query text.

    Args:
        query: The query text to match against.
        candidates: Iterable of candidate texts to search.
        top_k: Maximum number of matches to return.

    Returns:
        List of (candidate, score) tuples, sorted by score descending.
        Only includes matches above the threshold.
    """
    candidates_list = [c for c in candidates if c]
    if not query or not candidates_list:
        return []
    query_vec = self.embedding_model.encode(query)
    cand_vecs = self.embedding_model.encode_batch(candidates_list)
    scores = [self.embedding_model.similarity(query_vec, v) for v in cand_vecs]
    ranked = sorted(
        zip(candidates_list, scores, strict=False), key=lambda x: x[1], reverse=True
    )
    filtered = [(c, s) for c, s in ranked if s >= self.threshold]
    return filtered[:top_k]

match_lists

match_lists(source, target, top_k=3)

Match items from source list to target list using semantic similarity.

For each item in the source list, finds the top-k most similar items in the target list.

Parameters:

Name	Type	Description	Default
`source`	`Iterable[str]`	Source list of texts to match.	required
`target`	`Iterable[str]`	Target list of texts to match against.	required
`top_k`	`int`	Maximum number of matches per source item.	`3`

Returns:

Type	Description
`list[tuple[str, str, float]]`	List of (source_item, target_item, score) tuples for all matches
`list[tuple[str, str, float]]`	above the threshold.

Source code in at_scorer/ml/semantic_matcher.py

def match_lists(
    self, source: Iterable[str], target: Iterable[str], top_k: int = 3
) -> list[tuple[str, str, float]]:
    """Match items from source list to target list using semantic similarity.

    For each item in the source list, finds the top-k most similar items
    in the target list.

    Args:
        source: Source list of texts to match.
        target: Target list of texts to match against.
        top_k: Maximum number of matches per source item.

    Returns:
        List of (source_item, target_item, score) tuples for all matches
        above the threshold.
    """
    results: list[tuple[str, str, float]] = []
    source_list = [s for s in source if s]
    target_list = [t for t in target if t]
    if not source_list or not target_list:
        return results
    target_vecs = self.embedding_model.encode_batch(target_list)
    for src in source_list:
        src_vec = self.embedding_model.encode(src)
        scores = [
            (tgt, self.embedding_model.similarity(src_vec, tgt_vec))
            for tgt, tgt_vec in zip(target_list, target_vecs, strict=False)
        ]
        best = sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]
        results.extend([(src, tgt, score) for tgt, score in best if score >= self.threshold])
    return results