Monge Elkan
- class py_stringmatching.similarity_measure.monge_elkan.MongeElkan(sim_func=jaro_winkler_function)[source]
Computes Monge-Elkan measure.
The Monge-Elkan similarity measure is a type of hybrid similarity measure that combines the benefits of sequence-based and set-based methods. This can be effective for domains in which more control is needed over the similarity measure. It implicitly uses a secondary similarity measure, such as Levenshtein to compute over all similarity score. See the string matching chapter in the DI book (Principles of Data Integration).
- Parameters:
sim_func (function) – Secondary similarity function. This is expected to be a sequence-based similarity measure (defaults to Jaro-Winkler similarity measure).
- sim_func
An attribute to store the secondary similarity function.
- Type:
function
- get_raw_score(bag1, bag2)[source]
Computes the raw Monge-Elkan score between two bags (lists).
- Parameters:
bag1 (list) – Input lists.
bag2 (list) – Input lists.
- Returns:
Monge-Elkan similarity score (float).
- Raises:
TypeError – If the inputs are not lists or if one of the inputs is None.
Examples
>>> me = MongeElkan() >>> me.get_raw_score(['Niall'], ['Neal']) 0.8049999999999999 >>> me.get_raw_score(['Niall'], ['Nigel']) 0.7866666666666667 >>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 0.8364448130130768 >>> me.get_raw_score([''], ['a']) 0.0 >>> me = MongeElkan(sim_func=NeedlemanWunsch().get_raw_score) >>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 2.0 >>> me = MongeElkan(sim_func=Affine().get_raw_score) >>> me.get_raw_score(['Comput.', 'Sci.', 'and', 'Eng.', 'Dept.,', 'University', 'of', 'California,', 'San', 'Diego'], ['Department', 'of', 'Computer', 'Science,', 'Univ.', 'Calif.,', 'San', 'Diego']) 2.25
References
Principles of Data Integration book