agatha.util.umls_util module

umls_util.py

This module is responsible for cross referencing UMLS MRCONSO. This means that we will be able to both lookup UMLS terms from plaintext descriptions, and vice-versa.

class agatha.util.umls_util.UmlsIndex(mrconso_path, **filter_kwargs)

Bases: object

The UmlsIndex is responsible for managing the MRCONSO file.

When we create the UmlsIndex we create the intermediate data structures required to index all UMLS keywords, and all plaintext atoms. You can download a MRCONSO file associated with a UMLS release here:

www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html

Take a look to see what the MRCONSO file format is supposed to look like:

https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.T.concept_names_and_sources_file_mr/

Parameters
  • mrconso_path (Path) – The path to a MRCONSO RRF file.

  • include_supressed_content – By default, this index will only consider terms that have not been marked as SUPPRESS. If this flag is set, we will include all terms.

  • filter_language – If set, this index will only consider names appearing in the selected langauge (default = ENG). If set to None, all terms will be considered.

codes()
Return type

Set[str]

contains_code(code)
Return type

bool

contains_pref_text_for_code(code)
Return type

bool

find_codes_with_close_text(text, ignore_case=False)

Returns the set of codes with text most similar to that provided.

Each text field of all managed atoms is compared to the given text. The set of codes with text that minimize edit distance with the given text are returned.

For example, if codes C1 and C2 are both equally distant to text, then both will be returned.

Return type

Set[str]

find_codes_with_pattern(pattern)

Returns the set of codes with text that matches the regex pattern

Return type

Set[str]

get_pref_text(code)
Return type

str

get_texts(code)
Return type

Set[str]

num_codes()
Return type

int

agatha.util.umls_util.atom_contains_all_fields(atom)
Return type

bool

agatha.util.umls_util.filter_atoms(mrconso_data, include_suppressed=False, filter_language='ENG', code_subset=None)

Filters the lines of MRCONSO

If include_suppressed is set, then atoms with SUPPRESS set will be included in the result.

If filter_language is not None, then only atoms with LAT set to the filter language will be included.

If code_subset is set, then only UMLS terms present in this set will be passed through the filter.

Return type

Iterable[Dict[str, str]]

agatha.util.umls_util.parse_mrconso(mrconso_path)

Parses MRCONSO file

The MRCONSO file, as described in:

https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.T.concept_names_and_sources_file_mr/

Has columns described in umls_util.MRCONSO_FIELDNAMES.

This function takes each line of the MRCONSO.RRF file name parses out each field. The result is a list of dictionaries, where parse_mrconso(…)[i] contains all of the fields of line i. For instance, you can get the CUID of line i by calling parse_mrconso(…)[i][‘cui’]

Parameters

mrconso_path (Path) – The filepath to MRCONSO.RRF. Must end in .RRF.

Return type

Iterable[Dict[str, str]]

Returns

List of parsed MRCONSO data. Each line contains the fields defined in MRCONSO_FIELDNAMES.