agatha.util.umls_util module¶
umls_util.py
This module is responsible for cross referencing UMLS MRCONSO. This means that we will be able to both lookup UMLS terms from plaintext descriptions, and vice-versa.
-
class
agatha.util.umls_util.
UmlsIndex
(mrconso_path, **filter_kwargs)¶ Bases:
object
The UmlsIndex is responsible for managing the MRCONSO file.
When we create the UmlsIndex we create the intermediate data structures required to index all UMLS keywords, and all plaintext atoms. You can download a MRCONSO file associated with a UMLS release here:
www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
Take a look to see what the MRCONSO file format is supposed to look like:
https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.T.concept_names_and_sources_file_mr/
- Parameters
mrconso_path (
Path
) – The path to a MRCONSO RRF file.include_supressed_content – By default, this index will only consider terms that have not been marked as SUPPRESS. If this flag is set, we will include all terms.
filter_language – If set, this index will only consider names appearing in the selected langauge (default = ENG). If set to None, all terms will be considered.
-
codes
()¶ - Return type
Set
[str
]
-
contains_code
(code)¶ - Return type
bool
-
contains_pref_text_for_code
(code)¶ - Return type
bool
-
find_codes_with_close_text
(text, ignore_case=False)¶ Returns the set of codes with text most similar to that provided.
Each text field of all managed atoms is compared to the given text. The set of codes with text that minimize edit distance with the given text are returned.
For example, if codes C1 and C2 are both equally distant to text, then both will be returned.
- Return type
Set
[str
]
-
find_codes_with_pattern
(pattern)¶ Returns the set of codes with text that matches the regex pattern
- Return type
Set
[str
]
-
get_pref_text
(code)¶ - Return type
str
-
get_texts
(code)¶ - Return type
Set
[str
]
-
num_codes
()¶ - Return type
int
-
agatha.util.umls_util.
atom_contains_all_fields
(atom)¶ - Return type
bool
-
agatha.util.umls_util.
filter_atoms
(mrconso_data, include_suppressed=False, filter_language='ENG', code_subset=None)¶ Filters the lines of MRCONSO
If include_suppressed is set, then atoms with SUPPRESS set will be included in the result.
If filter_language is not None, then only atoms with LAT set to the filter language will be included.
If code_subset is set, then only UMLS terms present in this set will be passed through the filter.
- Return type
Iterable
[Dict
[str
,str
]]
-
agatha.util.umls_util.
parse_mrconso
(mrconso_path)¶ Parses MRCONSO file
The MRCONSO file, as described in:
https://www.ncbi.nlm.nih.gov/books/NBK9685/table/ch03.T.concept_names_and_sources_file_mr/
Has columns described in umls_util.MRCONSO_FIELDNAMES.
This function takes each line of the MRCONSO.RRF file name parses out each field. The result is a list of dictionaries, where parse_mrconso(…)[i] contains all of the fields of line i. For instance, you can get the CUID of line i by calling parse_mrconso(…)[i][‘cui’]
- Parameters
mrconso_path (
Path
) – The filepath to MRCONSO.RRF. Must end in .RRF.- Return type
Iterable
[Dict
[str
,str
]]- Returns
List of parsed MRCONSO data. Each line contains the fields defined in MRCONSO_FIELDNAMES.