agatha.construct.semrep_util module¶
SemRep Dask Utilities
This module helps run SemRep within the Agatha graph construction pipeline. For this to work, we need to run SemRep on each machine in our cluster, and extract all necessary information as edges.
To run SemRep, you must first start the MetaMap servers for part-of-speech tagging and word-sense disambiguation. These are supplied through MetaMap. Specifically, we are expecting to find skrmedpostctl and wsdserverctl in the directory specified through config.semrep.metamap_bin_dir. Once these servers are started we are free to run semrep.
-
class
agatha.construct.semrep_util.
MetaMapServer
(metamap_install_dir)¶ Bases:
object
Manages connection to MetaMap
SemRep requires a connection to MetaMap. This means we need to launch the pos_server and wsd_server. This class is responsible for managing that server connection. We anticipate using one server per-worker, meaning this class will be initialized using dask_process_global initializer.
- Parameters
metamap_install_dir (
Path
) – The install location of MetaMap
-
running
()¶
-
start
()¶ Call to start the MetaMap servers, if not already running.
-
stop
()¶ Stops the MetaMap servers, if running
-
class
agatha.construct.semrep_util.
SemRepRunner
(semrep_install_dir, metamap_server, anaphora_resolution=True, dysonym_processing=True, lexicon_year=2006, mm_data_version='USAbase', mm_data_year='2006AA', relaxed_model=True, single_line_delim_input_w_id=True, use_generic_domain_extensions=False, use_generic_domain_modification=False, word_sense_disambiguation=True)¶ Bases:
object
Responsible for running SemRep.
Given a metamap server and additional SemRep Configs, this class actually processes text and generates predicates. All SemRep predicates are copied here and provided through the constructor. All defaults are preserved.
- Parameters
semrep_install_dir (
Path
) – Location where semrep is installed.metamap_server (
MetaMapServer
) – A connection to the MetaMapServer that enables us to actually run SemRep. We use this to ensure server is running.work_dir – Location to store intermediate files used to communicate with SemRep.
anaphora_resolution – SemRep Flag
dysonym_processing – SemRep Flag
lexicon_year (
int
) – The year as an int which we use with MetaMap. Ex: 2020mm_data_version (
str
) – Specify which UMLS data version. Ex: USAbasemm_data_year (
str
) – Specify UMLS release year. Ex: 2020AArelaxed_model (
bool
) – SemRep Flaguse_generic_domain_extensions – SemRep Flag
use_generic_domain_modification – SemRep Flag
word_sense_disambiguation – SemRep Flag
-
run
(input_path, output_path)¶ Actually calls SemRep with an input file.
- Parameters
input_path (
Path
) – The location of the SemRep Input file- Return type
None
- Returns
The path produced by SemRep representing XML output.
-
class
agatha.construct.semrep_util.
UnicodeToAsciiRunner
(unicode_to_ascii_jar_path)¶ Bases:
object
Responsible for running the MetaMap unicode to ascii jar
-
clean_text_for_metamap
(s)¶ Metamap has a bunch of stupid rules.
- Return type
str
-
-
agatha.construct.semrep_util.
extract_entities_and_predicates_from_sentences
(sentence_records, semrep_install_dir, unicode_to_ascii_jar_path, work_dir, lexicon_year, mm_data_year, mm_data_version)¶ Runs each sentence through SemRep. Identifies Predicates and Entities
Requires get_metamap_server_initializer added to dask_process_global.
- Parameters
sentence_records (
Bag
) – Each record needs id and sent_text.work_dir (
Path
) – A directory visible to all workers where SemRep intermediate files will be stored.semrep_install_dir (
Path
) – The path where semrep was installed.
- Return type
Bag
- Returns
One record per input sentence, where id of the new record matches the input. However, returned records will only have entites and predicates
-
agatha.construct.semrep_util.
get_metamap_server_initializer
(metamap_install_dir)¶ - Return type
Tuple
[str
,Callable
[[],Any
]]
-
agatha.construct.semrep_util.
get_paths
(semrep_install_dir=None, metamap_install_dir=None)¶ Looks up all of the necessary files needed to run SemRep.
This function identifies the binaries and libraries needed to run SemRep. Additionally, this function asserts that all the needed files are actually present.
This function will find: skrmedpostctl: Metamap’s SKR/Medpost Part-of-Speech Tagger Server wsdserverctl: Metamap’s Word Sense Disambiguation (WSD) Server SEMREPrun.v*: The preamble needed to run SemRep semrep.v*.BINARY.Linux: The binary used to run SemRep lib: The Java libraries in SemRep
If only one or the other semrep_install_dir or metamap_install_dir is specified, then only that components paths will be returned.
- Parameters
semrep_install_dir (
Optional
[Path
]) – The install location of SemRep. Named public_semrep by default.metamap_install_dir (
Optional
[Path
]) – The install location of MetaMap. Named public_mm my default.
- Return type
Dict
[str
,Path
]- Returns
A dictionary of names and associated paths. If a name ends in _path then it has been asserted is_file(). If name ends in _dir it has been asserted is_dir().
-
agatha.construct.semrep_util.
semrep_xml_to_records
(xml_path)¶ Parses SemRep XML records to produce Predicate Records
This parses SemRep XML output, generated by SemRep v1.8 via the –xml_output_format flag. Take a look [here][1] to get more details on the XML spec. Additional details below. We specifically focus on parsing XML records produced by the SemRepRunner.
XML Format Summary: The XML file starts with an overarching SemRepAnnotation object, containing multiple Document records, one per input text. These documents contain identified UMLS terms (Document > Utterance > Entity) and predicates (Document > Utterance > Predication). One document may have multiple utterances.
- Parameters
xml_path (
Path
) – Location of XML file to parse.- Return type
List
[Dict
[str
,Any
]]- Returns
A list of python dicts wherein each corresponds to a detected predicate.
[1]:https://semrep.nlm.nih.gov/SemRep.v1.8_XML_output_desc.html
-
agatha.construct.semrep_util.
sentences_to_semrep_input
(records, unicode_to_ascii_jar_path)¶ Processes Sentence Records for SemRep Input
The SemRepRunner, with the default single_line_delim_input_w_id flag set, expects input in the form: ``` id1|Sentence 1 id2|Sentence 2
…
This function converts Agatha sentence records, containing the sent_text and id fields into the single_line_delim_input_w_id format. Because each sentence must occur on its own line, this function will replace newline characters with spaces in output.
Recommend Usage:
```python3 sentences.map_partitions(
sentences_to_semrep_input, unicode_to_ascii_jar_path,
- Parameters
records (
Iterable
[Dict
[str
,Any
]]) – Sentence records, each containing sent_text and idunicode_to_ascii_jar_path (
Path
) – The location of the metamap-provided jar
- Return type
List
[str
]