agatha.construct.semrep_util module¶

SemRep Dask Utilities

This module helps run SemRep within the Agatha graph construction pipeline. For this to work, we need to run SemRep on each machine in our cluster, and extract all necessary information as edges.

To run SemRep, you must first start the MetaMap servers for part-of-speech tagging and word-sense disambiguation. These are supplied through MetaMap. Specifically, we are expecting to find skrmedpostctl and wsdserverctl in the directory specified through config.semrep.metamap_bin_dir. Once these servers are started we are free to run semrep.

class agatha.construct.semrep_util.MetaMapServer(metamap_install_dir)¶

Bases: object

Manages connection to MetaMap

SemRep requires a connection to MetaMap. This means we need to launch the pos_server and wsd_server. This class is responsible for managing that server connection. We anticipate using one server per-worker, meaning this class will be initialized using dask_process_global initializer.

Parameters: metamap_install_dir (Path) – The install location of MetaMap

running()¶

start()¶: Call to start the MetaMap servers, if not already running.

stop()¶: Stops the MetaMap servers, if running

class agatha.construct.semrep_util.SemRepRunner(semrep_install_dir, metamap_server, anaphora_resolution=True, dysonym_processing=True, lexicon_year=2006, mm_data_version='USAbase', mm_data_year='2006AA', relaxed_model=True, single_line_delim_input_w_id=True, use_generic_domain_extensions=False, use_generic_domain_modification=False, word_sense_disambiguation=True)¶

Bases: object

Responsible for running SemRep.

Given a metamap server and additional SemRep Configs, this class actually processes text and generates predicates. All SemRep predicates are copied here and provided through the constructor. All defaults are preserved.

Parameters

semrep_install_dir (Path) – Location where semrep is installed.
metamap_server (MetaMapServer) – A connection to the MetaMapServer that enables us to actually run SemRep. We use this to ensure server is running.
work_dir – Location to store intermediate files used to communicate with SemRep.
anaphora_resolution – SemRep Flag
dysonym_processing – SemRep Flag
lexicon_year (int) – The year as an int which we use with MetaMap. Ex: 2020
mm_data_version (str) – Specify which UMLS data version. Ex: USAbase
mm_data_year (str) – Specify UMLS release year. Ex: 2020AA
relaxed_model (bool) – SemRep Flag
use_generic_domain_extensions – SemRep Flag
use_generic_domain_modification – SemRep Flag
word_sense_disambiguation – SemRep Flag

run(input_path, output_path)¶

Actually calls SemRep with an input file.

Parameters: input_path (Path) – The location of the SemRep Input file
Return type: None
Returns: The path produced by SemRep representing XML output.

class agatha.construct.semrep_util.UnicodeToAsciiRunner(unicode_to_ascii_jar_path)¶

Bases: object

Responsible for running the MetaMap unicode to ascii jar

clean_text_for_metamap(s)¶

Metamap has a bunch of stupid rules.

Return type: str

agatha.construct.semrep_util.extract_entities_and_predicates_from_sentences(sentence_records, semrep_install_dir, unicode_to_ascii_jar_path, work_dir, lexicon_year, mm_data_year, mm_data_version)¶

Runs each sentence through SemRep. Identifies Predicates and Entities

Requires get_metamap_server_initializer added to dask_process_global.

Parameters

sentence_records (Bag) – Each record needs id and sent_text.
work_dir (Path) – A directory visible to all workers where SemRep intermediate files will be stored.
semrep_install_dir (Path) – The path where semrep was installed.

Return type

Bag

Returns

One record per input sentence, where id of the new record matches the input. However, returned records will only have entites and predicates

agatha.construct.semrep_util.get_metamap_server_initializer(metamap_install_dir)¶

Return type: Tuple[str, Callable[[], Any]]

agatha.construct.semrep_util.get_paths(semrep_install_dir=None, metamap_install_dir=None)¶

Looks up all of the necessary files needed to run SemRep.

This function identifies the binaries and libraries needed to run SemRep. Additionally, this function asserts that all the needed files are actually present.

This function will find: skrmedpostctl: Metamap’s SKR/Medpost Part-of-Speech Tagger Server wsdserverctl: Metamap’s Word Sense Disambiguation (WSD) Server SEMREPrun.v*: The preamble needed to run SemRep semrep.v*.BINARY.Linux: The binary used to run SemRep lib: The Java libraries in SemRep

If only one or the other semrep_install_dir or metamap_install_dir is specified, then only that components paths will be returned.

Parameters

semrep_install_dir (Optional[Path]) – The install location of SemRep. Named public_semrep by default.
metamap_install_dir (Optional[Path]) – The install location of MetaMap. Named public_mm my default.

Return type

Dict[str, Path]

Returns

A dictionary of names and associated paths. If a name ends in _path then it has been asserted is_file(). If name ends in _dir it has been asserted is_dir().

agatha.construct.semrep_util.semrep_xml_to_records(xml_path)¶

Parses SemRep XML records to produce Predicate Records

This parses SemRep XML output, generated by SemRep v1.8 via the –xml_output_format flag. Take a look [here][1] to get more details on the XML spec. Additional details below. We specifically focus on parsing XML records produced by the SemRepRunner.

XML Format Summary: The XML file starts with an overarching SemRepAnnotation object, containing multiple Document records, one per input text. These documents contain identified UMLS terms (Document > Utterance > Entity) and predicates (Document > Utterance > Predication). One document may have multiple utterances.

Parameters: xml_path (Path) – Location of XML file to parse.
Return type: List[Dict[str, Any]]
Returns: A list of python dicts wherein each corresponds to a detected predicate.

[1]:https://semrep.nlm.nih.gov/SemRep.v1.8_XML_output_desc.html

agatha.construct.semrep_util.sentences_to_semrep_input(records, unicode_to_ascii_jar_path)¶

Processes Sentence Records for SemRep Input

The SemRepRunner, with the default single_line_delim_input_w_id flag set, expects input in the form: ``` id1|Sentence 1 id2|Sentence 2

…

```

This function converts Agatha sentence records, containing the sent_text and id fields into the single_line_delim_input_w_id format. Because each sentence must occur on its own line, this function will replace newline characters with spaces in output.

Recommend Usage:

```python3 sentences.map_partitions(

sentences_to_semrep_input, unicode_to_ascii_jar_path,

).to_textfiles(…) ```

Parameters

records (Iterable[Dict[str, Any]]) – Sentence records, each containing sent_text and id
unicode_to_ascii_jar_path (Path) – The location of the metamap-provided jar

Return type

List[str]