agatha.construct.semrep_util module

SemRep Dask Utilities

This module helps run SemRep within the Agatha graph construction pipeline. For this to work, we need to run SemRep on each machine in our cluster, and extract all necessary information as edges.

To run SemRep, you must first start the MetaMap servers for part-of-speech tagging and word-sense disambiguation. These are supplied through MetaMap. Specifically, we are expecting to find skrmedpostctl and wsdserverctl in the directory specified through config.semrep.metamap_bin_dir. Once these servers are started we are free to run semrep.

class agatha.construct.semrep_util.MetaMapServer(metamap_install_dir)

Bases: object

Manages connection to MetaMap

SemRep requires a connection to MetaMap. This means we need to launch the pos_server and wsd_server. This class is responsible for managing that server connection. We anticipate using one server per-worker, meaning this class will be initialized using dask_process_global initializer.

Parameters

metamap_install_dir (Path) – The install location of MetaMap

running()
start()

Call to start the MetaMap servers, if not already running.

stop()

Stops the MetaMap servers, if running

class agatha.construct.semrep_util.SemRepRunner(semrep_install_dir, metamap_server, anaphora_resolution=True, dysonym_processing=True, lexicon_year=2006, mm_data_version='USAbase', mm_data_year='2006AA', relaxed_model=True, single_line_delim_input_w_id=True, use_generic_domain_extensions=False, use_generic_domain_modification=False, word_sense_disambiguation=True)

Bases: object

Responsible for running SemRep.

Given a metamap server and additional SemRep Configs, this class actually processes text and generates predicates. All SemRep predicates are copied here and provided through the constructor. All defaults are preserved.

Parameters
  • semrep_install_dir (Path) – Location where semrep is installed.

  • metamap_server (MetaMapServer) – A connection to the MetaMapServer that enables us to actually run SemRep. We use this to ensure server is running.

  • work_dir – Location to store intermediate files used to communicate with SemRep.

  • anaphora_resolution – SemRep Flag

  • dysonym_processing – SemRep Flag

  • lexicon_year (int) – The year as an int which we use with MetaMap. Ex: 2020

  • mm_data_version (str) – Specify which UMLS data version. Ex: USAbase

  • mm_data_year (str) – Specify UMLS release year. Ex: 2020AA

  • relaxed_model (bool) – SemRep Flag

  • use_generic_domain_extensions – SemRep Flag

  • use_generic_domain_modification – SemRep Flag

  • word_sense_disambiguation – SemRep Flag

run(input_path, output_path)

Actually calls SemRep with an input file.

Parameters

input_path (Path) – The location of the SemRep Input file

Return type

None

Returns

The path produced by SemRep representing XML output.

class agatha.construct.semrep_util.UnicodeToAsciiRunner(unicode_to_ascii_jar_path)

Bases: object

Responsible for running the MetaMap unicode to ascii jar

clean_text_for_metamap(s)

Metamap has a bunch of stupid rules.

Return type

str

agatha.construct.semrep_util.extract_entities_and_predicates_from_sentences(sentence_records, semrep_install_dir, unicode_to_ascii_jar_path, work_dir, lexicon_year, mm_data_year, mm_data_version)

Runs each sentence through SemRep. Identifies Predicates and Entities

Requires get_metamap_server_initializer added to dask_process_global.

Parameters
  • sentence_records (Bag) – Each record needs id and sent_text.

  • work_dir (Path) – A directory visible to all workers where SemRep intermediate files will be stored.

  • semrep_install_dir (Path) – The path where semrep was installed.

Return type

Bag

Returns

One record per input sentence, where id of the new record matches the input. However, returned records will only have entites and predicates

agatha.construct.semrep_util.get_metamap_server_initializer(metamap_install_dir)
Return type

Tuple[str, Callable[[], Any]]

agatha.construct.semrep_util.get_paths(semrep_install_dir=None, metamap_install_dir=None)

Looks up all of the necessary files needed to run SemRep.

This function identifies the binaries and libraries needed to run SemRep. Additionally, this function asserts that all the needed files are actually present.

This function will find: skrmedpostctl: Metamap’s SKR/Medpost Part-of-Speech Tagger Server wsdserverctl: Metamap’s Word Sense Disambiguation (WSD) Server SEMREPrun.v*: The preamble needed to run SemRep semrep.v*.BINARY.Linux: The binary used to run SemRep lib: The Java libraries in SemRep

If only one or the other semrep_install_dir or metamap_install_dir is specified, then only that components paths will be returned.

Parameters
  • semrep_install_dir (Optional[Path]) – The install location of SemRep. Named public_semrep by default.

  • metamap_install_dir (Optional[Path]) – The install location of MetaMap. Named public_mm my default.

Return type

Dict[str, Path]

Returns

A dictionary of names and associated paths. If a name ends in _path then it has been asserted is_file(). If name ends in _dir it has been asserted is_dir().

agatha.construct.semrep_util.semrep_xml_to_records(xml_path)

Parses SemRep XML records to produce Predicate Records

This parses SemRep XML output, generated by SemRep v1.8 via the –xml_output_format flag. Take a look [here][1] to get more details on the XML spec. Additional details below. We specifically focus on parsing XML records produced by the SemRepRunner.

XML Format Summary: The XML file starts with an overarching SemRepAnnotation object, containing multiple Document records, one per input text. These documents contain identified UMLS terms (Document > Utterance > Entity) and predicates (Document > Utterance > Predication). One document may have multiple utterances.

Parameters

xml_path (Path) – Location of XML file to parse.

Return type

List[Dict[str, Any]]

Returns

A list of python dicts wherein each corresponds to a detected predicate.

[1]:https://semrep.nlm.nih.gov/SemRep.v1.8_XML_output_desc.html

agatha.construct.semrep_util.sentences_to_semrep_input(records, unicode_to_ascii_jar_path)

Processes Sentence Records for SemRep Input

The SemRepRunner, with the default single_line_delim_input_w_id flag set, expects input in the form: ``` id1|Sentence 1 id2|Sentence 2

```

This function converts Agatha sentence records, containing the sent_text and id fields into the single_line_delim_input_w_id format. Because each sentence must occur on its own line, this function will replace newline characters with spaces in output.

Recommend Usage:

```python3 sentences.map_partitions(

sentences_to_semrep_input, unicode_to_ascii_jar_path,

).to_textfiles(…) ```

Parameters
  • records (Iterable[Dict[str, Any]]) – Sentence records, each containing sent_text and id

  • unicode_to_ascii_jar_path (Path) – The location of the metamap-provided jar

Return type

List[str]