agatha.construct.text_util module

agatha.construct.text_util.add_bow_to_analyzed_sentence(records, bow_field='bow', token_field='tokens', entity_field='entities', mesh_heading_field='mesh_headings', ngram_field='ngrams')
Return type

Dict[str, Any]

agatha.construct.text_util.analyze_sentences(records, text_field, token_field='tokens', entity_field='entities')

Parses the text fields of all records using SciSpacy. Requires that text_util:nlp and text_util:stopwords have both been loaded into dask_process_global.

@param records: A partition of records to parse, each must contain text_field @param text_field: The name of the field we wish to parse. @param token_field: The output field for all basic tokens. These are sub-records containing information such as POS tag and lemma. @param entity_field: The output field for all entities, which are multi-token phrases. @return a list of records with token and entity fields

Return type

Iterable[Dict[str, Any]]

agatha.construct.text_util.entity_to_id(entity, sentence, token_field='tokens')
Return type

str

agatha.construct.text_util.get_adjacent_sentences(sentence_record)

Given the i’th sentence, return the keys for sentence i-1 and i+1 if they exist.

Return type

Set[str]

agatha.construct.text_util.get_entity_keys(sentence_record)
Return type

List[str]

agatha.construct.text_util.get_entity_text(entity, sentence, token_field='tokens')
Return type

str

agatha.construct.text_util.get_interesting_token_keys(sentence_record)
Return type

List[str]

agatha.construct.text_util.get_mesh_keys(sentence_record)
Return type

List[str]

agatha.construct.text_util.get_ngram_keys(sentence_record)
Return type

List[str]

agatha.construct.text_util.get_scispacy_initalizer(scispacy_version)
Return type

Tuple[str, Callable[[], Any]]

agatha.construct.text_util.get_sentence_id(pmid, version, sent_idx)
Return type

str

agatha.construct.text_util.get_stopwordlist_initializer(stopword_path)
Return type

Tuple[str, Callable[[], Any]]

agatha.construct.text_util.mesh_to_id(mesh_code)
Return type

str

agatha.construct.text_util.ngram_to_id(ngram_text)
Return type

str

agatha.construct.text_util.sentence_to_id(sent)
Return type

str

agatha.construct.text_util.split_sentences(records, text_data_field='text_data', id_field='id', min_sentence_len=None, max_sentence_len=None)

Splits a document into its collection of sentences. In order of text field elements, we split sentences and create new elements for the result. All fields from the original document, as well as the text field (minus the actual text itself) are copied over.

If min/max sentence len are specified, we do NOT consider sentences that fail to match the range.

id_field will be set with {SENTENCE_TYPE}:{pmid}:{version}:{sent_idx}

For instance:

{

“status”: “Published”, “umls”: [“C123”, “C456”], “text_fields”: [{

“text”: “Title 1”, “type”: “title”

}, {

“text”: “This is an abstract. This is another sentence.”, “type”: “abstract:raw”,

}]

}

becomes:

[{

“status”: “Published”, “umls”: [“C123”, “C456”], “sent_text”: “Title 1”, “sent_type”: “title”, “sent_idx”: 0, “sent_total”: 3, },{ “status”: “Published”, “umls”: [“C123”, “C456”], “sent_text”: “This is an abstract.”, “sent_type”: “abstract:raw”, “sent_idx”: 1, “sent_total”: 3, },{ “status”: “Published”, “umls”: [“C123”, “C456”], “sent_text”: “This is another sentence.”, “sent_type”: “abstract:raw”, “sent_idx”: 2, “sent_total”: 3,

}]

Return type

List[Dict[str, Any]]

agatha.construct.text_util.token_to_id(token)
Return type

str