agatha.construct.text_util module¶
-
agatha.construct.text_util.
add_bow_to_analyzed_sentence
(records, bow_field='bow', token_field='tokens', entity_field='entities', mesh_heading_field='mesh_headings', ngram_field='ngrams')¶ - Return type
Dict
[str
,Any
]
-
agatha.construct.text_util.
analyze_sentences
(records, text_field, token_field='tokens', entity_field='entities')¶ Parses the text fields of all records using SciSpacy. Requires that text_util:nlp and text_util:stopwords have both been loaded into dask_process_global.
@param records: A partition of records to parse, each must contain text_field @param text_field: The name of the field we wish to parse. @param token_field: The output field for all basic tokens. These are sub-records containing information such as POS tag and lemma. @param entity_field: The output field for all entities, which are multi-token phrases. @return a list of records with token and entity fields
- Return type
Iterable
[Dict
[str
,Any
]]
-
agatha.construct.text_util.
entity_to_id
(entity, sentence, token_field='tokens')¶ - Return type
str
-
agatha.construct.text_util.
get_adjacent_sentences
(sentence_record)¶ Given the i’th sentence, return the keys for sentence i-1 and i+1 if they exist.
- Return type
Set
[str
]
-
agatha.construct.text_util.
get_entity_keys
(sentence_record)¶ - Return type
List
[str
]
-
agatha.construct.text_util.
get_entity_text
(entity, sentence, token_field='tokens')¶ - Return type
str
-
agatha.construct.text_util.
get_interesting_token_keys
(sentence_record)¶ - Return type
List
[str
]
-
agatha.construct.text_util.
get_mesh_keys
(sentence_record)¶ - Return type
List
[str
]
-
agatha.construct.text_util.
get_ngram_keys
(sentence_record)¶ - Return type
List
[str
]
-
agatha.construct.text_util.
get_scispacy_initalizer
(scispacy_version)¶ - Return type
Tuple
[str
,Callable
[[],Any
]]
-
agatha.construct.text_util.
get_sentence_id
(pmid, version, sent_idx)¶ - Return type
str
-
agatha.construct.text_util.
get_stopwordlist_initializer
(stopword_path)¶ - Return type
Tuple
[str
,Callable
[[],Any
]]
-
agatha.construct.text_util.
mesh_to_id
(mesh_code)¶ - Return type
str
-
agatha.construct.text_util.
ngram_to_id
(ngram_text)¶ - Return type
str
-
agatha.construct.text_util.
sentence_to_id
(sent)¶ - Return type
str
-
agatha.construct.text_util.
split_sentences
(records, text_data_field='text_data', id_field='id', min_sentence_len=None, max_sentence_len=None)¶ Splits a document into its collection of sentences. In order of text field elements, we split sentences and create new elements for the result. All fields from the original document, as well as the text field (minus the actual text itself) are copied over.
If min/max sentence len are specified, we do NOT consider sentences that fail to match the range.
id_field will be set with {SENTENCE_TYPE}:{pmid}:{version}:{sent_idx}
For instance:
- {
“status”: “Published”, “umls”: [“C123”, “C456”], “text_fields”: [{
“text”: “Title 1”, “type”: “title”
- }, {
“text”: “This is an abstract. This is another sentence.”, “type”: “abstract:raw”,
}]
}
becomes:
- [{
“status”: “Published”, “umls”: [“C123”, “C456”], “sent_text”: “Title 1”, “sent_type”: “title”, “sent_idx”: 0, “sent_total”: 3, },{ “status”: “Published”, “umls”: [“C123”, “C456”], “sent_text”: “This is an abstract.”, “sent_type”: “abstract:raw”, “sent_idx”: 1, “sent_total”: 3, },{ “status”: “Published”, “umls”: [“C123”, “C456”], “sent_text”: “This is another sentence.”, “sent_type”: “abstract:raw”, “sent_idx”: 2, “sent_total”: 3,
}]
- Return type
List
[Dict
[str
,Any
]]
-
agatha.construct.text_util.
token_to_id
(token)¶ - Return type
str