agatha.construct.document_pipeline module

agatha.construct.document_pipeline.get_covid_documents(config)
Return type

Bag

agatha.construct.document_pipeline.get_medline_documents(config)
Return type

Bag

agatha.construct.document_pipeline.perform_document_independent_tasks(config, documents, ckpt_prefix, semrep_work_dir=None)

Performs Tasks that don’t require communication between documents

Performs all of the document processing operations that are required to happen on each document separately. This is important to separate between different input textual features because this allows us to update/invalidate particular sets of checkpoints faster.

Parameters
  • config (ConstructConfig) – Constriction Configuration

  • documents (Bag) – Collection of texts to process

  • ckpt_prefix (str) – To stop collisions, and to improve caching, each call to this function should have a different prefix indicating the type of the corresponding documents. For instance, calling this with medline documents could get the medline prefix.

  • semrep_work_dir (Optional[Path]) – The location to store semrep intermediate files. Only used if semrep has been installed and configured.

Return type

None