agatha.construct.checkpoint module

A singleton responsible for saving and loading dask bags.

agatha.construct.checkpoint.checkpoint(name, bag=None, verbose=None, allow_partial=None, halt_after=None, textfile=False, **compute_kw)

Stores the contents of the bag as a series of files.

This function takes each partition of the input bag and writes them to files within a directory associated with the input name. The location of each checkpoint directory is dependent on the ckpt_root option.

For each optional argument, (other than bag) of this function, there is an associated module-level parameter that can be set globally.

The module-level parameter checkpoint_root, set with set_root must be set before calling checkpoint.

Usage:

checkpoint(name) - returns load opt for checkpoint “name” checkpoint(name, bag) - if ckpt writes bag to ckpt “name” and returns load op if disable() was called, returns the input bag

Parameters
  • name (str) – The name of the checkpoint directory to lookup or save to

  • bag (Optional[Bag]) – If set, save this bag. Otherwise, we will require that this checkpoint has already been saved.

  • verbose (Optional[bool]) – Print helper info. If unspecified, defaults to module-level parameter.

  • allow_partial (Optional[bool]) – If true, partial files present in an unfinished checkpoint directory will not be overwritten. If false, unfinished checkpoints will be recomputed in full. Defaults to module-level parameter if unset.

  • halt_after (Optional[str]) – If set to the name of the current checkpoint, the agatha process will stop after computing its contents. This is important for partial pipeline runs, for instance, for computing training data for an ml model.

  • textfile (bool) – If set, checkpoint will be stored in plaintext format, used to save strings. This results in this function returning None.

Return type

Optional[Bag]

Returns

A dask bag that, if computed, _LOADS_ the specified checkpoint. This means that future operations can depend on the loading of intermediate data, rather than the intermediate computations themselves.

agatha.construct.checkpoint.ckpt(bag_name, ckpt_prefix=None, **kwargs)

Simple checkpoint interface

This is syntactic sugar for the most common use case. You can replace ` my_dask_bag = checkpoint("my_dask_bag", my_dask_bag) `

` ckpt("my_dask_bag") `

Calling this function will replace the variable associated with bag_name after computing its checkpoint. This means that calling compute on later calls of bag_name will load that bag from storage, rather than perform all intermediate computations again.

Parameters
  • bag_name (str) – The name of a local variable corresponding to a dask bag. This bag will be computed and stored to a checkpoint of the same name. The bag variable will be replaced with a new bag that can be loaded from this checkpoint.

  • ckpt_prefix (Optional[str]) – If set, the provided string will be prefixed to the bag_name checkpoint. This allows the same variable names to be associated with different checkpoints. For instance, the document_pipeline functions create a bag named “sentences” regardless of the set of documents used to create those sentences. By specifying a prefix, different calls to document_pipeline can create different checkpoints.

Return type

None

agatha.construct.checkpoint.clear_all_ckpt()
Return type

None

agatha.construct.checkpoint.clear_ckpt(name)
Return type

None

agatha.construct.checkpoint.clear_halt_point()
Return type

None

agatha.construct.checkpoint.disable()
Return type

None

agatha.construct.checkpoint.enable()
Return type

None

agatha.construct.checkpoint.get_allow_partial()
Return type

bool

agatha.construct.checkpoint.get_checkpoints_like(glob_pattern)
Return type

Set[Path]

agatha.construct.checkpoint.get_done_file_path(name)
Return type

Path

agatha.construct.checkpoint.get_or_make_ckpt_dir(name)
Return type

Path

agatha.construct.checkpoint.get_root()
Return type

Path

agatha.construct.checkpoint.get_verbose()
Return type

bool

agatha.construct.checkpoint.is_ckpt_done(name)
Return type

bool

agatha.construct.checkpoint.set_allow_partial(allow)
Return type

None

agatha.construct.checkpoint.set_halt_point(name)
Return type

None

agatha.construct.checkpoint.set_root(ckpt_root)
Return type

None

agatha.construct.checkpoint.set_verbose(is_verbose)
Return type

None