agatha.util.sqlite3_lookup module

class agatha.util.sqlite3_lookup.Sqlite3Bow(db_path, table_name='sentences', key_column_name='id', value_column_name='bow', **kwargs)

Bases: agatha.util.sqlite3_lookup.Sqlite3LookupTable

For backwards compatibility, Sqlite3Bow allows for alternate default table, key, and value names. However, newer tables following the default Sqlite3LookupTable schema will still work.

class agatha.util.sqlite3_lookup.Sqlite3Graph(db_path, table_name='graph', key_column_name='node', value_column_name='neighbors', **kwargs)

Bases: agatha.util.sqlite3_lookup.Sqlite3LookupTable

For backwards compatibility, Sqlite3Graph allows for alternate default table, key, and value names. However, newer tables following the default Sqlite3LookupTable schema will still work.

class agatha.util.sqlite3_lookup.Sqlite3LookupTable(db_path, table_name='lookup_table', key_column_name='key', value_column_name='value', disable_cache=False)

Bases: object

Dict-like interface for Sqlite3 key-value tables

Assumes that the provided sqlite3 path has a table containing string keys and json-encoded string values. By default, the table name is lookup_table, with columns key and value.

This interface is pickle-able, and provides caching and preloading. Note that instances of this object that are recovered from pickles will _NOT_ retain the preloading or caching information from the original.

Parameters
  • db_path (Path) – The file-system location of the Sqlite3 file.

  • table_name (str) – The sql table name to find within db_path.

  • key_column_name (str) – The string column of table_name. Performance of the Sqlite3LookupTable will depend on whether an index has been created on key_column_name.

  • value_column_name (str) – The json-encoded string column of table_name

  • disable_cache (bool) – If set, objects resulted from json parsing will not be cached

clear_cache()

Removes contents of internal cache

Return type

None

connected()

True if the database connection has been made.

Return type

bool

disable_cache()

Disables the use of internal cache

Return type

None

enable_cache()

Enables the use of internal cache

Return type

None

is_preloaded()

True if database has been loaded to memory.

Return type

bool

iterate(where=None)

Returns an iterator to the underlying database. If where is specified, returned rows will be conditioned. Note, when writing a where clause that columns are key and value

keys()

Get all keys from the Sqlite3 Table.

Recalls _all_ keys from the connected database. This operation may be slow or even infeasible for larger tables.

Return type

Set[str]

Returns

The set of all keys from the connected database.

preload()

Copies the database to memory.

This is done by dumping the contents of disk into ram, and _does not_ perform any json parsing. This improves performance because now sqlite3 calls do not have to travel to storage.

Return type

None

agatha.util.sqlite3_lookup.compile_kv_json_dir_to_sqlite3(json_data_dir, result_database_path, agatha_install_path, merge_duplicates, verbose)

Merges all key/value json entries into an indexed sqlite3 table

This function assumes that json_dir contains many *.json files. Each file should contain one json object per line. Each object should contain a “key” and a “value” field. This function will use the c++ create_lookup_table by executing a subprocess.

Parameters
  • json_data_dir (Path) – The location containing *.jso. files.

  • result_database_path (Path) – The location to store the result sqlite3 db.

  • agatha_install_path (Path) – The location containing the “tools” directory, where create_lookup_table has been built.

  • merge_duplicates (bool) – The create_lookup_table utility has two modes. If merge_duplicates is False, then we assume there are no key collisions and each value is stored as-is. If True, then we combine values associated with duplicate keys into arrays of unique elements.

  • verbose (bool) – If set, print intermediate output of create_lookup_table.

Return type

None

agatha.util.sqlite3_lookup.create_lookup_table(key_value_records, result_database_path, intermediate_data_dir, agatha_install_path, merge_duplicates=False, verbose=False)

Creates an Sqlite3 table compatible with Sqlite3LookupTable

Each element of the key_value_records bag is converted to json and written to disk. Then, one machine calls the create_lookup_table tool in order to index all records into an Sqlite3LookupTable compatible database. Warning, if used in a distributed setting, the master node will be the one to call the create_lookup_table utility.

key_value_records: A dask bag containing dicts. Each dict should have a “key”

and a “value” field.

result_database_path: The location to write the Sqlite3 file. intermediate_data_dir: The location to write intermediate json text files.

Warning, if any json files exist beforehand, they will be erased.

agatha_install_path: The root of Agatha, wherein the tools directory can be

located.

merge_duplicates: If set, create_lookup_table will perform the more

expensive operation of combining distinct values associated with the same key.

verbose: If set, the create_lookup_table utility will print intermediate

output.

Return type

None

agatha.util.sqlite3_lookup.export_key_value_records(key_value_records, export_dir)

Converts a Dask bag of Dicts into a collection of json files.

In order to create a lookup table, we must first export all data as json. This function maps each element of the input bag to a json encoded string and writes one file per partition to the export_dir. WARNING: this function will delete any json files already present in export_dir.

Parameters
  • key_value_records (Bag) – A dask bag containing dicts.

  • export_dir (Path) – The location to write json files. Will erase any if present beforehand.

Return type

None