How to Embed the Agatha Semantic Graph¶
We use Pytorch Big Graph (PTBG) to embed our semantic graph. This is a distributed knowledge graph embedding tool, meaning that it uses multiple machines, and takes node / edge type into account when embedding. PTBG is a complex tool that requires a number of preprocessing steps to use.
PTBG Process Outline¶
Create a single directory that contains all semantic graph edges.
This is produced by running
agatha.construct
.Edges are stored as small key-value json files.
The directory may contain a large number of files.
Convert graph to PTBG input format.
PTBG requires that we index and partition all nodes and edges.
Look into
tools/convert_graph_for_pytorch_biggraph
for how to do this.
Create a PTBG config.
Specify all node / edge types.
Specify location of all input files.
The parameters of this config must match the options used in the conversion.
Launch the PTBG training cluster.
Use 10-20 machines, too many will slow this process.
Wait at least 5 epochs, will take days.
Index the resulting embeddings for use in Agatha.
Agatha needs to know where to find each embedding, given the node name.
Use
tools/py_scripts/ptbg_index_embeddings.py
to create a lookup table that maps each node name to its embedding metadata.
Convert Edges to PTBG format¶
The PTBG conversion tool is a multi-threaded single-machine program that indexes every node and edge of the input graph for PTBG distributed training. The settings used to run this tool will determine qualities of the resulting PTBG config, so you will want to save the exact command you run for later steps.
Warning: This program is extremely memory intensive. If you’re running on plametto, make sure to grab the 1.5 or 2 TB node.
To begin, build the convert_graph_for_pytorch_biggraph
tool.
cd /path/to/agatha/tools/convert_graph_for_pytorch_biggraph
make
This will produce graph_to_ptbg
. You can take a look at how this tool works
with the ./graph_to_ptbg --help
command.
If you want to embed the entire graph, you can run this conversion with:
./graph_to_ptbg -i <json_edge_dir> -o <ptbg_data_dir>
By default, this will include all expected node and relationship types, as described in the Agatha paper.
If you only want to embed part of the graph, you can select the specific node and relation types to include. Note that excluded types will be ignored.
To select as subset of nodes, you will need to supply the optional --types
and
--relations
arguments. Here’s an example of using these flags to select only
nodes and relationships between umls terms (type m) and predicates (type p).
./graph_to_ptbg \
-i <json_edge_dir> \
-o <ptbg_data_dir> \
--types "mp" \
--relations "mp pm"
Note that the argument passed with --types
should be a string where each
character indicates a desired node type. Nodes of types outside of this list
will not be included in the output.
Note that the argument passed with --relations
should be a string with
space-separated relationship types. Each relationship should be a two character
long string. Relationships are also directed in PTBG, meaning that if you would
like to select both UMLS -> predicate
edges, as well as predicate -> UMLS
edges, you will need to specify both edge types.
Create a PTBG Config¶
Now that you have converted the agatha semantic graph for PTBG, you now need to
write a configuration script. Here’s the official docs for the PTBG
config.
The following is an example PTBG config. The parts you need to worry about occur
in the header section of the get_torchbiggraph_config
function. You should
copy this and change what you need.
#!/usr/bin/env python3
def get_torchbiggraph_config():
# CHANGE THESE #########################################################
DATA_ROOT = "/path/to/data/root"
""" This is the location you specified with the `-o` flag when running
`convert_graph_for_pytorch_biggraph` That tools should have created
`DATA_ROOT/entities` and `DATA_ROOT/edges`. This process will create
`DATA_ROOT/embeddings`. """
PARTS = 100
""" This is the number of partitions that all nodes and edges have been
split between when running `convert_graph_for_pytorch_biggraph`. By default,
we create 100 partitions. If you specified `--partition-count` (`-c`), then
you need to change this value to reflect the new partition count. """
ENT_TYPES = "selmnp"
""" This is the set of entities specified when running
`convert_graph_for_pytorch_biggraph`. The above value is the default. If you
used the `--types` flag, then you need to set this value accordingly."""
RELATIONS = [ "ep", "es", "lp", "ls", "mp", "ms", "np", "ns", "pe", "pl",
"pm", "pn", "ps", "se", "sl", "sm", "sn", "sp", "ss" ]
""" This is the ordered list of relationships that you specified when
running `convert_graph_for_pytorch_biggraph`. The above is the default. If
you specified `--relations` then you need to set this value accordingly.
"""
EMBEDDING_DIM = 512
""" This is the number of floats per embedding per node in the resulting
embedding. """
NUM_COMPUTE_NODES = 20
""" This is the number of computers used to compute the embedding. We find
that around 20 machines is the sweet spot. More or less result in slower
embeddings. """
THREADS_PER_NODE = 24
""" This is the number of threads that each machine will use to compute
embeddings. """
#########################################################################
config = dict(
# IO Paths
entity_path=DATA_ROOT+"/entities",
edge_paths=[DATA_ROOT+"/edges"],
checkpoint_path=DATA_ROOT+"/embeddings",
# Graph structure
entities={t: {'num_partitions': PARTS} for t in ENT_TYPES},
relations=[
dict(name=rel, lhs=rel[0], rhs=rel[1], operator='translation')
for rel in sorted(RELATIONS)
],
# Scoring model
dimension=EMBEDDING_DIM,
comparator='dot',
bias=True,
# Training
num_epochs=5,
num_uniform_negs=50,
loss_fn='softmax',
lr=0.02,
# Evaluation during training
eval_fraction=0,
# One per allowed thread
workers=THREADS_PER_NODE,
num_machines=NUM_COMPUTE_NODES,
distributed_init_method="env://",
num_partition_servers=-1,
)
return config
Launch the PTBG training cluster¶
Now there is only a little more book keeping nessesary to launch PTBG distributed training. This next step will look familiar to you if you’ve already taken a look at the docs for training the agatha deep learning model. Afterall, both techniques are using pytorch distributed.
You need every machine in your compute cluster to have some environment variables set. The offical docs on pytorch distributed environemnt variables can be found here.
These varaibles are:
MASTER_ADDR
: The hostname of the master node. In our case, this is just one of the workers.
MASTER_PORT
: An unused port to communicate. This can almost be anything. We use 12910
.
NODE_RANK
: The rank of this machine with respect to the “world”. If there are 3 total
machines, then each should have NODE_RANK
set to a different value in:
[0, 1, 2]. Note, we use NODE_RANK
but default pytorch uses RANK
. It
doesn’t really matter as long as you’re consistent.
WORLD_SIZE
: The total number of machines.
The simplest way to set these variables is to use a nodefile
, which is just a
file that contains each machine’s address. If you’re on PBS, you will have a
file called $PBS_NODEFILE
and if you’re on SLURM then you will have variable
called $SLURM_NODELIST
. To remain platform agnositic, we assume you have
copied any nessesry nodefile to ~/.nodefile
.
You can set all of the nessesary variables with this snippet, run on each node.
Preferably, this should be somewhere in your ~.bashrc
.
export NODEFILE="~/.nodefile"
export NODE_RANK=$(grep -n $HOSTNAME $NODEFILE | awk 'BEGIN{FS=":"}{print $1-1}')
export MASTER_ADDR=$(head -1 $NODEFILE)
export MASTER_PORT=12910
export WORLD_SIZE=$(cat $NODEFILE | wc -l)
Now that you’ve setup the appropriate environment variables, you’re ready to
start training. To launch a training process for each compute node, we’re going
to use parallel
.
parallel \
--sshloginfile $NODEFILE \
--ungroup \
--nonall \
"torchbiggraph_train --rank \$NODE_RANK /path/to/config.py"
Warning: It is important to escape the ‘$’ when calling $NODE_RANK
in the
above parallel command. With the escape (\$
) we’re saying “evaluate
NODE_RANK
on the remote machine”. Without the escape we’re saying “evaluate
NODE_RANK
on this machine.”
Expect graph embedding to take a very long time. Typically, we run for approximately 5 epochs, which is the most we can compute in our normal 72-hour time limit.
Index the Resulting Embeddings¶
Now, you should have a directory in DATA_ROOT/embeddings
that is full of files
that look like: embeddings_{type}_{part}.v{epoch}.h5
. Each one represents a
matrix where row i
corresponds to entity i
of the given type and partition.
Therefore, we need to build a small index that helps us reference each embedding
given an entity name. For this, we use the ptbg_index_embeddings.py
tool. You
can find this here: agatha/tools/py_scripts/ptbg_index_embeddings.py
.
Like all tools, you can use the --help
option to get more information. Here’s
an example of how to run it:
cd /path/to/agatha/tools/py_scripts
./ptbg_index_embeddings.py \
<DATA_ROOT>/entities \
entities.sqlite3
This tool will create an sqlite3 lookup table (compatible with
Sqlite3LookupTable
) that maps each entity to its embedding location. The
result will be stored in entities.sqlite3
(or wherever you specify). The
Agatha embedding lookup table will use this along with the embeddings
directory to rapidly lookup vectors for each entity.