Topic Model Queries on the Agatha Semantic Network¶
Our prior work, Moliere, performed hypothesis generation through a
graph-analytic and topic-modeling approach. Occasionally, we would like to run
this same approach using the Agatha topic network. This document describes the
way to use the agatha.topic_query
module to perform topic-model queries, and
how to interpret your results.
TL;DR¶
This is the recommended way to run the query process. First, create a file
called query.conf
and fill it with the following information:
graph_db: "<path to graph.sqlite3>"
bow_db: "<path to sentences.sqlite3>"
topic_model {
num_topics: 100
}
Look into agatha/topic_query/topic_query_config.proto
to get more details on
the TopicQueryConfig
specification.
Now you can run queries using the following syntax:
python3 -m agatha.topic_query query.conf \
--source <source term> \
--target <target term> \
--result_path <desired place to put result>
Here is a real-life example of a query:
python3 -m agatha.topic_query configs/query_2020.conf \
--source l:noun:tobacco \
--target l:noun:cancer \
--result_path ./tobacco_cancer.pb
Viewing Results¶
Once you’re done your query, you will have a binary file containing all topic
model information. This is stored as a compressed proto format, which should
enable easy programmatic access to all the components of the query result. You
can view more details on the proto specification at
agatha/query/topic_query_result.proto
.
Here’s a short python script that would load a proto result file for use:
from agatha.topic_query import topic_query_result_pb2
result = topic_query_result_pb2.TopicQueryResult()
with open("<result path>", 'rb') as proto_file:
result.ParseFromString(proto_file.read())
You now have access to: result.path
, result.documents
, and result.topics
.
If you want to cut to the chase, you can simply print out all proto result details using the following script:
Running Queries with Node Names¶
In order to run queries, you will need to know the particular node names of the
elements you would like to explore. Nodes of the Agatha network can be explored
by looking at the set of node
entities in the graph database. You can explore
these in sqlite3
with the following syntax:
sqlite3 .../graph.sqlite3 \
'select node from graph where node like "%<query term>%" limit 10'
Here’s an actual example:
sqlite3 graph.sqlite3 'select node from graph where node like "%dimentia%" limit 10'
> e:amyotrophic_lateral_sclerosis/parkinsonism_dimentia_complex
> e:dimentia_complex
> e:hiv-associated_dimentia
> e:mild_dimentia
> e:three-dimentianl_(_3d_)
> l:adj:three-dimentianl
> l:noun:dimentia
Note that node names follow particular patterns. All valid node names start with
a leading “type” character. These are specified in
agatha/util/entity_types.py
. Here are the existing entity types at the time of
writing:
ENTITY_TYPE="e"
EMMA_TYPE="l"
MESH_TERM_TYPE="m"
UMLS_TERM_TYPE="m"
NGRAM_TYPE="n"
PREDICATE_TYPE="p"
SENTENCE_TYPE="s"
Configuration¶
Just like the Agatha network construction process, the query process also needs many parameters that are specified either through command-line arguments, or through a configuration script. We recommend creating a configuration for the typical query case, omitting only the query term parameters. This way you can have the simplest query interface when running these topic-model queries yourself.
Look into agatha/config/topic_query_config.proto
to get more details on the
TopicQueryConfig
specification. Here is a fuller example of a configuration
that we actually use on Palmetto.
# TopicQueryConfig
# source: Omitted
# target: Omitted
# result_path: Omitted
graph_db: "/zfs/safrolab/users/jsybran/agatha/data/releases/2020/graph.sqlite3"
bow_db: "/zfs/safrolab/users/jsybran/agatha/data/releases/2020/sentences.sqlite3"
topic_model {
num_topics: 20
min_support_count: 2
truncate_size: 250
}
# Advanced
max_sentences_per_path_elem: 2000
max_degree: 1000