1. Creating a Dinf model¶
This page discusses how to construct a minimal simulation-only Dinf model,
with example code for each component. The complete
working model can be found at the bottom of the page,
and is also in the git repository
under examples/bottleneck/model.py
.
Writing a Dinf model requires a familiarity with Python,
and the examples use
demes,
msprime,
and numpy
APIs.
1.1. Overview¶
The primary use of Dinf is to infer simulation parameters that produce data closely resembling some target empirical dataset. In the diagram below, the rectangles represent the major components of a Dinf model, and the arrows indicate the flow of data.
\(\theta\) are the model parameters, such as population sizes, split times, or migration rates.
\(G(\theta)\) is the Generator function that simulates genetic data for concrete values of the model parameters. This is a genetic simulator such as msprime.
Target is the target dataset, which is usually an observed dataset in a VCF or BCF file.
\(x\) indicates data features that are either produced by the generator, or sampled from the target dataset. Features from the generator have distribution \(p_g(x; \theta)\), and features from the target dataset have distribution \(p_t(x)\).
\(D(x)\) is the Discriminator function (nominally a convolutional neural network), which is trained to distinguish between data produced by the generator and data sampled from the target dataset.
\(Pr(x \in \text{Target})\) is the output prediction from the discriminator—the probability that a given input feature \(x\) is from the target distribution, rather than from the generator distribution.
A Dinf model is organised as a Python script (a .py
file)
containing a dinf_model
variable,
which must be an instance of the DinfModel
class.
This object describes various components of the model.
import dinf
dinf_model = dinf.DinfModel(
parameters=parameters,
generator_func=generator,
target_func=None,
)
Defining the model then procedes by describing the parameters
,
writing the generator_func
function,
and writing the target_func
function (if any).
It’s not necessary to specify the architecture of the discriminator.
1.2. Parameters¶
In this example, we’ll have two inferrable parameters,
N0
and N1
, which correspond to the population sizes of a
deme before and after a bottleneck.
DinfModel.parameters
are defined via the Parameters
class,
whose constructor accepts named Param
instances by keyword.
Each keyword is the name of a parameter (N0
and N1
in the code below)
and each parameter has a lower bound (low
) and an upper bound (high
).
Parameters defined in this way have a uniform prior, e.g.
N0
\(\sim\)Uniform(10, 30000).
When performing a simulation study (as we’ll do here), the truth
value
will be used to produce the target dataset instead of using empirical data.
parameters = dinf.Parameters(
N0=dinf.Param(low=10, high=30_000, truth=10_000),
N1=dinf.Param(low=10, high=30_000, truth=200),
)
1.3. Generator¶
The generator is probably the most complicated part of the Dinf model. A genetic simulator must take the parameters and turn them into genetic data. The genetic data must then be transformed to match the expected input of the discriminator. The inputs to the discriminator are referred to as the features of the data, and the transformation of the genetic data into the desired format is called feature extraction.
1.3.1. Features¶
Dinf features are typically genotype matrices,
or a collection of matrices (e.g. when modelling multiple populations).
It’s important that there is agreement between the features extracted from the
generator and the features extracted from the target dataset.
Ideally, the only differences will be due to the accuracy of the simulation
model and the parameter values, not due to data filtering or quality issues.
To ensure consistency, the Dinf API provides helper classes for
feature extraction, which have methods for extracting features from
VCF files and from tskit.TreeSequence
objects.
The Feature matrices page describes the different feature extraction
classes provided by the Dinf API,
and shows heatmap plots for matrices from each class.
In the example model here, we’ll use the BinnedHaplotypeMatrix
feature extractor class.
num_individuals = 16
features = dinf.BinnedHaplotypeMatrix(
num_individuals=num_individuals,
num_loci=64,
ploidy=2,
phased=False,
maf_thresh=0.05,
)
1.3.2. Genetic simulator¶
DinfModel.generator_func
is a user-defined function that returns data
features using a genetic simulator.
The function accepts a single positional argument seed
,
followed by one keyword argument for each inferrable parameter.
So for our example there’s one argument for N0
and one for N1
.
Note that the python syntax def generator(seed, *, N0, N1):
means
that all parameters after the *
are keyword-only parameters.
The generator function must return the features that were extracted
from the simulated data.
There are a lot of choices to be made when simulating genetic data.
In the example below, the bottleneck demographic model is described using
a templated Demes YAML
string.
This demographic model is simulated using the
msprime
coalescent simulator, whose output is a tskit.TreeSequence
.
Finally, the features are extracted from the output using the
BinnedHaplotypeMatrix.from_ts()
method of the features
object
we defined above.
import string
import demes
import msprime
import numpy as np
recombination_rate = 1.25e-8
mutation_rate = 1.25e-8
sequence_length = 1_000_000
def demography(*, N0, N1):
model = string.Template(
"""
description: Two-epoch model with recent bottleneck.
time_units: generations
demes:
- name: A
epochs:
- start_size: $N0
end_time: 100
- start_size: $N1
end_time: 0
"""
).substitute(N0=N0, N1=N1)
return demes.loads(model)
def generator(seed, *, N0, N1):
"""Simulate a two-epoch model with msprime."""
rng = np.random.default_rng(seed)
graph = demography(N0=N0, N1=N1)
demog = msprime.Demography.from_demes(graph)
seed1, seed2 = rng.integers(low=1, high=2**31, size=2)
ts = msprime.sim_ancestry(
samples=num_individuals,
demography=demog,
sequence_length=sequence_length,
recombination_rate=recombination_rate,
random_seed=seed1,
record_provenance=False,
)
ts = msprime.sim_mutations(ts, rate=mutation_rate, random_seed=seed2)
feature_matrix = features.from_ts(ts)
return feature_matrix
Note that Dinf has no explicit requirement to use demes
for the demographic model, nor to use msprime
as the simulator.
It would even be possible to write a generator function with a
simulator that doesn’t output a tskit.TreeSequence
.
Dinf provides discriminator networks
that work well with the provided
feature extraction classes,
so substituting custom features will likely require additional effort
and testing.
1.4. Target¶
When defining the DinfModel
object above,
we set target_func=None
. This means that Dinf will reuse the
generator function to create the target dataset. When simulating the
target dataset, the truth
values specified for the parameters
will be passed to the generator function.
1.5. Discriminator¶
We have not specified anything about the discriminator.
By default Dinf will use ExchangeableCNN
,
a small exchangeable convolutional neural network,
that treats haplotypes within populations as exchangeable.
1.6. Complete example¶
Putting this all together into one file we obtain:
import string
import demes
import msprime
import numpy as np
import dinf
recombination_rate = 1.25e-8
mutation_rate = 1.25e-8
num_individuals = 16
sequence_length = 1_000_000
parameters = dinf.Parameters(
N0=dinf.Param(low=10, high=30_000, truth=10_000),
N1=dinf.Param(low=10, high=30_000, truth=200),
)
def demography(*, N0, N1):
model = string.Template(
"""
description: Two-epoch model with recent bottleneck.
time_units: generations
demes:
- name: A
epochs:
- start_size: $N0
end_time: 100
- start_size: $N1
end_time: 0
"""
).substitute(N0=N0, N1=N1)
return demes.loads(model)
features = dinf.BinnedHaplotypeMatrix(
num_individuals=num_individuals,
num_loci=64,
ploidy=2,
phased=False,
maf_thresh=0.05,
)
def generator(seed, *, N0, N1):
"""Simulate a two-epoch model with msprime."""
rng = np.random.default_rng(seed)
graph = demography(N0=N0, N1=N1)
demog = msprime.Demography.from_demes(graph)
seed1, seed2 = rng.integers(low=1, high=2**31, size=2)
ts = msprime.sim_ancestry(
samples=num_individuals,
demography=demog,
sequence_length=sequence_length,
recombination_rate=recombination_rate,
random_seed=seed1,
record_provenance=False,
)
ts = msprime.sim_mutations(ts, rate=mutation_rate, random_seed=seed2)
feature_matrix = features.from_ts(ts)
return feature_matrix
dinf_model = dinf.DinfModel(
target_func=None,
generator_func=generator,
parameters=parameters,
)