3. Inferring parameters¶
This page shows how to infer parameters using adversarial Monte Carlo.
We’ll continue to use the two-parameter bottleneck model from the Creating a Dinf model page.
Results will be plotted with the
dinf-plot
command.Results will be exported as tab-separated values with the
dinf-tabulate
command.Results will be loaded back into Python with the
dinf.load_results()
function.
The results below were obtained in under 10 minutes using an 8 core laptop with CPU-only training for the neural network. Note that models with more parameters may need substantially more resources.
3.1. Adversarial Monte Carlo¶
Using the dinf mc
subcommand, we’ll do 20 iterations of Monte Carlo simulations,
with 1000 training replicates and 1000 proposal replicates in each iteration.
In each iteration this will
train the discriminator on 500 samples from the generator (with parameter values sampled from the sampling distribution) and 500 samples from the target dataset (using the
truth
parameter values, as this is a simulation-only model),sample 1000 proposals from the generator (with parameter values taken from the sampling distribution),
apply the discriminator on the proposals to obtain predictions about how well they match the target dataset,
approximate the discriminator output as a function of the model parameters, using a weighted kernel density estimate of the proposals, where the weights are given by the discriminator predictions,
set the sampling distribution for the next iteration to the approximate distribution that was just constructed.
We’ll also evaluate the discriminator on 1000 test replicates (500 from each of the prior distribution and the target dataset). This test dataset is constructed only once, not in each iteration like the training or proposal datasets.
%%bash
mkdir -p out/bottleneck-mc
rm -fr out/bottleneck-mc
dinf mc \
--seed 1234 \
--epochs 5 \
--iterations 20 \
--training-replicates 1000 \
--test-replicates 1000 \
--proposal-replicates 1000 \
--model ../../examples/bottleneck/model.py \
--output-folder out/bottleneck-mc
3.2. The output folder¶
Once complete, the output folder will contain a sequence of numbered subfolders, one for each iteration.
%%bash
ls out/bottleneck-mc
0
1
10
11
12
13
14
15
16
17
18
19
2
3
4
5
6
7
8
9
Inside each of these subfolders is two files: (1) a trained neural network (the .nn
file), and (2) a data file (the .npz
file) containing the parameter proposals and their corresponding discriminator predictions.
%%bash
ls out/bottleneck-mc/19
data.npz
discriminator.nn
3.3. Plotting results¶
Let’s plot the results using dinf-plot
with the gan
subcommand. This subcommand produces many figures, including diagnostics and results. Below, we specify an .svg file as the --output-file
, as this is optimal for viewing in a web browser—note that we only specified one file but many .svg files are produced based on the given filename. When working from the command line, it may be more convenient to specify a .pdf as the output file, which will produce a single multipage pdf file. Alternately, ommitting the --output-file
option will open each figure in turn in an interactive window.
%%bash
dinf-plot gan \
--model ../../examples/bottleneck/model.py \
--output-file /tmp/bottleneck-mc.svg \
out/bottleneck-mc
First, we’ll look at the discriminator training metrics in each iteration.
from IPython.display import SVG, display
display(SVG(filename="/tmp/bottleneck-mc_metrics.svg"))
In any given iteration, we see that the training accuracy has increased throughout the 5 training epochs. We also see that the training accuracy declined as the iterations progressed, suggesting that our proposed model parameters have produced data that is increasingly difficult for the discriminator to distinguish from the target dataset. Interestingly, the accuracy on the test dataset increased for the first 10 or so iterations, then began to decrease.
Next, we’ll look at the entropy of the discriminator predictions for the proposals, relative to uniformity. This measures how much the sampling distributions are changing from one iteration to the next, where a value of 1.0 indicates no change.
display(SVG(filename="/tmp/bottleneck-mc_entropy.svg"))
From this, we see that after iteration 12, the distributions are not changing much. The changes we observe after this point may just be noise due to resampling.
Another thing we can look at to see how well the adversarial training has performed is the distribution of discriminator predictions themselves. Recall that the discriminator outputs a value between 0 and 1, with values near 0 meaning the discriminator predicts that the sample came from the generator, and values near 1 being predictions that the sample came from the target dataset. If the discriminator cannot distinguish generated data from the target dataset at all, then it should always guess 0.5. Below are violin plots of the discriminator outputs in each iteration—each blue blob is a kernel density estimate of the discriminator output for that iteration, with dark blue marks indicating the 2.5%, 50%, and 97.5% quantiles.
display(SVG(filename="/tmp/bottleneck-mc__Pr.svg"))
In early iterations, we see that most of the discriminator predictions are near 0, but from around iteration 12 onward, the discriminator is making many more non-zero predictions. The discriminator doesn’t end up predicting 0.5 for every sample, but this is likely due to bias from training it on a finite (and small) training dataset.
Let’s now look at some results. Below are violin plots of the proposals, with one figure produced for each of the parameters (N0
and N1
for this model). Recall that the proposals are drawn from the sampling distribution, so these figures show the marginal sampling distributions for each iteration.
display(SVG(filename="/tmp/bottleneck-mc_N0.svg"))
display(SVG(filename="/tmp/bottleneck-mc_N1.svg"))
The sampling distributions for the first iteration reflect the prior distribution, which is uniformly distributed between the upper and lower bounds. By around iteration 12, the sampling distributions are close to convergence, and they are concentrated around the truth
values (indicated by the red horizontal lines).
To look more closely at the final marginal sampling distributions, we’ll use dinf-plot
with the hist
subcommand. We’ve specified the --weighted
option here, which will weight the histogram and KDE by the discriminator predictions. Thus, we’re plotting the sampling distributions for what will become the hypothetical next iteration. These figures show marginal histograms (the dark blue lines), a marginal KDE (the light blue lines), the 2.5%, 50%, and 97.5% quantiles (horizontal black line with whiskers), and the truth value (the vertical red line).
%%bash
dinf-plot hist \
--weighted \
--kde \
--model ../../examples/bottleneck/model.py \
--output-file /tmp/bottleneck-mc_hist.svg \
out/bottleneck-mc/19/data.npz
display(SVG(filename="/tmp/bottleneck-mc_hist_N0.svg"))
display(SVG(filename="/tmp/bottleneck-mc_hist_N1.svg"))
It looks like the marginal sampling distributions are roughly centered on the truth values, and the 50% quantile (the median) provides a reasonable point estimate for the parameter values.
So far, we’ve only looked at the marginal distributions—i.e. we’ve looked at N0
and N1
separately. But what about the joint sampling distribution of N0
and N1
? Do N0
and N1
covary? We can visualise two parameters jointly using dinf-plot
with the hist2d
subcommand, which produces a 2d histogram. Note that if the model contains more than two parameters, 2d histograms will be plotted for all unique pairs of parameters.
%%bash
dinf-plot hist2d \
--weighted \
--model ../../examples/bottleneck/model.py \
--output-file /tmp/bottleneck-mc_hist2d.svg \
out/bottleneck-mc/19/data.npz
In the 2d histogram below, darker squares indicate a larger density of samples in any given histogram bin. Red lines indicated the truth values for each axis. Note that in two dimensions the data are more sparse than in one dimension, so sampling noise can appear more pronounced.
display(SVG(filename="/tmp/bottleneck-mc_hist2d.svg"))
From the 2d histogram, it looks like there might be a small negative correlation between N0
and N1
in our sampling distribution. However, we would need to do a proper statistical test to confirm if this is indeed the case.
3.4. Exporting results¶
The plots above have all been produced from the files contained in the output folder. But what if we want to do some further analysis using another programming language such as R? Or simply copy and paste the numbers that are shown in the plots? For these reasons, there is a dinf-tabulate
command which can be used to export information as tab-separated values. We highlight two useful dinf-tabulate
subcommands below.
To export the data, we can use the dinf-tabulate data
subcommand. This outputs tab-separated columns of the proposals and predictions from a given iteration, which can be easily loaded into R using its builtin read.delim()
function. Below, we’ll pipe the output to head
to look at the first 10 rows of the data (ie., the first 10 proposals). The special column _Pr
indicates the descriminator prediction, while the other columns (N0
and N1
for the current model) correspond to the parameter values that were sampled for a given proposal.
%%bash
dinf-tabulate data out/bottleneck-mc/19/data.npz | head
_Pr N0 N1
5.023707151412963867e-01 9.532425493668109993e+03 2.012924025355642925e+02
4.258708059787750244e-01 1.033074900540984709e+04 2.911974441341429838e+02
3.749374449253082275e-01 1.021407472189349755e+04 4.541637375653106687e+02
3.892780244350433350e-01 1.065375948935513043e+04 2.672710234594799203e+02
5.223942399024963379e-01 1.106780616309789366e+04 1.606729152186022702e+02
1.232456415891647339e-01 1.065732058832546136e+04 1.739923551630824647e+02
2.221430391073226929e-01 1.017977344376559631e+04 1.254285065270190103e+02
8.898271918296813965e-01 1.063855603698634877e+04 2.234555748498159460e+02
9.749054163694381714e-02 9.424131493141929241e+03 4.804094953561136094e+02
While we could use this exported data to calculate the quantiles shown in the marginal histograms, the dinf-tabulate quantiles
subcommand already provides this functionality.
%%bash
dinf-tabulate quantiles \
--weighted \
out/bottleneck-mc/19/data.npz
Param 0.025 0.5 0.975
N0 8555.312310257525 10107.844498815852 11599.36477673148
N1 80.33750036679618 202.55073615722165 418.6929774611843
3.5. Loading results into Python¶
Instead of exporting the results, we can directly load an .npz
file (the proposals and predictions) back into Python for further analysis using the dinf.load_results()
function. Below, we’ll load the results for the final iteration.
import dinf
# Load data as a numpy structured array.
data = dinf.load_results("out/bottleneck-mc/19/data.npz")
# Show the column names.
data.dtype.names
('_Pr', 'N0', 'N1')
# Select the first 5 proposals.
data[:5]
array([(0.5023707 , 9532.42549367, 201.29240254),
(0.4258708 , 10330.74900541, 291.19744413),
(0.37493744, 10214.07472189, 454.16373757),
(0.38927802, 10653.75948936, 267.27102346),
(0.52239424, 11067.8061631 , 160.67291522)],
dtype=[('_Pr', '<f4'), ('N0', '<f8'), ('N1', '<f8')])
The data
variable is a numpy structured array where each row corresponds to one proposal. Rows are accessed using integer indexes (including slice notation like :5
), while columns are accessed using string indexes that match the desired column name. The columns are _Pr
, which are the discriminator predictions, plus one column for each model parameter (N0
and N1
for the current model).
Recall that in the 2d histogram we observed what seemed to be a correlation between N0
and N1
. Lets quantify that correlation.
import numpy as np
# Calculate the covariance matrix between variables N0 and N1,
# weighted by the discriminator predictions. The predictions
# are in the special data column named '_Pr'.
cov = np.cov(data["N0"], data["N1"], aweights=data["_Pr"])
# Calculate the Pearson correlation coefficient.
r = cov[0,1] / np.sqrt(cov[0,0] * cov[1,1])
r
-0.25164680901678693
This shows that there is a negative correlation between N0
and N1
. Let’s do 1000 bootstrap replicates over the data to get confidence intervals on the correlation coefficient r
.
rng = np.random.default_rng(1234)
rs = []
for _ in range(1000):
# Resample the data, with replacement.
boot = rng.choice(data, size=len(data))
# Calculate the (weighted) covariance matrix for this bootstrap replicate.
cov = np.cov(boot["N0"], boot["N1"], aweights=boot["_Pr"])
# Calculate the Pearson correlation coefficient.
r = cov[0,1] / np.sqrt(cov[0,0] * cov[1,1])
rs.append(r)
# 2.5%, 50%, and 97.5% quantiles.
np.quantile(rs, q=[0.025, 0.5, 0.975])
array([-0.31805837, -0.25133503, -0.17809329])
We can now be confident in concluding that there is a correlation between N0
and N1
. Recall that we previously considered the median value for each parameter to be a reasonable point estimate. But in light of this correlation, it might be prudent to use a summary of the data that considers N0
and N1
jointly.
One option is to take the proposal with the largest discriminator prediction.
i = np.argmax(data["_Pr"])
data[i]
(0.9925424, 9702.84947226, 147.11742045)
However, our sample size is quite small (1000 proposals), so this “best” proposal could yet be quite noisy. An appealing alternative is to use the geometric median, which generalises the median to more than one dimension. This is not a trivial calculation, so we provide the dinf.Parameters.geometric_median()
function for this as part of the API.
from numpy.lib.recfunctions import structured_to_unstructured
# Convert the model parameters to a regular 2d numpy array.
thetas = structured_to_unstructured(data[["N0", "N1"]])
# Find the point in parameter space that minimises the
# L1 distance to all proposals.
dinf.Parameters.geometric_median(thetas, probs=data["_Pr"])
array([10113.25590158, 203.16706071])
3.6. Using the sampling distribution¶
Finally, we note that using a point estimate as a summary of the inference process may not always be desirable. For instance, if parameter inference with Dinf was merely a first step towards doing more realistic simulations for some other analysis. In this case an arguably better approach is to draw parameters for simulation using the sampling distribution inferred in the final iteration. The following code prints 10 draws with distinct parameter values.
# Load the model file.
dinf_model = dinf.DinfModel.from_file("../../examples/bottleneck/model.py")
# Draw 10 samples from the weighted KDE of the final iteration.
rng = np.random.default_rng(314159)
new_thetas = dinf_model.parameters.sample_kde(
thetas, probs=data["_Pr"], size=10, rng=rng,
)
# Print the drawn parameters.
for seed, theta in enumerate(new_thetas):
param_dict = {
param_name: value
for param_name, value in zip(dinf_model.parameters, theta)
}
print(seed, param_dict)
# Supposing that the model's `generator_func()` was adapted for
# a new purpose, e.g. to save the tree sequence to a file,
# and renamed to `sim()`, then calling `sim(seed, **param_dict)`
# would call this function with the sampled parameters.
0 {'N0': 8655.93341451247, 'N1': 231.17159307323115}
1 {'N0': 11022.533817932557, 'N1': 133.19436404301533}
2 {'N0': 8843.393700551951, 'N1': 290.60902521498645}
3 {'N0': 9667.363913212599, 'N1': 400.0837457897171}
4 {'N0': 10016.122770325122, 'N1': 375.2437263743045}
5 {'N0': 9559.53429953312, 'N1': 187.32153220521639}
6 {'N0': 11676.488012544825, 'N1': 232.54576449422353}
7 {'N0': 9585.974626142555, 'N1': 173.0333239314761}
8 {'N0': 10249.245868916712, 'N1': 153.1319379190742}
9 {'N0': 10744.694073692921, 'N1': 154.72931025906146}