This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
using Plots, Distributions, StatsPlots
Plots.pyplot()
Plots.PyPlotBackend()
Introduction¶
Traditional bayesian methods require a fully specified statistical model, so that a likelihood can be calculated. In particular, this requires specifying the distribution of error terms. In IO, and economics more broadly, many applied models avoid placing distributional assumptions on error terms, and instead use GMM for estimation.
There are two ways to convert a GMM model into something suitable for Bayesian methods. One is to just add distributional assumptions to the error terms. applies this approach to a random coefficients demand model.
A second approach is the Quasi-Bayesian approach of . In this approach, you simply use the GMM objective function in place of the likelihood. We will look at it in some detail.
Random Coefficients IV Logit¶
As a working example, we’ll use a random coefficients multinomial logit with endogeneity. To keep things simple for illustrustration, we’ll assume that there’s a single endogenous variable, $x \in \mathbb{R}^J$, and market shares are given by
The moment condition will be
Fully specified likelihood¶
One way to estimate the model is to fully specify a parametric likleihood. In the simulated data, we know the correct likelihood, so the results we get will be about as good as we can possibly hope to get. Compared to GMM, the full likelihood approach here assumes that (conditional on the instruments, $z$), the market demand shocks are normally distributed,
The endogenous variable, $x$, has a normal first stage,
and to keep the number of parameters small, we’ll impose $\Pi=\pi I_J$ and $\Xi=\rho I_J$.
Finally, we’ll assume the observed market shares $s_j$ come from $M$ draws from a Multinomial distribtion
where $s^*()$ is the share function implied by the random coefficients model. This approach removes the need to solve for $\xi$ as a function of $s$ and the other parameters. However, it has the downsides of needing to know $M$, and needing to sample $\xi$ along with the parameters to generate the posterior.
We implement the above model in Turing.jl. This gives a convenient way to both simulate the model and compute the posterior. However, it has the downside of putting an extra layer of abstraction between us and the core computations. This ends up costing us a bit of computation time, and arguably making the posterior calculation harder to debug and extend.
using Turing, FastGaussQuadrature, LinearAlgebra, JLD2, NNlib
@model rcivlogit(x=missing, z=missing, s=missing,
M=100, ν=gausshermite(12),
param=missing, N=10,J=1, ::Type{T}=Float64) where {T <: Real} =
begin
if x === missing
@assert s===missing && z===missing
x = Matrix{T}(undef, J, N)
s = Matrix{Int64}(undef,J+1,N)
z = randn(J,N)
end
J, N = size(x)
ξ = Matrix{T}(undef, J, N)
if !ismissing(param)
β = param.β
σ = param.σ
Σx = param.Σx
π = param.π
ρ = param.ρ
Ω = param.Ω
else
# Priors
β ~ Normal(0, 20.0)
σ ~ truncated(Normal(1.0, 10.0),0, Inf)
ρ ~ Normal(0, 20.0)
Σx ~ Wishart(J+2, diagm(ones(J))*1/(2*0.001))
Ω ~ Wishart(J+2, diagm(ones(J))*1/(2*0.001))
π ~ Normal(0.0, 20.0)
end
# The likelihood
ξ .~ MvNormal(zeros(J), Symmetric(Ω))
μ = zeros(typeof(x[1][1]*β), J + 1 , length(ν[1]))
for i in 1:N
x[:,i] ~ MvNormal(π*z[:,i] + ρ*ξ[:,i], Symmetric(Σx))
μ[1:J,:] .= x[:,i]*β + ξ[:,i] .+ x[:,i]*σ*sqrt(2)*ν[1]'
μ = softmax(μ, dims=1)
p = (μ*ν[2])/sqrt(Base.π)
s[:,i] ~ Multinomial(M,p)
end
return(x=x, z=z, s=s, ξ=ξ)
end
# some parameters for simulating data
J = 2
pm = (β=-1.0, σ=0.1, ρ=0.5, Σx=diagm(ones(J)), Ω=I+0.5*ones(J,J), π=1.0)
N = 20
M = 100
# simulate the data
data=rcivlogit(missing, missing ,missing, M, gausshermite(3), pm, N, J, Float64)();
# not sure why x, s, z are Array{Any}, but it will cause problems later, so fix it
data = (x=Float64.(data.x), s=Int64.(data.s), z=Float64.(data.z), ξ=Float64.(data.ξ))
(x = [-0.8291668656715948 0.8114037841724092 … -2.493005300292991 -1.735388
82102078; -1.9459132449679515 -0.5991443951465847 … -1.5243336512421664 1.4
27601236461319], s = [25 47 … 54 62; 52 32 … 24 7; 23 21 … 22 31], z = [-0.
16462757602447928 -1.02062940510034 … 0.20135486439450068 -1.37476878086003
5; -0.21948746864078345 0.7966223359730757 … -0.05976092739945117 0.4069354
101999431], ξ = [-0.949184612573837 2.270710162643411 … -1.2875415745627992
-0.7480245491443309; -0.773157795731879 0.4941916451642972 … -1.3757647826
041028 0.6742938892206334])
Some remarks on the code
-
When a Turing model is passed missing arguments, it samples them from the specified distributions. When the arguments are not missing, they’re treated as data and held fixed while calculating the posterior.
-
We use Gauss Hermite quadrature to integrate out the random coefficient. 3 integration points is not going to calculate the integral very accurately, but we will use the same integration approach during estimation. so it will work out well.
The Wishart prior distributions for the covariance matrices are not entirely standard. The inverse Wishart distribution is the conjugate prior for the covariance matrix of a Normal distribution, and is a common choice. However, when using HMC for sampling, conjugate priors do not matter. The modern view is that the inverse Wishart puts too much weight on covariances with high correlation, and other priors are preferred. The Wishart prior follows the advice of , and helps to avoid regions with degenerate covariance matrices (which were causing numeric problems during the tuning stages of NUTS with other priors, like the LKJ distribution).
Results¶
We can sample from the posterior with the following code. It takes some time to run.
model = rcivlogit(data.x, data.z, data.s, M, gausshermite(3), missing, N,J)
chain = Turing.sample(model, NUTS(0.65), 1000, progress=true, verbose=true)
dir = @__DIR__
JLD2.@save "$dir/turing.jld2" chain model data
Let’s look at the posteriors. The chain also contains posteriors for all $JN$ values of $\xi$, which are not going to be displayed.
dir = @__DIR__
JLD2.@load "$dir/turing.jld2" chain model data
chkeys = Symbol[]
for k in keys(pm)
for ck ∈ filter(x->occursin(Regex(String(k)),String(x)), keys(chain))
push!(chkeys, ck)
end
end
display(plot(chain[chkeys]))
display(describe(chain[chkeys])[1])
Summary Statistics
parameters mean std naive_se mcse ess rhat
e ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64
⋯
β -1.2436 0.4066 0.0129 0.0291 187.0492 0.9991
⋯
σ 0.6761 0.3108 0.0098 0.0206 247.3818 0.9991
⋯
ρ 0.4569 0.3015 0.0095 0.0179 244.5416 0.9992
⋯
Σx[1,1] 1.4559 0.8060 0.0255 0.0398 408.6881 1.0019
⋯
Σx[2,1] 0.5549 0.5705 0.0180 0.0246 507.8836 0.9991
⋯
Σx[1,2] 0.5549 0.5705 0.0180 0.0246 507.8836 0.9991
⋯
Σx[2,2] 2.3462 1.2436 0.0393 0.0593 363.2574 0.9990
⋯
Ω[1,1] 2.6946 1.7484 0.0553 0.0960 295.5635 0.9990
⋯
Ω[2,1] 1.2312 1.0636 0.0336 0.0477 443.6418 0.9990
⋯
Ω[1,2] 1.2312 1.0636 0.0336 0.0477 443.6418 0.9990
⋯
Ω[2,2] 3.1023 2.2316 0.0706 0.1431 175.3450 0.9991
⋯
π 0.8452 0.3464 0.0110 0.0157 380.0702 0.9991
⋯
1 column om
itted
We can see that the chain mixed pretty well. The posterior of the covariance matrices are somewhat high, but the other posteriors of the other parameters appear centered on their true values. Poorly estimated covariances are typical for this model (and many other models), regardless of estimation method.
DynamicHMC Implementation¶
DynamicHMC.jl is a useful alternative to Turing.jl. With DynamicHMC, you must write a function to evaluate the log posterior. This is arguably slightly more work than Turing’s modeling language (although it avoids needing to learn Turing’s language), but it also gives you more control over the code. This tends to result in more efficient code.
using DynamicHMC, Random, LogDensityProblems, Parameters, TransformVariables, Distributions, MCMCChains
defaultpriors = (β=Normal(0,20),
σ=truncated(Normal(0, 10), 0, Inf),
π=Normal(0,20),
ρ=Normal(0,20),
Σx=Wishart(J+2, diagm(ones(J))*1/(2*0.001)),
Ω =Wishart(J+2, diagm(ones(J))*1/(2*0.001)))
macro evalpriors(priors)
first = true
expr = :(1+1)
for k in keys(defaultpriors)
if first
expr = :(logpdf($(priors).$k, $k))
first = false
else
expr = :($expr + logpdf($(priors).$k, $k))
end
end
return(esc(expr))
end
# We define a struct to hold the data relevant to the model
struct RCIVLogit{T,Stype}
x::Matrix{T}
z::Matrix{T}
s::Matrix{Stype}
N::Int64
J::Int64
ν::typeof(gausshermite(1))
priors::typeof(defaultpriors)
end
# evaluate the log posterior
function (d::RCIVLogit)(θ)
@unpack ξ, β, σ, π, ρ, sξ, uξ, sx, ux = θ
@unpack x, z, s, N, J, ν = d
T = typeof(d.x[1,1]*β)
Σx =Symmetric(ux'*Diagonal(exp.(sx).^2)*ux)
Ω = Symmetric(uξ'*Diagonal(exp.(sξ).^2)*uξ)
# priors
logp=@evalpriors(d.priors)
# loglikelihood
logp += loglikelihood(MvNormal(zeros(J), Ω), ξ)
logp += loglikelihood(MvNormal(zeros(J), Σx), x - π*z - ρ*ξ)
μ = zeros(T, J + 1 , length(ν[1]))
for i in 1:N
@views μ[1:J,:] .= x[:,i]*β + ξ[:,i] .+ x[:,i]*σ*sqrt(2)*ν[1]'
μ = softmax(μ, dims=1)
p = (μ*ν[2])/sqrt(Base.π)
logp += logpdf(Multinomial(sum(s[:,i]), p),s[:,i])
end
return(logp)
end
ν = gausshermite(3)
prob = RCIVLogit(data.x, data.z, data.s, N, J,ν , defaultpriors)
θ0 = ( ξ=data.ξ, β=pm.β, σ=pm.σ, π=pm.π, ρ=pm.ρ,
sξ = log.(sqrt.(diag(pm.Ω))),
uξ=cholesky(pm.Ω ./ sqrt.(diag(pm.Ω)*diag(pm.Ω)')).U,
sx = log.(sqrt.(diag(pm.Σx))),
ux=cholesky(pm.Σx ./ sqrt.(diag(pm.Σx)*diag(pm.Σx)')).U
)
prob(θ0)
# A transformation from ℝ^(# parameters) to the parameters
# see Transformations.jl
t= as((ξ=as(Array,asℝ,J,N),
β = asℝ, σ=asℝ₊, π=asℝ, ρ=asℝ,
sξ=as(Array,asℝ,J),
uξ=CorrCholeskyFactor(J),
sx=as(Array,asℝ,J),
ux=CorrCholeskyFactor(J)) )
x0 = inverse(t, θ0)
# wrap our logposterior and transform into the struc that DynamicHMC
# uses for sampling
P = TransformedLogDensity(t, prob)
∇P = ADgradient(:ForwardDiff, P);
warmup_stages = default_warmup_stages(;doubling_stages=2, M=Symmetric)
results = DynamicHMC.mcmc_keep_warmup(Random.GLOBAL_RNG, ∇P, 1000;
initialization = (q = x0, ),
reporter = LogProgressReport(nothing, 25, 15),
warmup_stages=warmup_stages)
JLD2.@save "$dir/dhmc.jld2" results
JLD2.@load "$dir/dhmc.jld2" results
post = transform.(t,results.inference.chain)
p = post[1]
names = vcat([length(p[s])==1 ? String(s) : String.(s).*string.(1:length(p[s])) for s in keys(p)]...)
vals = hcat([vcat([length(v)==1 ? v : vec(v) for v in values(p)]...) for p in post]...)'
chain = MCMCChains.Chains(reshape(vals, size(vals)..., 1), names)
display(plot(chain[["β","σ","ρ","π"]]))
display(describe(chain[["β","σ","ρ","π"]])[1])
Summary Statistics
parameters mean std naive_se mcse ess rhat
Symbol Float64 Float64 Float64 Float64 Float64 Float64
β -1.1455 0.4021 0.0127 0.0357 76.3266 1.0292
σ 0.5666 0.2756 0.0087 0.0186 195.7986 0.9990
π 0.8488 0.3193 0.0101 0.0151 518.1321 0.9990
ρ 0.4704 0.2925 0.0093 0.0231 156.4125 1.0056
Quasi-Bayesian¶
Consider the following GMM setup. We have a data-depenendent function $g_i(\theta)$ such that $\Er[g_i(\theta_0)] = 0$ . Let the usual GMM estimator be
Assume the usual regularity conditions so that
where $D = D_\theta \Er[g_i(\theta_0)]$ and $\frac{1}{\sqrt{n}} \sum g_i(\theta_0) \to N(0, \Omega)$.
Quasi-Bayesian approaches are based on sampling from a quasi-posterior that also converges to $N\left(0, (D’WD)^{-1}(D’W\Omega W D) (D’WD)^{-1} \right)$ as $n\to \infty$.
Let $\Sigma = (D’WD)^{-1}(D’W\Omega W D) (D’WD)^{-1}$. Note that the log density of the asymptotic distribution is
Compare that to minus the GMM objective function (which we’ll denote by $\log(p_n(\theta)$)
If we take a second order Taylor expansion,
and complete the square
where $C$ is a constant that does not depend on $\theta$.
From this we can see that if we treat $\log(p_n(\theta))$ as a log posterior distribution, we get that that conditional on the data, the quasi-posterior is approximately normal with a data dependent mean and variance $D’WD$.
Finally, note that the usual asymptotic expansion of the GMM objective gives
then
That is, the quasi-posterior is approximately normal with mean equal to the usual GMM estimate and variance $(D’WD)^{-1}$.
If we treat the quasi-posterior and as a real posterior, the quasi-posterior mean or median will be consistent since they each approach $\hat{\theta}^{GMM}$ and $\hat{\theta}^{GMM}$ is consistent.
Additionally, if we use quasi-posterior credible intervals for inference, they may or may not be consistent. In general, the quasi-posterior variance, $(D’WD)^{-1}$, differs from the asymptotic variance, $(D’WD)^{-1}(D’W\Omega W D)(D’WD)^{-1}$, and quasi-posterior credible intervals will not be consistent. However, if we use an efficient weighting matrix, $W=\Omega^{-1}$, then the quasi-posterior credible intervals will be consistent.
See these notes for more information, or for complete details.
Implementation¶
To compute moment conditions, we need to solve for $\xi$ given the parameters. Here’s the associated code. It’s similar to what we used in BLPDemand.jl.
using DynamicHMC, Random, LogDensityProblems, Parameters, TransformVariables, Distributions, ForwardDiff, NLsolve
function share(δ::AbstractVector, σ, x::AbstractVector, ν::typeof(gausshermite(1)))
J = length(δ)
S = length(ν[1])
s = zeros(promote_type(eltype(δ), eltype(σ)),size(δ))
si = similar(s)
@inbounds for i in 1:S
si .= δ + σ*x*ν[1][i]
# to prevent overflow from exp(δ + ...)
simax = max(maximum(si), 0)
si .-= simax
si .= exp.(si)
si .= si./(exp.(-simax) + sum(si))
s .+= si*ν[2][i]
end
s .+= eps(eltype(s))
return(s)
end
function delta(s::AbstractVector{T}, x::AbstractVector{T},
σ::T, ν::typeof(gausshermite(1))) where T
tol = 1e-6
out = try
sol=NLsolve.fixedpoint(d->(d .+ log.(s) .- log.(share(d, σ, x, ν))),
log.(s) .- log.(1-sum(s)),
method = :anderson, m=4,
xtol=tol, ftol=tol,
iterations=100, show_trace=false)
sol.zero
catch
log.(s) .- log.(1-sum(s))
end
return(out)
end
# Use the implicit function theorem to make ForwardDiff work with delta()
function delta(s::AbstractVector, x::AbstractVector,
σ::D, ν::typeof(gausshermite(1))) where {D <: ForwardDiff.Dual}
σval = ForwardDiff.value.(σ)
δ = delta(s,x,σval, ν)
∂δ = ForwardDiff.jacobian(d -> share(d, σval, x, ν), δ)
∂σ = ForwardDiff.jacobian(s -> share(δ, s..., x, ν), [σval])
out = Vector{typeof(σ)}(undef, length(δ))
Jv = try
-∂δ \ ∂σ
catch
zeros(eltype(∂σ),size(∂σ))
end
Jc = zeros(ForwardDiff.valtype(D), length(σ), ForwardDiff.npartials(D))
for i in eachindex(σ)
Jc[i,:] .= ForwardDiff.partials(σ[i])
end
Jn = Jv * Jc
for i in eachindex(out)
out[i] = D(δ[i], ForwardDiff.Partials(tuple(Jn[i,:]...)))
end
return(out)
end
delta (generic function with 2 methods)
Now the code for the quasi-posterior.
using DynamicHMC, Random, LogDensityProblems, Parameters, TransformVariables, Distributions
defaultpriors = (β=Normal(0,20),
σ=truncated(Normal(0, 10), 0, Inf))
struct RCIVLogitQB{T,Stype}
x::Matrix{T}
z::Matrix{T}
s::Matrix{Stype}
N::Int64
J::Int64
ν::typeof(gausshermite(1))
priors::typeof(defaultpriors)
W::Matrix{T}
end
macro evalpriors(priors)
first = true
expr = :(1+1)
for k in keys(defaultpriors)
if first
expr = :(logpdf($(priors).$k, $k))
first = false
else
expr = :($expr + logpdf($(priors).$k, $k))
end
end
return(esc(expr))
end
function (d::RCIVLogitQB)(θ)
@unpack β, σ = θ
@unpack x, z, s, N, J, ν, priors, W = d
T = typeof(x[1,1]*β)
# priors
logp = @evalpriors(d.priors)
# quasi-loglikelihood
m = zeros(T, J*J)
for i in 1:N
ξ = delta(s[:,i], x[:,i], σ, ν) - x[:,i]*β
m .+= vec(ξ*z[:,i]')
end
m ./= N
logp += -0.5*N*m'*W*m #logpdf(MvNormal(zeros(length(m)), W), m)
return(logp)
end
gh = gausshermite(3)
ν = (gh[1],gh[2]/sqrt(Base.π))
prob = RCIVLogitQB(data.x, data.z, max.(1,data.s[1:J,:])./M, N, J,ν , defaultpriors, diagm(ones(J*J)))
θ0 = (β=pm.β, σ=pm.σ )
prob(θ0)
t=as((β = asℝ, σ=asℝ₊))
x0 = inverse(t, θ0)
P = TransformedLogDensity(t, prob)
∇P = ADgradient(:ForwardDiff, P)
LogDensityProblems.logdensity_and_gradient(∇P, x0);
The chain here actually takes very little time to run, but we still cache the results.
warmup=default_warmup_stages(stepsize_search=nothing, doubling_stages=2)
results = DynamicHMC.mcmc_keep_warmup(Random.GLOBAL_RNG, ∇P, 1000;
initialization = (q = x0, ϵ=1e-5 ),
reporter = LogProgressReport(nothing, 25, 15),
warmup_stages =warmup)
JLD2.@save "$dir/qb.jld2" results
JLD2.@load "$dir/qb.jld2" results
post = transform.(t,results.inference.chain)
p = post[1]
names = vcat([length(p[s])==1 ? String(s) : String.(s).*string.(1:length(p[s])) for s in keys(p)]...)
vals = hcat([vcat([length(v)==1 ? v : vec(v) for v in values(p)]...) for p in post]...)'
chain = MCMCChains.Chains(reshape(vals, size(vals)..., 1), names)
display(describe(chain[["β","σ"]])[1])
display(plot(chain[["β","σ"]]))
Summary Statistics
parameters mean std naive_se mcse ess rhat
Symbol Float64 Float64 Float64 Float64 Float64 Float64
β -1.2810 0.3689 0.0117 0.0257 154.9630 1.0000
σ 0.9447 0.6902 0.0218 0.0512 136.3898 1.0009
The quasi-posterior means look okay. The standard deviation and quantiles are not consistent because we didn’t use an efficient weighting matrix above. Let’s use one and repeat.
σ = mean(chain[:σ])
β = mean(chain[:β])
m = zeros(J*J, N)
for i in 1:N
ξ = delta(prob.s[:,i], prob.x[:,i], σ, ν) - prob.x[:,i]*β
m[:,i] .= vec(ξ*prob.z[:,i]')
end
W = inv(cov(m'))
prob = RCIVLogitQB(data.x, data.z, max.(1,data.s[1:J,:])./M, N, J,ν , defaultpriors,W)
θ0 = (β=pm.β, σ=pm.σ )
prob(θ0)
P = TransformedLogDensity(t, prob)
∇P = ADgradient(:ForwardDiff, P)
LogDensityProblems.logdensity_and_gradient(∇P, x0);
warmup=default_warmup_stages(stepsize_search=nothing, doubling_stages=2)
results = DynamicHMC.mcmc_keep_warmup(Random.GLOBAL_RNG, ∇P, 1000;initialization = (q = x0, ϵ=1e-5 ), reporter = LogProgressReport(nothing, 25, 15), warmup_stages =warmup)
JLD2.@save "$dir/qbw.jld2" results
JLD2.@load "$dir/qbw.jld2" results
post = transform.(t,results.inference.chain)
p = post[1]
names = vcat([length(p[s])==1 ? String(s) : String.(s).*string.(1:length(p[s])) for s in keys(p)]...)
vals = hcat([vcat([length(v)==1 ? v : vec(v) for v in values(p)]...) for p in post]...)'
chain = MCMCChains.Chains(reshape(vals, size(vals)..., 1), names)
display(describe(chain[["β","σ"]])[1])
display(plot(chain[["β","σ"]]))
Summary Statistics
parameters mean std naive_se mcse ess rhat
Symbol Float64 Float64 Float64 Float64 Float64 Float64
β -1.1734 0.3905 0.0123 0.0370 71.9467 1.0124
σ 0.7136 0.5478 0.0173 0.0480 94.2200 1.0084
These posterior standard deviations and quantiles can be used for inference.
About¶
This meant to accompany these slides on Bayesian methods in IO.
This document was created using Weave.jl. The code is available in on github.