Assignment: reproducing Grieco & McDevitt (2017)

Paul Schrimpf 2026-01-26

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

The source is available on github.

\[ \def\indep{\perp\!\!\!\perp} \def\Er{\mathrm{E}} \def\R{\mathbb{R}} \def\En{{\mathbb{E}_n}} \def\Pr{\mathrm{P}} \newcommand{\norm}[1]{\left\Vert {#1} \right\Vert} \newcommand{\abs}[1]{\left\vert {#1} \right\vert} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \def\inprob{\,{\buildrel p \over \rightarrow}\,} \def\indist{\,{\buildrel d \over \rightarrow}\,} \]

Download the .qmd file for this assignment from https://github.com/UBCECON567/Dialysis/blob/master/docs/qmd/dialysis-2.qmd , and put it in the same folder that you used for part 1 (a different folder will also work, but may lead to extra package installation. For the bibliography to work, you’ll also need https://github.com/UBCECON567/Dialysis/blob/master/docs/qmd/dialysis.bib in the same folder.

Change the author in the 4th line of dialysis-2.qmd, and change eval: false to eval: true on line 22 (leaving it on false will cause the code to not be executed when you run quarto). You can write your answers below each Question block. If you want, you can delete sections that are not questions, but that’s up to you.

If using another tool for your answers, turn in both your code and prose. These could be in the same file (like a Jupyter or Pluto notebook) or separate (e.g. a .jl file of code and pdf of prose). Your code should run, so it should include copies of code blocks from this document that you execute.

You may use LLMs, but please disclose where and how. Also, be sure to carefully check their output.

Package Environment

Julia’s package management uses “environments” — groups of packages installed together. It is generally a good idea to create a separate environment for each project. The code below activates the environment in the current directory, checks if the Dialysis package is part of that environment, and, if not, adds it. You can manage the environment interactively by pressing ] in the Julia REPL (and press Crtl-C to switch back to normal REPL mode). See the Pkg documentation for much more information.

import Pkg
Pkg.activate(".")
try
    using Dialysis # This assignment itself is in the "Dialysis" package. We will use some of the functions from it.
catch
    @warn "Dialysis package is not part of the current environment. Adding it now."
    Pkg.develop(PackageSpec(url="https://github.com/UBCECON567/Dialysis"))
    using Dialysis
    Pkg.add(["DataFrames","Statistics","StatsBase","Dates","StatsPlots","Plots"])
end
using DataFrames, Statistics, StatsPlots, Plots

Environments can be nested. The packages in the global environment will always be available. You can switch the global environment by entering activate while in package mode, or Pkg.activate() in normal mode. The global environment is a good place for utility packages that you always want available. In my global environment, I have: debugging tools (Debugger and Infiltrator), interactive utilities (TerminalPager and OhMyREPL), and development tools (Revise). You generally should not put much else in your global environment. For any assignment or other project, you should make a project specific environment that contains all packages necessary for the associated scripts and documents to run.

Data Prepartion

We begin by taking the data you created in part 1 of the assignment. You can modify the cell below if you wish (e.g. if you prefer different variable names or definitions).

The code below is separated into a module. Modules serve are organizational tools to group associated functions, variables, and type definitions together. They also serve as abstraction barriers. The packages loaded with using or import inside a module will not be available outside the module. The functions and other things defined inside a module are only available outside the module if we call them as, for example, Cleaning.guesstype instead of just guesstype (unless the module includes an explicit export statement).

"""
Functions used in cleaning Dialysis Facility Reports data.
"""
module Cleaning

using Dates, DataFrames, StatsBase, Statistics, Dialysis

"""
    guesstype(x)

Try to guess at appropriate type for x.
"""
function guesstype(x::AbstractArray{T}) where T <:Union{S,Missing} where S <: AbstractString
    if all(occursin.(r"(^\.$)|(^(-|)\d+)$",skipmissing(x)))
        return Int
    elseif all(occursin.(r"(^\.$)|(^(-|\d)\d*(\.|)\d*$)",skipmissing(x)))
        return Float64
    elseif all(occursin.(
        r"(^\.$)|(^\d\d\D{3}\d{4}$|^\d\d/\d\d/\d{4}$)",
        skipmissing(x)))
        return Date
    else
        return S
    end
end
guesstype(x) = eltype(x)


parse(t, x) = Base.parse(t, x)
# we need a parse that "converts" a string to string
parse(::Type{S}, x::S) where S <: AbstractString = x
# a version of parse that works for the date formats in this data
parse(::Type{Dates.Date}, x::AbstractString) = occursin(r"\D{3}",x) ? Date(x, "dduuuyyyyy") : Date(x,"m/d/y")

"""
        converttype(x)

    Convert `x` from an array of strings to a more appropriate numeric type
    if possible.
    """
function converttype(x::AbstractArray{T}) where T <: Union{Missing, AbstractString}
    etype = guesstype(x)
    return([(ismissing(val) || val==".") ? missing : parse(etype,val)
            for val in x])
end
converttype(x) = x

function combinefiscalyears(x::AbstractArray{T}) where T <: Union{Missing,Number}
    if all(ismissing.(x))
        return(missing)
    else
        v = median(skipmissing(x))
        return(isnan(v) ? missing : v)
    end
end
function combinefiscalyears(x)
    # use most common value
    if all(ismissing.(x))
        return(missing)
    else
        return(maximum(countmap(skipmissing(x))).first)
    end
end

upcase(x) = Base.uppercase(x)
upcase(m::Missing) = missing

function dayssince(year, dates)
    today = Date(year, 12, 31)
    past = [x.value for x in today .- dates if x.value>=0]
    if length(past)==0
        return(missing)
    else
        return(minimum(past))
    end
end

function load_and_clean_data()
    dialysis, datadic = Dialysis.loadDFR(recreate=false)

    # fix the identifier strings. some years they're listed as ="IDNUMBER", others, they're just IDNUMBER
    dialysis.provfs = replace.(dialysis.provfs, "="=>"")
    dialysis.provfs = replace.(dialysis.provfs,"\""=>"")
    # convert strings to numeric types
    dialysis=mapcols(Cleaning.converttype, dialysis)


    dialysis = combine(groupby(dialysis, [:provfs,:year]),
        names(dialysis) .=> Cleaning.combinefiscalyears .=> names(dialysis))
    sort!(dialysis, [:provfs, :year])

    pt = Symbol.(filter(x->occursin.(r"PT$",x),names(dialysis)))
    ft = Symbol.(filter(x->occursin.(r"FT$",x),names(dialysis)))
    dialysis[!,:labor]=sum.(skipmissing(eachrow(dialysis[!,pt])))*0.5 +
                       sum.(skipmissing(eachrow(dialysis[!,ft])))

    dialysis.hiring = panellag(:labor, dialysis, :provfs, :year, -1) - dialysis.labor
    dialysis.investment = panellag(:totstas_f, dialysis, :provfs, :year, -1) - dialysis.totstas_f


    dialysis.forprofit = (x->(x=="Unavailable" ? missing :
        x=="For Profit")).(dialysis.owner_f)

    # Chains
    dialysis.fresenius = (x->(ismissing(x) ? false :
            occursin(r"(FRESENIUS|FMC)",x))).(dialysis.chainnam)
    dialysis.davita = (x->(ismissing(x) ? false :
            occursin(r"(DAVITA)",x))).(dialysis.chainnam)
    # could add more

    # State inpection rates
    inspect = combine(groupby(dialysis, :provfs),
        :surveydt_f => x->[unique(skipmissing(x))])
    rename!(inspect, [:provfs, :inspection_dates])
    df=innerjoin(dialysis, inspect, on=:provfs)
    @assert nrow(df)==nrow(dialysis)

    df=transform(df, [:year, :inspection_dates] => (y,d)->Cleaning.dayssince.(y,d))
    rename!(df, names(df)[end] =>:days_since_inspection)
    df[!,:inspected_this_year] = ((df[!,:days_since_inspection].>=0) .&
        (df[!,:days_since_inspection].<365))

    # then take the mean by state
    stateRates = combine(groupby(df, [:state, :year]),
                    :inspected_this_year =>
            (x->mean(skipmissing(x))) => :state_inspection_rate)
    # if no inpections in a state in a year then
    # mean(skipmissing(x)) will be mean([]) = NaN. 0 makes more sense
    stateRates.state_inspection_rate[isnan.(stateRates.state_inspection_rate)] .= 0

    df = innerjoin(df, stateRates, on=[:state, :year])
    @assert nrow(df)==nrow(dialysis)

    # competitors
    df[!,:provcity] = Cleaning.upcase.(df[!,:provcity])
    comps = combine(groupby(df,[:provcity,:year]),
                    :dy =>
            (x -> length(skipmissing(x).>=0.0)) =>
            :competitors
           )
    comps = comps[.!ismissing.(comps.provcity),:]
    dialysis = outerjoin(df, comps, on = [:provcity,:year], matchmissing=:equal)
    @assert nrow(dialysis)==nrow(df)

    dialysis, datadic
end

end

dialysis, datadic = Cleaning.load_and_clean_data()
describe(dialysis)

Estimation of $\alpha$

Quality

To proxy for quality, Grieco and McDevitt (2017) use the septic infection rate adjusted for patient characteristics. That is, they use the residuals from regressing infection rate on patient characteristics. There are a number of Julia packages that can be used for regression. I like FixedEffectModels. As the name suggests, it is focused on fixed effect models, but also works for regressions without fixed effects. It is written by economists, so it has features that economists use a lot — convenient ways to get robust and/or clustered standard errors, and good support for IV. The GLM package is a reasonable alternative.

Usage of FixedEffectModels is fairly similar to R, in that it has a formula interface for constructing X matrices from a DataFrame. You can modify this regression if you want, but you do not need to.

dialysis[!,:idcat] =  categorical(dialysis[!,:provfs]) # FixedEffectModels requires clustering and fixed effect variables to  be categorical
qreg = reg(dialysis, @formula(sepi ~ days_since_inspection + age +
    sex + vin + ppavf + clmcntcom + hgm),
           Vcov.cluster(:idcat),
           save=true) # saves residuals
dialysis[!,:quality] = -qreg.residuals
dialysis1=dialysis
qreg

OLS and Fixed Effects Estimates

Question

Reproduce columns 2,3, 5, and 6 of Table 5. The syntax for fixed effects regression is shown below. Be sure to add the other columns. If you’d like, you could use RegressionTables.jl to produce tables that look a lot like the ones in the paper.

# -Inf causes as error in reg()
log_infmiss = x->ifelse(!ismissing(x) && x>0, log(x), missing)

dialysis2 = let
    dialysis = dialysis1
    dialysis[!,:lpy] = log_infmiss.(dialysis[!,:dy]) # or hdy or phd?
    dialysis[!,:logL] = log_infmiss.(dialysis[!,:labor])
    dialysis[!,:logK] = log_infmiss.(dialysis[!,:totstas_f])
    dialysis
end
# you may want to restrict sample to match sample that can be
# used in control function estimates
reg(dialysis, @formula(lpy ~ quality + logK + logL + fe(idcat)),
    Vcov.cluster(:idcat))

Estimation of α

As discussed in section 5 of Grieco and McDevitt (2017), the coefficient on quality, $\alpha$, is estimated from

\[ y_{jt} = \alpha q_{jt} + \Phi(\underbrace{h_{jt}, k_{jt}, l_{jt}, x_{jt}}_{w_{jt}}) + \epsilon_{jt} \]

with a second noisy measure of quality, $z_{jt}$, used to instrument for $q_{jt}$. To estimate $\alpha$, first the exogenous variables, $w$, can be partialed out to give:

\[ y_{jt} - \Er[y_{jt} | w_{jt} ] = \alpha (q_{jt} - \Er[q_{jt}|w_{jt}]) + \epsilon_{jt} \]

where we used the assumption that $\Er[\epsilon_{jt} | w_{jt} ] = 0$ and the fact that $\Er[\Phi(w) | w] = \Phi(w)$. Under the assumption that $\Er[\epsilon| z, w] = 0$, we can estimate $\alpha$ based on the moment condition:

\[ \begin{align*} 0 = & \Er[\epsilon f(z,w) ] \\ 0 = & \Er\left[ \left(y_{jt} - \Er[y_{jt} | w_{jt} ] - \alpha (q_{jt} - \Er[q_{jt}|w_{jt}])\right) f(z_{jt},w_{jt}) \right] \end{align*} \]

If $Var(\epsilon|z,w)$ is constant, the efficient choice of $f(z,w)$ is

\[ \Er[\frac{\partial \epsilon}{\partial \alpha} |z, w ] = \Er[q| z, w] - \Er[q|w] \]

To estimate $\alpha$, we simply replace these conditional expectations with regression estimates, and replace the unconditional expectation with a sample average. Let $\hat{\Er}[y|w]$ denote a nonparmetric estimate of the regression of $y$ on $w$. Then,

\[ \hat{\alpha} = \frac{\sum_{j,t} (y_{jt} - \hat{E}[y|w_{jt}])(\hat{E}[q|z_{jt},w_{jt}] - \hat{E}[q|w_{jt}])} {\sum_{j,t} (q_{jt} - \hat{E}[q|w_{jt}])(\hat{E}[q|z_{jt},w_{jt}] - \hat{E}[q|w_{jt}])} \]

The function partiallinearIV in Dialysis.jl will estimate this model. Also included are two methods for estimating $\hat{E}[y|w]$. polyreg estimates a polynomial series regression, that is it regresses $y$ on a polynomial of degree $d$ in $w$. To allow the regression to approximate any function, the degree must increase with the sample size, but to control variance, the degree must not increase too quickly. We will not worry too much about the choice of degree here.

An alternative method (and what Grieco and McDevitt (2017) used) is local linear regression. To estimate $\hat{E}[y|x_{jt}]$, local linear regression estimates a linear regression of $y$ on $x$, but weights observations by how close $x_{it}$ is to $x_{jt}$. That is,

\[ \hat{E}[y|x_{jt}] = x_{jt} \hat{\beta}(x_jt) \]

where

\[ \hat{\beta}(x_{jt}) = \argmin_\beta \sum_{i,t} (y_{it} - x_{it}\beta)^2 k((x_{it} - x_{jt})/h_n) \]

Here $k()$ is some function with its maximum at 0 (and has some other properties), like $k(x) \propto e^{-x^2}$. The bandwidth, $h_n$, determines how much weight to place on observations close to vs far from $x_{jt}$. Similar to the degree in polynomial regression, the bandwidth must decrease toward 0 with sample size allow local linear regression to approximate any function, but to control variance the bandwidth must not decrease too quickly. We will not worry too much about the choice of bandwidth. Anyway, the function locallinear in Dialysis.jl estimates a local linear regression.

::: {.callout-tip “Local linear”}

The locallinear function in Dialysis.jl is not very efficient and will be somewhat slow for this dataset. You can try it if you want, but my past experience has shown that polyreg gives similar results.

:::

Question

Estimate $\alpha$ using the following code. You may want to modify some aspects of it and/or report estimates of $\alpha$ for different choices of instrument, nonparametric estimation method, degree or bandwidth. Compare your estimate(s) of $\alpha$ with the ones in Tables 5 and 6 of Grieco and McDevitt (2017).

# create indicator for observations usable in estimation of α
inc1 = ((dialysis[!,:dy] .> 0) .& (dialysis[!,:labor] .> 0) .&
    (dialysis[!,:totstas_f] .> 0) .&
    .!ismissing.(dialysis[!,:quality]) .&
    .!ismissing.(dialysis[!,:smr]) .&
    (dialysis[!,:investment].==0) .&
    (dialysis[!,:hiring].!=0))
inc1[ismissing.(inc1)] .= false
dialysis[!,:inc1] = inc1
dialysis[!,:lsmr] = log.(dialysis[!,:smr] .+ .01) #
# As degree → ∞ and/or bandwidth → 0, whether we use :smr or
# some transformation as the instrument should not matter. However,
# for fixed degree or bandwidth it will have some (hopefully small)
# impact.

(α, Φ, αreg, eyqz)=partiallinearIV(:lpy,  # y
                                   :quality, # q
                                   :lsmr,   # z
                                   [:hiring, :logL, :logK,
                                    :state_inspection_rate, :competitors], # w
                                   dialysis[findall(dialysis[!,:inc1]),:];
                                   npregress=(xp, xd,yd)->polyreg(xp,xd,yd,degree=1),
                                   parts=true
                                   # You may want to change the degree here
                                   #
                                   # You could also change `polyreg`  to
                                   # `locallinear` and `degree` to
                                   # `bandwidthmultiplier`
                                   #
                                   # locallinear will likely take some time to
                                   # compute (≈350 seconds on my computer)
                                   )
# we will need these later in step 2
dialysis[!,:Φ] = similar(dialysis[!,:lpy])
dialysis[:,:Φ] .= missing
rows = findall(dialysis[!,:inc1])
dialysis[rows,:Φ] = Φ
dialysis[!,:ey] = similar(dialysis[!,:lpy])
dialysis[:,:ey] .= missing
dialysis[rows,:ey] = eyqz[:,1]
dialysis[!,:eq] = similar(dialysis[!,:lpy])
dialysis[:,:eq] .= missing
dialysis[rows,:eq] = eyqz[:,2]
dialysis[!,:ez] = similar(dialysis[!,:lpy])
dialysis[:,:ez] .= missing
dialysis[rows,:ez] = eyqz[:,3]
println("Estimate of α=$α")
αreg

Estimation of Labor and Capital Coefficients

Brief introduction to GMM

The coefficients on labor and capital are estimated by GMM. The idea of GMM is as follows. We have a model that implies

\[ \Er[c(y,x;\theta) | z ] = 0 \]

where $y$, $x$, and $z$ are observed variables. $c(y,x;\theta)$ is some known function of the data and some parameters we want to estimate, $\theta$. Often, $c(y,x;\theta)$ are the residuals from some equation. For example, for linear IV, we’d have

\[ c(y,x;\theta) = y - x\theta \]

The conditional moment restriction above implies that

\[ \Er[c(y,x;\theta)f(z) ] = 0 \]

for any function $f()$. We can then estimate $\theta$ by replacing the population expectation with a sample average and finding $\hat{\theta}$ such that

\[ \En[c(y,x;\hat{\theta})f(z) ] \approx 0 \]

The dimension of $f(z)$ should be greater than or equal to the dimension of $\theta$, so we have at least as many equations as unknowns. We find this $\hat{\theta}$ by minimizing a quadratic form of these equations. That is,

\[ \hat{\theta} = \argmin_\theta \En[g_i(\theta)] W_n \En[g_i(\theta)]' \]

were $g_i(\theta) = c(y_i, x_i;\theta)f(z_i)$, and $W_n$ is some positive definite weighting matrix.

Question

As practice with GMM, use it to estimate a simple regression model,

$y = x\beta + \epsilon$

assuming $\Er[\epsilon|x] = 0$. Test your code on simulated data. The following will help you get started.

using LinearAlgebra

function sim_ols(n; β = ones(3))
    x = randn(n, length(β))
    ϵ = randn(n)
    y = x*β + ϵ
    return(x,y)
end
β = ones(2)
(x, y) = sim_ols(100; β=β)
βols = (x'*x) \ (x'*y)

function gmm_objective(β; W=I)
    gi = (y - x*β) .* x
    Egi = mean(gi, dims=1) # note that Egi will be 1 × length(β)
    error("This is incomplete; you must finish it")
    # It is is likely that the code you will write will return a 1 x 1,
    # 2 dimensional array. For compatibility with Optim, you need to
    # return a scalar. If foo is a 1x1 array, write `foo[1]` to return a scalar instead of
    # 1x1 array
end

# minimizer gmm_objective
using Optim # github page : https://github.com/JuliaNLSolvers/Optim.jl
import ForwardDiff, ADTypes # optimizers work best with derivatives, this package calculates them

# docs : http://julianlsolvers.github.io/Optim.jl/stable/
res = optimize(gmm_objective,
               zeros(size(β)), # initial value
               BFGS(), # algorithm, see http://julianlsolvers.github.io/Optim.jl/stable/
               autodiff=ADTypes.AutoForwardDiff())
βgmm = res.minimizer
Test.@test βgmm ≈ βols
res, βgmm, βols

Estimating $\beta$

The model implies that

\[ \omega_{jt} = \Phi(w_{jt}) - \beta_k k_{jt} - \beta_l l_{jt} \]

and

\[ y_{jt} - \alpha q_{jt} - \beta_k k_{jt} - \beta_l l_{jt} = g(\omega_{jt-1}) + \eta_{jt} \]

The timing and exogeneity assumptions imply that

\[ \Er[\eta_{jt} | k_{jt}, l_{jt}, w_{jt-1}] \]

Given a value of $\beta$, and our above estimates of $\Phi$ and $\alpha$, we can compute $\omega$ from the equation above, and then estimate $g()$ and $\eta$ by a nonparametric regression of $y_{jt} - \alpha q_{jt} - \beta_k k_{jt} - \beta_l l_{jt}$ on $\omega_{jt-1}$. $\beta$ can then be estimated by finding the value of $\beta$ that comes closest to satisfying the moment condition

\[ \Er[\eta(\beta)_{jt} k_{jt}] = 0 \text{ and } \Er[\eta(\beta)_{jt} l_{jt}] = 0 \]

To do this, we minimize

\[ Q_n(\beta) = \left( \frac{1}{N} \sum_{j,t} \eta(\beta)_{jt} (k_{jt}, l_{jt}) \right) W_n \left( \frac{1}{N} \sum_{j,t} \eta(\beta)_{jt} (k_{jt}, l_{jt}) \right)' \]

Question

Write the body of the $Q_n(\beta)$ function below. Use it to estimate $\beta$. Compare your results with those of Grieco and McDevitt (2017). Optionally, explore robustness of your results to changes in the specification.

Qn(degree) = let
    (ωfunc, ηfunc) = errors_gm(:lpy, :logK, :logL, :quality, :Φ, :provfs, :year,
                               dialysis, α; degree=degree)
    function Qn(β)
        η = ηfunc(β) # note that η will be vector of length = size(dialysis,1)
                     # η will include missings
        error("You must write the body of this function")
    end
end

res = optimize(Qn(1),    # objective
               [0.0, 0.0], # lower bounds, should not be needed, but
               # may help numerical performance
               [1.0, 1.0], # upper bounds
               [0.4, 0.2], # initial value
               Fminbox(BFGS()),  # algorithm
               autodiff=ADTypes.AutoForwardDiff(),
               Optim.Options(show_trace=true))
@show res
@show βhat=res.minimizer

For diagnosing optimization problems, it can be helpful to plot the objective function. This is especially true here because there are only two parameters to estimate. Here’s some plotting code in case you want to try it.

using Plots
f = Qn(1)
βk = βhat[1].*range(0.7,1.3,length=10) # might want to adjust
βl = βhat[2].*range(0.7,1.3,length=10) # might want to adjust
surface(βk,βl,(x,y)->f([x,y]),xlabel="βk",ylabel="βl") # or contour in place of surface

Grieco, Paul L. E., and Ryan C. McDevitt. 2017. “Productivity and Quality in Health Care: Evidence from the Dialysis Industry.” The Review of Economic Studies 84 (3): 1071–1105. https://doi.org/10.1093/restud/rdw042.

</div>