Sunday 28 October 2012

Survival analysis with time varying covariates measured at random times by design


Stephen Rathbun, Xiao Song, Benjamin Neustiftler and Saul Shiffman have a new paper in Applied Statistics (JRSS C). This considers estimation of a proportional hazards survival model in the presence of time dependent covariates which are only intermittently sampled at random time points. Specifically they are interested in examples relating to ecological momentary assessment where data collection may be via electronic devices like smart phones and the decision on sampling times can be automated. They consider a self-correcting point-process sampling design, where the intensity of the sampling process depends on the past history of sampling times, which allows an individual to have random sampling times that are more regular than would be achieved from a Poisson process.

The proposed method of estimation is to use inverse intensity weighting to obtain an estimate of an individual's integrated hazard up to the event time. Specifically the estimator is for a individual with sampling times at times and point process sampling intensity . This then replaces the integrated hazard in an approximate log-likelihood.

In part of the simulation study and in an application the exact point process intensity is unknown and taken from empirical estimates from the sample. Estimating the sampling intensity didn't seem to have major consequences on the integrity of the model estimates. This seems to suggest the approach might be applicable in other survival models where the covariates are sampled longitudinally in a subject specific manner, provided a reasonable model for sampling can be devised.

Drawbacks of the method seem to be that covariates measured at baseline (which is not a random time point) cannot be incorporated in the estimate and that it seems that the covariates must be measured at the event time which may not be the case in medical contexts. The underlying hazard also needs to be specified parametrically, but as stated flexible spline modelled can be used.

Friday 26 October 2012

Estimating parametric semi-Markov models from panel data using phase-type approximations


Andrew Titman has a new paper in Statistics and Computing. This extends previous work on fitting semi-Markov models to panel data using phase-type distributions. Here, rather than assume a model in which each state is assumed to have a Coxian phase-type sojourn distribution, the model assumes a standard parametric sojourn distribution (e.g. Weibull or Gamma). The computational tractability of phase-type distributions is exploited by approximating the parametric semi-Markov model by a model in which each state has a 5-phase phase-type distribution. In order to achieve this, a family of approximations to the parametric distribution, with scale parameter 1, is developed by solving a relatively large one-off optimization assuming the optimal phase-type parameters for given shape parameters evolve as B-spline functions. The resulting phase-type approximations can then be scaled to give an approximation for any combination of scale and shape parameter and then embedded into the overall semi-Markov process. The resulting approximate likelihood appears to be very close to the exact likelihood, both in terms of shape and magnitude.

In general a 2-phase Coxian phase-type model is likely to give similar results to a Weibull or Gamma model. The only advantages of the Weibull or Gamma model are that the parameters are identifiable under the null (Markov) model so standard likelihood ratio tests can be used to compare models (unlike for the 2-phase model). Also the Weibull or Gamma model requires fewer parameters so may be useful for smaller sample sizes.

Constrained parametric model for simultaneous inference of two cumulative incidence functions


Haiwen Shi, Yu Cheng and Jong-Hyeon Jeong have a new paper in Biometrical Journal. This paper is somewhat similar in aims to the pre-print by Hudgens, Li and Fine in that it is concerned with parametric estimation in competing risks models. In particular, the focus is on building models for the cumulative incidence functions (CIFs) but ensuring that the CIFs sum to less than 1 at the asymptote as time tends to infinity. Hudgens, Li and Fine dealt with interval censored data but without covariates. Here, the data are assumed to be observed up to right-censoring but the emphasis is on simultaneously obtaining regression models directly for each CIF in a model with two risks.

The approach taken in the current paper is to assume that the CIFs will sum to 1 at the asymptote, to model the cause 1 CIF using a modified three-parameter logistic function with covariates via an appropriate link function. The CIF for the second competing risk is assumed to also have a three-parameter logistic form, but covariates only affect this CIF through the probability of this risk ever occurring.

When a particular risk in a competing risks model is of primary interest, the Fine-Gray model is attractive because it makes interpretation of the covariate effects straightforward. The model of Shi et al seems to be for cases where both risks are considered important, but still seems to require that one risk be considered more important. The main danger of the approach seems to be that the model for the effect of covariates on the second risk may be unrealistic, but will have an effect on the estimates for the first risk. If we only care about the first risk the Fine-Gray model would be a safer bet. If we care about both risks it might be wiser to choose a model based on the cause-specific hazards, which are guaranteed to induce a model with well behaved CIFs albeit at the expense of some interpretability of the resulting CIFs.

Obtaining a model with a direct CIF effect for each cause seems an almost impossible task because, if we allow a covariate to effect the CIF in such a way that a sufficiently extreme covariate leads to a CIF arbitrarily close to 1, it must be having a knock-on effect on the other CIF. The only way around this would be to have a model that assigns maximal asymptote probabilities to the CIFs at infinity that are independent of any covariates e.g. where are increasing functions taking values in [0,1] and . The need to restrict the to be independent of covariates would make the model quite inflexible however.

Sunday 14 October 2012

Assessing age-at-onset risk factors with incomplete covariate current status data under proportional odds models


Chi-Chung Wen and Yi-Hau Chen have a new paper in Statistics in Medicine. This considers estimation of a proportional odds regression model for current status data in cases where a subset of the covariates may be missing at random for a subset of the patient population.

It is assumed that the probability that a portion of the covariates is missing depends on all the other observable outcomes (the failure status, the survey time and the rest of the covariate vector). The authors propose to fit a logistic regression model, involving all subjects in the dataset, for this probability of missingness. To fit the regression model for the current status data itself, they propose to use what they term a "validation likelihood estimator." This involves only working with the subset of patients with complete data but maximizing a likelihood that conditions on the fact that the whole covariate vector was observed. An advantage of using the proportional odds model over other candidate models (e.g. proportional hazards) is that the resulting likelihood remains of the proportional odds form.

Clearly a disadvantage of this "validation likelihood estimator" is that the data from subjects who have incomplete covariates is not used directly in the regression model. As a result the estimator is likely to be less efficient than approaches that effectively attempt to impute the missing covariate values. The authors argue that the validation likelihood approach will tend to be more robust since it is not necessary to make (parametric) assumptions about the conditional distribution of the missing covariates.

Friday 12 October 2012

Nonparametric estimation of the cumulative intensities in an interval censored competing risks model


Halina Frydman and Jun Liu have a new paper in Lifetime Data Analysis. This concerns non-parametric estimation for competing risks models under interval censoring. The problem of estimating the cumulative incidence functions (or sub-distribution functions) under interval censoring has been considered by Hudgens et al (2001) and involves an extension of the NPMLE for standard survival data under interval censoring.

The resulting estimates of the cumulative incidence functions are only defined up to increments on intervals. Moreover, the intervals by which the CIFs are defined are not the same for each competing risk. This causes problems if one wants to convert the CIFs into estimates of the cumulative cause-specific hazards. Frydman and Liu propose estimating the cumulative cause-specific hazards by first constraining the NPMLEs of the CIFs to have the same intervals of support (NB: this is just a sub-set of the set of all NPMLEs involving sub-dividing the intervals) and adopting a convention to distribute the increment within the resulting sub-intervals (they assume an equal distribution across sub-interval).

In addition they show that an ad-hoc estimator of the cumulative hazards based on the convention that the support of each interval of the NPMLE for each CIF occurs at its midpoint leads to biased results. They also show that their estimator has standard N^0.5 convergence when the support of the observation time distribution is discrete and finite.

Tuesday 2 October 2012

Effect of an event occurring over time and confounded by health status: estimation and interpretation. A study based on survival data simulations with application on breast cancer


Alexia Savignoni, David Hajage, Pascale Tubert-Bitter and Yann De Ryckea have a new paper in Statistics in Medicine. This considers developing illness-death type models to investigate the effect of pregnancy on the risk of recurrence of cancer amongst breast cancer patients. The authors give a fairly clear account of different potential models with particular reference to the hazard ratio The simplest model to consider is a Cox model with a single time dependent covariate representing pregnancy, here . This can be extended by assuming non-proportional hazards which effectively makes the effect time dependent i.e. . Alternatively, an unrestricted Cox-Markov model could be fitted with separate covariate effects and non-parametric hazards from each pregnancy state, yielding: This model can be restricted by allowing a shared baseline hazard for giving either under a Cox model with a fixed effect or for a time dependent effect.

If we were only interested in and any of these models seems feasible, there doesn't actual seem that much point in formulating the model as an illness-death model. Note that the transition rate does not feature in any of the above equations but would be estimated in the illness-death model. The above models can be fitted by a Cox model with a time dependent covariate (representing pregnancy) that has an interaction with the time fixed covariates. The real power of a multi-state model approach would only become apparent if we were interested in the overall survival for different covariates, treating pregnancy as a random event.

The time dependent effects are represented simply via a piecewise constant time indicator in the model. The authors do acknowledge that a spline model would have been better. The other issue that could have been considered is whether the effect of pregnancy depends on time since initiation of pregnancy (i.e. a semi-Markov effect). An issue in their data example is that pregnancy is only determined via a successful birth meaning there may be some truncation in the sample (through births prevented due to relapse/death).

Applying competing risks regression models: an overview

Bernhard Haller, Georg Schmidt and Kurt Ulm have a new paper in Lifetime Data Analysis. This reviews approaches to building regression models for competing risks data. In particular, they consider cause specific hazard regression, subdistribution hazard regression (via both the Fine-Gray model and pseudo-observations), mixture models and vertical modelling. The distinction between mixture models and vertical modelling is the order of the conditioning. In mixture models, imply a separate time to event model is developed for each cause of death. Whereas in vertical modelling, meaning there is an overall "all cause" model for survival with a time dependent model for the conditional risk of different causes. Vertical modelling fits in much nearer to the standard hazard based formulation used in classical competing risks. Haller et al also prefer it to mixture modelling for computational reasons. The authors conclude however that vertical modelling's main purpose is as an exploratory tool to check modelling assumptions which may be made in a more standard competing risks model. They suggest that in a study, particularly a clinical trial, it would be more appropriate to use a Cox model either on the cause-specific hazards or on the sub-distribution hazard. The choice between these two models would depend on the particular research question of interest.