Liqun Wang's methodology research is mainly in the areas of applied probability and statistical inference for complex data analysis. The following are some specific research areas and topics.
This research is concerned with statistical distribution of time when a random process first passes a threshold, for example, the time when the population of an endangered species reaches a certain critical level, or the time when the number of infected individuals with a disease reaches a limit. The first passage time, or boundary crossing probability, of diffusion processes has been intensively studied for decades in many disciplines, such as biology, economics, engineering, epidemiology, physics, seismology and statistics. It also plays an important role in modern finance, such as credit risk modeling and barrier option pricing. However, the computation of BCP is a long-standing and challenging problem, since the explicit analytic solutions do not exist except for linear and very few nonlinear boundaries.
For nonlinear boundaries, traditionally the mainstream research is based on the Kolmogorov partial differential equation for the transition probability density and focuses on approximate solutions of certain integral or differential equations for the first passage time (FPT) density. These methods usually apply to one-sided and smooth boundaries only, and the accuracy of the numerical approximation is difficult, if at all possible, to access. In many real problems, however, two-sided and discontinuous boundary problems arise, for example, in problems of pricing barrier options and other financial derivatives in quantitative finance.
In contrast, Wang and Poezelberger (1997) use a novel method to derive an explicit integral representation of the BCP for Brownian motion crossing any piecewise linear boundary. This formula is then used to approximate the BCP for general nonlinear boundaries. This approach was subsequently extended to two-sided boundary crossing problems in Poezelberger and Wang (2001), where an approximation error rate was also derived. They also construct an optimal partition of the time interval, so that the approximation accuracy of BCP for a nonlinear boundary using a piecewise linear boundary is significantly improved.
Later Wang and Poezelberger (2007) further extend this approach to a class of diffusion processes which can be expressed as piecewise monotone (not necessarily one-to-one) functionals of the BM. This class contains many interesting processes arising in real applications, e.g., Ornstein-Uhlenbeck processes, growth processes and geometric BM with time-dependent drift. Moreover, using this approach explicit formulas for BCP for certain nonlinear boundaries can be derived, which are useful in evaluation and comparison of various computational methods and algorithms for boundary crossing problems.
Recently, Jin and Wang (2015) use this approach to derive the boundary crossing density for Brownian motion and curved and possibly discontinuous boundaries.
The problem of measurement error (ME) arises when a regression analysis involves predictor variables that either cannot be measured directly (latent variables) or are measured with substantial error (imprecise measurements). Examples of such variables include long-term systolic blood pressure, cholesterol level, drug concentration in patient's blood, exposure to air pollutants or radio-active substances, social ability and family wealth. It is well-known that statistical methods ignoring ME and simply using the indirect observations result in biased and inconsistent estimates.
To overcome the ME problem usually requires extra information besides the main sample data, such as validation data, replicate observations or instrumental data. Another challenge in nonlinear inference with ME is that the objective function to be minimized or maximized typically involves multiple integrals of no closed forms, so that the entailed numerical optimization is difficult or intractable. The ME models are widely used in biostatistics to analyze data from epidemiology, environmental, medical and health sciences. They are also called errors-in-variables models in econometrics, and latent variable models in psychology and other social sciences.
A fundamental question in statistical inference in measurement error (ME) models is the model identifiability because it directly relates to the existence of consistent estimators of unknown parameters. The problem arises from the well-known fact that a linear ME model with normal covariates and errors is not identifiable and therefore cannot be consistently estimated without extra information or restrictive assumptions. Closely related theoretical questions are what is the minimal extra information or weakest assumptions that are sufficient for model identifiability and how to construct consistent estimators using the available information.
Wang (2003, 2004) show that in general a nonlinear model with Berkson-type ME is identifiable without extra information or assumption. Moreover, they construct root-n consistent estimators based on the fist two conditional moments of the response variable given the observed predictor variables. This approach is further developed into a unified framework for estimation of nonlinear mixed-effects models and Berkson ME models by Wang (2007). Historically, these two classes of models have very different origins and therefore are treated separately in the literature.
For the case of classical ME, Wang and Hsiao (1995, 1996, 2011) show that the model identifiability can be achieved if sufficiently many instrumental variables (IV) are available. They also construct root-n consistent estimators based on the first two conditional moments of he response and observed predictors given the IVs. More recently, this approach is extended to deal with ME problems in generalized linear models and mixed-effects models by Abarin and Wang (2012), Li and Wang (2012), and Xu, Ma and Wang (2015).
The autoregressive conditional heteroscedasticity (ARCH) model and its various generalizations have been widely used to analyze economic and financial data, such as GDP, inflation, stock prices and interest rates. There is also a large number of empirical studies of agricultural and industrial commodity prices using GARCH models. However, it is well documented in the literature that many economic variables including GDP, inflation and commodity prices are imprecisely measured. Although the measurement error problem has been extensively studied in econometrics and statistics, the problem in GARCH models with mismeasured response processes has not been investigated.
Salamh and Wang (2015) is the first attempt in the literature to address this problem. In contrast to the models with covariate measurement error, they show that all model parameters are identifiable by the observed proxy process only and no extra information is needed. Moreover, they propose a set of moment conditions that are sufficient for the identifiability and can be used to construct generalized method of moments (GMM) estimators for the unknown parameters. They also investigate the impact of measurement error on the parameter estimation and show that the measurement error induces bias in the naive maximum likelihood estimator.
For nonlinear measurement error models, Wang (2003, 2004) derived consistent estimators by simultaneously minimizing the distances between the response variable and squared response variable and their corresponding conditional means given the observed predictors. This method is a natural extension of the ordinary least squares method in which only the first order distance is minimized. Later, Wang and Leblanc (2008) show that, in a general nonlinear regression model, this second-order least squares estimator (SLSE) is asymptotically more efficient than the ordinary LSE if the model error has nonzero third moment, and both estimators have the same asymptotic variances otherwise.
This method is also used to obtain consistent estimators in the censored linear models by Abarin and Wang (2009), linear mixed models by Li and Wang (2013), dynamic panel data models by Salamh and Wang (2015a), autoregressive conditionally heteroscedastic (ARCH) models by Salamh and Wang (2015b).
A theoretical and practical challenge in the estimation of nonlinear or generalized linear models with measurement error or random-effects is that the likelihood or other estimating functions involve distributions of the unobserved variables that have to be integrated out. Consequently the estimating functions contain multiple integrals that usually do not admit closed forms. This causes numerical difficulties and sometimes makes the computation and optimization infeasible. Wang (2004) proposes a simulation-based method to overcome this numerical problem in estimating the nonlinear measurement error models. In particular, by using a "simulation by parts" technique a consistent estimator is obtained even with finite number of simulated points. This is in contrast to most other simulation-based methods in the literature that require the number of simulation points to increase to infinity in order for the estimators to be consistent.
Later this simulation-based method is used to obtain root-n consistent estimator in nonlinear mixed effects models for longitudinal data (Wang 2007), nonlinear semiparametric models with classical measurement error (Wang and Hsiao 2011), generalized linear models with measurement error (Abarin and Wang 2012), and generalized linear mixed models with measurement error (Li and Wang 2012b). More recently, Li and Wang (2016) extends this approach to deal with the missing data problem in generalized linear mixed models.
Modern research in science and engineering often involves numerical optimization and integration of high-dimensional functions. In statistics, moreover, complex data structures require highly sophisticated statistical modeling and inference procedures. Consequently, practical and efficient computational methods and algorithms are crucial. The task is particularly challenging when the problem is high-dimensional.
Fu and Wang (2002) develop a discretization-based multivariate sampling algorithm, which is fairly efficient in generating a large sample of independent points from a relatively high-dimensional distribution without knowing the normalizing constant. This method overcomes many typical drawbacks of the Markov Chain Monte Carlo (MCMC) methods such as problems associated with ill-shaped or disconnected sample spaces. This algorithm is applied to a large class of global optimization problems in engineering design and reliability assessment by Wang, Shan and Wang (2004) and Wang, Wang and Shan (2005). It is also applied to Bayesian finite mixture modeling of genetic and environmental data by Wang and Fu (2007) and Xue et al (2005).