Poster I-1

## Poster Session I-1

Conference
11:30 AM — 12:00 PM KST
Local
Jul 19 Mon, 10:30 PM — 11:00 PM EDT

### GMOTE: Gaussian-based minority oversampling technique for imbalanced classification adapting tail probability of outliers

Seung Jee Yang (Hanyang University)

1
Imbalanced data substantially affects the performance of the standard classification models. As a solution to these, oversampling methods have been proposed such as the synthetic minority oversampling technique (SMOTE). However, because methods such as SMOTE use linear interpolation to generate synthetic instances, the synthetic data space may appear similar to a polygon. Furthermore, oversampling methods generate synthetic outliers in minority classes. In this paper, we propose a Gaussian-based minority oversampling technique (GMOTE) with a statistical perspective for imbalanced datasets. The proposed method generates instances by using a Gaussian mixture model to avoid linear interpolation and to consider outliers. Motivated by the clustering-based multivariate Gaussian outlier score, we propose considering local outliers by calculating the tail probability of instances calculated using the Mahalanobis distance. Experiments were conducted on a representative set of benchmark datasets, and the GMOTE performance was compared with that of other methods. When GMOTE is combined with a classification and regression tree or support vector machine, it produces better accuracy and F1-score. Experimental results demonstrate this robust performance.

### Exact inference for an exponential parameter under generalized progressive type II hybrid censored competing risk data

Subin Cho (Daegu University)

1
Progressive censoring has the drawback that it might take a very long time to observe m-th failures and complete the life test. In this reason, generalized progressive type II censoring scheme was introduced. In addition, it is known that more than one risk factor may be present at the same time. In this paper, we discuss exact inference for competing risk model with generalized progressive type II hybrid censored exponential data. We derive the conditional moment generating function of the maximum likelihood estimators of scale parameters of exponential distribution and the resulting lower confidence bound under generalized progressive type II hybrid censoring scheme. From the example data, it can be seen that the PDF of MLE is almost symmetrical.

### Meta-analysis methods for multiple related markers: applications to microbiome studies with the results on multiple $\alpha$-diversity indices

Hyunwook Koh (The State University of New York, Korea)

2
Meta-analysis is a practical and powerful analytic tool that enables a unified statistical inference across the results from multiple studies. Notably, researchers often report the results on multiple related markers in each study (e.g., various $\alpha$-diversity indices in microbiome studies). However, univariate meta-analyses are limited to combining the results on a single common marker at a time, whereas existing multivariate meta-analyses are limited to the situations where marker-by-marker correlations are given in each study. Thus, here we introduce two meta-analysis methods, namely, multi-marker meta-analysis (mMeta) and adaptive multi-marker meta-analysis (aMeta), to combine multiple studies throughout multiple related markers with no priori results on marker-by-marker correlations. mMeta is a statistical estimator for a pooled estimate and its standard error across all the studies and markers, whereas aMeta is a statistical test based on the test statistic of the minimum p-value among marker-specific meta-analyses. mMeta conducts both effect estimation and hypothesis testing based on a weighted average of marker-specific pooled estimates while estimating marker-by-marker correlations non-parametrically via permutations, yet its power is only moderate. In contrast, aMeta closely approaches the highest power among marker-specific meta-analyses, yet it is limited to hypothesis testing. While their applications can be broader, we illustrate the use of mMeta and aMeta to combine microbiome studies throughout multiple $\alpha$-diversity indices. We evaluate mMeta and aMeta in silico and apply them to real microbiome studies on the disparity in $\alpha$-diversity by the status of HIV infection.

### Estimation for a nonlinear regression model with non-zero mean errors and an application to a biomechanical model

Hojun You (Seoul National University)

4
We propose a modified least squares estimator for a nonlinear regression model with non-zero mean errors motivated by the head-neck position tracking application. A nonlinear regression with multiplicative errors can be handled under the framework of the proposed method. In addition, we assume temporal dependence in the errors. We propose not only the modified least squares procedure for parameter estimation, but also penalized least squares procedure for parameter estimation and selection at the same time. Asymptotic properties of the proposed estimators, especially local consistency and oracle property of the penalized least square estimator, are established under plausible assumptions imposed on the nonlinear function, errors, and a penalty function. A simulation study demonstrates that the proposed estimation performs well in both parameter estimation and selection with temporally correlated error. The analysis and comparison with the existing methods for head-neck position tracking data show better performance of the proposed method in terms of the variance accounted for (VAF).

### Neural network-based clustering for ischemic stroke patients

Su Hoon Choi (Chonnam National University)

1
Finding similar clusters for stroke patients is important because it can lead to discovering new patterns and more effective ways to manage stroke. Although lifetime clustering is an important tool, it remains a relatively unexplored topic. In general, the degree of risk is classified using SPI-II, a traditional risk score to stratify the risk of recurrence of stroke. SPI-II is a verified and reliable stroke risk score. However, existing tools for predicting stroke outcome risks may have limitations because all possible variables cannot be considered. In this study, we compare several lifetime clustering methods including the deep lifetime clustering(DLC) method, which is a neural network-based clustering model. The performance of the clustering method on the real-world survival datasets of patients with ischemic stroke was evaluated. The SPI-II scores are grouped into three groups: low, medium, and high risk, based on previous studies. Accordingly, we conduct an analysis on three clustering in all methods. The metrics used to evaluate clusters obtained from the method are Concordance index, Brier score, and Log-rank score. An analysis was conducted on 7,650 patients out of data from the local comprehensive stroke center registry in patients with acute ischemic stroke. Compared to SPI-II stroke risk scores and other clustering methods, the DLC model performed much better clustering for all evaluation index. These results suggest that the DLC method may be useful for grouping stroke patients with similar outcome risks. Our study had an inherent limitation that it included only data from a single stroke center register in Republic of Korea. Therefore, further research with independent cohorts is likely to be required. Nevertheless, the neural network-based clustering method was first applied to stroke patients on real-world datasets.

### Principal component analysis of amplitude and phase variation in multivariate functional data

Soobin Kim (Seoul National University)

5
In many situations, multivariate functional data have both phase and amplitude variations. A common approach is to remove phase variations using selected function aligning methods and then apply functional principal component analysis (FPCA) to the aligned functions, which contain only amplitude variations. To consider both types of variations, we propose an extension of FPCA for amplitude and phase variation to multivariate cases. The original functions are decomposed into amplitude functions and warping functions, and warping functions are transformed into square-integrable functions via a centered log-ratio transformation. Multivariate FPCA is then performed on each amplitude and phase component with data-adaptive weights to balance the variational effects. The proposed method demonstrates its usefulness through real data analysis with sea climate data in Korea.

### Clustering non-stationary advanced metering infrastructure data

Donghyun Kang (Chung-Ang University)

2
we propose a clustering method for advanced metering infrastructure (AMI) data in Korea. As AMI data present non-stationarity, we consider time-dependent frequency domain principal components analysis and develop a new clustering method based on the time-varying eigenvectors. Our method provides a meaningful result that is different from the clustering results obtained by employing conventional methods, such as K-means and K-centres functional clustering. We further apply the clustering results to the evaluation of the electricity price system in South Korea, and validate the reform of the progressive electricity tariff system.

Poster I-2

## Poster Session I-2

Conference
9:30 PM — 10:00 PM KST
Local
Jul 20 Tue, 8:30 AM — 9:00 AM EDT

### Geometrically Adapted Langevin Algorithm (GALA) for Markov Chain Monte Carlo (MCMC) simulations

Mariya Mamajiwala (University College London)

3
MCMC is a class of methods to sample from a given probability distribution. Of its myriad variants, the one based on the simulation of Langevin dynamics, which approaches the target distribution asymptotically, has gained prominence. The dynamics is specifically captured through a Stochastic Differential Equation (SDE), with the drift term given by the gradient of the log-likelihood function with respect to the parameters of the distribution. However, the unbounded variation of the noise (i.e. the diffusion term) tends to slow down the convergence, which limits the usefulness of the method. By recognizing that the solution of the Langevin dynamics may be interpreted as evolving on a suitably constructed Riemannian Manifold (RM), considerable improvement in the performance of the method can be realised. Specifically, based on the notion of stochastic development - a concept available in the differential geometric treatment of SDEs - we propose a geometrically adapted variant of MCMC. Unlike the standard Euclidean case, in our setting, the drift term in the modified MCMC dynamics is constrained within the tangent space of an RM defined through the Fisher information metric and the related connection. We show, through extensive numerical simulations, how such a mathematically tenable geometric restriction of the flow enables a significantly faster and accurate convergence of the algorithm.

### Bayes estimation for the Weibull distribution under generalized adaptive hybrid progressive censored competing risks data

Yeongjae Seong (Daegu University)

2
Adaptive progressive hybrid censoring schemes have become quite popular in reliability and lifetime-testing studies. However, the drawback of the adaptive progressive hybrid censoring scheme is that it might take a very long time in order to complete the life test. In this reason, generalized adaptive progressive hybrid censoring scheme was introduced. In this research, a competing risks model is considered under a generalized adaptive progressive hybrid censoring scheme. When the failure times are Weibull distributed, maximum likelihood estimates for the unknown model parameters are established where the associated existence and uniqueness are shown. An asymptotic distribution of the maximum likelihood estimators is used to construct approximate confidence intervals via the observed fisher information matrix. Moreover, Bayes point estimates and the highest probability density credible intervals of unknown parameters are also presented, and the Gibbs sampling technique is used to approximate corresponding estimates.

### Large deviations of mean-field interacting particle systems in a fast varying environment

Sarath Yasodharan (Indian Institute of Science)

2
We study large deviations of a “fully coupled” finite state mean-field interacting particle system in a fast varying environment. The empirical measure of the particles evolves in the slow time scale and the random environment evolves in the fast time scale. Our main result is the path-space large deviation principle for the joint law of the empirical measure process of the particles and the occupation measure process of the fast environment. This extends previous results known for two time scale diffusions to two time scale mean-field models with jumps. Our proof is based on the method of stochastic exponentials. We characterise the rate function by studying a certain variational problem associated with an exponential martingale.

### Stochastic homogenisation of Gaussian fields

Leandro Chiarini (Utrecht University)

3
In this poster we prove the convergence of a sequence of random fields that generalise the Gaussian Free Field and bi-Laplacian field. Such fields are defined in terms of non-homogeneous elliptic operators which will be sampled at random. Under standard assumptions of stochastic homogenisation, we identify the limit fields as the usual GFF and bi-Laplacian fields up to a multiplicative constant.

### Concentration inequality for U-statistics for uniformly ergodic Markov chains, and applications

Quentin Duchemin (Université Gustave Eiffel)

2

### A Bayesian illness-death model to approach the incidence of recurrent hip fracture and death in elderly patients

Fran Llopis-Cardona (Foundation for the Promotion of Health and Biomedical Research of Valencia Region (FISABIO))

2
Multi-state models are a wide class of stochastic processes models in which individuals can move between different states over time. These models are of special interest in survival analysis as they allow to deal with a wide range of complex scenarios. We focus on the so-called illness-death model, which includes an initial state, an illness state, and a death state, considered a generalization of the competing risks framework. In an illness-death scenario, competing risks models involve time to illness and to death but do not provide evidence of the transition from illness to death. Illness-death models however add this transition, what makes them a preferable model when progression to death after non-terminal diseases is a relevant outcome. We use an illness-death model to study the evolution of patients who have suffered a hip fracture. The dataset comes from the PREV2FO cohort and includes 34,491 patients aged 65 years and older who were discharged alive after a hospitalization for an osteoporotic hip fracture and followed until a recurrent hip fracture and death. Transition times, from the initial fracture to refracture and death, and from refracture to death, are modelled via Cox proportional hazards models with Weibull baseline hazard functions. For simplicity, we adjusted by covariates sex and age at discharge. Transition from refracture to death is defined with regard to the time from initial fracture to refracture. We use a Bayesian approach to estimate the posterior distribution of the model parameters via Markov Chain Monte Carlo Methods (MCMC). Based on this distribution, we estimate posterior distributions for cumulative incidences of refracture and death, as well as transition probabilities which include free-event probability, probability of permanence at refracture state and the probability of death after refracture. We also estimate cause-specific hazard ratios to assess the effect of covariates on each transition.

### The contact process with two types of particles and priority: metastability and convergence in infinite volume

Mariela Pentón Machado (Instituto de Matemática e Estatística, Universidade de São Paulo)

2
We consider a symmetric finite-range contact process on Z with two types of particles (or infections), which propagate according to the same supercritical rate and die (or heal) at rate 1. Particles of type 1 can occupy any site in $(-\infty,0]$ that is empty or occupied by a particle of type 2 and, analogously, particles of type 2 can occupy any site in $[1,+\infty)$ that is empty or occupied by a particle of type 1. We prove that this system exhibits two metastable states: one with the two species and the other one with the family that survives the competition. In addition, we study the convergence of the process when it is defined in infinite volume.

### A nonparametric instrumental approach to endogeneity in competing risks models

Jad Beyhum (ORSTAT, Katholieke Universiteit Leuven)

3
This paper discusses endogenous treatment models with duration outcomes, competing risks and random right censoring. The endogeneity issue is solved using a discrete instrumental variable. We show that the competing risks model generates a non-parametric quantile instrumental regression problem. The cause-specific cumulative incidence, the cause-specific hazard and the subdistribution hazard can be recovered from the regression function. A distinguishing feature of the model is that censoring and competing risks prevent identification at some quantiles. We characterize the set of quantiles for which exact identification is possible and give partial identification results for other quantiles. We outline an estimation procedure and discuss its properties. The finite sample performance of the estimator is evaluated through simulations. We apply the proposed method to the Health Insurance Plan of Greater New York experiment.