# Infinite Divisibility and Compound Poisson Law:Related Count Data Models and High-Dimensional Variable Selection

【作者】 张慧铭

【导师】 李波

【作者基本信息】 华中师范大学， 数理统计学， 2016， 硕士

【摘要】 本文借助母函数等工具研究了离散复合Poisson分布(简称DCP分布)的概率理论性质、统计推断与数值计算,对DCP分布和相关回归模型做了较全面的综述,并特别地探讨了计数数据回归的惩罚估计。本文的DCP分布有如下形式的母函数：著名的Felller刻画是：离散复合Poisson分布等价于离散无穷可分分布,这可视为Levy-Khintchine无穷可分分布刻画的特例情况。特别地,当{αi}>i=1∞可取负值且之和是绝对收敛时,称之为伪离散复合Poisson分布,它继承了DCP分布的部分性质。第一章介绍了本文的重要工具(母函数和Fourier变换),完善了Felller关于离散无穷可分刻画的证明；对Lasso等高维变量选择方法进行了简介；介绍了Bayesian Lasso方法,讨论了先验分布无穷可分的情况,并设想以适当的零膨胀分布作为先验分布得到稀疏非零系数的估计。第二章讨论了DCP分布(过程)的刻画,并且在附录里列举了对其概率质量函数的十种不同证明,对文献中DCP分布的百余种特例或子族进行整理。本章用Stein-Chen方法和算子半群方法研究了独立离散随机变量之和与相对应的DCP分布的全变差上界估计,还得到了DCP分布的三角阵逼近。第三章讨论了DCP分布的统计量、参数估计以及FFT算法、离散Kolmogorov-Smirnov检验。第四章研究了基于DCP分布的一些统计应用：1)运用第三章的累积量估计和Fourier变换估计对两个精算中具有零膨胀与过离散特点的理赔数据做了DCP分布拟合；2)我们证明了任意取0值概率大于0.5的离散分布均为伪离散复合Poisson分布,由此利用伪DCP分布的零膨胀性质和加虚拟频数的技巧,得到任意离散分布的拟合方法,并进行了离散K-S检验与卡方检验的对比；3)探讨了基于DCP分布的计数数据广义线性模型,用惩罚估计的方法来挑选重要回归变量。特别地,我们得到了负二项回归系数Elastic net估计值非零(为零)的充分必要条件(类似Karush-Kuhn-Tucker条件)。然后对狩猎蜘蛛计数数据分别实现了基于极大似然、Lasso惩罚、Elastic net惩罚的负二项回归,并进行了比较分析。4)阐述了由DCP分布特例衍生出的离散Frailty模型和治愈率模型(竞争因素的长期生存者分析模型)。5)展望了利用混合Poisson分布逼近离散分布的问题。由于混合系数选择的无穷维性和复杂性,混合分布的系数的估计成为高维问题。

【Abstract】 In this master thesis, we explore the probability theory, statistical inference and numerical computation of discrete compound Poisson (DCP) distribution. In particular, we do a very comprehensive literature review of DCP distributions and its applications in related statistical models of count data fields, and especially, we discuss penalized generalized linear model of count data regression.The discrete compound Poisson distributions have the probability generating function in the form of the following: The famous Feller’s characterization of the compound Poisson states that a discrete distribution is compound Poisson if and only if its distribution is discrete infinitely divisible. This is a special case of Levy-Khinchine formula. When the{ai}i=1∞, may take negative values and the sum is absolutely convergent, it is called pseudo discrete compound Poisson distribution.In the first chapter, we introduce an important tool (probability generating function and Fourier transform) as preliminaries and improve the flawed proof of Feller’s characterization, and then we give a short introduction of variable selection method about Lasso and generalization. We close this chapter with the infinitely divisibile prior distribution in Bayesian Lasso and we envisages appropriate zero-inflated distribution as prior distribution which obtains the nonzero sparse estimation of coefficients. The chapter Ⅱ discusses characterizations of DCP distribution(process) with ten methods to prove the probability mass function are given in Appendix, and we give over a hundred kinds of special cases or sub-families of DCP distribution which are listed in a table with references. We use Stein-Chen method and operator semigroup method to obtain the upper bound of the total variation between a sum of independent discrete r.v. and a related discrete compound Poisson r.v., and use row sum in random triangular array to approximate discrete compound Poisson distribution. Chapter Ⅳ studys statistics, parameters estimation, FFT of DCP probability mass. Chapter Ⅴ firstly uses cumulants estimation and Fourier transform estimation to actuarial claim data with zero-inflated and overdispersion properties, then compares its Kolmogorov-Smimov test and Chi-squared test. We give a theorem that a set of count data obeys discrete pseudo compound Poisson distribution if its. probability of zero is larger than the probability of nonzero. Further more, we use this zero-inflated property of pseudo discrete compound Poisson with adding virtual frequency techniques; we get an algorithm to fit any discrete distributions. Chapter V also discusses count GLM related to the DCP distribution and use penalized estimation to select important regression variables. In particular, we consider the Elastic net estimates of negative binomial regression, and we give a necessary and sufficient condition(like Karush-Kuhn-Tucker conditions) for non-zero(zero) coefficient estimates. Using a spider count data, we analysis this real example by negative binomial regression with MLE, Lasso, Elastic net penalties. Next, we set forth the survival functions in discrete frailty model and cured rate models (or long term survivor models with competing causes) which are derived from some DCP distributions. In the last section, we look forword to the future study that mixed Poisson distribution to approximate any discrete distribution, and states the problem of variable selection in mixture components. Due to the complexity of the mixture, it results the high-dimentional problem.

【关键词】
【Key words】