Principal axis analysis (PAA) is a geometrically-motivated, super-fast, projection pursuit method ([k]). In this work package, (Obj. 1) techniques rooted in invariant coordinate selection will be brought to bear on the pressing statistical science problem of analysing HDLSS data, while (Obj. 2) exploring the promise of further challenging extensions of ICS, including use of group symmetry ideas.
The operational tool to be developed is a new methodology for HDLSS data analysis, exploiting the complementary sphered and unsphered forms of PAA. Based on fundamental results in distance geometry and (a variant of) ICS, while using singular value decompositions, it will be a theoretically well-grounded, computationally efficient, addition to the data analyst’s armoury.
High dimensional, low sample size, data (p >> N) – cf. k >> N in (de facto) discrete data contexts (WP1) – are ‘a dramatically increasing feature of the practical environment’, [9]. The field is developing rapidly, stimulated by the major INI programme [9] held in 2008. FC’s paper [l] was presented later that year. Most recently, see (his published discussion of) the RSS Read Paper [23].
Objective 1: Recent geometric contributions include [6], generalising [24]. Again, [l] extends PAA to handle HDLSS data, uniquely combining fundamental distance geometry ([m]) and (a variant of) ICS – sphered PAA being a special case of ICS, while unsphered PAA is not. However, this methodology remains experimental. It is the aim of this work item to develop PAA for HDLSS data into an operational tool. For practical pertinence, given the vast datasets involved, our primary focus is on exploiting speed of computation.
The starting point will be the preparatory actions already taken. Theory shows that, for HDLSS data, affine independence of the observed data vectors – as holds almost surely for continuous data – implies that the sphered nontrivial PCA score vectors always form a regular simplex. Further theoretical investigation, exploiting the linear, invertible map between squared-distances and centred inner-products whose spectral decomposition was obtained in [m], leads directly to an explicit (conveniently, linear) discriminant function for the fundamental two-class problem. And, indeed, to an operational version of it even when class memberships are unknown. Proof-of-concept simulation studies validate these findings. In particular, extending a study in [24], the p-asymptotic error rate 0.15 of our new discriminant function is appreciably lower than the previous best rate of 0.20 attained, among others, by the support vector machine.
These fruitful developments will be carried forward via refinement, extension and benchmarking of this experimental methodology. In particular, PAA for HDLSS data will be compared, in both theory and practice, with existing methodologies. It is anticipated that a major paper, directed at the HDLSS research community, combining theory, simulation and real-world examples, will be the outcome of this work.
Objective 2: Two blue sky lines of enquiry, in particular, show exceptional promise:
The implications of these advances remain to be fully tested in practice. Further challenging questions include: What are the limits to applications of ICS? Can ICS be extended to functional data? Is a nonlinear form of ICS possible? Finally, the key transformation underlying ICS is defined by a pair of affine equivariant shape functionals (V1, V2), while the relevant group symmetry result holds for any such pair: what light does this throw on the open question of choosing an optimal (V1, V2) for a specified statistical purpose?