GEPHEX Publications

2023

Ferdjaoui A, Affeldt S, and Nadif M. WordGraph: a python package for reconstructing interactive causal graphical models from text data. The 17th ACM International Conference on Web Search and Data Mining. (WSDM'24, accepted) - (WSDM|pdf)

We present WordGraph, a Python package for exploring the topics of documents corpora. WordGraph provides causal graphical models from text data vocabulary and proposes interactive visualizations of terms networks. Our ease-to-use package is provided with a pre-built pipeline to access the main modules through jupyter widgets. It results in the encapsulation of a whole vocabulary exploration process within a single jupyter notebook cell, with straightforward parameters settings and interactive plots. WordGraph pipeline is fully customizable by adding/removing widgets or changing default parameters. To assist users with no background in Python nor jupyter notebook, but willing to explore large corpora topics, we also propose an automatic dashboard generation from the customizable jupyter notebook pipeline in a web application style (a demonstration video of the package is provided here). WordGraph is available through a GitHub repository. 

Keywords Co-clustering · Causal network reconstruction . Text data

Falissard L, Affeldt S, and Nadif M. Attentive perturbation : Extending prefix tuning to large language models inner representations. The 9th International Conference on Machine Learning, Optimization and Data Science. (LOD'23) - (LOD|pdf)

From adapters to prefix-tuning, parameter efficient fine-tuning (PEFT) has been a well investigated research field in the past few years, which has led to an entire family of alternative approaches for large language model fine-tuning. All these methods rely on the fundamental idea of introducing additional learnable parameters to the model, while freezing all pre-trained representations during training. This fine-tuning process is generally done through refitting all model parameters to the new, supervised objective function. This process, however, still requires a considerable amount of computing power, which might not be readily available to everyone. In addition, even with the use of transfer learning, this method requires substantial amounts of data. In this article, we propose a novel and fairly straightforward extension of the prefix-tuning approach to modify both the model’s attention weight and its internal representations. Our proposal introduces a “token-tuning” method relying on soft lookup based embeddings derived using attention mechanisms. We call this efficient extension “attentive perturbation”, and empirically show that it outperforms other PEFT methods on most natural language understanding tasks in the few-shot learning setting. 

Keywords Large language models · Parameter efficient fine-tuning . Adapters · Prefix-tuning · Natural language processing · Natural Language Understanding

2022

Ferdjaoui A, Tlati A, Affeldt S, and Nadif M. CORPEX : Analyse exploratoire d'un corpus biomédical à l'aide de la classification croisée. Extraction et Gestion des Connaissances. (EGC'23) - (EGC|pdf)

We propose an interface that supports corpus analysis via interactive visualizations of coclusters to explore the topics for a set of texts. The user can create or load a corpus of documents, clean them and  study simultaneously the terms and the documents. This article details the functionalities related to the dynamic generation of corpora, especially in a biomedical context, and also the loading of document-term matrices for already pre-processed corpora. The analysis of the corpus by cross-classification (co-clustering) and the joint visualization of the terms and documents according to the co-partitioning, are effective tools for a quick understanding of the topics in a corpus. The automatic saving of the results allows to easily relaunch different co-clustering analyses and obtain crossed views of the topics at different levels of granularity. 

Keywords Co-clustering · Web interface . Regularization · Information Retrieval · Biomedical text mining

2021

Affeldt S, Labiod L, and Nadif M. Regularized bi-directional co-clustering. Statistics and Computing, 31(3), 1-17. (2021) - (STCO | pdf)

The simultaneous clustering of documents and words, known as co-clustering , has proven to be more efficient than one-sided clustering to deal with sparse and high dimensional datasets. Text data are also generally unbalanced and directional by essence. Recently, von Mises-Fisher (vMF) mixture model was proposed to deal with unbalanced data while taking advantage of the directional nature of text. In this paper we propose a general co-clustering framework based on a matrix formulation of vMF model-based co-clustering. This formulation leads to a flexible framework for text co-clustering that can easily incorporate both the word-word semantic relationships and the document-document similarities. By contrast with existing methods, which generally rely on an additive incorporation of similarities, we propose a bi-directional multiplicative regularization that better captures the underlying text data structure. Extensive evaluations on various real-world text datasets demonstrate the performance of the proposed approach over baseline and competitive methods, both in terms of clustering results and co-cluster topic coherence.

Keywords Co-clustering · Regularization · Information Retrieval · Text mining

Existing co-clustering algorithms generally rely on the input document-term matrix. While some of them consider also pure word-word semantic correlations, co-clustering methods fail to consider side information arising from both word-word semantic correlations and document-document similarities. To fill this gap, we propose a Regularized Bi-directional Co-clustering (RBDCo) based on an appropriate matrix formulation. The figures below demonstrate the coherence of the word cluster obtained from the RBDCo co-clusters. 

Hay Fever disease. 

Migraine disease. 

Otitis disease. 

AMD disease. 


Results are based on the PubMed5 dataset which is based on approximately 12,500 biomedical abstracts downloaded from Medline database that cover 5 diseases and that were published between 2000 and 2008. 


Each document is originally labeled with the corresponding disease, namely Age-related Macular Degeneration (AMD), Otitis, Kidney Stones, Hay Fever and Migraine. The color of the vertices reflects the word association score, with warmer colors corresponding to higher scores, and the thickness of the edges represents the strength of the association.

Kidney Calculi disease. 


Affeldt S, Labiod L, and Nadif M. Approche ensemble pour le co-clustering par blocs sur des données textuelles: Application au biomédical. Extraction et Gestion des Connaissances: Actes EGC'2021. (2021) - (EGC | pdf)

Nous proposons un co-clustering par blocs via une approche ensemble qui fusionne plusieurs co-clusterings élémentaires en une matrice d’affinité consensus structurée. Les co-clusterings de base sont issus des mêmes données textuelles et générés par la même méthode de co-clustering. Ce processus de fusion renforce la qualité individuelle des co-clusterings par blocs au sein d’une seule matrice consensus. Notre approche permet un co-clustering complètement non supervisé, où le nombre de co-clusters est automatiquement déduit d’un critère de modularité non trivial généralisé. La fonction objective associée permet l’apprentissage conjoint de l’agrégation des co-clusterings élémentaires et du co-clustering consensus. Les résultats expérimentaux sur plusieurs jeux de données réelles démontrent l’intérêt de notre approche comparée à des méthodes compétitives de co-clustering

Keywords Co-clustering · Ensemble method · Information Retrieval · text mining

Affeldt S, Labiod L, and Nadif M. Regularized Dual-PPMI Co-clustering for Text Data. SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. (2021) - (SIGIR | pdf)

Keywords Co-clustering · Regularization · Information Retrieval · Text mining

2020

Affeldt S, Labiod L, and Nadif M. Ensemble Block Co-clustering: a Unified Framework for Text Data. 29th ACM International Conference on Information and Knowledge Management, CIKM (2020) - (ACM | pdf )


In this paper, we propose a unified framework for Ensemble Block Co-clustering (EBCO), which aims to fuse multiple basic co-clusterings into a consensus structured affinity matrix. Each co-clustering to be fused is obtained by applying a co-clustering method on the same document-term dataset. This fusion process reinforces the individual quality of the multiple basic data co-clusterings within a single consensus matrix. Besides, the proposed framework enables a completely unsupervised co-clustering where the number of co-clusters is automatically inferred based on the non trivial generalized modularity. We first define an explicit objective function which allows the joint learning of the basic co-clusterings aggregation and the consensus block co-clustering. Then, we show that EBCO generalizes the one side ensemble clustering to an ensemble block co-clustering context. We also establish theoretical equivalence to spectral co-clustering and weighted double spherical k-means clustering for textual data. Experimental results on various real-world document-term datasets demonstrate that EBCO is an efficient competitor to some state-of-the-art ensemble and co-clustering methods.

Keywords Co-clustering · Ensemble method · Information Retrieval · text mining



We propose a novel Ensemble Block Co-clustering (EBCO) framework in which the input is a collection of document-term matrix co-clusterings. The output of the framework is a consensus block co-clustering.


EBCO Framework


EBCO proposes relevant distributions of document topics in its co-clusters. The adjacent picture summarizes the topics distribution for EBCO co-partitions with a number of co-clusters between 10 (top) and 7 (bottom). Pie charts gives the percentage of disease-documents associated to EBCO co-clusters. 


As can be seen, several co-clusters are stable and keep a clear predominant topic when changing the number of co-clusters, such as AMD (gray pie charts), Otitis (blue pie charts), Migraine (light green pie charts without Raynaud Disease) and Hay Fever (brown pie charts).  Other co-clusters provide interesting biomedical indications on several actual disease relationships.

PubMed10 dataset is based on approximately 15,000 biomedical abstracts downloaded from Medline database that cover 10 diseases and that were published between 2000 and 2008. Each document is originally labeled with the corresponding disease.