PubMed records were collected using Biopython by querying papers on a day-by-day basis from January 1, 1980 to December 31, 2025 [1]. The search query used for each day was ("{date_str}"[Date - Create] : "{date_str}"[Date - Create]) AND (hasabstract) NOT preprint[filter], and up to 10,000 papers were retrieved per day. Because the full dataset contained more than 25 million papers, random subsampling was applied for downstream analyses. For the visualization analysis in Part 1, 50,000 papers were randomly sampled per year. For the temporal trend and future topic prediction analyses in Parts 2 and 3, 10,000 papers were randomly sampled per year, after which cancer-related papers were selected based on cancer-associated terms in the title and abstract, such as cancer, tumor, and malignancy.
To annotate papers by research area, Scopus All Science Journal Classification (ASJC) labels were assigned based on the journal in which each paper was published [2]. These labels were used as external annotations to characterize papers by broad and detailed scientific categories and to evaluate whether the embedding space preserved meaningful biological structure.
For each paper, the title and abstract were concatenated and converted into 768-dimensional embeddings using the sentence-transformers/all-mpnet-base-v2 model, which is based on MPNet and fine-tuned within the Sentence-Transformers framework. [3,4]. These embeddings were used as numerical representations of the semantic content of each publication.
Dimensionality reduction was performed for visualization and temporal trend analysis. In Part 1, t-SNE was applied to visualize the embedding space in two dimensions and to examine whether papers formed biologically meaningful clusters [5]. In Parts 2 and 3, PCA was used instead of t-SNE because PCA better preserves the overall structure of the original embedding distribution and is therefore more appropriate for analyzing time-dependent shifts in the embedding space [6].
For temporal trend analysis, cancer-related papers were grouped into 5-year time bins. Within the PCA space, the centroid of each time bin was defined as the median value of PC1 and PC2. Temporal changes in centroid location were tracked to examine directional shifts in the semantic center of cancer-related literature over time. To further characterize spatial changes in the embedding distribution, the two-dimensional PCA space was divided into 20 grid regions. Relative paper density was calculated for each grid within each 5-year time bin, such that the densities across all grids summed to 1 within a given time bin. This analysis was used to identify regions that became denser or sparser over time.
To identify topics enriched in selected grid regions, abstracts of papers mapped to those grids were analyzed using BERTopic [7]. PubMedBERT was used as the embedding model within BERTopic, UMAP was used for dimensionality reduction, and class-based TF-IDF (c-TF-IDF) was used for keyword extraction. The resulting topic keywords were used to interpret the major themes represented in dense or predicted future regions of the embedding space.
To model the temporal movement of cancer-related literature in the embedding space, polynomial regression was applied to the centroid trajectory across time bins. The model was implemented in scikit-learn using make_pipeline(PolynomialFeatures(degree=2), LinearRegression()) [5]. A prediction model was first trained using data from 1980–2019 and used to predict the centroid location for 2020–2024, which was then compared with the observed centroid. A final model was subsequently trained using data from 1980–2024 to project the expected centroid location for 2025–2029 and to infer possible future research hotspots.
The code used for this analysis is available at: https://github.com/TaerimLee/BCH394P_Final