Figure 1. Two-dimensional visualization of PubMed paper embeddings with biological context.
Titles and abstracts of PubMed papers were encoded into 768-dimensional embedding vectors using a pretrained MPNet model and projected into two dimensions by t-SNE. Each point represents one paper. Colors indicate external labels used for annotation in each panel. (A) t-SNE projection of all papers colored by high-level Scopus ASJC domains. (B) t-SNE projection of biomedical papers colored by disease-related categories. (C) t-SNE projection of cancer-related papers colored by cancer type or organ of interest. (D) t-SNE projection of breast cancer papers colored by omics modality, including genomics, transcriptomics, and proteomics.
At the broadest level, papers showed clear organization by high-level scientific domain (Figure 1A). Biomedical Science occupied the largest portion of the embedding space, while Chemistry, Engineering, and Physics formed distinguishable regions, indicating that the embeddings captured broad disciplinary context from text. We next examined whether this structure was preserved at finer biological levels. Within the biomedical literature, papers grouped by disease-related categories such as infectious diseases, cancer, and endocrinology occupied relatively distinct regions (Figure 1B), suggesting that the embedding space retained disease-level semantic information. A similar pattern was observed within cancer-related papers. Papers annotated by cancer type or organ, including breast, skin, liver, lung, colon, and pancreatic cancer, formed localized groupings (Figure 1C), indicating that the embeddings captured more specific cancer-related context. Even within breast cancer papers, partial separation was observed by omics modality, including genomics, transcriptomics, and proteomics (Figure 1D).
Together, these results show that MPNet-based embeddings preserve hierarchical biological context across multiple levels, from broad scientific domains to disease categories, cancer types, and omics-level distinctions.
Figure 2. PCA-based visualization of temporal shifts in cancer-related PubMed paper embeddings.
Papers were grouped into 5-year time bins. Each point represents one paper. Crosses indicate the temporal centers of each time bin, defined by the median PC1 and PC2 values. (A) PCA projection of cancer-related papers across 5-year time bins with temporal centers overlaid. (B) Magnified view of the temporal centers shown in panel A. (C) PCA projection of papers with a 4 × 5 grid overlaid on the PC space for density analysis.
To investigate temporal changes in cancer research topics, we focused on cancer-related papers and projected their MPNet-based embeddings into a two-dimensional space using PCA. Unlike t-SNE, which emphasizes local structure at the cost of distorting global relationships, PCA preserves the overall distribution of the data and is therefore more suitable for tracking time-dependent shifts in the embedding space. Cancer-related papers were grouped into 5-year time bins, and the temporal center of each group was defined by the median PC1 and PC2 values. When these centers were overlaid on the PCA projection, a gradual directional shift was observed across time (Figure 2A). This pattern became more apparent in the magnified view of the centroid region (Figure 2B), suggesting that the semantic center of cancer-related literature shifted gradually over time. To further examine how the distribution changed across the PC space, the two-dimensional projection was divided into a 4 × 5 grid, and papers from each time bin were visualized separately within this framework (Figure 2C). The grids were labeled sequentially from Grid 1 to Grid 20, starting from the upper-left corner and proceeding to the lower-right corner. Across time, the overall distribution showed a gradual shift toward the lower-left region of the PC space.
Figure 3. Temporal changes in grid density and topic enrichment within the PCA embedding space of cancer-related papers.
(A) Temporal changes in relative density for each grid from 1980–1984 to 2015–2019. (B) BERTopic-based topic analysis of papers mapped to Grid 9, showing representative keyword-defined topics and their normalized frequencies over time.
To quantify temporal changes in the spatial distribution of cancer-related literature, we calculated the relative density of papers within each of the 20 grid regions for every 5-year time bin (Figure 3A). Some regions displayed a continuous decrease in relative density, such as Grid 8, whereas others showed non-monotonic behavior, such as Grid 5, which increased initially and then declined. In contrast, Grid 9 exhibited the most consistent and pronounced increase across time, indicating that this region represented a research area that attracted progressively greater attention from 1980 to 2019. To characterize the themes enriched in this region, we analyzed the abstracts of papers mapped to Grid 9 using BERTopic (Figure 3B). Among the identified topics, we highlighted three representative groups: (i) therapy resistance-related terms, including miRNA, COX-2, inhibitors, estrogen, therapy, and resistance; (ii) epigenetics-related terms, including epigenetic, methylation, histone, azacytidine, and modifications; and (iii) three-dimensional culture-related terms, including 3D, spheroids, organoids, culture, and platform. Notably, the latter two topics actually became particularly prominent in the time periods around the 2010s.
These results suggest that the grid-density approach can identify regions of the embedding space associated with emerging or increasingly active research areas, and that topic modeling of dense grids provides interpretable biological themes underlying those temporal shifts.
Figure 4. Prediction of the future centroid location in the PCA embedding space.
(A) PCA projection of training papers (1980–2019) and test papers (2020–2024), with the centroid trajectory of the training set shown in blue. The predicted centroid for 2020–2024 is shown in red, and the actual centroid calculated from the 2020–2024 papers is shown in green. Ellipses indicate covariance-based dispersion of papers within each time bin and the estimated future distribution. (B) Magnified view of the centroid region shown in panel A.
To evaluate whether the temporal shift in cancer-related literature was predictable, we trained a polynomial regression model on the centroid trajectory derived from papers published between 1980 and 2019. Using the historical positions of the centroids in PCA space, we predicted the centroid location for papers published in 2020–2024 and compared it with the actual centroid derived from the papers published during that period. The predicted centroid was located close to the actual 2020–2024 centroid in the PCA space (Figure 4A). This agreement was also evident in the magnified view of the centroid region (Figure 4B), where the predicted and observed points appeared highly similar in both direction and magnitude of shift. In addition, the covariance-based ellipses provided a visual summary of the dispersion pattern within each time bin and the estimated future distribution.
These results suggest that the time-dependent semantic shift of cancer-related literature follows a sufficiently regular trajectory to allow short-term prediction of its future position in the embedding space.
Figure 5. Prediction of future semantic shift and characterization of topics enriched in the predicted region. (A) PCA projection of cancer-related papers published from 1980–2024, with the centroid trajectory shown in blue. The predicted centroid and ellipse for 2025–2029 is shown in orange. (B) BERTopic-based topic analysis and their normalized frequencies over time.
The centroid trajectory of cancer-related papers was predicted to move toward the lower-right region of the PCA space when the model was extended using papers from 1980–2024 (Figure 5A). Based on the previous enrichment of Grid 9, we focused on the neighboring lower-right grid as a likely region for future concentration. Based on this projection, Grid 14 was examined to identify topics associated with the predicted future region (Figure 5B). Among the identified topics, we focused on three representative groups: (i) anticancer drug-related terms; (ii) drug delivery-related terms; and (iii) computational and statistical analysis terms. Notably, Grid 14 contained not only treatment-related keywords but also themes related to precise drug delivery and statistical simulation.
These results suggest that future research hotspots may expand beyond therapeutic agents themselves to include selective delivery strategies and computational approaches.