6.4 Node Selection

Figure 1. The default selection criterion is the g-index one. When k=1, it is the standard g-index.

CiteSpace provides several ways to sample records to form the final networks. These criteria are known as node selection criteria.

From the version 4.0.R3, the default selection is based on the notion of g-index. The g-index was originally introduced by Leo Egghe in 2006 in order to remedy some of the weaknesses of the popular h-index. Unlike h-index, the g-index takes into account the number of citations of an author's most important publications. Specifically, the g index is the largest number that equals the average number of citations of the most highly cited g publications. In general, g is larger than h. The larger g makes it a more suitable than h as a way to select nodes from each time slice. To make it even more flexible, we modify the g-index by introducing a scaling factor k. The k can be any positive number so that the user may control the overall size of the resultant network to meet their particular needs.

Figure 2. The g and k for each slice is reported. It gives you an idea how you may adjust the scaling factor k.

Figure 3. The visualized network based on the g-index selection (k=5). Cluster labels are based on the keywords of citing papers selected by log-likelihood ratios (LLR).

The next option is to use the Top N per slice to select nodes. If you enter a value of 50, then CiteSpace will select the 50 most cited or occurred items from each slice to construct a network, depending on the node types you selected in the previous step. If you selected multiple node types, then these nodes will be ranked by the number of times they appeared in the records for each slice.

The third selection method is Top N% per slice. For example, you can select the top 15% most cited items or most frequent items per slice. You can also select the entire dataset by specifying top 100% (as long as you raise the upper limit value high enough, say, 10,000 per slice).

The fourth method is Threshold Interpolation. It selects both nodes and links. It is complex. I recommend you to explore other selection criteria before this one.

The fifth one needs to be used along with one of the above 3 methods – Select Citers. You can select records based on a distribution of citations. You can specify an interval of the citation distribution, for example, an interval of [5, max] will include records that have 5 or more citations. After the selection, you need to choose which one of the three selection methods you will need, namely, Top N, Top N%, or Threshold Interpolation.

Obviously the size of a visualized network influences the clarity and complexity of patterns we may learn from the visualization. The structure of a network is determined by the number of nodes selected for each time slice. It is unlikely that we will know in advance whether a Top N of 100 will generate a more desirable network than a Top N of 50.

Here are some suggestions:

First, begin with the g-index default and generate a network visualization. Then check the modularity of the network, the number of clusters, and the average silhouette scores. We won’t learn much from the network if there are only a couple of clusters. We won’t get a big picture if there are hundreds of clusters either. A good range of the number of clusters would be about 7~10 major clusters with 10 or more members and each of the clusters has high silhouette values (e.g. > 0.70).

You can then try a higher k or a larger Top N, e.g. aiming at ~100 nodes per slice. If your computer is powerful enough, you can certainly try a ~1,000 nodes per slice or even higher.

You should start the process from a small network (although if you include many slices, even a Top N of 50 can accumulate to a large network), and then based on your initial assessment of the network enlarge the network accordingly.

Finally, note that the largest network is not necessarily the most informative one. Make clear the questions you want to answer first.

From Version 4.0.R4, CiteSpace provides a revised parameter Link Retaining Factor in the project properties to give the user more controls in link selection. This parameter is particularly useful when you face a hairball of a highly connected network.

As illustrated in the following figure, the combination of a link retaining factor of 2 and the look back years of 8 gives the network that has the highest clarity for both g-index or Top-N node selection options. g-index is recommended because it tends to generate a network with a fewer number of small clusters than the top-N option.