CMSC 191: Computational Social Network Analysis
Constructing Network Data
This topic presents the technical and ethical foundations of constructing computational social network data. The process of transforming relational or behavioral datasets into graph structures is discussed, emphasizing how data source selection, sampling strategy, and preprocessing determine the reliability of subsequent analysis. Core techniques for data wrangling using Pandas are described as essential for ensuring integrity between node and edge lists, thereby enabling accurate graph instantiation in NetworkX or iGraph.
The handout further details methods for pseudonymization and hashing as mechanisms for privacy preservation, acknowledging the balance between anonymization and structural distortion. The concept of interoperability is addressed through the use of standardized file formats (.gml, .graphml, .net) to maintain fidelity across computational tools. Ultimately, the construction of network data is positioned as both a computational engineering challenge and an ethical responsibility that defines the quality of all subsequent analytical outcomes.
Apply data collection, preprocessing, and anonymization techniques for building network datasets.
Demonstrate ethical responsibility in handling and preparing relational data.
Utilize data interoperability standards to ensure consistency across analytical platforms.
Why is data cleaning considered both a technical and ethical process?
How do file formats and metadata affect interoperability between tools?
How can poor preprocessing distort analytical outcomes in network research?
In what ways does standardization contribute to reproducibility and collaboration?
Constructing Network Data* (class handout)
From Raw Logs to Relational Maps
Data Sources and Modes of Collection
Sourcing the Social Graph: Surveys, APIs, and Archives
Sampling Bias and the Ethics of Data Transformation
Data Cleaning, Preprocessing, and Anonymization
The Tidy Network: Standardizing with Pandas
Pseudonymization and Hashing: Securing Node Attributes
Cross-Software Data Interoperability
Portability and Standard Formats (.gml, .graphml, .net)
Metadata Requirements and Reproducible Workflows
5. Essential Software Toolkit Summary
Pandas: Used for structured data loading, cleaning, merging, and initial standardization of node and edge lists.
NetworkX & iGraph: Primary Python libraries for creating, manipulating, and exporting the in-memory graph object G(V, E).
Gephi & Pajek: Specialized tools for interactive visualization, large-scale layout algorithms, and specific network partitioning algorithms, requiring high fidelity data import via standard formats.
6. Weaving Structure and Responsibility
Note: Links marked with an asterisk (*) lead to materials accessible only to members of the University community. Please log in with your official University account to view them.
The semester at a glance:
Constructing Network Data
Validity and Reliability . . .
Project Development . . .
Implementation . . .
Wasserman, Stanley, and Katherine Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. (Core Text)
Access Note: Published research articles and books are linked to their respective sources. Some materials are freely accessible within the University network or when logged in with official University credentials. Others will be provided to enrolled students through the class learning management system (LMS).