Advanced data mining 26

Advanced Data Mining (CSE 554)

Syllabus

This course covers the analysis of relational and network-structured data — the kind of data that arises when entities interact, connect, or co-occur. Most real-world datasets (social networks, transaction logs, the Web, biological networks) are inherently relational, and classical tabular methods fail to capture the structure embedded in these relationships. From the perspective of Relational Big Data Analysis, the course examines how large-scale relational datasets can be characterised, modeled, and processed efficiently. From the perspective of Relational Data Mining, the course develops algorithmic tools for discovering patterns, communities, and predictive structure within relational data. Topics include graph models of real-world networks, community detection, influence propagation, recommendation systems, and association rule mining.

Letter Criteria

This course adopts an absolute grading policy. Final letter grades will be assigned based on the total score accumulated throughout the semester.

A+ : 95–100

A0 : 90–94

A- : 85–89

B+ : 80–84

B0 : 75–79

B- : 70–74

C+ : 65–69

C0 : 60–64

C- : 55–59

D+ : 50–54

D0 : 45–49

D- : 40–44

F : below 40

Evaluation Criteria

Paper critique : 10%
Midterm exam : 30%
Final exam : 30%
Individual Projects : 30%
Additional bonus score will be granted

T1. Introduction to Network Analysis — Why study networks; networks as a universal language across social, information, biological, and technological domains. Motivating questions on structure vs. dynamics, data availability, and real-world impact; overview of the two central themes: network structure & evolution and processes & dynamics.

T2. Random Graph Models — The Erdős–Rényi random graph as a baseline null model. Analysis of expected degree, degree distribution, clustering coefficient, and the emergence of the giant connected component and small diameter; what G(n,p) captures and where it fails to match real networks.

T3. The Small-World Phenomenon — The "six degrees of separation" effect and the Watts–Strogatz model. Reconciling high clustering with short path lengths by rewiring a regular lattice with a few random "shortcuts"; the trade-off between local structure and global reachability.

T4. Decentralized Search & Navigation — How individuals find short paths using only local information (Kleinberg's model). Adding geographic structure to small-world networks; why a specific distribution of long-range links makes networks navigable, and the conditions for efficient decentralized search.

T5. Network Centrality & Online Human Behavior — Measuring node importance via betweenness, closeness/farness, and related centrality measures. Applying network science to online behavior: how people express themselves through actions and text, and how user behavior shapes the success of online platforms.

T6. Signed Networks & Structural Balance — Networks with positive (friend) and negative (enemy) edges. Heider's structural balance theory ("friend of a friend is a friend"), analysis of signed triads, and status theory for explaining and predicting signed relationships.

T7. Cascading Behavior — Decision-Based Diffusion — How behaviors, products, and ideas spread through a network when adoption is a strategic decision. Game-theoretic / threshold models of diffusion, the role of network structure in adoption, and the coexistence of competing behaviors.

T8. Network Cascades & Probabilistic Contagion — Epidemic-style spreading models (e.g., SIR/SIS) where contagion is probabilistic rather than decision-driven. Information cascades, contagion thresholds, and modeling the spread of diseases and information through populations.

T9. Influence Maximization — Selecting a seed set of nodes to maximize the spread of influence. The Independent Cascade and Linear Threshold models, the submodularity of the influence function, and the greedy algorithm with its (1 − 1/e) approximation guarantee. (Includes supplementary handout.)

T10. Outbreak Detection — Placing sensors to detect outbreaks (e.g., contamination, information) as quickly as possible. Formulating outbreak detection as a submodular optimization problem and the CELF algorithm for scaling the greedy approach efficiently.

T11. Power-Laws & Scale-Free Networks — Heavy-tailed degree distributions in real networks. Identifying and fitting power-laws, the difference from exponential distributions, and the structural consequences of having highly connected "hub" nodes.

T12. Network Evolution & Preferential Attachment — Generative models explaining how power-laws emerge as networks grow. Continuous-time analysis of preferential ("rich-get-richer") attachment, derivation of the resulting degree distribution, and empirical patterns of network growth and densification.

T13. Kronecker Graphs — A recursive generative model for realistic large-scale networks. How Kronecker products produce graphs that match observed properties (heavy tails, small diameter, densification), and parameter estimation for fitting Kronecker graphs to real data.

T14. PageRank & Link Analysis — Ranking nodes via random walks on graphs. The PageRank algorithm, the random-surfer model with teleportation, the power-iteration computation, and related link-analysis methods (e.g., HITS, Personalized PageRank). (Includes supplementary handout.)

T15. The Strength of Weak Ties & Community Structure — Granovetter's theory linking tie strength to network position. Triadic closure, bridges and local bridges, and how weak ties grant access to novel information; introduction to detecting community structure in networks.\

T16. Spectral Clustering — Partitioning graphs using the eigenstructure of matrix representations. The three stages—building a matrix (adjacency/Laplacian), computing eigenvalues and eigenvectors, and grouping nodes; the connection between the graph Laplacian, conductance, and graph cuts.

T17. Overlapping Communities — Detecting communities when a node can belong to many social circles simultaneously. The Clique Percolation Method (k-clique communities) and model-based approaches (e.g., BigCLAM) that allow nodes to participate in multiple overlapping groups.

T18. Graph Representation Learning — Learning low-dimensional vector representations of nodes for machine-learning tasks. Random-walk-based embedding methods (DeepWalk / node2vec), the trade-off between homophily and structural equivalence via biased walks, and applications to node classification and link prediction.

2026F courses

Paper critique [Manual] [Template]

Related textbook & lecture

MetaPaper https://www.cs.cmu.edu/~christos/MetaPaper/
How to do research https://www.cs.ucr.edu/~eamonn/public/SDM_How_to_do_Research_Keogh.pdf
아주 쉬운 논문쓰기 https://oslab.kaist.ac.kr/wp-content/uploads/esos_files/paperwriting.pdf
학문을 직업으로 삼으려는 젊은 학자들을 위하여 http://www.ekera.org/bbs/board.php?bo_table=letter&wr_id=3407
학술논문작성법 https://www.edwith.org/howtopaper

Page updated

Google Sites

Report abuse