Project Overview
This project explores the structure of Facebook's social network by identifying communities and clusters of connected users. Using graph theory and machine learning techniques, the analysis compares multiple clustering algorithms and community detection methods to uncover meaningful patterns in how users form groups and interact within the network. The project demonstrates both traditional graph-based community detection and modern machine learning clustering approaches on real-world social network data.
Key Objectives
- Identify natural communities within the Facebook social network graph
- Compare traditional community detection algorithms with ML clustering approaches
- Extract and engineer local graph features for node characterization
- Evaluate clustering quality using multiple metrics and validation techniques
Dataset
Source: facebook_combined.txt - Facebook social network graph
Size: 4,039 nodes and 88,234 edges representing user connections
Type: Undirected graph with unlabeled nodes
Features Extracted (20+ local graph features):
- Centrality Measures: Degree, eigenvector, closeness, betweenness, subgraph, load centrality
- Link Prediction Features: Jaccard coefficient, Adamic-Adar index, preferential attachment
- Ego-graph Features: Transitivity, PageRank, degree centrality computed on 2-hop neighborhoods
- Structural Features: Clustering coefficient, core number, eccentricity, triangles, square clustering
- Network Metrics: Shortest path statistics, degree assortativity, Estrada index
Methods & Techniques
Community Detection Algorithms:
Clauset-Newman-Moore
Label Propagation
Louvain Method
Kernighan-Lin
Machine Learning Clustering:
K-Means
Agglomerative (Ward)
Gaussian Mixture Models
OPTICS
Affinity Propagation
Self-Organizing Maps
Advanced Techniques:
- Node2Vec graph embeddings (64 dimensions) for representation learning
- PCA dimensionality reduction to 5 principal components
- Quantile transformation for handling outliers and distribution normalization
- Gap statistic for determining optimal cluster count
Results & Performance
0.7289
Best Silhouette Score
0.4373
Best Davies-Bouldin
9844.54
Max Calinski-Harabasz
Key Findings
- Best Overall Method: Agglomerative Clustering with Ward linkage achieved the highest silhouette score (0.7289) and lowest Davies-Bouldin index (0.4373) with 8 clusters
- K-Means Performance: Highly effective with silhouette score of 0.7152 for 8 clusters, demonstrating robustness across metrics
- Node2Vec Advantage: Graph embeddings produced competitive results (7 optimal clusters) by capturing network topology more effectively than raw features
- Outlier Impact: Removing outliers significantly improved clustering quality across all metrics
- Community Structure: Network exhibits clear modular structure with meaningful community boundaries detected by multiple algorithms
Technologies Used
Python
NetworkX
scikit-learn
Node2Vec
PyCaret
pandas
NumPy
matplotlib
seaborn
sklearn-som
Gap-Statistic
Google Colab