Facebook Social Network Clustering

Project Overview

This project explores the structure of Facebook's social network by identifying communities and clusters of connected users. Using graph theory and machine learning techniques, the analysis compares multiple clustering algorithms and community detection methods to uncover meaningful patterns in how users form groups and interact within the network. The project demonstrates both traditional graph-based community detection and modern machine learning clustering approaches on real-world social network data.

            Key Objectives
            Identify natural communities within the Facebook social network graph
Compare traditional community detection algorithms with ML clustering approaches
Extract and engineer local graph features for node characterization
Evaluate clustering quality using multiple metrics and validation techniques

        

Dataset

Source: facebook_combined.txt - Facebook social network graph

Size: 4,039 nodes and 88,234 edges representing user connections

Type: Undirected graph with unlabeled nodes

Features Extracted (20+ local graph features):

Centrality Measures: Degree, eigenvector, closeness, betweenness, subgraph, load centrality
Link Prediction Features: Jaccard coefficient, Adamic-Adar index, preferential attachment
Ego-graph Features: Transitivity, PageRank, degree centrality computed on 2-hop neighborhoods
Structural Features: Clustering coefficient, core number, eccentricity, triangles, square clustering
Network Metrics: Shortest path statistics, degree assortativity, Estrada index

Methods & Techniques

Community Detection Algorithms:

Clauset-Newman-Moore Label Propagation Louvain Method Kernighan-Lin

Machine Learning Clustering:

K-Means Agglomerative (Ward) Gaussian Mixture Models OPTICS Affinity Propagation Self-Organizing Maps

Advanced Techniques:

Node2Vec graph embeddings (64 dimensions) for representation learning
PCA dimensionality reduction to 5 principal components
Quantile transformation for handling outliers and distribution normalization
Gap statistic for determining optimal cluster count

Results & Performance

0.7289

Best Silhouette Score

0.4373

Best Davies-Bouldin

9844.54

Max Calinski-Harabasz

Optimal Clusters

                Key Findings
                Best Overall Method: Agglomerative Clustering with Ward linkage achieved the highest silhouette score (0.7289) and lowest Davies-Bouldin index (0.4373) with 8 clusters
K-Means Performance: Highly effective with silhouette score of 0.7152 for 8 clusters, demonstrating robustness across metrics
Node2Vec Advantage: Graph embeddings produced competitive results (7 optimal clusters) by capturing network topology more effectively than raw features
Outlier Impact: Removing outliers significantly improved clustering quality across all metrics
Community Structure: Network exhibits clear modular structure with meaningful community boundaries detected by multiple algorithms

            

Technologies Used

Python NetworkX scikit-learn Node2Vec PyCaret pandas NumPy matplotlib seaborn sklearn-som Gap-Statistic Google Colab

← Back to All Projects