# Analysis of Wikipedia with scikit-network

This notebook shows how to apply [scikit-network](https://scikit-network.readthedocs.io/) to analyse the network structure of Wikipedia, through its hyperlinks, as well as the textual content of Wikipedia, through the words used in the summaries of the articles.

We consider the [Wikivitals](https://netset.telecom-paris.fr/pages/wikivitals.html) dataset of the [netset](https://netset.telecom-paris.fr) collection. This dataset consists of the [top 10,000 (vital) articles of Wikipedia](https://fr.wikipedia.org/wiki/WikipÃ©dia:Articles_vitaux/Niveau_4).

## Getting started

To install scikit-network, please execute the following command and restart the kernel:

In [None]:
# !pip install scikit-network

## Import

In [None]:
from IPython.display import SVG

In [None]:
import numpy as np
from scipy import sparse
from matplotlib import pyplot as plt

In [None]:
from sknetwork.data import load_netset
from sknetwork.ranking import PageRank, top_k
from sknetwork.hierarchy import LouvainHierarchy
from sknetwork.embedding import Spectral
from sknetwork.linalg import normalize
from sknetwork.utils import WardDense, get_neighbors, membership_matrix
from sknetwork.visualization import svg_digraph, svg_dendrogram

In [None]:
# used for 2D visualization
from sklearn.manifold import TSNE

## Data

All datasets of the [netset](https://netset.telecom-paris.fr) collection can be easily imported with scikit-network.

In [None]:
wikivitals = load_netset('wikivitals')

In [None]:
# hyperlinks
adjacency = wikivitals.adjacency
names = wikivitals.names
labels = wikivitals.labels
names_labels = wikivitals.names_labels

In [None]:
# bipartite graph between articles and words
biadjacency = wikivitals.biadjacency
words = wikivitals.names_col

In [None]:
adjacency

In [None]:
biadjacency

In [None]:
# categories
print(names_labels)

## Sample

Let's have a look at a random article.

In [None]:
i = np.random.choice(len(names))
print(names[i])

In [None]:
# label
label = labels[i]
print(names_labels[label])

In [None]:
# some outgoing hyperlinks
nodes = get_neighbors(adjacency, i)
print(names[nodes[:10]])

In [None]:
# some incoming hyperlinks
nodes = get_neighbors(sparse.csr_matrix(adjacency.T), i)
print(names[nodes[:10]])

In [None]:
# some words
nodes = get_neighbors(biadjacency, i)
print(words[nodes[:10]])

## PageRank

We first use (personalized) [PageRank](https://en.wikipedia.org/wiki/PageRank) to select typical articles of each category.

In [None]:
pagerank = PageRank()

In [None]:
# number of articles per category
n_selection = 50

In [None]:
# selection of articles
selection = []
for label in np.arange(len(names_labels)):
    ppr = pagerank.fit_transform(adjacency, seeds=(labels==label))
    scores = ppr * (labels==label)
    selection.append(top_k(scores, n_selection))
selection = np.array(selection)

In [None]:
# show selection
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(names[selection[label,:5]])

## Graph embedding

We now represent each node of the graph by a vector in low dimension, and visualize the result by [TSNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding).

In [None]:
# dimension of the embedding
n_components = 20

In [None]:
# embedding
spectral = Spectral(n_components)
embedding = spectral.fit_transform(adjacency)

In [None]:
# visualization by TSNE
tsne = TSNE(2)
embedding_ = tsne.fit_transform(embedding[selection.ravel()])

In [None]:
# vector of labels of the selection
labels_selection = np.repeat(np.arange(len(names_labels)), n_selection)

In [None]:
# show embedding
plt.figure(figsize=(6, 6))
for label, name_label in enumerate(names_labels):
    mask = labels_selection==label
    plt.scatter(embedding_[mask,0], embedding_[mask,1], s=50)
    center = np.mean(embedding_[mask], axis=0)
    plt.text(center[0], center[1], name_label)

## Text mining

In the following, we consider the bipartite graph between articles and words used in their summaries.

We first select topical words, used most often in articles of a given category.

In [None]:
# select topical words, with frequency higher than some threshold in some category
threshold = 0.5
membership = membership_matrix(labels)
labels_words_mask = (normalize(biadjacency.T).dot(membership) > threshold).T.toarray()

In [None]:
# number of words per category
n_selection = 50

In [None]:
# use frequency to select top words
counts = biadjacency.T.dot(np.ones(len(names)))

In [None]:
# selection of words by frequency
selection_words = []
for label in np.arange(len(names_labels)):
    mask = labels_words_mask[label]
    scores = counts * mask
    selection_words.append(top_k(scores, n_selection))
selection_words = np.array(selection_words)

In [None]:
# show selection
for label, name_label in enumerate(names_labels):
    print('---')
    print(label, name_label)
    print(words[selection_words[label,:5]])

## Graph co-embedding

We co-embbed articles and words in the same vector space.

In [None]:
# dimension of the embedding
n_components = 20

In [None]:
# embedding
spectral = Spectral(n_components)
spectral.fit(biadjacency)
embedding_articles = spectral.embedding_row_
embedding_words = spectral.embedding_col_

In [None]:
# visualization by TSNE
tsne = TSNE(2)
embedding_articles_ = tsne.fit_transform(embedding_articles[selection.ravel()])
embedding_words_ = tsne.fit_transform(embedding_words[selection_words.ravel()])

In [None]:
# show embedding of articles
plt.figure(figsize=(6,6))
for label, name_label in enumerate(names_labels):
    mask = (labels_selection==label)
    plt.scatter(embedding_articles_[mask,0], embedding_articles_[mask,1], s=50)
    center = np.mean(embedding_articles_[mask], axis=0)
    plt.text(center[0], center[1], name_label)

In [None]:
# show embedding of words
plt.figure(figsize=(6,6))
for label, name_label in enumerate(names_labels):
    mask = (labels_selection==label)
    plt.scatter(embedding_words_[mask,0], embedding_words_[mask,1], s=50)
    center = np.mean(embedding_words_[mask], axis=0)
    plt.text(center[0], center[1], name_label)

## Hierarchical structure

Finally, we show the hierarchical structure of articles and words in the embedding space.

In [None]:
ward = WardDense()

In [None]:
# hierarchy of articles
label = 0
index = selection[label]
dendrogram_articles = ward.fit_transform(embedding_articles[index])

In [None]:
# visualization
image = svg_dendrogram(dendrogram_articles, names=names[index], rotate=True, width=200, scale=2, n_clusters=4)
SVG(image)

In [None]:
# hierarchy of words
index = selection_words[label]
dendrogram_words = ward.fit_transform(embedding_words[index])

In [None]:
# visualization
image = svg_dendrogram(dendrogram_words, names=words[index], rotate=True, width=200, scale=2, n_clusters=4)
SVG(image)