+ - 0:00:00
Notes for current slide
Notes for next slide

Hands on Network Machine Learning

Eric W. Bridgeford and Jaewon Chung

For questions, reach out to ericwb95@gmail.com or j1c@jhu.edu

Follow the slides: ericwb.me/lectures/intro_graph_ML.html

1 / 328

What is the traditional framework for learning in science?

  • tabular data: nn observations with dd features/dimensions
    • We have lots of algorithms that allow us to learn from this data across different languages
    • sklearn\texttt{sklearn} in python\texttt{python}, keras\texttt{keras} in R\texttt{R}, etc.


Person Biological Sex Height Age
Person 11 Male 59"5'9" 2828
Person 22 Female 55"5'5" 2424
... ... ... ...
4 / 328

What is the traditional framework for learning in science?

  • tabular data: nn observations with dd features/dimensions
    • We have lots of algorithms that allow us to learn from this data across different languages
    • sklearn\texttt{sklearn} in python\texttt{python}, keras\texttt{keras} in R\texttt{R}, etc.
  • Devise techniques that allow us to calculate useful quantities about each observation
Person Biological Sex Height Age
Person 11 Male 59"5'9" 2828
Person 22 Female 55"5'5" 2424
... ... ... ...
5 / 328

Coin flip example

  • Coin flip experiment: have a coin with probability of landing on heads of pp
  • Question: If I flip the coin 1515 times and it lands on heads 66, can you estimate the probability of landing on heads?
  • Anybody? Why is it what it is?
6 / 328

Coin flip example

  • Coin flip experiment: have a coin with probability of landing on heads of pp
  • Question: If I flip the coin 1515 times and it lands on heads 66, can you estimate the probability of landing on heads?
  • Intuitive answer: 615\frac{6}{15}
7 / 328

Coin flip example

  • Coin flip experiment: have a coin with probability of landing on heads of pp
  • Question: If I flip the coin 1515 times and it lands on heads 66, can you estimate the probability of landing on heads?
Rigorous answer

xiBern(p)\mathbf{x}_i \sim \text{Bern}(p) i.i.di.i.d heads (value 11) and tails (value 00)

8 / 328

Coin flip example

  • Coin flip experiment: have a coin with probability of landing on heads of pp
  • Question: If I flip the coin 1515 times and it lands on heads 66, can you estimate the probability of landing on heads?
Rigorous answer

xiBern(p)\mathbf{x}_i \sim \text{Bern}(p) i.i.di.i.d heads (value 11) and tails (value 00)

L(x1,...,x15)=i=115pxi(1p)xi\mathcal L (x_1, ..., x_{15}) = \prod_{i = 1}^{15} p^{x_i}(1 - p)^{x_i}

9 / 328

Coin flip example

  • Coin flip experiment: have a coin with probability of landing on heads of pp
  • Question: If I flip the coin 1515 times and it lands on heads 66, can you estimate the probability of landing on heads?
Rigorous answer

xiBern(p)\mathbf{x}_i \sim \text{Bern}(p) i.i.di.i.d heads (value 11) and tails (value 00)

L(x1,...,x15)=i=115pxi(1p)xi\mathcal L (x_1, ..., x_{15}) = \prod_{i = 1}^{15} p^{x_i}(1 - p)^{x_i}

take derivative of =log(L)\ell = \log(\mathcal L) and set equal to zero p^=i=115xi15\Rightarrow \hat p = \frac{\sum_{i = 1}^{15}x_i}{15} is the MLE

10 / 328

Coin flip example

  • Coin flip experiment: have a coin with probability of landing on heads of pp
  • Question: If I flip the coin 1515 times and it lands on heads 66, can you estimate the probability of landing on heads?
Rigorous answer

xiBern(p)\mathbf{x}_i \sim \text{Bern}(p) i.i.di.i.d heads (value 11) and tails (value 00)

L(x1,...,x15)=i=115pxi(1p)xi\mathcal L (x_1, ..., x_{15}) = \prod_{i = 1}^{15} p^{x_i}(1 - p)^{x_i}

take derivative of =log(L)\ell = \log(\mathcal L) and set equal to zero p^=i=115xi15\Rightarrow \hat p = \frac{\sum_{i = 1}^{15}x_i}{15} is the MLE

check that we found a maximum (second derivative, check extrema)

11 / 328

Coin flip example

  • Coin flip experiment: have a coin with probability of landing on heads of pp
  • Question: If I flip the coin 1515 times and it lands on heads 66, can you estimate the probability of landing on heads?
  • Intuitive and rigorous answers align
    • Statistics allows us to be rigorous about things we find intuitive
12 / 328

Coin flip example

  • Coin flip experiment: have a coin with probability of landing on heads of pp
  • Question: If I flip the coin 1515 times and it lands on heads 66, can you estimate the probability of landing on heads?
  • Intuitive and rigorous answers align
    • Statistics allows us to be rigorous about things we find intuitive
    • Statistics can also allow us to be rigorous about things that are more complicated or unintuitive
13 / 328

What is a network?

  • Network/Graph G=(V,E)G = (\mathcal V, \mathcal E)
    • V\mathcal V are nodes/vertices
    • E\mathcal E are edges: connect one node/vertex to another
15 / 328

Layout plots

  • Provide a visualization of the nodes and edges in the network in some arbitrary space
16 / 328

Adjacency matrices

  • AA is an n×nn \times n adjacency matrix for a network with nn nodes aij={1,eijE0,eij∉E\begin{aligned} a_{ij} = \begin{cases}1, & e_{ij} \in \mathcal E \\ 0, & e_{ij} \not\in \mathcal E\end{cases} \end{aligned}
  • The incident nodes of an adjacency aija_{ij} are ii and jj
17 / 328

Adjacency matrices

  • AA is an n×nn \times n adjacency matrix for a network with nn nodes aij={1,eijE0,eij∉E\begin{aligned} a_{ij} = \begin{cases}1, & e_{ij} \in \mathcal E \\ 0, & e_{ij} \not\in \mathcal E\end{cases} \end{aligned}
  • if aij=1a_{ij} = 1, nodes ii and jj are neighbors (edge between them)
18 / 328

Paths describe edges of travel from one node to the next

  • Path from ii to jj: sequence of nodes (no repeats) and edges eije_{i'j'} from node ii and ending at node jj
  • What's a path from Staten Island (SI) to Bronx (BX)?
19 / 328

Two paths from SI to BX



20 / 328

Directed networks

  • Edges aren't necessarily "symmetric"
    • The lanes of the Brooklyn bridge are out from BK to MH
    • There is an edge from MH to BK, but there is no edge from BK to MH
  • asymmetric adjacency matrix
22 / 328

Networks with self-loops

  • self-loops allow nodes to connect back to themselves
    • A bridge from a point in SI to another point within SI (crossing a creek?)
  • non-hollow adjacency matrix
23 / 328

Networks with weights

  • For every edge, a weight wijw_{ij} indices information about the "strength" of the edge
    • What is the level of traffic congestion for each bridge?
  • non-binary adjacency matrix
24 / 328

Simple networks

  • undirected, unweighted, loopless
    • symmetric, binary, hollow adjacency matrix
25 / 328

Degrees

  • node degree degree(i)degree(i): the number of nodes that ii has edges to
  • Degree of BK: 2
26 / 328

Calculating node degrees

di=degree(i)=j=1naijd_i = degree(i) = \sum_{j = 1}^n a_{ij}

  • Sum the adjacency matrix column-wise
    • since the network is simple (loopless), aii=0a_{ii} = 0
  • For simple networks, we can write di=jiaijd_i = \sum_{j \neq i}a_{ij}
27 / 328

Degree matrix

  • The degree matrix DD is the matrix with node degrees on the diagonal, and 00 otherwise:

D=[d1000000dn]D = \begin{bmatrix}d_1 & 0 & \cdots & 0 \\ 0 & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ 0 & \cdots & 0 & d_n\end{bmatrix}

28 / 328

Degree matrix

  • The degree matrix DD is the matrix with node degrees on the diagonal, and 00 otherwise:

D=[d1000000dn]=[d1dn]\begin{aligned}D &= \begin{bmatrix}d_1 & 0 & \cdots & 0 \\ 0 & \ddots & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ 0 & \cdots & 0 & d_n\end{bmatrix} \\ &= \begin{bmatrix}d_1 & & \\ & \ddots & \\ & & d_n \end{bmatrix}\end{aligned}

29 / 328

Degree matrix

  • For the New York example, it looks like this:
30 / 328

Network Laplacian

  • as long as all di>0d_i > 0, we can take the reciprocal of the degree matrix

D1=[1d11dn]D^{-1} = \begin{bmatrix}\frac{1}{d_1} & & \\ & \ddots & \\ & & \frac{1}{d_n}\end{bmatrix}

31 / 328

Network Laplacian

  • as long as all di>0d_i > 0, we can take the reciprocal of the degree matrix

D1=[1d11dn]D^{-1} = \begin{bmatrix}\frac{1}{d_1} & & \\ & \ddots & \\ & & \frac{1}{d_n}\end{bmatrix}

  • we can also take the square root: D12=[1d11dn]D^{-\frac{1}{2}} = \begin{bmatrix}\frac{1}{\sqrt{d_1}} & & \\ & \ddots & \\ & & \frac{1}{\sqrt{d_n}}\end{bmatrix}
32 / 328

Network Laplacian

  • The network Laplacian is defined as: L=D12AD12L = D^{-\frac{1}{2}}A D^{-\frac{1}{2}}
33 / 328

Network Laplacian

  • The network Laplacian is defined as: L=D12AD12L = D^{-\frac{1}{2}}A D^{-\frac{1}{2}}

L=[1d11dn][a11a1nan1ann][1d11dn]=[a11d1a1nd1dna11dnd1anndn]\begin{aligned}L &= \begin{bmatrix} \frac{1}{\sqrt{d_1}} & & \\ & \ddots & \\ & & \frac{1}{\sqrt{d_n}}\end{bmatrix} \begin{bmatrix}a_{11} & \cdots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{nn}\end{bmatrix} \begin{bmatrix}\frac{1}{\sqrt{d_1}} & & \\ & \ddots & \\ & & \frac{1}{\sqrt{d_n}}\end{bmatrix} \\ &= \begin{bmatrix} \frac{a_{11}}{d_1} & \cdots & \frac{a_{1n}}{\sqrt{d_1}\sqrt{d_n}} \\ \vdots & \ddots & \vdots \\ \frac{a_{11}}{\sqrt{d_n}\sqrt{d_1}} & \cdots & \frac{a_{nn}}{d_n}\end{bmatrix}\end{aligned}

34 / 328

Network Laplacian

  • The network Laplacian is defined as: L=D12AD12L = D^{-\frac{1}{2}}A D^{-\frac{1}{2}}
  • the entries lij=aijdidjl_{ij} = \frac{a_{ij}}{\sqrt{d_i} \sqrt{d_j}}
35 / 328

Network Laplacian

  • The network Laplacian is defined as: L=D12AD12L = D^{-\frac{1}{2}}A D^{-\frac{1}{2}}
  • the entries lij=aijdidjl_{ij} = \frac{a_{ij}}{\sqrt{d_i} \sqrt{d_j}}
    • similar to the adjacency matrix, but "normalized" by the degrees of incident nodes for every edge
36 / 328

Network Laplacian example



37 / 328

Diffusion connectome

  • neuron: functional "unit" of the brain
    • axon: "projection" from the center (cell body) of the neuron
38 / 328

Diffusion connectome

  • neuron: functional "unit" of the brain
    • axon: "projection" from the center (cell body) of the neuron
  • synapse: junction of the brain that bridges two neurons
    • the neurons can "communicate" through the synapses
39 / 328

Diffusion connectome

  • cannot see synapses and neurons without microscopy
    • microscopy requires sections of the brain (illegal)
40 / 328

Diffusion connectome

  • cannot see synapses and neurons without microscopy
    • microscopy requires sections of the brain (illegal)
  • the brain is highly optimized
    • the axons tend to be "bundled together" in "fiber tracts"
    • we can see these with MRI

(Source: Muse)

41 / 328

Diffusion connectome

  • brain area: collection of neurons with similar function
  • measure which brain areas are connected with other brain areas
    • idea: if there is an axonal fiber tract, neurons from these brain areas can communicate

(Source: Schiffler 2017)

42 / 328

Diffusion connectome

  • nodes: brain areas
  • edges: does an axonal fiber tract link from one brain area to the other?

(Source: Schiffler 2017/me)

43 / 328

Attributes for network data

  • attributes describe characteristics of the nodes or edges in the network
  • Diffusion connectome: left and right hemisphere
    • What do you notice?
44 / 328

Attributes for network data

  • attributes describe characteristics of the nodes or edges in the network
  • Diffusion connectome: left and right hemisphere
    • homophily (like with like): More within-hemisphere connections than between hemisphere
45 / 328

Back to coin flips

  • Let's imagine a game of flipping a coin
    • You win if it lands on heads less than 55 times, and if it lands on heads more than 55 times, your friend wins
47 / 328

Back to coin flips

  • Let's imagine a game of flipping a coin
    • You win if it lands on heads less than 55 times, and if it lands on heads more than 55 times, your friend wins
  • Outcome: 99 heads
  • Was it rigged? Is the coin biased?
48 / 328

Making the coin flip experiment more formal

  • Model: xiBern(p)\mathbf x_i \sim \text{Bern}(p) i.i.d.i.i.d.
  • Question: is p>0.5p > 0.5?
  • With H0:p=0.5H_0 : p = 0.5 and HA:p>0.5H_A : p > 0.5, do we have evidence to reject H0H_0 when the number of heads is 99?
49 / 328

Making the coin flip experiment more formal

  • With H0:p=0.5H_0 : p = 0.5 and HA:p>0.5H_A : p > 0.5, do we have evidence to reject H0H_0 when the number of heads is 99?
50 / 328

Making the coin flip experiment more formal

  • With H0:p=0.5H_0 : p = 0.5 and HA:p>0.5H_A : p > 0.5, do we have evidence to reject H0H_0 when the number of heads is 99?
  • Null: i=110xiBinomial(10,0.5)\sum_{i = 1}^{10} \mathbf x_i \sim \text{Binomial}(10, 0.5)
  • pp-value =k=910(10k)0.5k(10.5)10k=.011= \sum_{k = 9}^{10} \binom{10}{k} 0.5^{k}(1 - 0.5)^{10 - k} = .011
51 / 328

Making the coin flip experiment more formal

  • With H0:p=0.5H_0 : p = 0.5 and HA:p>0.5H_A : p > 0.5, do we have evidence to reject H0H_0 when the number of heads is 99?
  • Null: i=110xiBinomial(10,0.5)\sum_{i = 1}^{10} \mathbf x_i \sim \text{Binomial}(10, 0.5)
  • pp-value =k=910(10k)0.5k(10.5)10k=.011= \sum_{k = 9}^{10} \binom{10}{k} 0.5^{k}(1 - 0.5)^{10 - k} = .011
  • What did the model let us do?
52 / 328

Making the coin flip experiment more formal

  • With H0:p=0.5H_0 : p = 0.5 and HA:p>0.5H_A : p > 0.5, do we have evidence to reject H0H_0 when the number of heads is 99?
  • Null: i=110xiBinomial(10,0.5)\sum_{i = 1}^{10} \mathbf x_i \sim \text{Binomial}(10, 0.5)
  • pp-value =k=910(10k)0.5k(10.5)10k=.011= \sum_{k = 9}^{10} \binom{10}{k} 0.5^{k}(1 - 0.5)^{10 - k} = .011
  • What did the model let us do?
    • Get really specific and precise about this tangible concept (99 heads in 1010 flips seems like an awful lot)
53 / 328

Why do we use random network models?

  • Networks are fundamentally different from nn observation, dd feature framework
    • Collection of nodes and edges, not observations with features
54 / 328

Why do we use random network models?

  • Networks are fundamentally different from nn observation, dd feature framework
    • Collection of nodes and edges, not observations with features
  • All of the machinery built for tabular data will not natively work with a network
55 / 328

Approach 1: nodes are observations, edges are features

  • Features in traditional machine learning describe each observation
    • isolate information about a particular observation
Person Biological Sex Height Age
Person 11 Male 59"5'9" 2828
Person 22 Female 55"5'5" 2424
... ... ... ...
56 / 328

Approach 1: nodes are observations, edges are features

  • Features in traditional machine learning describe each observation
  • Edges define relationships amongst the nodes
    • the edges do not inherently isolate information about each node, so they aren't features for observations
57 / 328

Approach 1: nodes are observations, edges are features

  • Features in traditional machine learning describe each observation
  • Edges define relationships amongst the nodes
    • the edges do not inherently isolate information about each node, so they aren't features for observations
58 / 328

Approach 2: treat all possible adjacency matrices like a coin flip

  • xiBern(p)\mathbf{x}_i \sim \text{Bern}(p)
    • Affix a probability (pp) to an outcome of 11, and 1p1 - p to an outcome of 00
  • There are a finite number of entries in an adjacency matrix, taking finitely many values
    • There are a finite number of adjacency matrices for nn node simple networks
59 / 328

Approach 2: treat all possible adjacency matrices like a coin flip

  • xiBern(p)\mathbf{x}_i \sim \text{Bern}(p)
    • Affix a probability (pp) to an outcome of 11, and 1p1 - p to an outcome of 00
  • There are a finite number of entries in an adjacency matrix, taking finitely many values
    • There are a finite number of adjacency matrices for nn node simple networks
  • What if we just affix a probability to each possible network?
60 / 328

Number of possible 22 node adjacency matrices

[0110] or [0000]\begin{aligned} \begin{bmatrix} 0 & 1 \\ 1 & 0\end{bmatrix} \text{ or } \begin{bmatrix} 0 & 0 \\ 0 & 0\end{bmatrix}\end{aligned}

61 / 328

Number of possible 33 node adjacency matrices

[011101110] or [010101010] or [001001110] or \begin{aligned} \begin{bmatrix} 0 & 1 & 1\\ 1 & 0 & 1 \\ 1 & 1 & 0\end{bmatrix} \text{ or } \begin{bmatrix} 0 & 1 & 0\\ 1 & 0 & 1 \\ 0 & 1 & 0\end{bmatrix} \text{ or }\begin{bmatrix} 0 & 0 & 1\\0 & 0 & 1 \\ 1 & 1 & 0\end{bmatrix} \text{ or }\end{aligned}

[011100100] or [001000100] or [000001010] or \begin{aligned} \begin{bmatrix} 0 & 1 & 1\\ 1 & 0 & 0 \\ 1 & 0 & 0\end{bmatrix} \text{ or } \begin{bmatrix} 0 & 0 & 1\\ 0 & 0 & 0 \\ 1 & 0 & 0\end{bmatrix} \text{ or }\begin{bmatrix} 0 & 0 & 0\\0 & 0 & 1 \\ 0 & 1 & 0\end{bmatrix} \text{ or }\end{aligned}

[010100000] or [000000000].\begin{aligned} \begin{bmatrix} 0 & 1 & 0\\ 1 & 0 & 0 \\ 0 & 0 & 0\end{bmatrix} \text{ or } \begin{bmatrix} 0 & 0 & 0\\ 0 & 0 & 0 \\ 0 & 0 & 0\end{bmatrix}.\end{aligned}

62 / 328

Approach 2: treat all possible adjacency matrices like a coin flip

  • Number of possible adjacency matrices for simple networks with nn nodes?
    • 2(n2)2^{\binom{n}{2}}
    • When n=50n = 50, this is well over the number of atoms in the universe
  • Good luck keeping track of all of that!
63 / 328

Approach 2.5: Treat edges like coin flips

Coin flip experiment
  • Outcomes: xx is heads (11) or tails (00)
    • Each flip xiBern(p)\mathbf x_i \sim \text{Bern}(p) i.i.d.i.i.d.
  • pp gives the probability of heads (11) and 1p1-p gives the probability of tails (00)
65 / 328

Approach 2.5: Treat edges like coin flips

Coin flip experiment
  • Outcomes: xx is heads (11) or tails (00)
    • Each flip xiBern(p)\mathbf x_i \sim \text{Bern}(p) i.i.d.i.i.d.
  • pp gives the probability of heads (11) and 1p1-p gives the probability of tails (00)
Network experiment
  • Outcomes: Adjacency matrices AA, where each entry aija_{ij} is 00 or 11
    • aijBern(pij)\mathbf{a}_{ij} \sim \text{Bern}(p_{ij}) independently
  • pijp_{ij} gives the probability of edge (aij=1a_{ij} = 1) and 1pij1 - p_{ij} gives the probability of no edge (aij=0a_{ij} = 0)
66 / 328

Independent Erdös Rényi Model

  • AIERn(P)\mathbf{A} \sim IER_n(P)
  • PP is an n×nn \times n probability matrix
    • For each aijBern(pij)\mathbf{a}_{ij} \sim \text{Bern}(p_{ij}) independently
  • Equipped with the usual properties: E[aij]=pij\mathbb{E}[\mathbf{a}_{ij}] = p_{ij}, variance, etc.
67 / 328

Advantages of the Independent Erdös Rényi Model

  • Can get a very precise description: any network structure is admissable
68 / 328

Disadvantages of the Independent Erdös Rényi Model

  • Let's say we have a network, how do we learn about PP if we assume the network is IERn(P)IER_n(P)?
69 / 328

Disadvantages of the Independent Erdös Rényi Model

  • Let's say we have a network, how do we learn about PP if we assume the network is IERn(P)IER_n(P)?
    • An observed network AA has a single observation aija_{ij} of each aij\textbf{a}_{ij}
    • Learning about pijp_{ij} would be based on a single zero or one (only one entry for a single adjacency matrix)
70 / 328

Erdös Rényi Random Network Model

  • What if we assume every pij=pp_{ij} = p?
72 / 328

Erdös Rényi Random Network Model

  • What if we assume every pij=pp_{ij} = p?
  • AERn(p)\mathbf A \sim ER_n(p) means that for every aijBern(p)\mathbf a_{ij} \sim \text{Bern}(p) i.i.d.i.i.d.
    • No structure at all: every edge (regardless of any attributes about the nodes) has probability pp
73 / 328

What is the motivation for the Stochastic Block Model?

  • Nodes might not be structureless, but the structure might be simple enough to not have to resort to the IERn(P)IER_n(P) case
  • What if we added a single attribute to each node, and allowed the nodes to be in similar "groups"?
    • learn about edges for nodes within groups, rather than learning about edges one at a time
75 / 328

Stochastic block model (working example)

  • Group of 100100 students, in one of two schools
  • Nodes: students
  • Edges: are a pair of students friends?
  • Attributes: which school does the student attend?
76 / 328

Stochastic block model (working example)

  • Group of 100100 students, in one of two schools
  • Nodes: students
  • Edges: are a pair of students friends?
  • Attributes: which school does the student attend?
    • expected: ability to model that students in the same school are (in general) more likely to be friends
77 / 328

Community assignment vector

  • Community assignment vector z\vec z
    • Each element ziz_i for a node ii takes one of KK possible values
    • The value ziz_i is called the community of node ii
78 / 328

Block matrix

  • Block matrix BB: assigns probabilities to edges belonging to different pairs of communities
    • If there are KK communities, BB is a K×KK \times K matrix
79 / 328

Stochastic block model (SBM)

  • ASBMn(z,B)\mathbf A \sim SBM_n(\vec z, B)
    • aij;zi=k,zj=lBern(bkl)\mathbf a_{ij}; z_i = k, z_j = l \sim \text{Bern}(b_{kl}) ind.ind.
    • block matrix defines the edge probabilities, given the community assignments
80 / 328

Stochastic block model (SBM)

  • ASBMn(z,B)\mathbf A \sim SBM_n(\vec z, B)
  • aij;zi=k,zj=lBern(bkl)\mathbf a_{ij}; z_i = k, z_j = l \sim \text{Bern}(b_{kl}) ind.ind.
    • edges are correlated based on the communities of their incident nodes ii and jj
    • once we know the community assignments, the edges are otherwise independent
81 / 328

How do we get a probability matrix for an SBM?

P=CBP = \color{yellow}{CB} CC^{\top}

  • CC: one-hot encoding of the community assignment vector z\vec z

    • CC is a n×Kn \times K matrix, and: cik={1,zi=k0,zik\begin{aligned}c_{ik} = \begin{cases}1, & z_i = k \\ 0, & z_i \neq k\end{cases}\end{aligned}
  • CBCB is a n×Kn \times K matrix times a K×KK \times K matrix, so n×Kn \times K

CB=[c11...c1Kcn1...cnK][b11...b1KbK1bKK]CB = \begin{aligned} \begin{bmatrix}c_{11} & ... & c_{1K} \\ & \vdots & \\ c_{n1} & ... & c_{nK} \end{bmatrix} \begin{bmatrix}b_{11} & ... & b_{1K} \\ \vdots & \ddots & \vdots \\ b_{K1} & \cdots & b_{KK}\end{bmatrix}\end{aligned}

82 / 328

Matrix Multiplication (Revisited)

Matrix product ABAB is defined, if AA has nn rows and mm columns, and BB has mm rows with ww columns, to be:

AB=[a11a1man1anm][b11b1wbm1bmw]AB = \begin{bmatrix} \color{yellow}{a_{11}} & \cdots & \color{yellow}{a_{1m}} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{nm} \end{bmatrix} \begin{bmatrix} \color{yellow}{b_{11}} & \cdots & b_{1w} \\ \vdots & \ddots & \vdots \\ \color{yellow}{b_{m1}} & \cdots & b_{mw} \end{bmatrix}

  • (AB)11=d=1ma1dbd1(AB)_{11} = \sum_{d = 1}^m a_{1d} b_{d1}, inner product of the first row of AA and the first column of BB
  • In general, (AB)ij=d=1maidbdj(AB)_{ij} = \sum_{d = 1}^m a_{id} b_{dj}
83 / 328

Breaking down the probability matrix for an SBM

P=CBP = \color{yellow}{CB} CC^{\top}

CB=[c11...c1Kcn1...cnK][b11...b1KbK1bKK]CB = \begin{aligned} \begin{bmatrix}c_{11} & ... & c_{1K} \\ & \vdots & \\ c_{n1} & ... & c_{nK} \end{bmatrix} \begin{bmatrix}b_{11} & ... & b_{1K} \\ \vdots & \ddots & \vdots \\ b_{K1} & \cdots & b_{KK}\end{bmatrix}\end{aligned}

  • (CB)ik=l=1Kcilblk(CB)_{ik} = \sum_{l = 1}^{K}c_{il} b_{lk}

But: cil={1,zi=l0,zil\begin{aligned}c_{il} = \begin{cases}1, & z_i = l\\ 0,& z_i \neq l\end{cases}\end{aligned}

84 / 328

Breaking down the probability matrix for an SBM

P=CBP = \color{yellow}{CB} CC^{\top}

CB=[c11...c1Kcn1...cnK][b11...b1KbK1bKK]CB = \begin{aligned} \begin{bmatrix}c_{11} & ... & c_{1K} \\ & \vdots & \\ c_{n1} & ... & c_{nK} \end{bmatrix} \begin{bmatrix}b_{11} & ... & b_{1K} \\ \vdots & \ddots & \vdots \\ b_{K1} & \cdots & b_{KK}\end{bmatrix}\end{aligned}

  • (CB)ik=l=1Kcilblk(CB)_{ik} = \sum_{l = 1}^{K}c_{il} b_{lk}

But: cil={1,zi=l0,zil\begin{aligned}c_{il} = \begin{cases}1, & z_i = l\\ 0,& z_i \neq l\end{cases}\end{aligned}

  • so (CB)ik=cizibzik+lzicilblk=1bzik+0(CB)_{ik} = c_{iz_i} b_{z_i k} + \sum_{l \neq z_i} c_{il} b_{lk} = 1 b_{z_ik} + 0
85 / 328

Breaking down the probability matrix for an SBM

P=CBP = \color{yellow}{CB} CC^{\top}

CB=[c11...c1Kcn1...cnK][b11...b1KbK1bKK]CB = \begin{aligned} \begin{bmatrix}c_{11} & ... & c_{1K} \\ & \vdots & \\ c_{n1} & ... & c_{nK} \end{bmatrix} \begin{bmatrix}b_{11} & ... & b_{1K} \\ \vdots & \ddots & \vdots \\ b_{K1} & \cdots & b_{KK}\end{bmatrix}\end{aligned}

  • (CB)ik=l=1Kcilblk=bzik(CB)_{ik} = \sum_{l = 1}^{K}c_{il} b_{lk}= b_{z_i k}
86 / 328

Breaking down the probability matrix for an SBM

P=CBP = \color{yellow}{CB} CC^{\top}

  • each row is a node, and the columns entries are the probabilities that the community of node ii connect with any of the other communities given by kk

CB=[bz11bz1Kbzn1bznK]\begin{aligned}CB = \begin{bmatrix}b_{z_{1}1} & \cdots & b_{z_{1} K} \\ & \vdots & \\ b_{z_n 1} & \cdots & b_{z_n K}\end{bmatrix}\end{aligned}

87 / 328

Breaking down the probability matrix for an SBM

P=CBCP = \color{yellow}{CBC}^{\top}

Right-multiply CB{CB} by CC^{\top}:

CBC=[bz11bz1Kbzn1bznK][c11cn1c1KcnK]\begin{aligned}CBC^{\top} = \begin{bmatrix}b_{z_{1}1} & \cdots & b_{z_{1} K} \\ & \vdots & \\ b_{z_n 1} & \cdots & b_{z_n K}\end{bmatrix}\end{aligned}\begin{bmatrix} c_{11} & & c_{n1} \\ \vdots & \cdots & \vdots \\ c_{1K} & & c_{nK} \end{bmatrix}

(CBC)ij=l=1Kbzilcjl(CBC^{\top})_{ij} = \sum_{l = 1}^{K}b_{z_{i}l}c_{jl}

Since cjl=1c_{jl} = 1 when zj=lz_{j} = l and 00 otherwise:

(CBC)ij=bzizj(CBC^{\top})_{ij} = b_{z_{i} z_{j}}

88 / 328

Breaking down the probability matrix for an SBM

P=CBC=[bz1z1bz1znbznz1bznzn]\begin{aligned}P = {CBC}^{\top} = \begin{bmatrix} b_{z_{1}z_{1}} & \cdots & b_{z_{1} z_{n}} \\ \vdots & \ddots & \vdots \\ b_{z_{n}z_{1}} & \cdots & b_{z_{n} z_{n}} \end{bmatrix} \end{aligned}

  • Note that if aij;zi=k,zj=lBern(bkl)\mathbf a_{ij}; z_i = k, z_j = l \sim \text{Bern}(b_{kl}), we want pij=bklp_{ij} = b_{kl}
    • This gives us that pij=bzizjp_{ij} = b_{z_i z_j}
89 / 328

Breaking down the probability matrix for an SBM

P=CBC=[bz1z1bz1znbznz1bznzn]\begin{aligned}P = {CBC}^{\top} = \begin{bmatrix} b_{z_{1}z_{1}} & \cdots & b_{z_{1} z_{n}} \\ \vdots & \ddots & \vdots \\ b_{z_{n}z_{1}} & \cdots & b_{z_{n} z_{n}} \end{bmatrix} \end{aligned}

  • Note that if aij;zi=k,zj=lBern(bkl)\mathbf a_{ij}; z_i = k, z_j = l \sim \text{Bern}(b_{kl}), we want pij=bklp_{ij} = b_{kl}
    • This gives us that pij=bzizjp_{ij} = b_{z_i z_j}
  • Therefore, we could equivalently write the model ASBMn(z,B)\mathbf A \sim SBM_n(\vec z, B) as:
    • aijBern(bzizj)\mathbf a_{ij} \sim \text{Bern}(b_{z_i z_j}) ind.ind.
90 / 328

Probability matrix for SBM example



91 / 328

Latent position matrix

  • For each node ii, a latent position vector xi\vec x_i is a dd-dimensional vector
  • The latent position matrix just stacks these latent position vectors into a n×dn \times d matrix

X=[x1xn]X = \begin{bmatrix} \leftarrow & \vec x_1^\top & \rightarrow \\ & \vdots & \\ \leftarrow & \vec x_n^\top & \rightarrow \end{bmatrix}

  • dd is called the latent dimensionality
  • What do you notice about this arrangement of the latent positions?
93 / 328

Latent position matrix

  • For each node ii, a latent position vector xi\vec x_i is a dd-dimensional vector
  • The latent position matrix just stacks these latent position vectors into a n×dn \times d matrix

X=[x1xn]X = \begin{bmatrix} \leftarrow & \vec x_1^\top & \rightarrow \\ & \vdots & \\ \leftarrow & \vec x_n^\top & \rightarrow \end{bmatrix}

  • dd is called the latent dimensionality
  • Key observation: the latent position matrix is tabular
94 / 328

Example of a latent position matrix



95 / 328

Random Dot Product Graph Model (RDPG)

  • ARDPGn(X)\mathbf A \sim RDPG_n(X)
    • aij;xi=x,xj=yBern(xy)\mathbf a_{ij} ; \vec x_i = \vec x, \vec x_j = \vec y \sim \text{Bern}(\vec x^\top \vec y) ind.ind.
    • edges are correlated based on the latent positions of their incident nodes ii and jj
96 / 328

Random Dot Product Graph Model (RDPG)

  • ARDPGn(X)\mathbf A \sim RDPG_n(X)
    • aij;xi=x,xj=yBern(xy)\mathbf a_{ij} ; \vec x_i = \vec x, \vec x_j = \vec y \sim \text{Bern}(\vec x^\top \vec y) ind.ind.
    • once we know the latent positions of the nodes, the edges are otherwise independent
97 / 328

Probability matrix for an RDPGn(X)RDPG_n(X) network

  • P=XXP = XX^\top P=[x1xn][x1xn]P = \begin{bmatrix} \leftarrow & \vec x_1^\top & \rightarrow \\ & \vdots & \\ \leftarrow & \vec x_n^\top & \rightarrow \end{bmatrix}\begin{bmatrix} \uparrow & & \uparrow \\ \vec x_1 & \cdots & \vec x_n\\ \downarrow & & \downarrow \end{bmatrix}
  • pij=xixj\Rightarrow p_{ij} = \vec x_i^\top \vec x_j
98 / 328

Probability matrix for an RDPGn(X)RDPG_n(X) network

  • P=XXP = XX^\top P=[x1xn][x1xn]P = \begin{bmatrix} \leftarrow & \vec x_1^\top & \rightarrow \\ & \vdots & \\ \leftarrow & \vec x_n^\top & \rightarrow \end{bmatrix}\begin{bmatrix} \uparrow & & \uparrow \\ \vec x_1 & \cdots & \vec x_n\\ \downarrow & & \downarrow \end{bmatrix}
  • pij=xixj\Rightarrow p_{ij} = \vec x_i^\top \vec x_j
  • so ARDPGn(X)\mathbf A \sim RDPG_n(X) is equivalent to:
    • aijBern(xixj)\mathbf a_{ij} \sim \text{Bern}(\vec x_i^\top \vec x_j) ind.ind.
99 / 328

Properties of Random Networks

  • Just like networks themselves, random networks can be described using properties
    • Instead of describing realized properties, the properties are now random
    • E.g., random degrees, etc.
101 / 328

Properties of Random Networks

  • Just like networks themselves, random networks can be described using properties
    • Instead of describing realized properties, the properties are now random
    • E.g., random degrees, etc.
  • Note that if aijBern(pij)\mathbf a_{ij} \sim \text{Bern}(p_{ij}), that by LOTUS:

E[aij]=a{0,1}aP(aij=a)=0(1pij)+1pij=pij\begin{aligned}\mathbb E[\mathbf a_{ij}] &= \sum_{a \in \{0, 1\}}a \cdot \mathbb P(\mathbf a_{ij} = a) \\ &=0 \cdot (1 - p_{ij}) + 1 \cdot p_{ij} \\ &= p_{ij}\end{aligned}

102 / 328

Random Network Expected Degree

  • networks have degrees did_i for each node ii, random networks have random degrees di\mathbf d_i
  • Network degree: di=jiaijd_i = \sum_{j \neq i} a_{ij}
  • Random network degree: di=jiaij\mathbf d_i = \sum_{j \neq i} \mathbf a_{ij}
103 / 328

Random Network Expected Degree

  • networks have degrees did_i for each node ii, random networks have random degrees di\mathbf d_i
  • Network degree: di=jiaijd_i = \sum_{j \neq i} a_{ij}
  • Random network degree: di=jiaij\mathbf d_i = \sum_{j \neq i} \mathbf a_{ij}
    • No longer a scalar: it has a distribution (it is random)
104 / 328

Expected Node Degree

  • expected node degree: E[di]\mathbb E[\mathbf d_i]
105 / 328

Expected Node Degree

  • expected node degree: E[di]\mathbb E[\mathbf d_i]
  • can use basic properties of expected values (linearity): E[di]=E[jiaij]=jiE[aij]=jipij,\begin{aligned}\mathbb E[\mathbf d_i] &= \mathbb E \big[\sum_{j \neq i} \mathbf a_{ij} \big] \\ &= \sum_{j \neq i}\mathbb E[\mathbf a_{ij}] \\ &= \sum_{j \neq i} p_{ij},\end{aligned} where we plugged in the result from here
106 / 328

Expected Node Degree (SBM)

  • If ASBMn(z,B)\mathbf A \sim SBM_n(\vec z, B), we get a special result
  • Let's consider the expected degree of a node ii, if it is in a particular community kk:

E[di;zi=k]=jiE[aij;zi=k]\mathbb E[\mathbf d_i; z_i = k] = \sum_{j \neq i} \mathbb E[\mathbf a_{ij}; \vec z_i = k]

107 / 328

Expected Node Degree (SBM)

  • If ASBMn(z,B)\mathbf A \sim SBM_n(\vec z, B), we get a special result
  • Let's consider the expected degree of a node ii, if it is in a particular community kk:

E[di;zi=k]=jiE[aij;zi=k]\mathbb E[\mathbf d_i; z_i = k] = \sum_{j \neq i} \mathbb E[\mathbf a_{ij}; \vec z_i = k]

  • Recall: pij=bzizjp_{ij} = b_{z_i z_j}, where we know that zi=kz_i = k
    • E[aij;zi=k]\Rightarrow \mathbb E[\mathbf a_{ij}; \vec z_i = k] can be written of the form bkzjb_{k z_j}  
108 / 328

Expected Node Degree (SBM)

  • E[aij;zi=k]\Rightarrow \mathbb E[\mathbf a_{ij}; \vec z_i = k] can be written of the form bkzjb_{k z_j}
  • 1{zj=l}={1,zj=l0,zjl\mathbb 1_{\{z_j = l\}} = \begin{cases} 1, & z_{j} = l \\ 0, & z_{j} \neq l\end{cases} bkzj=l=1K1{zj=l}bklb_{kz_j} = \sum_{l = 1}^K \mathbb 1_{\{z_j = l\}}b_{k l}
109 / 328

Expected Node Degree (SBM)

  • E[aij;zi=k]\Rightarrow \mathbb E[\mathbf a_{ij}; \vec z_i = k] can be written of the form bkzjb_{k z_j}
  • 1{zj=l}={1,zj=l0,zjl\mathbb 1_{\{z_j = l\}} = \begin{cases} 1, & z_{j} = l \\ 0, & z_{j} \neq l\end{cases} bkzj=l=1K1{zj=l}bklb_{kz_j} = \sum_{l = 1}^K \mathbb 1_{\{z_j = l\}}b_{k l} E[di;zi=k]=jil=1K1{zj=l}bkl\begin{aligned}\mathbb E[\mathbf d_i; z_i = k] &= \sum_{j \neq i} \sum_{l = 1}^K \mathbb 1_{ \{z_j = l\} }b_{k l}\end{aligned}
110 / 328

Expected Node Degree (SBM)

  • E[aij;zi=k]\Rightarrow \mathbb E[\mathbf a_{ij}; \vec z_i = k] can be written of the form l=1K1{zj=l}bkl\sum_{l = 1}^K\mathbb 1_{\{z_j = l\}}b_{k l} E[di;zi=k]=jil=1K1{zj=l}bkl=l=1Kji1{zj=l}bkl=l=1Kj:ji,zj=lbkl\begin{aligned}\mathbb E[\mathbf d_i; z_i = k] &= \sum_{j \neq i} \sum_{l = 1}^K \mathbb 1_{ \{z_j = l\} }b_{k l} \\ &= \sum_{l = 1}^K \sum_{j \neq i} \mathbb 1_{\{z_j = l\}}b_{kl} \\ &= \sum_{l = 1}^K \sum_{j : j \neq i, z_j = l} b_{kl}\end{aligned}
111 / 328

Expected Node Degree (SBM)

E[di;zi=k]=l=1Kj:ji,zj=lbkl=j:ji,zj=kbkk+lkj:zj=lbkl\begin{aligned}\mathbb E[\mathbf d_i; z_i = k] &= \sum_{l = 1}^K \sum_{j : j \neq i, z_j = l} b_{kl}\\ &= \sum_{j : j \neq i, z_{j} = k} b_{kk} + \sum_{l \neq k}\sum_{j : z_{j} = l}b_{kl}\end{aligned}

112 / 328

Expected Node Degree (SBM)

E[di;zi=k]=l=1Kj:ji,zj=lbkl=j:ji,zj=kbkk+lkj:zj=lbkl\begin{aligned}\mathbb E[\mathbf d_i; z_i = k] &= \sum_{l = 1}^K \sum_{j : j \neq i, z_j = l} b_{kl}\\ &= \sum_{j : j \neq i, z_{j} = k} b_{kk} + \sum_{l \neq k}\sum_{j : z_{j} = l}b_{kl}\end{aligned}

  • We are summing a fixed value (block probabilities) for each node, depending on which community it is in
113 / 328

Expected Node Degree (SBM)

E[di;zi=k]=l=1Kj:ji,zj=lbkl=j:ji,zj=kbkk+lkj:zj=lbkl\begin{aligned}\mathbb E[\mathbf d_i; z_i = k] &= \sum_{l = 1}^K \sum_{j : j \neq i, z_j = l} b_{kl}\\ &= \sum_{j : j \neq i, z_{j} = k} b_{kk} + \sum_{l \neq k}\sum_{j : z_{j} = l}b_{kl}\end{aligned}

  • We are summing a fixed value (block probabilities) for each node, depending on which community it is in E[di;zi=k]=bkkj:ji,zj=k1+lkbklj:zj=l1\begin{aligned}\mathbb E[\mathbf d_i; z_i = k] &= b_{kk}\sum_{j : j \neq i, z_j = k}1 + \sum_{l \neq k} b_{kl} \sum_{j : z_j = l}1\end{aligned}
114 / 328

Expected Node Degree (SBM)

E[di;zi=k]=bkkj:ji,zj=k1+lkbklj:zj=l1\begin{aligned}\mathbb E[\mathbf d_i; z_i = k] &= b_{kk}\sum_{j : j \neq i, z_j = k}1 + \sum_{l \neq k} b_{kl} \sum_{j : z_j = l}1\end{aligned}

115 / 328

Expected Node Degree (SBM)

E[di;zi=k]=bkkj:ji,zj=k1+lkbklj:zj=l1\begin{aligned}\mathbb E[\mathbf d_i; z_i = k] &= b_{kk}\sum_{j : j \neq i, z_j = k}1 + \sum_{l \neq k} b_{kl} \sum_{j : z_j = l}1\end{aligned}

  • If nl=j:zj=l1n_l = \sum_{j : z_j = l} 1 are the counts for each community, we get: E[di;zi=k]=(nk1)bkk+lknlbkl\mathbb E[\mathbf d_i; z_i = k] = (n_k - 1) b_{kk} + \sum_{l \neq k} n_{l} b_{kl}
    • (nk1)(n_k - 1): we summed over all of the nodes in community kk, except for node ii
    • nln_l: we summed over all of the nodes in communities lkl \neq k (so node ii is not a part of them)
116 / 328

Expected Node Degree (SBM)

E[di;zi=k]=(nk1)bkk+lknlbkl\mathbb E[\mathbf d_i; z_i = k] = (n_k - 1) b_{kk} + \sum_{l \neq k} n_{l} b_{kl}

  • The node degrees for an SBM are homogenous (the same) within the same community
  • no dependence of E[di;zi=k]\mathbb E[\mathbf d_i; z_i = k] on ii itself
117 / 328

Expected values of matrices

  • If X\mathbf X is a random n×mn \times m matrix with entries xij\mathbf x_{ij} which are random, then: E[X]=[E[x11]E[x1m]E[xn1]E[xnm]]\mathbb E[\mathbf X] = \begin{bmatrix} \mathbb E[\mathbf x_{11}] & \cdots & \mathbb E[\mathbf x_{1m}] \\ \vdots & \ddots & \vdots \\ \mathbb E[\mathbf x_{n1}] & \cdots & \mathbb E[\mathbf x_{nm}]\end{bmatrix}
118 / 328

Random degree matrix

  • Recall the degree matrix for an nn node network AA with node degrees did_i: D=[d1dn]D = \begin{bmatrix} d_1 & & \\ & \ddots & \\ & & d_n\end{bmatrix}
  • The random degree matrix of a random matrix A\mathbf A with random degrees di\mathbf d_i: D=[d1dn]\mathbf D = \begin{bmatrix} \mathbf d_1 & & \\ & \ddots & \\ & & \mathbf d_n\end{bmatrix}
119 / 328
Expected degree matrix
  • The expected degree matrix of a random matrix A\mathbf A with random degrees di\mathbf d_i: D=E[D]=[E[d1]E[dn]]\mathcal D = \mathbb E[\mathbf D] = \begin{bmatrix} \mathbb E[\mathbf d_1] & & \\ & \ddots & \\ & & \mathbb E[\mathbf d_n]\end{bmatrix}
120 / 328
Expected degree matrix
  • The expected degree matrix of a random matrix A\mathbf A with random degrees di\mathbf d_i: D=E[D]=[E[d1]E[dn]]\mathcal D = \mathbb E[\mathbf D] = \begin{bmatrix} \mathbb E[\mathbf d_1] & & \\ & \ddots & \\ & & \mathbb E[\mathbf d_n]\end{bmatrix}
Expected value of random network
  • Since E[X]\mathbb E[\mathbf X] is the matrix with entries E[xij]\mathbb E[\mathbf x_{ij}], that: E[A]=P\mathbb E[\mathbf A] = P
    • since E[aij]=pij\mathbb E[\mathbf a_{ij}] = p_{ij}
121 / 328

Population network Laplacian

  • Remember the network Laplacian is L=D12AD12L = D^{-\frac{1}{2}} A D^{-\frac{1}{2}}
    • has entries lij=aijdidjl_{ij} = \frac{a_{ij}}{\sqrt{d_i} \sqrt{d_j}}
    • to calculate, di>0d_i > 0 for all ii, or equivalently, aij=1a_{ij} = 1 for at least one jj for every node ii
122 / 328

Population network Laplacian

  • Remember the network Laplacian is L=D12AD12L = D^{-\frac{1}{2}} A D^{-\frac{1}{2}}
    • has entries lij=aijdidjl_{ij} = \frac{a_{ij}}{\sqrt{d_i} \sqrt{d_j}}
    • to calculate, di>0d_i > 0 for all ii, or equivalently, aij=1a_{ij} = 1 for at least one jj for every node ii
  • The population network Laplacian is: L=D12PD12=E[D]12E[A]E[D]12\mathcal L = \mathcal D^{-\frac{1}{2}}P \mathcal D^{-\frac{1}{2}} = \mathbb E[\mathbf D]^{-\frac{1}{2}} \mathbb E[\mathbf A] \mathbb E[\mathbf D]^{-\frac{1}{2}}
    • has entries ij=pijE[di]E[dj]\ell_{ij} = \frac{p_{ij}}{\sqrt{\mathbb E[\mathbf d_{i}]} \sqrt{\mathbb E[\mathbf d_{j}]}}
    • To calculate, E[di]>0\mathbb E[\mathbf d_i] > 0 for all ii, or equivalently, pij>0p_{ij} > 0 for at least one jj for every node ii
123 / 328

Academic network example

  • nodes: academics at a university
  • edges: have a pair of academics published a paper together?
  • communities: subject area (color)
app-screen

(Source: kbucha73)

125 / 328

Academic network example

  • nodes: academics at a university
  • edges: have a pair of academics published a paper together?
  • communities: subject area (color)
app-screen

(Source: kbucha73)

  • the "big shots" have higher degrees than other people even though they are in the same community
126 / 328

Degree-correction factors

  • recall: node degrees in an SBM are homogeneous within community: all have the same node degree
    • not realistic: most real networks will have some nodes that have more connections than others
127 / 328

Degree-correction factors

  • recall: node degrees in an SBM are homogeneous within community: all have the same node degree
    • not realistic: most real networks will have some nodes that have more connections than others
  • Degree-correction factor θi\theta_i indicates "node importance"
128 / 328

Degree-correction vector/matrix

  • Degree-correction vector θ\vec \theta is a length-nn vector of degree-correction factors for each node θ=[θ1θn]\vec \theta = \begin{bmatrix} \theta_1 \\ \vdots \\ \theta_n \end{bmatrix}
  • Degree-correction matrix Θ\Theta is a n×nn \times n diagonal matrix of degree-correction factors Θ=[θ1θn]\Theta = \begin{bmatrix} \theta_1 & & \\ & \ddots & \\ & & \theta_n \end{bmatrix}
129 / 328

Degree-corrected SBM (DCSBM)

  • ADCSBMn(θ,z,B)\mathbf A \sim DCSBM_n(\vec \theta, \vec z, B)
    • aijBern(θiθjbzizj)\mathbf a_{ij} \sim \text{Bern}(\theta_i\theta_j b_{z_i z_j}) ind.ind.
    • edges are correlated by the degree-correction factors, as well as their community assignments
130 / 328

Degree-corrected SBM (DCSBM)

  • ADCSBMn(θ,z,B)\mathbf A \sim DCSBM_n(\vec \theta, \vec z, B)
    • aijBern(θiθjbzizj)\mathbf a_{ij} \sim \text{Bern}(\theta_i\theta_j b_{z_i z_j}) ind.ind.
    • edges are correlated by the degree-correction factors, as well as their community assignments
  • pij=θiθjbzizjp_{ij} = \theta_i\theta_j b_{z_i z_j}
    • θi\theta_i is bigger: higher probability (than in an SBM) for node ii
    • θi\theta_i is smaller: lower probability (than in an SBM) for node ii
  • Probability matrix P=ΘCBCΘP = \Theta C B C^\top \Theta^\top
131 / 328

DCSBM Example

132 / 328

Degree-correction factor allows degree heterogeneities

E[di;zi=k]=θi(lk[nlbklj:zj=lθj]+(nk1)bkkji:zj=kθj)\mathbb E[\mathbf d_i; z_i = k] = \theta_i \bigg(\sum_{l \neq k}\bigg[n_l b_{kl}\sum_{j : z_j = l} \theta_j\bigg] + (n_k - 1)b_{kk} \sum_{j \neq i : z_j = k}\theta_j\bigg)

(Handout page 64)

133 / 328

Degree-correction factor allows degree heterogeneities

E[di;zi=k]=θi(lk[nlbklj:zj=lθj]+(nk1)bkkji:zj=kθj)\mathbb E[\mathbf d_i; z_i = k] = \theta_i \bigg(\sum_{l \neq k}\bigg[n_l b_{kl}\sum_{j : z_j = l} \theta_j\bigg] + (n_k - 1)b_{kk} \sum_{j \neq i : z_j = k}\theta_j\bigg)

(Handout page 64)

  • θi\theta_i "degree-corrects" the degrees of each node
    • higher θi\theta_i: higher degree (than other nodes in the same community)
    • lower θi\theta_i: lower degree (than other nodes in the same community)
134 / 328

The concept of positive semi-definiteness

A matrix with real entries RR is positive semi-definite (PSD) if for some other matrix QQ with real entries, we can write:

R=QQR = QQ^\top

  • scary name; simple meaning
  • QQ is called the square-root matrix for RR, and is often written as R\sqrt R
136 / 328

The concept of positive semi-definiteness

A matrix with real entries RR is positive semi-definite (PSD) if for some other matrix QQ with real entries, we can write:

R=QQR = QQ^\top

  • scary name; simple meaning
  • QQ is called the square-root matrix for RR, and is often written as R\sqrt R
  • The probability matrix for an RDPGn(X)RDPG_n(X) network is PSD: P=XXP = XX^\top
  • the latent position matrix XX is the square-root matrix for PP
137 / 328

PSDness and block models

  • Remember that:
  • If ASBMn(z,B)\mathbf A \sim SBM_n(\vec z, B), that P=CBCP = C B C^\top
  • If ADCSBMn(θ,z,B)\mathbf A \sim DCSBM_n(\vec \theta, \vec z, B), that P=ΘCBCΘP = \Theta C B C^\top \Theta^\top
138 / 328

PSDness and block models

  • Remember that:
  • If ASBMn(z,B)\mathbf A \sim SBM_n(\vec z, B), that P=CBCP = C B C^\top
  • If ADCSBMn(θ,z,B)\mathbf A \sim DCSBM_n(\vec \theta, \vec z, B), that P=ΘCBCΘP = \Theta C B C^\top \Theta^\top
  • These almost look like the condition for positive semi-definiteness:
  • if BB has a square-root matrix B\sqrt B, then:

SBM: P=(CB)(CB)=XXP = (C\sqrt B)(C\sqrt B)^\top = XX^\top where X=CBX = C\sqrt B

DCSBM: P=(ΘCB)(ΘCB)=XXP = (\Theta C \sqrt B) (\Theta C \sqrt B)^\top = XX^\top where X=ΘCBX = \Theta C \sqrt B

139 / 328

PSDness and block models

  • Remember that:
  • If ASBMn(z,B)\mathbf A \sim SBM_n(\vec z, B), that P=CBCP = C B C^\top
  • If ADCSBMn(θ,z,B)\mathbf A \sim DCSBM_n(\vec \theta, \vec z, B), that P=ΘCBCΘP = \Theta C B C^\top \Theta^\top
  • These almost look like the condition for positive semi-definiteness:
  • if BB has a square-root matrix B\sqrt B, then:

SBM: P=(CB)(CB)=XXP = (C\sqrt B)(C\sqrt B)^\top = XX^\top where X=CBX = C\sqrt B

DCSBM: P=(ΘCB)(ΘCB)=XXP = (\Theta C \sqrt B) (\Theta C \sqrt B)^\top = XX^\top where X=ΘCBX = \Theta C \sqrt B

  • if BB is PSD, the probability matrix of the block model is PSD
140 / 328

PSDness (the 2×22 \times 2 case)

If BB is a 2×22 \times 2 real matrix, BB is PSD if:

  1. b110b_{11} \geq 0
  • always satisfied, since block matrices have probabilities
  • valid probabilities cannot be negative
141 / 328

PSDness (the 2×22 \times 2 case)

If BB is a 2×22 \times 2 real matrix, BB is PSD if:

  1. b110b_{11} \geq 0, and:
  2. det(B)0det(B) \geq 0.

B=[b11b12b21b22]B = \begin{bmatrix}b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} The determinant det(B)=b11b22b12b21det(B) = b_{11}b_{22} - b_{12}b_{21}

  • condition 2 is equivalent to b11b22b12b21b_{11} b_{22} \geq b_{12}b_{21}
142 / 328

Homophilic block matrices

B=[b11b12b21b22]B = \begin{bmatrix}b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix}

  • A block matrix BB is called homophilic if the on-diagonal entries bkkb_{kk} exceed the off-diagonal entries bklb_{kl} for all klk \neq l
  • homophilic: like goes with like
  • within-community connections (bkkb_{kk}) are more probable than between-community connections (bklb_{kl} where klk \neq l)
143 / 328

Homophilic block matrices

B=[b11b12b21b22]B = \begin{bmatrix}b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix}

  • A block matrix BB is called homophilic if the on-diagonal entries bkkb_{kk} exceed the off-diagonal entries bklb_{kl} for all klk \neq l
  • homophilic: like goes with like
  • within-community connections (bkkb_{kk}) are more probable than between-community connections (bklb_{kl} where klk \neq l)
  • homophilic block matrices are PSD as b11b22>b12b21b_{11}b_{22} > b_{12}b_{21} (condition 22 satisfied)
144 / 328

Homophilic block matrices

B=[b11b12b21b22]B = \begin{bmatrix}b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix}

  • A block matrix BB is called homophilic if the on-diagonal entries bkkb_{kk} exceed the off-diagonal entries bklb_{kl} for all klk \neq l
  • homophilic: like goes with like
  • within-community connections (bkkb_{kk}) are more probable than between-community connections (bklb_{kl} where klk \neq l)
  • homophilic block matrices are PSD as b11b22>b12b21b_{11}b_{22} > b_{12}b_{21} (condition 22 satisfied)
  • b11b_{11} and b22b_{22} are each greater than b12b_{12} and b21b_{21}
145 / 328

Homophilic block matrix example



146 / 328

Block Models with PSD block matrices and RDPGs

If A(1)\mathbf A^{(1)} and A(2)\mathbf A^{(2)} are IERn(P)IER_n(P) networks with probability matrices P(1)P^{(1)} and P(2)P^{(2)}, we call them stochastically equivalent if P(1)=P(2)P^{(1)} = P^{(2)}

  • stochastically equivalent: two random quantities are equal in terms of the parameters that govern them (for IERn(P)IER_n(P) networks, probability matrices)
147 / 328

Block Models with PSD block matrices and RDPGs

If A(1)\mathbf A^{(1)} and A(2)\mathbf A^{(2)} are IERn(P)IER_n(P) networks with probability matrices P(1)P^{(1)} and P(2)P^{(2)}, we call them stochastically equivalent if P(1)=P(2)P^{(1)} = P^{(2)}

Recall that if BB is PSD, for a random network A(1)SBMn(z,B)\mathbf A^{(1)} \sim SBM_n(\vec z, B) that:

SBM: P(1)=(CB)(CB)=XXP^{(1)} = (C\sqrt B)(C\sqrt B)^\top = XX^\top where X=CBX = C\sqrt B

148 / 328

Block Models with PSD block matrices and RDPGs

If A(1)\mathbf A^{(1)} and A(2)\mathbf A^{(2)} are IERn(P)IER_n(P) networks with probability matrices P(1)P^{(1)} and P(2)P^{(2)}, we call them stochastically equivalent if P(1)=P(2)P^{(1)} = P^{(2)}

Recall that if BB is PSD, for a random network A(1)SBMn(z,B)\mathbf A^{(1)} \sim SBM_n(\vec z, B) that:

SBM: P(1)=(CB)(CB)=XXP^{(1)} = (C\sqrt B)(C\sqrt B)^\top = XX^\top where X=CBX = C\sqrt B A(2)RDPGn(X)\mathbf A^{(2)} \sim RDPG_n(X) where X=CBX = C\sqrt B

  • A(1)\mathbf A^{(1)} (an SBM) is stochastically equivalent to A(2)\mathbf A^{(2)} (an RDPG)
149 / 328

Block Models with PSD block matrices and RDPGs

If A(1)\mathbf A^{(1)} and A(2)\mathbf A^{(2)} are IERn(P)IER_n(P) networks with probability matrices P(1)P^{(1)} and P(2)P^{(2)}, we call them stochastically equivalent if P(1)=P(2)P^{(1)} = P^{(2)}

Recall that if BB is PSD, for a random network A(1)SBMn(z,B)\mathbf A^{(1)} \sim SBM_n(\vec z, B) that:

SBM: P(1)=(CB)(CB)=XXP^{(1)} = (C\sqrt B)(C\sqrt B)^\top = XX^\top where X=CBX = C\sqrt B A(2)RDPGn(X)\mathbf A^{(2)} \sim RDPG_n(X) where X=CBX = C\sqrt B

  • A(1)\mathbf A^{(1)} (an SBM) is stochastically equivalent to A(2)\mathbf A^{(2)} (an RDPG)
  • SBMs (and DCSBMs) with PSD block matrices are stochastically equivalent to some RDPG
150 / 328

Stochastic equivalence of SBMs and RDPGs example



151 / 328

Stochastic equivalence of DCSBMs and RDPGs example



152 / 328

Block Models with PSD block matrices and RDPGs

  • SBMs (and DCSBMs) with PSD block matrices are stochastically equivalent to some RDPG
  • PSD matrices are very intuitive and easy to work with
  • lots of linear algebra tricks and manipulations are designed for PSD matrices
153 / 328

Focus for second section

  • develop techniques for RDPGs
  • All techniques for RDPGs "work" for RDPGs and block models with PSD block matrices
  • they work more generally with non PSD block matrices (advanced; unintuitive)

Key application

  • the latent position matrix for RDPGs is tabular: XX is n×dn \times d
  • Nodes are like "observations"
  • latent dimensions of nodes are like "features"
154 / 328

Learning Representations of Networks

  • Probability matrix PP tells us about structure of network
156 / 328

Learning Representations of Networks

  • Probability matrix PP tells us about structure of network
  • But we never observe PP! We observe adjacency matrices AA
157 / 328

Learning Representations of Networks

  • Probability matrix PP tells us about structure of network
  • But we never observe PP! We observe adjacency matrices AA
  • So study "P^\hat P" from representations of AA

158 / 328

Latent Positions from Probability Matrix

  • Suppose we have a probability matrix PP
    • Assume symmetric (undirected)
160 / 328

Latent Positions from Probability Matrix

  • Suppose we have a probability matrix PP
    • Assume symmetric (undirected)
  • If PP is PSD, then P=XXP = XX^\top with X=PX=\sqrt P.
    • Recall: PSD means you can take square roots.
161 / 328

Latent Positions from Probability Matrix

  • Suppose we have a probability matrix PP
    • Assume symmetric (undirected)
  • If PP is PSD, then P=XXP = XX^\top with X=PX=\sqrt P.
    • Recall: PSD means you can take square roots.
  • Assume XX has nn rows and dd columns
    • nn nodes, dd latent dimensions
    • none of the columns are redundant (the rank of XX is dd).
    • entries are R\mathbb R
162 / 328

Latent Positions from Probability Matrix

  • Suppose we have a probability matrix PP
    • Assume symmetric (undirected)
  • If PP is PSD, then P=XXP = XX^\top with X=PX=\sqrt P.
    • Recall: PSD means you can take square roots.
  • Assume XX has nn rows and dd columns
    • nn nodes, dd latent dimensions
    • none of the columns are redundant (the rank of XX is dd).
    • entries are R\mathbb R
  • What does it mean to take a square root of a matrix?
163 / 328

Eigendecomposition (evd\texttt{evd}) of Symmetric Matrix

  • Recall that if RR is a n×nn\times n symmetric matrix, then RR has an evd\texttt{evd}:
164 / 328

Eigendecomposition (evd\texttt{evd}) of Symmetric Matrix

  • Recall that if RR is a n×nn\times n symmetric matrix, then RR has an evd\texttt{evd}:

R=QΛQR = Q\Lambda Q^\top

  • Λ\Lambda: n×nn\times n diagonal matrix of the ordered (in decreasing order) eigenvalues of RR
  • QQ: n×nn\times n matrix whose columns are eigenvectors of RR.
  • If RR is rank dd, Λ\Lambda has dd non-zero eigenvalues; ndn-d zero eigenvalues.
165 / 328

evd\texttt{evd} of Symmetric + PSD Matrix

  • Assume RR is also PSD with rank dd.
166 / 328

evd\texttt{evd} of Symmetric + PSD Matrix

  • Assume RR is also PSD with rank dd.
  • Then RR has dd non-negative eigenvalues
    • λ1λ2λd>0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d > 0
167 / 328

evd\texttt{evd} of Symmetric + PSD Matrix

  • Assume RR is also PSD with rank dd.
  • Then RR has dd non-negative eigenvalues
    • λ1λ2λd>0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d > 0
  • With keeping dd non-zero eigenvalues
    • QdQ_d : n×dn\times d matrix w/ first dd eigenvectors of RR
    • Λd\Lambda_d : d×dd\times d diagonal matrix w/ first dd eigenvalues of RR Qd=[q1...qd],Λd=[λ1λd]Q_d = \begin{bmatrix} \uparrow & & \uparrow \\ \vec{q_1} & ... & \vec{q_d} \\ \downarrow & & \downarrow \end{bmatrix},\Lambda_d = \begin{bmatrix} \lambda_1 & & \\ & \ddots & \\ & & \lambda_d \end{bmatrix}
  • Decompose RR by R=QdΛdQd R = Q_d\Lambda_d Q_d^\top
168 / 328

evd\texttt{evd} of Probability Matrix PP

  • Let PP be n×nn\times n PSD matrix with rank dd.
169 / 328

evd\texttt{evd} of Probability Matrix PP

  • Let PP be n×nn\times n PSD matrix with rank dd.
  • We can obtain latent positions of PP by evd\texttt{evd}(PP)

P=QdΛdQd=(QdΛd)(QdΛd)=XX P = Q_d \Lambda_d Q_d^\top = (Q_d \sqrt{\Lambda_d})(Q_d \sqrt{\Lambda_d})^\top = XX^\top

170 / 328

evd\texttt{evd} of Probability Matrix PP

  • Let PP be n×nn\times n PSD matrix with rank dd.
  • We can obtain latent positions of PP by evd\texttt{evd}(PP)

P=QdΛdQd=(QdΛd)(QdΛd)=XX P = Q_d \Lambda_d Q_d^\top = (Q_d \sqrt{\Lambda_d})(Q_d \sqrt{\Lambda_d})^\top = XX^\top with

X=QdΛd=P X = Q_d \sqrt{\Lambda_d} = \sqrt{P}

171 / 328

evd\texttt{evd} of Probability Matrix PP

  • Let PP be n×nn\times n PSD matrix with rank dd.
  • We can obtain latent positions of PP by evd\texttt{evd}(PP)

P=QdΛdQd=(QdΛd)(QdΛd)=XX P = Q_d \Lambda_d Q_d^\top = (Q_d \sqrt{\Lambda_d})(Q_d \sqrt{\Lambda_d})^\top = XX^\top with

X=QdΛd=P X = Q_d \sqrt{\Lambda_d} = \sqrt{P} and

Λd=[λ1λd]\sqrt{\Lambda_d} = \begin{bmatrix} \sqrt\lambda_1 & & \\ & \ddots & \\ & & \sqrt\lambda_d \end{bmatrix}

172 / 328

Limitations of Eigendecomposition

  • In practice, we never observe PP, but observe the adjacency matrix AA.
  • So let's compute evd\texttt{evd}(AA)!
173 / 328

Limitations of Eigendecomposition

  • In practice, we never observe PP, but observe the adjacency matrix AA.
  • So let's compute evd\texttt{evd}(AA)!
  • If AA is PSD, then evd\texttt{evd}(AA) gives us real latent positions of AA.
174 / 328

Limitations of Eigendecomposition

  • In practice, we never observe PP, but observe the adjacency matrix AA.
  • So let's compute evd\texttt{evd}(AA)!
  • If AA is PSD, then evd\texttt{evd}(AA) gives us real latent positions of AA.
  • But what happens when AA is not PSD?
    • Can happen even if PP is PSD.
175 / 328

Limitations of Eigendecomposition

  • Consider the following probability and sampled adjacency matrix:

P=[0.50.50.50.5],A=[0110] P = \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix}, A = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}

176 / 328

Limitations of Eigendecomposition

  • Consider the following probability and sampled adjacency matrix:

P=[0.50.50.50.5],A=[0110] P = \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix}, A = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}

  • AA is not PSD - it has eigenvalues 11 and 1-1. Λ=[11]\Lambda = \begin{bmatrix} 1 & \\ & -1 \end{bmatrix}
177 / 328

Limitations of Eigendecomposition

  • Consider the following probability and sampled adjacency matrix:

P=[0.50.50.50.5],A=[0110] P = \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix}, A = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}

  • AA is not PSD - it has eigenvalues 11 and 1-1. Λ=[11]\Lambda = \begin{bmatrix} 1 & \\ & -1 \end{bmatrix}
  • When we compute square root of Λ\Lambda, we get a non-real matrix! Λ=[1i]\sqrt\Lambda = \begin{bmatrix} 1 & \\ & i \end{bmatrix}
178 / 328

Limitations of Eigendecomposition

  • Consider the following probability and sampled adjacency matrix:

P=[0.50.50.50.5],A=[0110] P = \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix}, A = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}

  • AA is not PSD - it has eigenvalues 11 and 1-1. Λ=[11]\Lambda = \begin{bmatrix} 1 & \\ & -1 \end{bmatrix}
  • When we compute square root of Λ\Lambda, we get a non-real matrix! Λ=[1i]\sqrt\Lambda = \begin{bmatrix} 1 & \\ & i \end{bmatrix}     \implies results in non-real latent positions!
179 / 328

Singular Value Decomposition (svd\texttt{svd}) of Symmetric Matrices

  • Recall that if RR is a n×nn\times n symmetric matrix, then RR has svd\texttt{svd}:
180 / 328

Singular Value Decomposition (svd\texttt{svd}) of Symmetric Matrices

  • Recall that if RR is a n×nn\times n symmetric matrix, then RR has svd\texttt{svd}:R=UΣVR = U \Sigma V^\top
  • Σ\Sigma : n×nn\times n diagonal matrix of the ordered (in decreasing order) singular values of RR
  • UU : n×nn\times n the matrix whose columns are singular vectors of RR.
    • Singular values are always real and non-negative
    • symmetry: first dd columns of VV equal to first dd cols of UU
181 / 328

Singular Value Decomposition (svd\texttt{svd}) of Symmetric Matrices

  • Recall that if RR is a n×nn\times n symmetric matrix, then RR has svd\texttt{svd}:R=UΣVR = U \Sigma V^\top
  • Σ\Sigma : n×nn\times n diagonal matrix of the ordered (in decreasing order) singular values of RR
  • UU : n×nn\times n the matrix whose columns are singular vectors of RR.
    • Singular values are always real and non-negative
    • symmetry: first dd columns of VV equal to first dd cols of UU
  • If RR is rank dd, Σ\Sigma has dd non-zero SVs; ndn-d zero SVs
182 / 328

Singular Value Decomposition (svd\texttt{svd}) of Symmetric Matrices

  • Recall that if RR is a n×nn\times n symmetric matrix, then RR has svd\texttt{svd}:R=UΣVR = U \Sigma V^\top
  • Σ\Sigma : n×nn\times n diagonal matrix of the ordered (in decreasing order) singular values of RR
  • UU : n×nn\times n the matrix whose columns are singular vectors of RR.
    • Singular values are always real and non-negative
    • symmetry: first dd columns of VV equal to first dd cols of UU
  • If RR is rank dd, Σ\Sigma has dd non-zero SVs; ndn-d zero SVs
  • Taking the dd non-zero singular values: R=UdΣdUdR = U_d \Sigma_d U_d^\top
183 / 328

svd\texttt{svd} of Probability Matrix PP

  • Taking svd\texttt{svd} of PP: P=XX,X=UdΣd P = XX^\top, X = U_d \sqrt{\Sigma_d}
184 / 328

svd\texttt{svd} of Probability Matrix PP

  • Taking svd\texttt{svd} of PP: P=XX,X=UdΣd P = XX^\top, X = U_d \sqrt{\Sigma_d}
  • Applying svd\texttt{svd} to AA while keeping dd number of singular values/vectors: X^=U^dΣ^d\hat X = \hat U_d \sqrt{\hat \Sigma_d}
  • Gives us an estimate of the probability matrix PP. P^=X^X^\hat P = \hat X \hat X^\top
185 / 328

svd\texttt{svd} of Probability Matrix PP

  • Taking svd\texttt{svd} of PP: P=XX,X=UdΣd P = XX^\top, X = U_d \sqrt{\Sigma_d}
  • Applying svd\texttt{svd} to AA while keeping dd number of singular values/vectors: X^=U^dΣ^d\hat X = \hat U_d \sqrt{\hat \Sigma_d}
  • Gives us an estimate of the probability matrix PP. P^=X^X^\hat P = \hat X \hat X^\top
  • This is called the adjacency spectral embedding.
186 / 328

Adjacency Spectral Embedding

  • “Adjacency” - operates on the adjacency matrix AA
187 / 328

Adjacency Spectral Embedding

  • “Adjacency” - operates on the adjacency matrix AA
  • “Spectral” - intuition relies on the eigenvalues/eigenvectors
188 / 328

Adjacency Spectral Embedding

  • “Adjacency” - operates on the adjacency matrix AA
  • “Spectral” - intuition relies on the eigenvalues/eigenvectors
  • “Embedding” - finds a mathematical structure contained in another (a tabular structure contained within the adjacency matrix).
189 / 328

Adjacency Spectral Embedding

  • “Adjacency” - operates on the adjacency matrix AA
  • “Spectral” - intuition relies on the eigenvalues/eigenvectors
  • “Embedding” - finds a mathematical structure contained in another (a tabular structure contained within the adjacency matrix).
Procedure
  1. Decompose A=UΣVA= U\Sigma V^\top via svd\texttt{svd}
  2. Keep dd largest singular values, and obtain UdU_d and Σd\Sigma_d.
  3. Compute Σd\sqrt{\Sigma_d} where entries are σi\sqrt{\sigma_i} for 1,,d1,\ldots,d.
  4. Compute estimated latent positions X^=UdΣd\hat X = U_d \sqrt{\Sigma_d}.
190 / 328

ase\texttt{ase} Example - Sampling from SBMs

  • Suppose we have SBM with n=100n = 100 nodes and K=2K = 2 communities.
191 / 328

ase\texttt{ase} Example - True vs Estimated Probability Matrices

192 / 328

ase\texttt{ase} Example - Estimated Latent Positions

193 / 328

Non-identifiability of RDPG

  • Non-identifiability: multiple solutions to the same problem
    • Many "reasonable" latent positions XX produce same PP
194 / 328

Non-identifiability of RDPG

  • Non-identifiability: multiple solutions to the same problem
    • Many "reasonable" latent positions XX produce same PP
  • Rotational non-identifiability: rotation of latent positions
  • Given a rotation matrix WW, and Y=XWY=XW:
    • RDPGn(X)_n(X) and RDPGn(Y)_n(Y) have same probability matrix PP
195 / 328

Non-identifiability of RDPG

  • Non-identifiability: multiple solutions to the same problem
    • Many "reasonable" latent positions XX produce same PP
  • Rotational non-identifiability: rotation of latent positions
  • Given a rotation matrix WW, and Y=XWY=XW:
    • RDPGn(X)_n(X) and RDPGn(Y)_n(Y) have same probability matrix PP
    • XX and YY are rotationally equivalent
196 / 328

Non-identifiability of RDPG

  • Non-identifiability: multiple solutions to the same problem
    • Many "reasonable" latent positions XX produce same PP
  • Rotational non-identifiability: rotation of latent positions
  • Given a rotation matrix WW, and Y=XWY=XW:
    • RDPGn(X)_n(X) and RDPGn(Y)_n(Y) have same probability matrix PP
    • XX and YY are rotationally equivalent

P=YY=XW(XW)=XWWX=XIdX=XX \begin{aligned} P &= YY^\top \\ &= XW(XW)^\top \\ &= XWW^\top X^\top \\ &= XI_dX^\top \\ &= XX^\top \end{aligned}

197 / 328

Visualization of Rotational Non-identifiability

198 / 328

Visualization of Rotational Non-identifiability

  • Caveat: be careful when comparing different latent positions
199 / 328

Why use ase\texttt{ase} at all?

  1. ase\texttt{ase} tabularizes your adjacency matrix
    • n×dn\times d real matrix, where dd was the number of dimensions that you selected to embed into
    • explore the nodes using more traditional machine learning approaches
200 / 328

Why use ase\texttt{ase} at all?

  1. ase\texttt{ase} tabularizes your adjacency matrix
    • n×dn\times d real matrix, where dd was the number of dimensions that you selected to embed into
    • explore the nodes using more traditional machine learning approaches
  2. ase\texttt{ase} decouples the dependencies of the random network
    • Want to study PP which encodes structure of networks
    • Let's study XX latent positions, which also encodes same structure
201 / 328

Why use ase\texttt{ase} at all?

  1. ase\texttt{ase} tabularizes your adjacency matrix
    • n×dn\times d real matrix, where dd was the number of dimensions that you selected to embed into
    • explore the nodes using more traditional machine learning approaches
  2. ase\texttt{ase} decouples the dependencies of the random network
    • Want to study PP which encodes structure of networks
    • Let's study XX latent positions, which also encodes same structure
  3. If we assume A\mathbf A is a random network,
    • theoretical results     \implies as our networks get larger (more data), our estimates of PP get better
    • X^\hat X is a useful surrogate for studying XX
    • can be useful even under model misspecification
202 / 328

Laplacian Spectral Embedding (lse\texttt{lse})

  • Remember the network Laplacian is L=D12AD12L = D^{-\frac{1}{2}} A D^{-\frac{1}{2}}
204 / 328

Laplacian Spectral Embedding (lse\texttt{lse})

  • Remember the network Laplacian is L=D12AD12L = D^{-\frac{1}{2}} A D^{-\frac{1}{2}}
  • Laplacian spectral embedding is applying svd\texttt{svd} to LL
    • keep dd number of singular values/vectors:

L=XX,X=UdΣdL = XX^\top, X = U_d \sqrt{\Sigma_d}

  • Here, XX is the latent positions of the network Laplacian
205 / 328

Laplacian Spectral Embedding (lse\texttt{lse})

  • Remember the network Laplacian is L=D12AD12L = D^{-\frac{1}{2}} A D^{-\frac{1}{2}}
  • Laplacian spectral embedding is applying svd\texttt{svd} to LL
    • keep dd number of singular values/vectors:

L=XX,X=UdΣdL = XX^\top, X = U_d \sqrt{\Sigma_d}

  • Here, XX is the latent positions of the network Laplacian
  • Conceptually, think of it as ase(L)\texttt{ase}(L)
206 / 328

ase\texttt{ase} vs lse\texttt{lse}

  • Suppose A\mathbf A is a DCSBMn(θ,z,B)_n(\vec \theta, \vec z, B) random network with n=100n = 100 nodes and K=2K = 2 communities.
207 / 328

ase\texttt{ase} vs lse\texttt{lse}

  • Suppose A\mathbf A is a DCSBMn(θ,z,B)_n(\vec \theta, \vec z, B) random network with n=100n = 100 nodes and K=2K = 2 communities.
208 / 328

ase\texttt{ase} vs lse\texttt{lse}

  • lse\texttt{lse} preserves within-community structure with degree-corrections
  • Intuitively makes sense since LL is normalized by node degree
209 / 328

ase\texttt{ase} vs lse\texttt{lse}

  • How do we determine when to use lse\texttt{lse}?
210 / 328

ase\texttt{ase} vs lse\texttt{lse}

  • How do we determine when to use lse\texttt{lse}?
  • Tail heavy degree distributions
211 / 328

Practical Considerations

  • dd latent dimensions is a hyperparameter.
212 / 328

Practical Considerations

  • dd latent dimensions is a hyperparameter.
  • How do we choose dd?
    • Visual inspection of singular values (scree plot)
213 / 328

Practical Considerations

  • dd latent dimensions is a hyperparameter.
  • How do we choose dd?
    • Visual inspection of singular values (scree plot)
    • Automated "elbow" detection algorithms
214 / 328

Practical Considerations

  • dd latent dimensions is a hyperparameter.
  • How do we choose dd?
    • Visual inspection of singular values (scree plot)
    • Automated "elbow" detection algorithms
    • Try multiple dd and evaluate performance on subsequent tasks
215 / 328

What is community detection?

  • Goal: find groups of nodes that are "similar" to each other than to all other nodes.
217 / 328

What is community detection?

  • Goal: find groups of nodes that are "similar" to each other than to all other nodes.
  • Let's suppose you have a network AA with nn nodes.
218 / 328

What is community detection?

  • Goal: find groups of nodes that are "similar" to each other than to all other nodes.
  • Let's suppose you have a network AA with nn nodes.
    • Assume ASBMnA\sim SBM_n, but we don't know z\vec z!
219 / 328

What is community detection?

  • Goal: find groups of nodes that are "similar" to each other than to all other nodes.
  • Let's suppose you have a network AA with nn nodes.
    • Assume ASBMnA\sim SBM_n, but we don't know z\vec z!
    • Community detection algorithms estimate z\vec z.
220 / 328

What is community detection?

  • Goal: find groups of nodes that are "similar" to each other than to all other nodes.
  • Let's suppose you have a network AA with nn nodes.
    • Assume ASBMnA\sim SBM_n, but we don't know z\vec z!
    • Community detection algorithms estimate z\vec z.
  • Note: In unsupervised learning, called clustering
221 / 328

What is community detection?

  • Goal: find groups of nodes that are "similar" to each other than to all other nodes.
  • Let's suppose you have a network AA with nn nodes.
    • Assume ASBMnA\sim SBM_n, but we don't know z\vec z!
    • Community detection algorithms estimate z\vec z.
  • Note: In unsupervised learning, called clustering
Procedure for community detection
  1. Learn representations of your network
    • Embed using ase\texttt{ase} or lse\texttt{lse} in dd dimensions
222 / 328

What is community detection?

  • Goal: find groups of nodes that are "similar" to each other than to all other nodes.
  • Let's suppose you have a network AA with nn nodes.
    • Assume ASBMnA\sim SBM_n, but we don't know z\vec z!
    • Community detection algorithms estimate z\vec z.
  • Note: In unsupervised learning, called clustering
Procedure for community detection
  1. Learn representations of your network
    • Embed using ase\texttt{ase} or lse\texttt{lse} in dd dimensions
  2. Cluster the representations
    • e.g. k-means
223 / 328

What is community detection?

  • Goal: find groups of nodes that are "similar" to each other than to all other nodes.
  • Let's suppose you have a network AA with nn nodes.
    • Assume ASBMnA\sim SBM_n, but we don't know z\vec z!
    • Community detection algorithms estimate z\vec z.
  • Note: In unsupervised learning, called clustering
Procedure for community detection
  1. Learn representations of your network
    • Embed using ase\texttt{ase} or lse\texttt{lse} in dd dimensions
  2. Cluster the representations
    • e.g. k-means
  3. Evaluate your Clusters
224 / 328

Example - Sampling from DCSBM

  • Consider a DCSBMn_n with n=300n = 300 nodes and K=3K = 3 communities.
  • Randomly shuffle the ordering of nodes.
225 / 328

Example - Sampling from DCSBM

  • Consider a DCSBMn_n with n=300n = 300 nodes and K=3K = 3 communities.
  • Randomly shuffle the ordering of nodes.
226 / 328

Example - Sampling from DCSBM

  • Consider a DCSBMn_n with n=300n = 300 nodes and K=3K = 3 communities.
  • Randomly shuffle the ordering of nodes.
  • Can you find the communities?
227 / 328

Example - Latent Positions from ase\texttt{ase}

  • Estimate latent positions X^\hat X from ase\texttt{ase} of sampled adjacency matrix
  • Label latent positions by apriori known communities
228 / 328

Example - Latent Positions from ase\texttt{ase}

  • Estimate latent positions X^\hat X from ase\texttt{ase} of sampled adjacency matrix
  • Label latent positions by apriori known communities
229 / 328

Example - Latent Positions from ase\texttt{ase}

  • Estimate latent positions X^\hat X from ase\texttt{ase} of sampled adjacency matrix
  • Label latent positions by apriori known communities
  • X^\hat X retains the community structure
230 / 328

Estimate communities with 3-means

  • Use 3-means clustering on X^\hat X
  • Evaluate clustering by visualizing classification
231 / 328

Estimate communities with 3-means

  • Use 3-means clustering on X^\hat X
  • Evaluate clustering by visualizing classification
232 / 328

Clustering with Unknown Number of Communities

  • In real data, number of communities are not obvious.
  • How do we find "optimal" number of communities?
233 / 328

Clustering with Unknown Number of Communities

  • In real data, number of communities are not obvious.
  • How do we find "optimal" number of communities?
Procedure
  1. Learn representations of your network
    • Embed using ase\texttt{ase} or lse\texttt{lse} in dd dimensions
234 / 328

Clustering with Unknown Number of Communities

  • In real data, number of communities are not obvious.
  • How do we find "optimal" number of communities?
Procedure
  1. Learn representations of your network
    • Embed using ase\texttt{ase} or lse\texttt{lse} in dd dimensions
  2. For k2,3,,Kk\in{2, 3, \ldots, K} number of communities
    1. Use k-means to cluster latent positions
    2. Compute silhouette score
235 / 328

Clustering with Unknown Number of Communities

  • In real data, number of communities are not obvious.
  • How do we find "optimal" number of communities?
Procedure
  1. Learn representations of your network
    • Embed using ase\texttt{ase} or lse\texttt{lse} in dd dimensions
  2. For k2,3,,Kk\in{2, 3, \ldots, K} number of communities
    1. Use k-means to cluster latent positions
    2. Compute silhouette score
  3. Choose kk that has the highest silhouette score
236 / 328

Example

237 / 328

Practical Considerations

  • Choice of embeddings
    • lse\texttt{lse} can be robust to degree variations
    • Can even consider non-spectral embeddings!
      • node2vec
238 / 328

Practical Considerations

  • Choice of embeddings
    • lse\texttt{lse} can be robust to degree variations
    • Can even consider non-spectral embeddings!
      • node2vec
  • Choice of embedding dimension dd
    • Different dd can give different clustering results
239 / 328

Practical Considerations

  • Choice of embeddings
    • lse\texttt{lse} can be robust to degree variations
    • Can even consider non-spectral embeddings!
      • node2vec
  • Choice of embedding dimension dd
    • Different dd can give different clustering results
  • Choice of clustering algorithm
    • Gaussian mixture models (gmm\texttt{gmm}) can give better results
    • Fuzzy clustering (soft clustering)
240 / 328

Two-Truths of Spectral Clustering

  • Averaged from 100s human brain networks
    • Assume a 4-block SBM: {left, right} x {gray matter, white matter}
  • Embed using ase\texttt{ase} or lse\texttt{lse}
  • Estimate 2 communities using gmm\texttt{gmm}
241 / 328

Two-Truths of Spectral Clustering

Clusters gmm\texttt{gmm} o ase\texttt{ase} gmm\texttt{gmm} o lse\texttt{lse}
1 right GM , left GM right GM , right WM
2 right WM , left WM left GM , left WM
242 / 328

Two-Truths of Spectral Clustering

  • Core-periphery: densely and sparsely connected within, densely connected across
  • Affinity: densely connected within, sparsely connected across
243 / 328

What is Two-sample Hypothesis Testing?

  • Given two networks A(1)A^{(1)} and A(2)A^{(2)}:
245 / 328

What is Two-sample Hypothesis Testing?

  • Given two networks A(1)A^{(1)} and A(2)A^{(2)}:
    • Are they the "same"? (null hypothesis H0H_0)
246 / 328

What is Two-sample Hypothesis Testing?

  • Given two networks A(1)A^{(1)} and A(2)A^{(2)}:
    • Are they the "same"? (null hypothesis H0H_0)
    • Are they "different"? (alternate hypothesis HAH_A)
247 / 328

What is Two-sample Hypothesis Testing?

  • Given two networks A(1)A^{(1)} and A(2)A^{(2)}:
    • Are they the "same"? (null hypothesis H0H_0)
    • Are they "different"? (alternate hypothesis HAH_A)
General Strategy
  1. Assume a model for the networks
  2. Formulate a hypothesis test based on the model
  3. Compute a test statistic
248 / 328

What is Two-sample Hypothesis Testing?

  • Given two networks A(1)A^{(1)} and A(2)A^{(2)}:
    • Are they the "same"? (null hypothesis H0H_0)
    • Are they "different"? (alternate hypothesis HAH_A)
General Strategy
  1. Assume a model for the networks
  2. Formulate a hypothesis test based on the model
  3. Compute a test statistic
Assumptions
  • Same number of nodes in both networks
  • One-to-one correspondence of nodes between both network
249 / 328

Example - Traffic Patterns

  • Goal: examine traffic pattern differences during the daytime (7AM-7PM) compared to night time (7PM-7AM).
250 / 328

Example - Traffic Patterns

  • Goal: examine traffic pattern differences during the daytime (7AM-7PM) compared to night time (7PM-7AM).
  • 100 towns (nodes) from K=3K=3 states
    • 45 towns from NY
    • 30 towns from NJ
    • 25 towns from PA
251 / 328

Example - Traffic Patterns

  • Goal: examine traffic pattern differences during the daytime (7AM-7PM) compared to night time (7PM-7AM).
  • 100 towns (nodes) from K=3K=3 states
    • 45 towns from NY
    • 30 towns from NJ
    • 25 towns from PA
  • Measure the number of commuters from pairs of towns
    • If more than 1,000 drivers regularly make this commute, we add an edge between the pair of towns
252 / 328

Example - Traffic Patterns

  • Goal: examine traffic pattern differences during the daytime (7AM-7PM) compared to night time (7PM-7AM).
  • 100 towns (nodes) from K=3K=3 states
    • 45 towns from NY
    • 30 towns from NJ
    • 25 towns from PA
  • Measure the number of commuters from pairs of towns
    • If more than 1,000 drivers regularly make this commute, we add an edge between the pair of towns
  • We have two networks:
    • the first network between 7 AM and 7 PM (covering the bulk of the work day)
    • the second network between 7 PM and 7 AM (covering the bulk of night time)
253 / 328

Example - Traffic Patterns

254 / 328

Testing Whether Block Matrices are Different

  • Given block matrices B(1)B^{(1)} and B(2)B^{(2)}

B(1)=[b11(1)b12(1)b13(1)b21(1)b22(1)b23(1)b31(1)b32(1)b33(1)],      B(2)=[b11(2)b12(2)b13(2)b21(2)b22(2)b23(2)b31(2)b32(2)b33(2)] B^{(1)} = \begin{bmatrix} b^{(1)}_{11} & b^{(1)}_{12} & b^{(1)}_{13} \\ b^{(1)}_{21} & b^{(1)}_{22} & b^{(1)}_{23} \\ b^{(1)}_{31} & b^{(1)}_{32} & b^{(1)}_{33} \end{bmatrix},\;\;\; B^{(2)} = \begin{bmatrix} b^{(2)}_{11} & b^{(2)}_{12} & b^{(2)}_{13} \\ b^{(2)}_{21} & b^{(2)}_{22} & b^{(2)}_{23} \\ b^{(2)}_{31} & b^{(2)}_{32} & b^{(2)}_{33} \end{bmatrix}

255 / 328

Testing Whether Block Matrices are Different

  • Given block matrices B(1)B^{(1)} and B(2)B^{(2)}

B(1)=[b11(1)b12(1)b13(1)b21(1)b22(1)b23(1)b31(1)b32(1)b33(1)],      B(2)=[b11(2)b12(2)b13(2)b21(2)b22(2)b23(2)b31(2)b32(2)b33(2)] B^{(1)} = \begin{bmatrix} b^{(1)}_{11} & b^{(1)}_{12} & b^{(1)}_{13} \\ b^{(1)}_{21} & b^{(1)}_{22} & b^{(1)}_{23} \\ b^{(1)}_{31} & b^{(1)}_{32} & b^{(1)}_{33} \end{bmatrix},\;\;\; B^{(2)} = \begin{bmatrix} b^{(2)}_{11} & b^{(2)}_{12} & b^{(2)}_{13} \\ b^{(2)}_{21} & b^{(2)}_{22} & b^{(2)}_{23} \\ b^{(2)}_{31} & b^{(2)}_{32} & b^{(2)}_{33} \end{bmatrix}

  • First hypothesis: H0:B(1)=B(2);  HA:B(1)B(2)H_0: B^{(1)} = B^{(2)}; \ \ H_A: B^{(1)} \neq B^{(2)}
256 / 328

Testing Whether Block Matrices are Different

  • Given block matrices B(1)B^{(1)} and B(2)B^{(2)}

B(1)=[b11(1)b12(1)b13(1)b21(1)b22(1)b23(1)b31(1)b32(1)b33(1)],      B(2)=[b11(2)b12(2)b13(2)b21(2)b22(2)b23(2)b31(2)b32(2)b33(2)] B^{(1)} = \begin{bmatrix} b^{(1)}_{11} & b^{(1)}_{12} & b^{(1)}_{13} \\ b^{(1)}_{21} & b^{(1)}_{22} & b^{(1)}_{23} \\ b^{(1)}_{31} & b^{(1)}_{32} & b^{(1)}_{33} \end{bmatrix},\;\;\; B^{(2)} = \begin{bmatrix} b^{(2)}_{11} & b^{(2)}_{12} & b^{(2)}_{13} \\ b^{(2)}_{21} & b^{(2)}_{22} & b^{(2)}_{23} \\ b^{(2)}_{31} & b^{(2)}_{32} & b^{(2)}_{33} \end{bmatrix}

  • More concrete hypothesis:

H0,kl:bkl(1)=bkl(2);  HA,kl:bkl(1)bkl(2)H_{0, kl}: b_{kl}^{(1)} = b^{(2)}_{kl}; \ \ H_{A, kl} : b_{kl}^{(1)} \neq b^{(2)}_{kl}

257 / 328

Testing Whether Block Matrices are Different

  • Given block matrices B(1)B^{(1)} and B(2)B^{(2)}

B(1)=[b11(1)b12(1)b13(1)b21(1)b22(1)b23(1)b31(1)b32(1)b33(1)],      B(2)=[b11(2)b12(2)b13(2)b21(2)b22(2)b23(2)b31(2)b32(2)b33(2)] B^{(1)} = \begin{bmatrix} b^{(1)}_{11} & b^{(1)}_{12} & b^{(1)}_{13} \\ b^{(1)}_{21} & b^{(1)}_{22} & b^{(1)}_{23} \\ b^{(1)}_{31} & b^{(1)}_{32} & b^{(1)}_{33} \end{bmatrix},\;\;\; B^{(2)} = \begin{bmatrix} b^{(2)}_{11} & b^{(2)}_{12} & b^{(2)}_{13} \\ b^{(2)}_{21} & b^{(2)}_{22} & b^{(2)}_{23} \\ b^{(2)}_{31} & b^{(2)}_{32} & b^{(2)}_{33} \end{bmatrix}

  • More concrete hypothesis:

H0,kl:bkl(1)=bkl(2);  HA,kl:bkl(1)bkl(2)H_{0, kl}: b_{kl}^{(1)} = b^{(2)}_{kl}; \ \ H_{A, kl} : b_{kl}^{(1)} \neq b^{(2)}_{kl}

  • Statements about matrices to statement about probabilities
258 / 328

Fischer's Exact Test

  • For each kk and ll community pair, we can compute the 2×22\times 2 contingency table:
night time day time
Number of edges a b
Number of non-edges c d
259 / 328

Fischer's Exact Test

  • For each kk and ll community pair, we can compute the 2×22\times 2 contingency table:
night time day time
Number of edges a b
Number of non-edges c d
  • Intuitively, tests whether proportion of edges is independent of time of day.
260 / 328

Fischer's Exact Test

  • For each kk and ll community pair, we can compute the 2×22\times 2 contingency table:
night time day time
Number of edges a b
Number of non-edges c d
  • Intuitively, tests whether proportion of edges is independent of time of day.

  • Now we can compute the p-value for each H0,klH_{0, kl} hypothesis!

261 / 328

Fischer's Exact Test

  • For each kk and ll community pair, we can compute the 2×22\times 2 contingency table:
night time day time
Number of edges a b
Number of non-edges c d
  • Intuitively, tests whether proportion of edges is independent of time of day.

  • Now we can compute the p-value for each H0,klH_{0, kl} hypothesis!

  • What about multiple comparisons?

262 / 328

Multiple Comparisons Problem

  • Suppose we have 5000 fair coins (equal probability of heads and tails)
263 / 328

Multiple Comparisons Problem

  • Suppose we have 5000 fair coins (equal probability of heads and tails)
  • For each coin
    • Flip the coin 500 times
264 / 328

Multiple Comparisons Problem

  • Suppose we have 5000 fair coins (equal probability of heads and tails)
  • For each coin
    • Flip the coin 500 times
    • Test H0(i):p(i)=0.5H_0^{(i)} : p^{(i)} = 0.5 against HA(i):p(i)0.5H_A^{(i)}: p^{(i)} \neq 0.5
265 / 328

Multiple Comparisons Problem

  • Suppose we have 5000 fair coins (equal probability of heads and tails)
  • For each coin
    • Flip the coin 500 times
    • Test H0(i):p(i)=0.5H_0^{(i)} : p^{(i)} = 0.5 against HA(i):p(i)0.5H_A^{(i)}: p^{(i)} \neq 0.5
    • Compute the p-value of the hypothesis that the coin is fair
266 / 328

Multiple Comparisons Problem

  • Suppose we have 5000 fair coins (equal probability of heads and tails)
  • For each coin
    • Flip the coin 500 times
    • Test H0(i):p(i)=0.5H_0^{(i)} : p^{(i)} = 0.5 against HA(i):p(i)0.5H_A^{(i)}: p^{(i)} \neq 0.5
    • Compute the p-value of the hypothesis that the coin is fair
267 / 328

Multiple Comparisons Problem

  • Familywise error rate (FWER): probabiliy we incorrectly rejected the null
268 / 328

Multiple Comparisons Problem

  • Familywise error rate (FWER): probabiliy we incorrectly rejected the null
  • Rescale p-values via Holm-Bonferroni correction.
    • No assumptions on dependence of hypothesis tests
269 / 328

Multiple Comparisons Problem

  • Familywise error rate (FWER): probabiliy we incorrectly rejected the null
  • Rescale p-values via Holm-Bonferroni correction.
    • No assumptions on dependence of hypothesis tests
270 / 328

Example - Traffic Patterns

  • Hypothesis: H0,kl:bkl(1)=bkl(2);  HA,kl:bkl(1)bkl(2)H_{0, kl}: b_{kl}^{(1)} = b^{(2)}_{kl}; \ \ H_{A, kl} : b_{kl}^{(1)} \neq b^{(2)}_{kl}
271 / 328

Example - Traffic Patterns

  • Hypothesis: H0,kl:bkl(1)=bkl(2);  HA,kl:bkl(1)bkl(2)H_{0, kl}: b_{kl}^{(1)} = b^{(2)}_{kl}; \ \ H_{A, kl} : b_{kl}^{(1)} \neq b^{(2)}_{kl}
  • Take the minimum of the p-values for each H0,klH_{0, kl} hypothesis
    • If the minimum p-value is less than 0.05, reject the null hypothesis
272 / 328

Example - Traffic Patterns

  • Hypothesis: H0,kl:bkl(1)=bkl(2);  HA,kl:bkl(1)bkl(2)H_{0, kl}: b_{kl}^{(1)} = b^{(2)}_{kl}; \ \ H_{A, kl} : b_{kl}^{(1)} \neq b^{(2)}_{kl}
  • Take the minimum of the p-values for each H0,klH_{0, kl} hypothesis
    • If the minimum p-value is less than 0.05, reject the null hypothesis
  • Reject the null hypothesis
273 / 328

Testing If Block Matrices are Multiples of Each Other

  • What if probability of edge was, in general, 50% higher during daytime?
    • Was previous result was just due to a scaling factor?
274 / 328

Testing If Block Matrices are Multiples of Each Other

  • What if probability of edge was, in general, 50% higher during daytime?
    • Was previous result was just due to a scaling factor?
    • Want to find topolgical differences
275 / 328

Testing If Block Matrices are Multiples of Each Other

  • What if probability of edge was, in general, 50% higher during daytime?
    • Was previous result was just due to a scaling factor?
    • Want to find topolgical differences
  • Modify previous hypothesis: H0:B(1)=αB(2);  HA:B(1)αB(2)H_0: B^{(1)} = \alpha B^{(2)}; \ \ H_A: B^{(1)} \neq \alpha B^{(2)}
  • We can take α\alpha to be the ratio p(1)/p(2)p^{(1)}/p^{(2)}
276 / 328

Example - Traffic Patterns

  • Hypothesis: H0:B(1)=αB(2);  HA:B(1)αB(2)H_0: B^{(1)} = \alpha B^{(2)}; \ \ H_A: B^{(1)} \neq \alpha B^{(2)}
  • Use chi-squared test
  • Reject the null hypothesis
  • After adjusting for differences in density, there exists differences in traffic patterns.
277 / 328

Other interesting tests?

  1. Are the latent positions from two networks different?

    H0:X=YW; HA:XYWH_0: X = YW ;\ H_A: X \neq YW

278 / 328

Other interesting tests?

  1. Are the latent positions from two networks different?

    H0:X=YW; HA:XYWH_0: X = YW ;\ H_A: X \neq YW

  2. Are the distributions of latent positions from two networks different?

    H0:FX=FY; HA:FXFYWH_0: F_X = F_Y ;\ H_A: F_X \neq F_YW

279 / 328

Other interesting tests?

  1. Are the latent positions from two networks different?

    H0:X=YW; HA:XYWH_0: X = YW ;\ H_A: X \neq YW

  2. Are the distributions of latent positions from two networks different?

    H0:FX=FY; HA:FXFYWH_0: F_X = F_Y ;\ H_A: F_X \neq F_YW

  3. Which latent positions are different from two networks?

    H0:Xi=YiW; HA:XiYiWH_0: X_i = Y_iW ;\ H_A: X_i \neq Y_iW

280 / 328

Focus for third section

  • coding examples for network analysis
281 / 328

Code example

Google colab notebook

283 / 328

Outline

Graph Neural Networks

284 / 328

Random walks and diffusion methods

  • develop a stochastic process (random) that generates paths through the network
    • "random walk": paths generated from the stochastic process, but with repeats
    • "diffusion": start somewhere in the network, and end somewhere (randomly)
285 / 328

Random walks and diffusion methods

  • develop a stochastic process (random) that generates paths through the network
    • "random walk": paths generated from the stochastic process, but with repeats
    • "diffusion": start somewhere in the network, and end somewhere (randomly)
  • diffusion methods on networks
    • do not analyze the network itself
286 / 328

Random walks and diffusion methods

  • develop a stochastic process (random) that generates paths through the network
    • "random walk": paths generated from the stochastic process, but with repeats
    • "diffusion": start somewhere in the network, and end somewhere (randomly)
  • diffusion methods on networks
    • do not analyze the network itself
    • analyze the walks diffusing through the network
287 / 328

Finite-space markov chains

  • Finite-space markov chain: random system where the possible events that can occur are finite
    • for network data, the "events" are visits to nodes in the network
    • If the network has nn nodes, the collection of possible events is nn
  • Random walk for network data of length TT: (s0,s1,...,sT)(s_0, \mathbf s_1, ..., \mathbf s_T)
    • st\mathbf s_t is a node in the network (random)
    • s0s_0 is termed the source node for the random walk, and is usually fixed
288 / 328

First-order markov chains

  • Random walk: (s0,s1,...,sT)(s_0, \mathbf s_1, ..., \mathbf s_T)
    • s1,...,sT\mathbf s_1, ..., \mathbf s_T are determined through an evolutionary (random) process
  • First-order markov chain
    • Imagine that we have walked the sequence (s0,...,st)(s_0, ..., s_t)
289 / 328

First-order markov chains

  • Random walk: (s0,s1,...,sT)(s_0, \mathbf s_1, ..., \mathbf s_T)
    • s1,...,sT\mathbf s_1, ..., \mathbf s_T are determined through an evolutionary (random) process
  • First-order markov chain
    • Imagine that we have walked the sequence (s0,...,st)(s_0, ..., s_t) P(st+1st,...,s0)=P(st+1st)\mathbb P(\mathbf s_{t + 1} | s_{t}, ..., s_0) = \mathbb P(\mathbf s_{t+1} | s_{t})
  • "The next node in the walk depends only on the current node that we are at"
290 / 328

First-order markov chains with networks

  • ptt+1={1dt,att+1=10,att+1=0p_{tt+1} = \begin{cases}\frac{1}{d_t},& a_{tt+1} = 1 \\ 0, & a_{tt+1} = 0\end{cases}
    • If we are at a node tt, we choose the next node to visit in our walk randomly from the neighbors of node tt
291 / 328

First-order markov chains with networks

  • ptt+1={1dt,att+1=10,att+1=0p_{tt+1} = \begin{cases}\frac{1}{d_t},& a_{tt+1} = 1 \\ 0, & a_{tt+1} = 0\end{cases}
    • If we are at a node tt, we choose the next node to visit in our walk randomly from the neighbors of node tt
    • tt has dtd_t neighbors, so each neighbor has an equal chance of being visited
  • ptt+1p_{tt+1}: transition probability from node tt to the next node t+1t+1
292 / 328

Transition probability matrix

  • n×nn \times n matrix where the entries pijp_{ij} are the transition probabilities from ii to jj


293 / 328

Transition example



294 / 328

Obtaining a transition probability vector

  • To actually implement a random walk programmatically, we need to return to matrices
  • What we will do is use one-hot encodings of our current vectors
    • If st=is_{t} = i, we will let x(i)\vec x^{(i)} be the vector where:

xj(i)={1,j=i0,jix_j^{(i)} = \begin{cases}1, & j = i \\ 0, & j \neq i\end{cases}

  • If we are at MH, x(MH)=[01000]\vec x^{(MH)} = \begin{bmatrix}0 & 1 & 0 & 0 & 0\end{bmatrix}^{\top}
295 / 328

Obtaining a transition probability vector

  • We'll define the probability vector from node ii to be:

pi=Px(i)=[p11p1npn1pnn][x1(i)xn(i)]=[p11pn1p1npnn][x1(i)xn(i)]\begin{aligned} \vec p_i &= P^\top \vec x^{(i)} \\ &= \begin{bmatrix} p_{11} & \cdots & p_{1n} \\ \vdots & \ddots & \vdots \\ p_{n1} & \cdots & p_{nn}\end{bmatrix}^{\top}\begin{bmatrix}x_1^{(i)} \\ \vdots \\ x_n^{(i)}\end{bmatrix} \\ &= \begin{bmatrix} p_{11} & \cdots & p_{n1} \\ \vdots & \ddots & \vdots \\ p_{1n} & \cdots & p_{nn}\end{bmatrix}\begin{bmatrix}x_1^{(i)} \\ \vdots \\ x_n^{(i)}\end{bmatrix}\end{aligned}

296 / 328

Obtaining a transition probability vector

Continuing, we get:

pi=Px(i)=[i=1npj1xj(i)i=1npjnxj(i)]=[pi1pin]\begin{aligned} \vec p_i &= P^\top \vec x^{(i)} \\ &= \begin{bmatrix}\sum_{i = 1}^n p_{j1}x_{j}^{(i)} \\ \vdots \\ \sum_{i = 1}^n p_{jn }x_{j}^{(i)}\end{bmatrix} \\ &= \begin{bmatrix}p_{i1} \\ \vdots \\ p_{in}\end{bmatrix} \end{aligned} because xi(i)=1x_i^{(i)} = 1 is the only entry that is non-zero (other sum terms drop out)

297 / 328
The multinoulli distribution

xMultinoulli(p)\mathbf x \sim \text{Multinoulli}(\vec p) if:

  1. p\vec p is a probability vector (i=1npi=1\sum_{i = 1}^n p_i = 1)
  2. x\mathbf x has realizations x{1,,n}x \in \{1, \cdots, n\} (it is categorical)
  3. P(x=i)=pi\mathbb P(\mathbb x = i) = p_i
298 / 328
The multinoulli distribution

xMultinoulli(p)\mathbf x \sim \text{Multinoulli}(\vec p) if:

  1. p\vec p is a probability vector (i=1npi=1\sum_{i = 1}^n p_i = 1)
  2. x\mathbf x has realizations x{1,,n}x \in \{1, \cdots, n\} (it is categorical)
  3. P(x=i)=pi\mathbb P(\mathbb x = i) = p_i
Generating next entry of the random walk

Let st+1Multinoulli(pst)\mathbf s_{t + 1} \sim \text{Multinoulli}(\vec p_{s_t})

299 / 328

Algorithm for generating random walks


random_walk(P, s0):
"""
P: the nxn transition probability matrix.
s0: the source node.
"""
Let s[0] = s0.
For t from 1:T:
Let x[i] = 1 where i = s[t - 1], and 0 otherwise.
Let p = transpose(P)* x.
Let s[t] be a Multinoulli sample with probability vector p.
Return the random walk s.
300 / 328

Graph Neural Network Lexicon

  • Note that we have two networks we are dealing with
  1. The network: the object we are studying
  2. The GNN: the network that we are using implicitly as part of the algorithm
  • We will be studying a technique called DeepWalk\texttt{DeepWalk}
302 / 328

Generating a "corpus" for our netework

  • DeepWalk\texttt{DeepWalk} was designed from natural language processing techniques
    • most of the language around it uses NLP lingo
  • corpus: a collection of text (e.g., from a stack of documents) that we learn from
    • For DeepWalk\texttt{DeepWalk}, we use random walks to construct a "corpus"
303 / 328

Generating a "corpus" for our network


generate_corpus(T, R):
"""
T: the walk length.
R: the number of walks per node in the network.
"""
For each source node i in 1:n:
For each walk r in 1:R:
Use the procedure developed to generate a random walk of length T,
starting at node i.
Return the collection of random walks.
304 / 328

Skipgram GNN

  • Idea: the random walks hold properties about the nodes
    • Nodes that tend to have lots of shared neighbors will tend to be "closer" on random walks, in general
305 / 328

Skipgram GNN

  • Idea: the random walks hold properties about the nodes
    • Nodes that tend to have lots of shared neighbors will tend to be "closer" on random walks, in general
  • Dummy goal: given a node ii, predict the probability that another node jj is ω\omega-close to it
    • ω\omega: the "window width"
    • If we can figure out nodes that tend to be "close" to a node ii, perhaps we can use the GNN to extract information about the network
306 / 328

Skipgram GNN

  • Idea: the random walks hold properties about the nodes
    • Nodes that tend to have lots of shared neighbors will tend to be "closer" on random walks, in general
  • Dummy goal: given a node ii, predict the probability that another node jj is ω\omega-close to it
    • ω\omega: the "window width"
    • If we can figure out nodes that tend to be "close" to a node ii, perhaps we can use the GNN to extract information about the network
  • Nodes that are are given are called center nodes
    • Nodes that we want to predict are called context nodes
    • "Context": under what contexts (nearby nodes) does our center node tend to show up?
307 / 328

Using the corpus to produce training samples

  • Center nodes: white
  • Context nodes: gray
308 / 328

The embedding process

  • The GNN itself is extremely simple
    • Input layer: nn nodes (one for each node in the network)
    • Hidden layer: dd nodes
    • dd: "embedding dimensions"
    • Output layer: nn nodes (one for each node in the network)
309 / 328

Architecture of a skip-gram network


310 / 328

The embedding layer

  • To connect the input layer to the hidden nodes, we have an "embedding layer"
    • The "embedding layer" describes a weight matrix WW which is n×dn \times d
    • x(i)\vec x^{(i)} is nn-dimensional W=[w(1)w(n)]W = \begin{bmatrix} \leftarrow & \vec w^{(1)\top} & \rightarrow \\ & \vdots & \\ \leftarrow & \vec w^{(n)\top} & \rightarrow \end{bmatrix}
311 / 328

The embedding matrix

  • To connect the input layer to the hidden nodes, we have an "embedding matrix"
  • We take the product Wx(i)W^\top \vec x^{(i)} to be the "embedding" of the node ii
    • Since WW^\top is d×nd \times n dimensions and x(i)\vec x^{(i)} is an nn-vector, the product is dd-dimensional W=[w(1)w(n)]W^\top = \begin{bmatrix} \uparrow & & \uparrow \\ \vec w^{(1)} & \cdots & \vec w^{(n)} \\ \downarrow & & \downarrow\end{bmatrix}
312 / 328

The embedding process

Wx(i)=[w(1)w(n)][x1(i)xn(i)]=[j=1nw1(j)xj(i)j=1nwd(j)xj(i)]\begin{aligned}W^\top \vec x^{(i)} &= \begin{bmatrix} \uparrow & & \uparrow \\ \vec w^{(1)} & \cdots & \vec w^{(n)} \\ \downarrow & & \downarrow\end{bmatrix} \begin{bmatrix}x^{(i)}_1 \\ \vdots \\ x^{(i)}_n\end{bmatrix} \\ &= \begin{bmatrix}\sum_{j = 1}^n w^{(j)}_1 x_j^{(i)} \\ \vdots \\ \sum_{j = 1}^n w^{(j)}_d x_j^{(i)}\end{bmatrix}\end{aligned}

  • Remember: xj(i)=1x_j^{(i)} = 1 only when j=ij = i, and is zero otherwise (one-hot encoding)
313 / 328

The embedding process

Wx(i)=[j=1nw1(j)xj(i)j=1nwd(j)xj(i)]=[w1(i)wd(i)]=w(i)\begin{aligned}W^\top \vec x^{(i)} &= \begin{bmatrix}\sum_{j = 1}^n w^{(j)}_1 x_j^{(i)} \\ \vdots \\ \sum_{j = 1}^n w^{(j)}_d x_j^{(i)}\end{bmatrix}\\ &= \begin{bmatrix}w^{(i)}_1 \\ \vdots \\ w^{(i)}_{d}\end{bmatrix} = \vec w^{(i)}\end{aligned}

  • The "embedding" of x(i)x^{(i)} really just extracts the ithi^{th} row of WW
314 / 328

The "softmax" function

  • When are are trying to predict probabilities, a typical choice is the "softmax" function
  • If you have done logistic regression, you have used this before
  • The softmax for an nn-element vector y\vec y is the nn-element vector with entries:

σ(y)j=[exp(yj)j=1nexp(yj)]\sigma(\vec y)_j = \begin{bmatrix}\frac{\exp(y_j)}{\sum_{j' = 1}^n \exp(y_{j'})}\end{bmatrix}

315 / 328

The "softmax" function

  • When are are trying to predict probabilities, a typical choice is the "softmax" function
  • If you have done logistic regression, you have used this before
  • The softmax for an nn-element vector y\vec y is the nn-element vector with entries:

σ(y)j=[exp(yj)j=1nexp(yj)]\sigma(\vec y)_j = \begin{bmatrix}\frac{\exp(y_j)}{\sum_{j' = 1}^n \exp(y_{j'})}\end{bmatrix}

  • Note that the exponential is strictly positive (for non-infinite yjy_j), so the numerator is strictly less than the denominator
  • By construction, j=1nσ(y)j=1\sum_{j = 1}^n \sigma(\vec y)_j = 1
    • it produces probabilities
316 / 328

The "projection" matrix provides predicted probabilities

  • The first step in the projection layer is to use another matrix VV to "project" the embedding back to a dd-dimensional vector
    • We do this by taking the product Vw(i)V^\top \vec w^{(i)}
    • VV is a d×nd \times n matrix, and w(i)\vec w^{(i)} is a dd-vector, so we obtain a nn-vector
  • We then apply the softmax to the result, and the predicted probability for context node jj with center node ii is: pj(i)(W,V)=σ(Vw(i))jp^{(i)}_j(W, V) = \sigma\big(V^{\top} \vec w^{(i)}\big)_j
317 / 328

A forwards pass through a Skipgram network

  • To conclude off the description of the Skipgram network, we will use something known as a loss function
  • The loss function is the cross-entropy loss, which for a single training sample (x(i),x(j))\big(\vec x^{(i)}, \vec x^{(j)}\big): Li,j(W,V)=x(j)log(p(i)(W,V))L_{i,j}(W, V) = \vec x^{(j)} \log\bigg(\vec p^{(i)}(W, V)\bigg)
318 / 328

Forwards pass through a Skipgram network (schematic view)


319 / 328

How do we actually train the Skipgram network?

  1. Randomly reorder the training data into "epochs" which contain one instance of every training sample (in random order).
320 / 328

How do we actually train the Skipgram network?

  1. Randomly reorder the training data into "epochs" which contain one instance of every training sample (in random order).
  2. Split the epochs into BB batches.
321 / 328

How do we actually train the Skipgram network?

  1. Randomly reorder the training data into "epochs" which contain one instance of every training sample (in random order).
  2. Split the epochs into BB batches.
  3. For each batch:
    • Run a forwards pass for every training sample in that batch, and compute the cross-entropy loss.
    • Compute the cumulative loss as the sum of the cross-entropy losses for each training sample in the batch.
    • Use backpropogation to update the embedding matrix WW and the projection matrix VV.
322 / 328

How do we actually train the Skipgram network?

  1. Randomly reorder the training data into "epochs" which contain one instance of every training sample (in random order).
  2. Split the epochs into BB batches.
  3. For each batch:
    • Run a forwards pass for every training sample in that batch, and compute the cross-entropy loss.
    • Compute the cumulative loss as the sum of the cross-entropy losses for each training sample in the batch.
    • Use backpropogation to update the embedding matrix WW and the projection matrix VV.
  4. Repeat for all epochs, until arbitrary stopping criterion is met.
323 / 328

How do we actually train the Skipgram network?

  1. Randomly reorder the training data into "epochs" which contain one instance of every training sample (in random order).
  2. Split the epochs into BB batches.
  3. For each batch:
    • Run a forwards pass for every training sample in that batch, and compute the cross-entropy loss.
    • Compute the cumulative loss as the sum of the cross-entropy losses for each training sample in the batch.
    • Use backpropogation to update the embedding matrix WW and the projection matrix VV.
  4. Repeat for all epochs, until arbitrary stopping criterion is met.
  5. Use the rows w(i)\vec w^{(i)} as the embedding for node ii.
324 / 328

Skip-gram networks in action

  • What we described is known as DeepWalk\texttt{DeepWalk}
    • it uses first-order random walks
  • There are many other diffusion-based GNNs that use other types of random walks
    • node2vec\texttt{node2vec}: second-order biased random walks
325 / 328

Skip-gram networks in action


  • 2 homophilic communities, and each community has nodes with high degrees (core) and low degrees (periphery)
326 / 328

Skip-gram networks in action


  • 2 homophilic communities, and each community has nodes with high degrees (core) and low degrees (periphery)
    • DCSBM??
327 / 328

Skip-gram networks in action

328 / 328
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow