In this article, I will re-implement and discuss the paper: “Disrupting resilient criminal networks through data analysis: the case of Sicilian Mafia” by Lucia Cavallaro and Annamaria Ficara and Pasquale De Meo and Giacomo Fiumara and Salvatore Catanese and Ovidiu Bagdasar and Antonio Liotta.

They published real world data, and I will also explore it on my side. The Github of the project with the code and data can be found here.

Background

This study focuses on the structure of the Sicilian Mafia as a criminal network. The whole network is made of sub-groups referred as families.

The authors have built two real-world anonymized datasets, based on raw data derived from juridical acts, relating to a Mafia gang that operated in Sicily (Italy) during the first decade of 2000s:

a dataset of phonecalls
a dataset of physical meetings

The authors apply techniques of Social Network Analysis (SNA) to identify the best technique to disrupt the network (which member to arrest, in what order…).

Strategies to disrupt networks rely on:

human capital, a strategy in which we investigate the resources possessed by actors of a network
social capital, a strategy in which we arrest actors with the largest social link in the network (e.g brokers that are bridges between two markets)

In social capital, the removal of the actors with the largest betweenness centrality tends to be the best technique. After disruption, the network might still be exist. We talk about network resilience. This is what this paper focuses on.

Data

For this part, I won’t cover too much what the authors did and I will explore this dataset using my approach/visualization. I will rely on networkx (quite common), but also on Nx_altair, a library (which I think isn’t maintained anymore) that mixes Altair and Networkx.

import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import altair as alt

# Networks
import networkx as nx
import nx_altair as nxa
from networkx.algorithms.community import greedy_modularity_communities

We will first focus on the physical meeting dataset and build the corresponding graph:

file = open('Datasets/Montagna_meetings_edgelist.csv', 'r')

G = nx.Graph()
for row in file:
    r = row.split()
    n1, n2, w = int(r[0]), int(r[1]), int(r[2])
    G.add_node(n1)
    G.add_node(n2)
    G.nodes[n1]['id'] = n1
    G.nodes[n2]['id'] = n2
    G.add_edge(n1, n2)
    G.edges[(n1,n2)]['weight']=w
    G.edges[(n1,n2)]['method']='meetings'

We attributed a weight to each edge based on how frequently two nodes met.

Now, let’s make a first plot of the network and the meetings between the nodes.

pos = nx.spring_layout(G)

chart = nxa.draw_networkx(
    G=G,
    pos=pos,
    width='weight:N',
    node_tooltip=['id:N']
).properties(title="Sicilian Mafia physical meetings", height=500, width=500
).interactive()

chart

We can now try to discribe a bit further the network (number of nodes, edges, degree…):

c_degree = nx.degree_centrality(G)
c_degree = list(c_degree.values())
c = list(greedy_modularity_communities(G))

nb_nodes = len(list(G.nodes()))
nb_edges = len(list(G.edges()))
degrees = list(dict(G.degree()).values())
avg_degree = np.mean(degrees)
median_degree = np.median(degrees)
max_degree = np.max(degrees)        
min_degree = np.min(degrees) 
avg_centrality = np.mean(c_degree) 
min_centrality = np.min(c_degree) 
max_centrality = np.max(c_degree) 
median_centrality = np.median(c_degree) 

print("Number of nodes: ", nb_nodes)
print("Number of edges: ", nb_edges)
print("Average degree: ", avg_degree)
print("Median degree: ", median_degree)
print("Max degree: ", max_degree)
print("Min degree: ", min_degree)
print("Average centrality: ", avg_centrality)
print("Median centrality: ", median_centrality)
print("Max centrality: ", max_centrality)
print("Min centrality: ", min_centrality)

It heads the following:

Number of nodes:  101
Number of edges:  256
Average degree:  5.069306930693069
Median degree:  3.0
Max degree:  24
Min degree:  1
Average centrality:  0.050693069306930696
Median centrality:  0.03
Max centrality:  0.24
Min centrality:  0.01

On average, each person is connected to 5 other (which is quite high). There seems to be a least one person with a large centrality since this person has 24 edges.

We can now look at the distribution of the degrees:

hist = alt.Chart(pd.DataFrame(degrees, columns=['Degrees'])).mark_bar().encode(
    alt.X("Degrees:Q", bin=alt.Bin(maxbins=25)),
    y='count()').properties(title="Degree Distribution", height=300, width=500)

hist

It is much closer to a Barabasi-Albert model distribution, meaning that the network has strong signs of preferential attachment.

The two datasets have 47 nodes in common, so we can plot them together:

file = open('Datasets/Montagna_phonecalls_edgelist.csv', 'r')

for row in file:
    r = row.split()
    n1, n2, w = int(r[0]), int(r[1]), int(r[2])
    G.add_node(n1)
    G.add_node(n2)
    G.nodes[n1]['id'] = n1
    G.nodes[n2]['id'] = n2
    G.add_edge(n1, n2)
    G.edges[(n1,n2)]['weight']=w
    G.edges[(n1,n2)]['method']='phonecalls'

pos = nx.spring_layout(G)

chart = nxa.draw_networkx(
    G=G,
    pos=pos,
    width='weight:N'
).properties(title="Sicilian Mafia all meetings", height=500, width=500
).interactive()

chart

Now, we can try to display nodes in a different color if their betweenness centrality is above a certain threshold (here, 0.15). We identify three central nodes, which also look like the main information bottleneck in the graph. I do not have much information on these nodes, but we can easily suppose that these people play a key role in their communities. Zoom in the graph to see more details:

c_degree = nx.betweenness_centrality(G)
c_degree = list(c_degree.values())

chart = nxa.draw_networkx(
    G=G,
    pos=nx.spring_layout(G),
    width='weight:N',
    node_tooltip=['id','is_central']
)

edges = chart.layer[0].properties(height=500, width=500)
nodes = chart.layer[1].properties(height=500, width=500)

nodes = nodes.encode(
    fill='is_central:N',
    color='is_central:N'
).properties(
    height=500, width=500, title="Centrality"
)

central = (edges+nodes).interactive()

central

We can focus on running a simple community detection algorithm. We use the greedy modularity algorithm from NetworkX:

c = list(greedy_modularity_communities(G))

for n in G.nodes():
    if n in list(c[0]):
        G.nodes[n]['community'] = '0'
    elif n in list(c[1]):
        G.nodes[n]['community'] = '1'
    elif n in list(c[2]):
        G.nodes[n]['community'] = '2'
    elif n in list(c[3]):
        G.nodes[n]['community'] = '3'
    elif n in list(c[4]):
        G.nodes[n]['community'] = '4'
    elif n in list(c[5]):
        G.nodes[n]['community'] = '5'
    elif n in list(c[6]):
        G.nodes[n]['community'] = '6'
    elif n in list(c[7]):
        G.nodes[n]['community'] = '7'
    elif n in list(c[8]):
        G.nodes[n]['community'] = '8'

chart = nxa.draw_networkx(
    G=G,
    pos=nx.spring_layout(G),
    width='weight:N',
    node_tooltip=['id','community']
)

edges = chart.layer[0].properties(height=500, width=500)
nodes = chart.layer[1].properties(height=500, width=500)

nodes = nodes.encode(
    fill='community:N',
    color='community:N'
).properties(
    height=500, width=500, title="Community"
)

community = (edges+nodes).interactive()

community

8 communities have been detected, and the communities detected appear in various colors.

Network Disruption

As we have seen, there are some key players, and communities that are larger than others. When disruption a criminal network,it is therefore important to arrest the right persons and avoid the network resilience.

The authors use a metric to measure the network resilience:

\[\rho = 1 - \mid \frac{LCC_{i}(G_i) - LCC_{0}(G_0)} {LCC_{0}(G_0)} \mid\]

Where \(LCC\) is the Largest Connected Component, a measure of the total number of nodes of the largest connected component. A connected component of an undirected graph is a subgraph in which any two vertices are connected to each other by paths, and which is connected to no additional vertices outside of this component. In brief, this metric tracks by how much we damaged the LCC of the network by removing one or few nodes.

In the whole graph, the connected components can be computed as such:

for x in nx.connected_components(G):
    print(x)

{0, 1, 2}
{3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153}
{16, 17}
{73, 74}
{106, 107}

Just like we can visually observe, there is one major connected component. We can see the number of nodes of each sub-graph using:

for x in nx.connected_components(G):
    print(nx.number_of_nodes(G.subgraph(x)))

The LCC is therefore the max of those, which is 145.

We can therefore define a function to look for the LCC size:

def lcc_size(graph):

    compsize = []
    for c in nx.connected_components(graph):
        compsize.append(nx.number_of_nodes(graph.subgraph(c)))
    return max(compsize)

The idea behind the metric is to iteratively remove:

either the node with the highest collective influence, in a sequential manner
or the top 5 nodes with the highest collective influence, in a “block” manner

And see how the metric evolves. If you want to learn more about collective influence, make sure to check this article.

To gradually remove the nodes (in a sequential manner), the authors propose a function:

import array
import copy

toremove = array.array('i', [])
nrem = 5
names = ['id', 'data']
formats = ['int32', 'int32']
dtype = dict(names=names, formats=formats)

def disruption_sequence(Graph, weight=None):

    Gor = copy.deepcopy(Graph)
    lccinit = lcc_size(Graph) 
    dflcc = pd.DataFrame()
    dflccvar = pd.DataFrame()

    dictx = dict()
    dicty = dict()
    kiter = 0
    toremove = array.array('i', [])
    
    while Gor.number_of_nodes() > nrem:

        i = 0

        while i < nrem:
            if Gor.number_of_nodes() <= nrem:
                break

            toremove = array.array('i', [])
            dictx[kiter] = lcc_size(Gor)
            dicty[kiter] = 1 - (abs((lcc_size(Gor) - lccinit) / lccinit))
            toremove = collective_influence_centality(Gor, toremove, weight=weight)
            
            Gor.remove_node(toremove[0])
            toremove.pop(0)
            kiter += 1
            i += 1
            
    dflcc['No'] = list(dictx.keys())
    dflccvar['No'] = dflcc['No']
    dflcc['centrality'] = list(dictx.values())
    dflccvar['centrality'] = list(dicty.values())
    
    return dflcc, dflccvar

dflcc, dflccvar = disruption_sequence(G)

We store the LCC and the score of the network given the sequential disruption. We can then plot the result:

plt.figure(figsize=(12,8))
plt.plot(dflcc['No'], dflcc['centrality'])
plt.title("LCC Size with sequential disruption")
plt.show()

We can also plot the score described above:

plt.figure(figsize=(12,8))
plt.plot(dflccvar['No'], dflccvar['centrality'])
plt.title("Centrality score with sequential disruption")
plt.show()

For the block-removal, one can simply remove 5 nodes before re-computing the collective influence candidates to remove:

def disruption_block(Graph, weight=None):

    Gor = copy.deepcopy(Graph)
    lccinit = lcc_size(Graph) 
    dflcc = pd.DataFrame()
    dflccvar = pd.DataFrame()

    dictx = dict()
    dicty = dict()
    kiter = 0
    toremove = array.array('i', [])

    while Gor.number_of_nodes() > nrem:
        i = 0

        toremove = array.array('i', [])
        dictx[kiter] = lcc_size(Gor)
        dicty[kiter] = 1 - (abs((lcc_size(Gor) - lccinit) / lccinit))
        toremove = collective_influence_centality(Gor, toremove, weight=weight)
        
        while i < nrem:
            if Gor.number_of_nodes() <= nrem:
                break
            Gor.remove_node(toremove[0])
            toremove.pop(0)
            kiter += 1
            i += 1
    
    dflcc['No'] = list(dictx.keys())
    dflccvar['No'] = dflcc['No']
    dflcc['centrality'] = list(dictx.values())
    dflccvar['centrality'] = list(dicty.values())
    
    return dflcc, dflccvar

And as before, we can compute the LCC and the score with the number of iterations:

dflcc, dflccvar = disruption_block(G)
plt.figure(figsize=(12,8))
plt.plot(dflcc['centrality'])
plt.title("LCC Size with block disruption")
plt.show()

plt.figure(figsize=(12,8))
plt.plot(dflccvar['centrality'])
plt.title("Centrality score with block disruption")
plt.show()

Note, the results are quite similar in terms when you compare the number of persons who had to be arrested.

Discussion

The authors discuss that betweenness centrality is the best metric to apply (better than collective influence), given the focus it has on information bottleneck. 5% of the network is often responsible for more than 70% of the information (and this is real-world data !).

The dataset provided is static (meaning that we do not know how the network will react to a disruption). Overall, the metric provided is interesting, and this kind of approach could be explored in criminal network investigations.

Disrupting Resilient criminal networks through data analysis

Maël Fabien

Background

Data

Network Disruption

Discussion