I've been reading comics since I was a little kid. Neil Gaiman, Alan Moore, Warren Ellis, Art Spiegelman are among my favorites authors now, but when I was younger I was a big fan of Marvel Super Heroes.
When I discovered the Marvel Social Graph dataset, I immediately wanted to discover how the different characters influenced each other.
By using cluster analysis, I am going to figure out how my heros actually interacted with each other in the Marvel Universe.
The Marvel dataset is composed of a list of co-occurrences of Super Heros. For example, every time Spider Man appears in a comic book with Captain America, we will have a line with both their names. To illustrate this, here is a visual of the data in Data Science Studio.
Because of these connections, it is possible to create a graph, where each node being a character and each link or line shows us the presence of a co-occurrence in the dataset.
I did all the network analysis with python (mostly using networkx package) and used sigmajs to build a small web application to be able to visualize the network.
Here is the first visualization of our Marvel Social Network. The node size corresponds to the degree (number of edges adjacent to the node) in the graph and the node color to graph clusters detected with the Louvain method. I used the sigmajs force layout to arrange it.
Now, most people will tell you it is rarely interesting to plot the whole graph because it will be hardly interpretable. But in this case, we can validate a few hypothesis :
the hero with bigger degree are Captain America, Spider Man and Iron Man. So these heros are found to be of high importance in the social graph which is logic since they belong to the Marvel universe starts.
some clusters appears :
We would like to understand the internal structure of the Marvel graph but how can we prune the graph from all the unknown characters, such as Fah Lo Suee, polluting our graph?
One way to do that would be to use edge weights. Indeed, previously, an edge was defined by the existence of one co-occurrence.
However, Captain America and Spider Man appear together so in many comics: It would be smart of us to take account of this.
The solution is for us to use edge weights. An edge weight is some extra information about the node that we add to our graph to change its shape. In our case, this extra information will be the number of co-occurrences between characters.
For example, the edge weight of the link between SpiderMan and Captain America is the same as the number of comic they both appear in together.
If we prune the graph by keeping only edges (and corresponding nodes) having a weight higher than a specific threshold (K for example), we will drastically simplify the graph : all nodes with less than K appearances in the dataset and all edges with weight less than K will disappear.
So the more we increase K, the simpler the graph will be and we will get closer to the Marvel Graph skeleton.
For example, here is the graph generated when the value of K is 10.
It is starting to get easier to describe Marvel Universe. Three heroes have their own very rich universe (cluster) : SpiderMan, Captain America and Thor. On the bottom left, the X-Men are still clustered together.
Meanwhile, Iron Man, Hulk, the Fantastic 4, Hawkeye Ant Man and Vision all belong to the central cluster. This is because the ratio of the number of their appearance together (as the Avengers) by their appearance alone is very high.
We can do a similar analysis for K = 30.
This is the backbone of the Marvel Universe. There is a cluster for Spider Man, Hulk (dark green), Namor (pale green), Thor, Fantastic-4 (pink), though Captain America and Iron Man are still clustered together in the Avenger team.
There is a broad literature about detecting who are the influencers on social networks (for example, Twitter and Facebook).
Numerous graph criterion were derived to describe them : degree centrality , PageRank, closeness centrality or betweeness centrality. I find betweenness centrality very interesting because its value represent how much the node is important to convey information in the network to other individuals in the network who are not connected to each other.
Obviously, Marvel Stars like SpiderMan, Thor and Hulk are good candidates and have indeed very high values of all imaginable centrality measures. But if we draw the graph for a value of K of 50 the real Marvel Universe cornerstone clearly appears (the one with highest betweenness values).
It is Beast ! What is surprising is that he is not the X-Men biggest star (Wolverine, Storm, Professor X ... ) but he is the main link between X-Men and the Avengers. So if I was to choose a negociator between these two groups in a modern company, to target a person to dismantle a terrorist network or to choose my brand ambassador in Marvel Universe, Beast would be the perfect candidate.
I had a lot of fun mining the Marvel graph for clusters and influencers.
Obviously this analysis is a pretty basic one so here are some other things I would consider interesting :
So if you see that Marvel or DC has some new data to share, contact me on twitter @prrgutierrez , I would be happy to do a second post on the subject !
Please fill out the form below to receive the success story by email:
How can we come back to you ?