Enron words

This is the bipartite document–word dataset of Enron words. Left nodes are documents and right nodes are words. Edge weights are multiplicities.

Metadata

Code		`EN`
Internal name		`bag-enron`
Name		Enron words
Data source		http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Availability		Dataset is available for download
Consistency check		Dataset passed all tests
Category		Text network
Node meaning		Document, word
Edge meaning		Occurrence
Network format		Bipartite, undirected
Edge type		Unweighted, multiple edges

Size	n =	67,960
Left size	n₁ =	39,861
Right size	n₂ =	28,099
Volume	m =	6,412,172
Unique edge count	m̿ =	3,710,420
Wedge count	s =	3,214,624,476
Claw count	z =	2,510,007,422,598
Cross count	x =	2,191,825,474,071,012
Square count	q =	45,471,014,642
4-Tour count	T₄ =	376,634,510,028
Maximum degree	d_max =	7,190
Maximum left degree	d_1max =	2,120
Maximum right degree	d_2max =	7,190
Average degree	d =	188.704
Average left degree	d₁ =	160.863
Average right degree	d₂ =	228.199
Fill	p =	0.003 312 71
Average edge multiplicity	m̃ =	1.728 15
Size of LCC	N =	67,960
Diameter	δ =	6
50-Percentile effective diameter	δ_0.5 =	2.492 21
90-Percentile effective diameter	δ_0.9 =	3.606 21
Median distance	δ_M =	3
Mean distance	δ_m =	2.992 72
Gini coefficient	G =	0.707 894
Balanced inequality ratio	P =	0.224 254
Left balanced inequality ratio	P₁ =	0.225 645
Right balanced inequality ratio	P₂ =	0.156 346
Relative edge distribution entropy	H_er =	0.897 344
Power law exponent	γ =	1.269 14
Tail power law exponent	γ_t =	1.991 00
Degree assortativity	ρ =	−0.174 109
Degree assortativity p-value	p_ρ =	0.000 00
Spectral separation	\|λ₁[A] / λ₂[A]\| =	1.700 69
Controllability	C =	14,724
Relative controllability	C_r =	0.216 657

[1]	Jérôme Kunegis. KONECT – The Koblenz Network Collection. In Proc. Int. Conf. on World Wide Web Companion, pages 1343–1350, 2013. [ http ]
[2]	M. Lichman. UCI Machine Learning Repository, 2013. [ http ]