1.18.0 User Documentation for Apache MADlib
HITS
Contents

Given a graph, the HITS (Hyperlink-Induced Topic Search) algorithm outputs the authority score and hub score of every vertex, where authority estimates the value of the content of the page and hub estimates the value of its links to other pages. This algorithm was originally developed to rate web pages [1].

HITS
hits( vertex_table,
vertex_id,
edge_table,
edge_args,
out_table,
max_iter,
threshold,
grouping_cols
)


Arguments

vertex_table

TEXT. Name of the table containing the vertex data for the graph. Must contain the column specified in the 'vertex_id' parameter below.

vertex_id

TEXT, default = 'id'. Name of the column in 'vertex_table' containing vertex ids. The vertex ids can be of type INTEGER or BIGINT with no duplicates. They do not need to be contiguous.

edge_table

TEXT. Name of the table containing the edge data. The edge table must contain columns for source vertex and destination vertex.

edge_args

TEXT. A comma-delimited string containing multiple named arguments of the form "name=value". The following parameters are supported for this string argument:

• src (INTEGER or BIGINT): Name of the column containing the source vertex ids in the edge table. Default column name is 'src'.
• dest (INTEGER or BIGINT): Name of the column containing the destination vertex ids in the edge table. Default column name is 'dest'.

out_table

TEXT. Name of the table to store the result of HITS. It will contain a row for every vertex from 'vertex_table' with the following columns:

• vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for column naming.
• authority : The vertex authority score.
• hub : The vertex hub score.
• grouping_cols : Grouping column values (if any) associated with the vertex_id.

A summary table is also created that contains information regarding the number of iterations required for convergence. It is named by adding the suffix '_summary' to the 'out_table' parameter.

max_iter (optional)

INTEGER, default: 100. The maximum number of iterations allowed. Each iteration consists of both authority and hub phases.

threshold (optional)

FLOAT8, default: (1/number of vertices * 1000). Threshold must be set to a value between 0 and 1, inclusive of end points. If the difference between two consecutive iterations of authority AND two consecutive iterations of hub is smaller than 'threshold', then the computation stops. That is, both authority and hub value differences must be below the specified threshold for the algorithm to stop. If you set the threshold to 0, then you will force the algorithm to run for the full number of iterations specified in 'max_iter'.

grouping_cols (optional)
TEXT, default: NULL. A single column or a list of comma-separated columns that divides the input data into discrete groups, resulting in one distribution per group. When this value is NULL, no grouping is used and a single model is generated for all data.
Note
Expressions are not currently supported for 'grouping_cols'.

Examples
1. Create vertex and edge tables to represent the graph:
DROP TABLE IF EXISTS vertex, edge;
CREATE TABLE vertex(
id INTEGER
);
CREATE TABLE edge(
src INTEGER,
dest INTEGER,
user_id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2),
(3),
(4),
(5),
(6);
INSERT INTO edge VALUES
(0, 1, 1),
(0, 2, 1),
(0, 4, 1),
(1, 2, 1),
(1, 3, 1),
(2, 3, 1),
(2, 5, 1),
(2, 6, 1),
(3, 0, 1),
(4, 0, 1),
(5, 6, 1),
(6, 3, 1);

2. Running HITS with default values for optional parameters:
DROP TABLE IF EXISTS hits_out, hits_out_summary;
'vertex',             -- Vertex table
'id',                 -- Vertex id column
'edge',               -- Edge table
'src=src, dest=dest', -- Comma delimited string of edge arguments
'hits_out');          -- Output table of HITS
SELECT * FROM hits_out ORDER BY id;

 id |      authority       |         hub
----+----------------------+----------------------
0 |    8.43871829093e-07 |    0.338306115082665
1 |    0.158459587238244 |    0.527865350448059
2 |    0.405627969689677 |    0.675800764727558
3 |    0.721775835521825 |    3.95111934817e-07
4 |    0.158459587238244 |    3.95111934817e-07
5 |    0.316385413093048 |    0.189719957843216
6 |    0.405199928761102 |    0.337944978189241
(7 rows)

SELECT * FROM hits_out_summary;

 __iterations__
-----------------+
17
(1 row)

3. Running HITS with max_iter of 3 results in different authority and hub scores:
DROP TABLE IF EXISTS hits_out, hits_out_summary;
'vertex',             -- Vertex table
'id',                 -- Vertex id column
'edge',               -- Edge table
'src=src, dest=dest', -- Comma delimited string of edge arguments
'hits_out',           -- Output table
3);                  -- Max iteration
SELECT * FROM hits_out ORDER BY id;

 id |     authority     |        hub
----+-------------------+--------------------
0 |   0.08653327387778 | 0.375721659592363
1 |   0.18388320699029 | 0.533118571043218
2 |   0.43266636938891 | 0.654974244424525
3 |   0.70308285025699 | 0.040618557793769
4 |   0.18388320699029 | 0.040618557793769
5 |   0.30286645857224 | 0.182783510071961
6 |   0.38939973245002 | 0.330025782074373
(7 rows)

SELECT * FROM hits_out_summary;

 __iterations__
-----------------+
3
(1 row)

4. Running HITS with a low threshold of 0.00001 results in more iterations for convergence:
DROP TABLE IF EXISTS hits_out, hits_out_summary;
'vertex',             -- Vertex table
'id',                 -- Vertex id column
'edge',               -- Edge table
'src=src, dest=dest', -- Comma delimited string of edge arguments
'hits_out',           -- Output table
NULL,                 -- Default max_iter
0.00001);             -- Threshold
SELECT * FROM hits_out ORDER BY id;

 id |      authority       |         hub
----+----------------------+---------------------
0 |    1.15243075426e-09 |     0.33800946769422
1 |    0.158264459912827 |    0.527792117750177
2 |    0.405384672299625 |    0.675965453766535
3 |     0.72186275724613 |    5.39583282614e-10
4 |    0.158264459912827 |    5.39583282614e-10
5 |    0.316493740997913 |    0.189793242747412
6 |    0.405356461070609 |    0.337985666133163
(7 rows)

SELECT * FROM hits_out_summary;

 __iterations__
-----------------+
25
(1 row)

5. Running HITS with both max_iter and threshold:
DROP TABLE IF EXISTS hits_out, hits_out_summary;
'vertex',             -- Vertex table
'id',                 -- Vertex id column
'edge',               -- Edge table
'src=src, dest=dest', -- Comma delimited string of edge arguments
'hits_out',           -- Output table
20,                   -- Default max_iter
0.00001);             -- Threshold
SELECT * FROM hits_out ORDER BY id;

 id |      authority       |         hub
----+----------------------+---------------------
0 |    7.11260011825e-08 |    0.33810307986005
1 |    0.158326035587958 |   0.527815233930963
2 |    0.405461453180491 |   0.675913495026452
3 |    0.721835343230399 |   3.33021322089e-08
4 |    0.158326035587958 |   3.33021322089e-08
5 |    0.316459563893809 |   0.189770119973925
6 |    0.405307074424261 |   0.337972831786458
(7 rows)

SELECT * FROM hits_out_summary;

 __iterations__
-----------------+
20
(1 row)

The algorithm stopped at 20 iterations even though the convergence for threshold of 0.00001 is at 25 iterations. This is because max_iter was set to 20.
6. Running HITS with grouping column and default values for max_iter and threshold. Add more rows to the edge table to create different graphs based on the user_id column.
INSERT INTO edge VALUES
(0, 1, 2),
(0, 2, 2),
(0, 4, 2),
(1, 2, 2),
(1, 3, 2),
(2, 3, 2),
(3, 0, 2),
(4, 0, 2),
(5, 6, 2),
(6, 3, 2);
DROP TABLE IF EXISTS hits_out, hits_out_summary;
'vertex',             -- Vertex table
'id',                 -- Vertex id column
'edge',               -- Edge table
'src=src, dest=dest', -- Comma delimited string of edge arguments
'hits_out',           -- Output table
NULL,                 -- Default max_iter
NULL,                 -- Threshold
'user_id');           -- Grouping column
SELECT * FROM hits_out ORDER BY user_id, id;

 user_id | id |      authority       |         hub
---------+----+----------------------+----------------------
1 |  0 |    8.43871829093e-07 |    0.338306115082665
1 |  1 |    0.158459587238244 |    0.527865350448059
1 |  2 |    0.405627969689677 |    0.675800764727558
1 |  3 |    0.721775835521825 |    3.95111934817e-07
1 |  4 |    0.158459587238244 |    3.95111934817e-07
1 |  5 |    0.316385413093048 |    0.189719957843216
1 |  6 |    0.405199928761102 |    0.337944978189241
2 |  0 |    1.60841750444e-05 |    0.632262085114062
2 |  1 |    0.316079985713431 |    0.632529390899584
2 |  2 |    0.632364174872359 |    0.316347297480213
2 |  3 |    0.632694582987791 |    8.04208767442e-06
2 |  4 |    0.316079985713431 |    8.04208767442e-06
2 |  5 |                    0 |    1.22712519446e-10
2 |  6 |    2.45425034248e-10 |    0.316347297480213
(14 rows)

SELECT * FROM hits_out_summary order by user_id;

 user_id | __iterations__
---------+----------------
1 |             17
2 |             16
(2 rows)


Notes
1. On a Greenplum cluster, the edge table should be distributed by the source vertex id column for better performance.
2. This implementation of the HITS algorithm supports multigraph and each duplicated edge is considered for counting when calculating authority and hub scores.

Literature

[1] Kleinerg, Jon M., "Authoritative Sources in a Hyperlinked Environment", Journal of the ACM, Sept. 1999. https://www.cs.cornell.edu/home/kleinber/auth.pdf