WebIsAGraph

A very large hypernymy graph from a Web corpus

Stefano Faralli1 , Irene Finocchi2, Simone Paolo Ponzetto3, Paola Velardi21 University of Rome Unitelma Sapienza, Italy stefano.faralli@unitelmasapienza.it2 University of Rome Sapienza, Italy irene.finocchi@uniroma1.it velardi@di.uniroma1.it3 University of Mannheim, Germany simone@informatik.uni-mannheim.de

We present the WebIsAGraph, a very large hypernymy graph compiled from a dataset of is-a relationships extracted from the CommonCrawl. We provide the resource together with a Neo4j plugin to enable efficient searching and querying over such large graph.

License

All the following resources are released with a Creative Commons License 4.0 International (CC BY 4.0).

The WebIsAGraph

Click here to download the Neo4J dump of the graph we used for our experiments. The dump can be loaded on your machine with the neo4j-admin tool described in the Neo4j documentation: https://neo4j.com/docs/operations-manual/current/tools/dump-load/ .

Tuple Lucene indexes

Companion tuples datasets from WebIsADb necessary to add the metadata to the WebIsAGraph.

term_0.zip (53,806 KB), term_1.zip (339,002 KB), term_2.zip ( 260,666 KB), term_3.zip (132,427 KB), term_4.zip (84,153 KB), term_5.zip (64,572 KB), term_6.zip (42,697 KB), term_7.zip (34,845 KB), term_8.zip (37,656 KB), term_9.zip (31,193 KB), term_a.zip (8,739,473 KB), term_b.zip (7,336,726 KB), term_c.zip (12,962,392 KB), term_d.zip (6,932,319 KB), term_e.zip (5,730,732 KB), term_f.zip (7,792,534 KB), term_g.zip (4,695,330 KB), term_h.zip (5,217,468 KB), term_i.zip (5,217,468 KB), term_j.zip (2,034,182 KB), term_k.zip (1,569,434 KB), term_l.zip (5,873,690 KB), term_m.zip (9,009,297 KB), term_n.zip (4,347,512 KB), term_o.zip (4,137,990 KB), term_p.zip (11,387,218 KB), term_q.zip (533,516 KB), term_r.zip (6,622,599 KB), term_s.zip (14,052,667 KB), term_t.zip (7,577,672 KB), term_u.zip (2,195,589 KB), term_v.zip (2,487,705 KB), term_w.zip (4,586,234 KB), term_x.zip (215,776 KB), term_y.zip (1,005,775 KB), term_z.zip (288,331 KB).

Once downloaded unzip the files and place contents in your preferred <path of tuples dataset>. The correct resulting structure of <path of tuples dataset> must match the following:

<path of tuples dataset>/term_0/
.....
<path of tuples dataset>/term_9/
<path of tuples dataset>/term_a/
<path of tuples dataset>/term_b/
...
<path of tuples dataset>/term_z/

Neo4j plugin

Click here to download the Neo4j plugin we developed to query the graph and combine the WebIsADb original tuple information. Once downloaded you need to:

    1. stop your Neo4j service (https://neo4j.com/docs/operations-manual/current/installation/);
    2. copy the jar file under the <yourneo4jinstallationfolder>/plugins/ folder;
    3. edit the file <yourneo4jinstallationfolder>/conf/neo4j.conf as follows:
dbms.security.procedures.whitelist=algo.*,apoc.*,webisadb.*
dbms.security.procedures.unrestricted=algo.*,apoc.*,webisadb.*

4. create the folder <yourneo4jinstallationfolder>/webisadb/;

5. create a text file webisadb.conf with the following lines:

      1. lucenestore TAB <path of tuples dataset>
      2. stopwords TAB <path to stopwords.tsv>

6. restart Neo4j service (https://neo4j.com/docs/operations-manual/current/installation/).

Where <path of tuples dataset> is the path where you placed the tuples dataset (it must end with a "/") and <path to stopwords.tsv> is the path where you placed a list of stopwords (here you can download our list).

The plugin offers the following procedures and functions:

  • webisadb.tuples(relationship): To retrieve the tuples of a IsA

usage example with Cypher:

match (m)-[r]->(n) where n.label = 'katy perry' and m.label='celebrity'
WITH r 
     CALL webisadb.tuples(r) yield ID as I, PLD as PL, PATTERN as P, SENTENCE as S 
return I,PL,P,S LIMIT 10

output:

I              PL                       P       S
"68541691_0"  "evilbeetgossip.com"      "p20c" "In any case, Katy Perry is ...."
"103876600_0" "stylehive.com"           "p20a" "Katy Perry is the latest celebrity ..."
"32920342_0"  "glamboulevard.com"       "p20a" "Katy Perry is the latest celebrity ..."
"41085659_0"  "thehollywoodgossip.com"  "p6"   "So Lady Gaga, Madonna, Beyonce,  ..."
"40960692_0"  "thegeekyglobe.com"       "p5"   "For example if you search ..."
"436617510_0" "variety.com"             "p8a"  "ss says: November 14, 2013 ..."
"432333045_0" "yahoo.com"               "p20c" "Katy Perry is the most ..."
"14533257_0"  "yahoo.com"               "p20a" "Katy Perry is the latest ..."
"235295090_1" "fashion-stylist.net"     "p3a"  "They make shoes and belts ..."
"244275951_0" "thecelebritycafe.com"    "p20a" "Katy Perry is the latest ..."
  • webisadb.stopword(node_label): Returns 1 if text is a stopword.
  • webisadb.specialchars(node_label): Returns the number of specialchars.
  • webisadb.noisystring(node_label): Returns the number of specialchars over the total number of chars.

The datasets of the node polysemy prediction experiment

Click here to download the WordNet dataset.

Click here to download the DBpedia dataset.

Click here to download the WordNet U DBpedia dataset.