Snowball Algorithm Description

From GM-RKB
Jump to navigation Jump to search

This page describes the Snowball Relation Extraction Algorithm in detail.

Definition


Input/Output

This section describes Snowball's input/output requirements

Input: Initial Seed Tuples

One of the inputs is the initial set of seed tuples (Training Cases). For the OrganizationHeadquarterLocation() relation a sample set of seeds is:

| ORGANIZATION || LOCATION | | MICROSOFT || REDMOND | | IBM || ARMONK | | BOEING || SEATTLE | | INTEL || SANTA CLARA |


Input: Annotated Training Corpus

The second input is a corpus where the entities in the relation have been annotated.


Output: Tuples

This is an example of a Snowball output tuple: <Tuple> <TupleField> <FieldName>LOCALIZATION</FieldName> <FieldValue>+periplasm+</FieldValue> </TupleField> <TupleField> <FieldName>PROTEIN</FieldName> <FieldValue>+zncl2+</FieldValue> </TupleField> <TupleRank>0.30240786</TupleRank> <SupportingDocId>%2Fhome%2Fhomira%2Fdata%2FtrainSecondary%2F884</SupportingDocId> </Tuple> <Tuple> <TupleField> <FieldName>LOCALIZATION</FieldName> <FieldValue>+periplasm+</FieldValue> </TupleField> <TupleField> <FieldName>PROTEIN</FieldName> <FieldValue>+pao1161+</FieldValue> </TupleField> <TupleRank>0.305802</TupleRank> <SupportingDocId>%2Fhome%2Fhomira%2Fdata%2FtrainSecondary%2F6066</SupportingDocId> </Tuple>


Output: Patterns

This is an example of a Snowball pattern: <Pattern> <TermVector> </TermVector> <TermVector> <VectorElement> <Term>insertion</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>h</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>soluble</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>protein</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>produced</Term> <Weight>0.4082483</Weight> </VectorElement> <VectorElement> <Term>volcanii</Term> <Weight>0.4082483</Weight> </VectorElement> </TermVector> <TermVector> </TermVector> <SupportingTupleId>2</SupportingTupleId> <PatternRank>1.0</PatternRank> <PatternClosure>01</PatternClosure> </Pattern>


Algorithm


Process

  1. Start with seed set R of tuples
  2. Generate set P of patterns from R <= GeneratePatterns()
  3. Compute support and confidence for each pattern in P
  4. Discard patterns with low support or confidence
  5. Generate new set T of tuples matching patterns P <= GenerateTuples(Patterns)
  6. Compute confidence of each tuple in T
  7. Add to R the tuples t2T with conf(t)> threshold.
  8. Go back to step 2

Algorithm - Functions


GeneratePatterns() Function

  1. Assign the first context vector representation V as the centroid of cluster C
  2. For Vi, compute the similarity with the centroids of each existing clusters, let S be the highest similarity.
  3. If S the threshold value (St), add the item to the corresponding cluster and re-compute the cluster centroids, else use Vi to form a new cluster
  4. Go to step 2 if another V needs to be clustered, else proceed to step 5
  5. Return the centroids of each cluster, add them (i.e. promoted extraction patterns) to candidate pattern set

  sub GeneratePatterns
    foreach (< oseed; `seed >, str) in Occurrences
  (1) tS =< ls; t1;ms; t2; rs > = makeOccurrence(str);
  (2) cluster best = FindClosestCluster(tS,  sim);
      if(best)
         best.Add(tS );
         best.UpdateCentroid(tS);
         best.AddTuple(< oseed; `seed >);
      else
         CreateNewCluster(< oseed; `seed >, tS);
  (3)Patterns = FilterPatterns( Clusters, sup);
    return Patterns;


GenerateTuples(Patterns) Function

  sub GenerateTuples(Patterns)
    foreach text segment in corpus
  (1) {<o,l>, < ls; t1;ms; t2; rs>} =
        = CreateOccurrence(text segment);
      TC = < o;` >;
      SimBest = 0;
      foreach p in Patterns
  (2)   sim = Match(< ls; t1;ms; t2; rs >; p);
        if (sim >= Tsim)
  (3)   UpdatePatternSelectivity(p, TC);
          if(sim   SimBest)
            SimBest = sim;
            PBest = p;
      if(SimBest >= Tsim)
        CandidateTuples [TC ].Patterns [PBest ] =
          = SimBest;
    return CandidateTuples;


Open Issues

Some of the recognized open problems for the application of Snowball are:

  • Automated setting of internal thresholds. Many thresholds are required. E.g.:
    • the maximum number of words that a pattern can be found in.
    • a minimum threshold on the similarity between records in clusters (similar to the selection of the number of clusters).

See: Snowball Internal Parameters