PPLRE Shina

Mission: To produce credible output from the application of Snowball to PPLRE.
Next Milestone: Our next group meeting at SFU with Zhongmin is scheduled for Monday March 12th at 10AM.
- The hope is that you will have updated results for PO, PL, and OPL that include:
- 1) some optimization of Snowball
- 2) results that include the false negatives
- 3) results that 196 seeds

Current Tasks

070226: Please send me information on what you did to create the OPL results (e.g. new script and output directory?)
070226: Continue to automate the process of generating the evaluation measures, where possible.
070226: Optimize Snowball's PO and PL F-score (on 196 seeds) by updating the following Snowball parameters:
- - Iterations: 45, 20, and 100 (or some reasonable increment)
  - Cluster.minSimilary: 0.06, 0.03, 0.09.
  - TupleExtractor.minSimilarity: 0.05, 0.025, 0.075.

Optimize Snowball's PO and PL F-score (on 196 seeds) by updating the following Snowball parameters:
- Training set size: ~1200, ~15000, 0.
Unit test the division into many files by comparing the performance when you do vs. when you keep the sentences together.
Unit test to ensure that all seeds count. E.g. include the seed file in the test set which should always result in the seed being discovered.
Apply Snowball to the entire corpus.
Analyze whether Snowball makes effective use of many seeds (e.g. thousands of seeds).
Analyze Snowball using 10-fold cross validation.
Document the installation process and test the process by reinstalling on a new directory. (e.g. create seed files from scratch).

070219: Fix the problem with entities that have dashes inside. E.g. in sentence 6061.6 the protein "47-kda lipoprotein" is currently being changed to "47 - kda lipoprotein" and the difference (the spaces beside the - dash) causes the match test to fail.
070222: Create a tab-delimited (graph) file with RELTYPE|TRAINSIZE|RUNID|TP|FP|FN|TN|Precision|Recall|F-Score
070219: Update your Train code to only train on records with a +1 label.
070219: Update your Test code to calculate P/R/F only from sentences that have been curated.
070219: Update your Test code to set predictions that match a record with a -1 label to be "FP".
070219: Write a script to produce evaluation measures for OPL triple predictions. The script takes inputs "out.po.tab", "out.pl,tab", and "curated.tab". The rule for joining PO and PL relationships is that 1) the sentence must be the same and 2) the protein name must be the same... Try to look through Zhongmin's code for ideas and to ensure that the two of you are doing it similarly.
070226: We reviewed the results of testing against the whole document instead of sentence by sentence. Unexpectedly the performance was worse. We looked at some of the records and did not resolve the reason for this. However, we felt that the current sentence-by-sentence approach is not penalizing Snowball.