PPLRE Curated Data v1.4

From GM-RKB
Jump to: navigation, search

Overview


Data Issue


Data Statistics

  • NOTE: THESE STATISTICS ARE FOR v1.3 NOT SPECIFICALLY FOR v1.4.
  • NOTE: THESE STATISTICS ARE FOR v1.3 NOT SPECIFICALLY FOR v1.4.
  • NOTE: THESE STATISTICS ARE FOR v1.3 NOT SPECIFICALLY FOR v1.4.
  • NOTE: THESE STATISTICS ARE FOR v1.3 NOT SPECIFICALLY FOR v1.4.





High-level Statistics

| | All Docs | Train Docs | Test Docs | | OPL Relation Cases | 398 | 333 | 65 | | OPL Cases
in 1 Sentence | | | | | OPL Cases
in 2 Sentences | | | | | Passages with no OPL relation | | | | | Documents | 746 | 614 | 132 | | Sentences | | | | | Words | | | | | Organism
Instances | | | | | Distinct
Organisms | | | | | Protein
Instances | | | | | Distinct
Proteins | | | | | Location
Instances | | | | | Distinct
Locations | | | |

  1. documents

% grep -v TUPLE OPL.test.tab OPL.train.tab | perl -ne 'chomp;@t=split /\t/;print "$t[2]\n"' | sort -nu | wc -l

Overlap with ePSORTdb

  • tbd
  • First need to associate the Swiss-Prot id to each protein.

Distinct Organisms per Document

Orgs AllDCnt AllDPrp TstDCnt TstDPrp 0 7 0.9% 0 0 1 378 50.7% 73 55.3% *** 2 203 27.2% 34 25.8% * 3 94 12.6% 15 11.4% * 4 40 5.4% 6 4.5% * 5 13 1.7% 2 1.5% 6 5 0.7% 2 1.5% 8 2 0.3% 0 0.0% 9 2 0.3% 0 0.0% 11 1 0.1% 0 0.0% TOTAL 745 100.0% 132 100.0%

Distinct Proteins per Document

  • For all documents in PPLRE Curated Data v1.3
  • E.g. there are eight (8) documents with eight distinct Protein names.
  • E.g. there is one (1) document with thirty six distinct Protein names.
  • Note that the cooccurrence resolutino is currently weak for proteins.

Prots AllDCnt AllDPrp TstDCnt TstDPrp *~1% 0 3 0.4% 2 1.5% 1 8 1.1% 3 2.3% 2 28 3.8% 7 5.3% ** 3 35 4.7% 6 4.5% *** 4 40 5.4% 7 5.3% *** 5 48 6.4% 12 9.1% **** 6 66 8.9% 7 5.3% ***** 7 50 6.7% 9 6.8% ***** 8 60 8.1% 14 10.6% ****** 9 64 8.6% 6 4.5% ****** 10 52 7.0% 9 6.8% ***** 11 41 5.5% 5 3.8% **** 12 46 6.2% 4 3.0% **** 13 35 4.7% 10 7.6% **** 14 41 5.5% 8 6.1% *** 15 27 3.6% 3 2.3% ** 16 18 2.4% 5 3.8% * 17 18 2.4% 2 1.5% 18 14 1.9% 3 2.3% 19 8 1.1% 2 1.5% * 20 6 0.8% 0 0.0% * 21 11 1.5% 3 2.3% * 22 6 0.8% 1 0.8% * 23 3 0.4% 2 1.5% 24 5 0.7% 0 0.0% * 25 2 0.3% 0 0.0% 26 4 0.5% 2 1.5% 27 2 0.3% 0 0.0% 28 1 0.1% 0 0.0% 29 1 0.1% 0 0.0% 30 1 0.1% 0 0.0% 36 1 0.1% 0 0.0% TOTAL 745 100.0% 132 100.0%

Distinct Locations per Document

Locs AllDCnt AllDPrp TstDCnt TstDPrp *=10% 0 19 2.6% 4 3.0% 1 410 55.0% 71 53.8% **** 2 199 26.7% 33 25.0% * 3 86 11.5% 19 14.4% * 4 20 2.7% 3 2.3% 5 9 1.2% 2 1.5% 6 2 0.3% 0 0.0% TOTAL 745 100.0% 132 100.0%


Sentence Location with an Organism

  • For all test documents in PPLRE Curated Data v1.3
  • Sentences with [math]x[/math] entities are counted [math]x[/math] times.

Sentnce Mention Prop Cumm Visual 0 122 25.7% 25.7% *********** 1 55 11.6% 37.3% **** 2 52 10.9% 48.2% **** 3 49 10.3% 58.5% *** 4 45 9.5% 68.0% *** 5 36 7.6% 75.6% ** 6 39 8.2% 83.8% ** 7 23 4.8% 88.6% * 8 21 4.4% 93.1% * 9 13 2.7% 95.8% 10 7 1.5% 97.3% * 11 7 1.5% 98.7% * 12 4 0.8% 99.6% 14 1 0.2% 99.8% 15 1 0.2% 100.0% Total 475 100.0%

Sentence Location with an Protein

  • For all test documents in PPLRE Curated Data v1.3
  • Sentences with [math]x[/math] entities are counted [math]x[/math] times.

Sentnce Mention Prop Cumm Visual *=10 0 235 10.2% 10.2% ********************** 1 283 12.3% 22.5% *************************** 2 318 13.8% 36.3% ***************************** 3 302 13.1% 49.4% **************************** 4 303 13.1% 62.5% **************************** 5 211 9.2% 71.7% ******************** 6 176 7.6% 79.3% **************** 7 148 6.4% 85.7% ************* 8 113 4.9% 90.6% ********** 9 86 3.7% 94.4% ******* 10 52 2.3% 96.6% **** 11 34 1.5% 98.1% ** 12 14 0.6% 98.7% 13 12 0.5% 99.2% 14 8 0.3% 99.6% * 15 10 0.4% 100.0% * Total 2305 100.0%


Sentence Location with an Location

  • For all test documents in PPLRE Curated Data v1.3
  • Sentences with [math]x[/math] entities are counted [math]x[/math] times.

SENTENCE_ID Count *=10 0 77 ****** 1 60 **** 2 50 *** 3 30 * 4 40 ** 5 32 * 6 32 * 7 20 8 16 9 9 * 10 9 * 11 5 * 12 2 13 1 14 1 15 2 Total 386


Sentence Location of Curated Record

The position of the sentence appears to be an important feature in predicting OPL() relations. An analysis of the v1.3 curated dataset suggests that a prediction that is based on one of the first few sentences is significantly more likely to be correct on average than a prediction from the later sentences. One piece of evidence is that it the distributions of sentences referenced by the True Actual(+1) and False Actual(-1) records differ significantly. The first sentence, for example accounts for one third (~37%) of the true actual records but only accounts for one quarter (~26.6%) of the false actual records. The ratio of true/false record becomes more stark for the later sentences in the abstract. After the seventh sentence (sentence=6) for example, only ~5% of the true actuals remain, but ~14% of the false actuals do. All other things being equal, the odds of making a true positive prediction are more favourable from the top sentences than the later sentences.

One confounding variable to this analysis is that many true actual records are repeated PSID/Sentence pairs, but there are no repeats in the false actual data. However, when only unique doc/sents are analyzed the difference in distributions hold. The difference on the first sentence becomes even a little stronger (26.58% vs 40.17%).

Records Unique PSID/Sent sentence false actual true actual false actual true actual 0 26.60% 36.90% 26.58% 40.17% 1 20.40% 18.30% 20.40% 18.83% 2 9.60% 12.30% 9.58% 10.88% 3 8.50% 5.10% 8.50% 5.44% 4 7.00% 6.30% 6.96% 5.86% 5 6.00% 8.10% 6.03% 6.28% 6 6.20% 7.20% 6.18% 5.86% 7 4.90% 0.60% 4.95% 0.84% 8 5.40% 1.20% 5.41% 1.67% 9 3.40% 1.20% 3.40% 1.26% 10 1.40% 0.60% 1.39% 0.84% 11 0.50% 0.30% 0.46% 0.42% 12 0.00% 0.90% 0.00% 0.84% 13 0.00% 0.90% 0.00% 0.84% 14 0.20% 0.00% 0.15% 0.00%

  • "Records" simply treats each row in the data as independent records
  • "Unique PSID/Sent" discards repeated records (the false actuals do not include repeats).
  • these #s are from the train dataset but also hold in the test dataset

What could the reason be for this difference in distributions? Is it a fundamental feature of the PubMed abstracts that the first few sentences contain the most reliable localization claims? Or is it a problem in the data and how we ran the initial round of experiments? My guess is the earlier. The reason for this is that as we noticed this effect in the DUC text summarization task. In news articles the first few sentences were shown to be of greater importance to the summary than the others... While I have encountered this insight in the text summarization literature, I have not encountered it in the IE literature.


Number of relations per sentence

The number of relations per sentence is correlated to the location of the sentence in the abstract. A quick check suggested that the average number of relations per sentence is 1.39. On average however, the the first sentence has 1.27 relations on average, while the third sentence has on average 1.68 relations.

Sent AvgRels 0 1.27 1 1.30 2 1.68 3 1.50 4 1.50 5 1.67 6 1.59

  • based on true actuals in the training set

This analysis was originally inspired by the analysis into the question of why looking only at the first few sentences improved performance. The slight strengthening on the first sentence when looking only at unique records suggested that the first sentence contains fewer relations on averate than some of the latter sentences.


Multiple Sentences

Because many of the algorithms are sentence-level algorithm's it is important to know the proportion of relations that reside in a single sentence vs. in multiple senteces. The larger the proportion of relations that reside in multiple sentences the lower the expected performance. On average three fifths of the relations reside in a single sentence. There is however a significant difference in this proportion between the train set and the test set. The test set has almost three quarters of its relations in a single sentence. This bias may effect the performance of algorithms that are trained.

v1.3 Train Test Overall

       	Records	Prop%	Records	Prop%	Records	Prop%

SingleSent 195 58.6% 48 73.8% 243 61.1% MultiSent 138 41.4% 17 26.2% 155 38.9% Total 333 100.0% 65 100.0% 398 100.0%