PPLRE Evaluation - Nearest Neighbor v1.0

From GM-RKB
Jump to navigation Jump to search

This page describes the Performance results of PPLRE Nearest Neighbor Algorithm v1.0 on the PPLRE Evaluation Task against the PPLRE Curated Data v1.3 and PPLRE Tagged Corpus v2.3 as reported by the PPLRE Automated Evaluation System.


Overview

The evaluation is currently under way. Some preliminary results below.

  • Token-based Distance: The best results so far come from a simple word/token based distance between concepts.
  • Confidence Score: The distance based confidence score is effective at ranking on precisiong. It can be used to improve optimize the F-Score and the Top X measures.
  • Recall: Can achieve a Recall that is almost as high as the PPLRE Cooccurrence algorithm. And when Recall is optimized the Precision is significantly higher than coocurrence.

Next Steps

  • Try k-Nearest Neighbor implementation. The motivation is that this may improve the recognition of the OP() relation because it is a One to Many Relation.
  • Try passages of two senteces. The motivation is that this has not been possible with the other algorithms.
  • Try

Detail

Current Optimal

  • adjust distance metric to more accurately reflect distance (endTokenId minus startTokenId)
  • perfEvalRun_0408172026_30910

TP/FP/TN/FN: 37/154/84/28, Pre/Rec/F/Acc/P20/MaxF: 0.194/0.569/0.289/0.399/0.450/0.326

Single-sentence based classifier

A simple approach to the application of the Nearest Neighbor Algorithm to Relation Recognition is to simply pick the nearest neighbors between the pairs of sought entity types. In this way an entity of type A at the beginning of a sentence will not be associated with an entity of type B at the other end of a sentence if another pair of type A and type B entities exists between them.

significant difference in P20 TP/FP/TN/FN: 39/162/83/26, Pre/Rec/F/Acc/P20/MaxF: 0.194/0.600/0.293/0.394/0.250/0.320


  1. with boosting

TP/FP/TN/FN: 37/154/84/28, Pre/Rec/F/Acc/P20/MaxF: 0.194/0.569/0.289/0.399/0.400/0.326


  1. with sentence to start-of-next-sentence edges

TP/FP/TN/FN: 36/149/84/29, Pre/Rec/F/Acc/P20/MaxF: 0.195/0.554/0.288/0.403/0.450/0.320


  1. simple token distance (penalized sentence links)

TP/FP/TN/FN: 38/153/85/27, Pre/Rec/F/Acc/P20/MaxF: 0.199/0.585/0.297/0.406/0.450/0.323


  1. everything equally distant after 10 tokens

TP/FP/TN/FN: 35/122/89/30, Pre/Rec/F/Acc/P20/MaxF: 0.223/0.538/0.315/0.449/0.065/0.318


  1. firstXsentences

1 TP/FP/TN/FN: 18/54/124/47, Pre/Rec/F/Acc/P20/MaxF: 0.250/0.277/0.263/0.584/0.450/0.269 3 TP/FP/TN/FN: 25/105/105/40, Pre/Rec/F/Acc/P20/MaxF: 0.192/0.385/0.256/0.473/0.350/0.273 6 TP/FP/TN/FN: 29/134/95/36, Pre/Rec/F/Acc/P20/MaxF: 0.178/0.446/0.254/0.422/0.350/0.284 9 TP/FP/TN/FN: 35/153/85/30, Pre/Rec/F/Acc/P20/MaxF: 0.186/0.538/0.277/0.396/0.450/0.308


  1. Cooccur

TP/FP/TN/FN: 40/243/80/25, Pre/Rec/F/Acc/P20/MaxF: 0.141/0.615/0.230/0.309/0.141/0.230


Multi-sentence (Discourse) based classifier

This approach updates the Text Graph to join Sentences that share a Concept Entity from one of the sought Concept Types (ORG/PROT/LOC).

Motivating Example

The following sentence illustrate the possible benefit. In the sentence the protein is the first word in the sentence while the organism is the last word: "[PROTEIN Protein K] is an [LOCATION outer membrane] protein found in pathogenic encapsulated strains of [ORGANISM Escherichia coli]." (ref: PPLRE Corpus 610.a.0) Later on in the abstract however another sentence places these two concepts significantly closer. The sentence reads: "These data suggest that [PROTEIN protein K] is a functional porin in [ORGANISM E. coli]."

Issues

The multi-sentence approach can also however increase the number of False Negatives over the sentence-level based approach. For example in the following sentence one of the relations is not predicted by the multiple-sentence based approach: “We confirm and extend previous studies by demonstrating that [PROTEIN Lst] is located in the [LOCATION outer membrane] and is surface exposed in both [ORGANISMa N. gonorrhoeae] and [ORGANISMb N. meningitidis]." (ref: PPLRE Corpus 10230.a.3). The multi-sentence approach does not make a prediction for the organism at the end of the sentence. The reason for this is that in another sentence the organism is associated with a nearer protein.
This problem may be resolved by introducing multi-class predictions; which will require support for k-nearest neighbor support.

Confidence Score

Two features are used to associate a Confidence Score

A candidate confidence score is to

All All PO PO PL PL 0.000000 9 1.571% 0.025000 64 11.169% 0.050000 86 15.009% 0.075000 66 11.518% 0.125000 67 11.693% 0.100000 44 7.679% 0.150000 37 6.457% 0.200000 48 8.377% 0.250000 31 5.410% 0.325000 44 7.679% 0.500000 39 6.806% 1.000000 38 6.632%

-  	38	3.1%	18	2.8%	20	3.5%
0.025 	180	14.7%	88	13.6%	91	15.9%
0.050 	197	16.1%	96	14.8%	101	17.6%
0.075 	112	9.2%	58	9.0%	54	9.4%
0.100 	120	9.8%	67	10.3%	53	9.3%
0.125 	86	7.0%	47	7.3%	39	6.8%
0.150 	87	7.1%	43	6.6%	44	7.7%
0.200 	84	6.9%	52	8.0%	32	5.6%
0.250 	95	7.8%	59	9.1%	36	6.3%
0.325 	57	4.7%	31	4.8%	26	4.5%
0.400 	37	3.0%	19	2.9%	18	3.1%
0.500 	90	7.4%	69	10.6%	21	3.7%
0.800 	3	0.2%	0	0.0%	3	0.5%
1.000 	36	2.9%	1	0.2%	35	6.1%

TOTAL 1222 100.0% 648 100.0% 573 100.0%

Code: rev PL_P* | awk '{print $1}' | rev | perl -ne 'while (<>){chomp;$f{int($_*40)}++;$n++}; for $k (keys %f) {printf "%f\t%d\t%1.3f%\n",$k/40,$f{$k},100*$f{$k}/$n}'

Performance

This single-sentence based approach achieves an

F-Score of 0.349

    • runDir_0405105214_30329
    • 0.06 0.08 14 18 132 51 0.438 0.215 0.289 0.679
    • Find an example of the benefit of discourse-level

Recall optimization

Unexpectedly this approach achieves a higher recall rate than the multi-sentence approach. It identifies two additional records at TP (40 vs 38).

40/161/83/25 0.199/0.615/0.301


Distance tempered confidence

The distance between entities in a relation is strongly correlated to the confidence in the prediction.

4 words for PL 7 words for OP


Joint Confidence Score

To support the PPLRE P20 score a single confidence score is required, based on the two confidence scores for the two relations. A simple test was performed on whether the summation or product of the two confidences achieved better performance on the PPLRE P20 test.

  • Summation: 29.2%
  • Multiplication: 36.0%

As a result the joint confidence score will be based on the multiplication of the two scores.


Do not create edges for neighboring sentences

Performance degrades slightly when an edge is created between the last word in a sentence and the first word in the following sentence.

  • F-Measure: Maximum drops from 0.286 to 0.258
  • Recall: Maximum drops from 0.554 to 0.523
  • Precision: Maximum drops from 0.556 to 0.526

Note that this analysis involved removing the edge rather than adding an edge. The reason for this is that the data structure that is used by the current token-based distance function uses has the last token of one sentence immediately before the first token of the following sentence. To work around this the distance was stretched significantly if a sentence break separated the tokens. ($distance_ = $tokDist_ + 999*$sentDist_ )
This analysis should be redone once we go to multiple sentence based relations.


Max distance soft-threshold

An internal threshold was introduced is that forced every nearest neighbor candidates beyond a certain distance to be treated as equally distant to all other candidates beyond that line. The intuition is that beyond that line the direct semantic connection has lost its significant. This threshold did not improve performance.

  • If the threshold was set to ~9 tokens then there was no effect on the best performance. This suggests that all of the nearest neighbors for the best classifier were no further than 9 tokens apart.

Performance on the v1.3 TRAIN dataset

Because the NN approach is not trained the dataset that is set aside for training can be used to further test the algorithm's performance. Aside from the benefit of the additional empirical data point, another benefit is a more stable result becase the this dataset is larger.

Unexpectedly, the performance drops significantly against this dataset. The performance is cut approximately in half from the performance on the test set (~30% precision and ~14% F runDir_0405220635_6570).

0.12 0.22 31 74 594 302 0.295 0.093 0.142 0.624

Given that this is the first experiment on this dataset this score may be the result of a bug... need to find out.

However, evidence that there is no significant bug is that:

  • the two minimum confidence thresholds attained their optimal values are identical to the ones identified in the test set (currently op=0.12 and pl=0.21)
  • the proportion of True Actuals and False Actuals is the same in both the train and test set: 69% of false actuals and 31% of true actuals.

perfEvalRun_0407222352_10407

This is the performance of the optimal run to date.

  • Given the requirement of F-Score >= 25% the highest Precision achieved is 52.2%.
  • F-Measure 0.323
  • This precision is achieved when the OP() threshold is set to 0.6 and the PL() threshold is set to 0.1.


  • Configuration
    • $distance_ = $tokDist_ ;
    • $distance_ += 999*$sentDist_ ;
    • $confidence_ /= $distance_ ;

op_t 	 pl_t 	TP	FP	TN	FN	P	R	F	A

0.00 0.00 36 151 86 29 0.193 0.554 0.286 0.404 0.10 0.15 18 25 126 47 0.419 0.277 0.333 0.667 0.13 0.22 10 9 140 55 0.526 0.154 0.238 0.701 0.21 0.40 4 1 144 61 0.800 0.062 0.114 0.705