KDD 2009 Abstracts Analysis

From GM-RKB
Jump to navigation Jump to search

A KDD-2009 Abstracts Analysis is an Analysis Report for a Research Paper Abstract Analysis Task applied to the Research Papers presented at KDD-2009.



Ngram-based Analysis

Two-Word terms

| Papers | Term | | 18 | data mining | 15 | experimental result | 13 | social network | 7 | search engine | 7 | recommender system | 7 | machine learning | 7 | learning algorithm | 6 | training data | 6 | learning method | 6 | labeled data | 5 | training dataset | 5 | synthetic dataset | 5 | data point | 5 | collaborative filtering | 4 | unlabeled data | 4 | training example | 4 | time series | 4 | optimization problem | 4 | learning problem | 4 | frequent itemset | 4 | empirical result | 4 | benchmark dataset

One-Word Terms

| Papers | Term | |87|Data |69|Paper |56|Algorithm |51|Result |47|Dataset |46|Problem |43|Model |39|Information |36|Application |36|Approach |34|Product |32|Network |32|Number |28|Framework |28|Feature |26|Analysis |26|Time |24|Experiment |24|Knowledge

Code

This section contains the basic code to identify the 1-gram, 2-gram, and 3-grams.

#!/bin/bash 

wget -r http://kdd09.crowdvine.com/

cd .

for file in `(cd ..; /bin/ls 4??? 5???)`
do

  fileLoc="../$file"

# Determine talk type
  track=`grep ">Track:<" $fileLoc | awk '{print $2}' | sed "s/://"`
  trackG=`echo $track | perl -ne 'chomp; s/([SPWDT]).*/$1/g; print $_'`

# Extract the abstract
  start=`grep -n class=\"body $fileLoc | awk '{print $1}' | sed "s/://"`
  start=`echo $start | perl -ne 'chomp; print 1+$_'`
  abstract=`head -$start $fileLoc | tail -1 | sed "s/<p>''" | sed "s/<\/p>''" | sed "s/  / /g"`
  abstractP=`echo $abstract | perl -ne 's/([\.\,\:\;\"])/ $1 /g; print $_'`
  echo $abstractP > $trackG/$file

  cat $trackG/$file | ../unigram.pl | ../stemmer.pl | sort -u > ${file}.unigram
  cat $trackG/$file | ../bigram.pl | ../stemmer.pl | sort -u > ${file}.bigram
  cat $trackG/$file | ../trigram.pl | ../stemmer.pl | sort -u > ${file}.trigram

done
</code>

===bigram.pl===
Base on 
<code>
#!/bin/perl

$word2="" ;
while(<>) {
  chop;
  tr/A-Z/a-z/;
  foreach $word1 (split) {
    $bigram = "$word2 $word1";
    $word2 = $word1;
    $count{$bigram}++;
  }
}

foreach $bigram (sort numerically keys %count) {
  print "$bigram\n";
}

sub numerically { # compare two words numerically
$count{$b} <=> $count{$a}; # decreasing order
} 

Semantic Analysis of Technical Terms

  • coming in Dec 2009