KDD 2009 Abstracts Analysis
A KDD-2009 Abstracts Analysis is an Analysis Report for a Research Paper Abstract Analysis Task applied to the Research Papers presented at KDD-2009.
- Context:
- It makes use of a Technical Term Identification Algorithm.
- See: Technical Term.
Ngram-based Analysis
- This report is based on a Word-level Semantic Analysis Task (based on N-grams of Stemmed Words).
Two-Word terms
| Papers | Term | | 18 | data mining | 15 | experimental result | 13 | social network | 7 | search engine | 7 | recommender system | 7 | machine learning | 7 | learning algorithm | 6 | training data | 6 | learning method | 6 | labeled data | 5 | training dataset | 5 | synthetic dataset | 5 | data point | 5 | collaborative filtering | 4 | unlabeled data | 4 | training example | 4 | time series | 4 | optimization problem | 4 | learning problem | 4 | frequent itemset | 4 | empirical result | 4 | benchmark dataset
One-Word Terms
| Papers | Term | |87|Data |69|Paper |56|Algorithm |51|Result |47|Dataset |46|Problem |43|Model |39|Information |36|Application |36|Approach |34|Product |32|Network |32|Number |28|Framework |28|Feature |26|Analysis |26|Time |24|Experiment |24|Knowledge
Code
This section contains the basic code to identify the 1-gram, 2-gram, and 3-grams.
#!/bin/bash
wget -r http://kdd09.crowdvine.com/
cd .
for file in `(cd ..; /bin/ls 4??? 5???)`
do
fileLoc="../$file"
# Determine talk type
track=`grep ">Track:<" $fileLoc | awk '{print $2}' | sed "s/://"`
trackG=`echo $track | perl -ne 'chomp; s/([SPWDT]).*/$1/g; print $_'`
# Extract the abstract
start=`grep -n class=\"body $fileLoc | awk '{print $1}' | sed "s/://"`
start=`echo $start | perl -ne 'chomp; print 1+$_'`
abstract=`head -$start $fileLoc | tail -1 | sed "s/<p>''" | sed "s/<\/p>''" | sed "s/ / /g"`
abstractP=`echo $abstract | perl -ne 's/([\.\,\:\;\"])/ $1 /g; print $_'`
echo $abstractP > $trackG/$file
cat $trackG/$file | ../unigram.pl | ../stemmer.pl | sort -u > ${file}.unigram
cat $trackG/$file | ../bigram.pl | ../stemmer.pl | sort -u > ${file}.bigram
cat $trackG/$file | ../trigram.pl | ../stemmer.pl | sort -u > ${file}.trigram
done
</code>
===bigram.pl===
Base on
<code>
#!/bin/perl
$word2="" ;
while(<>) {
chop;
tr/A-Z/a-z/;
foreach $word1 (split) {
$bigram = "$word2 $word1";
$word2 = $word1;
$count{$bigram}++;
}
}
foreach $bigram (sort numerically keys %count) {
print "$bigram\n";
}
sub numerically { # compare two words numerically
$count{$b} <=> $count{$a}; # decreasing order
}
Semantic Analysis of Technical Terms
- coming in Dec 2009