Gabor Melli's v2nodeClassify.Predict.150405

From GM-RKB
Jump to navigation Jump to search

Gabor_Melli's_v2nodeClassify.Predict.150405 is a supervised prediction of product taxonomy node system.



References

2015

v2nodeClassify.Predict.150405

PREDICT TAXO-NODE OF UNLABELED DATA

This notebook represents a supervised product taxonomy classification system (that is based on a one-vs-rest supervised classification algorithm )

To Do

  • Add text preprocessing
  • Make featurization a subroutine
  • Include a feature for the type of record (taxonomy, or document)
  • Include dictionary-based features
In [1]:
# LIBRARIES
debug = 1
 
import pandas as pd
if debug: print "pandas version: " + pd.__version__   # pandas version: 0.15.2
from pandas import DataFrame, Series

import numpy as np
if debug: print "numpy version: " + np.__version__    # numpy version: 1.9.2
from numpy  import random # random

from re     import split

from sklearn import preprocessing, svm, cross_validation  # labelEncoder
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.feature_extraction.text 
from sklearn.externals import joblib
if debug: print "sklearn version: " + sklearn.__version__    # sklearn version: 0.15.2

import gc
pandas version: 0.15.2
numpy version: 1.9.2
sklearn version: 0.15.2
In [2]:
# GLOBALS

from datetime import datetime
from time import time
dstamp=datetime.now().strftime("%y%m%d")
if debug: print "dstamp=" + dstamp # dstamp=141113
dstamp="150405"

tstamp=datetime.now().strftime("%y%m%d%H%M%S")
if debug: print "tstamp=" + tstamp # tstamp=141113032844

dataDir = "../data/"
modelsDir = "models/"

dictionaryFile="PC_BN_PF_terms.141107b.tsv"

corpusBasedUnigramVectorizerFilename = "corpusBasedUnigramVectorizer." + dstamp + ".sklearn"
corpusBasedUnigramVectorizerFile = modelsDir + corpusBasedUnigramVectorizerFilename

dictPopularPairsFilename = "dictPopularPairs." + dstamp + ".tsv"
dictPopularPairsFile = modelsDir + dictPopularPairsFilename

dictBasedUnigramVectorizerListFilename = "dictBasedUnigramVectorizerList." + dstamp + ".sklearn"
dictBasedUnigramVectorizerListFile = modelsDir + dictBasedUnigramVectorizerListFilename

modelFilename="svm_clfr_all." + dstamp + ".sklearn"
modelFile=modelsDir + modelFilename

colNameRecordType        = 'recordType'
colNameRecordTextContent = 'recordTextContent'
colNameRecordTextContentOrig = 'recordTextContentOrig'
colNameRecordLabel       = 'recordLabel'
colNameRecordSource      = 'recordSource'
colNameTermType          = 'type'
colNameTermCategory                 = 'category'
colNameRecordTextContentTokens      = 'tokens'
colNameRecordTextContentTokensCount = 'tokensCount'
dstamp=150522
tstamp=150522174440
In [19]:
fileToProcess = 28 ;

df_unlbldTaxoDataInfo = pd.DataFrame([
   { 'fileId':0, 'colNameRecordSource': 'TST',        'colNameRecordType': 'taxoPath',  'dataFilename': 'test2.txt'},
   { 'fileId':1, 'colNameRecordSource': 'SHO',        'colNameRecordType': 'taxoPath',  'dataFilename': 'offerTaxo_unlbld_SHO.140515.tsv'},
   { 'fileId':2, 'colNameRecordSource': 'TRG',        'colNameRecordType': 'taxoPath',  'dataFilename': 'offerTaxo_unlabld_IMR-Target.141001.txt'},
   { 'fileId':3, 'colNameRecordSource': 'CJ',         'colNameRecordType': 'taxoPath',  'dataFilename': 'offerTaxo_unlabld_CJ.141001b.txt'},
   { 'fileId':4, 'colNameRecordSource': 'CPROD_sml',  'colNameRecordType': 'passage',   'dataFilename': 'CPROD1_unlbld_0-40.141030.tsv'},
   { 'fileId':5, 'colNameRecordSource': 'CPROD_med',  'colNameRecordType': 'passage',   'dataFilename': 'CPROD1_unlbld_170-230.141030.tsv'},
   { 'fileId':6, 'colNameRecordSource': 'dictTerms',  'colNameRecordType': 'term',      'dataFilename': 'PC_BN_PF_terms.141107.tsv', 'dataColName':'term'},
   { 'fileId':7, 'colNameRecordSource': 'dictTerms2', 'colNameRecordType': 'term',      'dataFilename': 'PC_BN_PF_BPC_terms.141208b.tsv', 'dataColName':'term'},
   { 'fileId':8, 'colNameRecordSource': 'terms2',     'colNameRecordType': 'term',      'dataFilename': 'toLabel.terms.150111b.tsv', 'dataColName':'term'},
   { 'fileId':9, 'colNameRecordSource': 'all',        'colNameRecordType': 'mixed',     'dataFilename': 'pcTaxo_labeled.150107b.tsv', 'dataColName':'recordTtextContent'},
   { 'fileId':10,'colNameRecordSource': 'EUS',        'colNameRecordType': 'taxoPath',  'dataFilename': 'pcTaxo_unlabld_EUS.150306.tsv'},
   { 'fileId':11,'colNameRecordSource': 'EUK',        'colNameRecordType': 'taxoPath',  'dataFilename': 'pcTaxo_unlabld_EUK.150306.tsv'},
   { 'fileId':12,'colNameRecordSource': 'EUSnotDelta','colNameRecordType': 'taxoPath',  'dataFilename': 'pcTaxo_unlabld_EUSnonDelta.150306.tsv'},
   { 'fileId':13,'colNameRecordSource': 'become',         'colNameRecordType': 'taxoPath',  'dataFilename': 'become_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':14,'colNameRecordSource': 'commission_junction_1','colNameRecordType': 'taxoPath',  'dataFilename': 'commission_junction_category_mappings.ascii.b.1of2.tsv', 'dataColName':'Feed Category'},
   { 'fileId':34,'colNameRecordSource': 'commission_junction_2','colNameRecordType': 'taxoPath',  'dataFilename': 'commission_junction_category_mappings.ascii.b.2of2.tsv', 'dataColName':'Feed Category'},
   { 'fileId':15,'colNameRecordSource': 'ebay',           'colNameRecordType': 'taxoPath',  'dataFilename': 'ebay_category_mapping.140413.tsv', 'dataColName':'Feed Category'},
   { 'fileId':16,'colNameRecordSource': 'ebay_nondelta',  'colNameRecordType': 'taxoPath',  'dataFilename': 'ebay_nondelta_category_mapping.tsv', 'dataColName':'Feed Category'},
   { 'fileId':17,'colNameRecordSource': 'ebay_uk',        'colNameRecordType': 'taxoPath',  'dataFilename': 'ebay_uk_category_mapping.tsv', 'dataColName':'Feed Category'},
   { 'fileId':18,'colNameRecordSource': 'impact_radius',  'colNameRecordType': 'taxoPath',  'dataFilename': 'impact_radius_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':19,'colNameRecordSource': 'link_share',     'colNameRecordType': 'taxoPath',  'dataFilename': 'link_share_category_mappings.ascii.tsv', 'dataColName':'Feed Category'},
   { 'fileId':20,'colNameRecordSource': 'pricegrabber',   'colNameRecordType': 'taxoPath',  'dataFilename': 'pricegrabber_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':21,'colNameRecordSource': 'pricegrabber_uk','colNameRecordType': 'taxoPath',  'dataFilename': 'pricegrabber_uk_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':22,'colNameRecordSource': 'shopping',       'colNameRecordType': 'taxoPath',  'dataFilename': 'shopping_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':23,'colNameRecordSource': 'shopzilla',      'colNameRecordType': 'taxoPath',  'dataFilename': 'shopzilla_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':24,'colNameRecordSource': 'walmart',        'colNameRecordType': 'taxoPath',  'dataFilename': 'walmart_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':25,'colNameRecordSource': 'viglink',        'colNameRecordType': 'taxoPath',  'dataFilename': 'viglink_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':27,'colNameRecordSource': 'shopzilla_uk',   'colNameRecordType': 'taxoPath',  'dataFilename': 'shopzilla_uk_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':28,'colNameRecordSource': 'amazon',         'colNameRecordType': 'taxoPath',  'dataFilename': 'amazon_category_mappings.tsv', 'dataColName':'Feed Category'},
   { 'fileId':29,'colNameRecordSource': 'amazon_uk',      'colNameRecordType': 'taxoPath',  'dataFilename': 'amazon_uk_category_mappings.tsv', 'dataColName':'Feed Category'},
    ])
#],index)

#if debug: print df_unlbldTaxoDataInfo.loc[fileToProcess]
df_unlbldTaxoDataInfo[(df_unlbldTaxoDataInfo.fileId==fileToProcess)]
Out[19]:
colNameRecordSource colNameRecordType dataColName dataFilename fileId
28 amazon taxoPath Feed Category amazon_category_mappings.tsv 28

Read in the unlabeled data

In [20]:
# load a cherry-picked file

df = df_unlbldTaxoDataInfo[(df_unlbldTaxoDataInfo.fileId==fileToProcess)]
dataFilename = df.iloc[0]['dataFilename']
dataCode     = df.iloc[0]['colNameRecordSource']
dataColName  = df.iloc[0]['dataColName']

df_unlabeledData = DataFrame(pd.read_csv(dataDir + dataFilename, delimiter='\t', quoting=3, skipinitialspace=True))

#df_unlabeledData.rename(columns={'term': colNameRecordTextContent}, inplace=True)
df_unlabeledData.rename(columns={'taxoPath': colNameRecordTextContent}, inplace=True)

#df_unlabeledData = df_unlabeledData.loc[random.choice(df_unlabeledData.index, 10, replace=False)] # random sample
#df_unlabeledData.index = range(0, len(df_unlabeledData))

if debug:
    print "dataFilename =", dataFilename
    print "dataColName =", dataColName
    print "dataCode =", dataCode
    print "record count:", len(df_unlabeledData) # 16893
    print "\nsample:\n", df_unlabeledData.loc[random.choice(df_unlabeledData.index, 10, replace=False)] # random sample
dataFilename = amazon_category_mappings.tsv
dataColName = Feed Category
dataCode = amazon
record count: 26012

sample:
                                           Feed Category Viglink Category
4034   Automotive/Categories/Replacement Parts/Switch...               AU
4304   Automotive/Categories/Tires & Wheels/Tires/Tra...               AU
12651  Departments>Boys>Clothing>Suits & Sport Coats>...               FS
5503       Baby>Baby Boys>Clothing>Jackets & Coats>Vests               FS
4381   Automotive/Categories/Tools & Equipment/Garage...               AU
13554  Electronics/Categories/Computers & Accessories...            AU,CE
9966   Categories>Outdoor Recreation>Camping & Hiking...   FS,AU,HB,OT,HG
2220   Automotive/Categories/Interior Accessories/Sea...         AU,CE,SF
14660  Grocery & Gourmet Food>Categories>Snack Foods>...         HB,OT,HG
416                                   >Products>Bargains               FS

Clean the data

In [21]:
df_unlabeledData.rename(columns={dataColName:colNameRecordTextContent}, inplace=True)

# data cleanup the data
df_unlabeledData[colNameRecordTextContentOrig] = df_unlabeledData[colNameRecordTextContent] # keep the orig
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].fillna('value is missing on this record') # fill missing data
#if debug: df_unlabeledData[colNameRecordTextContent].isnull()
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].str.replace("/"," ")
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].str.replace(":"," ")
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].str.replace(","," ")
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].str.replace(";"," ")
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].str.replace("~"," ")
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].str.replace("&"," ")
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].str.replace(">"," > ")
df_unlabeledData[colNameRecordTextContent] = df_unlabeledData[colNameRecordTextContent].str.replace("  "," ")

if debug:
    print "\nsample:\n", df_unlabeledData.loc[random.choice(df_unlabeledData.index, 10, replace=False)] # random sample
sample:
                                       recordTextContent Viglink Category  \
10903  Categories > Replacement Parts > Steering Syst...            AU,HB 
19287  Patio Lawn  Garden Categories Lawn Mowers  Out...               AU 
4684   Automotive > Categories > Motorcycle  Powerspo...   FS,AU,HB,OT,HG 
6841   Books Subjects Mystery Thriller  Suspense Thri...               BK 
11976  Clothing Shoes  Jewelry Departments Men Shops …               AU 
15500  Home  Kitchen Categories Kitchen  Dining Kitch...               AU 
22054  Sports  Outdoors Categories Fan Shop Home  Kit...               AU 
17578  Industrial  Scientific > Categories > Professi...               HG 
13990  Electronics > Categories > Computers  Accessor...   AU,HB,OT,JW,HG 
24996  Toys  Games Categories Sports  Outdoor Play Be...               SF 

                                   recordTextContentOrig
10903  Categories>Replacement Parts>Steering System>P...
19287  Patio, Lawn & Garden/Categories/Lawn Mowers & …
4684   Automotive>Categories>Motorcycle & Powersports...
6841   Books/Subjects/Mystery, Thriller & Suspense/Th...
11976  Clothing, Shoes & Jewelry/Departments/Men/Shop...
15500  Home & Kitchen/Categories/Kitchen & Dining/Kit...
22054  Sports & Outdoors/Categories/Fan Shop/Home & K...
17578  Industrial & Scientific>Categories>Professiona...
13990  Electronics>Categories>Computers & Accessories...
24996  Toys & Games/Categories/Sports & Outdoor Play/...

Read in the trained model

In [22]:
svm_clfr_all = joblib.load(modelFile) 

Read in the featurize-extraction model(s)

In [23]:
# divided between two files
list_cntVectorizer = joblib.load(dictBasedUnigramVectorizerListFile) 
df_popularCategoryTypes=pd.DataFrame().from_csv(dictPopularPairsFile, sep='\t') 
if debug: print df_popularCategoryTypes
    index category type  termCount
0       0        A   BN       6314
1       2       AE   BN       9654
2       5       AE   PL       6398
3       8       AU   BN       3763
4       9       AU   PC       7594
5      24       CE   PC       3404
6      25       CE   PF       3347
7      65       FS   BN       5017
8      66       FS   PC      11948
9      83       HG   BN       3791
10     84       HG   PC       5532
11     85       HG   PF       3086
12    111       OT   PF       8838

Extract the features

Create the feature vector

In [24]:
# create an empty array with x records
df_extrContentFeatures = pd.DataFrame(np.empty((len(df_unlabeledData.index),0)))
In [25]:
df_unlabeledData[colNameRecordTextContent].str.len()
Out[25]:
0     44
1     41
2     52
3     33
4     26
5     28
6     32
7     34
8     36
9     44
10    61
11    54
12    52
13    54
14    31
...
25997    68
25998    48
25999    52
26000    46
26001    53
26002    47
26003    63
26004    58
26005    59
26006    71
26007    60
26008    73
26009    72
26010    66
26011    67
Name: recordTextContent, Length: 26012, dtype: int64
In [26]:
corpusBasedUnigramVectorizer = joblib.load(corpusBasedUnigramVectorizerFile) 
In [27]:
df_extrContentFeatures['strLen']         = df_unlabeledData[colNameRecordTextContent].str.len()

df_unlabeledDataDerived = pd.DataFrame(np.empty((len(df_unlabeledData.index),0))) # shell array

df_unlabeledDataDerived['pathNodes']  = df_unlabeledData.recordTextContent.apply(lambda s: s.split(' > '))
df_unlabeledDataDerived[colNameRecordTextContentTokens] = df_unlabeledData.recordTextContent.apply(lambda s: split('[>| |&]+',s))
# present a sample
if debug>=2: print "derived data:", df_unlabeledDataDerived.loc[random.choice(df_labeledDataDerived.index, 5, replace=False)]

df_extrContentFeatures['tokensCount'] = df_unlabeledDataDerived[colNameRecordTextContentTokens].str.len()

# present a sample
# error check
if len(df_extrContentFeatures.index)<2: print "ERROR!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
# report a sample
elif debug: print "df_extrContentFeatures:\n", df_extrContentFeatures.loc[random.choice(df_extrContentFeatures.index, 5, replace=False)]
df_extrContentFeatures:
       strLen  tokensCount
8258       81            9
6654       40            6
20883      55            7
1254       47            6
16551      62            6
In [28]:
# time consuming

for index, record in df_popularCategoryTypes.iterrows():  # ordered as a debugging aid

    productType, productCategory = record[colNameTermType], record[colNameTermCategory]

    t0 = time()
    cntVectorizer = list_cntVectorizer[index] 
    cv_list_pre = cntVectorizer.transform(df_unlabeledData[colNameRecordTextContent])
    cv_list = cv_list_pre.toarray().sum(axis=1).tolist()

    df_extrContentFeatures[productType+"_"+productCategory+"_terms"] = cv_list
    df_extrContentFeatures[productType+"_"+productCategory+"_terms2tokens"] = cv_list / df_extrContentFeatures[colNameRecordTextContentTokensCount]
    timeDelta = time() - t0

    if debug: print "index:", index, "\tproductCategory:", record['category'], " productType:", record['type'], "timeDelta:", timeDelta

    if debug>=3:
       print cntVectorizer.transform(df_testTextItem['textItem']).toarray()
       print cv_list, "\n"

# report a sample
if len(df_extrContentFeatures.index)<2: print "ERROR!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
# report a sample
elif debug: print df_extrContentFeatures.loc[random.choice(df_extrContentFeatures.index, 2, replace=True)]
index: 0 	productCategory: A  productType: BN timeDelta: 1.06999993324
index: 1 	productCategory: AE  productType: BN timeDelta: 0.930000066757
index: 2 	productCategory: AE  productType: PL timeDelta: 1.02300000191
index: 3 	productCategory: AU  productType: BN timeDelta: 0.766000032425
index: 4 	productCategory: AU  productType: PC timeDelta: 0.646000146866
index: 5 	productCategory: CE  productType: PC timeDelta: 0.602999925613
index: 6 	productCategory: CE  productType: PF timeDelta: 0.668000221252
index: 7 	productCategory: FS  productType: BN timeDelta: 0.890999794006
index: 8 	productCategory: FS  productType: PC timeDelta: 0.790999889374
index: 9 	productCategory: HG  productType: BN timeDelta: 0.779000043869
index: 10 	productCategory: HG  productType: PC timeDelta: 0.679000139236
index: 11 	productCategory: HG  productType: PF timeDelta: 0.631000041962
index: 12 	productCategory: OT  productType: PF timeDelta: 0.882999897003
       strLen  tokensCount  BN_A_terms  BN_A_terms2tokens  BN_AE_terms  \
24467     103           12           3               0.25            1 
4452       76           10           4               0.40            0 

       BN_AE_terms2tokens  PL_AE_terms  PL_AE_terms2tokens  BN_AU_terms  \
24467            0.083333            2            0.166667            4 
4452             0.000000            1            0.100000            6 

       BN_AU_terms2tokens         …          PC_FS_terms  \
24467            0.333333         …                    4 
4452             0.600000         …                    6 

       PC_FS_terms2tokens  BN_HG_terms  BN_HG_terms2tokens  PC_HG_terms  \
24467            0.333333            6                 0.5           10 
4452             0.600000            2                 0.2            6 

       PC_HG_terms2tokens  PF_HG_terms  PF_HG_terms2tokens  PF_OT_terms  \
24467            0.833333            7            0.583333            9 
4452             0.600000            4            0.400000            7 

       PF_OT_terms2tokens
24467                0.75
4452                 0.70

[2 rows x 28 columns]
In [29]:
srs_textCorpus = df_unlabeledData[colNameRecordTextContent]

# apply this feature extractor
sprs_vectorizedTokens = corpusBasedUnigramVectorizer.transform(srs_textCorpus)

from scipy.sparse import csr_matrix, issparse, isspmatrix, isspmatrix_csc, isspmatrix_csr, isspmatrix_bsr, isspmatrix_lil, isspmatrix_dok, isspmatrix_coo, isspmatrix_dia
import numpy as np
from scipy import int8

sprs_vectorizedTokens2 = csr_matrix(sprs_vectorizedTokens, dtype=int8) # squeeze int64 to save memory

narr_vectorizedTokens = sprs_vectorizedTokens2.toarray()

df_vectorizedTokens = pd.DataFrame(narr_vectorizedTokens)

if debug>=2:
   label_prefix="ut"
   unigramTokenFeatureNames=[label_prefix + "_" + str(i)  for i in range(df_vectorizedTokens.shape[1])]
   df_vectorizedTokens.columns = unigramTokenFeatureNames
   print tokenDict # [u'00', u'08', u'09', u'10', u'100', u'1000', u'1001', ..., u'0042g3c1a41e', ..., u'146qqsspagenamezwdvwqqrdz1qqcmdzviewitem',
 
# for debugging keep the feature names
tokenDict = corpusBasedUnigramVectorizer.get_feature_names()

# report a sample
if debug: print "df_vectorizedTokens sample\n", df_vectorizedTokens.loc[random.choice(df_vectorizedTokens.index, 4, replace=False)]
df_vectorizedTokens sample
       0      1      2      3      4      5      6      7      8      9      \
496        0      0      0      0      0      0      0      0      0      0 
5912       0      0      0      0      0      0      0      0      0      0 
21392      0      0      0      0      0      0      0      0      0      0 
3899       0      0      0      0      0      0      0      0      0      0 

       …    20228  20229  20230  20231  20232  20233  20234  20235  20236  \
496    …        0      0      0      0      0      0      0      0      0 
5912   …        0      0      0      0      0      0      0      0      0 
21392  …        0      0      0      0      0      0      0      0      0 
3899   …        0      0      0      0      0      0      0      0      0 

       20237
496        0
5912       0
21392      0
3899       0

[4 rows x 20238 columns]
In [30]:
# DEBUG: relabel the column names to be unique
if debug>=2: 
   #unigramTokenFeatureNames=[label_prefix+"_"+str(i) for i in range(df_vectorizedTokens.shape[1])]
   unigramTokenFeatureNames=[label_prefix+"_"+tokenDict[i] for i in range(df_vectorizedTokensTest.shape[1])]
   df_vectorizedTokensTest.columns = unigramTokenFeatureNames

   # report a sample
   df_vectorizedTokensTest.loc[random.choice(df_vectorizedTokensTest.index, 3, replace=False)]
In [31]:
#merge the separate feature spaces
#df_extrFeaturesTest=df_extrContentFeaturesTest.join(df_vectorizedTokensTest)

# fyi, using .join instead appears to be memory intensive than using  merge()
df_extrFeatures = pd.merge(df_extrContentFeatures, df_vectorizedTokens, how='inner', left_index=True, right_index=True, sort=True,
      suffixes=('_x', '_y'), copy=True)

print "shape: ", df_extrFeatures.shape # (16000, 7706) # (31056, 13468)

# report a sample
df_extrFeatures.loc[random.choice(df_extrFeatures.index, 3, replace=False)]
# write-out to a file
#df_extrFeaturesTest.to_csv("df_extrFeaturesTest." + dstamp + ".csv", sep='\t', encoding='utf-8')
shape:  (26012, 20266)
Out[31]:
strLen tokensCount BN_A_terms BN_A_terms2tokens BN_AE_terms BN_AE_terms2tokens PL_AE_terms PL_AE_terms2tokens BN_AU_terms BN_AU_terms2tokens ... 20228 20229 20230 20231 20232 20233 20234 20235 20236 20237
110 10 2 0 0.000000 0 0.00 0 0.00 0 0.000000 ... 0 0 0 0 0 0 0 0 0 0
16953 67 7 2 0.285714 0 0.00 0 0.00 1 0.142857 ... 0 0 0 0 0 0 0 0 0 0
820 30 4 2 0.500000 1 0.25 1 0.25 1 0.250000 ... 0 0 0 0 0 0 0 0 0 0

3 rows à — 20266 columns

In [32]:
#df_extrFeaturesSlice = df_extrFeatures[0:1]
#ndarr_preds_All = svm_clfr_all.predict(df_extrFeaturesSlice)

ndarr_preds_All = svm_clfr_all.predict(df_extrFeatures)
# ndarr_preds_All_Test[0:5]
# array(['n/a', 'n/a', 'CB>OT', 'n/a', 'n/a'], dtype=object)

gc.collect
ndarr_preds_All_Score = svm_clfr_all.decision_function(df_extrFeatures)
#ndarr_preds_All_Score[:3]

#import numpy as np
ndarr_preds_All_Score = np.column_stack((ndarr_preds_All,ndarr_preds_All_Score.max(axis=1)))
ndarr_preds_All_Score
# array([ ['CT>HW', -0.7373890220954527],   ['n/a', -0.8697072603624554],    ['CT>HW', -0.6463626722157896], ...
Out[32]:
array([ ['IS>IN', -0.7487199845348264],
       ['AU>PA', -0.7075546114593144],
       ['AU>PA', -0.3692320688771711],
       ..., 
       ['IS>IN', 0.41506620500102787],
       ['HG>HI', -0.7264187657103042],
       ['FS>SH>WO', 0.35970082967814143] ], dtype=object)
In [33]:
df_predLabels = pd.DataFrame(ndarr_preds_All_Score)
df_predLabels.columns = ["taxoLabel","predScore"]
# report a sample
df_predLabels.loc[random.choice(df_predLabels.index, 3, replace=False)]
Out[33]:
taxoLabel predScore
25447 GM>HW -0.7726987
23146 FS>CL>ME 0.03568075
24000 HG>HI 3.229172
In [34]:
df_predictions = df_unlabeledData.join(df_predLabels)

predsFile="df_predictions." + dataCode + "." + dstamp + ".tsv"

df_predictions.to_csv(predsFile, sep='\t')

if debug:
  print df_predictions.loc[random.choice(df_predictions.index, 25, replace=False)]
                                       recordTextContent Viglink Category  \
6833   Books Subjects Mystery Thriller  Suspense Thri...               BK 
4902   Automotive > Categories > Replacement Parts > …            AU,HG 
2632   Automotive Categories Paint Body  Trim Body He...               AU 
14398  Grocery  Gourmet Food > Categories > Cooking  …            OT,HG 
20193  Power Tool Parts  Accessories > Power Drill Pa...            AU,HG 
17413  Industrial  Scientific > Categories > Occupati...               HG 
382                > Products > Apparel > Jackets  Vests               FS 
17872  Kitchen  Bath Fixtures > Bathroom Fixtures > T...         AU,OT,HG 
17826  Industrial  Scientific > Categories > Test Mea...         AU,OT,HG 
7825   Categories > Camera  Photo > Accessories > Fil...               AU 
5382   Baby Products > Categories > Feeding > Solid F...   FS,HB,OT,JW,HG 
17753  Industrial  Scientific > Categories > Tapes Ad...               AU 
21081  Replacement Parts > Engine Cooling  Climate Co...            AU,HG 
17655  Industrial  Scientific > Categories > Raw Mate...            OT,HG 
7351   Books Subjects Teen  Young Adult Science Ficti...               BK 
4181   Automotive Categories Replacement Parts Switch...               AU 
15776  Home  Kitchen > Categories > Home Décor > Area...         HB,OT,HG 
6329   Books Subjects Children's Books Sports  Outdoo...               BK 
19819  Pet Supplies Categories Dogs Collars Harnesses...               AU 
19116  Outdoor Recreation > Skates Skateboards  Scoot...         FS,AU,HG 
12937  Departments > Women > Accessories > Scarves  W...   FS,HB,OT,JW,HG 
25186  Toys  Games > Categories > Electronics for Kid...               HG 
21302  Shops > Surf Skate  Street > Accessories > Wal...               FS 
6991   Books Subjects Reference Encyclopedias  Subjec...               BK 
7145   Books Subjects Science  Math Chemistry General...               BK 

                                   recordTextContentOrig taxoLabel   predScore
6833   Books/Subjects/Mystery, Thriller & Suspense/Th...     BK>BK    1.927918
4902   Automotive>Categories>Replacement Parts>Caps>C...     AU>PA    3.052673
2632   Automotive/Categories/Paint, Body & Trim/Body/...     AU>PA    2.556363
14398  Grocery & Gourmet Food>Categories>Cooking & Ba...     FD>FD   0.4565988
20193  Power Tool Parts & Accessories>Power Drill Par...     HG>HI -0.08937569
17413  Industrial & Scientific>Categories>Occupationa...     IS>IN   0.5579817
382                    >Products>Apparel>Jackets & Vests     FS>CL   -0.423632
17872  Kitchen & Bath Fixtures>Bathroom Fixtures>Toil...     HG>KD   0.2588397
17826  Industrial & Scientific>Categories>Test, Measu...     IS>IN   0.2553237
7825   Categories>Camera & Photo>Accessories>Filters …     CP>CA    1.046376
5382   Baby Products>Categories>Feeding>Solid Feeding...     FB>OT    1.469127
17753  Industrial & Scientific>Categories>Tapes, Adhe...     IS>IN   0.1974295
21081  Replacement Parts>Engine Cooling & Climate Con...     AU>PA    3.572494
17655  Industrial & Scientific>Categories>Raw Materia...     IS>IN  0.09769334
7351   Books/Subjects/Teen & Young Adult/Science Fict...     BK>BK     1.67285
4181   Automotive/Categories/Replacement Parts/Switch...     AU>PA    4.677591
15776  Home & Kitchen>Categories>Home Décor>Area Rugs...     HG>HD   0.3092907
6329   Books/Subjects/Children's Books/Sports & Outdo...     BK>BK    1.261812
19819  Pet Supplies/Categories/Dogs/Collars, Harnesse...     PT>DO    1.122119
19116  Outdoor Recreation>Skates, Skateboards & Scoot...     SF>OT    1.538451
12937  Departments>Women>Accessories>Scarves & Wraps>...  FS>AC>WO   0.2966394
25186  Toys & Games>Categories>Electronics for Kids>R...     HO>TO  -0.4961235
21302  Shops>Surf, Skate & Street>Accessories>Wallets...  FS>AC>WO  -0.2004225
6991   Books/Subjects/Reference/Encyclopedias & Subje...     BK>BK   0.3891943
7145   Books/Subjects/Science & Math/Chemistry/Genera...     BK>BK   0.5336897

#predsFile="predsJoined." + taxoCode + "." + dstamp + ".tsv" #df_predsJoined=pd.merge(df_predictions, df_VIG_lab, on='taxoLabel', suffixes=['_left', '_right']) #df_predsJoined.to_csv(predsFile, sep='\t')





In [ ]: