News Classification Algorithms

This website describes the different text classification algorithms implemented in our CATA tool.

The overall challenge of press release classificaion can be described as follows:

Given:

  • Collection (corpus) of documents D
  • Set of classes C
  • for each class in C there exists set of term characterizing that class

            terms are unique within one class, but terms in can appear in any number of classes

Goal:

  • For each document Di predict corresponding class Cj
  • This is the multiclass classification (not the multlabel one) 

            because "Eventually, we need multiclass since each press release represents one corporate event, 

            typically.

Subgoal:

  • Calculate probability distribution of documents over classes using terms of classes 
  • It is document-class matrix M, |D| x |C|, where M_{ij} is probability of document i to be from the class j
  • For each document select a class with the maximal probability

 

Preliminary estimation of quality of ALG1

 

estimation of binary prediction

using JV dataset jv_train

fixed output: (3025, 13)

 

 2 classes:    precision:  90.4% recall:  75.7% fscore:  82.4%

 3 classes:    precision:  90.7% recall:  69.4% fscore:  78.6%

 4 classes:    precision:  91.0% recall:  64.1% fscore:  75.2%

 5 classes:    precision:  90.0% recall:  53.9% fscore:  67.5%

 6 classes:    precision:  90.5% recall:  52.6% fscore:  66.6%

 7 classes:    precision:  90.5% recall:  52.1% fscore:  66.1%

 8 classes:    precision:  90.8% recall:  51.3% fscore:  65.6%

 9 classes:    precision:  90.9% recall:  50.3% fscore:  64.7%

10 classes:    precision:  90.8% recall:  48.9% fscore:  63.5%

11 classes:    precision:  91.1% recall:  48.4% fscore:  63.2%

15 classes:    precision:  91.8% recall:  47.0% fscore:  62.1%

20 classes:    precision:  91.7% recall:  45.2% fscore:  60.6%

36 classes:    precision:  91.8% recall:  44.3% fscore:  59.8%

---------------------------------------------------------------------

artificial cases:

2 classes:

    keywords for the control 

    keywords random from the corpus

    train

    precision:  91.2% recall:  93.3% fscore:  92.2%

    test

    precision:  91.5% recall:  94.5% fscore:  93.0%

1 class in the input for the CATA:

    train

    precision: 100.0% recall:  50.9% fscore:  67.4%

    test

    precision: 100.0% recall:  49.6% fscore:  66.3%

---------------------------------------------------- 

 

Summary of the experiments with ALG1 (as a binary classifier):

 

Good: 

    - ALG1 shows high and stable precition (91%)

Bad:

    - recall drops with new classes added

    - recall strongly depends from the samples from other classes

 

---------------------------------------------------------

 

Due to we use ALG1 in practice as a multiclass clasifier (not as a binary classifier),

we need to test it also on the multiclass dataset.

 

Comparison results binary dataset (2newsgroups) and  multiclass dataset (20newsgroups):

 

Evaluation of CATA as binary classifier on binary dataset:

 

2news_test 2classes:    precision:  84.9% recall:  86.6% fscore:  85.8%

2news_test 20classes:   precision:  92.8% recall:  26.0% fscore:  40.6%

 

Evaluation of CATA as binary classifier on multiclass dataset:

 

20news_test 20classes:  precision:  35.1% recall:  26.6% fscore:  30.3%

20news_test 2classes:   precision:   6.7% recall:  82.1% fscore:  12.4%

 

Evaluation of CATA as multiclass classifier on multiclass dataset (averaged metrics):

 

macro averaging:

precision:  44.9% recall:  44.0% fscore:  43.8%

micro averaging:

precision:  46.9% recall:  46.9% fscore:  46.9%

weighted averaging:

precision:  47.5% recall:  46.9% fscore:  46.6%

 
Summary:
 - on binary 2news dataset ALG1 shows results similar to results with JV binary dataset
 - ALG1 on multiclass dataset shows worse results with average precision of 47.5% and recall of 46.7%

 

--------------------------------------------------------------------------------

 

Results with Tfidf algorithm (compared with ALG1)

 

 

 

20news_test (16 keywords, 1-ngrams)

 

CATA1 weighted stats:       precision:  47.5% recall:  46.9% fscore:  46.6%  

Tfidf ngrams=1:             precision:  59.3% recall:  54.8% fscore:  55.6%

 

20news_test (32 keywords, 1-2-ngrams)

 

random benchmark            precision:   5.2% recall:   4.5% fscore:   4.7%

ALG1 weighted stats:        precision:  52.3% recall:  52.1% fscore:  51.4%

 

Tfidf ng=2                  precision:  61.5% recall:  59.0% fscore:  59.0%

Tfidf ng=1                  precision:  63.4% recall:  60.6% fscore:  60.8%

Tfidf ng=1,fake             precision:  70.0% recall:  55.0% fscore:  60.5%

Tfidf ng=2,stemm            precision:  63.2% recall:  60.6% fscore:  60.8%

Tfidf ng=1,stemm            precision:  64.6% recall:  61.9% fscore:  62.2%

Tfidf ng=1,stemm,fake       precision:  69.3% recall:  58.2% fscore:  62.3%

 

cut with mincut=0.15:

Tfidf ng=1,stemm,fake       precision:  70.2% recall:  57.9% fscore:  62.4%

cut with mincut=0.30:

Tfidf ng=1,stemm,fake       precision:  83.5% recall:  37.9% fscore:  50.7%

cut with mincut=0.45:

Tfidf ng=1,stemm,fake       precision:  93.0% recall:   9.5% fscore:  16.6%

 

Summary on 20newsgroups dataset (with labels):

 

- new experiment setup: 

    calculate quality of prediction of classes in multiclass dataset

    use mixed 20 classes dataset, predict every class and calculate averaged metrics

- random benchmark  precision:   5.2% recall:   4.5% fscore:   4.7%

- ALG1              precision:  52.3% recall:  52.1% fscore:  51.4%

- Tfidf tuned       precision:  70.2% recall:  57.9% fscore:  62.4%

- from literature we know that supervised model trained on labeled 20news train dataset

    predicts on test dataset with precision and fscore > 80%

    http://nlp.stanford.edu/wiki/Software/Classifier/20_Newsgroups

 

---------------------------------------------------------------------------------------------

 

SnP full dataset JV labeled (test part): 57243 samples with 858 JV labeled 

we know labels only for the JV class, thus metrics concern only this class

 

ALG1 makes errors on this dataset: returns only first 28 records and breaks

 

random benchmark            precision:   1.2% recall:   0.9% fscore:   1.1%            

 

Tfidf ng=1                  precision:   7.8% recall:  48.4% fscore:  13.5%

Tfidf ng=2                  precision:   8.6% recall:  53.6% fscore:  14.9%

Tfidf ng=3                  precision:   8.8% recall:  53.0% fscore:  15.0%

Tfidf ng=2,fake32           precision:  20.7% recall:  15.6% fscore:  17.8%

Tfidf ng=3,fake32           precision:  20.7% recall:  15.5% fscore:  17.7%

Tfidf ng=2,stemm            precision:   9.9% recall:  64.8% fscore:  17.1%

Tfidf ng=3,stemm            precision:  10.0% recall:  64.3% fscore:  17.2%

Tfidf ng=2,stemm,fake16     precision:  22.5% recall:  19.2% fscore:  20.7%

Tfidf ng=3,stemm,fake16     precision:  22.7% recall:  19.2% fscore:  20.8%

Tfidf ng=2,stemm,fake32     precision:  20.5% recall:  23.0% fscore:  21.6%

Tfidf ng=3,stemm,fake32     precision:  20.4% recall:  22.7% fscore:  21.5%

Tfidf ng=2,stemm,fake128    precision:  21.6% recall:   3.7% fscore:   6.4%

 

cut with mincut=0.15:

Tfidf ng=2,stemm,fake32     precision:  20.7% recall:  23.0% fscore:  21.8%

cut with mincut=0.30:

Tfidf ng=2,stemm,fake32     precision:  35.6% recall:  12.6% fscore:  18.6%

cut with mincut=0.45:

Tfidf ng=2,stemm,fake32     precision:  40.0% recall:   0.2% fscore:   0.5%

 

--------

 

Summary on SnP500 dataset and "Alliances and joint ventures" labels:

 

- experiment setup: 

    calculate quality of prediction of classes in multiclass dataset

    use full SnP500 dataset, predict every class, but calculate metrics using JV labels

- random benchmark  precision:   1.2% recall:   0.9% fscore:   1.1%

- ALG1              NA

- Tfidf tuned       precision:  20.7% recall:  23.0% fscore:  21.8%