Hierarchical confusion matrix for classification performance evaluation

Author Notes

Abstract

This study proposes the novel concept of hierarchical confusion matrix, opening the door for popular confusion-matrix-based (flat) evaluation measures from binary classification problems, while considering the peculiarities of hierarchical classification problems. The concept is developed to a generalised form and proven its applicability to all types of hierarchical classification problems including directed acyclic graphs, multi-path labelling, and non-mandatory leaf-node prediction. Finally, measures based on the novel confusion matrix are used for three real-world hierarchical classification applications and compared to established evaluation measures. The results, the conformity with important attributes of hierarchical classification schemes and its broad applicability justify its recommendation.

evaluation metrics, hierarchical classification problems, hierarchical confusion matrix

1 Introduction

Hierarchical classification is the task of assigning m multiple classes out of n multiple, overlapping classes (⁠ $C_{i}$ ⁠) disposed in a hierarchical structure. Classes in the hierarchical classification task can be divided into sub-classes or grouped into super classes. Often, hierarchical classification is wrongly referred to other problems (Kiritchenko et al., 2006; Silla & Freitas, 2011) such as hierarchical clustering of objects to generate structures of classes (Gauch & Whittaker, 1981; Gordon, 1987). This work explicitly understands hierarchical classification as allocation of objects to a set of given classes rather than the design of classes and their structure itself.

Machine learning research started to focus on (hierarchically) structured classification problems in the context of text categorisation in the mid-1990s (Koller & Sahami, 1997). From then on, hierarchical classification has been applied in many fields, namely text categorisation, music genre classification, and bioinformatic applications including sequence classification, and protein function prediction (Freitas & Carvalho, 2007; Silla & Freitas, 2011). Despite of intensive research, many related questions in the field remain open, as stated, e.g. in Kosmopoulos et al. (2015).

Hierarchical classification evaluation is part of ongoing research and various measures have been proposed so far (Costa et al., 2007; Sokolova et al., 2006). While established evaluation measures for binary and multi-class classification problems exist, there is still the lack of consensus on which measures to employ in the context of hierarchical classification problems (Costa et al., 2007).

Hierarchical classification studies frequently adopted conventional, flat measures from binary classification problems such as accuracy, precision, or recall (Costa et al., 2007; Kiritchenko et al., 2006; Kosmopoulos et al., 2015). Due to their simple structure based on the confusion matrix, these measures are easy accessibly to human inspection. However, flat measures do not appropriately consider the complex peculiarities of hierarchical classification problems. The measures in Kiritchenko et al., 2006 (hierarchical precision, recall, and F-score) are observed to frequently occur in recent studies and try to bridge the evaluation of binary and hierarchical classification problems.

In this work, we propose a novel concept for the evaluation of hierarchical classification problems: the hierarchical confusion matrix. This approach opens the door for conventional measures in the context of hierarchical classification and to complete the field of hierarchical classification evaluation. Further, we extend the concept to a generalised form and prove its applicability to all types of hierarchical classification problems including directed acyclic graphs, multi-path label, and non-mandatory leaf-node prediction. Finally, we demonstrate the application of measures based on the novel concept for the comparison of classifiers in three exemplary benchmarks related to different types of hierarchical classification problems.

An implementation of the hierarchical confusion matrix in Python programming language, and the examples and experiments of this study are available on GitHub.¹

The brief remainder of this work is organised as follows. Section 2 formally introduces the problems of hierarchical classification. In Section 3, a summary of proposed evaluation measures for hierarchical classification is presented. Section 4 presents the novel hierarchical confusion matrix, discusses the conformity to important attributes, and proves applicability to all types of hierarchical classification problems. In Section 5, the novel evaluation concept is applied onto experimental data and compared to existing measures revealing its usefulness and reasonability. Finally, Section 6 concludes this paper and summarises its contribution.

2 Hierarchical classification

Supervised machine learning allows access to labelled data for training and testing stages. Classification describes supervised machine learning using categorical labels, and stands for the allocation of objects to a set of existing classes. The field of classification can be divided into the sub-problems: binary, multi-class, multi-label, and hierarchical classification, as shown in Figure 1. In the binary classification problem, objects need to be allocated to one of two non-overlapping classes (⁠ $C_{1}$ ⁠, $C_{2}$ ⁠). Multi-class classification requires the choice of one out of n multiple, non-overlapping classes (⁠ $C_{i}$ ⁠) for a given input object. Multi-label classification represents the assignment of objects to m multiple classes out of n multiple, non-overlapping classes (⁠ $C_{i}$ ⁠).

Figure 1.

Classification problems by example. The selected class(es) is (are) marked. (a) Binary classification. (b) Multi-class classification. (c) Multi-label classification. (d) Hierarchical classification. Note that the single selected category (dark chocolate) causes the selection of multiple parental vertices in the graph on a path up to the root node.

Open in new tab Download slide

Before we define hierarchical classification, let us introduce the notation used in this paper. We define a set of data Ω and a structure Γ. The structure Γ can be described by a graph, where the nodes represent the n classes $C_{i} \in Γ$ and the edges stand for the possible relationships between these classes, namely ancestry and descendance. For each class $C_{i}$ ⁠, let us denote the set of all descendants by $D {C_{i}}$ and the set of all neighbouring nodes (with at least one common ancestor) by $N {C_{i}}$ ⁠.

The root node $R \in Γ$ is the only node without ancestors $A {R} = \emptyset$ ⁠. Leaf nodes i are nodes without any descendants. An allocation path ${\vec{P}}_{i}$ is a series of classes (connected nodes) within Γ. This work requires three conditions for ${\vec{P}}_{i}$ ⁠: (1) for each consecutive pair of elements $C_{a}, C_{b} \in {\vec{P}}_{i}$ the ancestry relationship $C_{a} \in A {C_{b}}$ holds, (2) the connectivity to the root node $R \in {\vec{P}}_{i}$ ⁠, and (3) the connectivity to the allocated class $C_{i} \in {\vec{P}}_{i}$ ⁠. Let us denote the true class of a given object $o_{d} \in Ω$ as $C {o_{d}}$ ⁠.

$P_{d}$ describes the set of all predicted allocation paths for $o_{d}$ ⁠, and a single, predicted allocation path is given by ${\hat{P}}_{d} \in P_{d}$ ⁠. $W_{d}$ denotes the set of all true allocation paths for $o_{d}$ based on Γ and ${\tilde{P}}_{d} \in W_{d}$ defines a single, true allocation path. Moreover, let us denote the intersection of ${\hat{P}}_{d}$ and ${\tilde{P}}_{d}$ by ${\hat{P}}_{d} \cap {\tilde{P}}_{d}$ ⁠, the (longest, consecutively connected) common path of prediction of ${\hat{P}}_{d}$ and ${\tilde{P}}_{d}$ by ${\hat{P}}_{d} \bar{\cap} {\tilde{P}}_{d}$ ⁠, and the leaf node of a path ${\vec{P}}_{i}$ as $Z {{\vec{P}}_{i}}$ ⁠.

Hierarchical classification is a structured classification problem and represents the allocation of one or multiple paths ${\vec{P}}_{x} \in P_{d}$ within the structure Γ to a given object $o_{d} \in Ω$ ⁠. Table 1 summarises the definitions of this work.

Table 1.

Open in new tab

Definitions for hierarchical classification

Symbol	Definition
Γ	Structure of hierarchical classification problem
Ψ	Number of allocated paths of classification problem
Φ	Mandatoriness for leaf-node prediction
Ω	Set of data
$o_{d}$	A single data set in O
$C_{i}$	Class i, node in Γ
R	Root node of Γ
$C {o_{d}}$	True class of $o_{d}$ in Γ
$D {C_{i}}$	Set of descendants of $C_{i}$
$N {C_{i}}$	Set of neighbours of $C_{i}$
${\vec{P}}_{i}$	Allocation path to $C_{i}$
$Z {{\vec{P}}_{i}}$	Leaf node of ${\vec{P}}_{i}$
$P_{a} \bar{\cap} P_{b}$	Longest, consecutively connected, common path
$P_{d}$	Set of all predicted allocation paths for $o_{d}$
${\hat{P}}_{d}$	A single, predicted allocation path for $o_{d}$ in $P_{d}$
$W_{d}$	Set of all true allocation paths for $o_{d}$
${\tilde{P}}_{d}$	A single, true allocation path for $o_{d}$ in $W_{d}$

Symbol	Definition
Γ	Structure of hierarchical classification problem
Ψ	Number of allocated paths of classification problem
Φ	Mandatoriness for leaf-node prediction
Ω	Set of data
$o_{d}$	A single data set in O
$C_{i}$	Class i, node in Γ
R	Root node of Γ
$C {o_{d}}$	True class of $o_{d}$ in Γ
$D {C_{i}}$	Set of descendants of $C_{i}$
$N {C_{i}}$	Set of neighbours of $C_{i}$
${\vec{P}}_{i}$	Allocation path to $C_{i}$
$Z {{\vec{P}}_{i}}$	Leaf node of ${\vec{P}}_{i}$
$P_{a} \bar{\cap} P_{b}$	Longest, consecutively connected, common path
$P_{d}$	Set of all predicted allocation paths for $o_{d}$
${\hat{P}}_{d}$	A single, predicted allocation path for $o_{d}$ in $P_{d}$
$W_{d}$	Set of all true allocation paths for $o_{d}$
${\tilde{P}}_{d}$	A single, true allocation path for $o_{d}$ in $W_{d}$

Table 1.

Open in new tab

Definitions for hierarchical classification

Symbol	Definition
Γ	Structure of hierarchical classification problem
Ψ	Number of allocated paths of classification problem
Φ	Mandatoriness for leaf-node prediction
Ω	Set of data
$o_{d}$	A single data set in O
$C_{i}$	Class i, node in Γ
R	Root node of Γ
$C {o_{d}}$	True class of $o_{d}$ in Γ
$D {C_{i}}$	Set of descendants of $C_{i}$
$N {C_{i}}$	Set of neighbours of $C_{i}$
${\vec{P}}_{i}$	Allocation path to $C_{i}$
$Z {{\vec{P}}_{i}}$	Leaf node of ${\vec{P}}_{i}$
$P_{a} \bar{\cap} P_{b}$	Longest, consecutively connected, common path
$P_{d}$	Set of all predicted allocation paths for $o_{d}$
${\hat{P}}_{d}$	A single, predicted allocation path for $o_{d}$ in $P_{d}$
$W_{d}$	Set of all true allocation paths for $o_{d}$
${\tilde{P}}_{d}$	A single, true allocation path for $o_{d}$ in $W_{d}$

Symbol	Definition
Γ	Structure of hierarchical classification problem
Ψ	Number of allocated paths of classification problem
Φ	Mandatoriness for leaf-node prediction
Ω	Set of data
$o_{d}$	A single data set in O
$C_{i}$	Class i, node in Γ
R	Root node of Γ
$C {o_{d}}$	True class of $o_{d}$ in Γ
$D {C_{i}}$	Set of descendants of $C_{i}$
$N {C_{i}}$	Set of neighbours of $C_{i}$
${\vec{P}}_{i}$	Allocation path to $C_{i}$
$Z {{\vec{P}}_{i}}$	Leaf node of ${\vec{P}}_{i}$
$P_{a} \bar{\cap} P_{b}$	Longest, consecutively connected, common path
$P_{d}$	Set of all predicted allocation paths for $o_{d}$
${\hat{P}}_{d}$	A single, predicted allocation path for $o_{d}$ in $P_{d}$
$W_{d}$	Set of all true allocation paths for $o_{d}$
${\tilde{P}}_{d}$	A single, true allocation path for $o_{d}$ in $W_{d}$

According to Silla and Freitas (2011), Borges et al. (2013), Kosmopoulos et al. (2015), the different sub-problems of hierarchical classification can be categorised based on three aspects: structure Γ, number of paths Ψ, and label depth Φ. First, the literature commonly differentiates two types of structures Γ that put the classes into a hierarchical relationship: trees T and directed acyclic graphs DAGs (Silla & Freitas, 2011). Both of these structures are presented in Figure 2. A tree is a graph in which any two nodes are connected by exactly one path, whereas a DAG allows nodes to possess multiple parental nodes. T implies $| W_{d} | = 1$ and DAG entails $| W_{d} | \geq 1$ for a single class resp. single labelled object $o_{d}$ ⁠. Moreover, this work assumes that a classifier is required to predict a specific path for DAG problems, while T problems require the prediction of a class only. Second, Silla and Freitas (2011) distinguishe single (SPL) and multi-path label (MPL) hierarchical classification depending on the number of allocated paths $Ψ = | P_{d} |$ per object $o_{d} \in Ω$ ⁠. SPL implies $| P_{d} | = 1$ ⁠, while MPL entails $| P_{d} | \geq 1$ ⁠. Third, hierarchical classification problems differ by their required label depth Φ; there are problems that necessarily require full depth labelling (FD) meaning the mandatory leaf-node prediction (MLNP), and such that do not (partial depth labelling, PD or non-mandatory leaf-node prediction, NMLNP) (Costa et al., 2007; Freitas & Carvalho, 2007).

Figure 2.

Structures of hierarchical classification. (a) Trees. (b) Directed acyclic graphs (DAGs).

Open in new tab Download slide

Algorithms approaching the hierarchical classification problem can be grouped into three categories (Dumais & Chen, 2000; Kiritchenko et al., 2006; Silla & Freitas, 2011): flat, local-model, and global-model approaches. These three approaches and their sub-types are outlined in Figure 3.

Figure 3.

Approaches to hierarchical classification. The use of classifiers is marked with dashed lines. (a) Flat approach transforms leaf node classification to a multi-class problem. (b) Global-model approach considers all classes at the same time. (c) Local classifier per node employs a binary classifier for each vertex of the structure. (d) Local classifier per parent node utilises a multi-class classifier for each parental node. (e) Local classifier per level uses a multi-class classifier at each level of the taxonomy.

Open in new tab Download slide

The flat approach transforms the hierarchical classification problem into a multi-class problem using the leaf nodes only. Three types of local-model approaches are mentioned by Silla and Freitas (2011): local classifier per node, local classifier per parent node, and local classifier per level. The local classifier per node approach employs a binary classifier for each node of the class hierarchy, while the local classifier per parent approach trains a multi-class classifier for all non-leaf nodes. The local classifier per level approach consists in using a multi-class classifier for all labels present in the levels of a hierarchical structure. The global approach utilises one single classifier that captures the whole hierarchical class structure. Kiritchenko et al. (2006) argue that local-model approaches produce consistent labelling and are efficient to train. This comes at the cost that local models are highly sensitive to ancestral decisions and hardly interpretable due to the vast amount of different classifiers. Therefore, they pronounce for the use of global models that probably have a higher chance for correct predictions. However, the training of global-model approaches is complex and contrary to the other approaches the classification problem is not decomposed to simpler problems. In general, the performance of classifiers depends on their training step. A larger amount of data available and less contradictions in the data set supports the training of better models. This is an argument for local classifiers per level and flat approaches. Local classifiers per node and parent node face a shrinking size of training data for the deeper levels, while global-model approaches face contradicting data as there is no decomposition of the data for different classifiers.

All approaches have their point and with regard to the diversity of algorithms found in the literature, one can conclude that different types of data and hierarchical classification problems require different, customised approaches.

3 Related works

In the context of the binary classification problem, there are four possible cases that occur: a true positive prediction (TP), a true negative prediction (TN), a false positive prediction (FP), and a false negative prediction (FN). Based on the number of times these four cases occur, that make up the confusion matrix, many different performance measures have been proposed (Fawcett, 2006). Amongst these measures there are accuracy (ACC), precision (PPV), recall (TPR), false negative rate (FNR), false positive rate (FPR), specificity (TNR), prevalence (PT), F1-score (F1), and Matthews correlation coefficient (MCC) to name a few; their definitions are listed in Table 2.

Table 2.

Open in new tab

Common binary classification measures

Measure	Definition
ACC	$= \frac{TP + TN}{TP + TN + FP + FN}$
PPV	$= \frac{TP}{TP + FP}$
TPR	$= \frac{TP}{TP + FN}$
FNR	$= \frac{FP}{FN + TP}$
FPR	$= \frac{FP}{FP + TN}$
TNR	$= \frac{TN}{TN + FP}$
PT	$= \frac{\sqrt{TPR (1 - TNR)} + TNR - 1}{TPR + TNR - 1}$
F1	$= \frac{2 TP}{2 TP + FP + FN}$
MCC	$= \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}$

Table 2.

Open in new tab

Common binary classification measures

Measure	Definition
ACC	$= \frac{TP + TN}{TP + TN + FP + FN}$
PPV	$= \frac{TP}{TP + FP}$
TPR	$= \frac{TP}{TP + FN}$
FNR	$= \frac{FP}{FN + TP}$
FPR	$= \frac{FP}{FP + TN}$
TNR	$= \frac{TN}{TN + FP}$
PT	$= \frac{\sqrt{TPR (1 - TNR)} + TNR - 1}{TPR + TNR - 1}$
F1	$= \frac{2 TP}{2 TP + FP + FN}$
MCC	$= \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}$

As it turns out, hierarchical classification is a more complex problem, and the definition of the confusion matrix elements is not trivial. Kiritchenko et al. (2006) plead for the consideration of partially correct classification, the discriminatory punishment of more distant errors and errors at upper levels. Take the snack product categorisation in Figure 1d as an example; the wrong classification of white chocolate could be considered to be more accurate compared to the wrong classification of muffins or even a snack from different flavours.

Kosmopoulos et al. (2015) discuss further issues, namely over- and under-specialisation, alternative paths, pairing, and long distance problems. With regard to our snack product category example, over- and under-specialisation could occur in NMLNP problems, where the classifier predicts dark chocolate, and the true label is set as sweet snack or the other way around. Moreover, the MPL classification problem can increase the complexity of evaluation further.

Due to these complexities, specific measures for the evaluation of hierarchical classification problems have been proposed. Kosmopoulos et al. (2015) differentiates pair-based and set-based measures. Pair-based measures assign costs to pairs of predicted and true classes; these measures are most suitable for tree and SPL problems. A possible extension to MPL problems is the relaxation for each class to be part of more than one pair. Set-based measures take the whole set of predicted and true classes into account, including ancestry and descendance relationships. Symmetric difference loss and hierarchical precision and recall (Kiritchenko et al., 2005) count amongst the set-based measures.

Costa et al. (2007) categorise the evaluation measures into distance-based, depth-based, semantics-based, and hierarchy-based measures. Distance-based measures consider the distance of true and predicted classes in a structure (Sun & Lim, 2001; Wang et al., 1999). A possible issue with these measures is the neglection of different levels, as these measures commonly do not consider the difficulty for accurate classification at deeper levels and the severity of misclassification at shallower levels.

A possible mean to encounter this shortcoming is the weighting of structural edges which is described as depth-based measures. Besides the number of edges, the depth of true and predicted classes is taken into consideration. However, the new problem of setting the weights occurs. Blockeel et al. (2002) and Holden and Freitas (2006) employ exponential weighting. A possible issue with these (distance-based) measures is the discrimination and over-penalisation of leaf nodes with different depths. In addition to that, the definition of depth becomes more challenging in DAGs.

Semantics-based measures employ the concept of similarity between classes and objects to predict the performance of classification models (Sun & Lim, 2001). From this perspective, similarity or distance can be quantified comparing different feature vectors in a multi-dimensional space similar to the hierarchical clustering problems (Gauch & Whittaker, 1981; Gordon, 1987). One drawback doing so is that during the comparison of two algorithms with different features but identical predictions, evaluation results in different performances. Hierarchy-based measures consider structure-related information. Mainly, these measures consider the intersection of allocation paths, with a focus either on descendants (Ipeirotis et al., 2001) or ancestors (Kiritchenko et al., 2004).

Finally, we also find loss function-based evaluation measures in the literature (Cesa-Bianchi et al., 2005, 2006) that employ functions similar to the model training for evaluation purposes.

4 Hierarchical confusion matrix

The orientation on binary classification evaluation measures in the context of hierarchical classification problems can be observed in many works such as Sun and Lim (2001), Kiritchenko et al. (2005), Kiritchenko et al. (2006), Dumais and Chen (2000), Sokolova and Lapalme (2009). Especially the measures outlined in Kiritchenko et al., 2005, 2006 (hierarchical precision (hP), recall (hR), and F-score (hF)) enjoy increased attention in the ongoing discussion on hierarchical classification evaluation and are frequently used in studies, e.g. Borges et al. (2013). Their definition for SPL problems is shown below, including a degree of freedom meaning that hP is considered β times more important than hR:

\begin{aligned} hP = \frac{\sum_{i} | {\tilde{P}}_{i} \cap {\hat{P}}_{i} |}{\sum_{i} | {\hat{P}}_{i} |} hR = \frac{\sum_{i} | {\tilde{P}}_{i} \cap {\hat{P}}_{i} |}{\sum_{i} | {\tilde{P}}_{i} |} \\ hF = \frac{(1 + β^{2}) \cdot hP \cdot hR}{(β^{2} \cdot hP + hR)} \end{aligned}

(1)

Silla and Freitas (2011) recommend the use of hP, hR, and hF, as extensions to both, trees and DAGs, and to both, SPL and MPL problems. However, these measures come with the problem of defining the relative importance of hP and hR, and they do not consider true negatives. As Chicco and Jurman (2020) argue, more elaborate measures such as the MCC score represent more reliable statistical rates since they score high if and only if the performance was good in all four fields of the confusion matrix. Unmistakably, hP, hR, and hF strongly resemble to the binary classification measures PPV, TPR, and F1 based on the confusion matrix:

\begin{aligned} PPV = \frac{TP}{TP + FP} TPR = \frac{TP}{TP + FN} \\ F 1 = \frac{(1 + β^{2}) \cdot PPV \cdot TPR}{(β^{2} \cdot PPV + TPR)} \end{aligned}

(2)

Open questions arise on the definition of the confusion matrix for hierarchical classification problems. The observation for hP, hR, and hF motivates the proposition of a novel confusion matrix for hierarchical classification, that opens the door for the application of (all) binary classification measures (also those taking true negatives into account such as MCC score) while considering the peculiarities of hierarchical classification. The concept of the hierarchical confusion matrix defines the four fields of the confusion matrix with a connotation to decisions that are done along the prediction path(s) as follows:

True positives (⁠ ${TP}_{H}$ ⁠)—the number of nodes that were correctly labelled positively by the predicted path.
True negatives (⁠ ${TN}_{H}$ ⁠)—the number of nodes that were correctly labelled negatively by the predicted path.
False positives (⁠ ${FP}_{H}$ ⁠)—the number of nodes that were wrongly labelled positively by the predicted path.
False negatives (⁠ ${FN}_{H}$ ⁠)—the number of nodes that were wrongly labelled negatively by the predicted path.

After determining ${TP}_{H}$ ⁠, ${TN}_{H}$ ⁠, ${FP}_{H}$ ⁠, and ${FN}_{H}$ for a single $o_{d} \in Ω$ as a function of $P_{d}$ and $W_{d}$ ⁠, the values of hierarchical confusion can be calculated as the sum over the whole data set and used to determine the ‘flat’ evaluation measures of binary classification (see Table 2):

\begin{aligned} TP = \sum_{o_{d} \in Ω} {TP}_{H} {P_{d}, W_{d}} TN = \sum_{o_{d} \in Ω} {TN}_{H} {P_{d}, W_{d}} \\ FP = \sum_{o_{d} \in Ω} {FP}_{H} {P_{d}, W_{d}} FN = \sum_{o_{d} \in Ω} {FN}_{H} {P_{d}, W_{d}} \end{aligned}

(3)

In the following, we formalise this concept of hierarchical confusion beginning from its simplest form for (⁠ $Γ =$ ⁠, $Ψ = SPL$ ⁠, $Φ = MLNP$ ⁠) problems and then extend it to a generalised form that is applicable to all hierarchical classification problems. Finally, we discuss how the novel concept conforms desirable attributes for hierarchical classification measures.

4.1 Hierarchical confusion for $Γ = T$ ⁠, $Ψ = SPL$ ⁠, $Φ = MLNP$

According to the verbal definition of the confusion matrix elements in the prior section, we formalise the elements as follows. Since (⁠ $Γ = T$ ⁠, $Ψ = SPL$ ⁠, $Φ = MLNP$ ⁠) problems imply $W_{d} = {\tilde{P}}_{d}$ and $P_{d} = {\hat{P}}_{d}$ ⁠, we define the elements as function of a given ${\tilde{P}}_{d}$ and ${\hat{P}}_{d}$ instead of a given $W_{d}$ and $P_{d}$ ⁠. The true positive (⁠ ${TP}_{H}$ ⁠) equals the number of nodes in the intersection of ${\hat{P}}_{d}$ and ${\tilde{P}}_{d}$ ⁠. The true negative (⁠ ${TN}_{H}$ ⁠) is given by the union of the neighbours of each node on the common path² and the number of (correctly) negative predicted descendants of the common path’s leaf node. The number of (correctly) negative predicted descendants of the common path’s leaf node can be obtained by the set of descendants of the common path’s leaf node excluding the FN node of the true path ${\tilde{P}}_{d}$ and the false positive node of the predicted path ${\hat{P}}_{d}$ ⁠.

The false positive (⁠ ${FP}_{H}$ ⁠) equals the number of nodes of the predicted path that are not part of the common path. The false negative (⁠ ${FN}_{H}$ ⁠) is given by the number of nodes of the true path that are not part of the common path. An example for the evaluation of a prediction in the context of a hierarchical tree classification problem is outlined in Figure 4a. To summarise, the four aforementioned cases are formalised below:

\begin{aligned} {TP}_{H} {{\tilde{P}}_{d}, {\hat{P}}_{d}} & = | {\hat{P}}_{d} \cap {\tilde{P}}_{d} ∖ R | \\ {TN}_{H} {{\tilde{P}}_{d}, {\hat{P}}_{d}} & = | {⋃_{C_{i} \in {\hat{P}}_{d} \bar{\cap} {\tilde{P}}_{d}} N {C_{i}}} ∖ {\tilde{P}}_{d} | \\ + | D {Z {{\hat{P}}_{d} \bar{\cap} {\tilde{P}}_{d}}} ∖ {{\tilde{P}}_{d} \cup {\hat{P}}_{d}} | \\ {FP}_{H} {{\tilde{P}}_{d}, {\hat{P}}_{d}} & = | {\hat{P}}_{d} ∖ {\tilde{P}}_{d} | \\ {FN}_{H} {{\tilde{P}}_{d}, {\hat{P}}_{d}} & = | {\tilde{P}}_{d} ∖ {\hat{P}}_{d} | \end{aligned}

(4)

Figure 4.

Four exemplary predictions and their evaluation using the concept of hierarchical confusion. The true class(es) is (are) marked with a cross. The predictions are marked with numbered arrows and prediction paths are marked with an dotted resp. dashed lines. The tables right to the diagrams present the resulting numbers for the confusion matrix fields and the relevant nodes that were considered in the specific context. (a) Hierarchical (Γ = T, Ψ = SPL, Φ = MLNP) classification problem. (b) Hierarchical (Γ = DAG, Ψ = SPL, Φ = MLNP) classification problem. (c) Hierarchical (Γ = DAG, Ψ = MPL, Φ = MLNP) classification problem. (d) Hierarchical (Γ = DAG, Ψ = MPL, Φ = NMLNP) classification problem.

Open in new tab Download slide

4.2 Hierarchical confusion for $Ψ = SPL$ ⁠, $Φ = MLNP$

In order to extend any definition for ${TP}_{H}$ ⁠, ${TN}_{H}$ ⁠, ${FP}_{H}$ ⁠, and ${FN}_{H}$ from the context of a T to a DAG problem, we propose a two-step ‘benevolent’ approach:

Step 1: Select the path i (⁠ ${\tilde{P}}_{i} \in W_{d}$ ⁠) that resembles most to ${\hat{P}}_{d}$ (SPL problems require $| P_{d} | = 1$ ⁠):
$i = max_{\begin{matrix} {\tilde{P}}_{i} \in W_{d} \end{matrix}} | {\tilde{P}}_{i} \bar{\cap} {\hat{P}}_{d} |$
(5)
Step 2: Calculate ${TP}_{H}$ ⁠, ${TN}_{H}$ ⁠, ${FP}_{H}$ ⁠, and ${FN}_{H}$ like for a (⁠ $Γ = T$ ⁠, $Ψ = SPL$ ⁠, $Φ = MLNP$ ⁠) problem using ${\tilde{P}}_{i}$ and ${\hat{P}}_{d}$ ⁠.

We call this approach ‘benevolent’ as one ${\tilde{P}}_{i}$ of the possible true paths in $W_{d}$ is chosen to enable the achievement of the best possible performance for a given prediction ${\hat{P}}_{d}$ ⁠. A key assumption here is that the prediction is a path, while the true label is considered a node. Predictions of local and some global models output a path, while predictions of flat and some global models output a node only. For methods that predict a node, we use the benevolent approach to infer and score a path. This strategy gives maximal leeway to models predicting a node. An example for the evaluation of a (⁠ $Γ = DAG$ ⁠) problem is outlined in Figure 4b.

4.3 Hierarchical confusion for $Φ = MLNP$

The next step towards a generalised hierarchical confusion concept is the extension to MPL problems. Similar to the DAG extension, we can decompose the MPL problem to SPL problems. To do so, we first need to extend the notation. In MPL problems, $C {o_{d}}$ is a set of n true classes of $o_{d}$ in Γ. The classifier predicts m paths. The number of predicted paths m and the number of true classes n are not necessarily equal in applications, as n is considered to be unknown to the classifier a priori. $W_{d, j}$ is the set of all true allocation paths for $o_{d}$ connected to $C_{j} \in C {o_{d}}$ ⁠. Tree problems imply $| W_{d, j} | = 1$ ⁠. Based on this redefinition, the concept of confusion matrix can be extended to MPL problems with the following, ‘benevolent’ approach:

Step 1: Create a map M, that stores values $m = M [{\hat{P}}_{k}]$ for each ${\hat{P}}_{k} \in P_{d}$ ⁠.
Step 2: For each ${\hat{P}}_{k} \in P_{d}$ ⁠, determine the m values as follows:
- Step 2.1: Find the path i of any class j in $W_{d, j}$ that resembles most to ${\hat{P}}_{k}$ ⁠:
  $(i, j) = max_{\begin{matrix} {\tilde{P}}_{i} \in W_{d, j} \end{matrix}} | {\tilde{P}}_{i} \bar{\cap} {\hat{P}}_{k} |$
  (6)
- Step 2.2: Set m accordingly to the number of intersections between ${\hat{P}}_{k}$ and the found ${\tilde{P}}_{i} \in W_{d, j}$ ⁠:
  $m = | {\tilde{P}}_{i} \bar{\cap} {\hat{P}}_{k} | \Rightarrow M [{\hat{P}}_{k}]$
  (7)
Step 3: For each ${\hat{P}}_{k} \in P_{d}$ (in descending order of related m values stored in M) determine ${TP}_{H, k}$ ⁠, ${TN}_{H, k}$ ⁠, ${FP}_{H, k}$ ⁠, and ${FN}_{H, k}$ as follows:
- Step 3.1: If $W_{d, j} \neq \emptyset$ ⁠, go to Step 3.2. Otherwise, if $W_{d, j}$ is empty, if there are more predictions m than true classes n, consider the whole prediction as an FP:
  $\begin{aligned} {TP}_{H, k} = 0 {TN}_{H, k} = 0 \\ {FP}_{H, k} = | {\hat{P}}_{k} ∖ R | {FN}_{H, k} = 0 \end{aligned}$
  (8)
- Step 3.2: Find the path i of any class j in $W_{d, j}$ that resembles most to ${\hat{P}}_{k}$ ⁠:
  $(i, j) = max_{\begin{matrix} {\tilde{P}}_{i} \in W_{d, j} \end{matrix}} | {\tilde{P}}_{i} \bar{\cap} {\hat{P}}_{k} |$
  (9)
- Step 3.3: Calculate ${TP}_{H, k}$ ⁠, ${TN}_{H, k}$ ⁠, ${FP}_{H, k}$ ⁠, and ${FN}_{H, k}$ like a (⁠ $Γ = T$ ⁠, $Ψ = SPL$ ⁠, $Φ = MLNP$ ⁠) problem using ${\tilde{P}}_{i}$ and ${\hat{P}}_{k}$ ⁠.
- Step 3.4: Remove all paths in $W_{d, j}$ related to class j. Remove $C_{j}$ from the set of true labels $C {o_{d}}$ ⁠.
Step 4: Calculate the fields of the confusion matrix as sum of values determined by the single predictions ${\hat{P}}_{k} \in P_{d}$ ⁠. For all remaining classes $C_{j}$ that were not removed from $C {o_{d}}$ (if there are more true classes n than predictions m), consider the shortest paths inside as FNs:
$\begin{aligned} {TP}_{H} = \sum_{{\hat{P}}_{k} \in P_{d}} {TP}_{H, k} {TN}_{H} = \sum_{{\hat{P}}_{k} \in P_{d}} {TN}_{H, k} \\ {FP}_{H} = \sum_{{\hat{P}}_{k} \in P_{d}} {FP}_{H, k} \\ {FN}_{H} = \sum_{{\hat{P}}_{k} \in P_{d}} {FN}_{H, k} + \sum_{C_{j} \in C {o_{d}}} min_{\forall {\tilde{P}}_{i} \in W_{d, j}} | {\tilde{P}}_{i} ∖ R | \end{aligned}$
(10)

An example for the evaluation of a (⁠ $Γ = DAG$ ⁠, $Ψ = MPL$ ⁠) problem is depicted in Figure 4c.

4.4 Generalised hierarchical confusion (for all problems)

Finally, the discussion of NMLNP problems remains. As it turns out, the approach presented for MPL problems is applicable to NMLNP problems as well, as shown in Figure 4d. NMLNP adds complexity to the game by not requiring the prediction to the deepest level of the taxonomy. This allows for misclassification errors such as over- and under-specialisation. The discussed approach in the previous subsection is also suitable in this case, as the confusion matrix element definitions consider paths and their intersection.

4.5 Conformity with desirable attributes of evaluation

There is a set of important attributes that shall be ideally fulfilled by hierarchical evaluation schemes as demanded by, e.g. Costa et al. (2007) and Kiritchenko et al. (2006): (1) the independence of subjective, user-specified parameters; (2) metrics should give credit to partially correct classification; (3) distant errors and higher level errors should be penalised more heavily; (4) the applicability of the evaluation schemes to a broad range of problems.

First, the hierarchical confusion matrix by construction does not depend on subjective, user-specific parameters.

Second, let us consider $Ψ = SPL$ ⁠, $Φ = MLNP$ w.l.o.g.; let us assume ${\hat{P}}_{d}$ to be a partially correct prediction path, i.e. there exists a path ${\tilde{P}}_{i} \in W_{d}$ such that $| {\tilde{P}}_{i} \bar{\cap} {\hat{P}}_{d} | > 0$ ⁠. Thus, ${TP}_{H}, {TN}_{H} > 0$ ⁠, which implies the evaluation scheme gives credit to partially correct classification.

Third, let ${\hat{P}}_{1}$ and ${\hat{P}}_{2}$ be two paths in $Ψ = SPL$ ⁠, $Φ = MLNP$ ⁠, where the number of correctly selected classes inside ${\hat{P}}_{2}$ is higher than for ${\hat{P}}_{1}$ ⁠. Then ${TP}_{H}$ and ${TN}_{H}$ of ${\hat{P}}_{2}$ are greater than for ${\hat{P}}_{1}$ and vice versa for ${FP}_{H}$ and ${FN}_{H}$ ⁠. Therefore, the higher level error of ${\hat{P}}_{1}$ is punished more heavily. This holds also for $Ψ = MPL$ in case of multiple true paths, as the assignment of the true path for evaluation is benevolent. For NMLNP problems, over-specifications are less heavily punished than errors at higher levels as long as the depth of over-specification is lower. More precisely, in this case the number of ${TP}_{H}$ is maximal, ${FN}_{H}$ is zero, ${FP}_{H}$ corresponds to the level depth of over-specifications, and ${TN}_{H}$ is maximal minus the level depth of over-specifications. The case of under-specification follows the same lines.

To the best of the authors knowledge, the proposed concept of hierarchical confusion matrix is the first evaluation scheme that is by construction applicable to all types of hierarchical classification problems and thus conforms the fourth attribute.

5 Experiments

In this section, we apply our concept of the hierarchical confusion matrix onto three, real-world hierarchical classification problems with differing levels of complexity. Employing binary performance measures (see Table 2) using the hierarchical confusion matrix allows for the comparison of different classification models in the three experiments. Code and data for reproduction can be found on GitHub.

5.1 Transposon classification

Transposon classification is a task from bioinformatics and computational biology (Riehl et al., 2022). Transposons are DNA sequences that are able to move inside the genome using copy & paste and cut & paste transposition mechanisms and are suspected to play an important role as driver for genome evolution. Moreover, they can be classified into a tree structure depending on structural and protein features obtained by the sequences that determine their specific transposition mechanisms. The transposon classification task is a (⁠ $Γ = T$ ⁠, $Ψ = SPL$ ⁠, $Φ = MLNP$ ⁠) classification problem. Within this experiment, we use the combined data set mipsREdat-PGSB,³ RepBase v23.08,⁴ and the taxonomy proposed in Riehl et al. (2022).

Table 3 presents the hierarchical confusion matrix and evaluation measures based on it for different classification models that were compared in Riehl et al. (2022). Situations in which a clear ranking based on one measure is complicated, the consideration of multiple measures as presented in the table can facilitate model ranking. The results show that evaluation measures based on the novel hierarchical confusion matrix span the whole range, comparable to the related binary measures.

Table 3.

Open in new tab

Transposon classification experiment

Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	Rank
HC_GA	19,145	27,690	7,854	7,744	75.02%	70.91%	71.20%	71.05%	49.08%	2
HC_LGA	18,420	25,582	8,598	8,469	72.05%	68.18%	68.50%	68.34%	43.33%	3
NLLCPN	17,608	24,765	9,410	9,281	69.39%	65.17%	65.48%	65.33%	37.93%	4
RFSB	22,833	34,026	4,131	4,056	87.41%	84.68%	84.92%	84.80%	74.06%	1
TERL	366	630	18,776	26,523	2.15%	1.91%	1.36%	1.59%	$-$ 95.58%	6
TopDown	1,415	2,743	16,603	25,474	8.99%	7.85%	5.26%	6.30%	$-$ 81.49%	5

Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	Rank
HC_GA	19,145	27,690	7,854	7,744	75.02%	70.91%	71.20%	71.05%	49.08%	2
HC_LGA	18,420	25,582	8,598	8,469	72.05%	68.18%	68.50%	68.34%	43.33%	3
NLLCPN	17,608	24,765	9,410	9,281	69.39%	65.17%	65.48%	65.33%	37.93%	4
RFSB	22,833	34,026	4,131	4,056	87.41%	84.68%	84.92%	84.80%	74.06%	1
TERL	366	630	18,776	26,523	2.15%	1.91%	1.36%	1.59%	$-$ 95.58%	6
TopDown	1,415	2,743	16,603	25,474	8.99%	7.85%	5.26%	6.30%	$-$ 81.49%	5

Table 3.

Open in new tab

Transposon classification experiment

Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	Rank
HC_GA	19,145	27,690	7,854	7,744	75.02%	70.91%	71.20%	71.05%	49.08%	2
HC_LGA	18,420	25,582	8,598	8,469	72.05%	68.18%	68.50%	68.34%	43.33%	3
NLLCPN	17,608	24,765	9,410	9,281	69.39%	65.17%	65.48%	65.33%	37.93%	4
RFSB	22,833	34,026	4,131	4,056	87.41%	84.68%	84.92%	84.80%	74.06%	1
TERL	366	630	18,776	26,523	2.15%	1.91%	1.36%	1.59%	$-$ 95.58%	6
TopDown	1,415	2,743	16,603	25,474	8.99%	7.85%	5.26%	6.30%	$-$ 81.49%	5

Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	Rank
HC_GA	19,145	27,690	7,854	7,744	75.02%	70.91%	71.20%	71.05%	49.08%	2
HC_LGA	18,420	25,582	8,598	8,469	72.05%	68.18%	68.50%	68.34%	43.33%	3
NLLCPN	17,608	24,765	9,410	9,281	69.39%	65.17%	65.48%	65.33%	37.93%	4
RFSB	22,833	34,026	4,131	4,056	87.41%	84.68%	84.92%	84.80%	74.06%	1
TERL	366	630	18,776	26,523	2.15%	1.91%	1.36%	1.59%	$-$ 95.58%	6
TopDown	1,415	2,743	16,603	25,474	8.99%	7.85%	5.26%	6.30%	$-$ 81.49%	5

5.2 GermEval 2019 competition, Task 1A

The GermEval 2019 competition (Remus et al., 2019) was a competition hold by the Language Technology Group of Universität Hamburg, Germany. In total, 18 teams participated to the competition and worked on multiple tasks. Here, we consider Task 1A and 1B, related to hierarchical classification of blurb texts in German language—short descriptions of books. The data set covers 20,784 books and a 3-level DAG structure with 343 categories. All participant submissions, data set, and taxonomy were downloaded from the competition’s website.⁵

Task 1A is a (⁠ $Γ = (1 - level) T$ ⁠, $Ψ = MPL$ ⁠, $Φ = MLNP$ ⁠) classification problem. The challenge exercise was to classify one or multiple top-level categories for the blurbs that represent literature genres. Table 4 summarises the confusion matrix, evaluation measures calculated by us [hierarchical confusion (HC)] and the reported performances by the organisers of the competition (GE) for each participant and two baseline models (Baseline, Contender). The organisers employed micro-averaged recall, precision, and F1-score (Silla & Freitas, 2011; Sorower, 2010) as well as subset accuracy (Sorower, 2010) for evaluation purposes. This table allows for the comparison of the reported evaluation measures. Also, it enables the comparison of the different resulting rankings based on the F1-score. Additionally, the MCC score based on the hierarchical confusion matrix is reported.

Table 4.

Open in new tab

GermEval 2019 competition, Task 1A experiment

	Hierarchical confusion matrix performance evaluation (HC)									Reported performances (Remus et al., 2019)
Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	ACC	PPV	TPR	F1	Rank (HC)	Rank (Remus et al., 2019)
Averbis	3,613	28,863	584	857	95.75%	86.09%	80.83%	83.37%	80.99%	79.00%	86.00%	81.00%	83.00%	6	6
Comtravo-DS	3,690	29,517	1,078	780	94.70%	77.39%	82.55%	79.89%	76.89%	72.00%	81.00%	83.00%	82.00%	7	8
DFKI-SLT	3,787	28,933	536	683	96.41%	87.60%	84.72%	86.14%	84.09%	82.00%	88.00%	85.00%	86.00%	3	3
EricssonResearch	3,769	28,891	455	701	96.58%	89.23%	84.32%	86.70%	84.79%	84.00%	89.00%	84.00%	87.00%	1	1
Fosil-hsmw	3,719	29,003	694	751	95.77%	84.27%	83.20%	83.73%	81.30%	79.00%	84.00%	83.00%	84.00%	5	5
HSHL	3,647	28,877	777	823	95.31%	82.44%	81.59%	82.01%	79.32%	77.00%	82.00%	82.00%	82.00%	8	7
HUIU	3,608	28,808	867	862	94.94%	80.63%	80.72%	80.67%	77.76%	76.00%	81.00%	81.00%	81.00%	9	9
Raghavan	3,747	28,983	522	723	96.34%	87.77%	83.83%	85.75%	83.68%	83.00%	88.00%	84.00%	86.00%	4	4
TwistBytes	3,852	29,551	752	618	96.06%	83.67%	86.17%	84.90%	82.65%	79.00%	87.00%	86.00%	86.00%	2	2
Baseline	3,344	29,084	544	1,126	95.10%	86.01%	74.81%	80.02%	77.49%	71.00%	86.00%	75.00%	80.00%
Contender	3,809	29,835	2,301	661	91.91%	62.34%	85.21%	72.00%	68.53%	74.00%	82.00%	85.00%	84.00%

	Hierarchical confusion matrix performance evaluation (HC)									Reported performances (Remus et al., 2019)
Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	ACC	PPV	TPR	F1	Rank (HC)	Rank (Remus et al., 2019)
Averbis	3,613	28,863	584	857	95.75%	86.09%	80.83%	83.37%	80.99%	79.00%	86.00%	81.00%	83.00%	6	6
Comtravo-DS	3,690	29,517	1,078	780	94.70%	77.39%	82.55%	79.89%	76.89%	72.00%	81.00%	83.00%	82.00%	7	8
DFKI-SLT	3,787	28,933	536	683	96.41%	87.60%	84.72%	86.14%	84.09%	82.00%	88.00%	85.00%	86.00%	3	3
EricssonResearch	3,769	28,891	455	701	96.58%	89.23%	84.32%	86.70%	84.79%	84.00%	89.00%	84.00%	87.00%	1	1
Fosil-hsmw	3,719	29,003	694	751	95.77%	84.27%	83.20%	83.73%	81.30%	79.00%	84.00%	83.00%	84.00%	5	5
HSHL	3,647	28,877	777	823	95.31%	82.44%	81.59%	82.01%	79.32%	77.00%	82.00%	82.00%	82.00%	8	7
HUIU	3,608	28,808	867	862	94.94%	80.63%	80.72%	80.67%	77.76%	76.00%	81.00%	81.00%	81.00%	9	9
Raghavan	3,747	28,983	522	723	96.34%	87.77%	83.83%	85.75%	83.68%	83.00%	88.00%	84.00%	86.00%	4	4
TwistBytes	3,852	29,551	752	618	96.06%	83.67%	86.17%	84.90%	82.65%	79.00%	87.00%	86.00%	86.00%	2	2
Baseline	3,344	29,084	544	1,126	95.10%	86.01%	74.81%	80.02%	77.49%	71.00%	86.00%	75.00%	80.00%
Contender	3,809	29,835	2,301	661	91.91%	62.34%	85.21%	72.00%	68.53%	74.00%	82.00%	85.00%	84.00%

Table 4.

Open in new tab

GermEval 2019 competition, Task 1A experiment

	Hierarchical confusion matrix performance evaluation (HC)									Reported performances (Remus et al., 2019)
Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	ACC	PPV	TPR	F1	Rank (HC)	Rank (Remus et al., 2019)
Averbis	3,613	28,863	584	857	95.75%	86.09%	80.83%	83.37%	80.99%	79.00%	86.00%	81.00%	83.00%	6	6
Comtravo-DS	3,690	29,517	1,078	780	94.70%	77.39%	82.55%	79.89%	76.89%	72.00%	81.00%	83.00%	82.00%	7	8
DFKI-SLT	3,787	28,933	536	683	96.41%	87.60%	84.72%	86.14%	84.09%	82.00%	88.00%	85.00%	86.00%	3	3
EricssonResearch	3,769	28,891	455	701	96.58%	89.23%	84.32%	86.70%	84.79%	84.00%	89.00%	84.00%	87.00%	1	1
Fosil-hsmw	3,719	29,003	694	751	95.77%	84.27%	83.20%	83.73%	81.30%	79.00%	84.00%	83.00%	84.00%	5	5
HSHL	3,647	28,877	777	823	95.31%	82.44%	81.59%	82.01%	79.32%	77.00%	82.00%	82.00%	82.00%	8	7
HUIU	3,608	28,808	867	862	94.94%	80.63%	80.72%	80.67%	77.76%	76.00%	81.00%	81.00%	81.00%	9	9
Raghavan	3,747	28,983	522	723	96.34%	87.77%	83.83%	85.75%	83.68%	83.00%	88.00%	84.00%	86.00%	4	4
TwistBytes	3,852	29,551	752	618	96.06%	83.67%	86.17%	84.90%	82.65%	79.00%	87.00%	86.00%	86.00%	2	2
Baseline	3,344	29,084	544	1,126	95.10%	86.01%	74.81%	80.02%	77.49%	71.00%	86.00%	75.00%	80.00%
Contender	3,809	29,835	2,301	661	91.91%	62.34%	85.21%	72.00%	68.53%	74.00%	82.00%	85.00%	84.00%

	Hierarchical confusion matrix performance evaluation (HC)									Reported performances (Remus et al., 2019)
Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	ACC	PPV	TPR	F1	Rank (HC)	Rank (Remus et al., 2019)
Averbis	3,613	28,863	584	857	95.75%	86.09%	80.83%	83.37%	80.99%	79.00%	86.00%	81.00%	83.00%	6	6
Comtravo-DS	3,690	29,517	1,078	780	94.70%	77.39%	82.55%	79.89%	76.89%	72.00%	81.00%	83.00%	82.00%	7	8
DFKI-SLT	3,787	28,933	536	683	96.41%	87.60%	84.72%	86.14%	84.09%	82.00%	88.00%	85.00%	86.00%	3	3
EricssonResearch	3,769	28,891	455	701	96.58%	89.23%	84.32%	86.70%	84.79%	84.00%	89.00%	84.00%	87.00%	1	1
Fosil-hsmw	3,719	29,003	694	751	95.77%	84.27%	83.20%	83.73%	81.30%	79.00%	84.00%	83.00%	84.00%	5	5
HSHL	3,647	28,877	777	823	95.31%	82.44%	81.59%	82.01%	79.32%	77.00%	82.00%	82.00%	82.00%	8	7
HUIU	3,608	28,808	867	862	94.94%	80.63%	80.72%	80.67%	77.76%	76.00%	81.00%	81.00%	81.00%	9	9
Raghavan	3,747	28,983	522	723	96.34%	87.77%	83.83%	85.75%	83.68%	83.00%	88.00%	84.00%	86.00%	4	4
TwistBytes	3,852	29,551	752	618	96.06%	83.67%	86.17%	84.90%	82.65%	79.00%	87.00%	86.00%	86.00%	2	2
Baseline	3,344	29,084	544	1,126	95.10%	86.01%	74.81%	80.02%	77.49%	71.00%	86.00%	75.00%	80.00%
Contender	3,809	29,835	2,301	661	91.91%	62.34%	85.21%	72.00%	68.53%	74.00%	82.00%	85.00%	84.00%

First of all, the evaluation results and rankings for both evaluation approaches resemble strongly. The rankings differ by only one pair (Comtravo-DS and HSHL). Next, the differences in precision, recall, and F1-score between the hierarchical confusion matrix and micro-averaged measures are relatively small (on average 2.41%, 0.10%, and 1.44%). However, larger differences become obvious with regard to accuracy (on average 17.53%). Differences in accuracy could be explained by the different perception of true negatives in our proposed confusion matrix and subset accuracy.

5.3 GermEval 2019 competition, Task 1B

Task 1B is a (⁠ $Γ = DAG$ ⁠, $Ψ = MPL$ ⁠, $Φ = NMLNP$ ⁠) classification problem. The challenge exercise was to classify one or multiple categories from the DAG for the blurbs, not necessarily to the deepest level. Table 5 summarises the results similar to the task before. Again, resulting rankings are identical except for one pair (EricssonResearch and TwistBytes). Differences in accuracy, precision, recall, and F1-score (67.44%, 2.44%, 5.20%, and 4.25%) are higher compared to the previous task. Again, accuracy differs most. As Task 1B considers a more complex structure (DAG vs. T), differences between the evaluation measures become more obvious.

Table 5.

Open in new tab

GermEval 2019 competition, Task 1B experiment

	Hierarchical confusion matrix performance evaluation (HC)									Reported performances (Remus et al., 2019)
Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	ACC	PPV	TPR	F1	Rank (HC)	Rank (Remus et al., 2019)
Averbis	8,552	125,951	4,683	6,558	92.29%	64.62%	56.60%	60.34%	56.24%	27.00%	68.00%	61.00%	64.00%	3	3
Comtravo-DS	7,187	111,871	3,376	7,923	91.33%	68.04%	47.56%	55.99%	52.36%	19.00%	70.00%	53.00%	60.00%	6	6
DKFI-SLT	7,049	112,567	2,256	8,061	92.06%	75.75%	46.65%	57.74%	55.56%	21.00%	78.00%	52.00%	62.00%	4	4
EricssonResearch	8,498	119,546	3,208	6,612	92.88%	72.60%	56.24%	63.38%	60.10%	38.00%	74.00%	62.00%	67.00%	1	2
HSHL	7,167	106,488	3,025	7,943	91.20%	70.32%	47.43%	56.65%	53.21%	26.00%	72.00%	54.00%	62.00%	5	5
TwistBytes	9,174	130,886	4,747	5,936	92.91%	65.90%	60.71%	63.20%	59.35%	25.00%	71.00%	65.00%	68.00%	2	1
Baseline	5,183	97,128	964	9,927	90.38%	84.32%	34.30%	48.77%	50.00%	15.00%	85.00%	39.00%	53.00%
Contender	7,693	118,017	2,854	7,417	92.45%	72.94%	50.91%	59.97%	57.05%	25.00%	76.00%	56.00%	64.00%

	Hierarchical confusion matrix performance evaluation (HC)									Reported performances (Remus et al., 2019)
Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	ACC	PPV	TPR	F1	Rank (HC)	Rank (Remus et al., 2019)
Averbis	8,552	125,951	4,683	6,558	92.29%	64.62%	56.60%	60.34%	56.24%	27.00%	68.00%	61.00%	64.00%	3	3
Comtravo-DS	7,187	111,871	3,376	7,923	91.33%	68.04%	47.56%	55.99%	52.36%	19.00%	70.00%	53.00%	60.00%	6	6
DKFI-SLT	7,049	112,567	2,256	8,061	92.06%	75.75%	46.65%	57.74%	55.56%	21.00%	78.00%	52.00%	62.00%	4	4
EricssonResearch	8,498	119,546	3,208	6,612	92.88%	72.60%	56.24%	63.38%	60.10%	38.00%	74.00%	62.00%	67.00%	1	2
HSHL	7,167	106,488	3,025	7,943	91.20%	70.32%	47.43%	56.65%	53.21%	26.00%	72.00%	54.00%	62.00%	5	5
TwistBytes	9,174	130,886	4,747	5,936	92.91%	65.90%	60.71%	63.20%	59.35%	25.00%	71.00%	65.00%	68.00%	2	1
Baseline	5,183	97,128	964	9,927	90.38%	84.32%	34.30%	48.77%	50.00%	15.00%	85.00%	39.00%	53.00%
Contender	7,693	118,017	2,854	7,417	92.45%	72.94%	50.91%	59.97%	57.05%	25.00%	76.00%	56.00%	64.00%

Table 5.

Open in new tab

GermEval 2019 competition, Task 1B experiment

	Hierarchical confusion matrix performance evaluation (HC)									Reported performances (Remus et al., 2019)
Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	ACC	PPV	TPR	F1	Rank (HC)	Rank (Remus et al., 2019)
Averbis	8,552	125,951	4,683	6,558	92.29%	64.62%	56.60%	60.34%	56.24%	27.00%	68.00%	61.00%	64.00%	3	3
Comtravo-DS	7,187	111,871	3,376	7,923	91.33%	68.04%	47.56%	55.99%	52.36%	19.00%	70.00%	53.00%	60.00%	6	6
DKFI-SLT	7,049	112,567	2,256	8,061	92.06%	75.75%	46.65%	57.74%	55.56%	21.00%	78.00%	52.00%	62.00%	4	4
EricssonResearch	8,498	119,546	3,208	6,612	92.88%	72.60%	56.24%	63.38%	60.10%	38.00%	74.00%	62.00%	67.00%	1	2
HSHL	7,167	106,488	3,025	7,943	91.20%	70.32%	47.43%	56.65%	53.21%	26.00%	72.00%	54.00%	62.00%	5	5
TwistBytes	9,174	130,886	4,747	5,936	92.91%	65.90%	60.71%	63.20%	59.35%	25.00%	71.00%	65.00%	68.00%	2	1
Baseline	5,183	97,128	964	9,927	90.38%	84.32%	34.30%	48.77%	50.00%	15.00%	85.00%	39.00%	53.00%
Contender	7,693	118,017	2,854	7,417	92.45%	72.94%	50.91%	59.97%	57.05%	25.00%	76.00%	56.00%	64.00%

	Hierarchical confusion matrix performance evaluation (HC)									Reported performances (Remus et al., 2019)
Model	TP	TN	FP	FN	ACC	PPV	TPR	F1	MCC	ACC	PPV	TPR	F1	Rank (HC)	Rank (Remus et al., 2019)
Averbis	8,552	125,951	4,683	6,558	92.29%	64.62%	56.60%	60.34%	56.24%	27.00%	68.00%	61.00%	64.00%	3	3
Comtravo-DS	7,187	111,871	3,376	7,923	91.33%	68.04%	47.56%	55.99%	52.36%	19.00%	70.00%	53.00%	60.00%	6	6
DKFI-SLT	7,049	112,567	2,256	8,061	92.06%	75.75%	46.65%	57.74%	55.56%	21.00%	78.00%	52.00%	62.00%	4	4
EricssonResearch	8,498	119,546	3,208	6,612	92.88%	72.60%	56.24%	63.38%	60.10%	38.00%	74.00%	62.00%	67.00%	1	2
HSHL	7,167	106,488	3,025	7,943	91.20%	70.32%	47.43%	56.65%	53.21%	26.00%	72.00%	54.00%	62.00%	5	5
TwistBytes	9,174	130,886	4,747	5,936	92.91%	65.90%	60.71%	63.20%	59.35%	25.00%	71.00%	65.00%	68.00%	2	1
Baseline	5,183	97,128	964	9,927	90.38%	84.32%	34.30%	48.77%	50.00%	15.00%	85.00%	39.00%	53.00%
Contender	7,693	118,017	2,854	7,417	92.45%	72.94%	50.91%	59.97%	57.05%	25.00%	76.00%	56.00%	64.00%

5.4 Discussion

Based on the results of Task 1A and Task 1B in Tables 4 and 5, we observe that the range of our measures is consistently lower. Therefore, we conclude that our measures are more robust. Similarly, Figure 5 shows the differences between the proposed and reported metrics for the two tasks of the GermEval 2019 Competition. The differences in accuracy are much higher than for the other metrics. Also, the differences in Task 1A (⁠ $Γ = (1 - level) T$ ⁠, $Ψ = MPL$ ⁠, $Φ = MLNP$ ⁠) are lower than for Task 1B (⁠ $Γ = DAG$ ⁠, $Ψ = MPL$ ⁠, $Φ = NMLNP$ ⁠). Our accuracy is consistently higher, while for other metrics, this is not the case. For Taks 1B, our metrics are consistently lower, while there is no such trend for Task 1A. The higher accuracy of the hierarchical confusion matrix and the (mostly) lower PPV, TPR, and F1 can be explained by the larger numbers of TN due to the proposed benevolent approach.

Figure 5.

Differences in metrics (proposed—reported). Differences of the different evaluation metrics ACC, PPV, TPR, and F1 between the proposed hierarchical confusion matrix and the reported metrics from the GermEval2019 competition for Task 1A and Task 1B.

Open in new tab Download slide

Moreover, we determined the Kendall’s rank correlations (Kendall, 1938; Kosmopoulos et al., 2015) to assess the similarity of rankings for the two different classification tasks 1A and 1B of the GermEval competition, depending on different measures (ACC, PPV, TPR, F1) and evaluation schemes [HC = hierarchical confusion vs. GE = schemes used in Remus et al. (2019)]. We also compare HC, which is horizontal, to GE which is vertical, and this is termed as across (Table 6). Kendall’s rank correlations range from $- 1$ (opposite ranking) to +1 (exact same ranking) and quantifies the similarity of rankings. Overall, we observe a large disagreement between rankings based on different measures when compared within the same schemes as well across the schemes for both tasks. In Task 1A, ACC correlates perfectly with F1 for HC and it has a high correlation with PPV for GE, while it does not correlate with the other measures. In Task 1B, there is almost no strong correlation for HC, and a strong correlation for ACC with PPV and for TPR with F1 for GE. The comparison across HC and GE reveals no significant relationship between the evaluation schemes and measures. Taken together, this suggests that the different metrics are largely complementary and that they will capture different aspects of the comparison.

Table 6.

Open in new tab

Kendall’s rank correlation matrix of rankings

HC	ACC	PPV	TPR	F1	GE	ACC	PPV	TPR	F1	Across	ACC	PPV	TPR	F1
Task 1A
ACC	1.00	$-$ 0.33	0.11	1.00	ACC	1.00	0.61	0.06	0.06	ACC	0.50	0.22	0.44	0.00
PPV		1.00	$-$ 0.55	$-$ 0.33	PPV		1.00	0.00	$-$ 0.11	PPV	0.17	0.22	$-$ 0.11	$-$ 0.11
TPR			1.00	0.11	TPR			1.00	0.00	TPR	0.06	$-$ 0.11	0.11	0.00
F1				1.00	F1				1.00	F1	0.50	0.22	0.44	0.00
Task 1B
ACC	1.00	$-$ 0.47	0.07	0.20	ACC	1.00	0.60	$-$ 0.07	0.07	ACC	$-$ 0.20	$-$ 0.33	0.60	0.47
PPV		1.00	0.20	0.33	PPV		1.00	0.07	0.20	PPV	$-$ 0.07	0.07	$-$ 0.33	$-$ 0.20
TPR			1.00	$-$ 0.20	TPR			1.00	0.87	TPR	0.20	$-$ 0.20	$-$ 0.07	0.07
F1				1.00	F1				1.00	F1	$-$ 0.47	$-$ 0.07	0.33	0.47

HC	ACC	PPV	TPR	F1	GE	ACC	PPV	TPR	F1	Across	ACC	PPV	TPR	F1
Task 1A
ACC	1.00	$-$ 0.33	0.11	1.00	ACC	1.00	0.61	0.06	0.06	ACC	0.50	0.22	0.44	0.00
PPV		1.00	$-$ 0.55	$-$ 0.33	PPV		1.00	0.00	$-$ 0.11	PPV	0.17	0.22	$-$ 0.11	$-$ 0.11
TPR			1.00	0.11	TPR			1.00	0.00	TPR	0.06	$-$ 0.11	0.11	0.00
F1				1.00	F1				1.00	F1	0.50	0.22	0.44	0.00
Task 1B
ACC	1.00	$-$ 0.47	0.07	0.20	ACC	1.00	0.60	$-$ 0.07	0.07	ACC	$-$ 0.20	$-$ 0.33	0.60	0.47
PPV		1.00	0.20	0.33	PPV		1.00	0.07	0.20	PPV	$-$ 0.07	0.07	$-$ 0.33	$-$ 0.20
TPR			1.00	$-$ 0.20	TPR			1.00	0.87	TPR	0.20	$-$ 0.20	$-$ 0.07	0.07
F1				1.00	F1				1.00	F1	$-$ 0.47	$-$ 0.07	0.33	0.47

Table 6.

Open in new tab

Kendall’s rank correlation matrix of rankings

HC	ACC	PPV	TPR	F1	GE	ACC	PPV	TPR	F1	Across	ACC	PPV	TPR	F1
Task 1A
ACC	1.00	$-$ 0.33	0.11	1.00	ACC	1.00	0.61	0.06	0.06	ACC	0.50	0.22	0.44	0.00
PPV		1.00	$-$ 0.55	$-$ 0.33	PPV		1.00	0.00	$-$ 0.11	PPV	0.17	0.22	$-$ 0.11	$-$ 0.11
TPR			1.00	0.11	TPR			1.00	0.00	TPR	0.06	$-$ 0.11	0.11	0.00
F1				1.00	F1				1.00	F1	0.50	0.22	0.44	0.00
Task 1B
ACC	1.00	$-$ 0.47	0.07	0.20	ACC	1.00	0.60	$-$ 0.07	0.07	ACC	$-$ 0.20	$-$ 0.33	0.60	0.47
PPV		1.00	0.20	0.33	PPV		1.00	0.07	0.20	PPV	$-$ 0.07	0.07	$-$ 0.33	$-$ 0.20
TPR			1.00	$-$ 0.20	TPR			1.00	0.87	TPR	0.20	$-$ 0.20	$-$ 0.07	0.07
F1				1.00	F1				1.00	F1	$-$ 0.47	$-$ 0.07	0.33	0.47

HC	ACC	PPV	TPR	F1	GE	ACC	PPV	TPR	F1	Across	ACC	PPV	TPR	F1
Task 1A
ACC	1.00	$-$ 0.33	0.11	1.00	ACC	1.00	0.61	0.06	0.06	ACC	0.50	0.22	0.44	0.00
PPV		1.00	$-$ 0.55	$-$ 0.33	PPV		1.00	0.00	$-$ 0.11	PPV	0.17	0.22	$-$ 0.11	$-$ 0.11
TPR			1.00	0.11	TPR			1.00	0.00	TPR	0.06	$-$ 0.11	0.11	0.00
F1				1.00	F1				1.00	F1	0.50	0.22	0.44	0.00
Task 1B
ACC	1.00	$-$ 0.47	0.07	0.20	ACC	1.00	0.60	$-$ 0.07	0.07	ACC	$-$ 0.20	$-$ 0.33	0.60	0.47
PPV		1.00	0.20	0.33	PPV		1.00	0.07	0.20	PPV	$-$ 0.07	0.07	$-$ 0.33	$-$ 0.20
TPR			1.00	$-$ 0.20	TPR			1.00	0.87	TPR	0.20	$-$ 0.20	$-$ 0.07	0.07
F1				1.00	F1				1.00	F1	$-$ 0.47	$-$ 0.07	0.33	0.47

One of the disadvantages of complex evaluation measures is the complexity and related run-time for applications. In general, the run-time of the proposed approach is linear to the number of predictions, and non-linear to the complexity of the graph measured, e.g. in number of nodes and edges. To quantitatively assess the performance of the implementation of our hierarchical confusion matrix approach, we present run-times (on a normal, Intel i5 8th Gen state-of-the-art desktop computer) and complexities of problems in Table 7. The table summarises the run-times, number of records processed, complexity of hierarchies measured in number of edges, as well as run-time per record (Run-time 1) and run-time per record per edges (Run-time 2). As the run-times show, the hierarchical confusion matrix is a computationally efficient measure. As Run-time 2 indicates, the run-time divided by records and number of edges indicates a non-linear relationship between complexity and run-time.

Table 7.

Open in new tab

Run-time analysis of hierarchical confusion matrix

Exercise	Run-time [s]	#Records	#Edges	Run-time 1 [s]	Run-time 2 [s]
GermEval 2019 Task 1A	6.09	91,476	8	6.66 E-05	8.32 E-06
Transposon Classification	8.31	54,036	18	1.54 E-04	8.54 E-06
GermEval 2019 Task 1B	297.16	91,476	343	3.25 E-03	9.47 E-06

Exercise	Run-time [s]	#Records	#Edges	Run-time 1 [s]	Run-time 2 [s]
GermEval 2019 Task 1A	6.09	91,476	8	6.66 E-05	8.32 E-06
Transposon Classification	8.31	54,036	18	1.54 E-04	8.54 E-06
GermEval 2019 Task 1B	297.16	91,476	343	3.25 E-03	9.47 E-06

Table 7.

Open in new tab

Run-time analysis of hierarchical confusion matrix

Exercise	Run-time [s]	#Records	#Edges	Run-time 1 [s]	Run-time 2 [s]
GermEval 2019 Task 1A	6.09	91,476	8	6.66 E-05	8.32 E-06
Transposon Classification	8.31	54,036	18	1.54 E-04	8.54 E-06
GermEval 2019 Task 1B	297.16	91,476	343	3.25 E-03	9.47 E-06

Exercise	Run-time [s]	#Records	#Edges	Run-time 1 [s]	Run-time 2 [s]
GermEval 2019 Task 1A	6.09	91,476	8	6.66 E-05	8.32 E-06
Transposon Classification	8.31	54,036	18	1.54 E-04	8.54 E-06
GermEval 2019 Task 1B	297.16	91,476	343	3.25 E-03	9.47 E-06

6 Conclusions

In this work, we proposed the novel concept of hierarchical confusion matrix, which has been extended to a generalised form and we proved its applicability to all types of hierarchical classification problems including DAG structures, MPL, and NMLNP problems. Moreover, we showed how it conforms various desirable attributes of hierarchical evaluation schemes. The evaluation concept was demonstrated on real-world applications, benchmarking hierarchical classification problems. The evaluation results were compared to alternative evaluation measures present in literature, revealing and supporting the reasonability of the proposed concept. The novel concept of hierarchical confusion matrix allows for the evaluation of hierarchical classification problems using measures based on binary classification problems. Due to its holistic applicability, reasonability and simplicity, it will facilitate future research in hierarchical classification benchmarking studies and potentially sets the new standard for hierarchical classification evaluation. Future studies could focus on assessing consistency of classification metrics and on comparing these. Follow-up studies could apply classification metrics on data sets using different hierarchies and examine whether similar rankings result out of these applications. This could indicate whether these metrics are robust and consistent.

Acknowledgments

The authors thank the referees, the Associate Editor, and the Joint Editor of the journal for thoughtful comments and suggestions that greatly improved the quality of the manuscript. We also thank the Evergrande Center for Immunologic Diseases for their Startup funds support.

Data availability

The implementation of the hierarchical confusion matrix in Python programming language, and the examples and experiments of this study are available on GitHub (https://github.com/DerKevinRiehl/HierarchicalConfusionMatrix).

References

Blockeel

Bruynooghe

Džeroski

Ramon

, &

Struyf

(

2002

Hierarchical multi-classification

. In

Workshop Notes of the KDD’02 Workshop on Multi-Relational Data Mining

(pp.

–

). KU Leuven. https://lirias.kuleuven.be/1653518?limo=0

Borges

H. B.

Silla

C. N. Jr.

, &

Nievola

J. C.

(

2013

An evaluation of global-model hierarchical classification algorithms for hierarchical classification problems with single path of labels

Computers & Mathematics with Applications

(

1991

–

2002

. https://doi.org/10.1016/j.camwa.2013.06.027

Google Scholar

Crossref

WorldCat

Cesa-Bianchi

Gentile

Tironi

, &

Zaniboni

(

2005

Incremental algorithms for hierarchical classification

Advances in Neural Information Processing Systems

233

–

240

. https://proceedings.neurips.cc/paper/2004/hash/78f7d96ea21ccae89a7b581295f34135-Abstract.html

Google Scholar

OpenURL Placeholder Text

WorldCat

Cesa-Bianchi

Gentile

, &

Zaniboni

(

2006

Hierarchical classification: Combining bayes with svm. In Proceedings of the 23rd International Conference on Machine Learning

(pp.

177

–

184

). https://doi.org/10.1145/1143844.1143867

Chicco

, &

Jurman

(

2020

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

BMC Genomics

(

–

. https://doi.org/10.1186/s12864-019-6413-7

Google Scholar

Crossref

WorldCat

Costa

Lorena

Carvalho

A. C. P. L. F.

, &

Freitas

(

2007

A review of performance evaluation measures for hierarchical classifiers. In Evaluation Methods for Machine Learning II: Papers from the AAAI-2007 Workshop

(pp.

–

). https://kar.kent.ac.uk/14562/

Dumais

, &

Chen

(

2000

Hierarchical classification of web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

(pp.

256

–

263

). https://doi.org/10.1145/345508.345593

Fawcett

(

2006

An introduction to ROC analysis

Pattern Recognition Letters

(

861

–

874

. https://doi.org/10.1016/j.patrec.2005.10.010

Google Scholar

Crossref

WorldCat

Freitas

, &

Carvalho

(

2007

A tutorial on hierarchical classification with applications in bioinformatics

Research and Trends in Data Mining Technologies and Applications

175

–

208

. https://doi.org/10.4018/978-1-59904-271-8.ch007

Google Scholar

OpenURL Placeholder Text

WorldCat

Gauch

H. G. Jr.

, &

Whittaker

R. H.

(

1981

Hierarchical classification of community data

The Journal of Ecology

(

537

–

557

. https://doi.org/10.2307/2259682

Google Scholar

Crossref

WorldCat

Gordon

A. D.

(

1987

A review of hierarchical classification

Journal of the Royal Statistical Society: Series A (General)

150

(

119

–

137

. https://doi.org/10.2307/2981629

Google Scholar

Crossref

WorldCat

Holden

, &

Freitas

A. A.

(

2006

Hierarchical classification of G-protein-coupled receptors with a PSO/ACO algorithm. In Proceedings of the IEEE Swarm Intelligence Symposium (SIS’06) (pp. 77–84). IEEE Press

Ipeirotis

P. G.

Gravano

, &

Sahami

(

2001

Probe, count, and classify: Categorizing hidden web databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (pp. 67–78)

. https://doi.org/10.1145/375663.375671

Kendall

M. G.

(

1938

A new measure of rank correlation

Biometrika

(

1–2

–

. https://doi.org/10.1093/biomet/30.1-2.81

Google Scholar

OpenURL Placeholder Text

WorldCat

Kiritchenko

Matwin

, &

Famili

A. F.

(

2004

Hierarchical text categorization as a tool of associating genes with gene ontology codes. In European Workshop on Data Mining and Text Mining in Bioinformatics (pp. 30–34)

Kiritchenko

Matwin

, &

Famili

A. F.

(

2005

Functional annotation of genes using hierarchical text categorization. In Proceedings of the ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics

. https://www.researchgate.net/profile/Svetlana-Kiritchenko/publication/44046343_Functional_Annotation_of_Genes_Using_Hierarchical_Text_Categorization/links/09e4150b3962ea0f9c000000/Functional-Annotationof- Genes-Using-Hierarchical-Text-Categorization.pdf

Kiritchenko

Matwin

Nock

, &

Famili

A. F.

(

2006

Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 395–406). Springer

Koller

, &

Sahami

(

1997

Hierarchically classifying documents using very few words (Technical Report). Stanford InfoLab

Kosmopoulos

Partalas

Gaussier

Paliouras

, &

Androutsopoulos

(

2015

Evaluation measures for hierarchical classification: A unified view and novel approaches

Data Mining and Knowledge Discovery

(

820

–

865

. https://doi.org/10.1007/s10618-014-0382-x

Google Scholar

Crossref

WorldCat

Remus

Aly

, &

Biemann

(

2019

Germeval 2019 Task 1: Hierarchical classification of blurbs. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019)

Riehl

Riccio

Miska

E. A.

, &

Hemberg

(

2022

Transposonultimate: Software for transposon classification, annotation and detection

Nucleic Acids Research

(

e64

–

e64

. https://doi.org/10.1093/nar/gkac136

Silla

C. N.

, &

Freitas

A. A.

(

2011

A survey of hierarchical classification across different application domains

Data Mining and Knowledge Discovery

(

1–2

–

. https://doi.org/10.1007/s10618-010-0175-9

Google Scholar

OpenURL Placeholder Text

WorldCat

Sokolova

Japkowicz

, &

Szpakowicz

(

2006

Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In Australasian Joint Conference on Artificial Intelligence (pp. 1015–1021). Springer

Sokolova

, &

Lapalme

(

2009

A systematic analysis of performance measures for classification tasks

Information Processing & Management

(

427

–

437

. https://doi.org/10.1016/j.ipm.2009.03.002

Google Scholar

Crossref

WorldCat

Sorower

M. S.

(

2010

A literature survey on algorithms for multi-label learning. Department of Computer Science, Oregon State University.

Sun

, &

Lim

E.-P.

(

2001

Hierarchical text classification and evaluation. In Proceedings 2001 IEEE International Conference on Data Mining (pp. 521–528). IEEE

Wang

Zhou

, &

Liew

S. C.

(

1999

Building hierarchical classifiers using class proximity. In VLDB (Vol. 99, pp. 7–10). Citeseer

Footnotes

https://github.com/DerKevinRiehl/HierarchicalConfusionMatrix

The union of the neighbours of each node on the common path excluding nodes of the true path or duplicates in this set. In more complex problems such as DAGs it is possible that nodes occur multiple times in the union of the neighbours of each node on the common path, therefore they are excluded. Moreover, it is possible that nodes of the true path appear in this union set, and therefore are explicitly removed from the set.

https://pgsb.helmholtz-muenchen.de/plant/recat/index.jsp

https://www.girinst.org/repbase/

https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2019-hmc/germeval2019t1-public-data-final.zip

Author notes

Conflict of interest: None declared.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
July 2023	35
August 2023	4
September 2023	9
October 2023	13
November 2023	14
December 2023	49
January 2024	120
February 2024	90
March 2024	81
April 2024	73
May 2024	112
June 2024	133
July 2024	183
August 2024	188
September 2024	208
October 2024	240
November 2024	207
December 2024	201
January 2025	194
February 2025	172
March 2025	278
April 2025	235
May 2025	17

Article Contents

Hierarchical confusion matrix for classification performance evaluation

Abstract

1 Introduction

2 Hierarchical classification

3 Related works

4 Hierarchical confusion matrix

4.1 Hierarchical confusion for $Γ = T$ ⁠, $Ψ = SPL$ ⁠, $Φ = MLNP$

4.2 Hierarchical confusion for $Ψ = SPL$ ⁠, $Φ = MLNP$

4.3 Hierarchical confusion for $Φ = MLNP$

4.4 Generalised hierarchical confusion (for all problems)

4.5 Conformity with desirable attributes of evaluation

5 Experiments

5.1 Transposon classification

5.2 GermEval 2019 competition, Task 1A

5.3 GermEval 2019 competition, Task 1B

5.4 Discussion

6 Conclusions

Acknowledgments

Data availability

References

Footnotes

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Hierarchical confusion matrix for classification performance evaluation

Abstract

1 Introduction

2 Hierarchical classification

3 Related works

4 Hierarchical confusion matrix

4.1 Hierarchical confusion for Γ=T⁠, Ψ=SPL⁠, Φ=MLNP

4.2 Hierarchical confusion for Ψ=SPL⁠, Φ=MLNP

4.3 Hierarchical confusion for Φ=MLNP

4.4 Generalised hierarchical confusion (for all problems)

4.5 Conformity with desirable attributes of evaluation

5 Experiments

5.1 Transposon classification

5.2 GermEval 2019 competition, Task 1A

5.3 GermEval 2019 competition, Task 1B

5.4 Discussion

6 Conclusions

Acknowledgments

Data availability

References

Footnotes

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

4.1 Hierarchical confusion for $Γ = T$ ⁠, $Ψ = SPL$ ⁠, $Φ = MLNP$

4.2 Hierarchical confusion for $Ψ = SPL$ ⁠, $Φ = MLNP$

4.3 Hierarchical confusion for $Φ = MLNP$