Abstract

This study proposes the novel concept of hierarchical confusion matrix, opening the door for popular confusion-matrix-based (flat) evaluation measures from binary classification problems, while considering the peculiarities of hierarchical classification problems. The concept is developed to a generalised form and proven its applicability to all types of hierarchical classification problems including directed acyclic graphs, multi-path labelling, and non-mandatory leaf-node prediction. Finally, measures based on the novel confusion matrix are used for three real-world hierarchical classification applications and compared to established evaluation measures. The results, the conformity with important attributes of hierarchical classification schemes and its broad applicability justify its recommendation.

1 Introduction

Hierarchical classification is the task of assigning m multiple classes out of n multiple, overlapping classes (Ci) disposed in a hierarchical structure. Classes in the hierarchical classification task can be divided into sub-classes or grouped into super classes. Often, hierarchical classification is wrongly referred to other problems (Kiritchenko et al., 2006; Silla & Freitas, 2011) such as hierarchical clustering of objects to generate structures of classes (Gauch & Whittaker, 1981; Gordon, 1987). This work explicitly understands hierarchical classification as allocation of objects to a set of given classes rather than the design of classes and their structure itself.

Machine learning research started to focus on (hierarchically) structured classification problems in the context of text categorisation in the mid-1990s (Koller & Sahami, 1997). From then on, hierarchical classification has been applied in many fields, namely text categorisation, music genre classification, and bioinformatic applications including sequence classification, and protein function prediction (Freitas & Carvalho, 2007; Silla & Freitas, 2011). Despite of intensive research, many related questions in the field remain open, as stated, e.g. in Kosmopoulos et al. (2015).

Hierarchical classification evaluation is part of ongoing research and various measures have been proposed so far (Costa et al., 2007; Sokolova et al., 2006). While established evaluation measures for binary and multi-class classification problems exist, there is still the lack of consensus on which measures to employ in the context of hierarchical classification problems (Costa et al., 2007).

Hierarchical classification studies frequently adopted conventional, flat measures from binary classification problems such as accuracy, precision, or recall (Costa et al., 2007; Kiritchenko et al., 2006; Kosmopoulos et al., 2015). Due to their simple structure based on the confusion matrix, these measures are easy accessibly to human inspection. However, flat measures do not appropriately consider the complex peculiarities of hierarchical classification problems. The measures in Kiritchenko et al., 2006 (hierarchical precision, recall, and F-score) are observed to frequently occur in recent studies and try to bridge the evaluation of binary and hierarchical classification problems.

In this work, we propose a novel concept for the evaluation of hierarchical classification problems: the hierarchical confusion matrix. This approach opens the door for conventional measures in the context of hierarchical classification and to complete the field of hierarchical classification evaluation. Further, we extend the concept to a generalised form and prove its applicability to all types of hierarchical classification problems including directed acyclic graphs, multi-path label, and non-mandatory leaf-node prediction. Finally, we demonstrate the application of measures based on the novel concept for the comparison of classifiers in three exemplary benchmarks related to different types of hierarchical classification problems.

An implementation of the hierarchical confusion matrix in Python programming language, and the examples and experiments of this study are available on GitHub.1

The brief remainder of this work is organised as follows. Section 2 formally introduces the problems of hierarchical classification. In Section 3, a summary of proposed evaluation measures for hierarchical classification is presented. Section 4 presents the novel hierarchical confusion matrix, discusses the conformity to important attributes, and proves applicability to all types of hierarchical classification problems. In Section 5, the novel evaluation concept is applied onto experimental data and compared to existing measures revealing its usefulness and reasonability. Finally, Section 6 concludes this paper and summarises its contribution.

2 Hierarchical classification

Supervised machine learning allows access to labelled data for training and testing stages. Classification describes supervised machine learning using categorical labels, and stands for the allocation of objects to a set of existing classes. The field of classification can be divided into the sub-problems: binary, multi-class, multi-label, and hierarchical classification, as shown in Figure 1. In the binary classification problem, objects need to be allocated to one of two non-overlapping classes (C1, C2). Multi-class classification requires the choice of one out of n multiple, non-overlapping classes (Ci) for a given input object. Multi-label classification represents the assignment of objects to m multiple classes out of n multiple, non-overlapping classes (Ci).

Classification problems by example. The selected class(es) is (are) marked. (a) Binary classification. (b) Multi-class classification. (c) Multi-label classification. (d) Hierarchical classification. Note that the single selected category (dark chocolate) causes the selection of multiple parental vertices in the graph on a path up to the root node.
Figure 1.

Classification problems by example. The selected class(es) is (are) marked. (a) Binary classification. (b) Multi-class classification. (c) Multi-label classification. (d) Hierarchical classification. Note that the single selected category (dark chocolate) causes the selection of multiple parental vertices in the graph on a path up to the root node.

Before we define hierarchical classification, let us introduce the notation used in this paper. We define a set of data Ω and a structure Γ. The structure Γ can be described by a graph, where the nodes represent the n classes CiΓ and the edges stand for the possible relationships between these classes, namely ancestry and descendance. For each class Ci, let us denote the set of all descendants by D{Ci} and the set of all neighbouring nodes (with at least one common ancestor) by N{Ci}.

The root node RΓ is the only node without ancestors A{R}=. Leaf nodes i are nodes without any descendants. An allocation path Pi is a series of classes (connected nodes) within Γ. This work requires three conditions for Pi: (1) for each consecutive pair of elements Ca,CbPi the ancestry relationship CaA{Cb} holds, (2) the connectivity to the root node RPi, and (3) the connectivity to the allocated class CiPi. Let us denote the true class of a given object odΩ as C{od}.

Pd describes the set of all predicted allocation paths for od, and a single, predicted allocation path is given by P^dPd. Wd denotes the set of all true allocation paths for od based on Γ and P~dWd defines a single, true allocation path. Moreover, let us denote the intersection of P^d and P~d by P^dP~d, the (longest, consecutively connected) common path of prediction of P^d and P~d by P^d¯P~d, and the leaf node of a path Pi as Z{Pi}.

Hierarchical classification is a structured classification problem and represents the allocation of one or multiple paths PxPd within the structure Γ to a given object odΩ. Table 1 summarises the definitions of this work.

Table 1.

Definitions for hierarchical classification

SymbolDefinition
ΓStructure of hierarchical classification problem
ΨNumber of allocated paths of classification problem
ΦMandatoriness for leaf-node prediction
ΩSet of data
odA single data set in O
CiClass i, node in Γ
RRoot node of Γ
C{od}True class of od in Γ
D{Ci}Set of descendants of Ci
N{Ci}Set of neighbours of Ci
PiAllocation path to Ci
Z{Pi}Leaf node of Pi
Pa¯PbLongest, consecutively connected, common path
PdSet of all predicted allocation paths for od
P^dA single, predicted allocation path for od in Pd
WdSet of all true allocation paths for od
P~dA single, true allocation path for od in Wd
SymbolDefinition
ΓStructure of hierarchical classification problem
ΨNumber of allocated paths of classification problem
ΦMandatoriness for leaf-node prediction
ΩSet of data
odA single data set in O
CiClass i, node in Γ
RRoot node of Γ
C{od}True class of od in Γ
D{Ci}Set of descendants of Ci
N{Ci}Set of neighbours of Ci
PiAllocation path to Ci
Z{Pi}Leaf node of Pi
Pa¯PbLongest, consecutively connected, common path
PdSet of all predicted allocation paths for od
P^dA single, predicted allocation path for od in Pd
WdSet of all true allocation paths for od
P~dA single, true allocation path for od in Wd
Table 1.

Definitions for hierarchical classification

SymbolDefinition
ΓStructure of hierarchical classification problem
ΨNumber of allocated paths of classification problem
ΦMandatoriness for leaf-node prediction
ΩSet of data
odA single data set in O
CiClass i, node in Γ
RRoot node of Γ
C{od}True class of od in Γ
D{Ci}Set of descendants of Ci
N{Ci}Set of neighbours of Ci
PiAllocation path to Ci
Z{Pi}Leaf node of Pi
Pa¯PbLongest, consecutively connected, common path
PdSet of all predicted allocation paths for od
P^dA single, predicted allocation path for od in Pd
WdSet of all true allocation paths for od
P~dA single, true allocation path for od in Wd
SymbolDefinition
ΓStructure of hierarchical classification problem
ΨNumber of allocated paths of classification problem
ΦMandatoriness for leaf-node prediction
ΩSet of data
odA single data set in O
CiClass i, node in Γ
RRoot node of Γ
C{od}True class of od in Γ
D{Ci}Set of descendants of Ci
N{Ci}Set of neighbours of Ci
PiAllocation path to Ci
Z{Pi}Leaf node of Pi
Pa¯PbLongest, consecutively connected, common path
PdSet of all predicted allocation paths for od
P^dA single, predicted allocation path for od in Pd
WdSet of all true allocation paths for od
P~dA single, true allocation path for od in Wd

According to Silla and Freitas (2011), Borges et al. (2013), Kosmopoulos et al. (2015), the different sub-problems of hierarchical classification can be categorised based on three aspects: structure Γ, number of paths Ψ, and label depth Φ. First, the literature commonly differentiates two types of structures Γ that put the classes into a hierarchical relationship: trees T and directed acyclic graphs DAGs (Silla & Freitas, 2011). Both of these structures are presented in Figure 2. A tree is a graph in which any two nodes are connected by exactly one path, whereas a DAG allows nodes to possess multiple parental nodes. T implies |Wd|=1 and DAG entails |Wd|1 for a single class resp. single labelled object od. Moreover, this work assumes that a classifier is required to predict a specific path for DAG problems, while T problems require the prediction of a class only. Second, Silla and Freitas (2011) distinguishe single (SPL) and multi-path label (MPL) hierarchical classification depending on the number of allocated paths Ψ=|Pd| per object odΩ. SPL implies |Pd|=1, while MPL entails |Pd|1. Third, hierarchical classification problems differ by their required label depth Φ; there are problems that necessarily require full depth labelling (FD) meaning the mandatory leaf-node prediction (MLNP), and such that do not (partial depth labelling, PD or non-mandatory leaf-node prediction, NMLNP) (Costa et al., 2007; Freitas & Carvalho, 2007).

Structures of hierarchical classification. (a) Trees. (b) Directed acyclic graphs (DAGs).
Figure 2.

Structures of hierarchical classification. (a) Trees. (b) Directed acyclic graphs (DAGs).

Algorithms approaching the hierarchical classification problem can be grouped into three categories (Dumais & Chen, 2000; Kiritchenko et al., 2006; Silla & Freitas, 2011): flat, local-model, and global-model approaches. These three approaches and their sub-types are outlined in Figure 3.

Approaches to hierarchical classification. The use of classifiers is marked with dashed lines. (a) Flat approach transforms leaf node classification to a multi-class problem. (b) Global-model approach considers all classes at the same time. (c) Local classifier per node employs a binary classifier for each vertex of the structure. (d) Local classifier per parent node utilises a multi-class classifier for each parental node. (e) Local classifier per level uses a multi-class classifier at each level of the taxonomy.
Figure 3.

Approaches to hierarchical classification. The use of classifiers is marked with dashed lines. (a) Flat approach transforms leaf node classification to a multi-class problem. (b) Global-model approach considers all classes at the same time. (c) Local classifier per node employs a binary classifier for each vertex of the structure. (d) Local classifier per parent node utilises a multi-class classifier for each parental node. (e) Local classifier per level uses a multi-class classifier at each level of the taxonomy.

The flat approach transforms the hierarchical classification problem into a multi-class problem using the leaf nodes only. Three types of local-model approaches are mentioned by Silla and Freitas (2011): local classifier per node, local classifier per parent node, and local classifier per level. The local classifier per node approach employs a binary classifier for each node of the class hierarchy, while the local classifier per parent approach trains a multi-class classifier for all non-leaf nodes. The local classifier per level approach consists in using a multi-class classifier for all labels present in the levels of a hierarchical structure. The global approach utilises one single classifier that captures the whole hierarchical class structure. Kiritchenko et al. (2006) argue that local-model approaches produce consistent labelling and are efficient to train. This comes at the cost that local models are highly sensitive to ancestral decisions and hardly interpretable due to the vast amount of different classifiers. Therefore, they pronounce for the use of global models that probably have a higher chance for correct predictions. However, the training of global-model approaches is complex and contrary to the other approaches the classification problem is not decomposed to simpler problems. In general, the performance of classifiers depends on their training step. A larger amount of data available and less contradictions in the data set supports the training of better models. This is an argument for local classifiers per level and flat approaches. Local classifiers per node and parent node face a shrinking size of training data for the deeper levels, while global-model approaches face contradicting data as there is no decomposition of the data for different classifiers.

All approaches have their point and with regard to the diversity of algorithms found in the literature, one can conclude that different types of data and hierarchical classification problems require different, customised approaches.

3 Related works

In the context of the binary classification problem, there are four possible cases that occur: a true positive prediction (TP), a true negative prediction (TN), a false positive prediction (FP), and a false negative prediction (FN). Based on the number of times these four cases occur, that make up the confusion matrix, many different performance measures have been proposed (Fawcett, 2006). Amongst these measures there are accuracy (ACC), precision (PPV), recall (TPR), false negative rate (FNR), false positive rate (FPR), specificity (TNR), prevalence (PT), F1-score (F1), and Matthews correlation coefficient (MCC) to name a few; their definitions are listed in Table 2.

Table 2.

Common binary classification measures

MeasureDefinition
ACC=TP+TNTP+TN+FP+FN
PPV=TPTP+FP
TPR=TPTP+FN
FNR=FPFN+TP
FPR=FPFP+TN
TNR=TNTN+FP
PT=TPR(1TNR)+TNR1TPR+TNR1
F1=2TP2TP+FP+FN
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)
MeasureDefinition
ACC=TP+TNTP+TN+FP+FN
PPV=TPTP+FP
TPR=TPTP+FN
FNR=FPFN+TP
FPR=FPFP+TN
TNR=TNTN+FP
PT=TPR(1TNR)+TNR1TPR+TNR1
F1=2TP2TP+FP+FN
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)
Table 2.

Common binary classification measures

MeasureDefinition
ACC=TP+TNTP+TN+FP+FN
PPV=TPTP+FP
TPR=TPTP+FN
FNR=FPFN+TP
FPR=FPFP+TN
TNR=TNTN+FP
PT=TPR(1TNR)+TNR1TPR+TNR1
F1=2TP2TP+FP+FN
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)
MeasureDefinition
ACC=TP+TNTP+TN+FP+FN
PPV=TPTP+FP
TPR=TPTP+FN
FNR=FPFN+TP
FPR=FPFP+TN
TNR=TNTN+FP
PT=TPR(1TNR)+TNR1TPR+TNR1
F1=2TP2TP+FP+FN
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)

As it turns out, hierarchical classification is a more complex problem, and the definition of the confusion matrix elements is not trivial. Kiritchenko et al. (2006) plead for the consideration of partially correct classification, the discriminatory punishment of more distant errors and errors at upper levels. Take the snack product categorisation in Figure 1d as an example; the wrong classification of white chocolate could be considered to be more accurate compared to the wrong classification of muffins or even a snack from different flavours.

Kosmopoulos et al. (2015) discuss further issues, namely over- and under-specialisation, alternative paths, pairing, and long distance problems. With regard to our snack product category example, over- and under-specialisation could occur in NMLNP problems, where the classifier predicts dark chocolate, and the true label is set as sweet snack or the other way around. Moreover, the MPL classification problem can increase the complexity of evaluation further.

Due to these complexities, specific measures for the evaluation of hierarchical classification problems have been proposed. Kosmopoulos et al. (2015) differentiates pair-based and set-based measures. Pair-based measures assign costs to pairs of predicted and true classes; these measures are most suitable for tree and SPL problems. A possible extension to MPL problems is the relaxation for each class to be part of more than one pair. Set-based measures take the whole set of predicted and true classes into account, including ancestry and descendance relationships. Symmetric difference loss and hierarchical precision and recall (Kiritchenko et al., 2005) count amongst the set-based measures.

Costa et al. (2007) categorise the evaluation measures into distance-based, depth-based, semantics-based, and hierarchy-based measures. Distance-based measures consider the distance of true and predicted classes in a structure (Sun & Lim, 2001; Wang et al., 1999). A possible issue with these measures is the neglection of different levels, as these measures commonly do not consider the difficulty for accurate classification at deeper levels and the severity of misclassification at shallower levels.

A possible mean to encounter this shortcoming is the weighting of structural edges which is described as depth-based measures. Besides the number of edges, the depth of true and predicted classes is taken into consideration. However, the new problem of setting the weights occurs. Blockeel et al. (2002) and Holden and Freitas (2006) employ exponential weighting. A possible issue with these (distance-based) measures is the discrimination and over-penalisation of leaf nodes with different depths. In addition to that, the definition of depth becomes more challenging in DAGs.

Semantics-based measures employ the concept of similarity between classes and objects to predict the performance of classification models (Sun & Lim, 2001). From this perspective, similarity or distance can be quantified comparing different feature vectors in a multi-dimensional space similar to the hierarchical clustering problems (Gauch & Whittaker, 1981; Gordon, 1987). One drawback doing so is that during the comparison of two algorithms with different features but identical predictions, evaluation results in different performances. Hierarchy-based measures consider structure-related information. Mainly, these measures consider the intersection of allocation paths, with a focus either on descendants (Ipeirotis et al., 2001) or ancestors (Kiritchenko et al., 2004).

Finally, we also find loss function-based evaluation measures in the literature (Cesa-Bianchi et al., 2005, 2006) that employ functions similar to the model training for evaluation purposes.

4 Hierarchical confusion matrix

The orientation on binary classification evaluation measures in the context of hierarchical classification problems can be observed in many works such as Sun and Lim (2001), Kiritchenko et al. (2005), Kiritchenko et al. (2006), Dumais and Chen (2000), Sokolova and Lapalme (2009). Especially the measures outlined in Kiritchenko et al., 2005, 2006 (hierarchical precision (hP), recall (hR), and F-score (hF)) enjoy increased attention in the ongoing discussion on hierarchical classification evaluation and are frequently used in studies, e.g. Borges et al. (2013). Their definition for SPL problems is shown below, including a degree of freedom meaning that hP is considered β times more important than hR:

(1)

Silla and Freitas (2011) recommend the use of hP, hR, and hF, as extensions to both, trees and DAGs, and to both, SPL and MPL problems. However, these measures come with the problem of defining the relative importance of hP and hR, and they do not consider true negatives. As Chicco and Jurman (2020) argue, more elaborate measures such as the MCC score represent more reliable statistical rates since they score high if and only if the performance was good in all four fields of the confusion matrix. Unmistakably, hP, hR, and hF strongly resemble to the binary classification measures PPV, TPR, and F1 based on the confusion matrix:

(2)

Open questions arise on the definition of the confusion matrix for hierarchical classification problems. The observation for hP, hR, and hF motivates the proposition of a novel confusion matrix for hierarchical classification, that opens the door for the application of (all) binary classification measures (also those taking true negatives into account such as MCC score) while considering the peculiarities of hierarchical classification. The concept of the hierarchical confusion matrix defines the four fields of the confusion matrix with a connotation to decisions that are done along the prediction path(s) as follows:

  • True positives (TPH)—the number of nodes that were correctly labelled positively by the predicted path.

  • True negatives (TNH)—the number of nodes that were correctly labelled negatively by the predicted path.

  • False positives (FPH)—the number of nodes that were wrongly labelled positively by the predicted path.

  • False negatives (FNH)—the number of nodes that were wrongly labelled negatively by the predicted path.

After determining TPH, TNH, FPH, and FNH for a single odΩ as a function of Pd and Wd, the values of hierarchical confusion can be calculated as the sum over the whole data set and used to determine the ‘flat’ evaluation measures of binary classification (see Table 2):

(3)

In the following, we formalise this concept of hierarchical confusion beginning from its simplest form for (Γ=, Ψ=SPL, Φ=MLNP) problems and then extend it to a generalised form that is applicable to all hierarchical classification problems. Finally, we discuss how the novel concept conforms desirable attributes for hierarchical classification measures.

4.1 Hierarchical confusion for Γ=T, Ψ=SPL, Φ=MLNP

According to the verbal definition of the confusion matrix elements in the prior section, we formalise the elements as follows. Since (Γ=T, Ψ=SPL, Φ=MLNP) problems imply Wd=P~d and Pd=P^d, we define the elements as function of a given P~d and P^d instead of a given Wd and Pd. The true positive (TPH) equals the number of nodes in the intersection of P^d and P~d. The true negative (TNH) is given by the union of the neighbours of each node on the common path2 and the number of (correctly) negative predicted descendants of the common path’s leaf node. The number of (correctly) negative predicted descendants of the common path’s leaf node can be obtained by the set of descendants of the common path’s leaf node excluding the FN node of the true path P~d and the false positive node of the predicted path P^d.

The false positive (FPH) equals the number of nodes of the predicted path that are not part of the common path. The false negative (FNH) is given by the number of nodes of the true path that are not part of the common path. An example for the evaluation of a prediction in the context of a hierarchical tree classification problem is outlined in Figure 4a. To summarise, the four aforementioned cases are formalised below:

(4)
Four exemplary predictions and their evaluation using the concept of hierarchical confusion. The true class(es) is (are) marked with a cross. The predictions are marked with numbered arrows and prediction paths are marked with an dotted resp. dashed lines. The tables right to the diagrams present the resulting numbers for the confusion matrix fields and the relevant nodes that were considered in the specific context. (a) Hierarchical (Γ = T, Ψ = SPL, Φ = MLNP) classification problem. (b) Hierarchical (Γ = DAG, Ψ = SPL, Φ = MLNP) classification problem. (c) Hierarchical (Γ = DAG, Ψ = MPL, Φ = MLNP) classification problem. (d) Hierarchical (Γ = DAG, Ψ = MPL, Φ = NMLNP) classification problem.
Figure 4.

Four exemplary predictions and their evaluation using the concept of hierarchical confusion. The true class(es) is (are) marked with a cross. The predictions are marked with numbered arrows and prediction paths are marked with an dotted resp. dashed lines. The tables right to the diagrams present the resulting numbers for the confusion matrix fields and the relevant nodes that were considered in the specific context. (a) Hierarchical (Γ = T, Ψ = SPL, Φ = MLNP) classification problem. (b) Hierarchical (Γ = DAG, Ψ = SPL, Φ = MLNP) classification problem. (c) Hierarchical (Γ = DAG, Ψ = MPL, Φ = MLNP) classification problem. (d) Hierarchical (Γ = DAG, Ψ = MPL, Φ = NMLNP) classification problem.

4.2 Hierarchical confusion for Ψ=SPL, Φ=MLNP

In order to extend any definition for TPH, TNH, FPH, and FNH from the context of a T to a DAG problem, we propose a two-step ‘benevolent’ approach:

  • Step 1: Select the path i (P~iWd) that resembles most to P^d (SPL problems require |Pd|=1):
    (5)
  • Step 2: Calculate TPH, TNH, FPH, and FNH like for a (Γ=T, Ψ=SPL, Φ=MLNP) problem using P~i and P^d.

We call this approach ‘benevolent’ as one P~i of the possible true paths in Wd is chosen to enable the achievement of the best possible performance for a given prediction P^d. A key assumption here is that the prediction is a path, while the true label is considered a node. Predictions of local and some global models output a path, while predictions of flat and some global models output a node only. For methods that predict a node, we use the benevolent approach to infer and score a path. This strategy gives maximal leeway to models predicting a node. An example for the evaluation of a (Γ=DAG) problem is outlined in Figure 4b.

4.3 Hierarchical confusion for Φ=MLNP

The next step towards a generalised hierarchical confusion concept is the extension to MPL problems. Similar to the DAG extension, we can decompose the MPL problem to SPL problems. To do so, we first need to extend the notation. In MPL problems, C{od} is a set of n true classes of od in Γ. The classifier predicts m paths. The number of predicted paths m and the number of true classes n are not necessarily equal in applications, as n is considered to be unknown to the classifier a priori. Wd,j is the set of all true allocation paths for od connected to CjC{od}. Tree problems imply |Wd,j|=1. Based on this redefinition, the concept of confusion matrix can be extended to MPL problems with the following, ‘benevolent’ approach:

  • Step 1: Create a map M, that stores values m=M[P^k] for each P^kPd.

  • Step 2: For each P^kPd, determine the m values as follows:

    • Step 2.1: Find the path i of any class j in Wd,j that resembles most to P^k:
      (6)
    • Step 2.2: Set m accordingly to the number of intersections between P^k and the found P~iWd,j:
      (7)
  • Step 3: For each P^kPd (in descending order of related m values stored in M) determine TPH,k, TNH,k, FPH,k, and FNH,k as follows:

    • Step 3.1: If Wd,j, go to Step 3.2. Otherwise, if Wd,j is empty, if there are more predictions m than true classes n, consider the whole prediction as an FP:
      (8)
    • Step 3.2: Find the path i of any class j in Wd,j that resembles most to P^k:
      (9)
    • Step 3.3: Calculate TPH,k, TNH,k, FPH,k, and FNH,k like a (Γ=T, Ψ=SPL, Φ=MLNP) problem using P~i and P^k.

    • Step 3.4: Remove all paths in Wd,j related to class j. Remove Cj from the set of true labels C{od}.

  • Step 4: Calculate the fields of the confusion matrix as sum of values determined by the single predictions P^kPd. For all remaining classes Cj that were not removed from C{od} (if there are more true classes n than predictions m), consider the shortest paths inside as FNs:
    (10)

An example for the evaluation of a (Γ=DAG, Ψ=MPL) problem is depicted in Figure 4c.

4.4 Generalised hierarchical confusion (for all problems)

Finally, the discussion of NMLNP problems remains. As it turns out, the approach presented for MPL problems is applicable to NMLNP problems as well, as shown in Figure 4d. NMLNP adds complexity to the game by not requiring the prediction to the deepest level of the taxonomy. This allows for misclassification errors such as over- and under-specialisation. The discussed approach in the previous subsection is also suitable in this case, as the confusion matrix element definitions consider paths and their intersection.

4.5 Conformity with desirable attributes of evaluation

There is a set of important attributes that shall be ideally fulfilled by hierarchical evaluation schemes as demanded by, e.g. Costa et al. (2007) and Kiritchenko et al. (2006): (1) the independence of subjective, user-specified parameters; (2) metrics should give credit to partially correct classification; (3) distant errors and higher level errors should be penalised more heavily; (4) the applicability of the evaluation schemes to a broad range of problems.

First, the hierarchical confusion matrix by construction does not depend on subjective, user-specific parameters.

Second, let us consider Ψ=SPL, Φ=MLNP w.l.o.g.; let us assume P^d to be a partially correct prediction path, i.e. there exists a path P~iWd such that |P~i¯P^d|>0. Thus, TPH,TNH>0, which implies the evaluation scheme gives credit to partially correct classification.

Third, let P^1 and P^2 be two paths in Ψ=SPL, Φ=MLNP, where the number of correctly selected classes inside P^2 is higher than for P^1. Then TPH and TNH of P^2 are greater than for P^1 and vice versa for FPH and FNH. Therefore, the higher level error of P^1 is punished more heavily. This holds also for Ψ=MPL in case of multiple true paths, as the assignment of the true path for evaluation is benevolent. For NMLNP problems, over-specifications are less heavily punished than errors at higher levels as long as the depth of over-specification is lower. More precisely, in this case the number of TPH is maximal, FNH is zero, FPH corresponds to the level depth of over-specifications, and TNH is maximal minus the level depth of over-specifications. The case of under-specification follows the same lines.

To the best of the authors knowledge, the proposed concept of hierarchical confusion matrix is the first evaluation scheme that is by construction applicable to all types of hierarchical classification problems and thus conforms the fourth attribute.

5 Experiments

In this section, we apply our concept of the hierarchical confusion matrix onto three, real-world hierarchical classification problems with differing levels of complexity. Employing binary performance measures (see Table 2) using the hierarchical confusion matrix allows for the comparison of different classification models in the three experiments. Code and data for reproduction can be found on GitHub.

5.1 Transposon classification

Transposon classification is a task from bioinformatics and computational biology (Riehl et al., 2022). Transposons are DNA sequences that are able to move inside the genome using copy & paste and cut & paste transposition mechanisms and are suspected to play an important role as driver for genome evolution. Moreover, they can be classified into a tree structure depending on structural and protein features obtained by the sequences that determine their specific transposition mechanisms. The transposon classification task is a (Γ=T, Ψ=SPL, Φ=MLNP) classification problem. Within this experiment, we use the combined data set mipsREdat-PGSB,3 RepBase v23.08,4 and the taxonomy proposed in Riehl et al. (2022).

Table 3 presents the hierarchical confusion matrix and evaluation measures based on it for different classification models that were compared in Riehl et al. (2022). Situations in which a clear ranking based on one measure is complicated, the consideration of multiple measures as presented in the table can facilitate model ranking. The results show that evaluation measures based on the novel hierarchical confusion matrix span the whole range, comparable to the related binary measures.

Table 3.

Transposon classification experiment

ModelTPTNFPFNACCPPVTPRF1MCCRank
HC_GA19,14527,6907,8547,74475.02%70.91%71.20%71.05%49.08%2
HC_LGA18,42025,5828,5988,46972.05%68.18%68.50%68.34%43.33%3
NLLCPN17,60824,7659,4109,28169.39%65.17%65.48%65.33%37.93%4
RFSB22,83334,0264,1314,05687.41%84.68%84.92%84.80%74.06%1
TERL36663018,77626,5232.15%1.91%1.36%1.59%95.58%6
TopDown1,4152,74316,60325,4748.99%7.85%5.26%6.30%81.49%5
ModelTPTNFPFNACCPPVTPRF1MCCRank
HC_GA19,14527,6907,8547,74475.02%70.91%71.20%71.05%49.08%2
HC_LGA18,42025,5828,5988,46972.05%68.18%68.50%68.34%43.33%3
NLLCPN17,60824,7659,4109,28169.39%65.17%65.48%65.33%37.93%4
RFSB22,83334,0264,1314,05687.41%84.68%84.92%84.80%74.06%1
TERL36663018,77626,5232.15%1.91%1.36%1.59%95.58%6
TopDown1,4152,74316,60325,4748.99%7.85%5.26%6.30%81.49%5
Table 3.

Transposon classification experiment

ModelTPTNFPFNACCPPVTPRF1MCCRank
HC_GA19,14527,6907,8547,74475.02%70.91%71.20%71.05%49.08%2
HC_LGA18,42025,5828,5988,46972.05%68.18%68.50%68.34%43.33%3
NLLCPN17,60824,7659,4109,28169.39%65.17%65.48%65.33%37.93%4
RFSB22,83334,0264,1314,05687.41%84.68%84.92%84.80%74.06%1
TERL36663018,77626,5232.15%1.91%1.36%1.59%95.58%6
TopDown1,4152,74316,60325,4748.99%7.85%5.26%6.30%81.49%5
ModelTPTNFPFNACCPPVTPRF1MCCRank
HC_GA19,14527,6907,8547,74475.02%70.91%71.20%71.05%49.08%2
HC_LGA18,42025,5828,5988,46972.05%68.18%68.50%68.34%43.33%3
NLLCPN17,60824,7659,4109,28169.39%65.17%65.48%65.33%37.93%4
RFSB22,83334,0264,1314,05687.41%84.68%84.92%84.80%74.06%1
TERL36663018,77626,5232.15%1.91%1.36%1.59%95.58%6
TopDown1,4152,74316,60325,4748.99%7.85%5.26%6.30%81.49%5

5.2 GermEval 2019 competition, Task 1A

The GermEval 2019 competition (Remus et al., 2019) was a competition hold by the Language Technology Group of Universität Hamburg, Germany. In total, 18 teams participated to the competition and worked on multiple tasks. Here, we consider Task 1A and 1B, related to hierarchical classification of blurb texts in German language—short descriptions of books. The data set covers 20,784 books and a 3-level DAG structure with 343 categories. All participant submissions, data set, and taxonomy were downloaded from the competition’s website.5

Task 1A is a (Γ=(1level)T, Ψ=MPL, Φ=MLNP) classification problem. The challenge exercise was to classify one or multiple top-level categories for the blurbs that represent literature genres. Table 4 summarises the confusion matrix, evaluation measures calculated by us [hierarchical confusion (HC)] and the reported performances by the organisers of the competition (GE) for each participant and two baseline models (Baseline, Contender). The organisers employed micro-averaged recall, precision, and F1-score (Silla & Freitas, 2011; Sorower, 2010) as well as subset accuracy (Sorower, 2010) for evaluation purposes. This table allows for the comparison of the reported evaluation measures. Also, it enables the comparison of the different resulting rankings based on the F1-score. Additionally, the MCC score based on the hierarchical confusion matrix is reported.

Table 4.

GermEval 2019 competition, Task 1A experiment

Hierarchical confusion matrix performance evaluation (HC)Reported performances (Remus et al., 2019)
ModelTPTNFPFNACCPPVTPRF1MCCACCPPVTPRF1Rank (HC)Rank (Remus et al., 2019)
Averbis3,61328,86358485795.75%86.09%80.83%83.37%80.99%79.00%86.00%81.00%83.00%66
Comtravo-DS3,69029,5171,07878094.70%77.39%82.55%79.89%76.89%72.00%81.00%83.00%82.00%78
DFKI-SLT3,78728,93353668396.41%87.60%84.72%86.14%84.09%82.00%88.00%85.00%86.00%33
EricssonResearch3,76928,89145570196.58%89.23%84.32%86.70%84.79%84.00%89.00%84.00%87.00%11
Fosil-hsmw3,71929,00369475195.77%84.27%83.20%83.73%81.30%79.00%84.00%83.00%84.00%55
HSHL3,64728,87777782395.31%82.44%81.59%82.01%79.32%77.00%82.00%82.00%82.00%87
HUIU3,60828,80886786294.94%80.63%80.72%80.67%77.76%76.00%81.00%81.00%81.00%99
Raghavan3,74728,98352272396.34%87.77%83.83%85.75%83.68%83.00%88.00%84.00%86.00%44
TwistBytes3,85229,55175261896.06%83.67%86.17%84.90%82.65%79.00%87.00%86.00%86.00%22
Baseline3,34429,0845441,12695.10%86.01%74.81%80.02%77.49%71.00%86.00%75.00%80.00%
Contender3,80929,8352,30166191.91%62.34%85.21%72.00%68.53%74.00%82.00%85.00%84.00%
Hierarchical confusion matrix performance evaluation (HC)Reported performances (Remus et al., 2019)
ModelTPTNFPFNACCPPVTPRF1MCCACCPPVTPRF1Rank (HC)Rank (Remus et al., 2019)
Averbis3,61328,86358485795.75%86.09%80.83%83.37%80.99%79.00%86.00%81.00%83.00%66
Comtravo-DS3,69029,5171,07878094.70%77.39%82.55%79.89%76.89%72.00%81.00%83.00%82.00%78
DFKI-SLT3,78728,93353668396.41%87.60%84.72%86.14%84.09%82.00%88.00%85.00%86.00%33
EricssonResearch3,76928,89145570196.58%89.23%84.32%86.70%84.79%84.00%89.00%84.00%87.00%11
Fosil-hsmw3,71929,00369475195.77%84.27%83.20%83.73%81.30%79.00%84.00%83.00%84.00%55
HSHL3,64728,87777782395.31%82.44%81.59%82.01%79.32%77.00%82.00%82.00%82.00%87
HUIU3,60828,80886786294.94%80.63%80.72%80.67%77.76%76.00%81.00%81.00%81.00%99
Raghavan3,74728,98352272396.34%87.77%83.83%85.75%83.68%83.00%88.00%84.00%86.00%44
TwistBytes3,85229,55175261896.06%83.67%86.17%84.90%82.65%79.00%87.00%86.00%86.00%22
Baseline3,34429,0845441,12695.10%86.01%74.81%80.02%77.49%71.00%86.00%75.00%80.00%
Contender3,80929,8352,30166191.91%62.34%85.21%72.00%68.53%74.00%82.00%85.00%84.00%
Table 4.

GermEval 2019 competition, Task 1A experiment

Hierarchical confusion matrix performance evaluation (HC)Reported performances (Remus et al., 2019)
ModelTPTNFPFNACCPPVTPRF1MCCACCPPVTPRF1Rank (HC)Rank (Remus et al., 2019)
Averbis3,61328,86358485795.75%86.09%80.83%83.37%80.99%79.00%86.00%81.00%83.00%66
Comtravo-DS3,69029,5171,07878094.70%77.39%82.55%79.89%76.89%72.00%81.00%83.00%82.00%78
DFKI-SLT3,78728,93353668396.41%87.60%84.72%86.14%84.09%82.00%88.00%85.00%86.00%33
EricssonResearch3,76928,89145570196.58%89.23%84.32%86.70%84.79%84.00%89.00%84.00%87.00%11
Fosil-hsmw3,71929,00369475195.77%84.27%83.20%83.73%81.30%79.00%84.00%83.00%84.00%55
HSHL3,64728,87777782395.31%82.44%81.59%82.01%79.32%77.00%82.00%82.00%82.00%87
HUIU3,60828,80886786294.94%80.63%80.72%80.67%77.76%76.00%81.00%81.00%81.00%99
Raghavan3,74728,98352272396.34%87.77%83.83%85.75%83.68%83.00%88.00%84.00%86.00%44
TwistBytes3,85229,55175261896.06%83.67%86.17%84.90%82.65%79.00%87.00%86.00%86.00%22
Baseline3,34429,0845441,12695.10%86.01%74.81%80.02%77.49%71.00%86.00%75.00%80.00%
Contender3,80929,8352,30166191.91%62.34%85.21%72.00%68.53%74.00%82.00%85.00%84.00%
Hierarchical confusion matrix performance evaluation (HC)Reported performances (Remus et al., 2019)
ModelTPTNFPFNACCPPVTPRF1MCCACCPPVTPRF1Rank (HC)Rank (Remus et al., 2019)
Averbis3,61328,86358485795.75%86.09%80.83%83.37%80.99%79.00%86.00%81.00%83.00%66
Comtravo-DS3,69029,5171,07878094.70%77.39%82.55%79.89%76.89%72.00%81.00%83.00%82.00%78
DFKI-SLT3,78728,93353668396.41%87.60%84.72%86.14%84.09%82.00%88.00%85.00%86.00%33
EricssonResearch3,76928,89145570196.58%89.23%84.32%86.70%84.79%84.00%89.00%84.00%87.00%11
Fosil-hsmw3,71929,00369475195.77%84.27%83.20%83.73%81.30%79.00%84.00%83.00%84.00%55
HSHL3,64728,87777782395.31%82.44%81.59%82.01%79.32%77.00%82.00%82.00%82.00%87
HUIU3,60828,80886786294.94%80.63%80.72%80.67%77.76%76.00%81.00%81.00%81.00%99
Raghavan3,74728,98352272396.34%87.77%83.83%85.75%83.68%83.00%88.00%84.00%86.00%44
TwistBytes3,85229,55175261896.06%83.67%86.17%84.90%82.65%79.00%87.00%86.00%86.00%22
Baseline3,34429,0845441,12695.10%86.01%74.81%80.02%77.49%71.00%86.00%75.00%80.00%
Contender3,80929,8352,30166191.91%62.34%85.21%72.00%68.53%74.00%82.00%85.00%84.00%

First of all, the evaluation results and rankings for both evaluation approaches resemble strongly. The rankings differ by only one pair (Comtravo-DS and HSHL). Next, the differences in precision, recall, and F1-score between the hierarchical confusion matrix and micro-averaged measures are relatively small (on average 2.41%, 0.10%, and 1.44%). However, larger differences become obvious with regard to accuracy (on average 17.53%). Differences in accuracy could be explained by the different perception of true negatives in our proposed confusion matrix and subset accuracy.

5.3 GermEval 2019 competition, Task 1B

Task 1B is a (Γ=DAG, Ψ=MPL, Φ=NMLNP) classification problem. The challenge exercise was to classify one or multiple categories from the DAG for the blurbs, not necessarily to the deepest level. Table 5 summarises the results similar to the task before. Again, resulting rankings are identical except for one pair (EricssonResearch and TwistBytes). Differences in accuracy, precision, recall, and F1-score (67.44%, 2.44%, 5.20%, and 4.25%) are higher compared to the previous task. Again, accuracy differs most. As Task 1B considers a more complex structure (DAG vs. T), differences between the evaluation measures become more obvious.

Table 5.

GermEval 2019 competition, Task 1B experiment

Hierarchical confusion matrix performance evaluation (HC)Reported performances (Remus et al., 2019)
ModelTPTNFPFNACCPPVTPRF1MCCACCPPVTPRF1Rank (HC)Rank (Remus et al., 2019)
Averbis8,552125,9514,6836,55892.29%64.62%56.60%60.34%56.24%27.00%68.00%61.00%64.00%33
Comtravo-DS7,187111,8713,3767,92391.33%68.04%47.56%55.99%52.36%19.00%70.00%53.00%60.00%66
DKFI-SLT7,049112,5672,2568,06192.06%75.75%46.65%57.74%55.56%21.00%78.00%52.00%62.00%44
EricssonResearch8,498119,5463,2086,61292.88%72.60%56.24%63.38%60.10%38.00%74.00%62.00%67.00%12
HSHL7,167106,4883,0257,94391.20%70.32%47.43%56.65%53.21%26.00%72.00%54.00%62.00%55
TwistBytes9,174130,8864,7475,93692.91%65.90%60.71%63.20%59.35%25.00%71.00%65.00%68.00%21
Baseline5,18397,1289649,92790.38%84.32%34.30%48.77%50.00%15.00%85.00%39.00%53.00%
Contender7,693118,0172,8547,41792.45%72.94%50.91%59.97%57.05%25.00%76.00%56.00%64.00%
Hierarchical confusion matrix performance evaluation (HC)Reported performances (Remus et al., 2019)
ModelTPTNFPFNACCPPVTPRF1MCCACCPPVTPRF1Rank (HC)Rank (Remus et al., 2019)
Averbis8,552125,9514,6836,55892.29%64.62%56.60%60.34%56.24%27.00%68.00%61.00%64.00%33
Comtravo-DS7,187111,8713,3767,92391.33%68.04%47.56%55.99%52.36%19.00%70.00%53.00%60.00%66
DKFI-SLT7,049112,5672,2568,06192.06%75.75%46.65%57.74%55.56%21.00%78.00%52.00%62.00%44
EricssonResearch8,498119,5463,2086,61292.88%72.60%56.24%63.38%60.10%38.00%74.00%62.00%67.00%12
HSHL7,167106,4883,0257,94391.20%70.32%47.43%56.65%53.21%26.00%72.00%54.00%62.00%55
TwistBytes9,174130,8864,7475,93692.91%65.90%60.71%63.20%59.35%25.00%71.00%65.00%68.00%21
Baseline5,18397,1289649,92790.38%84.32%34.30%48.77%50.00%15.00%85.00%39.00%53.00%
Contender7,693118,0172,8547,41792.45%72.94%50.91%59.97%57.05%25.00%76.00%56.00%64.00%
Table 5.

GermEval 2019 competition, Task 1B experiment

Hierarchical confusion matrix performance evaluation (HC)Reported performances (Remus et al., 2019)
ModelTPTNFPFNACCPPVTPRF1MCCACCPPVTPRF1Rank (HC)Rank (Remus et al., 2019)
Averbis8,552125,9514,6836,55892.29%64.62%56.60%60.34%56.24%27.00%68.00%61.00%64.00%33
Comtravo-DS7,187111,8713,3767,92391.33%68.04%47.56%55.99%52.36%19.00%70.00%53.00%60.00%66
DKFI-SLT7,049112,5672,2568,06192.06%75.75%46.65%57.74%55.56%21.00%78.00%52.00%62.00%44
EricssonResearch8,498119,5463,2086,61292.88%72.60%56.24%63.38%60.10%38.00%74.00%62.00%67.00%12
HSHL7,167106,4883,0257,94391.20%70.32%47.43%56.65%53.21%26.00%72.00%54.00%62.00%55
TwistBytes9,174130,8864,7475,93692.91%65.90%60.71%63.20%59.35%25.00%71.00%65.00%68.00%21
Baseline5,18397,1289649,92790.38%84.32%34.30%48.77%50.00%15.00%85.00%39.00%53.00%
Contender7,693118,0172,8547,41792.45%72.94%50.91%59.97%57.05%25.00%76.00%56.00%64.00%
Hierarchical confusion matrix performance evaluation (HC)Reported performances (Remus et al., 2019)
ModelTPTNFPFNACCPPVTPRF1MCCACCPPVTPRF1Rank (HC)Rank (Remus et al., 2019)
Averbis8,552125,9514,6836,55892.29%64.62%56.60%60.34%56.24%27.00%68.00%61.00%64.00%33
Comtravo-DS7,187111,8713,3767,92391.33%68.04%47.56%55.99%52.36%19.00%70.00%53.00%60.00%66
DKFI-SLT7,049112,5672,2568,06192.06%75.75%46.65%57.74%55.56%21.00%78.00%52.00%62.00%44
EricssonResearch8,498119,5463,2086,61292.88%72.60%56.24%63.38%60.10%38.00%74.00%62.00%67.00%12
HSHL7,167106,4883,0257,94391.20%70.32%47.43%56.65%53.21%26.00%72.00%54.00%62.00%55
TwistBytes9,174130,8864,7475,93692.91%65.90%60.71%63.20%59.35%25.00%71.00%65.00%68.00%21
Baseline5,18397,1289649,92790.38%84.32%34.30%48.77%50.00%15.00%85.00%39.00%53.00%
Contender7,693118,0172,8547,41792.45%72.94%50.91%59.97%57.05%25.00%76.00%56.00%64.00%

5.4 Discussion

Based on the results of Task 1A and Task 1B in Tables 4 and 5, we observe that the range of our measures is consistently lower. Therefore, we conclude that our measures are more robust. Similarly, Figure 5 shows the differences between the proposed and reported metrics for the two tasks of the GermEval 2019 Competition. The differences in accuracy are much higher than for the other metrics. Also, the differences in Task 1A (Γ=(1level)T, Ψ=MPL, Φ=MLNP) are lower than for Task 1B (Γ=DAG, Ψ=MPL, Φ=NMLNP). Our accuracy is consistently higher, while for other metrics, this is not the case. For Taks 1B, our metrics are consistently lower, while there is no such trend for Task 1A. The higher accuracy of the hierarchical confusion matrix and the (mostly) lower PPV, TPR, and F1 can be explained by the larger numbers of TN due to the proposed benevolent approach.

Differences in metrics (proposed—reported). Differences of the different evaluation metrics ACC, PPV, TPR, and F1 between the proposed hierarchical confusion matrix and the reported metrics from the GermEval2019 competition for Task 1A and Task 1B.
Figure 5.

Differences in metrics (proposed—reported). Differences of the different evaluation metrics ACC, PPV, TPR, and F1 between the proposed hierarchical confusion matrix and the reported metrics from the GermEval2019 competition for Task 1A and Task 1B.

Moreover, we determined the Kendall’s rank correlations (Kendall, 1938; Kosmopoulos et al., 2015) to assess the similarity of rankings for the two different classification tasks 1A and 1B of the GermEval competition, depending on different measures (ACC, PPV, TPR, F1) and evaluation schemes [HC = hierarchical confusion vs. GE = schemes used in Remus et al. (2019)]. We also compare HC, which is horizontal, to GE which is vertical, and this is termed as across (Table 6). Kendall’s rank correlations range from 1 (opposite ranking) to +1 (exact same ranking) and quantifies the similarity of rankings. Overall, we observe a large disagreement between rankings based on different measures when compared within the same schemes as well across the schemes for both tasks. In Task 1A, ACC correlates perfectly with F1 for HC and it has a high correlation with PPV for GE, while it does not correlate with the other measures. In Task 1B, there is almost no strong correlation for HC, and a strong correlation for ACC with PPV and for TPR with F1 for GE. The comparison across HC and GE reveals no significant relationship between the evaluation schemes and measures. Taken together, this suggests that the different metrics are largely complementary and that they will capture different aspects of the comparison.

Table 6.

Kendall’s rank correlation matrix of rankings

HCACCPPVTPRF1GEACCPPVTPRF1AcrossACCPPVTPRF1
Task 1A
ACC1.000.330.111.00ACC1.000.610.060.06ACC0.500.220.440.00
PPV1.000.550.33PPV1.000.000.11PPV0.170.220.110.11
TPR1.000.11TPR1.000.00TPR0.060.110.110.00
F11.00F11.00F10.500.220.440.00
Task 1B
ACC1.000.470.070.20ACC1.000.600.070.07ACC0.200.330.600.47
PPV1.000.200.33PPV1.000.070.20PPV0.070.070.330.20
TPR1.000.20TPR1.000.87TPR0.200.200.070.07
F11.00F11.00F10.470.070.330.47
HCACCPPVTPRF1GEACCPPVTPRF1AcrossACCPPVTPRF1
Task 1A
ACC1.000.330.111.00ACC1.000.610.060.06ACC0.500.220.440.00
PPV1.000.550.33PPV1.000.000.11PPV0.170.220.110.11
TPR1.000.11TPR1.000.00TPR0.060.110.110.00
F11.00F11.00F10.500.220.440.00
Task 1B
ACC1.000.470.070.20ACC1.000.600.070.07ACC0.200.330.600.47
PPV1.000.200.33PPV1.000.070.20PPV0.070.070.330.20
TPR1.000.20TPR1.000.87TPR0.200.200.070.07
F11.00F11.00F10.470.070.330.47
Table 6.

Kendall’s rank correlation matrix of rankings

HCACCPPVTPRF1GEACCPPVTPRF1AcrossACCPPVTPRF1
Task 1A
ACC1.000.330.111.00ACC1.000.610.060.06ACC0.500.220.440.00
PPV1.000.550.33PPV1.000.000.11PPV0.170.220.110.11
TPR1.000.11TPR1.000.00TPR0.060.110.110.00
F11.00F11.00F10.500.220.440.00
Task 1B
ACC1.000.470.070.20ACC1.000.600.070.07ACC0.200.330.600.47
PPV1.000.200.33PPV1.000.070.20PPV0.070.070.330.20
TPR1.000.20TPR1.000.87TPR0.200.200.070.07
F11.00F11.00F10.470.070.330.47
HCACCPPVTPRF1GEACCPPVTPRF1AcrossACCPPVTPRF1
Task 1A
ACC1.000.330.111.00ACC1.000.610.060.06ACC0.500.220.440.00
PPV1.000.550.33PPV1.000.000.11PPV0.170.220.110.11
TPR1.000.11TPR1.000.00TPR0.060.110.110.00
F11.00F11.00F10.500.220.440.00
Task 1B
ACC1.000.470.070.20ACC1.000.600.070.07ACC0.200.330.600.47
PPV1.000.200.33PPV1.000.070.20PPV0.070.070.330.20
TPR1.000.20TPR1.000.87TPR0.200.200.070.07
F11.00F11.00F10.470.070.330.47

One of the disadvantages of complex evaluation measures is the complexity and related run-time for applications. In general, the run-time of the proposed approach is linear to the number of predictions, and non-linear to the complexity of the graph measured, e.g. in number of nodes and edges. To quantitatively assess the performance of the implementation of our hierarchical confusion matrix approach, we present run-times (on a normal, Intel i5 8th Gen state-of-the-art desktop computer) and complexities of problems in Table 7. The table summarises the run-times, number of records processed, complexity of hierarchies measured in number of edges, as well as run-time per record (Run-time 1) and run-time per record per edges (Run-time 2). As the run-times show, the hierarchical confusion matrix is a computationally efficient measure. As Run-time 2 indicates, the run-time divided by records and number of edges indicates a non-linear relationship between complexity and run-time.

Table 7.

Run-time analysis of hierarchical confusion matrix

ExerciseRun-time [s]#Records#EdgesRun-time 1 [s]Run-time 2 [s]
GermEval 2019 Task 1A6.0991,47686.66 E-058.32 E-06
Transposon Classification8.3154,036181.54 E-048.54 E-06
GermEval 2019 Task 1B297.1691,4763433.25 E-039.47 E-06
ExerciseRun-time [s]#Records#EdgesRun-time 1 [s]Run-time 2 [s]
GermEval 2019 Task 1A6.0991,47686.66 E-058.32 E-06
Transposon Classification8.3154,036181.54 E-048.54 E-06
GermEval 2019 Task 1B297.1691,4763433.25 E-039.47 E-06
Table 7.

Run-time analysis of hierarchical confusion matrix

ExerciseRun-time [s]#Records#EdgesRun-time 1 [s]Run-time 2 [s]
GermEval 2019 Task 1A6.0991,47686.66 E-058.32 E-06
Transposon Classification8.3154,036181.54 E-048.54 E-06
GermEval 2019 Task 1B297.1691,4763433.25 E-039.47 E-06
ExerciseRun-time [s]#Records#EdgesRun-time 1 [s]Run-time 2 [s]
GermEval 2019 Task 1A6.0991,47686.66 E-058.32 E-06
Transposon Classification8.3154,036181.54 E-048.54 E-06
GermEval 2019 Task 1B297.1691,4763433.25 E-039.47 E-06

6 Conclusions

In this work, we proposed the novel concept of hierarchical confusion matrix, which has been extended to a generalised form and we proved its applicability to all types of hierarchical classification problems including DAG structures, MPL, and NMLNP problems. Moreover, we showed how it conforms various desirable attributes of hierarchical evaluation schemes. The evaluation concept was demonstrated on real-world applications, benchmarking hierarchical classification problems. The evaluation results were compared to alternative evaluation measures present in literature, revealing and supporting the reasonability of the proposed concept. The novel concept of hierarchical confusion matrix allows for the evaluation of hierarchical classification problems using measures based on binary classification problems. Due to its holistic applicability, reasonability and simplicity, it will facilitate future research in hierarchical classification benchmarking studies and potentially sets the new standard for hierarchical classification evaluation. Future studies could focus on assessing consistency of classification metrics and on comparing these. Follow-up studies could apply classification metrics on data sets using different hierarchies and examine whether similar rankings result out of these applications. This could indicate whether these metrics are robust and consistent.

Acknowledgments

The authors thank the referees, the Associate Editor, and the Joint Editor of the journal for thoughtful comments and suggestions that greatly improved the quality of the manuscript. We also thank the Evergrande Center for Immunologic Diseases for their Startup funds support.

Data availability

The implementation of the hierarchical confusion matrix in Python programming language, and the examples and experiments of this study are available on GitHub (https://github.com/DerKevinRiehl/HierarchicalConfusionMatrix).

References

Blockeel
H.
,
Bruynooghe
M.
,
Džeroski
S.
,
Ramon
J.
, &
Struyf
J.
(
2002
).
Hierarchical multi-classification
. In
Workshop Notes of the KDD’02 Workshop on Multi-Relational Data Mining
(pp.
21
35
). KU Leuven. https://lirias.kuleuven.be/1653518?limo=0

Borges
H. B.
,
Silla
C. N. Jr.
, &
Nievola
J. C.
(
2013
).
An evaluation of global-model hierarchical classification algorithms for hierarchical classification problems with single path of labels
.
Computers & Mathematics with Applications
,
66
(
10
),
1991
2002
. https://doi.org/10.1016/j.camwa.2013.06.027

Cesa-Bianchi
N.
,
Gentile
C.
,
Tironi
A.
, &
Zaniboni
L.
(
2005
).
Incremental algorithms for hierarchical classification
.
Advances in Neural Information Processing Systems
,
17
,
233
240
. https://proceedings.neurips.cc/paper/2004/hash/78f7d96ea21ccae89a7b581295f34135-Abstract.html

Cesa-Bianchi
N.
,
Gentile
C.
, &
Zaniboni
L.
(
2006
).
Hierarchical classification: Combining bayes with svm. In Proceedings of the 23rd International Conference on Machine Learning
(pp.
177
184
). https://doi.org/10.1145/1143844.1143867

Chicco
D.
, &
Jurman
G.
(
2020
).
The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation
.
BMC Genomics
,
21
(
1
),
1
13
. https://doi.org/10.1186/s12864-019-6413-7

Costa
E.
,
Lorena
A.
,
Carvalho
A. C. P. L. F.
, &
Freitas
A.
(
2007
).
A review of performance evaluation measures for hierarchical classifiers. In Evaluation Methods for Machine Learning II: Papers from the AAAI-2007 Workshop
(pp.
1
6
). https://kar.kent.ac.uk/14562/

Dumais
S.
, &
Chen
H.
(
2000
).
Hierarchical classification of web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp.
256
263
). https://doi.org/10.1145/345508.345593

Fawcett
T.
(
2006
).
An introduction to ROC analysis
.
Pattern Recognition Letters
,
27
(
8
),
861
874
. https://doi.org/10.1016/j.patrec.2005.10.010

Freitas
A.
, &
Carvalho
A.
(
2007
).
A tutorial on hierarchical classification with applications in bioinformatics
.
Research and Trends in Data Mining Technologies and Applications
,
175
208
. https://doi.org/10.4018/978-1-59904-271-8.ch007

Gauch
H. G. Jr.
, &
Whittaker
R. H.
(
1981
).
Hierarchical classification of community data
.
The Journal of Ecology
,
69
(
2
),
537
557
. https://doi.org/10.2307/2259682

Gordon
A. D.
(
1987
).
A review of hierarchical classification
.
Journal of the Royal Statistical Society: Series A (General)
,
150
(
2
),
119
137
. https://doi.org/10.2307/2981629

Holden
N.
, &
Freitas
A. A.
(
2006
).
Hierarchical classification of G-protein-coupled receptors with a PSO/ACO algorithm. In Proceedings of the IEEE Swarm Intelligence Symposium (SIS’06) (pp. 77–84). IEEE Press
.

Ipeirotis
P. G.
,
Gravano
L.
, &
Sahami
M.
(
2001
).
Probe, count, and classify: Categorizing hidden web databases. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (pp. 67–78)
. https://doi.org/10.1145/375663.375671

Kendall
M. G.
(
1938
).
A new measure of rank correlation
.
Biometrika
,
30
(
1–2
),
81
93
. https://doi.org/10.1093/biomet/30.1-2.81

Kiritchenko
S.
,
Matwin
S.
, &
Famili
A. F.
(
2004
).
Hierarchical text categorization as a tool of associating genes with gene ontology codes. In European Workshop on Data Mining and Text Mining in Bioinformatics (pp. 30–34)
.

Kiritchenko
S.
,
Matwin
S.
, &
Famili
A. F.
(
2005
).
Functional annotation of genes using hierarchical text categorization. In Proceedings of the ACL Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
. https://www.researchgate.net/profile/Svetlana-Kiritchenko/publication/44046343_Functional_Annotation_of_Genes_Using_Hierarchical_Text_Categorization/links/09e4150b3962ea0f9c000000/Functional-Annotationof- Genes-Using-Hierarchical-Text-Categorization.pdf

Kiritchenko
S.
,
Matwin
S.
,
Nock
R.
, &
Famili
A. F.
(
2006
).
Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Conference of the Canadian Society for Computational Studies of Intelligence (pp. 395–406). Springer
.

Koller
D.
, &
Sahami
M.
(
1997
).
Hierarchically classifying documents using very few words (Technical Report). Stanford InfoLab
.

Kosmopoulos
A.
,
Partalas
I.
,
Gaussier
E.
,
Paliouras
G.
, &
Androutsopoulos
I.
(
2015
).
Evaluation measures for hierarchical classification: A unified view and novel approaches
.
Data Mining and Knowledge Discovery
,
29
(
3
),
820
865
. https://doi.org/10.1007/s10618-014-0382-x

Remus
S.
,
Aly
R.
, &
Biemann
C.
(
2019
).
Germeval 2019 Task 1: Hierarchical classification of blurbs. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019)
.

Riehl
K.
,
Riccio
C.
,
Miska
E. A.
, &
Hemberg
M.
(
2022
).
Transposonultimate: Software for transposon classification, annotation and detection
.
Nucleic Acids Research
,
50
(
11
),
e64
e64
. https://doi.org/10.1093/nar/gkac136

Silla
C. N.
, &
Freitas
A. A.
(
2011
).
A survey of hierarchical classification across different application domains
.
Data Mining and Knowledge Discovery
,
22
(
1–2
),
31
72
. https://doi.org/10.1007/s10618-010-0175-9

Sokolova
M.
,
Japkowicz
N.
, &
Szpakowicz
S.
(
2006
).
Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In Australasian Joint Conference on Artificial Intelligence (pp. 1015–1021). Springer
.

Sokolova
M.
, &
Lapalme
G.
(
2009
).
A systematic analysis of performance measures for classification tasks
.
Information Processing & Management
,
45
(
4
),
427
437
. https://doi.org/10.1016/j.ipm.2009.03.002

Sorower
M. S.
(
2010
).
A literature survey on algorithms for multi-label learning. Department of Computer Science, Oregon State University.

Sun
A.
, &
Lim
E.-P.
(
2001
).
Hierarchical text classification and evaluation. In Proceedings 2001 IEEE International Conference on Data Mining (pp. 521–528). IEEE
.

Wang
K.
,
Zhou
S.
, &
Liew
S. C.
(
1999
).
Building hierarchical classifiers using class proximity. In VLDB (Vol. 99, pp. 7–10). Citeseer
.

Footnotes

The union of the neighbours of each node on the common path excluding nodes of the true path or duplicates in this set. In more complex problems such as DAGs it is possible that nodes occur multiple times in the union of the neighbours of each node on the common path, therefore they are excluded. Moreover, it is possible that nodes of the true path appear in this union set, and therefore are explicitly removed from the set.

Author notes

Conflict of interest: None declared.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.