Thesis
Helsinki University
Faculty of sciences, Kumpula
Department of mathematics and statistics
Prof. Dr. Jukka Corander
Masters thesis
A comparison between different classification methods: Logistic Regression, Support Vector Machines, Naive Bayes and Bayesian Networks.
Tiedekunta/Osasto Fakultet/Sektion – Faculty
Sciences, Kumpula
Laitos/Institution– Department
Mathematics and Statistics
Tekijä/Författare – Author
Pedro Ernesto Alonso de los Reyes
Työn nimi / Arbetets titel – Title
A Comparison Between Different Classification Methods: Logistic Regression, Support Vector Machines, Naive Bayes and Bayesian Networks
Oppiaine /Läroämne – Subject
Bayesian Statistics
Työn laji/Arbetets art – Level
Masters
Aika/Datum – Month and year
August 2014
Sivumäärä/ Sidoantal – Number of pages
49
Tiivistelmä/Referat – Abstract
The purpose of this thesis is to compare different classification methods, on the basis of the results for accuracy, precision and recall. The methods used are Logistic Regression against Naive Bayes, Support Vector Machines against Naive Bayes and a full Bayesian network.
This thesis was done at University of Helsinki, the methods divide in sections that include, the main idea of the methods used, the explanation of each one and the actual comparison between them, where the focus is understanding each method, apply it to a data set and get the response.
It was concluded that the Bayesian methods are well suited to the classification task, they are as good as their counterparts, some times better. While the Support Vectors Machines is still the best all around, the Bayesian approach has very little down side in the learning curve, and, makes a good approximate to the traditional method's power. The results were, Logistic Regression is the less reliable method for classification, then Naive Bayes, next Bayesian networks, finally Support Vector Machines as the best.
Avainsanat – Nyckelord – Keywords
Bayes, Bayes Networks, Support Vector Machines, Logistic Regression, Classification.
Säilytyspaikka – Förvaringställe – Where deposited
Muita tietoja – Övriga uppgifter – Additional information
,
Table of Contents
1 Introduction 1
1.1 Generative 2
1.2 Discriminative 2
2 Data Used. 3
3 Methods 3
3.1 Logistic Regression 4
3.1.1 Log-odds 4
3.1.2 Model 4
3.1.3 Objective function 6
3.1.4 Implementations. 7
3.2 Support Vector Machines 7
3.2.1 Bases of classification 7
3.2.2 Model 8
3.2.3 Mathematical concepts 8
3.2.4 Vector operations needed 9
3.2.5 Hyperplanes as Decision Surfaces 9
3.2.6 Algorithm 10
3.2.7 Optimisation 10
3.2.8 SVM Linearly Separable Data: Binary Classification in Dual Form 10
3.2.9 Dual Problem: Derivation 11
3.2.10 Kernel Trick 12
3.2.11 Implementations 14
3.3 Bayesian networks 14
3.3.1 Model 14
3.3.1.1 DAGS 15
3.3.1.2 CPT 15
3.3.1.3 Bayesian networks 16
3.3.1.4 Bayesian networks DAG parametrisation 17
3.3.2 Inference in BN 20
3.3.2.1 Factors 20
3.3.2.2 The join-tree 22
3.3.2.3 Message passing 24
3.3.3 Naive Bayes 25
3.3.3.1 Structure and design 26
3.3.3.2 Implementation 27
3.3.4 Full Bayesian Network 27
3.3.4.1 Structure and design 27
4 Results 29
4.0.1 Plots 29
5 Discussion 35
6 Conclusion 35
7 Appendix A (Code) 36
7.1 Code in Logistic Regression 36
7.2 Code used in Support Vector Machines. 38
7.3 Code in Naive Bayes. 38
7.4 Code for Bayes Network data set 1. 39
7.5 Code for Bayes Network data set 2. 40
8 Bibliography 41
List of Abbreviations and Symbols
LR
Logistic regression
SVM
Support vector machines
NB
Naive Bayes
BN
Bayesian Networks
1 Introduction
This thesis is focused on machine learning, thought as a subfield of Computer Science and Artificial Intelligence that deals with the construction and study of systems, which learns from data, instead of following explicit programmed instructions. The main topic in this document is classification.
There are 3 main methods of machine learning, these are: Supervised, Unsupervised and Reinforced learning. The supervised type, in words of B. Liu is, “also
called classification or inductive learning in machine learning. This type of learning is analogues to human learning from past experience to gain new knowledge in order to improve our ability to perform real-world tasks. However, since computers do not have “experiences”, machine learning learns from data, which are collected in the past and represent past experiences in some real-world applications”.[[1]]
There are several types of supervised learning tasks, as models that are focused on methods where a cost function is minimised, that can be used in the prediction of the values of a discrete class attribute, here there will the cost function, optimisation and none of the previous for the bayesian approach.
In learning, things to consider are whether the model is generative (for example Bayesian Networks) or discriminative (for example Logistic Regression)[[2], [3]]. The two methods will be explained shortly.
The key topics covered in this document will be, logistic regression, support vector machines, naive Bayes and Bayesian networks.
To compare the methods several measures will be used, like accuracy, precision and recall.
The working hypothesis in this thesis will be the following:
1. Hypothesis #1: Naive Bayes is as good or better than logistic regression. [[4]]
2. Hypothesis #2: Bayesian networks are as good as logistic regression.[[5]]
3. Hypothesis #3: Naive Bayes and Bayesian networks can approach the classification power of support vector machines.[[6]]
While these hypotheses have been covered in the literature, in [[3], [5], [7]–[9]], the present objective in this thesis is to present a clear picture of the different models and methods, based on the former literature, this document serves to illustrate the efficiency of the methods. To present the validity of the hypothesis, the measures used here will be: Accuracy, Precision and Recall in binary classification as:
This measures have the following meaning, accuracy means how many examples were classified how they really are, precision is the proportion of correctly classified positives against all the positive and recall as the fraction of the relevant instances retrieved.
The generative and discriminative methods are explained next, these are important because they have some advantages and disadvantages one over the other and vice-versa. For example, in what are they modelling, how they do it, parameters used, to name a few.
1.1 Generative
Generative classifiers use Bayes rule to get each probable class for the data. In supervised learning problems for classification what is sought after is an approximation of an unknown target function, that maps the data points to the corresponding class,.
As a start, let Y be assumed as a binary random variable and X as the data with k vectors and n boolean attributes. More explicitly,, where theis the boolean variable denoting the ith attribute of X.
Now Bayes rule, states;, whereis the mth possible value for Y and as, the kth vector in the data set X, the summation is the normalizing constant. Thus, the way this classifier work is, estimate thefrom the data directly, as well as. After counting the attributes the one uses Bayes rule to determinefor any new instance of.
This is a generative model, which can be thought as, describing how to generate random instances of X conditioned on the target distribution.[[2]]
1.2 Discriminative
In discriminative classifiers, the distribution can be though, as directly discriminating the value of the targetfor any given instance of . The approach is the same as generative, the aim is to a learn function of the form, in probability notation.
The form is defined by:, whereare the parameters of the model. Then the likelihood function is:from probability calculus.
Same procedure as before: . Then getting the posterior is a matter of normalizing, again: as before.
Where following Bayes rule, it can be combined with a prior, to get the joint probability.[10]
But, a difference here is, a discriminative model does not care how the data was generated, it simply categorize a given vector, as an example, consider:
The need to predict the labelsof a set of vectors, then this becomes in predicting the most likely label as: , where this is the posterior shown before. This is akin to trying to model the decision boundary between the classes.
The difference between the two approaches is, that in the first one the modelling occurs with the joint probability, the second the interest lies on modelling the conditional probability. As shown here both models require Bayes theorem, but, apply it in different forms.
2 Data Used.
The data used to compare the models was taken from a Machine Learning class online from coursera.com, taken in 2013, with professor Andrew Ng, is artificial data to get to know all the classifiers. From now on Y is the point in the data and C will be the classes
The records in the data are a made up admission to a school based on two test, used in logistic regression and naïve Bayes, to classify them into admitted or not, they are in the rangeand , in the first data set used in the set second are some random points withandthey are used in support vector machines, naïve Bayes and bayesian networks.
This was done to compare the models, the aim of this comparison is to check how well the hypothesis #1, #2 and #3 hold.
3 Methods
The methods used here were selected because they are the most applied in the literature, the data were available and selected as they were previously applied to a course taken, here the two methods developed in the course (logistic regression and support vector machines) are compared against the bayesian approach.
3.1 Logistic Regression
The first classifier explained here is to be considered as the base line, where others will be measured against. The data used in here is the first set, mentioned previously and in the other classifiers perform better, as is expected form the literature, it will not be compared again.
To start with the logistic regression (LR), the log-odds are explained, as they form an important part of the classifier, then the general model used here, the objective function and the implementation referred to appendices and the results obtained.
3.1.1 Log-odds
Odds are sometimes a better scale to represent probabilities. They are used to represent payoff for bets, for example 3-1 against means that one unit bet will pay 3. To make a correspondence to probability is like: 3-1 against is represented asand 3 in favour as 3, the following expression is obtained:
and
The reason to use this is, the LR model assumes that, the log-odds of an observation can be represented as a linear function of the input variables.
3.1.2 Model
LR is concerned on modelling the input features, taken as independent from one another, to obtain the corresponding class, in a dichotomous data set like used here, a class 1 or 0. This way to model is to separate the data vectors into the corresponding classes taken 1 as positive example and 0 as negative (false).
To map the data in general, no only in two classes a logistic function is used in the form of, this function is on the rangeand the domain, therefore is a valid probability that can used to assign the mos likely label to an instance.
Then to obtain the LR model from, the z variable is written as a linear sum of the form , where the x's are the independent variables and are constant terms representing unknown parameter. Then z can be considered an index combining the x's.
The unknown parameter are called the coefficients in the equation, beingthe y intercept and's the slopes.
Then the model is more formally as: LR.1, with this being a valid probability.
What is being looked for here is the odds ratio like: for the two classes in the data, given one vector the following occurs:
This is the odds ratio for a vector , then expanding this form, one obtain:
Where:
Expanding one the other is the same:
Then the odds ratio becomes:
Then,
This is the LR formula with two classes given one data vector, the in LR.1 will be also called hypothesis functionLR.2, where are the coefficients and T indicates, transposed in matrix form.
As mentioned before a cost function will be defined to get an
3.1.3 Objective function
To develop this classifier one needs to measure how well are the predictions compared to the truth, in this case, how costly it will be to make an error labelling an instance as class 1 being 0 or vice versa.
The cost function can be defined as suits one needs, here the definition will be based on the literature on LR.
These two functions can be specified as:
Then for the m data in the training set, the cost function is:
Then we must seek thethat minimises the function, for this the sigmoid function has a property that states:
The gradient can be searched to find the minimum by setting the partial derivatives to 0, as:
Where from 2 to 3 the property for the sigmoid was applied and normal chain rule for derivatives followed.
The last formula can be used in gradient descent algorithm.
3.1.4 Implementations.
See Appendix A 4.1.
3.2 Support Vector Machines
The next topic in data classification are Support Vector Machines (SVM), this model was first introduced by Vapnik in 1979, this part could be considered an introduction to SVM.[12] Here we focus on classification, that will be covering linearly separable elements, the methods in the dual formulation SVM and the kernel trick ( in general, then Gaussian) for the data sets used here.
SVM are considered a state of the art classifiers due to, its generalization performance (errors rates on test sets), the comparison with Bayes methods is to see how good can the latter approach the former.
3.2.1 Bases of classification
To introduce SVM's first, consider the example in the book by Statniknov, “whether on ashore line the things being looked at are houses or boats. The simplest way to do that is to look at the coast line and say everything above that line are houses and everything below boats. In that case the decision surface will be the line that will separate the two classes. This is obviously not ideal as some boats can be outside the water, they will be mislabeled as houses”[13]. The classification algorithms build models in a similar way. The coast line in this example is one of the possible decision surfaces available, while there are infinite many others where it is possible classify objects with no errors.
Once defined a decision surface the following rule for classification would suffice: new objects (points) will be classified as boats if they fall below the decision surface, as houses if they do not.
3.2.2 Model
SVM is a classification model, that solves the problem of identifying objects. SVM is very important because: the variables in the data can be very large and work well with small sample size. SVM can learn from a very large range of models avoiding overfitting.
SVM seek a linear decision surface, one that looks like a line in a two-dimensional plane, to separate classes of samples and has the largest distance, or largest margin, between the border-line samples, this points that lie on the border are called support vectors.
If the data has no linear decision surface that can separate the samples, the SVM can still do the classification needed, by mapping the data in a much higher dimensional space. This feature space is known as the kernel trick, a mathematical projection.
3.2.3 Mathematical concepts
This classification method require a geometrical representation of the samples to build a decision surface. This representation is as follows, assuming that an object has n characteristics that describes it, called features or variables. Represent each as a vector in the n-dimensional space. E. g. a patient in a hospital could be represented by two variables, systolic blood pressure = 120mmHg, age = 49 years. This would be a vector inwhose tail is at the origin and head at the pointin a plane. Having this representation would allow to geometrically represent the decision surface, that would separate the classes, in this case two.
3.2.4 Vector operations needed
The algorithms in SVM require its inputs in the form of vectors represented inwhere n is the number of features or characteristics of interest that will describe a sample. The SVM will then perform mathematical operations in these vectors in order to build a decision surface. These operations, consist of: Scalar multiplication, addition, subtraction, computation of Euclidian length and dot product. The definition of a vector “v” is: that is located in the space.
3.2.5 Hyperplanes as Decision Surfaces
An hyperplane is a linear decision surface that splits the space into two parts. With this a simple decision rule for classification can be implemented: all samples, points lying above the hyperplane would be classified as member of a class and all below that would be another class. A hyperplane is adimensional sub-space.
In SVM classification the fundamental things needed are thus, a hyperplane and the rule above.
To define it mathematically a hyperplane consist on a point P0 and a vector that is perpendicular to that hyperplane. As an example, consider two vectors, , where O is the origin of the coordinates and P is an arbitrary point on a hyperplane. Since that , is perpendicular to the hyperplane and P0 is on the hyperplane by thatand is perpendicular to. This is equivalent toor. Defining the equation of the hyperplane becomes. This equation also holds forwhere . Where the b controls the parallels hyperplanes we can obtain, an infinite number that will all depend on that parameter.
The distance between two parallels hyperplanes is given by:[SVM.1]. Which can be viewed as the Euclidian length of a vector between hyperplanes.
3.2.6 Algorithm
To find hyperplanes the following algorithm can be applied:
1.-Find a point in the space of the equation call it .
2.-Place a vector orthogonal to .
3.-Set the equation of the hyperplane as the vector passing through.
4.-Optional: Add or subtract a value from that equation to get more hyperplanes.
As this algorithm gives an infinite number of hyperplanes, we need one that has the largest margin separating the classes, then the optimisation of that equation is needed.
3.2.7 Optimisation
SVM is a method to find the best separating hyperplane for the data, when we have located the point in space and the decision surface that allows to classify samples into two classes. SVM is not about just finding a hyperplane that separates the samples, but, the hyperplane with the largest margin or “gap” that separates them between borderline samples, as samples on the border of the hyperplane.
Convex functions are the ones that a straight line connecting two points, the function will lie below this line, for any two points in the space. Also they have the property that any local minimum will be a global one. To compute the convex function for this
quadratic programming (QP) is needed, ad optimisation problem such that the function to optimise (objective) is quadratic, subject to linear constraints.
If an objective function in the QP problem is convex, then is called a convex QP problem and these kind of problems are easily and efficiently solvable using greedy search algorithms, due to the convex property that every local minimum is a global one as well.
3.2.8 SVM Linearly Separable Data: Binary Classification in Dual Form
This is the simplest SVM that can be done, in this case the data is in two visible groups and the hyperplane can be found with no extra complications, more explicit, the kernel trick needs not be used, this will be explained with the dual formulation of the problem.
This type of SVM is also referred as the hard-margin, when the data is separable and has no outliers in the classes, unlike the soft-margin where there are outliers.
The dual formulation is a convex QP problem, with N variables as,with N being the number of samples to classify.
The objective here, is to maximizewith this constraints,. Where this problem come from the standard form of optimization:, with constraints that,.
This can be transformed to a Lagrange problem, because solving a convex optimization problem equals solving the following optimization Lagrangian function:
[SVM.1], with L being the Lagrangian function. Here, the only thing being dealt with are the inequalities on, due to the Lagrange transformation.
The dual formulation equals the primal (not shown here) if the satisfy some equivalence conditions, such as: The objective function is convex, , the inequality constraints are convex gi, asthere exist some value that makes gi strictly less the 0. Since this conditions hold in the dual, they hold in the primal as well, the solutions will be equivalent.
3.2.9 Dual Problem: Derivation
The dual problem defined in SVM.1, can be solved, in a more convenient way, using a Lagrangian reformulation, as follows:
Whereis the objective function,are the inequality constraints. Which can be expressed as:
SVM.2
Then, the current goal is to find the minimum of the Lagrangian. As the Lagrangian is a convex function from a previous point, it is know that only one minimum happens, this is when the derivatives are zero. Thus, the next step is to, get this:
Then:
Plugging the previous values into the Lagrangian equation:
Maximising this expression with respect to , applying the constraints from the partial derivatives, gives the dual formulation of the problem, as:
with constraints:
This version of the problem is easy to solve and it only depends of the data through the dot products in . In the kernel trick, this will become important.
This problem can be solved with an array of techniques, any type of hill-climbing can work, for example Sequential Minimisation Optimisation (SMO), with the constraints stated above. Next the kernel trick, explanation.
3.2.10 Kernel Trick
The kernel trick is a transformation on data-space, when these are not linearly separable, or when the mix is too complex to use another technique.
This is to map the features into a new high-dimensional space and find a hyperplane in there, i. e. if the features are two-dimensional and there is no a clear separation, then using the kernel trick to map them in three dimensions, there might be a hyperplane that would be able to better separate them.
Here the explanation would be how to do this in general, then for the current data applying a Gaussian kernel.
Thus this trick is defined as follows:
For a normal input vector,, the having a functionthat maps the features in higher-dimensional space, as:
This might be close to linearly separable, then run the SVM on these features instead on the original ones, can be a better approach.
Then, given a Kernel function K on data points,as:
Instead of applying the SVM to the original, the new learned data is. To this end, simply replace each inputwith . Since the dual SVM problem can be written in terms of the dot product of data points, the new points are those obtained from the Kernel function. Since both, the SVM optimization problem and the resulting classifier only handle data through dot products of data points, even if computingis inefficient, the SVM can be efficient as long as the computation of the Kernel is done efficiently.
As an example, consider the following kernel function, . This can be calculated in linear time in the dimension of, as follows:
Whereis the extension of the previously described transformation to the N dimension, that is:
Generalising some kernel,corresponds to a feature mapping to anfeature space. The used kernel here was the Gaussian kernel or RBF, defined asfor the second data set.
3.2.11 Implementations
See Appendix A 4.2.
3.3 Bayesian networks
This is mostly the main topic in this thesis, were the points to be covered the definition of Bayesian networks (BN), its implementation, its inference algorithms, some concepts as a Directed Acyclic Graph (DAG), to name a few of the key points, then with the same data as the LR and SVM, create two types of BN, a naive Bayes and a full BN and compare them to the classification of the previous algorithms. The full BN are done in Genie and data inputed in jSMILE, where the networks will be implemented (former), and the data classified (latter).
3.3.1 Model
The main idea of a BN, also called Belief Network, is to model the joint probability of a series of events in a compact way, e. g. if there are seven variables with two outcomes each, the probability table needed to specify all the events, has entries, or 127 entries to fully specify what event might take place. These are too many to try to specify by anyone, it can be difficult to know with certainty, what probability to assign to event 120 to make sure all the sum equals 1, since this is a probability, the , this has to be true in order to be valid. A BN allows to shrink that amount, in order to get a lower number of probabilities specified in the model, to get this specification, the use of graphs is required, a way to model the variables involved, is with a DAG and the influences that each variable(node) ejects on the others with arcs, this is the reason they are also called Belief Networks, the probabilities will be specified with a conditional probability table (CPT), on each variable(or node) in the graph.
3.3.1.1 DAGS
A DAG is a graphical representation of the variables involved in a problem, structure, so on. In the BN the nodes represent propositional variables, and the edges can be though of representing direct causal influences it might not be that way always but here they will be regarded as such.
BN.I1.- A DAG with 4 variables.
Own creation.
DAG's as their name indicate, can't contain cycles, no node connected to itself or simple, it shouldn't be able to go from one node to the same node by following the arrows.
3.3.1.2 CPT
A CPT is a table that specifies the conditional probability in a node given the nodes connected to it (parents). The CPT is the conditional probability table of a variable given the parents of that variable, is represented by in fig. BN.I1 for node H would be . This is the CPT that represents the probabilities of each state in the node for every combination of its parents, e. g. the CPT in node H could be, if we assume that the nodes each have 2 states (T, F):
R
S
H
T
T
T
0.95
T
T
F
0.05
T
F
T
0.9
T
F
F
0.1
F
T
T
0.8
F
T
F
0.2
F
F
T
0
F
F
F
1
T1.- CPT of one variable with 2 parents.
Own creation.
Where each combination of the parents , to make a proper probability. Also, is worth noting that some entries in the table could be inferred from the previous one as, then is always true that the complete table of entries will be of four probabilities needed to specify the network and complete the rest by the previous result, this is the compactness that a BN can use to model a probability of an event with less parameter than the full joint.
3.3.1.3 Bayesian networks
Then the BN is composed of two parts, a DAG called the structure (G) and a CPT called the parametrisation ().
In the BN the nodes can be considered as parents of one, or as descendants and as non-descendants, this is:
-Parents(V).- Are the set of variables in a DAG G, say N with a directed edge from N → V.
-Descendants(V).- Are the set of variables N in G with a directed edge from V → N.
-Non-descendants(V).- Are all the other variables not in Parents(V) or Descendants(V) in a DAG G.
An example is in Fig. BN.I1, for variables W, R the following is a result of this:
-Parents(R).- . As R has no parents.
-Descendants(R).- W, H. As these nodes are connected by a directed edge from R.
-Non-descendants(R).- S. As this goes into the rest of nodes in the graph.
-Parents(W).- . R. As the directed edge goes from R → W
-Descendants(W).-. As W has no directed edges outgoing.
-Non-descendants(W).- H, S. As these go into the rest of nodes in the graph.
The graphical structure allows to see whether any variables depend on any others, this is the advantage that will allow to simplify the probability of an event, instead of having, looking at the structure the simplification can be obtained. E.g. In Fig. BN.I1 assuming the variables are all binary, the full probability table would consist onor 15 entries, while the BN would be 8 entries which is a good simplification, with more variables where the joint can become unmanageable.
The independence of the variables is a powerful constraint to take advantage, where the following result is handy:
BN.1
Which states that any variable is independent (I) given its parents, from its non-descendants. This is called the Markovian assumptions denoted Markov(V), where V is any variable in the graph, when the DAG is viewed as causal structure, then it can be though as, the Parents(V) are the direct causes of V, while the Descendants(V), are the effects of V. The formal interpretation of a DAG is defined as a set of conditional independence statements, it makes no reference to the notion of causality, even if its convenient to make this interpretation. If a DAG is constructed with the notion of causality, it is possible to get a DAG that agrees with the perception, however, it is also possible to make a one that does not match the perception and still states the same independences that the previous one did. Then the need to parametrise the DAG, where the quantification process is involved, when having one that agrees with the perception, this becomes easier.
3.3.1.4 Bayesian networks DAG parametrisation
To set the CPT's based on the DAG structure, conforming to the view on the independencies declared by this particular structure, the need is to state a(CPT), that captures the state of beliefs in the domain at hand. First create a DAG (G), that is consistent with the independencies known in the underlying domain. This DAG is then a partial specification of the beliefsshould take. More specifically G is saying that the distribution must satisfy the Markovian(G) defined earlier. While this is a constraint on the possible distributions, it does not uniquely defines it.
Some additional set of conditional probabilities are needed, as, for every variable X in G and its parents U, the probability ofhas to be provided, for every value of x in variable X and every instantiation u of parents in U. As it turns out several probabilities are redundant, as the can be inferred by the probability calculus, as:
and
This explains how the BN is able to shrink the probabilities needed from the joint distribution.
Since it is known the factorisation from probability calculus as:
Then with the parametrisation of a BN, the factorisation that gives the joint probability, in Fig. 1, with the latter independencies assumptions could become:
Then the formal definition of a BN becomes:
Definition 1. A BN for variables in Z is a pair, where:
. G is a DAG over the variables in Z, called the network structure.
.is a set of CPT's, one for each variable in Z, called the network parametrisation.
Where the network has some properties of probabilistic independence, like:
-Theof a DAG, is guaranteed to satisfy all the Markovian(G) independencies assumptions in BN.1, for every variable X in the network.
-Also a few independences called the graphoid axioms are satisfied, these follow form the definition:
, what this means it, the distribution in P finds variable X independent of Y given Z, or formally.
The probabilistic independencies obtained from the graphoid axioms, are:
-Symmetry.iff.- This is that if knowing y has no influence on x, then knowing x has no influence in y.
-Decomposition.only ifand. This is if yw does not influence x, then y alone or w, will have no influence on x.
-Composition.andonly if . While this does not hold in general, as two pieces of information may be irrelevant on their own, the combination may be relevant.
-Weak union.only if. This is, if yw is not relevant to x, the any partial y will not make the rest, w relevant.
-Contraction.andonly if. This property states that, after learning irrelevant information y the information w is found to be irrelevant to x, then the combined information yw must have been irrelevant to begin with.
-Intersection.,only if.
Where this axiom holds only for strictly positive probability distributions (no 0 in any CPT), it says that if information w irrelevant give y, and information y is irrelevant given w, then the combined information yw is irrelevant to begin with. While this is not true in general it helps in the case of strictly positive distributions.
This properties allow to state more independencies that BN.1, making them an addendum to the Markovian assumptions, in some cases they help to prove them. These properties are what is called the graphoid axioms, that allows to represent independences beyond, the Markov(G).
Also is worth noting that the independence in a graph can be asserted by the concept of d-separation, where the nodes can be considered as valves, for three nodes connected, this is an important concept in BN, but it goes far beyond the scope of this thesis, to summarise it in a way that fully exemplifies it, the following theorem:
Theorem 1. To test if X and Y are independent (d-separated) by any nodes in Z, where X and Y are any set of nodes also in a DAG G, see if they are disconnected in a new DAG G', obtained from the following steps:
. Delete all outgoing edges from nodes in Z.
. Recursively delete any leaf node W from G as long as W is not in.
As an example of the algorithm consider BN.I1, to test if d-sep(W, R, S).
-The first step is to delete the out going edges from R, the edge R → W and R → H.
-Next recursively delete any leaves not is, as there is only H, is deleted. Leaves are nodes with no outgoing edges.
The new DAG G' has only W, R, S all disconnected, so d-sep(W, R, S) is true.
The last notion of independence touched here is the concept of Markov blanket, MB(G).
This is a set of variables that, when known, render any other variable irrelevant to the considered variable, in this case X.
Theorem 2. If P is a distribution over variables X. The Markov blanket for variable X that is part of X are the parents(X), children(X) and spouses(X).
Where the spouses(X) are the nodes that share a common child with X. This is another way of stating independence between variables.
3.3.2 Inference in BN
The inference performed in the BN, in this case called classification, will be to obtain the probability of a given class, for the data vectors, that correspond to the most plausible explanation of them, as in which class the belong. To do this, factors are explained as well as the jointree algorithm.
3.3.2.1 Factors
A factor is a function over a set of variables, mapping each instantiation to a non-negative number represented by a table, the possible values they can take and the non-negative number for them.
This is an example of a factor:
B
C
D
T
T
T
0.95
T
T
F
0.05
T
F
T
0.9
T
F
F
0.1
F
T
T
0.8
F
T
F
0.2
F
F
T
0
F
F
F
1
T1. A factor representation.
Taken from [[15]].
Where the variables are B, C and D, and the values they can take are T and F, andare non-negative numbers. As ca be seen here the factors can be a representation of the CPT's in a BN. But is worth nothing, factors does not necessarily represent a probability distribution over the variables, are used here because they provide a clear and computational effective representation of conditional probabilities. They can go from conditionals to marginals in the process. So the definition of a factor is:
Definition 1. A factor over variables X is a function that maps each instantiation of x of variables in X to a non-negative number, in this case.
They are also called potentials, but here will not be adopted.
With factors, there are two key operations that will be performed, they will help with the inference process. The first is summing out a variable, the second is multiplying two factors together. This operation can be considered as the building blocks in BN inference. The operations performed are:
Definition 2. Considering a factorover variables in X where X is taken, the result of summing out, or projecting the variable is another factor over the set.
Which is denoted asand defined like:
Definition 3. The result of multiplying factors, over variables, denoted as, is defined as:
Where this function is commutative and associative and the multiplication is done over common variables, the same with the sum, is done where the summed variable(s) are consistent with the values they can take.
Then the inference, resumes to multiplying and projecting factors in the BN to get a joint or a conditional probability.
E. g. Considering again Fig. 1, and the following factors:
R
T
0.8
F
0.2
Table 2. Node R in Fig. 1.
Source: Own creation.
S
T
0.9
F
0.1
Table 3. Node S in Fig. 1.
Source: Own creation.
R
W
T
T
0.8
T
F
0.2
F
T
0.5
F
F
0.5
Table 4. Node W in Fig. 1.
Source: Own creation.
R
S
H
T
T
T
0.8
T
T
F
0.2
T
F
T
0.5
T
F
F
0.5
F
T
T
0.9
F
T
F
0.1
F
F
T
1
F
F
F
0
Table 5. Node H in Fig. 1.
Source: Own creation.
The joint probability will be the multiplication in the factors, as:
that is:
P() =(0.8)(0.9)(0.8)(0.8)=0.4608, the joint probability of the event happening, this is the basis for the classification that will follow, where the probability of the event with the highest, will be the class assigned.
3.3.2.2 The join-tree
With the definition of factors and a BN enough is done to define the inference approach used here, the jointree algorithm, there are some others algorithms, the variable elimination, bucket elimination, to name a couple, here the focus is on constructing a tree where the BN is modelled and consider the messages passing criteria to get the probabilities.
Constructing an elimination tree for a set of factors is straightforward, as is just to construct a tree and assign each factor to a node in the tree. Any elimination tree constructed in this fashion will be good enough, to complete calculate the probabilities, but, the interest here is to get a low-width tree to minimise the amount of work performed.
Define a jointree as follows:
Definition 4. A jointree (JT) for a DAG, is a pair (T, C), where T is a tree and C is a function that maps each node i in the tree T to a label, called a cluster. The JT must satisfy the following:
1. The clusteris a set of nodes from the DAG.
2. Each family on the DAG must appear in some cluster.
3. If a node appears in two clustersand, it must also appear in every clusteron the path connecting nodes i and j in the JT. This is known as the JT property or running intersection.
The family in a BN are defined as, the variable in question and its parents.
Then the separators in a JT are the edges i-j denoted byand defined as the size of a JT is the size of its largest cluster minus one.
To get a JT form a BN, the following procedure is carried out:
Transform the arcs to undirected ones.
Moralise the new graph, this is add an edge where two nodes share the same child.
Triangulate the moral graph, as add an edge where there is a cycle with more than 3 nodes to break that, and get only families with three at most.
Add the factors to the tree nodes that have a common factor.
Taking in consideration that, it has to respect the running intersection property which states the following: Ifand, then X is in every unique path in the tree betweenand. E.g. If the cluster has the variables ABC and the next node has ABD, has running-intersection, if it has DEF, is not and cannot be a JT.
E. g. For this network:
BN.I2. DAG example for JT.
Source:[6] done with [8].
After following the steps, the corresponding JT obtained is:
Where the corresponding factors in each node could be:on node DFG,on ACE, on ADF,on ABD andon EFH.
In this case the only node without a factor in it is AEF, but, that is no problem as they all don't need to have one, they can be empty depending on the calculations needed.
3.3.2.3 Message passing
In all the inference algorithms, the concept of message passing is required to calculate any probability sought for, in the JT there are two main architectures. The Shenoy-Shafer architecture and the Hugin architecture, they have advantages and disadvantages, like their space-time complexity, but in some JT they are equivalent. The one explained here is the Shenoy-Shafer.
The firs step in this architecture, is to select a node as the root, it's common to select the probability sought after as that. Then the passing of messages begins, in two phases, inward and outward. This is towards the root and away from the root, the former is also known as the collect or pull phase, the latter as the distribute or push phase. After the first phase the probability of the variables in the root is known. This is how the messages go: Node i sends a message to j only when it has received all the messages from its neighbours k. The message from node i to node j is a factor, as:
Whereis the product of the factors assigned to node i. Note thatincludes exactly the variables of factor. Then the projection of this factor on to the variablesis the same as summing out variables, from the factor. Then the most commonly used form:
What this says is, that any incoming message will be multiplied by the factors in that node, then the sum takes place over the variables not in the next node, passed to the separator between the nodes and wait until all the messages to the next node have been gathered, then repeat.
In the end at the root, the nodes will contain the join probability of the variables in the node and the evidence passed to them, as:
Then if the sought after probability is a conditional, simply normalising (dividing the join over the probability of the evidence), the is the final probability.
This point is the general inference in BN, next is the naive Bayes, where the inference mechanism is the same.
3.3.3 Naive Bayes
The naïve Bayes (NB), is the most simple BN, where it makes the strong (naïve) assumption, that every node is conditional independent of the rest given its class. This assumption may not hold in the real world, but, as a classification model has a good power. Sometimes better than LR, and it can be an approximation to the SVM model. Here the main points are, the structure of NB, the implementation in Genie and jSMILE and the own function of Matlab, the results and some plots.
3.3.3.1 Structure and design
The structure of a NB is quite simple, it has the class at the top, and the nodes are the children, separated as their only interaction is through the class.
Here is a simple diagram of the NB:
NB.I1. An example of naïve Bayes.
Source: Own creation with [[16]].
In this simple example of NB, as previously stated the class is the parent of all the nodes, then in this setting, the probability of being 0 or 1, could be calculated as the Maximum Likelihood Estimate (MLE) or using a Laplacian smoothing, which would make it more robust to new incoming data. The independencies stated by Fig. NB.I1 are:
1.
2.
3.
4.
5.
6.
Which using the d-separation algorithm discussed previously is what allows the CPT's to be:,,,,. Where taking the factors approach, this can be expressed.
3.3.3.2 Implementation
See Appendix A 4.3.
3.3.4 Full Bayesian Network
A full BN makes no strong assumptions on independence, unlike the NB, so in the previous data set, the exams could be considered independent of each other, or one influences in the other. In this case the former is considered to explain the class. In the examples for the SVM as they have no real meaning the consideration is that the features influence the class, not the other way around. In the diagrams of the BN, for the first data in the SVM, the two features are considered independent of each other, but the class depends on them. For the second data, the dependencies are not clear, then some modelling decisions were taken.
3.3.4.1 Structure and design
For the BN the considerations previously mentioned, were used in the second data, while both datas were discretised to have fifteen nodes, representing the values where , , …, and so on until. Each node having an increment of 0.5, design purposes only. Also having a maximum of 3 parents each node, again for design.
The first BN, looks like:
Fig. 4. Full BN for the first data set.
Source: [[17]] for model from data and software.
And for the classification, for the code see Appendix A 4.3.
The code is done in Java with the jSMILE library to get the classification, as mentioned on the appendix 4.4.
The second BN was done with the same discretisation approach, but this time taking 0.1, as the steps to jump and the same constraints in the parents.
It looks like this:
Fig. 4. Full BN for the second data set.
Source: [[17]] model from data and software.
See Appendix A 4.5, for details on the actual implementation.
4 Results
For LR applied to the data sets used here:
LR. I1. View of the data set of students by their score.
Source: Andrew Ng class. [[14]]
LR. I2. Logistic regression line, everything below the line is class 1.
Source: Andrew Ng class. [[14]]
The accuracy in this case is 89%, while the precision and recall which are, calculated as:
and [[11]]
For this case is shown that: Precision=90%, Recall=91%
In SVM, the following results were obtained:
Here are the plots for the SVM, the data, the linear kernel and the gaussian kernel.
SVM.I1.-Data in the for the SVM.
Source: Andrew Ng class. [[14]]
SVM.I2.- An SVM with a linear decision boundary, where the C is 1000, so accept very few errors.
Source: Andrew Ng class. [[14]]
SVM.I3.- Same SVM with C as 0.1 accept more misclassifications.
Source: Andrew Ng class. [[14]]
SVM.I4.- Data for the gaussian kernel, view.
Source: Andrew Ng class. [[14]]
SVM.I6.- Gaussian kernel decision boundary, C = 1, more errors accepted.
Source: Andrew Ng class. [[14]]
SVM.I7.- Gaussian kernel decision boundary, C = 1000, fewer errors accepted.
Source: Andrew Ng class. [[14]]
The results in theses cases were as follows:
For the first SVM with the C = 1000, as can be seen in the plot it got a perfect match, so the accuracy is 100%, precision 100%, recall 100%, in this type of data that were pretty linearly separable.
The second SVM where C = 0.1 the results were a little different as it miss one so the accuracy is 98%, the precision 100%, as it had no false positives, recall is 95%.
In the gaussian kernel with a C = 1000 and, the following results were obtained, accuracy of 92%, precision is again of 100% due to the fact that no negative examples were flagged a positives, while several negatives were classified as positives, the recall was of 86% , as this data was not linear at all it makes sense.
With the same kernel and this parameters C = 1 and , sigma stayed constant, the results were, accuracy 93%, precision of 97%, now several dots were flagged as positives while being negatives as is shown in the plot, were some dots were outside the decision boundary, recall of 91% better than last, but this is the trade it has to be made with precision.
In NB model was not treated as a linear classification method, there are no plots of the lines separating the data. In all the data the plots would be the same discussed previously, the class as the parent, all the nodes as sons independent of each other.
Fig. 1 NB.I1.
Source: Andrew Ng class.[[14]]
Fig. 2 NB.I2. The second data set tested with NB.
Source: Andrew Ng class.[[14]]
Fig 3. Last data set for NB.
Source: Andrew Ng class.[[14]]
The results were as follows, the exams data set Fig. 1:
The accuracy in this case was of 93%.
Precision of 93%.
Recall of 95%.
The second unnamed data set, Fig. 2:
The accuracy in this case was of 100%.
Precision of 100%.
Recall of 100%.
Finally the third data set Fig. 3, like:
The accuracy in this case was of 75%.
Precision of 75%.
Recall of 82%.
In the Full BN model the results were, for date set 1 then data set 2:
(1)
Accuracy of 98%.
Precision of 95%.
Recall of 100%.
(2)
Accuracy of 88%.
Precision of 87%
Recall of 91%.
This shows a slight difference that the parent constraints may have had on the BN classification.
5 Discussion
6 Conclusion
The main points covered in this thesis were related to the classification power of different model, in the traditional machine learning approach like SVM, as in their Bayesian counterparts. The focus for these models were the accuracy, prediction and recall, trying to see if they were as good as traditional forms, sometimes even better. In the precent thesis, the data used were not following some of the important machine learning considerations, as cross-validation or separating the data in training and test, to name a couple.
As shown in this thesis, for the linearly separable data, naïve Bayes has better accuracy, precision and recall than logistic regression, and also can have an equivalent power to support vector machines, when the data is linearly separable, naïve Bayes showed a better over all in both measures.
Support Vector Machines, where the data was linearly separable, it showed a better score also, where it was not, it showed a better than the rest scores, a good result.
While in the bayesian networks counter-example the support vector machine was very good separating the data, but, the full bayesian network can be though as comparable in the measures, if is thought as the information loss in the discretised data was enough to account for the misclassifications.
In conclusion, the three hypotheses hold logistic regression being the less powerful model, as presumed, while support vectors machines, the best, but it was shown that the Bayesian approach could be as effective in some cases or comparable to its power.
7 Appendix A (Code)
This is the code used for the implementations of the methods.
7.1 Code in Logistic Regression
The LR model was implemented in Matlab, as all the other model shown here, using the following code:
function g = sigmoid(z)
g = zeros(size(z));
g = 1./(1+exp(-z));
end
This was for formula LR.6.
Then the gradient one step is calculated like, it has the two ways of calculating for Matlab, vectorised and un-vectorised:
function [J, grad] = costFunction(theta, X, y)
m = length(y);
thetas = size(theta,1);
features = size(X,2);
steps = 100;
alpha = 0.1;
J = 0;
grad = zeros(size(theta));
sums = [];
result = 0;
for i=1:m
sums = [sums; -y(i)*log(sigmoid(theta'*X(i,:)'))-(1-y(i))*log(1-sigmoid(theta'*X(i,:)'))];
end
result = sum(sums);
J = (1/m)* result;
tempo = [];
thetas_update = 0;
temp_thetas = [];
grad = temp_thetas;
for k = 1:size(theta)
for j = 1:m
tempo(j) = (sigmoid(theta'*X(j,:)')-y(j))*X(j,k);
end
temp_thetas(k) = sum(tempo);
tempo = [];
end
grad = (1/m).*temp_thetas;
end
The one step remark is done because the complete gradient was calculated with Matlab integrated functions to take less time in computing the gradient, shown in the following code:
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
fprintf('Cost at theta found by fminunc: %f\n', cost);
fprintf('theta: \n');
Then this was to look for the best line that can separate the data automatically, with the cost function as the minimising objective.
7.2 Code used in Support Vector Machines.
This part will show the SVM with linear decision boundaries, normal and gaussian kernel, using Matlab with the ready-made package svmTrain.m and svmPredict, already done, just to set up the linear kernel function, as well, as the gaussian kernel.
For the linear kernel, like:
function sim = linearKernel(x1, x2)
x1 = x1(:); x2 = x2(:);
sim = x1' * x2;
end
Where the x1=x1(:) is done to make sure the vectors received are column vectors.
For the gaussian, like:
function sim = gaussianKernel(x1, x2, sigma)
x1 = x1(:); x2 = x2(:);
sim = 0;
res = x1 - x2;
sim = exp(-1 * res'*res / (2*sigma^2));
end
With this functions defined, is a matter of feeding them to the Matlab function for it to see the resulting SVM.
7.3 Code in Naive Bayes.
The naïve Bayes model is as follows:
The implementation is as usual done in Matlab, but, this time using the own function and Genie, jSMILE, Matlab provides a function that does the NB classification with no external help, to classify the data points in both ways, the data is the same as LR, two exams and accepted or not-accepted.
For the integrated Matlab function the code for implementing the NB on the first data is:
NaiveBayes.fit(X,y)
This gives a NB fitted for a gaussian distribution for this data as is the default setting in the function. Whereare the observations,are the classes.
The same induction goes for the second example in the SVM data set, only this case as there are no real meanings behind the data to classify, no need to specify the names. The function is the same, giving the same gaussian distribution, but differents values in the parameters.
7.4 Code for Bayes Network data set 1.
This is the implementation of the first data set BN:
With jSMILE given the BN in Genie, calculate the probabilities of the vectors for both classes, then chose the class the had the bigger probability, as the true class. The code for calculating the first network goes like (this is the main part omitting the unnecessary things, like the setup, the imports, so on):
Network thesis1 = new Network();
thesis1.readFile("/Users/pedroalonso/Dropbox/thesis1data.xdsl");
thesis1.updateBeliefs();
String[] nodes = thesis1.getAllNodeIds();
int count = 0;
for(int i=0;i< nodes.length;i++){
System.out.println(nodes[i]);
count++;
}
Scanner data = new Scanner(new FileInputStream("/Users/pedroalonso/Dropbox/genie.csv"));
String line = data.nextLine();
StringTokenizer nodes_names = new StringTokenizer(line, ",");
String[] bnNodesNames = new String[count];
int nodes_count = 0;
while(nodes_names.hasMoreTokens()) {
bnNodesNames[nodes_count] = nodes_names.nextToken();
nodes_count++;
}
int num_data = 0;
int[] yp = new int[52];
while(data.hasNext()){
String vector1 = data.nextLine();
StringTokenizer firstV = new StringTokenizer(vector1, ",");
int tokens = firstV.countTokens();
for(int i = 0; i < tokens-1;i++) {
thesis1.setEvidence(bnNodesNames[i], ("State"+firstV.nextToken()));
}
String c = firstV.nextToken();
System.out.println("class ="+c);
double c1 = 0, c2= 0;
if(c.equals("1")) {
thesis1.setEvidence(bnNodesNames[15], ("State"+c));
thesis1.updateBeliefs();
double p_e = thesis1.probEvidence();
c1 = p_e;
thesis1.setEvidence(bnNodesNames[15], ("State0"));
thesis1.updateBeliefs();
double p_not_e = thesis1.probEvidence();
c2 = p_not_e;
}
if(c.equals("0")){
thesis1.setEvidence(bnNodesNames[15], ("State"+c));
thesis1.updateBeliefs();
double p_e = thesis1.probEvidence();
c2 = p_e;
thesis1.setEvidence(bnNodesNames[15], ("State1"));
thesis1.updateBeliefs();
double p_not_e = thesis1.probEvidence();
c1 = p_not_e;
}
if(c1 > c2) {
yp[num_data] = 1;
}
if(c2 > c1){
yp[num_data] = 0;
num_data++;
}
'
7.5 Code for Bayes Network data set 2.
Here is the code used in for the second BN:
The implementation is the following, using jSMILE[10] and the BN from Genie[9] model (omitting unnecessary things to set up):
Network thesis2 = new Network();
thesis2.readFile("/Users/pedroalonso/Dropbox/thesis3.xdsl");
thesis2.updateBeliefs();
String[] bnNodes = thesis2.getAllNodeIds();
for(int i = 0; i < bnNodes.length; i++) {
System.out.println(bnNodes[i]);
}
System.out.println(bnNodes.length);
Scanner data = new Scanner(new FileInputStream("/Users/pedroalonso/Dropbox/thesis2.csv"));
String line = data.nextLine();
System.out.println(line);
StringTokenizer tokens = new StringTokenizer(line,",");
int count_nodes = 0;
while(tokens.hasMoreElements()) {
String tok = tokens.nextToken();
bnNodes[count_nodes] = tok;
System.out.println(tok);
count_nodes++;
}
for(int i = 0; i < bnNodes.length; i++) {
}
int count_data = 0;
int[] yp = new int[863];
while(data.hasNext()) {
String vector = data.nextLine();
tokens = new StringTokenizer(vector, ",");
int count_tokens = tokens.countTokens();
for(int i = 0; i < count_tokens-1; i++) {
thesis2.setEvidence(bnNodes[i], "State"+tokens.nextToken());
}
String c = tokens.nextToken();
double c1 = 0, c2 = 0;
if(c.equals("1")) {
thesis2.setEvidence(bnNodes[15], "State"+c);
thesis2.updateBeliefs();
double p_e = thesis2.probEvidence();
c1 = p_e;
thesis2.setEvidence(bnNodes[15], "State0");
thesis2.updateBeliefs();
double p_not_e = thesis2.probEvidence();
c2 = p_not_e;
}
if (c.equals("0")) {
thesis2.setEvidence(bnNodes[15], "State"+c);
thesis2.updateBeliefs();
double p_e = thesis2.probEvidence();
c2 = p_e;
thesis2.setEvidence(bnNodes[15], "State1");
thesis2.updateBeliefs();
double p_not_e = thesis2.probEvidence();
c1 = p_not_e;
}
if(c1 > c2) {
yp[count_data] = 1;
}
if(c2 > c1){
yp[count_data] = 0;
}
count_data++;
}
for(int i = 0; i < yp.length; i++) {
System.out.println(yp[i]);
}
8 Bibliography
[1] B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, p. 532.
[2] T. M. Mitchell, Machine Learning, 1st ed. Portland: McGraw-Hill Science/Engineering/Math, 1997, p. 414.
[3] M. G. Madden, “On the Classification Performance of TAN and General Bayesian Networks,” Springer, vol. 1, no. Mdl, pp. 3–16, 2009.
[4] A. Y. (University of california) Ng and M. I. (University of C. Jordan, “On Discriminative vs. Generative: A comparison of logistic regression and naive Bayes.,” Mach. Learn., vol. 1, no. 1, p. 8, 2001.
[5] T. Roos, W. Hannes, P. Grüwald, and P. Myllymäki, On Discriminative Bayesian Network Classifiers, 2nd ed. Amsterdam: Springer Science + Business Media, 2005, pp. 267–296.
[6] S. Hassan and M. Rafi, “Comparing SVM and Naïve Bayes Classifiers for Text Categorization with Wikitology as knowledge enrichment,” Mach. Learn., vol. 1, no. 5, p. 3, 2012.
[7] F. Pernkopf, J. Bilmes, B. Ee, and W. Edu, “Discriminative versus Generative Parameter and Structure Learning of Bayesian Network Classifiers,” Mach. Learn., vol. 1, no. 1, p. 8, 2004.
[8] H. Zhang, “Full Bayesian Network Classifiers,” Proc. 22nd Int. Conf. Mach. Learn., vol. 1, no. 1, p. 8, 2006.
[9] N. Friedman, D. Geiger, and M. Goldszmidt, “No Title,” Mach. Learn., vol. 1, no. 1, p. 37, 1997.
[10] C. M. Bishop and J. Lasserre, Generative or Discriminative ? Getting the Best of Both Worlds, 1st ed. UK: Oxford University Press, 2007, pp. 3–24.
[11] C. Goutte and E. Gaussier, “A Probabilistic Interpretation of Precision , Recall and F -score , with Implication for Evaluation,” Pascal2, vol. 3408, no. 1, pp. 345–359, 2005.
[12] C. Cortes and V. Vapnik, “Support-Vector Networks,” 1st ed., vol. 297, L. Saitta, Ed. Boston: Kluwer Academic, 1995, pp. 273–297.
[13] A. Statnikov, “A Gentle Introduction to Support Vector Machines in Biomedicine , Volume 1 : Theory and Methods,” New York, 2010.
[14] A. Y. (University of california) Ng, “Regression Classifica & on,” 14 October, 2013. [Online]. Available: https://class.coursera.org/ml-004. [Accessed: 14-Oct-2013].
[15] L. A. Darwiniche, Adnan (University of California, Modeling and Reasoning with Bayesian Networks, 1st ed. New York: Cambridge University Press, 2009, p. 548.
[16] J. Textor, “Adjustment criteria in casual diagrams: an algorithmic perspec- tive.,” Utrcht, 2013.
[17] M. J. Drużdżel, “Intelligent Decision Support Systems Based on SMILE,” Artif. Intell., vol. 2, no. 1, pp. 26–29, 2005.
Comentarios
Publicar un comentario