Jupyter Snippet P4M IntroToDataScience

Jupyter Snippet P4M IntroToDataScience

A basic introduction to Data Science

Disclaimer: This is not a full introduction and unashamedly focusses on the areas I’m most interested in (optimisation). We are going to spend a small amount of time looking at howto extract data contained in text files. The focus is then on giving you an introduction to how some of the machine learning algorithms work under the hood (rather than just calling the existing implementations from a library). Overall this is very basic though, with a thorough treatment requiring at least a semester course.


In this exerercise we are going to work through a fairly common scenario. After developing a couple of algorithms for solving a particular problem (or implementing some “standard” algorithms as well as your shiny new and hopefully improved algorithm), you test the algorithm on some instances and want to know which algorithm is “best”? Or which algorithm is best for what type of data?

Here we are going to go through this type of analysis by focussing on alternative linear programming algorithms. So here are the elements that go into the analysis:

  • The problem to be solved: $\min c^T x$ subject to $A x = b,\ x \ge 0$ where $A$ is a $m\times n$ matrix, $x$ is a vector of decision variables and $b$ and $c$ are appropriate constant vectors. In practice we normally allow some variants (for example inequality constraints, arbitrary lower and upper bounds on the variables).
  • Algorithms: Three standard algorithms, primal simplex, dual simplex, and the barrier method, as implemented by the CPLEX solver library
  • Instances: The MIPLIB2010 “standard” test instances. The data sets are actually for mixed integer problems (where some variables are required to be binary or general integers), but for these tests we will throw away the binary requirements. There are 87 instances in this data set

To make things easy you will not be required to re-run all of the tests but are given the results from running each algorithm on each instance.

Getting the data

In an ideal world the data would already be sitting in a database or simple CSV file for you to read. However this may not always be possible. It is not uncommon for the data to be sitting in text files (perhaps the log files produced as the algorithms are run). Since such text files can get quite large, it makes sense to compress these.

Your first task then is to obtain the data and display a few lines of one of the log files to see what these look like. That data can be found as a zip file at http://users.monash.edu/~andrease/Files/Other/cplexLP.zip. You could download this, unzip it and look at it - but we want to do things with python here. The basic library functions you need are

  • urllib using the urllib.request.urlopen("name.of.url/to-load") function. Basic usage information is provided in this how-to. It is probably easiest to just read this data and write it to file.
  • zipfile module provides methods for reading (and writing) zip archives. The basic usage is to create zip=ZipFile(<file>,'r') archive, then use zip.namelist() to inspect the contents and zip.open(name,'r') to open the named file. Complete documentation is here
  • Note that all log files are ASCII. To convert a binary (ASCII) string to the standard python unicode string use .decode("utf-8")
# add your code here to get http://users.monash.edu/~andrease/Files/Other/cplexLP.zip

# Add code here to read one of the files
# Note: if the output is too long you may want to manually 
#       toggle scrolling under `Cell->Current Outputs` in the notebook menu

Extracting Features

For each instance we want to extract the following information:

  • The name (identifier) for the data set. This is encoded in the name of each file in the Zip archive (after the cplexLP/) or in the log file itself as part of the “Problem name”
  • Characteristics or features of the instance to be solved. The log file contains some “statistics” of the problem such as the number of constraints and variables, number of non-zero coefficients, etc. See the CPLEX documentation on displaying problem statistics for more informaiton
  • The solution algorithm used: this is again shown in the “Problem name” or can be seen as an integer in the log file as “method for linear optimisation” (1=primal, 2=dual, 4=barrier)
  • The time required using the solution algorithm: You could take the solution time in seconds, but for the analysis here using the “Deterministic time” in “ticks” is likely to be more useful as a way of assessing which algorithm is more efficient. Having said that, you may also want to extract the time in seconds. See this brief discussion on ticks or CPLEX documentation for a little more information about what these mysterious ticks mean

In particular from the section of the log files that looks like this:

Objective sense      : Minimize
Variables            :   18380  [Fix: 7282,  Box: 11098]
Objective nonzeros   :       2
Linear constraints   :     576  [Less: 486,  Equal: 90]
  Nonzeros           :  109706
  RHS nonzeros       :      30

Variables            : Min LB: 0.000000         Max UB: 212.0000       
Objective nonzeros   : Min   : 51.00000         Max   : 100.0000       
Linear constraints   :
  Nonzeros           : Min   : 1.000000         Max   : 224.0000       
  RHS nonzeros       : Min   : 1.000000         Max   : 1.000000     

...lines deleted...

Deterministic time = 296.40 ticks  (956.13 ticks/sec)

We want to extract fields like the count of number of variables, the number of non-zeros or range of coefficient values.

Python functions and libraries to use

When dealing with text files, regular expressions are your friend. They form a mini-language of their own for expressing string patterns. Regular expressions are not unique to python, with relatively similar syntax used in a variety of tools (particularly in linux/unix). The regular expression documentation gives all the details, but to get you started the string r"^[abc]\d*" would match any line that

  • is at the start of a line (as incdicated by the ^)
  • has one of the characters in the set {‘a’,‘b’,‘c’} (written as “[abc]") first
  • is followed by zero or more digits “\d” = digit, * means “zero or more”, you could use “+” for “one or more”
  • the string is written as a raw string (r"") to make sure that “" is not interpreted as an escape character - since regular expressions make frequent use of the backslash
  • it is also particularly useful and easy to mark a substring as a group. For example r"([xyz]+)\1" means that we have a group consisting of one or more characters in {x,y,z} followed by the same group (group 1) repeating a second time.
  • import re to get the library, then use re.search(pattern,string) to find a pattern in the given string, re.search(pattern,string).group(1) would give the substring of the first group in the time the pattern was matched (if the pattern was found at all).

Regular expression testing

To test your skills at writing regular expressions, try the following exercises:

import re
def testre(regex,string):
    "Simple test function to print all matches of a regular expression in a string"
    matches = re.findall(regex,string)
    return matches
testre(__INSERT_REGEX_HERE__, # insert regular expressions to find any capitalised words with 2 letters or more
    """Some words, like Australia I know, have Capital letters, see Figure 3 for more"""
    )# should find 4: Some;Australia;Capital;Figure
testre(__INSERT_REGEX_HERE__, # outer LaTeX environments: \begin{somename}up to\end{somename}
      \begin{test}\end{fail}  \begin{array}1 & 2\\3 & 4 \\end{array}
      """) # should find 3: enumerate;nested;array
[float(f) for f in 
        # find floating point numbers (including integers and scientific notation)
      "15.3 - -100 1.2e-03  NaN 1ee4 +2E+6 ")  # 7: 15.3;-100;1.2e-03;NaN;1;4;+2E+6
] # output 15.3, -100.0, 0.0012, nan, 1.0, 4.0, 2000000.0]
4: Some;Australia;Capital;Figure
3: enumerate;nested;array
7: 15.3;-100;1.2e-03;NaN;1;4;+2E+6

[15.3, -100.0, 0.0012, nan, 1.0, 4.0, 2000000.0]

Extracting the data

Have a go at extracting some/all of the data. Store the data in a pandas DataFrame object. See the 10 minute introduction to Pandas if you have not come across this before. Essentially it is a “table like” data structure that makes it easy to manipulate data.

We want to extract the following fields relating to Variables (Var), Constraints (Con) or the Objective function (Obj): VarCnt, VarFix,VarBox, ObjNZ,ConNZ, ConLess, ConEqual, ConNZ, RhsNZ, VarMin,VarMax,ObjMin,ObjMax,ConMin,ConMax,RhsMin,RhsMax. We also want the performance of the algorithm: PrimalT or DualT or BarrierT for the ticks.

For this exercise there is no need to get all of the fields that might be interesting them all (as getting them all can be a bit fiddly). To get a complete data set we will load a DataFrame of results extracted from these files below.

import re,pandas
import numpy as np

###### insert your code here ###############

data # finish by displaying the DataFrame obtained
# let's just check for missing values
.dataframe tbody tr th {
    vertical-align: top;

.dataframe thead th {
    text-align: right;

Starting with just the data

To make things easy, here is a way to just start with the data you should have by now (you can also skip the above and just start from here). As a reminder on using data frames, this pandas selection tutorial provides a good summary on how to access rows,columns etc

Also to make things easy there is a version of the data that has

  • All timing (tick) information for the different algoritms removed
  • Replace the min/max value of coefficients with the magnituded (log) of the range
  • Scaled all columns to have entries in the range $[-1,1]$ (to have similar magnitudes)

You can load the raw data (including ‘PrimalT’, ‘DualT’ and ‘BarrierT’ performance fields) and the preprocessed features using the commands below:

import pandas
from urllib.request import urlopen
data = pandas.read_json(urlopen("http://users.monash.edu/~andrease/Files/Other/cplexLP.json"))
features = pandas.read_json(urlopen("http://users.monash.edu/~andrease/Files/Other/features.json"))
features.head(5) # show first few rows
Reload using: data = pandas.read_json(open("cplexLP.json","r"))

Visualising our data

The set of features is fairly high dimensional. How do we make sense of this? Let’s try to do a plot. For this we need to reduce the data to two dimensions. Enter PCA (Principal Component Analysis). For most of the “machine learning” type algorithms in this workshop we will use the scikit-learn set of libraries. This includes a PCA method. However so we can see more easily what is going on here, let’s do it directly using the numpy matrix routines.

Useful functions:

  • data.values gives the whole frame as a matrix.
  • We need to form the covariance matrix (subtract the column mean, multiply by the transpose and divide by n-1): $cov= \frac{1}{n-1}(X-I\bar x)^T (X-I\bar x)$. For this you need np.mean which takes an argument axis=d to take the mean in the d’th dimension, and X.T is the transpose of numpy matrix X. Or just call numpy.cov - but be careful not to get features & observations (columns and rows) mixed up
  • np.linalg.eig(X) returns the (eigenvalues, eigenvectors) as vector/matrix respectively.
  • We only want to keep the two largest eigenvectors and plot the projection onto these two axes
  • I’m assuming you are using matplotlib (matplotlib.pyplot.scatter) but you can use anything else for plotting

Question: Is the scaling of the different attributes proposed above sufficient? Shoud we do something more radical?

import numpy as np
import matplotlib.pyplot as plt
## insert your code here

For alternative, more complicated ways to map data down to 2 dimensions there are plenty of other tools in sklearn library (look at Manifold Learning). Looking at underlying algoirthms is well beyond the scope of this tutorial. However you may want to have have a go at a few of these such as TSNE or MDS. Note that some of these manifold learning methods are randomised so won’t return the same result each time

import numpy as np
from sklearn.manifold import TSNE, MDS,LocallyLinearEmbedding,SpectralEmbedding
## you may want to give this a go ###

Unsupervised learning - clustering

In unsupervised learning we are looking for some structure without having any real idea of what is “correct”. The simplest case of unsupervised learning is clustering - grouping data into clusters of points that are close together.

The simplest way to do this is to apply the cluster.KMeans model from the scklearn library. See this tutorial for more information. Try splitting the data (based on the features table) into 3 clusters.

from sklearn import cluster
### insert your code here and plot the results ###

Writing our own k-means algorithm

The basic algorithm is very simple - so simple that we can write our own

Pick an initial assignment of points to clusters (or an initial set of centres of the clusters)
Repeat until converged:
    Assign each point to the nearest centre (euclidean distance)
    Move the center of each cluster to the centroid (average of all points in the cluster)

This method is a heuristic: It is not guaranteed to give an optimal solution (the one with the least average distance of points to their centre) and it will converge to different solutions depending on the initial solution we start with.

Some usuful numpy funtions

  • np.random.randint(low,high,n) returns array length n of integers in range(low,high)
  • np.mean(matrix,axis=d) = array of means (along either columns or rows depending on axis
  • np.linalg.norm = norm of an array
  • np.argmin(matrix,axis=d) = like mean iterates over elements along just one axis and returns the minimum index
  • X[v > 0] - returns a submatrix of X depending on which elements of v are positive. Whether v may be of length number of rows (to select whole rows) or same dimension as X (to select submatrix). Similarly X[:,v>0] would select submatrix based on columns where v is positive.
import numpy as np
def kmeansHeur(data,p):
    """Given a n x m matrix with n points and m features 
    return list/np.array length n of labels (in range(p))  
    data = np.array(data) # just to make convert from DataFrame or other data type
    ### insert your algorithm here #####
    return lbl

lblsHeur = kmeansHeur(features,ncluster) # fit model to data
print("Points per cluster=",[sum(1 for i in lblsHeur if i==k) for k in range(ncluster)])
# do a scatter plot here using c=lbslHeur to get the dots coloured by cluster

This answer is probably not the same that you got wiht KMeans from sklearn. Which one is “right” or at least “better”? Need to formally define the objective function that we are trying to optimise. Essentially we are minimising the sum of squared norm distances: $$\min_{c_k,C_k} \sum_k \sum_{i\in C_k} ||c_k - x_i||^2$$ Where $c_k$ and $C_k$ are the centre of cluster $k$ and the set of points in the cluster respectively. Each $x_i$ is a point of the data.

Compute the objective function for both your solution and the KMeans solution. Which one is better? Is either of these optimal (and how would we know?)

def clusterObj(data,label):
    "Input: feature matrix (pts x features) and label for each feature."
    data = np.array(data)
    # compute the centre for each cluster and return the sum of squared distances
(35.92617097573589, 44.328479351670445)

Note that our method is a randomised heuristic and the result depends on the starting point. Rerun your kmeansHeur repeatedly (say 1000 times) and keep the best result. How does this compare to the result from a single run of sklearn.cluster.KMeans ?

### insert your code here ####

Supervised Learning - Classification with Support Vector Machines

We might want to figure out which algorithm works fastest for a given problem. That is, given a linear program and its characteristics, can we decide which algorithm to use to get the solution as fast as possible. Support vector machines are one model for such a classification task. It supposes that there is a linear function of the factors that determines which of 2 classes a point belongs to. That is, that there exists a set of weights $w_i$ and constant $W$ such that $\sum_i w_i f_i < W$ if one algorithm is faster and $\sum_i w_i f_i > W$ when the other method is faster where $f_i$ are the features of the instance we are interested in solving. How do we decide what the $w_i$ (and $W$) should be? This is supervised learning where we are using existing training data to fit the parameters.

Let’s start by trying to work out for which instances the primal vs dual simplex method is faster. We have primalT and dualT that tells us which is faster for the test data. So we want to find a vector $w$ and constant $W$ such that $\sum_i w_i f_{ki} \le W-1$ if PrimalT < DualT and $\sum_i w_i f_{ki} \ge W+1$ otherwise. (Where $f_{ki}$ is the i-th feature of instance k. Finding such a set of $w_i$ and $W$ is just a linear programming problem. Some things we need to consider is: The limit of 1 is just to ensure we get some minimum separation (we could pick any number as multiplying each row by a constant would not change the problem other than to increase the non-zero difference between the LHS and W

  • If the instances can be separated, we want to make the 2 hyperplanes (defined by the $\pm1$ on the right hand side) as far apart as possible(so that we get a clean separation). The separation distance is $2/||w||$. So maximising the separation is equivalent to $\min ||w||^2$ which gives us a quadratic program with linear constraints.
  • It may not be possible to separate the points cleanly into two groups with a single hyperplane. Hence we might want to penalise any point that is mis-classified, perhaps based on how far it is away from the hyperplane. Let $v_k$ be the distance that point $k$ is away from the hyperplane then we might want to $\min ||w||^2 + \sum_k v_k$ with $\sum_i w_i f_{ki} \le W-1+v_k$,
  • We can put the two objectives together - by noting that maximisation is the same as minimising the negative and placing greater priority on minimising violations: $\min ||w||^2 + \alpha \sum_k v_k$ subject to $\sum_i w_i f_{ki} \le W+v_k-s$ for $k$ such that the primal solution is better (with correspoding $\ge$ constraints for the other points). Here $\alpha$ is an arbitrary weight that provides a trade-off between violations (misclassifications) and the distance separating the hyperplanes.
  • Some variants of this are possible. For example we could replace the$||w||^2$ term by $\max_i |w_i|$ (for example) which can be written in a linear program, or even drop it entirely. When there are violations these are minimised if $||w||$ is minimised so we may not need this.

Hence for an initial experiment we want to set up an SVM training function that takes our features as input data together with the list instances for which data['PrimalT'] < data['dualT'] (this python expression returns a boolean column). Let I be those instances and IC be the complement. The we want to solve the following linear program: $$\min_{v,w,W} \sum_{i\in I\cup IC} v_i$$ $$s.t.\ \sum_{k\in F} D_{if} w_f \le W - 1 + v_i\quad\forall\ i\in I$$ $$\quad \sum_{k\in F} D_{if} w_f \ge W + 1 - v_i\quad\forall\ i\in IC$$ $$ v_i \ge 0\quad\forall\ i$$ Here $D_{if}$ is the data for instance $i$ and feature $f$ out of the set of features $F$.

How to set this LP up using the CPLEX solver:

  • import docplex.mp.model as MP
  • mdl = MP.Model() create a mathematical programming (optimisation) model
  • x = mdl.binary_var_list(n) create list of variables x[0]x[n-1] that are all in {0,1} - not needed here
  • x = mdl.continuous_var_list(n,lb=-mdl.infinity,ub=3*np.ones(n)) create list of variables $-\infty < x[i] \le 3$ for i=0,…,n-1
  • mdl.add( mdl.sum(x[i] for i in range(n)) == 1) add the constraint $\sum_{i=0}^{n-1} x_i=1$
  • mdl.minimize( x[0]+x[1]**2) set the objective (linear + quadratic term). Note: CPLEX handles linear and quadratic programming problems but not more general non-linear optimisation problems
  • CPXparam = mdl.context.cplex_parameters access parameters - complete list of CPLEX parameters. It may be a good idea to set CPXparam.timelimit=60 seconds (to stop it running too long if the problem is not set up correctly), CPXparam.threads=1. This model should run in no more than a second though.
  • solution = mdl.solve(log_output=True) does the optimisation and shows logs some information to the screen as it runs. solution == None if the solver fails to obtain a solution.
  • mdl.get_solve_status() gives the final solution status (as a string)
  • solution.get_value(x[0]+x[1]) would return the final value of x[0]+x[1] in the solution. Alternatively you can also use solution[x[2]] to get the value of x[2] in the solution. (More informmation on the solution class see SolveSolution)
# Mathematical programming optimisation model
import docplex.mp.model as MP

def SVM(features,label):
    """features: the usual matrix of points x features
    label: a list/array (length number of points) of bools/integers that
           identifies the points to be separated from the rest
    Returns: array/list length number of features+1 of w_i plus the constant W"""
    # insert your code here and return the solution
    return [solution.get_value(w[f]) for f in F]
soln = SVM(features,data['PrimalT']<data['DualT'])
print("Weights =",soln)

What can we do to improve the fit? How to deal with non-linear separations? Try adding additional “features” that capture the non-linear effects. For example we might expect that the effectiveness depends on the size or density of the constraint matrix (density = ‘ConNZ’ / (‘VarCnt’ x ‘ConCnt’)). Create an extended feature set and try again. The idea here is that if we have say a two dimensional set of features (x,y for each point) and we wanted to separate those points that are inside an elipse centered at the origin from those outside, then we can find such a separating elipse by including features $x^2$ and $y^2$ with the optimisation choosing some combination $w_x x^2 + w_y y^2 \le W$ giving us an elipse while still solving a lienar problem (since the $x^2$ is just a constant coefficent for the variable $w_x$ in the optimisation).

Try creating some additional features. Remember for DataFrames we can easily create new columns by doing arithmetic with whole columns. For example features['eqSq'] = features['ConEqual']**2 would create a new column eqSq that contains the number of equality constraints (ConEqual) squared.

expFeat = features.copy() # expanded set of features
### add features here ####
expSoln = SVM(expFeat,data['PrimalT']<data['DualT'])

How well is our classifier doing?

To analyse the performance we should think about:

  • Do we care about:
    1. how far we are on the wrong side of the line? (this is what we are optimising at the moment)
    2. how many instances are classified incorrectly?
    3. how much extra compute time we would incur if we used the minimum?
  • How well it works for data it hasn’t seen?

Below compute some alternative measures of the quality of the fit. Then address the second issue by using a random split of the instances to create 4 groups (of about 21 each). Then we used 3 parts as the “training” data to fit the SVM model, and the other group for testing to validate that the model is acutally OK on data that wasn’t used for the training. By repeating this 4 times for each group we can compute a more accurate average performance of the approach on this type of data.

def SVMviolation(feat,w,lower=data['PrimalT'],higher=data['DualT']):
    """Input: feat = matrix of features, w - list/array of weights (length features + 1), 
     higher/lower = performance measure
     Returns total violation of sum( w[i]*feat[i]) <= W or >= W depending on if lower[i] < or > higher[i]"""
    ### insert your code here
def SVMextraT(feat,w,lower=data['PrimalT'],higher=data['DualT']):
    """Input: feat = matrix of features, w - list/array of weights (length features + 1), 
     higher/lower = performance measure  (if feat * w < W we run the 'lower' method)
     Returns total extra time for running the slower algorithm"""
    #### insert your code here 
def SVMcount(feat,w,lower=data['PrimalT'],higher=data['DualT']):
    """Input: feat = matrix of features, w - list/array of weights (length features + 1), 
     higher/lower = performance measure  (if feat * w < W we run the 'lower' method)
     Returns total extra time for running the slower algorithm"""
    ### insert your code here

print("Results from training on whole data")
print("Effectiveness of solution: %.2f viol, %.2f ticks, %d misclassified"%(
print("Expanded feature set soln: %.2f viol, %.2f ticks, %d misclassified"%(

Now partition your data into 4 subsets of about 21 instances each. Take each subset in turn as the test data and use the remainder to train (fit) the SVM. How well, on average, does this approach work for data it hasn’t been trained on?

Note: while in general one might want to do the splitting randomly, this could lead to somewhat misleading results. The data sets are very variable in size & complexity, so it would make sense to try to split them a bit more systematically to ensure none of the groups only have very big/complex problems or only small trivial data sets. If you are feeling creative, make up a way to split the instances that deterministic or less random and more likely to be balanced.

# do things correctly using a subset of data for training only
def SVMtest(feat,nGroup=4,method=SVM,lower=data['PrimalT'],higher=data['DualT']):
    """Split features feat into nGroup groups, train using method, evaluate using lower/higher"""
    ### insert your code here
    return (violation,extraT,count) # total performance across all 4 groups with the 3 performance measures

print("Results when using original features: %.2f violation, %.2f ticks, %d misclassified"%
print("Results when using expanded features: %.2f violation, %.2f ticks, %d misclassified"%
CPXPARAM_Read_DataCheck                          1
Tried aggregator 1 time.
LP Presolve eliminated 2 rows and 18 columns.
Aggregator did 1 substitutions.
Reduced LP has 55 rows, 57 columns, and 745 nonzeros.
Presolve time = 0.00 sec. (0.14 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =             0.000000
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation =  25.846194539417073 7/3 points misclassified out of 58
Separation =  0.0001707715384478243
CPXPARAM_Read_DataCheck                          1
Tried aggregator 1 time.
LP Presolve eliminated 2 rows and 14 columns.
Reduced LP has 68 rows, 74 columns, and 920 nonzeros.
Presolve time = 0.00 sec. (0.13 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =             0.000000
Iteration:    62   Dual objective     =            33.875482
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation =  34.94454272579978 5/11 points misclassified out of 70
Separation =  3.8695707264822075e-05
CPXPARAM_Read_DataCheck                          1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 13 columns.
Reduced LP has 64 rows, 69 columns, and 889 nonzeros.
Presolve time = 0.00 sec. (0.12 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =             0.000000
Iteration:    62   Dual objective     =            37.870247
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation =  38.05281703487499 5/13 points misclassified out of 64
Separation =  0.001642740150009337
CPXPARAM_Read_DataCheck                          1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 11 columns.
Reduced LP has 69 rows, 76 columns, and 959 nonzeros.
Presolve time = 0.00 sec. (0.13 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =             0.000000
Iteration:    62   Dual objective     =            35.678993
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation =  35.904024755019826 3/14 points misclassified out of 69
Separation =  0.007035393760542038
Performance with 4 groups:1856.27 viol, 9805082.54 ticks, 48 misclassified
CPXPARAM_Read_DataCheck                          1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 18 columns.
Reduced LP has 62 rows, 68 columns, and 1230 nonzeros.
Presolve time = 0.00 sec. (0.17 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =             0.000000
Perturbation started.
Iteration:    51   Dual objective     =             0.000000
Removing perturbation.
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation =  16.6574843028596 2/6 points misclassified out of 62
Separation =  0.0005100711480481556
CPXPARAM_Read_DataCheck                          1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 16 columns.
Reduced LP has 61 rows, 69 columns, and 1202 nonzeros.
Presolve time = 0.00 sec. (0.16 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =             0.000000
Iteration:    62   Dual objective     =            22.943956
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation =  23.44995181413229 6/6 points misclassified out of 61
Separation =  0.0007552419195052046
CPXPARAM_Read_DataCheck                          1
Tried aggregator 1 time.
LP Presolve eliminated 1 rows and 20 columns.
Reduced LP has 67 rows, 72 columns, and 1301 nonzeros.
Presolve time = 0.00 sec. (0.21 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =             0.000000
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation =  21.538678238361175 4/2 points misclassified out of 68
Separation =  1.1258324539094591e-05
CPXPARAM_Read_DataCheck                          1
Tried aggregator 1 time.
LP Presolve eliminated 0 rows and 17 columns.
Reduced LP has 70 rows, 77 columns, and 1385 nonzeros.
Presolve time = 0.00 sec. (0.19 ticks)
Initializing dual steep norms . . .

Iteration log . . .
Iteration:     1   Dual objective     =             0.000000
Iteration:    62   Dual objective     =            16.821517
Solved with status JobSolveStatus.OPTIMAL_SOLUTION
Total violation =  18.917275053481482 5/1 points misclassified out of 70
Separation =  0.0001448685861530269
Performance with 4 groups:1125.87 viol, 9787073.67 ticks, 36 misclassified

(1125.871757496142, 9787073.669999996, 36)

Extension exercises

  • We could modify our SVM approach to use an objective that better matches what we want to achieve (e.g. minimise additional computational effort from misclassified instances)
  • Use the sklearn builtin method LinearSVC to implement a support vector machine. Documentation is available here The basic approach is:
from sklearn import svm
mdl = svm.LinearSVC()
mdl.fit(X,y) # X = training data, y is -1 when PrimalT < DualT and +1 otherwise
mdl.predict(x) # predict outcome using test data x