Clustering

k-Means Clustering

Example: Irises

A demonstration of k-Means Clustering using the Iris flower data set

Imports for Distance quanta

``````import edu.uci.ics.jung.graph.DirectedSparseGraph
import cats.implicits._
import spire.algebra._
import axle._
import axle.quanta.Distance
import axle.quanta.DistanceConverter
import axle.jung._

implicit val fieldDouble: Field[Double] = spire.implicits.DoubleAlgebra

implicit val distanceConverter = {
import axle.algebra.modules.doubleRationalModule
Distance.converterGraphK2[Double, DirectedSparseGraph]
}``````

Import the Irises data set

``````import axle.data.Irises
import axle.data.Iris``````
``````val ec = scala.concurrent.ExecutionContext.global
val blocker = cats.effect.Blocker.liftExecutionContext(ec)
implicit val cs = cats.effect.IO.contextShift(ec)

val irisesIO = new Irises[cats.effect.IO](blocker)
val irises = irisesIO.irises.unsafeRunSync()``````

Make a 2-D Euclidean space implicitly available for clustering

``````import org.jblas.DoubleMatrix
import axle.algebra.distance.Euclidean
import axle.jblas.rowVectorInnerProductSpace

implicit val nrootDouble: NRoot[Double] = spire.implicits.DoubleAlgebra

implicit val space: Euclidean[DoubleMatrix, Double] = {
implicit val ringInt: Ring[Int] = spire.implicits.IntAlgebra
implicit val inner = rowVectorInnerProductSpace[Int, Int, Double](2)
new Euclidean[DoubleMatrix, Double]
}``````

Build a classifier of irises based on sepal length and width using the K-Means algorithm

``````import spire.random.Generator.rng
import axle.ml.KMeans
import axle.ml.PCAFeatureNormalizer
import distanceConverter.cm``````
``````val irisFeaturizer =
(iris: Iris) => List((iris.sepalLength in cm).magnitude.toDouble, (iris.sepalWidth in cm).magnitude.toDouble)

val normalizer = (PCAFeatureNormalizer[DoubleMatrix] _).curried.apply(0.98)

val classifier: KMeans[Iris, List, DoubleMatrix] =
KMeans[Iris, List, DoubleMatrix](
irises,
N = 2,
irisFeaturizer,
normalizer,
K = 3,
iterations = 20)(rng)``````

Produce a "confusion matrix"

``````import axle.ml.ConfusionMatrix

val confusion = ConfusionMatrix[Iris, Int, String, Vector, DoubleMatrix](
classifier,
irises.toVector,
_.species,
0 to 2)``````
``````confusion.show
// res1: String = """  1  49   0 :  50 Iris-setosa
//  34   0  16 :  50 Iris-versicolor
//  16   0  34 :  50 Iris-virginica
//
//  51  49  50
// """``````

Visualize the final (two dimensional) centroid positions

``````import axle.visualize.KMeansVisualization
import axle.visualize.Color._

val colors = Vector(red, blue, green)

val vis = KMeansVisualization[Iris, List, DoubleMatrix](classifier, colors)``````

Create the SVG

``````import axle.web._
import cats.effect._

vis.svg[IO]("docwork/images/k_means.svg").unsafeRunSync()``````

Average centroid/cluster vs iteration:

``````import scala.collection.immutable.TreeMap
import axle.visualize._

val plot = Plot(
() => classifier.distanceLogSeries,
connect = true,
drawKey = true,
colorOf = colors,
title = Some("KMeans Mean Centroid Distances"),
xAxis = Some(0d),
xAxisLabel = Some("step"),
yAxis = Some(0),
yAxisLabel = Some("average distance to centroid"))``````

Create the SVG

``````import axle.web._
import cats.effect._

plot.svg[IO]("docwork/images/kmeansvsiteration.svg").unsafeRunSync()``````

Example: Federalist Papers

Imports

``````import axle.data.FederalistPapers
import FederalistPapers.Article``````

``````val ec = scala.concurrent.ExecutionContext.global
val blocker = cats.effect.Blocker.liftExecutionContext(ec)
implicit val cs = cats.effect.IO.contextShift(ec)

val articlesIO = FederalistPapers.articles[cats.effect.IO](blocker)

val articles = articlesIO.unsafeRunSync()``````

The result is a `List[Article]`. How many articles are there?

``````articles.size
// res5: Int = 86``````

Construct a `Corpus` object to assist with content analysis

``````import axle.nlp._
import axle.nlp.language.English

import spire.algebra.CRing
implicit val ringLong: CRing[Long] = spire.implicits.LongAlgebra

val corpus = Corpus[Vector, Long](articles.map(_.text).toVector, English)``````

Define a feature extractor using top words and bigrams.

``````val frequentWords = corpus.wordsMoreFrequentThan(100)
// frequentWords: List[String] = List(
//   "the",
//   "of",
//   "to",
//   "and",
//   "in",
//   "a",
//   "be",
//   "that",
//   "it",
//   "is",
//   "which",
//   "by",
//   "as",
// ...``````
``````val topBigrams = corpus.topKBigrams(200)
// topBigrams: List[(String, String)] = List(
//   ("of", "the"),
//   ("to", "the"),
//   ("in", "the"),
//   ("to", "be"),
//   ("that", "the"),
//   ("it", "is"),
//   ("by", "the"),
//   ("of", "a"),
//   ("the", "people"),
//   ("on", "the"),
//   ("would", "be"),
//   ("will", "be"),
//   ("for", "the"),
// ...``````
``````val numDimensions = frequentWords.size + topBigrams.size
// numDimensions: Int = 403``````
``````import axle.syntax.talliable.talliableOps

def featureExtractor(fp: Article): List[Double] = {

val tokens = English.tokenize(fp.text.toLowerCase)
val wordCounts = tokens.tally[Long]
val bigramCounts =  bigrams(tokens).tally[Long]
val wordFeatures = frequentWords.map(wordCounts(_) + 0.1)
val bigramFeatures = topBigrams.map(bigramCounts(_) + 0.1)
wordFeatures ++ bigramFeatures
}``````

Place a `MetricSpace` implicitly in scope that defines the space in which to measure similarity of Articles.

``````import spire.algebra._

import axle.algebra.distance.Euclidean

import org.jblas.DoubleMatrix

implicit val fieldDouble: Field[Double] = spire.implicits.DoubleAlgebra
implicit val nrootDouble: NRoot[Double] = spire.implicits.DoubleAlgebra

implicit val space = {
implicit val ringInt: Ring[Int] = spire.implicits.IntAlgebra
implicit val inner = axle.jblas.rowVectorInnerProductSpace[Int, Int, Double](numDimensions)
new Euclidean[DoubleMatrix, Double]
}``````

Create 4 clusters using k-Means

``````import axle.ml.KMeans
import axle.ml.PCAFeatureNormalizer``````
``````import cats.implicits._
import spire.random.Generator.rng

val normalizer = (PCAFeatureNormalizer[DoubleMatrix] _).curried.apply(0.98)

val classifier = KMeans[Article, List, DoubleMatrix](
articles,
N = numDimensions,
featureExtractor,
normalizer,
K = 4,
iterations = 100)(rng)``````

Show cluster vs author in a confusion matrix:

``````import axle.ml.ConfusionMatrix

val confusion = ConfusionMatrix[Article, Int, String, Vector, DoubleMatrix](
classifier,
articles.toVector,
_.author,
0 to 3)``````
``````confusion.show
// res6: String = """12  0 39  1 : 52 HAMILTON
//  0  0  3  0 :  3 HAMILTON AND MADISON
//  1  5  5  4 : 15 MADISON
//  0  0  5  0 :  5 JAY
//  1  0  8  2 : 11 HAMILTON OR MADISON
//
// 14  5 60  7
// """``````