Text

Natural Langage Processing (NLP), Linguistics, and Programming Languages

Language Modules

Natural-language-specific stop words, tokenization, stemming, etc.

English

Currently English is the only language module. A language modules supports tokenization, stemming, and stop words. The stemmer is from tartarus.org, which is released under a compatible BSD license. (It is not yet available via Maven, so its source has been checked into the Axle github repo.)

Example

val text = """
Now we are engaged in a great civil war, testing whether that nation, or any nation,
so conceived and so dedicated, can long endure. We are met on a great battle-field of
that war. We have come to dedicate a portion of that field, as a final resting place
for those who here gave their lives that that nation might live. It is altogether
fitting and proper that we should do this.
"""

Usage

import axle.nlp.language.English

English.
  tokenize(text.toLowerCase).
  filterNot(English.stopWords.contains).
  map(English.stem).
  mkString(" ")
// res0: String = "now we engag great civil war test whether nation ani nation so conceiv so dedic can long endur we met great battle-field war we have come dedic portion field final rest place those who here gave live nation might live altogeth fit proper we should do"

Edit Distance

See the Wikipedia page on Edit distance

Levenshtein

See the Wikipedia page on Levenshtein distance

Imports and implicits

import org.jblas.DoubleMatrix

import cats.implicits._

import spire.algebra.Ring
import spire.algebra.NRoot

import axle._
import axle.nlp.Levenshtein
import axle.jblas._

implicit val ringInt: Ring[Int] = spire.implicits.IntAlgebra
implicit val nrootInt: NRoot[Int] = spire.implicits.IntAlgebra
implicit val laJblasInt = linearAlgebraDoubleMatrix[Int]
implicit val space = Levenshtein[IndexedSeq, Char, DoubleMatrix, Int]()

Usage

space.distance("the quick brown fox", "the quik brown fax")
// res2: Int = 2

Usage with spire's distance operator

Imports

import axle.algebra.metricspaces.wrappedStringSpace
import spire.syntax.metricSpace.metricSpaceOps

Usage

"the quick brown fox" distance "the quik brown fax"
// res3: Int = 2

"the quick brown fox" distance "the quik brown fox"
// res4: Int = 1

"the quick brown fox" distance "the quick brown fox"
// res5: Int = 0

Vector Space Model

See the Wikipedia page on Vector space model

Example

val corpus = Vector(
    "a tall drink of water",
    "the tall dog drinks the water",
    "a quick brown fox jumps the other fox",
    "the lazy dog drinks",
    "the quick brown fox jumps over the lazy dog",
    "the fox and the dog are tall",
    "a fox and a dog are tall",
    "lorem ipsum dolor sit amet"
)

Unweighted Distance

The simplest application of the vector space model to documents is the unweighted space:

import cats.implicits._

import spire.algebra.Field
import spire.algebra.NRoot

import axle.nlp.language.English
import axle.nlp.TermVectorizer

implicit val fieldDouble: Field[Double] = spire.implicits.DoubleAlgebra
implicit val nrootDouble: NRoot[Double] = spire.implicits.DoubleAlgebra

val vectorizer = TermVectorizer[Double](English.stopWords)
val v1 = vectorizer(corpus(1))
// v1: Map[String, Double] = Map(
//   "tall" -> 1.0,
//   "dog" -> 1.0,
//   "drinks" -> 1.0,
//   "water" -> 1.0
// )

val v2 = vectorizer(corpus(2))
// v2: Map[String, Double] = Map(
//   "brown" -> 1.0,
//   "quick" -> 1.0,
//   "jumps" -> 1.0,
//   "fox" -> 2.0,
//   "other" -> 1.0
// )

The object defines a space method, which returns a spire.algebra.MetricSpace for document vectors:

import axle.nlp.UnweightedDocumentVectorSpace

implicit val unweighted = UnweightedDocumentVectorSpace().normed
unweighted.distance(v1, v2)
// res7: Double = 3.4641016151377544

unweighted.distance(v1, v1)
// res8: Double = 0.0

Compute a "distance matrix" for a given set of vectors using the metric space:

import axle.jblas._
import axle.algebra.DistanceMatrix

val dm = DistanceMatrix(corpus.map(vectorizer))
dm.distanceMatrix.show
// res9: String = """0.000000 1.732051 3.316625 2.449490 3.162278 2.000000 2.000000 2.828427
// 1.732051 0.000000 3.464102 1.732051 3.000000 1.732051 1.732051 3.000000
// 3.316625 3.464102 0.000000 3.316625 2.236068 2.645751 2.645751 3.605551
// 2.449490 1.732051 3.316625 0.000000 2.449490 2.000000 2.000000 2.828427
// 3.162278 3.000000 2.236068 2.449490 0.000000 2.449490 2.449490 3.464102
// 2.000000 1.732051 2.645751 2.000000 2.449490 0.000000 0.000000 2.828427
// 2.000000 1.732051 2.645751 2.000000 2.449490 0.000000 0.000000 2.828427
// 2.828427 3.000000 3.605551 2.828427 3.464102 2.828427 2.828427 0.000000"""

dm.distanceMatrix.max
// res10: Double = 3.605551275463989

TF-IDF Distance

import axle.nlp.TFIDFDocumentVectorSpace

val tfidf = TFIDFDocumentVectorSpace(corpus, vectorizer).normed
tfidf.distance(v1, v2)
// res11: Double = 4.068944074907273

tfidf.distance(v1, v1)
// res12: Double = 0.0

Angluin Learner

Models Dana Angluin's Language Learner.

Example: Baby Angluin Learner

Imports

import axle._
import axle.lx._
import Angluin._

Setup

val mHi = Symbol("hi")
val mIm = Symbol("I'm")
val mYour = Symbol("your")
val mMother = Symbol("Mother")
val mShut = Symbol("shut")
val mUp = Symbol("up")

val Σ = Alphabet(Set(mHi, mIm, mYour, mMother, mShut, mUp))

val s1 = Expression(mHi :: mIm :: mYour :: mMother :: Nil)
val s2 = Expression(mShut :: mUp :: Nil)
val ℒ = Language(Set(s1, s2))

val T = Text(s1 :: ♯ :: ♯ :: s2 :: ♯ :: s2 :: s2 :: Nil)

val ɸ = memorizingLearner

Usage

import axle.algebra.lastOption

val outcome = lastOption(ɸ.guesses(T))
// outcome: Option[Grammar] = Some(
//   value = HardCodedGrammar(
//     ℒ = Language(
//       sequences = Set(
//         Expression(symbols = List('hi, 'I'm, 'your, 'Mother)),
//         Expression(symbols = List('shut, 'up))
//       )
//     )
//   )
// )

outcome.get.ℒ
// res14: Language = Language(
//   sequences = Set(
//     Expression(symbols = List('hi, 'I'm, 'your, 'Mother)),
//     Expression(symbols = List('shut, 'up))
//   )
// )
// res15: Language = Language(
//   sequences = Set(
//     Expression(symbols = List('hi, 'I'm, 'your, 'Mother)),
//     Expression(symbols = List('shut, 'up))
//   )
// )

T
// res16: Text = Iterable(
//   Expression(symbols = List('hi, 'I'm, 'your, 'Mother)),
//   Expression(symbols = List()),
//   Expression(symbols = List()),
//   Expression(symbols = List('shut, 'up)),
//   Expression(symbols = List()),
//   Expression(symbols = List('shut, 'up)),
//   Expression(symbols = List('shut, 'up))
// )

T.isFor(ℒ)
// res17: Boolean = true

Gold Paradigm

Models the Gold Paradigm.

Example: Baby Gold Learner

Imports

import axle._
import axle.lx._
import GoldParadigm._

Setup

val mHi = Morpheme("hi")
val mIm = Morpheme("I'm")
val mYour = Morpheme("your")
val mMother = Morpheme("Mother")
val mShut = Morpheme("shut")
val mUp = Morpheme("up")

val Σ = Vocabulary(Set(mHi, mIm, mYour, mMother, mShut, mUp))

val s1 = Expression(mHi :: mIm :: mYour :: mMother :: Nil)
val s2 = Expression(mShut :: mUp :: Nil)

val ℒ = Language(Set(s1, s2))

val T = Text(s1 :: ♯ :: ♯ :: s2 :: ♯ :: s2 :: s2 :: Nil)

val ɸ = memorizingLearner

Usage

import axle.algebra.lastOption

lastOption(ɸ.guesses(T)).get
// res19: Grammar = HardCodedGrammar(
//   ℒ = Language(
//     sequences = Set(
//       Expression(
//         morphemes = List(
//           Morpheme(s = "hi"),
//           Morpheme(s = "I'm"),
//           Morpheme(s = "your"),
//           Morpheme(s = "Mother")
//         )
//       ),
//       Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up")))
//     )
//   )
// )
// res20: Language = Language(
//   sequences = Set(
//     Expression(
//       morphemes = List(
//         Morpheme(s = "hi"),
//         Morpheme(s = "I'm"),
//         Morpheme(s = "your"),
//         Morpheme(s = "Mother")
//       )
//     ),
//     Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up")))
//   )
// )

T
// res21: Text = Text(
//   expressions = List(
//     Expression(
//       morphemes = List(
//         Morpheme(s = "hi"),
//         Morpheme(s = "I'm"),
//         Morpheme(s = "your"),
//         Morpheme(s = "Mother")
//       )
//     ),
//     Expression(morphemes = List()),
//     Expression(morphemes = List()),
//     Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up"))),
//     Expression(morphemes = List()),
//     Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up"))),
//     Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up")))
//   )
// )

T.isFor(ℒ)
// res22: Boolean = true

Python Grammar

This is part of a larger project on source code search algorithms.

python2json.py will take any python 2.6 (or older) file and return a json document that represents the abstract syntax tree. There are a couple of minor problems with it, but for the most part it works.

As an example, let's say we have the following python in example.py:

x = 1 + 2
print x

Invoke the script like so to turn example.py into json:

python2json.py -f example.py

You can also provide the input via stdin:

cat example.py | python2json.py

I find it useful to chain this pretty-printer when debugging:

cat example.py | python2json.py | python -mjson.tool

The pretty-printed result in this case is:

{
    "_lineno": null,
    "node": {
        "_lineno": null,
        "spread": [
            {
                "_lineno": 2,
                "expr": {
                    "_lineno": 2,
                    "left": {
                        "_lineno": 2,
                        "type": "Const",
                        "value": "1"
                    },
                    "right": {
                        "_lineno": 2,
                        "type": "Const",
                        "value": "2"
                    },
                    "type": "Add"
                },
                "nodes": [
                    {
                        "_lineno": 2,
                        "name": "x",
                        "type": "AssName"
                    }
                ], 
                "type": "Assign"
            }, 
            {
                "_lineno": 3,
                "nodes": [
                    {
                        "_lineno": 3,
                        "name": "x",
                        "type": "Name"
                    }
                ],
                "type": "Printnl"
            }
        ],
        "type": "Stmt"
    },
    "type": "Module"
}

Future Work

Python Grammar organization

AST

Linguistics