Text
Natural Langage Processing (NLP), Linguistics, and Programming Languages
Language Modules
Natural-language-specific stop words, tokenization, stemming, etc.
English
Currently English is the only language module. A language modules supports tokenization, stemming, and stop words. The stemmer is from tartarus.org, which is released under a compatible BSD license. (It is not yet available via Maven, so its source has been checked into the Axle github repo.)
Example
val text = """
Now we are engaged in a great civil war, testing whether that nation, or any nation,
so conceived and so dedicated, can long endure. We are met on a great battle-field of
that war. We have come to dedicate a portion of that field, as a final resting place
for those who here gave their lives that that nation might live. It is altogether
fitting and proper that we should do this.
"""
Usage
import axle.nlp.language.English
English.
tokenize(text.toLowerCase).
filterNot(English.stopWords.contains).
map(English.stem).
mkString(" ")
// res0: String = "now we engag great civil war test whether nation ani nation so conceiv so dedic can long endur we met great battle-field war we have come dedic portion field final rest place those who here gave live nation might live altogeth fit proper we should do"
Edit Distance
See the Wikipedia page on Edit distance
Levenshtein
See the Wikipedia page on Levenshtein distance
Imports and implicits
import org.jblas.DoubleMatrix
import cats.implicits._
import spire.algebra.Ring
import spire.algebra.NRoot
import axle._
import axle.nlp.Levenshtein
import axle.jblas._
implicit val ringInt: Ring[Int] = spire.implicits.IntAlgebra
implicit val nrootInt: NRoot[Int] = spire.implicits.IntAlgebra
implicit val laJblasInt = linearAlgebraDoubleMatrix[Int]
implicit val space = Levenshtein[IndexedSeq, Char, DoubleMatrix, Int]()
Usage
space.distance("the quick brown fox", "the quik brown fax")
// res2: Int = 2
Usage with spire's distance
operator
Imports
import axle.algebra.metricspaces.wrappedStringSpace
import spire.syntax.metricSpace.metricSpaceOps
Usage
"the quick brown fox" distance "the quik brown fax"
// res3: Int = 2
"the quick brown fox" distance "the quik brown fox"
// res4: Int = 1
"the quick brown fox" distance "the quick brown fox"
// res5: Int = 0
Vector Space Model
See the Wikipedia page on Vector space model
Example
val corpus = Vector(
"a tall drink of water",
"the tall dog drinks the water",
"a quick brown fox jumps the other fox",
"the lazy dog drinks",
"the quick brown fox jumps over the lazy dog",
"the fox and the dog are tall",
"a fox and a dog are tall",
"lorem ipsum dolor sit amet"
)
Unweighted Distance
The simplest application of the vector space model to documents is the unweighted space:
import cats.implicits._
import spire.algebra.Field
import spire.algebra.NRoot
import axle.nlp.language.English
import axle.nlp.TermVectorizer
implicit val fieldDouble: Field[Double] = spire.implicits.DoubleAlgebra
implicit val nrootDouble: NRoot[Double] = spire.implicits.DoubleAlgebra
val vectorizer = TermVectorizer[Double](English.stopWords)
val v1 = vectorizer(corpus(1))
// v1: Map[String, Double] = Map(
// "tall" -> 1.0,
// "dog" -> 1.0,
// "drinks" -> 1.0,
// "water" -> 1.0
// )
val v2 = vectorizer(corpus(2))
// v2: Map[String, Double] = Map(
// "brown" -> 1.0,
// "quick" -> 1.0,
// "jumps" -> 1.0,
// "fox" -> 2.0,
// "other" -> 1.0
// )
The object defines a space
method, which returns a spire.algebra.MetricSpace
for document vectors:
import axle.nlp.UnweightedDocumentVectorSpace
implicit val unweighted = UnweightedDocumentVectorSpace().normed
unweighted.distance(v1, v2)
// res7: Double = 3.4641016151377544
unweighted.distance(v1, v1)
// res8: Double = 0.0
Compute a "distance matrix" for a given set of vectors using the metric space:
import axle.jblas._
import axle.algebra.DistanceMatrix
val dm = DistanceMatrix(corpus.map(vectorizer))
dm.distanceMatrix.show
// res9: String = """0.000000 1.732051 3.316625 2.449490 3.162278 2.000000 2.000000 2.828427
// 1.732051 0.000000 3.464102 1.732051 3.000000 1.732051 1.732051 3.000000
// 3.316625 3.464102 0.000000 3.316625 2.236068 2.645751 2.645751 3.605551
// 2.449490 1.732051 3.316625 0.000000 2.449490 2.000000 2.000000 2.828427
// 3.162278 3.000000 2.236068 2.449490 0.000000 2.449490 2.449490 3.464102
// 2.000000 1.732051 2.645751 2.000000 2.449490 0.000000 0.000000 2.828427
// 2.000000 1.732051 2.645751 2.000000 2.449490 0.000000 0.000000 2.828427
// 2.828427 3.000000 3.605551 2.828427 3.464102 2.828427 2.828427 0.000000"""
dm.distanceMatrix.max
// res10: Double = 3.605551275463989
TF-IDF Distance
import axle.nlp.TFIDFDocumentVectorSpace
val tfidf = TFIDFDocumentVectorSpace(corpus, vectorizer).normed
tfidf.distance(v1, v2)
// res11: Double = 4.068944074907273
tfidf.distance(v1, v1)
// res12: Double = 0.0
Angluin Learner
Models Dana Angluin's Language Learner.
Example: Baby Angluin Learner
Imports
import axle._
import axle.lx._
import Angluin._
Setup
val mHi = Symbol("hi")
val mIm = Symbol("I'm")
val mYour = Symbol("your")
val mMother = Symbol("Mother")
val mShut = Symbol("shut")
val mUp = Symbol("up")
val Σ = Alphabet(Set(mHi, mIm, mYour, mMother, mShut, mUp))
val s1 = Expression(mHi :: mIm :: mYour :: mMother :: Nil)
val s2 = Expression(mShut :: mUp :: Nil)
val ℒ = Language(Set(s1, s2))
val T = Text(s1 :: ♯ :: ♯ :: s2 :: ♯ :: s2 :: s2 :: Nil)
val ɸ = memorizingLearner
Usage
import axle.algebra.lastOption
val outcome = lastOption(ɸ.guesses(T))
// outcome: Option[Grammar] = Some(
// value = HardCodedGrammar(
// ℒ = Language(
// sequences = Set(
// Expression(symbols = List('hi, 'I'm, 'your, 'Mother)),
// Expression(symbols = List('shut, 'up))
// )
// )
// )
// )
outcome.get.ℒ
// res14: Language = Language(
// sequences = Set(
// Expression(symbols = List('hi, 'I'm, 'your, 'Mother)),
// Expression(symbols = List('shut, 'up))
// )
// )
ℒ
// res15: Language = Language(
// sequences = Set(
// Expression(symbols = List('hi, 'I'm, 'your, 'Mother)),
// Expression(symbols = List('shut, 'up))
// )
// )
T
// res16: Text = Iterable(
// Expression(symbols = List('hi, 'I'm, 'your, 'Mother)),
// Expression(symbols = List()),
// Expression(symbols = List()),
// Expression(symbols = List('shut, 'up)),
// Expression(symbols = List()),
// Expression(symbols = List('shut, 'up)),
// Expression(symbols = List('shut, 'up))
// )
T.isFor(ℒ)
// res17: Boolean = true
Gold Paradigm
Models the Gold Paradigm.
Example: Baby Gold Learner
Imports
import axle._
import axle.lx._
import GoldParadigm._
Setup
val mHi = Morpheme("hi")
val mIm = Morpheme("I'm")
val mYour = Morpheme("your")
val mMother = Morpheme("Mother")
val mShut = Morpheme("shut")
val mUp = Morpheme("up")
val Σ = Vocabulary(Set(mHi, mIm, mYour, mMother, mShut, mUp))
val s1 = Expression(mHi :: mIm :: mYour :: mMother :: Nil)
val s2 = Expression(mShut :: mUp :: Nil)
val ℒ = Language(Set(s1, s2))
val T = Text(s1 :: ♯ :: ♯ :: s2 :: ♯ :: s2 :: s2 :: Nil)
val ɸ = memorizingLearner
Usage
import axle.algebra.lastOption
lastOption(ɸ.guesses(T)).get
// res19: Grammar = HardCodedGrammar(
// ℒ = Language(
// sequences = Set(
// Expression(
// morphemes = List(
// Morpheme(s = "hi"),
// Morpheme(s = "I'm"),
// Morpheme(s = "your"),
// Morpheme(s = "Mother")
// )
// ),
// Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up")))
// )
// )
// )
ℒ
// res20: Language = Language(
// sequences = Set(
// Expression(
// morphemes = List(
// Morpheme(s = "hi"),
// Morpheme(s = "I'm"),
// Morpheme(s = "your"),
// Morpheme(s = "Mother")
// )
// ),
// Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up")))
// )
// )
T
// res21: Text = Text(
// expressions = List(
// Expression(
// morphemes = List(
// Morpheme(s = "hi"),
// Morpheme(s = "I'm"),
// Morpheme(s = "your"),
// Morpheme(s = "Mother")
// )
// ),
// Expression(morphemes = List()),
// Expression(morphemes = List()),
// Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up"))),
// Expression(morphemes = List()),
// Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up"))),
// Expression(morphemes = List(Morpheme(s = "shut"), Morpheme(s = "up")))
// )
// )
T.isFor(ℒ)
// res22: Boolean = true
Python Grammar
This is part of a larger project on source code search algorithms.
python2json.py
will take any python 2.6 (or older) file and return a json document that represents the
abstract syntax tree.
There are a couple of minor problems with it, but for the most part it works.
As an example, let's say we have the following python in example.py:
x = 1 + 2
print x
Invoke the script like so to turn example.py into json:
python2json.py -f example.py
You can also provide the input via stdin:
cat example.py | python2json.py
I find it useful to chain this pretty-printer when debugging:
cat example.py | python2json.py | python -mjson.tool
The pretty-printed result in this case is:
{
"_lineno": null,
"node": {
"_lineno": null,
"spread": [
{
"_lineno": 2,
"expr": {
"_lineno": 2,
"left": {
"_lineno": 2,
"type": "Const",
"value": "1"
},
"right": {
"_lineno": 2,
"type": "Const",
"value": "2"
},
"type": "Add"
},
"nodes": [
{
"_lineno": 2,
"name": "x",
"type": "AssName"
}
],
"type": "Assign"
},
{
"_lineno": 3,
"nodes": [
{
"_lineno": 3,
"name": "x",
"type": "Name"
}
],
"type": "Printnl"
}
],
"type": "Stmt"
},
"type": "Module"
}
Future Work
Python Grammar organization
- factor out
axle-ast-python
axle-ast-python
AST
- move ast view xml (how is it able to refer to
xml.Node
?)
- ast.view.AstNodeFormatter (xml.Utility.escape)
- ast.view.AstNodeFormatterXhtmlLines
- ast.view.AstNodeFormatterXhtml
- Tests for
axle.ast
- Redo axle.ast.* (rm throws, more typesafe)
cats.effect
foraxle.ast.python2
Linguistics
- Nerod Partition
- Finish Angluin Learner
- Motivation for Gold Paradigm, Angluin Learner