Higher-order functions and function literalsProgramming Practice TutorialsStringBuilderMaps

Maps

What is a map?

To compare different authors, or to identify a good match in a web search, we can use a histogram of a document. It contains all the words used, and for each word how often it was used.

In other words, given an input text, we want to compute a mapping

\[ \textit{words} \rightarrow \mathbb{N} \]
that maps a word \(w\) to the number of times it is used in the text.

We therefore need a data type that can store pairs of (word, count), that is, pairs of (String, Int). It should support the following operations:

This data type is called a map or dictionary. A map implements a mapping from some key type to some value type.

The Java library provides a parameterized data type Map<K,V>, where K is the key type and V is the value type. We can think of it as a container for (K,V) pairs.

We can create such a map as follows:

>>> val m1 = mapOf(Pair("A", 3), Pair("B", 7))
>>> m1
{A=3, B=7}

Instead of having to write Pair for every mapping, we can use the small utility function to. It does nothing but combine two elements into a pair, and makes the syntax for creating a map nicer:

>>> 23 to 19
(23, 19)
>>> "CS109" to "Otfried"
(CS109, Otfried)
>>> val m = mapOf("A" to 7, "B" to 13)
>>> m
{A=7, B=13}

We can access the mappings using the nice mathematical syntax \(m[x]\) for the result of mapping key \(x\) by the mapping \(m\):

>>> m["A"]
7
>>> m["B"]
13
>>> m["C"]
null
Note that requesting a key that is not in the map will return the value null. This means that the result type of the map access is actually V? (see nullable types).

This means that you have to first check the result for null before you can do anything with it:

>>> m["B"] + 7
error: infix call corresponds to a dot-qualified call 'm["B"].plus(7)'
which is not allowed on a nullable receiver 'm["B"]'. Use ?.-qualified
call instead

Alternatively, you can use the getOrElse method: if the key is not in the map, it will use the provided code to compute a default value:

>>> m.getOrElse("B") { 99 }
13
>>> m.getOrElse("C") { 99 }
99
>>> m.getOrElse("B") { 99 } + 7
20
>>> m.getOrElse("C") { 99 } + 7
106
The result type of getOrElse is the (not nullable) value type V, so there is no need to check for null.

We can check if a mapping is defined for a given key using the in operator:

>>> "C" in m
false
>>> "A" in m
true

You can determine the number of entries in a map m as m.size, use m.isEmpty() to determine if the map has no mappings, and you can iterate over all the entries of a map using a for-loop:

>>> val m = mapOf("A" to 7, "B" to 13)
>>> m.size
2
>>> for ((k, v) in m)
...   println("$k -> $v")
A -> 7
B -> 13

Mutable maps

Often we need a mutable map, where we can add, update, and remove mappings. We use the syntax m[key] on the left side of an assignment to add or change a mapping. Mappings are removed using m.remove(key).

>>> val m = mutableMapOf("A" to 7, "B" to 13)
>>> m
{A=7, B=13}
>>> m["C"] = 13
>>> m
{A=7, B=13, C=13}
>>> m.remove("A")
7
>>> m
{B=13, C=13}
>>> m["B"] = 42
>>> m
{B=42, C=13}

Another useful method is getOrPut. If the given key is in the map, it returns the value from the map. Otherwise, it executes the given piece of code, stores the value in the map (for the given key), and returns the value:

>>> m.getOrPut("B") { 99 }
42
>>> m
{B=42, C=13}
>>> m.getOrPut("D") { 99 }
99
>>> m
{B=42, C=13, D=99}

Computing a word histogram

Here is my first attempt at a histogram program (histogram1.kts):

fun histogram(fname: String): Map<String, Int> {
  val file = java.io.File(fname)
  val hist = mutableMapOf<String, Int>()

  file.forEachLine {
    if (it != "") {
      val words = it.split(Regex("[ ,:;.?!<>()-]+"))
      for (word in words) {
      	if (word == "") continue
	val upword = word.toUpperCase()
	hist[upword] = hist.getOrElse(upword) { 0 } + 1
      }
    }
  }
  return hist
}

if (args.size != 1) {
  println("Usage: kts histogram1.kts <file name>")
  kotlin.system.exitProcess(1)
}
  
val fname = args[0]
val hist = histogram(fname)
println(hist)

The function histogram creates an empty map hist from strings to integers. It then looks at all words in the file, converts them to uppercase. Using getOrElse, we obtain the current mapping of the word or zero if the word is not yet there. We increase the number by one, and store the new number in the map.

When I run it on the text file text.txt, I get this output:

$ kts histogram1.kts text.txt {WHEN=2, I=3, STARTED=1,
    PROGRAMMING=1, THERE=1, WERE=2, NO=1, GRAPHICAL=2, DISPLAYS=2,
    ALL=1, COMPUTER=7, INPUT=1, AND=2, OUTPUT=2, WAS=5, DONE=1,
    USING=1, TEXT=1, A=9, SHARED=1, BY=4, MANY=1, USERS=2, EACH=1,
    USER=2, CONNECTED=1, TO=3, THE=14, FROM=2, TERMINAL=3, WHICH=1,
    IS=2, CALLED=1, THIS=1, WAY=1, BECAUSE=1, IT=1, ENDPOINT=1, OF=4,
    CONNECTION=1, EARLIEST=1, TERMINALS=1, LOOKED=1, LIKE=1,
    COMBINATION=1, PRINTER=1, WITH=4, KEYBOARD=2, IN=1, MIDDLE=1,
    SCHOOL=1, SOMETIMES=1, ALLOWED=1, PLAY=1, THAT=2, USED=1, SUCH=1,
    PRINTING=2, PRINTERS=1, LATER=1, REPLACED=1, CRT=1, COULD=1,
    TYPICALLY=1, DISPLAY=1, MATRIX=1, 25X80=1, CHARACTERS=2, ASCII=1,
    ALPHABET=1, LETTERS=1, DIGITS=1, SOME=1, SPECIAL=1, INTERACTED=1,
    TYPING=1, COMMAND=2, ON=2, RESPONDED=1}

The output is not very pretty, so I should use a nicer way of printing out the map. This can be done by iterating over the contents of the map, as follows (histogram2.kts):

fun printHistogram(h: Map<String, Int>) {
  for ((word, count) in h)
    println("%20s: %d".format(word, count))
}

The output is much nicer:

$ kts histogram2.kts text.txt 
                WHEN: 2
                   I: 3
             STARTED: 1
         PROGRAMMING: 1
               THERE: 1
                WERE: 2
                  NO: 1
           GRAPHICAL: 2
            DISPLAYS: 2
                 ALL: 1
            COMPUTER: 7
               INPUT: 1
                 AND: 2
              OUTPUT: 2
                 WAS: 5
                DONE: 1
               USING: 1
                TEXT: 1
                   A: 9
              SHARED: 1
                  BY: 4
                MANY: 1
               USERS: 2
                EACH: 1
                USER: 2
           CONNECTED: 1
                  TO: 3
                 THE: 14
                FROM: 2
            TERMINAL: 3
               WHICH: 1
                  IS: 2
              CALLED: 1
                THIS: 1
                 WAY: 1
             BECAUSE: 1
                  IT: 1
            ENDPOINT: 1
                  OF: 4
          CONNECTION: 1
            EARLIEST: 1
           TERMINALS: 1
              LOOKED: 1
                LIKE: 1
         COMBINATION: 1
             PRINTER: 1
                WITH: 4
            KEYBOARD: 2
                  IN: 1
              MIDDLE: 1
              SCHOOL: 1
           SOMETIMES: 1
             ALLOWED: 1
                PLAY: 1
                THAT: 2
                USED: 1
                SUCH: 1
            PRINTING: 2
            PRINTERS: 1
               LATER: 1
            REPLACED: 1
                 CRT: 1
               COULD: 1
           TYPICALLY: 1
             DISPLAY: 1
              MATRIX: 1
               25X80: 1
          CHARACTERS: 2
               ASCII: 1
            ALPHABET: 1
             LETTERS: 1
              DIGITS: 1
                SOME: 1
             SPECIAL: 1
          INTERACTED: 1
              TYPING: 1
             COMMAND: 2
                  ON: 2
           RESPONDED: 1

This is still not perfect, because it's quite hard to find the word we are looking for. It would be better if the list was sorted. We can do this by converting the map to a sorted map (histogram3.kts):

fun printHistogram(h: Map<String, Int>) {
  val s = h.toSortedMap()
  for ((word, count) in s)
    println("%20s: %d".format(word, count))
}

Now the output becomes:

$ kts histogram3.kts text.txt 
               25X80: 1
                   A: 9
                 ALL: 1
             ALLOWED: 1
            ALPHABET: 1
                 AND: 2
               ASCII: 1
             BECAUSE: 1
                  BY: 4
              CALLED: 1
          CHARACTERS: 2
         COMBINATION: 1
             COMMAND: 2
            COMPUTER: 7
           CONNECTED: 1
          CONNECTION: 1
               COULD: 1
                 CRT: 1
              DIGITS: 1
             DISPLAY: 1
            DISPLAYS: 2
                DONE: 1
                EACH: 1
            EARLIEST: 1
            ENDPOINT: 1
                FROM: 2
           GRAPHICAL: 2
                   I: 3
                  IN: 1
               INPUT: 1
          INTERACTED: 1
                  IS: 2
                  IT: 1
            KEYBOARD: 2
               LATER: 1
             LETTERS: 1
                LIKE: 1
              LOOKED: 1
                MANY: 1
              MATRIX: 1
              MIDDLE: 1
                  NO: 1
                  OF: 4
                  ON: 2
              OUTPUT: 2
                PLAY: 1
             PRINTER: 1
            PRINTERS: 1
            PRINTING: 2
         PROGRAMMING: 1
            REPLACED: 1
           RESPONDED: 1
              SCHOOL: 1
              SHARED: 1
                SOME: 1
           SOMETIMES: 1
             SPECIAL: 1
             STARTED: 1
                SUCH: 1
            TERMINAL: 3
           TERMINALS: 1
                TEXT: 1
                THAT: 2
                 THE: 14
               THERE: 1
                THIS: 1
                  TO: 3
           TYPICALLY: 1
              TYPING: 1
                USED: 1
                USER: 2
               USERS: 2
               USING: 1
                 WAS: 5
                 WAY: 1
                WERE: 2
                WHEN: 2
               WHICH: 1
                WITH: 4

How do maps work?

Maps are implemented using a hash table, which allows extremely fast insertion, removal, and search, but does not maintain any ordering on the keys. (Come to CS206 to learn about hash tables.)

A pronounciation dictionary

Let's build a real "dictionary", one that maps words to other words. In this case, we want to build a map that maps English words to their pronounciation (English pronounciation is so unpredictable, it would be nice to have a tool to help with this).

We will use the data file cmudict.txt. Here are a few example lines from this file:

## Date: 9-7-94
##
...
ADHERES AH0 D HH IH1 R Z
ADHERING AH0 D HH IH1 R IH0 NG
ADHESIVE AE0 D HH IY1 S IH0 V
ADHESIVE(2) AH0 D HH IY1 S IH0 V
...
The format is roughly this: A line starting with # is a comment. Other lines start with a word in upper case, then a space, then the pronounciation in a particular format (the phonemes are separated by spaces, we will not go into details).

Some words have more than one correct pronounciation, see "adhesive" above. As you can see, the additional pronounciations are indicated by the number in parentheses after the word. Here is another example for a word with three pronounciations (and partly different meanings):

MINUTE  M IH1 N AH0 T
MINUTE(2)  M AY0 N UW1 T
MINUTE(3)  M AY0 N Y UW1 T

Here is a function that reads the file and constructs a mapping. It simply ignores the additional pronounciations and only uses the first one for each word (note that it returns an immutable map even though it internally works with a mutable map): (cmudict1.kts):

fun readPronounciations(): Map<String,String> {
  val file = java.io.File("cmudict.txt")
  var m = mutableMapOf<String, String>()
  file.forEachLine {
    l ->
      if (l[0].isLetter()) {
        val p = l.trim().split(Regex("\\s+"), 2)
        val word = p[0].toLowerCase()
        if (!("(" in word))
	  m[word] = p[1]
      }
  }
  return m
}

Here's a quick test of the function:

>>> val m = readPronounciations()
>>> m["minute"]
M IH1 N AH0 T
>>> m["knight"]
N AY1 T
>>> m["night"]
N AY1 T
>>> m["weird"]
W IH1 R D

Now let's put our map to some use. English has many words that are homophones: they sound the same, like “be” and “bee”, or ”sewing” and ”sowing”. We want to find some more examples. What would be the largest number of different words that all have the same pronounciation?

To determine this, we need to create a dictionary for the opposite direction: mapping pronounciations to words. Since there may be several words with the same pronounciation, this will be a Map<String, Set<String>>, that is a mapping from strings to sets of strings.

We write a general function that computes the inverse function of a map:

fun reverseMap(m: Map<String, String>): Map<String,Set<String>> {
  var r = mutableMapOf<String,MutableSet<String>>()
  for ((word, pro) in m) {
    val s = r.getOrElse(pro) { mutableSetOf<String>() }
    s.add(word)
    r[pro] = s
  }
  return r
}

We test with some examples:

>>> val r = reverseMap(m)
>>> r[m["knight"]]
[knight, night, nite]
>>> r[m["weird"]]
[weird]
>>> r[m["minute"]]
[minot, minott, minute, mynatt]
>>> r[m["be"]]
[b, b., be, bea, bee]

And now we use the reverse map to display all the words that have at least a certain number of homophones:

fun showHomophones(k: Int) {
  val m = readPronounciations()
  var r = reverseMap(m)
  for ((pro, words) in r) {
    if (words.size >= k) {
      print("$pro (${words.size} words):")
      println("  " + words.joinToString(separator=" "))
    }
  }
}

Here is the output for \(k = 10\):

>>> showHomophones(10)
OW1 (10 words):  au aux eau eaux o o' o. oh ow owe
S IY1 (10 words):  c c. cea cie sci sea see si sie sieh
S IY1 Z (11 words):  c.'s c.s cees saez sea's seas sease sees seese seize sies
K EH1 R IY0 (10 words):  carey carie carrie cary cheri kairey kari kary kerrey kerry
F R IY1 Z (10 words):  freas frease frees freese freeze freis frese friese frieze friis
SH UW1 (11 words):  hsu schoo schou schue schuh shew shiu shoe shoo shu shue
L AO1 R IY0 (11 words):  laurie laury lawrie lawry lorey lori lorie lorrie lorry lory lowrie
M EY1 Z (10 words):  mae's maes mais maize mase may's mayes mays mayse maze
R OW1 (10 words):  reaux rheault rho ro roe roh rohe row rowe wroe

Let's try another puzzle: There are words in English that sound the same if you remove the first letter: "knight" and "night" is an example. Let's try to find all such words:

fun findWords() {
  val m = readPronounciations()
  for ((word, pro) in m) {
    val ord = word.substring(1)
    if (pro == m[ord])
      println(word)
  }
}

And here is the output:

>>> findWords()
ai
ailes
aisle
aisles
ar
eau
eaux
ee
eerie
eide
eiden
eike
eiler
eiseman
eisenberg
el
em
en
eng
es
eudy
eula
eury
ex
extra
gnats
gnu
herb
hmong
hour
hours
hwan
hwang
knab
knabb
knack
knapp
knapper
knauer
knaus
knauss
knave
kneale
knebel
knee
kneece
kneed
kneel
kneer
knees
knell
kneller
kness
knew
knicely
knick
knicks
knies
kniess
knight
knight's
knightly
knights
knill
knipp
knipper
knipple
knit
knobbe
knoble
knock
knode
knoell
knoles
knoll
knope
knot
knoth
knots
knott
knuckles
knut
knuts
llama
llana
llanes
llano
llewellyn
lloyd
mme
mmonoxide
ngo
ourso
pfahl
pfarr
pfeffer
pfister
pfizer
pfohl
pfund
psalter
psalters
pty
scent
scents
schau
whole
whorton
wrack
wracked
wracking
wrage
wrap
wrapped
wrappers
wrath
wrather
wray
wreck
wrecker
wren
wrench
wrenn
wrest
wresting
wriggle
wright
wright's
wrights
wring
wringer
wringing
wrisley
wrist
wriston
write
writer
writes
wrobel
wroe
wrona
wrote
wroten
wrought
wrubel
wruck
wrung
wrye
yu
How many of these words do you know?

There is a word in English where you can remove both the first letter or the second letter, and it still sounds the same. (In other words, if the whole word is XYABCDE, then YABCDE and XABCDE sound the same as XYABCDE.)

Can you figure out which word this is?

Higher-order functions and function literalsProgramming Practice TutorialsStringBuilderMaps