Programming project 5

Download the archive pp5.zip. It contains all the files for this project.

Part 1: ListSet

In this task you implement the two missing methods from our implementation of the set ADT using a Python list (discussed in the lecture).

The implementation is in listset.py. You need to fill in the missing methods intersection and difference. (Do not change anything else in the file.)

You can use the unit test script test_listset.py for testing.

Important: Remember that the methods union, intersection, and difference do not modify the set. They return a new set for the result. Have a look at the implementation of union to see how this is done.

Submission: Upload your file listset.py to the submission server.

Part 2: Dissociated Press

In this project we implement the Dissociated press algorithm, which can generate a random text based on a given text. For instance, based on kaist.txt, it could generate this text:

all over the outset has evolved into a world leader in its school
specializing in the korean engineering and some it industrial
organizations. research emphasis from an emerging economy equipped
with national and initially staffed with academic asset in 1997 as the
lowest in korea advanced institute of high-tech manpower for
bachelor’s master’s and enhanced its school programs in march 2009
further expanded kaist’s academic asset in the school’s founding was
established in high-technology industries all lectures at kaist was a
global educational power. before the united states. before the means
of high-tech manpower for bachelor’s

The algorithm is based on n-grams: An n-gram is a sequence of \(n\) words from the original text. We will not take frequencies of n-grams into account. For instance, the text

  The bee is the bee of the bees

has the 2-grams

the bee
the bees
bee is
bee of
is the
of the

The dissociated press algorithm starts by printing a random n-gram that begins a sentence. Then it takes the last \(n-1\) words it has printed, and chooses a random n-gram that starts with the same \(n-1\) words. It prints the last word of this n-gram, and repeats. So every consecutive \(n\) words of the output text are an n-gram of the original text. In can sometimes happen that the original text contains no n-gram starting with the \(n-1\) words just printed. In this case, the algorithm simply stops.

To keep it simple, we most ignore punctuation and produce a text nearly without punctuation, as shown above. (We only separate sentences using a period.)

Project setup

I have already implemented the main algorithm in the function dissociated_press. Its arguments are the name of the original text file, the value \(n\), and the number of words of the new text to be produced. The generated text is then printed directly to the output.

I have also made the function read_text, which reads a text file and returns the words from the file in a list, all in lowercase:

>>> words = read_text("kaist.txt") 
>>> words
['.', 'korea', 'advanced', 'institute', 'of', 'science', 'and',
'technology', '(kaist)', 'was', 'established', 'in', '1971', 'as',
'the', 'nation’s', 'first', 'graduate', 'school', 'specializing',
'in', 'science', 'and', 'engineering', 'education', 'and', 'research',
'.', 'the', 'school’s', 'founding', 'was', 'a', 'catalyst', 'for',
'korea’s', 'rapid', 'rise', 'from', 'a', 'producer', 'of', 'light',
'industry', 'goods', 'to', 'a', 'world', 'leader', 'in',
'high-technology', 'industries', '.', 'the', 'political',
... and so on ...

(Note that we insert a single period . as an extra "word" to separate sentences.)

Your task is to implement the three missing functions find_ngrams, find_starters, and make_ngram_map (explained in detail below). When you have implemented all three functions, the program will work and will generate nonsense output based on four different texts: The text about KAIST mentioned above, the Declaration of Independence of the USA, a speech given by the KAIST president, and the book "Alice in Wonderland".

However, you should really test your functions one by one, as you implement them.

The three functions

find_ngrams(word_list, n)

The function find_ngrams takes as argument a list of words and an integer parameter \(n > 1\). It must return a set of tuples. Each tuple has \(n\) members and represents an ngram.

Here is an example for n = 2 and n = 3:

>>> find_ngrams(["the", "bee", "is", "the", "bee", "of", "the", "bees"], 2)
{('bee', 'of'), ('is', 'the'), ('the', 'bee'), ('bee', 'is'), 
('of', 'the'), ('the', 'bees')}
>>> find_ngrams(["the", "bee", "is", "the", "bee", "of", "the", "bees"], 3)
{('of', 'the', 'bees'), ('bee', 'is', 'the'), ('the', 'bee', 'is'), 
('is', 'the', 'bee'), ('bee', 'of', 'the'), ('the', 'bee', 'of')}
find_starters(ngrams)

The function find_starters takes as argument a set of n-grams (in the same format given by find_ngrams, that is, a set of tuples with \(n\) elements each).

The output is a List of exactly those n-grams that start with a period.

Here is an example:

s = { ("a", "bird"), (".", "a"), ("bird", "is"), (".", "bird"), 
      ("is", "flying"), (".", "the"), ("the", "bird"), 
      ("flying", ".") }
>>> find_starters(s)
[('.', 'a'), ('.', 'bird'), ('.', 'the')]

Note that the output is a Python List, not a set.

make_ngram_map(ngrams)

The function make_ngram_map takes as argument a set of n-grams (in the same format given by find_ngrams, that is, a set of tuples with \(n\) elements each).

The output is a map, where the keys are \(n-1\)-grams (that is tuples with \(n-1\) elements), and the values are sets of strings.

The map contains a key for each \(n-1\) tuple that occurs as the beginning of one of the n-grams. The corresponding value is the set of all the words that appear as the last word in such an n-gram.

For example:

>>> ng = {('bee', 'of'), ('is', 'the'), ('the', 'bee'), ('bee', 'is'), 
...       ('of', 'the'), ('the', 'bees')}
>>> make_ngram_map(ng)
{('bee',): {'is', 'of'}, ('the',): {'bee', 'bees'}, 
 ('is',): {'the'}, ('of',): {'the'}}

Note that even for n = 2, that is, when n-1 == 1, the keys in the map are 1-tuples, not just strings.

Here is an example for n = 3:

>>> ng3 = {('of', 'the', 'bees'), ('bee', 'is', 'the'), ('the', 'bee', 'is'), 
...        ('is', 'the', 'bee'), ('bee', 'of', 'the'), ('the', 'bee', 'of')}
>>> make_ngram_map(ng3)
{('bee', 'is'): {'the'}, ('bee', 'of'): {'the'}, 
('is', 'the'): {'bee'}, ('of', 'the'): {'bees'}, 
('the', 'bee'): {'is', 'of'}}

Submission: Upload your file ngrams.py to the submission server.