ICS 211 Homework 12

Language Model

Recently, AI programs using Large Language Models (LLMs) have started to produce text that is similar to what many humans can create. They do this by looking at a large training set of human-created text and using probability to choose words that are likely to follow the preceding words.

In this assignment, we will use similar principles, but only look at one preceding word. For example, in the sentence "the quick brown fox jumps over the lazy dog", the word "the" can be followed by either the words "quick" or "lazy", and with equal (50%) probability for each. If trained only on this sentence our language model will, once it generates the word "the", then choose either "quick" or "lazy", both with equal probability.

If the training set consists of the following 10 quotes,

All that we are is the result of what we have thought.
If you judge people, you have no time to love them.
The most courageous act is still to think for yourself. Aloud.
The greatest wealth is to live content with little.
The future belongs to those who prepare for it today.
I have no special talent. I am only passionately curious.
The successful warrior is the average man, with laser-like focus.
Those who dare to fail miserably can achieve greatly.
A great man is always willing to be little.
The root of suffering is attachment.

In this training set, the word "is" occurs 6 times: twice followed by "the" and once each followed by "still", "to", "always", and "attachment". So, after this simplified language model has generated the word "is", with 1/3 (2/6) chance it will choose "the", and with 1/6 chance each it will pick the other four words.

This is not a very good training set. In practice and with a larger training set, even such a simple language model may sometimes produce text that at least occasionally seems to make sense.

Your assignment is to implement such a language model.

1. Processing the Training Set (50%)

To process the training set, your code must

read files whose names are given as parameters to main (in an array of strings),
break each file into words using Scanner.next(),
convert the word to lowercase, and
store in a HashMap each word (as the key) and the list of words that can follow it with their frequency (as the value). You are only allowed (and you only need) to use the get and put methods of HashMap.

More in detail, the value in the hash map is a new class WordMapValue that you create, that has both the number of occurrences of the key in the text, and a list whose elements are yet another new class FollowingWord that you also create, each of which contains a following word, and the number of times it follows the word that is the key.

In the above example of the 10 quotes, for map key "is", the WordMapValue has the number 6 and a list containing 5 objects of type FollowingWord. One FollowingWord in the list has the string "the" with a frequency of 2, and the other four have the strings "still", "to", "always", and "attachment", each with a frequency of 1. The FollowingWord values in the list may be in any order.

You may use any list from the Java standard library.

2. Generating Text (50%)

In this part your code must generate text beginning with the word "the", choosing words at random but with the likelyhood of each word being chosen being the same as the likelihood that this word follows the preceding word in the training set.

You do this by getting a random number between 0 and 1 and multiplying it by the number of occurrences of the word in the text. For example, in the 10 quotes above, "the" occurs seven times, so your random number will be between 0.0 and 6.9999999.

Once you have this random number, go through the elements of the list associated with "the", decrementing your random number by the frequency of the word, until your random number becomes 0 or less. Again with the same example of the 10 quotes, "the" is followed by "result", "most", "greatest", "future", "successful", "average", and "root", each occurring just once. Assuming these words are in the list in the given order, and that the random number is 3.7, it is decremented by one for "result", giving 2.7, then again by one for "most", giving 1.7, by one for "greatest", giving 0.7, and finally by one for "future", giving -0.3, meaning we select "future" as the word to follow our first word "the".

Then repeat the random selection starting with the word "future". Since "future" has only one occurrence where it is followed by the word "belongs", the next word selected will necessarily be "belongs". Then repeat by selecting a word to follow "belongs", and so on.

Finally, bring all your code together in the main method of a TestLanguage.java class that:

reads all the files specified on its command line, and adds their contents to its (initially empty) training set
prints the word "the"
prints a randomly-selected (as described above) word following the preceding word, repeating until 300 words have been printed.

As an example of what you might see, this is one output of the program when given as training input the file constitution.txt:

the contrary notwithstanding. section 7. all impeachments. when not be electors shall have power to promote the more than to make or any state deprive any foreign nations, and duties shall be taken captive on the treasury of representatives shall have warned them by the president shall be deprived of the world's constitutions have passed the united states, and they shall constitute a state legislature, which a written declaration, or, on claim for more perfect union, according to the earth, the office and disciplining, the list of two- thirds, expel a president from office of the first class shall be entitled if this union a majority of the accused shall devolve on the vice president, if this article. 15th amendment to the senate and consent of government of every state of choice of honor, trust or more perfect union, and tyranny, already begun with certain rights of the united states. section 1. the united states, whose character is not be a president whenever two-thirds of the laws. section 2. the united states than to service or his death, resignation or any law direct. this article by declaring what may propose amendments 1st amendment section 3. treason shall be discharged by any poll tax shall name in office. thereupon congress may empower the militia, being twenty-one days (sundays excepted) after the constitution, shall be necessary for the said office, he shall be denied or affirmation, and regulations respecting the twelfth. in this purpose if such term. section 5. sections 1 and the united states. a republican form of his continuance in all vacancies happen by the judges in office. section 2. the powers and no tax shall enjoy any state to discharge the president from office, the right do. and certify, and excises, to deny any of america the several states,

In this example, the language model found that "states" was a likely word to follow "united" and "several", and that "legislature" was a likely word to follow "state". Also, the language model gave "more perfect union", which is a phrase that occurs in the constitution. On the other hand, much of the text is nonsense, as we would expect from such a simple model.

Turning in the Assignment

Once you are done, find your src directory and navigate to edu.ics211.h12, then turn in the .java files to Laulima.