Deadline: May 17th, 2012 -- to be sent to me by email
Implement a module that reads a collection of texts into data structures that allow us to iterate through these texts word by word (possibly also sentence by sentence or paragraph by paragraph), and compute various statistics for each word.
For each position i, (in each sentence/paragraph s) in each document d, your system should keep track of:
Your module should also have a data structure that represents the vocabulary of the text collection. We will need this to iterate over and compute word occurrence statistics.
Deadline: May 31st, 2012 -- to be sent to me by email
Add as parameter to your system K -- the number of topics in the data.
Implement a function/method/subroutine that assigns a random integer between 1 and K (or 0 and K-1) to each position (token) in a document in the collection (if a document has length 100, you will assign 100 random integers). These integers represent the topics in our document collection.
For each word w in the vocabulary of the document collection, compute the following (keep all these values in different data structures):
For each topic k (k between 1 and K (or 0 and K-1)), compute the following counts (keep them in different data structures):
Deadline: June 14th, 2012 -- to be sent to me by email
Separate your data into training (90%) and testing (10%)
Set the elements of the alpha vector to the same value 50/K and for beta to 0.01.
Until convergence, or for a number of iterations (N = 1000 or more) (one iteration covers one full pass over the training portion of the document collection!):
Deadline: June 28th, 2012 -- to be sent to me by email
Train the model for different values of alpha and beta, and plot the perplexity of the test data based on the computed distributions.
Deadline: End of the course
Change the basic LDA according to your favourite paper on the topic.