In a previous post I discussed how to generate a Markov Chain with a basic algorithm, but using that algorithm memory usage was a limiting factor. Now we will attempt to find a solution to the memory usage of the algorithm. If you are unfamiliar with how to generate a Markov Chain checkout the post The Making of a Markov Chain.

In theory a Markov Chain models the statistical dependency of a system through the redundant information in the data. So if a system produces a large proportion of redundant information to true information, then in any finite data set produced by the system there is a high probability that some sequences will never occur. As demonstrated in the previous posts on Markov Chains, a chain of order five can start to produce some of the statistical properties of English. But due to the highly structured nature of written English, the text must contain at least some amount of redundant information, thus there are many sequences of text that are unlikely to occur or will never occur. So our Markov Chain algorithm could improve its memory performance if the algorithm does not allocate memory to store sequences that do not occur in the given text.

Design Goals

Before we start programming the new memory system we need to consider what operations the Markov Chain algorithm requires from the memory system. The first step in generating a Markov Chain is to count the number of times any given sequence occurs in the data and saving the result to memory, in order to minimize the memory usage of the Markov Chain the memory for the sequences should be allocated dynamically as the sequences are found. The Second step involves normalizing the data so that the probability of any given character occurring next is know in terms of the sequence that precedes it. So it must be possible to iterate through all of the allocated memory in a manner that allows for this normalization. Finally text is generated by looking up the probability distribution of characters given the last couple characters, so from our earlier assumption that the data is sparse this means that there is a high probability that the Markov Chain will try to look up values for sequences that have not been allocated. So the memory system must be able to determine if a sequence has not been allocated an if so return a reasonable default value, in this case the default value will be zero.

The Algorithm

To implement this data structure we will use a class with three attributes and three methods. The three attributes contain the apparent shape of the array, the indices that contain data and the data contained. There will be one method to set a numerical value for any n-gram, a second method to return the value associated with an n-gram and a third method to normalize all of the values.

The following pseudocode describes one way to implement this data structure.

class sdata
    #the apparent shape of the full array
    shape = (1,)
    #the data will either be an empty list,
    #a list of sdata instances, or a list
    #of integers. The data in the list is
    #maintained in the order in which it was
    #added.
    data = []
    #a list of the indices that contain
    #non-zero data in the order in which they
    #where added.
    indices = []

    def setter(key, value)
        index = key[0]
        key_pass = key[1:]
        #check to see if index is within the bounds
        check_index(index)

        #check if that index has been used before
        if index in indices
            id = indices.index(index)
            #update the value. if data[id] is an
            #instance of sdata then
            #sdata.setter(key_pass, value) should
            #be called.
            data[id][key_pass] = value
        else
            #add the index to the list of indices
            indices.append(index)
            #add an empty sdata object with the
            #correct shape to the data
            data.append(sdata(shape[1:]))
            #the assignment should call
            #sdata.setter(key_pass, value) on the
            #empty sdata that was just added.
            data[-1][key_pass] = value

    def getter(key, value)
        index = key[0]
        key_pass = key[1:]
        #check to see if index is within the bounds
        check_index(index)

        #check if that index has been used before
        if index in indices
            id = indices.index(index)
            #return the value. if data[id] is an
            #instance of sdata then
            #data.getter(key_pass, value) should
            #be called.
            return data[id][key_pass]
        else
            tmp_shape = shape
            while (len(key) != 0)
                #for each index in key check
                #to see if the index is within
                #the correct bounds
                tmp_shape = tmp_shape[1:]
                index = key[0]
                key = key[1:]
                check_index(index)

            if len(tmp_shape) != 0
                #if requested data is
                #multidimensional return an empty
                #sdata of the correct size
                return sdata(tmp_shape)
            else
                #return the default value
                return 0

    def normalize(first=True)
        if len(shape) != 1
            #if data is a list of sdata instances
            for sub_array in data
                #normalize each instance of sdata
                sub_array.normalize(False)
        else
            total = sum(data)
            #normalize the list of values
            if total != 0
                data /= total

The full python source code is available at the GitHub repository Iprocess-Projects: TextGen_1.2.

The Result

This method of storing the Markov Chain data is much more efficient then the naive method used in the original version of TextGen. When using TextGen_1.1 since memory was allocated for all possible n-grams at a given order my computer, which has approximately 4GB of available memory, would run out of memory when using a Markov Chain of an order higher than five. On the other hand when training TextGen_1.2 on the same text with a fifth order Markov Chain the memory usage was negligible, in fact on my computer TextGen_1.2 can calculate up to fifteenth order Markov Chains when operating on a 1MB text file. In addition to the reduction in memory usage, the normalization step of the algorithm runs much faster since it is now only normalizing non-trivial values in memory and does not iterate through all possible n-grams.

In a quick comparison test of the two versions of TextGen I loaded the book The Mysterious Island by Jules Verne from Project Gutenberg. I modified the code to record how long it took to generate the Markov Chains. The result was that for TextGen_1.1 it took 366.22 seconds to finish when using fifth order chains, for TextGen_1.2 it took 47.50 seconds to finish when using fifth order chains and 151.51 seconds for fifteenth order Markov Chains.

Since this new version can produce much higher order chains lets compare the output of the two versions of the program, again I used The Mysterious Island as the training text. A section of the output from TextGen_1.1 at fifth order.

arrived very little apprecipitated in the nautilus forty millions for every useful trees cleared to feared only for longer on a descended the meanwhile that placed his needed in the rent barren going since the chest part of it applied that in heard and we canoes the mysters of this ebb to the stormy cloud dissipate instinct all the house new zealand wind his parts and a height happened.

it is there besieged to fear off the island yonder grass.

cyrus harding.

at these so called the steel. the reply direction near an island but reach itit was not wrong fallen great in the exterity. these quantity of those of iron enclosure yield.

herbert islet in trace of so i see had azotic acid them up by acquainted by strong rotten in a stained any serious acquaint set food which cyrus harding the blow granite for the animal the neither having something the matter.

Now for a section of the output from TextGen_1.2 at fifteenth order.

by jove said spilett and look carefully ayrton for it is possible that under the masses of trees which covered the ground. the thick grass completely muffled their footsteps. the colonists for one evening pencroft listening at the door of his room heard these words escape from his lips

no here i never

the sailor was wrong to despise the proceedings of the genius of the island now i do not suppose that illness would ever attack them.

all were indeed wonderfully well. herbert had a lively and reverent love for the engineer. and he made known his ideas to his companions came up with him. the place where the footprints were to be found cyrus remarked to the reporter held him back with them

yes indeed said pencroft.

it will be enough

the reporter herbert will be dead.

So as demonstrated this new version is not only more memory efficient and faster but there is also a significant improvement in the quality of the generated text, due to the higher order chains.

This new method works well but there is still room for improvement. Since at a minimum each n-gram needs $n+4$ bytes to store all of the relevant data, that is n bytes for the n-gram string and four bytes to store the number of times that n-gram has appeared, then if the training text was composed of N characters then there will be less than N n-grams in that text and the minimum necessary memory will be less then $N*(n+4)$ bytes. So for fifteenth order Markov Chains trained on a 1MB text file the required memory would be approximately 19MB, when training with TextGen_1.2 on The Mysterious Island, which was 1.08MB, approximately 4GB of memory was used. So an improved algorithm could expect up to a two-hundred times decrease in memory usage. So while TextGen_1.2 and the algorithms that it is based on are significantly improved from the previous version there is still room for more optimization.