Random Writing Adit Pareek August 27, 2021 1 Introduction and Goals The purpose of this assignment is to produce text similar to a known piece of work using a ”random writing” method that employs Markov chains. This lab also had the added goal of introducing the concepts of abstraction, I/O, exceptions, and maps in Java. My personal goal for this assignment was to be able to understand the fundamentals of testing using the white-box testing method. 2 Solution Design 2.1 Assumptions It was assumed that all four arguments would be entered as command line arguments, with none missing. It was assumed that all files inputted through the command line are in the same directory as the RandomWriter class, if they exist. It was assumed that the source file input and the output file input, if they exists, are .txt files. It was assumed that carriage returns on WindowsOS, which contribute to two characters for each newline, were supposed to be removed during preprocessing to prevent an excess in characters that would adversely affect the output of the program. It was assumed that random.randInt() produced truly random results (not pseudo-random) that provide a good representation of the data. It was assumed that readText() would be executed before writeText() It was assumed that the seed string would be inputted into the beginning of the output file and that a new random seed would be cut off early if it goes over the length of the desired output. 2.2 General Algorithm This lab is built on the concept of a Markov chain. A Markov chain is a ”stochas- tic processes for which the description of the present state fully captures all the 1 information that could influence the future evolution of the process 1 .” In simpler terms, a Markov chain is a randomly determined model in which the current state affects all possible future states. Essentially, if we take a certain substring ”k”, we can map the most common characters to follow that substring and ran- domly choose one substring from all those combinations. Then, the new-found substring of the same length as ”k” would be used to continue to find ensuing combinations. This process is continued until the designated length of output is achieved. There were two algorithms that were considered: re-generating every sub- string on every pass or using HashMaps to preprocess all possible seed to future character combinations. Although the first method uses only a modest amount of space, the time complexity of the algorithm would be off the charts. This is because the program would have to loop through the file every time a new substring would have to be generated. This would translate to a time complex- ity of at least O ( n 2 ) for readText() and writeText() . In the end, the latter algorithm was chosen. Despite it having a high space complexity, the time com- plexity of the algorithm is only O ( n ) for both readText() and writeText() For the purposes of this program, therefore, time was considered more important than space. HashMaps were used throughout the program. In readText() , the HashMap was used to store the mapped substrings to their possible combinations. This proved helpful in outputting the newly generated random text to the output file, as it limited further iteration through the text file. 3 Completed Assignment The desired scope of the project was to create a program to mimic a text file’s writing using a Markov chain. With further optimizations, this program has the capability to be used to ”ghost write” for authors, musicians, and speech-givers. 3.1 Validating Inputs Inputs are validated in the main routine. The very first step in the program is to acknowledge command line arguments for the input filename, output filename, k-level analysis value, and the length of the output. Once the command line arguments are taken in, the program checks to make sure these inputs are feasible. First, the program checks for k being less than zero. The crudest form of analysis is when k is equivalent to zero (random selection), therefore, any number lower than zero should not be allowed. Similarly, length is also checked to make sure it is non-negative (since the lengths of outputs to a file can only be zero or positive). Then, the source file and output files are checked to make sure they are feasible. The source file is checked for whether or not it can be read and confirmed to have a length greater than that of the k-level of analysis. A source file length lower than the k-value cannot run, 1 https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/markov-chain 2 as there are not enough characters in the file for the k-level analysis to proceed. The output file is also confirmed to either exist and be writable or not exist and be manually created. If there are any issues validating inputs, the program prints an error message to the console and exits the operation. After validating inputs, readText() and writeText() are called in succession. An important facet to note is that both methods are surrounded by try-catch blocks. Since they involve I/O processes and throw exceptions, Java requires a try-catch block to surround the method calls. All the inputs were validated in O (1) time. 3.2 Reading Text The overall goal of readText() is to perform preprocessing to map seed to proceeding substring values by way of a HashMap. This process, as a result, will make writing the final output to the result file easier, more modular, and more testable. Once readText() is called, the input file (source file name) is created and inputted into a FileReader. A FileReader was chosen over other file-reading mechanisms as it is the most efficient method for reading character by character from a file. The FileReader is then initiated and a while loop is executed for the entire length of the file. For the very first input into the HashMap (which was declared in the randomWriter constructor), the program manually counts a number of characters equivalent to that of the k-value. Once the full substring is reached, the HashMap puts the substring (excluding the current character) as the key and the value as the current character. This is because, if k is equal to 2, for example, the program would read three letters. The first two are the precursor to the third value. In the HashMap, the key are the first two characters and the value becomes the third character. Then, the program shifts the current substring over by one index and the proceeding character becomes the next value. In proceeding iterations of the method, the algorithm is the same but the program does not count up to a number of characters equivalent to that of the k-value. This is because the program can simply use a substring consisting of the previous key from the first character onwards and the previous value (the last character) inputted into the HashMap. A StringBuffer was employed to build and combine these substrings because this class is more efficient (time complexity-wise) than regular Strings for concatenation purposes. The HashMap’s key is a String, but the value of the HashMap is an ArrayList holding Strings. An ArrayList was employed over other data structures because of it’s modifiable length and the fact that it keeps duplicates. As a result, it made more sense over a set or regular array. If a duplicate key is found, the existing value (even if the same value exists already) is added to the ArrayList of values and re-inputted into the HashMap. 3 After the preprocessing of the HashMap is complete, the entire File is read into a String and a random seed (with a length equivalent to the k-value) is chosen. The seed variable is a private variable accessible to all RandomWriter objects and is introduced in the RandomWriter constructor. One important design choice to note is that, if k is equal to zero, there is no preprocessing necessary. A k-level analysis of zero is simply choosing a random seed for each output, rather than basing it off previous characters. As a result, the HashMap remains empty for a k-level analysis of zero. Due to the existence of only one while loop, the text was read and prepro- cessed in O ( n ) time. 3.3 Writing Text The overall goal of writeText() is to take the preprocessing done in readText() and output random written text to an output file. As writeText() is called, the output file is created and inputted into a FileWriter object. The first step of the program is to check for if the seed is longer than the desired output length. If this is the case, then the program has to cut off the excess length of the seed and output that to the file. Otherwise, the full seed would be outputted and not meet the designated requirements of the method call. Otherwise, the output file is iteratively filled up according to the preprocess- ing done in readText() . Using the randomly generated seed from readText() , the method looks for possible future characters to add to the output file. If a key with the current seed is found in the HashMap, then a random letter is selected from the ArrayList of values corresponding to the desired key. This random letter is selected by using the Random library in the Java Standard Li- brary. .nextInt() was used to generate a random number between 0 and the size of the value ArrayList. This number represented the index of the random letter to be chosen. On the other hand, if the current seed is not in the HashMap or k is equiva- lent to 0, the program first uses .nextInt() to generate a random value between 0 and the length of the file subtracted by k . This is because, if we did not sub- tract the k value, the ensuing substring could result in an index out of bounds error. Using the same method as outlined in the end of readText() , a new substring is generated depending on the value of k . Yet again, if the sum of the current length of the output file and the new seed is greater than the designated output length, then the new seed is cut off to fit the desired specifications. In either case, the output file populates and then is closed with the newly generated random written text. The text was written to the output file in O ( n ) time. 4 4 Software Test Methodology The testing strategy employed in this lab is known as white-box texting. Con- trary to black-box testing, white-box testing looks at the internals of the code (intermediary steps) to make sure the program is achieving the desired output. For this lab, specifically, large-scale black box testing could not suffice because the human eye would not be able to tell minor differences apart without digging deeper into the internals of how the program generates it’s output. 4.1 Validating Inputs The very first facet of the program to test was it’s method of validating inputs. Although these were accounted for in main , it was important to test if these truly worked. The first test case was setting k to a negative value (less than 0). According to the project specifications, the program should return an error message if k is negative. As a result, if the value of k is negative, then the program exits and prints a specific error message denoting the status of k Similarly, if the output length is negative, the program exits and prints a specific error message denoting the status of length. Another easily testable case was confirming if the source file could be read and if the result file could be written to. To test these, multiple file names (both existent and nonexistent) were inputted into the command line arguments to make sure they operated correctly. Otherwise, an error specifying the reason for the program exiting would be outputted to the console. These methods of testing were essentially entirely black-box testing, as just by looking at an output/test case, a binary decision could be made off of human observation alone. As a result, this part of the program could be validated simply by black-box testing. 4.2 Reading Text The process of testing readText() involved confirming the correct configuration of the HashMap for all values of k For large files, black box testing would not suffice due to the inability of simple human observation to confirm the HashMap. This is because, for large files, the number of entires in the HashMap would be too large to track. By scaling down the frame of testing, however, a pseudo-black-box testing approach was applied to confirm that text was read and preprocessed correctly by readText() This testing method began by first defining a test source file that’s easily traceable and understandable. A new source file was created with the follow- ing characters (including spaces): a b c d On pen and paper, the correct HashMap configurations were designed for values of k ranging from 0 to 3. For example, for k = 1, a HashMap for that specific file would look as follows: {_=[b, c, d], a=[ ], b=[ ], c=[ ]}; where _ represents a space. This was cross-verified by an iteration of readText() to confirm if the re- sulting HashMap produced the same data structure. This method was a good, 5 general method of seeing if the program designed as intended on a smaller scale. This same process was tested with multiple configurations of text and new char- acters (such as newlines) to ensure compatibility. For larger values of k , the key of the HashMap (the created substring) was printed at every iteration to make sure the length remained constant throughout the run process. This was manually verified with an algorithm confirming this and printing if any substrings did not have the current length. This method of testing, an element of white-box testing, helped verify if the substring creation method was working correctly. There was, however, major edge cases that had to be accounted for. The first being dealing with newline characters. Windows is unique in that it counts newlines (” \ n”) as two characters since it includes the carriage return (” \ r”). Therefore, using the assumption that they should only be one character, any and all carriage returns were skipped in the program. During testing, this was especially relevant as this had to be manually confirmed to be removed to ensure it did not result in further problems creating the HashMap. This test case had a tendency to break many of my test cases, as for example, a simple new line character with k = 1 should not execute but it was. When manually testing, it was checked if the ASCII character of the carriage return ever appeared in the resulting HashMap. If it did, then it was known there was an error in the reading operation. Another major edge case was super massive files with text over tens of thou- sands of lines of code. Due to their length, they were difficult to test completely and check each key-value pair. To test the HashMaps for these files, I used the same principles as smaller, test files and gradually printed the current string, HashMap size (compared to what it would mathematically be by taking the length subtracted by k ), and last couple additions to the HashMap in compari- son to the previous substrings from the source file. readText() was tested even further in tandem with writeText() 4.3 Writing Text The process of testing the method for writing text involved correctly choosing random seeds, cutting off excess length seeds, and iterating through substrings without skipping any. An important facet of testing this method is making sure the length of the output doesn’t exceed the desired length. Although there is a while loop accounting for this case, it was still necessary to test the character count for each call of the method. After every call of writeText(), a testing method would read through the file and count the number of characters to confirm if the file’s length was equivalent to the desired file length. An edge case that had to be tested was when the newSeed’s length is greater than the desired file length at the beginning of the program (e.g. newSeed is ”hello” and output length is 4). Originally, there was no failsafe in place to restrict the program output to just 4 characters. As a result, it had to be manually added. In the main while loop in writeText() , a similar statement 6 was then added to confirm that the length of the output file never exceeded or was less than the desired output length. By testing on smaller files, this issue was identified and resolved. Another specific case was k being equivalent to zero, this implied that a random seed had to be selected every single iteration of writeText() The random seed generation was confirmed by print statement debugging and cross- verifying with a script to make sure that, over millions of iterations, all different locations of the file were targeted for the random seed generation. This principle also extended to if a random seed needed to be generated with k being greater than zero due to no further Markov connections. Overall, the testing philosophy behind this part of the method was a mix of white-box testing (to make sure random seeds were selected from all over and had the correct length) as well as black-box testing to ensure that the seeds looked like real words in the reading file. 4.4 Testing Reading and Writing Text Together To test both readText() and writeText() , a helper testing method by the name of frequencyAnalysis() was constructed. This method loops through an output file and returns a HashMap mapping every character substring to how often it occurs. This was useful when testing for if the program was choos- ing letters with the correct probability. On large outputs (in the thousands), the number of selections per character should level off to around that of the expected probability. This provided a thorough test to examine whether or not the program was choosing future values correctly. 4.5 Interesting Results Interesting text files and their sources are collated in the project submission. 7