WordWhacker V0.4

PublishedActual
Birmingham New Street07:1007:10
London Euston08:3008:32

Public sector workers are striking today over pensions, I expected the trains to be a disaster but we left New Street on time and got into Euston only slightly late so the day hasn’t started too badly at least.

Anyway, as previously covered, I have been trying to prove that there are some words and phrases that are equivalent to others in some way.

v0.4

Back from holiday and I was refreshed, to get better results I’d need to also look at the combinations of words, for the results to make sense though I’d need to take account of the types of word going into the phrase for instance two adjectives do not make sense whereas an adjective followed by a noun does.

Some more searching brought me to WordNet which has word lists broken down into four files, .adj, .adv, .noun and .verb which unsurprisingly contains lists of adjectives, adverbs, nouns and verbs respectively.

I updated the application some more, changing the control of the application slightly to cope with the word lists, and adding some protection so not to process every possible combination of every word pair that exists, this could lead to huge files and massive computation times. The flow of control was updated to the following:

  1. Set the charset to use
  2. Set the text case, could be:
    1. Camel - i.e. capitalise the fist letter of each word
    2. Lower - i.e. all words are processed as being lower case
    3. Both - creates word lists containing both lower and camel case
  3. Set whether phrases should be included - the source files contain phrases like a_good_deal, far_and_away or to_the_contrary, these can be excluded
  4. Set a list of words that should be included in the list of word pairs - for instance to target on happiness and joy “happiness, joy” would be entered
  5. Generate the results

Here is the Java code to achieve the above:

  1package words;
  2
  3import java.io.BufferedReader;
  4import java.io.File;
  5import java.io.FileInputStream;
  6import java.io.FileWriter;
  7import java.io.FilenameFilter;
  8import java.io.IOException;
  9import java.io.InputStreamReader;
 10import java.util.ArrayList;
 11import java.util.HashMap;
 12import java.util.HashSet;
 13import java.util.List;
 14import java.util.Map;
 15import java.util.Map.Entry;
 16import java.util.Set;
 17
 18import org.apache.commons.csv.CSVPrinter;
 19
 20/**
 21 * Class to generate numerical values for words and compare equivalence to other words.
 22 *
 23 * @author a
 24 */
 25public class WordWhackerV04 {
 26
 27    public enum Charset {
 28        ASCII, UNICODE, POSITIONAL
 29    }
 30
 31    public enum TextCase {
 32        CAMEL, LOWER, BOTH
 33    }
 34
 35    public static Map<Character,Integer> letters = new HashMap<Character,Integer>();
 36    public static Map<Character,Integer> asciiletters = new HashMap<Character,Integer>();
 37    static {
 38        asciiletters.put('A', 65);asciiletters.put('B', 66);asciiletters.put('C', 67);
 39        asciiletters.put('D', 68);asciiletters.put('E', 69);asciiletters.put('F', 70);
 40        asciiletters.put('G', 71);asciiletters.put('H', 72);asciiletters.put('I', 73);
 41        asciiletters.put('J', 74);asciiletters.put('K', 75);asciiletters.put('L', 76);
 42        asciiletters.put('M', 77);asciiletters.put('N', 78);asciiletters.put('O', 79);
 43        asciiletters.put('P', 80);asciiletters.put('Q', 81);asciiletters.put('R', 82);
 44        asciiletters.put('S', 83);asciiletters.put('T', 84);asciiletters.put('U', 85);
 45        asciiletters.put('V', 86);asciiletters.put('W', 87);asciiletters.put('X', 88);
 46        asciiletters.put('Y', 89);asciiletters.put('Z', 90);
 47
 48        asciiletters.put('a', 97); asciiletters.put('b', 98); asciiletters.put('c', 99);
 49        asciiletters.put('d', 100);asciiletters.put('e', 101);asciiletters.put('f', 102);
 50        asciiletters.put('g', 103);asciiletters.put('h', 104);asciiletters.put('i', 105);
 51        asciiletters.put('j', 106);asciiletters.put('k', 107);asciiletters.put('l', 108);
 52        asciiletters.put('m', 109);asciiletters.put('n', 110);asciiletters.put('o', 111);
 53        asciiletters.put('p', 112);asciiletters.put('q', 113);asciiletters.put('r', 114);
 54        asciiletters.put('s', 115);asciiletters.put('t', 116);asciiletters.put('u', 117);
 55        asciiletters.put('v', 118);asciiletters.put('w', 119);asciiletters.put('x', 120);
 56        asciiletters.put('y', 121);asciiletters.put('z', 122);asciiletters.put(' ', 20);
 57
 58        letters.put('a', 1); letters.put('b', 2); letters.put('c', 3);
 59        letters.put('d', 4); letters.put('e', 5); letters.put('f', 6);
 60        letters.put('g', 7); letters.put('h', 8); letters.put('i', 9);
 61        letters.put('j', 10);letters.put('k', 11);letters.put('l', 12);
 62        letters.put('m', 13);letters.put('n', 14);letters.put('o', 15);
 63        letters.put('p', 16);letters.put('q', 17);letters.put('r', 18);
 64        letters.put('s', 19);letters.put('t', 20);letters.put('u', 21);
 65        letters.put('v', 22);letters.put('w', 23);letters.put('x', 24);
 66        letters.put('y', 25);letters.put('z', 26);letters.put(' ', 0);
 67    }
 68
 69    public Map<String,Integer> adjs = new HashMap<String,Integer>();
 70    public Map<String,Integer> advs = new HashMap<String,Integer>();
 71    public Map<String,Integer> nouns = new HashMap<String,Integer>();
 72    public Map<String,Integer> verbs = new HashMap<String,Integer>();
 73
 74    BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
 75    public Charset useCharset = Charset.ASCII;
 76    private TextCase useTextCase = TextCase.BOTH;
 77    private boolean includePhrases = false;
 78    private static final int MAX_RESULTS = 1024;//Max rows in Excel is 65536
 79    private int resultCount = 0;
 80
 81    /**
 82     * @param args a {@link java.lang.String}[] of program arguments
 83     */
 84    public static void main(String[] args) {
 85        WordWhackerV04 whacker = new WordWhackerV04();
 86        whacker.driveApp();
 87    }
 88
 89    /**
 90     * Utility method to control the flow of the application
 91     */
 92    private void driveApp() {
 93        String input = "";
 94        try {
 95            this.changeCharset();
 96            this.changeTextCase();
 97            this.changeIncludePhrases();
 98            this.createWordlists();
 99
100            while(!"0".equals(input)) {
101                System.out.println("1. Generate equivalences");
102                System.out.println("0. Exit");
103
104                input = stdin.readLine();
105
106                if("1".equalsIgnoreCase(input)) {
107                    this.generateEquivalence();
108                }
109            }
110            System.exit(0);
111        } catch (IOException e) {
112            e.printStackTrace();
113        }
114    }
115
116    /**
117     * Method to set the charset for seeding
118     */
119    private void changeCharset() {
120        String input = "";
121        try {
122            System.out.println("Please enter the ID of the charset to use [ASCII]");
123            System.out.println("1. ASCII");
124            System.out.println("2. Unicode");
125            System.out.println("3. Positional");
126
127            input = stdin.readLine();
128
129            if("1".equalsIgnoreCase(input)) {
130                this.useCharset = Charset.ASCII;
131            } else if("2".equalsIgnoreCase(input)) {
132                this.useCharset = Charset.UNICODE;
133            } else if("3".equalsIgnoreCase(input)) {
134                this.useCharset = Charset.POSITIONAL;
135            }
136        } catch (IOException e) {
137            e.printStackTrace();
138        }
139    }
140
141    /**
142     * Method to set whether phrases whould be included
143     */
144    private void changeIncludePhrases() {
145        String input = "";
146        try {
147            System.out.println("Please set whether to include phrases [No]");
148            System.out.println("1. Yes");
149            System.out.println("2. No");
150
151            input = stdin.readLine();
152
153            if("1".equalsIgnoreCase(input)) {
154                this.includePhrases = true;
155            }
156        } catch (IOException e) {
157            e.printStackTrace();
158        }
159    }
160
161    /**
162     * Method to choose which case of characters to use.
163     */
164    private void changeTextCase() {
165        String input = "";
166        try {
167            System.out.println("Please set the text case to process to use [BOTH]");
168            System.out.println("1. Camel");
169            System.out.println("2. Lower");
170            System.out.println("3. Both");
171
172            input = stdin.readLine();
173
174            if("1".equalsIgnoreCase(input)) {
175                this.useTextCase = TextCase.CAMEL;
176            } else if("2".equalsIgnoreCase(input)) {
177                this.useTextCase = TextCase.LOWER;
178            } else if("3".equalsIgnoreCase(input)) {
179                this.useTextCase = TextCase.BOTH;
180            }
181        } catch (IOException e) {
182            e.printStackTrace();
183        }
184    }
185
186    /**
187     * Read in a file and store in a Map
188     */
189    private void createWordlists() {
190        File dictDir = new File(System.getProperty("user.dir")+"\\dict");
191        if(dictDir.exists()) {
192            File[] files = dictDir.listFiles(new FilenameFilter() {
193                @Override
194                public boolean accept(File dir, String name) {
195                    boolean accept = false;
196                    if(!name.endsWith(".csv") && !name.endsWith(".txt")) {
197                        accept = true;
198                    }
199                    return accept;
200                }
201            });
202            for(File file : files) {
203                createStrippedList(file);
204            }
205        }
206    }
207
208    /**
209     * Method to generate the equivalent words and phrases
210     */
211    private void generateEquivalence() {
212        String input = "";
213        try {
214            System.out.println("Enter specific words separated by , (comma): ");
215            input = stdin.readLine();
216
217            List<String> specificWords = null;
218            if(input.length() > 0) {
219                String[] inSpecWords = input.split(",");
220                specificWords = new ArrayList<String>(inSpecWords.length*2);
221                for(String specWord : inSpecWords) {
222                    if(useTextCase == TextCase.LOWER || useTextCase == TextCase.BOTH) {
223                        specificWords.add(specWord.toLowerCase());
224                    }
225                    if(useTextCase == TextCase.CAMEL || useTextCase == TextCase.BOTH) {
226                        specificWords.add(String.format("%s%s",
227                                                        Character.toUpperCase(
228                                                                specWord.charAt(0)),
229                                                        specWord.substring(1)
230                                                                .toLowerCase()));
231                    }
232                }
233            }
234
235            System.out.println("Please enter the text to generate equivalence for: ");
236            input = stdin.readLine();
237
238            int stringValue = getWordValue(input);
239            File outFile = new File(input+".csv");
240            final CSVPrinter printer = new CSVPrinter(new FileWriter(outFile));
241            try {
242                this.saveMatchingWords(printer, stringValue);
243                this.saveMatchingAdjNoun(printer, specificWords, stringValue);
244                this.saveMatchingVerbAdv(printer, specificWords, stringValue);
245            } catch(MaxResultsReachedException mrre) {
246                System.out.println(mrre.getMessage());
247            }
248            System.out.println(outFile.getName()+" file created");
249        } catch (IOException e) {
250            e.printStackTrace();
251        }
252    }
253
254    /**
255     * Method for saving the matching words to a given value of source word
256     *
257     * @param printer pointer to the CSV file to write to
258     * @param stringValue value of the source word
259     * @throws MaxResultsReachedException
260     */
261    private void saveMatchingWords(CSVPrinter printer, int stringValue)
262                                                       throws MaxResultsReachedException {
263        this.appendRowData(printer,getKeysByValue(adjs,stringValue),"adj",stringValue);
264        this.appendRowData(printer,getKeysByValue(advs,stringValue),"adv",stringValue);
265        this.appendRowData(printer,getKeysByValue(nouns,stringValue),"noun",stringValue);
266        this.appendRowData(printer,getKeysByValue(verbs,stringValue),"verb",stringValue);
267    }
268
269    /**
270     * Method to generate Adjective-Noun pairs which match the value of the
271     * source word, all results will include one of the provided specificWords
272     * or all possible matched if this is empty
273     *
274     * @param printer pointer to the CSV file to write to
275     * @param specificWords a list of words that should appear in the results
276     * @param stringValue value of the source word
277     * @throws MaxResultsReachedException
278     */
279    private void saveMatchingAdjNoun(CSVPrinter printer, List<String> specificWords,
280                                      int stringValue) throws MaxResultsReachedException {
281        this.saveMatchingPair(printer,adjs,nouns,specificWords,stringValue,"adj-noun");
282    }
283
284    /**
285     * Method to generate Verb-Adverb pairs which match the value of the
286     * source word, all results will include one of the provided specificWords
287     * or all possible matched if this is empty
288     *
289     * @param printer pointer to the CSV file to write to
290     * @param specificWords a list of words that should appear in the results
291     * @param stringValue value of the source word
292     * @throws MaxResultsReachedException
293     */
294    private void saveMatchingVerbAdv(CSVPrinter printer, List<String> specificWords,
295                                      int stringValue) throws MaxResultsReachedException {
296        this.saveMatchingPair(printer,verbs,advs,specificWords,stringValue,"verb-adv");
297    }
298
299    /**
300     * Method to save the matching words
301     *
302     * @param printer pointer to the CSV file to write to
303     * @param map1 pointer to the first map of words to use
304     * @param map2 pointer to the second map of words to use
305     * @param specificWords a list of words that should appear in the results
306     * @param stringValue value of the source word
307     * @param type String containing the type of word or phrase
308     * @throws MaxResultsReachedException
309     */
310    private void saveMatchingPair(CSVPrinter printer, Map<String, Integer> map1,
311                                  Map<String, Integer> map2, List<String> specificWords,
312                                  int stringValue, String type)
313                                                       throws MaxResultsReachedException {
314        if(specificWords != null && specificWords.size()>0) {
315            for(String specificWord : specificWords) {
316                if(map1.containsKey(specificWord)) {
317                    Map<String, Integer> tmpMap = new HashMap<String, Integer>();
318                    tmpMap.put(specificWord, map1.get(specificWord));
319                    processWordPairs(printer, stringValue,tmpMap,map2,type);
320                }
321                if(map2.containsKey(specificWord)) {
322                    Map<String, Integer> tmpMap = new HashMap<String, Integer>();
323                    tmpMap.put(specificWord, map2.get(specificWord));
324                    processWordPairs(printer, stringValue,map1,tmpMap,type);
325                }
326            }
327        } else {
328            processWordPairs(printer, stringValue,map1,map2,type);
329        }
330    }
331
332    /**
333     * Method for processing word pairs
334     *
335     * @param printer pointer to the CSV file to write to
336     * @param stringValue value of the source word
337     * @param map1 pointer to the first map of words to use
338     * @param map2 pointer to the second map of words to use
339     * @param type String containing the type of word or phrase
340     * @throws MaxResultsReachedException
341     */
342    private void processWordPairs(CSVPrinter printer, int stringValue,
343                                  Map<String, Integer> map1, Map<String, Integer> map2,
344                                  String type) throws MaxResultsReachedException {
345        Set<Map.Entry<String, Integer>> map1Vals = map1.entrySet();
346        for(Map.Entry<String, Integer> entry : map1Vals) {
347            if(entry.getValue() < stringValue) { //only process if less than
348                int remVal = stringValue - entry.getValue();
349                Set<String> map2Vals = getKeysByValue(map2,remVal);
350                for(String map2Val : map2Vals) {
351                    appendRowData(printer, entry.getKey() + " " + map2Val,
352                                  type,entry.getValue()+remVal);
353                }
354            }
355        }
356    }
357
358    /**
359     * Iterates through a set of matches, writing each as a row in the results csv file
360     *
361     * @param printer pointer to the CSV file to write to
362     * @param col1 set of values to write to the csv file
363     * @param col2 the type of word/phrase to write to the csv
364     * @param stringValue value of the source word
365     * @throws MaxResultsReachedException
366     */
367    private void appendRowData(CSVPrinter printer, Set<String> col1, String col2,
368                                      int stringValue) throws MaxResultsReachedException {
369        for(String value : col1) {
370            appendRowData(printer, value, col2, stringValue);
371        }
372    }
373
374    /**
375     * writes the result as a row in the output csv file
376     *
377     * @param printer pointer to the CSV file to write to
378     * @param col1 result to write to the csv file
379     * @param col2 the type of word/phrase to write to the csv
380     * @param stringValue value of the source word
381     * @throws MaxResultsReachedException
382     */
383    private void appendRowData(CSVPrinter printer, String col1, String col2,
384                                      int stringValue) throws MaxResultsReachedException {
385        if(resultCount<MAX_RESULTS) {
386            printer.println(new String[]{col1,col2,Integer.toString(stringValue)});
387            resultCount++;
388        } else {
389            throw new MaxResultsReachedException("Maximum number of results reached");
390        }
391    }
392
393    /**
394     * Utility method to get all the matching Keys of a Map by the given value
395     *
396     * @param <K> the key object
397     * @param <V> the value object
398     * @param map a Map to search through
399     * @param value the value to search for
400     *
401     * @return a set of keys which match the given value
402     */
403    private <K, V> Set<K> getKeysByValue(Map<K, V> map, V value) {
404         Set<K> keys = new HashSet<K>();
405         for (Entry<K, V> entry : map.entrySet()) {
406             if (entry.getValue().equals(value)) {
407                 keys.add(entry.getKey());
408             }
409         }
410         return keys;
411    }
412
413    /**
414     * Method to strip the source word list according to the options entered by the user.
415     * This will take out phrases and popluate the maps with word numeric values.
416     *
417     * @param file a link to the source word list file.
418     */
419    private void createStrippedList(File file) {
420        BufferedReader bufferedStream = null;
421        try {
422            Map<String, Integer> mapPointer = this.getWordMap(file);
423            bufferedStream = new BufferedReader(
424                             new InputStreamReader(
425                             new FileInputStream(file)));
426            String line = "";
427            while((line = bufferedStream.readLine()) != null) {
428                String word = getWord(line);
429                if(word.matches("^[a-zA-Z].*")) {
430                    if(!this.includePhrases && word.contains("_")) {
431                        continue;
432                    }
433                    if(useTextCase == TextCase.LOWER || useTextCase == TextCase.BOTH) {
434                        word = word.toLowerCase();
435                        mapPointer.put(word, this.getWordValue(word));
436                    }
437                    if(useTextCase == TextCase.CAMEL || useTextCase == TextCase.BOTH) {
438                        word = String.format("%s%s",Character.toUpperCase(word.charAt(0)),
439                                                    word.substring(1).toLowerCase());
440                        mapPointer.put(word, this.getWordValue(word));
441                    }
442                }
443            }
444        } catch (IOException e) {
445            e.printStackTrace();
446        } finally {
447            if(bufferedStream != null) {
448                try {
449                    bufferedStream.close();
450                } catch (IOException e) {
451                    e.printStackTrace();
452                }
453            }
454        }
455    }
456
457    /**
458     * Method to get the actual word from a source file, this extracts only
459     * the pertinent bit from the source line
460     *
461     * @param sentence a String sentence to process
462     *
463     * @return the first word in the sentence.
464     */
465    private static String getWord(String sentence) {
466        String[] items = sentence.split(" ");
467        return items[0];
468    }
469
470    /**
471     * Method to get a Map of Strings to equivalent values based on a given source file
472     *
473     * @param file a pointer to a source word list file
474     *
475     * @return a Map of words to numeric values.
476     */
477    private Map<String, Integer> getWordMap(File file) {
478        Map<String, Integer> tmpMap = null;
479        if(file.getName().endsWith(".adj")) {
480            tmpMap=this.adjs;
481        } else if(file.getName().endsWith(".adv")) {
482            tmpMap=this.advs;
483        } else if(file.getName().endsWith(".noun")) {
484            tmpMap=this.nouns;
485        } else if(file.getName().endsWith(".verb")) {
486            tmpMap=this.verbs;
487        }
488        return tmpMap;
489    }
490
491    /**
492     * Method to return the numeric value for a given word
493     *
494     * @param word a {@link java.lang.String} containing the word
495     * @return an int representing the words numeric value
496     */
497    private int getWordValue(String word) {
498        int returnable = 0;
499        char[] chars = word.toCharArray();
500        for(char theChar : chars) {
501            Integer charValue = null;
502            switch(useCharset) {
503                case ASCII:
504                    charValue = asciiletters.get(theChar);
505                break;
506                case UNICODE:
507                    charValue = Character.getNumericValue(theChar);
508                break;
509                case POSITIONAL:
510                    charValue = letters.get(Character.toLowerCase(theChar));
511                break;
512                default:
513                break;
514            }
515            if(charValue != null) {
516                returnable = returnable + charValue;
517            }
518        }
519        return returnable;
520    }
521
522    /**
523     * Exception defined as inner class
524     */
525    private class MaxResultsReachedException extends Exception {
526        private static final long serialVersionUID = 1L;
527
528        public MaxResultsReachedException(String message) {
529            super(message);
530        }
531    }
532}

The flow through the application prompts the user to enter their choices and then generates the output file; in the example below we can see that more than 1024 possibilities would have been generated (value of MAX_RESULTS)

 1Please enter the ID of the charset to use [ASCII]
 21. ASCII
 32. Unicode
 43. Positional
 51
 6Please set the text case to process to use [BOTH]
 71. Camel
 82. Lower
 93. Both
101
11Please set whether to include phrases [No]
121. Yes
132. No
142
151. Generate equivalences
160. Exit
171
18Enter specific words that should appear separated by , (comma): 
19
20Please enter the text to generate equivalence for: 
21Happiness
22The maximum number of results has been reached
23Happiness.csv file created
241. Generate equivalences
250. Exit
260

Generating on Happiness gives many options for output, most are taken up by single matching words but the result set does cover some pairs, here is an example of the output CSV file:

Tenacious,adj,939
Unlimited,adj,939
Wished-for,adj,939
Gainfully,adv,939
Excitedly,adv,939
Certainly,adv,939
Orchestra,noun,939
Whitetail,noun,939
Foresight,noun,939
Implement,verb,939
Orientate,verb,939
Recapture,verb,939
Lovely Wax,adj-noun,939
Downy Nest,adj-noun,939
Fit Snoopy,adj-noun,939

During my investigations I have found some very funny combinations as profanities and negative words were not removed from the dictionaries. So, while I have been able to prove that my friends company is “Perpetual Happiness” (positional), “Righteous Happiness” (ASCII), and “Phenomenal Happiness” (Unicode) I have also seen some results far from complimentary.

At this point I have stopped developing the script as it has achieved what I wanted it to. There are three builds I have in mind for this for the future, all have a learning aim:

  1. Add the code to a git repo and add to GitHub
  2. Make the processing distributed using Hadoop or similar, this should enable me to create multiple word sentences
  3. Turn this into a webapp using Spring so that I can learn more about this framework
  4. Moved Hadoop version of the webapp into the cloud using Amazon/Cloud Foundry or similar.

Who knows, if I get another 80 minutes free time I might implement them.