WordWhacker V0.4
Published | Actual | |
---|---|---|
Birmingham New Street | 07:10 | 07:10 |
London Euston | 08:30 | 08:32 |
Public sector workers are striking today over pensions, I expected the trains to be a disaster but we left New Street on time and got into Euston only slightly late so the day hasn’t started too badly at least.
Anyway, as previously covered, I have been trying to prove that there are some words and phrases that are equivalent to others in some way.
v0.4
Back from holiday and I was refreshed, to get better results I’d need to also look at the combinations of words, for the results to make sense though I’d need to take account of the types of word going into the phrase for instance two adjectives do not make sense whereas an adjective followed by a noun does.
Some more searching brought me to WordNet which has word lists broken down into four files, .adj, .adv, .noun and .verb which unsurprisingly contains lists of adjectives, adverbs, nouns and verbs respectively.
I updated the application some more, changing the control of the application slightly to cope with the word lists, and adding some protection so not to process every possible combination of every word pair that exists, this could lead to huge files and massive computation times. The flow of control was updated to the following:
- Set the charset to use
- Set the text case, could be:
- Camel - i.e. capitalise the fist letter of each word
- Lower - i.e. all words are processed as being lower case
- Both - creates word lists containing both lower and camel case
- Set whether phrases should be included - the source files contain phrases like a_good_deal, far_and_away or to_the_contrary, these can be excluded
- Set a list of words that should be included in the list of word pairs - for instance to target on happiness and joy “happiness, joy” would be entered
- Generate the results
Here is the Java code to achieve the above:
1package words;
2
3import java.io.BufferedReader;
4import java.io.File;
5import java.io.FileInputStream;
6import java.io.FileWriter;
7import java.io.FilenameFilter;
8import java.io.IOException;
9import java.io.InputStreamReader;
10import java.util.ArrayList;
11import java.util.HashMap;
12import java.util.HashSet;
13import java.util.List;
14import java.util.Map;
15import java.util.Map.Entry;
16import java.util.Set;
17
18import org.apache.commons.csv.CSVPrinter;
19
20/**
21 * Class to generate numerical values for words and compare equivalence to other words.
22 *
23 * @author a
24 */
25public class WordWhackerV04 {
26
27 public enum Charset {
28 ASCII, UNICODE, POSITIONAL
29 }
30
31 public enum TextCase {
32 CAMEL, LOWER, BOTH
33 }
34
35 public static Map<Character,Integer> letters = new HashMap<Character,Integer>();
36 public static Map<Character,Integer> asciiletters = new HashMap<Character,Integer>();
37 static {
38 asciiletters.put('A', 65);asciiletters.put('B', 66);asciiletters.put('C', 67);
39 asciiletters.put('D', 68);asciiletters.put('E', 69);asciiletters.put('F', 70);
40 asciiletters.put('G', 71);asciiletters.put('H', 72);asciiletters.put('I', 73);
41 asciiletters.put('J', 74);asciiletters.put('K', 75);asciiletters.put('L', 76);
42 asciiletters.put('M', 77);asciiletters.put('N', 78);asciiletters.put('O', 79);
43 asciiletters.put('P', 80);asciiletters.put('Q', 81);asciiletters.put('R', 82);
44 asciiletters.put('S', 83);asciiletters.put('T', 84);asciiletters.put('U', 85);
45 asciiletters.put('V', 86);asciiletters.put('W', 87);asciiletters.put('X', 88);
46 asciiletters.put('Y', 89);asciiletters.put('Z', 90);
47
48 asciiletters.put('a', 97); asciiletters.put('b', 98); asciiletters.put('c', 99);
49 asciiletters.put('d', 100);asciiletters.put('e', 101);asciiletters.put('f', 102);
50 asciiletters.put('g', 103);asciiletters.put('h', 104);asciiletters.put('i', 105);
51 asciiletters.put('j', 106);asciiletters.put('k', 107);asciiletters.put('l', 108);
52 asciiletters.put('m', 109);asciiletters.put('n', 110);asciiletters.put('o', 111);
53 asciiletters.put('p', 112);asciiletters.put('q', 113);asciiletters.put('r', 114);
54 asciiletters.put('s', 115);asciiletters.put('t', 116);asciiletters.put('u', 117);
55 asciiletters.put('v', 118);asciiletters.put('w', 119);asciiletters.put('x', 120);
56 asciiletters.put('y', 121);asciiletters.put('z', 122);asciiletters.put(' ', 20);
57
58 letters.put('a', 1); letters.put('b', 2); letters.put('c', 3);
59 letters.put('d', 4); letters.put('e', 5); letters.put('f', 6);
60 letters.put('g', 7); letters.put('h', 8); letters.put('i', 9);
61 letters.put('j', 10);letters.put('k', 11);letters.put('l', 12);
62 letters.put('m', 13);letters.put('n', 14);letters.put('o', 15);
63 letters.put('p', 16);letters.put('q', 17);letters.put('r', 18);
64 letters.put('s', 19);letters.put('t', 20);letters.put('u', 21);
65 letters.put('v', 22);letters.put('w', 23);letters.put('x', 24);
66 letters.put('y', 25);letters.put('z', 26);letters.put(' ', 0);
67 }
68
69 public Map<String,Integer> adjs = new HashMap<String,Integer>();
70 public Map<String,Integer> advs = new HashMap<String,Integer>();
71 public Map<String,Integer> nouns = new HashMap<String,Integer>();
72 public Map<String,Integer> verbs = new HashMap<String,Integer>();
73
74 BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
75 public Charset useCharset = Charset.ASCII;
76 private TextCase useTextCase = TextCase.BOTH;
77 private boolean includePhrases = false;
78 private static final int MAX_RESULTS = 1024;//Max rows in Excel is 65536
79 private int resultCount = 0;
80
81 /**
82 * @param args a {@link java.lang.String}[] of program arguments
83 */
84 public static void main(String[] args) {
85 WordWhackerV04 whacker = new WordWhackerV04();
86 whacker.driveApp();
87 }
88
89 /**
90 * Utility method to control the flow of the application
91 */
92 private void driveApp() {
93 String input = "";
94 try {
95 this.changeCharset();
96 this.changeTextCase();
97 this.changeIncludePhrases();
98 this.createWordlists();
99
100 while(!"0".equals(input)) {
101 System.out.println("1. Generate equivalences");
102 System.out.println("0. Exit");
103
104 input = stdin.readLine();
105
106 if("1".equalsIgnoreCase(input)) {
107 this.generateEquivalence();
108 }
109 }
110 System.exit(0);
111 } catch (IOException e) {
112 e.printStackTrace();
113 }
114 }
115
116 /**
117 * Method to set the charset for seeding
118 */
119 private void changeCharset() {
120 String input = "";
121 try {
122 System.out.println("Please enter the ID of the charset to use [ASCII]");
123 System.out.println("1. ASCII");
124 System.out.println("2. Unicode");
125 System.out.println("3. Positional");
126
127 input = stdin.readLine();
128
129 if("1".equalsIgnoreCase(input)) {
130 this.useCharset = Charset.ASCII;
131 } else if("2".equalsIgnoreCase(input)) {
132 this.useCharset = Charset.UNICODE;
133 } else if("3".equalsIgnoreCase(input)) {
134 this.useCharset = Charset.POSITIONAL;
135 }
136 } catch (IOException e) {
137 e.printStackTrace();
138 }
139 }
140
141 /**
142 * Method to set whether phrases whould be included
143 */
144 private void changeIncludePhrases() {
145 String input = "";
146 try {
147 System.out.println("Please set whether to include phrases [No]");
148 System.out.println("1. Yes");
149 System.out.println("2. No");
150
151 input = stdin.readLine();
152
153 if("1".equalsIgnoreCase(input)) {
154 this.includePhrases = true;
155 }
156 } catch (IOException e) {
157 e.printStackTrace();
158 }
159 }
160
161 /**
162 * Method to choose which case of characters to use.
163 */
164 private void changeTextCase() {
165 String input = "";
166 try {
167 System.out.println("Please set the text case to process to use [BOTH]");
168 System.out.println("1. Camel");
169 System.out.println("2. Lower");
170 System.out.println("3. Both");
171
172 input = stdin.readLine();
173
174 if("1".equalsIgnoreCase(input)) {
175 this.useTextCase = TextCase.CAMEL;
176 } else if("2".equalsIgnoreCase(input)) {
177 this.useTextCase = TextCase.LOWER;
178 } else if("3".equalsIgnoreCase(input)) {
179 this.useTextCase = TextCase.BOTH;
180 }
181 } catch (IOException e) {
182 e.printStackTrace();
183 }
184 }
185
186 /**
187 * Read in a file and store in a Map
188 */
189 private void createWordlists() {
190 File dictDir = new File(System.getProperty("user.dir")+"\\dict");
191 if(dictDir.exists()) {
192 File[] files = dictDir.listFiles(new FilenameFilter() {
193 @Override
194 public boolean accept(File dir, String name) {
195 boolean accept = false;
196 if(!name.endsWith(".csv") && !name.endsWith(".txt")) {
197 accept = true;
198 }
199 return accept;
200 }
201 });
202 for(File file : files) {
203 createStrippedList(file);
204 }
205 }
206 }
207
208 /**
209 * Method to generate the equivalent words and phrases
210 */
211 private void generateEquivalence() {
212 String input = "";
213 try {
214 System.out.println("Enter specific words separated by , (comma): ");
215 input = stdin.readLine();
216
217 List<String> specificWords = null;
218 if(input.length() > 0) {
219 String[] inSpecWords = input.split(",");
220 specificWords = new ArrayList<String>(inSpecWords.length*2);
221 for(String specWord : inSpecWords) {
222 if(useTextCase == TextCase.LOWER || useTextCase == TextCase.BOTH) {
223 specificWords.add(specWord.toLowerCase());
224 }
225 if(useTextCase == TextCase.CAMEL || useTextCase == TextCase.BOTH) {
226 specificWords.add(String.format("%s%s",
227 Character.toUpperCase(
228 specWord.charAt(0)),
229 specWord.substring(1)
230 .toLowerCase()));
231 }
232 }
233 }
234
235 System.out.println("Please enter the text to generate equivalence for: ");
236 input = stdin.readLine();
237
238 int stringValue = getWordValue(input);
239 File outFile = new File(input+".csv");
240 final CSVPrinter printer = new CSVPrinter(new FileWriter(outFile));
241 try {
242 this.saveMatchingWords(printer, stringValue);
243 this.saveMatchingAdjNoun(printer, specificWords, stringValue);
244 this.saveMatchingVerbAdv(printer, specificWords, stringValue);
245 } catch(MaxResultsReachedException mrre) {
246 System.out.println(mrre.getMessage());
247 }
248 System.out.println(outFile.getName()+" file created");
249 } catch (IOException e) {
250 e.printStackTrace();
251 }
252 }
253
254 /**
255 * Method for saving the matching words to a given value of source word
256 *
257 * @param printer pointer to the CSV file to write to
258 * @param stringValue value of the source word
259 * @throws MaxResultsReachedException
260 */
261 private void saveMatchingWords(CSVPrinter printer, int stringValue)
262 throws MaxResultsReachedException {
263 this.appendRowData(printer,getKeysByValue(adjs,stringValue),"adj",stringValue);
264 this.appendRowData(printer,getKeysByValue(advs,stringValue),"adv",stringValue);
265 this.appendRowData(printer,getKeysByValue(nouns,stringValue),"noun",stringValue);
266 this.appendRowData(printer,getKeysByValue(verbs,stringValue),"verb",stringValue);
267 }
268
269 /**
270 * Method to generate Adjective-Noun pairs which match the value of the
271 * source word, all results will include one of the provided specificWords
272 * or all possible matched if this is empty
273 *
274 * @param printer pointer to the CSV file to write to
275 * @param specificWords a list of words that should appear in the results
276 * @param stringValue value of the source word
277 * @throws MaxResultsReachedException
278 */
279 private void saveMatchingAdjNoun(CSVPrinter printer, List<String> specificWords,
280 int stringValue) throws MaxResultsReachedException {
281 this.saveMatchingPair(printer,adjs,nouns,specificWords,stringValue,"adj-noun");
282 }
283
284 /**
285 * Method to generate Verb-Adverb pairs which match the value of the
286 * source word, all results will include one of the provided specificWords
287 * or all possible matched if this is empty
288 *
289 * @param printer pointer to the CSV file to write to
290 * @param specificWords a list of words that should appear in the results
291 * @param stringValue value of the source word
292 * @throws MaxResultsReachedException
293 */
294 private void saveMatchingVerbAdv(CSVPrinter printer, List<String> specificWords,
295 int stringValue) throws MaxResultsReachedException {
296 this.saveMatchingPair(printer,verbs,advs,specificWords,stringValue,"verb-adv");
297 }
298
299 /**
300 * Method to save the matching words
301 *
302 * @param printer pointer to the CSV file to write to
303 * @param map1 pointer to the first map of words to use
304 * @param map2 pointer to the second map of words to use
305 * @param specificWords a list of words that should appear in the results
306 * @param stringValue value of the source word
307 * @param type String containing the type of word or phrase
308 * @throws MaxResultsReachedException
309 */
310 private void saveMatchingPair(CSVPrinter printer, Map<String, Integer> map1,
311 Map<String, Integer> map2, List<String> specificWords,
312 int stringValue, String type)
313 throws MaxResultsReachedException {
314 if(specificWords != null && specificWords.size()>0) {
315 for(String specificWord : specificWords) {
316 if(map1.containsKey(specificWord)) {
317 Map<String, Integer> tmpMap = new HashMap<String, Integer>();
318 tmpMap.put(specificWord, map1.get(specificWord));
319 processWordPairs(printer, stringValue,tmpMap,map2,type);
320 }
321 if(map2.containsKey(specificWord)) {
322 Map<String, Integer> tmpMap = new HashMap<String, Integer>();
323 tmpMap.put(specificWord, map2.get(specificWord));
324 processWordPairs(printer, stringValue,map1,tmpMap,type);
325 }
326 }
327 } else {
328 processWordPairs(printer, stringValue,map1,map2,type);
329 }
330 }
331
332 /**
333 * Method for processing word pairs
334 *
335 * @param printer pointer to the CSV file to write to
336 * @param stringValue value of the source word
337 * @param map1 pointer to the first map of words to use
338 * @param map2 pointer to the second map of words to use
339 * @param type String containing the type of word or phrase
340 * @throws MaxResultsReachedException
341 */
342 private void processWordPairs(CSVPrinter printer, int stringValue,
343 Map<String, Integer> map1, Map<String, Integer> map2,
344 String type) throws MaxResultsReachedException {
345 Set<Map.Entry<String, Integer>> map1Vals = map1.entrySet();
346 for(Map.Entry<String, Integer> entry : map1Vals) {
347 if(entry.getValue() < stringValue) { //only process if less than
348 int remVal = stringValue - entry.getValue();
349 Set<String> map2Vals = getKeysByValue(map2,remVal);
350 for(String map2Val : map2Vals) {
351 appendRowData(printer, entry.getKey() + " " + map2Val,
352 type,entry.getValue()+remVal);
353 }
354 }
355 }
356 }
357
358 /**
359 * Iterates through a set of matches, writing each as a row in the results csv file
360 *
361 * @param printer pointer to the CSV file to write to
362 * @param col1 set of values to write to the csv file
363 * @param col2 the type of word/phrase to write to the csv
364 * @param stringValue value of the source word
365 * @throws MaxResultsReachedException
366 */
367 private void appendRowData(CSVPrinter printer, Set<String> col1, String col2,
368 int stringValue) throws MaxResultsReachedException {
369 for(String value : col1) {
370 appendRowData(printer, value, col2, stringValue);
371 }
372 }
373
374 /**
375 * writes the result as a row in the output csv file
376 *
377 * @param printer pointer to the CSV file to write to
378 * @param col1 result to write to the csv file
379 * @param col2 the type of word/phrase to write to the csv
380 * @param stringValue value of the source word
381 * @throws MaxResultsReachedException
382 */
383 private void appendRowData(CSVPrinter printer, String col1, String col2,
384 int stringValue) throws MaxResultsReachedException {
385 if(resultCount<MAX_RESULTS) {
386 printer.println(new String[]{col1,col2,Integer.toString(stringValue)});
387 resultCount++;
388 } else {
389 throw new MaxResultsReachedException("Maximum number of results reached");
390 }
391 }
392
393 /**
394 * Utility method to get all the matching Keys of a Map by the given value
395 *
396 * @param <K> the key object
397 * @param <V> the value object
398 * @param map a Map to search through
399 * @param value the value to search for
400 *
401 * @return a set of keys which match the given value
402 */
403 private <K, V> Set<K> getKeysByValue(Map<K, V> map, V value) {
404 Set<K> keys = new HashSet<K>();
405 for (Entry<K, V> entry : map.entrySet()) {
406 if (entry.getValue().equals(value)) {
407 keys.add(entry.getKey());
408 }
409 }
410 return keys;
411 }
412
413 /**
414 * Method to strip the source word list according to the options entered by the user.
415 * This will take out phrases and popluate the maps with word numeric values.
416 *
417 * @param file a link to the source word list file.
418 */
419 private void createStrippedList(File file) {
420 BufferedReader bufferedStream = null;
421 try {
422 Map<String, Integer> mapPointer = this.getWordMap(file);
423 bufferedStream = new BufferedReader(
424 new InputStreamReader(
425 new FileInputStream(file)));
426 String line = "";
427 while((line = bufferedStream.readLine()) != null) {
428 String word = getWord(line);
429 if(word.matches("^[a-zA-Z].*")) {
430 if(!this.includePhrases && word.contains("_")) {
431 continue;
432 }
433 if(useTextCase == TextCase.LOWER || useTextCase == TextCase.BOTH) {
434 word = word.toLowerCase();
435 mapPointer.put(word, this.getWordValue(word));
436 }
437 if(useTextCase == TextCase.CAMEL || useTextCase == TextCase.BOTH) {
438 word = String.format("%s%s",Character.toUpperCase(word.charAt(0)),
439 word.substring(1).toLowerCase());
440 mapPointer.put(word, this.getWordValue(word));
441 }
442 }
443 }
444 } catch (IOException e) {
445 e.printStackTrace();
446 } finally {
447 if(bufferedStream != null) {
448 try {
449 bufferedStream.close();
450 } catch (IOException e) {
451 e.printStackTrace();
452 }
453 }
454 }
455 }
456
457 /**
458 * Method to get the actual word from a source file, this extracts only
459 * the pertinent bit from the source line
460 *
461 * @param sentence a String sentence to process
462 *
463 * @return the first word in the sentence.
464 */
465 private static String getWord(String sentence) {
466 String[] items = sentence.split(" ");
467 return items[0];
468 }
469
470 /**
471 * Method to get a Map of Strings to equivalent values based on a given source file
472 *
473 * @param file a pointer to a source word list file
474 *
475 * @return a Map of words to numeric values.
476 */
477 private Map<String, Integer> getWordMap(File file) {
478 Map<String, Integer> tmpMap = null;
479 if(file.getName().endsWith(".adj")) {
480 tmpMap=this.adjs;
481 } else if(file.getName().endsWith(".adv")) {
482 tmpMap=this.advs;
483 } else if(file.getName().endsWith(".noun")) {
484 tmpMap=this.nouns;
485 } else if(file.getName().endsWith(".verb")) {
486 tmpMap=this.verbs;
487 }
488 return tmpMap;
489 }
490
491 /**
492 * Method to return the numeric value for a given word
493 *
494 * @param word a {@link java.lang.String} containing the word
495 * @return an int representing the words numeric value
496 */
497 private int getWordValue(String word) {
498 int returnable = 0;
499 char[] chars = word.toCharArray();
500 for(char theChar : chars) {
501 Integer charValue = null;
502 switch(useCharset) {
503 case ASCII:
504 charValue = asciiletters.get(theChar);
505 break;
506 case UNICODE:
507 charValue = Character.getNumericValue(theChar);
508 break;
509 case POSITIONAL:
510 charValue = letters.get(Character.toLowerCase(theChar));
511 break;
512 default:
513 break;
514 }
515 if(charValue != null) {
516 returnable = returnable + charValue;
517 }
518 }
519 return returnable;
520 }
521
522 /**
523 * Exception defined as inner class
524 */
525 private class MaxResultsReachedException extends Exception {
526 private static final long serialVersionUID = 1L;
527
528 public MaxResultsReachedException(String message) {
529 super(message);
530 }
531 }
532}
The flow through the application prompts the user to enter their choices and then generates the output file; in the example below we can see that more than 1024 possibilities would have been generated (value of MAX_RESULTS)
1Please enter the ID of the charset to use [ASCII]
21. ASCII
32. Unicode
43. Positional
51
6Please set the text case to process to use [BOTH]
71. Camel
82. Lower
93. Both
101
11Please set whether to include phrases [No]
121. Yes
132. No
142
151. Generate equivalences
160. Exit
171
18Enter specific words that should appear separated by , (comma):
19
20Please enter the text to generate equivalence for:
21Happiness
22The maximum number of results has been reached
23Happiness.csv file created
241. Generate equivalences
250. Exit
260
Generating on Happiness gives many options for output, most are taken up by single matching words but the result set does cover some pairs, here is an example of the output CSV file:
Tenacious,adj,939
Unlimited,adj,939
Wished-for,adj,939
Gainfully,adv,939
Excitedly,adv,939
Certainly,adv,939
Orchestra,noun,939
Whitetail,noun,939
Foresight,noun,939
Implement,verb,939
Orientate,verb,939
Recapture,verb,939
Lovely Wax,adj-noun,939
Downy Nest,adj-noun,939
Fit Snoopy,adj-noun,939
During my investigations I have found some very funny combinations as profanities and negative words were not removed from the dictionaries. So, while I have been able to prove that my friends company is “Perpetual Happiness” (positional), “Righteous Happiness” (ASCII), and “Phenomenal Happiness” (Unicode) I have also seen some results far from complimentary.
At this point I have stopped developing the script as it has achieved what I wanted it to. There are three builds I have in mind for this for the future, all have a learning aim:
- Add the code to a git repo and add to GitHub
- Make the processing distributed using Hadoop or similar, this should enable me to create multiple word sentences
- Turn this into a webapp using Spring so that I can learn more about this framework
- Moved Hadoop version of the webapp into the cloud using Amazon/Cloud Foundry or similar.
Who knows, if I get another 80 minutes free time I might implement them.