Prog 3: Which Language
Remember in class we used a short program to count how many of each alphabetic character was found on an input line? Then we modified it to read from Shakespear's Macbeth instead, counting how many of each letter there is in that document. This led us to wonder how frequency counts data like that might be useful in other contexts, such as knowing which letters to guess next in a hangman word-guess game, or in decoding encrypted text. We also wondered whether or not the letter frequency counts would work as a signature for a particular language, with the letter frequency counts functioning as a sort of bar-code for that language. This is the program to explore exactly that question!
Running the Program
The program will be written in four stages within Codio. Once finished with all four stages, running the program will look like the following:
Program 3: Which Language. Select from the following stages of output to display: 1. Letter frequency counts 2. Letter frequency order 3. Get user input and display frequency counts 4. Get user input, display frequency counts, and display language 0. Exit the program Your choice --> 4 Letter Frequency Counts: Engl Finn Fren Germ Hung Ital Port Span A: 6018 9416 6544 5068 7541 8935 9939 10052 B: 1464 448 1081 2060 1746 1221 1173 1387 C: 2144 636 3028 3126 1014 3865 2855 3222 D: 3331 1013 2698 4592 2265 2945 3687 3815 E: 9270 7187 12782 14779 8280 9364 10551 10861 F: 1701 297 1101 1464 932 1099 1039 653 G: 1333 187 772 2503 2975 1423 983 936 H: 5244 2384 1004 4775 1687 1485 1413 1263 I: 4653 8022 5583 7062 2961 8128 4179 3964 J: 38 1331 538 148 1049 5 216 429 K: 691 3952 28 1080 3551 51 19 18 L: 3294 4648 4240 3089 4658 4814 2082 3985 M: 2481 3526 3269 2861 3812 2790 4415 3021 N: 4987 7958 5958 9026 4725 5966 4455 5704 O: 6054 4219 5315 2208 3529 8668 8898 7605 P: 1000 1331 2297 466 448 2060 1862 1842 Q: 121 78 907 84 77 548 954 883 R: 4518 1784 5986 5955 2802 5357 5328 5706 S: 4943 5425 6898 5852 4538 4863 6774 6468 T: 7055 7620 6025 5477 5632 5373 3729 3932 U: 2590 4226 5373 3327 963 3081 3840 3591 V: 657 1784 1566 661 1297 1287 1351 893 W: 1926 120 78 1797 138 141 68 61 X: 112 45 328 89 49 42 260 91 Y: 1637 1403 247 169 1998 216 65 891 Z: 15 3 343 939 2742 490 314 301 Letter frequency order: Engl Finn Fren Germ Hung Ital Port Span E A E E E E E E T I S N A A A A O N A I T O O O A T T R N I S S H E R S L N R R N S N T S T N N S L I A M R M L I U U H K S I I R O O D O L U T D K L U G C T D L M M C I U D U U H C L R D C C M R D M Z M L M C V P G D P P P W Y V O Y H H B F J F B B G V H Y P B W H V B G B D H F V B F V G C Q K J F G Y P B G Z C Q Q Q K F J V U Z Z F V G Z P F Y X J Q W X Y P W J Z X Q Y J W K W X J X W X Q X Y W Z Z K Q X J K K Copy and paste a paragraph of text to be analyzed, followed by ^z (PC) or ^d (Mac): Ma per arrivare a un agreement bisogna essere in due. E dato che il governo intende resistere sui numeri della manovra, è necessario offrire garanzie all’Europa e ai mercati. Perciò sono stati stabiliti due capisaldi: uno tecnico, l’altro più politico. La riduzione strutturale del debito viene fissato come un «obiettivo strategico», non a caso sottolineato da Di Maio dopo il vertice. La linea dell’esecutivo è che per far ripartire l’Italia sia necessario «cambiare approccio» con una manovra espansiva «dopo anni di cure rigoriste senza risultati», ma s A:51 B:5 C:20 D:15 E:55 F:4 G:6 H:2 I:54 J:0 K:0 L:22 M:10 N:29 O:39 P:13 Q:0 R:37 S:28 T:31 U:15 V:9 W:0 X:0 Y:0 Z:3 Letter frequency order: Engl Finn Fren Germ Hung Ital Port Span User E A E E E E E E E T I S N A A A A I O N A I T O O O A A T T R N I S S O H E R S L N R R R N S N T S T N N T S L I A M R M L N I U U H K S I I S R O O D O L U T L D K L U G C T D C L M M C I U D U D U H C L R D C C U M R D M Z M L M P C V P G D P P P M W Y V O Y H H B V F J F B B G V H G Y P B W H V B G B B D H F V B F V F G C Q K J F G Y Z P B G Z C Q Q Q H K F J V U Z Z F J V G Z P F Y X J K Q W X Y P W J Z Q X Q Y J W K W X W J X W X Q X Y W X Z Z K Q X J K K Y Language is Italian
Stages
Write your program in the stages shown below. Within the Codio environment as you write the program in order to get credit you must run and pass the tests for each stage before going on to the following stages. Each stage builds on previous stages.
Display the letter frequency counts for the eight language files, writing this code in main.cpp
Running the program should look like:
Program 3: Which Language. Select from the following stages of output to display: 1. Letter frequency counts 2. Letter frequency order 3. Get user input and display frequency counts 4. Get user input, display frequency counts, and display language 0. Exit the program Your choice --> 1 Letter Frequency Counts: Engl Finn Fren Germ Hung Ital Port Span A: 6018 9416 6544 5068 7541 8935 9939 10052 B: 1464 448 1081 2060 1746 1221 1173 1387 C: 2144 636 3028 3126 1014 3865 2855 3222 D: 3331 1013 2698 4592 2265 2945 3687 3815 E: 9270 7187 12782 14779 8280 9364 10551 10861 F: 1701 297 1101 1464 932 1099 1039 653 G: 1333 187 772 2503 2975 1423 983 936 H: 5244 2384 1004 4775 1687 1485 1413 1263 I: 4653 8022 5583 7062 2961 8128 4179 3964 J: 38 1331 538 148 1049 5 216 429 K: 691 3952 28 1080 3551 51 19 18 L: 3294 4648 4240 3089 4658 4814 2082 3985 M: 2481 3526 3269 2861 3812 2790 4415 3021 N: 4987 7958 5958 9026 4725 5966 4455 5704 O: 6054 4219 5315 2208 3529 8668 8898 7605 P: 1000 1331 2297 466 448 2060 1862 1842 Q: 121 78 907 84 77 548 954 883 R: 4518 1784 5986 5955 2802 5357 5328 5706 S: 4943 5425 6898 5852 4538 4863 6774 6468 T: 7055 7620 6025 5477 5632 5373 3729 3932 U: 2590 4226 5373 3327 963 3081 3840 3591 V: 657 1784 1566 661 1297 1287 1351 893 W: 1926 120 78 1797 138 141 68 61 X: 112 45 328 89 49 42 260 91 Y: 1637 1403 247 169 1998 216 65 891 Z: 15 3 343 939 2742 490 314 301
Sort the letters for each language and display them with the most frequent ones first and the least frequent ones last, adding this code to main.cpp
Running the program should now look like the following:
Program 3: Which Language. Select from the following stages of output to display: 1. Letter frequency counts 2. Letter frequency order 3. Get user input and display frequency counts 4. Get user input, display frequency counts, and display language 0. Exit the program Your choice --> 2 Letter Frequency Counts: Engl Finn Fren Germ Hung Ital Port Span A: 6018 9416 6544 5068 7541 8935 9939 10052 B: 1464 448 1081 2060 1746 1221 1173 1387 C: 2144 636 3028 3126 1014 3865 2855 3222 D: 3331 1013 2698 4592 2265 2945 3687 3815 E: 9270 7187 12782 14779 8280 9364 10551 10861 F: 1701 297 1101 1464 932 1099 1039 653 G: 1333 187 772 2503 2975 1423 983 936 H: 5244 2384 1004 4775 1687 1485 1413 1263 I: 4653 8022 5583 7062 2961 8128 4179 3964 J: 38 1331 538 148 1049 5 216 429 K: 691 3952 28 1080 3551 51 19 18 L: 3294 4648 4240 3089 4658 4814 2082 3985 M: 2481 3526 3269 2861 3812 2790 4415 3021 N: 4987 7958 5958 9026 4725 5966 4455 5704 O: 6054 4219 5315 2208 3529 8668 8898 7605 P: 1000 1331 2297 466 448 2060 1862 1842 Q: 121 78 907 84 77 548 954 883 R: 4518 1784 5986 5955 2802 5357 5328 5706 S: 4943 5425 6898 5852 4538 4863 6774 6468 T: 7055 7620 6025 5477 5632 5373 3729 3932 U: 2590 4226 5373 3327 963 3081 3840 3591 V: 657 1784 1566 661 1297 1287 1351 893 W: 1926 120 78 1797 138 141 68 61 X: 112 45 328 89 49 42 260 91 Y: 1637 1403 247 169 1998 216 65 891 Z: 15 3 343 939 2742 490 314 301 Letter frequency order: Engl Finn Fren Germ Hung Ital Port Span E A E E E E E E T I S N A A A A O N A I T O O O A T T R N I S S H E R S L N R R N S N T S T N N S L I A M R M L I U U H K S I I R O O D O L U T D K L U G C T D L M M C I U D U U H C L R D C C M R D M Z M L M C V P G D P P P W Y V O Y H H B F J F B B G V H Y P B W H V B G B D H F V B F V G C Q K J F G Y P B G Z C Q Q Q K F J V U Z Z F V G Z P F Y X J Q W X Y P W J Z X Q Y J W K W X J X W X Q X Y W Z Z K Q X J K K
Prompt for user input, and count the letter frequencies for the user input, adding this code to main.cpp
Then again display the sorted letters for each language, this time also displaying the sorted letters for the user input in the right-most column. Running the program should now look like the following, where when prompted Italian text is copied in to be analyzed:
Program 3: Which Language. Select from the following stages of output to display: 1. Letter frequency counts 2. Letter frequency order 3. Get user input and display frequency counts 4. Get user input, display frequency counts, and display language 0. Exit the program Your choice --> 3 Letter Frequency Counts: Engl Finn Fren Germ Hung Ital Port Span A: 6018 9416 6544 5068 7541 8935 9939 10052 B: 1464 448 1081 2060 1746 1221 1173 1387 C: 2144 636 3028 3126 1014 3865 2855 3222 D: 3331 1013 2698 4592 2265 2945 3687 3815 E: 9270 7187 12782 14779 8280 9364 10551 10861 F: 1701 297 1101 1464 932 1099 1039 653 G: 1333 187 772 2503 2975 1423 983 936 H: 5244 2384 1004 4775 1687 1485 1413 1263 I: 4653 8022 5583 7062 2961 8128 4179 3964 J: 38 1331 538 148 1049 5 216 429 K: 691 3952 28 1080 3551 51 19 18 L: 3294 4648 4240 3089 4658 4814 2082 3985 M: 2481 3526 3269 2861 3812 2790 4415 3021 N: 4987 7958 5958 9026 4725 5966 4455 5704 O: 6054 4219 5315 2208 3529 8668 8898 7605 P: 1000 1331 2297 466 448 2060 1862 1842 Q: 121 78 907 84 77 548 954 883 R: 4518 1784 5986 5955 2802 5357 5328 5706 S: 4943 5425 6898 5852 4538 4863 6774 6468 T: 7055 7620 6025 5477 5632 5373 3729 3932 U: 2590 4226 5373 3327 963 3081 3840 3591 V: 657 1784 1566 661 1297 1287 1351 893 W: 1926 120 78 1797 138 141 68 61 X: 112 45 328 89 49 42 260 91 Y: 1637 1403 247 169 1998 216 65 891 Z: 15 3 343 939 2742 490 314 301 Letter frequency order: Engl Finn Fren Germ Hung Ital Port Span E A E E E E E E T I S N A A A A O N A I T O O O A T T R N I S S H E R S L N R R N S N T S T N N S L I A M R M L I U U H K S I I R O O D O L U T D K L U G C T D L M M C I U D U U H C L R D C C M R D M Z M L M C V P G D P P P W Y V O Y H H B F J F B B G V H Y P B W H V B G B D H F V B F V G C Q K J F G Y P B G Z C Q Q Q K F J V U Z Z F V G Z P F Y X J Q W X Y P W J Z X Q Y J W K W X J X W X Q X Y W Z Z K Q X J K K Copy and paste a paragraph of text to be analyzed, followed by ^z (PC) or ^d (Mac): Ma per arrivare a un agreement bisogna essere in due. E dato che il governo intende resistere sui numeri della manovra, è necessario offrire garanzie all’Europa e ai mercati. Perciò sono stati stabiliti due capisaldi: uno tecnico, l’altro più politico. La riduzione strutturale del debito viene fissato come un «obiettivo strategico», non a caso sottolineato da Di Maio dopo il vertice. La linea dell’esecutivo è che per far ripartire l’Italia sia necessario «cambiare approccio» con una manovra espansiva «dopo anni di cure rigoriste senza risultati», ma s A:51 B:5 C:20 D:15 E:55 F:4 G:6 H:2 I:54 J:0 K:0 L:22 M:10 N:29 O:39 P:13 Q:0 R:37 S:27 T:31 U:15 V:9 W:0 X:0 Y:0 Z:3 Letter frequency order: Engl Finn Fren Germ Hung Ital Port Span User E A E E E E E E E T I S N A A A A I O N A I T O O O A A T T R N I S S O H E R S L N R R R N S N T S T N N T S L I A M R M L N I U U H K S I I S R O O D O L U T L D K L U G C T D C L M M C I U D U D U H C L R D C C U M R D M Z M L M P C V P G D P P P M W Y V O Y H H B V F J F B B G V H G Y P B W H V B G B B D H F V B F V F G C Q K J F G Y Z P B G Z C Q Q Q H K F J V U Z Z F J V G Z P F Y X J K Q W X Y P W J Z Q X Q Y J W K W X W J X W X Q X Y W X Z Z K Q X J K K Y
Using the data gathered in previous steps, compare the set of sorted letters from the user input to the sorted letters for the other languages, finding the difference between each of them, adding this code to main.cpp
The comparison with the smallest difference should be used to indicate which language the user input was in. Running the program now will also display which language it is in. This should look like the following:
Program 3: Which Language. Select from the following stages of output to display: 1. Letter frequency counts 2. Letter frequency order 3. Get user input and display frequency counts 4. Get user input, display frequency counts, and display language 0. Exit the program Your choice --> 4 Letter Frequency Counts: Engl Finn Fren Germ Hung Ital Port Span A: 6018 9416 6544 5068 7541 8935 9939 10052 B: 1464 448 1081 2060 1746 1221 1173 1387 C: 2144 636 3028 3126 1014 3865 2855 3222 D: 3331 1013 2698 4592 2265 2945 3687 3815 E: 9270 7187 12782 14779 8280 9364 10551 10861 F: 1701 297 1101 1464 932 1099 1039 653 G: 1333 187 772 2503 2975 1423 983 936 H: 5244 2384 1004 4775 1687 1485 1413 1263 I: 4653 8022 5583 7062 2961 8128 4179 3964 J: 38 1331 538 148 1049 5 216 429 K: 691 3952 28 1080 3551 51 19 18 L: 3294 4648 4240 3089 4658 4814 2082 3985 M: 2481 3526 3269 2861 3812 2790 4415 3021 N: 4987 7958 5958 9026 4725 5966 4455 5704 O: 6054 4219 5315 2208 3529 8668 8898 7605 P: 1000 1331 2297 466 448 2060 1862 1842 Q: 121 78 907 84 77 548 954 883 R: 4518 1784 5986 5955 2802 5357 5328 5706 S: 4943 5425 6898 5852 4538 4863 6774 6468 T: 7055 7620 6025 5477 5632 5373 3729 3932 U: 2590 4226 5373 3327 963 3081 3840 3591 V: 657 1784 1566 661 1297 1287 1351 893 W: 1926 120 78 1797 138 141 68 61 X: 112 45 328 89 49 42 260 91 Y: 1637 1403 247 169 1998 216 65 891 Z: 15 3 343 939 2742 490 314 301 Letter frequency order: Engl Finn Fren Germ Hung Ital Port Span E A E E E E E E T I S N A A A A O N A I T O O O A T T R N I S S H E R S L N R R N S N T S T N N S L I A M R M L I U U H K S I I R O O D O L U T D K L U G C T D L M M C I U D U U H C L R D C C M R D M Z M L M C V P G D P P P W Y V O Y H H B F J F B B G V H Y P B W H V B G B D H F V B F V G C Q K J F G Y P B G Z C Q Q Q K F J V U Z Z F V G Z P F Y X J Q W X Y P W J Z X Q Y J W K W X J X W X Q X Y W Z Z K Q X J K K Copy and paste a paragraph of text to be analyzed, followed by ^z (PC) or ^d (Mac): Ma per arrivare a un agreement bisogna essere in due. E dato che il governo intende resistere sui numeri della manovra, è necessario offrire garanzie all’Europa e ai mercati. Perciò sono stati stabiliti due capisaldi: uno tecnico, l’altro più politico. La riduzione strutturale del debito viene fissato come un «obiettivo strategico», non a caso sottolineato da Di Maio dopo il vertice. La linea dell’esecutivo è che per far ripartire l’Italia sia necessario «cambiare approccio» con una manovra espansiva «dopo anni di cure rigoriste senza risultati», ma s A:51 B:5 C:20 D:15 E:55 F:4 G:6 H:2 I:54 J:0 K:0 L:22 M:10 N:29 O:39 P:13 Q:0 R:37 S:28 T:31 U:15 V:9 W:0 X:0 Y:0 Z:3 Letter frequency order: Engl Finn Fren Germ Hung Ital Port Span User E A E E E E E E E T I S N A A A A I O N A I T O O O A A T T R N I S S O H E R S L N R R R N S N T S T N N T S L I A M R M L N I U U H K S I I S R O O D O L U T L D K L U G C T D C L M M C I U D U D U H C L R D C C U M R D M Z M L M P C V P G D P P P M W Y V O Y H H B V F J F B B G V H G Y P B W H V B G B B D H F V B F V F G C Q K J F G Y Z P B G Z C Q Q Q H K F J V U Z Z F J V G Z P F Y X J K Q W X Y P W J Z Q X Q Y J W K W X W J X W X Q X Y W X Z Z K Q X J K K Y Language is Italian
This output should appropriately change, accurately identifying the language of the text that is supplied.
Resources, Submitting your Program
The Codio project contains the eight training data files in English, Finnish, French, German, Hungarian, Italian, Portuguese and Spanish, a file of text in different languages to use for testing (Text for Testing.txt), and a sample program that reads from a file and does letter counts (countInputCharsFromFile.cpp) very similar to the one we discussed in class. This set of files is also include at the bottom of this page in Archive.zip for those working in a different programming environment before uploading to Codio.
Added Oct 5: In this program we are not attempting to count standard alphabetical characters. It would be possible (again, though not required here!) to handle multi-byte inputs of languages such as Chinese, Korean, Japanese, or Hebrew, using the multi-byte UTF-8 format, explained at https://en.wikipedia.org/wiki/UTF-8.
Added Oct 6: I suggest you only compare characters that *are* present in the pasted-in input.
Added Oct 14: Codio is currently not working to test and submit your programs. We will instead be using Zybooks to validate the running of your program, in section 7.7 Program 3: Which Language. You will need to develop your program in Codio as we've been doing and test it there to ensure it is correct. Then copy and paste it into the solution window in Zybooks section 7.7. While in Develop mode you can test your program in that environment. Once you are convinced it is running correctly, select the Submit mode button, and then the orange Submit for Grading button, which will run the tests. For this particular program you can test it as many times as you would like.
Other runs of the program (with the intermediate tables of counts omitted) give the following results, where the first five give the correct language, and the last three don't:
Correct:
Fewer hours of sleep on an average school night
Language is English
Päivän lakko kohdistuu suuriin työnantajiin
Language is Finnish
Sie ist keine typische Physik-Nobelpreisträgerin.
Language is German
A miniszterelnök a Kossuth Rádiónak adott pénteki interjújában nagyon fontosnak
Language is Hungarian
La riduzione strutturale del debito viene fissato
Language is Italian
Incorrect:
El director del Instituto Politécnico Nacional, Mario Alberto Rodríguez Casas
Language is Italian
Voilà Roche et Aznavour intégrés dans la petite cour de celle qu'ils appellent tous
Language is Italian
Pautas que, tradicionalmente, têm sido mais associadas à esquerda.
Language is French
Refining the distance formula affects the correctness of results. Not surprisingly languages that are somewhat similar (French, Italian, Portuguese, Spanish) tend to get confused with each other when providing shorter samples of user input text. The longer the text to be analyzed, the more accurate the results of the program.
Grading Rubric for the 45 Style Points (updated 10/11)
(20%) Identifier (variable and function) names are meaningful, in camelCase.
Identifiers are not meaningful.
Many identifiers are not meaningful.
Most identifiers are meaningful, but some are ambiguous or are not in camelCase.
Nearly all identifiers are meaningful and in camelCase.
All identifiers are meaningful and in camelCase.
(20%) Comments. Every section of code (but not every line!) must be commented.
Program has no comments.
Documentation is missing in several significant parts of the program.
Documentation is mostly adequate and/or header is missing.
Header is present; Major sections are documented, though too much or too little. Parameters are not documented.
Program has header and each major section of code is appropriately commented. All functions have header documentation, along with the role of parameters.
(25%) Functional Decomposition
Program is all inside main(), though it should be broken up with functions
Program uses a few functions, but major areas of functionality are not broken up into functions and should be.
Functions are used to avoid redundant code or to organize program ideas, however parameter access permissions are not as restrictive as they should be.
Functions are used both to avoid redundant code as well as to organize program ideas, with parameters used correctly.
Program is easy to understand and maintain due to appropriate use of functions.
(25%) Appropriate data, looping and decision structures are used
Structures important to program clarity and efficiency are not used, resulting in program not running correctly.
Choices of structures result in program being both unclear and inefficient.
Choices of structures result in program being either unclear or inefficient.
Program is correct, with a few minor areas that could be made more clear or efficient.
Program is structured optimally for both clarity and efficiency.
(10%) Indentation represents logical meaning. At most a single blank line within a function, and at most two blank lines between functions.
Indentation does not reflect program meaning. Code regularly has extra blank lines and/or commented-out sections of code.
Layout is substantially inconsistent in terms of indentation and blank lines.
Layout is often inconsistent in terms of indentation or extra blank lines.
Program layout is consistent with only a few exceptions.
Program layout is consistent and follows all common conventions.