String Parsing

String parsing is the name given to the processing of string input data. Typically, this can be done via the console, input fields or external files. You have performed such a task during the data structures assignment when converting infix notation mathematical expressions into postfix expressions.

There is a much greater need for string parsing when it comes to working with external files: Your browser parses HTML files in order to display a website. Data from apps or games are written to local files. Even databases use external, formatted files to store tables of information. In fact, most software is designed to be flexible which usually means it must use an external file to configure it.

We are going to work with parsing strings read in from external files. If you require a refresher on how we read and write external files, please review the videos HERE. If you have completed the activities in the "File Input/Output" page, then you have worked with simple files. These files had some information in them, but each line did not have more than one piece of data in it. If you happen to know what that data is, it can be useful, but what if you don't? What if the data is incomplete? It wouldn't be very useful would it?

This is why each line in a data file often has more than one piece of data. But how can we get the data out?

Data Files

Download the file [avengers.csv (Source, accessed February 17, 2016)] and open it. The data inside contains information about characters from the Marvel universe who have been part of the Avengers at some point. Lets say we wanted to find out the ratio of FEMALE to MALE Avengers over their history.

First we would need to open the file and set ourselves up for reading it. Create a new class called AvengersFileParser and type the following:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;

public class AvengersFileParser {
  
 private static BufferedReader avengersInput;
 
 public static void main(String[] args) {
  try {
   avengersInput = new BufferedReader(new FileReader(("FILEPATH/avengers.csv")));
   
   String avenger = avengersInput.readLine();
   while (avenger != null) {
    // Do something!
    
    avenger = avengersInput.readLine();
   }
   
   avengersInput.close();
  }catch (FileNotFoundException e) {
   // The file was not found! We should do something about that
   
  } catch (IOException e) {
   // IO Error. We should do something about this too!
  }
 }
}

The variable "avenger" holds one line of our csv file. Now we have to figure out how to get the information out of that line so we can count male/female members. By this point you should know about the "split" command on strings. This command allows us to split any string on any delimiter (a character that when found in a string, is split) to turn a string into an array of values. What would be a good delimiter here?

Add in the bold line below:

    while (avenger != null) {

        String[] data = avenger.trim().split(",");
    
        avenger = avengersInput.readLine();
    }

Aside: Why did we put "strip()" into the code? What happens if it isn't there?

Now we have each Avenger as an array. What are the things you notice about those arrays? Which piece of data do we need in order to count the male/female avengers?

Final Code:

The final code we need to find out the ratios of FEMALE to MALE Avengers is (new code in bold):

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;

public class AvengersFileParser {
  
 private static BufferedReader avengersInput;
 
 public static void main(String[] args) {
  try {
   avengersInput = new BufferedReader(new FileReader(("src\\fileparsing\\avengers.csv")));
   
   int maleCount = 0;
   int femaleCount = 0;
   
   String avenger = avengersInput.readLine();
   
   while (avenger != null) {
    // Do something with the 
    String[] data = avenger.trim().split(",");
    
    if (data[3].toLowerCase().equals("male")) {
     ++maleCount;
    } else if (data[3].toLowerCase().equals("female")) {
     ++femaleCount;
    }
    
    avenger = avengersInput.readLine();
   }
   
   System.out.println("The number of male Avengers is: " + maleCount);
   System.out.println("The number of female Avengers is: " + femaleCount);
   
   float ratio = (float)femaleCount / maleCount;
   System.out.println("For every male Avenger, there are : " + ratio + " female avengers.");
   
   avengersInput.close();
  }catch (FileNotFoundException e) {
   // The file was not found! We should do something about that
   
  } catch (IOException e) {
   // IO Error. We should do something about this too!
  }
 }
}

Questions:

  • Why do we have an "if ... else if" statement rather than an "if ... else"?
  • Why, when we calculate the ratio, do we cast the first integer to "float"?
    • What happens if we remove the "float" from the calculation?

Activities

Download the [data.csv] file and and answer the questions:

  1. What do you think this data shows?
  2. What is the average salary of the people in the file?
  3. How many people have the education level "High School"?
  4. How many peoples last names start with the letter "M"?
  5. How many people have more than 15 years of experience with a salary less than $50,000?