Project Reflection and Documentation

Reflection: What does research mean to you?

Research is a collaborative process composed of a series of failures interrupted by the occasional success. This definition can be divided into three parts that outline the three central characteristics of research. Research is a:

1. collaborative process

2. composed of a series of failures

3. interrupted by the occasional success

The rest of this reflection will define and elaborate on these three parts.

First and foremost, research is collaborative. At the most basic level, all current research stems from previous research and is dependent on related knowledge. Thus, even if an individual is the sole person investigating a particular topic, her work is still collaborative because she is dependent on related knowledge in order to advance her own field. Typically though, research is also collaborative in a more direct sense in that there are multiple people working together in order to solve a problem. The term "research community'' encapsulates and reflects the collaborative nature of research. Among other things, healthy communities work together towards a common goal, set clear policies and obligations, promote interaction among the different members, maintain fair organization and structure that is conducive to community activity. Although competing for recognition or resources may complicate relations, ideally research groups working on the same topic would reflect these same characteristics.

Second, research is composed of a series of failures. The word "failure'' here is taken to refer to anything that is not the realization of the end goal. Thus, this term encapsulates any number of events that are not strictly negative. An overused and slightly cliche, albeit fitting, example is Thomas Edison's hundreds of attempts at creating the light bulb. All failures, but all still useful for understanding and guiding Edison to a working solution. "Failures'' may also produce new areas and topics of research and may refine the research community's set of working knowledge. Hence, failure as used here is a nuanced term.

Finally, research produces the occasional success. Similarly to the word "failure'', the word "success'' is also nuanced. Here it is taken to refer to attaining advancement that directly contributes to achieving the end goal. To continue to refer to the previously used, cliche example, success would refer to Thomas Edison producing the working light bulb. Possibly, a better term here may be "milestone'' instead of "success'' since it better recognizes the repetitive and largely unending nature of research. Research continues because there is still knowledge to gain, and even after many years of striving to accomplish some goal, there remains the satisfaction of the remaining possibility and challenge of more discovery.

End of Week #1 Update

Our initial meetings with Dr. LaToza and Sahar entailed gaining familiarity with the project. Dr. LaToza introduced the basic motivation behind the project and what previous work had been done, especially with regards to ActiveDocumentation, RulePad, and the HTML machine learning algorithm/project. He gave us several resources to read and consider that all gave some insight into areas dealing with how developers code, what developers focus on as they program, how developers use and interact with IDEs and APIs, etc. Thus far we have read only about 7 papers, all of which are fairly diverse albeit applicable. Our first couple of meetings felt somewhat abstract as we tried to pin down some larger ideas and concepts with regards to the project as a whole. In the third meeting we had an involved discussion that revolved around the different uses for our tool, and the different workflows that might result from the different uses. For example, a developer familiar with a code base may know what rules they wish to add. On the other hand, a developer who is unfamiliar with a project, may want to explore rules or have the machine suggest possible rules. We met a fourth time this week on Friday to further discuss ideas revolving around how we would measure complexity and what kind of initial parameters we would provide to our algorithm.

How many programmers does it take to change a light bulb?

None – It’s a hardware problem

For the hackathon, we were divided up into six groups at random. Our goal was to find a classifier that would most accurately predict the quality of a new kind of wine based off of a set of characteristics. It was fascinating to see how much team members differed in their approaches with regards to preferred tools, exploring the data, and finding a method that classified the observations well.

While I immediately took to exploring the data in R and looking for correlations among our possible features for our model, two other members immediately started using pandas (an open-source data analysis tool for Python) and exploring what classifiers seemed to work best. I used a heatmap of the correlations to figure out what features we might exclude from our training dataset. Then we tried re-training different classifiers to see if their performance improved.

What we ended up using was a random forrest classifier trained off of the given training dataset with citric acidity and free sulfur dioxide dropped. A random forrest classifier creates a bunch of decision trees based off of a subset of the training data, and then combines all the votes from the decision trees to determine the category or final class of a given instance. We dropped the citric acidity and free sulfur dioxide variables from our training set because they were highly correlated with other features and thus would have a negative impact on our classifier's accuracy. We also used cross validation to estimate the accuracy of our classifier.

The hackathon was a great way to work with and get to know other students from both the REU and the ASL group summer research groups. It also gave me the opportunity to learn about different machine learning algorithms. I am not familiar with many machine learning topics, and thus I got to learn about different machine learning algorithms that people may use to solve such a problem, particularly the random forrest classifier which is what my group used.

End of Week #2 Update

We continued to have meetings with Dr. LaToza and Sahar to explore what the tool may look like. We are in a unique position with our project in that while other groups are analyzing data that has been given to them we are essentially designing our data. We have multiple streams of data that are available to us, including an abstract syntax tree (AST) of the code, cursor information, and search information. Through some combination of this data we wish to identify some token about which we can identify several possible queries about design rules that we can present to the user. Therefore, the first task is to find what streams of data are most useful for identifying which parts of the code are most relevant to the user. Once those sections of code are identified, the next challenge lies in deriving useful and interesting design rules to present to the user. We are currently focusing on the first task, identifying the most useful sources of data. This includes exploring what information about user activity can be obtained from the IDE and how to access this information.

Additionally, we are actively learning JavaScript and XPaths. We spent a large portion of our time working on installing all the necessary components to make the ActiveDocumentation tool function properly on our personal machines, so that we can use the tool and explore its different capabilities.

I read several interesting papers this week, all of varying levels of relevancy to our project. I read a paper about a group of researchers who used a statistical language model to perform code suggestion. From their paper, I started to wonder how our project could be phrased in terms of an NLP problem, where instead of trying to fill in a missing word, we are trying to identify rhetorical devices or a write a summary of a corpus of information. I conducted a light search to try to see what unusual applications might be connected to such an idea and NLP. There were several interesting papers, including one about identifying different kinds and styles of imagery in ancient Chinese poetry, which had several surprising parallels with our project. It may not be the case that these papers are ultimately related to the final product we produce, but they contain interesting solutions to consider, especially in terms of formatting and aggregating information. As I've read, I've continued to update my set of summaries of the papers and ideas I've explored.

For every action, there is an equal and opposite criticism.

-Steven Wright

I decided to critique a research paper that we have read more recently. This is one of a number of papers that discusses how developers learn and look at code particularly in team settings. I chose this one because it seemed especially helpful. The paper was written and organized clearly, and they systematically discuss what tools, methods, and questions developers utilize in their daily workspaces and team settings. The research paper is titled Information Needs in Collocated Software Teams (Ko et al, 2007). My summary and discussion of the paper is below.

This research team observed and analyzed the behavior of seventeen developers to identify developers' day-to-day needs. They identify knowledge about design and program behavior as the type of information that is most often deferred since unavailable coworkers were often the only source of knowledge. The researchers identify many sources of information the developers look at to accomplish a task including check-in logs, bug reports, content management systems, version control systems, and other coworkers. Much of the information about a project's design and history are known by individuals but not necessarily recorded or updated in a formal document.

One significant question developers had when exposing code to teammates was if they had followed their team' conventions. Some developers in the study tried to use static analysis tools to check for fault-prone design patterns, but the tools gave too many false positives or incomprehensible recommendations. Other challenges developers had was in merging code bases and trying to determine differences in the current submission. Additional key activities in understanding execution behavior were tied to circumstances in which a developer was using vendor code, joining a new team, obtaining ownership of code, and debugging. Developers asked specific questions regarding what code caused different program states and behaviors and what statically related to the code of interest, but pursued the answers to these questions starting with a hypothesis that they largely developed by using their intuition, asking coworkers, looking for execution logs, scouring bug reports, and using the debugger. They then refined their hypothesis by asking questions of varying scope from questions about a specific function definition to code that performed similar operations; they most often used the search tool in order to answer these questions. The researchers also found that developers heavily utilized the debugger in order to answer questions about code function.

The researchers found that coworkers were the most frequent source of information accessed and that the information needs where coworkers were most often consulted were about design. When coworkers were unavailable to answer these questions, developers' tasks often became blocked. The researchers state that "design intent was also difficult to find. Information about rationale and intent existed sometimes in unsearchable places like whiteboards and personal notebooks or in unexpected places like bug reports." They explore questions about what can be written down for documentation cost-effectively and suggest that a demand-driven approach to recording design knowledge might succeed to avoid wasted effort on recording design information that might never be read or go stale before being read. ActiveDocumentation and RulePad, which Sahar has worked on, provides an active means of documenting code effectively. However, the researchers' statement about how writing updating documentation is often forgotten or neglected supports our project because our tool would help document code that is partially or completely undocumented and help developers easily update documentation for projects.

We have identified cursor location as a possible source of important information regarding developer activity, but from this paper we may also want to investigate using the debugger in order to identify areas of interest. Additionally, we should mention that a possible limitation for our tool is that it currently can only be used by an individual, while many environments in which code is developed are collaborative environments with multiple individuals accessing and editing code. However, it might be for future work that we leave adapting the tool for developer team environments.

End of Week#3 Update

This week was an interesting week as we tried to pin down some more specifics about our algorithm. We began our week thinking of a general algorithm for finding rules that was composed of using information like cursor placement to identify a token, using that taken to generate a set of queries, and then presenting those queries to the user. The details of this algorithm would vary significantly depending on how many queries we could make per second, which was information we didn't have. In order to get an estimate for how long it took to run XPath queries, we ran a few tests that entailed executing a set of XPath queries on corresponding srcML files generated from projects of varying size from thousands lines of code to hundreds of thousands of lines of code. Our results indicated that there was a large range of time that a query could take anywhere from about 0.02 seconds to 2 seconds depending on the size of the original code base; the larger the codebase, the longer the time to query. Therefore we began to re-think our algorithm, and even the way we thought about our algorithm.

Our task until our next meeting is to look at previous work done with association rule miners to try to identify if there is a machine learning algorithm that exists that works for a problem, and if not, to try to see what kind of representations they use that might be helpful for our situation. Identifying an algorithm or representation may significantly assist us understanding the most useful parts of the sources of information we have available to us. One idea that has been placed on the table with regards to how we store our data includes some way of indexing our data.

Additionally, this week we got to do a ten minute presentation of our research project for the rest of Dr. LaToza's research groups. It went well, and Dr. LaToza helped record some of the feedback we got when people asked us questions. We will use this feedback as we continue to think about our project and when we are preparing for our poster presentation at the end of the summer. Below are the slides from our presentation.

Tuesday Group Presentation

End of Week #4 Update

This week's update is going to be brief since its hard to describe specifically what we are looking at without adding too many specific implementation details. We delved deeply into looking at association rule mining techniques. Specifically, we have been exploring variations of a structure called a frequent pattern tree (FP-tree). FP-trees are more efficient at association rule mining than the Apriori algorithm because they don't compute every set of candidates possible. However, the difficulty comes in finding an FP-tree that can be used with incremental data; this is important for our application because the code base will be modified, and so we need to have a way to update the FP-tree without having to simply re-traverse the database again. The other difficulty is that most FP-tree algorithms work recursively. This means that we can have stack overflow issues with the recursive calls if our tree is too deep, so we also need to look at iterative implementations of this algorithm. One final consideration is how to select support and confidence.

Other things we have considered this week is how we will represent different elements of the code base in the database we feed to build our FP-tree. How would we go about listing '@A' in our table as a feature of the database? What kind of information would we need to supply the algorithm with in order to produce our desired rules?

Finally, this Wednesday there was as guest speaker and we all gave 3 minute elevator pitches for our group projects. They went well, and it was good to get some practice presenting our work. Below is the single slide we used as per the guidelines we were given.

Adventure, yeah. I guess that’s what you call it when everybody comes back alive.

– Mercedes Lackey

A discovery about research along a journey...

It's funny. The idea and process of research are becoming available and advertised to increasingly younger age groups. It's no longer a matter of trying to introduce research to middle and high school aged students; it's a matter of involving them, deeply and in a very hands-on way. Now, we're introducing research to elementary-age students. Consequently, many students middle school age and higher could probably give you at least a short description of research and what it might involve. By the end of high school, there's a fair chance a student has gotten experience working in or shadowing in a research environment; chances are, they can probably tell you a fair bit about what research means to them and describe their experience. And this is all good and well!

I've found it interesting that no matter how many outlines of research are given, the actual process varies quite a lot from professor to professor and project to project. Last summer, I was given a large set of papers to read and told to ask questions and report on what I had learned; gradually, my professor guided me into a topic of research by teaching me how to ask questions. The research area was somewhat more constrained and specific, so most of the papers I read ended up relating directly to my research. This summer, I was given a topic and even desired outcome of research, and guided to different areas that might be relevant to this given goal. However, this area of research sits side-saddle in the research world, straddling across several major areas of computer science; this makes it easy to follow an irrelevant path into papers that are only tangentially related to my work.

The role of graduate students also varied. Last summer, Molly helped me learn a lot of the soft skills associated with research: keeping track of papers and LaTex citations, work flow, tips on reading a technical paper. As the summer progressed, she became more involved in the technical aspects of the paper's contents, while continuing to help me learn good practice and research habits. This summer, Sahar has been actively guiding us and helping us along with the research from the start. This is largely because she has been working on this project and in this area for over the last two years, and so she had lots of knowledge and experience with this work. She doesn't necessarily help us with soft-skills in research, but the REU provides that support for us, so it's not necessary for her to provide that guidance (although she has no problem providing it when we ask!).

Even our meetings with the professors and how they lead us through the research experience varies. We have consistently met with Dr. LaToza and Sahar 2-4 times a week for the entire summer, each time our readings and direction changing significantly. My meetings with Dr. Hering were generally speaking more structured in the sense that while I was in the initial stages of reading papers, she would simply ask me to report on what I had read and what I didn't understand from my readings. Then, she would answer my questions and ask me what I thought the authors had missed. Once I had settled on a research topic and began to work, we would meet 1-2 times a week after I had worked on my paper or a simulation and correct my work. This summer, our topic is much broader, and few (if any) researchers have done any work even close to the work we are hoping to do. Consequently, we read many tangentially related papers, and then try to synthesize that information to solve the challenge at hand. As a result, our meetings with Dr. LaToza and Sahar are far more frequent and unstructured.

Most people could probably give a general outline of research, its goals and the process. However, the in-between is fascinating and unexpected. In many ways, both of my summer research experiences have had many similarities. In both, I have spent a period getting familiar with the research area, asking questions, and trying to understand gaps in the literature. I've had careful and consistent guidance with both my faculty mentors and the graduate students who worked with them. They've both been highly enjoyable experiences! It wouldn't be hard to make a general outline of research that could fit both experiences.

I've always enjoyed looser, more organically-structured learning experiences, and research fits the bill perfectly. However, I have also come to understand the impact of differences in teaching, learning, and guiding styles. Both were good experiences, but both were very distinct as well!

End of Week 5 & 6 Update

Week 5 included Independence Day, and so was slightly abbreviated. Additionally, our work for the last two weeks has been closely related, so it makes sense to recap them all at once.

These past two weeks, we gotten our first looks into what our data might look like. We are using a set of functions from a python library to navigate the AST tree contained in the srcML document and mine information from that document to feed into our association rule mining algorithm. A lot of time and effort has gone into creating the python script to mine the code for features. I'll quickly list out some of the challenges I encountered and how I am working to re-design what features we extract and how we extract them:

(1) Including broadly applicable attributes about the code, clutters the data, making it difficult to discover more interesting attribute associations.

(2) Repeatedly mining the data in stages is critical since it is difficult to represent and associate all relevant information in a single step.

(3) Navigating srcML can be tricky. Spending time with the srcML document to understand its structure and the different variations in encoding that can be produced is critical.

During the meeting, I proposed we start at an existing rule and see if we could create a set of databases and a process of mining those databases that could lead us back to that rule. Thus, in the coming week, this is what we're focusing on - a single rule and how we could possibly derive that rule. Implementing the algorithm and mining the data from the code is time consuming but interesting! I look forward to examining more data that we produce and working towards a practical algorithm and solution to our challenge.

End of Week #7 Update

This week we worked on refining the set of attributes that we output for each class. We were trying to understand if we would get closer to generating the rules we are interested in finding if we output more specific attributes. The difficulty in this in knowing how specific to make the attributes, especially since making attributes to specific might result in an explosion of attributes. For example, we could generate attributes like "has public function foo()" and "has function foo() that return type String" or we could make a more specific attribute instead like "has public function foo() that returns type String". However, generating attributes such as the latter one could result in an explosion of attributes that gets us no closer to finding the rules we wish to find. Current questions of interest include:

(1) Are we expressive enough? How many more features are needed such that we can find the rules we are interested in?

(2) How tractable are the solutions we are coming up with? (In terms of run time, rules generated, etc.)

(3) What other kinds of information from the user do we need to figure out more interesting rule

In other news, we are coming up on the last 3 weeks of the REU, which contain several deadlines. By the end of next, we need to have made a video about our project. I've been working on the video, and hopefully we will have it done within about 45 minutes more of work. We also have a powerpoint presentation to prepare for (due week 9) and a poster presentation (due week 10). Our project has been undergoing a lot of developments recently, which we expect to continue, so we set up a time with the George Mason Sp@rk office in order to print our poster a few days before the poster presentation. We are beginning to plan the poster.

In the next several days (beginning of week 8), we will finish our video and design our poster. We will also discuss how we will structure our presentation. As always, we will also continue to work on our research with Dr. LaToza as we try to refine the attributes that we output.

Week 8 & 9 Updates

These last couple of weeks have been filled with a mixture of deadlines for the REU program and updates to our research. After a series of tests in which we continued to refine the attributes we output to our database, we decided to try to test our methodology on the codebase for Eclipse, which is a an open-source IDE. We obtained the code from github and have been conducting various tests to analyze the rules that we produce, how to group classes into focus and peripheral classes, and what makes one frequent itemset more interesting than another. Based off these results we have further refined our algorithm.

With regards to the video, the one that we created is posted on the home page of this blog. It contains a concise summary of what our work has entailed this summer!

We also completed making our powerpoint presentation, which is directly below this weeks update. It helped us to decide what kinds of information to include on the poster and how we could present it.

DevUXD Final Presentation.pptx

Finally, we finished our poster for the poster presentation that will happen the last Friday of the REU. We are excited to see all of our hard work come together after so many weeks of work.

Week 10 Update

This is the last week of the REU, which culminates in a poster presentation that includes all the groups who performed research at GMU this summer. We will be at our poster to discuss and answer questions about our project for half the session, and we will be able to explore other peoples' projects for the other half.

As far as our project goes, we were able to successfully discover design rules given a code base. We currently have several fine tuning decisions that need to be made, but will be postponed until the start of the semester. We hope to integrate our algorithm with RulePad and publish our results within the next year or so.

The final details of uploading/organizing our code and writing our paper is also completed as well as packing to move back home.