Discussion Notes SQ 2009 Week 2

Paper:

V. Garcia, E. Almeida, L. Lisboa, A. Martins, S. Meira, D. Lucredio, and

R. Fortes. Toward a code search engine based on the state-of-art and

practice. In APSEC ’06: Proceedings of the XIII Asia Pacific Software

Engineering Conference, pages 61–70, Washington, DC, USA, 2006. IEEE

Computer Society.

Contributions:

- A survey of research on code search engines from 1990-2006.

- Presents a set of requirements for code search engines, repository systems, and environment integration.

Discussion:

- The paper covers the work done on Source Code Search Engines from 1990 to 2006. Work covered reflects the spirit of the time.

In the 90's we had the influence of CMM and we needed to have the information organized in a structured way. In the 2000's we have

the influence of Web 2.0 where users can help to classify the information according to their needs, ie Folksonomy. Folksonomy was an

approach that was less controlled, free flowing and dynamic. Between 1990-2000 we see more influence of Data Mining that uses Data

Bases and Artificial Intelligence techniques. As a result we can dig our own information now without having to know about the contents

of the repository.

- Back in history, between 1950 and 1970, the main usage of computing changed from Military, Business, and then Personal Computer.

By 1980, there was much more software.There were more computers on desk and it was easier to write your own program.

- The paper discusses about component search, but we need to be careful about the definition of components because the idea of components

is different in 1960's and 1990's.

- We discussed about the difference of white-box and black-box components in source code search. They could be seen from two points of view:

the Producer and the User point of view:

- We discussed about the requirements for the Search Engines.

i) Retrieval Algorithms. "Through the years these were the mostly used approaches for retrieving reusable software, including most of the works presented here.

Thereforre, we believe that they should be part of any reuse-oriented search engine, mainly because these are already well-proven and

easy-to-use solutions." -> The fact that they have been used before in prototypes does not mean that they are the best IR techniques

to use for source code. Here the authors did not cover any IR technique applied to other than text.

iii) High recall and precision. It is better to offer more precision sacrificing some recall. It is more important that the most retrieved elements are

relevant. Studies have shown that the users do not usually look at the 2nd or 3rd page of results, only the first one. There are 3 thresholds:

- First match- Best

- Above the fold. It means the first screen on the first page- Second Best

- On the first page.- Third Best

iv) IDE Integration. Minimum overhead is not a good reason to implement IDE integration(as it doesn't really take any time to switch between

the IDE and the search tool) but using the IDE to get the context of the search it is a good reason.

The requirements discussed do not offer much support on the search process followed by the developer.

Ideas to implement:

- Look for matches not directly in the source code, but instead in some representation of it such as UML Diagrams or another abstraction.

- Repository Systems Requirements:

- xi. Repository familiarity. "Reuse occurs more frequently with well-known assets." If a lot of people use an asset it means that it has better

quality? It basically resembles the chicken egg problem. So Do we use things because they are famous or do we use them because they have

been used by a lot of people?