“Each graduate of the Master of Library and Information Science program is able to… design, query, and evaluate information retrieval systems”
Introduction
An information retrieval system involves the acquisition of information from a collection of information resources that satisfies an information need. Rubin (2010, page 282) defines an information retrieval system as a “device interposed between a potential end-user of an information collection and the information collection itself. For a given information program, the purpose of the system is to capture wanted items and filter out unwanted items from the information collection.” As librarians, much, if not all, information retrieval will involve a database. As with the design of any system in use by the information organization, whether computerized or not, the users of those systems and their typical needs must be known. It makes no sense to create or bring in technology no one can or cares to operate. The principles of database design, querying, and evaluation are all important concepts if we are not only going to help with the implementation of new database systems, but to also use them effectively for ourselves as well as the patrons of our information organization.
Database Design
When designing a database, several things need to be considered. The goals for system need to be defined and the users and their characteristics need to be understood. Determining how the information will be represented and what the search engine will look like are also important.
Databases are an information retrieval technology that is so ubiquitous that they can be found for just about any subject of interest. And each database should have a set of goals defined for its use For example, the Internet Game Database (IGDB) has a goal “to gather, preserve, and distribute knowledge about games (IGDB, 2016). A completely different database is EWG’s Skin Deep Cosmetic Database, whose mission is “to use the power of information to protect human health and the environment “(EWG’s Skin Deep Cosmetic Database, 2016). Libraries may have more generalized databases for patrons to work with, however, it is still important to be familiar with the goals for each database in use. Understanding the goals, like the EBSCO Library Literature & Information Science Full Text database which has a goal to provide “broad coverage” of Library Studies, helps users to decide if the database might contain information of use to them.
The potential database users are another important aspect for database design. Using the example of the IGDB for video games, one would want to find out from video gamers what types of information they would be interested in finding, in what formats, how easy it is to search, and other aspects of video gaming. It makes no sense to talk to people who have never played video games before and have no interest in ever playing video games. It may, however, be interesting to those who play a lot of board games, so in that case it might make sense to talk to them. Thoroughly understanding the users will not only help to inform the design overall, but the search and storage mechanisms as well.
The representation of information is probably one of the most critical aspects of database design because it determines whether the information will be found or not. As mentioned previously, understanding what users will want from the database will help to determine how information will be represented in the database. Assuming that full-text searching is more costly than exposing just the terms and concepts that represent a document, describing the document using only those terms and concepts allows the document to found quickly and, maybe more importantly, all the other documents to which it is related.
There are many ways to represent information the best is the way that helps users find as much information that addresses their needs as possible. Using controlled vocabularies, attributes, disambiguation, pre-coordination, and post-coordination are all ways to help users find useful information.
Databases normally consist of “records”, each of which describes information about an entity, whether it is a book, an audio recording of a symphony, a painting, or even a person (as in the records of a student at a particular university). Records are a form of “metadata”, which is “structured information that describes, explains, locates, or otherwise make it easier to retrieve, use, or manage an information resource” (Weedman, 2008). Records should consist of recorded values of attributes of an entity, such as Author, Date, and Subject. A set of records constitute a database and it is these records that we search when looking for information contained within the database (Meadow, et.al, 2007, chapter 2). We can used controlled vocabularies to assign attributes like “Subject” to a record, but it must be done in a consistent manner so that other documents assigned the same subject logically belong to the same class (Lancaster, 2003, chapter 6). Controlled vocabularies like the Library of Congress Subject Headings (LCSH) have very precise definitions for every subject it describes. Since many libraries use the LCSH as Subject attributes for their own materials, consistency in subject assignment is not only maintained within the walls of the institution, but also other institutions from which material may be accessed such as through Interlibrary Loan. The standardization of subject terms means that the user can be reasonably assured that the materials indexed by the same subject attribute will be related.
Records can be indexed using one or more attributes in records. This allows the records to be searched more quickly. Care must be taken to use terms and concepts that are not ambiguous. The classic example of “She liked candy more than her mom” show the confusion that can arise from ambiguity. Does “she” like candy more than her mom likes candy, or does she not like her mom as much as she likes candy? Obviously a book on the planet Mercury that was assigned the same subject as the chemical mercury, the ambiguity in the word mercury would mean any search results on mercury would be less helpful than if the terms used were less ambiguous.
Finally, ‘pre-coordination’ of terms exists to help researchers find the correct subject headings to use when looking for subject headings. Pre-coordination means that someone took the time to put common terms together, like Philosophy—History. This helps researchers when they are don’t know what subject heading terms they should be using and it allows them to recognize the correct terms when browsing the pre-coordinated terms used in library catalogs (Mann, 2005, chapter 2). Post-coordinate terms are terms that can be combined with an “AND” when using computerized catalogs to find the intersection of terms already known by the researcher. So, for example, if I want to find documents on the history of Great Britain, I can do a search on “History” AND “Great Britain.
Principles of Querying
There two key things to know if you want to be able to search effectively. You need know about the database design of the databases you are using and you need some searching strategies. Understanding the design of the database and how data is structured will help when composing queries (Weedman, 2008) and when sifting through results. For example, if the database allows Boolean logic to include or exclude terms or it allows expressions to choose partial sets of records using specific commands, the ability to search on several terms simultaneously is possible (Hock, 2013, chapter 3). Also, if the ranking algorithms are understood, then it will be more efficient to find the most relevant results to a given query. When searching on the Internet using Google, one of the ranking algorithms that will be used is how often the returned results (or webpages) are linked to by other webpages. The idea is that the page that is linked to the most often is of higher importance than pages that are not linked to very often. There may be some instances when this is a relevant selection criteria, but in general it seems arbitrary and it is quite possible that the top ranked pages are linked to for reasons having nothing to do the current query criteria. The number of links is just one feature that can be used for ranking results, but it is only one of many different ways of ranking. Other features could be based on word frequency (i.e. how many times the words in the query appear in a document), the rarity of the word, the position of word, for example in titles vs. in paragraph bodies, even the size of the font used (Witten, et. al, 2007, chapter 4).
Obviously, if a specific controlled vocabulary is used it would be helpful to be familiar with that vocabulary. The ways in which data is stored is also useful information to know. For example, if names are stored last name-first instead of first name-last, a query on first name-last name, would not turn up any results and the query would need to be reissued the other way around, Last name-first name.
Librarians are also super-searchers. They know how to apply different strategies for our search depending on their goals. We know how to use the information returned from a query to formulate a more targeted query, for example, to understand the common terminology for a specific area search. Boolean logic allows us to include only as much information as we want. We can exclude terms and attributes we don’t want. We can also use the database of databases, Ulrich’s Periodicals Directory to find any publication ever published and which database service has access to it, we can use the Foundations Directory On-line when we need to find money for our programs and institutions, and we know the specific databases to query based on the information need. We know how to use the tools available, whether it is the index at the back of a book or a tag cloud on a specific website, we understand how to use the tools to get the information. Finally, we know how to look at search results to find the right search terms for a next, more focused search.
Evaluation
The ability to evaluate the effectiveness of a database is another important skill for librarians when performing information retrieval tasks. We need to be able to tell whether the system is working properly.
One way to do this is to determine “coverage”, or the percentage of information published during a specific time frame that is available in the database. Probably the most common method for finding this out is use reliable bibliographies and check those against the database. This can be quite time consuming and difficult to know what whether you are actually working with a “comprehensive” bibliography.
Another, more common way to tell whether a database is working properly is to use measures of precision and recall. Precision is the ratio of relevant material retrieved during a search to the total material returned in a search, while recall is the ratio of number of relevant material retrieved to the total number of relevant material available in a system (Meadow, 2007, chapter 16). Looking at the database in its entirety, if the precision of a specific query are 50%, this would mean that half of the returned results are relevant, while a recall of 50% would mean that only 50% of the relevant documents were returned. In general, an increase in one of the measures will result in a decrease in the other measure and one should decide whether they are more interested in improving precision or recall when evaluating the effectiveness of a system. It should be noted, however, that it is usually overkill to shoot for 100% in either measure unless it is required that 100% of relevant documents be found (such as during a patent search).
Coursework & Work Experience
I took several courses that helped me with this competency. First of all, INFO-202 was all about working with databases and how to create records for databases. I am including three documents from this course in order to demonstrate my ability to generate database goals and rules for adding records to a database, as well as records. In INFO-240, I learned how to make my own website findable by web crawlers by using good index terms in metadata. In INFO-247, I worked on assignments where I needed to find the correct subject terms (using LCSH) to label different, unrelated articles. I also worked with a team on a project to define the vocabulary for a weblog on what librarians wear (turns out pretty much everything). For evidence that I understand how to maintain corporate databases, I include a presentation I did on the interview of the corporate librarian. The presentation includes information on how metadata is created for internal reports, which are kept in a large database. In INFO-246, I worked on defining classes based on statistical language text mining. It can be argued that the classes, based on words and terms found in news articles, could be used as a vocabulary for later retrieval. And, almost 10 years ago, I worked on a system as part of a scanner that would automatically identify documents that were put on the scanner. This system, called Biblio, used several different types of learning algorithms to classify both words and document structure (e.g. title, abstract, paragraph, image, etc.) to automatically extract metadata from both structured and unstructured document that were scanned. Learning algorithms such as Support Vector Machines (SMV), Neural networks (NN), and statistical language processing (SLP) were used. My evidence includes a paper that was published in the International Journal of Document Analysis and Recognition (IJDAR) (Staelin et.al, 2007). This paper demonstrates my knowledge of automatic information retrieval approaches.
Evidence
The first piece of evidence comes from a group project I worked on during the INFO-202 class, which I took in the spring of 2011. I include three documents, Group3_SOP.docx, Group_3_Rules.doc, and Vans_records.docx which can be found on the evidence page. Taken together, this evidence shows my ability to develop a statement of purpose (or goals), create rules for the data allowed in the database, and example records based on the goals and rules. The database was designed for use with an application that could be used for keeping track of items in your refrigerator, both for inventory purposes and for keeping track of expiration/sell-by dates. The idea is that you could actually use a barcode reader to both enter data and get reports on things like expired items, so the barcode itself was an important field in the database for searching. These files I have submitted for evidence were just a few of several that were generated during the design phase of the database.
The next piece of evidence is called “Present_Scrubbed.pptx”, also on the evidence page, contains a presentation I did for INFO-247 in the spring of 2013. This was based on an interview I did with a corporate librarian. This evidence shows that I am able to understand how certain types of information are entered into databases and tracked as well as querying in corporate databases for technical reports. These technical reports document inventions and academic papers (once they are cleared for publishing by the patent attorneys) for internal use by researchers. This database is not only important for historical reasons, but also as a way for researchers to know what has already been done and to avoid “reinventing the wheel”. One of the most interesting aspects of this work was learning how keywords entered by researchers are turned into Meta-Data for each document which is used later when searching. The keywords are not a controlled vocabulary but instead more of a folksonomy. The corporate librarian does publish a list of keywords that have been used in the database to help seekers find materials and researchers use appropriate keywords on their own submitted documents.
The final evidence for this competency is called “Biblio.pdf”. This is paper that was published in the International Journal of Document Analysis and Recognition (IJDAR) in 2007 (Staelin, et.al, 2007). This paper reports work I did and is directly related to information retrieval. While working on this project, I wrote all the code for automatic information retrieval system using machine-learning algorithms such as Support Vector Machines (SVM), neural networks, and Bayes theorem. This code was implemented as part of scanner software which searched for information on scanned-in documents that could be used for meta-data and document recognition. Most of the data I used for recognition were journal articles and legal documents. The meta-data extraction and recognition was based on many different features such as physical structure (columns, images, titles, subtitles, section titles, etc.) as well as word type and frequencies. During training, features from a set of training documents were input into the various learning algorithms which would determine the values of each feature for the specific type of meta-data class. During search, classes are determined based on how close the results for each feature matches the particular class of meta-data. As the results section of the paper show, recall and a form of precision were used as a measure of how well Biblio’s classification engine worked, and therefore whether search results were relevant. In this case, recall was defined as the total number of correctly classified documents / correctly classified + incorrectly classified documents or, the percent of correctly classified documents. In this paper we measure precision with a measure is called “specificity” and it actually measures whether words that are not a part of the meta-data (non-meta-data) are classified as meta-data or not. Markov and Larose (2007, page 109) define precision and recall similarly to us:
Precision = True Positive/True Positive + False Positive (% of correctly classified divided by the correctly classified documents + the incorrectly classified documents for this class)
Recall = True Positive / True Positive + False Negative (% of correctly classified documents divided by the correctly classified documents + the documents that should have been classified with this class.
In our paper the “TrueNegativeCount” are words that should not be classified as meta-data and here are NOT classified as meta-data (and is therefore classified correctly). The measure then is the real negative count / the real negative count + the incorrectly classified as non-metadata when it was really meta-data. This measure gives us the percentage of non-meta-data correctly classified as non-meta-data. I believe this shows my ability to evaluate the results of information retrieval tasks using measures such as recall and specificity, which is really another way of measuring precision.
Conclusions
Information retrieval is an important skill for librarians and other information professionals. We need to understand how to design databases, how to use those databases for finding information through querying, and how to evaluate those databases. I believe that the evidence I’ve presented demonstrates my understanding of this competency as well as my ability to create automatic systems that can help with classifying information for later retrieval. The classification systems can be used to match documents contained in a database based on the expected features of relevant documents. As a computer programmer I have the skills necessary to write a database from the ground up. The design work from three difference courses in the MLIS program takes that skill to a new level as I now have the ability to design the records based on the goals of the database and I can effectively evaluate the performance using measures like precision and recall.
References
EWG’s Skin Deep Cosmetic Database. (2016). About. Retrieved from http://www.ewg.org/skindeep/site/about.php, Accessed March 20, 2016.
Hock, R. (2013). The extreme searcher’s internet handbook: A guide for the serious searcher, 4th edition. CyberAge Books. Medford, NJ.
Lancaster, F.W. (2003). Indexing and abstracting in theory and practice, 3rd edition. F.W. Lancaster, Champaign, IL.
IGBD. (2016). What is IGDB? Retrieved from https://www.igdb.com/faq, Accessed March 20, 2016.
Mann, T. (2005). The Oxford guide to library research: How to find reliable information online and offline. Oxford University Press. New York, NY.
Markov, Z. and Larose, D.T. (2007). Data mining the web: Uncovering patterns in web content, structure, and usage. John Wiley & Sons, Inc. Hoboken, NJ.
Meadow, C.T., Boyce, B.R., Kraft, D.H., and Barry C. (2007). Text information retrieval systems, 3rd edition. Emerald Group Publishing Limited. Bingley, U.K.
Rubin, R.E. (2010). Foundations of library and information science, 3rd edition. Neal-Schuman Publishers. New York, NY.
Staelin, C., Elad, M., Greig, D., Shmueli, O., & Vans, M. (2007). Biblio: automatic meta-data extraction. International Journal of Document Analysis and Recognition (IJDAR), 10(2), 113-126.
Weedman, J. (2008). Information retrieval: Designing, querying, and evaluating information systems. In: The Portable MLIS: Insights from the experts, Haycock, K and Sheldon, B.E., eds. Libraries Unlimited. Westport, CT.
Witten I.H., Gori, M. and Numerico, T. (2007). Web dragons: Inside the myths of search engine technology. Morgan Kaufmann Publishers. San Francisco, CA.