資訊擷取(Information Retrieval)

概說

Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). The term “unstructured data” refers to data which does not have clear, semantically overt, easy-for-a-computer structure. IR is also used to facilitate “semi-structured” search such as finding a document where the title contains Java and the body contains threading. The field of information retrieval also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. Given a set of topics, standing information needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to.

名詞介紹

名詞

docID

Document Frequency

Precision

Recall

Relevant Contingency Table

Effectiveness

Relevance

Corpus

Tokenization

Stop Words

Term-Document Matrix

Invert Index

Permuterm Index

k-gram Index

Post-Filtering

Jaccard Coefficient

Document Analysis

Metadata Fields

Zones

Weighted Zone Scoring (Ranked Boolean Retrieval)

Relevance Feedback

Query Expansion

Vector Space Scoring Model

說明

Each document has a unique serial number, known as the document identifier

The number of documents which contain each term.

What fraction of the returned results are relevant to the information need?

What fraction of the relevant documents in the collection were returned by the system?

Retrieved

Non-Retrieved

Precision

Recall

Accuracy

F

Relevant

True Positives (tp)

False Negatives (fn)

Non-Relevant

False Positives (fp)

True Negatives (tn)

P = (tp) / (tp + fp)

R = (tp) / (tp + fn)

A = (tp + tn) / (tp + tn + fp + fn)

\in \!\,

權重:α [0, 1] 權重比:

(1) Precision (2) Recall

A document is relevant if it is one that the user perceives as containing information of value with respect to their personal information need.

A collection of text (documents) retrieved from a system.

  • Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.

  • A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.

  • A type is the class of all tokens containing the same character sequence.

  • A term is a (perhaps normalized) type that is included in the IR system’s dictionary.

  • The stop words are extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely.

  • A stop list, the members of which are then discarded during indexing.

程序

    1. Collect the documents to be indexed.

    2. Tokenize the text.

    3. Do linguistic preprocessing of tokens.

    4. Index the documents that each term occurs in.

  • For Wildcard Search. Place a special symbol $ into the character set, to mark the end of a term, construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term.

  • Ex. Hello$, ello$H, llo$He, lo$Hel, o$Hell, $Hello

  • For Wildcard Search. A k-gram is a sequence of k characters using a special character $ to denote the beginning or end of a term.

  • Ex. castle: $ca, cas, ast, stl, le$.

For Wildcard Search. A a simple string-matching operation to check the difference between the terms enumerated by the Boolean Query on the 3-gram Index and the original query.

= (AB) / (AB) To compute the Jaccard Coefficient needs the set of k-grams in q (query) and t (vocabulary term). As the scan proceeds, IR system proceeds from one vocabulary term t to the next, computing on the fly the Jaccard Coefficient between q and t. If the coefficient exceeds a preset threshold, it adds t to the output; if not, it moves on to the next term in the postings.

  • Digital documents generally encode, in machine-recognizable form, certain metadata associated with each document.

  • By metadata, we mean specific forms of data about a document, such as its author(s), title and date of publication.

  • This metadata would generally include fields such as the date of creation and the format of the document, as well the author and possibly the title of the document.

  • Zones are similar to fields, except the contents of a zone can be arbitrary free text. Whereas a field may take on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text.

  • Ex. Document Title, Abstract, Affiliation

    • Weighted zone scoring in such a collection would require three weights g1=0.2, g2=0.3 and g3=0.5, respectively corresponding to the Author, Title and Body zones.

    • It corresponds to an application in which a match in the Author zone is least important to the overall score, the Title zone somewhat more, and the Body contributes even more.

    • If the term "smart city" were to appear in the Title and Body zones but not the Author zone of a document, the score of this document would be 0.8=0.3 + 0.5.

  • gi = Zone 權重,總和=1

  • si = 文件符合布林值,

\in \!\,
  • [0, 1]

The idea of relevance feedback (RF) is to involve the user in the retrieval process so as to improve the final result set.

In particular, the user gives feedback on the relevance of documents in an initial set of results.

The basic procedure is:

    1. The user issues a (short, simple) query.

    2. The system returns an initial set of retrieval results.

    3. The user marks some returned documents as relevant or non-relevant.

    4. The system computes a better representation of the information need based on the user feedback.

    5. The system displays a revised set of retrieval results.

  • Users give additional input on query words or phrases, possibly suggesting additional query terms.

  • The central question in this form of query expansion is how to generate alternative or expanded queries for the user.

  • The most common form of query expansion is global analysis, using some form of thesaurus.

  • For each term t in a query, the query can be automatically expanded with synonyms and related words of t from the thesaurus.

  • Use of a thesaurus can be combined with ideas of term weighting: for instance, one might weight added terms less than original query terms.

  1. Boolean Retrieval

    1. Term Vocabulary & Postings Lists

    2. Dictionaries and Tolerant Retrieval

    3. Index Construction

    4. Index Compression

    1. Scoring, Term Weighting and the Vector Space Model

    2. Computing Scores in a Complete Search System

    3. Evaluation in Information Retrieval

    4. Relevance Feedback and Query Expansion