Download Abbreviation Dictionary !!TOP!!

The United States Postal Service website contains a list of Primary Street Suffix Names and the corresponding list of standard abbreviations. The lookup table can be found at the website below:

Postal Addressing Standards, Appendix C: C1 Street Suffix Abbreviations

This information can be the basis of an Abbreviation Dictionary file. To create a new Abbreviation Dictionary file create a new .txt file, rename it with a .dic extension and then copy and paste the Abbreviation Dictionary below into it.

Design. Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune.

Download Abbreviation Dictionary

DOWNLOAD 🔥 https://urlca.com/2y4Cny 🔥

Measurements. We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database.

Results. On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database.

Understanding biomedical literature is particularly challenging because of its expanding vocabulary, including the unfettered introduction of new abbreviations. An automatic method to define abbreviations would help researchers by providing a self-updating abbreviation dictionary and also facilitate computer analysis of text.

Nevertheless, the numerous lists of abbreviations covering many domains attest to broad interest in identifying them. Opaui, a web portal for abbreviations, contains links to 152 lists.4 Some are compiled by individuals or groups.5,6 Others accept submissions from users over the internet.7,8 For the medical domain, a manually collected published dictionary contains over 10,000 entries.9

Because of the size and growth of the biomedical literature, manual compilations of abbreviations suffer from problems of completeness and timeliness. Automated methods for finding abbreviations are therefore of great potential value. In general, these methods scan text for candidate abbreviations and then apply an algorithm to match them with the surrounding text. Most abbreviation finders fall into one of three types.

Another approach recognizes that the alignment between an abbreviation and its long form often follows a set of patterns.13,14,15 Thus, a set of carefully and manually crafted rules governing allowed patterns can recognize abbreviations. Furthermore, one can control the performance of the system by adjusting the set of rules, trading off between the leniency in which a rule allows matches and the number of errors that it introduces.

In their rule-based system, Pustejovsky et al. introduced an interesting innovation by including lexical information.14 Their insight is that abbreviations are often composed from noun phrases and that constraining the search to definitions in the noun phrases closest to the abbreviation will improve precision. With the search constrained, they found that they could further tune their rules to also improve recall.

Finally, there is one completely different approach to abbreviation search based on compression.16 The idea here is that a correct abbreviation gives better clues to the best compression model for the surrounding text than an incorrect one. Thus, a normalized compression ratio built from the abbreviation gives a score capable of distinguishing abbreviations.

System architecture. We used a machine-learning approach to find and score abbreviations. First, we scan text to find possible abbreviations, align them with their prefix strings, and then collect a feature vector based on eight characteristics of the abbreviation and alignment. Finally, we apply binary logistic regression to generate a score from the feature vector.

*These features are used to calculate the score of an alignment using Equation 3. We identified syllable boundaries using the algorithm used in TEX [22]. The column indicates the weight given to each feature. The sign of the weight indicates whether that feature is favorably associated with real abbreviations.

Next we generated all possible alignments between the abbreviations and prefixes in our set of 1000. This yielded our complete training set, which consisted of (1) alignments of incorrect abbreviations, (2) correct alignments of correct abbreviations, and (3) incorrect alignments of correct abbreviations. We converted these alignments into feature vectors.

where p is the probability of seeing an abbreviation, X is the feature vector, and is the vector of weights. Thus, training this model consists of finding the vector that maximizes the difference between the positive and negative examples.

We ran our algorithm against the Medstract gold standard (after correcting 6 typographical errors in the XML file) and generated a list of the predicted abbreviations, definitions, and their scores. With these predictions, we calculated the recall and precision at every possible score cutoff generating a recall/precision curve. Recall is defined as:

In addition, we evaluated the coverage of the database against a list of abbreviations from the China Medical Tribune, a weekly Chinese language newspaper covering medical news from Chinese journals.17 The website includes a dictionary of 452 commonly used English medical abbreviations with their long forms. We searched the database for these abbreviations (after correcting 21 spelling errors) and calculated the recall as

Abbreviations Predicted in Medstract Gold Standard. We calculated the recall and precision of the abbreviations found with every possible score cutoff. Some scores are labelled on the curve. When the score cutoff is 0.14, seven of the abbreviations the algorithm found were not identified in the gold standard but nevertheless looked correct (primary ethylene response element (PERE), basic helix-loop-helix (bHLH), intermediate neuroblasts defective (ind), Ca2+-sensing receptor (CaSR), GABA(B) receptor (GABA(B)R1), Polymerase II (Pol II), GABAB receptor (GABA(B)R2)). The arrow points to the adjusted performance if these abbreviations had been included in Medstract. The performance of the Acromed system on this gold standard, as reported in Pustejovsky et al.,14 is shown for comparison.

At a score cutoff of 0.14, the algorithm made 8 errors. 7 of those errors are abbreviations missing from the gold standard: primary ethylene response element (PERE), basic helix-loop-helix (bHLH), intermediate neuroblasts defective (ind), Ca2+-sensing receptor (CaSR), GABA(B) receptor (GABA(B)R1), polymerase II (Pol II), and GABAB receptor (GABA(B)R2). The final error occurred when an unfortunate sequence of words in the prefix yielded a higher scoring alignment than the long form: Fas and Fas ligand (FasL).

Then we scanned all MEDLINE abstracts until the end of 2001 for abbreviations. This required 70 hours of computation using five processors on a Sun Enterprise E3500 running Solaris 2.6. In all, we processed 6,426,981 MEDLINE abstracts (only about half of the 11,447,996 citations had abstracts) at an average rate of 25.5 abstracts/second.

From this scan, we identified a total of 1,948,246 abbreviations from MEDLINE, and 20.7% of them were defined in more than one abstract. Only 2.7% were found in five or more abstracts; 2,748,848 (42.8%) of the abstracts defined at least 1 abbreviation and 23.7% of them defined 2 or more.

Of the nearly two million abbreviation/definition pairs, there were only 719,813 distinct abbreviations because many of them had different definitions (e.g., AR can stand for autosomal recessive, androgen receptor, amphiregulin, aortic regurgitation, aldose reductase, among others). More than one definition, was available for 156,202 abbreviations (21.7%).

Thus, we created a robust method for identifying abbreviations using supervised machine learning. The method uses a set of features that describe different patterns seen commonly within abbreviations. We evaluated it against the Medstract gold standard because it was easily available, it eliminated the need to develop an alternate standard, and it provided a reference point to compare methods.

The largest remaining source of error was from our strong assumption that the abbreviation must be inside parentheses and the long form outside. The algorithm missed seven abbreviations that immediately preceded the long form, which was inside parentheses. To handle this problem, the candidate finder should also allow this pattern. However, it is unclear how adding more candidates may impact the precision.

Our precision in this evaluation was hurt by abbreviations missing from the gold standard. Our algorithm identified eight of these, and seven had scores higher than 0.14. Disregarding these cases yields a precision of 99% at 82% recall, which is comparable to Acromed at 98% and 72%.

Believing the algorithm to have sufficient performance, we ran it against all of MEDLINE and put the results in a database as an abbreviation server. During validation, we found that the server contained 88% of the abbreviations from the dictionary in the China Medical Tribune. Since this list was created independently of MEDLINE, the results demonstrate that this server contains most of the common abbreviations of interest to medical professionals. To improve the recall even further, Yu has shown that linking to external dictionaries of abbreviations can augment the ability of automated methods to assign definitions that are not indicated in the text.15 e24fc04721