Analysis & Evaluation

Text-Mining at Condé Nast:

An Answer to Consolidation and Control of Information


Vogue Cover, July 1962

Hadass Blank

LIS 698 Seminar and Practicum, Summer II 2010

Dr. Tula Giannini, Dean Pratt-SILS



I completed my Practicum experience with the Condé Nast Archive, the digital corporate library division of the Condé Nast publishing company.  Upon my acceptance into this program, I desired to learn about how this archive used digital and information technology to both organize and enhance its materials.  Taking what I have learned from courses at SILS thus far, and my previous knowledge of art history, fashion, and magazine production, I hoped to utilize my theoretical understanding of digital librarianship into real-world practice.  During this experience, I worked alongside my supervisor Brian Cross, the Associate Director of Digital Assets and primary information architect for the Vogue Digital Archive Project and Keyworder program.  He was also instrumental with the formation of Condé Nast’s budding text-mining software. 

The Vogue Digital Archive Project was my main assignment at Condé Nast.  Along with a team of freelancers, I manually entered keywords for each Vogue image ever published, based upon the specialized language the captions provided.  After completing a “keywording” exercise that both tested my fashion knowledge and cataloging capabilities, I was chosen to experiment with this new Keyworder program and write about its effectiveness and shortcomings.  I also worked on creating the vocabularies associated with the fashion styles of Vogue images, which would eventually be added to the type-ahead language used in their text-mining software.  In effect, I was able to experience how Condé Nast predicts user searches.  These exercises inspired me to focus my research paper on text-mining software and its place within a digital publishing company, such as with Condé Nast.

What is Text-Mining and Data-Mining?

Text-mining is simply the process of extracting quality information from text.  This specific information is usually derived from user trends that are outputted by language patterns.  The practice of text-mining also involves the process of structuring input text—most often through parsing linguistics into a database—and isolating the patterns within the structured data, to then evaluate and interpret what is yielded.  For example, typical text-mining tasks include text categorization, text clustering, concept extraction, production of granular taxonomies, and sentiment analysis.

            Although the idea of text-mining originated in the late twentieth century, this field is currently accelerated by the rapid technological revolution.  Text-mining is, in fact, an interdisciplinary action that draws upon information retrieval, data mining, and computational linguistics.  As most available information is stored as text, the process of text-mining is considered to possess a high commercial value.  Currently, text-mining is becoming popular with larger media companies, such as with Condé Nast, to distinguish information and to provide users with more successful searches by predicting and standardizing the language used.  On the other side, editors of this software, like Brian Cross, are benefiting from cultivating the text-mining software and type-ahead languages with their control of content and archival streamlining.

            Data-mining, similar to text-mining, can also be used to uncover patterns within data, while using smaller parts of information.  The mining process involved can actually be ineffective if the provided textual samples are not representative of the umbrella themes of the data provided.  For instance, both text-mining and data-mining cannot recognize human nuances that may be present in the larger body of data if those patterns are not present in the sample being investigated.  This is particularly true with archival documents belonging to Condé Nast, in which highly specific fashion information is often contained within separate captions and articles.  This inability for the software to find patterns has served as the foundation for service providers, like Condé Nast, to adopt programs that predict user interactions with their digital offerings, and thus create what should be “mined” themselves.

Text -Mining with Condé Nast

            The emerging world of text-mining is proving especially useful to publishers who maintain large databases of information, and thus require comprehensive indexing for accurate retrieval.  In other words, the more organized the system, and the more specific the input of the search, the greater the accuracy and satisfaction of the output.  To make this outcome more frequent, programmers are starting to provide this input to the users in the form of type-ahead prompts and text-mining. 

Condé Nast has recently adopted the software program {confidential name} that provides semantic cues to computers to answer specific user inquiries contained within the text without removing the publisher controls to public access.  Condé Nast uses this software program for their Keworder project.  Condé Nast adopted this program when they started facing a content and unstructured data overload.  By applying a layer of semantically interpreted metadata alongside the content, {confidential name} has capabilities to release this information, making it more visible, understandable, organized, centralized, and ready for analysis.  {Confidential name}’s technology automatically annotates the unstructured data, identifying context, such as people, categories, and entities, with in theory outputs a standardized language.  {Confidential name} also claims to identify nuance and meaning in content, preparing the output for various applications.  At this time, it is almost impossible for a computer program to understand human nuance without specific directions.  My position at Condé Nast, and my role with the Vogue Digital Archive Project requires me to act as the human editor to accurately correct its “mistakes,” while providing additional specific text-mining and type-ahead vocabularies. 

Additionally, text-mining software can be used to build large files of information about specific people and events.  In this sense, the Special Events project I completed (as seen in the Nature of Work section) entailed this type of data organization.  Since the mid 1960s, Condé Nast photographers have documented various events and uploaded them to their digital archive.  The titles for these events have become inconsistent through the years, and it was my job to organize them.  There were over one-thousand eight-hundred event entries, which included film screenings, book parties, fashion collections, galas, balls, and fragrance launches.  Eventually, this consistent language will be added to the text-mining vocabulary, enabling more efficient user searches, and less human editing to the software outputs.

            I also added Prism controlled vocabularies, which represents a publishing industry standard for metadata, for my Sample Issue Reviews Project (as seen the Nature of Work section).  These entries will only contribute to the efficiency of the data provided.  Upon completing my reviews, these sample issues were the first to be “mined” in the Vogue Keyworder program.

Text-Mining Using Semantics and Lexicons

            In linguistics, semantics is the relationship of words to their meanings. Research and development departments of companies, such as Condé Nast, are currently investigating text-mining technique programs to further automate the analysis processes—eventually eliminating the need for a human proof-reader.  Until recently, with an adoption of an OCR process, websites predominately used text-based searches that yielded results containing specific user-defined words or phrases, without any relation or nuance.  Now, when using a semantic program, the text-mining software can separate content based upon their meaning and context, rather than just by a specific word.  When using semantics, especially in a web format, the availability of machine-readable metadata would enable automated agents and other software to access the internet more intelligently.  In other words, cultivators of the databases would be able to perform tasks automatically and locate related information on behalf of the user.

            Also in linguistics, the lexicon of a language is its expressions, words, and vocabularies.  In other words, a lexicon, similar to semantics, is the language’s inventory of lexemes, or combination pattern.  For instance, efficient computing is consistently advancing as a field, and allows new forms of human-computer interactions, in addition to the use of a standardized natural language.  There is a common perception that the future of human-computer interaction lies in the understanding of cultural themes, such as entertainment, aesthetics, and publishers.  By studying the relationship between this natural language and effective information, and understanding its computational treatment, providers like Brian Cross can value this practice as crucial to the future development of the program. 

Understanding Pattern Recognition to Predict User Trends

Pattern recognition identifies data combinations based on statistics, where the patterns to be classified are usually groups of measurements or observations.  A complete pattern recognition system consists of a tracker that gathers the user observations to be classified or described, an extraction mechanism that computes numeric or symbolic information from the observations, and a description scheme that compiles these extracted features.  At Condé Nast, after they receive the automatic text-mining software results from {confidential name}, as well as gather the data taken from user case studies and usability testing, they still need to manually correct the computer generated information into an accurate language.  They then analyze these findings to create their own vocabularies and patterns to dictate their target audience’s customary language. 

Pattern recognition also transcends to images, not just text.  Condé Nast currently employs this effective technology for facial recognition in their Model Identification Projects (as seen in the Nature of Work section).  This project entailed identifying models in runway shows using a DAM client application reminiscent to Adobe Bridge.  This project concerned researching select Ready-to-Wear (RTW) and Haute Couture (HC) runway shows in New York and Paris, and tagging specific models in the Condé Nast database.  With this project, I was able to use my metadata, taxonomy, and cataloging knowledge, as well as my art history and fashion background.  I also successfully applied the facial recognition program to my Dior Slides project (as seen in the Nature of Work section). 


Condé Nast’s text-mining software objectives included both understanding and identifying web usage, user profiles, web analytics, and data streams.  More specifically, current publishing organizations have started dedicating its resources to tracking various users’ behavior on their online databases to better understand and satisfy their needs.  As a direct result, web usage mining tools were developed to help them use web logs to discover usage patterns and profiles.  Many publishing companies refer to this information as valuable evidence or case studies for usability.  In addition, with this data, companies like Condé Nast are better able to generate accurate text-mining languages that will best satisfy their target audiences. 

To create these knowledge-based platforms, Condé Nast’s text-mining software uses natural-language processing while planning, designing, and developing a comprehensive media product that would satisfy their target audiences’ needs.  Condé Nast uses their text-mining software to link concept terms with processed text to a related thesaurus.  They trust that these text and data-mining products can only become more useful if the features of a subject classification system are incorporated into text mining techniques and products.  In other words, the specialized role of human language technologies in the library and information science venue has the potential to become standardized, and thus predicted.



Brunelli, Roberto.  Template Matching Techniques in Computer Vision: Theory and Practice. Wiley Publications: MA, 2009.

Haravu, L.J. and A. Neelameghan. “Text Mining and Data Mining in Knowledge Organization and Discovery: The Making of Knowledge-Based Products.” Knowledge Organization and Classification in International Information Retrieval. Ed. Nancy J. Williamson and Clare Beghtol. Binghamton, NY: Haworth, 2003.

Hawwash, Basheer and Olfa Nasraoui. “Mining and Tracking Evolving Web User Trends from Large Web Server Logs.” Statistical Analysis and Data Mining. Vol. 3 (2). Wiley Periodicals, Inc: MA. 03/11/ 2010. Pg. 106-125.

Roe, David. “Nstein’s TME 5.0: Optimize Your Web Content for the Semantic Web.” CMSWire. Simpler Media Group, Inc. 6/08/2009.

Valitutti, Alessandro, and Carlo Strapparava. “Developing Affective Lexical Resources.” Psychology Journal 2 (1): 2004. Pg. 61-83.

**Note: Because this website is open to the public, I have omitted all confidential company information and specific software names.


Hadass Blank,
Aug 17, 2010, 10:14 AM