Very interesting Dutch search company
Wageningen Universiteit Library has just upgraded its Collexis Search software to Collexis High Definition Search 6.0. The reason for the upgrade is based on extended features and usability.
In the age of Google search, why Collexis?
The search engine uses concepts rather than key words. Before we go forward, look at how it heps the library search.
The implementation of the Collexis software provides these benefits to the user community by guiding users to the appropriate “entrance/path” either by clicking on relevant words or entering a question into a search box, whereupon an appropriate entry is suggested. If the results are not specific enough,additional search suggestions are offered. The TUlib Instruction program teaches the visitor:
– How to set up a literature search.
– Which sources of information are best to use.
– Different search methods to find information.
– How to evaluate retrieved information.
– Where and how to obtain information.
– How to manage and use the literature references.
– How to maintain your subject knowledge in your discipline.
Karin Clavel, Product Researcher,explained, “TU Delft Library is using the Collexis software because of the significant added value it offers to library users. Specifically, the added value encompasses two factors: one, it makes it easier to enter into searches and two, it helps the user find pertinent information in a more efficient manner.”
Q: Collexis makes use of thesauri. What are they?
A: A thesaurus is a specialized vocabulary (“repository of knowledge”) for a community of interest, such as medicine or IT. It contains selected words, terms and concepts with their semantic relations in a hierarchical structure and can also contain synonyms and homonyms. The use of thesauri allows Collexis to identify concepts with multiple different forms, but which represent the same idea in a given field of study. As an example, in medicine the concept “drug” may also be called a Prescription, a Narcotic, a Pharmaceutical, a Pill, etc.
Q: Why are concepts used instead of (key)words?
A: A thesaurus contains the concepts that are important for conceptually describing the documents (sources) of the application domain. Each concept has a name, the preferred term that is used for denoting that concept. The name of a concept should be clear and understandable, even if the concept is shown outside its semantic context. For each concept, synonyms can be described in the thesaurus. A synonym is a natural language variation that is used to denote the same concept. The concepts are organized into a hierarchy according to the specialization / generalization relation.
Q: What is a Collexis Fingerprint?
A: The content of a text is represented by a Fingerprint, a small and unique representation of the text. The Collexis Abstraction component uses a thesaurus to find concepts in a text or a query. It exploits the synonyms and the hierarchy of the thesaurus to recognize the concepts in the text and to estimate the relevance of the concept for denoting the text. A series of concepts with their relative weights are referred to as a conceptual fingerprint. A complex set of algorithms determines the selection and weight of each concept in the fingerprint. A typical fingerprint is only 400 bytes in size. Fingerprints from the same content source are usually stored together in one database, called a Collexion.
Q: How is the indexing performed?
A: Indexing is the process of creating a conceptual fingerprint from a text. In Collexis this automated indexing mechanism performs the following steps on the text: removing the stop words, normalizing the text, selecting concepts by comparison with the thesaurus, clustering the concepts and attaching a relative weight to the concepts by means of a set of algorithms and by measuring the specificity, similarity and frequency of the concepts.
Q: How does Collexis generate its search results?
A: Collexis employs vector matching: comparing a search query with the Fingerprints from the records in a Collexion. The outcome is a very accurate and relevant list of content items and/or experts in the form of a list of records. There also exists the possibility to over-specify a query (i.e., using a considerable piece of text) thus adding context to the query. This context will help the system to improve the accuracy of the query and return references to those content items that are contextually related. The system administrator can enlarge or reduce the set of returned documents by entering a threshold that indicates the minimum “distance” between the records returned and the query. Matching of a search query with Collexion records can be performed on multiple Collexions at a time.
Q: What makes Collexis different?
A: For one thing it makes use of thesauri for information retrieval. Thesauri differentiate Collexis from full text search engines. The high quality search is based on semantics that have been defined in a thesaurus: synonymous terms and terms in different languages are linked to a single concept, hierarchical relations between concepts, links between definitions and terms, and other semantic relationships are exploited in the search applications. They help to highlight those terms in a document or query that are meaningful to the searcher.
Additionally, Collexis’ matching technology is unique. The matching technology computes “distances” between the query and the content items that are being searched. This means that partially matching documents can be found too. It also implies that users do not have to construct a complicated (Boolean) search query, but can simply enter a free text search without the risk of getting “no results” because of the extensive search text. In fact, with matching technology more search text in general means better results. Moreover, the matching process is extremely fast.
There is yet another aspect which differentiates Collexis: manipulation to facilitate discoveries. This has to do with the fact that the Fingerprints generated by the software can be easily manipulated by the computer: they can be aggregated, associated, clustered, etc. These manipulations allow Collexis to also make available information that goes beyond the level of a single document. Information distributed over different documents can be made visible for the searcher; patterns manifest in a group of documents - e.g., a group of documents written by one author or belonging to a particular semantic category - can be found by aggregation.
Q: How does Collexis deal with low concept density documents or queries?
A: A standard possibility is to index a document without thesaurus. This process incorporates most of the indexing steps (stop words, normalization etc), but will generate a fingerprint with word-based entries instead of concept entries. Since Collexis is able to work with multiple thesauri simultaneously, such a “free text” fingerprint can be used in addition to a thesaurus based fingerprint and can take into account terms not present in the thesaurus. These word-based entries can be based on any number of consecutive words (bigrams, trigrams etc). Naturally, such a free text fingerprint does not offer the advantages of a thesaurus-based fingerprint like multilingualism, synonymy et cetera.
Q: Where can Collexis be applied?
A: Collexis can be applied wherever it is important to retrieve information in a swift, easy and high quality manner. Be it within an organization or externally. Typical application fields are in knowledge discovery (drug discovery), policy making (trend analysis, unrevealed relationships, etc), competitor analysis (gap mining, comparison, searching patent databases).
Q: What can Collexis do when there is no thesaurus for our line of business?
A: We can build one with relative ease. Collexis offers tools that can analyze documents and generate candidate terms for inclusion in a thesaurus. Optionally starting with already existing lists of terms a thesaurus can be expanded quickly.
Q: What are the limitations of Collexis?
A: Aside from practical considerations there is no real limit to the amount of content that can be fingerprinted by and stored in Collexis.
Q: How accurate is Collexis?
A: Because the Collexis fingerprint creation algorithm uses a thesaurus, it yields fingerprints that are extremely accurate. Consequently the outcome of a search is an accurate and relevant list of content and/or experts. It is not possible to over-specify a query, as will happen with Boolean search techniques. Quite the contrary is true: using a sizable piece of text as a search term will only yield a better search result.
Q: How fast does it work?
A: The fingerprinting technique guarantees an unmatched combination of speed and performance. Since a typical fingerprint is only 400 bytes in size, the matching process is extremely fast. A collection containing 500,000 fingerprints will on average be matched in 20 milliseconds.
Q: What databases does Collexis support?
A: Any database which can be converted to text entries – which is almost everything. Collexis can deal with situations where an organization has a multitude of databases with as many database systems, containing both structured and unstructured information. The system processes content from any type of database as well as information that is not stored in databases, such as web pages or e-mails.
Q: How does Collexis deal with different languages?
A: The concepts used in the conceptual fingerprinting algorithm are real-world entities rather than language-defined terms or phrases. Using a multilingual thesaurus, Collexis can match a query in one language with a fingerprint collection referring to information items in other languages.
Q: Is integration difficult?
A: Collexis can be easily integrated into any desired application. It supports open technology and comes with JAVA and .NET development kits.
Q: How long has Collexis been in business?
A: Since 1999, when the company was started in Geldermalsen, the Netherlands as a spin-off of the EU project SHARED.
Sorry, comments for this entry are closed at this time.