Data Mining

EXALEAD CloudView is used to identify, extract and process information about textual and multimedia data both inside the enterprise and on the Web. Specific types of data mining our customers are using to tap into the value of Big Data include text mining, Web data mining, multimedia analysis and sentiment analysis.

Specific Types of Data Mining

Text Mining

Text mining, or text analytics, is the process of analyzing text to identify and extract the meaningful data and patterns it contains at both an embedded and contextual level. This enriched information is used to make search results more relevant, to automatically classify and cluster data for navigation and filtering, and to support qualitative and qualitative analytics. It also makes it possible to integrate structured and unstructured data into a meaningful whole (for example, integrating CRM data with social media content or website activity logs).

Among vendors providing text analytics, EXALEAD is uniquely able to apply advanced processing on a massive scale using a minimal number of commodity servers. It is also unique in the modularity and configurability of its extensive semantic processing pipeline.


  • Reveal the hidden information intelligence in unstructured data
  • Add valuable context to structured data

Web Data Mining

In Web data mining, one seeks to identify, extract and process relevant Web content according to a specific crawl objective. For example, one may want to extract pertinent details from online supplier catalogs to validate, enrich and extend an internal parts database, or one may wish to glean real estate market intelligence from online classifieds (see the AKERYS case study).

To achieve optimal results for mining the world's first and foremost Big Data source, the Web, EXALEAD provides a discriminating crawl ecosystem that yields high-quality results while maximizing performance and minimizing the size of the index. It offers:

  • Comprehensive Data Capture
    The system can capture Web content in unstructured, semi-structured and structured form, including Deep Web content that is dynamically generated as a result of form input and/or database querying.
  • Qualitative Filtering
    The platform provides configurable qualitative filtering, for example, excluding certain document types, treating the content of a site as a single page to avoid crowding out other relevant sources (website collapsing), and detecting and applying special rules for duplicate and near-duplicate content.
  • High Performance
    To maximize performance, CloudView enables you to regulate the breadth and depth of a crawl according to your business needs and resources, and to employ a refined update strategy, zeroing in on pertinent new or modified content rather than re-crawling and re-indexing all content.

CloudView is also uniquely designed to help you avoid placing an undue load on the visited site or violating data ownership and privacy policies.

Once your Web content is collected, EXALEAD CloudView primes it for use with the same robust semantic processing pipeline described above under Text Mining.

Multimedia Analysis

Multimedia content is the fastest growing type of user-generated content, with millions of photos, audio files and videos uploaded to the Web and enterprise servers daily. Exploiting this type of content at Big Data scale is impossible if we must rely solely on human tagging or basic associated metadata like file names to access and understand content.

Fortunately, EXALEAD CloudView integrates seamlessly with technologies like automatic speech-to-text transcription and object-recognition processing (content-based image retrieval) to enable clients to structure multimedia content from the inside out, bringing critical new accessibility to large-volume multimedia collections and enabling the development of innovative applications in fields like medicine, media, publishing, environmental science, forensics and digital asset management.

Sentiment Analysis

CloudView's sentiment analysis framework uses semantic technologies to automatically discover, extract and summarize the emotions and attitudes expressed in unstructured content. This processing is sometimes applied to behind-the-firewall content like email messages, call recordings and customer/constituent surveys. More commonly, however, it is applied to the Web, the most comprehensive repository of public sentiment concerning everything from ideas and issues to people, products and companies.

Sentiment analysis on the Web typically entails collecting data from select Web sources (industry sites, the media, blogs, forums, social networks, etc.), cross-referencing this content with target entities represented in internal systems (services, products, people, programs, etc.), and extracting and summarizing the sentiments expressed in this cross-referenced content in the CloudView index.

Once this knowlegdebase of sentiment data has been created, it can be exploited via full-text search and faceted navigation, quantitative dashboarding and freestyle exploratory analytics. See the Sentiment Analytics section on the Any-User Analytics page.