Classifying and Searching Hidden-Web Databases

Panos Ipeirotis, Assistant Professor
Stern School of Business of New York University


Abstract

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Hence, traditional search engines do not index this valuable information. One way to facilitate access to "hidden-web" databases is through Yahoo!-like directories, which organize these databases manually into categories that users can browse. An alternative way is through "metasearchers," which provide a unified query interface to search many databases at once. As a step towards improving access to "hidden-web" databases, we have developed QProber, a system to automatically categorize and search autonomous, hidden-web databases. To categorize a database, QProber uses just the number of matches generated by a small number of query probes derived using state-of-the-art machine learning techniques. To search over "uncooperative" hidden-web databases, QProber exploits the database categorization to extract a small, topically-focused document sample from each database, from which a statistical summary of the database contents is produced. The content summaries can then be used during metasearching to select the most appropriate databases for a given query, a critical task for search scalability and effectiveness. Specifically, QProber identifies the most relevant databases for a query by exploiting both the database classification information and the extracted summaries. QProber produces high-quality database selection decisions, which in turn help return highly relevant search results.