Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15)

Information retrieval, Scientific archives -- Research, Database management, Data mining


Prior work proposed "Data Near Here" (DNH), a data search engine for scientific archives that is modeled on Internet search engines. DNH performs a periodic, asynchronous scan of each dataset in an archive, extracting lightweight features that are combined to form a dataset summary. During a search, DNH assesses the similarity of the search terms to the summary features and returns to the user, at interactive timescales, a ranked list of datasets for further exploration and analysis. We will demonstrate the search capabilities and ancillary metadata-browsing features for an archive of observational oceanographic data. While comparing search terms to complete datasets might seem ideal, interactive search speed would be impossible with archives of realistic size. We include an analysis showing that our summary-based approach gives a reasonable approximation of such a "complete dataset" similarity measure.

