Crossing the Boundaries of Domains and Languages.

Back links: CriES Workshop - CriES Pilot Challenge

Preprocessing Tool for CriES

Description

The preprocessing tool preforms the following steps on the Yahoo! Answers dataset:

  • Filter the questions and answers to the selected categories (Computer & Internet, Mathematics & Science and Health)
  • Extract topics from dataset
  • Extraction of user graphs

The output of the tool will be:

  1. A subset of the dataset in XML
  2. A user graph modeling questioner-answerer relation in GraphML format
  3. A topic file consisting of 60 multi-lingual topics int TREC format

Download

The preprocessing tool is written in Java and can be downloaded as executable jar file: cries_preprocessing.jar. If you are interested on the source code, please contact Philipp Sorg.

Documentation

The preprocessing tool is implemented in Java. You will need a Java 1.6 runtime environment to run the program.

Command to run the tool:
java -jar cries_preprocessing.jar -Dxml_file=<Yahoo! Answers XML file> -Doutput_dir=<output directory>

Comments:
- The preprocessing tool can handle gzipped XML input
  files (in this case the file FullOct2007.xml.gz)

The following output files will be generated:

  • cries_topics.xml This file contains the questions that are used as topics in the CriES challenge.
    • identifier = uri tag of the question in Yahoo dataset
    • title = subject tag in Yahoo dataset
    • description = content tag in Yahoo dataset
    • Additional information (it must be specified in the submitted results, if this information is used for retrieval)
      • category = The category of the topic
      • questioner = User id of the user who posted the question
      • answerer = User id of the user who posted the selected best answer to this question
  • cries_questions.xml A subset of the original Yahoo dataset (using the same XML schema) that contains all questions in the relevant categories.
  • cries_questioner_answerer_graph.xml.gz A user-user graph in GraphML XML format, that contains edges questioners (all users that posted questions) to answerers (the users who posted best answers to a question posted by the questioner). This graph might be useful to compute user aprioris.

Please refer to our Evaluation Guidelines for instructions of how to submit your expert search results.

cries_automatic_eval.trec_rel.txt is a TREC style relevance file, that assigns each topic exactly one relevant user, namely the user who wrote the best answer to topic question. This file can be used for testing/debugging, but it will most probably heavily underestimate the values of evaluation measures.

cries/preprocessing.txt · Last modified: 2010/05/10 13:46 by pso
© 2008 Institute AIFB, University of Karlsruhe & ISWeb, University of Koblenz.
All rights reserved.
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0