Back links: [[:cries|CriES Workshop]] - [[challenge|CriES Pilot Challenge]] ====== Preprocessing Tool for CriES ====== ===== Description ===== The preprocessing tool preforms the following steps on the Yahoo! Answers dataset: * Filter the questions and answers to the selected categories (Computer & Internet, Mathematics & Science and Health) * Extract topics from dataset * Extraction of user graphs The output of the tool will be: - A subset of the dataset in XML - A user graph modeling questioner-answerer relation in GraphML format - A topic file consisting of 60 multi-lingual topics int TREC format ===== Download ===== The preprocessing tool is written in Java and can be downloaded as executable jar file: {{:cries:cries_preprocessing.jar}}. If you are interested on the source code, please contact [[philipp.sorg@kit.edu|Philipp Sorg]]. ===== Documentation ===== The preprocessing tool is implemented in Java. You will need a Java 1.6 runtime environment to run the program. Command to run the tool:\\ ''java -jar cries_preprocessing.jar -Dxml_file= -Doutput_dir='' Comments: - The preprocessing tool can handle gzipped XML input files (in this case the file FullOct2007.xml.gz) The following output files will be generated: * **cries_topics.xml** This file contains the questions that are used as topics in the CriES challenge. * identifier = uri tag of the question in Yahoo dataset * title = subject tag in Yahoo dataset * description = content tag in Yahoo dataset * Additional information (it must be specified in the submitted results, if this information is used for retrieval) * category = The category of the topic * questioner = User id of the user who posted the question * answerer = User id of the user who posted the selected best answer to this question * **cries_questions.xml** A subset of the original Yahoo dataset (using the same XML schema) that contains all questions in the relevant categories. * **cries_questioner_answerer_graph.xml.gz** A user-user graph in GraphML XML format, that contains edges questioners (all users that posted questions) to answerers (the users who posted best answers to a question posted by the questioner). This graph might be useful to compute user aprioris. Please refer to our [[evaluation_guideline|Evaluation Guidelines]] for instructions of how to submit your expert search results. {{:cries:cries_automatic_eval.trec_rel.txt|}} is a TREC style relevance file, that assigns each topic exactly one relevant user, namely the user who wrote the best answer to topic question. This file can be used for testing/debugging, but it will most probably heavily underestimate the values of evaluation measures.