A Customized Web Spider for Why-QA Pairs Corpus PreparationResearch Areas : Semantic Web
Keywords: Non-Factoid questions, web crawler, Latent Dirichlet Allocations, Topic Modeling, Natural Language Processing.,
Considering the growth of researches on improving the performance of non-factoid question answering system, there is a need of an open-domain non-factoid dataset. There are some datasets available for non-factoid and even how-type questions but no appropriate dataset available which comprises only open-domain why-type questions that can cover all range of questions format. Why-questions play a significant role and are usually asked in every domain. They are more complex and difficult to get automatically answered by the system as why-questions seek reasoning for the task involved. They are prevalent and asked in curiosity by real users and thus their answering depends on the users’ need, knowledge, context and their experience. The paper develops a customized web crawler for gathering a set of why-questions from five popular question answering websites viz. Answers.com, Yahoo! Answers, Suzan Verberne’s open-source dataset, Quora and Ask.com available on Web irrespective of any domain. Along with the questions, their category, document title and appropriate answer candidates are also maintained in the dataset. With this, distribution of why-questions according to their type and category are illustrated. To the best of our knowledge, it is the first large enough dataset of 2000 open-domain why-questions with their relevant answers that will further help in stimulating researches focusing to improve the performance of non-factoid type why-QAS.
 S. Verberne, L.W.J. Boves, N.H.J. Oostdijk and P.A.J.M. Coppen, “Data for question answering: the case of why”, 2006.
 shuzi, “GitHub - shuzi/insuranceQA: A question answering corpus in insurance domain”, 2015. [Online]. Available: https://github.com/shuzi/insuranceQA. [Accessed Feb. 9, 2021].
 D. Cohen, L. Yang,, and W. B. Croft, “Wikipassageqa: A benchmark collection for research on non-factoid answer passage retrieval”, In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 1165-1168.
 A. Dulceanu, T. Le Dinh, W. Chang, T. Bui, D.S. Kim, M.C.Vu, and S. Kim, “PhotoshopQuiA: A corpus of non-factoid questions and answers for why-question answering”, In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018, pp. 2763-2770.
 Stack Overflow - Where Developers Learn, Share, & Build Careers. [Online]. Available: https://stackoverflow.com/. [Accessed Feb. 9, 2021].
 Adobe Support Community. [Online]. Available: https://forums.adobe.com/welcome. [Accessed Feb. 9, 2021].
 Graphic Design Stack Exchange. [Online]. Available https://graphicdesign.stackexchange.com. [Accessed Feb. 9, 2021].
 Super User Stack Exchange [Online]. Available https://superuser.com. [Accessed Feb. 9, 2021].
 Adobe Photoshop Family [Online]. Available https://feedback.photoshop.com. [Accessed Feb. 9, 2021].
 K. Jiang,, D. Wu and H. Jiang., “FreebaseQA: a new factoid QA data set matching Trivia-style question-answer pairs with freebase”, In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2019, pp. 318-323.
 T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis and E. Grefenstette, “The narrativeqa reading comprehension challenge”, Transactions of the Association for Computational Linguistics, vol. 6, pp. 317-328, 2018.
 H. Hashemi, M. Aliannejadi, H. Zamani and W.B. Croft, “ANTIQUE: A non-factoid question answering benchmark.”. In European Conference on Information Retrieval, Springer, Cham, 2020, pp. 166-173.
 Yahoo ! answers [Online]. Available https://answers.yahoo.com/. [Accessed Feb. 9, 2021].
 A. Colas, S. Kim, F. Dernoncourt,, S. Gupte, D.Z. Wang and D.S. Kim, “TutorialVQA: Question Answering Dataset for Tutorial Videos” [Online]. Available arXiv preprint arXiv:1912.01046, 2019.
 Answers [Online]. Available https://www.answers.com/. . [Accessed Feb. 9, 2021].
 S. Verberne, “Data Download,” [Online]. Available: http://sverberne.ruhosting.nl/wordpress/research/data-download/. [Accessed Feb. 9, 2021].
 Quora [Online]. Available https://www.quora.com/ . [Accessed Feb. 9, 2021].
 Ask [Online]. Available https://www.ask.com/. [Accessed Feb. 9, 2021].
 Scrapy [Online]. Available https://scrapy.org/. [Accessed Feb. 9, 2021].
 BeautifulSoup [Online]. Available https://pypi.org/project/beautifulsoup4/. [Accessed Feb. 9, 2021].
 A. Rahman and V. Ng, “Coreference resolution with world knowledge”, In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, 2011, pp. 814-824.
 A. Anandkumar, D. P. Foster, D. Hsu, S.M. Kakade and Y.K. Liu, “A spectral algorithm for latent dirichlet allocation”, Algorithmica, vol. 72, no. 1, 2015, pp. 193-214.
 D. M. Blei, A.Y. Ng and M. I. Jordan, “Latent dirichlet allocation.”, Journal of machine Learning research, 2003, pp. 993-1022.
 D.S. Chang and K.S. Choi, “Causal relation extraction using cue phrase and lexical pair probabilities”, In International Conference on Natural Language Processing, Springer, Berlin, Heidelberg, 2004, pp. 61-70.
 M. Breja and S.K. Jain, “Why-type Question Classification in Question Answering System”, In FIRE (Working Notes), 2017, pp. 149-153.
 M. Breja and S.K. Jain, “Analysis of Why-Type Questions for the Question Answering System”, In European Conference on Advances in Databases and Information Systems, Springer, Cham, 2018, pp. 265-273.
 H. Fu and Y. Fan, “Music information seeking via social Q&A: An analysis of questions in music StackExchange community”, In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, 2016, pp. 139-142.
 Financial Opinion Mining and Question Answering [Online] Available https://sites.google.com/view/fiqa/home/. [Accessed Feb. 9, 2021].
 A.F.U.R. Khilji, R. Manna, S.R. Laskar, P. Pakray, D. Das, S. Bandyopadhyay and A. Gelbukh, “Question classification and answer extraction for developing a cooking QA system”, Computación y Sistemas, vol. 24, no. 2, 2020, pp. 921-927.
 V. Koeman, L. A. Dennis, M. Webster, M. Fisher and K. Hindriks, “The Why did you do that?" Button: Answering Why-questions for end users of Robotic Systems”, In 7th International Workshop on Engineering Multi-Agent Systems (EMAs 2019), 2019, pp. 152-172.
 Yahoo! Language Data [Online]. Available https://webscope.sandbox.yahoo.com/catalog.php?datatype=l. [Accessed Feb. 9, 2021].
 E. Hovy, U. Hermjakob, and D. Ravichandran, “A question/answer typology with surface text patterns”, In Proceedings of the Hum an Language Technology conference (HLT), San Diego, CA, 2002.
 A. Mishra and S.K. Jain, “A survey on question answering systems with classification”, Journal of King Saud University-Computer and Information Sciences, vol. 28, no. 3, 2016, pp.345-361.
 The Baron [Online]. Available https://www.thebaron.info/archives/technology/reuters-technical-development-chronology-1991-1994, [Accessed March, 02, 2021]
 G. Smith, “Newspapers on CD-ROM”, In Serials The Journal for the Serials Community, vol. 5, no. 3, 1992, pp. 17-22.
 M. Breja and S.K. Jain, “Analyzing Linguistic Features for Classifying Why-Type Non-Factoid Questions”, International Journal of Information Technology and Web Engineering (IJITWE), vol. 16, no. 3, 2021, pp.21-38.
 M. Breja and S.K. Jain, “Why-Type Question to Query Reformulation for Efficient Document Retrieval”, International Journal of Information Retrieval Research (IJIRR), vol. 12, no. 1, 2022, pp.1-18.
 M. Breja and S.K. Jain, “Analyzing Linguistic Features for Answer Re-Ranking of Why-Questions”, Journal of Cases on Information Technology (JCIT), vol. 24, no. 3, 2022, pp.1-16.
 M. Breja and S.K. Jain, “A survey on non-factoid question answering systems”, International Journal of Computers and Applications,2021, pp.1-8.
 Y. Niu , “Analysis of semantic classes: toward non-factoid question answering”. University of Toronto, 2007.