Parents and Substance Abuse

Many teens feel impervious. For example, they may feel they can smoke, vape, and even drive under the influence and feel like it’s unlikely to hurt them Risk-taking seems hard-wired, perhaps…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




CAI Datasets

Here I collect links and descriptions of datasets available online that are related to conversational artificial intelligence (CAI). This list will continue to grow.

Freebase is a practical, scalable tuple database used to structure general human knowledge. The data in Freebase is collaboratively created, structured, and maintained. Freebase is a huge and freely available database of general facts; data is organized as triplets (subject, type1.type2.predicate, object), where two entities subject and object (identified by mids) are connected by the relation type type1.type2.predicate.

A dataset covering 14,042 ambiguous questions from NQ-open, an existing open-domain QA benchmark. The task is to predict a set of question and answer pairs, where each plausible answer is associated with a disambiguated rewriting of the original question.

CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: “Random split” which is the main evaluation split, and “Question token split”, see paper for details.

A dataset for answering complex questions that require reasoning over multiple web snippets.​ Contains a large set of complex questions in natural language, and can be used in multiple ways: By interacting with a search engine, as a reading comprehension task (12,725,989 web snippets), and as a semantic parsing task (questions paired with SPARQL queries).

DROP, from the Allen Institute of Artificial Intelligence, is “A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs”. DROP is a crowdsourced, adversarially-created, 96k-question benchmark, in which a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

DuReader 2.0 is a large-scale open-domain Chinese dataset from Baidu for Machine Reading Comprehension (MRC) and Question Answering (QA). It contains more than 300K questions, 1.4M evident documents and corresponding human generated answers.

Large-scale semantic parsing via schema matching and lexicon extension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Learning to parse database queries using inductive logic programming. In Proceedings of the National Conference on Artificial Intelligence.

Question answering dataset with 100,000 real Bing questions and a human generated answer. 1,000,000 question dataset, a natural language generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.

MultiQA from the Allen Institute of Artificial Intelligence is training and evaluating reading comprehension models over arbitrary sets of datasets. All datasets are in a single format, and it is accompanied by an AllenNLP DatasetReader and model that enable easy training and evaluation on multiple subsets of datasets.

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Question-answer pairs obtained from non-experts. This dataset is built using FreeBase as the knowledge base and contains 5,810 question-answer pairs. It was created by crawling questions through the Google Suggest API, and then obtaining answers using Amazon Mechanical Turk. WebQuestions is built on FreeBase since all answers are defined as FreeBase entities.

This dataset from Facebook contains six languages, around 100K utterances, 11 domains, and 117 intent classes. “Through our experiments, we show that a shared multilingual NLU model for multiple languages improves performance significantly compared with a per-language model, for all languages, thereby enabling faster language scale-up.”

Large multi-domain Wizard-of-Oz dataset for dialog modeling.

Domains: universal, restaurant, hotel, attraction, taxi, train, hospital, police

Act types: inform, request, select, recommend, not found, request booking info, offer booking, inform booked, decline booking, welcome, greet, bye, req more

Slots: address, postcode, phone, name, no of choices, area, price range, type, internet, parking, stars, open hours, departure, destination, leave after, arrive by, no of people, reference no., train ID, ticket price, travel time, department, day, no of days

Sequence of questions and answers in dialogue form.

A large English-language dialogue dataset from Semantic Machines (now part of Microsoft), featuring natural conversations about tasks involving calendars, weather, places, and people. Each turn is annotated with an executable dataflow program featuring API calls, function composition, and complex constraints built from strings, numbers, dates and times. Believed to be the largest and most complex task oriented dialog data set (as of 2021).

Story Cloze Test is a commonsense reasoning framework for evaluating story understanding, story generation, and script learning. This test requires a system to choose the correct ending to a four-sentence story.

ROCStories enables the Story Cloze Test. It is a corpus of five-sentence commonsense stories. It captures a set of causal and temporal commonsense relations between daily events and it is a collection of everyday life stories that can also be used for story generation.

An open-domain dialogue task for training agents that can converse knowledgeably about open-domain topics.

About TP on CAI

Other stories in TP on CAI you may like:

Add a comment

Related posts:

What can we learn from previous transformations?

The digital transformation we are in right now is not our first technological transformation. In fact these transformations, or cycles happen every half century or so. Which means they are repetitive…

My Journey from Toxic Money Mindset to Secure Financial Attitude

Many of us have been through the anxiety of not having enough money. We may feel like we can never get ahead or that our financial situation is too stressful for words. The worst point is that this…

The Secret to Speed and Stability

Every product team wants to improve both their speed of development and their production stability. We all want to make fast changes, while making sure our site is reliable and continues to serve our…