Tag

linguistics

14 datasets
  • TalkBank

    Offsite — About About TalkBank: > The goal of TalkBank is to foster fundamental research in the study of human and animal communication. It will construct sample databases within each of the subfields studying communication. It will use these databases to advance the development of standards and tools for creating, sharing, searching, and commenting upon primary materials via ...
  • The Speech Accent Archive

    Offsite — From website: > The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers. On [about ...
  • A Million Syllabi

    Free Download — A data set of over a million syllabi gathered by Dan Cohen’s Syllabus Finder tool from 2002 to 2009. It could be the largest collection of syllabi ever gathered by several orders of magnitude. See a more detailed description on Dan Cohen’s blog Format Data are formatted as json records separated by newlines. Caution: this data is messy and comes with no warranty.
  • Statistical Machine Translation - Europarl Parallel Corpus

    Offsite — About Overview: > The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. > The goal of the extraction and processing was to generate sentence aligned text for ...
  • MOCHA-TIMIT

    Offsite — About Authors: Alan Wrench, Queen Margaret University College. Funded by: Engineering and Physical Sciences Research Council. When created: November 1999. Purpose: Phonetically balanced dataset for training an automatic speech recognition system Openness Availability: English speakers available here free for non-commercial use and may be distributed on CDROM for a ...
  • 20 Newsgroups Dataset (De-Duped Version)

    Free Download — The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It is speculated that it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 Newsgroups collection has become a ...
  • Speech Accent Archive: 1200+ speech samples from a variety of language backgrounds

    Offsite — The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers. The Elicitation Paragraph Please call Stella. Ask her ...
  • The IBL Corpus

    Offsite — About > The IBL Corpus was collected by the University of Plymouth and the University of Edinburgh as part of the EPSRC funded project IBL, Instruction-based Learning for Mobile Robots (GR/M90023, GR/M90160). IBL focused on the problem of how natural language instructions can be used by an intelligent embodied agent to build a hierarchy of complex functions based on a ...
  • eXtended WordNet

    Offsite — About From website: > WordNet is a lexical database for English that has been widely adopted in artificial intelligence and computational linguistics for a variety of practical applications. Since WordNet was designed as a lexical database, it exhibits some limitations when used for knowledge processing applications. Often one needs to retrieve words that are ...
  • Google Books Ngrams

    Offsite — Description Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set). Each of the links will directly download a fragment of the given corpus. For ...
  • Google Labs - Books Ngram Viewer

    Offsite — Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the updated versions will have distinct and persistent version identifiers (20090715 for the current set). Each of the links below will directly download a fragment of the given corpus. For instance, ...
  • Rap Data Pack #1: Jay-Z

    Free Download — Here is the first Rap Data Pack™ released by StapleCrops.com. It includes the raw data from Hov’s complete body of work: word count, readability, release dates, Geo Codes, etc..
  • Twitter Wordbag

    No Data — The service for this API has ceased Our apologies for the inconvenience this may cause. Twitter Wordbag unlocks word importance and frequency data across the Twitter universe. Discover any Twitter user’s word usage frequency, or most characteristic words, relative to the average of all Twitter users, by simply querying a screen name or user ID. This Twitter word API ...
  • Twitter Word Usage

    No Data — The service for this API has ceased Our apologies for the inconvenience this may cause. Twitter word statistics for any term! Discover how commonly a given word is used on Twitter. This Twitter word API allows you to query a term and access valuable statistics about its usage on Twitter. Query any term to obtain word usage statistics, including global frequency, ...