Character Subtitles English
I. Subtitles for the Deaf and Hard of Hearing (SDH)This section applies to subtitles for the deaf and hard of hearing created for English language content (i.e. intralingual subtitles). For English subtitles for non-English language content, please see Section II
Character subtitles English
This guide is directed towards transcribers and translators who work in English. It contains guidelines about English spelling and punctuation conventions, line-breaking issues and common mistakes, as well as tips on how to make your English subtitles in the TED Translators program a better source text for translations into other languages.
You can use either American and British spelling and punctuation rules, but please select one of the conventions and use it consistently in your subtitles. You may consider making the first note in the Amara editor one that states if you've used American or British English in order to better inform the reviewer. As a reviewer, don't change the spelling and punctuation rules to your preferred variety of English if the subtitles use US or British English consistently (for the most part).
In American English, separate dots / ellipses from other words with a space, before and after the dots (do not send subtitles back if there's no space before and after ellipses, as this should be considered a minor punctuation issue).
Every subtitle whose length exceeds 42 characters must be broken into two lines. No subtitle can go beyond the total length of 84 characters. Generally, each line should be broken only after a linguistic "whole" or "unit," no matter if it's the only line in the subtitle, or the first or second line in a longer subtitle. This means that sometimes it's necessary to rephrase the subtitle in order to make it possible to break lines without breaking apart any linguistic units, e.g. splitting apart an adjective and the noun that it refers to. For important and useful rules regarding line breaking in any language, see this guide. Below, you will find additional English-specific line breaking advice:
The examples below show places in a sentence where lines can be broken. The ideal places to break are marked by the green slashes, while the orange slashes indicate places where it would be OK to break the line if breaking at the green slashes were not possible. Note that you don't normally break lines that do not exceed 42 characters; the examples below are simply used to show various grammatical contexts where a sentence can be broken, not to suggest that you should break subtitles into very short lines.
English transcripts, as well as translations from other languages into English, will often serve as the starting point for further translations. This is why it is advisable to think about the future translations while creating English subtitles, and to find ways to make it easier to spread the ideas in the English subtitles in other target languages.
Here, we have sentences with relative clauses. If possible without breaking the reading speed and subtitle length limits (and if the subtitles don't have to be synchronized with important action in the video), try to keep the clauses together in one subtitle. Even if the transcript splits the sentence apart, you can fix it in your translation. Examples:
Gonna, wanna, kinda, sorta, gotta and 'cause are ways of pronouncing going to, want to, kind of, sort of, have got to (usually with a contraction, i.e. "I've got to" etc.) and because, respectively. Do not use them in English subtitles. Instead, use the full form (e.g. going to where you hear gonna). The only exception is when the speaker uses these forms purposefully, to affect a certain kind of dialect or idiosyncrasy of speech.
This item relates mostly to English transcripts. Subtitles are meant to represent natural (though relatively correct) speech, so the style should not be cleaned up too much, in order to prevent the subtitles from sounding unnecessarily formal and more like written language than speech. One common example is removing too many sentence-initial "and" and "so." While in written English, starting consecutive sentences with such connectors is often seen as a fault in style ("And it was complete. And I called my friend. And my friend was so surprised!"), in spoken English, such connectors often produce an unbroken stream of related clauses in the lack of formal connectors typical of written English (such as "accordingly," "what is more," etc.). Removing too many may make the subtitles sound disjointed, so leave as many as possible. Connectors may be removed to improve reading-speed issues, of course, and once you have gained a strong sense of how to slightly edit subtitles for clarity, it will be OK for you to remove a few initial and's. When in doubt, leave it in.
Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.
Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.
Research on the Chinese language is becoming an important theme in psycholinguistics. Not only is Chinese one of the most widely spoken languages in the world, it also differs in interesting ways from the alphabetic writing systems used in the Western world. For example, the logographic writing system makes it impossible to compute the word's phonology on the basis of non-lexical letter to sound conversions [1]. Another characteristic of the Chinese writing system is that there are no spaces between the words. This is likely to have consequences for eye movement control in reading [2]. Finally, a Chinese character represents a syllable, which most of the time is a morpheme (i.e., the smallest meaningful element), and many Chinese words in fact are disyllabic compound words [3].
Research on the Chinese language requires reliable information about word characteristics, so that the stimulus materials can be manipulated and controlled properly. By far the most important word feature is word frequency. In this text, we first describe the frequency measures that are available for Chinese. Then, we describe the contribution a new frequency measure based on film subtitles is making in other languages and we present a similar database for Mandarin Chinese.
When reading Table 1, it is important to keep in mind that many corpora were meant to be representative for the language produced in Chinese speaking regions and not necessarily for the language daily heard and read by Chinese speaking people. In addition, some of these sources are copyright protected. One main problem with Chinese word frequencies is that Chinese words are not written separately, making the segmentation of the corpus into words labor-intensive if one wants to have information beyond single character frequencies (Chinese words can consist of one to four or even more characters). This situation is currently changing, due to the availability of automatic parsers and part-of-speech taggers, as we will see below.
All in all, despite the existence of several frequency lists in Chinese, there are only three sources that provide easy access for individual researchers and other people interested in the Chinese language. The first is CCL ( :8080/ccl_corpus), which gives access to the unsegmented and untagged corpus and provides information about character frequencies but not word frequencies. The second is LCSMCS ( -bin/yuliao/), which gives word frequencies based on the segmented part of the corpus (2 million words). Unfortunately, words have to be entered separately on the website. Part of the single-character word frequencies from LCSMCS are also available in the Chinese Single-character Word Database (CSWD; available at _norm/psychnorms.html). This database provides information about 2,390 single-character Chinese words including nouns, verbs, and adjectives [11]). Finally, there is the Lancaster Corpus of Mandarin Chinese ( ) which provides frequency information for 5,000 words in A frequency dictionary of mandarin Chinese: Core vocabulary for learners [3] and for a larger set of 50,000 words upon request from the authors (also released by Richard Xiao on ).
Encouraged by the above findings, we decided to compile a word and character frequency list based on Chinese subtitles. A potential problem in this work is that, unlike in most writing systems, there are no spaces between the words in Chinese. Therefore, word segmentation (i.e. splitting the character sequence into words) is a critical step in collecting Chinese word frequencies. Fortunately, in the last decade automatic word segmentation programs have become available with a good output [for a review see 3]. These algorithms are trained on a tagged corpus (i.e., a corpus in which all the words have been identified and given their correct syntactic role) and are then applied to new materials [19]. Their performance is regularly compared in competitions such as the SIGHAN Bakeoff (www.sighan.org; SIGHAN: a Special Interest Group of the Association for Computational Linguistics). A program that consistently performed among the best is ICTCLAS ( ) [3]. It incorporates part-of-speech information (PoS, i.e. the syntactic roles of the words, such as noun, verb, adjective, etc.) and generates multiple hidden Markov models, from which the one with the highest probability is selected [19], [20]. This not only provides the correct segmentation for the vast majority of sentences, but also has the advantage that the most likely syntactic roles of the words are given, which makes it possible to additionally calculate PoS-dependent frequencies. The algorithm is expected to work well for film subtitles, because these subtitles are of a limited syntactic complexity (most of them are short, simple sentences) and because the program has the faculty to recognize out-of-vocabulary words such as foreign names, which often exist in subtitles but are rarely covered by regular vocabularies. The program was also used to parse the LCMC corpus. 041b061a72