Since 1993, international interest in the SF-36 has increased exponentially, with translations completed
in more than 60 countries. Key milestones for the project were publication of a 1998 special issue of the
Journal of Clinical Epidemiology on IQOLA Project methods and validation studies from 15 countries; and a
2003 Quality of Life Research paper that compared the impact of disease such as crohn's on health status in eight countries.
Since 1991, more than 1,000 papers have been written about the SF-36 by researchers in countries outside the
United States. In addition, IQOLA researchers have written SF-36 scoring documentation and user's manuals
for Australia, Canada, Denmark, France, Germany, Italy, Japan, Spain, Sweden, and the United Kingdom.
Research Protocol Stage 1: Translation
The translation methods adopted by the IQOLA Project included the production of forward and backward
translations, use of difficulty and quality ratings, pilot testing, and cross-cultural comparison of the translations.
In the first step of the process, at least two native speaking translators independently translated the SF-36 from
English into the target language. For the initial group of 14 countries, each translator produced one translation
of the SF-36 items and established a list of all possible translations of the response choices. Translators placed
emphasis on conceptual rather than literal equivalence, and the choice of wording was to be compatible with a reading
level of age 14 or lower. Translators also rated the difficulty of translating each item and response choice.
In each country, the ordinal and interval properties of all translations of the response choices were evaluated in a
Thurstone-like scaling exercise. In brief, the translation team determined the translation of the end points, or
anchors, of each SF-36 response continuum (e.g., "excellent" and "poor"), and a group of native speakers was asked to
position all possible translations of the remaining response choices (e.g., "very good", "good", "fair") on a 100 mm
LASA scale. The aim of the exercise was to produce additional information that would help in selecting response choices
that had similar values as those in the original instrument. However, the translation of the response choices was not
based on information from the Thurstone exercise alone; other criteria (e.g., clarity, common language use) also were considered.
The translators and National Principal Investigator met to agree on a preliminary forward translation, which was then given to
two other bilinguals who rated its quality, using the criteria of clarity, common language use, and conceptual equivalence.
In some countries, quality raters also provided an overall rating of the acceptability of the translation; if a translated item
was deemed unacceptable, the raters proposed an alternative. The quality ratings were given to the National Principal Investigator,
who discussed the information with the original translators and modified the translation as needed, to develop a revised
preliminary forward translation.
This forward translation was then given to two translators who were native English speakers, who translated the questionnaire back
into English. The backward translations were reviewed by researchers at HAL for conceptual equivalence with the original source version.
Items or response choices that were deemed not to be conceptually equivalent were discussed with the National Principal Investigator.
The translations also were discussed, item per item, at an international investigators meeting and modifications were made to some translations.
These discussions helped to streamline the translations cross-culturally and contributed to the standardization of the SF-36. Thus, at
the end of the translation process, the translations had been discussed in terms of their national (translators and National Principal
Investigator), bi-national (National Principal Investigator and HAL), and cross-national (all National Principal Investigators and HAL) performance.
Finally, the translations were pilot-tested in individual countries through administration to up to 50 respondents who differed in health
status. Difficulties encountered by respondents were noted, and the translations were revised as needed.
The translation protocol has been modified slightly for work outside the original sponsored fourteen countries. In essence, the procedure
used to translate the SF questionnaires is the standard translation process recommended by the Medical Outcomes Trust Scientific Advisory
Committee, upcoming ISPOR principles of good practice, and others. Translations are developed using at least two independent forward translations.
One or two backward translations (into English) then are reviewed for conceptual equivalence with the original source form by U.S.-based
research staff at HAL. In addition, small pilot tests with patients are conducted to evaluate acceptability and understanding of the translation,
and international harmonization has been conducted with some translations as circumstances warrant. However, some steps that were part of the
full IQOLA translation process (e.g., Thurstone-like scaling exercise, formal quality ratings) generally have not been done in additional
countries due to constraints of time and resources. The SF-12 Health Survey translations have been developed from the SF-36, as the SF-12
items are included in the SF-36. The SF-8 translations also were developed following standard translation procedures.
Experience to date suggests that the SF-36, SF-12 and SF-8 can be adapted for use in other countries with relatively minor changes to the content
of the forms, providing support for the use of the SF translations. For the SF-36, the most difficult items to translate were Physical Functioning
items which used examples of activities (e.g., bowling, golf) and distances that are not common outside of the United States (e.g., blocks); and
items which used colloquial expressions such as "pep" or "blue". In addition, in the Thurstone scaling exercise, a notable percent (although not
the majority) of respondents had difficulty positioning the response choice "a good bit of the time" between "some of the time" and "most of the time".
These difficulties were addressed during the translation process; for example, rather than "bowling" and "golf", translations use examples of moderate
activities that are appropriate for the country in question (e.g., cycling, gardening, tai chi). Many of the issues with the most difficult items
also were addressed in Version 2.0 of the SF-36; for example, specific distances are used instead of blocks and "pep" and "blue" have been replaced by synonyms.
Research Protocol Stage Two: Tests of Scaling Assumptions
Following the translation stage, the second research stage tested the assumptions underlying item scoring and scale construction, including data quality,
scaling and scoring assumptions, and the reliability of the SF-36 scales. Tests included evaluation of item and scale-level descriptive statistics;
examination of the equality of item-scale correlations, item internal consistency and item discriminant validity; and estimation of scale score reliability
using internal consistency and test-retest methods. Results from these tests were used to determine if standard algorithms for the construction and scoring
of the eight SF-36 scales could be used in each country and to provide information that could be used in translation improvement. When scaling assumptions
were not met, evidence was sought to determine if this was due to translation problems or to country-specific differences in the definition or structure of health.
Establishing that translated scales meet standards used in tests of scaling assumptions is a necessary but insufficient prerequisite for their use. Examination
of the validity and comparability of the scales is necessary for their proper interpretation.
Research Protocol Stage Three: Validation and Norming
In the third research stage, data from clinical studies, general population surveys, and other studies were analyzed to address issues of validity
and evaluate the equivalence of interpretations across countries. Validity is the extent to which a score means what it is supposed to mean:
whether it has the intended interpretation. Studies of validity increase understanding of the meaning of a score and the meaning of changes or
differences in that score. The validity of questionnaires in the health field has most often been evaluated by means of content, construct, and
criterion validity. For the SF-36, evidence of all types of validity is relevant because of its widespread use across a variety of applications.
IQOLA researchers also have utilized a number of other methods to explore the cross-cultural equivalence of the SF-36 translations. These techniques
included structural equation modeling, tests of differential item functioning, and detection of item bias.
When enough evidence has been accumulated to show that a scale measures the intended health concept and does not measure other concepts, the scale is said
to be validated. However, the process of validation continues as long as new information is produced about the interpretation and meaning of scores.
In the absence of agreed-upon criteria, or "gold standards" for validating health measures, normative data can be very useful in interpreting scale scores.
Normative data make it possible to interpret the scale score for an individual respondent or the average score for a group of respondents in comparison to
the distribution of scores for other individuals in the norming sample. Norm-based comparisons require valid norms for a well-defined and representative
sample of the population of interest.
The IQOLA Project developed a protocol for collecting normative general population data. The recommended sample size was 2,500 to 3,000 respondents, which
would allow for comparisons of SF-36 scale scores by gender and ten-year age groups. The sample was to be representative of the age and gender of the population
and various regions in a country, at a minimum. If a sample was not truly representative of the general population, the departures and their implications were
to be documented. Construction of sampling weights, to weight back to a nationally representative sample, was encouraged. While the preferred methodology for
data collection was mail-out/mail-back of self-reported questionnaires, other means of collecting data were acceptable, such as interviewer or telephone
administration. Essentially, the protocol recommended that customary methods of data collection, which would lead to the highest response rates, should be
used if possible. Vigorous follow-up of non-respondents was encouraged, and a response rate of two thirds or greater was targeted. Finally, if possible, a
description of non-respondents was to be reported. SF-36 items were to be placed first in the survey, so that respondents' answers to the SF-36 would not be
influenced by any preceding questions. In addition to the SF-36, the protocol called for the collection of a standard set of additional data elements, including
sociodemographic information and self-reported chronic conditions.