International Quality of Life Assessment

The IQOLA Project

History

The International Quality of Life Assessment (IQOLA) Project began in 1991, with the goal of developing validated translations of a health status questionnaire for use in multinational clinical trials and other international studies of health. One of the first questions project researchers addressed was what questionnaire to translate. The SF-36® Health Survey was available in "developmental" form in 1988 and in "standard" form in 1990. By 1991, a number of U.S. studies had documented its acceptability, reliability and validity, and research had indicated that it was applicable across heterogeneous populations in the US. Preliminary work in several European countries suggested that it could be translated successfully. Other factors that favored the SF-36 were that it was a comprehensive measure of generic health status, and its brevity meant that it could be supplemented with other generic and disease-specific measures in clinical studies. Thus, the SF-36 was chosen as the initial health status measure to be translated in the IQOLA Project.

The first meeting of the IQOLA Project took place in September of 1991 in Paris. Participants included the IQOLA Project Principal Investigator and other U.S.-based Health Assessment Lab (HAL) research staff, National Principal Investigators from the first five sponsored countries (France, Germany, Italy, the Netherlands, and Sweden), and staff from Mapi Research Institute, who assisted in the coordination of the project in its early stages. Additional sponsored researchers joined the project in 1992 (Australia, Belgium, Canada, Japan, Spain, United Kingdom) and 1993 (Denmark and Norway). IQOLA research procedures were refined and augmented in a series of meetings from 1991 to 1993, which included representatives from all fourteen countries. The IQOLA Project followed a three-stage research protocol for translating and testing the SF-36, as outlined in this table and discussed in more detail below.

Stage	Research Protocol	Products
1	Translation following a standard process.	Questionnaires which can be used in data collection.
2	Formal psychometric tests of the assumptions underlying item scoring and construction of multi-item scales.	Scoring algorithms which can be used to make standardized comparisons.
3	Studies to evaluate validity and the equivalence of interpretations across countries.	Validation and norming studies that provide a basis for interpretation of scores.

Since 1993, international interest in the SF-36 has increased exponentially, with translations completed in more than 60 countries. Key milestones for the project were publication of a 1998 special issue of the Journal of Clinical Epidemiology on IQOLA Project methods and validation studies from 15 countries; and a 2003 Quality of Life Research paper that compared the impact of disease such as crohn's on health status in eight countries. Since 1991, more than 1,000 papers have been written about the SF-36 by researchers in countries outside the United States. In addition, IQOLA researchers have written SF-36 scoring documentation and user's manuals for Australia, Canada, Denmark, France, Germany, Italy, Japan, Spain, Sweden, and the United Kingdom.

Research Protocol – Stage 1: Translation

The translation methods adopted by the IQOLA Project included the production of forward and backward translations, use of difficulty and quality ratings, pilot testing, and cross-cultural comparison of the translations.

In the first step of the process, at least two native speaking translators independently translated the SF-36 from English into the target language. For the initial group of 14 countries, each translator produced one translation of the SF-36 items and established a list of all possible translations of the response choices. Translators placed emphasis on conceptual rather than literal equivalence, and the choice of wording was to be compatible with a reading level of age 14 or lower. Translators also rated the difficulty of translating each item and response choice.

In each country, the ordinal and interval properties of all translations of the response choices were evaluated in a Thurstone-like scaling exercise. In brief, the translation team determined the translation of the end points, or anchors, of each SF-36 response continuum (e.g., "excellent" and "poor"), and a group of native speakers was asked to position all possible translations of the remaining response choices (e.g., "very good", "good", "fair") on a 100 mm LASA scale. The aim of the exercise was to produce additional information that would help in selecting response choices that had similar values as those in the original instrument. However, the translation of the response choices was not based on information from the Thurstone exercise alone; other criteria (e.g., clarity, common language use) also were considered.

The translators and National Principal Investigator met to agree on a preliminary forward translation, which was then given to two other bilinguals who rated its quality, using the criteria of clarity, common language use, and conceptual equivalence. In some countries, quality raters also provided an overall rating of the acceptability of the translation; if a translated item was deemed unacceptable, the raters proposed an alternative. The quality ratings were given to the National Principal Investigator, who discussed the information with the original translators and modified the translation as needed, to develop a revised preliminary forward translation.

This forward translation was then given to two translators who were native English speakers, who translated the questionnaire back into English. The backward translations were reviewed by researchers at HAL for conceptual equivalence with the original source version. Items or response choices that were deemed not to be conceptually equivalent were discussed with the National Principal Investigator. The translations also were discussed, item per item, at an international investigators meeting and modifications were made to some translations. These discussions helped to streamline the translations cross-culturally and contributed to the standardization of the SF-36. Thus, at the end of the translation process, the translations had been discussed in terms of their national (translators and National Principal Investigator), bi-national (National Principal Investigator and HAL), and cross-national (all National Principal Investigators and HAL) performance.

Finally, the translations were pilot-tested in individual countries through administration to up to 50 respondents who differed in health status. Difficulties encountered by respondents were noted, and the translations were revised as needed.

The translation protocol has been modified slightly for work outside the original sponsored fourteen countries. In essence, the procedure used to translate the SF questionnaires is the standard translation process recommended by the Medical Outcomes Trust Scientific Advisory Committee, upcoming ISPOR principles of good practice, and others. Translations are developed using at least two independent forward translations. One or two backward translations (into English) then are reviewed for conceptual equivalence with the original source form by U.S.-based research staff at HAL. In addition, small pilot tests with patients are conducted to evaluate acceptability and understanding of the translation, and international harmonization has been conducted with some translations as circumstances warrant. However, some steps that were part of the full IQOLA translation process (e.g., Thurstone-like scaling exercise, formal quality ratings) generally have not been done in additional countries due to constraints of time and resources. The SF-12 Health Survey translations have been developed from the SF-36, as the SF-12 items are included in the SF-36. The SF-8 translations also were developed following standard translation procedures.

Experience to date suggests that the SF-36, SF-12 and SF-8 can be adapted for use in other countries with relatively minor changes to the content of the forms, providing support for the use of the SF translations. For the SF-36, the most difficult items to translate were Physical Functioning items which used examples of activities (e.g., bowling, golf) and distances that are not common outside of the United States (e.g., blocks); and items which used colloquial expressions such as "pep" or "blue". In addition, in the Thurstone scaling exercise, a notable percent (although not the majority) of respondents had difficulty positioning the response choice "a good bit of the time" between "some of the time" and "most of the time". These difficulties were addressed during the translation process; for example, rather than "bowling" and "golf", translations use examples of moderate activities that are appropriate for the country in question (e.g., cycling, gardening, tai chi). Many of the issues with the most difficult items also were addressed in Version 2.0 of the SF-36; for example, specific distances are used instead of blocks and "pep" and "blue" have been replaced by synonyms.

Research Protocol – Stage Two: Tests of Scaling Assumptions

Following the translation stage, the second research stage tested the assumptions underlying item scoring and scale construction, including data quality, scaling and scoring assumptions, and the reliability of the SF-36 scales. Tests included evaluation of item and scale-level descriptive statistics; examination of the equality of item-scale correlations, item internal consistency and item discriminant validity; and estimation of scale score reliability using internal consistency and test-retest methods. Results from these tests were used to determine if standard algorithms for the construction and scoring of the eight SF-36 scales could be used in each country and to provide information that could be used in translation improvement. When scaling assumptions were not met, evidence was sought to determine if this was due to translation problems or to country-specific differences in the definition or structure of health.

Establishing that translated scales meet standards used in tests of scaling assumptions is a necessary but insufficient prerequisite for their use. Examination of the validity and comparability of the scales is necessary for their proper interpretation.

Research Protocol – Stage Three: Validation and Norming

In the third research stage, data from clinical studies, general population surveys, and other studies were analyzed to address issues of validity and evaluate the equivalence of interpretations across countries. Validity is the extent to which a score means what it is supposed to mean: whether it has the intended interpretation. Studies of validity increase understanding of the meaning of a score and the meaning of changes or differences in that score. The validity of questionnaires in the health field has most often been evaluated by means of content, construct, and criterion validity. For the SF-36, evidence of all types of validity is relevant because of its widespread use across a variety of applications. IQOLA researchers also have utilized a number of other methods to explore the cross-cultural equivalence of the SF-36 translations. These techniques included structural equation modeling, tests of differential item functioning, and detection of item bias.

When enough evidence has been accumulated to show that a scale measures the intended health concept and does not measure other concepts, the scale is said to be validated. However, the process of validation continues as long as new information is produced about the interpretation and meaning of scores.

In the absence of agreed-upon criteria, or "gold standards" for validating health measures, normative data can be very useful in interpreting scale scores. Normative data make it possible to interpret the scale score for an individual respondent or the average score for a group of respondents in comparison to the distribution of scores for other individuals in the norming sample. Norm-based comparisons require valid norms for a well-defined and representative sample of the population of interest.

The IQOLA Project developed a protocol for collecting normative general population data. The recommended sample size was 2,500 to 3,000 respondents, which would allow for comparisons of SF-36 scale scores by gender and ten-year age groups. The sample was to be representative of the age and gender of the population and various regions in a country, at a minimum. If a sample was not truly representative of the general population, the departures and their implications were to be documented. Construction of sampling weights, to weight back to a nationally representative sample, was encouraged. While the preferred methodology for data collection was mail-out/mail-back of self-reported questionnaires, other means of collecting data were acceptable, such as interviewer or telephone administration. Essentially, the protocol recommended that customary methods of data collection, which would lead to the highest response rates, should be used if possible. Vigorous follow-up of non-respondents was encouraged, and a response rate of two thirds or greater was targeted. Finally, if possible, a description of non-respondents was to be reported. SF-36 items were to be placed first in the survey, so that respondents' answers to the SF-36 would not be influenced by any preceding questions. In addition to the SF-36, the protocol called for the collection of a standard set of additional data elements, including sociodemographic information and self-reported chronic conditions.