Address for reprints: Robert J. Goldberg, Ph.D., Department of Quantitative Health Sciences, University of Massachusetts Medical School, 55 Lake Avenue North, Worcester, MA 01655, ude.demssamu@grebdloG.treboR, Tel: (508) 856-3991, Fax: (508) 856-4596
The publisher's final edited version of this article is available at Am J MedIn this article we provide an overview of the different data collection approaches that are commonly utilized in carrying out clinical, public health, and translational research. We discuss several of the factors researchers need to consider in using data collected in questionnaire surveys, from proxy informants, through the review of medical records, and in the collection of biologic samples. We hope that the points raised in this overview will lead to the collection of rich and high quality data in observational studies and randomized controlled trials.
Keywords: data collection approaches, clinical research, observational studiesIn a recent editorial, we described the different types of observational studies and randomized controlled trial designs that investigators often utilize in carrying out clinical and public health research 1 . Although two of the most important steps in successfully carrying out a research project are the clear formulation of key testable hypotheses and careful selection of a cost-efficient, rigorous study design, less information is available for researchers with respect to contemporary methods of high quality and reliable data collection. With increasing attention being paid to patient-reported outcomes in observational, comparative effectiveness, and clinical trials research, data collection approaches that combine medical record abstraction, patient interviews, and administrative data will be more commonly utilized in the future.
In the present editorial, we discuss a number of issues that pertain to the collection of high-quality data in the conduct of clinical, translational, and epidemiologic research projects and ways to enhance the collection of reliable and meaningful data. We also discuss issues related to the accuracy of these data, and factors to consider in the possible independent confirmation of information collected from different data sources. The data collection instruments reviewed include questionnaire surveys and patient self-reported data; use of proxy/informant information; hospital and ambulatory medical records; and analysis of biologic materials.
Much of the information collected in observational epidemiologic studies is collected in the form of patient/participant self-reports on standardized questionnaires which are either self or interviewer administered in person, by phone, or via mail or the internet. The factors on which information is routinely collected in these studies include socio-demographic characteristics, lifestyle practices, medical history, and use of prescribed and/or over the counter medications. Questions are also often asked about participant’s knowledge and attitudes toward various lifestyle and disease predisposing factors. With increasing attention being paid to patient reported outcomes by funding agencies such as the National Institutes of Health (NIH), Agency for Healthcare Research and Quality (AHRQ), and the newly formed Patient-Centered Outcomes Research Institute (PCORI), measures of patient-centered factors such as Quality of Life (QoL), depression, anxiety, cognitive, and functional status are increasingly included in these surveys. The CONSORT (consolidated standards of reporting trials) Statement was recently updated to include standards for reporting patient reported outcomes in randomized controlled trials, highlighting the increasing awareness of the inclusion of such measures as key outcomes of these rigorous investigations 2 .
Patient reported outcomes are ideally measured using standardized, validated instruments to promote the collection of high-quality data and allow for meaningful comparisons across observational studies or randomized trials. Use of standardized assessments also facilitates pooling of data across studies with the goal of establishing clinically relevant cut-points or clinically meaningful change in important patient related outcomes in response to a lifestyle intervention or medical treatment. Recent federally funded initiatives, such as the NIH Toolbox (www.nihtoolbox.org) and Patient Reported Outcomes Measurement Information System (PROMISE) (www.nihpromis.org), have highlighted the importance of harmonization of patient reported outcomes data collection instruments.
Surveyed individuals are typically asked to respond to these questions in either a yes/no manner, on a Likert type scale (e.g., very often - not at all often), or with open-ended responses. The choice of responses is dictated by the investigator, and, by of course, the standardized instrument (if one is used). The selection of the type of response desired is often made on the basis of the difficulty of the question asked and the depth of knowledge and level of precision the investigator would like to have about a particular factor.
Standardized instruments often have different forms that vary in length, so an investigator can decide whether a ‘long’ (e.g. SF-36) 3,4 or ‘short’ (SF-12) version is best suited for their study. Tests with multiple length versions typically have published psychometric properties (e.g., sensitivity and specificity of screening tests) which guide investigators in choosing a test version. For example, a consenting study participant might be asked a series of questions about their level of physical activity, either in the present or during a recent period of pertinent exposure. The number and depth of these questions would be determined, in part, by how this variable would be used in subsequent analyses and presented in peer reviewed publications. If the factor of physical activity was to be simply used as a controlling variable in either stratified or multivariable adjusted regression analyses, then a briefer assessment of physical activity might be more acceptable with the added benefit of reduced respondent burden. On the other hand, if an investigator is particularly interested in the role of type of aerobic activity, level of exercise intensity, or duration of physical activity, then a more extensive battery of questions might be asked about this factor with objective validation of self-reported activity carried out or a standardized instrument used.
Although the use of validated, standardized instruments is preferred, these data collection tools are not always available. If standardized instruments do not exist for a specific construct to be measured, investigators will often create ‘home-grown’ scales. It is extremely important to carefully design these home-grown instruments, ideally with the input of a psychometrician, and to pilot test all measures before using them in a formal research study. Ideally, these pilot efforts would involve validation of the instrument against a ‘gold-standard’ (e.g., clinical diagnosis) or important study outcome. One needs to carefully balance the need for independent validation of participant responses, and the attendant costs and logistical issues associated with such, versus simply discussing the lack of validation of certain variables as a study limitation. These decisions should be discussed with a senior, experienced mentor who has been involved in observational clinical research studies or randomized trials for many years. The advantages and disadvantages of questionnaire data are summarized in Table 1 .
Advantages and Disadvantages of Questionnaire Survey Data
Can collect personal and/or risk factor data not typically contained in hospital/ambulatory care records
Can elicit information in an analytically desirable and standardized manner Can maintain high survey response rates through various financial or other incentivesValidating individual survey responses can be difficult, burdensome, costly, and of questionable utility
If response rates are less than desirable, one may question the representativeness of the study sample and its generalizability
Responses might differ if questions are asked in-person vs. by phone vs. by mail/internetThe collection of information about study participants through the use of proxy respondents can be one of the more challenging tasks for an investigator. Moreover, the accuracy/validity of the proxy’s responses, and their extent of knowledge about various health related aspects of the study participant needs to be thoughtfully considered in determining the type and quantity of information to be elicited from the proxy respondent. On the other hand, especially in observational studies where the cases or controls in a retrospective study may have died or may not be capable of/competent to provide their own responses, information from proxies may be the only source of data available. In some situations, informant perspectives are important data elements, even if different from that of the patient. For instance, family member reports of the type and amount of assistance a patient requires with activities of daily living may be qualitatively different, but equally important, as that reported by the patient.
Informal caregivers are increasingly being recognized as ‘stakeholders’ in many research studies, particularly those that focus on patient reported outcomes such as quality of life. In cases of questionable mental status, or non-communicative state of a patient, informants can be very helpful and important in providing information to help establish a ‘baseline’ for a patient. In these situations, informants can report on the patient’s level of cognitive and physical function as well as level of independence, important outcomes in many contemporary clinical research studies. For some domains, validated informant questionnaires exist. For instance, the Informant Questionnaire on Cognitive Decline in the Elderly is an informant measure of cognitive function and informant responses on the SF-36 and activities of daily living 5 and these scales have been used as assessments of health related quality of life and functional status with varying results 6,7 .
Due to its ubiquity, and the abundance of high-quality data embedded within it, a commonly used source of information in clinical research studies is the medical record. Information contained in hospital or ambulatory care records may be used either as the sole source of data, or complementary to other instruments used to elicit information. Decisions about the adequacy of using the medical record as the sole or main source of data for a given study hinges on the investigator’s hypotheses, study sample size, budget and timeline, as well as the extent and type of data available in a given record system. Medical records can be important sources of information that can reliably document participants’ medical history, clinical, laboratory, or physiologic profile at varying time points in a cost-efficient manner. On the other hand, the data contained in medical records can be frustrating to use and, in some cases, conflicting or of questionable accuracy, due to the non-standardized manner in which this information is collected, recorded, and/or abstracted by various health care professionals and members of research teams. The increasing use of electronic medical records and their merger with administrative data has eased data abstraction efforts and, with increasing use of standardized data entry sets, reduced data heterogeneity.
One major limitation of using the medical record as a primary data source is that potentially important patient reported information is often lacking, which is typically limited to the reporting of a “chief complaint” or symptoms directly related to the present complaint. If clinical information is stigmatized (e.g., sexual history, alcohol or drug use), or difficult to systematically assess in primary care settings (e.g., cognitive status, depression), it is often under-reported in the medical record. It is also important to note that factors (e.g., medication use) are defined by clinicians, not by trained study staff or study participants, and certain variables may not be accurately coded. Moreover, the extent of documentation about key medical history or clinical variables can vary widely between providers (including conflicting data) and health care systems. Heterogeneity can create considerable difficulties in either the construction of key study variables or in their use.
For example, in studying a purported association between macular degeneration and a number of different dietary components, it would be important to document the presence of various medical history conditions which may affect an individual’s dietary practices as well as the development of macular degeneration. In this example, we would be particularly interested in ascertaining the presence of a history of type 2 diabetes mellitus based on information contained in medical records. Inasmuch, one needs to consider how this condition and related chronic medical conditions would be classified based on information contained in medical records. For example, is diabetes considered present if there is a simple notation of this condition in the patient’s medical history by a sole provider? On the other hand, might there be a need for the documentation of various key elements of each condition to be noted in the medical records (e.g., multiple elevated serum glucose levels obtained under fasting conditions) before a diagnosis of diabetes can be accepted? For several relatively common conditions, such as heart failure and stroke, independently and extensively validated algorithms have been developed to ascertain the presence of these important chronic diseases 8–10 .
Depending on the major research questions under study, resources available, and amount of variability/precision willing to be accepted in documenting the presence (or equally importantly absence) of each of these comorbid conditions, rules of acceptance and rejection can be applied in the consideration of these factors. Similarly, the investigator might also decide to simply ask the survey participant whether or not diabetes had been ever diagnosed in their past. This should be a very simple thing to do but the investigator needs to have considered beforehand how they will analyze the data if personal responses are not consistent with their medical record findings. Table 2 summarizes the advantages and disadvantages of using medical records.
Advantages and Disadvantages of Hospital/Ambulatory Care Records