Within the simplifications of the school curriculum, this is incorrect. A deeper and more general discussion shows that the mass of the pendulum does have a small influence on the period, but the effect is negligible in appropriate experimental settings Nelson and Olsson Hand-on materials and the computer simulation both allowed almost the same operations in the experimental space Klahr In the hands-on experiment, students were given three masses, each weighing 50 g, that could be combined to create various weights.
Adding mass did not affect the effective length of the pendulum. The angular displacement was measured on an angle meter and parallax errors had to be taken into account. Various features of that experiment were altered or removed, including those features that allowed variation of string length and gravity.
The version used for this study only allowed manipulation of mass and angular displacement. Measuring the period in the simulation experiment was accomplished with a digital stopwatch.
The only difference between the hands-on and the computer-simulated experiments was the measurement uncertainty. The simulation calculates periods up to a tenth of a millisecond, whereas the stopwatch in the hands-on experiment has an uncertainty of 0.
The task was to explore the relationship between the insulation of a thermometer and the temperature it shows when the experiment is left in an environment with constant temperature. The experimental setup consisted of three foam cubes of different sizes and a digital thermometer. The foam cubes could be used to systematically vary the thickness of the sheathing, and thus the amount of insulation provided.
Each foam cube had a hole drilled halfway into the cube, allowing the participants to place the thermometer in the center of the foam cubes. Like the simple pendulum experiment, this experiment is very easy to conduct. Both studies employed the same laboratory procedures. After a brief introduction to the contexts, participants were asked to state an initial hypothesis. To avoid irrelevant hypotheses i. Students had the choice of three hypotheses in the simple pendulum task: increasing the pendulum mass causes a an increase in the oscillation time, b no influence in the oscillation time, or c a decrease in the oscillation time.
The temperature task also provided three hypotheses: increasing the thickness of the sheathing surrounding a thermometer causes a an increase in the temperature, b no influence in the temperature, or c a decrease in the temperature.
Next, students tested their hypothesis experimentally with the available setup by collecting their own data. The experimental setting was constructed in such a way that the quality of the data allowed clear inferences.
Although they had as much time available as needed, the experimental phase lasted 5 to 10 minutes. The process of data collection in both studies occurred right after the participants conducted the experiment.
For Study 1, we immediately recorded semi-structured interviews with all the participants. Students were first asked whether they maintained or rejected their initial hypothesis. Hence, interviewers were free to ask more or different questions to elicit further justifications. This approach surely produces variance. This is the primary reason, among others, that we did not count frequencies across justifications.
We will discuss this issue further later in the paper. All the participants were interviewed. Again, all the participants completed the questionnaire. The methodological approach and a detailed description of the development of the questionnaire are presented in the next sections.
Therefore, it is necessary to induce a manifest behavior to quantitatively operationalize the different types of justifications. Much work in the field of argumentation consequently utilizes spoken language, such as studying group conversations e. This approach is justifiable considering the dialogical aspect of argumentation, yet it is also problematic in the context of our work, as it is possible that social desirability may create biased justifications during an interview Nederhof For instance, a year-old eighth grader interviewed face-to-face by an unknown adult researcher regarding their justifications for or against an initial hypothesis might be afraid to say something he or she assumes might be seen as inappropriate in physics classes, for example, justifying a claim by referring to gut feelings.
Utilizing written justifications e. For the reasons outlined above, the most appropriate assessment method may be to anonymously ask students to rate their agreement with a list of justifications. This approach may both alleviate the social desirability bias of an interview structure Nederhof ; Richman et al.
In addition, interviews and text analyses are time-demanding methods that are not suitable when studying structural relations between latent constructs because they usually demand a high sample size. Accordingly, research in this field must be studied in a quantitative as well as qualitative manner, which requires highly efficient and economical methods, such as self-administered questionnaires. Consequently, we chose to assess the justifications quantitatively by presenting students with a series of statements in a paper-and-pencil format and asking them to indicate to what extent these assertions applied to them in their own justification for supporting or rejecting their hypothesis.
We argue that this critique, though, is not justified here. In contrast to many of the studies this critique addresses, our instrument can only be employed within a certain situation, specifically, while conducting a scientific experiment.
It would not be appropriate to use this instrument to assess epistemic cognition without any accompanying lab work. After conducting the experiment, the participants were asked to indicate the extent to which each statement applied to their justification.
Each item refers to only a single category. Students rated the extent to which the items applied in their decision on a five-point Likert-type scale ranging from 0 does not apply to 4 fully applies. We decided to use a five-step Likert-type scale because we planned to analyze the data within the statistical framework of confirmatory factor analysis CFA using a maximum likelihood ML algorithm to estimate parameters Brown There is sufficient evidence that a five-point scale fulfills the requirements regarding data quality using ML estimation e.
The questionnaire was developed in three steps: 1 development of an item battery covering selected categories identified in Study 1, 2 evaluation of content validity, and 3 evaluation of psychometric quality. For the item battery, we used the coded interviews from Study 1 to establish a sufficient number of items. However, because some of the passages taken from the interviews had to be rephrased during item development, the matching of items to the different justification types was ensured via expert rating, which we describe in the Results section.
Finally, the content-validated set of items was analyzed to assess psychometric quality in terms of item difficulty, item variance, and discrimination. Furthermore, evidence is given below regarding the construct validity and the theoretical underlying factorial structure. All the interviews from Study 1 were transcribed verbatim. We used an iterative grounded theory approach to derive categories from the transcripts.
We began with the interviews from all the participants who worked on the simple pendulum task both hands-on and simulation. These artifacts were then grouped by similarities into categories of justifications. Each category comprised a specific type of justification that could be used regardless of whether a student changed his or her initial hypothesis. Next, we analyzed the interviews from the students who worked with the temperature in solid bodies task in the same way to look for evidence of completeness, validity, and transferability of our categorization.
To ensure the reliability of the coding process, two raters, who were trained on a small subset of transcripts using a coding manual, analyzed all the interviews. The raters decided for every justification identified in the transcripts whether a category does not apply coded as 0 , partly applies 1 , or fully applies 2.
In addition, the percentage agreement was analyzed, resulting in averages of Detailed results for inter-rater reliabilities for each category are provided in Table 1. All the data from Study 2 were extracted into a tabular format. Responses from the Likert-scaled items were coded from 0 does not apply to 4 fully applies. Accordingly, we used robust ML estimators in the CFA, which are seen as unbiased when using ordinal indicators with five or more steps Pui-Wa and Qiong ; Rhemtulla et al.
All the models were calculated within the statistical framework R R Core Team using the software package lavaan Rosseel The metric of the latent variable was scaled by fixing the variances of the latent factors to 1 Brown In line with research question a , we aimed to generate a broad spectrum of possible justifications in Study 1.
They were randomly assigned to either the hands-on or computer-simulated simple pendulum context. All the students had sufficient experience in experimentation e. Students who did not choose an incorrect initial hypothesis were not included in the analysis in either study, as it could be assumed that those students did not conceive the data as anomalous. In contrast, we assume that those students with incorrect initial hypotheses saw a discrepancy between their hypotheses and their own collected experimental data.
Table 2 describes these categories in general and provides example statements from the simple pendulum task. Interview 1 with an 8th-grade student, 14 years old, simple pendulum task, hands-on experiment, wrong initial hypothesis :. Student 1: I found out that the heavier the pendulum, the longer it takes [the time of oscillation]. Once I had 2. Then I had way less [mass] but only had like 2. The difference is just minimal. We identified two types of justifications in this interview.
This hypothesis justification indicates a lack of knowledge of measurement uncertainties. Because the student did not explicitly refer to measurement uncertainties, we name this category measurement uncertainties implicit. Interview 2 with an 8th-grade student, 13 years old, simple pendulum task, simulation experiment, wrong initial hypothesis :.
I assumed that the greater the mass, the longer the time of swing. Because the air drag is bigger then. In my observations, it was the same [the time of oscillation] no matter which weight I used. I measured once with 0. Always 2. Student 2: Because I observed something totally different when I conducted the experiment. The computer has its reasons for that. Student 2: The time of swing stays the same if you change the mass. Research question b aims to develop an instrument to empirically assess the use of different justifications in lab work learning situations.
Because Study 1 led to ten different categories of justifications—which is quite a great deal in terms of test development—we selected four categories for operationalization in Study 2: intuition , appeal to an authority , measurement uncertainties explicit , and data as evidence. This selection, which we will elaborate on below, was based on the general relevance of the justification types for learning science with respect to the literature.
Intuition was included because students gave non-rational justifications in the interviews as a matter of course. The use of this type of justification is particularly relevant for science education research, as it relates to both hot vs. Further, intuition is known to be an important factor for learning science, but it has received little attention in science education research Fensham and Marton Investigating the use of intuition in lab work is particularly relevant, as it is well known from other disciplines that people tend to rely on intuition in statistical decision-making Kahneman and Tversky Additionally, it is highly relevant to know the extent to which expertise, which is already integrated into the experiment because an expert put it together for the students, influences cognition during processes of data evaluation and experimental observations Hug and McNeill Little work in the context of argumentation examines the influence of measurement uncertainty in data to justify a claim, despite the fact that evaluating quantitative data in order to draw conclusions is not possible without estimating the uncertainty e.
For obvious reasons, the category data as evidence was operationalized because the justification of claims on the basis of measurement data used as evidence is at the core of science and is addressed in science standards e. The developed items were all based on the interview answers coded in Study 1. Additionally, for the categories intuition and data as evidence , we were able to draw on established instruments, such as the Rational-Experiential Inventory Epstein et al.
To further ensure content validity, eight graduate students 26—34 years old from different domains three in physics, two in chemistry, one in biology, one in English, and one in arts were asked to judge item texts regarding the category of justification they address. The broad academic backgrounds ensured that the content validity could be extended beyond the domain of physics. The experts were first presented with a detailed description of the justification types.
Among the 88 items, 63 items matched this criterion: 18 for the category intuition , 13 for the category measurement uncertainties explicit , 12 for the category appeal to an authority , and 20 for the category data as evidence. We must note that for six items in the category appeal to an authority , only six instead of seven of the eight experts agreed on the classification; however, we chose to include these items in the questionnaire.
To correct this, we rephrased those six items and put stronger emphasis on expert knowledge. Six inadequate phrasings of items were identified and revised. Although these six items were not re-rated by the experts, we argue in favor of the content validity of these items because the deficiencies were obvious upon comparison with valid items. A set of 63 content-valid items were subjected to an evaluation of psychometric quality.
To further reduce the item battery to a reasonable number of items, we initially used two criteria to select the items with the highest psychometric quality: 1 Items with extreme item difficulty were excluded. We want to denote that item difficulty is a technical term that does not imply that the instrument assesses an underlying construct of ability or skill Kline Item selection based solely on discrimination carries the risk of low variance, that is, the risk that the resulting scale will not be able to differentiate sufficiently across a wide range of test scores, as variances in item difficulty naturally reduce discrimination.
Consequently, we added a third criterion to the item selection: 3 The items with the largest selection indices were selected from the remaining set of items, not exceeding eight items per category of justification. This procedure led to the establishment of a set of 31 items. The item difficulty of the remaining items falls in the medium range, but the items overlap at certain intervals intuition : In addition, CFA estimates the discriminant and convergent validity and allows testing of competing models for a more in-depth review of the features of CFA, see Brown To estimate the factorial validity, we defined four competing models.
Models 1—3 include four factors that reflect the four categories of justification. Model 1 contains all 31 items, while Models 2 and 3 use a reduced version of Model 1 with five items per category items with the highest CFA-based factor loadings.
Models 2 and 3 differ in that Model 2 allows for covariance of the factors, while Model 3 has an orthogonal factor structure. All evaluations of the model fits were accomplished with regard to accepted standards of interpretation of fit indices Hu and Bentler The goodness-of-fit indices for all the models are reported in Table 3.
Model 1 shows a poor fit, as the CFI is too low. The competing Model 3 is noticeably worse than Model 2 because all the fit indices are lower than in Model 1. Model 4, a one-factor model, is discarded because not all the fit indices meet the cutoff criteria.
Hence, Model 2 shows the best fit to the data and was used for further analysis. The reliability of the scales can be calculated directly from the CFA model in terms of the proportion of true variance to total variance of the measurement Brown The CFA-based reliabilities are. The aim of Study 1 was to identify different types of justifications that students give for hypotheses in physics, when faced with quantitative anomalous data obtained from their own experiments.
We found ten different categories of justifications. Other categories are novel to the analysis of justifications, such as those referring to experimental competences, measurement uncertainties, and the suitability of the experimental setup. This points to an important difference between our study and that of Chinn and Brewer : While we investigated student-generated experimental data, they used entire theories with plausible predefined initial hypotheses and predefined data sources.
The fact that we found both previously known and new categories shows that argumentation is influenced by the situation in which it occurs and that it adheres to general strategies Kind The identified justification types indicate that there are rational e. Hence, we argue that non-rationality in argumentation in science instruction should not be overlooked Sinatra It may, for example, be the case that students make non-rational decisions when they do not have enough information at hand to make informed choices.
This is in line with Petty and Cacioppo , who state in their elaboration of likelihood model of persuasion ELM , that an inability to process information can lead to peripheral non-rational choices.
These problems may lead to the use of the justification types ignorance , measurement uncertainties implicit , and use of theoretical concepts.
We argue that this is a restriction. It is worth noting that another study by Lin , which also used laboratory experiments, identified 17 categories of justifications. Although all 17 categories can be matched to our categorization, the allocation is rather problematic, as it includes some overlap. Furthermore, Lin suggests that this justification accept anomalous data but do not know why does not lead a participant to a conceptual change, even though anomalous data are accepted. We doubt that this is sufficiently proven.
Accepting data, even without knowing why, can still involve a conceptual change. The results of this study are also particularly relevant for practitioners. Our proposed categorization of justifications given by students allows science teachers to anticipate the possible range of justifications students might generate in the context of lab work.
For example, if students refer to gut feelings or ignore data completely when justifying a claim, teachers can provide help by supporting students with prepared worksheets that focus on the evaluation of the evidence at hand.
This might include a discussion of measurement uncertainties when estimating the quality of the data. Thus, awareness of the fact that students will not always use justifications that are favored by science educators e. More detailed implications of the results for practice are described elsewhere Ludwig and Priemer In summary, three new aspects characterize our Study 1 results: a The use of self-collected experimental data are now included in the assessment of justifications.
The latter is valuable because conceptual change is hard to reach and to assess Posner et al. Of course, our categorization has limitations. Given that the categories of justifications found in the simple pendulum task could also be found in the temperature in solid bodies task, we conclude that our categorization is discerning and comprehensive within the scope of the methodology used.
However, as noted above, justifications can be context-dependent Chinn et al. By addressing two contexts in our study, we made sure that the results do not depend on a single topic. Hence, we provide a good starting point for further research. The categories measurement uncertainties explicit and implicit are probably mostly encountered in lab work situations in which the uncertainty of measurement plays a major role, which is more often the case in the domain of physics than in other classes.
Scholars and practitioners should be aware of this constraint. Both physics contexts of our studies have further characteristics that limit their generalizability. Further, our participants had little prior knowledge in these contexts, which may influence their use of justifications, for example, when they refer to known theories. While conducting the experiments to generate their own quantitative data, the students collected evidence of varying quality and quantity for example, number of repetitions and precision of measurements and documented their results differently.
This led to variation in the resources students had at hand to recapitulate their experimental work when giving justifications. It remains an open question whether our justification types are valid in settings in which students are not required to interpret anomalous data.
Furthermore, we do not know if younger or older students would use the same justifications. We emphasize that we did not determine the frequency of the use of the different justification types for the following reasons: First, the common practice of quantitative analysis of qualitative data is commonly criticized Hammer and Berland Second, to compare the frequencies across categories would have required us to ensure that the interviews elicited all justifications a student might have in mind.
This was not the aim of the interview questions. This is especially conceivable with categories such as intuition , which students might see as inappropriate in science classrooms and therefore might not mention, even if such categories did play a role in their decisions. The four selected categories data as evidence , measurement uncertainties explicit , intuition , and appeal to an authority are of general interest in lab work settings because they reflect known non-rational justifications that are also found in other studies such as using intuition and referring to experts and because they focus on the evaluation of the collected data evidence and evaluating the quality of data with respect to measurement uncertainties.
The questionnaire can be administered in 5—10 minutes and is thus especially suitable for medium- to large-scale assessments. The target group is eighth- and ninth-grade students. The results obtained from this questionnaire are also highly contextual and are directly related to the specific laboratory task. The results of the CFA confirm the claimed underlying four-factor structure of the questionnaire. This indicates good convergent validity of the questionnaire Brown The absence of a strong correlation between latent factors is essential for gaining evidence for divergent validity.
Brown sets. In our work, only three of six possible structural correlations between factors yielded significance, but these were moderate in size. This speaks to the fact that all four underlying categories of justification can be measured independently. Model-based reliability estimators consistently return high results.
Accordingly, we argue that the questionnaire measures the use of justification in a reliable manner. Thus, we gained evidence for the quality of the questionnaire in terms of content validity expert rating , factorial validity CFA , discriminant validity correlations between factors , convergent validity high factor loadings of indicators , and reliability model-based reliability estimators. The final questionnaire is available as supplementary material accompanying the online version of this article Online Resource 1.
Note that an English translation is provided here; however, the questionnaire was developed and distributed in German. Thus, the authors caution that the textual validity of the translated items has not been explicitly investigated.
Nevertheless, the questionnaire and the data we have collected concerning it, provide a good basis for further research. Of course, the psychometric evaluation of the instrument also has limitations.
Due to the exclusion of items in the development of the test, the four justification categories may be underrepresented in the questionnaire. Moreover, the questionnaire may not be able to sufficiently differentiate between persons in the very extreme lower and upper ends of the scale due to the lack of items in these ranges.
While the chosen model, Model 2, shows a satisfactory fit to the empirical data, other items representing the four justification categories might have led to another model and hence possibly a better model fit. Finally, we used only a limited number of competing models, specifically only one- and four-factor models. We thus do not know whether another factorial structure may have a better fit than Model 2. Again, we want to stress that we decided against reporting the means of the newly developed scales due to a possible research bias that might occur when scale analysis, item selection, and hypothesis testing are carried out on the same sample Kline The questionnaire in its present form is a valuable tool for assessing justifications frequently used by students.
How do personal factors, such as the ability to evaluate data or domain-specific knowledge, influence how students justify hypotheses?
Will a highly motivated student automatically justify claims on the basis of measurement data as evidence? Will the use of justifications vary with age? Knowing what influences the use of justifications, how this develops over time, and how we can promote the use of rational justifications rather than non-rational justifications is only one side of the same coin: Science educators also have to investigate how the use of different types of justifications may affect learning outcomes in the science lab.
Because it is conceivable that, for example, the use of data as evidence or the evaluation of uncertainties might lead to better learning outcomes. These hypotheses can now be empirically investigated. Furthermore, justifications are components of arguments that are often built to persuade—whether oneself, a classmate, or the scientific community-therefore, it seems especially important to investigate the relationship between persuasion and the use of different justifications.
Some of these questions are currently undergoing further research, in which we are applying our questionnaire to two large-scale studies with more than high school students participating in lab work courses. Employing methods of latent variable modeling, our first analysis shows that, for example.
Again, if students use data as evidence in their justification, they are more likely to state a correct hypothesis after experimentation, while relying on intuition leads to a less permanent decision Ludwig We could also demonstrate that the learning environment—real vs.
These results point to the fact that it is now possible to investigate the process of stating a scientific hypothesis based on experimentally derived data at a fine-grained level. Abi-El-Mona, I. International Journal of Science Education, 33 4 , — Google Scholar. Albert, E. Development of the concept of heat in children. Science Education, 62 3 , — Anderson, R. Inquiry as an organizing theme for science curricula. Ledermann Eds.
London: Lawrence Erlbaum. Asterhan, C. Argumentation and explanation in conceptual change: indications from protocol analyses of peer-to-peer dialog.
Cognitive Science, 33 3 , — Beauducel, A. On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Betsch, C. Brown, T. Confirmatory factor analysis for applied research. New York: Guilford Press. Brown, D. The following tables are meant to aid in collecting the raw experimental data.
Do not use these tables in the final lab report; follow the example table in the handout on how to write a results section. Table 1. Time — series data for your own group. Record the dependent variable each assigned day.
Procedure 1: Dependent variable? Table 2. Each group must have a sediment score for each treatment for four days; this data will be used to generate a time-series graph for the lab report.
Flask opening treatment. Interpretations and Conclusions. Check the class standard flasks for the sediment scale:. The class data table is the data that must be used for the Chi-square test in your lab report. Table 3. Bio class data for Procedure II, page 5 of the lab manual. The sediment scores are based upon the score assigned to each flask on day 5. Sediment score on Day 5. Group 1. Group 2. Group 3. Group 4. Group 5.
Group 6. No Cotton plug. Cotton plug. Formatting a testable hypothesis What Is a Real Hypothesis? Chocolate may cause pimples. Salt in soil may affect plant growth. Plant growth may be affected by the color of the light. Bacterial growth may be affected by temperature. Some experiments completely support a hypothesis and some do not.
If a hypothesis is shown to be wrong, the experiment was not a failure. All experimental results contribute to knowledge. Experiments that do or do not support a hypothesis may lead to even more questions and more experiments.
After a year, the farmer finds that erosion on the traditionally farmed hill is 2. The plants on the no-till plots are taller and the soil moisture is higher. The farmer decides to convert to no-till farming for future crops.
The farmer continues researching to see what other factors may help reduce erosion. As scientists conduct experiments and make observations to test a hypothesis, over time they collect a lot of data. If a hypothesis explains all the data and none of the data contradicts the hypothesis, the hypothesis becomes a theory. A scientific theory is supported by many observations and has no major inconsistencies. A theory must be constantly tested and revised.
Once a theory has been developed, it can be used to predict behavior. A theory provides a model of reality that is simpler than the phenomenon itself. Even a theory can be overthrown if conflicting data is discovered.
However, a longstanding theory that has lots of evidence to back it up is less likely to be overthrown than a newer theory. Skip to main content. Physical Geography. Search for:. Scientific Method You have probably learned that the scientific method is a series of steps that help to investigate. Scientific Questioning The most important thing a scientist can do is to ask questions. What makes Mount St.
Helens more explosive and dangerous than the volcano on Mauna Loa, Hawaii? What makes the San Andreas fault different than the Wasatch Fault? Why does Earth have so many varied life forms but other planets in the solar system do not?
What impacts could a warmer planet have on weather and climate systems? Scientific Research To answer a question, a scientist first finds out what is already known about the topic by reading books and magazines, searching the Internet, and talking to experts. Example The farmer researches no-till farming on the Internet, at the library, at the local farming supply store, and elsewhere. Example The farmer conducts an experiment on two separate hills.
In this experiment: What is the independent variable? What are the experimental controls? What is the dependent variable? Example After a year, the farmer finds that erosion on the traditionally farmed hill is 2. For an interactive animation of how Darwin used finches to explain the origin of species using the Galapagos islands finches, click here.
Science does not prove anything beyond a shadow of a doubt. Scientists seek evidence that supports or refutes an idea. If there is no significant evidence to refute an idea and a lot of evidence to support it, the idea is accepted.
0コメント