I was inspired to write this post after reflecting on James Boutin’s series of posts critiquing the construction and use of data in schools. There are a lot of ways to screw up evaluations, beginning with misguided initial theories, terrible instrument design, and inept analysis and interpretation. In this post, I’m not going to tell you all of the ways you can fail and how to succeed. There are too many for a single post. Instead, I want to provide the big picture process for doing evaluation scientifically so that you know what you should be getting into when you decide to evaluate.
Evaluation has two components – assessing the causal processes and developing the monitoring system (i.e. benchmarks) to continually assess them. The causal assessment tells you what about your program and what about your operating environment are influencing your outcomes. It allows you to say something like, “participation in our interview-skills training program increases the probability of employment by 25%, but the lack of access to public transportation decreases our clients’ probability by 30%.” The benchmarks allow you to keep track of these influential variables and outcomes and detect any changes or problems with the program. They allow you to say “over the past year, 50 clients have participated in our interview skills training, but 40 did not have access to public transportation.” These two pieces of information can play a very influential role in getting city government to expand train or bus routes in your direction or increased funding to pay for bus passes.
My suggestion for a general strategy is to perform a causal analysis once every five or ten years and use the findings to select which benchmarks to track. This 5-10 year interval is a heuristic. Some programs operate in very dynamic environments that change quickly relative to other programs. The more dynamic your environment and the more changes you make to your program, the more often you will have to redo the causal analysis. In the example above, a new bus line might change the interview program dynamics in several, indirect ways: more clients may come from new areas changing group dynamics, while better access to other resources like a public library or health facilities may improve the job chances of participants but not because of your program.
Causal Evaluation: Assessing causal relationships is not only the most important part of evaluation, but also the most difficult and most susceptible to bias, misinterpretation, and generally terrible research. That is why I strongly advise hiring an expert, typically someone who has at least a master’s level training in appropriate research methodologies. Causal inference involves the highest standards of social sciences research and requires some of the most sophisticated methods we’ve developed (which is why I describe this approach as doing evaluation “scientifically”). In essence, I suggest paying the $10,000-$50,000 (or more for larger, more complex programs and organizations) once every five to ten years to hire a well-qualified contract researcher or consultant. My earlier post “Researching With Nonprofits” goes a bit into what this process might be like. Even better would be to hire one full-time, but I won’t get into the difficulties with financing operating costs.
The most important part of putting the causal evaluation together is the program logic model (for entrepreneurs, this is why you must make one). Writing out the logic model gives you an explicit understanding of what you believe are the most important processes determining your program’s outcomes and is the starting point for designing the analysis. Depending on how much data you can gather and to what extent you’re able to randomly select clients to participate in programs, you can expect several waves of data collection or possibly one big one. Large amounts of data allow for several sophisticated analyses that provide evidence for causal inference. Small datasets require multiple measures over time to both gather enough data and add temporal variables that help support causal inference. So, if you’re a small organization or the program is small, you can expect waves of data collection lasting for a period of time determined by the turnover in your program.
So what do you get for your investment? It depends on the results. If the study fails to find any significant causal connections and there’s nothing wrong with the data, then a full program review is in order since your program logic model has not received empirical validation. This is the difference between benchmarks and a causal analysis and why benchmarks are not useful in themselves. For the interview skills program example, benchmarks would say “40 clients used the service and 30 received a job offer.” Great, right? Nope. The causal analysis concludes that those 30 people would have gotten those jobs without the training.
Benchmarks tell you what’s happening. The causal analysis can tell you whether you should take credit for it. The overall goal then is to get the causal part right and then ride on the results for as long as the causal dynamics remain stable.
Benchmarking: If the study succeeds in isolating key causal relationships, then those variables become benchmarks. To go back to the interview skills program example, if you find that, say, access to transportation, client’s education-level, or involvement in other programs all affect the probability of receiving a job offer, then you collect that information, put it into a spreadsheet, and monitor the changes. So, if the rate of job offers decreases, you can look and find that your client base in the last cycle was less educated or less involved in the rest of your programs. Thus, you can say that the program is working with more disadvantaged clients and that you need to do more to get clients involved in other programs. Hopefully, you can see how this might inspire confidence among your staff and board and encourage donors to open their wallets.
Long-Term Planning: The basic feature of planning evaluations over time is understanding the dynamics in your environment. As mentioned above, programs not only have their own dynamics which may change over time, but they also operate within dynamic environments, the causal processes of which will change. I see three indicators of when a new causal analysis might be necessary. First, front-line staff and program managers can recognize when dynamics are changing. Changes in client demographic, new complaints about new issues, or decreasing contact with potential employers can each indicate new dynamics entering the program. Second, changes in benchmarks can indicate underlying changes in the causal dynamic. For example, in the interview skills program, if job offers decline, and none of the other measures change correspondingly, it might be time to do another causal analysis. Finally, dynamics will likely change when you substantively alter your programs. If you redesign your program to include resume writing or professional writing, dynamics associated with writing like immigration status, race, and education will likely influence how well clients write in your programs and, if the writing component has an impact, the rate of job offers.
Lastly, I would like to take note of current national and sector-level governments, organizations, and thinkers pushing for accountability. While I believe that data-informed program development and evaluation is the way to go, there isn’t a one-size fits all approach to developing good data and the capacity of organizations to do their own high-standard evaluations represents probably the single biggest barrier to accountability. Anyone can do research, but to do good research by social scientific standards requires specific training in hypothesis testing, data collection design, and data analysis. If the accountability movement wants to succeed, it needs to develop the financial and technical resources necessary for organizations to develop this capacity.