Uninformed: Informative Information for the Uninformed

Vol 8» 2007.Sep


This document has attempted to convey the potential benefits of generalizing data flow information along generalization tiers. Each generalization tier is used to represent the data flow behaviors of an abstract or concrete software element such as an instruction, basic block, procedure, and so on. Using this concept, data flow information can be collected at the most specific tier, the instruction tier, and then generalized to increasingly less-specific tiers. The generalization process has the effect of reducing the amount of data that must be considered at once while still conveying a general description of the manner in which data flows within an application.

Generalized data flow information can be immediately used in conjunction with existing graph reachability problems. For instance, a common task that involves determining reachable data flow paths between a conceptual source and sink location within an application can potentially benefit from operating on generalized data flow information. This paper has illustrated these potential benefits by defining the Progressive Qualified Elaboration (PQE) algorithm which can be used to progressively determine reachability at each generalization tier. By starting at the least specific generalization tier and progressing toward the most specific, it is possible to restrict the amount of data flow information that must be considered at once to a minimal set. This is accomplished by using reachable paths found at each generalization tier to qualify the set of data flow paths that must be considered at more specific generalization tiers.

While these benefits are thought to be present, the author has yet to conclusively prove this to be the case. The results presented in this paper do not prove the presumed usefulness of generalizing data flow information beyond the procedure tier. The author believes that analysis of large applications involving hundreds of modules could benefit from generalizing data flow information to the data type, module, and more abstract tiers. However, at the time of this writing, conclusive data has not been collected to prove this usefulness. The author hopes to collect information that either confirms or refutes this point during future research.

At present, the underlying implementation used to generate the results described in this paper has a number of known limitations. The first limitation is that it does not currently take into account formal parameters that are not passed at a call site, such as fields, global variables, and so on. This significantly restricts the accuracy of the data flow model that it is currently capable of generating. This limitation represents a more general problem of needing to better refine the underlying completeness of the data flow information that is captured.

While the algorithms presented in this paper were portrayed in the context of data flow analysis, it is entirely possible to apply them to other fields as well, such as control flow analysis. The PQE algorithm itself is conceptually generic in that it simply describes a process that can be employed to qualify the next set of analysis information that must be considered from a more generic set of analysis information. This may facilitate future research directions.