Project

General

Profile

Feature #2251

improve the call graph generation

Added by Constantin Asofiei about 10 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Target version:
-
Start date:
03/03/2014
Due date:
% Done:

100%

Estimated time:
30.00 h
billable:
No
vendor_id:
GCD

ca_upd20140222d.zip (12.7 KB) Constantin Asofiei, 03/03/2014 10:52 AM

relative_filenames_in_the_preprocessor.zip (1.9 KB) Greg Shah, 03/12/2014 12:38 PM

ca_upd20140408a.zip (33.8 KB) Constantin Asofiei, 04/08/2014 02:42 PM

ca_upd20140424b.zip (29.2 KB) Constantin Asofiei, 04/24/2014 08:15 AM

ca_upd20140425a.zip (29.3 KB) Constantin Asofiei, 04/25/2014 08:51 AM

ca_upd20140513b.zip (1.59 KB) Constantin Asofiei, 05/13/2014 09:00 AM

ca_upd20140715d.zip (9.17 KB) Constantin Asofiei, 08/01/2014 12:09 PM


Related issues

Related to Conversion Tools - Bug #2356: pphints bug - nested includes are missing from the.pphints file New
Related to Conversion Tools - Bug #2357: callgraph - set schema triggers as entry points New
Related to Conversion Tools - Feature #3277: implement call-graph v3 Closed

History

#1 Updated by Constantin Asofiei about 10 years ago

I've fixed the rules so that the callgraph can be ran with current P2J and also to use the p2j.cfg.xml's file-separator - see attached for the changes. I think the only issue left for the current code to work properly is the file-name case-sensitivity. My findings are these:
  1. the included files are not reported in the .ast.graph file.
  2. for each file specified i.e. in rootlist.xml, we generate its callgraph. But I can't find how the ambiguous RUN VALUE is reported... I've tried enabling debugging, but unless I set it to trace, there is no additional info logged. And trace ... well, is trace, and there is lots of output. Anyway, we should enhance the .graph XML so that we record, for each node in the graphs:
    - the callgraph for the include files (not sure how easy this can be done, we will need to modify/access the preprocessor, I think)
    - any problems (such as RUN VALUE, RUN with missing files, etc) are recorded, too
    - any other info (like library calls, SESSION:EXPORT, PROPATH, there was a hint file used, etc)
    The idea is, if we record this data in XMLs, it will be easily walked by other customer-specific tools, and they don't need to re-walk the full AST set. In the end, it might even be possible to generate a GML file from all .ast.graph nodes, and visualize the full graph in tools like yEd.
  3. am I right that the run_generated.rules tries to find 4GL code which generates other 4GL files at runtime, and invokes them via a RUN statement? I'm curious, did MAJIC really use this feature?
  4. invocation_reports.rules - was this an early attempt at generating reports? Because it logs nodes like APPLY, PUBLISH, FUNCTION ... IN SUPER, etc.
  5. do you think it might be possible to make the call-graph-generator incremental? I'm thinking of this:
    - at first, determine the callgraph for all nodes. Generate a report with files containing ambiguous RUN VALUE cases, which will need hints to be disambiguated.
    - on a subsequent run, check if there is an existing .ast.graph file. If so, load it, and:
    a. if the graph is final (there are no ambiguous RUN statements), then do not re-do this graph
    b. if the graph is not final (there are ambiguous RUN statements), then re-generate the graph, but only for the sub-nodes which contain "ambiguous RUN statements". Even this can be improved, and re-generate only for nodes which have a hints file, and in the previous run they didn't have one (this would mean that once a hint file is added, that hint file must be final). This caching will make it easy to add hints, re-generate to include the hints, etc, until no more ambiguities are found, as it will avoid re-generating the callgraph for non-ambiguous call graphs.
    c. for the ambiguous RUN statements: when they are resolved, record all the possible targets reached by the ambiguous RUN AND the ambiguous expression, so that the fact that a RUN is determined via hints is not lost.
  6. generating a .ast.graph file for each AST is nice, but we are missing the info about conex components. Currently, if file A calls file B, then the .ast.graph file for A contains the sub-callgraph for file B. We should generate the conex components, so that we know the set of programs which are part of the same conex component. Because a customer might decide that: this entire sub-set of files is dead, as it is part of a module we no longer maintain/need/etc.

#2 Updated by Constantin Asofiei about 10 years ago

For the GUI project, there are cases when a parent/child relationship between two programs can be determined from the arguments of a RUN statement: i.e. a string literal parameter is passed with a program name, which will be executed by the procedure targeted by the RUN statement.

This case (and others where a child program can be determined in a customer-specific way) can be solved by the rules from a customer_specific_call_graph file: this can check for customer-specific patterns and, when a match is found, register the resolved program as a child.

The assumption here is that the procedure targeted by the matched RUN statement will at some point execute a RUN VALUE statement which calls the program specified via a parameter. Even if we add a node to the callgraph, the RUN VALUE (part of the called procedure) which executes this programs is still ambiguous, unless hints are used to disambiguate it. If all possible targets are already resolved from the arguments passed by the caller, a hint with an empty program name list will be enough to disambiguate it.

Note that patterns like these are not limited to procedures, user-def func calls can be used too.

Some thoughts about how we determine the parent (i.e. caller) for these kind of programs: I think it should be OK to set as the parent the external program where the customer-specific pattern is found. To make it more accurate, we can use hints to determine the external proc (to which an internal proc or user-def function belongs) for the matched RUN statement or user-def function call. But, in terms of determining the conex components, this is not needed, as is enough to resolve only one link between the current conex component and a child program, to make that child program part of the curent conex component.

#3 Updated by Greg Shah about 10 years ago

Some good references:

http://en.wikipedia.org/wiki/GraphML
http://en.wikipedia.org/wiki/YEd

Please note that GraphML is not the same as Graph Modelling Languages (GML - http://en.wikipedia.org/wiki/Graph_Modelling_Language).

#4 Updated by Greg Shah about 10 years ago

Code Review 0222d

Overall, I am fine with the changes.

My only concern is with CallGraphWorker.Callgraph.prepareFilename(). The modified code is not yet ready to run on Windows, but these comments were removed:

-         // WARNING: Unix-specific filename processing!
-         // TODO: Change to make it os-independent. 

-         if (absPath != null && filename.charAt(0) == File.separatorChar)
+         if (absPath != null && filename.startsWith(separator))

Since Windows filenames can include a drive letter and colon, detecting absolute filenames is more complex than the updated version can handle. Please do fix this. It probably makes sense to use the J2SE File class to do this work for us.

#5 Updated by Greg Shah about 10 years ago

I think the only issue left for the current code to work properly is the file-name case-sensitivity.

Yes, please add this.

the included files are not reported in the .ast.graph file.

Yes, I definitely agree that we need to report include files as well as procedure files that are not dead.

So in addition to storing this information in the graph, I would like this to be part of the resulting dead files report too.

But I can't find how the ambiguous RUN VALUE is reported.

It may not be reported properly at all. It (and other ambiguous call linkage features) certainly MUST be reported clearly to the user when the tool is run, so that the user will know exactly which features require hints. It is expected that the process is very iterative, with each run getting closer to the final run (one in which there are no additional hints needed).

the callgraph for the include files (not sure how easy this can be done, we will need to modify/access the preprocessor, I think)

Please do review the "pphints" data that the preprocessor generates. This data probably already provides what we need. You can see how we use this in our reports processing:

rules/reports/consolidated_reports.xml (search on preproc and hints)
com/goldencode/p2j/preproc/PreprocessorHintsWorker
com/goldencode/p2j/preproc/PreprocessorHints

any problems (such as RUN VALUE, RUN with missing files, etc) are recorded, too

Yes.

any other info (like library calls, SESSION:EXPORT, PROPATH, there was a hint file used, etc)

Good idea.

The idea is, if we record this data in XMLs, it will be easily walked by other customer-specific tools, and they don't need to re-walk the full AST set.

This makes sense. It also makes it feasible for us to do some interesting reports (as part of our standard set of reports).

In the end, it might even be possible to generate a GML file from all .ast.graph nodes, and visualize the full graph in tools like yEd.

Yes, this was one of the original goals of having the callgraph data. This would be fantastic!

am I right that the run_generated.rules tries to find 4GL code which generates other 4GL files at runtime, and invokes them via a RUN statement

Yes, it was meant to show output to the user that such a feature was encountered. The dynamic_code_generation.xml was also being used for this purpose.

I'm curious, did MAJIC really use this feature?

Yes, it did. We helped TIMCO find these locations and they wrote replacement 4GL code that used other features in order to bypass this requirement. I think you did some work for them on replacement code. Possibly in #1327?

invocation_reports.rules - was this an early attempt at generating reports?

Yes and no. This was really some early code based on the idea that the call graph processing would be used for more than just a dead file analysis. Somewhere we had to identify all possible linkage types in the call tree. By putting in reporting that documented linkage types that weren't yet supported, we could be sure that the person running the call graph tools would not ignore these nodes should they be encountered.

We ultimately do want to implement a range of code improvements that can only be implemented using call graph analysis. In other words, we want to implement PatternEngine based walking of the call graph that will allow a whole new level of processing that cannot be implemented with our current single-source-tree-at-a-time approach. An example is to determine which shared variables are actually being used in the call tree. Today, we cannot detect dead shared variables because we cannot know which downstream code may or may not be accessing them.

In general, one can think of 2 kinds of linkages:

  1. program to program or "external linkage" - any mechanism by which one compilation unit of 4GL code can invoke code in a separate compilation unit
  2. program internal or "internal linkage" - any mechanism by which code can call other code within the same compilation unit

A compilation unit is the fully-preprocessed file (the cache file in our terms) that contains an external procedure (and any internal procedures, user defined functions, triggers... that are also contained in that file).

For dead files analysis, only the external linkage case must be considered. But I don't want to ignore the internal linkage cases. In particular, I would like to handle them as follows:

  • our call graph design should plan for their eventual inclusion, at a minimum, it should not be hard to add them later
  • it would be best for us to at least lay the groundwork by providing matching rules for all of these possible internal linkage mechanisms, and we will want some form of output for each match

Our original approach created a single call graph output for each root node. I'm not sure that is the best approach long term.

do you think it might be possible to make the call-graph-generator incremental

Yes, I think it is feasible and desirable.

I'm thinking of this:
- at first, determine the callgraph for all nodes.
...

I like the approach. We will need to document how to reset the environment to get a "from scratch run".

generating a .ast.graph file for each AST is nice, but we are missing the info about conex components

What do you mean by the term "conex"?

This case (and others where a child program can be determined in a customer-specific way) can be solved by the rules from a customer_specific_call_graph file: this can check for customer-specific patterns and, when a match is found, register the resolved program as a child.

This sounds reasonable.

The assumption here is that the procedure targeted by the matched RUN statement will at some point execute a RUN VALUE statement which calls the program specified via a parameter. Even if we add a node to the callgraph, the RUN VALUE (part of the called procedure) which executes this programs is still ambiguous, unless hints are used to disambiguate it. If all possible targets are already resolved from the arguments passed by the caller, a hint with an empty program name list will be enough to disambiguate it.

Yes. This imprecision won't affect the dead files analysis, but it could affect other more advanced usage of the call graph. For example, if the actual linkage is from a procedure that is not the targeted compilation unit (the program being executed by the matched RUN statement), then the linkage will not be allocated to the proper location. For these advanced cases, we will eventually need a reasonable way to allocate the found linkage to the right downstream location.

#6 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

  1. program internal or "internal linkage" - any mechanism by which code can call other code within the same compilation unit

With persistent procedures, the non-private internal entries of a program are exposed to the outside world... and, with super procedures, determining the right program where the internal procedure relies will be a 100% hint file job (or customer-specific rules).

But I don't want to ignore the internal linkage cases. In particular, I would like to handle them as follows:

OK, I'll keep this in mind.

generating a .ast.graph file for each AST is nice, but we are missing the info about conex components

What do you mean by the term "conex"?

Conex was not the best choice describing what I meant; when we deal with callgraps, given the entry point, we require that there is at least a path between the entry point and any other node (as we are in a directed graph). What we are missing is which are the maximal callgraph components: the callgraph for program P1 is maximal if there is no callgraph for program Px which includes the callgraph for P1, as a sub-graph. Currently, if we look at the .graph files, there is no easy way of determining the maximal components.

#7 Updated by Greg Shah about 10 years ago

program internal or "internal linkage" - any mechanism by which code can call other code within the same compilation unit

With persistent procedures, the non-private internal entries of a program are exposed to the outside world... and, with super procedures, determining the right program where the internal procedure relies will be a 100% hint file job (or customer specific rules).

Yes, I understand. And it is not just internal procedures, but also user-defined functions that are impacted.

And this is not just due to persistent/super procs, but also appserver support. Of course, with appserver, we would be more concerned with the possible inbound calls unless the project code was self-referencing.

As part of this task, please catalog the different linkage mechanisms. I will provide a starting point:

Call sites:

  1. run filename - an external procedure defined in the specified source file
  2. run filename ON server_handle - the target session is found at runtime via the given handle which can be:
    - the local session handle
    - a remote appserver handle
  3. run internal_proc - procedure stmt defined in the local source file
  4. run internal_proc - procedure stmt defined in super
  5. run internal_proc IN proc_handle - target is found at runtime via the given handle which can be:
    - the local file
    - a procedure that is on the stack
    - a local persistent procedure
    - a remote persistent procedure
  6. run internal_proc - procedure stmt defined as API call to Windows DLL or UNIX shared library
  7. run value(expr) - an internal or external procedure found by evaluating the expression and then matching the name to one of the cases above
  8. run value(expr) ON server_handle - same as case 2 but with a runtime-generated filename
  9. run value(expr) IN proc_handle - same as case 5 but with a runtime-generated procedure name
  10. run library_ref - runs a specific external procedure that is packaged in a procedure library (.pl file)
  11. run super - runs the immediate super procedure for the currently executing internal procedure
  12. function call - function stmt defined in local file
  13. function call - function stmt defined IN SUPER
  14. function call - function stmt defined IN proc_handle with the target file found at runtime via the given handle:
    - the local file
    - a procedure that is on the stack
    - a local persistent procedure
    - a remote persistent procedure
  15. super() - calls the user-defined function that exists in the immediate super procedure for the currently executing local user-defined function
  16. dynamic-function(funcname_expr) - calls the user-defined function in the local file that is determined at runtime by evaluating the funcname_expr, based on the associated function stmt, this may be any of the following:
    - function in local file
    - function defined IN SUPER
    - function defined IN proc_handle
  17. dynamic-function(funcname_expr IN proc_handle) - calls the user-defined defined in the procedure specified by the given handle, which is determined at runtime to be one of:
    - the local file
    - a procedure that is on the stack
    - a local persistent procedure
    - a remote persistent procedure
  18. shell-based command execution of a child process (e.g. unix or os-command)
  19. stream I/O to/from a child process (input thru, output thru and input-output thru)
  20. use of PUBLISH to invoke internal procedures that are registered with SUBSCRIBE
  21. invocation of a statically linked user C function via CALL
  22. usage of DDE EXECUTE (command execution)
  23. COM automation object method calls
  24. OCX control method calls
  25. .NET method calls/constructor (property access?)
  26. OO 4GL method calls
  27. OO 4GL property getters/setters (hidden dispatch)
  28. OO 4GL constructors/destructors (hidden dispatch)
  29. class events pub/sub
  30. dynamic-invoke
  31. CALL handle object
  32. forced invocation of a widget trigger via the APPLY language stmt (which can invoke arbitrary code that is on the call stack)
  33. forced invocation of a procedure trigger via the APPLY language stmt
  34. ON statement invocation of a procedure via PERSISTENT RUN
  35. input blocking stmts (events may be generated during these):
    - choose
    - insert
    - message (only with VIEW-AS ALERT-BOX, UPDATE or SET)
    - prompt-for
    - set
    - update
    - readkey (not when reading from file)
    - wait-for
    - pause
  36. process-events will cause events to be processed, in combination with apply, enable or direct setting of the sensitive attribute it is possible this could be a vector for triggers to be called
  37. web services invocation
  38. SOAP invocation
  39. stored procedure invocation

Entry Points

  1. external procedure - each 4GL source file
  2. internal procedure - procedure stmt
  3. user-defined functions
  4. ON stmt - trigger definitions, includes UI, procedure and database event processing, these can be defined as an in-line block of code or as a PERSISTENT RUN of a procedure name (this later form is really a call site, see above)
  5. KW_TRIGGERS - trigger phrase on widgets, including dynamically created widgets
  6. TRIGGER-PROCEDURE - schema-level database triggers
  7. OO 4GL user-defined methods
  8. OO 4GL user-defined properties (getters/setters)
  9. OO 4GL constructor
  10. OO 4GL destructor
  11. internal procedures registered via SUBSCRIBE statement
  12. OCX Event Procedures
  13. Asynchronous Request Procedure PROCEDURE/KW_PROC/KW_ASYNC/KW_EVT_PROC
  14. DDE-NOTIFY trigger
  15. exported web services
  16. appserver exports (internal procedure, external procedure or user-defined function)

Most of this came from some comments previously left behind in common-progress.rules. Some of the call sites invoke code outside of the 4GL. This means that those sites are not relevant to dead file analysis but they may be useful in more advanced call graph analysis.

What we are missing is which are the maximal callgraph components: the callgraph for program P1 is maximal if there is no callgraph for program Px which includes the callgraph for P1, as a sub-graph.

OK, I understand. Yes, you're right. This is something we need to derive.

#8 Updated by Constantin Asofiei about 10 years ago

Greg, some thoughts about the callgraph generation. Assuming we have an explicit RUN ext-prog.p. call in an internal procedure IP in a program P. We have some cases (this applies to user-def functions/triggers too):
  1. if IP is never called, then we should not add ext-prog.p to the callgraph.
  2. if IP is called from P, then ext-prog.p should be added to P's callgraph.
  3. if IP is called from P2, then ext-prog.p should be linked as P2->P(IP)->ext-prog.
With this in mind, to generate a correct call-graph, only the call sites from the external procedure should be checked, when adding an external program to the callgraph. The call-sites from the internal entries will be added only if the internal entry will be executed. Thus, I want to keep a call-sites node with all the possible call-sites from the current program (no matter where they appear). For the first pass, we can add only the external programs to the call-graph (no matter where the RUN statement appears), and this will make it easy to integrate the other call-sites into the callgraph. The structure would look like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--Persisted CallGraph-->
<graph-root>
  <call-graph-node filename="./a.p.ast">
    <preprocessor-graph>
       <include end-column="3" end-line="4" file=".\a.i" start-column="1" start-line="1">
         <include end-column="2" end-line="4" file=".\b.i" start-column="1" start-line="4"/>
       </include>
       <include end-column="2" end-line="4" file=".\b.i" start-column="1" start-line="4"/>
    </preprocessor-graph>

    <call-sites>
      <call-site id="1" ast-id="1" type="KW_RUN" line="15" column="1">
         <target type="external-procedure" value="c.p" hint="true"/>
         <target type="external-procedure" value="d.p" hint="true" missing="true"/>
      </call-site>

      <call-site id="2" ast-id="2" type="KW_RUN" line="20" column="1">
         <target type="procedure" value="internal-proc" hint="true"/>
      </call-site>

      <call-site id="3" ast-id="3" type="KW_RUN" line="25" column="1" ambiguous="true"/>
    </call-sites>

    <environment>
      <ast id="1" type="KW_PROPATH" line="3" column"1">
         <!-- TOOD: maybe expand each element of the propath? -->
         <target value="custum-pro-path" />
      </ast>
      <ast id="1" type="KW_EXPORT" line="3" column"1">
         <target value="appserver-export-path" hint="true"/>
      </ast>
    </environment>

    <!-- we keep only the full callgraph, using current program as entry-point:  the additional 
         info is kept per each .p.ast.graph file.
         -->
    <call-graph-node call-site-id="1" filename="./b.p.ast">
       <!-- the call-site-id's for all b.p.ast's children are relative to b.p.ast.graph -->
       <call-graph-node call-site-id="1" filename="./c.p.ast">
    </call-graph-node>
  </call-graph-node>
</graph-root>

with:
  1. the preprocessor-graph is an example of the content of a .p.pphints file. I don't think we need to merge the preprocessor graph with the actual call graph, as we want to work with the .p.cache file, not the .i include files.
  2. the call-sites includes all call sites from the current program (no matter where they appear). The example contains only KW_RUN, but this can be easily expanded for the other types (user-def func calls, triggers, UI event-processing statements, etc). ast-id attributes are referencing nodes in the associated .p.ast file.
  3. the environment contains data about PROPATH and SESSION:EXPORT (and other relevant info). This can help in restricting the available external programs, from this program.
  4. each call-graph-node contains a call-site-id attribute, referencing a call-site ID, relative to the .p.ast.graph file for the parent node. Even if we make this ID unique (as the AST IDs are), the reference will still be in the parent's .p.ast.graph file.
  5. the callgraph is built using the current program as entry-point: the additional info for each callgraph node is not duplicated, but kept in the node's .p.ast.graph file.
  6. the filename currently will allow easy identification of the program referenced by this node. When adding the other possible call-sites, we should add some target and type attributes, where:
    - target represents i.e. the name of an external program, library, procedure,etc
    - type is some string with the type of this node.
    Note that if a call-site has multiple targets, then all targets will be added to the callgraph.
  7. same as with the preprocessor case: if a target is invoked from multiple call-sites, we should add a node for each one. Currently, an external program invoked multiple times in the current program, is linked only once.

#9 Updated by Constantin Asofiei about 10 years ago

Something I forgot: we should allow rootlist.xml to have nodes like <node folder="./path/to/folder/"/>, to include all the programs in that folder.

#10 Updated by Greg Shah about 10 years ago

the preprocessor-graph is an example of the content of a .p.pphints file. I don't think we need to merge the preprocessor graph with the actual call graph, as we want to work with the .p.cache file, not the .i include files

Yes, I agree we don't want to duplicate the data. The important thing is to know the exact list of source files that are required for every call-site. If the call-site is from an include file, then there may be a list of containing include files that are also in-scope. This information probably should be stored for each call-site.

The example contains only KW_RUN, but this can be easily expanded for the other types (user-def func calls, triggers, UI event-processing statements, etc). ast-id attributes are referencing nodes in the associated .p.ast file.

My only concern with the example is that we probably should "qualify" the RUN call-site more specifically. Why not create a more specific call-site "type" that is a unique and mutually exclusive list of types (RUN_FILENAME, RUN_INTERNAL_PROC, RUN_VALUE...)? I think that overall, this will be pretty useful information. What are the pros/cons of this?

the environment contains data about PROPATH and SESSION:EXPORT (and other relevant info). This can help in restricting the available external programs, from this program.

I worry that this is too disconnected from the original AST to be useful. It is only in the context of the original AST that we can determine to what degree these statements affect call-sites.

Your other notes seem pretty straightforward.

#11 Updated by Greg Shah about 10 years ago

OK, I don't want to throw you off track or open the scope of this too much. But we are at a point in this analysis that it makes sense to consider an alternate approach.

At some point this year, I hope to rewrite TRPL using a much cleaner approach. One of the core ideas will be to back the entire conversion project with a database as the storage/access mechanism. While some of our data will fit nicely into a relational model, the core data structures that we really deal with are ASTs. Tree data structures don't nicely fit into a relational data model. Assuming each AST node is a row in a table (more likely a set of rows in a set of tables, like having different tables for the core node and annotations). In order to properly model the connections between nodes, the relational model requires self-referencing between the rows of these tables that represent each parent and child. In other words, to walk a subtree, one has to arbitrarily dereference the same set of tables many times. This is not something that can be easily delegated to SQL. We still may use an embedded H2 database for the data that fits well, but I don't think it makes sense to force fit the ASTs into that structure.

The NoSQL solutions are not of interest. Although the table structure may not be best for tree data, tree data is still highly structured data. The NoSQL solutions don't support the kinds of highly structured data that we need. In addition, these solutions often lack the kind of transactional support (ACID) that we would require for a real production solution.

Instead, I have long been considering using graph database technology. In particular, I have been watching Neo4J (http://www.neo4j.org/), an open source graph database that is written in Java, can be embedded in our solution and supports the kinds of scaleability and ACID requirements we would have. Most importantly, it is designed to store and efficiently query exactly the kind of data structures that we have. Not only is it efficient to query this database, it is very efficient to query the relationships between nodes. Of course, a tree is just a simple form of a graph.

The call graph is a much more complex graph, as it would need to represent features like cycles (connections from the call graph to itself).

I am wondering if we shouldn't leverage the graph database rather than trying to implement our own approach from scratch.

In addition, I am struggling with the decision of how closely to tie the ASTs and the call graph. As I noted previously, much of the call graph can only really be understood in the context of the AST itself. And in fact the call graph itself can be thought of as a kind of "overlay network" of connections between AST nodes (that may be in different trees).

Please take a few hours to explore the Neo4J technology and to think about this question. I would be willing to take a small hit (a week?) to leverage this technology if it gives us a better/more powerful result. It is my suspicion that this would be the case, but I really don't know if we can do it in a small amount of time.

In the case where we do use Neo4J, I would NOT expect full integration of the ASTs into the graph database (too much effort). But I wanted to mention the issue and consider a design where in the future, the ASTs and call graph are part of the same database and are fully interconnected into a single graph. This utopian future would allow the full set of advanced analysis, refactoring and optimizations that we really want to provide.

#12 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

If the call-site is from an include file, then there may be a list of containing include files that are also in-scope. This information probably should be stored for each call-site.

Is it possible to determine to which include file (if any) an AST is coming from? Isn't this information lost after preprocessing?

Why not create a more specific call-site "type" that is a unique and mutually exclusive list of types (RUN_FILENAME, RUN_INTERNAL_PROC, RUN_VALUE...)? I think that overall, this will be pretty useful information. What are the pros/cons of this?

I was thinking for type at the call-site to be the AST's type and let the target node disambiguate... i.e. to let multiple kind of targets for a call-site, as a RUN VALUE may target native code, internal proc or even external proc.

I worry that this is too disconnected from the original AST to be useful. It is only in the context of the original AST that we can determine to what degree these statements affect call-sites.

Hmm... I'll think if we can move this info at the call site.

Btw, something else: when hints are used to disambiguate the call site, we should use the token name (indexed by the location) as the hint name.

#13 Updated by Greg Shah about 10 years ago

-------- Original Message --------
Subject: Neo4J licensing
Date: Tue, 04 Mar 2014 17:11:04 -0500
From: Eric Faulhaber <>
Organization: Golden Code Development
To: Greg Shah <>

Greg,

Since we haven't gone open source yet and we have no immediate plans to do so, how do you see the licensing working if we integrate Neo4J into P2J?

Thanks,
Eric

We don't have to go with Neo4J, it is just the best known of the options. If a commercial license doesn't work for us, we can choose from one of many other solutions including Titan (http://thinkaurelius.github.io/titan/), OrientDB (http://en.wikipedia.org/wiki/OrientDB), HypergraphDB (http://www.hypergraphdb.org/index) or possibly even Filament (http://filament.sourceforge.net/).

#14 Updated by Greg Shah about 10 years ago

Is it possible to determine to which include file (if any) an AST is coming from? Isn't this information lost after preprocessing?

Yes. I think that PreprocessorHints.getSourcesFor() does this. If you need something different, you can modify PreprocessorHints.

Btw, something else: when hints are used to disambiguate the call site, we should use the token name (indexed by the location) as the hint name.

OK.

#15 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

Instead, I have long been considering using graph database technology. In particular, I have been watching Neo4J (http://www.neo4j.org/), an open source graph database that is written in Java, can be embedded in our solution and supports the kinds of scaleability and ACID requirements we would have. Most importantly, it is designed to store and efficiently query exactly the kind of data structures that we have. Not only is it efficient to query this database, it is very efficient to query the relationships between nodes. Of course, a tree is just a simple form of a graph.

I've been looking into these graph DB implementations, and the major points are these:
  1. HypergraphDB
    - nodes can be of any java type, as long as it exposes a default c'tor.
    - node's properties are encapsulated in java beans, and supports inheritance. This is not that great for us, as the set of annotations which can be applied to an AST depends at least on the token type of that AST... thus, our ASTs would work better with property graphs, which support (key->value) pairs for edges and nodes.
  2. Titan DB
    - is a property graph; value can be of any java type, and this makes it easy to emulate our AST annotations.
    - uses the TinkerPop stack for working with graphs, http://www.tinkerpop.com/ .
    - the Gremlin module provides a Groovy-style query language. This can be used either from a shell or from Java code.
    - what looks nice to me is that it allows HBase storage and it can be integrated with Hadoop: we might benefit from this, if we move from the current sequential way of applying the rules to an AST, to a more parallel approach.
  3. Neo4J
    - same as Titan DB, is a property graph.
    - is built on top of the Blueprints APIs (from TinkerPop); at some point, it provided support for Gremlin too, but now they moved on to Cypher as a query language, and Gremlin is no longer officially supported.
    - has its own proprietary storage.

Both Neo4J and Titan work in a similar way, as they are both property graphs. At this time, I'm inclined to go with Titan, especially because the storage backend is not restricted to a proprietary format.

I am wondering if we shouldn't leverage the graph database rather than trying to implement our own approach from scratch.

In addition, I am struggling with the decision of how closely to tie the ASTs and the call graph. As I noted previously, much of the call graph can only really be understood in the context of the AST itself. And in fact the call graph itself can be thought of as a kind of "overlay network" of connections between AST nodes (that may be in different trees).

Yes, I think we should go ahead with a graph DB, because what we were discussing the other days will be a lot easier with graph DBs. For a first pass, the nodes in this DB will reference ASTs (via the AST ID), and some minimal info (like token type, column, row, etc) will be copied in the graph.

For the callgraph implementation, is not required to upload all AST trees in the graph DB. How this can work:
  • initially, the ASTs for all external-programs in the project will be added as nodes. Later on, this can be expanded to include the internal entries and triggers, too.
  • we can either determine the call-graph starting from each possible entry point, or assume the entire external programs/internal entry set as entry points and generate the call-graph for each one.
  • rootlist.xml should allow explicit internal-entries too, as entry points.
  • when a call-site is encountered, a node is created for it and linked with its enclosing external-program AST.
  • if an AST node is identified as a call-site, all outgoing edges will determine the possible targets for this call-site
  • it is possible to have non-AST nodes, to map external libraries, web services, etc
  • if the AST for an outgoing edge is not found and the edge targets 4GL code, then we are in a "missing file/internal entry" case. Add a node with a bogus AST ID and mark it as "missing".
  • once the callgraph generation is complete, all AST nodes with no outgoing or incoming edges are dead files/internal entries
  • for the include files:
    - initially, all include files are added as "include-file" nodes
    - AST nodes for an external program will link to include files which are included directly (the first level of inclusion). The graph represented in the .p.pphints file is also built.
    - AST nodes for a call-site will link to the include file from where this call-site originates
    - once complete, all the include files with no edges are dead files (as the links between include files are added only when an external program is added).
  • to determine the maximal sub-callgraph components, we need to start with each entry point and do a BFS traversal of the graph. The nodes in the resulting tree structure are nodes reachable from that entry point.
  • nodes which are not reachable from the entry-points but they have edges (incoming or outgoing) are part of sub-graphs which can be considered dead.
  • to determine the maximal components, choose an unvisited entry-point node and:
    - go up the incoming edges and stop when an unvisited entry point is found, which has no incoming edges (avoiding cycles).
    - if there is no such node, this is the root for a new maximal component.
    - else, choose the found node as the root and BFS from it.

As a side note, the above still applies even if we don't use the graph DB at this time; instead, the graph will be emulated with XMLs and other data structures.

#16 Updated by Greg Shah about 10 years ago

At this time, I'm inclined to go with Titan, especially because the storage backend is not restricted to a proprietary format.

And as Eric correctly highlighted, it avoids a GPL licensing issue.

what looks nice to me is that it allows HBase storage and it can be integrated with Hadoop: we might benefit from this, if we move from the current sequential way of applying the rules to an AST, to a more parallel approach

Yes, although Hbase seems to be pretty heavy for our near term use-cases. I also worry about the ACID/transaction support in such a scenario.

I would really prefer a lightweight and embeddable (written in Java and able to be hosted in-process like H2) storage engine. We already spend way too much time installing/configuring/managing PostgreSQL. Using a native solution will be a large and unnecessary cost for implementing and using our tools.

if the AST for an outgoing edge is not found and the edge targets 4GL code, then we are in a "missing file/internal entry" case. Add a node with a bogus AST ID and mark it as "missing"

Or if it is ambiguous, we might want to mark it as such...

- initially, all include files are added as "include-file" nodes
- AST nodes for an external program will link to include files which are included directly (the first level of inclusion). The graph represented in the .p.pphints file is also built.
- AST nodes for a call-site will link to the include file from where this call-site originates
- once complete, all the include files with no edges are dead files (as the links between include files are added only when an external program is added)

Actually, I think this is close, but not quite correct. The initial setup must link each external procedure with include-file nodes for all include files that are required for that external proc. The include file nodes probably need to have an edge in the direction of includer --> includee. The reason is because when a call-site is linked to an include file that is deeply nested, there may be many include files between that one and the "first level of inclusion" that is linked directly to the external proc node. These include files are required and are not dead if something they include is not dead.

we can either determine the call-graph starting from each possible entry point, or assume the entire external programs/internal entry set as entry points and generate the call-graph for each one

Yes, this must be possible. The key idea here is to properly setup the database and then to have separate tools to do specific analysis such as dead file analysis. We can even open this up to ad-hoc queries. My point is that we should separate the tools that do analysis (use of) the database from the tool that creates/populates the database in the first place.

#17 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

what looks nice to me is that it allows HBase storage and it can be integrated with Hadoop: we might benefit from this, if we move from the current sequential way of applying the rules to an AST, to a more parallel approach

Yes, although Hbase seems to be pretty heavy for our near term use-cases. I also worry about the ACID/transaction support in such a scenario.
I would really prefer a lightweight and embeddable (written in Java and able to be hosted in-process like H2) storage engine. We already spend way too much time installing/configuring/managing PostgreSQL. Using a native solution will be a large and unnecessary cost for implementing and using our tools.

Titan supports other storage backends, BerkeleyDB and Persistit both being ACID compliant and embeded in the same JVM. For both, required configuration is minimal, just a folder for the DB files needs to be required.

- initially, all include files are added as "include-file" nodes
- AST nodes for an external program will link to include files which are included directly (the first level of inclusion). The graph represented in the .p.pphints file is also built.
- AST nodes for a call-site will link to the include file from where this call-site originates
- once complete, all the include files with no edges are dead files (as the links between include files are added only when an external program is added)

Actually, I think this is close, but not quite correct. The initial setup must link each external procedure with include-file nodes for all include files that are required for that external proc. The include file nodes probably need to have an edge in the direction of includer --> includee. The reason is because when a call-site is linked to an include file that is deeply nested, there may be many include files between that one and the "first level of inclusion" that is linked directly to the external proc node.

Yes, this is what I meant by The graph represented in the .p.pphints file is also built..

These include files are required and are not dead if something they include is not dead.

Just to emphasize something: if an external program is not dead, then all its include graph (i.e. all include files from by .p.phints) are not dead, too. We can't mark an include file as dead if it is not explictly targeted by a call-site, as the include might contain other kind of code, not just internal entries/triggers.

My point is that we should separate the tools that do analysis (use of) the database from the tool that creates/populates the database in the first place.

OK, I understand. But some knowledge about how the graph is expected to be analysed is required, when populating it.

#18 Updated by Greg Shah about 10 years ago

Titan supports other storage backends, BerkeleyDB and Persistit both being ACID compliant and embeded in the same JVM. For both, required configuration is minimal, just a folder for the DB files needs to be required.

Persistit is preferred over BerkeleyDB, since Persistit is completely implemented in Java and requires no native components to be installed.

Just to emphasize something: if an external program is not dead, then all its include graph (i.e. all include files from by .p.phints) are not dead, too.

Yes, this is right. My (poorly worded) point was that by having more explicit edges from call-sites to specific includes, it would enable some future analysis approaches such as looking for dead code at a sub-file level (e.g. we do some of this in unreachable code analysis today).

But some knowledge about how the graph is expected to be analysed is required, when populating it.

Absolutely.

#19 Updated by Constantin Asofiei about 10 years ago

This note will keep a history of the TitanDB required depencies (and their associated licences).

For blueprints, pipes, gremlin, colt, sl4j, kryo, the licence text must be kept with the jar.

About Gremlin: it relies on antlr-2.7.7, which may conflict with our antlr version.

1a. Core dependencies:

akiban-persistit-3.3.0.jar: Apache Licence 2
blueprints-core-2.4.0.jar: https://github.com/tinkerpop/blueprints/blob/master/LICENSE.txt
colt-1.2.0.jar: http://acs.lbl.gov/software/colt/license.html
commons-codec-1.7.jar
commons-collections-3.2.1.jar
commons-configuration-1.6.jar
commons-lang-2.5.jar
commons-logging-1.1.1.jar
guava-14.0.1.jar: Apache License 2.0
high-scale-lib-1.1.2.jar: https://github.com/stephenc/high-scale-lib/blob/master/LICENSE
hppc-0.4.2.jar: Apache License 2.0
kryo-2.21-shaded.jar: New BSD License https://raw.github.com/EsotericSoftware/kryo/master/license.txt
metrics-core-3.0.0-BETA3.jar: Apache License 2.0
slf4j-api-1.7.5.jar: http://slf4j.org/license.html
spatial4j-0.3.jar: Apache License 2.0
titan-core-0.4.2.jar: Apache License 2.0
titan-persistit-0.4.2.jar: Apache License 2.0

1b. Lucene dependencies:

lucene-analyzers-common-4.4.0.jar: Apache License 2.0
lucene-core-4.4.0.jar: Apache License 2.0
lucene-queries-4.4.0.jar: Apache License 2.0
lucene-queryparser-4.4.0.jar: Apache License 2.0
lucene-spatial-4.4.0.jar: Apache License 2.0
titan-lucene-0.4.2.jar: Apache License 2.0

2. Gremlin depencies:

antlr-2.7.7.jar
asm-3.1.jar
gremlin-groovy-2.4.0.jar: https://github.com/tinkerpop/gremlin/blob/master/LICENSE.txt
gremlin-java-2.4.0.jar: https://github.com/tinkerpop/gremlin/blob/master/LICENSE.txt
groovy-1.8.9.jar: Apache License 2.0
pipes-2.4.0.jar: https://github.com/tinkerpop/pipes/blob/master/LICENSE.txt

3. Remaining dependencies (will be added only if they are required):

ant-1.8.3.jar
ant-launcher-1.8.3.jar
asm-analysis-3.1.jar
asm-commons-3.1.jar
asm-tree-3.1.jar
asm-util-3.1.jar
commons-beanutils-1.7.0.jar
commons-beanutils-core-1.7.0.jar
commons-digester-1.8.jar
commons-io-2.1.jar
concurrent-1.3.4.jar
frames-2.4.0.jar
gmetric4j-1.0.3.jar
ivy-2.3.0.jar
jackson-annotations-2.1.2.jar
jackson-core-2.1.2.jar
jackson-databind-2.1.2.jar
jackson-datatype-json-org-2.1.2.jar
jansi-1.5.jar
javassist-3.18.0-GA.jar
jettison-1.3.3.jar
jline-1.0.jar
json-20090211.jar
jsr305-1.3.9.jar
log4j-1.2.16.jar
metrics-ganglia-3.0.0-BETA3.jar
metrics-graphite-3.0.0-BETA3.jar
oncrpc-1.0.7.jar
slf4j-log4j12-1.7.5.jar
spatial4j-0.3.jar
stax-api-1.0.1.jar
velocity-1.7.jar

#20 Updated by Constantin Asofiei about 10 years ago

I don't like how PatternEngine.callGraphMode hard-codes a "callgraph-walking" mode in PatternEngine, it doesn't feel natural... only thing PatternEngine should be aware of is the list of ASTs to be processed. Instead, the CallGraphGenerator should work like this:
  • use a custom Iterator which retrieves AST instances from the head of a queue, until the queue is empty. Initially, the queue contains the ASTs for all the entry points. This iterator instance is passed to a ProcessEngine.processASTs method.
  • a i.e. callgraph/generate_call_graph rule is applied to all the ASTs returned by the custom iterator. The iterator will ignore ASTs which are already processed (or the already-processed ASTs will not even end up in the queue).
  • when the callgraph generation rules are applied, if call-sites are found in an AST (the AST may be for an external program, internal procedure, etc), it will add the targets of these call-sites to the tail of the entry-point queue (only if the targets are not external, to i.e. web-services, native libraries, etc), beside creating the appropriate nodes/edges in the graphDB.
On a side note, I want to add a com.goldencode.p2j.graphdb package to hide the graph DB being used. We will have a GraphDBFactory to create a GraphDB instance (depending on the graph DB being used), with GraphDB defining some common APIs to access/write to the graph DB, as in:
  • find node
  • create node
  • create edge
  • get in/out edges
  • DFS/BFS traversal
  • etc

#21 Updated by Constantin Asofiei about 10 years ago

About the PatternEngine's callgraph dependencies: I think I understand why was needed for the PatternEngine to "know" about the callgraph mode: to be able to apply the descend/ascend/walk/next-child rules, on this graph. I will keep the existing approach, which applies the rule sets as the callgraph is walked, but I will make this generic: it will not be dependent on the .p.ast files, instead it will walk ASTs. I will change CallGraphNode to be a generic node which points to a graph DB node.

#22 Updated by Greg Shah about 10 years ago

The plan so far is reasonable. I like the abstraction of the graph database.

I don't like how PatternEngine.callGraphMode hard-codes a "callgraph-walking" mode in PatternEngine, it doesn't feel natural... only thing PatternEngine should be aware of is the list of ASTs to be processed.
...
About the PatternEngine's callgraph dependencies: I think I understand why was needed for the PatternEngine to "know" about the callgraph mode: to be able to apply the descend/ascend/walk/next-child rules, on this graph.

The idea is that you are not walking an AST, you are walking the call-graph itself. So the "walk" is very different since you are not going to visit all nodes in a tree (at least that was the original idea) and you will walk across trees as part of the same graph.

The current call-graph mode is just a "simple" walk of the trees in the graph. I don't think we use it in any of our code today. The approach in the future is meant to be much smarter, only walking the nodes that are actually accessible via that call tree and walking across trees (and maybe back) as needed. This is out of scope for the current work.

2. Gremlin depencies:

I am a bit worried about the performance of Groovy. My last experience with it was very bad, but I do know they have improved things since then.

Is Gremlin the only query mechanism for Titan? What is your opinion of it?

#23 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

I like the abstraction of the graph database.

After some more thought, I think is better to use the Blueprint APIs, which are implemented by both Neo4J and Titan. If Titan usage is completely hidden behind Blueprint, we can easily switch to Neo4J or other Blueprint-compatible graph DB. Else, our graph API interfaces will need to mostly duplicate Blueprint for i.e. query support and other usage. Is not just nodes/edges we need to think about...

This is out of scope for the current work.

I was thinking for a tool which i.e. determines the dead files be written directly in TRPL, and this is why I want to allow the graph DB to be walked via TRPL rules. Also, the changes are not very complex (and I want to get rid of the CallGraphNode and CallGraphPersister classes).

I am a bit worried about the performance of Groovy. My last experience with it was very bad, but I do know they have improved things since then.

Is Gremlin the only query mechanism for Titan? What is your opinion of it?

I don't plan to use Gremlin From Java code, as they already have a query/filter mechanism via java APIs (and is better to avoid the Groovy overhead). Gremlin I think it may be better used from TRPL code. What about antlr-2.7.7 - there is this comment in progress.g:

This grammar is highly specific to ANTLR 2.7.4 and it may not be valid in future versions of ANTLR. This is due to the fact that the generated code and the grammar syntax of ANTLR can change between versions. These factors are the most important constraints defining the structure of the resulting grammar. As a result, it is possible that the grammar will require changes to run in future versions of ANTLR.

#24 Updated by Greg Shah about 10 years ago

After some more thought, I think is better to use the Blueprint APIs, which are implemented by both Neo4J and Titan.

OK. If their API is good there is no reason to recreate the wheel.

I was thinking for a tool which i.e. determines the dead files be written directly in TRPL, and this is why I want to allow the graph DB to be walked via TRPL rules. Also, the changes are not very complex (and I want to get rid of the CallGraphNode and CallGraphPersister classes).

OK.

I don't plan to use Gremlin From Java code, as they already have a query/filter mechanism via java APIs (and is better to avoid the Groovy overhead). Gremlin I think it may be better used from TRPL code.

I am pretty sure that in the future, TRPL code will be a DSL hosted in Scala (which in itself is hosted in the JVM). We will still have the pattern engine driving the process, but the code that is executed will be written in Scala.

As such I'd like to avoid mixing Groovy and TRPL (Scala). I guess the key questions:

1. What is the benefit of Gremlin?
2. What will the costs be (in performance and/or integration issues)?

What about antlr-2.7.7 - there is this comment in progress.g:

It is likely that the grammars could be modified easily to work in 2.7.7. Later this year it is likely that we would be shifting to ANTLR 3.x or possibly even ANTLR 4.x. We already have enough work in plan to migrate our own grammars without being held back by Gremlin's requirements for 2.7.7. This sounds like one more "con" for Gremlin, though it may not be too much of a problem in the short term.

#25 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

As such I'd like to avoid mixing Groovy and TRPL (Scala). I guess the key questions:

1. What is the benefit of Gremlin?

A first impression is that is just a "shorthand" access the Pipes APIs and the graph's APIs related to nodes/edges.

2. What will the costs be (in performance and/or integration issues)?

I'll take a deeper look when I start building the tools which walk the graph, and see how Gremlin behaves vs. direct java code (which should look like a "converted" Gremlin code). In any case, there is nothing stopping us adding some utilities in CallGraphWorker.

#26 Updated by Greg Shah about 10 years ago

A first impression is that is just a "shorthand" access the Pipes APIs and the graph's APIs related to nodes/edges.

OK, then we won't want to expose this at the TRPL level. Ultimately (i.e. not now), ff there are syntactic constructs that are useful (and which we cannot support in a native TRPL syntax) then we will enhance TRPL to cleanly support the syntax we need. The TRPL 2.0 (as I have been calling it) will already be featuring a much improved syntax as part of the move to Scala.

I don't see the need for the Pipes usage either considering that TRPL already has a well defined idea of pipelining.

Part of the idea of moving to a database backed TRPL implementation (whether it is a graph database or relational) is to natively and naturally translate TRPL code into database operations. I have already done quite a bit of analysis of how we can potentially get a massive improvement in performance by using this mechanism. The work is theoretical at this time, I haven't proven my theory yet. But I am pretty sure it will work. Anyway, let's stay away from Gremlin and Pipes for now.

#27 Updated by Constantin Asofiei about 10 years ago

I've committed today's work to the task branch on devsrv01. What I've accomplished (the work is still rough, but it works):
  • loading of the code set into the graph DB. Is enough to do this in MAJIC:
    java -classpath p2j/build/lib/p2j.jar com.goldencode.p2j.uast.CallGraphGenerator
    
  • after the graph is loaded, it will go through the rootlist.xml and generate the RUN filename. linkages (for now).
  • PatternEngine can be used to walk the callgraph, touching i.e. all reachable nodes from majic.p, like this:
    java -classpath p2j/build/lib/p2j.jar com.goldencode.p2j.pattern.PatternEngine -c callgraph/list_dependencies . "majic.p.ast" 
    

Currently, a graph node is for an external procedure and the edges are annotated with the call-site ID (the ID of an AST within the external procedure, outside of the graph DB) and other useful info. What I currently don't like in the PatternEngine is that, when I want to walk the callgraph, it moves me through the external procedure AST, too (this is how it was previously implemented). I want to remove this: when we are walking the graph, let only the nodes loaded in the graph be walked.

#28 Updated by Greg Shah about 10 years ago

I did a quick code review. It looks really good! Very impressive.

In my opinion, the native Java query access is pretty easy to read/understand. I think the key for us will be doing a smart translation of TRPL behavior into efficient access/filtering of the graph.

#29 Updated by Constantin Asofiei about 10 years ago

Blueprints has a built-in feature to serialize the entire DB to a GraphML file. I've tried importing this into yEd, but there are problems with the long properties: yEd doesn't support long values, only integer values...

#30 Updated by Constantin Asofiei about 10 years ago

With a little hack (replace all "long" types with "string" in the .graphml file) I've managed to open the graph in yEd, for the entire MAJIC code set. Although yEd can not display useful data directly to screen (like node text or edge label, as this is read from some proprietary yEd nodes), the graph can be looked at. Even so, I'll let the CallGraphGenerator to generate the .graphml file, too.

Now I'm moving into more advanced call-sites (i.e. RUN VALUE and customer-specific call-sites); how it works:
  • until the entire AST set is in the graph DB, I need to annotate a link between two nodes with the linkage type (i.e. external program, internal proc, include, etc), the call-site id and/or include file info.
  • about missing and ambiguous nodes: for the same reason (the graph DB doesn't have the entire AST set loaded), I have to link a call-site from an AST node (for an external program) to a MISSING or AMBIGUOUS node (as I can't annotate the call-site's specific AST in the graph DB). When the AMBIGUOUS node has no more incoming edges, then the callgraph is ready for analysis.
  • the TRPL rules are built so that other call-site types can be added easily. I'll add a few other call-sites (like socket, web service and appserver connections), to make sure the design is easy enough to add other call-sites.

Finally, for the GUI/server projects, please let me know what are the call-sites which need to be addressed fast. Currently, what I have on the short list are only external-program linkages, via RUN filename or RUN VALUE, and other cases (like TRIGGER-PROCEDURE procedures or appserver entry points) which require for the node to be marked as "external-usage" (i.e. expected entry-point).

#31 Updated by Greg Shah about 10 years ago

For this current work, include any call-site that can cause an external linkage, including the non-4GL ones.

  • All forms of the RUN statement that have external linkage, including native API calls. This includes RUN filename, RUN filename ON, RUN VALUE, RUN SUPER, RUN library_member and RUN internal_proc_that_is_a_native_api_call. You can ignore the RUN internal_proc forms.
  • All forms of child process launching (stream and OS-COMMAND).
  • ON statement with RUN PERSISTENT.
  • All schema references to a given TRIGGER-PROCEDURE. (Which will allow determination of missing TRIGGER-PROCEDURES and unreferenced TRIGGER-PROCEDURES)
  • Web Services
  • SOAP
  • usage of DDE EXECUTE (command execution)
  • COM automation object method calls
  • OCX control method calls

Technically, the following also have external linkage (but aren't used in the code that will be analyzed). You can avoid these unless it is very quick to add the support:

  • CALL object usage
  • invocation of a statically linked user C function via CALL
  • DYNAMIC-INVOKE
  • any OO 4GL object reference (includes having a member of the given class type, instantiation (NEW), method calls, property access)
  • .NET method calls/constructor (property access?)
  • Stored Procedures

Can you think of any external linkage type that I have not listed above?

Internal forms of linkage can be left until later.

#32 Updated by Constantin Asofiei about 10 years ago

Can you think of any external linkage type that I have not listed above?

I think all forms of the CONNECT method (for appserver, web services and sockets) and the server socket's ENABLE-CONNETIONS should be treated too. I'll keep an eye for other cases.

Some notes:
1. Is there any reason why we can't generate the callgraph with the final ASTs, after conversion, and should use only the ASTs generated by the front phase? Currently, callgraph generation is possible either in F3 mode or CallGraphGenerator.main; and for F3 mode, info like the one specified by the relative-name annotation is missing, as this one is added by the annotations phase.
2. The run rcode-lib.pl<<rcode-proc.r>>. is not supported by the conversion rules, it abends in the Core Code Conversion phase with a Unsupported RUN mode LIBRARY_REF.

#33 Updated by Greg Shah about 10 years ago

1. Is there any reason why we can't generate the callgraph with the final ASTs, after conversion, and should use only the ASTs generated by the front phase?

My primary concern is that our current conversion approach is "lossy". During unreachable processing, annotations and core conversion, we can make significant changes to the ASTs. We hide, drop and/or modify the ASTs in many cases. This means that processing the ASTs at that time may generate very different results than on the input trees.

I don't know for sure the extent of the problem. For the unreachable processing in particular, it may be a good thing to process the call graph after that since it should not (I hope) negatively change the outcome and may only help it (unless we have bugs). But even during the annotations phase we start making changes that are potentially dangerous, so it is hard to tell the scope of the issue.

2. The run rcode-lib.pl<<rcode-proc.r>>. is not supported by the conversion rules, it abends in the Core Code Conversion phase with a Unsupported RUN mode LIBRARY_REF.

Yes, this is expected. So far, we actually haven't encountered this being used in any customer code (and we have run our tools on many potential customer's code as part of a conversion project assessment).

However, we should go ahead and allow this to convert to something that will fail at runtime. That way the conversion won't die/abend. And during conversion we should output a big ERROR message listing the specific library reference that will fail at runtime.

#34 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

This means that processing the ASTs at that time may generate very different results than on the input trees.

OK, this makes sense. I will work with only the info set at the input trees. This means I will not have access on the "relative-name" annotation, as this is set at the annotations phase, but I think the filename should be enough.

However, we should go ahead and allow this to convert to something that will fail at runtime. That way the conversion won't die/abend. And during conversion we should output a big ERROR message listing the specific library reference that will fail at runtime.

OK, I've changed it to output an ERROR message for the RUN library-ref and ON ... PERSISTENT RUN library-ref cases. The code will convert to the usual ControlFlowOps.invoke and registerTriggerPersistent APIs, is not required to emit a different API, as is the runtime's job to determine what needs to be executed.

#35 Updated by Constantin Asofiei about 10 years ago

The ASTs for a i.e. INPUT THROUGH stmt is broken and fixed by annotations/input-output.rules:181, the "fixup KW_THROUGH children to be a single array" comment: this needs to be moved to fixups/post_parse_fixups.xml, so that will be picked up by the callgraph generation code.

#36 Updated by Greg Shah about 10 years ago

Yes, please do this for any "fixes" that are done later but which are needed for call-graph processing.

#37 Updated by Constantin Asofiei about 10 years ago

  • % Done changed from 0 to 60
I've added to testcases/uast/callgraph-tests the current tests for callgraph generation. What is working:
  • RUN external-program/library-ref variants. Note that the RUN SUPER statement is not needed, as this will always target an internal procedure
  • PROCEDURE ... EXTERNAL cases: these are linked at the procedure definition, not the call-site.
  • native process launching
  • ON ... PERSISTENT triggers
  • TRIGGER PROCEDURE and schema triggers

I've added some report listings too (driven from callgraph/reports.xml). You can look at the the *.lst files from testcases/uast/callgraph-tests folder for how they would look.

Now I'm running the callgraph rules with the server/majic project, to see how it behaves. Tomorrow I expect to finish the support for the remaining cases (appserver/socket/web services connections + dde/com/ocx) and cleanup/javadoc the code.

#38 Updated by Constantin Asofiei about 10 years ago

  • % Done changed from 60 to 0

Greg, is there a way to get the concatenated result from a string expression containing only literals, like "-H localhost -S 3363" + " foo-string" + "bar", during conversion time?

#39 Updated by Greg Shah about 10 years ago

We might have created something like that, but I don't recall it.

However, there is some useful code in operators.rules that is pretty close. See the section starting at line 194 where we are merging all character expression input (into a single concat call) that are part of the same extended string concatenation operation.

#40 Updated by Constantin Asofiei about 10 years ago

About DDE, COM and OCX call-sites:
  1. a COM-HANDLE var is initialized via CREATE expr com-handle-var [CONNECT TO expr] statement or via :COM-HANDLE attribute, with these notes:
    - currently, we can't parse the CONNECT TO phrase properly. More, should the CREATE statement should notice that we are in a COM-HANDLE var case and convert differently?
    - the loadControls(expr) method (for a frame Control-Frame widget) can be used to load data from an external file.
  2. for the DDE case, external linkage I think can be limited to DDE INITIATE. The DDE EXECUTE, DDE SEND, DDE REQUEST, DDE GET all require a dde-id ... which can be obtained only via DDE INITIATE.

#41 Updated by Greg Shah about 10 years ago

we can't parse the CONNECT TO phrase properly.

Please provide more details.

More, should the CREATE statement should notice that we are in a COM-HANDLE var case and convert differently?

This seems reasonable. What do you propose?

the loadControls(expr) method (for a frame Control-Frame widget) can be used to load data from an external file

Yes, this seems like the best external linkage point for OCX.

for the DDE case, external linkage I think can be limited to DDE INITIATE

Agreed.

#42 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

we can't parse the CONNECT TO phrase properly.

Please provide more details.

It's a combination of CONNECT TO and CREATE VALUE, like this: create value(ch) hcom connect to "bla.xls"., which produces this error:

./dde_com_ocx.p
line 13:23: unexpected token: connect
    at com.goldencode.p2j.uast.ProgressParser.create_widget_stmt(ProgressParser.java:24498)
    at com.goldencode.p2j.uast.ProgressParser.stmt_list(ProgressParser.java:21863)
    at com.goldencode.p2j.uast.ProgressParser.statement(ProgressParser.java:5302)
    at com.goldencode.p2j.uast.ProgressParser.single_block(ProgressParser.java:4276)
    at com.goldencode.p2j.uast.ProgressParser.block(ProgressParser.java:4039)
    at com.goldencode.p2j.uast.ProgressParser.external_proc(ProgressParser.java:3965)
    at com.goldencode.p2j.uast.AstGenerator.parse(AstGenerator.java:1416)
    at com.goldencode.p2j.uast.AstGenerator.processFile(AstGenerator.java:942)
    at com.goldencode.p2j.uast.AstGenerator.processFile(AstGenerator.java:814)
    at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:203)
    at com.goldencode.p2j.uast.ScanDriver.scan(ScanDriver.java:122)
    at com.goldencode.p2j.convert.ConversionDriver.runScanDriver(ConversionDriver.java:385)
    at com.goldencode.p2j.convert.ConversionDriver.front(ConversionDriver.java:282)
    at com.goldencode.p2j.convert.ConversionDriver.main(ConversionDriver.java:1729)

More, should the CREATE statement should notice that we are in a COM-HANDLE var case and convert differently?

This seems reasonable. What do you propose?

I see that the parser emits CREATE_OBJECT nodes, but not for all cases. I think the only problem is the CREATE VALUE cases, when it should notice that we are targeting a COM-HANDLE var and should go down the create_object_stmt path. Currently, a create value(ch) hcom. parses like:

    <ast col="0" id="12884902012" line="0" text="statement" type="STATEMENT">
      <ast col="1" id="12884902013" line="13" text="create" type="CREATE_WIDGET">
        <ast col="8" id="12884902016" line="13" text="value" type="KW_VALUE">
          <ast col="0" id="12884902018" line="0" text="expression" type="EXPRESSION">
            <ast col="14" id="12884902019" line="13" text="ch" type="VAR_CHAR">
              <annotation datatype="java.lang.Long" key="oldtype" value="2345"/>
              <annotation datatype="java.lang.Long" key="refid" value="12884902002"/>
            </ast>
          </ast>
        </ast>
        <ast col="18" id="12884902020" line="13" text="hcom" type="VAR_COM_HANDLE">
          <annotation datatype="java.lang.Long" key="oldtype" value="2345"/>
          <annotation datatype="java.lang.Long" key="refid" value="12884901989"/>
        </ast>
      </ast>
    </ast>

while a create "Excel.Application" hcom. parses like:
    <ast col="0" id="12884902025" line="0" text="statement" type="STATEMENT">
      <ast col="1" id="12884902026" line="16" text="create" type="CREATE_OBJECT">
        <ast col="0" id="12884902029" line="0" text="expression" type="EXPRESSION">
          <ast col="8" id="12884902030" line="16" text="&quot;Excel.Application&quot;" type="STRING"/>
        </ast>
        <ast col="28" id="12884902033" line="16" text="hcom" type="VAR_COM_HANDLE">
          <annotation datatype="java.lang.Long" key="oldtype" value="2345"/>
          <annotation datatype="java.lang.Long" key="refid" value="12884901989"/>
        </ast>
      </ast>
    </ast>

#43 Updated by Greg Shah about 10 years ago

Using VALUE for CREATE Automation is not noted in the documentation. Are you sure that this works in the 4GL as COM automation?

The bizarre thing is that the CREATE Automation statement already uses an expr as the 2nd token, so there is never a need to use VALUE. But the 4GL is so nasty that it would not surprise me if they actually defer the determination of the type of the CREATE until runtime when they can see the evaluated text of the expression.

All of your concerns above related to this would be resolved by parser changes. The biggest issue is resolving the ambiguity between the CREATE widget use of VALUE and the CREATE Automation case.

#44 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

Using VALUE for CREATE Automation is not noted in the documentation. Are you sure that this works in the 4GL as COM automation?

There are no compile errors in 4GL for a CREATE VALUE(...) var-com-handle statement. But there are compile errors in 4GL for a CREATE button var-com-handle, create "a" + "b" var-com-handle. or not-a-com-handle:loadControls(). This suggests that the COM-HANDLE data type is very different from HANDLE and the parser uses this to disambiguate the CREATE statement.

All of your concerns above related to this would be resolved by parser changes. The biggest issue is resolving the ambiguity between the CREATE widget use of VALUE and the CREATE Automation case.

Yes, this is a parser-level problem. I took a quick look at the create_object_stmt and create_widget_stmt and I'm not sure how I can find the 3rd token type (relative to the CREATE syntax), as this is too far away for LA to access it (we need to jump over the Automation expression first, which may be very complex). This might need to be fixed in the post-parse fixups phase: just alter the create_widget_stmt to accept the CONNET TO phrase and let post-parse fixups change the AST type from CREATE_WIDGET to CREATE_OBJECT.

#45 Updated by Greg Shah about 10 years ago

This might need to be fixed in the post-parse fixups phase: just alter the create_widget_stmt to accept the CONNET TO phrase and let post-parse fixups change the AST type from CREATE_WIDGET to CREATE_OBJECT.

This is definitely one solution.

I'm not sure how I can find the 3rd token type (relative to the CREATE syntax), as this is too far away for LA to access it (we need to jump over the Automation expression first, which may be very complex).

The way to do it in the parser is to implement a "left-factoring" approach.

The idea: create a common create_widget_or_object rule that matches the left portion of both the create_widget_stmt and the create_object_stmt. Then the create_widget_or_object rule will call proper sub-rule for the right portions. Depending on the sub-rule called, the root node will have to be set to the proper type.

It is a bit more intrusive to the parser, but it is the best approach.

#46 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

The idea: create a common create_widget_or_object rule that matches the left portion of both the create_widget_stmt and the create_object_stmt. Then the create_widget_or_object rule will call proper sub-rule for the right portions. Depending on the sub-rule called, the root node will have to be set to the proper type.

It is a bit more intrusive to the parser, but it is the best approach.

OK, I'll fix it after the main callgraph changes.

On a side note, there are cases in the server project where there are typos in the preprocessor, like in {a.i &aaa "bla"}, when it should have been {a.i &aaa = "bla"}. Currently, we throw exception in PreprocessorHints.createArgument:922, as the value for this aaa argument is missing. I think we need some tools to analyze the .p.pphints files and check for problems...

#47 Updated by Greg Shah about 10 years ago

If the .pphints problems are not hindering your work, then create a new bug task for this and we will work it later. You can add it to the "Code Improvements" milestone.

#48 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

If the .pphints problems are not hindering your work, then create a new bug task for this and we will work it later. You can add it to the "Code Improvements" milestone.

I had to fix the exception now (allow null value if is not used), and I'll let the analysis for another task.

Also, about include files: I see that the P2J preprocessor does not support relative names via "..", only relative names starting with the location of the program. Is this correct? Because I want to assume that the project relative name for an include file name from the .pphints file will be obtained via external-program-name + pphints-include-name.

#49 Updated by Greg Shah about 10 years ago

I see that the P2J preprocessor does not support relative names via "..", only relative names starting with the location of the program. Is this correct?

As far as I know it does work. Have you found a problem? Attached is a very simple example that can be duplicated in the testcases/uast/ directory.

The FileScope class hides the majority of the file processing for the preprocessor. It uses the File class from J2SE, plus Configuration.normalize() and Utils.canonicalizePath() to handle relative name problems. See the 2nd constructor and the open() method.

#50 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

As far as I know it does work. Have you found a problem?

What is not working for me is this case:

./run_test/include-test2.p
Warning [./run_test/include-test2.p line 1, col 21]: Include file "../include-test2.i" not found.

with include-test2.p having this content:
{../include-test2.i}

message "include-test".

./include-test2.i exists in project root.

I think this is because FileScope doesn't keep track of the folder in which the 4GL program resides... Also, propath is set to . - should I set propath to .:run_test ?

#51 Updated by Greg Shah about 10 years ago

I think this is because FileScope doesn't keep track of the folder in which the 4GL program resides...

As far as I know, the 4GL doesn't consider this either. It only considers two things:

  1. The current working directory of the program. I think it attempts to find relative files from this location as a first pass.
  2. The PROPATH. I think it takes a second pass checking each directory in the PROPATH + the relative path to see if the requested file exists.

I can't be completely sure of this because the current directory thing might just be that the PROPATH always has this (implicitly or explicitly) or virtually always has this.

Anyway, as far as I know, the preprocessor should not be considering the path of the file that contains the include reference.

Also, propath is set to . - should I set propath to .:run_test ?

I would expect the 4GL to require this, as we would too.

#52 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

Anyway, as far as I know, the preprocessor should not be considering the path of the file that contains the include reference.

I think you are right; what 4GL does I think is to search first in the folder where "pro" is started and after that the PROPATH. Because if I have a structure like:
~/callgraph/a.p
~/a.i

and a.p contains only {../a.i} then:
  1. cd ~ && pro -p callgraph/a.p will not find a.i.
  2. cd ~/callgraph && pro -p a.p will find a.i.

I would expect the 4GL to require this, as we would too.

I've changed the propath, but there is a problem resolving the project-relative name for the include file.

#53 Updated by Constantin Asofiei about 10 years ago

About the preprocessor parser problem which can be reproduced with an include file like this one:

&IF "{&bbb}" = "aaa" &THEN
message "then".
&ELSEIF "{&bbb}" = "ddd" &THEN
{include-test3.i}
message "else".
&ENDIF

include-test3.i will be reported in .pphints, but will not be processed to .p.cache.

I think this is because the condtext rule in text.g does not disable the braces expansion, if the ELSEIF condition does not emit (line 762):

            // did the IF portion emit?
            if (emit == 0)
            {
               // no emit yet, so evaluate the expression
               if (Preprocessor.evaluate(env, x))
               {
                  // force the following block to emit
                  emit = 1;
               }
               else
               {
                  // disable include expansion; must be done here BEFORE any
                  // matching may inadvertantly access lookahead and trigger
                  // an expansion; this is OK as long as it is above the
                  // &THEN
                  env.setExpandBraces(false);
               }
            }
            else /* this needs to be added */
            {
               // disable include expansion; must be done here BEFORE any
               // matching may inadvertantly access lookahead and trigger
               // an expansion; this is OK as long as it is above the
               // &THEN
               env.setExpandBraces(false);
            }

#54 Updated by Greg Shah about 10 years ago

Yes, I think you are right.

#55 Updated by Constantin Asofiei about 10 years ago

I've added a good version for review in the 2251 branch. rules/callgraph/callgraph_lib.rules contains details about how the nodes/edges are configured in the graph DB.

BTW, do you have a rebase script, so I can rebase?

What is left on the list:
  • finish callgraph testing with the server project and majic
  • document the structure of the files generated by the callgraph tools
  • conversion regression testing on MAJIC

#56 Updated by Greg Shah about 10 years ago

BTW, do you have a rebase script, so I can rebase?

Not yet. Please go through the process manually based on my previous description. After doing this a few times, we can decide what scripting makes sense.

#57 Updated by Greg Shah about 10 years ago

Code Review branch rev 10492

I have some questions regarding the "database design". There are inconsistencies:

1. External targets have a "node-type" but call-sites don't. I prefer the type approach to be used consistently. Is there a reason not to do that?

I also wonder if we shouldn't have the types be integer values that are our AST token types. This seems to be an approach which will be consistent with how we will probably use it when we back TRPL with the graph DB. Of course, we can add new artificial token types as needed to extend this model into the call graph processing.

Another advantage is this will eliminate a great deal of the string processing and it should be much faster and more compact (it will scale better to very large size projects).

2. I prefer to avoid hard-coding knowledge of the meaning of things into "key" names. For example:

If a call-site is determined to be amiguous and can not be determined at runtime as the call-site uses a runtime-evaluated expression, then its call-site-key should end with a "_value" suffix.

I prefer for there to be boolean property that "marks" the call-site in a way that is easy to understand. Special knowledge is not required to understand a boolean property named "runtime-evaluation". But it is needed to understand that in order to find "runtime-evaluation" call-sites, you must make a list of all call-sites whose call-site-key ends in _value.

I expect that this will help make things easier to query as well, leading to less development time later as we add new "applications" that use the graph database for different purposes.

3. The PatternEngine should not have any behavior that is specific to the Progress 4GL. Right now, it works because of hard-coding to 4GL concepts like "external-procedure". The PatternEngine (like the rest of TRPL) is supposed to be generic.

4. In FileScope.open(), please move this duplicated code:

// re-adjust the name to the resolved one
fileName = Configuration.normalizeFilename(fullname);

Into the if (hit) section at this bottom. It seems to me that it only has to be done in that one place, not the 4 locations above.

#58 Updated by Greg Shah about 10 years ago

Please do a bzr check-in for your work each night, unless there are no changes. It doesn't matter what the state of the code is, including whether it compiles or not.

There are multiple reasons for this:

1. It is much easier to keep up with code review work if the amount to review is smaller. By checking in each day, you allow reviewers to keep up with you.

2. It facilitates conversation about design issues earlier in the process.

3. It naturally does a backup of your work each night.

#59 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

1. External targets have a "node-type" but call-sites don't. I prefer the type approach to be used consistently. Is there a reason not to do that?

In the graph DB, from the AST files associated with an external program, only the root AST node (for the external program) has an associated node in the DB. The call-sites don't have such a node; instead, the info about the call-site is specified at the edge's properties, the call-site-key property. For this reason, some info can't be annotated at the call-site, and some custom nodes (like the ones related to the external targets) are needed. Should I have created graph DB nodes for the call-sites' ASTs, too?

I also wonder if we shouldn't have the types be integer values that are our AST token types. This seems to be an approach which will be consistent with how we will probably use it when we back TRPL with the graph DB. Of course, we can add new artificial token types as needed to extend this model into the call graph processing.

I agree node-type is best used as an integer, but I don't think they can be matched with AST token types. Currently, only the external-procedure and schema-table cases have both a graph DB node and an associated AST node. The other cases specify targets outside of the legacy code. Consider this: as a i.e. UNIX cmd statement doesn't have an associated node in the graph DB, the native-process launch command must have its own DB node, to which all these edges will point. When each call-site will have its own node in the graph DB, then the "external target" nodes are no longer required, and can be easily replaced with properties set at the call-site.

The info about the call-site is specified at the call-site-key property for an edge: this is really dependent on the call-site's token-type. I thought about using the AST token types for the call-site-key, but is not enough to distinguish between i.e. a RUN filename and a RUN filename ON SERVER statements, unless some additional token types are added.

If a call-site is determined to be amiguous and can not be determined at runtime as the call-site uses a runtime-evaluated expression, then its call-site-key should end with a "_value" suffix.

I prefer for there to be boolean property that "marks" the call-site in a way that is easy to understand. Special knowledge is not required to understand a boolean property named "runtime-evaluation". But it is needed to understand that in order to find "runtime-evaluation" call-sites, you must make a list of all call-sites whose call-site-key ends in _value.

Here, the problem is again related to the fact that a call-site doesn't have an associated node in the graph DB, and I can't set a property at the call-site. This is why the "ambiguous" node exists in the graph DB. "_value" is just something to distinguish between the call-sites which have a hard-coded target and the ones with targets evaluated at runtime, but can be removed.

3. The PatternEngine should not have any behavior that is specific to the Progress 4GL. Right now, it works because of hard-coding to 4GL concepts like "external-procedure". The PatternEngine (like the rest of TRPL) is supposed to be generic.

I understand this. My changes just adjusted the previous logic (which worked with .ast.graph files) to work with the graph DB. Do I need to make it work different?

4. In FileScope.open(), please move this duplicated code:
Into the if (hit) section at this bottom. It seems to me that it only has to be done in that one place, not the 4 locations above.

Done.

From your notes, my conclusions are these:
  1. node-type and call-site-key properties are best to use integer IDs.
  2. remove _value usage and replace it with a boolean parameter passed to callgraph_lib APIs.

#60 Updated by Constantin Asofiei about 10 years ago

About mapping the call-site-key string values to token types:
  • includes: this is to used mark edges with include-files. An artificial INCLUDES token is needed.
  • run_filename, run_filename_on_server, run_on_server, run_value, run_library_ref, run_library_ref_on_server: KW_RUN can be used, but the edge needs additional info about the call-site, to disambiguate these cases.
  • table_trigger: KW_TAB_TRG
  • native_procedure: KW_EXTERN can be used, as KW_EXTERN in RELEASE EXTERNAL is dropped
  • native_process: kw_btos, kw_dos, kw_os2, kw_unix, kw_vms, kw_os_cmd, input_thru, input_output_thru, output_thru
  • connect: KW_CONN
  • server_socket: KW_ENABLE_C
  • com_object: CREATE_OBJECT
  • dde_initiate: DDE_INITIATE
  • ocx_controls: KW_LOADCTRL
About mapping the node-type string values to toke types:
A. nodes with an associated AST:
  • external-procedure - no associated token type. BLOCK can be used, with a special root property set to true.
  • schema-table - TABLE can be used
B. nodes with no associated AST:
  • include - no associated token type. An artifical INCLUDE token is needed.
  • missing-external-procedure, library-procedure, native-procedure, native-process, network-connection, com-object, dde-innitiate, ocx-control - as there is no DB node for the outbound call-site, an edge originating from an external-procedure node will have as inbound one of these nodes, to disambiguate the call-site's target. When the graph DB contains all nodes for each AST, not just external-procedure, these can be removed and replaced with properties set at the call-site AST. So, for now we need artificial tokens for each of these cases.

#61 Updated by Greg Shah about 10 years ago

Should I have created graph DB nodes for the call-sites' ASTs, too?

I think so, yes. This would be consistent with our planned approach in the future and it may also lead to a more consistent approach in the short term.

I agree node-type is best used as an integer, but I don't think they can be matched with AST token types. Currently, only the external-procedure and schema-table cases have both a graph DB node and an associated AST node.

Go ahead with your plan to create artificial token types in progress.g to represent the type of any node or edge that does not map directly to a current token type.

I thought about using the AST token types for the call-site-key, but is not enough to distinguish between i.e. a RUN filename and a RUN filename ON SERVER statements, unless some additional token types are added.

It is OK to add an artificial token type for these cases too, if that is helpful.

3. The PatternEngine should not have any behavior that is specific to the Progress 4GL. Right now, it works because of hard-coding to 4GL concepts like "external-procedure". The PatternEngine (like the rest of TRPL) is supposed to be generic.

I understand this. My changes just adjusted the previous logic (which worked with .ast.graph files) to work with the graph DB. Do I need to make it work different?

I'm referring to code such as this (from PatternEngine.run()):

Vertex node = CallGraphWorker.findNodeById("external-procedure", id);

This is hard coded to the 4GL. walkGraph() also has such a dependency.

run_filename, run_filename_on_server, run_on_server, run_value, run_library_ref, run_library_ref_on_server: KW_RUN can be used, but the edge needs additional info about the call-site, to disambiguate these cases.

Go ahead with an artificial token type for each edge type. You can make it as verbose as needed to be clear (e.g. RUN_FILENAME_ON_SERVER).

#62 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

Should I have created graph DB nodes for the call-sites' ASTs, too?

I think so, yes. This would be consistent with our planned approach in the future and it may also lead to a more consistent approach in the short term.

OK, this brings up some other issues:
  • If the entire AST structure for an external-procedure is loaded into the DB, the in-DB ASTs can't be used directly in TRPL: instead, I think a DBRegistryPlugin implementation (similar to InMemoryRegistryPlugin) is needed, which will load/unload ASTs from the DB into/from ProgressAst instances, as the TRPL rules still need ProgressAst instances.
  • for current task, how far do I go with the changes: allow only CallGraphGenerator to work load into the graph DB the external-procedure ASTs, or change the entire conversion process to work with the DBRegistryPlugin?

I'm referring to code such as this (from PatternEngine.run()):

Vertex node = CallGraphWorker.findNodeById("external-procedure", id);

This is hard coded to the 4GL. walkGraph() also has such a dependency.

This is somehow related to the previous point.
  1. in graph-based mode walk, we need to determine how we pass the initial nodes to PatternEngine:
    - in previous mode, the list of .ast files are used to specify a list of initial external programs, even for graph mode
    - with current changes, I don't think we need to rely on physical .ast files for the initial list of graph nodes. To determine the initial list of nodes, we can change the arguments for PatternEngine to accept the name of a property/token (i.e. external-procedure) and a list or spec of possible values for that property. It can be possible to even accept a complex TRPL booelan expression, which can be evaluated for each node.
  2. for graph-based PatternEngine walk, when call-site linkages are reached, do we need to get out of the current external-procedure AST and start walking the target AST? I.e. when a RUN p1.p is reached, do we start walking p1.p's AST or p1.p is just registered with the PatterEngine, to be walked later?

#63 Updated by Constantin Asofiei about 10 years ago

Please review the progress.g changes in 10494, they are related to parsing the CREATE automation-object com-handle-var. statements.

#64 Updated by Greg Shah about 10 years ago

Code Review branch rev 10494

Except for progress.g, I am fine with the other changes.

I have these questions about progress.g:

1. Can the expr option be used with CREATE_WIDGET or is it only used for CREATE_OBJECT? I understand that VALUE can be used for both, which is what caused this left-factoring to be needed. But it is not clear that expr can be used in both places. If not, then the expr alternative should set is_com_handle to true.

2. I don't think your use of semantic predicates in ()? will work properly. Check the generated code. I think it will be generating a syntactic predicate exception for these cases, which is not how a semantic predicate works. The idea of a semantic predicate is that the code you put there is added to the end of the normal predication checking logic that the parser generates, using an && operator. So the alternative will be avoided if that semantic predicate is false. The problem is that semantic predicates are not recognized inside a ()? subrule. They should be but it doesn't work.

      ({ is_com_handle }? com_connect_option)?
      ({ !is_com_handle }? in_widget_pool_clause)?
      ({ !is_com_handle }? attribute_assign_clause)?
      ({ !is_com_handle }? trigger_phrase)?

This can be written in two ways that will work:

      (
           { is_com_handle }? com_connect_option
         | // empty alternative
      )
      (
           { !is_com_handle }? in_widget_pool_clause
         | // empty alternative
      )
      (
           { !is_com_handle }? attribute_assign_clause
         | // empty alternative
      )
      (
           { !is_com_handle }? trigger_phrase
         | // empty alternative
      )

OR

      (
           { is_com_handle }? com_connect_option
         | { !is_com_handle }? in_widget_pool_clause
         | { !is_com_handle }? attribute_assign_clause
         | { !is_com_handle }? trigger_phrase
      )*

The first way limits the matching to the specified order. If the 4GL in fact can match in a different order, then the 2nd approach is needed.

#65 Updated by Greg Shah about 10 years ago

If the entire AST structure for an external-procedure is loaded into the DB, the in-DB ASTs can't be used directly in TRPL: instead, I think a DBRegistryPlugin implementation (similar to InMemoryRegistryPlugin) is needed, which will load/unload ASTs from the DB into/from ProgressAst instances, as the TRPL rules still need ProgressAst instances.

I don't want to go this far right now. I'm just suggesting putting placeholder graph nodes in for the other ASTs.

allow only CallGraphGenerator to work load into the graph DB the external-procedure ASTs

Yes.

or change the entire conversion process to work with the DBRegistryPlugin?

No.

I don't think we need to rely on physical .ast files for the initial list of graph nodes. To determine the initial list of nodes, we can change the arguments for PatternEngine to accept the name of a property/token (i.e. external-procedure) and a list or spec of possible values for that property. It can be possible to even accept a complex TRPL booelan expression, which can be evaluated for each node.

If possible, I'd like to hide this knowledge behind an interface and allow the functionality to be "plugged in" by the specific implementation.

when a RUN p1.p is reached, do we start walking p1.p's AST or p1.p is just registered with the PatterEngine, to be walked later?

This is a good question. I think the most correct approach is to walk into p1.p first and then after that portion of the call tree is complete, process the rest of the original AST. This is conceptually closest to what happens at runtime and it also may be important for certain kinds of analysis like tracking shared variable usage.

#66 Updated by Constantin Asofiei about 10 years ago

In bzr there are these changes:
  • Switched to integer token types for call-site-key and node-type.
  • Fixed progress.g for CREATE automation object
  • create a graph node for each call-site. I did not load the entire tree for the external-program, just the needed call-site nodes. The callgraph-related rules should work with the entire external-program tree is loaded, too.
What remains:
  • fix PatternEngine to walk the graph DB properly and hide the 4GL dependency.

#67 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

BTW, do you have a rebase script, so I can rebase?

Not yet. Please go through the process manually based on my previous description. After doing this a few times, we can decide what scripting makes sense.

bzr-rewrite is not installed on devsrv01, so I did the following on my system:

A. to rebase trunk_master with the filesrv01 repo; while in trunk_checkout:
  1. bzr unbind
  2. bzr rebase filesrv01_p2j_checkout
  3. bzr push --overwrite ~/secure/code/p2j_repo/p2j/trunk/
  4. bzr bind
B. to rebase task_branch with trunk_master (which is now on the same revision as filesrv01); while in task_branch_checkout:
  1. bzr unbind
  2. bzr rebase trunk_checkout
  3. resolving conflicts, then:
    - bzr resolve
    - bzr rebase-continue
  4. bzr push --overwrite ~/secure/code/p2j_repo/p2j/active/2251/
  5. bzr bind

#68 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

I don't think we need to rely on physical .ast files for the initial list of graph nodes. To determine the initial list of nodes, we can change the arguments for PatternEngine to accept the name of a property/token (i.e. external-procedure) and a list or spec of possible values for that property. It can be possible to even accept a complex TRPL booelan expression, which can be evaluated for each node.

If possible, I'd like to hide this knowledge behind an interface and allow the functionality to be "plugged in" by the specific implementation.

The proposed approach for this plugin is:
  • add a pattern.GraphNodeFilter interface, with a boolean acceptNode(Vertex) method.
  • add a uast.ExternalProcNodeFilter class implementing the GraphNodeFilter interface
  • add a configuration parameter, named graph-node-filter which will have as value the name of a pattern.GraphNodeFilter implementation; this will be resolved by the Configuration.getGraphNodeFilter, and can default to uast.ExternalProcNodeFilter
  • PatternEngine.walkGraph and PatternEngine.run will use the pattern.GraphNodeFilter implementation to check if a certain node can be traversed or not.

This approach will still rely on the physical AST files to determine the initial nodes. It just hides the knowledge about which nodes can be traversed and which can not.

#69 Updated by Greg Shah about 10 years ago

Code Review branch rev 10504

1. In the native_process function, please add support for the undocumented kw_ctos as a shell program. The parser supports this because we found it in customer source code.

2. In progress.g, instead of BEGIN_ARTIFICIAL and END_ARTIFICIAL, please use BEGIN_CALLGRAPH and END_CALLGRAPH. So many of the other tokens in the list are already "artificial", I don't want readers to be confused.

#70 Updated by Greg Shah about 10 years ago

By the way, the rebasing approach that you took looks reasonable. How hard did you find it to use?

#71 Updated by Greg Shah about 10 years ago

The proposed approach for this plugin is:

Very good.

#72 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

By the way, the rebasing approach that you took looks reasonable. How hard did you find it to use?

Rebasing trunk_master with filesrv01 repo was a straight shot, as there were no conflicts. For rebasing task_branch with trunk_master, there was a conflict related to the history entry for a single file (common-progres.rules). As I expected, the only problems will be the conflict resolution: these can build up, if rebase is not done frequently. Also, if there is a large number of branch-related revisions of a file and the file gets changed in the trunk, conflict resolution may be required for more than one branch-revision of this file; but these cases should be rare.

#73 Updated by Constantin Asofiei about 10 years ago

rev 10507 contains the changes from note 69's review, the PatternEngine changes in note 68 plus:
- licence files for the lib jars (see updated note 19 for the lucene jars)
- reports.xml has documentation about the structure of the generated reports.

#74 Updated by Constantin Asofiei about 10 years ago

After analysing the report listings for the server sources, I have these cases which are reported wrong:
  1. RUN <internal-proc>. where the internal procedure is defined in a super-procedure, is converted as a KW_RUN/FILENAME AST. This is falsely identified as a "missing external procedure". What needs to be done is identify the set of ALL defined internal procedures, and if a RUN's target is an internal procedure, then treat it like wise.
  2. RUN port-type SET h-port-type ON SERVER h-web-service. needs to be disambiguated and treated like a port-type, not a missing external procedure. The problem is that a RUN ... port-type ON SERVER h-web-service. can't be disambiguated easily, as, at conversion time, this can be either a port-type or an external procedure (as the AST is a KW_RUN/FILENAME).
  3. the conenct-string in a CONNECT method call case is not resolved properly.
  4. for some reason, some .p.dict files are picked up as dead files... these files should have not been loaded in the graph in the first place.

As a side note, there are some PROCEDURE ... EXTERNAL cases where the library name is defined as i.e. kernel32, without the .dll suffix.

#75 Updated by Greg Shah about 10 years ago

Code Review branch rev 10507

1. What is the effort involved to add all unreferenced include files to the dead files list?

2. The PatternEngine still has a reference to ProgressParserTokenTypes.EXTERNAL_PROCEDURE.

Otherwise it all looks very good.

#76 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

1. What is the effort involved to add all unreferenced include files to the dead files list?

This is already done, forgot to document it.

2. The PatternEngine still has a reference to ProgressParserTokenTypes.EXTERNAL_PROCEDURE.

Thanks for noticing, I'll fix it.

#77 Updated by Greg Shah about 10 years ago

As a side note, there are some PROCEDURE ... EXTERNAL cases where the library name is defined as i.e. kernel32, without the .dll suffix.

Yes, this is one more quirk of how the library support works on Windows. From #1634:

On Windows, it is also possible to leave the library extension (.dll) off of the specified library name during loading and the operating system itself (the LoadLibrary API) will add it (actually, it will add ".DLL" if you have no extension). I don't think the 4GL does this.

I think it is best to strip off the .dll from the library names (if present) and let them all case-insensitively match otherwise. This will reduce the possible number of libraries listed which are really the same dll. There are still cases where the same lib can be loaded with different relative paths, but removing the path prefix is not safe to do because the libraries can really be different ones in that case.

#78 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

I think it is best to strip off the .dll from the library names (if present) and let them all case-insensitively match otherwise. This will reduce the possible number of libraries listed which are really the same dll. There are still cases where the same lib can be loaded with different relative paths, but removing the path prefix is not safe to do because the libraries can really be different ones in that case.

OK. But note that, for native libraries, I'm not using just the library name to distinguish the targets, I'm using the routine name's too, and the target will be like kernel32:CloseHandle.

#79 Updated by Constantin Asofiei about 10 years ago

There is a way to get rid of the non-indexed suffix search, when looking for external programs: keep, alongside the filename property, a reverse-filename property. Thus, we are no longer looking for a suffix, but for a prefix, which is indexed properly by titan.

#80 Updated by Constantin Asofiei about 10 years ago

Rev 10510 contains:
  • fix for the RUN <internal-proc> reported as missing external procedures.
  • fix for RUN port-type SET h-port-type ON SERVER h-web-service
  • fix for the connect-string from CONNECT methods in assignments
  • strip the .dll suffix from libraries
  • some other fixes and improvements.

Next will be the fix for the {&FILENAME} preprocessor variable.

#81 Updated by Greg Shah about 10 years ago

Code Review branch rev 10510

1. Please add a comment into the init-rules of post_parse_fixups.xml explaining that the "delete if exists" approach needs to be improved to better support incremental conversions.

2. I'm a little worried that the CallGraphWorker.resolveExternalPrograms() approach of using the reverse-filename for matching can potentially lead to false-positive matches. Do we need to account for that?

#82 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

2. I'm a little worried that the CallGraphWorker.resolveExternalPrograms() approach of using the reverse-filename for matching can potentially lead to false-positive matches. Do we need to account for that?

I don't think is possible for this to happen. When searching, the name of the file is prefixed with a unix-style separator (which is always used in the "filename" property). Thus, matching a i.e. /filename.p suffix is always equal to matching a p.emanelif/ prefix, for the "reverse-filename" property.

#83 Updated by Greg Shah about 10 years ago

My concern is that this algorithm can match multiple filenames in the project, making the results ambiguous.

Assume that the project contains the following:

path/to/first/filename.p
path/to/second/filename.p
other/path/filename.p

A reverse search for p.emanelif/ will match all 3, when at runtime it will match only one of these.

#84 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

path/to/first/filename.p
path/to/second/filename.p
other/path/filename.p

A reverse search for p.emanelif/ will match all 3, when at runtime it will match only one of these.

OK, I understand now your concern. This is not related of how matches are done: searching for reversed prefix or suffix will return the same results. The problem here is this: if the code has an RUN filename.p statement, with filename.p defined as in your example, at conversion time we don't know the active PROPATH at the time when RUN filename.p will be executed - so we can't tell for sure which version will be executed.

Current approach is to add all possible matches at targets. An alternative is to make these cases ambiguous and let it disambiguate via hints.

#85 Updated by Greg Shah about 10 years ago

OK, thanks for clarifying this.

#86 Updated by Constantin Asofiei about 10 years ago

Branch rev 10512 contains:
- the fix for the FILE-NAME preproc var
- some misc fixes
- some performance improvements for callgraph and report generation, but these are not enough. I think I'm missing some index configuration, as CallGraphWorker.findNodeById doesn't look like it uses an index (it takes almost 70ms for a two-property query in a DB with ~120k nodes).

#87 Updated by Greg Shah about 10 years ago

Code Review branch rev 10512

Everything looks good, except that the ProgressPatternWorker change is not needed. AnnotatedAst.getSymbolicTokenType() will get you the same result when used with a ProgressAst instance.

#88 Updated by Constantin Asofiei about 10 years ago

I think I'm missing some index configuration, as CallGraphWorker.findNodeById doesn't look like it uses an index (it takes almost 70ms for a two-property query in a DB with ~120k nodes).

The branch rev 10513 fixes this problem.

Greg Shah wrote:

Everything looks good, except that the ProgressPatternWorker change is not needed. AnnotatedAst.getSymbolicTokenType() will get you the same result when used with a ProgressAst instance.

I need the ProgressPatternWorker change because, in the pipeline's init-rules, there is no AST available... and I can't use this to access AnnotatedAst APIs.

#89 Updated by Greg Shah about 10 years ago

Code Review branch rev 10513

Everything looks good.

One question: it looks to me like each call-graph run is fresh and we don't support incremental call-graph processing. Is that right?

#90 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

One question: it looks to me like each call-graph run is fresh and we don't support incremental call-graph processing. Is that right?

Yes. For incremental processing, current generate_call_graph.xml is not viable. We need an update_call_graph.xml, which will walk only the nodes adjacent to the AMBIGUOUS node, and any new targets exposed after resolving an ambiguous call-site.

Beside the update_call_graph.xml tool, I don't think there is anything else to be done.

#91 Updated by Constantin Asofiei about 10 years ago

Branch rev 10514 adds the "-u" argument for CallGraphGenerator, which uses the passed filename list and all ambiguous nodes, to check if new targets can be resolved from the provided hints.

#92 Updated by Greg Shah about 10 years ago

Code Review branch rev 10514

I am good with the changes. If you still think it is finished, go ahead with regression testing.

#93 Updated by Greg Shah about 10 years ago

I saw your note about the performance metrics in #2260. The numbers seem pretty good for processing such a large project.

1. What is your general sense about the performance? Do you feel it is "good", "bad" or something else?

2. How would you assess the performance impact of moving the entire conversion process/TRPL to use the graph DB approach?

3. Now that you have implemented a complete solution using the graph DB, what are your current thoughts about the functionality, costs and benefits of using it to back TRPL?

#94 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

1. What is your general sense about the performance? Do you feel it is "good", "bad" or something else?

The performance is OK as long as proper indexes are added. Although I noticed that, if devsrv01 is busy, the AST loading from disk (the code-set load part) increases 2-3 times.

2. How would you assess the performance impact of moving the entire conversion process/TRPL to use the graph DB approach?

Removing the current disk dependency (the requirement to (de)serialize the entire ASTs from/to disk) will bring a performance improvement. Also, currently we use a read-only AST (this) and a mutable AST (copy). My goal would be to change the AnnotatedAst infrastructure to not work with in-memory trees (so the entire AST structure for i.e. an external-procedure is not brought into memory), and use the graph DB directly. The this/copy approach I think can be simulated by obtaining two transactions to the graph DB, where the "copy" transaction will be committed only when the AST for the external procedure will be persisted. Also, lots of i.e. "refid" annotations can be simulated with edges between the i.e. var reference and var declaration.

I can't give a specific percentange for the performance improvement, but I'm confident it will be visible.

3. Now that you have implemented a complete solution using the graph DB, what are your current thoughts about the functionality, costs and benefits of using it to back TRPL?

When moving the TRPL to a graph DB, I think we can say the graph DB will contain more than one graph, for the entire code set, as the callgraph edges and the AST tree-related edges are semantically different. For this reason, we should organize things very carefully. I think the graph DB can be hidden behind the existing APIs accessible from TRPL, so there should be minimal changes in the TRPL code. Is just a matter of being smart in building these APIs and organizing the tree structure in the graph DB efficiently: an example is that the graph DB has no inherent order for a node's adjacent nodes, so we will need to order the edges for the parent-child relationship explicitly (better said, we will need edge properties to order a node's children). Also, constraints can be added to the edge properties, so that there can be one-to-one, one-to-many or many-to-one relationships between the nodes.

A first thought about what I will miss is to look at the external procedure AST directly, but this can be easily solved by providing a tool to extract an AST (sub-)tree from the graph DB to a file.

About using the Blueprints APIs: TRPL has problems when the same API is defined in multiple interfaces or when the API has varargs arguments; thus some blueprints.VertexQuery APIs can't be used from TRPL. But I think is better to hide the graph DB APIs behind the AnnotateAst and other TRPL APIs, and not expose them to the TRPL code.

My conclusion is that, for a first pass, is best to hide the graph DB usage from TRPL code, and move all the graph access into the AnnotatedAst and PatternWorker infrastructure. Once this is complete, we can move to improving the TRPL performance, by adding explicit edges between a references and the reference definition.

#95 Updated by Constantin Asofiei about 10 years ago

See the branch 10517 revision (which was rebased) for the latest changes (some doc changes and a minor fix). This is placed into runtime testing. Conversion testing has passed for MAJIC, the server code is reconverting now.

BTW, to "release" a branch to the trunk, is there another way beside just copying the branch content (without the .bzr folder) over the trunk checkout, and then commit the trunk checkout?

#96 Updated by Greg Shah about 10 years ago

BTW, to "release" a branch to the trunk, is there another way beside just copying the branch content (without the .bzr folder) over the trunk checkout, and then commit the trunk checkout?

Are you asking about our temporary approach (while our real trunk is not yet on devsrv01)?

#97 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

BTW, to "release" a branch to the trunk, is there another way beside just copying the branch content (without the .bzr folder) over the trunk checkout, and then commit the trunk checkout?

Are you asking about our temporary approach (while our real trunk is not yet on devsrv01)?

I'm asking about what the approach will be, to release the branch changes into the main trunk on devsrv01 (I know at this time the devsrv01 trunk is not yet official). I think we should automate this, and:
1. determine automatically the files which are not in the branch checkout (i.e. were explicitly deleted) and are still in the trunk checkout, and delete them
2. copy the changed and new files to the trunk checkout
3. commit the trunk checkout, using as message an argument received by the script

#98 Updated by Constantin Asofiei about 10 years ago

The update passed runtime and conversion testing (both server and MAJIC code).

Regarding note 97: I think bzr merge should do the trick, to merge the branch into the trunk, but I'll confirm when I'll release this update.

#99 Updated by Greg Shah about 10 years ago

Regarding note 97: I think bzr merge should do the trick, to merge the branch into the trunk

Yes, that was my plan.

I think you need to rebase first. We can pull the latest code to the devsrv01 trunk master and rebase from that.

Then you can bzr merge on your local system from your branch checkout to the trunk checkout (that is still bound to filesrv01).

#100 Updated by Constantin Asofiei about 10 years ago

Branch revision 10518 was committed to the 10496/10497, both on trunk master on devsrv01 and filesrv01.

#101 Updated by Constantin Asofiei about 10 years ago

  • % Done changed from 0 to 100
  • Status changed from WIP to Review

#102 Updated by Greg Shah about 10 years ago

  • Status changed from Review to Closed

#103 Updated by Constantin Asofiei about 10 years ago

The branch rev 10504 contains the fix for generating the callgraph reports on windows (the filename should have been wrapped in a fixupFileName, when passed to printf).

Released to bzr rev 10504.

#104 Updated by Constantin Asofiei about 10 years ago

This fixes the incompatibility when running the callgraph on a machine with different case-sensitivity than the one from where the legacy code originates.

#105 Updated by Greg Shah about 10 years ago

You can check in and distribute the update.

#106 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

You can check in and distribute the update.

Committed to bzr rev 10507.

#107 Updated by Constantin Asofiei about 10 years ago

I managed to convince myself that the feature which allows empty string array hints to disambiguate a call-site was properly implemented and tested... attached is the fix.

#108 Updated by Constantin Asofiei about 10 years ago

This passed conversion testing and fixes the 107 problem.

#109 Updated by Greg Shah about 10 years ago

Code Review 0425a

I'm fine with the changes. You can check them in and distribute.

#110 Updated by Constantin Asofiei about 10 years ago

Greg Shah wrote:

Code Review 0425a

I'm fine with the changes. You can check them in and distribute.

Committed to bzr rev 10514.

#111 Updated by Constantin Asofiei almost 10 years ago

Fix for a problem when generating the list of dead include files, from the callgraph DB.

Committed to bzr rev 10530.

#112 Updated by Greg Shah almost 10 years ago

Please see #2260 for 2 open questions/issues.

#113 Updated by Constantin Asofiei over 9 years ago

This is an update which fixes some problems related to schema triggers: the main problem here is that the ASTs for the schemas are located in data/namespace (relative to current folder) - is not correct to search the basepath for the schema ASTs.

Also, it adds a tool to list the edges of a specific node.

#114 Updated by Greg Shah over 9 years ago

Code Review 0715d

The changes look good. I especially like that edge listing tool. No regression testing is needed. You can check in and distribute when ready.

What would be your estimate of the time needed to fix the two known issues in #2260? Please create a new task for each one, including your time estimate. Make me a watcher. The tasks should be in this same project.

#115 Updated by Constantin Asofiei over 9 years ago

0715d.zip was released as rev 10593.

Tasks #2356 and #2357 were added for the issues in #2260.

#116 Updated by Greg Shah about 7 years ago

Also available in: Atom PDF