Testing¶
Even with models that are formulated and coded by hand, testing and validation of the results is an important step. In practice, there is often an iterative cycle of:
Invalid results could have many causes, such as:
inaccuracies in the initial problem description,
mistakes or over-simplifications in formulating the mathematical model for the problem,
mistakes or oversights in implementing the model,
inaccuracies in the data.
Corrections/revisions might need to be made at any of the model development stages in response to any of the above root causes. For models and code generated by an LLM, errors could occur in any of these places as well.
To fully validate a model, it would be necessary to review the formulation, code and solution for accuracy and make updates and revisions as necessary. For users just getting started with mathematical optimization it might be challenging to accurately review and evaluate the correctness of the model/code generated by an LLM.
One good way to get started with this process is to conduct a sanity check of the solution provided.
Ensure that the objective value is plausible¶
Often the first and most straightforward way of sanity-checking a model is by looking at the objective value and reasoning about whether it makes sense or not.
Visualize the solution¶
Digging deeper, visualizing the solution is an intuitive way of inspecting the model results, and no expert knowledge is required for it! Since we are already in the domain of LLMs, we can just as well go all the way and ask it to generate a visualization of the solution. Practically all images in the Example prompts section were created this way.
For example, after having the LLM generate a solution you could say:
Put the results in a stacked bar graph, with the time on the X-axis and the load on the Y-axis. The base load should be at the bottom in one color while all the deployments should be stacked on top of that with a different color for each deployment.
The LLM will go ahead and likely use one of the popular graph packages like matplotlib
or plotly
. If you are not
happy with the type of graph used by the LLM you can find inspiration by looking at the amazing collection of different
graph types in the packages’ documentation.
Another type of graph that you might be interested in is a network graph using the networkx
package which was used
in the Incident Response Planning example. For example you could use a prompt as follows:
Put the results in a DAG so that the dependencies become clear using the `networkx` package. The X-axis should represent the time the service was started, and each node should be colored according to the initial priority value.
Verify that the solution satisfies your constraints¶
Try to verify that the solution is feasible, satisfying all of the constraints in the model. For instance, after the solution is calculated, you could ask:
Help me write some code that checks that the solution is feasible.
Another method is using one of Gurobi’s open source solutions, for instance: the gurobi-modelanalyzer project. The most accessible approach from this project would be to use the solution checker.
Other specific edge cases that indicate validity¶
Some specific edge cases you may want to test when evaluating model validity could be:
Test a solution of all
0
values for the decision variables. Should this be feasible or infeasible? Does the objective make sense?Test a solution where all decision variables are set to their min or max bound. Does the objective function trend in the expected direction? If the model is infeasible, does that make sense?
Test a known feasible point. Do the objective and other constraint values match what you currently observe for your application?
I found the model has issues, now what?¶
If something seems wrong at this point, it is very likely that the LLM misunderstood the problem. If you are able to find the source of the mistake, you can point it out to the LLM and instruct on how to make changes. However, in our extensive testing we found that it is way more likely that the LLM has fundamentally misinterpreted the prompt and can not be easily fixed by adding small changes to the existing model.
Instead, we suggest revisiting the prompt and checking whether it violates any of the sections in the Tips and pitfalls chapter and trying out different versions of the prompt.