February 15, 2005

Schema from program or program from schema?

One of the goals of my C++ metaprogramming system is that it should be possible to generate a schema for game data directly from an annotated C++ program. This schema is important because it decouples the game engine from the tools that are used to process the game's data.

There seem to be three options here. The first option is the one I am currently investigating and also the motivation for my C# to C++ translator: generate the schema from the code. The second is to generate the code from the schema. The third is to write both the schema and the code by hand.

The last option is undesirable. It violates the Don't Repeat Yourself principle. The schema and the program express the same thing and one of the two is redundant.

The second option, generate the code from the schema, has some appeal, primarily simplicity of implementation. It has one problem I can't see a solution to though. A schema is data-centric whereas an OO program hides data within classes that expose a more abstract public interface. The schema language could of course be extended to include OO constructs for information hiding but then it would become coupled to the structure of the program, which is undesirable.

So lets say I generate C++ classes from a data-centric schema. The generated C++ classes will inevitably be data-centric too. They might just be structs with public data members. Or one might generate accessor function pairs for the data members. But that is just C++ gorp to make a C style struct look like a OO class.

Then, being a good OO programmer, I would have to write more classes that exposed a better OO abstraction. These classes might be initialized directly from the automatically generated classes. Here is an example:
// Hand written schema
[schema]
[dynamite]
[field name="fuseTime" type="int" default="20"/]
[/dynamite]
[/schema]

// Automatically generated from the schema
struct SDynamite
{
int fuseTime;
};

// Hand written OO abstraction for dynamite
class CDynamite
{
public:

CDynamite(const SDynamite &d): fuseTime(d.fuseTime) {}

// Abstract interface to dynamite
void StartCountDown();
void Update(float dt);

private:
int fuseTime;
};
Of course it would be possible to embed the SDynamite as a private data member of CDynamite. But then CDynamite is tightly coupled to the schema format. It does not allow the internals of CDynamite to change without changing the schema. Notice the redundancy in the constructor. It is essentially converting between the schema format and the internal representation of CDynamite.

So generating the code from the schema does not appear to eliminate the redundancy. It just moves it around. The fundamental problem is the schema does not contain all the information necessary to generate an OO abstraction to itself. Whereas, through metadata annotations or naming conventions, OO code has sufficient information to generate a data-centric schema without redundancy.
// Hand written annotated dynamite class
CDL class Dynamite
{
public:

PROPERTY
ATTR(Serialized)
ATTR(Default=20)
int GetFuseTime() const;
void SetFuseTime(int);

// Abstract interface to dynamite
void StartCountDown();
void Update(float dt);

// ...
};

// Automatically generated schema
[schema]
[dynamite]
[field name="FuseTime" type="int" default="20"/]
[/dynamite]
[/schema]

Both examples express how to convert between the schema format and an object. In the second example, the schema format is inferred from the conversion. In the first example, both the schema format and the conversion must be separately expressed. There must be redundancy in the first example.

In the case where the OO class has the same or similar structure to the schema (probably a high proportion of cases), this redundancy can also be eliminated.
CDL class Dynamite
{
public:

// Abstract interface to dynamite
void StartCountDown();
void Update(float dt);

private:

PROPERTY
ATTR(Serialized)
ATTR(Default=20)
int fuseTime;
};

This is similar to embedding an SDynamite data member in CDynamite but with one key difference. Embedding SDynamite is all or nothing. It is not possible to say I only want the fuse time from SDynamite but these other schema fields will be represented differently.

I see a potential problem with generating a schema from code as well. Which language? If I am using multiple languages, say a scripting language and C# or C++, which do I generate the schema from? If I want to be able to implement game components in both languages, I probably want both. So I need to allow my preprocessor to accept metadata from multiple sources. For that reason, I will start calling it a metacompiler.

Comments:
Quick question: are you planning to release this schema generation tool (and/or other things, like C#->C++ translator)? :)
 
Given my day job, there is no way I will ever be able to release a fully working version of CDL or my C# to C++ translator. I just don't have time.

CDL does not even exist yet. It is still in the planning stages. I have a prototype C# to C++ translator that might be useful as a starting point for someone interested in doing it properly. I am willing to make the source available.
 
Couldn't you derive CDynamite from SDynamite? That way you don't need the redundant constructor and field declarations.

Also, I'd add a constructor to SDynamite initializing fuseTime to 20.
 
CDynamite could be derived from SDynamite. That would be quite similar to adding SDynamite as a data member to CDynamite as I suggested in my oriignal post. As I see it, the problem is it makes the CDynamite class tightly coupled to SDynamite and thus the schema.

The whole point of CDynamite is to provide a more abstract and less data-centric encapsulation of dynamite. A class is more than just its member variables after all. The idea is that code sees dynamite in terms of its public interface (StartCountDown and Update) whereas tools see the automatically generated data-centric schema (fuseTime).

I agree that adding a constructor to SDynamite to initiailize fuseTime to 20 is a good idea. That is easily to generate from the schema.
 
I have to admit that I'm right now I favor the second approach myself.

It has the advantage of decoupling the class in question from what it exposes to the outside world. I've never been comfortable with the idea of editing member variables on an object directly. After all, I might want to expose a very different interface than the one I really have (for example, I might have a quaternion internally, but expose a set of Euler angles, or I might not even have a value at all, but that value gets applied to many different member variables and causes other functions to trigger).

I think that having a class take a structure in its constructor with all the data necessary to initialize itself is a good idea anyway. You probably want to do that to be able to load it from any format you want (although this is something I'm becoming less and less a fan of, the more I think of having a good data-cooking system, then less it's needed or even desirable).

Besides, I can't help to get the feeling that adding all the extra information to the source code is overloading things a bit too much. I'm a fan of having things do one thing and one thing only. Keeping the schema and the code separate "feels" right.

It's definitely worth thinking how the data cooking and the serialization from the final data format play into all of this.
 
I agree that a file format should not be tied to the member variables of the serialized classes. That is one of the things that I have observed to cause game code to "harden" more quickly. Neither generating schema from program nor program from schema requires that the file format be coupled to the member variables though. I started out trying to put some example code in this comment but it is beyond the formatting facilities of this blog management system. I'll post it io the main page in a moment.

I share you concerns regarding adding annotations to the source code. It does clutter the code. Although it works well for Java and C# because they have a more concise syntax for annotations. I'm going to try and make the CDL syntax a little nicer. Is it in violation of the single responsibility principle? My feeling is no.

Even if an object has a single responsibility (such as representing an orientation) it must still address multiple orthogonal concerns in order to fully succeed in that responsibility. Here are some examples of orthogonal concerns for the orientation class. What is its public API? How is the data represented internally? How is memory allocated for orientation objects? How is their lifetime managed (i.e. when and how are they destroyed)? How does the code report failure (exceptions, error code, error callback)?

Perhaps it is because of my experience with languages where objects are serializable by default, but I think "how are orientation objects serialized?" is not an additional responsibility but rather one of the essential orthogonal concerns that must be addressed in taking on the single responsibility.
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?