The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

How to parse?

I'd like to parse some JavaScript source code, to alter it to add a new statement to the beginning of every function. For example, given an input line like ...

this.getWidgetsByFilter = function(unaryFunc, onlyOne){

... I'd like to insert a new statement so that the output line looks like this ...

this.getWidgetsByFilter = function(unaryFunc, onlyOne){ inspect.trace("this.getWidgetsByFilter = function(unaryFunc, onlyOne)");

... the purpose of this being so that I can profile (eg. stack-trace) functions, even anonymous functions as in the example above.

My problem is how to find "function" statements: I need to exclude "function" when it's within a quoted string, or within either kind of comment.

I can hack a parser together using C#, but I'd like to ask how I 'should' go about parsing something like this. I know nothing about parser theory: tokenizing, lexxing, yaccing, or anything like. Perhas you could tell me which tools to use for the job, or outline the algorithm that I should implement in C#?
Christopher Wells Send private email
Sunday, September 03, 2006
 
 
for something like that you can probably get away with a regular expression match.

some perl hackery should be fine.
SomeDude Send private email
Sunday, September 03, 2006
 
 
Unless, of course, you read all of the posts about how useless Perl is, in which case you could use an ad hoc mixture of C, shell scripting, AWK, sed, grep, regular expressions, Lisp, etc.
*myName
Sunday, September 03, 2006
 
 
> you can probably get away with a regular expression match

When I read that, my first thought was http://www.schipul.com/en/q/?1680 :-)

And, in fact, ironically, because of Javascript's support for regular expression literals, apparently I do need to parse it: http://www.mozilla.org/js/language/js20/rationale/syntax.html#regular-expressions

For example, the following is a valid statement:

  url += cssStr.match(/^[\t\s\w()\/.\\'"-:#=&?]*\)/)[0]; // url string

My first attempt to parse that (without support for 'regular expression literals') detected that as an unterminated string, because of the unmatched '"'.

I'll write a simple C# program to parse it; but I was wondering whether my using C#, without ever having studied parsing as a computer science topic, is an example of "if all you have is a hammer, ...".
Christopher Wells Send private email
Sunday, September 03, 2006
 
 
There's an open-source javascript interpreter called (IIRC) Rhino, which has a parser you could use. The javascript source compressor/obfuscator we use is based on this, rather than on Regexps, so it's far less susceptible to subtle syntax-munging bugs that affect the meaning of the code.
Matt Send private email
Monday, September 04, 2006
 
 
Well, I did it: it took me nearly 700 lines of C# code. My algorithm is that I lex or tokenize the input stream into the following categories of token:

* C++-style comment
* C-style comment
* String literal
* Numeric literal
* Identifier or reserved word
* Punctuation
* Whitespace
* Regexp literal

I need to tokenize, so that I can use the type of the previous token to decide whether a '/' is either a divide-by punctuator or the begining of a regexp literal.

Anyway that was a bit of tedious coding! And it would be worse if I wanted to go further (e.g. to combine the tokens into expressions and statements and so on).

So, my question is, what kind of tools are used to process a grammar definition of the following form:

IdentifierName ::
  IdentifierStart
  IdentifierName IdentifierPart

IdentifierStart ::
  UnicodeLetter
  $
  _
  \ UnicodeEscapeSequence

IdentifierPart ::
  IdentifierStart
  UnicodeCombiningMark
  UnicodeDigit
  UnicodeConnectorPunctuation
  \ UnicodeEscapeSequence

... etc ...
Christopher Wells Send private email
Monday, September 04, 2006
 
 
I've used JavaCC before when building a tokenizer/parser. It worked out really well.

But I'm looking at ANTLR for an upcoming project, since it'll be in C# and JavaCC (obviously) won't cut it. So far, the ANTLR documentation looks good, and it can output C# code. So I'd tentatively recommend it.
BenjiSmith Send private email
Tuesday, September 05, 2006
 
 
I suppose you want to write a compiled parser + lexer that can handle these functions during runtime? If you need to do these offline you can even write macros within emacs.

Compilers is one area where the average programmer gets nicely separated from the computer scientist. There's a good reason why these guys are in business :-)
230 Volts Send private email
Tuesday, September 05, 2006
 
 
Wow. Thanks, Benji.
Christopher Wells Send private email
Tuesday, September 05, 2006
 
 
The best way to do this, as someone mentioned above is to hack an existing interpreter.

Sounds like you're on the right track with your lexical analyzer.

Are you handling things like
var = 'this string\'s syntax is confusing and it contains some "weird \" constructs" including an open "';

And are you handling evals in the code, or are you open to ignoring them?

It's not a trivial task.
Dror Matalon Send private email
Wednesday, September 06, 2006
 
 
> Sounds like you're on the right track with your lexical analyzer.

Yes it's working: it emits tokens (I think, if I've got the terminology right).

I see from one of the code examples in http://www.cs.usfca.edu/~parrt/course/652/lectures/antlr.html that ANTLR appears to be switching based on a look-ahead of the next token: is 'look-ahead' usual? I haven't tried to implement any next stages beyond the lexer, but my first thought might have been to model it as look-behind: i.e given a token, what you do with it depends on the previous tokens, not depends on the next tokens.

> Are you handling things like ...

Yes: both types of string, and the escape char within string and regexp literals; a throw if a literal hasn't been terminated before end-of-line; and I colorize each character and write it to the console, so I can inspect the colorized source code to verify that each one has been classified properly.

It's running OK against the 15 KLOC script that I want to instrument; so, that's Ok.

I sympathise with the author of ANTLR though, who says "I ... wanted something that reproduced what I built by hand anyway": this seems like the kind of software that should be built by a machine.

> And are you handling evals in the code, or are you open to ignoring them?

No, I'd forgotten about eval. Yes, I think I'd best ignore them (not insert an extra instrumentation statement into any string): in general I can't tell whether any given string is going to be evaled (and also the string to be evaled by might be built from different fragments in the source code); the proper place to instrument eval would have to be in the run-time script interpreter, I think.
Christopher Wells Send private email
Wednesday, September 06, 2006
 
 
Christopher, the term for what you want to do is "aspect-oriented development."

"Aspect-oriented" means that you can intercept functions and do something else before, concurrently, or after those functions.  You have the original source code; a set of rules about when to inject additional code; and the combined code that merges the different aspects (the original functions, and the new functionality).

Google for "aspect-oriented javascript" and you'll find plenty of resources.
Flow
Wednesday, September 06, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz