The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Deleting text between two markers

Hi,

This is probably a pretty simple one, but I have done various Google searches which don't reveal answers which I can use.

I have a couple of hundred files to process for my employer. Each file contains about 200kb of text which is littered with markers: @@@ . I would like to delete all the text between two markers leaving text that isn't inside the markers. Some text goes over multiple lines so I can't use "Process Lines Containing" style matching.

For example: I want to remove every thing except [2].

@@@ [1] This is some marked text, it would span many lines which I'm not going to bore you with. @@@

[2] This is some non-marked text which could also go on for many lines which I'm not going to bore you with.

@@@ [3] This is some more marked text which could go on for many lines which I'm not going to bore you with. @@@


Sorry this is such a barmy request, but if anybody can help I will be most grateful.
Gavin Laking Send private email
Thursday, August 03, 2006
 
 
Assuming your marked text forms separate paragraphs, this would be as follows in Perl:

#!/usr/bin/perl -w-

$/ = "\n\n";
my $re = '^@@@[^@]+@@@\s*$'; # avoid niggling warnings about @$ interpolation

while(<>) {
    print unless (/$re/o);
}
George Jansen Send private email
Thursday, August 03, 2006
 
 
my comment wrapped -- everything from "#" to "interpolation" is a comment following the single-quoted regexp.
George Jansen Send private email
Thursday, August 03, 2006
 
 
Thank you very much George! I encountered a little problem though. I wonder if you (or another magician) can do your thing again?!

What if a block spans multiple (varying number of) paragraphs? For example:

@@@ This is a marked block of text which is very boring to type and must be very boring to read.

Well let's just waste a few more lines of text, noting the paragraph break I used a few seconds ago... toodle de doo, hmm it's sunny outside today... @@@

Hope you can help! But thanks so far if I've asked too much!
Gavin Laking Send private email
Thursday, August 03, 2006
 
 
sed '/^@@@/,/@@@$/d' < in > out
captain damage
Thursday, August 03, 2006
 
 
A few more lines of Perl, but not many:

#!/usr/bin/perl -w-

=h2 Stripping text between markers

Strip all text embedded between '@@@' markers. Allow for the possibility
that they don't fall at paragraph breaks.

The simplest way to deal with this is to use the '@@@' as a regular
expression for "split": this turns the string into a series of items,
alternating the wanted and unwanted. The only problem is, what if the
file ends in something like
@@@do not include .. blah blah EOF?
This we could deal with, but we'll pretend that the file terminator
implicitly supplies any missing @@@.

Note that BEGINNING with @@@ will give us a leading element of
undef. We get rid of it with the shift below.

=cut

my $ats = '@@@';

$/ = undef; my $string = <>;  # grab everything;


my @all_items = split(/$ats/, $string);
my @unmarked = map { $all_items[$_] } grep { $_ % 2 == 0 }
(0 .. $#all_items);

shift (@unmarked) unless (defined($unmarked[0]));

my $new_string = join('', @unmarked);

=h3 Clean up.

The process may have left us with extra white space caused by the
removal of the markers.

=cut

$new_string =~ s/^\s*(.+)\s*$/$1\n/s;
$new_string =~ s/ {2,}/ /g;
$new_string =~ s/\n{3,}/\n\n/g;

print $new_string;
George Jansen Send private email
Thursday, August 03, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz