想想寻正的文章挺多的,删起来费事。将下面的脚本保存为deletejunks.pl, 假设新语丝的Web根目录是/var/www/localhost/htdocs/,那么输入
./deletejunks.pl /var/www/localhost/htdocs/ 西风独自凉
列出所有西风的文章的前面几行接着是文章的完整路径,这时候可以检查检查,看看是不是有错误。
要删除的时候,输入:
./deletejunks.pl /var/www/localhost/htdocs/ 西风独自凉 2>deletelist && xargs -a deletelist rm
注意locale的设置,我的机器上设置的是zh_CN.gb2312
#!/usr/bin/perl
use encoding 'gb2312' , STDIN => 'gb2312', STDOUT => 'gb2312';
sub delete_by_author($) {
my($txtfile) = @_;
my($is_author) = 0;
my($nonblank_line_number) = 1;
open ARTICLE, "<:encoding(gb2312)", $txtfile;
do {
$line = <ARTICLE>;
if ($line !~ /^\s+$/) {
print $line;
$nonblank_line_number ++;
if ($line =~ /$ARGV[1]$/) {
print {STDERR} $txtfile,"\n";
}
}
## search the top 4 non-blank lines in the article
} while $nonblank_line_number < 5;
close ARTICLE;
}
sub recurse($) {
my($path) = @_;
## append a trailing / if it's not there
$path .= '/' if($path !~ /\/$/);
for my $eachFile ( glob($path.'*') ) {
## if the file is a directory
if( -d $eachFile) {
## pass the directory to the routine ( recursion )
recurse($eachFile);
} else {
#delete_by_author($eachFile) if($eachFile ~ /\.txt$/);
if($eachFile =~ /\.txt$/) { delete_by_author($eachFile); }
}
}
}
if (! $ARGV[1]) {die "Usage: deletejunks.pl BaseDirectory Author \n";}
## initial call ... $ARGV[0] is the first command line argument
recurse($ARGV[0]);