|
Program duplicator version 1.0.9
Program duplicator version 1.0.9
Initial revision 2002-08-31; Last revision 2004-05-31
1 Download
2 File readme
3 Usage and options summary
4 Description
5 Project revision history
6 License
1 Download
Sources: src/duplicator-1.0.9.tgz [43 Kb ]
Win9x-EXE (minGW cross-compiled): mingw/duplicator.zip [38 Kb ]
2 File readme
duplicator --- locator of duplicated and plagiarized documents
SUPPORTED ENVIRONMENTS
http://www.gnu.org GNU/Linux
http://www.mingw.org MinGW --- Minimalist GNU For Windows
COMPILATION
Enter make (or gmake) in the directory where sources reside
BRIEF INSTRUCTION
This program can be used for the location of duplicated and
plagiarized documents in a text collections. Before you use this
program you have to concantenate all documents of the text collection
into one file, separating documents by special "sentinel" symbol (byte),
which is not encountered in any of the documents of the coollection.
Usually symbol '\000' works fine, by default program uses line feed
'\n' symbol to separate documents, which can be overrided by option
--record-separator (this assumes that all records in the file have no
linefeed symbol, you can change it to '\015'; linefeed separators are
useful because they allow use of Unix commands 'head' and 'tail' for
location of the records). Suppose that you've put your collection with
'\000'-separated records into file COLLECTION.TXT.
Next, you have to construct suffix array
duplicator -m --record-separator=0 COLLECTION.TXT
This action can take quite a long time (hours, or even days,
especially if the length of COLLECTION.TXT times five is less than the
size of RAM). The behaviour (in particular, memory consumption) can be
adjusted using special switch --memory-for-index. The size of
resulting suffix array may exceed file size limit on your system (some
systems can not have more than 2Gb files). This can be dealt with
option --max-file-size.
Using suffix array constructed you can use all other options for
search of duplicates and plagiarized documents. The most useful
commands are
duplicator -r --record-separator=0 COLLECTION.TXT
duplicator -l --record-separator=0 COLLECTION.TXT
The first command outputs R-measure for each document w.r.t the rest
of the collection into file COLLECTION.TXT.rrepeats.txt
Second command outputs longest repeated substrings of each document in
the rest of the collection into file COLLECTION.TXT.lrepeats.txt
More detailed description of R-measure can be found in the following
paper
Khmelev D., Teahan W. A repetition based measure for
verification of text collections and for text categorization.
SIGIR'2003, July 28--August 1, 2003, Toronto, Canada.
License conditions are described in file LICENSE.txt
3 Usage and options summary
user@computer$ ./duplicator --help
Usage: duplicator -[mc12rls] [LONGOPTIONS] FILE
-m --makesuffixarray construct suffix array in FILE.s00, FILE.s01...
-c --checkorder check the correctness of suffix array
-s --lcpstat puts Lcp statistics to FILE.lcpstat.txt
-l --lrepeat puts LCS info into FILE.lrepeats.txt
-r --rrepeat puts approximate R repeats to FILE.rrepeats.txt
-1 --countrepetitions record stats to FILE.cmnsbstr.txt
-2 --countr puts R statistics to FILE.rstat.txt
-h, --help display this help and exit
-d, --description display complete description
-v, --version display version and exit
Long options: (each option requires a numerical argument)
--record-separator=<num> the code in 0..255 of record separator character
--max-file-size=<num> (in Mb's), by default 2000 (Mb),
--memory-for-index=<num> size of each piece in multi-stage sort (100 Mb)
The following options manage the data representation in FILE.lrepeats.txt
--lrepeat-level=<num> output LCS longer than <num> only (default 50)
--lrepeat-max-list-length (default 2000) the maximal number of
LCS per record displayed in FILE.lrepeats.txt
--lrepeat-outstrlen=<num> sample LCS displayed are at most <num>(default 20)
The following options manage the work of action -r or -1
--rrepeat-max-list=<num> (default 10) the pretenders list size
--r-level=<num> omit Lcps shorter than --r-level (default 1)
4 Description
user@computer$ ./duplicator --description
<Usage information from the previous section is omitted>
*** DESCRIPTION
(input files: FILE)
(general options: --record-separator
--out-recnum-offset --zero-recnum-offset)
This program give some information on duplications and repetitions in
file FILE. Repetitions are calculated between "records", which should
be separated by record separator (RS), which '\n' is by default,
and can be modified with --record-separator <RS code> option. Duplicator
simply substitutes each <RS> character with zero '\000' character after
loading source FILE into memory, so any zero characters in FILE makes an
error unless <RS> is '\000'.
With options -l, -r, -1, -2, duplicator produces some output files,
containing infomation on records. The records by default are numbered
from 1. If you prefer, you can make offsets to be zero using
--zero-recnum-offset option. Other offsets (in a range from
-1000*1000*1000..1000*1000*1000) can be obtained set with
--out-recnum-offset option accepting integer argument in a form
--out-recnum-offset=<num>
*** Option -m
(input files: FILE)
(output files: FILE.s00, FILE.s01 etc)
(temporary files: _tmppool%03d.bin)
(sub-options: --max-file-size --memory-for-index)
In order to compute repetition characteristics, the duplicator needs to
construct suffix array. You should call it with option -m to do it:
duplicator -m FILE
If you have enough memory (about size of FILE times 5), then the
program would construct the suffix array and put it into FILE.s00. If
your FILE is really large (more than 512 Mb), then it could happen
that the size of suffix array will exceed the limit for file size in
your system (which is usually 2Gb). If suffix array is larger that
2000 than it is splitted into several pieces
numbered, resp. FILE.s00, FILE.s01 etc. You can modify maximum file
size using option --max-file-size setting the maximal file size on
your system in Mb's.
More usual problem is the shortage of memory. If your RAM size is less
than the size of FILE, then upgrade your memory. If your RAM size is
slightly more than size of FILE (e.g. 1.5 times more), then duplicator
can make suffix array in pieces (assuming that you have
8*length(FILE) free space on your hard drive). The size of each piece
in Mb's is defined by option --memory-for-index (default 100Mb).
Program will construct temporary files of with names _tmppool%03d.bin
in working directory and will remove them after it has been finished
successfully.
Suffix array construction is a hard job, so the program will never
overwrite the existing file(s). Hence, if you wish to reconstruct
suffix array of if the program was interrupted, you should manually
delete all the files related. Notice also that only this option
requires a lot of memory, all other functions memory requirement are
much lighter: they simply load FILE into memory and than operate
scanning through suffix array consequently, loading only a small part
of it (at most the size of longest record elements)
*** Option -c
(input files: FILE FILE.s00 [and FILE.s01 etc if exist])
(output: stdout)
(sub-options: --display-suffix-sample)
outputs the report into stdout. It checks the number of suffixes,
information on mis-sorted suffixes.
You can use option --display-suffix-sample to display a sample of
20 suffixes from the middle of the suffix array (more precisely,
last 20 suffixes of first 1/30 of suffix array)
Suffix array structure is quite simple. After the input file is loaded
into memory. it is logically preceded by <RS> character and is kept in
a single large array of characters S[0..n], where n is the length of
file (in fact, if the last record did not end with <RS> character, the
text is appended by another <RS> character). The main fact for us is
that the text itself is S[1..n]. A sub-array S[i..n] is called a suffix
number i. Suffix i is empty if S[i]=0 and non-empty otherwise. In
suffix array FILE.s00 only non-empty suffixes are kept. Each non-empty
suffix is a 4-byte positive number. This numbers are sorted in
lexicographical order of corresponding articles and form suffix
array. To be absolutely precise if two suffixes S[i_1..n] and S[i_2..n]
are the same as *zero-tailored*string*, than the order is determined by
by order of number i_1 and i_2, i.e., if i_1<i_2 then
S[i_1..n]<S[i_2..n]. This definition of order imply that there are no
"equal" suffixes at all.
The structure if FILE.s00 is as follows, where offsets and
sizes are given in bytes
offset size value meaning
0 4 T a total number of non-empty suffixes
4 4 i_1 S[i_1..n] --- the smallest suffix
8 4 i_2 S[i_2..n]
....................
4*T 4 i_T S[i_T..N] --- the largest suffix
If 4*T is larger than --max-file-size, then for some k we have the
following structure of FILE.s00:
offset size value meaning
0 4 T a total number of non-empty suffixes
4 4 i_1 S[i_1..n] --- the smallest suffix
8 4 i_2 S[i_2..n]
....................
4*k 4 i_k S[i_k..N]
4*(k+1) 4 0 trailing zero means that the suffix array is
continued in next FILE.s01
FILE.s01 structure:
offset size value meaning
0 4 i_{k+1} S[i_{k+1}..n]
4 4 i_{k+2} S[i_{k+2}..n]
....................
The trailing zero means that we should go for the next suffix array
file in FILE.s01, FILE.s02 etc.
*** Option -l (--lrepeat)
(input files: FILE FILE.s00 [and FILE.s01 etc if exist])
(output: FILE.lrepeats.txt)
FILE. More precisely, for each record R it determines a list of other
records, which have common substring with the record R of length at
least --lrepeat-level (which is 50 by default).
The program keeps no more than --lrepeat-max-list-length records in a
list of CS records for each record R while calculation. The records
with longer CS have the priority and they replace those with smaller
CS.
In output file FILE.lrepeats.txt information on records is
displayed line-per-record, e.g.,
1 1920 L=30 repsnum=0
2 269 L=33 repsnum=0
3 198 L=39 repsnum=0
4 2143 L=2143 repsnum=1 (16,2143,2391,4533,"is not under pressur")
5 270 L=270 repsnum=2 (12751,270,4535,4804,"reported the farmer ") (14502
,164,4623,4786," and have matured re")
6 444 L=28 repsnum=0
...........................
where the first column contains the number of the record, the second
being the record length, the third being the length of longest
repetition of substring in this record. The rest of the line
represents the records which have at least --lrepeat-level common
substring with a given one. It begins with "repsnum=<num>", where
<num> is the number of these records, possibly 0. Each longest
repetition with other records is given by three scalars
"(rec,lcs,offs,loffs,str)" in brackets, separated by commas. Here
'rec' is the record number, 'lcs' being the longest common substring
between rec and record for this line, 'offs' being the number of byte,
starting the shared string, 'loffs' being the number of the last byte
of the shared string, and 'str' being first few characters of shared
string. The values of loffs and lcs are useful with 'head' and 'tail'
Unix commands, which allow to view the repeated string completely. In
order to do it, just type
head -c`loffs` FILE |tail -c`lcs`
where `loffs` and `lcs` are numerical values from a quinter above. For
example, the example above is obtained for the file reuters.txt and
the command would be:
head -c4533 reuters.txt|tail -c 2143|head -c70
producing the following output (cutted)
is not under pressure to act quickly on its proposed equity offering a
The string str is a standard C string with echoed special symbols and ".
The maximal length of str is --lrepeat-outstrlen (default 20).
A final remark: this mode is better used to detect that there are no
repeted strings in the corpora. The problem is that for each pair of
records only the repetition of maximal length is reported, while there
could be a lot of repetitions of the same length which are not
displayed.
*** Option -r (--rrepeat)
(input files: FILE FILE.s00 [and FILE.s01 etc if exist])
(output: FILE.rrepeats.txt)
(sub-options: --rrepeat-max-list --r-level)
Finds out all approximate repeats for R statistics.
*** Option -1 (--countrepetitions)
(input files: FILE FILE.s00 [and FILE.s01 etc if exist])
(output: FILE.cmnsbstr.txt)
Counts repetition statistics.
*** Option -2 (--countr)
(input files: FILE FILE.s00 [and FILE.s01 etc if exist])
(output: FILE.rstat.txt)
(sub-options: --r-level)
Counts R statistics only
*** Process management
Ctrl-C should be used to display the progress in lengthy subroutines.
For Unix users: To stop the program suspend it pressing Ctrl-Z and use
kill -9. You can also send a signal kill -2 to the program to check
current status.
For Win32 users: To stop the program close the console window.
5 Project revision history
Files of the project were modified on the following dates:
2002-08-31
2002-09-01
2002-09-10
2002-09-12
2002-09-25
2002-10-23
2002-10-25
2003-03-08
2003-05-12
2003-05-16
2004-05-31
6 License
duplicator - locator of duplicated and plagiarized documents
Available at http://www.math.toronto.edu/dkhmelev/PROGS/tacu/
Author:
Dmitry V. Khmelev
dkhmelev((at))math.toronto.edu
[change ((at)) to @ in order to get proper address - antispam]
University of Toronto,
Department of Mathematics,
100 St George Street,
M5S 3G3 ON,
Canada
LICENSING TERMS
This program is granted free of charge for research and education
purposes. However you must obtain a license from the author to use it
for commercial purposes.
Scientific results produced using the software provided shall
acknowledge the use of duplicator. The proper references are:
D. Khmelev, Text Analysis and Conversion Utilities
http://www.math.toronto.edu/dkhmelev/PROGS/tacu/
Khmelev D., Teahan W. A repetition based measure for
verification of text collections and for text categorization.
SIGIR'2003, July 28-August 1, 2003, Toronto, Canada.
http://www.math.toronto.edu/dkhmelev/PAPERS/published/2003/sigir/sig.html
Moreover shall the author of duplicator be informed about the
publication.
The software must not be modified and distributed without prior
permission of the author.
By using duplicator you agree to the licensing terms.
NO WARRANTY
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
PARTIES PROVIDE THE PROGRAM ÄS IS" WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF
THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO
LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY
OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED
OF THE POSSIBILITY OF SUCH DAMAGES.
1 Download
2 File readme
3 Usage and options summary
4 Description
5 Project revision history
6 License
|