View Single Post
  #49  
Old 2020-02-16, 10:55 PM
Five's Avatar
Five Five is offline
 
Join Date: Oct 2004
Location: Canada
Re: The Validity of MD5 Checksums

shntool tutorial by Jason Jordan, 2004-05-05

part one of three

Code:
shntool 2.0.3 tutorial
----------------------

This is a fairly brief but hopefully useful document on how to use shntool and
its various modes.  It is a work in progress, so take what it says with a grain
of salt.  I will try to keep it current with each new release.  However, even if
this document becomes dated, shntool's built-in help screens as well as its man
page (i.e. the README.txt for Windows folks) will be kept current, so if you
have any troubles you can consult those sources of information.  The help
screens can be accessed from any mode by giving the '-h' command-line switch.

NOTE: For the purposes of this tutorial, all sample commands will be given in
their long form (e.g. 'shntool len' as opposed to 'shnlen'), since some
platforms do not support symbolic links, which allow one to use the short form.
Also, the '%' represents the command prompt, whatever platform you are on.
Most of you know this, but this is for those who don't, so that I don't get any
email saying "When I run '% shntool', it says 'command not found'!!!!"  :^)

shntool's modes can generally be split into three categories - modes that simply
display information about given files, modes that create new files based on
input files, and modes that do something else that is not covered by the above
two categories.


Contents:

1. Modes that display information
  1a. len mode
  1b. info mode
  1c. md5 mode
  1d. cue mode
2. Modes that create files
  2a. fix mode
  2b. join mode
  2c. split mode
  2d. strip mode
  2e. conv mode
  2f. pad mode
3. Miscellaneous modes
  3a. cat mode
  3b. cmp mode
4. Custom format modules
  4a. cust format


=================================
1. Modes that display information
=================================

Currently len, info, md5, and cue modes are the only modes that simply show
information about files.  All of these modes read filenames from the command
line, or from standard input if none are given on the command line.


------------
1a. len mode
------------

len mode shows a one-line summary detailing many properties of a given file.
Below is sample output from len mode:

% shntool len *.shn
    length     expanded size   cdr  WAVE problems filename
    18:39.49      197506892    ---   --   ---xx   gd72-08-27d2t01.shn
     8:48.56       93270956    ---   --   ---xx   gd72-08-27d2t02.shn
     4:58.46       52675436    ---   --   ---xx   gd72-08-27d2t03.shn
    12:18.62      130329068    ---   --   ---xx   gd72-08-27d2t04.shn
     5:32.39       58656572    ---   --   ---xx   gd72-08-27d2t05.shn
    50:18.27       532438924 B                    (totals for 5 files, 0.5612 overall compression ratio)
%

NOTE: overall compression ratio is simply the total size of the actual files on
      the disk divided by the total size of the WAVE data (i.e. the header,
      data, and extra RIFF chunks) contained in the files.  Thus, any data
      appended or prepended to input files (such as ID3 tags, ID3v2 tags or seek
      tables) will increase the overall compression ratio, even pushing it above
      1.0000!  While seemingly counterintuitive, this makes sense, since the
      extra data only serves to reverse the effect of any compression done to a
      given file.


Here are some files that show off many of the property/problem flags described
in the "Explanation of output columns" section below:

% shntool len < test.list
    length     expanded size   cdr  WAVE problems filename
     0:00.543          6030    cxx   --   -----   doh.wav
     0:00.543          6030    cxx   --   ----j   doh-withjunk.wav
     4:08.31       43820156    ---   --   3--xx   test-ok.shn
     4:08.31       43820156    ---   --   3----   test-ok.wav
     0:19.535        215420    cxx   he   -----   test-he.wav
     0:06.123        135876    cxx   -e   -----   test-e.wav
     0:06.123        135049    cxx   --   ---t-   test-t.wav
     3:40.40       38901578    -b-   -e   ---xx   test-be.shn
    10:01.65      106169336    ---   h-   ---xx   test-h.shn
    22:32.076      233209631 B                    (totals for 9 files, 0.6327 overall compression ratio)
%

You can specify an alternate totals unit with the -u command-line switch.  For
example, running:

% shntool len -u mb *.shn

on the first set of files listed above will produce identical output, except for
the totals line which will be shown in terms of megabytes instead of bytes:

    50:18.27         507.77 MB                    (totals for 5 files, 0.5612 overall compression ratio)


Explanation of output columns
-----------------------------

The 'length' column shows the length of the WAVE data in that file, in m:ss.nnn
format.  If the WAVE data is CD-quality, then the length is shown in m:ss.ff
format, where ff is a number from 00 to 74 that best approximates the number of
frames (2352-byte blocks) remaining after m:ss.  If all files given are
CD-quality, then the total length is displayed in m:ss.ff format; otherwise, the
total length will be displayed in m:ss.nnn format.

Note on rounding:  If the WAVE data is CD-quality, then its length is rounded to
                   the nearest frame.  Otherwise, it is rounded to the nearest
                   second.

The 'expanded size' column shows the total size of the WAVE header, WAVE data
and any other RIFF chunks appended to the file.  Essentially this shows exactly
how large a file is (or will be when it is decompressed).  NOTE:  Do not rely on
this field for audio size!  If you simply want to know how many bytes of audio
are in a file, run it through info mode, and look at the "data size" field in
its output.

The following three columns - cdr, WAVE and problems - attempt to show
properties and/or problems associated with the corresponding file.  Each entry
under a particular column stands for a specific property/problem.  In all three
columns, whenever that entry is applicable and checks out okay, a '-' will
appear in its place; and whenever that entry is not applicable or cannot be
determined, an 'x' will appear in its place.  However, if a particular entry
does not check out okay, you will see a unique letter corresponding to what
went wrong.  Read on for more information about what these letters mean.

The 'cdr' column shows properties of CD-quality WAVE data.  There are three
entries under this column.  The first entry will contain a 'c' if the WAVE data
is not CD-quality.  The second entry will contain a 'b' if the data is
CD-quality, but not cut on a sector boundary.  The third entry will contain an
's' if the data is CD-quality, but too short to be burned (i.e. 705600 bytes - 4
seconds worth of CD-quality WAVE data).

The 'WAVE' column shows properties of the WAVE data for any file. These properties
are not problems; they are just indicators of WAVE data that is not canonical.
There are two entries under this column.  The first entry will contain an 'h' if
the WAVE header is not canonical (44 bytes). The second entry will contain an
'e' if the WAVE file contains extra RIFF chunks, other than the required 'fmt'
and 'data' chunks.  Files that exhibit one or both of these properties can be
made canonical by stripping the unnecessary data via shntool's built-in strip
mode.

The 'problems' column shows problems with the WAVE header, WAVE data or the file
itself, for the given file.  There are four entries under this column.  The
first entry will contain a '3' if the file contains an ID3v2 tag.  The second
entry will contain an 'a' if the audio data is not block-aligned, i.e. the data
size is not a multiple of the block align.  The third entry will contain an 'i'
if the header size plus the reported data size is greater than the calculated
total size taken from the header (i.e. chunk size + 8).  The fourth entry will
contain a 't' if the calculated total size is greater than the file's actual
size, and the file is not compressed (e.g. a .wav file).  The fifth entry will
contain a 'j' if the calculated total size is less than the file's actual size,
and the file is not compressed.  The last two entries are only verified for WAVE
data that is not compressed, since it would take far too long to verify this for
compressed WAVE data as well.

Summary of one-character abbreviations:

  all columns:

  '-'  this particular entry is OK
  'x'  this particular entry is not applicable or cannot be determined

  cdr column:

  'c'  data is not [C]D-quality
  'b'  CD-quality WAVE data is not cut on a sector [b]oundary
  's'  CD-quality WAVE data is too [s]hort to be burned

  WAVE column:

  'h'  WAVE [h]eader is not canonical
  'e'  WAVE file contains [e]xtra chunks

  problems column:

  '3'  file contains an ID[3]v2 tag
  'a'  audio data is not block-[a]ligned
  'i'  WAVE header is [i]nconsistent about data size and/or file size
  't'  WAVE file seems to be [t]runcated
  'j'  WAVE file seems to have [j]unk appended to it


-------------
1b. info mode
-------------

info mode shows a detailed, multi-line listing of the properties of a given
file.  Below is sample output from info mode when run on just one file.

NOTE: for CD-quality files, the sector-misalignment is simply the remainder
      when the data size is divided by 2352; i.e. it is the number of bytes
      by which the audio data exceeds the previous sector boundary.


% shntool info kottke1992-07-04d1t17.shn
-------------------------------------------------------------------------------
file name:                    kottke1992-07-04d1t17.shn
handled by:                   shn format module
length:                       4:37.45
WAVE format:                  0x0001 (Microsoft PCM)
channels:                     2
bits/sample:                  16
samples/sec:                  44100
average bytes/sec:            176400
rate (calculated):            176400
block align:                  4
header size:                  44 bytes
data size:                    48969228 bytes
chunk size:                   48969264 bytes
total size (chunk size + 8):  48969272 bytes
actual file size:             21108269
file is compressed:           yes
compression ratio:            0.4311
CD-quality properties:
  CD quality:                 yes
  cut on sector boundary:     no
  sector misalignment:        588 bytes
  long enough to be burned:   yes
WAVE properties:
  non-canonical header:       no
  extra RIFF chunks:          no
Possible problems:
  file contains ID3v2 tag:    no
  data chunk block-aligned:   yes
  inconsistent header:        no
  file probably truncated:    unknown
  junk appended to file:      unknown
  odd data size has pad byte: n/a
Extra shn-specific info:
  seekable:                   no


NOTE: compression ratio is simply the total size of the actual file on the disk
      divided by the total size of the WAVE data (i.e. the header, data, and
      extra RIFF chunks) contained in the file.  Thus, any data appended or
      prepended to the file (such as ID3 tags, ID3v2 tags or seek tables) will
      increase the compression ratio, even pushing it above 1.0000!  While
      seemingly counterintuitive, this makes sense, since the extra data only
      serves to reverse the effect of any compression done to the file.


------------
1c. md5 mode
------------

md5 mode computes the MD5 fingerprint of the WAVE data contained within input
files.  This can be used to catalog unique sources of audio, and to determine
whether files stored in one format are identical in terms of audio data.  The
string "[shntool]" is added to the output to distinguish these MD5 sums from
normal MD5 sums.  If you want to calculate the composite MD5 fingerprint from
a set of files, use the -c option.  The composite MD5 sum can be useful for
fingerprinting a file set, or identifying file sets that contain the exact same
audio data, but different track breaks (e.g. file sets that have been "fixed",
with no padding added).

NOTE:  The -c option is equivalent to the following commands:

       a)  % shntool cat -nh -np -nr <files> | md5sum

       b)  % shntool join -nopad <files>
           % shntool md5 joined.wav

       The advantage of the -c option over a) is that it can be uses on systems
       that don't have md5sum installed, and the advantage over b) is that no
       extra disk space is required for the joined file.


Here is the output for one source of a particular show:

% shntool md5 *.shn
b3b7d3f6c6b0ffc88e6588f4f279d97e  [shntool]  ph1993-08-20d1t01.shn
dc989e4aa15b31b8389814d7ac945c87  [shntool]  ph1993-08-20d1t02.shn
e3e3276ce8c1d5aac3ba9c603d9a1810  [shntool]  ph1993-08-20d1t03.shn
1cc17eaef4c086222bbb5974e61de72f  [shntool]  ph1993-08-20d1t04.shn
6a1b42c3d592b3004212dc6ac36649ea  [shntool]  ph1993-08-20d1t05.shn
0c177bcb31e882efee9bd5930cf9c2ec  [shntool]  ph1993-08-20d1t06.shn
0bf26039c8bb42c15518f0c4988dd01e  [shntool]  ph1993-08-20d1t07.shn
b82f32b6c727e465ba7a925a2bf0f7f7  [shntool]  ph1993-08-20d1t08.shn
1164b3207df6916621808ff4ee2ac9b7  [shntool]  ph1993-08-20d1t09.shn
32dc984bd441a4703a5b65902583ec45  [shntool]  ph1993-08-20d2t01.shn
14f466e265499a56aacbfb7144057d37  [shntool]  ph1993-08-20d2t02.shn
8a35ff394515baa062610172f125e376  [shntool]  ph1993-08-20d2t03.shn
a23d9996e43df5107a802b3aba4a2830  [shntool]  ph1993-08-20d2t04.shn
8a35eb14fcbb0dacbad1915a07746400  [shntool]  ph1993-08-20d2t05.shn
1808c90d6728dd151eafd3d6135d1ee0  [shntool]  ph1993-08-20d2t06.shn
afe8dd6c67afa59d2f31fb04dc6a78c1  [shntool]  ph1993-08-20d2t07.shn
136f555973b5a92433522f0fccf05e7d  [shntool]  ph1993-08-20d3t01.shn
8cffae094165d8a5bffacd5198dd53a7  [shntool]  ph1993-08-20d3t02.shn
08c9532a2c07750cb7ccb5cbe678da6b  [shntool]  ph1993-08-20d3t03.shn
a8c76355bd329d563b79d4ac75485314  [shntool]  ph1993-08-20d3t04.shn
e2429980fd995cb19764b7768ff188e3  [shntool]  ph1993-08-20d3t05.shn
%

Here is an example showing how the MD5 sum of WAVE data remains constant
even though the compression formats differ:

% shntool md5 example.*
e09f22c64d717ed89c6009b52fcfddd2  [shntool]  example.aiff
e09f22c64d717ed89c6009b52fcfddd2  [shntool]  example.ape
e09f22c64d717ed89c6009b52fcfddd2  [shntool]  example.flac
e09f22c64d717ed89c6009b52fcfddd2  [shntool]  example.ofr
e09f22c64d717ed89c6009b52fcfddd2  [shntool]  example.pac
e09f22c64d717ed89c6009b52fcfddd2  [shntool]  example.shn
e09f22c64d717ed89c6009b52fcfddd2  [shntool]  example.wav
%

Here's one way to find the MD5 fingerprint of all your audio files:

% find /audio/dir | shntool md5 2>/dev/null

Here's an example showing the usefulness of the -c option.  Notice how the
individual MD5 fingerprints change, while the composite MD5 fingerprint
remains constant (as long as no padding is added to the fixed files):

% shntool md5 *.flac
3ec1532ce893a8e845a3a1c5ff6db537  [shntool]  gd1966-07-16d01t08.flac
3a5e83dec13f396be2f0d10b848f1f30  [shntool]  gd1966-07-16d01t09.flac
% shntool md5 -c *.flac
09fdf2f9d7c55b007a5cc738a67662ee  [shntool]  composite
% shntool fix -o flac -nopad *.flac
shntool [fix]: warning: no shift direction specified - assuming backward shift
gd1966-07-16d01t08.flac --> gd1966-07-16d01t08-fixed.flac ... done.
gd1966-07-16d01t09.flac --> gd1966-07-16d01t09-fixed.flac ... done.
File 'gd1966-07-16d01t09-fixed.flac' was not padded, though it needs 420 bytes of padding.
% shntool md5 *fixed*.flac
d666d3a7ff07981210d57707e6de86be  [shntool]  gd1966-07-16d01t08-fixed.flac
1007bed4c0b891213e5b3e837d41c6a5  [shntool]  gd1966-07-16d01t09-fixed.flac
% shntool md5 -c *fixed*.flac
09fdf2f9d7c55b007a5cc738a67662ee  [shntool]  composite


------------
1d. cue mode
------------

The purpose of cue mode is to generate a CUE sheet or a set of split points
from a set of files.  You can use a CUE sheet to burn data joined from a set
of files, or to re-split the same joined data by feeding the CUE sheet to
split mode.  Since CUE sheets are sector-aligned by design, cue mode also
allows you to create explicit byte-offset split points from a set of files.
That way, you can re-split the joined data in exactly the same places in which
it was originally split, whether the tracks were sector-aligned or not.

Here is an example showing a CUE sheet created from a set of files:

% shntool cue *.flac
shntool [cue]: warning: no output type specified - assuming CUE sheet
FILE "joined.wav" WAVE
  TRACK 01 AUDIO
    INDEX 01 0:00:00
  TRACK 02 AUDIO
    INDEX 01 1:30:07
  TRACK 03 AUDIO
    INDEX 01 2:57:43
  TRACK 04 AUDIO
    INDEX 01 9:25:44
  TRACK 05 AUDIO
    INDEX 01 17:23:58
  TRACK 06 AUDIO
    INDEX 01 27:47:63
  TRACK 07 AUDIO
    INDEX 01 37:05:16
  TRACK 08 AUDIO
    INDEX 01 40:53:63
%

If you want raw split points instead, use the '-s' option:

% shntool cue -s *.flac
15892464
31323936
99769488
184121616
294206976
392527632
432857376
%
__________________
Checksums Demystified | ask for help in Technobabble

thetradersden.org | ttd recommended free software/freeware webring
shntool tlh eac foobar2000 spek audacity cdwave vlc

Quote:
Originally posted by oxymoron
Here you are in a place of permanent madness, be careful!
Reply With Quote Reply with Nested Quotes