SaReGaMa: An Automatic Printed Music Recognition System



Index

  1. Introduction
    1.1 On-line vs. Off-line
    1.2 A Brief comparison with Opitcle Character Recognition
  2. Motivation for the project
  3. Terminologies
    3.1 Names of musical symbols
  4. Review of past work
  5. Methodology
    5.1 Detection of staff lines
    5.2 Hough transform
    5.3 Detection of bar lines
    5.4 Recognition of note symbols.
    5.4 Implementation
  6. MIDI
  7. Input Output
  8. Conclusions
    8.1 Improvements
  9. Some other links around the world
  10. References
  11. Back to MAIN PAGE

1. Introduction

Automatic optic music recognition is the automatic analysis of images of musical notations, either written or printed. A music notation represents musical information by symbolic form. It includes much information that is important for many other music information processing studies, so music notaion has been input in most laboratories where music information processing is studied. In ost cases, this is done manually be use of computer keyboard or mouse. Automatic recognition would be preferable here.

1.1 On-line vs. Off-line

Automatic optical music recognition can be roughly classified into two categories: on-line and off-line.

In an online system, the machine analyses the musical score and generates the result almost instantaneously. Such a system can be attached to device such as robotic arms adjacent to a piano and perform the piece of musical work in real time. In such a system , the machine must be able to carry out the analysis in short time. This implies that the system may not have enuf time to analyze the whole score before generating its output.

In an offline system, the score is first digitized as an image file and stored. Usually, optical scanners are used and cameras provide an alternative. The stored image is then analyzed by the computer and converted into a binary form using a coding tht should be designed to be suitable for both performing the piece of musical work, and re-printing of the score. Since an offline system can analyze the whole score before generating the ouput, accuracy of recognition is improved. For example, a sophisticated semantic checker can be developed to correct suspected mistakes made in an earlier stage of recognition.

1.2 A Brief comparison with Optical Character Recognition

Text is printed by placing chracters from several fonts onto a black sheet, while a musical score lays musical symbols, which can be treated as chracters in some special fonts, onto a sheet of staff lines. A musical score can be divided into groups, with symbols attached to the same staff being considered as a group. These groups are then analogous to the rows of characters in optical character recogntion. These observations suggest that current optical character recognition systems may be adapted to perform optical music recognition. This would be desirable, since optical character recognition has been under development for many years and we can use and modify such a system to build an optical music recognition system.

Unfortunately, there are some major differences between text and musical scores that make the adaptation of current optical character recogntion system to recognize music difficult.


Owing to these major differences , ordinary optical character recognition techniques do not perform well for music scores. Special techniques have been developed to handle musical scores efficintly and effectively.

2. Motivation for the Project

The existence of an automatic printed music recognition system would make practical the conversion of large quantities of the printed music into computer readable form. This is similar to the requirement of automatic conversion of engineering drawing, circuit layout etc. from the existing documents. Once the music is stored electronically using some music representational language, it can be manipulated freely, enabling applications such as musicological analysis, point-of-sale printing , production of new editions possible.

Stated below are some of other applications for printed music recognition:-


Above and all the image recognition of the printed musical characters present us with a challenging problem in the field of image pattern matching and semantic analysis.

3. Terminologies

Staff Line On a musical score,a staff line is a long,thin horizontal line which defines a co-ordinate system. Along the line from left to right is the time axis. The scale on this axis is roughly linear. The directio perpendicular to staff lines is the dimension of pitch, with higher positions denoting higher pitches.
staff space This is the distance between adjacent staff lines of the same stave.
stave(staff) A stave is also known as a staff. It is a group of five staff lines.
bar line A vertical line in a musical score to separate notes into groups called "bar units".
ledger line Ledger lines are additional horizontal lines added near a note symbol when a note lies too far above or below the staff. They help to clrify the positions of these notes.
bar unit It is a small unit of a piece of music. Each bar unit occupies the same length of time.
note symbol A note symbol in a musical score is a symbol that represents a musical note and its duration. The pitch of a note is determined by the vertical position of the note symbol relative to the staff.
note head This is the elliptical portion of the note symbol. For whole notes and half notes, the note heads are hollow. For other notes, the note head is a solid ellipse.
note stem This is the vertical line segment of a note symbol. Besides a whole note, al other note symbols have a stem, with its end touching a note head.
note flag This is the tail part of a node to determine the typr of a note. The tail is on the note stem other than that attached to the note head. A whole note, half note or quarter note does not have flags. An eighth note has one flag; a sixteenth note has two flags and so on.
voice A voice is a muscial line. A voice may correspond to a single instrument , though a piano part of a score is usually notated as two or more voices.
slur A slur is a thin, wide and curly line that spans across a group of note symbols. Slurs may span over several bar units.
pedal marking A pedal marking tells a pianist how to control the foot pedals of a piano.
dynamic marking Dyanmic markings are present in a musical score to indicate the loudness of subsequent notes.



                                     Figure 2: Names of various basic components on a musical score.

3.1 Names of musical symbols


Here is a list of musical symbols, with the names given by their side.


                                                       Figure 3: Musical symbols

4. Rieviews of the past works

Work on automatic recognition of printed music began in the late 1960's and the early 1970's with the research of Pruslin and Prerau at the MIT. The limitations of the hardware available at that time for acquiring and manipulating images restricted the possibilities of the work, but some progress has was made using techniques including low-pass filtering and contour tracing. In mid and late 80's, Matsusima and Katayose , at Wadesa University made the WABOT-2 keyboard-playing robot which has vision system and uses mask-matching implemented in hardware in conjunction with localized measurements to read nursery song sheets. Most of the works through 90's has been concetrated on locating staves and isolating and recognizing symbols.

Pruslin preprocesses the music image by eliminating all thin horizontal and vertical lines, including many bare staff-line sections and stems. This results in an image of isolated symbols, such as note heads and beams, which are then recognized using contour-tracing methods. Prerau describes a "fragmentation and assemblage" method for treating staff lines and isolating music symbols.

Some automatic recognition systems for music notation have been developed. Nakamura and Fujinaga has proposed using projection profiles, the type and position of each symbol are recognized by means of the extraction of the feature of the shape and position from the horizontal or vertical projection, because a certain point in the pattern of the symbol has an important feature. The advantage of these methods is simplicity. If symbols are connected or drawn in vertical alignment, however, recognition is difficult. These methods are able to recognize only simple music notation of monophony such as children's songs.

Tojo has proposed a recognition method using the classification of symbols into large groups according to the shape of the rectangle circumscribed with symbols and the discrimination of the symbols from the structured analysis. In order for this method to be carried out, a careful elimination of the staff lines as a preprocessing and a fine segmentation of symbols is required. Matsushima has developed a high-speed recognition system for real time musical performance with a robot. This system has hardware to detect symbols in about 10 seconds and is too inflexible to handle complex notations.

Difficulties of the recognition systems are various. In simple notations, the density of symbols is low, so the concentration between symbols doesn't need to be considered. In complex notations, symbols are drawn with high density, so connection, overlap and containment of symbols appear everywhere. The segmentation of symbols is very difficult for this feature in addition to the overlap with the staff lines. Considering from the point of view of semantics, simple notations can be interpreted uniquely by means of simple rules. In complex notations, there are ambiguous descriptions ,so the proper knowledge is required to interpret such description. Existing recognition systems of music notation cannot deal with complex notations for these difficulties.

5. Methodology



Stages of the recognition process. Figure 4 shows the processig flow of a typical automatic optical music recognition system . This is not the only possible one. Other processing flows are possible.


                                                       Figure 4: The processing flow

A common first operation in a music recognition system is thresholding to convert a grey scale image into a binary image. Other forms of preprocessing are sometimes used for noise reduction. The other important steps in recogntion of printed music documents include staff line identification, symbol classification, symbol recogntion and analyzing the relative positions of the symbols .

Brief Synopsis
In an offline optical music recognition, the musical score is first scanned with an optical scanner. The output of this process is a bitmap, which is the input to the optical musical recogntion.
The bitmap is analyzed to detect the staff lines. Deatils of this will be discussed in section 5.1 . After this there are two branches to go. One is to remove all staff line segments that do not overlap with other musical symbols. This isolates the musical symbols that have been connected by the staff lines. Another option is to keep the staff lines. With this option recogntion in the later stages must use techniques that can work with the presence of staff lines. For examle template matching methods can be used to perform recognition in the presence of staff lines. We use the first approach. Figure 5 explains more.


                                                       Figure 5: Removal of staff lines

Having the staff lines detected, the next step is to detect the bar lines. Detecting the bar lines helps separating a musical score into smaller bar units. Being able to partition an image into smaller units, the optical musical recognition can work more quickly and require less memory. Section 5.1 explains this in more detail.

Done with the preprocessing stage, we come to the recognition of symbols. The image is processed bar unit by bar unit. First, the note symbols are recognized. Next comes the recognition of attributive symbols. Recognition of symbols is done by using Neural Nets.Section 5.4 explains this in detail.

After recognising the symbols, the output is written in a predefined format. This format is read by another program which then creates a MIDI equivalent of the original musical score. This can now be played on any sound card.

5.1 Detection of staff lines

Staff lies play a major role in optical musical recognition. Most musical symbols of a musical score are laid around the staff lines ina two dimensional manner. The horizontal axisis the time axis while the vertical axis tells the pitch of note symbols.

To a human reader, staff lines are important because they help the reader to find out precisely the vertical position of note. From this information, the human reader can know the pitch of a note.

Interstingly, the importance of the staff lines to an optical musci recognition system is quite different. Computers can often accurately find out the position of a note symbol on the vertical axis without the aid of the auxillary lines. However, the staff lines embed some other information that are very important for the optical music recognition.

Following information is important for various reasons.

1. The thickness of the staff lines
    The thickness of the staff lines in pixel units tells the optical musical recognition system about    the quality of printing of the original musical score and the resolution of scnning process used to    convert the score to a bitmap. Hence it is used to set up many thresholds and acts as the    tolerance value for many measurements and comparisons.

2. Staff spacing
   The amount of space between adjacent staff lines gives the optical musical recognition system a    very important hint about the resolution of the scanned bitmap, as well as the size of the score    printing. The staff spacing gives a size normalization that is useful for the subsequent recogntion    stages. Sizes and distances can be measured in units that are normalized to the staff spacing.    This can avoid the inflexibility of absolute measures and static threshold values.

3. The inclination of the staff lines.
   Most of the time, the bitmap of the original muiscal score does not have the staff lines    horizontal. This is because -


   This inclination of the staff lines lets the recognition system know the image skew. This can help    to improve the accuracy of the recognition system. For example, when the skew is too large, the    system may rotate the image before further recognition.

   Although the staff lines contain such useful information, their prescence makes the optical    musical recognition difficult.

  1. The staff lines graphically connect most musicl symbols, thus interfering with the recogntion of the symbols.
  2. Staff lines disturb the contour of the muscial symbols.
  3. Musical symbols that have `hollow' regions, such as sharp and flat symbols and a half-note may have their hollow regions intersected with a staff line. The staff line runs through the hollow reigon, dividing it into two separate regions. THis makes the recognition of such symbols different.

    So, the staff line presents, to some extent, noise to the recogntion of musical symbols. Without identifying and locating them, it is impossible to recognize the musical score. At least, determining the pitch of a note becomes impossible if staff lines are not identified.

    While staff lines make the recognition of music symbols difficult, the musical symbols also make the identification of staff lines difficult. The presence of musical symbols acts as noise in the staff line identification process. This is especially true for symbols that are long and thin, e.g. slurs. Unfortunately, such ``noise'' is sustantial. The signal to ``noise'' ratio is just too high. Many general image processing techniques fail. Methods specially tailored for optical musical recognition are needed.
  4. Projection on the vertical axis

    Perhaps the most straightforward method to locate the staff lines is to project the whole image onto the vertical axis of the image. This is illustrted in Figure 4. The group of five equally spaced peaks in the projection reflects the presence of a staff group. The staff line thickness can be found from the width of the peak and the staff spacing is the distance between successive peaks of the groups of the five peaks. Then, what about the staff line inclination ?

    The inclination is controlled by using Hough Transform. This method is described in detail in the nest section.


    Figure 6: A fragment of a musical score and its vertical projection


    5.2 Hough Transform

    Hough transform, patented by Hough (1962), is a method commonly used in image processing for locating straight lines in an image. It can find out lines in all orientations and positions. In short, the Hough transform is a voting process in which each pixel of the image votes for the candidate lines that it belongs to. Candidate lines that get higher vote counts correspond to lines in the image.

    Although staff lines ae long, thin straight lines in the musical score and Hough transform can detect straight lines in an image, empirical results showed that Hough transform is not a robust method for staff line identification. The following are some suggested reasons for the failurer. Nonetheless we have used this this method in our approach.

    1. The large amount of musical symbols present on the musical score act as a noise to the identification, making the Hough transform to report more more lines than desired.
    2. The staff lines does not appear strictly straight in the image. This may be caused by the printing error of the score, or errors introduced by the scanner. As the Hough transform is for straight lines, it may not give good results to slightly curved lines.
    3. The staff lines in the image have a sustantial thickness. As a result, a ``thick'' staff line may contain one or more thin lines. These thin lines are voted for by the staff line, causing false lines to be reported. See figure 7.




                                                   Figure 7: A thick line can contain several thin lines.

    Our work
    Some modifications had been tried to reduce the effect of (1) above. One of hich was to restrict the slope of the candidate lines to the range [-1,1] (for a skew of +- 45 degrees). This avoids having bar lines and note stems from being captured. Furthersome, long vertical runs of black pixels are not considered by the transform, because they are most probably not part of a staff line ( remember that staff lines are thin). So only short vertical runs of black pixels were allowed to vote for the candidate lines.

    To avoid the thickness of the staff lines causing us any problem, we took the hough transform of the whole image and then found out the total angle with wich the whole image is inclined, this cancels out the variations due to the thickness of staff lines. Then we rotate the image accordingly to get a totally unskewed image. Again this helps us in our bar line recognition and symbols identification.

    Hough transform is a rather computationally expensive operation with time complexity of about O(n3), where n is the maximum of the image height and image width, in pixel unit.

    5.3 Detection of bar lines

    The next processing is the detection of the bar lines. Bar lines are thin, vertical lines in a musical score. Unlike staff lines, bar lines are seldom intercepted by other musical symbols. This makes the detection of the bar lines easier.

    Our Approach
    We have not considered slurs in our approach which cross over bar lines. Bar lines are revealed as sharp peaks in the projection. Moreover symbol density just near the bar lines is low. This makes the detection of bar lines easier. Our method is robust in a sense that even if symbols are closely paked around the bar lines or even slurs are crossing the bar lines, our method will be able to detect the bar lines easily.

    5.4 Recognition of note symbols

    The note symbols are recognised using a neural net.

    What is a neural net ?
    A neural network is a computational model that shares some of the properties of brains:it consists of many simple units working in parallel with no central control. The connections between units have numeric weights that can be modified by the learning element.

    Why did we implement neural nets ?
    Intially we hade been in two minds. Either use neural nets or Huet's Method to recognize the notes. But finally we dedided upon the neural net implementation.

    Firstly we are recognizing only printed music. Unlike text, printed music doesn't come in different fonts, only yhte relative sizes of symbols are different. Since our aim was not to recognize handwritten music we, thus, had a small number of symbols to recognize.

    Secondly the recognition of the symbols by neural net was fast enough. This has a positive point in reading music online !.

    Pitfalls
    The net takes an appreciable time to learn symbols. Since its all like a black box, we don't have much control over the functioning of the net. Also if two symbols in the character set resemble each other too closely, there might me some error in recognising these symbols.

    Specifiactions of the net
    The neural net we are implementing consists of one input layer,with 81 nodes, and one hidden layer with 100 nodes. The output layer consists of 10 nodes,each standing for a unique symbol.


                                                                   Figure 8: A simple neural net.

    6. MIDI



    What is MIDI
    MIDI is to sampled audio waveform as sheet music is to a compact disk recording. MIDI stands for Musical Instrument Digital Interface and is a standard for the digital communication of musical data. MIDI allows you to connect various MIDI compatible music devices (synthesizers, for instance) together and control them from other MIDI-compatible equipment(a computer, for instance).

    MIDI file format
    MIDI has also defined a file format for the interchange and playback of, effectively, binary sheet of music. A MIDI file specifies which notes should be played at what time and by which instruments, in order to create a piece of music. A piece of software on the host computer interprets the commands in the MIDI file and causes hardware to output the correct note of the musical instrument.

    MIDI is binary data, and a MIDI file is therefore a binary file. You can't load a MIDI file into a text editor and view it. (Well, you can, but it will look like gibberish, since the data is not ASCII, ie, text. Of course, you can use my MIDI File Disassembler/Assembler utility, available on this web site, to convert a MIDI file to readable text).

    MIDI files are not specific to any particular computer platform or product.

    Our Work
    After recognising the symbols intermediate output is generated which is read by a routine and converted intoequivalent standard MIDI format. . Only single track MIDI files are generated as the muscial score is assumed to contain voice of only one instrument i.e. guitar. The time specification, pitch etc. of the note to played are picked from the intermediate format itself.

    A binary MIDI file is written using some routines. The MIDI file thus generated can be played on any sound card(assuming there is a software installed to understand MIDI format.)

    7. INPUT OUTPUT

    Here the input is given as a scanned image of the printed music and the output is sent to the sound card of a system to play the recognised music.Initially we are thinking of elliminating the need for recognition of the key signature recognition, which tells what key you are playing in.Apart from that we are assuming that the symbols like staccato, portato, and the accent are absent from the scanned input.
    Sample Input file.
    Sample output file(MIDI FILE).
    Click here if you are using a midi compatible browser e.g. Internet Explorer

    8. Conclusions

    In this project, optical music recogntion was investigated. The existence of staff lines makes optical music recognition unique from the class of optical recognition topics. Since musical symbolsare connected together by staff lines, to successfully recognize the symbol requires special techniques that are tailored for optical music recognition.

    The staff lines are important since they tell the size of the symbols, quality of scanning and the image skew. However, the are at the same time noise to the recognition of the musicl symbols on the score. So, intutively, staff lines should be removed before further processing. However, there are some optical music recognition system that employs template matching techniques to recogniz the symbols in the prescence of staff lines.

    Having staff lines removed , the musical symbols can be isolated from each other. Neural network approach was used in doing this. Then the process of converting the musical score to a representation which represents the information completely is also a tedious task. The final output was written in the form of a MIDI file.

    8.1 Further Improvements

    At present, the program can recognize only a few symbols which are not printed too closely. We are trying to train our neural net for as many symbnols as possible. Uptil now we have been able to identify treble cles, time signature, whole note and a full note. The program ignores tempo, rests, accidentals and duration dots. Here is a list of the places where further improvements to the program are suggested:

    Add the capability of recognising rests and duration dots
    Duration dots shoulsd be easily identified because of its size. After staff line removal, the remaining symbols that have a square bounding and the side length smaller than half the staff spa ce should be a duration dot. Whole and half rests appear as short strokes whose height are approximately half the staff spacing. They are rectangular. So, they can be identified by examining the size of the bounding box, as weel as the symbol area. Whole note and half note differ by the position at which they appear on the staff. Other rest symbols have approximately the same size. A check on the size of the bounding box would be able to identify them.

    Allow multiple instruments to be played simulatneusly
    This feature can be done using the midi routines which allows the voice of multiple instruments to be written to at most 16 channels and so we can have more engrossing music.

    Allow different time-signatures
    At present we have considered the case of 4/4 time signature as it gives the uniform beat. But complex time signatures like 3/4 can also be allowed, in which case the beats will be difficult to program using MIDI.

    SOME OF THE LINKS FROM AROUND THE WORLD

    Optical Music Recognition Group

    REFERENCES

    1) Y. Nakamura et al., "Input method of Note and Realization of Folk Music Database," TG PRL78-73, pp. 41-50,Institute of Electronics and Comm. Engineers. of Japan (IECE), (in Japanese)(1978).

    2) I. Fujinaga et al.," Issues in the Design of Optical Music Recogntion System," Proc. 1989 International Computer Music Conf., Columbus, Ohio.

    3) A. Tojo and H. Aoyama, "Automatic Recognition of Music Score," Proc. 6th ICPR, Munich, W.Germany,p. 1223.

    4) T. Matsushima et al., "Automatic High Speed Recognition of Printed Music (WABOT 2 Vision System)," Proc. Int'l Conf. on Advanced Robotics , Tokyo, pp. 477-482, (1985).


    Hope u enjoyed visiting this page.