Second New Zealand Computer Science Graduate Conference

Optical Music Recognition:
A Generalised Approach

David Bainbridge
Department of Computer Science, University of Canterbury, Christchurch

ABSTRACT

The aim of Optical Music Recognition (OMR) is to automatically convert optically scanned pages of music into a machine-readable format. Significant results have been accomplished in the field, however the static design philosophy originally necessary to investigate the suitability of techniques, is now proving too restrictive for practical OMR systems. Sufficient work has been completed to permit the design of dynamic OMR systems that are no longer constrained in the class of music they recognise. In this report a generic OMR system is adapted to solve the problem in a general manner. In particular, details of a specially designed language, PRIMELA, are described. PRIMELA is capable of describing arbitrary primitive graphical shapes that constitute musical features, and forms a key component of the general system.

1 Introduction

The aim of Optical Music Recognition (OMR) is to convert optically scanned pages of music into a versatile machine-readable format. Consider some of the benefits of such a system: you could listen to a piece of printed music without any training in musical notation; a clarinetist could scan a tune and have it transposed automatically; a soloist could have the computer play an accompaniment for rehearsal; an editor could read in an old edition and touch it up using a music publishing program; or a publisher could reduce an orchestrated work to a piano part with little fuss, and convert the piece to Braille at almost no extra cost.

Like other fields of Document Image Analysis (DIA), the task of recognising a page of music requires the comprehension of two-dimensional relationships. This distinguishes the problem from the extensively researched area of Optical Character Recognition (OCR), where--in its most basic form--only one-dimensional lines of text must be processed. In fact, music typically includes text, so OCR is a sub-task of OMR.

A deficiency in most OMR projects, even recent ones, is their lack of generality and extendibility. Implemented systems tend to solve particular subsets of the music recognition problem, classifying only restricted sets of symbols with hard-wired code. Although the techniques developed can be extended to increase the set of music recognised, or applied to different music problems (for example handwritten music or mediaeval music), such alterations are only possible with expert knowledge and access to the source code. Music, however, is rich in its diversity of notation. Thus, a system that is static in the set of music that it can classify is of limited use.

The primary aim of the Canterbury OMR project is the development of a general system that is capable of recognising different music notation systems such as Common Music Notation (CMN), percussion, and tablature, as well as being adaptable to the differences that occur in notation between instruments, and changes that occur over time within one music notation, past, present or future.

The problem of OMR is simplified by decomposing it into smaller tasks. There is widespread agreement on the sub-tasks that must be solved, although researchers are still investigating the suitability of different methods for solve these problems [2]. Existing work, in general, can be broken down into four categories: staff line identification, musical object location, symbol identification, and musical semantics.

In the remainder of this paper, we will present a generic system that serves to illustrate the four key stages of an OMR system by way of example, and proceed by discussing the design of a system capable of performing these stages in a general manner. In particular this paper will focus on PRIMELA, a component of the general design for recognising arbitrary primitive graphical shapes. The paper concludes by analysing the implemented work and discussing its implications.

2 A Generic OMR System

In music, information is conveyed by superimposing shapes on the staff. This is significantly different to OCR where objects are disjoint, and complicates the task of deciding where one object ends and the next starts. Locating the staff lines is the first step in the OMR process to solving this problem. Staff lines can be found by using horizontal projections [2]. Choosing sections for the projection on the left-hand side and right-hand side of the page helps compensates for music that has been scanned skew.

Once the staff lines have been located, individual musical objects are isolated by removing the staff lines. A standard algorithm traverses the length of each line, changing black pixels to white so long as there is nothing immediately above or below the staff at that point. Figure 1 shows a typical result of applying the algorithm. Despite care being taken to keep objects whole, the algorithm does cause some fragmentation, in particular objects that blend tangentially with staff lines, such as a bass clef or a minim notehead. Fragmentation is an important problem that a realistic OMR system must solve.

Figure 1: An example piece with staff lines removed.

With the musical objects isolated, one might believe the problem has been reduced to that of OCR. Unfortunately this is not the case. There are significant differences between the objects found in text and music. Text is characterised by one-dimensional lines of uniformly sized symbols static in shape, where each symbol is intentionally distinct from others. Music, on the other hand, is characterised by complex two-dimensional relationships between symbols of variable size and dynamic shape. Musical shapes frequently consist of multiple components and are often intentionally similar in appearance.

Rather than attempting to classify such complex shapes `straight off', an OMR system works at a sub-symbol level, detecting the simple geometric primitive shapes that make up symbols (such as note-heads, stems and beams), and assembling them into musical features, guided by musical knowledge. This step is illustrated in Figure 2, where a bounding box has been drawn around each classified musical feature.

Figure 2: The example piece of music with located musical features.

Finally the OMR system must interpret the musical semantics of the classified shapes. By this we mean translating the graphically assembled shapes into their musical meaning. Unfortunately there is no recognised standard for representing an interpreted page of music. In Figure 3 the musical semantics of the example piece is shown using an internal text format. Translating the format into alternative file formats is a straightforward process.

Figure 3: Musical semantics of the example piece.

3 A Generalised Design

Now that a generic OMR system has been described, let us consider the steps necessary to build a general OMR system.

First staff lines must be located and removed. The algorithm described for the generic system, relaxed slightly to recognise staff systems of an arbitrary collection of lines, provides a general strategy. However, in removing the restriction of five staff lines per staff, a complication can arise due to text in the image. Letters from the Roman alphabet, for example, have three prominent levels where ink is present: the letter `E' being the definitive example. For small cross-sections, this can result in false staffs being detected. Fortunately, this problem can be eradicated by insisting on an OCR pre-processing phase that recognises and removes all text.

Next we must generalise the detection of primitive musical shapes. The topic of pattern recognition covers a wide variety of problems and there are numerous techniques that are applicable to detecting musical shapes. Existing work is characterised by the development of customised routines that detect individual primitives. An improvement would be a homogeneous framework that permits the arbitrary specification of graphical shapes, which pattern recognition routine to use and how the results are to be processed. This, in essence, describes PRIMELA, a language designed specifically for such a role. It is discussed in detail in Section 4.

Specifying how to assemble primitives together requires musical knowledge. Again, it is common to find OMR systems where such information has been hard-wired into the code. A better approach would be to form a link to a knowledge base system where the taxonomy is expressed abstractly. Artificial Intelligence offers various models for expressing knowledge: frame-based systems, semantic networks and rule-based knowledge bases are all capable of representing this knowledge. The author is currently investigating a rule based approach using a derivative of Discrete Clause Grammars [4].

Finally, how can the specification of musical semantics be made general? A description of this part of an OMR system is often omitted in the literature, since the operations are too specific for a particular implementation. In broad terms, this phase of an OMR system consists of multiple passes of a graph like data-structure applying the effect of one class of musical feature (such as key signatures) on other groups of musical features (such as notes). A generalised form of this stage must be a dynamic component of the system.

4 Arbitrary Primitive Specification

PRIMELA is an LALR language. It uses familiar computer language constructs such as block structures, `;' as a terminator, and follows a C++ style of syntax. It has been successfully implemented using Yacc. A `program' of primitive descriptions is usually structured as a top level file `including' a suite of primitives, were each primitive description corresponds to one file. A single primitive description is similar to a C++ object with three compulsory sections: graphical description; variable instantiation; and matching control.

It is envisaged that there are three types of people involved with a general OMR system: the implementor, a `primitive design engineer', and a musical user. The implementor has expert knowledge in both computing and music. However, once the general system has been written, the implementor is no longer required. The role of the primitive design engineer is to design and maintain the dynamic configuration of the system, and thus requires a degree of computer and music knowledge. The musical user, on the other hand, is only interested in running the system, so requires only rudimentary computer skills coupled with the musical knowledge specific to the class of music being processed.

Currently the implementation of PRIMELA supports four types of pattern recognition algorithm, in a homogeneous manner. The categories are: projections (vertical and horizontal) [2], Hough transform [3], template matching [6], and slicing techniques [1], though in the future there is no reason why this can not be extended to include other techniques.

4.1 Graphical Description

The graphical section of a description comes first. It defines an arbitrarily complex shape and is similar in notation to that used in drawing packages. In fact this part of the primitive description can be generated using a customised drawing package.

The graphical section specifies a sequence of drawing operations, which can either define a line shape (rectangle, ellipse, etc.) or perform a flood-fill operation. Shapes can be combined into hierarchical blocks which can be subjected to transformations (scaling, rotation, and translation). The canvas supports multiple region values. The motivation being the ability to increase the relative importance of some lines/shapes within a primitive, to other parts of the same primitive.

For the purposes of music recognition, the defined shape must be resolution independent. Freeform curves are an ideal choice [5]. Not only would they be the natural choice when defining primitives such as a note tail and a bass clef, they are a super-set of conic sections. This means that all shapes (line, rectangle, ellipse, etc.) that the language offers can be generated using freeform curves. Such a homogeneous treatment of graphical shapes simplifies the implementation. Freeform curves are also affine under transformation. That is to say, a freeform curve can be scaled, translated, and rotated by first transforming its control points and then generating the curve. Thus a transformed freeform curve naturally has the resolution independent property.

4.2 Variable Instantiation

The next section of a primitive description instantiates variables. There are two distinct categories of variable: searching option variables and dynamic variables.

Searching option variables control how rectangular areas for searching are defined, which image to use for matching, as well as any post processing steps such as noise filtering and joining fragmented shapes together.

Searching option variables play an important role in dealing with fragmented objects. The first choice a primitive design engineer must make is whether the entire page is searched for the primitive, or just rectangular regions based on the isolated shapes located after staff line removal. In the latter case, if fragmentation of the primitive is possible, variables are specified that extend the perimeter of the bounding box. Enhancements are possible by controlling when the matching algorithm is applied, for example, only when the current pixel under consideration is black.

A searching option variable also controls which image to perform the match in. The choice is between the original image and the image where the staff lines have been removed. The second image is always used to define the rectangular region, however when the pattern recognition routine is called, the image passed to the function to perform the match is dependent on the searching option variable.

Dynamic variables are used in the matching control block of the primitive description. The variable instantiation block is where they are initialised. Commonly used dynamic variables are thresholds that a pattern recognition routine must exceed to be considered as a match, and limits for noise filtering.

4.3 Matching Control

There are several sequential operations to matching. Not all are mandatory, and default behaviours exist for those omitted. First a pre-condition may be applied to the specified rectangular region, such as insisting the number of black pixels in the region exceeds a certain value, and is usually included to improve efficiency. Only if the condition is satisfied does the matching algorithm proceed to the next operation, rectangle dimension modification. This stage is primarily for noise tolerance and alters, if the designer desires, the dimensions of the rectangular region.

Once these pre-processing operations have been performed, the pattern recognition routine specified in the graphical section is invoked. For each primitive found in the region, the detected primitive is copied, then subjected to a post-conditional check. Only if the detected primitive passes this check is its certainty value calculated. A certainty value is in the range [0.0,1.0], where `1.0' represents a definite match. A definite match is removed, whereas a lower rating remains in the image, since it may cause a better match with a different primitive description. Certainty values are used in the assembly stage of the OMR process to judge to overall likelihood of the constructed musical feature.

Finally if the noise option has been specified, the area from which the primitive has been removed is `cleaned up'. This is accomplished by examining all the isolated shapes that remain, removing those that meet the condition specified by the designer, such as a maximum width and height.

4.4 An Example PRIMELA Description

Figure 4: A PRIMELA description for a treble clef.

In Figure 4 the PRIMELA code for the specifying a treble clef is shown. The description specifies a match by horizontal projection, where the graphical section defines an ideal projection outline. This was accomplished by extracting a representative treble clef from a page of music and automatically generating its horizontal projection. Using the projection as a backdrop for the customised drawing package, a freeform curve (a Bspline in this case) was fitted to the outline shape.

The instantiation section of the description specifies that the isolated shapes be used as the bases for rectangular regions, as well as defining some dynamic variables used in the matching process. In the matching block, there is no pre-condition or post-condition to satisfy, however the y-dimension of the rectangular area is increased slightly to compensate for noise. By leaving the brackets, {% %}, empty for the copy and remove statements, their default behaviour of copying or removing (respectively) every pixel in the rectangular region, are invoked.

5 Conclusions

PRIMELA is still in prototype form, and it is anticipated that the language will be altered in the future, as the system is more stringently tested. Initial results are encouraging. Examples of CMN and Gregorian chant music have been processed successfully.

The implemented solution to fragmentation is, as far as the author knows, novel to OMR. Since it is a general strategy it is more flexible than earlier ad-hoc methods, however there is an increase in computational cost associated with it. This is not viewed as problematic since the application of the fragmentation algorithm is under the control of the primitive description designer, hence is used only where needed.

The system being developed at Canterbury has illustrated that each key component of a generic system is possible. The current stage of work is to combine these processes into a uniform, usable package.

Acknowledgements

The author would like to thank his supervisor, Tim Bell, for his continued support and motivation. Thanks also to Paddy Krishnan, for his comments on earlier drafts of this paper. The work was made possible through funding by the New Zealand Vice-Chancellors' Committee.

References

1: David Bainbridge. Preliminary Experiments in Musical Score Recognition. BEng thesis, Department of Computer Science, University of Edinburgh, June 1991.
2: Dorothea Blostein and Henry S. Baird. A critical survey of music image analysis. In H. S. Baird, H. Bunke, and K. Yamamoto, editors, Structured Document Image Analysis. Springer, 1992.
3: R.D. Boyle and R.C. Thomas. Computer Vision: A First Course. Artificial Inteligence Texts. Blackwell Scientific Publications, 1988.
4: Ivan Bratko. Prolog: Programming for Artifical Intelligence. Addison-Wesley Publishing Company, 1990.
5: James D. Foley, Andries van Dam, Steven K. Feiner, and John F. Hughes. Computer Graphics: Principles and Practice. Addison-Wesley Publishing Company, second edition edition, 1990.
6: Ian Witten, Alistair Moffat, and Timothy Bell. Managing Gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, 1994.

Footnotes

...PRIMELA

PRIMELA is an acronym for PRIMitive Expression LAnguage.

...translation)

This last feature is purely for convenience.

David Bainbridge
Tue Sep 2 16:49:15 NZST 1997