COMPX241 Emerging Project Ideas

The Soundtrack of our Lives

Project Manager: Ben Worsnop

Team Members: Vishaan Bhardwaj; Timothy Binu Olakkattu; Kenji Olegario; Simon Peun; and Timothy Yang

Weekly Update Time: Mon 4-5pm, J.B.07

Key Idea: Use the live CCTV footage that a restaurant or bar typically has to estimate the age of its patrons, and from that determine what the song playlist should be.

Okay, so the overall algorithm needed would go something like this:

  1. Detect where faces are in the video stream
  2. Apply age detection to the located regions with faces in them
  3. From the returned information, determine what songs to play

For Step 2, you might even find that an existing developed solution for this already, implicitly, includes a solution for Step 1.

There is an implied Step 0: how to get frames of content out of video, which can be achieved through the use of widely available libraries, such as OpenCV. Steps 1 and 2 look to be most attainable through the application of trained Deep Learning models. For Step 3, working with online music APIs would allow you to retrieve information about songs, such as when it was released and how popular it has been over time.

An alternative to Step 0 is use a command-line video programs (such as ffmpeg) to do the heavy lifting, which you then program into your code through a function call such as system(my_cmd) from the Standard C Library, or equivalent, depending on the programming language you are using. To take the example just a bit further, if you were to, for instance, initialise the variable my_cmd to be the string ffmpeg -ss 5 -i input -frames:v 1 -q:v 2 output.jpg which is then passed into system() as a parameter, when the system() function call returns, there is now a JPEG image file on the file-system that corresponds to the frame that is 5 seconds into the video file. Your code can then read in that file and set it as input to Step 1, which detects where any faces in the image occur.

One additional note to make with regards to system() is that it returns an integer number, which you need to pay attention to. If it returns 0, this means the command executed successfully. If it returns any value other than zero, then this means there was a problem encountered trying to run the command, and (in our case) in all likelihood the output file wasn't generated. Your code should print out an error message, and go no further in trying to identify faces in that particular frame. Code that doesn't check return values and carries on, regardless, tend to ultimately crash later on.

For Steps 1 and 2, the good news is that from a quick scan of available open source projects and libraries, there looks to be quite a few projects/packages out there that have worked on one or other of the required capabilities, which would assist in achieving the overall aim for The Soundtracks of Our Lives project. The bad news is that from a quick scan of available open source projects and libraries, there looks to be quite a few relevant projects/packages out there that will need sifting through to figure out how/if they can be used!

In terms of creative ideas, Step 3 holds the most potential. If you have determined the age of just one person, what exactly does that mean in terms of the music you should play? Now bump that requirement up to a set of users whose ages have been determined. If a mixed age group, do you play one song motivated by the age of one person, and then line up the next song based on the next age that's been determined, or is there something more sophisticated that can be done? And how about that transition from one song to the next. If you conceive of the task for this part of the project as being akin to a DJ, then there is a real art in selecting the next song to play. In which case the sorts of content analysis tools mentioned in the list of potential resources for the Musical Mashups project would be equally applicable here.

Some potentially useful resources:

Once Upon a Time there was ...

Project Manager: Hannah Carino

Team Members: Shahd Abusaleh; Shriha Deo; Josephine George Jeba Kumar; Chloe Lee; and Mahi Patel

Weekly Update Time: Mon 4-5pm, J.B.07

Key Idea: Create an editing environment that allows an author to develop interactive touch-screen hypermedia-enriched children's stories, akin to the physical LeapPad educational toy, but in a digital form. In supporting the authoring process, look to draw upon generative AI algorithms such as ChatGPT and Dall-e to assist in the production of artwork, sound effects, and even help with aspects of the writing the story itself.

For some light-hearted context, writing a children's story is, well, child's play, right? Certainly that's the view that Bernard and Manny in Black Books have... to start with: Clip 1, Clip 2 Clip 3.

Similar to the Music Mashups project, the emphasis in this project is on the editing environment needed to produce the end result. Let's refer to this as the authoring environment, to emphasise the kind of user that this project is designed for. I'm picturing a mum or dad, who has a reasonable level of experience using desktop tools through their job, for instance—who makes up stories to tell their kids as part of the going to bed ritual, in addition to the books they read to them—and is interested in expanding upon this through Once Upon a Time there was ...

In terms of the the output produced by the authoring environment, HTML+CSS combined with Javascript provides all the necessary features to provide a rich multimedia experience for a young reader. Pop that on to a tablet and you essentially have your LeapPad equivalent digital solution. Such a statement does, however, gloss over some of the finer details of what is needed. For instance:

  • There is definitely a skill in generating an age-appropriate interface, say for kids aged in the range 4–8;
  • Operating on a touch-screen, rather than a regular screen with a computer-mouse, can introduce some additional challenges such as the fact that using a finger or stylus to interact with the displayed content can obscure the view of the screen at a critical moment; and
  • How about keeping that notion of a book going? Wouldn't it be nice if the young reader can swipe their finger from right-to-left, or left-to-right, to affect a page turn while they are reading the story.

In terms of standing up the authoring environment, for this project my first thought goes to basing the work around an open source Rich Text editor such as CKEditor or ProseMirror. Something you can splice your own code into, to provide that augmented assist with generative AI tools, such as:

  1. DALL-E, for example, to generate images, such as backgrounds and individual picture elements;
  2. ChatGPT, for example, for text writing assistance through a broad range of features covering things like the style of writing used, researching background on a topic or location, and deciding on a character's name (to name just few that come to mind); and
  3. Stable Audio, for example, for audio sound effect generation (Here's a quick experiment getting Stable Audio to generate: the sound effect of a large rock producing a splash when dropped into water).

It is also possible to develop add-ins/add-ons to fully formed word-processing applications such as Microsoft Word, and Google Docs.

But let's step things up a further notch for this project. How about we make the book reading a tactile experience for the young reader? We happen to be fortunate enough to have Professor Stefan Rüger, visiting Waikato at the moment from the Knowledge Media Institute (KMi) at The Open University in the UK. He is currently involved in the SENSE project, which among other things is researching pedagogical uses of a 3D printed stylus they have developed that can provide tactile responses. Check out their picture-book demo and their texture-book demos. Source code available through their GitHub repository

More specifically, the developed stylus can be plugged into any device that supports USB-based I/O for mouse and audio, and through the actuator embedded in the device, a user holding the stylus can be programmatically provided with different tactile (and audio) sensations, in real-time. The typical way the device is used is to play/vibrate different audio sounds depending on where the stylus is touching on the screen. Sounds like a perfect fit with the goals of Once Upon a Time there was ...

For potenially useful resources, see the hyperlinks provided in the project description text above.

Really, Guess Who?

Project Manager: Caleb Archer

Team Members: Matthew Cook; Mathew Jacob; Bishal Kandel; Eli Murray; and Alexander Trotter

Weekly Update Time: Wed 1-2pm, K.B.07

Key Idea: Take the classic board game of Guess Who? and develop a digital version that allows players to vary what the starting set of pictures are: a set of people based on real-world photos? a clip-art styled cute animals round?

This project falls into the Smoke and Mirrors coding-space of, take a classic board or card game, and reimagine it as a digital version—looking to push beyond, in particular, the static elements of the classic game, which typically follow as a consequence of its physical form. For Guess Who? the most obvious element to target is the set of pictures a particular game is played with. The genesis of the idea for this project was to imagine the game being played with photos of real people—famous people, say, where details such as colour of eyes, length of hair, etc. are known. In terms of implementation details, "known " means: attributes that are available in machine-readable form. This is needed so, in making the selection of (famous) people a particular game will be played with, it is guaranteed that there is a uniques set of answers to questions that leads to the identification of each person in that set.

Game Play. Unless you opt for a high risk strategy where the questions you ask deliberately target a unique feature that only one (or two) of the pictures has, and you then get lucky early on in the game, then the main strategy used in playing Guess Who? is to ask a question in each round that allows you to flip down (ideally) half of the pictures that remain up. With this in mind, then, when compiling the set of attributes that will be in operation during a game, move beyond having the minimal set of features that allows each person to be identified uniquely. Include some attributes that are double-ups on other attributes that identify a particular subgroup, or better still have additional attributes that straddle across subsets of other attributes, to mix things up a bit. Note, including these additional attributes won't interfere with the ability to uniquely identify the picture that has been chosen, but it will help make the game more interesting to play.

The world of Fandom suggests that there could very well be online resources containing the sorts of details needed to develop the envisioned project of Really, Guess Who? Just a little bit of web searching turned up Celeb Heights.

If it turns out to be the case that Fandom sites do have this information, but not in sufficient quantities, or not in a form that is easy to access in a machine-readable form, then shift your attention to Linked Open Data resources, such as WikiData. This particular Linked Open Data source is "a free and open knowledge base " containing billions of facts about millions of things, people being just just one subcategory. For example, here a link to the entry about the actor David Tennant (Q214601 as a web page) which along with a photo of the actor, you'll see includes details such as hair colour and height; age can be worked out from his date of birth; gender; nationality; the types of acting he has done (stage, TV, film). The particularly handy thing about this resource is that it is also available in machine-readable, such as David Tennant (Q214601 in JSON format).

If the number of characteristic attributes about a person across Fandom databases and/or Linked Open Data are too few, then you could augment this with attributes established using image analysis of the chosen photos. In this scenario, having selected a set of celebrity photos, image analysis could be applied to determine if a particular person is wearing a hat, or has a moustache, and so on. In setting up the gallery-board to play a game, a little bit like Google reCAPTCHA, these photos could then be shown to a user, getting them to confirm the person really is wearing a hat. Without this step, it would be possible for the software to include a photo in the gallery-board where, to the users playing the game, the celebrity is not wearing a hat, but because of the incorrect image analysis classification the software will treat it as if it is, and confusion will ensue.

Working with Linked Open Data would also allow you to expand the subject matter used in a round of Really, Guess Who? While it might take a bit more effort following the linked attributes (predicates in Linked Data parlance), more challenging sets of celebrities could be formed where questions to non-visual features are needed to establish which person is the chosen person. For example, the chosen person has an Oscar, and have appeared in movies where Stephen Spielberg was the director. This is an interesting line to pursue in the project that really leverages the transformation in the game that is possible by being in the digital realm, however so users playing the game don't have to be Mensa level mind-readers to divine what sort attributes separate the different people, the software could display which predicates "are in play" for a the chosen set of people. This would mean a user would get to see Academy Award (Oscar) as a question topic that could be asked, and there for pose the question, has the chosen person won an Oscar, or even post the question as 2 or more Oscars if they like, and are confident about what that is likely to do in terms of how many pictures they can flip down.

Another way to expand on the distinguishing features of the pictures used is to go beyond basing the game around people, and instead focus on flora or fauna. Plenty more attributes to focus on here. With animals does it have fur or not? 2, 4, 6, 8 legs? And so on. Or how about going for a science angle, and have the items shown be elements of the periodic table, with the questions such as is its boiling point greater than 100 degrees Celsius? If you manage to develop a robust technique for traversing the Linked Data predicates, it could very well be that your game could even start by asking what sort of category you would like to play, and your system takes things from there!

While varying the selection of pictures used is an obvious area to target in a morphed digital version of the game, there are areas that could also be zhuzhed up. How about playing against a computer player? Or designing a game mechanic that allows for more than 2 players to compete: on your turn it could be you choose which player you wish to take a guess at. If you uniquely identify the chosen picture, that player is knocked out. With this game dynamic, it would be easy to imagine alliances forming, so to balance things out perhaps there needs to be an additional game dynamic that allows a targeted player to strike back?

A sampling of Linked Data resources:

Further potentially useful Linked Data resources provided in the How Many Elephants? project description.

How Many Elephants?

Project Manager: Hannah Murphy

Team Members: Yaman Faisal Hamid; Henry Hitchins; Savinu Mathusinghe; Garv Singh; and Enzo Zozaya de Oliveira

Weekly Update Time: Wed 2-3pm, K.B.07

Key Idea: Develop an app that makes it easier to interpret values such as height, weight, volume, time, temperature, power, force, etc.

There is a penchant in the school education system for expressing maths and science problems in terms of real-world situations. For example, here is one that turned up on Quora: Billy walked from home to school at a speed of 6km/hr. He came back on the same route at a speed of 4km/hr. If he took a total of 9000 seconds. What was his average speed?

Notwithstanding the actual maths a student needs to use to solve this problem, my first instinct is to this question the first value given for Billy walking to school: 6 km/h seems a bit fast for walking, doesn't it? Here's my reasoning. When I was a kid back in the UK I have a dim recollection of being told that a "Bobby on the Beat" (aka Police Constable) was expected to walk at 2–3 miles per hour when on duty, walking their beat. Converting that to km/h (something I have a better handle on, having moved to NZ!), that comes out as 4.8 km/h tops. So, (as long as my recollection is correct) yes to me it seems a bit fast. But is my intuition correct? Through a bit of web searching, I was able to find some websites of decent repute that gave values of typical walking speeds, broken down by age, and yes this aligned with my thinking.

Now let's take the 9000 seconds part of the question. That's an odd unit to be expressing the question in. Sixty seconds in a minute, 60 minutes in an hour. OK, so that's 3600 seconds in an hour, and I'm starting to get a better feel for how long getting to and from school has taken Billy. Well over 2 hours, but not quite a long a 3 hours. Straightaway this makes the idea of maintaining 6 km/h walking to school somewhat dubious. If it is meant to be nothing out of the ordinary, then it raises the additional question of why Billy's walking speed going home is so much slower—dropping by as much as a third!

Sidebar: Faced with all this, if I were Billy and that's what I had to do every day to get to school, I'd be thinking about getting a bike!

The selected question about walking speeds, time and (implied) distance happens to fall into an area where I am confident about my breakdown of the numbers; however, move on to other quantities, such as weight, and I'm on more uncertain ground. If I was reading a question about an African elephant that weighs 8 tonnes, I wouldn't have the same ability to reason about whether that value is typical for such an animal: perhaps a bit on the light side, perhaps a bit heavier than usual. Or perhaps, completely off the charts. Take a moment to reflect. What do you think?

The core aim of How Many Elephants? (HME) is to develop a software environment that helps people develop a better grasp of what quantities and values actually mean. To allow a person to think about things in more tangible terms. Examples of quantities such as speed, time, distance, and weight (mass) have been mentioned. The scope of the project would be to provide assistance across a wide range of scales, and the various standards we have for measuring them. Start with the SI units, and expand upon things from there. From the seven base units, there are 22 coherent derived units including energy (Joules) force (Newtons), and frequency (Hertz), which I think many of us would dearly benefit from gaining a better intuition around what quantities expressed in these units mean.

For certain items on the News, you'll see journalists including comparisons to help viewers get a better sense of a value that has just been mentioned. If a distance, for example, they might re-express the value in terms of the equivalent number of rugby pitches (NZ), soccer pitches (UK), basketball courts (USA). When distances get larger, how about expressing values in terms of travelling between two locations the user are familiar with, starting with where they live. One of my favourite comparisons is to do with the quantity of mined gold in the world, which "best estimates currently available suggest that [to be] around 212,582 tonnes" (World Gold Council) Using a density value of 19.3 g/cm3 for gold, and the dimensions of an Olympic-sized swimming pool (ideally with a depth of 3 metres, 50x25x3), then we arrive at the result that all the world's gold that has been mined throughout history would actually fit into 3 Olympic-sized swimming pools! I also enjoy the quirky comparisons Alex Horne, the creator and co-star of Taskmaster, sometimes gives, such as in this clip where he expressed distances measured in terms Noel Edmunds (a British TV presenter). The software environment developed for this project could not only help in performing these sorts of conversions, it could go one step further, and even put forward some suggested conversions.

For some comparisons it should also be possible to develop visual aids to show what this means. How many Eiffel Towers stacked upon one another does it take to equal the height of the Grand Canyon? For the massive pipes that are built in hydro-electric power stations, how many double-decker buses driving side-by-side could fit down the widest part of the pipe? In looking to generate such visual aids, I would consider using something like the DALL-E API, to generate the the "to be compared against" item most likely in a schematic/clip-art style, tightly cropped.

As to how to approach this problem, Linked Data/Linked Open Data is definitely a topic you should look into. In particular Wikidata. As noted in Really, Guess Who? this resource provides a gateway to a vast array of semantically marked up data in a machine-readable way. For example, here's the Wikidata entry about the Eiffel Tower (concept/entity, Q243) in a human friendly form, and the exact same entity from, but now in JSON format.

You'll also need to develop a capability for representing values in a wide array of units. The SI units make for a good starting point, however you will need to support more than just those, as indicated by the fact that there are then 22 coherent derived units, and that is just within the metric system. Depending on who the user is, the imperial system for distances and weights might be the more common frame of reference. Even sticking within the metric system, I doubt many people would find expressing temperature in the official SI unit of Kelvin, for instance, would be particularly intuitive.

Here's the kind of features I think HME should be capable of providing:

  1. The user provides a value and specifies its units.
  2. The text provided is "soft" parsed by HME and a message issued to confirm the software has correctly interpreted what sort of value has been entered.
  3. HME then asks something along the lines of, what would you like to do?

There is a range of possible things the software could assist with, including but not restricted to:

  • Basic conversion. The ability to convert, for example, seconds to hours, but paying attention to displaying the converted to value in a human-friendly way. So if converting a high value in seconds to hours, don't just express it as 2.25 hours, it's easier to understand it as 2 hours and 15 minutes. And if the converted number, strictly speaking, runs to a large number of decimal places, don't show that version immediately: display a value that is clearly labelled as "roughly rounds to" first, and then provide the option for revealing the exact value of the conversion. In cases where the converted number is so large (or small) that scientific notation is being used to display the number (again roughly rounded to avoid many decimal places), then have the interface provide a feature that allows the user to breakdown what that format means to better understand it: just how many zeros are in it, and where do they feature in the number?
  • Is equivalent to. HME could provide more than one way to support this ability. It could present a fixed list of choices, and ask the user which one (or ones) they want to select. It could also allow the user to freely pick what they want the value converted to, access Linked Data resources to see if it can determine the requested entity, and if it can—perhaps, displaying what it found to the user first, to confirm—then proceeding with the "is equivalent to".
  • Sanity/Reality check. This could be seen as a variation on the is equivalent to capability. For the number I have, if I convert it to something I have a better intuition for, does it make sense? This could be at the prompting of the user, as to what makes for a good entity to compare against, or else could have a more computer-prompted angle where it suggests some things to convert it to—things which have already been pre-determined as useful things to compare again, for a given type of unit.
  • Alex Horne inspired. Yet another variation on the is equivalent to capability, but this time coming up with some quite unusual comparisons. Could be pitched as a "surprise me" feature, in which case would be best done through some dynamic selection of what to convert to, rather than relying on some pre-canned fixed list of entities to choose from, which would become a bit stale over time.

When I started writing this description, I was actually thinking of students working on these sorts of problems as being the intended end-user; however, having looked a bit more into example maths and science questions, I think this project would also assist educationalists in setting their questions. The question setter could have included the detail that Billy was running a bit late when leaving home in the morning, and so decided to jog in. Although I have to say, I'm still a bit troubled by the distance involved in this question. Most councils have regulations that determine if a student lives more than a certain distance away from their school (this was set at 3 miles in Dundee, where I grew up), then the council is required to provide transportation, such as a bus, to get them to school.

In addition to the Linkded Data resources given in Really, Guess Who, further potentially useful resources:

Trusting my Eyes and Ears

Project Manager: Abbie Reid

Team Members: Alexander Kashpir; Pratham Sethi; Emma Teale; and Suhani Vakil

Weekly Update Time: Wed 2-3pm, K.B.07

Key Idea: Create a way to digitally sign media such as images and audio so it is easy to verify who produced/published the image.


Thought provoking: The Guardian's "Points of View" advertising campaign, 1986.

Imagine if the original photo had been the one on the left, but with a wider field of view: as the Guardian's advertising campaign highlights (but through a different point of view), it is surprisingly easy to manipulate someone's perception of an incident through careful selection of what is shown to the reader. In the increasingly polarised political world we now seem to live in, let's say a Guardian newspaper journalist had published our imagined wider angled version of the photo on the left showing the "skinhead" lunging to save the man from the tumbling bricks as part of an article they had written; but then someone else took that photo, cropped it to make it look like the skinhead was trying to steal the man's briefcase, and then posted it to social media. The capability that the Trusting my Eyes and Ears project is looking to develop is to provide a way that it would be relatively easy for someone surfing the web to check the validity of the photo, and in the case of the cropped one, learn that it is not a valid version of the photo that the journalist had published.

Moreover, now that creating image- and audio- deepfakes is becoming so easy, the broader aim of this project is to develop a technique that helps journalists and others have a way of publishing media content such as photos and audio that is verifiably theirs, and importantly can be shown to be fake if the image is visually manipulated in any way. As a corollary to this, a successful solution that addresses this problem could even be used by sites such as OpenAI's DALL-E to sign the images their algorithm produces, allowing anyone seeing such a image to easily confirm that it was indeed generated by this generative AI algorithm.

As to an approach? How about (conceptually) embedding a QR code into an image using steganography so it is not visible, with the crucial pixels encoding the QR code spread throughout the image (say in a spiral, or concentric rings). The image is verifiable by a software app extracting the QR code, and then confirming its checksum (or similar) is correct. In the event the image has been tampered with, then this will affect the QR values extracted, meaning the checksum will no longer be correct.

In terms of developing the signing media content capability, all the mainstream programming languages have packages and libraries to assist with generating QR-codes, and to read, write and manipulate image and audio formats. Note also most media formats have the ability to include textual metadata in their header sections (for example EXIF metadata), and so this is an area that should be assessed as to its fit-for-purpose for this project—for instance, storing a signed checksum as embedded metadata based on the pixels in the image/PCM values in the audio.

An advantage of the steganography approach is that it should prove possible to develop a more robust solution to tamper detection than the embedded metadata approach, after all some malicious could simply remove the checksum from the embedded metadata, manipulate the media content, and then repost. Encountering this version of the file online, there is nothing left in the file that lets you determine it has been interfered with. A downside to the steganography approach is that it can only be applied to lossless media formats such as GIF and PNG (images) and FLAC (audio). When signing something like a JPEG or MP3 file this could be handled in the encoding software by changing its output format, however—with just the change in file format alone—it will mean the resulting file will be non-trivially larger than the original. And that is before the hidden data has been added in, an aspect that will degrade the level of compression that can be achieved by the media format.

The techniques this project draws upon from the field of cryptography are likely the hardest, conceptually, to work with, which is why I have concentrated on this area in the resources section below. Asymmetric encryption, in general, looks to be a good approach to follow, but even focusing on that, there is still a lot to get your head around.

Let's turn our attention, now, to the verifying side of this project. It would make for a great Smoke and Mirrors demo if you could use a web browser to visit someone else's website, and through an enhanced feature of the browser, such as a right-click when over an image, select a menu item that allows you to run a verification check that establishes the authenticity of the photo. Or else, just through the regular act of visiting web pages your verification technique is run against images that the page contains, overlaying a small semi-transparent verified icon in, let's say, the bottom-right corner.

Of course, as this is Smoke and Mirrors, the example websites you visit are actually ones under your control, otherwise how else would you be accessing websites that include media with your embedded metadata/steganography in them? It's just a small sleight of hand to say: imagine an online world where our technique is already in use, as this in no way detracts from the fundamental capabilities of the approach that has been developed. To really help you sell the idea of the project, your demo could show the audience a "good actor" site, such as the Guardian newspaper, where the authentication checksums all check out. Then you could proceed to copying one of the images from that site into an image editor, say a step that is done by a different person on the team representing the "bad actor", which they then add into their website. The original presenter then visits this other site, which includes the modified image, and the enhanced verification capability in the browser shows that this is a photo not to be trusted.

This is not as tall an order as it might first sound. There is a nifty technique around called user-scripting, which lets you splice bespoke Javascript that you have control over, into your web browser which then gets run when you visit other people's websites. You can achieve this by installing a browser extension such as TamperMonkey to your browser. Then you are free to install whatever userscripts you see fit, which are typically keyed to spring into life when you visit a website that matches certain regular expressions. As you are able to specify the JavaScript you would like to run, this means you are able to access the Document Object Model (DOM) for the webpage that has been loaded in, and so through that change CSS and HTML elements, affecting visual changes in the page displayed.

Taking a step back, but keeping the main idea of the project in mind, there are some additional/alternative lines of technical development that can be pursued. For example:

  • Build a website that: (a) lets content providers upload their media files to generate versions that have been signed using Trusting my Eyes and Ears; and (b) lets general user surfing the web provide a URL to a media resource, or else a media file they have downloaded, and have its validity checked.
  • Or how about heading down the route of actually inventing your own media file format(s), to ensure all the features needed to perform verification are included. As remarked earlier, this could be achieved by simply co-opting an existing file format, but with the added twist that certain embedded metadata must be present, otherwise the file is deemed corrupt.
    You could then take a web server such as Apache2 or Jetty, and add in support for your new file format. When a request comes in to serve up a file that is in your new format (say .vpng for Verified PNG), your code pulls a sleight of hand: it uses the code you have developed for reading your format, however the end result that gets streamed back is actually in the base format—controlled through setting the MIME-Type field in the returned HTTP header to be image/png—meaning the web browser receiving the file data will know how to display it.
    Or you could go all in on the newly invented verifiable file format(s), and learn how to compile up a web browser such as Firefox from scratch, and extend the code base to include support for your file format(s).
  • Consider combining the idea of embedded metadata and steganography. This would enable you to experiment with reducing the amount of information that needs to be encoded directly in the steganography encoded data. Instead of being self-sufficient in the data represented, the job of the hidden part of the message now becomes one of redirection: directing the software that processes the image to the name of a metadata field in the header, which is where the larger payload of data is, that gets used for verification. Of course, for this to work, the redirection step itself would need to tamper proof.

Potentially useful resources:

The Ultimate Video Player

Project Manager: Michael Peddie

Team Members: Luke Fraser-Brown; Jayden Litolff; Shreyaa Senthil Kumar; Jack Unsworth; and Leo Van Der Merwe

Weekly Update Time: Wed 1-2pm, K.B.07

Key Idea: Implement a video player that takes things to the next level: (i) it provides closed captions even if the video being played doesn't have this embedded in the video; (ii) it allows you to search and find the part of the video you want to go to based on text present in video as well as spoken; (iii) when watching a longer video such as a movie it provides a timeline that is actually usable; and (iv) I can pop in the video ID from a popular streaming platform such as YouTube and experience that video in this enhanced form.

Sounds pretty ambitious, right? Well here's a breakdown of the building blocks I would look to use to achieve this.

  • Closed caption is always an option. Investigate Deep Learning Speech-To-Text (STT)—also referred to as ASR for Automatic Speech Recognition— capabilities such as Mozilla's DeepSpeech or OpenAI's Whisper code-bases. There are plenty other projects listed on GitHub. Claims of performance in real-time will need to be assessed, as this clearly depends on the specs of the processor being used. Depending on the computation cost of running real-time speech-to-text, a contingency plan would be for the software to generate this in the background and notify you when it has been done. A scenario where the latter would be of benefit is a video of a lecture (or other form of instructional material) that has just been recorded, where the transcribed text would then be useful to search by to locate a particular part of the video. Likewise if text-to-speech was run after a TV show or movie had been watched, if you were ever wanting to find a particular section of the video again. It might even be the case that, while it is possible to achieve real-time STT this is at the expense of accuracy, and so there would be value in using the real-time version of the text in the first instance, and swapping in the more accurate version when it becomes available.
  • Text search within the video. The previous item has already touched on one source where text can come from for text searching: text-to-speech applied to the audio track to the video. Applying OCR to frames of content from the video is another. A fully automated version of this would start with a technique for identifying whether text exists in a frame or not—GitHub has a range of projets tagged as performing text-detection, for instance. Then for frames that do, it would apply an OCR algorithm to it (e.g., Tesseract, or the Google Vision API). To avoid doing this for every frame of the video (time-consuming and/or costly), let's add in a technique that compares the currently selected frame with the previous one, and determine if it is visually different enough to warrant applying the OCR technique. That, ideally, would need to be a fast comparison. The processing requirements could be reduced by including a minimal time-step betwee choses frames. Or how about looking at using a keyframe detection algorithm to pick out where important scene changes occur, and base the application of text-detection around that?
    In relying on automatically derived text, whether it comes from the audio or images (or both), there are inevitably going to be errors in it. This will certainly impact upon the idea of using this text to support the user navigating to particular points in the video. What sorts of things can be done to help mitigate this? If it's a short video, the user might be prepared to manually correct the whole thing, in which case a custom form of text editor that lays out the text according to time would be handy. For a longer video, maybe culling out common/frequently used words and just showing the more distinctive ones would be sufficient. After all, is anyone really going to then search the video for a word like of.
    To perform the text search feature needed in the interface there are a range of text-matching algorithms that can be used, with different strengths and weaknesses. As this is the Ultimate Video Player, pay attention to what gives a good user experience. For example, ignoring case matching is likely preferred by the user. And how about ignoring the ends of words so it doesn't matter if I searched for explosion or explosions? (done using a technique called stemming).
  • A Timeline suitable for longer duration videos. For this area of work in the project, all I can say with any certainty is that I really don't like the way this is implemented in the mainstream video players we all use! The issue with vertical scrollbars when documents get long—which has been known about for decades—seems to have been baked into the interaction experience of video players (but now horizontally) with absolutely no forethought at all. My instinct tells me, however, that there must be a better way to do this. I do at this point need to declare that I haven't extensively searched for solutions, rather taken the view that if there was already a better solution out there, surely the mainstream video players would be making use of it. My proposed approach to this aspect of the project is to take inspiration from the HCI research community, which has developed techniques such as Fish eye distorion and Speed-dependent Automatic Zooming, and "think fresh."
    In doing so, I wouldn't be totally surprised if you come across some HCI projects looking at the specific problem of long duration timelines in video players. If you do, take a moment to reflect on why you think their idea didn't make it beyond being presented as an academic paper. Is there something about the work that would make it difficult to generalise to video players used across a wide range of situtations? Afterall, the needs of a user watching a one hour recorded lecture will be quite different to someone watching a movie. Or is it perhaps the case that no one took on the software engineering task of developing an industrial-strength version of the in-the-lab devised widget? I don't suppose you know of anyone enrolled in a Software Engineering degree who will be spending 6 weeks in a team undertaking a video-based software develop project, by any change, do you?
    Kudos to the member of the class who pointing out an interesting timeline representation technique the company SpaceX uses in its live feed for rocket launches (seen in this clip on YouTube). That's an interesting way to augment the notion of time represented as a horizontal line. How about something like that, but interactive to aid navigation?
  • The ultimate video player works with mainstream video sites. A Smoke and Mirrors approach here would be to look into a package like Pytube, which has the ability to determine the video URL that is behind a YouTube ID. This could then be used to allow your (ultimate) video player to play the YouTube video. To be clear, in the spirit of Smoke and Mirrors, this particular capability is about making it easier to imagine how the sorts of capabilities your project has developed, could play out in reality. For instance, if—in seeing a demo of your Ultimate Video Player—YouTube realises they've fallen behind the competition and need to catch up (stat!) the obvious thing would be to acquire the team that did the work!

OK, on the assumption that you haven't been acquired by YouTube/Google, then the next best way that occurs to me to bring your work to the masses, would be to base your implementation work on an existing open source video player framework such as dash.js.

Potentially useful resources:

That would be a D'oh from me!

Project Manager: Bert Downs

Team Members: Riley Cooney; Christian Florencio; Jacob Koppens; Brady Lethbridge; and Kai Meiklejohn

Weekly Update Time: Mon 4-5pm, J.B.07

Key Idea: When I'm interacting with my desktop with the mouse, sometimes the actual click I've just done didn't quite go how I planned, and the wrong thing in terms of windowing has just happened: I was off by a few pixels when I clicked, or something like a popup window suddenly appeared just as I was clicking. In my mind I'm thinking, that's not what I meant to happen, what I actually want is .... The aim of this project is to add to a desktop windowing environment the ability to control what happens to the Windows by voice: undoing the last thing that happened (in the event of a mistake) would be a game changer, but the project should also look to support a range of regular desktop operations as well, spoken in a way that is natural for the user.

I originally considered calling this project D'oh, paying homage to Homer Simpson's often said phrase when things go wrong for him. However, as the project also includes the idea of supporting regular desktop interactions, but voiced naturally—rather than enforced keyword restricted vocabulary, such as being forced to say stilted words like File➵Edit➵Copy—I felt it didn't fully embody what the project is looking to achieve. This is what led to the evolved title of, That would be a D'oh from me!.

Next I considered naming the software system to be developed, VALET—for Voiced Activated natural Language desktop windowing EnvironmenT. A bit of a tortuous acronym, I admit, but it would lead to being able to say things like Hey Valet, restore that tab I just closed, and Hey Valet, iconify all the windows that are open. But then I got to thinking that locking in the name of the computer system in this way, forcing the user to have to use this name, is antithetical to the "speak how you want" aspect of the project. It might be something the Big Tech companies force you to do (Hey, Siri) continually requiring you to reinforce their chosen brand-name recognition. But that's not the game we're playing here. Sure, start with Valet as the name if you like, but include a capability to change this, if the user so wishes.

Enough chit-chat. Let's talk specifics. This is a project where you would get into the nuts and bolts of a reasonably intricate piece of software: a desktop window manager, figuring out the key places to splice in the additional capabilities That's a D'oh from me brings to the table.

The desktop window manager for Windows and for MacOS are closed sourced, so the obvious place to go, to experiment with this project idea, is GNU/Linux and the open source desktop window managers that have been written for that. Yes, I'm afraid that is plural (window managers), so one of the steps needed in this project is to invest some time assessing the options that are "on offer" as it were. As the elements of Window Manager are strongly connected to the underlying Operating System a lot of them are written in the C programming language.

Some good news is that a core Unix philosophy, that the GNU/Linux world is part of, is for things to be modular. This starts small with the command-line utility that Unix provides, and then extends upwards through various layers of abstraction to a fully operational desktop environment, such as Ubuntu, the distribution of GNU/Linux used in our R-block labs for instance. One of the critical layers of abstract to provide a graphical desktop is the X-Windows System (often shortened to X11, which references its major version release). This provides the basic functionality of graphical windows and capturing input through devices such as mouse and keyboard, but is very bare-bones with respect to visual appearance. This is where the next layer of abstraction kicks in, providing a toolkit of nicely styled "widgets" such as buttons, and menus. Examples of widget toolkits are GTK, Motif, and Qt. Finally we are at a point where the windowing/desktop environment can be built, out of those lower layers of abstraction. The Gnome desktop environment makes use of GTK, for instance, and the particular part of Gnome that provides the window manager is called Mutter.

Wikipedia has an extensive list of X11 Window Managers (WMs), however links from that page to further Wikipedia pages about specific WMs is more patchy—which is not to say there aren't other sources available about these online, it might just require a bit more searching. Some of the Window Managers are billed as "simple" and/or "lightweight", meaning they will be more straightforward to get your head around their code base, however you will also likely find many of them were "hobbyist" projects for someone, and are not actively maintained anymore. By comparison, Window Managers such as Mutter and Compiz are associated with mainstream GNU/Linux distributions such as Fedora and Ubuntu. They will definitely be well maintained, as well as providing documentation mapping out the design and classes used; however they will undoubtedly be large code bases, even if the number of points you need to identify, for where your additional code gets spliced in, stays the same. That said, the payoff for being able to affect the necessary changes in one of those mainstream Window Managers allows your project's impact to be much bigger. Pushing further still, if it proves possible to add these voice capabilities one level lower—within the X11 layer of things—then you will have come up with a solution that will universally work across all Unix desktop environments!

In terms of the voice activation side of this project, as mentioned at the start, I am looking for the software to support a natural way for users to speak their instructions. The tasks of both text-to-speech and grammatically understanding text that makes up a setence fall under the topic of Natural Language Processing (NLP). For the former, DeepSpeech is a widely used open source command-line tool for performing this. There are also online APIs that can be utilised from providers such as OpenAI and Google.

For the task of grammatically recognising the works spoken in natural text, the particular NLP tool that does this is called a Part of Speech (PoS) Tagger. Again there are many libraries and packages around that do this. If you would like to get a general sense of how this works out in practice, then you can experiment with one of the online demonstration sites, such as the following one based on the Stanford PoS Tagger, and another from CMU. The CMU one, while admittedly quite a bit older, is useful as it gets you used to the underlying PoS nomenclature for tag names that PoS taggers use.

As an initial phase, focus on the idea of voice commands that trigger functionality that already exist within the Window Manager, such as iconifying the window that is currently selected. Then look to expand upon this to provide examples of voice commands that trigger existing functionality in the WM, but these commands do not have direct one-to-one mappings to the WM function that is being run. An example of this would be saying out loud, "reopen the window that has just been minified", and—taking things a bit further—being able to respond the same way if the word "iconfied" was used instead of "minified". To implement this, you will likely find you need to enrich the state information the window manager keeps.

Next, target new capabilities that do not exist at all, such as asking the system to re-open a window that has actually been quit. Sounds like a pipedream? Well how about this for an idea: with a small change in the code, you could reprogramme the WM so a mouse click on the close/quit "X" in the top-right corner actually does a minify operation with the added special feature that it sets a 'visible' field in the WM for tasks (newly introduced by you!) that controls whether or not that item actually gets drawn in the taskbar or not. By iconifying the window (rather than quitting) and setting its 'visible' field to false it will appear as if the window has been quit. Then if the user asks for the window to be restored ... hey presto: un-minify it, and set its 'visibility' field to true and there it is! The idea, if you like, could be compared with the trash-can metaphor in a desktop environment. Of course, unlike the trash-can metaphor, you won't want these windows hanging around indefinitely, so after a bit of time—probably quite a short amount of time—the window is actually quit for real.

Phrases like "Hey, Siri ..." and "Alexa, ..." are known as wake-up phrases. For this project, you could run with the wake-up phrase "Hey, Valet ..." or come up with another name for you system, however in the spirit of the user-centred approach this project takes, an item of work for one of the team would be to allow the user to choose their own wake-up phrase. Have them record that a few times, and then use this to fine-tune the speech-to-text model used to heighten the accuracy with which the system can detect this phrase being said.

Thinking about the learnability of the system by a user, as the user is speaking you might also like to experiment with displaying text that shows candidate phrases that the system is capable of performing, based on the spoken text that has been recognised so far. As the user continues to speak the number of text items that are displayed reduces, ideally down to a single one, which becomes the action that is performed. There will be some nuances to how best to do this: perhaps the text only appears if there is a bit of a pause in what the user is saying; if the options get down to one and it is not what the user was after, there needs to be a natural way for them to side-step that action being performed; and in the case that it is the item they want, but are continuing to speak the courteous thing for the software to do is to wait until they have finished, before enacting the operation.

If done well, this feature in particular would enable users to learn about features that they might not even be aware exists in the underlying mouse-clicking/system settings desktop environment, such as the ability to save the current windows session, and have it restored later. Further, the more spoken ways the system knows how a particular action can be expressed, the higher the likelihood of the user coming across it.

In the world of text-only terminal consoles, tmux (short for terminal multiplexer) is a command-line program that allows you to operate different rectangular regions of your screen with different running programs. A common use of tmux is to open up your terminal to be full-sized, start tmux and then use the keyboard commands it provides to organise the different tasks you want to run in different regions of the screen, in a way that very much resembles windows in a desktop environment.

It is this analogy that led a classmate in COMPX241 to suggest tmux as a candidate for the environment in which the project, That would be a D'oh from me, works. While this would probably not deliver as much pzazz as demonstrating the idea with a full graphical desktop (a concern expressed by the classmate), it is the case that it would involve working with a smaller code-base that will be well maintained, unlike some of the lightweight/hobbyist graphical window managers out there. That said, to implement an end-to-end solution, many of the same technical steps would need to be addressed spanning voice input, and how the recognised text matches to operations performed within the interactive environment. Depending on the team size, it could very well be that there is capacity to look at both. And if you are looking to step up the impact of how That would be a D'oh from me operates with tmux, then how about figuring out a way to make this work when the terminal session being run is through an ssh session?

  • Google Query, ubuntu "window manager";
    One of the results: https://askubuntu.com/questions/72549/how-to-determine-which-window-manager-and-desktop-environment-is-running
    https://askubuntu.com/questions/159505/how-to-switch-windows-manager-on-the-fly

Bingally Bong

Project Manager: Zhibo Xu

Team Members: Daniel Aneke; Harry Meyer; Luka Milosevic; Hansh Shinde; AND Rakaipaka Smiler

Weekly Update Time: Wed 1-2pm, K.B.07

Key Idea: Develop an app that bingally bongs a person on their phone with information that might be of use to them, given the location they are in. Could be any sort of information, but note that the project was originally conceived as a bespoke mobile-phone travel app for families with kids travelling abroad.

For this project I imagine two distinct phases to the software app developed: Explorer mode and Wisdom of the Crowd. To be honest, when in Explorer mode, the app isn't that supportive—but that's OK, as you are the intrepid explorer! What it's doing though, is running GPS the whole time, and paying attention to when you seem to spend a lot of time in one location. There's usually a reason for this, either good or bad, for such a "hotspot". Maybe you were figuring out how best to get into the city centre from the airport (Someone didn't plan ahead, did they Dad?). Or perhaps you stopped at a cafe (was it any good?), or were viewing one of the sights to see.

Having run your app in Explorer mode, at the end of the day, when you plug in to your laptop (say), it shows you these hotspots on a map and asks you to enter some information to explain what was happening, which it stores centrally. The enriched information that is built up by the explorers feeds the Wisdom of the Crowd side of the app. In this latter mode when you find yourself at the airport, it vibrates to let you know there is information potentially relevant to where you are that it can show you. In this case, it could inform you of what previous people determined was a good course of action for getting into the city centre. This could even factor in the time of day that they did this, and/or the size and ages of the members that make up your travelling group.

An added twist to the app in Explorer mode is that it lets you take photos, and/or is integrated with the GPS locations of photos you have been taking during the day. These might be useful to show someone using the Wisdom of the Crowd side of the app to help that user orientate themselves.

Unfortunately we don't have the budget to send you to any exotic locations to trial the software you develop, however the ideas expressed in this project work equally well when applied to the idea of someone new to our university's campus.

The above stated mantra about always developing as a web app (unless there is a technical impediment that prevents you) can be applied here. Some sort of back-end store will be needed for the explorer generated content. As food for thought, take a look around the Paradise Gardens showcase, which illustrates a technique for spatial/proximity searching, and is built using our very own Open Source Greenstone Digital Library software.

For Bingally Bong (BB) to work as the content entered by Explorers grows, then measures need to be in place so those that follow are not continually spammed by information. There therefore needs to be a way to align the interests of those that follow with the body of information stored in the system. An interesting angle, then, to take could be looking to see if the app can be hooked in with an existing social media platform, such as Facebook. The idea here would be that BB could apply a Topic Modelling algorithm across the content that an individual has posted to Facebook (FB), and from that establish a focused set of keyword/topics with which to filter the text content that BB has.

In addition to the core two-part aspect to this project, there is also a high degree of enrichment such a bespoke tourist map app could provide:

  • Currency conversion (and other forms of conversion?), the convension in this part of the world for tipping, other customs.
  • Language translation (text-to-speech, speech-to-text) and/or a library of photos of things that would be hard to translate.
  • Acknowledging that you won't be on-line all the time when overseas, support "Laid-back Searching", whereby you enter web search queries when they occur to you—on the spot, as it were—but these queries only get run once you (but more importantly your phone!) are back in a WiFi hotspot.

Hot Air Balloons: Just Blowin' in the Wind

Project Manager: Ethan MacLeod

Team Members: Musawar Ahmad; Ernest Divina; Annycah Libunao; Syed Ashtar Ali Rizvi; and Charles Serrato

Weekly Update Time: Wed 2-3pm, K.B.07

Key Idea: 3D virtual experience flying a balloon, using Google Earth or equivalent, so it is the actual world you are flying over. Add to this actual real-time wind data for the area of the earth you are flying over, along a gamification element such as challenges that you need to achieve, but overall in-keeping with the tranquil nature in general of flying a balloon. If networked, this would allow you to meet other players, Just Blowin' in the Wind.

To help kick off some ideas, you could take a look at the online demo of the following open source project which utilises WebGL to produce a hot air balloon flying over procedurally generated terrain.

I've always been intrigued by games that require a lot of dedication to accomplish anything: Minecraft being a case in point. The dedication required is borderline tedious, but somehow it works! Slow Television is another example that turns out to have quite the following. These observations serve as the kernel of the idea for this project, with hot air ballooning the focus. In broad brush strokes, you have to achieve some overall goal, say take on the challenge of flying your balloon from London to Paris, that will take some time.

One direction to take the game play for this project is like Minecraft: actually make it rather tedious to achieve the goal—quite fiddly, even—requiring you to pay continuous attention to various factors, making small adjustments every now and then (analogous to the farming style games where you continually have to water plants, feed animals). Events happen along the way, requiring you to interleave your response to the situation with your steering of the balloon.

Or maybe the approach for game play really commits to the passive idea, in keeping with Slow TV. If pursuing the latter, I would still look to have events happen along the way, which require you to respond to the situation. Doing this would perhaps act as a substitute for the notion of "something interesting to look at" that might occur from time-to-time in a Slow TV recording. In the spirit of Slow TV, how about such events being meeting other travellers in the world of Hot Air Balloons: Just Blowin' in the Wind?


Smoke and Mirror Projects: From the Vaults

The Smoke and Mirrors brand has been a signature component of the Software Engineering programme at the University of Waikato from its inception. First run in 2003, it started life as a group-based project that 1st Year SE students, who had been learning to program in the C++ language, undertook. In 2010 it moved to the 2nd year level, with Java being the programming language taught, where it has remained since.

It is one of the great pleasures in my job at Waikato to be involved in providing the Smoke and Mirrors experience for our SE students, and for so many years—for all of the years the projects have run in fact! There even came a point where I would be on sabbatical in the semester where Smoke and Mirrors was scheduled to run, however a year in advance the department had changed the semester it ran in, so I could continue running the projects.

I haven't been able to locate any written record of the projects run in that first year, sadly. One from that year that does come to mind, however, was a team developing a software-based arbitrary-precision calculator. As part of their presentation to the department at the end of the semester, they demonstrated their GUI-based calculator calculating π to ... well ... a very high precision! For the years 2004–2009 I have been able to track down the titles of the projects that ran, which at least hints at the variety of projects undertaken. For more recent years, I still have the project briefs that the students start with, when the projects are launched.

With a nod to posterity, here are the projects by year, working back from newest to oldest.