Takeaways from a young researcher at an International Machine Learning Conference
Recently I had the opportunity to attend ACM (Association of Computing Machinery) Multimedia 2019, after having a paper accepted for AVEC (Audio-Visual Emotion Challenge, rebranding to Audio-Visual Engagement Challenge in 2020).
During the conference, I met some great people, learned very much about the applications of media-based data and fusion in multi-modal contexts. However, I noticed a few things I thought would be interesting to mention, from a more “naive” perspective.I had the luck to talk to some researchers who were a part of the organizing committee or worked closely to assist the committee, and have been reassured they are always actively looking to improve the quality of work presented at conferences like these. However, organizational bandwidth is always difficult to manage in research, especially as organizers are mostly volunteering their time to make these conferences happen.
In my discussion of my experiences, I comment on topics via the proxy of medicine to more easily describe the potential effects of considerations presented in the article.
Focus on “State of The Art”
During several of the paper presentations and talks, the word “State-of-the- Art” is thrown around. Easily, over half of the talks I listened to measured their results against a “State-of-the-Art” result, usually performing on par, or beating “State-of-the-Art”, on an error metric.
This alarmed me.
“State-of-the-Art” was always measured in terms of traditional metrics such as MSE (Mean Squared Error), Accuracy, and occasionally Precision/Recall or Sensitivity/Specificity (ways to analyze false positive,false negative errors). In machine learning courses and online tutorials, these metrics are classically taught as ways to evaluate the results of models. It’s no surprise here they are used as the primary metrics to evaluate the performance of the models at conferences. However, in more recent years, machine learning practitioners have been re-evaluating the validity of these metrics. They often purport anomolous results which may be a result of faulty-model training, sampling error, or over-/underfitting (variance/bias trade-off). As a result, reported “State-of-the-Art” results hold little value unless scrutinized against measures of statistical value and fit.
However, there is room to be optimistic that academic practioners are becoming increasingly aware of the pitfall of traditional metrics. In the case of the AVEC Challenge this year, the organizers proposed the idea of the Concordance-Correlation-Coefficient (CCC) in regression models. CCC is a statistical measure to interelatability and reproducibilty. The value of CCC in regression datasets is for a set of true-values and predicted-values fora test/validation set. CCC can provide a measure of how close predicted values are to the true values, explained by the variability between the two sets of values. This is useful for understanding how well a model can ascertain reproducible results, and not statistically anomolous results.
CCC is not an end-all be-all measure. It does not evaluate study validity or data validity. It performs what it is meant to do — to evaluate variability of measured data against a “truth”; an outcome metric. However, it does bring attention to a significant issue with respect to machine learning and in particular computational medical studies, reproducibility. Optimization of models against a metric like CCC encourages data science and machine learning practitioners to optimize and identify models with higher reproducibility, rather than just the reduction of a loss metric which may lead to anomolous models.
Another dimension to “State-Of-The-Art” should also be measured as the novelty of the methodology. In 2018 and 2019, BERT (Bidirectional Encoder Representations from Transformers) took the ML in NLP community by storm. Within a year, the BERT paper has been cited almost 3k times, and has had several spin-off concepts. BERT is not just innovative because of its effectiveness in pushing benchmarks dictated by minimizing loss, but it’s capacity to build off of transformers, and enable bidirectional training in transformers which was previously not possible. BERT pushed the boundary of what was perceived to be as possible.
A fair portion of the works at the conference focused on repurposed use of State-of-the-Art breakthroughs (ie: BERT embeddings, Attention Networks, GAN’s), typically only garnering 1–2% increase in performance, thus calling themselves “State-of-the-Art”. Little was done to show what mechanisms were affected by the models/representations acquired through these novel methods, making it hard to understand what was improving the performance. Though it’s important to understand the capacity new break throughs in Machine Learning offer us, it’s just as important to present ideas and push the boundaries of knowledge.
There are several papers at the conference which do break through conceptual boundaries. A few of my favorites include:
- HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities — (University of Amsterdam, Presented by
- A Multimodal View into Music’s Effect on Human Neural, Physiological, and Emotional Experience — (USC , Presented by Timothy Greer)
- Self Supervised Face-Groupings on Graphs (KIT, presented by
Socially and Ethically Driven Projects
In attending ACM Multimedia 2019, I noticed there were few projects directly aiming at social or ethical goals. I don’t fault anyone for this. When the scope of a conference is rather technical in nature, a lot of the focus of papers, submissions, and talks are also rather technical. Much focus was on the applications of methods towards multimodal retrieval, fusion, and learning, and bridging the gaps between different modalities, and making sense of multimodal research for real-world use.
As multimodal systems, especially machine learning-based systems, become prevalent in real-world production and use, I personally believe researchers should place emphasis on the social and ethical factors of their published work, for a few reasons.
- Real-world production and use equates to real-world consequences
An emerging trend that has risen in the 21st century is the increasing reliance on computational models for almost all aspects of life, both online and offline. This includes how we utilize the internet, how we date, how we learn, etc. The research that is being done now will directly inform how the technology is used in production in the future. Specific to medicine, we can expect to have learning models inform doctors about our mental and physical state, and proclivity to illness in the near future. If these systems are not vetted with rigorous ethical/social considerations, and are too pre-emptively launched into production, we may actually inadvertently harm lives.
An existing example of potential harm to lives is the use of “Facial Recognition AI” in law enforcement. Law enforcement agencies argue that this tool makes it much easier to discern criminals from bystanders. However, AI technologists generally know that there are real shortcomings to Facial Recognition AI that can only be solved by taking real-world considerations seriously. In 2015, a Google algorithm classified a black woman who used their Photo’s App as a “Gorilla”. This anomaly evidenced that there was a huge lack of training data to represent people of African descent. This was likely unintentional, but may be a result of the underrepresentation of African/Black researchers in AI. Google has since stated they have fixed this issue. This same system may unintentionally learn to associate facial pigment with proclivity towards crime if it is trained on data labelled by law enforcement with systematic bias against certain ethnic groups. Without implicit understanding of potential biases of data collection and labelling, issues like these may perpetuate into other realms of life. Government agencies such as the NIH (National Institute for Health) are already aware of these underlying issues, and trying to address the bias in data collection in healthcare.
To bequeath all of the ethical responsibilities of technology developed by researchers might be a little riduculous. However, I believe researchers should be stakeholders of the work that they publish. Researchers, more than ever, are now in a position of power where they can subsequently influence social change. They are also able to influence the real world outcomes of technology for the better.
2. Grounding research in tangible social/ethical goals may optimize for more creative approaches:
I have much less emperical evidence to support this stance — However, I believe there is a case to be made.
Take medicine — where a main social goal is to reduce the misdiagnosis of illness. A few considerations need to be made about building a model for diagnosis. We must reduce the false positive, and especially the false negative rate. We must also understand the learned priors of a model to ensure that the model is learning proper information.
Instead of focusing on the accuracy of the model (which is often done in the context of conference and industry papers), researcher may engender the development of loss functions that minimize false-negative rate. In addition, there may be more emphasis on building invertible transforms through layers of the deep neural network to directly map back and easily interpret what is being learned in the model. Both of these points of research are highly directed points of investigation that arise from the grounding of research in tangible social and ethical goals.
Already, this optimization towards addressing social goals is becoming fruitful in research. In NeurIPS 2018 Best Paper “Neural Ordinary Differential Equations” (https://arxiv.org/abs/1806.07366), the authors propose the use of Ordinary Differential Equations instead of discrete layers for neural networks, as a result of neural networks facing difficulty in evaluating continuous temporal information as a dimension of measure. This route of investigation was sparked by David Duvenaud of the University of Toronto, when trying to model medical records over time to inform pathology towards health outcomes. Though much in proof of concept, ideas grounded in social goals like this, may be able to spark future research and investigation with as much groundbreaking effect as Yann Lecun’s Convolutional Neural Network, or Ian GoodFellow’s GAN (Generative Adversarial Network).
3. A priori social/ethical knowledge can inform unique approaches:
During the AVEC Challenge Workshop session, Albert Ali Salah, a professor at Utrecht University in the Netherlands, proposed a very unique approach for analyzing and quantifying depression, based on previous research him and his collaborators had discovered. In his previous research, they noticed that study refugees who experienced trauma, had a very noticeable “deep breath” or “sigh” with recalling traumatic experiences. Using this a priori knowledge of a key paralinguistic feature, they were able to build a simple Extreme Learning Machine model that far surpassed the MSE and CCC performance of the baseline model. This feature and approach may be very valuable towards building models of trauma and PTSD behavior that can inform future diagnosis and treatment of PTSD patients.
Another very interesting concept discussed at the conference by Hayley Hung, a professor at Delft University in the Netherlands, was ecological validity. Ecological validity addresses the extent by which research can generalize to the real-world, also known as “In-the-wild” situations. Often at conferences, models are generated on top of standardized datasets, created by research labs or corporate entities. They provide a standardized data by which researchers can evaluate the strength of their models. However, the modeling of data on a standardized dataset versus an in-the-world situation can be very different.
Curated datasets often attempt to reduce noise and optimize towards a single task at hand. This is done so that the evaluative framework of models optimizes towards a task of specific interest to researchers. However, this calls into question the issue of potential biases, usually unintentional. For example, emotion databases attempt to capture information about facial expressions that correspond to a certain emotion (“disgust”, “anger”, “happy”). Qualitatively, there can be distinct differences in how different cultures display emotions. Annotators from one culture may view certain facial images with “shock”, while others may view it as “disgust”. The ecological validity of the model may not translate when implemented in the wild, especially when individuals of multiple cultures exist in the data.
Overall, this was a very good first International Conference experience for me. I learned quite a lot from the people I met at the conference, gained a lot of insight and was able to connect with researchers across the world, even as an early career researcher. I highly recommend anyone attending conferences like these to stay inquisitive and open to the ideas being discussed and presented, but also be equally critical. The proliferation of novel ideas and the rigor of constructive criticism ultimately will shape the future of the technology being developed around the world towards Multimedia and Computational research.