Training Large Language Models (LLMs) with in-domain data can significantly enhance their performance, leading to more accurate and reliable question-answering (QA) systems essential for supporting clinical decision-making and educating patients.
This study introduces LLMs trained on in-domain, well-curated ophthalmic datasets. We also present an open-source substantial ophthalmic language dataset for model training. Our LLMs (EYE-Llama), first pre-trained on an ophthalmology-specific dataset, including paper abstracts, textbooks, EyeWiki, and Wikipedia articles. Subsequently, the models underwent fine-tuning using a diverse range of QA datasets. The LLMs at each stage were then compared to baseline Llama 2, ChatDoctor, and ChatGPT (GPT3.5) models, using four distinct test sets, and evaluated quantitatively (Accuracy, F1 score, and BERTScore) and qualitatively by two ophthalmologists.
Upon evaluating the models using the American Academy of Ophthalmology (AAO) test set and BERTScore as the metric, our models surpassed both Llama 2 and ChatDoctor in terms of F1 score and performed equally to ChatGPT, which was trained with 175 billion parameters (EYE-Llama: 0.57, Llama 2: 0.56, ChatDoctor: 0.56, and ChatGPT: 0.57). When evaluated on the MedMCQA test set, the fine-tuned models demonstrated a higher accuracy compared to the Llama 2 and ChatDoctor models (EYE-Llama: 0.39, Llama 2: 0.33, ChatDoctor: 0.29). However, ChatGPT outperformed EYE-Llama with an accuracy of 0.55. When tested with the PubmedQA set, the fine-tuned model showed improvement in accuracy over both the Llama 2, ChatGPT, and ChatDoctor models (EYE-Llama: 0.96, Llama 2: 0.90, ChatGPT: 0.93, ChatDoctor: 0.92).
The study shows that pre-training and fine-tuning LLMs like EYE-Llama enhances their performance in specific medical domains. Our EYE-Llama models surpass baseline Llama 2 in all evaluations, highlighting the effectiveness of specialized LLMs in medical QA systems. (Funded by NEI R15EY035804 (MNA) and UNC Charlotte Faculty Research Grant (MNA).)
In learning-to-rank problems, a \textit{privileged feature} is one that is available during model training, but not available at test time. Such features naturally arise in merchandised recommendation systems; for instance, “user clicked this item” as a feature is predictive of “user purchased this item” in the offline data, but is clearly not available during online serving. Another source of privileged features is those that are too expensive to compute online but feasible to be added offline. \textit{Privileged features distillation} (PFD) refers to a natural idea: train a “teacher” model using all features (including privileged ones) and then use it to train a “student” model that does not use the privileged features. In this paper, we first study PFD empirically on three public ranking datasets and an industrial-scale ranking problem derived from Amazon’s logs. We show that PFD outperforms several baselines (no-distillation, pretraining-finetuning, self-distillation, and generalized distillation) on all these datasets. Next, we analyze why and when PFD performs well via both empirical ablation studies and theoretical analysis for linear models. Both investigations uncover an interesting non-monotone behavior: as the predictive power of a privileged feature increases, the performance of the resulting student model initially increases but then decreases. We show the reason for the later decreasing performance is that a very predictive privileged teacher produces predictions with high variance, which lead to high variance student estimates and inferior testing performance.
Abstract. We study the problem of online path learning with non-additive gains, which is a central problem appearing in several applications, including ensemble structured prediction. We present new online algorithms for path learning with non-additive count-based gains for the three settings of full information, semi-bandit and full bandit with very favorable regret guarantees. A key component of our algorithms is the definition and computation of an intermediate context-dependent automaton that enables us to use existing algorithms designed for additive gains. We further apply our methods to the important application of ensemble structured prediction. Finally, beyond count-based gains, we give an efficient implementation of the EXP3 algorithm for the full bandit setting with an arbitrary (non-additive) gain.
Abstract. The standard techniques for online learning of combinatorial objects perform multiplicative updates followed by projections into the convex hull of all the objects. However, this methodology can be expensive if the convex hull contains many facets. For example, the convex hull of n-symbol Huffman trees is known to have exponentially many facets. We get around this difficulty by exploiting extended formulations, which encode the polytope of combinatorial objects in a higher dimensional “extended” space with only polynomially many facets. We develop a general framework for converting extended formulations into efficient online algorithms with good relative loss bounds. We present applications of our framework to online learning of Huffman trees and permutations. The regret bounds of the resulting algorithms are within a factor of O(√log(n)) of the state-of-the-art specialized algorithms for permutations, and depending on the loss regimes, improve on or match the state-of-the-art for Huffman trees. Our method is general and can be applied to other combinatorial objects.
Abstract. This thesis develops algorithms for learning combinatorial objects. A combinatorial object is a structured concept composed of components. Examples are permutations, Huffman trees, binary search trees and paths in a directed graph. Learning combinatorial objects is a challenging problem: First, the number of combinatorial objects is typically exponential in terms of number of components. Second, the convex hull of these objects is a polytope whose characterization in the original space may have exponentially many facets or a description of the polytope in terms of facets/inequalities may not be even known. Finally, the loss of each object could be a complicated function of its component and may not be simply additive as a function of the components. In this thesis, we explore a wide variety of combinatorial objects and address the challenges above. For each combinatorial object, we go beyond the original space of the problem and introduce auxiliary spaces and representations. The representation of the objects in these auxiliary spaces admits additive losses and polytopes with polynomially many facets. This allows us to extend well-known algorithms like Expanded Hedge and Component Hedge to these combinatorial objects for the first time.
Abstract. A deep embedding forest-based (DEF) model for improving on-line serving time for classification learning methods and other tasks such as, for example, predicting user selection of search results provided in response to a query or for image, speech or text recognition. Initially, a deep neural network (DNN) model is trained to determine parameters of an embedding layer, a stacking layer, deep layers and a scoring layer thereby reducing high dimensional features. After training the DNN model, the parameters of the deep layers and the scoring layer of the DNN model and discarded and the parameters of the embedding layer and the stacking layer are extracted. The extracted parameters from the DNN model then initialize parameters of an embedding layer and a stacking layer of the DEF model such that only a forest layer of the DEF model is then required to be trained. Output from the DEF model is stored in computer memory.
Abstract. We introduce a new method for efficiently learning permutations that exploits the technique of extended formulation from the combinatorial optimization community. This extended formulation technique encodes the hard-to-describe polytope of permutations as the projection of an easier to describe polytope in a higher dimensional space. Although the best special-purpose algorithm for permutations has a slightly better regret bound, our methodology yields regret bounds like those of other published algorithms. The new methodology can be generalized to other combinatorial objects. It also has an elegant method of determining the algorithm’s initial weight vector and the use of the extended formulation leads to a faster and more natural way to make appropriate predictions.
Abstract. We consider the problem of repeatedly solving a variant of the same dynamic programming problem in successive trials. An instance of the type of problems we consider is to find a good binary search tree in a changing environment. At the beginning of each trial, the learner probabilistically chooses a tree with the n keys at the internal nodes and the n + 1 gaps between keys at the leaves. The learner is then told the frequencies of the keys and gaps and is charged by the average search cost for the chosen tree. The problem is online because the frequencies can change between trials. The goal is to develop algorithms with the property that their total average search cost (loss) in all trials is close to the total loss of the best tree chosen in hindsight for all trials. The challenge, of course, is that the algorithm has to deal with exponential number of trees. We develop a general methodology for tackling such problems for a wide class of dynamic programming algorithms. Our framework allows us to extend online learning algorithms like Hedge and Component Hedge to a significantly wider class of combinatorial objects than was possible before.
Abstract. Deep Neural Networks (DNN) have demonstrated superior ability to extract high level embedding vectors from low level features. Despite the success, the serving time is still the bottleneck due to expensive run-time computation of multiple layers of dense matrices. GPGPU, FPGA, or ASIC-based serving systems require additional hardware that are not in the mainstream design of most commercial applications. In contrast, tree or forest-based models are widely adopted because of low serving cost, but heavily depend on carefully engineered features. This work proposes a Deep Embedding Forest model that benefits from the best of both worlds. The model consists of a number of embedding layers and a forest/tree layer. The former maps high dimensional (hundreds of thousands to millions) and heterogeneous low-level features to the lower dimensional (thousands) vectors, and the latter ensures fast serving. Built on top of a representative DNN model called Deep Crossing, and two forest/tree-based models including XGBoost and LightGBM, a two-step Deep Embedding Forest algorithm is demonstrated to achieve on-par or slightly better performance as compared with the DNN counterpart, with only a fraction of serving time on conventional hardware. After comparing with a joint optimization algorithm called partial fuzzification, also proposed in this paper, it is concluded that the two-step Deep Embedding Forest has achieved near optimal performance. Experiments based on large scale data sets (up to 1 billion samples) from a major sponsored search engine proves the efficacy of the proposed model.
Abstract. Personalized health-care is trending and individuals tend to wear sensors in order to record their own health data. As a part of this trend, any redundancy in the data captured by wearable sensors must be exploited to reduce the number of devices one may wear. In this thesis, we work with a device which senses breathing and pulse through pressure tube and pulse oximetry, respectively. Extracting the dependency between these two measurements, we approximately predict the breathing rate by first reconstructing the breathing signal using the data coming from the finger-tip sensor, and then detecting the peaks in the reconstructed signal. For breathing signal reconstruction, two different techniques are used: (1) applying low- and high-pass filters on the pulse signal (2) training a neural network on a prepared dataset. Our experiments show that neural networks have a better performance comparing to filters in reconstructing the breathing signal, and consequently, predicting the breathing rate.