In modern scientific discovery, it is becoming increasingly critical to uncover whether one property of a dataset is related to another. The MGC (pronounced magic), or Multiscale Generalized Correlation, provides a framework for investigation into the relationships between properties of a dataset and the underlying geometries of the relationships, all while requiring sample sizes feasible in real data scenarios. Our work can be found in an R package MGC package, and we are currently summarizing our results in a manuscript.
Supervised learning techniques designed for the situation when the dimensionality exceeds the sample size have a tendency to overfit as the dimensionality of the data increases. To remedy the HDLSS (High Dimensionality, low sample size) situation, we attempt to learn a lower-dimensional representation of the data before learning a classifier. That is, we project the data to a situation where the dimensionality is more manageable, and then are able to better apply standard classification or clustering techniques since we will have fewer dimensions to overfit. A number of previous works have focused on how to strategically reduce dimensionality in the unsupervised case, yet in the labeled HDLSS (LHDLSS) regime, few works have attempted to devise dimensionality reduction techniques that leverage the labels associated with the data. In this package, we provide several methods for feature extraction, some utilizing labels and some not, along with utilities to simplify cross-validative efforts to identify the best feature extraction method measuring performance using several classification algorithms. Additionally, we provide a series of adaptable benchmark simulations to serve as a standard for future investigative efforts into supervised HDLSS. Finally, we provide a comprehensive comparison of the included algorithms across a range of benchmark simulations and real data applications. Our work can be found in the LOL R package, and we are currrently summarizing our results in a manuscript.
In modern graph analytics, it is often the case that dimensions of graph attributes can be batched together to simplify summarizing models of a graph. For instance, in binary graphs, the popular Stochastic Block Model (SBM) generalizes the vertices of a graph into vertex communities, and analyzes characteristics of the probability of edges existing between the different vertex communities. In the SIEM, we generalize the Stochastic Block Model and the Independent Edge Model to treat a graph as a collection of edge communities with disparate interactions between the respective communities. In conjunction with the graphs processed from NDMG, we are able to identify unique edge community interactions between graphs obtained from diffusion MRI (dMRI) and functional MRI (fMRI) graphs. Particularly, we show that the connectivity ipsi-laterally (edges connecting vertices in the same hemisphere) vs contra-laterally (edges connecting vertices in the opposite hemisphere) in dMRI graphs exceeds that of fMRI graphs in hemispheric connectivity notebook. On the other hand, the connectivity associated with homotopic interactions (same brain region in opposite brain hemisphere; ie, an edge connecting the primary motor cortex in the left and right hemispheres) vs heterotopically (an edge connecting regions in opposite hemisphere that are not the same region; ie, an edge connecting the primary motor cortex in the left hemisphere to the primary visual cortex in the right hemisphere) in fMRI graphs exceeds that of dMRI graphs in bilateral connectivity notebook. Finally, using the SIEM, we are able to show that pooling across sites reveals the present of batch effects in bilateral connectivity, in hemispheric connectivity for both dMRI and fMRI graphs. This suggests the need for improved methods to improve model generalizability in human connectomics. Our model, and several other existing models, are being prepared into the graphstats R package.
Scientific validity and clinical utility depend on our ability to generate replicable scientific results. Potential obstacles to reproducibility and replicability include measurement error, variability of sample demographics, data acquisition details, analysis methods, or statistical errors. We developed a set of statistical and computational principles to guide the design of pipelines that eliminate as many sources of variability as possible. As an example, we introduce NeuroData’s MRI to Graph (NDMG) pipeline. Running NDMG on data from 17 functional studies and 11 diffusion studies (a total of 3,571 individuals and 5,993 scans) demonstrates that certain coarse-scale connectome properties are consistently preserved across all studies, regardless of scanner manufacturer, acquisition protocols, and demographics. Nevertheless, other more fine-scale connectome properties remain highly variable, even after harmonizing data processing and controlling for these additional sources of variability. This work therefore suggest the need for further efforts to mitigate sources of variability in multimodal connectome data. Our pipeline is packaged in a docker container and as the ndmg python package, and our in-progress manuscript can be found on BioRxiv.
When looking at the relationship between weighted networks and traditional small world measures, my colleagues and I found that existing measure were seriously lacking. We found that, not only were existing measures highly dependent upon nuisance values such as density, but they also failed to take into consideration a very key feature of network performance, connection strength. For example, two networks with the same nodes connected, one with connections of varying (weighted) strengths and the other with equivalent (binarized) strengths, would be assessed the exact same Small World Coefficient.
To rectify this issue, we introduce the small world coefficient in our paper, Small-World Propensity in Weighted, Real-World Networks. The small world propensity provides a robust statistic for assessing small-world structure in networks of varying densities. We find that, compared to higher level organisms such as humans, the C. Elegans network demonstrates remarkably low Small-World Propensity, a significant observation as the C. Elegans represents the canonical biological example of a small-world network.
In the past 20 years, applying computational statistics to the brain has become an intriguing problem for researchers. One of the most popular of these methods involves ROI timeseries analysis of fMRI data. To conduct ROI timeseries analysis, researchers essentially map regions of the brain to an atlas of arbitrary size; regions may be determined by performance, or other features like activity correlations over a massive dataset. The regions essentially "map" every volume-pixel, or voxel, of the brain to a point on the map. Then, we take some other dataset and figure out which voxels in our new scans lie within each different region specified in each region, and form a vector for each voxel of the intensity with respect to the timecourse. We then average the activity levels for all voxels contained in each region of interest, and are left with a "timeseries" for each roi.
If that sounds pretty complicated, you are not alone in your thoughts. Each step along the way from brain graph to timeseries has many options and parameters that are researcher defined, and the common academic is left puzzled with which to chose for a given study. As a part of the mrimages to graphs, our team is looking into some of the most common timeseries extraction methods, and comparing their performance. To compare performance, we are seeking to find the measures that show the most robustness across scans for single subject. Stated another way, we are looking to find which measures will allow researchers to extract timeseries that are reliable for a single subject, and particularly, will be repeatable whenever analysis is conducted for that same subject following a future fMRI scan. We have introduced numerous new statistics and measures, in addition to proposing a standardized, optimal processing pipeline to ensure repeatability of fMRI analysis.
Have you ever been exploring a city, and been curious as to your safety? Stroll safe compiles all of the latest crime data for a given area, and uses a custom algorithm for determining the crime intensity of a given area (a combination of crime frequency and crime severity). This outputs the users with a heat map of the area, and provides them with push notifications if they walk into a dangerous area. By simply pressing the "I feel unsafe" button at the bottom, a user can easily dial 911, an emergency contact, or call a taxi. Our project won the Everyblock API award, and we are currently working on improving the project for a potential startup. Read more about our project on Devpost.
My contributions: Along with being a primary developer on the project, my contributions chiefly focused with developing the entire UI. Having never worked with Android prior, this daunting task was accomplished with tons of reading, tutorials, lots of caffeine, and very little sleep :P
Technologies used: Android Studio, Java, XML, Python, Google Maps API, OpenDataPhilly, Everyblock API
Press: Technical.ly article
A few future projects are in progress, along with some awesome new hackathon ideas, so stay tuned for some cool new things!