Asked  7 Months ago    Answers:  5   Viewed   26 times

I have a large data.frame of character data that I want to convert based on what is commonly called a dictionary in other languages.

Currently I am going about it like so:

foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), snp2 = c("AA", "AT", "AG", "AA"), snp3 = c(NA, "GG", "GG", "GC"), stringsAsFactors=FALSE)
foo <- replace(foo, foo == "AA", "0101")
foo <- replace(foo, foo == "AC", "0102")
foo <- replace(foo, foo == "AG", "0103")

This works fine, but it is obviously not pretty and seems silly to repeat the replace statement each time I want to replace one item in the data.frame.

Is there a better way to do this since I have a dictionary of approximately 25 key/value pairs?

 Answers

22
map = setNames(c("0101", "0102", "0103"), c("AA", "AC", "AG"))
foo[] <- map[unlist(foo)]

assuming that map covers all the cases in foo. This would feel less like a 'hack' and be more efficient in both space and time if foo were a matrix (of character()), then

matrix(map[foo], nrow=nrow(foo), dimnames=dimnames(foo))

Both matrix and data frame variants run afoul of R's 2^31-1 limit on vector size when there are millions of SNPs and thousands of samples.

Tuesday, June 1, 2021
 
Optimus
answered 7 Months ago
58

This is because the values you have in your gene column are not gene ids, they are peptide id (they start with ENSP). To get the info you need, try replacing ensembl_gene_id by ensembl_peptide_id:

G_list <- getBM(filters = "ensembl_peptide_id", 
                attributes = c("ensembl_peptide_id", "entrezgene", "description"),
                values = genes, mart = mart)

Also, what you are really looking for is the hgnc_symbol

Here is the total code to get your output:

library('biomaRt')
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- df$genes
df<-df[,-4]
G_list <- getBM(filters= "ensembl_peptide_id", attributes= c("ensembl_peptide_id","hgnc_symbol"),values=genes,mart= mart)
merge(df,G_list,by.x="gene",by.y="ensembl_peptide_id")
Sunday, August 1, 2021
 
anon_swe
answered 4 Months ago
17

The reason is because you assigned a single new column to a 2 column matrix output by apply. So, the result will be a matrix in a single column. You can convert it back to normal data.frame with

 do.call(data.frame, df)

A more straightforward method will be to assign 2 columns and I use lapply instead of apply as there can be cases where the columns are of different classes. apply returns a matrix and with mixed class, the columns will be 'character' class. But, lapply gets the output in a list and preserves the class

df[paste0('new.letters', names(df)[2:3])] <- lapply(df[2:3], fun.split)
Friday, August 13, 2021
 
Baba
answered 4 Months ago
25

The numeric keys were entered as numbers and you are fetching them as strings. I suggest that you stick to one convention for your dictionary.

Sub TestDict()
  Dim dict As New Dictionary
  dict.Add 1, "one"
  dict.Add "2", "two"

  Debug.Print dict("1")     ' Nothing
  Debug.Print dict(1)       ' one

  Debug.Print dict("2")    ' two
  Debug.Print dict(2)      ' Nothing
End Sub

Solution

Chose a convention for your dictionary and stick to it. In this application I would take the convention of always converting my keys to strings, both when inserting and when fetching. A few changes in your code can achieve it:

If Not PODict.Exists(CStr(Range("A" & i).Value) Then ' could use .Text also

PODict.Add CStr(ID), Names


SOArr = Split(PODict(CStr(POSelected)), ",") ' maybe not needed here, but to illustrate
Thursday, August 26, 2021
 
Andrea Ligios
answered 4 Months ago
80

Add another two dummy entries to your corrections dict:

corrections = {'male'   : 'male',    # dummy entry for male
               'female' : 'female',  # dummy entry for female
               'mail'   : 'male', 
               'maela'  : 'male', 
               'maae'   : 'male'}

Now, use map and fillna:

df11.Gender = df11.Gender.map(corrections).fillna('other')
df11

   course_id  AcademicYear_to  months  TotalFee  Gender
0        260             2017      24       100    male
1        260             2018      12       140    male
2        274             2016      36       300    male
3        274             2017      24       340  female
4        274             2018      12       200   other
5        285             2017      24       300   other
6        285             2018      12       200    male
Friday, August 27, 2021
 
laconicdev
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share