Chemical Formula Parser












8












$begingroup$


As an exercise, I wrote a simple context free parser modeled from a pushdown automaton for chemical formulas such as NO2 or Fe(NO3)3.



To explain a bit about the decisions made:




  1. The state functions (pstart, pchem and pnum) are lambdas because they carry around a lot of context. They could be methods of a class, it doesn't really matter.


  2. The state functions have the input character injected because I realized halfway that I have to process some more after the line ended and I don't want to duplicate the state functions.



I am mostly concerned about the general messiness of the parser. There is no obvious way to know where variables are mutated, control flow is all over the place and consequently I find it difficult to reason about the program. Is there a better code structure?



Finally, I'd like to know if I caught all illegal syntax. Note repeating chemicals like NN is allowed.



#include<iostream>
#include<stack>
#include<unordered_map>
#include<string>
#include<iterator>

using namespace std::literals;

enum class state_t { start, num, chem, error };

double parse(std::string line, const std::unordered_map<std::string, double>& m)
{
auto b = line.begin(), e = line.end();
std::stack<double> stk;
int num = 0;
std::string chem;
state_t state = state_t::start;

auto pstart = [&](char c) {
if(std::isupper(c))
{
chem = ""s + c;
state = state_t::chem;
return true;
}
else if(std::isdigit(c))
{
if(stk.empty())
state = state_t::error;
else
{
num = c - '0';
state = state_t::num;
return true;
}
}
else if(c == '(')
{
stk.push(-1);
return true;
}
else if(c == ')')
{
double sum = 0;
while(!stk.empty() && stk.top() > 0)
{
sum += stk.top();
stk.pop();
}
if(stk.empty())
state = state_t::error;
else
{
stk.pop();
stk.push(sum);
return true;
}
}
else
state = state_t::error;

return false;
};

auto pnum = [&](char c) {
if(std::isdigit(c))
{
num = 10 * num + c - '0';
return true;
}
else
{
stk.top() *= num;
state = state_t::start;
}
return false;
};

auto pchem = [&](char c){
if(std::islower(c))
{
chem += c;
return true;
}
else
{
if(auto it = m.find(chem); it != m.end())
{
stk.push(it->second);
state = state_t::start;
}
else
state = state_t::error;
}
return false;
};

while(b != e)
{
switch(state)
{
case state_t::start:
if(pstart(*b))
b++;
break;
case state_t::num:
if(pnum(*b))
b++;
break;
case state_t::chem:
if(pchem(*b))
b++;
break;
default:
return -1;
}
}

switch(state)
{
case state_t::num:
pnum('n');
break;
case state_t::chem:
pchem('n');
break;
}

if(state == state_t::error)
return -1;

double sum = 0;
while(!stk.empty() && stk.top() > 0)
{
sum += stk.top();
stk.pop();
}
if(stk.size() > 0) // expected ')'
return -1;

return sum;
}

int main()
{
std::unordered_map<std::string, double> m;
m["Na"] = 23.5;
m["Cl"] = 35.5;
m["O"] = 16;
m["N"] = 14;
m["Fe"] = 55.8;

std::string line;
while(getline(std::cin, line))
std::cout << parse(std::move(line), m) << 'n';
}









share|improve this question











$endgroup$








  • 1




    $begingroup$
    I realized if(stk.size() > 1) should be if(stk.size() > 0), should I edit?
    $endgroup$
    – 友人A
    Jan 27 at 11:08






  • 2




    $begingroup$
    Feel free to edit as long as there are no answers. But try to not edit as time goes by.
    $endgroup$
    – Incomputable
    Jan 27 at 11:26






  • 2




    $begingroup$
    Note, the singular is “automaton”.
    $endgroup$
    – Konrad Rudolph
    Jan 27 at 14:26






  • 1




    $begingroup$
    There is a standard for writing chemical formulas to be machine parsable so that they are not ambiguous. Its called SMILE Simple Molecule Input Line Entry. The trouble with your system is that for more complex molecules you can't be unambiguous.
    $endgroup$
    – Martin York
    Jan 27 at 18:50








  • 1




    $begingroup$
    You said that you wrote a parser but the result of the function is double so clearly you’re doing way more than just parsing, and it’s not at all clear (from the names, nor the usage) what transformation you’re performing. I’m guessing you are calculating the molecular weight? In which case the function parse should be renamed appropriately.
    $endgroup$
    – Konrad Rudolph
    Jan 28 at 11:32


















8












$begingroup$


As an exercise, I wrote a simple context free parser modeled from a pushdown automaton for chemical formulas such as NO2 or Fe(NO3)3.



To explain a bit about the decisions made:




  1. The state functions (pstart, pchem and pnum) are lambdas because they carry around a lot of context. They could be methods of a class, it doesn't really matter.


  2. The state functions have the input character injected because I realized halfway that I have to process some more after the line ended and I don't want to duplicate the state functions.



I am mostly concerned about the general messiness of the parser. There is no obvious way to know where variables are mutated, control flow is all over the place and consequently I find it difficult to reason about the program. Is there a better code structure?



Finally, I'd like to know if I caught all illegal syntax. Note repeating chemicals like NN is allowed.



#include<iostream>
#include<stack>
#include<unordered_map>
#include<string>
#include<iterator>

using namespace std::literals;

enum class state_t { start, num, chem, error };

double parse(std::string line, const std::unordered_map<std::string, double>& m)
{
auto b = line.begin(), e = line.end();
std::stack<double> stk;
int num = 0;
std::string chem;
state_t state = state_t::start;

auto pstart = [&](char c) {
if(std::isupper(c))
{
chem = ""s + c;
state = state_t::chem;
return true;
}
else if(std::isdigit(c))
{
if(stk.empty())
state = state_t::error;
else
{
num = c - '0';
state = state_t::num;
return true;
}
}
else if(c == '(')
{
stk.push(-1);
return true;
}
else if(c == ')')
{
double sum = 0;
while(!stk.empty() && stk.top() > 0)
{
sum += stk.top();
stk.pop();
}
if(stk.empty())
state = state_t::error;
else
{
stk.pop();
stk.push(sum);
return true;
}
}
else
state = state_t::error;

return false;
};

auto pnum = [&](char c) {
if(std::isdigit(c))
{
num = 10 * num + c - '0';
return true;
}
else
{
stk.top() *= num;
state = state_t::start;
}
return false;
};

auto pchem = [&](char c){
if(std::islower(c))
{
chem += c;
return true;
}
else
{
if(auto it = m.find(chem); it != m.end())
{
stk.push(it->second);
state = state_t::start;
}
else
state = state_t::error;
}
return false;
};

while(b != e)
{
switch(state)
{
case state_t::start:
if(pstart(*b))
b++;
break;
case state_t::num:
if(pnum(*b))
b++;
break;
case state_t::chem:
if(pchem(*b))
b++;
break;
default:
return -1;
}
}

switch(state)
{
case state_t::num:
pnum('n');
break;
case state_t::chem:
pchem('n');
break;
}

if(state == state_t::error)
return -1;

double sum = 0;
while(!stk.empty() && stk.top() > 0)
{
sum += stk.top();
stk.pop();
}
if(stk.size() > 0) // expected ')'
return -1;

return sum;
}

int main()
{
std::unordered_map<std::string, double> m;
m["Na"] = 23.5;
m["Cl"] = 35.5;
m["O"] = 16;
m["N"] = 14;
m["Fe"] = 55.8;

std::string line;
while(getline(std::cin, line))
std::cout << parse(std::move(line), m) << 'n';
}









share|improve this question











$endgroup$








  • 1




    $begingroup$
    I realized if(stk.size() > 1) should be if(stk.size() > 0), should I edit?
    $endgroup$
    – 友人A
    Jan 27 at 11:08






  • 2




    $begingroup$
    Feel free to edit as long as there are no answers. But try to not edit as time goes by.
    $endgroup$
    – Incomputable
    Jan 27 at 11:26






  • 2




    $begingroup$
    Note, the singular is “automaton”.
    $endgroup$
    – Konrad Rudolph
    Jan 27 at 14:26






  • 1




    $begingroup$
    There is a standard for writing chemical formulas to be machine parsable so that they are not ambiguous. Its called SMILE Simple Molecule Input Line Entry. The trouble with your system is that for more complex molecules you can't be unambiguous.
    $endgroup$
    – Martin York
    Jan 27 at 18:50








  • 1




    $begingroup$
    You said that you wrote a parser but the result of the function is double so clearly you’re doing way more than just parsing, and it’s not at all clear (from the names, nor the usage) what transformation you’re performing. I’m guessing you are calculating the molecular weight? In which case the function parse should be renamed appropriately.
    $endgroup$
    – Konrad Rudolph
    Jan 28 at 11:32
















8












8








8


2



$begingroup$


As an exercise, I wrote a simple context free parser modeled from a pushdown automaton for chemical formulas such as NO2 or Fe(NO3)3.



To explain a bit about the decisions made:




  1. The state functions (pstart, pchem and pnum) are lambdas because they carry around a lot of context. They could be methods of a class, it doesn't really matter.


  2. The state functions have the input character injected because I realized halfway that I have to process some more after the line ended and I don't want to duplicate the state functions.



I am mostly concerned about the general messiness of the parser. There is no obvious way to know where variables are mutated, control flow is all over the place and consequently I find it difficult to reason about the program. Is there a better code structure?



Finally, I'd like to know if I caught all illegal syntax. Note repeating chemicals like NN is allowed.



#include<iostream>
#include<stack>
#include<unordered_map>
#include<string>
#include<iterator>

using namespace std::literals;

enum class state_t { start, num, chem, error };

double parse(std::string line, const std::unordered_map<std::string, double>& m)
{
auto b = line.begin(), e = line.end();
std::stack<double> stk;
int num = 0;
std::string chem;
state_t state = state_t::start;

auto pstart = [&](char c) {
if(std::isupper(c))
{
chem = ""s + c;
state = state_t::chem;
return true;
}
else if(std::isdigit(c))
{
if(stk.empty())
state = state_t::error;
else
{
num = c - '0';
state = state_t::num;
return true;
}
}
else if(c == '(')
{
stk.push(-1);
return true;
}
else if(c == ')')
{
double sum = 0;
while(!stk.empty() && stk.top() > 0)
{
sum += stk.top();
stk.pop();
}
if(stk.empty())
state = state_t::error;
else
{
stk.pop();
stk.push(sum);
return true;
}
}
else
state = state_t::error;

return false;
};

auto pnum = [&](char c) {
if(std::isdigit(c))
{
num = 10 * num + c - '0';
return true;
}
else
{
stk.top() *= num;
state = state_t::start;
}
return false;
};

auto pchem = [&](char c){
if(std::islower(c))
{
chem += c;
return true;
}
else
{
if(auto it = m.find(chem); it != m.end())
{
stk.push(it->second);
state = state_t::start;
}
else
state = state_t::error;
}
return false;
};

while(b != e)
{
switch(state)
{
case state_t::start:
if(pstart(*b))
b++;
break;
case state_t::num:
if(pnum(*b))
b++;
break;
case state_t::chem:
if(pchem(*b))
b++;
break;
default:
return -1;
}
}

switch(state)
{
case state_t::num:
pnum('n');
break;
case state_t::chem:
pchem('n');
break;
}

if(state == state_t::error)
return -1;

double sum = 0;
while(!stk.empty() && stk.top() > 0)
{
sum += stk.top();
stk.pop();
}
if(stk.size() > 0) // expected ')'
return -1;

return sum;
}

int main()
{
std::unordered_map<std::string, double> m;
m["Na"] = 23.5;
m["Cl"] = 35.5;
m["O"] = 16;
m["N"] = 14;
m["Fe"] = 55.8;

std::string line;
while(getline(std::cin, line))
std::cout << parse(std::move(line), m) << 'n';
}









share|improve this question











$endgroup$




As an exercise, I wrote a simple context free parser modeled from a pushdown automaton for chemical formulas such as NO2 or Fe(NO3)3.



To explain a bit about the decisions made:




  1. The state functions (pstart, pchem and pnum) are lambdas because they carry around a lot of context. They could be methods of a class, it doesn't really matter.


  2. The state functions have the input character injected because I realized halfway that I have to process some more after the line ended and I don't want to duplicate the state functions.



I am mostly concerned about the general messiness of the parser. There is no obvious way to know where variables are mutated, control flow is all over the place and consequently I find it difficult to reason about the program. Is there a better code structure?



Finally, I'd like to know if I caught all illegal syntax. Note repeating chemicals like NN is allowed.



#include<iostream>
#include<stack>
#include<unordered_map>
#include<string>
#include<iterator>

using namespace std::literals;

enum class state_t { start, num, chem, error };

double parse(std::string line, const std::unordered_map<std::string, double>& m)
{
auto b = line.begin(), e = line.end();
std::stack<double> stk;
int num = 0;
std::string chem;
state_t state = state_t::start;

auto pstart = [&](char c) {
if(std::isupper(c))
{
chem = ""s + c;
state = state_t::chem;
return true;
}
else if(std::isdigit(c))
{
if(stk.empty())
state = state_t::error;
else
{
num = c - '0';
state = state_t::num;
return true;
}
}
else if(c == '(')
{
stk.push(-1);
return true;
}
else if(c == ')')
{
double sum = 0;
while(!stk.empty() && stk.top() > 0)
{
sum += stk.top();
stk.pop();
}
if(stk.empty())
state = state_t::error;
else
{
stk.pop();
stk.push(sum);
return true;
}
}
else
state = state_t::error;

return false;
};

auto pnum = [&](char c) {
if(std::isdigit(c))
{
num = 10 * num + c - '0';
return true;
}
else
{
stk.top() *= num;
state = state_t::start;
}
return false;
};

auto pchem = [&](char c){
if(std::islower(c))
{
chem += c;
return true;
}
else
{
if(auto it = m.find(chem); it != m.end())
{
stk.push(it->second);
state = state_t::start;
}
else
state = state_t::error;
}
return false;
};

while(b != e)
{
switch(state)
{
case state_t::start:
if(pstart(*b))
b++;
break;
case state_t::num:
if(pnum(*b))
b++;
break;
case state_t::chem:
if(pchem(*b))
b++;
break;
default:
return -1;
}
}

switch(state)
{
case state_t::num:
pnum('n');
break;
case state_t::chem:
pchem('n');
break;
}

if(state == state_t::error)
return -1;

double sum = 0;
while(!stk.empty() && stk.top() > 0)
{
sum += stk.top();
stk.pop();
}
if(stk.size() > 0) // expected ')'
return -1;

return sum;
}

int main()
{
std::unordered_map<std::string, double> m;
m["Na"] = 23.5;
m["Cl"] = 35.5;
m["O"] = 16;
m["N"] = 14;
m["Fe"] = 55.8;

std::string line;
while(getline(std::cin, line))
std::cout << parse(std::move(line), m) << 'n';
}






c++ parsing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 27 at 14:38







友人A

















asked Jan 27 at 10:26









友人A友人A

436




436








  • 1




    $begingroup$
    I realized if(stk.size() > 1) should be if(stk.size() > 0), should I edit?
    $endgroup$
    – 友人A
    Jan 27 at 11:08






  • 2




    $begingroup$
    Feel free to edit as long as there are no answers. But try to not edit as time goes by.
    $endgroup$
    – Incomputable
    Jan 27 at 11:26






  • 2




    $begingroup$
    Note, the singular is “automaton”.
    $endgroup$
    – Konrad Rudolph
    Jan 27 at 14:26






  • 1




    $begingroup$
    There is a standard for writing chemical formulas to be machine parsable so that they are not ambiguous. Its called SMILE Simple Molecule Input Line Entry. The trouble with your system is that for more complex molecules you can't be unambiguous.
    $endgroup$
    – Martin York
    Jan 27 at 18:50








  • 1




    $begingroup$
    You said that you wrote a parser but the result of the function is double so clearly you’re doing way more than just parsing, and it’s not at all clear (from the names, nor the usage) what transformation you’re performing. I’m guessing you are calculating the molecular weight? In which case the function parse should be renamed appropriately.
    $endgroup$
    – Konrad Rudolph
    Jan 28 at 11:32
















  • 1




    $begingroup$
    I realized if(stk.size() > 1) should be if(stk.size() > 0), should I edit?
    $endgroup$
    – 友人A
    Jan 27 at 11:08






  • 2




    $begingroup$
    Feel free to edit as long as there are no answers. But try to not edit as time goes by.
    $endgroup$
    – Incomputable
    Jan 27 at 11:26






  • 2




    $begingroup$
    Note, the singular is “automaton”.
    $endgroup$
    – Konrad Rudolph
    Jan 27 at 14:26






  • 1




    $begingroup$
    There is a standard for writing chemical formulas to be machine parsable so that they are not ambiguous. Its called SMILE Simple Molecule Input Line Entry. The trouble with your system is that for more complex molecules you can't be unambiguous.
    $endgroup$
    – Martin York
    Jan 27 at 18:50








  • 1




    $begingroup$
    You said that you wrote a parser but the result of the function is double so clearly you’re doing way more than just parsing, and it’s not at all clear (from the names, nor the usage) what transformation you’re performing. I’m guessing you are calculating the molecular weight? In which case the function parse should be renamed appropriately.
    $endgroup$
    – Konrad Rudolph
    Jan 28 at 11:32










1




1




$begingroup$
I realized if(stk.size() > 1) should be if(stk.size() > 0), should I edit?
$endgroup$
– 友人A
Jan 27 at 11:08




$begingroup$
I realized if(stk.size() > 1) should be if(stk.size() > 0), should I edit?
$endgroup$
– 友人A
Jan 27 at 11:08




2




2




$begingroup$
Feel free to edit as long as there are no answers. But try to not edit as time goes by.
$endgroup$
– Incomputable
Jan 27 at 11:26




$begingroup$
Feel free to edit as long as there are no answers. But try to not edit as time goes by.
$endgroup$
– Incomputable
Jan 27 at 11:26




2




2




$begingroup$
Note, the singular is “automaton”.
$endgroup$
– Konrad Rudolph
Jan 27 at 14:26




$begingroup$
Note, the singular is “automaton”.
$endgroup$
– Konrad Rudolph
Jan 27 at 14:26




1




1




$begingroup$
There is a standard for writing chemical formulas to be machine parsable so that they are not ambiguous. Its called SMILE Simple Molecule Input Line Entry. The trouble with your system is that for more complex molecules you can't be unambiguous.
$endgroup$
– Martin York
Jan 27 at 18:50






$begingroup$
There is a standard for writing chemical formulas to be machine parsable so that they are not ambiguous. Its called SMILE Simple Molecule Input Line Entry. The trouble with your system is that for more complex molecules you can't be unambiguous.
$endgroup$
– Martin York
Jan 27 at 18:50






1




1




$begingroup$
You said that you wrote a parser but the result of the function is double so clearly you’re doing way more than just parsing, and it’s not at all clear (from the names, nor the usage) what transformation you’re performing. I’m guessing you are calculating the molecular weight? In which case the function parse should be renamed appropriately.
$endgroup$
– Konrad Rudolph
Jan 28 at 11:32






$begingroup$
You said that you wrote a parser but the result of the function is double so clearly you’re doing way more than just parsing, and it’s not at all clear (from the names, nor the usage) what transformation you’re performing. I’m guessing you are calculating the molecular weight? In which case the function parse should be renamed appropriately.
$endgroup$
– Konrad Rudolph
Jan 28 at 11:32












2 Answers
2






active

oldest

votes


















6












$begingroup$

Using a state machine is fine, but is usually harder to get right than writing the parser with focus on the grammar. You're also mixing a computation with the parsing which adds more information to consider when analyzing the code.
I would recommend to separate the parser code from the computation and then writing the parser strictly following the grammar you want to parse. I'll try to illustrate what I mean by giving a simplified version of your parser. Say you have a grammar:



formula = *(group)
group = element [ count ]
element = uppercase [ lowercase ]
count = "0" .. "9"


You can now give a function for each non-terminal:




// formula = *(group) ; * becomes while (...) ...
std::list<group> parse_formula(std::stringstream& s)
{
std::list<group> rv;

while (!s.eof())
{
rv.push_back(parse_group(s));
}

return rv;
}

// group = element [ count ]
group parse_group(std::stringstream& s)
{
group g;
group.element = parse_element(s);
try
{
group.count = parse_count(s);
}
catch (const parse_failed&)
{
group = 1;
}
}

// element = uppercase [lowercase]
std::string parse_element(std::stringstream& s)
{
if (!std::isupper(s.peek()))
{
throw parse_failed(...);
}

std::string element = s.get();;

if (std::islower(s.peek()))
{
s.get(ch);
element += ch;
}

return element;
}

// count = [0-9]
unsigned parse_count(std::stringstream& s)
{
if (!std::isdigit(s.peek()))
{
throw parse_failed(...);
}

unsigned rv;
s >> rv; // this actually violates the grammar as it reads multiple digits
return rv;
}


You can then iterate over the list of groups and compute the sum.






share|improve this answer











$endgroup$





















    6












    $begingroup$

    Lambdas Too Complex



    You say that the reason behind your lambdas is that they carry a lot of state around, but it is mutating plenty of it as well. Extracting your lambdas into methods in their own right would make it easier to keep track of where your state is used and where it is mutated.



    Methods Should Calculate Or Mutate, Not Both



    Lambdas that calculate something and then mutate something else are notoriously difficult to debug and maintain. When you create your methods, it should either calculate something (such as the atomic weight) or change something (such as changing your state).



    Variable Names Unclear



    Names like pstart, num and m do not tell me anything about the purpose of these variables. There is nothing wrong with LongVariableNamesThatExplainWhatTheyAreDoing.






    share|improve this answer









    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "196"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212312%2fchemical-formula-parser%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      6












      $begingroup$

      Using a state machine is fine, but is usually harder to get right than writing the parser with focus on the grammar. You're also mixing a computation with the parsing which adds more information to consider when analyzing the code.
      I would recommend to separate the parser code from the computation and then writing the parser strictly following the grammar you want to parse. I'll try to illustrate what I mean by giving a simplified version of your parser. Say you have a grammar:



      formula = *(group)
      group = element [ count ]
      element = uppercase [ lowercase ]
      count = "0" .. "9"


      You can now give a function for each non-terminal:




      // formula = *(group) ; * becomes while (...) ...
      std::list<group> parse_formula(std::stringstream& s)
      {
      std::list<group> rv;

      while (!s.eof())
      {
      rv.push_back(parse_group(s));
      }

      return rv;
      }

      // group = element [ count ]
      group parse_group(std::stringstream& s)
      {
      group g;
      group.element = parse_element(s);
      try
      {
      group.count = parse_count(s);
      }
      catch (const parse_failed&)
      {
      group = 1;
      }
      }

      // element = uppercase [lowercase]
      std::string parse_element(std::stringstream& s)
      {
      if (!std::isupper(s.peek()))
      {
      throw parse_failed(...);
      }

      std::string element = s.get();;

      if (std::islower(s.peek()))
      {
      s.get(ch);
      element += ch;
      }

      return element;
      }

      // count = [0-9]
      unsigned parse_count(std::stringstream& s)
      {
      if (!std::isdigit(s.peek()))
      {
      throw parse_failed(...);
      }

      unsigned rv;
      s >> rv; // this actually violates the grammar as it reads multiple digits
      return rv;
      }


      You can then iterate over the list of groups and compute the sum.






      share|improve this answer











      $endgroup$


















        6












        $begingroup$

        Using a state machine is fine, but is usually harder to get right than writing the parser with focus on the grammar. You're also mixing a computation with the parsing which adds more information to consider when analyzing the code.
        I would recommend to separate the parser code from the computation and then writing the parser strictly following the grammar you want to parse. I'll try to illustrate what I mean by giving a simplified version of your parser. Say you have a grammar:



        formula = *(group)
        group = element [ count ]
        element = uppercase [ lowercase ]
        count = "0" .. "9"


        You can now give a function for each non-terminal:




        // formula = *(group) ; * becomes while (...) ...
        std::list<group> parse_formula(std::stringstream& s)
        {
        std::list<group> rv;

        while (!s.eof())
        {
        rv.push_back(parse_group(s));
        }

        return rv;
        }

        // group = element [ count ]
        group parse_group(std::stringstream& s)
        {
        group g;
        group.element = parse_element(s);
        try
        {
        group.count = parse_count(s);
        }
        catch (const parse_failed&)
        {
        group = 1;
        }
        }

        // element = uppercase [lowercase]
        std::string parse_element(std::stringstream& s)
        {
        if (!std::isupper(s.peek()))
        {
        throw parse_failed(...);
        }

        std::string element = s.get();;

        if (std::islower(s.peek()))
        {
        s.get(ch);
        element += ch;
        }

        return element;
        }

        // count = [0-9]
        unsigned parse_count(std::stringstream& s)
        {
        if (!std::isdigit(s.peek()))
        {
        throw parse_failed(...);
        }

        unsigned rv;
        s >> rv; // this actually violates the grammar as it reads multiple digits
        return rv;
        }


        You can then iterate over the list of groups and compute the sum.






        share|improve this answer











        $endgroup$
















          6












          6








          6





          $begingroup$

          Using a state machine is fine, but is usually harder to get right than writing the parser with focus on the grammar. You're also mixing a computation with the parsing which adds more information to consider when analyzing the code.
          I would recommend to separate the parser code from the computation and then writing the parser strictly following the grammar you want to parse. I'll try to illustrate what I mean by giving a simplified version of your parser. Say you have a grammar:



          formula = *(group)
          group = element [ count ]
          element = uppercase [ lowercase ]
          count = "0" .. "9"


          You can now give a function for each non-terminal:




          // formula = *(group) ; * becomes while (...) ...
          std::list<group> parse_formula(std::stringstream& s)
          {
          std::list<group> rv;

          while (!s.eof())
          {
          rv.push_back(parse_group(s));
          }

          return rv;
          }

          // group = element [ count ]
          group parse_group(std::stringstream& s)
          {
          group g;
          group.element = parse_element(s);
          try
          {
          group.count = parse_count(s);
          }
          catch (const parse_failed&)
          {
          group = 1;
          }
          }

          // element = uppercase [lowercase]
          std::string parse_element(std::stringstream& s)
          {
          if (!std::isupper(s.peek()))
          {
          throw parse_failed(...);
          }

          std::string element = s.get();;

          if (std::islower(s.peek()))
          {
          s.get(ch);
          element += ch;
          }

          return element;
          }

          // count = [0-9]
          unsigned parse_count(std::stringstream& s)
          {
          if (!std::isdigit(s.peek()))
          {
          throw parse_failed(...);
          }

          unsigned rv;
          s >> rv; // this actually violates the grammar as it reads multiple digits
          return rv;
          }


          You can then iterate over the list of groups and compute the sum.






          share|improve this answer











          $endgroup$



          Using a state machine is fine, but is usually harder to get right than writing the parser with focus on the grammar. You're also mixing a computation with the parsing which adds more information to consider when analyzing the code.
          I would recommend to separate the parser code from the computation and then writing the parser strictly following the grammar you want to parse. I'll try to illustrate what I mean by giving a simplified version of your parser. Say you have a grammar:



          formula = *(group)
          group = element [ count ]
          element = uppercase [ lowercase ]
          count = "0" .. "9"


          You can now give a function for each non-terminal:




          // formula = *(group) ; * becomes while (...) ...
          std::list<group> parse_formula(std::stringstream& s)
          {
          std::list<group> rv;

          while (!s.eof())
          {
          rv.push_back(parse_group(s));
          }

          return rv;
          }

          // group = element [ count ]
          group parse_group(std::stringstream& s)
          {
          group g;
          group.element = parse_element(s);
          try
          {
          group.count = parse_count(s);
          }
          catch (const parse_failed&)
          {
          group = 1;
          }
          }

          // element = uppercase [lowercase]
          std::string parse_element(std::stringstream& s)
          {
          if (!std::isupper(s.peek()))
          {
          throw parse_failed(...);
          }

          std::string element = s.get();;

          if (std::islower(s.peek()))
          {
          s.get(ch);
          element += ch;
          }

          return element;
          }

          // count = [0-9]
          unsigned parse_count(std::stringstream& s)
          {
          if (!std::isdigit(s.peek()))
          {
          throw parse_failed(...);
          }

          unsigned rv;
          s >> rv; // this actually violates the grammar as it reads multiple digits
          return rv;
          }


          You can then iterate over the list of groups and compute the sum.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 27 at 14:38

























          answered Jan 27 at 12:13









          CornholioCornholio

          48617




          48617

























              6












              $begingroup$

              Lambdas Too Complex



              You say that the reason behind your lambdas is that they carry a lot of state around, but it is mutating plenty of it as well. Extracting your lambdas into methods in their own right would make it easier to keep track of where your state is used and where it is mutated.



              Methods Should Calculate Or Mutate, Not Both



              Lambdas that calculate something and then mutate something else are notoriously difficult to debug and maintain. When you create your methods, it should either calculate something (such as the atomic weight) or change something (such as changing your state).



              Variable Names Unclear



              Names like pstart, num and m do not tell me anything about the purpose of these variables. There is nothing wrong with LongVariableNamesThatExplainWhatTheyAreDoing.






              share|improve this answer









              $endgroup$


















                6












                $begingroup$

                Lambdas Too Complex



                You say that the reason behind your lambdas is that they carry a lot of state around, but it is mutating plenty of it as well. Extracting your lambdas into methods in their own right would make it easier to keep track of where your state is used and where it is mutated.



                Methods Should Calculate Or Mutate, Not Both



                Lambdas that calculate something and then mutate something else are notoriously difficult to debug and maintain. When you create your methods, it should either calculate something (such as the atomic weight) or change something (such as changing your state).



                Variable Names Unclear



                Names like pstart, num and m do not tell me anything about the purpose of these variables. There is nothing wrong with LongVariableNamesThatExplainWhatTheyAreDoing.






                share|improve this answer









                $endgroup$
















                  6












                  6








                  6





                  $begingroup$

                  Lambdas Too Complex



                  You say that the reason behind your lambdas is that they carry a lot of state around, but it is mutating plenty of it as well. Extracting your lambdas into methods in their own right would make it easier to keep track of where your state is used and where it is mutated.



                  Methods Should Calculate Or Mutate, Not Both



                  Lambdas that calculate something and then mutate something else are notoriously difficult to debug and maintain. When you create your methods, it should either calculate something (such as the atomic weight) or change something (such as changing your state).



                  Variable Names Unclear



                  Names like pstart, num and m do not tell me anything about the purpose of these variables. There is nothing wrong with LongVariableNamesThatExplainWhatTheyAreDoing.






                  share|improve this answer









                  $endgroup$



                  Lambdas Too Complex



                  You say that the reason behind your lambdas is that they carry a lot of state around, but it is mutating plenty of it as well. Extracting your lambdas into methods in their own right would make it easier to keep track of where your state is used and where it is mutated.



                  Methods Should Calculate Or Mutate, Not Both



                  Lambdas that calculate something and then mutate something else are notoriously difficult to debug and maintain. When you create your methods, it should either calculate something (such as the atomic weight) or change something (such as changing your state).



                  Variable Names Unclear



                  Names like pstart, num and m do not tell me anything about the purpose of these variables. There is nothing wrong with LongVariableNamesThatExplainWhatTheyAreDoing.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 27 at 12:15









                  Joe CJoe C

                  1,045211




                  1,045211






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Code Review Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f212312%2fchemical-formula-parser%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Probability when a professor distributes a quiz and homework assignment to a class of n students.

                      Aardman Animations

                      Are they similar matrix